Ruhee Ansari - Advanced Statistic Project SCB

Download as pdf or txt
Download as pdf or txt
You are on page 1of 28

Table Of Content

Content
Problem 1A:
1.1 State the null and the alternate hypothesis for conducting one-way ANOVA for
both Education and Occupation individually………………………………………………………. 5
1.2 Perform one-way ANOVA for Education with respect to the variable ‘Salary’.
State whether the null hypothesis is accepted or rejected based on the ANOVA
results……………………………………………………………………………………………………………….. 5
1.3 Perform one-way ANOVA for variable Occupation with respect to the variable
‘Salary’. State whether the null hypothesis is accepted or rejected based on the
ANOVA results…………………………………………………………………………………………………..6
1.4 If the null hypothesis is rejected in either (1.2) or in (1.3), find out which class
means are significantly different. Interpret the result………………………………………. 6

Problem 1B:
1.5 What is the interaction between the two treatments? Analyze the effects of one
variable on the other (Education and Occupation) with the help of an interaction
plot…………………………………………………………………………………………………………………….. 6
1.6 Perform a two-way ANOVA based on the Education and Occupation (along with
their interaction Education*Occupation) with the variable ‘Salary’.
State the null and alternative hypotheses and state your results. How will you
interpret this result?....................................................................................................................... 7
1.7 Explain the business implications of performing ANOVA for this particular case
study…………………………………………………………………………………………………………………... 8

Problem 2:
2.1 Perform Exploratory Data Analysis [both univariate and multivariate analysis to
be performed]. What insight do you draw from the EDA?.................................................... 9
2.2 Is scaling necessary for PCA in this case? Give justification and perform
scaling……………………………………………………………………………………………………………………10
2.3 Comment on the comparison between the covariance and the correlation
matrices from this data. [on scaled data] ……………………………………………………………… 19
2.4 Check the dataset for outliers before and after scaling. What insight do you
derive here?...................................................................................................................................... 21
2.5 Extract the eigenvalues and eigen vectors. [print both] ………………………………… 22
2.6 Perform PCA and export the data of the Principal Component (eigenvectors) into
a data frame with the original features…………………………………………………………………. 23

pg. 1
2.7 Write down the explicit form of the first PC (in terms of the eigenvectors. Use
values with two places of decimals only). [hint: write the linear equation of PC in
terms of eigenvectors and corresponding features] ……………………………………… 22
2.8 Consider the cumulative values of the eigenvalues. How does it help you to
decide on the optimum number of principal components? What do the eigenvectors
indicate?............................................................................................................................................ 23
2.9 Explain the business implication of using the Principal Component Analysis for
this case study. How may PCs help in the further analysis? [Hint: Write
Interpretations of the Principal Components Obtained] …………………………………… 27

pg. 2
Problem 1A:
Salary is hypothesized to depend on educational qualification and
occupation. To understand the dependency, the salaries of 40 individuals
[SalaryData.csv] are collected and each person’s educational qualification
and occupation are noted. Educational qualification is at three levels, High
school graduate, Bachelor, and Doctorate. Occupation is at four levels,
Administrative and clerical, Sales, Professional or specialty, and Executive
or managerial. A different number of observations are in each level of
education – occupation combination.
[Assume that the data follows a normal distribution. In reality, the
normality assumption may not always hold if the sample size is small.]

Solution:
1. Started with loading the dataset and checking the first 5 rows of the
table.

2. Checking the shape.

3. Checking with describe function i.e., the mean median mode etc.
pg. 3
4. Checking the datatypes.

5. Checking the values in each category.

pg. 4
1.1 State the null and the alternate hypothesis for conducting one-way
ANOVA for both Education and Occupation individually.

Solution:
For Education:
H0=The mean salary for different education levels is same.
Ha=At least one of the groups has mean salary different.
For Occupation:
H0=The mean salary for different occupation is same.
Ha=At least one of the groups has mean salary different.

1.2 Perform one-way ANOVA for Education with respect to the variable
‘Salary’. State whether the null hypothesis is accepted or rejected based
on the ANOVA results.

Solution:

After performing the ANOVA, the p value is less than the significance
level (0.05), we can reject the null hypothesis and also conclude that
there is a difference in the mean salary for at least one category of
education.

pg. 5
1.3 Perform one-way ANOVA for variable Occupation with respect to
the variable ‘Salary’. State whether the null hypothesis is accepted or
rejected based on the ANOVA results.

Solution:

After performing the ANOVA, the p value (0.458508) is greater than the
significance level (0.05), we have no evidence to reject the null
hypothesis, so accepting the null hypothesis and also conclude that there
is no difference in the mean salary for at least one category of
occupation.

1.5 What is the interaction between the two treatments? Analyze the
effects of one variable on the other (Education and Occupation) with the
help of an interaction plot.

Solution:

Here for both Education and Occupation the p value is know less than
the significance level(0.05).

pg. 6
After checking the plot, we can say that for Doctorate and Bachelor’s
degree who have working as Adm-clerical and sales occupation have
approx. similar salary package. As the occupation or the level of
occupation changes the salary for doctorate has increased and has the
highest salary package. Those are HS-grad have minimum salary
package. The HS-grad who is working as sales have the least salary
package.

1.6 Perform a two-way ANOVA based on the Education and Occupation


(along with their interaction Education*Occupation) with the variable
‘Salary’. State the null and alternative hypotheses and state your results.
How will you interpret this result?
Solution:

pg. 7
H0=There is no interaction between Education and Occupation.
Ha=There is an interaction between Education and Occupation.
After performing two-way ANOVA, the p-value is less than the
significance level (0.05) hence we reject the null hypothesis and also
conclude that there is an interaction between Education and
Occupation.

1.7 Explain the business implications of performing ANOVA for this


particular case study.
Solution:
We can check the interaction between the two categories, as it’s a
business need while we are working on such components to know how
the things react individually and with while interacting with each other.
Here we can say that when check the interaction with respect to salary,
there is always an interaction between education and occupation with
respect to salary. Based on both terms the salary is considered.
The education has a great impact on type of occupation and salary.

pg. 8
Problem 2:
The dataset Education - Post 12th Standard.csv contains
information on various colleges. You are expected to do a
Principal Component Analysis for this case study according to
the instructions given. The data dictionary of the 'Education -
Post 12th Standard.csv' can be found in the following file: Data
Dictionary.xlsx

2.1 Perform Exploratory Data Analysis [both univariate and multivariate


analysis to be performed]. What insight do you draw from the EDA?
Solution:
The main purpose of univariate analysis is to describe the data, summarize
and finds pattern.
1. We start with loading the dataset, checking its shape and data types of
variables.

pg. 9
2.Then we use describe function to summarize our data it tells us the
mean, standard deviation, IQR, and summary of numeric columns.

pg. 10
3. Then we use distplot or density plot to check the normality. Normality
means whether the data is normally distributed or not.

pg. 11
pg. 12
pg. 13
pg. 14
4.After that, we do the multivariate analysis like we do the correlation and
heatmap.

pg. 15
From this we come to know there is 777 numbers of rows and 18 columns.
Out of which only Name column is of object data type and S.F. Ratio is of
float datatype and rest all columns are of integer datatype.
In all numeric columns only in Top 25 perc column there is no outlier rest
all the columns have outliers.
There are no duplicate rows and also no null value present in data set.
Many of the columns are strongly correlated such as Application accepted
i.e. Accept column and the Application received i.e. Apps.
2.2 Is scaling necessary for PCA in this case? Give justification and
perform scaling.
Solution:
The main objective of scaling or standardization to normalize a data within
a particular range. It is a step of data pre-processing which is applied to
independent variables or features of data.

pg. 16
Before standardizing we need to remove the outliers which are present in the
dataset.

pg. 17
After removing the outliers.

Here most of the outliers have been removed.


Standardization cannot be possible on columns having strings, so we are
taking just the numerical columns i.e., the function created for numerical
columns and then applying z-score.

Then again, we have checked for the info (), for just taking the details
about the columns or to see the changes.

pg. 18
As all columns are numeric and from integer all are converted to float data
type. we have taken all numeric columns i.e., in total 17.

2.3 Comment on the comparison between the covariance and the


correlation matrices from this data. [ on scaled data]
Solution:
Correlation is a scaled version of covariance; note that the two
parameters always have the same sign (positive, negative, or 0).
When the sign is positive, the variables are said to be positively correlated;
when the sign is negative, the variables are said to be negatively
correlated; and when the sign is 0, the variables are said to be
uncorrelated.

pg. 19
In simple sense correlation, measures both the strength and direction of
the linear relationship between two variables.
Covariance is a measure used to determine how much two variables
change in tandem. It indicates the direction of the linear relationship
between variables.

pg. 20
2.4 Check the dataset for outliers before and after scaling. What insight
do you derive here?

The numbers of the outliers have been reduced and for some of the
columns there is no outliers present after scaling.

pg. 21
2.5 Extract the eigenvalues and eigen vectors. [ print both]
Eigenvalue and Eigenmatrix are mainly used to capture key information
that stored in a large matrix.
• It provides summary of large matrix.
• Performing computation on large matrix is slow and require more
memory and CPU, eigenvectors and eigenvalues can improve the
efficiency in computationally intensive task by reducing dimensions
after ensuring of the key information is maintained.

2.7 Write down the explicit form of the first PC.

pg. 22
2.6 Perform PCA and export the data of the Principal Component
(eigenvectors) into a data frame with the original features. (solving both
questions together).
2.8 Consider the cumulative values of the eigenvalues. How does it help
you to decide on the optimum number of principal components? What
do the eigenvectors indicate?

Solution:
To decide how many eigenvalues/eigenvectors to keep, we should clearly
define the objective first for doing PCA in the first place.
If we don't have any strict constraints, then we should plot the cumulative sum
of eigenvalues. If we divide each value by the total sum of eigenvalues prior to
plotting, then your plot will show the fraction of total variance retained vs.
number of eigenvalues.
The plot will then provide a good indication of when you hit the point of
diminishing returns.

Scree plot: A scree plot helps the analyst visualize the relative importance of
the factors, a sharp drop in the plot signals that subsequent factors are
ignorable.

pg. 23
To find PCA components we use PCA command from sklearn.

pg. 24
pg. 25
The Cumulative % gives the percentage of variance accounted for by the n
components. In the above array we see that the first feature explains
33.2% of the variance within our data set while the first two explain 61.6
and so on. If we employ 7features, we capture ~ 87.2% of the variance
within the dataset.

Correlation Between Components and Features.

pg. 26
2.9 Explain the business implication of using the Principal Component
Analysis for this case study. How may PCs help in the further analysis?
[Hint: Write Interpretations of the Principal Components Obtained].
Solution:
1. PCA is a statistical technique and uses orthogonal transformation to
convert a set of observations of possibly correlated variables into a set of
values of linearly uncorrelated variables. PCA also is a tool to reduce
multidimensional data to lower dimensions while retaining most of the
information. Principal Component Analysis (PCA) is a well-established
mathematical technique for reducing the dimensionality of data, while
keeping as much variation as possible.
2. This PCA can only be done on continuous variables.
3. There are about 18 variables in the dataset, by applying PCA we will
reduce those to just 7 components which will capture 87.2 % variance in
the dataset.
4. Depending upon the business need we can decide the number of PCs.

pg. 27
pg. 28

You might also like