95% found this document useful (38 votes)
7K views25 pages

Adv Stats Proj

The document provides details about a business report on advance statistics including ANOVA, EDA, and PCA. It discusses using ANOVA to analyze how salary depends on education level and occupation using sample salary data. It performs one-way ANOVA for education level and occupation individually. It also analyzes the interaction between education and occupation using a two-way ANOVA and interaction plots. The implications for human resource departments are discussed. Principal component analysis is also proposed to analyze college data from another dataset.

Uploaded by

Zohaib Imam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
95% found this document useful (38 votes)
7K views25 pages

Adv Stats Proj

The document provides details about a business report on advance statistics including ANOVA, EDA, and PCA. It discusses using ANOVA to analyze how salary depends on education level and occupation using sample salary data. It performs one-way ANOVA for education level and occupation individually. It also analyzes the interaction between education and occupation using a two-way ANOVA and interaction plots. The implications for human resource departments are discussed. Principal component analysis is also proposed to analyze college data from another dataset.

Uploaded by

Zohaib Imam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

BUSINESS REPORT

On Advance Statistics (ANOVA, EDA, PCA)

By- Zohaib Imam.

Problem 1A:
Salary is hypothesized to depend on educational qualification and occupation. To understand
the dependency, the salaries of 40 individuals [SalaryData.csv] are collected and each person’s
educational qualification and occupation are noted. Educational qualification is at three levels,
High school graduate, Bachelor, and Doctorate. Occupation is at four levels, Administrative and
clerical, Sales, Professional or specialty, and Executive or managerial. A different number of
observations are in each level of education – occupation combination.
[Assume that the data follows a normal distribution. In reality, the normality assumption may
not always hold if the sample size is small.]

Q.1.1) State the null and the alternate hypothesis for conducting one-way ANOVA for both Education and
Occupation individually.

Null and Alternate Hypothesis for Education.

H0: The means of 'Salary' variable with respect to each Education is equal.

H1: At least one of the means of 'Salary' variable with respect to each Education is unequal.

Null and Alternate Hypothesis for Occupation.

H0: The means of 'Salary' variable with respect to each Occupation is equal.

H1: At least one of the means of 'Salary' variable with respect to each Occupation is unequal.
Q.1.2) Perform one-way ANOVA for Education with respect to the variable ‘Salary’. State whether the null
hypothesis is accepted or rejected based on the ANOVA results.

Since the p value in this scenario is less than alpha (0.05), we can say that we reject the Null Hypothesis (H0).

Q.1.3) Perform one-way ANOVA for variable Occupation with respect to the variable ‘Salary’. State
whether the null hypothesis is accepted or rejected based on the ANOVA results.

Since the p value in this scenario is greater than alpha (0.05), we can say that we cannot reject the Null
Hypothesis (H0).
Q. 1.4) If the null hypothesis is rejected in either (1.2) or in (1.3), find out which class means are
significantly different. Interpret the result.

Bachelors & Doctorate class means are totally different from other two class means.

Q. 1.5) What is the interaction between the two treatments? Analyze the effects of one variable on the
other (Education and Occupation) with the help of an interaction plot.
• As seen from the above two interaction plots, there seems to be very good interaction between
Doctorate and bachelors in the occupation of Adm-clerical and Sales.
• There is also some of a kind of interaction between Bachelors and HS-grad in the occupation of Prof-
specialty.
• But there is very less or no interaction between Doctorate and HS-grad in any of the occupation.
• The above point draws an important inference that a Doctorate graduate may not be highly
preferred for a job role and may be considered over-qualified which results in at par or not
significantly higher wage to that of a Bachelor's degree holder.

Q.1.6) Perform a two-way ANOVA based on the Education and Occupation (along with their interaction
Education*Occupation) with the variable ‘Salary’. State the null and alternative hypotheses and state your
results. How will you interpret this result?

H0: The means of 'Salary' variable with respect to each Education category and Occupation is equal.

H1: At least one of the means of 'weight6weeks' variable with respect to each Education category and
Occupation is unequal.
• For the variable Education, as P(>F) is less than 0.05(significance level), Null Hypothesis is
rejected and establishes that Education has a significant impact on the mean Salary
• For variable Occupation, as P(>F) is greater than 0.05, Null Hypothesis cannot be rejected and
establishes that Occupation have little to no statistical evidence of any significant effect on
the mean Salary
• For the interaction variable, i.e., ‘’C(Occupation)”, ‘’C(Education)”, the P(>F) is less than 0.05
indicating that there is some statistical evidence about the interaction between the 2
variables(conforming to the earlier inference drawn from Fig 1B.1) and the interaction have
significant impact to some extent on the Salary
• However, more independent variables need to be incorporated to better understand to what
extent the interaction of Education and Occupation leads to appropriate estimation of mean
Salary. For e.g., Year of Work Experience could be a probable independent variable which could
be considered for future analyses
Q. 1.7) Explain the business implications of performing ANOVA for this particular case study.

• Assuming the report is intended for HR departments of a company or HR Consulting firm,


following are the key takeaways:
An employee or a graduate’s salary is significantly dependent on their level of education as
compared to their occupation or job role
• Given the statistical conclusion about the interaction effect of education and occupation on
salary, it is imperative to say despite occupation’s lesser significance, there is some level of
impact of job role on salary
• It is also noteworthy that for few occupations a higher salary may be awarded to a Bachelors
degree holder than their Doctorate counterparts. This brings an important shortcoming of
the dataset provided which further reduces the accuracy of the tests and analyses performed,
i.e., other important independent variables which impact the salary, such as work experience,
specialization/domain, industry.
• Needless to say, on an average a Doctorate would probably earn higher salary than Bachelors
and HS-grads. However, it is also true being a Doctorate may not necessarily mean
significantly higher salary than Bachelor's degree graduates/employees as was observed in Fig
• The above point draws an important inference that a Doctorate graduate may not be highly
preferred for a job role and may be considered over-qualified which results in at par or not
significantly higher wage to that of a Bachelor's degree holder.
• Hence, HR professional may need to have a more comprehensive approach towards setting
of salary bands. As with different industries, similar job titles also do demand varying salary
packaging as with job requirements/description. Nevertheless, work experience remains an
important factor deciding salary.
• The ANOVA test does indicate that to occupation level coupled with higher educational
qualification have significant impact on the salary even though occupation type/level alone
may not be a significant influencer as compared to education.
An employee or a graduate’s salary is significantly dependent on their level of education as
compared to their occupation or job role
• Given the statistical conclusion about the interaction effect of education and occupation on
salary, it is imperative to say despite occupation’s lesser significance, there is some level of
impact of job role on salary
• It is also noteworthy that for few occupations a higher salary may be awarded to a Bachelors
degree holder than their Doctorate counterparts. This brings an important shortcoming of
the dataset provided which further reduces the accuracy of the tests and analyses performed,
i.e., other important independent variables which impact the salary, such as work experience,
specialization/domain, industry.
• Needless to say, on an average a Doctorate would probably earn higher salary than Bachelors
and HS-grads. However, it is also true being a Doctorate may not necessarily mean
significantly higher salary than Bachelor's degree graduates/employees as was observed in Fig
• The above point draws an important inference that a Doctorate graduate may not be highly
preferred for a job role and may be considered over-qualified which results in at par or not
significantly higher wage to that of a Bachelor's degree holder.
• Hence, HR professional may need to have a more comprehensive approach towards setting
of salary bands. As with different industries, similar job titles also do demand varying salary
packaging as with job requirements/description. Nevertheless, work experience remains an
important factor deciding salary.
• The ANOVA test does indicate that to occupation level coupled with higher educational
qualification have significant impact on the salary even though occupation type/level alone
may not be a significant influencer as compared to education.

Problem 2:
The dataset Education - Post 12th Standard.csv contains information on various colleges. You
are expected to do a Principal Component Analysis for this case study according to the
instructions given. The data dictionary of the 'Education - Post 12th Standard.csv' can be found
in the following file: Data Dictionary.xlsx.

Q2.1) Perform Exploratory Data Analysis [both univariate and multivariate analysis to be performed].
What insight do you draw from the EDA?

Exploratory Data Analysis:

• The cars data set has 777 observations and 18 variables in the data set.
• There are no categorical variables in this data set.
• All the values are on int64 type except ‘Names’ which is of object datatype and ‘S.F.Ratio’ which is of
float64 datatype.
• There are no missing values in the data set.
• There are no duplicate rows present in the data set.
• There are many outliers present in the dataset which have not been treated, hence the result
of the analyses may not fully very accurate. However, the simplistic approach though might
distort the output, it will fairly give an approximate result for the purpose drawing inference.
• Max percentage of PhD faculties exceeds 100(i.e., 103) which have been fixed by imputing with
the median value. Same with graduate rate (i.e., 118) which ideally can't exceed 100 have been
corrected by imputing by the median value.
Boxplot before treating outliers:

Boxplot after treating outliers:


Univariate Analysis:
Univariate analysis done on top 10% schools, total application, total acceptance, Phd, S.F.Ratio, Grad Rate,
Room Board.

Univariate analysis of students enrolling from top 10% schools:

• The mean percentage of students in a


university coming from best(top 10 schools) is
27%. The median % is 23%
• More importantly, almost 120 universities
out of 777 have students coming from best
schools
• Almost 20 universities have more than 50%
of its student population who come from the
best schools.
• Approximately there are 10 universities each
having its student population of 80% and 90%
who comes from top 10 schools.
Univariate analysis of total application:

• Mean no. of applications is 3001


whereas the median is 1558.
• The maximum no. of applications stands
at 48094 which roughly have 10
institutions.
• The distribution is highly skewed
towards right.
• There are quite a few universities who
get applications of more than 10,000+.
• Comparing the applications vs
acceptance, mean % of
acceptance/total application is 67%
whereas the median % is 67% which is
comparable.
Univariate analysis of total acceptance:

• Mean no. of acceptance is 2018


whereas the median is 1110
• The maximum no. of acceptance stands
at 26330 which roughly have less 10
institutions
• The distribution is highly skewed
towards right
Univariate analysis of faculties with PhD
Insights:

• The distribution is left skewed.


• The mean % of faculty having PhD is 72& whereas the
median % is 75%.
• There are approximately 20 universities which have
100% PhD qualified faculties.
• The lowest 25% universities when ranked by the PhD,
have about 62% of PhD qualified faculties.
Univariate analysis of student to faculty ratio
Insights:

• The mean and median SF ration are 14 and 13


respectively.
• The distribution is nearly normally distributed.
• Approximately 120 universities have Sf ratio 12.
• However, the max SF ratio is 40 which is for roughly
3 universities.
• It can also be observed few universities do have very
low SF ratio (<5) accounting for approximately 5 universities.
Univariate analysis of graduation rate
Insights:

• The mean and median graduation rate is 65 %


each
• The distribution is somewhat normally
distributed
• The lowest graduation rate is 10% which
accounts for less 10 universities
• Interestingly, there are approximately 40
universities with 100% graduation rate
Univariate analysis of boarding expenses
Insights:

• The mean and median boarding expenses


are $4357 and $4200.
• The distribution is somewhat normally
distributed.
• The highest expense stands at $8124 which
approximately accounts for less than 3 universities.
• The top 75% universities also have a
comparable expense amount of $5050.
Multivariate Analysis:

Heatmap showing correlation coefficients.

Observation and inference:

• Few pairs have very high correlation namely:


o Application and Acceptance
o Students from top 10% schools and graduation rate
o Terminal and PhD qualified faculties
o Full time undergrad students and enrolment
o Students from top 10% schools and 25% schools
• The heatmap exhibit the problem of multicollinearity which can be observed with significant
number of high correlation pairs of features. Multicollinearity is a problem because it
undermines the statistical significance of an independent variable or feature.
Q.2.2) Is scaling necessary for PCA in this case? Give justification and perform scaling.

• The main objective of scaling or standardization to normalize a data within a particular range. It is a
step of data preprocessing which is applied to independent variables or features of data. Another
importance of scaling is it helps in speeding up the calculations in an algorithm.
• If we have attributes with a well-defined meaning. Say, latitude and longitude, then we should
not scale our data, because this will cause distortion.
• But in this case, we have mixed numerical data, where each attribute is something entirely
different (like Room. Board, Grad. Rate), has different units attached (price, graduation rate, ...)
then these values aren't really comparable. so, we need to scale our data. scaling of data can be
done by z-score or from sklearn standardScaler.

Q.2.3) Comment on the comparison between the covariance and the correlation matrices from this data.

• Correlation is a scaled version of covariance; note that the two parameters always have the same
sign (positive, negative, or 0). When the sign is positive, the variables are said to be positively
correlated; when the sign is negative, the variables are said to be negatively correlated; and when
the sign is 0, the variables are said to be uncorrelated.
• In simple sense correlation, measures both the strength and direction of the linear relationship
between two variables.
• Covariance is a measure used to determine how much two variables change in tandem. It indicates
the direction of the linear relationship between variables.

Q.2.4) Check the dataset for outliers before and after scaling. What insight do you derive here?

• While doing the univariate analysis we have check the outliers using the boxplot after standardizing
we are again checking the outliers.

Boxplot of outliers after standardizing.


Observation:

• The scaled dataset has all similar max values and comparable min values.
• The mean value for each of the variables are comparable to 0 and standard deviation 1.
• The range of each variables hence are now standardized and are all unit-less quantities.

Q.2.5) Perform PCA and export the data of the Principal Component scores into a data frame.

• The cumulative % gives the percentage of variance accounted for by the n components. For Example,
the cumulative percentage for the second component is the sum of the percentage of variance for
the first and second components. It helps in deciding the number of components by calculating the
components which explained the high variance.

• In the above array we see that the first feature explains 33.3% of variance within our data set while
the first two explains 62.1% and so on. If we employ 7 features, we capture ~ 07.6% of the variance
within the data set.

Below Is the Principal Score into a Data Frame:


Features marked with rectangular red box are the one having maximum loading on the respective
component. We consider these marked features to decide the context that the component represents.
• The Cumulative % gives the percentage of variance accounted for by the n components. For
example, the cumulative percentage for the second component is the sum of the percentage of
variance for the first and second components. It helps in deciding the number of components by
selecting the components which explained the high variance.

Correlation between components and features:


Q.2.6) Extract the eigenvalues and eigenvectors.[print both}.

• Eigenvalue and Eigenmatrix are mainly used to capture key information that stored in a large
matrix.
• It provides summary of large matrix.
• Performing computation on large matrix is slow and require more memory and CPU,
eigenvectors and eigenvalues can improve the efficiency in computationally intensive task by
reducing dimensions after ensuring of the key information is maintained.
Q.2.7) Write down the explicit form of the first PC (in terms of the eigenvectors. Use values with two places
of decimals only).

• Explicit form of first PC;

Eigen Vectors:

Q.2.8) Consider the cumulative values of the eigenvalues. How does it help you to decide on the optimum
number of principal components? What do the eigenvectors indicate?

• The first eigen value explains 33.20% of the information represented by all the 17 features
• Similarly, the 1st and 2nd eigen value together explains 61.5% of the information and so on
• It can be observed that by considering up to 8th eigen value, the total variance which can be
explained is 90.33%
• Based on the advice by the business about the acceptable percentage of variance explained,
one can identify the optimum number of PCs
• Here in this case, if 90.33% is assumed to be acceptable for the scope of the analyses, we see
by considering the first 8 PCs out of 17 PCs generated can explain significant amount of
variance or in simpler words, can represent the 90.33% information present in 17 numeric
features in the original dataset
• Another way of visualizing and approximating the optimal number of PCs can be done using
a Scree Plot. Note: the y-axis represents the % of variance explained

• Eigen vectors are the coefficient of new feature components which is obtained by multiplying the
Eigen Vector values by the features.
Q.2.9) Explain the business implication of using the Principal Component Analysis for this case study. How
many PCs help in the further analysis? [Hint: Write Interpretations of the Principal Components Obtained].

• PCA is a statistical technique and uses orthogonal transformation to convert a set of observations of
possibly correlated variables into a set of values of linearly uncorrelated variables. PCA also is a tool
to reduce multidimensional data to lower dimensions while retaining most of the information.
Principal Component Analysis (PCA) is a well-established mathematical technique for reducing the
dimensionality of data, while keeping as much variation as possible.
• This PCA can only be done on continuous variables.
• There are about 18 variables in the dataset, by applying PCA we will reduce those to just 7
components which will capture 87.6 % variance in the dataset.

You might also like