0% found this document useful (0 votes)

85 views13 pages

Sayan Pal Business Report Advance Statistics Assignment PDF

The document describes an analysis of salary data for 40 individuals to understand how salary depends on educational qualification and occupation. It includes: - Formulating null and alternative hypotheses for one-way ANOVAs to test the effect of education and occupation on salary individually. - Performing the one-way ANOVAs and finding that education but not occupation has a significant effect on salary. - Analyzing the interaction between education and occupation using an interaction plot. - Performing a two-way ANOVA to test the effects of both factors and their interaction on salary. The document then moves to analyzing college data using principal component analysis (PCA). It includes: exploring the data; checking for outliers before and

Uploaded by

Sayan Pal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

85 views13 pages

Sayan Pal Business Report Advance Statistics Assignment PDF

Uploaded by

Sayan Pal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Advance Statistics Assignment

Problem 1A:

Salary is hypothesized to depend on educational qualification and occupation. To understand the

dependency, the salaries of 40 individuals [SalaryData.csv] are collected and each person’s
educational qualification and occupation are noted. Educational qualification is at three levels,
High school graduate, Bachelor, and Doctorate. Occupation is at four levels, Administrative and
clerical, Sales, Professional or specialty, and Executive or managerial. A different number of
observations are in each level of education – occupation combination.

[Assume that the data follows a normal distribution. In reality, the normality assumption may not
always hold if the sample size is small.]

1.1 State the null and the alternate hypothesis for conducting one-way ANOVA for both Education
and Occupation individually.

Solution:

Formulation of hypothesis for conducting one-way ANOVA for education qualification w.r.t salary

 H0: Salary depend on education qualification

 Ha: Salary does not depend on education
 Confidence level = 0.05

Formulation of hypothesis for conducting one-way ANOVA for occupation w.r.t salary

 H0: Salary depend on occupation

 Ha: Salary does not depend on occupation
 Confidence level = 0.05

1.2 Perform one-way ANOVA for Education with respect to the variable ‘Salary’. State whether the
null hypothesis is accepted or rejected based on the ANOVA results.

Solution:

To perform one-way ANOVA for education w.r.t the variable ‘Salary’, we apply the ANOVA formula in
the Jupyter notebook and run the AOV table. We get following output:

From the above table, we find that the P value is less than 0.05, hence we reject the null hypothesis.
Advance Statistics Assignment

1.3 Perform one-way ANOVA for variable Occupation with respect to the variable ‘Salary’. State
whether the null hypothesis is accepted or rejected based on the ANOVA results.

Solution:

To perform one-way ANOVA for occupation w.r.t the variable ‘Salary’, we apply the ANOVA formula in
the Jupyter notebook and run the AOV table. We get following output:

From the above table, we find that the P value is greater than 0.05, hence we do not reject the null
hypothesis.

Problem 1B:

1.5 What is the interaction between the two treatments? Analyze the effects of one variable on
the other (Education and Occupation) with the help of an interaction plot.

Solution:

As seen from the below interaction plots, there seems to be moderate interaction between the two
categorical variables.

Adm-clerical and sales professionals with bachelors and doctorate degrees earn almost similar salary
packages.
Advance Statistics Assignment

1.6 Perform a two-way ANOVA based on the Education and Occupation (along with their
interaction Education*Occupation) with the variable ‘Salary’. State the null and alternative
hypotheses and state your results. How will you interpret this result?

Solution:

Formulation of hypothesis for conducting two-way ANOVA based on education and occupation w.r.t
salary.

 H0: Salary depends on both categories - education and occupation

 Ha: Salary does not depend on at least one of the categories - education and occupation
 Confidence level = 0.05

Considering both education and occupation, education is a significant factor as the P value is <0.05,
whereas occupation is not a significant variable as P value of it is >0.05

1.7 Explain the business implications of performing ANOVA for this particular case study.

Solution:

By performing ANOVA on the given data set, we can conclude that salary is dependent on
occupation.

Problem 2:

The dataset Education - Post 12th Standard.csv contains information on various colleges. You are
expected to do a Principal Component Analysis for this case study according to the instructions
given.

2.1 Perform Exploratory Data Analysis [both univariate and multivariate analysis to be
performed]. What insight do you draw from the EDA?

Solution:

Firstly, after importing all the relevant libraries on Jupyter notebook, we load the data set. Then, we
perform EDA to extract and see patterns in the given data set.

The given data set has a shape of (777, 18). Also, we check the top 5 rows of the data set then went
on to see if there are any missing values in it – as per the output there are no missing values.
Advance Statistics Assignment

We then check the statistical summary of the data set, which is represented below

Post this, we need to perform univariate analysis by defining “univariateAnalysis_numeric”, which

includes 17 numeric variables and then adding a loop to perform the analysis for all the 17 numeric
variables. The function will accept column name and number of bins as arguments.

The analysis of all these variables includes:

 Statistical description of the numeric variable

 Distribution of the column with histogram or distplot
Advance Statistics Assignment

 Boxplot representation of the column - 5 point summary and outliers if any

The output displays, total 17*3 = 51 distinct charts/columns. Hence I have put the screenshot of only
one variable i.e. apps (Please refer the python notebook for your perusal).

Further, we perform multivariate analysis, using correlation function in which we get below output.
Advance Statistics Assignment

Insights:

 The average (mean) number of applications received by the listed universities is around
3,001
 The number of applications accepted ranges from 72 to 26,330
 Average student enrolment is around ~880
 Median of new students from top 10% of higher secondary class is 23%
 Average book cost is around 550
 The minimum S.F. ratio is around 2.5
 Average percentage of faculties with Ph.D.’s is 72.6
 There are considerable number of variables that are highly correlated
 “Apps” has high correlation with “Accept”, and ”Enroll”

2.2 Is scaling necessary for PCA in this case? Give justification and perform scaling.

Solution:

Yes, it is necessary to perform scaling for PCA. For instance, in given data set, applications and other
variables are having values in thousands and few variables such as percentile is in just two digits. So,
the data in these variables are of different scales, it is tough to compare these variables.

The PCA calculates a new projection of the given data set and the new axis are based on the standard
deviation of the variables. So a variable with a high standard deviation in the data set will have a
higher weight for the calculation of axis than a variable with a low standard deviation. By performing
scaling, we can easily compare these variables.
Advance Statistics Assignment

We get the following output, post we perform scaling using Z score.

Advance Statistics Assignment

2.4 Check the dataset for outliers before and after scaling. What insight do you derive here?

Solution:

Before scaling, let’s plot a boxplot to check the outliers in all the variables. We get the following
output:

Post scaling, let’s plot a boxplot to check the outliers in all the variables. We get the following output:
Advance Statistics Assignment

Insights:

 By scaling, all variables have the same standard deviation, thus all variables have the same
weight and thus resulting in PCA calculating relevant axis.
 Before scaling, we only had one variable with no outliers (top25 perc); Post scaling, we have
multiple variables with negligible outliers – this is achieved by normalizing the scale of the
variables

2.5 Perform PCA and export the data of the Principal Component scores into a data frame.

Solution:

For performing PCA, we need to follow below steps:

# Step 1: Generate the covariance matrix

# Step 2: Get eigenvalues and eigenvector

# Step 3: View Scree Plot to identify the number of components to be built

# Step 4: We can perform PCA on the scaled data set by importing PCA from sklearn.decomposition.
We get following component output:

Then, we can do loading of each feature on the components

Advance Statistics Assignment

Post that, we can load these components into a dataframe along with the list of columns we had
earlier considered in df_num_scaled.

Below is the representative screenshot of df_pca_loading in which we had exported the principal
component scores into a data frame.
Advance Statistics Assignment

2.6 Extract the eigenvalues, and eigenvectors.

Solution:

We can extract the above represented eigenvalues and eigenvectors using covariance matrix.

The below snapshot represent extracted eigenvalues and eigenvectors:

Advance Statistics Assignment

2.7 Write down the explicit form of the first PC (in terms of the eigenvectors. Use values with two
places of decimals only).

Solution:

Eigen Vector of First PC

[2.42 3.24 9.77 -1.02

2.28 -4.76 1.23 -3.41

-1.84 -1.34 -6.79 -1.51

5.73 2.54 -3.50 4.76

-2.73]

If we sort the eigenvectors in descending order with respect to their eigenvalues, we will have that
the first eigenvector accounts for the largest spread among data, the second one for the second
largest spread and so on.

2.8 Consider the cumulative values of the eigenvalues. How does it help you to decide on the
optimum number of principal components? What do the eigenvectors indicate?

Solution:

From the below screenshot of cumulative values of the eigenvalues, we can see that around 8
principal components explained over 90% of the variance. Thus, the optimum number of principal
components can be 8.

Furthermore, eigenvectors indicate the direction of the principal components, we can multiply the
original data by the eigenvectors to re-orient our data onto the new axes.

2.9 Explain the business implication of using the Principal Component Analysis for this case study.
How may PCs help in the further analysis? [Hint: Write Interpretations of the Principal Components
Obtained]

Solution:

We know that the principal components describe the amount of the total variance that can be
explained by a single dimension of the data. As mentioned above, we have generated only 8 PCA
dimensions. These 8 PCA can be used for further analysis, representing more than 90% of the
variance.
Advance Statistics Assignment

In this case study, we had 17 numeric variables to be assessed, with PCA we did dimensionality
reduction from 17 to 8 (representing more than 90% of the variance).

However, we can see from the above mentioned cumulative variance that even 5 PCA dimensions
represent around 80% of the variance. But, to be on a safer side, we have considered to go with 90%
variance.

Thus, as far as business implication of using PCA is concerned, in this case, we are reducing a high-
dimensional space (with 17 variables) and converting it to a lower dimensional space without
(theoretically) losing much of the explanatory power.

Following are the interpretations from the obtained PCs

 PC1: Explains No. of students for whom the particular college or university is Out-of-state
tuition and instructional expenditure per student
 PC2: Represents the highly correlated variables such as Apps, Enroll and Accept
 PC3: Highlights the estimated cost of books for a student
 PC4: Represents % of faculties with Ph.D.’s and terminal degree
 PC5: Explains percentage of new students from top 10% and 25% of higher secondary class
including cost of room and board
 PC6: Details about student/faculty ratio
 PC7: Highlights estimated personal spending for a student and graduation rate
 PC8: Explains number of part-time undergraduate students and alumni who donate

Exploratory Data Analysis
100% (1)
Exploratory Data Analysis
203 pages
Advanced Statistics Project Report Final
No ratings yet
Advanced Statistics Project Report Final
40 pages
Advanced Statistics Project Report
100% (1)
Advanced Statistics Project Report
42 pages
AV Project Shivakumar Vanga
No ratings yet
AV Project Shivakumar Vanga
36 pages
Advanced Statistics ANOVA PCA EDA Project Report 3 Great Lakes
91% (34)
Advanced Statistics ANOVA PCA EDA Project Report 3 Great Lakes
28 pages
Student Performance Analysis and Prediction
No ratings yet
Student Performance Analysis and Prediction
19 pages
Business Report: Pgpdsba Advanced Statistics Module Project
100% (3)
Business Report: Pgpdsba Advanced Statistics Module Project
18 pages
Predictive Modeling Business Report Seetharaman Final Changes PDF
100% (1)
Predictive Modeling Business Report Seetharaman Final Changes PDF
28 pages
Business Report: Advanced Statistics Project
100% (5)
Business Report: Advanced Statistics Project
24 pages
Advanced Statistics (AS) Project Report
No ratings yet
Advanced Statistics (AS) Project Report
52 pages
Business Report - Advanced Statistics - Great Learning
100% (1)
Business Report - Advanced Statistics - Great Learning
20 pages
AV Project Shivakumar Vanga
100% (1)
AV Project Shivakumar Vanga
37 pages
Advanced Statistics Project Module 3 - Advanced Statistics: Submitted To Great Learning
No ratings yet
Advanced Statistics Project Module 3 - Advanced Statistics: Submitted To Great Learning
37 pages
Advanced Statistics Project Report
100% (1)
Advanced Statistics Project Report
34 pages
Advanced Statistics Business Report: Name: S.Krishnaveni Date: 17/10/2021
No ratings yet
Advanced Statistics Business Report: Name: S.Krishnaveni Date: 17/10/2021
18 pages
VaibhavKumar Extendedproject PDF
100% (2)
VaibhavKumar Extendedproject PDF
10 pages
Exploratory Data Analysis:: Salarydata - CSV
No ratings yet
Exploratory Data Analysis:: Salarydata - CSV
32 pages
ASProject-Padma Murali
No ratings yet
ASProject-Padma Murali
45 pages
PCA Project Advanced Statistics
67% (3)
PCA Project Advanced Statistics
24 pages
Caterpillar Cat 304CR Mini Hydraulic Excavator (Prefix NAD) Service Repair Manual (NAD00001 and Up)
100% (3)
Caterpillar Cat 304CR Mini Hydraulic Excavator (Prefix NAD) Service Repair Manual (NAD00001 and Up)
23 pages
Assignment 4
No ratings yet
Assignment 4
5 pages
Explorotary Data Analysis
100% (1)
Explorotary Data Analysis
30 pages
Problem Statement 1: Exploratory Data Analysis
No ratings yet
Problem Statement 1: Exploratory Data Analysis
22 pages
Da (22C01156)
No ratings yet
Da (22C01156)
26 pages
Advance Business Reprot
No ratings yet
Advance Business Reprot
24 pages
Advanced Statistics
100% (1)
Advanced Statistics
16 pages
Project - Advanced Statistics - Final-1
100% (3)
Project - Advanced Statistics - Final-1
15 pages
Ruhee Ansari - Advanced Statistic Project SCB
100% (1)
Ruhee Ansari - Advanced Statistic Project SCB
28 pages
Tanaya - Lokhande - Advance Statistic Business Report
No ratings yet
Tanaya - Lokhande - Advance Statistic Business Report
24 pages
Advanced Statistics Project
No ratings yet
Advanced Statistics Project
23 pages
Advanced Statistics Project
No ratings yet
Advanced Statistics Project
25 pages
Advanced Statistics - Project - 16052021
100% (1)
Advanced Statistics - Project - 16052021
9 pages
Advance Statistics - Buisness Report
100% (1)
Advance Statistics - Buisness Report
26 pages
Project As
No ratings yet
Project As
23 pages
Project: Advanced Statistics: Anova, Eda and Pca
No ratings yet
Project: Advanced Statistics: Anova, Eda and Pca
35 pages
October 17: Great Learning Authored By: ANIMESH HALDER
No ratings yet
October 17: Great Learning Authored By: ANIMESH HALDER
19 pages
Education - Post 12th Standard - CSV
88% (16)
Education - Post 12th Standard - CSV
11 pages
XSTK Project PDF
No ratings yet
XSTK Project PDF
26 pages
AS Project Report - 16-10-21
No ratings yet
AS Project Report - 16-10-21
16 pages
AKSHAYA - Advanced Statistics Project Report
No ratings yet
AKSHAYA - Advanced Statistics Project Report
50 pages
Advanced Statistics Project Report
No ratings yet
Advanced Statistics Project Report
20 pages
Advanced Statistics ANOVA PCA EDA Project Report 3 Great Lakes
No ratings yet
Advanced Statistics ANOVA PCA EDA Project Report 3 Great Lakes
28 pages
Anshul Dyundi Predictive Modelling Alternate Project July 2022
No ratings yet
Anshul Dyundi Predictive Modelling Alternate Project July 2022
11 pages
AIDS - DM Using Python - Lab Programs
No ratings yet
AIDS - DM Using Python - Lab Programs
19 pages
Advanced Statistics Assignment: Business Report (PGP - DSBA)
No ratings yet
Advanced Statistics Assignment: Business Report (PGP - DSBA)
23 pages
Description: Salarydata - CSV
No ratings yet
Description: Salarydata - CSV
4 pages
DAV Manual
No ratings yet
DAV Manual
15 pages
Decision Making: Submitted By-Ankita Mishra
No ratings yet
Decision Making: Submitted By-Ankita Mishra
20 pages
Business Report SMDM Bhushan
No ratings yet
Business Report SMDM Bhushan
18 pages
Anova and Pca
No ratings yet
Anova and Pca
10 pages
Business Report: Advanced Statistics Module Project - II
No ratings yet
Business Report: Advanced Statistics Module Project - II
9 pages
Exploratory Data Analysis: Masters of Science
No ratings yet
Exploratory Data Analysis: Masters of Science
12 pages
Project Advance Stats - Abhishek
No ratings yet
Project Advance Stats - Abhishek
14 pages
Stat 1116-BHS20100 - M Assignment
No ratings yet
Stat 1116-BHS20100 - M Assignment
7 pages
Vijayalakshmi
No ratings yet
Vijayalakshmi
17 pages
Advance Statistics Business Report
No ratings yet
Advance Statistics Business Report
15 pages
Smaw NC Ii
No ratings yet
Smaw NC Ii
65 pages
It Skills 2 Assignment (Adarsh Chaudhary)
No ratings yet
It Skills 2 Assignment (Adarsh Chaudhary)
7 pages
Education - Post 12th Standard - CSV
No ratings yet
Education - Post 12th Standard - CSV
11 pages
Wiring Methods
100% (1)
Wiring Methods
9 pages
Aquino Problems Compilation FINALS
No ratings yet
Aquino Problems Compilation FINALS
26 pages
Chapter 1 Final
100% (1)
Chapter 1 Final
63 pages
Alphabet Free Worksheets
No ratings yet
Alphabet Free Worksheets
28 pages
DOH Heat Stress - 2024
No ratings yet
DOH Heat Stress - 2024
58 pages
Comparison ARP4754 and ARP4754A
No ratings yet
Comparison ARP4754 and ARP4754A
7 pages
Re Removals
No ratings yet
Re Removals
44 pages
MCR HRD
No ratings yet
MCR HRD
50 pages
Dowel - Wikipedia
100% (1)
Dowel - Wikipedia
23 pages
Bid Process
No ratings yet
Bid Process
26 pages
Supply Chain Risk Management Yanelisa
No ratings yet
Supply Chain Risk Management Yanelisa
21 pages
CIEL Brochure (Sep 2024)
No ratings yet
CIEL Brochure (Sep 2024)
21 pages
Chapter 8 - Quiz
No ratings yet
Chapter 8 - Quiz
10 pages
Economics Term I
No ratings yet
Economics Term I
9 pages
OS Forecast Kwartal Mei - Agst 2024
No ratings yet
OS Forecast Kwartal Mei - Agst 2024
22 pages
Short-Term Financial Management Decisions: Working Capital Management
No ratings yet
Short-Term Financial Management Decisions: Working Capital Management
13 pages
Wear in Pipes
No ratings yet
Wear in Pipes
19 pages
Chapter 2
No ratings yet
Chapter 2
10 pages
ASR1201D ASR1201D-D: Slim Water-Proof RFID Reader
No ratings yet
ASR1201D ASR1201D-D: Slim Water-Proof RFID Reader
1 page
Penerapan Sistem Transaksi Non Tunai Dalam Pelaksanaan Belanja Langsung Di Dinas Sosial Kota Tangerang
No ratings yet
Penerapan Sistem Transaksi Non Tunai Dalam Pelaksanaan Belanja Langsung Di Dinas Sosial Kota Tangerang
10 pages
AWS Sandbox - How To Set One Up Securely and Responsibly - CloudShare
No ratings yet
AWS Sandbox - How To Set One Up Securely and Responsibly - CloudShare
12 pages
Assignment Brief - BTEC (RQF) Higher National Diploma in Business (Business Management)
No ratings yet
Assignment Brief - BTEC (RQF) Higher National Diploma in Business (Business Management)
3 pages
Dtefinalreport
No ratings yet
Dtefinalreport
9 pages
Iarjset Icmart 24
No ratings yet
Iarjset Icmart 24
5 pages
Detection of Personal Attacks in Mobbing On Social Media Using Deeplearning Techniques
No ratings yet
Detection of Personal Attacks in Mobbing On Social Media Using Deeplearning Techniques
6 pages
36. Nguyễn Thị Trang-Week 4
No ratings yet
36. Nguyễn Thị Trang-Week 4
4 pages
The Azad Jammu and Kashmir Prohibition of Alienation of Shamlat Deh Ordinance 1971 Ordinance VIII of 1971
No ratings yet
The Azad Jammu and Kashmir Prohibition of Alienation of Shamlat Deh Ordinance 1971 Ordinance VIII of 1971
2 pages
Introduction To Business Statistics Through R Software: Software
From Everand
Introduction To Business Statistics Through R Software: Software
Editor IJSMI
No ratings yet
Practical Earned Value Analysis: 25 Project Indicators from 5 Measurements
From Everand
Practical Earned Value Analysis: 25 Project Indicators from 5 Measurements
Akram Najjar
No ratings yet
Process Performance Models: Statistical, Probabilistic & Simulation
From Everand
Process Performance Models: Statistical, Probabilistic & Simulation
Vishnuvarthanan Moorthy
No ratings yet

Sayan Pal Business Report Advance Statistics Assignment PDF

Uploaded by

Sayan Pal Business Report Advance Statistics Assignment PDF

Uploaded by

Advance Statistics Assignment

Salary is hypothesized to depend on educational qualification and occupation. To understand the

 H0: Salary depend on education qualification

 H0: Salary depend on occupation

 H0: Salary depends on both categories - education and occupation

Post this, we need to perform univariate analysis by defining “univariateAnalysis_numeric”, which

The analysis of all these variables includes:

 Statistical description of the numeric variable

 Boxplot representation of the column - 5 point summary and outliers if any

We get the following output, post we perform scaling using Z score.

For performing PCA, we need to follow below steps:

# Step 1: Generate the covariance matrix

# Step 2: Get eigenvalues and eigenvector

# Step 3: View Scree Plot to identify the number of components to be built

Then, we can do loading of each feature on the components

2.6 Extract the eigenvalues, and eigenvectors.

The below snapshot represent extracted eigenvalues and eigenvectors:

Eigen Vector of First PC

[2.42 3.24 9.77 -1.02

2.28 -4.76 1.23 -3.41

-1.84 -1.34 -6.79 -1.51

5.73 2.54 -3.50 4.76

Following are the interpretations from the obtained PCs

You might also like