Jamboree Case Study
Jamboree Case Study
Introduction
Jamboree is a renowned educational institution that has successfully assisted numerous students in gaining
admission to top colleges abroad. With their proven problem-solving methods, they have helped students
achieve exceptional scores on exams like GMAT, GRE and SAT with minimal effort. To further support
students, Jamboree has recently introduced a new feature on their website. This feature enables students to
assess their probability of admission to Ivy League colleges, considering the unique perspective of Indian
applicants.
What is expected
Conduct a thorough analysis to assist Jamboree in understanding the crucial factors impacting graduate
admissions and their interrelationsships. Additionally provide predictive insights to determine an individual's
admission chances based on various variables.
1. Data
The analysis was done on the data located at -
https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/001/839/original/Jamboree_Admission.csv
2. Libraries
Below are the libraries required for analysing and visualizing data
import statsmodels.api as sm
import statsmodels.stats.api as sms
from statsmodels.stats.outliers_influence import variance_inflation_factor
3. Data Loading
Loading the data into Pandas dataframe for easily handling of data
*************************************************
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Serial No. 500 non-null int64
1 GRE Score 500 non-null int64
2 TOEFL Score 500 non-null int64
3 University Rating 500 non-null int64
4 SOP 500 non-null float64
5 LOR 500 non-null float64
6 CGPA 500 non-null float64
7 Research 500 non-null int64
8 Chance of Admit 500 non-null float64
dtypes: float64(4), int64(5)
memory usage: 35.3 KB
None
*************************************************
*************************************************
Shape of the dataset is (500, 9)
*************************************************
*************************************************
Number of nan/null values in each column:
Serial No. 0
GRE Score 0
TOEFL Score 0
University Rating 0
SOP 0
LOR 0
CGPA 0
Research 0
Chance of Admit 0
dtype: int64
*************************************************
*************************************************
Number of unique values in each column:
Serial No. 500
GRE Score 49
TOEFL Score 29
University Rating 5
SOP 9
LOR 9
CGPA 184
Research 2
Chance of Admit 61
dtype: int64
*************************************************
*************************************************
Duplicate entries:
False 500
Name: count, dtype: int64
*************************************************
In [3]: df.columns
Index(['Serial No.', 'GRE Score', 'TOEFL Score', 'University Rating', 'SOP',
Out[3]:
'LOR ', 'CGPA', 'Research', 'Chance of Admit '],
dtype='object')
Out[4]: Serial No. GRE Score TOEFL Score University Rating SOP LOR CGPA Research Chance of Admit
In [5]: df.describe()
count 500.000000 500.000000 500.000000 500.000000 500.000000 500.00000 500.000000 500.000000 500.00000
mean 250.500000 316.472000 107.192000 3.114000 3.374000 3.48400 8.576440 0.560000 0.72174
std 144.481833 11.295148 6.081868 1.143512 0.991004 0.92545 0.604813 0.496884 0.14114
min 1.000000 290.000000 92.000000 1.000000 1.000000 1.00000 6.800000 0.000000 0.34000
25% 125.750000 308.000000 103.000000 2.000000 2.500000 3.00000 8.127500 0.000000 0.63000
50% 250.500000 317.000000 107.000000 3.000000 3.500000 3.50000 8.560000 1.000000 0.72000
75% 375.250000 325.000000 112.000000 4.000000 4.000000 4.00000 9.040000 1.000000 0.82000
max 500.000000 340.000000 120.000000 5.000000 5.000000 5.00000 9.920000 1.000000 0.97000
Insight
There are 500 unique applicants
There are no null values
There are no duplicates
There is a space after LOR and Chance of Admit column name
The column Serial No. can be dropped as it doesnt provide any additional information that what is
provided by the dataframe's index.
The GRE Score in the dataset ranges from 290 to 340 and hence can be converted to datatype int16
The TOEFL Score in the dataset ranges from 92 to 120 and hence can be converted to datatype int8
The University Rating in the dataset ranges from 1 to 5 and hence can be converted to datatype int8
The SOP in the dataset ranges from 1 to 5 and hence can be converted to datatype float32
The LOR in the dataset ranges from 1 to 5 and hence can be converted to datatype float32
The CGPA in the dataset ranges from 6.8 to 9.92 and hence can be converted to datatype float32
The Research in the dataset has values 0 and 1 and hence can be converted to datatype bool
The Chance of Admit in the dataset ranges from 0.34 to 0.97 and hence can be converted to datatype
float32
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 GRE Score 500 non-null int16
1 TOEFL Score 500 non-null int8
2 University Rating 500 non-null int8
3 SOP 500 non-null float32
4 LOR 500 non-null float32
5 CGPA 500 non-null float32
6 Research 500 non-null bool
7 Chance of Admit 500 non-null float32
dtypes: bool(1), float32(4), int16(1), int8(2)
memory usage: 10.4 KB
In [7]: df.head()
Out[7]: GRE Score TOEFL Score University Rating SOP LOR CGPA Research Chance of Admit
Insight
The memory usage for the dataframe reduced by 70%, from 35.3 KB to 10.4 KB
4. Exploratory Data Analysis
4.1. Detecting outliers
4.1.1. Outliers for every continuous variable
In [8]: # helper function to detect outliers
def detectOutliers(df):
q1 = df.quantile(0.25)
q3 = df.quantile(0.75)
iqr = q3-q1
lower_outliers = df[df<(q1-1.5*iqr)]
higher_outliers = df[df>(q3+1.5*iqr)]
return lower_outliers, higher_outliers
**************************************************
Outliers of 'GRE Score' column are:
Lower outliers:
Series([], Name: GRE Score, dtype: int16)
Higher outliers:
Series([], Name: GRE Score, dtype: int16)
**************************************************
**************************************************
Outliers of 'TOEFL Score' column are:
Lower outliers:
Series([], Name: TOEFL Score, dtype: int8)
Higher outliers:
Series([], Name: TOEFL Score, dtype: int8)
**************************************************
**************************************************
Outliers of 'CGPA' column are:
Lower outliers:
Series([], Name: CGPA, dtype: float32)
Higher outliers:
Series([], Name: CGPA, dtype: float32)
**************************************************
**************************************************
Outliers of 'Chance of Admit' column are:
Lower outliers:
92 0.34
376 0.34
Name: Chance of Admit, dtype: float32
Higher outliers:
Series([], Name: Chance of Admit, dtype: float32)
**************************************************
Insight
From the above plots and analysis, I will not remove any outliers
Insight
A large chunk of applicants, 32.4%, are associated with university with rating 3
SOP 4 has the maximum applicants, 89
LOR 3 has the maximum applicants, 99
56% of the applicants have research experience
Insight
GRE Score, TOEFL Score and CGPA exhibit a linear relation with Chance of Admit
Applicants with high University Rating, SOP and LOR have higher chance of admission
It is also very evident that an appicant who has research experience has higher chance of admission
In [19]: X_train.head()
Out[19]: GRE Score TOEFL Score University Rating SOP LOR CGPA Research
In [20]: columns_to_scale = ['GRE Score', 'TOEFL Score', 'University Rating', 'SOP', 'LOR', 'CGPA
#Initialize an object of class MinMaxScaler()
min_max_scaler = MinMaxScaler()
# Fit min_max_scaler to training data
min_max_scaler.fit(X_train[columns_to_scale])
# Scale the training and testing data
X_train[columns_to_scale] = min_max_scaler.transform(X_train[columns_to_scale])
X_test[columns_to_scale] = min_max_scaler.transform(X_test[columns_to_scale])
In [21]: X_train.head()
Out[21]: GRE Score TOEFL Score University Rating SOP LOR CGPA Research
107 0.96 0.892857 0.75 0.625 0.875 0.852564 1
Model 1
In [23]: model_1 = sm.OLS(y_train, X_train_1).fit()
print(model_1.summary())
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specifi
ed.
Insight
The R-squared and Adj. R-squared are close to each other indicating that all the features/predictors are
relevant
SOP has a very high p-value of 0.918.
I will retrain the model by dropping SOP column
6.2. Drop columns with p-value > 0.05 (if any) and re-train
the model
Model 2
In [28]: X_train_2 = X_train.drop(columns=['SOP'])
X_test_2 = X_test.drop(columns=['SOP'])
X_train_2 = sm.add_constant(X_train_2)
X_test_2 = sm.add_constant(X_test_2)
model_2 = sm.OLS(y_train, X_train_2).fit()
print(model_2.summary())
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specifi
ed.
<matplotlib.collections.PathCollection at 0x1ac051c1810>
Out[30]:
Insight
All the model performance metrics have remained same implying that the SOP columns was not so
important
0 const 1511.50
6 CGPA 4.78
4 SOP 2.84
5 LOR 2.03
7 Research 1.49
The VIF score for the const term is high as expected as the constant term (intercept) is perfectly
collinear with the sum of all the other predictors, making its VIF high
As none of the features have a VIF > 5, it indicates that there is no multicollinearity but for the sake of
experimentation I will drop CGPA, the feature with high VIF amoung the other features, and again find
the VIF for remaining features
0 const 1485.48
4 SOP 2.74
5 LOR 1.94
6 Research 1.49
Insight
Finally, based on the VIF score, the features GRE Score, TOEFL Score, SOP, University Rating, LOR
and Researche do not exhibit multicollinearity
Model 3
Retrain the model only with features GRE Score, TOEFL Score, SOP, University Rating, LOR and Researche
In [33]: X_train.head()
Out[33]: GRE Score TOEFL Score University Rating SOP LOR CGPA Research
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specifi
ed.
0.0041030675622921835
Out[37]:
Insight
As the mean of residual is close to 0, the model can be considered to be unbiased
Insight
From the above regression plots and the Pearson correlation values, GRE Score, TOEFL Score and
CGPA exhibit strong linear relation with dependent variable Chance of Admit
From the above, it looks like the variance of the residual is decreasing with the independent variable.\
Goldfeld-Quandt test for homoskedasticity\ H0: Homoscedasticity is present\ H1: Heteroscedasticity is
present
Insight
As the p-value of the Goldfeld-Quandt homoskedasticity test is greater than 0.05, we can conslude that
regression model follows homoscedasticity
Shapiro-Wilk test for normality\ H0: The data is normally distributed\ H1: The data is not normally
distributed\
In [43]: stats.shapiro(residuals)
ShapiroResult(statistic=0.939202606678009, pvalue=0.00017242538160644472)
Out[43]:
{'GRE Score': 0.1037473, 'TOEFL Score': 0.07275442, 'University Rating': 0.025786784, 'S
OP': 0.0013959017, 'LOR': 0.072684325, 'CGPA': 0.34205964, 'Research': 0.025733968}
Mean Absolute Error for the model(MAE): 0.05
Root Mean Squared Error for Model: 0.06
R2 Score for Model: 0.77
Adjusted R2 Score for Model: 0.75
Insight
It can be observed that the performace of both Ridge(with alpha=0.1) and Lasso(with alpha=0.001)
are similar to Linear Regression in terms of performance metrics(like MAE, RMSE, R2 score and
Adjusted R2 Score) as well as scatter plot.
Similar behaviour of Ridge implies that the predictors in the dataset are not highly correlated with
each other. This is inline with the VIF score too.
Similar behaviour of the Lasso implies that the dataset does not have many irrelevent predictors.
SOP was the only feature with very low coeffcient value.
9. Insights
There are 500 unique applicants
A large chunk of applicants, 32.4%, are associated with university with rating 3
SOP 4 has the maximum applicants, 89
LOR 3 has the maximum applicants, 99
56% of the applicants have research experience
All the columns/feature have good correlation with Chance of Admit
GRE Score, TOEFL Score and CGPA are highly correlated with each other as well as target Chance of
Admit
It is also very evident that an appicant who has research experience has higher chance of admission
None of the features exhibit multicollinearity
CGPA is the significant predictor and SOP is the least significant predictor based on the model
coeffecients
10. Recommendation
The most important factor impacting the admission is the CGPA. The student with higher CGPA is most
likely to perform well in GRE and TOEFL.
Jamboree can actually ignore SOP while assessing the probability of admision as it is has the least
impact on the model's performance
Jamboree should encourage more students to have research experience so as to increase their chance
of admission.