Jamboree Linear Regression Version 2 Jupyter Notebook
Jamboree Linear Regression Version 2 Jupyter Notebook
In [ ]: import pandas as pd
import warnings
#warnings.filterwarnings("ignore")
df = pd.read_csv('/Users/suraaj/Desktop/Datasets/Admission_Predict_Ver1.1.csv')
df.head()
Out[1]: Serial No. GRE Score TOEFL Score University Rating SOP LOR CGPA Research Chance of Admit
In [ ]: df.shape
Out[2]: (500, 9)
Now, let us drop the irrelevant column and check if there are any null values in the dataset
localhost:8888/notebooks/Desktop/DSML/dsml-case-studies/Jamboree_Linear_Regression_Version_2.ipynb 1/12
07/12/2023, 22:39 Jamboree_Linear_Regression_Version_2 - Jupyter Notebook
In [ ]: imporx plt
import seaborn as sns
fig = sns.distplot(df['GRE Score'], kde=False)
plt.title("Distribution of GRE Scores")
plt.show()
fig = sns.distplot(df['TOEFL Score'], kde=False)
plt.title("Distribution of TOEFL Scores")
plt.show()
fig = sns.distplot(df['University Rating'], kde=False)
plt.title("Distribution of University Rating")
plt.show()
fig = sns.distplot(df['SOP'], kde=False)
plt.title("Distribution of SOP Ratings")
plt.show()
fig = sns.distplot(df['CGPA'], kde=False)
plt.title("Distribution of CGPA")
plt.show()
plt.show()
It is clear from the distributions, students with varied merit apply for the university.
Understanding the relation between different factors responsible for graduate admissions
localhost:8888/notebooks/Desktop/DSML/dsml-case-studies/Jamboree_Linear_Regression_Version_2.ipynb 2/12
07/12/2023, 22:39 Jamboree_Linear_Regression_Version_2 - Jupyter Notebook
People with higher GRE Scores also have higher TOEFL Scores which is justified because both TOEFL and GRE
have a verbal section which although not similar are relatable
Although there are exceptions, people with higher CGPA usually have higher GRE scores maybe because they are
smart or hard working
LORs are not that related with CGPA so it is clear that a persons LOR is not dependent on that persons academic
excellence. Having research experience is usually related with a good LOR which might be justified by the fact
that supervisors have personal interaction with the students performing research which usually results in good
LORs
localhost:8888/notebooks/Desktop/DSML/dsml-case-studies/Jamboree_Linear_Regression_Version_2.ipynb 3/12
07/12/2023, 22:39 Jamboree_Linear_Regression_Version_2 - Jupyter Notebook
GRE scores and LORs are also not that related. People with different kinds of LORs have all kinds of GRE scores
CGPA and SOP are not that related because Statement of Purpose is related to academic performance, but since
people with good CGPA tend to be more hard working so they have good things to say in their SOP which might
explain the slight move towards higher CGPA as along with good SOPs
localhost:8888/notebooks/Desktop/DSML/dsml-case-studies/Jamboree_Linear_Regression_Version_2.ipynb 4/12
07/12/2023, 22:39 Jamboree_Linear_Regression_Version_2 - Jupyter Notebook
Applicants with different kinds of SOP have different kinds of TOEFL Score. So the quality of SOP is not always
related to the applicants English skills.
In [ ]: import numpy as np
corr = df.corr()
#fig, ax = plt.subplots(figsize=(8, 8))
#colormap = sns.diverging_palette(220, 10, as_cmap=True)
#dropSelf = np.zeros_like(corr)
#dropSelf[np.triu_indices_from(dropSelf)] = True
#colormap = sns.diverging_palette(220, 10, as_cmap=True)
sns.heatmap(corr, linewidths=.5, annot=True)
plt.show()
Lets split the dataset with training and testing set and prepare the inputs and outputs
localhost:8888/notebooks/Desktop/DSML/dsml-case-studies/Jamboree_Linear_Regression_Version_2.ipynb 5/12
07/12/2023, 22:39 Jamboree_Linear_Regression_Version_2 - Jupyter Notebook
In [ ]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.20, shuffle=True
In [ ]: X_train
Out[15]: GRE Score TOEFL Score University Rating SOP LOR CGPA Research
In [ ]: y_train
In [ ]: #Standardization
from sklearn.preprocessing import StandardScaler
X_train_columns=X_train.columns
std=StandardScaler()
X_train_std=std.fit_transform(X_train)
In [ ]: X_train_std
localhost:8888/notebooks/Desktop/DSML/dsml-case-studies/Jamboree_Linear_Regression_Version_2.ipynb 6/12
07/12/2023, 22:39 Jamboree_Linear_Regression_Version_2 - Jupyter Notebook
In [ ]: X_train=pd.DataFrame(X_train_std, columns=X_train_columns)
In [ ]: X_train
Out[20]: GRE Score TOEFL Score University Rating SOP LOR CGPA Research
Lets use a bunch of different algorithms to see which model performs better
Adjusted. R-squared reflects the fit of the model. R-squared values range from 0 to 1, where a higher value
generally indicates a better fit, assuming certain conditions are met.
const coefficient is your Y-intercept. It means that if both the Interest_Rate and Unemployment_Rate
coefficients are zero, then the expected output (i.e., the Y) would be equal to the const coefficient.
Interest_Rate coefficient represents the change in the output Y due to a change of one unit in the interest rate
(everything else held constant)
Unemployment_Rate coefficient represents the change in the output Y due to a change of one unit in the
unemployment rate (everything else held constant)
std err reflects the level of accuracy of the coefficients. The lower it is, the higher is the level of accuracy
P >|t| is your p-value. A p-value of less than 0.05 is considered to be statistically significant
Confidence Interval represents the range in which our coefficients are likely to fall (with a likelihood of 95%)
localhost:8888/notebooks/Desktop/DSML/dsml-case-studies/Jamboree_Linear_Regression_Version_2.ipynb 7/12
07/12/2023, 22:39 Jamboree_Linear_Regression_Version_2 - Jupyter Notebook
In [ ]: import statsmodels.api as sm
X_train = sm.add_constant(X_train)
model = sm.OLS(y_train.values, X_train).fit()
print(model.summary())
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly spe
cified.
In [ ]: X_train_new=X_train.drop(columns='SOP')
localhost:8888/notebooks/Desktop/DSML/dsml-case-studies/Jamboree_Linear_Regression_Version_2.ipynb 8/12
07/12/2023, 22:39 Jamboree_Linear_Regression_Version_2 - Jupyter Notebook
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly spe
cified.
“ VIF score of an independent variable represents how well the variable is explained by other independent
variables.
So, the closer the R^2 value to 1, the higher the value of VIF and the higher the multicollinearity with the
particular independent variable.
localhost:8888/notebooks/Desktop/DSML/dsml-case-studies/Jamboree_Linear_Regression_Version_2.ipynb 9/12
07/12/2023, 22:39 Jamboree_Linear_Regression_Version_2 - Jupyter Notebook
In [ ]: calculate_vif(X_train_new,[])
0 const 1.000000
4 LOR 1.953575
5 CGPA 4.871981
6 Research 1.471018
VIF looks fine and hence, we can go ahead with the predictions
In [ ]: X_test_std= std.transform(X_test)
In [ ]: X_test = sm.add_constant(X_test)
In [ ]: X_test_del=list(set(X_test.columns).difference(set(X_train_new.columns)))
In [ ]: X_test_new=X_test.drop(columns=X_test_del)
Mean of Residuals
In [ ]: residuals = y_test.values-pred
mean_residuals = np.mean(residuals)
print("Mean of Residuals {}".format(mean_residuals))
localhost:8888/notebooks/Desktop/DSML/dsml-case-studies/Jamboree_Linear_Regression_Version_2.ipynb 10/12
07/12/2023, 22:39 Jamboree_Linear_Regression_Version_2 - Jupyter Notebook
Here null hypothesis is - error terms are homoscedastic and since p-values >0.05, we fail to reject the
null hypothesis
Normality of residuals
In [ ]: p = sns.distplot(residuals,kde=True)
p = plt.title('Normality of error terms/residuals')
localhost:8888/notebooks/Desktop/DSML/dsml-case-studies/Jamboree_Linear_Regression_Version_2.ipynb 11/12
07/12/2023, 22:39 Jamboree_Linear_Regression_Version_2 - Jupyter Notebook
Bias-Variance Tradeoff
The more preferred model is one with low bias and low varinace.
Dimensionality reduction and feature selection can decrease variance by simplifying models.
Similarly, a larger training set tends to decrease variance.
For reducing Bias: Change the model, Ensure the date is truly representative(Ensure that the training data is
diverse and represents all possible groups or outcomes.), Parameter tuning.
The bias–variance decomposition forms the conceptual basis for regression regularization methods such as
Lasso and ridge regression.
Regularization methods introduce bias into the regression solution that can reduce variance considerably
relative to the ordinary least squares (OLS) solution.
Although the OLS solution provides non-biased regression estimates, the lower variance solutions produced
by regularization techniques provide superior MSE performance.
Linear and Generalized linear models can be regularized to decrease their variance at the cost of increasing
their bias.
localhost:8888/notebooks/Desktop/DSML/dsml-case-studies/Jamboree_Linear_Regression_Version_2.ipynb 12/12