Graduate Admission Prediction Using Machine Learning: December 2020
Graduate Admission Prediction Using Machine Learning: December 2020
net/publication/348433004
CITATIONS READS
0 3,162
4 authors, including:
Ashraf M Elnagar
University of Sharjah
110 PUBLICATIONS 1,099 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Emirati-Accented Speaker and Emotion Recognition Based on Deep Neural Network View project
All content following this page was uploaded by Ali Bou Nassif on 13 January 2021.
Abstract— Student admission problem is very important in • Statement of purpose (SOP) which is a document written to
educational institutions. This paper addresses machine learning show the candidate's life, ambitious and the motivations for
models to predict the chance of a student to be admitted to a the chosen degree/ university. The score will be out of 5
master’s program. This will assist students to know in advance if points.
they have a chance to get accepted. The machine learning models
are multiple linear regression, k-nearest neighbor, random forest, • Letter of Recommendation Strength (LOR) which verifies
and Multilayer Perceptron. Experiments show that the Multilayer the candidate professional experience, builds credibility,
Perceptron model surpasses other models. boosts confidence and ensures your competency. The score
is out of 5 points
Keywords— K- Nearest neighbors, Multilayer Perceptron, Multiple
linear regression, Random forest, Student admission • Undergraduate GPA (CGPA) out of 10
1
Graduate Record Examination www.ets.org/gre
2
Test Of English as Foreign Language www.ets.org/toefl
independent variables. As well as, present a linear relationship on the best model that showed the least MSE which was multi-
between them and fit them in a linear equation. The format of linear regression.
the linear equation is as following [7]:
In addition, Chakrabarty et al. [16] compared between both
yi=β0+β1xi1+...+βnxn +ϵ (1) linear regression and gradient boosting regression in predicting
chance of admit; point out that gradient boosting regression
where, for i=n observations: showed better results.
yi=dependent variable Gupta et al. [17] developed a model that studies the graduate
xi= independent variables admission process in American universities using machine
learning techniques. The purpose of this study was to guide
β0=y-intercept students in finding the best educational institution to apply for.
βn=slope coefficients for each independent variable Five machine learning models were built in this paper including
SVM (Linear Kernel), AdaBoost, and Logistic classifiers.
ϵ=the model’s error term or residuals
Waters and Miikkulainen [18] proposed a remarkable article
C. K-Nearest Neighbor that helps in ranking graduation admission application according
K-nearest neighbor (KNN) is a supervised machine learning to the level of acceptance and enhances the performance of
algorithm used for classification and regression problems [8], reviewing applications using statistical machine learning.
[9]. It is based on the theory of similarity measuring. Therefore,
Sujay [19] applied linear regression to predict the chance of
to predict a new value, neighbors should be put into
admitting graduate students in master’s programs as a
consideration. KNN uses some mathematical equations to
percentage. However, no more models were performed.
calculate the distance between points to find neighbors. In a
regression problem, KNN is used to find the mean of the k IV. DATASET PROCESSING AND FEATURE SELECTION
labels. While in classification problems, the mode of k labels
will be returned [10]. A. Correlated variables
D. Random Forest The dataset contained an independent variable to present the
serial number of the requests. According to my expertise, it does
The random forest algorithm is one of the most popular and not correlate to the dependent variable; hence, it was removed
powerful machine learning algorithms that is capable of from the dataset. It is good to mention that tests showed that
performing both regression and classification tasks [11]. This there are no missing values in any row of the database.
algorithm creates forests within number of decision reads.
Therefore, the more data is available the more accurate and B. Outliers
robust results will be provided [12]. Outliers are data values that differ greatly from the majority
Random Forest method can handle large datasets with higher of a set of data. To find the outliers, there are many methods that
dimensionality without overfitting the model. In addition, it can can be used, such as: scatterplot and boxplot. In this paper
handle the missing values and maintains accuracy of missing outliers will be investigated using boxplot method.
data [12]. As for the boxplot, the middle part of the plot represents the
E. Multilayer Perceptron first and third quartiles. The line near the middle of the box
represents the median. The whiskers on either side of the IQR
Multilayer perceptron is a supervised deep artificial neural represent the lowest and highest quartiles of the data. The ends
network machine-learning algorithm used to predict value of a of the whiskers represent the maximum and minimum of the
dependent variable of a dataset according to weights and bias data, and the individual circles beyond the whiskers represent
[13]. Weights are updated continuously when finding any error outliers in the dataset [20].
in classification. The first layer is the input layer, more than one
layer are presented next as hidden layers where each layer will
contain a linear relationship between the previous layer, and the
final layer is output layer that makes decisions and predicts [14].
Forward pass and backward pass can be performed. Forward
pass is where information flows from left to right, in other
words, the flow will be input, hidden layers, and output in order.
On the other hand, backward weights will adjust according to
the gradient flow in that direction [14].
III. RELATED WORK
A great number of researches and studies have been done on
graduation admission datasets using different types of machine
learning algorithms. One impressive work by Acharya et al. [15]
has compared between 4 different regression algorithms, which
are: Linear Regression, Support Vector Regression, Decision
Trees and Random Forest, to predict the chance of admit based Figure 1 Boxplot of Chance of Admit
As noticed from Figure 1, there are outliers beneath the Note that the first and third quartile are found using the
boundaries of the boxplot. following equations where N is equal number of data points.
There are many different methods to deal with outliers, such
as: First Quartile (Q1)=(N+1) × 0.25 (2)
Feature Value
Minimum Value 0.3600
First Quartile (Q1) 0.6325
Median 0.7200
Mean 0.7233
Third Quartile (Q3) 0.8200
Maximum Value 0.9700
Figure 2 Histogram of Chance of Admit
Figure 2 shows the histogram graph of the dependent Table 3 displays the p-value of each parameter obtained from
variable Chance of Admit with skewness −0.25 to the left. Shapiro-Wilk test; all three parameters’ p-value are less than
0.05. Therefore, the null hypothesis is rejected, and variables are
The histogram graphs of the most important independent not normally distributed.
variables are presented also according to the importance test.
D. Multicollinearity Issue
Multicollinearity is a huge issue that exists whenever an
independent variable is highly correlated with one or more
independent variables in a multiple regression equation. If VIF
is > 10, high multicollinearity is found. This problem can lead to
unstable regression model. In other words, any slight change in
the data will lead to a huge change in the coefficients of the
multiple linear regression model [21], [22].
In conclusion, there is no multicollinearity problem since all
the values are less than 10. This also leads to the fact that our
regression model is stable.
E. Linear Regression
According to the linear regression model applied, the
equation that represents regression model is:
Figure 3 Histogram of CGPA Regression model= -1.33 + )0.002 * GRE + 0.0026 * TOEFL +
0.005* Uni.Rating + 0.004 * SOP +0.013 * LOR + 0.118 *
Figure 3 shows the histogram of CGPA, the most important CGPA + 0.023 * Research)
independent variable, with skewness −0.0283553 to the left.
According to Pr(>|t|) value from the linear regression test,
all variables have a statistically significant role except for
columns 3, 4, which are Uni.Rating and SOP. Also, the R-
squared value = 0.83. which means that 83% of variation in our
dataset can be explained with our model. The p-value is 2.2e-16,
which is way less than 0.05 so we reject the null hypothesis and
the model is statistically significant.
VI. RESULTS AND DISCUSSIONS
A. Statistical Test
According to the normality test, the dependent variable is not
normally distributed. Therefore, nonparametric test will be
performed using PHStat. The test is one-way ANOVA which is
performed to determine whether three samples or more have any
statistically significant differences between their means or not
[23].
Figure 4 Histogram of GRE
The test shows that p-value equals 0.97, which is greater than
Figure 4 shows the histogram of GRE with skewness −0.04 0.05, thus, the null hypothesis cannot be rejected, and the tests
to the left. are not statistically different.
C. Normality Test B. Mean absolute error
Normality test shows whether a parameter is normally The different regression models are performed on
distributed or not. Shapiro test is used to perform normality test. Admission dataset through Weka in order to decide which model
If the p-value is greater than 0.05, this means it is normally performs the best based on mean absolute error (MAE) value.
distributed. Otherwise, the graph is not normally distributed. The results are shown in Table 4:
Table 3 Shapiro-Wilk Normality Test Table 4 Performance Analysis
Authors Contribution: Sara Aljasmi wrote the paper. Ali Bou Nassif, Ismail Shahin and Ashraf Elnagar
revised the experiments and the whole paper