Educational Data Mining For Predicting Studentsâ ™ Academic Performance Using Machine Learning Algorithms
Educational Data Mining For Predicting Studentsâ ™ Academic Performance Using Machine Learning Algorithms
a r t i c l e i n f o a b s t r a c t
Article history: Educational data mining has gained impressive attention in recent years. The primary focus of
Received 31 March 2021 educational institutions is to provide quality education for students to enhance academic performance.
Received in revised form 24 May 2021 The performance of students depends on several aspects, i.e., personal, academic, and behavioural fea-
Accepted 28 May 2021
tures. The present study deals with predicting students’ academic performance in a technical institution
Available online xxxx
in India. A dataset was obtained using a questionnaire-based survey and the academic section of the cho-
sen institution. Data-pre-processing and factor analysis have been performed on the obtained dataset to
Keywords:
remove the anomalies in the data, reduce the dimensionality of data and obtain the most correlated fea-
Educational data mining
Regression
ture. The Python 3 tool is used for the comparison of machine learning algorithms. The support vector
Academic performance regression_linear algorithm provided superior prediction.
Prediction Ó 2021 Elsevier Ltd. All rights reserved.
Support vector regression Selection and peer-review under responsibility of the International Conference on Sustainable materials,
Manufacturing and Renewable Technologies 2021.
https://doi.org/10.1016/j.matpr.2021.05.646
2214-7853/Ó 2021 Elsevier Ltd. All rights reserved.
Selection and peer-review under responsibility of the International Conference on Sustainable materials, Manufacturing and Renewable Technologies 2021.
Please cite this article as: P. Dabhade, R. Agarwal, K.P. Alameen et al., Educational data mining for predicting students’ academic performance using
machine learning algorithms, Materials Today: Proceedings, https://doi.org/10.1016/j.matpr.2021.05.646
P. Dabhade, R. Agarwal, K.P. Alameen et al. Materials Today: Proceedings xxx (xxxx) xxx
performance measures used in the evaluation of algorithms. Sec- data and fills the null values. For the present model, data cleaning
tion 4 provides the results obtained and the inferences drawn. Sec- was performed by manually checking the data in Excel sheet and
tion 5 outlines the conclusions from the present work. and using pandas, matplotlib, and seaborn libraries with the help
of Python 3 tool in Jupyter notebook. Along with cleaning data
visualization was also carried out.The data have been corrected
2. Problem statement
by dropping some irrelevant/insufficient attributes and generaliz-
ing the labels for multiple response type data, e.g., hobbies and
The present study is performed to understand the various fac-
interests. Categorical data have been then transformed into numer-
tors on which students’ performance depends and to predict the
ical values using one-hot encoding and integer encoding by divid-
performance for the upcoming semester based on student’s data
ing the categorical data into two sets, i.e., nominal and ordinal
and past performance. Students who are not performing to their
data.
institution’s expectations need to be spotted and offered special
Regions with dashed-dot white in Fig. 1indicate the missing val-
care to boost their performance. The objective is to identify and
ues or irrelevant data in the current study’s raw data set. Raw data
enumerate the possible factors affecting students’ performance
contain responses to google form questionnaire in original form.
and that of the institutions; to compare and to examine selected
Cleaning of the data resulted in a 100% cleaned dataset shown in
attributes’ relationships towards students’ academic performance.
Fig. 2. Cleaned data contains all the variables corresponding to
A prediction model has been developed that predicts students’ per-
the student attributes used in research. Some of the variables used
formance based on the chosen attributes. The prediction model is
in Fig. 1 and Fig. 2 are listed in Appendix. Due to the size restric-
demonstrated based on the results obtained from the analysis.
tions on plot, all the variables are not shown in Fig. 1 and Fig. 2
Feature Scaling is a technique to scale the features of data in a
3. Research methodology certain specific range. In the present study, it was performed to
weigh each attribute on the same scale and handle variance in
For the present study, students’ performance in the previous magnitudes of the features to achieve better results from the algo-
semesters and their personal features are considered for the stu- rithms. Data reduction was performed to reduce the dimensional-
dents of an educational institution in South India. The attributes ity of data by decreasing the number of features without losing
used in the present study are extracted from the available litera- those features. Principal Component Analysis (PCA) has been used
ture, and some new attributes are added. A questionnaire-based to lower the dimensionality of data [9]. PCA is based on the corre-
survey has been done with 98 attributes and 112 questions framed lation matrix of the features involved. Fig. 3 shows the correlation
in Google form consisting sections dealing with personal, educa- matrix for some of the used features. Reducing the dimensionality
tional, behavioural, and extra-curricular details. Some of these fea- of data without losing its value provides easy data visualization
tures are mentioned in Table 1. and faster performance from machine learning algorithms. Dataset
A dataset of 85 students has been collected through circulating with the 92 features has been reduced to the dimensionality of 2
the Google form among the final year undergraduate students of a while the grade point average of the semester has been kept
particular specialization. Other specific academic details like grade untouched.
point average (GPA) and admission details of the students have
been collected from the academic section of the institution to 3.2. Model and algorithm
achieve better accuracy.
Machine learning is categorised as supervised learning and
3.1. Data pre-processing unsupervised learning. Supervised learning can further be classi-
fied into regression and classification algorithms. Regression model
The collected data have been processed and prepared for anal- is applied for continuous data, while classification is applied for
ysis using the following techniques: discrete data. In the present study, multiple linear regression and
support vector regression are used. To evaluate the effect of fea-
Cleaning tures and attributes on the desired result, built-in models of
Integration machine learning algorithms have been selected from sci-kit learn
Transformation and implemented in the jupyter notebook using Python program-
Feature Scaling ming language.
Reduction/Factor Analysis
3.3. Linear regression
Data obtained from the questionnaire-based form and the aca-
demic section of the institution were integrated with MS-Excel. In linear regression, there exists a linear relationship between
Data cleaning removes the irrelevant, insufficient, or inadequate the input and the output as shown below [10]:
y ¼ a þ bx þ e
Table 1 Here, x: input variable, y: output variable, a: y-intercept, b:
Students’ features considered for questionnaire preparation.
slope, and e: error rate.
Description Multiple Linear Regression (MLR) consists of the linear relation-
Personal Demographic details; Hobbies; Interest; Family income; ship between two or more input variables and desired output
Time spent on social media, and watching movies and is formulated as follows:
Educational Schooling details; Entrance exam score; Reasons for
joining the academic program; Past semesters’ GPA; y ¼ a1 x1 þ a2 x2 . . . . . . : þ an xn þ e
Career goals; Attention in academic studies; Skills;
Internship Details
where a1, a2 ,..an are the coefficients of factors x1,x2,. . .xn
Behavioural Participation of students in academic activities; Effect of respectively.
friends circle, Interaction with faculties Polynomial regression implements the non-linear relationship
Extra-curricular Participation of students in extracurricular activities, between the input variables and the corresponding conditional
namely sports or club activities in college
mean y, denoted as E (y |x).
2
P. Dabhade, R. Agarwal, K.P. Alameen et al. Materials Today: Proceedings xxx (xxxx) xxx
In general, we can model the expected value of y as an nth nique is known as Support Vector Regression (SVR). SVR is similar
degree polynomial, yielding the general polynomial regression to linear regression for 2-D features where the equation of the line
model given by equation shown below: is as shown below.
y ¼ a0 þ a1 x þ a2 x2 þ a3 x3 þ . . . þ an xn þ e y ¼ a þ bx
where E (y |x) is the conditional mean and a1, a2,..an are the coeffi- In SVR, the straight line separating the independent attributes is
cients of factors x, x2,. . .xn respectively. called a hyperplane. The boundary line is plotted using the data
points called support vectors which are the data points very near
3.4. Support vector regression to the hyperplane. The regression models try to minimize the error
between the actual and the predicted value, whereas SVR attempts
Support Vector Machine (SVM) is an algorithm practiced in to fit the best line within a threshold value a (distance between the
regression and classification techniques. SVM for regression tech- hyperplane and the boundary line). Thus, SVR tries to fulfil the con-
3
P. Dabhade, R. Agarwal, K.P. Alameen et al. Materials Today: Proceedings xxx (xxxx) xxx
dition -a < y - a – b x < a. The value is predicted using the points 3.5. Performance measures
along this boundary. In the present research, SVR models with lin-
ear, rbf and polynomial kernels are used. For SVR to work with The following performance measures are used to evaluate the
nonlinear models, kernels are required. The kernel is a function regression model:
that takes the data as input and transforms it to desired form. In
the case of non-linear data, the kernel projects it into the higher PN ^
i¼1
Y i Y i
dimensional space called the feature space. Then, it uses a linear Mean Absolute Error (MAE)MAE ¼ N
model in this new high-dimensional feature space. The linear PN ^
2
Y i Y i
model in the feature space corresponds to a non-linear model in Mean Squared Error (MSE)MSE ¼
i¼1
N
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ffi
P
the input space. The use of the kernel is dependent on the type
N ^
of data. The linear kernel works best with linear data, whereas, i¼1
Y i Y i
more complex data require rbf and polynomial kernels. Root Mean Squared Error (RMSE)RMSE ¼ N
4
P. Dabhade, R. Agarwal, K.P. Alameen et al. Materials Today: Proceedings xxx (xxxx) xxx
P ^
2
high with error metrics to get the best fit, i.e., mean squared error
Y i Y i
R-SquaredR2 ¼ 1 SS or root mean squared error, as shown in Table 2.
SSRES i
TOT
¼ 1 P 2
i
Y i Y Fig. 4 shows that R2 score decreases with increasing order of
polynomials. Hence, the best fit would be obtained for order 1
b i : predicted value and, Y:
where Y i : actual expected output, Y i.e. linear regression for the available dataset. Fig. 5 shows the
Mean of actual values, N: number of data points. goodness of fit obtained using MLR for the available dataset with
R2 score of 83.27%. Results have also been analysed by checking
the variation in the model fit by changing the test set values from
20% to 30%. From the obtained plot shown in Fig. 9, it can be noted
4. Results and discussion that as the test data size increases and training data size decreases,
then R2 score of the model was dropped because of the availability
According to the methodology used, after data pre-processing, of a small dataset (only 85 students’ data). As the data size
out of the total 85 students’ collected, 80% data was used as train- increases, there may be the possibility to even obtain better accu-
ing dataset and 20% data have been kept as test dataset. The effect racy with more percentage of data to test. From Figs. 6, 7 and 8,
of all the selected attributes along with the GPA’s till fifth semester SVR_rbf algorithm provides an accuracy of 49.61% showing the
were taken into account as input variables , and the output variable bad fit for the given dataset. SVR_linear provides R2 score of
was the sixth semester GPA of the final year students of the chosen 83.44%, which is nearly equal to the multiple linear regression.
institution to train and test the model. Data have been modelled The best R2 score was obtained by the SVR model with linear kernel
for the training dataset using multiple linear regression and sup- followed by MLR and SVR_poly with degree 1. The SVR model with
port vector machine and predictions were made for the test data rbf kernel was also applied, but it provides less R2 score. From
The results have been analysed and compared. The R-squared score these results, it can be inferred that our dataset is linear. The pre-
shows the accuracy of the model. The R-squared score must be diction power of an algorithm can be identified by the values of the
5
P. Dabhade, R. Agarwal, K.P. Alameen et al. Materials Today: Proceedings xxx (xxxx) xxx
Table 2
Performance measures for regression.
Performance parameters for ml Multiple Linear Regression Support Vector Support Vector Support Vector
algorithms (MLR) (kernel = linear) (kernel = rbf) (kernel = poly)
MAE 0.44079 0.41413 0.45836 0.41261
MSE 0.23040 0.22813 0.69424 0.25302
RMSE 0.48000 0.47763 0.83321 0.50301
R2 score 0.83278 0.83442 0.49613 0.81637