0% found this document useful (0 votes)
19 views

Educational Data Mining For Predicting Studentsâ ™ Academic Performance Using Machine Learning Algorithms

Uploaded by

premiumakunbulan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Educational Data Mining For Predicting Studentsâ ™ Academic Performance Using Machine Learning Algorithms

Uploaded by

premiumakunbulan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Materials Today: Proceedings xxx (xxxx) xxx

Contents lists available at ScienceDirect

Materials Today: Proceedings


journal homepage: www.elsevier.com/locate/matpr

Educational data mining for predicting students’ academic performance


using machine learning algorithms
Pranav Dabhade a,⇑, Ravina Agarwal a, K.P. Alameen a, A.T. Fathima a, R. Sridharan a, G. Gopakumar b
a
Department of Mechanical Engineering, National Institute of Technology, Calicut 673601, Kerala, India
b
Department of Computer Science and Engineering, National Institute of Technology, Calicut 673601, Kerala, India

a r t i c l e i n f o a b s t r a c t

Article history: Educational data mining has gained impressive attention in recent years. The primary focus of
Received 31 March 2021 educational institutions is to provide quality education for students to enhance academic performance.
Received in revised form 24 May 2021 The performance of students depends on several aspects, i.e., personal, academic, and behavioural fea-
Accepted 28 May 2021
tures. The present study deals with predicting students’ academic performance in a technical institution
Available online xxxx
in India. A dataset was obtained using a questionnaire-based survey and the academic section of the cho-
sen institution. Data-pre-processing and factor analysis have been performed on the obtained dataset to
Keywords:
remove the anomalies in the data, reduce the dimensionality of data and obtain the most correlated fea-
Educational data mining
Regression
ture. The Python 3 tool is used for the comparison of machine learning algorithms. The support vector
Academic performance regression_linear algorithm provided superior prediction.
Prediction Ó 2021 Elsevier Ltd. All rights reserved.
Support vector regression Selection and peer-review under responsibility of the International Conference on Sustainable materials,
Manufacturing and Renewable Technologies 2021.

1. Introduction required results. Thus, the present study focuses on implementing


performance prediction techniques and provides information
Student’s academic performance in institutions is the source of about the high potential of data mining techniques. Educational
great concern and interest for most research and the parents, insti- data mining (EDM) is a methodology adopted to extract significant
tutions, and government of every country. It is essential for educa- knowledge and patterns from academic database [5,6].
tional institutions to monitor their students’ academic In the present study, data collection is accomplished using a
performance and take respective improvement measures [1]. Edu- questionnaire-based survey from students and data from the aca-
cators should evaluate students’ performance in a university or col- demic section of the chosen institute. Student features can be clas-
lege to meet the set objectives and foster an environment of sified into personal features, educational features, and behavioural
continuous improvement [2]. Several factors are considered for features. The behavioural features consist of attributes linked to
assessing the performance of an educational institution. Based on students’ learning experience [7]. It is observed that classification
these factors, the institution needs to enhance its ranking. Educa- techniques and regression are widely considered for prediction.
tional institutions are concerned with providing quality education Authors in [8] have used KNIME based data mining technique
to obtain outstanding performance. Students’ academic perfor- and other algorithms to predetermine the probability of graduation
mance is reflected as a key factor in the rankings of higher educa- of students so that early intervention may be deployed. Only sig-
tional institutions [3]. Educators can obtain insights on obstacles nificant academic attributes have been considered for prediction
encountered by students in achieving superior academic perfor- in most of the past studies. There is a need to consider students’
mance. An educator can upskill the students and implement reme- background statistics, behavioural features and extracurricular
dial actions to improve students’ poor performance [4]. The activities.
present study illustrates the methodology to implement the data The remaining sections of the paper are structured as follows.
mining techniques for performance prediction and generates the Section 2 provides the problem statement and the objective of
the research presented in this paper. Section 3 describes the
methodology adopted which includes details about data
⇑ Corresponding author. pre-processing, details of machine learning algorithms and
E-mail address: [email protected] (P. Dabhade).

https://doi.org/10.1016/j.matpr.2021.05.646
2214-7853/Ó 2021 Elsevier Ltd. All rights reserved.
Selection and peer-review under responsibility of the International Conference on Sustainable materials, Manufacturing and Renewable Technologies 2021.

Please cite this article as: P. Dabhade, R. Agarwal, K.P. Alameen et al., Educational data mining for predicting students’ academic performance using
machine learning algorithms, Materials Today: Proceedings, https://doi.org/10.1016/j.matpr.2021.05.646
P. Dabhade, R. Agarwal, K.P. Alameen et al. Materials Today: Proceedings xxx (xxxx) xxx

performance measures used in the evaluation of algorithms. Sec- data and fills the null values. For the present model, data cleaning
tion 4 provides the results obtained and the inferences drawn. Sec- was performed by manually checking the data in Excel sheet and
tion 5 outlines the conclusions from the present work. and using pandas, matplotlib, and seaborn libraries with the help
of Python 3 tool in Jupyter notebook. Along with cleaning data
visualization was also carried out.The data have been corrected
2. Problem statement
by dropping some irrelevant/insufficient attributes and generaliz-
ing the labels for multiple response type data, e.g., hobbies and
The present study is performed to understand the various fac-
interests. Categorical data have been then transformed into numer-
tors on which students’ performance depends and to predict the
ical values using one-hot encoding and integer encoding by divid-
performance for the upcoming semester based on student’s data
ing the categorical data into two sets, i.e., nominal and ordinal
and past performance. Students who are not performing to their
data.
institution’s expectations need to be spotted and offered special
Regions with dashed-dot white in Fig. 1indicate the missing val-
care to boost their performance. The objective is to identify and
ues or irrelevant data in the current study’s raw data set. Raw data
enumerate the possible factors affecting students’ performance
contain responses to google form questionnaire in original form.
and that of the institutions; to compare and to examine selected
Cleaning of the data resulted in a 100% cleaned dataset shown in
attributes’ relationships towards students’ academic performance.
Fig. 2. Cleaned data contains all the variables corresponding to
A prediction model has been developed that predicts students’ per-
the student attributes used in research. Some of the variables used
formance based on the chosen attributes. The prediction model is
in Fig. 1 and Fig. 2 are listed in Appendix. Due to the size restric-
demonstrated based on the results obtained from the analysis.
tions on plot, all the variables are not shown in Fig. 1 and Fig. 2
Feature Scaling is a technique to scale the features of data in a
3. Research methodology certain specific range. In the present study, it was performed to
weigh each attribute on the same scale and handle variance in
For the present study, students’ performance in the previous magnitudes of the features to achieve better results from the algo-
semesters and their personal features are considered for the stu- rithms. Data reduction was performed to reduce the dimensional-
dents of an educational institution in South India. The attributes ity of data by decreasing the number of features without losing
used in the present study are extracted from the available litera- those features. Principal Component Analysis (PCA) has been used
ture, and some new attributes are added. A questionnaire-based to lower the dimensionality of data [9]. PCA is based on the corre-
survey has been done with 98 attributes and 112 questions framed lation matrix of the features involved. Fig. 3 shows the correlation
in Google form consisting sections dealing with personal, educa- matrix for some of the used features. Reducing the dimensionality
tional, behavioural, and extra-curricular details. Some of these fea- of data without losing its value provides easy data visualization
tures are mentioned in Table 1. and faster performance from machine learning algorithms. Dataset
A dataset of 85 students has been collected through circulating with the 92 features has been reduced to the dimensionality of 2
the Google form among the final year undergraduate students of a while the grade point average of the semester has been kept
particular specialization. Other specific academic details like grade untouched.
point average (GPA) and admission details of the students have
been collected from the academic section of the institution to 3.2. Model and algorithm
achieve better accuracy.
Machine learning is categorised as supervised learning and
3.1. Data pre-processing unsupervised learning. Supervised learning can further be classi-
fied into regression and classification algorithms. Regression model
The collected data have been processed and prepared for anal- is applied for continuous data, while classification is applied for
ysis using the following techniques: discrete data. In the present study, multiple linear regression and
support vector regression are used. To evaluate the effect of fea-
 Cleaning tures and attributes on the desired result, built-in models of
 Integration machine learning algorithms have been selected from sci-kit learn
 Transformation and implemented in the jupyter notebook using Python program-
 Feature Scaling ming language.
 Reduction/Factor Analysis
3.3. Linear regression
Data obtained from the questionnaire-based form and the aca-
demic section of the institution were integrated with MS-Excel. In linear regression, there exists a linear relationship between
Data cleaning removes the irrelevant, insufficient, or inadequate the input and the output as shown below [10]:
y ¼ a þ bx þ e
Table 1 Here, x: input variable, y: output variable, a: y-intercept, b:
Students’ features considered for questionnaire preparation.
slope, and e: error rate.
Description Multiple Linear Regression (MLR) consists of the linear relation-
Personal Demographic details; Hobbies; Interest; Family income; ship between two or more input variables and desired output
Time spent on social media, and watching movies and is formulated as follows:
Educational Schooling details; Entrance exam score; Reasons for
joining the academic program; Past semesters’ GPA; y ¼ a1 x1 þ a2 x2 . . . . . . : þ an xn þ e
Career goals; Attention in academic studies; Skills;
Internship Details
where a1, a2 ,..an are the coefficients of factors x1,x2,. . .xn
Behavioural Participation of students in academic activities; Effect of respectively.
friends circle, Interaction with faculties Polynomial regression implements the non-linear relationship
Extra-curricular Participation of students in extracurricular activities, between the input variables and the corresponding conditional
namely sports or club activities in college
mean y, denoted as E (y |x).
2
P. Dabhade, R. Agarwal, K.P. Alameen et al. Materials Today: Proceedings xxx (xxxx) xxx

Fig 1. Uncleaned/raw data plot.

In general, we can model the expected value of y as an nth nique is known as Support Vector Regression (SVR). SVR is similar
degree polynomial, yielding the general polynomial regression to linear regression for 2-D features where the equation of the line
model given by equation shown below: is as shown below.

y ¼ a0 þ a1 x þ a2 x2 þ a3 x3 þ . . . þ an xn þ e y ¼ a þ bx
where E (y |x) is the conditional mean and a1, a2,..an are the coeffi- In SVR, the straight line separating the independent attributes is
cients of factors x, x2,. . .xn respectively. called a hyperplane. The boundary line is plotted using the data
points called support vectors which are the data points very near
3.4. Support vector regression to the hyperplane. The regression models try to minimize the error
between the actual and the predicted value, whereas SVR attempts
Support Vector Machine (SVM) is an algorithm practiced in to fit the best line within a threshold value a (distance between the
regression and classification techniques. SVM for regression tech- hyperplane and the boundary line). Thus, SVR tries to fulfil the con-
3
P. Dabhade, R. Agarwal, K.P. Alameen et al. Materials Today: Proceedings xxx (xxxx) xxx

Fig 2. Cleaned data set a plot.

dition -a < y - a – b x < a. The value is predicted using the points 3.5. Performance measures
along this boundary. In the present research, SVR models with lin-
ear, rbf and polynomial kernels are used. For SVR to work with The following performance measures are used to evaluate the
nonlinear models, kernels are required. The kernel is a function regression model:
that takes the data as input and transforms it to desired form. In
the case of non-linear data, the kernel projects it into the higher PN  ^ 
i¼1
Y i Y i 
dimensional space called the feature space. Then, it uses a linear  Mean Absolute Error (MAE)MAE ¼ N
model in this new high-dimensional feature space. The linear PN  ^
2
Y i Y i
model in the feature space corresponds to a non-linear model in  Mean Squared Error (MSE)MSE ¼
i¼1
N
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ffi
P 
the input space. The use of the kernel is dependent on the type
N ^
of data. The linear kernel works best with linear data, whereas, i¼1
Y i Y i

more complex data require rbf and polynomial kernels.  Root Mean Squared Error (RMSE)RMSE ¼ N

4
P. Dabhade, R. Agarwal, K.P. Alameen et al. Materials Today: Proceedings xxx (xxxx) xxx

Fig 3. Correlation plot.

P ^
2
high with error metrics to get the best fit, i.e., mean squared error
Y i Y i
 R-SquaredR2 ¼ 1  SS or root mean squared error, as shown in Table 2.
SSRES i

TOT
¼ 1 P  2 
i
Y i Y Fig. 4 shows that R2 score decreases with increasing order of
polynomials. Hence, the best fit would be obtained for order 1
b i : predicted value and, Y:
where Y i : actual expected output, Y i.e. linear regression for the available dataset. Fig. 5 shows the
Mean of actual values, N: number of data points. goodness of fit obtained using MLR for the available dataset with
R2 score of 83.27%. Results have also been analysed by checking
the variation in the model fit by changing the test set values from
20% to 30%. From the obtained plot shown in Fig. 9, it can be noted
4. Results and discussion that as the test data size increases and training data size decreases,
then R2 score of the model was dropped because of the availability
According to the methodology used, after data pre-processing, of a small dataset (only 85 students’ data). As the data size
out of the total 85 students’ collected, 80% data was used as train- increases, there may be the possibility to even obtain better accu-
ing dataset and 20% data have been kept as test dataset. The effect racy with more percentage of data to test. From Figs. 6, 7 and 8,
of all the selected attributes along with the GPA’s till fifth semester SVR_rbf algorithm provides an accuracy of 49.61% showing the
were taken into account as input variables , and the output variable bad fit for the given dataset. SVR_linear provides R2 score of
was the sixth semester GPA of the final year students of the chosen 83.44%, which is nearly equal to the multiple linear regression.
institution to train and test the model. Data have been modelled The best R2 score was obtained by the SVR model with linear kernel
for the training dataset using multiple linear regression and sup- followed by MLR and SVR_poly with degree 1. The SVR model with
port vector machine and predictions were made for the test data rbf kernel was also applied, but it provides less R2 score. From
The results have been analysed and compared. The R-squared score these results, it can be inferred that our dataset is linear. The pre-
shows the accuracy of the model. The R-squared score must be diction power of an algorithm can be identified by the values of the

5
P. Dabhade, R. Agarwal, K.P. Alameen et al. Materials Today: Proceedings xxx (xxxx) xxx

Table 2
Performance measures for regression.

Performance parameters for ml Multiple Linear Regression Support Vector Support Vector Support Vector
algorithms (MLR) (kernel = linear) (kernel = rbf) (kernel = poly)
MAE 0.44079 0.41413 0.45836 0.41261
MSE 0.23040 0.22813 0.69424 0.25302
RMSE 0.48000 0.47763 0.83321 0.50301
R2 score 0.83278 0.83442 0.49613 0.81637

Fig 6. Actual vs predicted GPA for SVR (kernel = linear).

Fig 4. Accuracy vs order of polynomials.

Fig 7. Actual vs predicted GPA for SVR (kernel = rbf).

measure to explain the percentage of output variability. MAE, MSE


and RMSE are better to be used to compare performance between
Fig 5. Actual vs predicted GPA for MLR. different regression models. The preferred measure is RMSE, how-
ever MSE can also be used if the value of error is not too large and
MAE is used to penalize large error. Hence, higher R-squared and
evaluation measures. Among the evaluation measures, RMSE and lower RMSE values are desired which are exibited by SVR_linear
R-Square are the significant ones. The measures such as MAE, model.
MSE and RMSE give the error values in the prediction. MAE is
determined using the sum of absolute errors. MSE is the sum of 5. Conclusion and future scope
the squares of absolute errors. RMSE is square root of MAE. R-
square measure represents the proportion of the variance for a In the present study, machine learning algorithms such as mul-
dependant variable that is explained by independent variables in tiple linear regression, support vector regression_rbf, support vec-
a regression model. Whereas,RMSE has been considered as the best tor regression_poly and support vector regression_linear are
6
P. Dabhade, R. Agarwal, K.P. Alameen et al. Materials Today: Proceedings xxx (xxxx) xxx

CRediT authorship contribution statement

Pranav Dabhade: Investigation, Data curation, Writing - origi-


nal draft, Methodology, Project administration, Visualization. Rav-
ina Agarwal: Investigation, Data curation, Methodology,
Visualization, Writing - original draft. K.P. Al Ameen: Formal anal-
ysis, Methodology, Software, Validation, Visualization. A.T.
Fathima Raniya: Investigation, Methodology, Visualization, Writ-
ing - original draft. R. Sridharan: Conceptualization, Resources,
Supervision, Writing - review & editing. G. Gopakumar:
Supervision.

Declaration of Competing Interest

The authors declare that they have no known competing finan-


cial interests or personal relationships that could have appeared
to influence the work reported in this paper.
Fig 8. Actual vs predicted GPA for SVR (kernel = polynomial).
Appendix

List of variables used in Fig. 1


Column1, Username, Category, State/UT, How fluent are you in
English, How many members are there in your family, Who all
earns in your family, Your annual family income, Mother’s occupa-
tion, Do any of your family member motivates you to study, Height
(in cm), Do you include any of the following fitness activities,
Which sports do you play, Have any previous serious health issues/
accidents/ injuries that affected your study?,Which of the follow-
ing habits do you have?, The avearge time you spent weekly on
online/computer games (in hours), Skills you possess, Average
number of hours you sleep daily (approximately), Do you have
an addiction to online platforms (like Netflix, Amazon Prime, You
tube, etc.), How often do you use Whatsapp?, Do you have a girl-
friend/boyfriend?, Medium of primary schooling, Location of pri-
mary school, Type of secondary school, Board of class 10th, Did
coaching/tution in class 10th?, Marks in 10th standard (% or CGPA),
Board of class 12th, Subjects taken in class 12th other than PCM,
Did coaching/tution in class 12th?, JEE-Advanced General Rank,
Did coaching for JEE preparation?, Number of year attempts of
JEE-main to get into NITC, Reason for joining B.Tech, Least favourite
subject in B.Tech, Do you use any online e-learning platform?, Your
experience with online learning, Your interest in your own branch
studies, Gadgets available for online learning/classes, Have you
been taking any scholarships, How many hours do you give for self
Fig 9. Actual vs predicted GPA for MLR with different test sizes. study on average per week?, Was/Is thar internship pertains to
your branch of study?, Are you aware of the placement process?,
What type of learner are you?, How attentive are you in lectures?,
Did you or any of your family member dignosed with COVID?, How
applied for determining the academic performance of the final year did COVID affected your family financially?, How often do you use
undergraduate students of the chosen academic institution. The Linkedin?, How often do you have discussions with friends related
resulting dataset had 112 variables. After detailed analysis, 92 vari- to academic activities?, Do your friends encourage and guide you
ables have been chosen for further study. Using factor analysis, in your studies, How actively do you participate in the academic
dimensionality has been reduced to 2. The datasets have been split tasks assigned to you?, Your personality type, How often do you
into training dataset and test dataset. The model was trained. It is participate in the academic tasks assigned to you?, Your personal-
observed that the linear model provides the best fit with an accu- ity type?, How often do you participate in seminars/ workshops/
racy of 83.44%. Further, the proposed models provide concrete evi- expert talks/ placement talks?, Are you a part of any Technical club
dence that recent past performance is most important for the at NITC?, Do you take part in sports tournaments?, Are you the rep-
prediction of future performance. For the larger data set, variations resentative of any student body? (e.g SAC member/CR/BR/PR/Hos-
can be observed in the accuracy of prediction. The obtained results tel & Mess reps., etc.), Entrance Rank, sem-1 gpa, sem-3 gpa, SEM-5
show that there is a relation between students’ behaviour features GPA, CGPA
and academic performance. With larger dataset , better training of List of variables used in Fig. 2
the model can be achieved. Such enhanced training will lead to Roll_number, CIWG, OBC, Bihar, Karnsataka, Odisha, Telangana,
better prediction. The present research can be extended by adopt- English_fluency, Are you doing any part-time job, Do any of your
ing other machine learning algorithms namely, neural networks family members motivate you to study, Which of the following
and other regression algorithms. habits do you have?, Average number of hours you sleep daily
7
P. Dabhade, R. Agarwal, K.P. Alameen et al. Materials Today: Proceedings xxx (xxxx) xxx

(approximately), Do you have a girlfriend/boyfriend?, Medium of References


secondary schooling, Hosteller/Day Scholar in class 10th, Marks
in 10th standard (% or CGPA), Hostellar/Day Scholar in class 12th, [1] R. Singh, S. Pal, Machine learning algorithms and ensemble technique to
improve prediction of students performance, International Journal of
Marks in 12th standard (% or CGPA), Reason for joining B.Tech, Advanced Trends in Computer Science and Engineering, 2020, 9 (3), pp.
Gadgets available for online learning/ Classes, Was/Is that intern- 3970-3976.
shipnpertains to your branch of study?, How attentive are you in [2] H. Talal, S. Saeed, A study on adoption of data mining techniques to analyze
academic performance, ICIC Express Letters, Part B: Applications, 2019, 10 (8),
lectures?, How did COVID affected your family financially?, How pp. 681-687.
your friend circle affects your studies?, Entrance Rank, sem-3 [3] W.F.W. Yaacob, S.A.M. Nasir, W.F.W. Yaacob, N.M. Sobri, Supervised data
gpa, sem-6 gpa, Hindi, Business_f, Financial advisor_f, Labour_f, mining approach for predicting student performance, Indonesian Journal of
Electrical Engineering and Computer Science, 2019, 16 (3), pp. 1584- 1592.
Retired employee_f, Working in educational institutions_f, House
[4] A. Almasri, E. Celebi, R.S. Alkhawaldeh, EMT: ensemble meta-based tree model
wife, Working in eduactional institutions_m, Entertainment, No for predicting student performance, Scientific Programming, 2019, art. no.
hobbies, Sports, Business & Management, No other interest, Sports 3610248.
& Fitness, Leadership, Problem-solving, Biology, Electronics, Moral [5] J. Sultana, M. Usha Rani, M.A.H. Farquad, Student’s performance prediction
using deep learning and data mining methods, International Journal of Recent
Science, Don’t know, Jobs in pvt. sector, Start-Up, The geographical Technology and Engineering, 2019, 8 (1 Special Issue 4), pp. 1018-1021.
location of your home-town_Semi-Urban, Have you been taking [6] F. Makombe, M. Lall, A predictive model for the determination of academic
any scholarships?_Yes, What type of learner are you?_Regular, performance in private higher education institutions, International Journal of
Advanced Computer Science and Applications, 2020, 11 (9), pp. 415- 419.
Do you prefer a group study?_No, How often do you interact with [7] E.A. Amrieh, T. Hamtini, I. Aljarah, Mining educational data to predict student’s
faculties?_Frequently, Your personality type?_Ambivert, How academic performance using ensemble methods, Int. J. Database Theory Appl.
often do you participate in Seminars/Workshops/expert talks/ 9 (8) (2016) 119–136.
[8] A.I. Adekitan, O. Salau, The impact of engineering students’ performance in the
Placement talks?_Frequently, How often do you participate in first three years on their graduation result using educational data mining,
Seminars/Workshops/expert talks/ Placement talks?_Sometimes, Heliyon, 2019, 5 (2), art. no. e01250.
Do you take part in sports tournaments?_No, Are you the represen- [9] A. Kumar, K.K. Eldhose, R. Sridharan, V.V. Panicker, Students academic
performance prediction using regression: a case study, The International
tative of any student body? (e.g SAC member/CR/BR/PR/Hostel & Conference on Systems, Computation, Automation, and Networking, 2020.
Mess reps., etc.)_Yes, Entrance Exam_DASA, Have_pressure_from_- [10] R.R. Rajalaxmi, P. Natesan, N. Krishnamoorthy, S. Ponni, Regression model for
family, whatsapp_veryfreq, Coaching after 12th class (year drop), predicting engineering students academic performance, International Journal
of Recent Technology and Engineering (IJRTE), 2019, ISSN: 2277- 3878,
During class 12th only, Not_interested_programming
Volume-7 Issue-6S3.
List of variables used in Fig. 3
Male, Female, Entrance Rank, sem-1 gpa, sem-2 gpa, sem-3 gpa,
sem-4 gpa, sem-5 gpa, 10th_marks, 12th_marks

You might also like