Feature Selection Algorithms For Predicting Students Academic Performance Using Data Mining Techniques
Feature Selection Algorithms For Predicting Students Academic Performance Using Data Mining Techniques
Abstract: Educational Data Mining (EDM) is used by an educational organization to enhance the academic progress of students. For predicting the
academic achievement of the student, EDM comes with many features selection and Machine Learning techniques. The purpose of using these features
selection techniques is to remove the unwanted elements from the student academic datasets that have not required for performance prediction. By
using feature selection techniques, the quality of students' datasets has improved, and with it, the predictive accuracy of various data mining techniques
has also enhanced. Taking these facts into consideration analysis of four feature selection and six classification techniques are implemented on student
datasets to check the predictive accuracy. After the implementation of FS and classification techniques only CfsSubsetEval, GainRatioAttributeEval
feature selection gave improved efficiency up to 5%.
Index Terms: Educational Data Mining, Attribute Selection, Classification, Prediction, Accuracy
————————————————————
IJSTR©2020
www.ijstr.org
INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME 9, ISSUE 04, APRIL 2020 ISSN 2277-8616
of data, authors used Chi-Square and IG feature selection to other classification algorithms (Decision Tree, K-NN and
algorithm and found better prediction with CART decision Naive Bayes). M. Zaffar, M. A. Hashmani et al., in his study
tree algorithm. E. Osmanbegović, et al., in his study "Performance analysis of feature selection algorithm for
"Determining Dominant Factor for students Performance educational data mining," implemented different filter
Prediction by using Data Mining Classification Algorithms," feature selection algorithms on the selected dataset of a
calculate the academic performance of the secondary student. In this research, authors used the dataset of two
school education student at Tuzla. For the pre-processing different students with a different number of feature
phase of the collected dataset, they used Gain Ratio (GR) selection algorithm and analyzed the result for prediction
feature selection algorithm. They found the best prediction accuracy.
accuracy with the Random Forest (RF) algorithm as
compared to other classification algorithms. A. Figueira, et 3 RESEARCH METHODOLOGY
al., "Predicting Grades by Principal Component Analysis: A The considered dataset for this study has taken from
Data Mining Approach to Learning Analytics'," predict the Kalboard-360, which is a Multi-agent Learning Management
students’ academic grade in Bachelor Degree program. For System. In this technological era, such an online learning
pre-processing phase, authors used Principal Component platform provides the user with unlimited access to
Analysis (PCA) feature selection algorithm. In this study, educational resources from several places and on any
PCA feature selection algorithm has been used to build a device which uses an internet connection. The dataset
decision tree. This tree is used to predict the grade of the consists of 16 features and 480 records of students, in
student in academic. N. Rachburee and W. Punlumjeak et which 305(Male) and 175 (Female) students. These
al., in his study "A comparison of feature selection approach features can have classified into three major categories,
between greedy, IG-ratio, Chi-square, and mRMR in namely demographics, academic background, behavioural
educational mining, "compare different feature selection characteristics. In table 1, all the 16 attributes have
algorithm like IG-ration, Chi-Square, Greedy Forward considered for the analysis on the dataset mentioned
selection and mRMR. This work has conducted on the first above, and the following output has achieved where Naïve-
year's student's dataset (with 15 attributes) of the University Bayes is providing the most classified instances with
of Technology, Thailand. In this research, authors found accuracy up to 74.30%.
better prediction accuracy by using Greedy Forward (GF)
selection with Artificial Neural Network (ANN) as compared
Different Feature Selection (FS) Algorithms: intercorrelations have preferred for those features that have
Here, four FS algorithms such as CfsSubsetEval, extremely related to the class. Attribute Subset Evaluator
GainRatioAttributeEval, InfoGainAttributeEval and (cfsSubsetEval) + Search Method (Best first (forwards)) In
ReliefAttributeEval are evaluated. Classifications algorithms table-2, best seven attributes (gender, Relation,
Naive Bayes (NB), Logistic Regression (LR), DecisionTable raisedhands, VisITedResources, AnnouncementsView,
(DT), JRip, J48 and Random Forest (RF) has evaluated ParentAnsweringSurvey, StudentAbsenceDays.) has
through academic algorithms. cfsSubsetEval: Attributes selected based on the FS mentioned above algorithm. It
subsets are evaluated based on both the predictive ability providing most classified instances with accuracy up to
and the degree of redundancy of each feature. Low 77.29% and least is JRip with accuracy up to 73.75%.
GainRatioAttributeEval: Estimates the value of an attribute GainR (Class, Attribute) = (H (Class) - H (Class | Attribute))
by calculating the gain ratio concerning the class attribute. / H (Attribute). Attribute Subset Evaluator
3623
IJSTR©2020
www.ijstr.org
INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME 9, ISSUE 04, APRIL 2020 ISSN 2277-8616
(GainRatioAttributeEval) + Search Method (Ranker) In RF and J48 providing most classified instances with
table2, best seven attributes (StudentAbsenceDays, accuracy up to 76.45% and least is decision table with
raisedhands, VisITedResources, AnnouncementsView, accuracy up to 71.66%.
ParentAnsweringSurvey, Discussion, Relation) has
selected based on the FS mentioned above algorithm. And
3624
IJSTR©2020
www.ijstr.org
INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME 9, ISSUE 04, APRIL 2020 ISSN 2277-8616
4 RESULTS AND DISCUSSIONS the precise and harmonic means of memory. The results of
In given work, our primary focus is on evaluating the four FS algorithms has reported by applying six classifiers
performance of four FS algorithms on academic to Tables-2 to Table-5. These tables represent the results
performance on student dataset. FS algorithm’s obtained by each FS algorithm with their evaluation
performance can have measured through various parameters.
parameters like recall, precision, F-measurement and
predictive accuracy. The F-measurement has defined as
ML CCI with CCI with CCI with CCI with CCI with
Algorithm ( all attribute) (CfsSubsetEval) (GainRatioAttributeEval) (InfoGainAttributeEval) (ReliefAttributeEval)
In the below table 5, six algorithms have used out of which selection algorithms, shows better accuracy than other
Naïve Bayes, J48 is giving maximum output with algorithms in combination. In future, more feature selection
GainRatioAttributeEval attribute of 75.20% and 76.45% algorithms are analyzed with different classification
correctly classified instances. JRip, RF, DT, and LR, on the algorithms to get better efficiency. The same work can also
other hand, has a maximum output with CfsSubsetEval of have done on different student academic dataset. Apart
73.75%, 77.29%, 72.91% and 76.25% correctly classified from this, we can't overlook the benefits of feature selection
instances, respectively. The graphical representation of techniques in Data Mining
table-5 has represented in figure-1.
6 REFERENCES:
[1]. S. Sivakumar, S. Venkataraman, and R. Selvaraj,
"Predictive Modeling of Student Dropout Indicators
in Educational Data Mining using Improved
Decision Tree," Indian Journal of Science and
Technology, vol. 9, 2016.
[2]. K. W. Stephen, "Data Mining Model for Predicting
Student Enrolment in STEM Courses in Higher
Education Institutions," 2016.
[3]. E. Osmanbegović, M. Suljić, and H. Agić,
"Determining Dominant Factor for students
Performance Prediction by using Data Mining
Classification Algorithms," Tranzicija, vol. 16, pp.
147-158, 2015.
[4]. A. Figueira, "Predicting Grades by Principal
Component Analysis: A Data Mining Approach to
Figure 1: Graphical representation of Correctly Classified Learning Analyics," in Advanced Learning
Instances Technologies (ICALT), 2016 IEEE 16th
International Conference on, 2016, pp. 465-467.
5 CONCLUSION [5]. N. Rachburee and W. Punlumjeak, "A comparison
In this work, different FS algorithms are evaluated and of feature selection approach between greedy, IG-
analyzed with different classification algorithms (like ratio, Chi-square, and mRMR in educational
Random Forest, JRip, J48, Decision Tree, Linear mining," in Information Technology and Electrical
Regression). The implementation result of these FS Engineering (ICITEE), 2015 7th International
algorithms doesn’t show any significant change range from Conference on, 2015, pp. 420-424.
67.9167% to 77.2917% using WEKA toolkit. The [6]. M. Zaffar, M. A. Hashmani, and K. Savita,
cfsSubsetEval algorithm with Random Forest algorithm "Performance analysis of feature selection
gave the highest accuracy up to 77.2917%, and algorithm for educational data mining," in Big Data
ReliefAttributeEval algorithm with Decision Tree gave the and Analytics (ICBDA), 2017 IEEE Conference on,
lowest accuracy up to 67.9167%. From Figure 1, it is very 2017, pp. 7-12.
much clear that Random Forest, with almost all feature
3625
IJSTR©2020
www.ijstr.org
INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME 9, ISSUE 04, APRIL 2020 ISSN 2277-8616
3626
IJSTR©2020
www.ijstr.org