Predicting Student Performance Using Feature Selection Algorithms For Deep Learning Models
Predicting Student Performance Using Feature Selection Algorithms For Deep Learning Models
Abstract—Feature selection is an integral process for feature improve their learning conditions. For example, if the school
engineering prior to deep learning (DL) model development. knows that students who do not have Internet access at home
The idea is to reduce complexities of high - dimensional data or are of low income will have a high chance of retention or
structures by keeping only relevant information in the data dropping out, then some financial or technological issues
mining process. The critical part in developing a DL model to could be worked on in order to keep them learning. However,
predict student performance is the high - dimensionality of limitations of formulated models mentioned in [4] and [5] to
students’ profiles which results in a DL model with low predict student performance include: (1) directly applying
performance metrics. Students' profile/data involves different any classification model without studying the nature of
aspects such as demographic information, academic records,
features; (2) features are not analyzed before applying any
technological resources, social attitudes, family background
machine learning model; and (3) extracting or selecting more
and/or socio – economic status. Empirically, the diversity of
effective features to increase performance accuracy.
these data produce complexity in terms of dimension. In this
paper, we compared the effectiveness of four feature selection In the literature, we can find several libraries for Machine
algorithms (Information Gain Based, ReliefF, Boruta and Learning applications, such as TensorFlow (Google),
Recursive Feature Elimination) on deep learning models using Theano, and CNTK (Microsoft) [6], that can be used to create
an educational dataset from Portugal. The effectiveness is mathematical models to discover students’ patterns from
measured using the following model performance metrics: personal and/or academic data. Thus, this paper will focus on
training accuracy, validation accuracy, testing accuracy, kappa analyzing how Deep Learning (DL) models will act to predict
statistic, and f - measure. Results revealed the robustness of the
students’ performance using different dimensionality
Boruta algorithm in dimensionality reduction as it allowed the
reduction methods (Information Gain, ReliefF Algorithm,
deep learning model to achieve its highest performance metrics
compared to the utilization of other feature selection algorithms.
Boruta Algorithm, and Recursive Feature Elimination). For
the experiments, an educational dataset of the two Portugal
Keywords — deep learning, feature selection algorithms, public schools was used [7].
dimensionality reduction, prediction. This paper is organized as follows: in section II, an
overview of the needed steps to build a DL model is
I. INTRODUCTION described. Section III provides an overview of the feature
Educational institutions from all over the world work with selection algorithms considered in this study (Information
grade retention issues. According to statistical reports from Gain Based, ReliefF Algorithm, Boruta Algorithm, and
the National Center for Education Statistics in the U.S., Recursive Feature Elimination). Section IV presents the
https://nces.ed.gov/; between the years 2000 and 2016, the employed methodology to run the experiments and the
percentage of students retained in a grade decreased from dataset as well. Section IV compares and evaluates the
3.1% to 1.9%. This pattern has been observed mainly among performance of the DL models using different feature
white, black, and Hispanic students. This retention rate needs selection algorithms, and, finally, in section V conclusions
to be reduced as much as possible to allow student academic are presented.
progress to happen smoothly and continuously. Besides that,
for some psychological researchers, grade retention has not II. DL MODELS USING TENSORFLOW AND KERAS
been seen as a fair strategy for students who did not achieve TensorFlow is an open-source library developed by the
at least the average for the next level. In fact, grade retention Google Brain Team (AI research group at Google) in 2015,
has shown numerous deleterious effects on student written in C++ for Artificial Intelligence applications. It is
performance such as poor peer interactions, an aversion to multiplatform (Windows, MacOS, and Linux) and can be run
school, behavioral problems, and poor self-concept [1]. on CPU (Central Processing Unit), GPU (Graphics
Considering this scenario, it is so relevant to keep up the Processing Unit), and TPU (Tensor Processing Units) [8].
students’ progress (performance) academically in order to try Currently, it is considered the most used software for
to identify those at risk for grade retention and/or dropout. Machine Learning and Deep Learning applications [9].
To predict the student’s performance using an educational Google, Intel, Uber, Airbnb, and DropBox, for example, are
dataset is a topic that has long been researched for several some companies that already use it.
years as can be found in [2], [3], [4], and the limitations in One major advantage of using this framework is the use
research [5]. The basic principle is to identify students who of data flow graphs, in which nodes represent units of
need assistance as early in the course as possible to adopt computation and edges represent the tensors
some pedagogical-learning actions before they fail or drop (multidimensional arrays) that link the nodes. Its main feature
out. So, educational institutions can create specific programs is the ability to quickly generate a trained predictive model,
to help students who are in vulnerable situations and then
Authorized licensed use limited to: University of the Phillippines Diliman. Downloaded on December 01,2023 at 10:22:46 UTC from IEEE Xplore. Restrictions apply.
ALGORITHM II. RELIEFF (SOURCE: ZHOU & WANG, 2015) D. Recursive Feature Elimination (RFE)
RFE as a feature selection algorithm is a wrapper type that
Input: Feature data matrix: D, repeat times: n, the number can be used as a core method given any machine learning
of neighbors: K algorithm. Recursively, RFE ranks features given a dataset
Output: Vector W for the feature attributes ranking according to some measure of importance [21]. Pseudocode
Begin for RFE is presented in Algorithm IV. Utilizing the given
for j = 1 to n do dataset, the algorithm starts with training a specified machine
Randomly select an instance Rⱼ learning model where, for every iteration, feature importance
Find K nearest hits H and nearest misses M; is being measured and the less important features are
for i = 1 to all features do removed. RFE requires a number of features to be kept, but
Updating estimation ᵢ by Equation (2); in some instances a variable importance measure may be
end implemented such as the varImp() function from the ‘caret’
end package in R [12] where it returns a final list of features
End ranked according to their variable importance.
Authorized licensed use limited to: University of the Phillippines Diliman. Downloaded on December 01,2023 at 10:22:46 UTC from IEEE Xplore. Restrictions apply.
FIG 1. METHODOLOGICAL FRAMEWORK
guardian student's guardian ('mother', 'father' or
'other')
TABLE I. FEATURES OF THE USED DATASET goout going out with friends (1 - 5)
school student's school ('GP' - Gabriel Pereira or Walc weekend alcohol consumption (1 - 5)
'MS' - Mousinho da Silveira)
health current health status (1 - 5)
sex student's sex ('F' - female or 'M' - male)
absences number of school absences (0 - 93)
age student's age (15 - 22)
G1 first period grade (0 - 20)
address student's home address type ('U' - urban or
'R' - rural) G2 second period grade (0 - 20)
famsize family size ('LE3' - less or equal to 3 or 'GT3' G3 (Class) final grade (0 - 20)
- greater than 3)
Authorized licensed use limited to: University of the Phillippines Diliman. Downloaded on December 01,2023 at 10:22:46 UTC from IEEE Xplore. Restrictions apply.
The target class ‘G3’ ranges from 0 - 20. G3 was categorized There are four sequential deep learning models developed
into three classes based on the grading system in Portugal in this phase based on the results obtained from phase 3. To
[24]. 0 - 9 was categorized as ‘Failed’, 10 - 13 was ensure unbiased experimentation, hyperparameters were set
categorized as ‘Average’ and 14 - 20 as ‘AboveAverage’. the same for the four DL models. Two hidden layers were
Classes were then encoded as ‘0’ for ‘Failed’, 1 for utilized with 53 hidden neurons in the first layer and 13
‘Average’, and 2 for ‘AboveAverage’. hidden neurons in the second layer. Rectified linear unit
(relu) was used as an activation function in the hidden layers
C. Phase 3 - Feature Selection
while the softmax function was used as an activation function
Four feature selection algorithms were used to define in the output layer. L2 kernel regularizer with a value of 0.01
relevant features in the classification task. Feature selection was applied to two hidden layers. Adaptive moment estimator
processes were implemented in RStudio. Package ‘FSelector’ (Adam) was used as model optimizer and categorical cross
[10] was used for Information Gain Based algorithm, package entropy as model loss function. The learning rate was set to
‘caret’ [12] was used for Recursive Feature Elimination 0.001, batch size was set to 32, and the number of epochs was
algorithm, package ‘Boruta’ [11] was used for the Boruta set to 200. There was no early stopping of epoch and dropout
algorithm, and package ‘CORElearn’ [13] was used for the of nodes included during the DL model development.
ReliefF algorithm. Snippets for each feature selection process
are presented in Fig. 2. Note that only core functions for each Implementation of the experiments in the fourth phase
algorithm are presented. was done in Google Colaboratory using the Python
programming language. TensorFlow was utilized as a core
D. Phase 4 - Deep Learning Model Development module to build the deep learning model. From TensorFlow,
Prior to the development of DL models, a data up- submodules of keras were imported to define the architecture
sampling technique was applied since there is an imbalance of the model. Sub - modules include Sequential, regularizers,
in target classes where Class 0 has only 230 observations, Dense, Activation, Optimizers, and BatchNormalization.
Class 2 has only 294 observations while Class 1 has 520 Other modules such as NumPy, pandas, matplotlib
observations. In this case, if an up-sampling technique was (visualization), random (for reproducibility of results), and
not applied; upon training, the DL model would give more sklearn (data partitioning and metrics evaluation) were also
time on learning to the majority class (Class 1) compared to imported. A sample of reusable source code used for training
the minority classes (Class 0 and Class 2) which may cause the deep learning model is presented in Fig. 3.
the model to have poor performance when tested on new The whole dataset was partitioned into 80% for training
observations. After up-sampling, the dataset now has a total the deep learning model and 20% for model testing and
of 1560 observations with equal observations (520) for each evaluation of performance metrics.
class. On the other hand, one - hot encoding is applied to
features with categories that do not have ordinal relationships FIG 3. CODE SNIPPET FOR TRAINING DL MODELS
including target classes. This is to make certain for the DL
model that there is better interpretability of these kinds of
variables during training by assuming that each variable is not
greater or less than the other variables.
Authorized licensed use limited to: University of the Phillippines Diliman. Downloaded on December 01,2023 at 10:22:46 UTC from IEEE Xplore. Restrictions apply.
that random fluctuations in the training data is learned as TABLE II. IMPORTANT FEATURES CONFIRMED BY FS
ALGORITHMS
concepts by the model [25]. With this case, an overfitted
model is not able to generalize and has a negative impact
when applied to new data. The three remaining performance Algorithm Number of Selected Features
metrics are obtained after the developed DL models make Features
predictions out of the testing set. A confusion matrix analysis
Information 17 school, sex, address, Pstatus,
is derived after the prediction to analyze the model's Gain Based
performance. Medu, Fjob, Mjob, reason,
guardian, studytime, failures,
The accuracy is given by the equation: schoolsup, higher, Dalc, G1
(3) and G2
Authorized licensed use limited to: University of the Phillippines Diliman. Downloaded on December 01,2023 at 10:22:46 UTC from IEEE Xplore. Restrictions apply.
FIG 5. MODEL TRAINING LOSS OF DL MODEL WITH BORUTA Science, E-Learning and Information Systems, 1–8.
https://doi.org/10.1145/3279996.3280014
[3] Teruel, M., & Alonso Alemany, L. (2018). Co-embeddings for Student
Modeling in Virtual Learning Environments. Proceedings of the 26th
Conference on User Modeling, Adaptation and Personalization, 73–80.
https://doi.org/10.1145/3209219.3209227
[4] Castro-Wunsch, K., Ahadi, A., & Petersen, A. (2017). Evaluating
Neural Networks as a Method for Identifying Students in Need of
Assistance. Proceedings of the 2017 ACM SIGCSE Technical
Symposium on Computer Science Education, 111–116.
https://doi.org/10.1145/3017680.3017792
[5] Injadat, M., Moubayed, A., Nassif, A. B., & Shami, A. (2020). Multi-
split optimized bagging ensemble model selection for multi-class
educational data mining. Applied Intelligence.
https://doi.org/10.1007/s10489-020-01776-3
[6] Bahrampour, S., Ramakrishnan, N., Schott, L. & Shah, M. (2016).
Comparative Study of Deep Learning Software Frameworks. arXiv:
In this case, the plot of accuracy shows the model 1511.06435v3
[7] Cortez, P., & Silva, A. M. G. (2008). Using Data Mining to Predict
increased its accuracy over time having the predicted class as
Secondary School Student Performance. In A. Brito, & J. Teixeira
a true class, especially after the 175° epoch training dataset. (Eds.), Proceedings of 5th Annual Future Business Technology
Additionally, the plot of loss shows that the model had a Conference, Porto, 5-12.
smooth training process (good learning rate) reducing the loss [8] TensorFlow. An end-to-end open source machine learning platform.
over time. (2019, Set. 23). Available: https://www.tensorflow.org/
[9] Allaire, J. (2012). RStudio: integrated development environment for R.
Additionality, for educators who wants to replicate this Boston, MA, 770, 394.
kind of experiment in their institutions, the main steps are: [10] Romanski, P., Kotthoff, L., & Kotthoff, M. L. (2013). Package
‘FSelector’. URL http://cran/r-project.
1. Collect the students’ academic/personal data org/web/packages/FSelector/index. html.
using web forms, interviews, or surveys; [11] Kursa, M. B., Rudnicki, W. R., & Kursa, M. M. B. (2020). Package
2. Preprocess the students’ data (cleaning and ‘Boruta’.
[12] Kuhn, M. (2009). The caret package. Journal of Statistical Software,
transforming) to keep the quality of data and
28(5).
useful information and then use feature selection [13] Robnik-Sikonja, M., Savicky, P., & Robnik-Sikonja, M. M. (2021).
algorithms (Information Gain Based, ReliefF, Package ‘CORElearn’.
Boruta, and Recursive Feature Elimination); [14] Tang, J., Alelyani, S., & Liu, H. (2014). Feature selection for
3. Code the experiments using commands/methods classification: A review. Data classification: Algorithms and
from machine learning libraries defining the applications, 37.
[15] Paja, W., Pancerz, K., & Grochowalski, P. (2018). Generational feature
deep learning model (neural network elimination and some other ranking feature selection methods. In
architecture), compiling, and fitting it; Advances in Feature Selection for Data and Pattern Recognition (pp.
4. Evaluate the potential of the generated models 97-112). Springer, Cham.
using performance metrices; [16] Hssina, B., Merbouha, A., Ezzikouri, H., & Erritali, M. (2014). A
5. Make predictions for real data. comparative study of decision tree ID3 and C4. 5. International Journal
of Advanced Computer Science and Applications, 4(2), 13-19.
VI. CONCLUSION AND RECOMMENDATION [17] Kursa, M. B., & Rudnicki, W. R. (2010). Feature selection with the
Boruta package. J Stat Softw, 36(11), 1-13.
In this paper we have examined four feature selection [18] Zhou, X. (2015). Feature selection for image classification based on a
algorithms (Information Gain Based, ReliefF, Boruta, and new ranking criterion. Journal of Computer and Communications,
3(03), 74.
Recursive Feature Elimination) according to their relative
[19] Wang, Z., Zhang, Y., Chen, Z., Yang, H., Sun, Y., Kang, J., … Liang,
effectiveness in training a deep learning model to predict X. (2016). Application of ReliefF algorithm to selecting feature sets for
students’ performance using the open source data of two classification of high resolution remote sensing image. 2016 IEEE
Portuguese schools. Relevant features confirmed by the International Geoscience and Remote Sensing Symposium (IGARSS).
Boruta algorithm have aided the deep learning model to doi:10.1109/igarss.2016.7729190
obtain a high testing accuracy rate of 92.9% with a Kappa [20] Robnik-Šikonja, M., & Kononenko, I. (2003). Theoretical and
empirical analysis of ReliefF and RReliefF. Machine learning, 53(1),
statistic of 0.891 and an f - measure of 0.922. This 23-69.
information shows that comparing the four feature selection [21] I. Guyon, J. Weston, S. Barnhill, V. Vapnik, Mach. Learn. 46 (2002)
algorithms for dimensionality reduction, the Boruta 389–422.
algorithm is found to be most effective. The developed deep [22] Mitchell, T. M. (1997). Machine Learning McGraw-Hill International.
learning model using the Boruta algorithm as a feature p. 58
selection method may be implemented to predict students' [23] P. Cortez and A. Silva. Using Data Mining to Predict Secondary School
Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of
performance in an early stage of the course in order to 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp.
identify students at risk for grade retention and/or dropout. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-
7.
REFERENCES [24] The Portuguese Grading System. Available at:
[1] Jimerson, S. R., & Renshaw, T. L. (2012). Retention and Social https://www.studyineurope.eu/study-in-portugal/grades
Promotion. Principal Leadership Journal, September 12 - 16 [25] Brownlee, J. (2016). Overfitting and Underfitting With Machine
https://www.researchgate.net/publication/271652282_Retention_and_ Learning Algorithms. Available at:
Social_Promotion https://machinelearningmastery.com/overfitting-and-underfitting-
[2] Olivé, D. M., Huynh, D. Q., Reynolds, M., Dougiamas, M., & Wiese, with-machine-learning-algorithms/
D. (2018). A supervised learning framework for learning management [26] Cohen, J. (1968). Weighted kappa: Nominal scale agreement with
systems. Proceedings of the First International Conference on Data provisions for scaled disagreement or partial credit. Psychological
Bulletin, 70, 213-220.
Authorized licensed use limited to: University of the Phillippines Diliman. Downloaded on December 01,2023 at 10:22:46 UTC from IEEE Xplore. Restrictions apply.