0% found this document useful (0 votes)
13 views7 pages

Predicting Student Performance Using Feature Selection Algorithms For Deep Learning Models

Uploaded by

gernellumacad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views7 pages

Predicting Student Performance Using Feature Selection Algorithms For Deep Learning Models

Uploaded by

gernellumacad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

2021 XVI Latin American Conference on Learning Technologies (LACLO)

Predicting Student Performance Using Feature


Selection Algorithms for Deep Learning Models
2021 XVI Latin American Conference on Learning Technologies (LACLO) | 978-1-6654-2358-8/21/$31.00 ©2021 IEEE | DOI: 10.1109/LACLO54177.2021.00009

Thereza P. P. Padilha Gernel Lumacad Richard Catrambone


Department of Exact Sciences Senior HS - STEM - Strand Department School of Physychology
Federal University of Paraíba St. Rita's College of Balingasag Georgia Institute of Technology
Rio Tinto, Brazil Cagayan de Oro, Philippines Atlanta, United States
[email protected] [email protected] [email protected]

Abstract—Feature selection is an integral process for feature improve their learning conditions. For example, if the school
engineering prior to deep learning (DL) model development. knows that students who do not have Internet access at home
The idea is to reduce complexities of high - dimensional data or are of low income will have a high chance of retention or
structures by keeping only relevant information in the data dropping out, then some financial or technological issues
mining process. The critical part in developing a DL model to could be worked on in order to keep them learning. However,
predict student performance is the high - dimensionality of limitations of formulated models mentioned in [4] and [5] to
students’ profiles which results in a DL model with low predict student performance include: (1) directly applying
performance metrics. Students' profile/data involves different any classification model without studying the nature of
aspects such as demographic information, academic records,
features; (2) features are not analyzed before applying any
technological resources, social attitudes, family background
machine learning model; and (3) extracting or selecting more
and/or socio – economic status. Empirically, the diversity of
effective features to increase performance accuracy.
these data produce complexity in terms of dimension. In this
paper, we compared the effectiveness of four feature selection In the literature, we can find several libraries for Machine
algorithms (Information Gain Based, ReliefF, Boruta and Learning applications, such as TensorFlow (Google),
Recursive Feature Elimination) on deep learning models using Theano, and CNTK (Microsoft) [6], that can be used to create
an educational dataset from Portugal. The effectiveness is mathematical models to discover students’ patterns from
measured using the following model performance metrics: personal and/or academic data. Thus, this paper will focus on
training accuracy, validation accuracy, testing accuracy, kappa analyzing how Deep Learning (DL) models will act to predict
statistic, and f - measure. Results revealed the robustness of the
students’ performance using different dimensionality
Boruta algorithm in dimensionality reduction as it allowed the
reduction methods (Information Gain, ReliefF Algorithm,
deep learning model to achieve its highest performance metrics
compared to the utilization of other feature selection algorithms.
Boruta Algorithm, and Recursive Feature Elimination). For
the experiments, an educational dataset of the two Portugal
Keywords — deep learning, feature selection algorithms, public schools was used [7].
dimensionality reduction, prediction. This paper is organized as follows: in section II, an
overview of the needed steps to build a DL model is
I. INTRODUCTION described. Section III provides an overview of the feature
Educational institutions from all over the world work with selection algorithms considered in this study (Information
grade retention issues. According to statistical reports from Gain Based, ReliefF Algorithm, Boruta Algorithm, and
the National Center for Education Statistics in the U.S., Recursive Feature Elimination). Section IV presents the
https://nces.ed.gov/; between the years 2000 and 2016, the employed methodology to run the experiments and the
percentage of students retained in a grade decreased from dataset as well. Section IV compares and evaluates the
3.1% to 1.9%. This pattern has been observed mainly among performance of the DL models using different feature
white, black, and Hispanic students. This retention rate needs selection algorithms, and, finally, in section V conclusions
to be reduced as much as possible to allow student academic are presented.
progress to happen smoothly and continuously. Besides that,
for some psychological researchers, grade retention has not II. DL MODELS USING TENSORFLOW AND KERAS
been seen as a fair strategy for students who did not achieve TensorFlow is an open-source library developed by the
at least the average for the next level. In fact, grade retention Google Brain Team (AI research group at Google) in 2015,
has shown numerous deleterious effects on student written in C++ for Artificial Intelligence applications. It is
performance such as poor peer interactions, an aversion to multiplatform (Windows, MacOS, and Linux) and can be run
school, behavioral problems, and poor self-concept [1]. on CPU (Central Processing Unit), GPU (Graphics
Considering this scenario, it is so relevant to keep up the Processing Unit), and TPU (Tensor Processing Units) [8].
students’ progress (performance) academically in order to try Currently, it is considered the most used software for
to identify those at risk for grade retention and/or dropout. Machine Learning and Deep Learning applications [9].
To predict the student’s performance using an educational Google, Intel, Uber, Airbnb, and DropBox, for example, are
dataset is a topic that has long been researched for several some companies that already use it.
years as can be found in [2], [3], [4], and the limitations in One major advantage of using this framework is the use
research [5]. The basic principle is to identify students who of data flow graphs, in which nodes represent units of
need assistance as early in the course as possible to adopt computation and edges represent the tensors
some pedagogical-learning actions before they fail or drop (multidimensional arrays) that link the nodes. Its main feature
out. So, educational institutions can create specific programs is the ability to quickly generate a trained predictive model,
to help students who are in vulnerable situations and then

978-1-6654-2358-8/21/$31.00 ©2021 IEEE 1


DOI 10.1109/LACLO54177.2021.00009
Authorized licensed use limited to: University of the Phillippines Diliman. Downloaded on December 01,2023 at 10:22:46 UTC from IEEE Xplore. Restrictions apply.
eliminating the need to reimplement it and with high ALGORITHM I. ID3 WITH INFORMATION GAIN (SOURCE: HSSINA, 2014)
precision. In 2019, Google created a new version, called
TensorFlow 2, that integrated the Keras API directly and then Inputs: R: a set of non – target attributes, C: the target
allowed the building of deep learning models with just a few attribute, S: training data
lines of code. In building these models, five main steps need Output: returns a decision tree
to be followed: Start
1. Define the model: sets up the neural network Initialize to empty tree;
according to the type of task, including number and type of if S is empty then
the layers, the input of each layer, and the weight parameters; return a single node failure value
end if
2. Compile the model: configures the type of loss, if S made only for the values of the same target
optimizer, and the metrics for future analysis; then
3. Fit the model: presents the training set for the return a single node of this value
proposed model specifying, for example, a number of end if
iterations (epochs) and the sample size per iteration if R is empty then
(batch_size); return a single node with value as the
4. Evaluate the model: verifies the performance of the most common value of the target attribute
trained model using test data. New data that it has never seen values found in S
are presented to the model to see how well or badly it end if
performs. Training/testing accuracy, F-score, Kappa statistics D ← the attribute that has the largest Information Gain
are useful metrics to validate the model; (D, S) among all the attributes of R
← Attribute values of D
5. Make predictions: provides the built model to return a tree whose root is D and the arcs are
make real predictions.
labeled by , , …, and going to subtrees ID3
III. FEATURE SELECTION ALGORITHMS
Dimensionality reduction is a technique to reduce end
complexity of higher dimensional data that can be
categorized into two types: feature extraction and feature Evaluating the information gain value for each feature and
selection. Feature extraction reduces the dimension of the selecting features that maximize the information gain value
data into a new feature space with lower dimensionality by of the DT which minimizes the entropy, helps the
combining information of the original features. Feature classification task to be better performed. A feature is
selection selects a subset from the original features that gives considered more relevant when it obtains less entropy, and
greater relevance to the target class [14]. Feature selection is thus produces higher information gain.
supervised in nature which accounts for information between
features of the data and the target class. Literature concerning B. ReliefF
the four feature selection algorithms compared in this study The ReliefF algorithm is a feature estimator algorithm
is discussed in this section. which estimates the quality of each feature in a given data set
with strong dependencies between features. ReliefF is highly
A. Information Gain Based efficient in dealing with complex datasets, both continuous
Information Gain (IG) is a feature selection method based and discrete [19], which can also handle incomplete and noisy
on reduction in information entropy which quantifies the data [20]. The key idea of the ReliefF is to estimate the quality
maximal information of each feature about the classification of features according to how well their values distinguish
task. Basically, IG is a metric for decision tree algorithms, between instances that are similar to each other. The measure
specifically the Iterative Dichotomiser 3 (ID3) [22]. IG is of quality for each feature is given by the equation:
dependent on measures of impurity called Entropy. In the
feature selection process, decision trees are built from a given
dataset by splitting each feature with respect to a target class
and calculating the weighted entropies of each branch and
(2)
subtracting it from the original entropy. IG is given by the
equation:
(1)
where is the sum of distance between
the selected instance and its kth nearest neighbor in H (or M),
is the prior probability class c. A more detailed
where ( , ) is the information gain value, explanation of equation 2 can be found in [18] and a
( ) is the information entropy of the target class, comprehensive flow of ReliefF is presented in Algorithm II.
and ( , ) is the information entropy for each
feature. A comprehensive flow of ID3 is presented in
Algorithm I.

Authorized licensed use limited to: University of the Phillippines Diliman. Downloaded on December 01,2023 at 10:22:46 UTC from IEEE Xplore. Restrictions apply.
ALGORITHM II. RELIEFF (SOURCE: ZHOU & WANG, 2015) D. Recursive Feature Elimination (RFE)
RFE as a feature selection algorithm is a wrapper type that
Input: Feature data matrix: D, repeat times: n, the number can be used as a core method given any machine learning
of neighbors: K algorithm. Recursively, RFE ranks features given a dataset
Output: Vector W for the feature attributes ranking according to some measure of importance [21]. Pseudocode
Begin for RFE is presented in Algorithm IV. Utilizing the given
for j = 1 to n do dataset, the algorithm starts with training a specified machine
Randomly select an instance Rⱼ learning model where, for every iteration, feature importance
Find K nearest hits H and nearest misses M; is being measured and the less important features are
for i = 1 to all features do removed. RFE requires a number of features to be kept, but
Updating estimation ᵢ by Equation (2); in some instances a variable importance measure may be
end implemented such as the varImp() function from the ‘caret’
end package in R [12] where it returns a final list of features
End ranked according to their variable importance.

C. Boruta ALGORITHM IV. RECURSIVE FEATURE ELIMINATION


(SOURCE: KURSA, 2020)
The Boruta algorithm (BA) is a wrapper-built feature
selection method around the random forest algorithm that Tune/Train the model on the training set using all
iteratively attempts to capture important features and remove predictors
irrelevant features. The implementation of BA starts with Calculate model performance
feeding a set of features with n numbers of observation with Calculate variable importance or rankings
respect to a target class, following the pseudocode for BA as for each subset size ᵢ, = 1, 2, . . . , do
presented in algorithm 3. BA returns normhits and decisions
Keep the ᵢ most important variable
(confirmed or rejected) for each feature. The greater the
[Optional] Preprocess the data
normits of a feature, the more likely the feature is to be
confirmed as important. The pseudocode of the BA can be Tune/Train the model on the training set using
seen in Algorithm III. ᵢ predictors
Calculate model performance
ALGORITHM III. BORUTA (SOURCE: PAJA, 2018) [Optional] Recalculate the rankings for each
predictor
Inputs: originalData - input dataset; RFruns - the number end
of random forest runs. Calculate the performance profile over the ᵢ
Output: finalSet that contains relevant and irrelevant Determine the appropriate number of predictors
features Use the model corresponding the optimal ᵢ
confirmed set = ѫ
rejected set = ѫ IV. METHODOLOGY
for each RFruns do There are five phases implemented in the methodology of
originalPredictors ←originalData(predictors) this study as shown in Fig. 1. Phase 1 was the collection of
shadowAtrr ← permute(originalPredictors) data. Phase 2 was the data pre - processing phase which
extendedPredictors ← cbind(originalPredictors, includes cleaning and transformation of the collected data.
shadowAttr) Phase 3 was the feature selection process using the four
extendedData ← cbind(extendedPredictors, considered algorithms: IG, ReliefF, Boruta, and RFE. Phase
originalData(decisions)) 4 was the development of deep learning models using the
zScoreSet ← randomForest(extendedData) results obtained in the third phase. Finally, phase 5 was the
MZSA ← max(zScoreSet(shadowAttr)) evaluation of performance metrics generated from four
for each ∈ originalPredictors do developed deep learning models during phase 4.
if zScoreSet(a) > MZSA then A. Phase 1 - Data Collection
hit (a) + +
The dataset utilized in this study was adopted from [23]
for each ∈ originalPredictors do
which can be downloaded from the UCI machine learning
significance (a) ← twoSidedEqualityTest (a)
repository using the link
if significance (a) >> MZSA then https://archive.ics.uci.edu/ml/datasets/Student+Performance.
confirmedSet ← finalSet ∪ It consists of student achievement in secondary education of
else if significance (a) << MZSA then two Portuguese schools. The dataset includes features
rejectedSet ← rejectedSet ∪ relating to the students’ profiles: academic records,
return finalSet ← confirmedSet ∪ rejectedSet demographic information, technological resources, social
attitudes, family background and socio - economic status. It
was gathered through school reports and survey
Most conventional feature selection algorithms satisfy the
questionnaires. In total, it has 33 features as shown in Table
minimal - optimal problem; in contrast the Boruta algorithm
I.
is an all - relevant feature selection algorithm. Detailed
discussion of this can be found in [11].

Authorized licensed use limited to: University of the Phillippines Diliman. Downloaded on December 01,2023 at 10:22:46 UTC from IEEE Xplore. Restrictions apply.
FIG 1. METHODOLOGICAL FRAMEWORK
guardian student's guardian ('mother', 'father' or
'other')

traveltime home to school travel time (1 - 4)

studytime weekly study time (1 - 4)

failures number of past class failures (1 - 4)

schoolsup extra educational support (yes or no)

famsup family educational support (yes or no)

paid extra paid classes within the course subject


(yes or no)

activities extra-curricular activities (yes or no)

nursery attended nursery school (yes or no)

higher wants to take higher education (yes or no)

internet Internet access at home (yes or no)

romantic with a romantic relationship (yes or no)

famrel quality of family relationships (1 - 5)

freetime free time after school (1 - 5)

TABLE I. FEATURES OF THE USED DATASET goout going out with friends (1 - 5)

Feature Description (Values) Dalc workday alcohol consumption (1 - 5)

school student's school ('GP' - Gabriel Pereira or Walc weekend alcohol consumption (1 - 5)
'MS' - Mousinho da Silveira)
health current health status (1 - 5)
sex student's sex ('F' - female or 'M' - male)
absences number of school absences (0 - 93)
age student's age (15 - 22)
G1 first period grade (0 - 20)
address student's home address type ('U' - urban or
'R' - rural) G2 second period grade (0 - 20)

famsize family size ('LE3' - less or equal to 3 or 'GT3' G3 (Class) final grade (0 - 20)
- greater than 3)

Pstatus parent's cohabitation status ('T' - living B. Phase 2 - Data Pre-processing


together or 'A' - apart) Cleaning and transformation of the collected data was
implemented in RStudio - an integrated development
Medu mother's education (0 - 4) environment for R Programming [9]. The dataset was
examined to if there were missing values and N/A values.
Fedu father's education (0 - 4) Categorical encoding was also applied to features with
categorical variables which includes binary encoding for
Mjob mother's job ('teacher', 'health', 'services', dichotomous categories (romantic, internet, etc.), ordinal
'at_home' or 'other') encoding for categories that follow rank orders (Medu, Fedu,
etc.), and categories that do not follow rank orders such as
Fjob father's job ('teacher', 'health', 'services', Mjob and Fjob which were converted into factor variables.
'at_home' or 'other') Data normalization was then applied to all features except G3
as the target class column. It is imperative to normalize the
reason reason to choose this school ('home', data so that the deep learning algorithm will not be dominated
'reputation', 'course' or 'other') by variables with larger scales that affect model performance.

Authorized licensed use limited to: University of the Phillippines Diliman. Downloaded on December 01,2023 at 10:22:46 UTC from IEEE Xplore. Restrictions apply.
The target class ‘G3’ ranges from 0 - 20. G3 was categorized There are four sequential deep learning models developed
into three classes based on the grading system in Portugal in this phase based on the results obtained from phase 3. To
[24]. 0 - 9 was categorized as ‘Failed’, 10 - 13 was ensure unbiased experimentation, hyperparameters were set
categorized as ‘Average’ and 14 - 20 as ‘AboveAverage’. the same for the four DL models. Two hidden layers were
Classes were then encoded as ‘0’ for ‘Failed’, 1 for utilized with 53 hidden neurons in the first layer and 13
‘Average’, and 2 for ‘AboveAverage’. hidden neurons in the second layer. Rectified linear unit
(relu) was used as an activation function in the hidden layers
C. Phase 3 - Feature Selection
while the softmax function was used as an activation function
Four feature selection algorithms were used to define in the output layer. L2 kernel regularizer with a value of 0.01
relevant features in the classification task. Feature selection was applied to two hidden layers. Adaptive moment estimator
processes were implemented in RStudio. Package ‘FSelector’ (Adam) was used as model optimizer and categorical cross
[10] was used for Information Gain Based algorithm, package entropy as model loss function. The learning rate was set to
‘caret’ [12] was used for Recursive Feature Elimination 0.001, batch size was set to 32, and the number of epochs was
algorithm, package ‘Boruta’ [11] was used for the Boruta set to 200. There was no early stopping of epoch and dropout
algorithm, and package ‘CORElearn’ [13] was used for the of nodes included during the DL model development.
ReliefF algorithm. Snippets for each feature selection process
are presented in Fig. 2. Note that only core functions for each Implementation of the experiments in the fourth phase
algorithm are presented. was done in Google Colaboratory using the Python
programming language. TensorFlow was utilized as a core
D. Phase 4 - Deep Learning Model Development module to build the deep learning model. From TensorFlow,
Prior to the development of DL models, a data up- submodules of keras were imported to define the architecture
sampling technique was applied since there is an imbalance of the model. Sub - modules include Sequential, regularizers,
in target classes where Class 0 has only 230 observations, Dense, Activation, Optimizers, and BatchNormalization.
Class 2 has only 294 observations while Class 1 has 520 Other modules such as NumPy, pandas, matplotlib
observations. In this case, if an up-sampling technique was (visualization), random (for reproducibility of results), and
not applied; upon training, the DL model would give more sklearn (data partitioning and metrics evaluation) were also
time on learning to the majority class (Class 1) compared to imported. A sample of reusable source code used for training
the minority classes (Class 0 and Class 2) which may cause the deep learning model is presented in Fig. 3.
the model to have poor performance when tested on new The whole dataset was partitioned into 80% for training
observations. After up-sampling, the dataset now has a total the deep learning model and 20% for model testing and
of 1560 observations with equal observations (520) for each evaluation of performance metrics.
class. On the other hand, one - hot encoding is applied to
features with categories that do not have ordinal relationships FIG 3. CODE SNIPPET FOR TRAINING DL MODELS
including target classes. This is to make certain for the DL
model that there is better interpretability of these kinds of
variables during training by assuming that each variable is not
greater or less than the other variables.

FIG 2. CODE SNIPPET FOR EACH FEATURE SELECTION ALGORITHM

E. Phase 5 - Evaluation of DL Performance


The four DL models developed in the fourth phase were
evaluated using the following performance metrics: training
accuracy, validation accuracy, testing accuracy, kappa
statistic, and f - measures. Training accuracy and validation
accuracy describes whether the developed model is overfitted
or not. If validation accuracy is greater than the training
accuracy, the developed DL model is overfitted which means

Authorized licensed use limited to: University of the Phillippines Diliman. Downloaded on December 01,2023 at 10:22:46 UTC from IEEE Xplore. Restrictions apply.
that random fluctuations in the training data is learned as TABLE II. IMPORTANT FEATURES CONFIRMED BY FS
ALGORITHMS
concepts by the model [25]. With this case, an overfitted
model is not able to generalize and has a negative impact
when applied to new data. The three remaining performance Algorithm Number of Selected Features
metrics are obtained after the developed DL models make Features
predictions out of the testing set. A confusion matrix analysis
Information 17 school, sex, address, Pstatus,
is derived after the prediction to analyze the model's Gain Based
performance. Medu, Fjob, Mjob, reason,
guardian, studytime, failures,
The accuracy is given by the equation: schoolsup, higher, Dalc, G1
(3) and G2

ReliefF 10 G2, G1, failures, Fedu, Medu,


The Kappa statistic [26] measures the interrater reliability
of developed DL models. The measure ranges from 0 - 1. It studytime, school, reason,
is given by the equation: Fjob and Mjob
(4) 14 age, Medu, Fedu, Mjob,
Boruta
failures, schoolsup, higher,
where K is the kappa coefficient, Po is the probability of internet, goout, Dalc, Walc,
correct classification, and Pe is the probability of random absences, G1 and G2
classifications (probability sum of correctly and incorrectly
classified). Interpretations for the Kappa coefficient are as 13 G2, paid, schoolsup, G1,
Recursive school, higher, Pstatus, sex,
follows: 0.01 – 0.20 (none to slight), 0.21 – 0.40 (fair), 0.41 Feature
– 0.60 (moderate), 0.61 – 0.80 (substantial), and 0.81 – 1.00 Elimination
failures, Fjob, traveltime,
(almost perfect). internet and famsize
A high f – measure value implies that the model has a high
rate of predicting true positives and true negatives in the TABLE III. PERFORMANCE METRICS OF DL MODELS
classification task. It is given by the equation:
FS Training Validation Testing Kappa F-
(5) Methods Accuracy Accuracy Accuracy Statistic score

where Recall is and Precision is


All 1.000 0.890 0.893 0.839 0.896
Features
V. RESULTS AND DISCUSSION
ID3 0.990 0.910 0.898 0.848 0.898
Relevant features confirmed by each feature selection
algorithm are summarized in Table II (number and their
names). The information gain based algorithm recorded 17 ReliefF 0.960 0.910 0.904 0.855 0.900
features with lower entropies. The ReliefF algorithm
considered only 10 features as important. The Boruta
algorithm confirmed 14 features to be relevant for the Boruta 0.970 0.880 0.929 0.891 0.922
classification task. The Recursive feature elimination
algorithm confirmed only 13 features as relevant.
Performance metrics of the four developed DL models are RFE 0.960 0.920 0.923 0.884 0.920
presented in Table III. Based on the results, none of the DL
models are overfitted since their corresponding validation
accuracy is less than their training accuracy. Among the four FIG 4. MODEL TRAINING ACCURACY OF DL MODEL WITH BORUTA
developed DL models, the DL model with features
considered relevant by the Boruta algorithm achieved the
highest testing accuracy rate of 92.9% (bolded). Its Kappa
statistic of 0.891 indicates that the model is ‘almost perfect’.
Its f - measure of 0.922 also indicates that the DL model has
a high balance rate of classifying true Class 0, true Class 1,
and true Class 2. Model accuracy and model loss of the DL
model with the Boruta algorithm is shown in Fig. 4 and Fig.
5, respectively, with 200 epochs.

Authorized licensed use limited to: University of the Phillippines Diliman. Downloaded on December 01,2023 at 10:22:46 UTC from IEEE Xplore. Restrictions apply.
FIG 5. MODEL TRAINING LOSS OF DL MODEL WITH BORUTA Science, E-Learning and Information Systems, 1–8.
https://doi.org/10.1145/3279996.3280014
[3] Teruel, M., & Alonso Alemany, L. (2018). Co-embeddings for Student
Modeling in Virtual Learning Environments. Proceedings of the 26th
Conference on User Modeling, Adaptation and Personalization, 73–80.
https://doi.org/10.1145/3209219.3209227
[4] Castro-Wunsch, K., Ahadi, A., & Petersen, A. (2017). Evaluating
Neural Networks as a Method for Identifying Students in Need of
Assistance. Proceedings of the 2017 ACM SIGCSE Technical
Symposium on Computer Science Education, 111–116.
https://doi.org/10.1145/3017680.3017792
[5] Injadat, M., Moubayed, A., Nassif, A. B., & Shami, A. (2020). Multi-
split optimized bagging ensemble model selection for multi-class
educational data mining. Applied Intelligence.
https://doi.org/10.1007/s10489-020-01776-3
[6] Bahrampour, S., Ramakrishnan, N., Schott, L. & Shah, M. (2016).
Comparative Study of Deep Learning Software Frameworks. arXiv:
In this case, the plot of accuracy shows the model 1511.06435v3
[7] Cortez, P., & Silva, A. M. G. (2008). Using Data Mining to Predict
increased its accuracy over time having the predicted class as
Secondary School Student Performance. In A. Brito, & J. Teixeira
a true class, especially after the 175° epoch training dataset. (Eds.), Proceedings of 5th Annual Future Business Technology
Additionally, the plot of loss shows that the model had a Conference, Porto, 5-12.
smooth training process (good learning rate) reducing the loss [8] TensorFlow. An end-to-end open source machine learning platform.
over time. (2019, Set. 23). Available: https://www.tensorflow.org/
[9] Allaire, J. (2012). RStudio: integrated development environment for R.
Additionality, for educators who wants to replicate this Boston, MA, 770, 394.
kind of experiment in their institutions, the main steps are: [10] Romanski, P., Kotthoff, L., & Kotthoff, M. L. (2013). Package
‘FSelector’. URL http://cran/r-project.
1. Collect the students’ academic/personal data org/web/packages/FSelector/index. html.
using web forms, interviews, or surveys; [11] Kursa, M. B., Rudnicki, W. R., & Kursa, M. M. B. (2020). Package
2. Preprocess the students’ data (cleaning and ‘Boruta’.
[12] Kuhn, M. (2009). The caret package. Journal of Statistical Software,
transforming) to keep the quality of data and
28(5).
useful information and then use feature selection [13] Robnik-Sikonja, M., Savicky, P., & Robnik-Sikonja, M. M. (2021).
algorithms (Information Gain Based, ReliefF, Package ‘CORElearn’.
Boruta, and Recursive Feature Elimination); [14] Tang, J., Alelyani, S., & Liu, H. (2014). Feature selection for
3. Code the experiments using commands/methods classification: A review. Data classification: Algorithms and
from machine learning libraries defining the applications, 37.
[15] Paja, W., Pancerz, K., & Grochowalski, P. (2018). Generational feature
deep learning model (neural network elimination and some other ranking feature selection methods. In
architecture), compiling, and fitting it; Advances in Feature Selection for Data and Pattern Recognition (pp.
4. Evaluate the potential of the generated models 97-112). Springer, Cham.
using performance metrices; [16] Hssina, B., Merbouha, A., Ezzikouri, H., & Erritali, M. (2014). A
5. Make predictions for real data. comparative study of decision tree ID3 and C4. 5. International Journal
of Advanced Computer Science and Applications, 4(2), 13-19.
VI. CONCLUSION AND RECOMMENDATION [17] Kursa, M. B., & Rudnicki, W. R. (2010). Feature selection with the
Boruta package. J Stat Softw, 36(11), 1-13.
In this paper we have examined four feature selection [18] Zhou, X. (2015). Feature selection for image classification based on a
algorithms (Information Gain Based, ReliefF, Boruta, and new ranking criterion. Journal of Computer and Communications,
3(03), 74.
Recursive Feature Elimination) according to their relative
[19] Wang, Z., Zhang, Y., Chen, Z., Yang, H., Sun, Y., Kang, J., … Liang,
effectiveness in training a deep learning model to predict X. (2016). Application of ReliefF algorithm to selecting feature sets for
students’ performance using the open source data of two classification of high resolution remote sensing image. 2016 IEEE
Portuguese schools. Relevant features confirmed by the International Geoscience and Remote Sensing Symposium (IGARSS).
Boruta algorithm have aided the deep learning model to doi:10.1109/igarss.2016.7729190
obtain a high testing accuracy rate of 92.9% with a Kappa [20] Robnik-Šikonja, M., & Kononenko, I. (2003). Theoretical and
empirical analysis of ReliefF and RReliefF. Machine learning, 53(1),
statistic of 0.891 and an f - measure of 0.922. This 23-69.
information shows that comparing the four feature selection [21] I. Guyon, J. Weston, S. Barnhill, V. Vapnik, Mach. Learn. 46 (2002)
algorithms for dimensionality reduction, the Boruta 389–422.
algorithm is found to be most effective. The developed deep [22] Mitchell, T. M. (1997). Machine Learning McGraw-Hill International.
learning model using the Boruta algorithm as a feature p. 58
selection method may be implemented to predict students' [23] P. Cortez and A. Silva. Using Data Mining to Predict Secondary School
Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of
performance in an early stage of the course in order to 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp.
identify students at risk for grade retention and/or dropout. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-
7.
REFERENCES [24] The Portuguese Grading System. Available at:
[1] Jimerson, S. R., & Renshaw, T. L. (2012). Retention and Social https://www.studyineurope.eu/study-in-portugal/grades
Promotion. Principal Leadership Journal, September 12 - 16 [25] Brownlee, J. (2016). Overfitting and Underfitting With Machine
https://www.researchgate.net/publication/271652282_Retention_and_ Learning Algorithms. Available at:
Social_Promotion https://machinelearningmastery.com/overfitting-and-underfitting-
[2] Olivé, D. M., Huynh, D. Q., Reynolds, M., Dougiamas, M., & Wiese, with-machine-learning-algorithms/
D. (2018). A supervised learning framework for learning management [26] Cohen, J. (1968). Weighted kappa: Nominal scale agreement with
systems. Proceedings of the First International Conference on Data provisions for scaled disagreement or partial credit. Psychological
Bulletin, 70, 213-220.

Authorized licensed use limited to: University of the Phillippines Diliman. Downloaded on December 01,2023 at 10:22:46 UTC from IEEE Xplore. Restrictions apply.

You might also like