0% found this document useful (0 votes)
91 views16 pages

Predicting and Interpreting Student Performance Using Ensemble Models and Shapley Additive Explanations

Uploaded by

Sergio Alumenda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
91 views16 pages

Predicting and Interpreting Student Performance Using Ensemble Models and Shapley Additive Explanations

Uploaded by

Sergio Alumenda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Received October 7, 2021, accepted October 25, 2021, date of publication October 29, 2021, date of current version

November 19, 2021.


Digital Object Identifier 10.1109/ACCESS.2021.3124270

Predicting and Interpreting Student Performance


Using Ensemble Models and Shapley
Additive Explanations
HAYAT SAHLAOUI1 , EL ARBI ABDELLAOUI ALAOUI 2, ANAND NAYYAR 3,4 , SAID AGOUJIL1 ,
AND MUSTAFA MUSA JABER 5,6
1 Department of Computer Science, Faculty of Sciences and Techniques at Errachidia, University of Moulay Ismaïl, Meknes, Errachidia 52000, Morocco
2 Department of Sciences, Ecole Normale Supérieure, University of Moulay Ismaïl, Meknes 52000, Morocco
3 Graduate School, Duy Tan University, Da Nang 550000, Vietnam
4 Faculty of Information Technology, Duy Tan University, Da Nang 550000, Vietnam
5 Department of Computer Science, Al-Turath University College, Baghdad 10021, Iraq
6 Department of Computer Science, Dijlah University College, Baghdad 10021, Iraq

Corresponding authors: Anand Nayyar ([email protected]) and Mustafa Musa Jaber ([email protected])

ABSTRACT In several areas, including education, the use of machine learning, such as artificial neural
networks, has resulted in significant improvements in predicting tasks. The opacity of these models is one
of the problems with their use. Prediction models that may offer valuable insights while still being simple
to comprehend are preferred by decision-makers in education. Hence, this study suggests an approach that
improves the previous student performance prediction by enhancing performance and explaining why a
student’s performance is attaining a certain score. A prediction model was proposed and tested using machine
learning models. Our models outperform previous work models developed on the same dataset. Using a
combined framework of data level and algorithm approaches, the proposed model achieves an accuracy of
over 98%, inplying a 20.3% improvement compared with previous work models. As a balancing technique
for upsampling data, we use the default strategy of synthetic minority oversampling technique (SMOTE)
to oversample all classes to the number of examples in the majority class. We also use ensemble methods.
For tuning the parameters, we use a simple grid search algorithm provided by scikit to estimate the optimal
parameters of our model. This hyperparameter optimization along with a ten-fold cross-validation process
demonstrates the dependability of the novel model. In addition, a novel visual and intuitive technique is
used to help determine which factors most influence the score which helps to interpret and understand the
entire model and visualizes feature attributions at the observation level for the machine learning model.
Therefore, SHAP values are a powerful tool that should be incorporated within the student performance
prediction framework by obtaining the prediction and explanation created through the experiment, educators
can recognize students at risk early and provide suitable exhortation in an auspicious manner.

INDEX TERMS Ensemble methods, game theory, machine learning, student performance prediction, SHAP
values.

I. INTRODUCTION educational datasets to have a good understanding of stu-


Data mining techniques play an essential role in many appli- dents and their educational system [1]. A primal applica-
cation fields, such as business analytics, security analyt- tion of educational data mining (EDM) is investigating the
ics, financial analytics, and learning analytics. In this study, student learning process and predicting student performance
we are primarily concerned with applications of data min- to improve educational practices. In this context, we are
ing in the education environment. This area of research attempting to approximate student performance, experience,
focuses on the design and application of algorithms on ranking, or grade [2] by pulling out features from traditional
recorded or logged data. Student performance value could be
The associate editor coordinating the review of this manuscript and numeric in the case of a regression problem or categorical
approving it for publication was Shen Yin. in the case of a classification problem. Currently, one of the

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
152688 VOLUME 9, 2021
H. Sahlaoui et al.: Predicting and Interpreting Student Performance

most promising research fields of information technology with previous studies that used the Jordan dataset to investi-
is machine learning [3]. The application of machine learn- gate student performance prediction. We also apply the SHAP
ing in the education domain is the primary subject of our value to explain the inner logic of the ensemble method.
research [4]. This technology can help predict student per- In this study, we attempt to answer the following questions:
formance and alert teachers about students at risk to provide i) From previous studies related to student performance,
them with the support they need. Therefore, the question of which input attributes affect student performance, which set
predicting student performance has to deal with building the of algorithms is used for prediction, what are their accuracy
learning classifier by using students’ observed records that measures, and which tools are used?; ii) can the predictive
represent the training set and matching student historical data models used in student performance be improved by using
that represent features with their label that represents the ensemble methods?; iii) is there an opportunity to use these
actual performance [5]. The final goal is to alert teachers proposed models to improve the interpretability of the models
about students at risk of drop out and suggest means to that can enhance the use of these models by decision-makers
improve their performance and increase their retention and in education?; iv) what are the reasons and key indicators
completion rate [6]. for explaining student prediction models using game theory?;
Linear regression models are used in many existing student v) what are Shapley values and how do these values help
performance studies. Linear models are easy to implement improve prediction accuracy and model transparency?
and interpret, but they are inaccurate in terms of modeling The key contributions of this study are as follows:
student grades. Studies have also addressed the shortcomings General objectives:
of linear models using nonlinear models, such as artificial • This research could greatly benefit school administrators
neural networks [7]. In contrast to linear models, nonlinear in terms of their knowledge of the causes of student
models have been shown to achieve better performance [8]. performance problems and in finding solutions to them.
However, interpreting the model is difficult (e.g., which fac- • The results of this study could benefit officials in the
tors make student performance inefficient). This predictive ministry of education in terms of finding educational
model answers the ‘‘how much’’ question, not the reason why. plans, solutions, programs, and activities for students at
Therefore, the best case is to have a robust model that can risk.
be interpreted and result in improved adoption [9]. • The study could contribute to theoretical literature to
Robust machine learning algorithms can usually make extend the scope and nourish the machine learning
accurate predictions, but their well-known opaque nature repository and provide noteworthy insights and models
hampers their adoption. The interpretability of the model to improve education across the globe.
corresponds to the tag or the label on the pill bottle. The label • This study could open horizons for other researchers to
needs to be transparent for easy adoption. Decision-makers conduct other studies in this field.
in education prefer prediction models that provide useful Specific objectives:
insights, as well as those that are easy to understand [10], [11]. • The combination of machine learning models and SHAP
This clear-cut view additionally provides quality assurance value could overcome the challenges of the trade-
to the pipeline. Intuition helps researchers in determining off between student performance models’ interpretabil-
whether any significant errors exist when inputting or exe- ity and complexity and improve model accuracy and
cuting a model. Moreover, it is a smart way to regenerate the transparency.
outcome of the prediction algorithm. Most machine learning- • improving student performance model accuracy without
based projects always focus on results, not interpretability. compromising their interpretability via the combination
Here, we highlight the latest method for interpretability, of a data level and algorithm approach:
SHAP value to illustrate and explain the accuracy of the - by solving class imbalance problems that imply strate-
model. gies, such as data augmentation for the minority class,
This paper describes the first attempt to use machine learn- which is referred to SMOTE.
ing model interpretability in the context of student perfor- - by leveraging two innovative model types: bagging
mance prediction. Previous studies [12] focused on updating and boosting-based algorithms, in particular, ET and
the accuracy of student grade models. However, this paper XGBoost, which were examined for the first time in the
outlines, on the one hand, the first attempt to use a combined context of student performance data.
framework of data level and algorithm approaches to improve - to improve the precision and to avoid model overfit-
classifier performance measures, for instance, SMOTE tech- ting, we used hyperparameter optimization technique by
nique for upsampling data and bagging- (ET) and boosting- choosing a set of optimal hyperparameters for the learn-
based (XGBoost) algorithms for improving the performance ing algorithm. Tenfold cross validation was used to train
of single classifiers. And on the other hand, the use of SHAP and evaluate our models and analyze the performance of
values [13] and associated visualizations in the quest of the algorithms with proper metrics.
interpretability and explainability of models in an education • Augmenting our ensemble models with SHAP val-
context. This work aims to improve performance measure ues [13]. In real-world machine learning-based appli-
values by using the ensemble method and then compare it cations, model interpretability can sometimes be more

VOLUME 9, 2021 152689


H. Sahlaoui et al.: Predicting and Interpreting Student Performance

important than accuracy. Therefore, SHAP provides a


more visual, intuitive, and comprehensive approach to
increase the transparency of ensemble models.
• Evaluating and comparing the proposed approach with
previous works: By running our set of models, we were
able to validate empirically that the proposed model
outperforms previous ones.
This paper is organized as follows: In Section II, related
work on the prediction of the performance of university stu-
dents is reviewed. In Section III, the methodology followed FIGURE 1. Determinant factors of student academic performance.
to show how to use ensemble methods to build a student
performance model in order to improve its performance and
the methods that help explain the model has been presented.
This process consists of four phases: a preprocessing phase, as age and gender, which affect their performance. Age and
a model building phase, a model evaluation phase and a model gender are the most often used factors for prediction because
interpretability phase. Section IV, explores the experimen- they are considered internal factors of variability, which are
tal protocols, results obtained, and discussion. By providing simple to define and measure. The researchers in [15]–[17]
a complete description of the variables used in predicting looked at how psychometric factors tend to affect perfor-
student grades. After that, the focus was on implementing mance of students.
and interpreting the model. By showing the details of the
experimental implementation of the model and the accuracy 2) ACADEMIC FACTORS
in comparison with traditional models. The implementation These factors refer to indicators that explain the success
stage is then extended through research on how to explain or failure of students in the academic track in universities
the model used in student performance prediction. Finally (e.g., the score a student obtains). According to [18], the
Section V concludes the paper with limitations, suggestions cumulative grade point average (CGPA) is the most important
and future scope. attribute that has been frequently used because of its huge
effect in shaping education’s future. In [19]–[21], the authors
II. RELATED WORK considered student GPA.
In this section, we provide a research background for our In [22], the authors disclosed that previous academic per-
study. For this, we reviewed previous studies related to stu- formance and parent educational background are the most
dent performance prediction and tried to layout the input important attributes in predicting the future academic perfor-
attributes that affect student performance, the set of algo- mance of a student. Others looked into the effect of previous
rithms that are used for prediction, and their accuracy metrics, academic achievement in determining the performance of
as well as the methods used to complete this critical mission students in the future [23].
in web-based educational contexts.

A. DETERMINANT ATTRIBUTES/FACTORS THAT AFFECT 3) FINANCIAL FACTORS


STUDENT ACADEMIC PERFORMANCE Financial factors refer to the financial ability of the par-
The improvement of the education system depends on the ents to finance their children’s education and shape their
factors that affect student academic success [14]. Therefore, future career. Research on these factors focused on the socio-
to predict student performances with efficiency and accu- economic status of the student [24]. Some researchers studied
racy, studying the variables that influence student academic the correlation between academic performance and parent’s
performance is of paramount importance. A comprehensive educational level and income [25].
framework on the factors that influence student academic per-
formance is presented. The framework includes five variable 4) FAMILY FACTORS
aspects that considerably influence student performance: per- Family factors are related to parent’s educational background
sonal, academic, economic, social, and institutional domains, and their ability to provide educational assistance to their
as shown in Figure 1. Each of the five domains contributes children and create a propitious environment for learning.
to the performance measurement of students and is made up In [26], results revealed that the type of school does not
of attributes that work individually and conjointly for learner influence student performance, but the parent’s job plays a
success. However, the degree of complexity and influence of critical role in predicting grades. In [22], the authors found
each domain on student performance varies. that previous academic performance and parent’s educational
background are the most important attributes in predicting a
1) PERSONAL FACTORS student’s future academic performance. Some studies tend to
Personal factors include student behavior, such as feelings, analyze the influence of parent’s education background and
thoughts, or actions, as well as students’ demographics, such income on academic performance [25].

152690 VOLUME 9, 2021


H. Sahlaoui et al.: Predicting and Interpreting Student Performance

5) INSTITUTIONAL FACTORS achieved an accuracy of 83.4%. Arsad et al. [36] used an arti-
The factors that correspond to this category relate to the ficial neural network algorithm to predict the performance of
academic program and resources that the institution allocates undergraduate students in the field of engineering. This work
for the best academic performance of its students. The authors discovered that undergraduate subjects have a major effect
of [15]–[17] looked at how psychometric factors tend to on the final CGPA after graduation. In the work of [19] it was
influence the performance of students. pointed out that with the SMOTE method in an imbalanced
dataset, classification models for neural networks and naïve
Bayes had an accuracy of 75%. When using the discretization
B. DIFFERENT ALGORITHMS IN DEVELOPING A MODEL method, the models for naïve Bayes and neural networks
TO PREDICT THE ACADEMIC PERFORMANCE OF achieve almost identical levels of accuracy.
STUDENTS The authors in [37] built a model to predict the GPA of
The above review has shown that different researchers have the students using the naïve Bayes technique and K -means
used different algorithms in building a model to predict the clustering. Their models produced an accuracy of 98.8%.
academic performance of students. Among these approaches, By contrast, [38] indicated that a naïve Bayes classifier using
the most frequently used ones are decision trees, linear regres- the cfsSubsetEval feature selection technique in a dataset
sion, naïve Bayes classifiers, support vector machines, and consisted of 257 academic records had an accuracy of 84%.
neural networks. A few studies have used a coalition of classi- In [16], the support vector machine was found to have
fiers in the quest of improving predictions [27]–[29]. Many of a decent generalization potential and is quicker than other
the studies reviewed tended to choose the optimal algorithm approaches. Meanwhile, an analysis performed in [17] for the
by trying a set of machine learning approaches. The decision purpose of recognizing students at risk of failure obtained the
tree consists of a series of rules organized in a stratified best accuracy in model prediction by using a support vector
structure. Most researchers used this technique because it algorithm. In his academic work [39] applied a support vector
is straightforward, that is, it can be changed into numerous machine to predict which students are to pursue doctoral stud-
classification rules. The decision tree algorithm is widely ies. Results revealed that the method achieved an accuracy of
used in predictive modeling because it is suitable for large 96.7 %. While the authors in [40] revealed that the polyno-
and small amounts of data [1]. It is also good for numeric and mial kernel technique achieved 97.62% accuracy. The sup-
categorical variables and does not require lengthy data prepa- port vector machine algorithm is the least favored by the
ration. It can be easily understood and interpreted by users researcher because it is only suitable when the dataset is small
because it is based on rules that are easy for users to under- and the algorithm is not transparent. The only disadvantage
stand and interpret. The survey by [30] showed that decision is the complexity of the model. Such complexity contradicts
tree models are easy to understand due to their inner logic the general requirement that models in intelligent learning
process and can be converted directly into if then rules. The systems should be transparent. In addition, selecting proper
decision tree revealed that C4.5, CART, and ID3 had the best kernel functions and other parameters is difficult, and differ-
classifiers for predicting student performance [5]. In [12], ent experiments have to be conducted empirically.
some well-known decision tree algorithms, such as C4.5 and Ensemble methods were introduced by Breiman, Fre-
CART, were used to predict student’s final grades using their und, and Schapire in 1984. The most popular algorithms,
log data extracted from the Moodle system. Meanwhile, [31] such as arcing [41], boosting [42], bagging [43], and casual
used ID3 and C4.5 algorithms on student’s internal marks forests [44], have gained a keen interest in the literature.
to identify students who are likely to fail. Experiments were Ensemble methods have two major families: The first is
conducted on 200 engineering students’ datasets, and their called homogeneous because it combines similar algorithms,
performance was predicted by applying a decision tree. The such as multiple classification trees; multiple SVMs, bag-
results revealed that the accuracy of the classifiers was 98.5%. ging [43], and boosting [42] are examples of such methods.
A comparative study of classification techniques to predict The second is called heterogeneous because it combines algo-
academic performance was conducted by [32]. This study rithms of different nature (e.g., classification trees, SVM, and
used the data of 350 students. The result indicated that out of neural networks). The most popular method is the stacking
the classifiers, J48 achieved an accuracy of 97% [32], [33]. method [45].
Neural networks are the most established model that has In his paper [5] applied the ensemble method to improve
been used in educational data mining. Neural networks func- the results on the best-selected classifier J48 to predict student
tion similarly to the human brain by linking neurons or nodes performance. The results revealed a huge improvement. That
that collaborate to produce an output task [34]. Neural net- is, the proposed model using ensemble methods could reach
work algorithms are also widely used by researchers because an accuracy of up to 98.5%. By contrast, [12] showed that
they are well suited to large amounts of data and can also the proposed algorithm (i.e., bagging and boosting) obtained
be applied in structured and unstructured data; however, they an accuracy of over 80%, confirming the soundness of the
are difficult to interpret and require more execution time for algorithm. To address the student performance prediction
algorithm training. A neural network was used in [35] to problem, various techniques that provide accurate results are
predict student performance. Results indicated that the model being identified. Recently, many academic papers (as shown

VOLUME 9, 2021 152691


H. Sahlaoui et al.: Predicting and Interpreting Student Performance

in Table 1) elaborated on the proper features that affect precision and recall, which capture some information that
student performance, and all of them agreed, that all mod- accuracy cannot. When considering models for predicting
els can provide reasonable results when predicting student student performance, misclassifying students at risk as ‘‘not
performance. However, they cannot identify a technique that at risk’’ has more consequences than if the opposite were
is clearly superior because the accuracy of the prediction the case. Thus, accuracy may not be the right choice as a
depends mainly on the context, data, parameters, and hyper- performance evaluation technique in this case.
parameters of the technique. Any potential alternative must This paper describes the first attempt to use the inter-
consider these factors. pretability of machine learning models in the context of stu-
dent performance prediction. Previous studies [12] focused
C. TOOLS USED on updating the accuracy of student grade models. However,
The most popular used tools are WEKA and SPSS Modeler this paper outlines the first attempt to use SHAP values [13]
most likely due to their automation and versatility features in and associated visualizations in the quest for the interpretabil-
predictive analytics [46]. In addition, R and Python package ity and explainability of models in the education context.
are used due to their mature and supportive community and And this by augmenting our ensembles models with SHAP
hundreds of open-source libraries and frameworks [47]. values [13]. In real-world machine learning-based applica-
tions, model interpretability can sometimes be more impor-
D. ADDITIONAL VALUE OF THE SUGGESTED APPROACH tant than accuracy. Therefore, SHAP provides a more visual,
By scouting literature regarding student performance pre- intuitive, and comprehensive approach as a great way to
diction, we found a gap specific to student performance increase the transparency of ensemble models, which help
improvement in terms of how to boost model performance in interpreting and understanding the entire model and in
without neglecting to clear out their opacity making them visualizing feature attributions at the observation level for
easy for educators to understand. And provide guidance on any machine learning model. In contrast to surrogate mod-
how the model can be used to improve school administra- els, ensemble models, regardless of the model type (model
tor’s decision-making skills and confidence in the scores agnostic), can be improved by SHAP via outliers and missing
calculated by the ensemble models to achieve better model value detection, variable selection, which can help in dis-
adoption. For better interpretability, leads to better accep- covering the root cause of a decrease in model performance,
tance [9]. This study is the first to examine two novel types and proposing changes that might redress the issues. Which
of models related to student performance data (i.e., XGBoost augment the trustingness of the ensembles models and lead to
and ET) and use them to boost the predictive accuracy of their better adoption [9]. Finally, the suggested models were
student performance models. In contrast to previous machine assessed and compared with previous work. By running our
learning models, the algorithms used are scalable, provide set of models, we could empirically validate the performance
accurate results, and require less execution time in train- of our student performance system versus previous work
ing the algorithm. Moreover, they do not require a huge models.
amount of data for training the algorithm in contrast to
other robust models, such as deep learning. We also solved III. RESEARCH METHODOLOGY
the problem of class imbalance, which implies strategies, In this section, we show how ensemble methods are used to
such as data oversampling for the less represented class, build a student performance model, as well as other methods
known as SMOTE technique. As a balancing technique, that help explain the model.
SMOTE was used to oversample all classes to the number
of examples in the majority class. To improve the preci- A. SYSTEM MODEL
sion and avoid model over fitting, we used a hyperparam- Figure 2 depicts the four stages of this operation, including
eter optimization technique by choosing a set of optimal the preprocessing phase, the development phase, the appraisal
hyperparameters for the learning algorithm. Ten-fold cross phase, and the interpretability phase.
validation was used to train and evaluate our models. For The first phase begins with a data preprocessing step in
tuning the parameters, we used a simple grid search algo- which the collected data are converted into a suitable format.
rithm provided by scikit to estimate the optimal param- The discretization process is then used to translate student
eters of XGBoost (booster = dart, n_estimators = 1000, performance from numerical values to nominal values that
learning_rate = 0.1, nthread = 4, objective = binary: correspond to the classification problem’s class tag. In this
logistic, eval_metric = mlogloss, eta = 0.7, gamma = 4, process, we divided the dataset into three nominal intervals
max_depth = 9, min_child_weight = 1, max_delta_step = 0, depending on the student’s average grade (high performer,
subsample = 0.8, colsample_bytree = 0.8, silent = 1, average performer, and low performer). The values of the
seed = 7, base_score = 0.7). For the ET, the optimized low performers range from 0 to 69, those of the middle per-
parameters were ( n_estimators = 300, max_features = 2). formers range from 70 to 89 and those of the high performers
Cross validation is mainly used to gauge the capabilities range from 90 to 100. The post discretization dataset includes
of our model based on new data. We also analyzed the several low-performing students, average performing stu-
performance of the algorithms with proper metrics, such as dents, and high performing students. Afterward, we solved

152692 VOLUME 9, 2021


H. Sahlaoui et al.: Predicting and Interpreting Student Performance

TABLE 1. Attributes, algorithms, and data mining techniques frequently used to predict students academic performance.

the problem of class imbalance, which implies strategies, validation was used to train and evaluate our models. For
such as data oversampling for the less represented class, tuning the parameters, we used a simple grid search algo-
known as SMOTE, by balancing classes in the training data rithm provided by scikit to estimate the optimal parame-
before inputting data to the machine learning algorithm. This ters of XGBoost (booster = dart, n_estimators = 1000,
process plays a vital role in improving accuracy measures. learning_rate = 0.1, nthread = 4, objective = binary:
The original training set was 480, of which 127 accounted logistic, eval_metric = mlogloss, eta = 0.7, gamma = 4,
for the low-performing class, 211 accounted for the average max_depth = 9, min_child_weight = 1, max_delta_step = 0,
performers, and 142 accounted for the high performers. As a subsample = 0.8, colsample_bytree = 0.8, silent = 1,
balancing technique, SMOTE was used to oversample all seed = 7, base_score = 0.7). For the ET, the optimized
classes to the number of examples in the majority class. parameters were ( n_estimators = 300, ma_features = 2).
Afterward, supervised classification algorithms were used We recorded the accuracy, precision, and recall values for
to train models to learn the mapping function of input fea- model comparison. Afterward, we focused on model interpre-
tures to the output labels by leveraging or examining two tation. For this, we used the SHAP value to explain the inner
novel model types that are related to student performance logic of the ensemble methods. As indicated by the Shapley
data (i.e., XGBoost and ET). These models were used to value, every time we provide data as input for interpreter and
boost the predictive accuracy of student performance models. predictor, the predictor model supplies the accuracy value,
The testing set was used to test the prediction performance whereas the interpreter depicts the effect of negative and
of the models trained by different algorithms. To improve positive features. As a visualization technique, the SHAP
precision and avoid model over fitting, we used a hyperpa- value provides a view of the prediction model functionality
rameter optimization technique by choosing a set of optimal and increases model transparency. Specifically, the SHAP
hyperparameters for the learning algorithm. Ten-fold cross force visualization provides a clear view of the features of

VOLUME 9, 2021 152693


H. Sahlaoui et al.: Predicting and Interpreting Student Performance

TABLE 2. Attributes, algorithms, and data mining techniques frequently used to predict students academic performance.

student performance that affect the score that non technical Algorithm 1 SMOTE Algorithm
users can interpret immediately [10]. Input: N : amount of instances of minority groups;
n number of SMOTE (in %);
k: represent the number of nearest neighbors;
B. DATA PROCESSING
minority data D = xi ∈ X; with i = 1; 2;. . . ..;N .
In this step, preprocessing for the purpose of balancing data
Output: D0 : synthetic data from D;
was conducted using SMOTE. SMOTE is a technique that
n = (int) (n /100);
involves creating a new sample for the minority class. In this
for i = 1 to N do
method, a new sample is built from a sample xi of the minority
Find the k nearest neighbors of xi ;
class by finding the k closest neighbors to xi0 from the minor-
while N 6= 0 do
ity class. Then, a neighbor is randomly picked, and a synthetic
Choose one of the k nearest neighbors of xi
sample that groups xi and the selected neighbor is finally
Choose a random number α ∈ [0,1]
created [48]–[50]. Algorithm 1 clearly depicts the this oper-
x = xi + α (x - xi )
ation. Several other SMOTE techniques have emerged from
Append x to D0
this strategy, e.g., mostly weighted SMOTE [51], SMOTE-
end while
Boost [52], and borderline SMOTE [53]; these techniques
end for = 0
were not the subject of this study but can be the subject of
future research.

C. MODEL CONSTRUCTION validation to train and evaluate our models. We experimented


In this section, we apply supervised classification algorithms four classification algorithms described below:
to train models to learn the mapping function of the input
features to the output labels. The testing set was used to test 1) K −NEIGHBOR CLASSIFIER (KNC)
the prediction performance of the models trained by different The classification of a new sample is given by the common
algorithms. To improve the precision, we used ten-folds cross classification of the K neighbors that are closest to that sample

152694 VOLUME 9, 2021


H. Sahlaoui et al.: Predicting and Interpreting Student Performance

FIGURE 2. Global ensemble-based system for the classification of students’ academic performance.

in our dataset. With K a positive integer denoted along with of over learning from the data by including randomness.
the new sample. Such reduction occurs by combining multiple decision trees,
creating samples without replacements, and dividing nodes
2) DECISION TREES (DT) according to a random split using a random subset of features.
Decision trees are rule-based and tree-inspired structure algo- Let a training dataset L = {(x1 ;y1 ),. . . , (xn ;yn ) }, where the
rithms that represent features like a tree [54]. Each node rep- sample xi = {f1 ,f2 ,. . . ,fn } is a D-dimensional vector with
resents a feature, each connection between nodes represents a fi as the feature and j ∈ {1,2,. . . , D}. ET creates M inde-
decision rule, and the output is represented by the leaf nodes. pendent decision trees. For every decision tree, Sp indicates
DTs are also seen as a flow chart, where the flow starts at the the subset of training dataset L at child node p; afterward,
root node and ends with the predictions made on the leaves. for every node p, the ET algorithm selects the best split
It is a decision support tool, which is also seen as a tree-like with regard to Sp . The ET algorithm is clearly described
diagram to display the predictions that result from a sequence below:
of feature-based divisions [55].

3) RANDOM FOREST (RF) Algorithm 2 Extra Tree


The random forest classifier [44] (also referred to as a forest Input: Training subset Sp = { s1 ,s2 ,. . . ,sQp }, where the
of decision trees) is a classification algorithm that trim the sample si = {f1 ,f2 ,. . . ,fn } is a D-dimensional vector, the
risk of overfitting by including randomness. Such trimming number of attributes to select randomly, K , and the min-
is performed by combining multiple decision trees, creating imum number of samples required to split a node, nmin
samples with replacements, and dividing nodes according if Qp < nmin then
to the best split using a random subset of features. The RF Stop splitting and define the node as a leaf node.
classifier was introduced by Leo Breiman and Ad’ele Cutler else
in 2001 and is considered to be one of the most efficient Select a random subgroup of
algorithms that involve less data preprocessing. end if
for every feature k in the subgroup do
4) BAGGING CLASSIFIER identify fkmax and fkmin as the maximal and minimal values
Bagging classifiers, also known as bootstrap aggregation, was for the feature k in subset Sp
introduced in 1996 by Breiman in [43] to fix the problems Obtain a random cut-point,fkc , uniformly in the range
of CART instability. It is based on majority of the votes [fkmin ;fkmax ].
of learner classifiers for classification or their average for Set fkc , as a candidate split.
regression. end for
Select a split [f∗ < f∗c ] such that score(f∗c ) = min score (f∗c ).
k = 1,. . . ,k
5) EXTRA TREE CLASSIFIER
Output:
Extra Tree classifiers, also known as extremely random best split [f∗ < f∗c ] at the child node p. = 0
trees [56], is a classification algorithm that reduces the risk

VOLUME 9, 2021 152695


H. Sahlaoui et al.: Predicting and Interpreting Student Performance

6) XGBoost (XGB) D. MODEL EVALUATION


The extreme gradient boosting (XGBoost) algorithm is sim- The selection of performance metrics is crucial for evaluating
ilar to the gradient boosting algorithm; however, it is more a model [58]. When it comes to a classification problem as in
efficient and faster because it is composed of a linear model our case, the frequently used metric is the accuracy defined
and a tree model, in addition to its ability to perform parallel by Equation 4. This formula, as those of other metrics, hail
calculations on a single machine. from the confusion matrix given by Table 3. In our case,
The trees constructed in a gradient boosting algorithm TP stands for true positive, which implies that the set of
are constructed in series so that a gradient descent step can instance for which the prediction of the model is correct; TN
be performed to minimize a loss function. Unlike RF, the refers to true negative, which is the set of instances correctly
XGBoost algorithm builds the tree itself in a parallel fashion. predicted; FP is false positive, representing the number of
Essentially, the information contained in each column can instances wrongly predicted; and FN is false negative, giving
have statistics computed on it in parallel. the number of instances wrongly predicted.
The importance of the variables is calculated in the same TP + TN
way as that for the RF by calculating and averaging on the Accuracy = (4)
TP + FP + FN + TN
values by which a variable decreases the impurity of the tree TP
at each step. Precision = (5)
TP + FP
In accordance with [57], at the t th iteration, the objective TP
function of XGBoost can be represented as Recall = (6)
TP + FN
2(t) = 8(t) + (t) However, this metric is limited and does not always rep-
Xn t
X resent reality when data are unbalanced because it favors
2(t) = 8(yi , ŷi ) + (fk ) (1) the majority class. Thus, in addition to accuracy, specific
i=1 k=1 additional metrics allow to evaluation to adapt to unbalanced
(t) data by considering the minority and majority classes. Among
where n represents the nth prediction, and ŷi can be repre-
these metrics we used.
sented as
m TABLE 3. Confusion matrix.
(t−1)
X
ŷth
i = fm (xi ) = ŷi + ft (xi ) (2)
i=1

The regularization term (fk ) is considered by complet-


ing the convex differentiable loss function, and this term is
defined by
m
1 X 2 E. MODEL INTERPRETATION
(fk ) = γ T + λ ωj (3) In this part, we examine one approach, particularly game
2
i=1 theory, to interpret the ensemble methods used in predicting
The term  is interpreted as a combination of ridge student performance in Jordan. In this part, we introduce
regularization of coefficient λ and Lasso penalization of the reasons and key indicators for explaining student pre-
coefficient γ . diction models using game theory. We also define Shapley
values and how they help improve prediction accuracy and
model transparency [9]. SHAP values refer to Shapley values,
a game theory jargon [59]. These values consist mainly of two
components: game and players, where ‘‘game’’ represents
the outcome of the predictive model and ‘‘players’’ represent
the features of the model. Shapley measures the contribution
value each player makes to the game. By applying these
components to our case, the SHAP value measures how much
the feature is contributing to the output of the model. The
contribution of an the individual player is determined by
considering all possible coalitions of players, that is, all i
feature combinations possible (i goes from 0 to n, where n
is a total number of features available). The order of adding
the feature to the model is important and influences the pre-
diction. Lloyd Shaply proposed this approach in 1953 (hence
FIGURE 3. XGBoost algorithm procedure. the name ‘‘Shapley values’’ for phi values measured in this

152696 VOLUME 9, 2021


H. Sahlaoui et al.: Predicting and Interpreting Student Performance

manner). Provided a prediction p (this is the prediction by the • Interpretability: Whether the SHAP value can be used
complex model), the Shapley value for a specific function i to explain the inner logic of the ensemble methods. The
(out of n total features, and S is a subset of n) is [9]: predictor model supplies the accuracy value, and the
X |S|!(n − |S| − 1)! interpreter depicts the effect of negative and positive
φi (p) = [p(S ∪ {i}) − p(S)] (7) features.
n!
S⊆\{i}
A. DATA DESCRIPTION
This equation helps calculate the importance or the influ- This part describes the open-source data retrieved from the
ence of a feature by calculating the difference of model Kaggle repository 1 in connection with the study made
prediction with and without feature i. The shift in model esti- by [12]. These data are leveraged to create a scalable student
mation is essentially the feature’s consequence. In summary, performance model. The educational dataset used in this work
SHAP values are used whenever we have to deal with an consists of 480 student observations with 17 features. The
input output model and we want to understand the decision full description of the variables used to predict a student’s
made by this model, that is, why the model suggested that performance in the Jordanian school is given in Table 4.
a student was more or less likely to pass. This predictive These variables consist of personal features, such as gender
model answers the ‘‘how much’’ question, and the SHAP and place of origin; educational background features, such
value identifies the reason why [59]. Thus, we used SHAP as class participation, used digital resources, and student
values to interpret the model. Improved interpretability leads absence days; institutional features, such as education level
to improved acceptance [9]. and grades; and social features, such as parents in charge
Robust machine learning algorithms can usually make of students, parents who responded to the survey, and par-
accurate predictions, but their opaque nature hampers their ents’ review of the school. Jordanian datasets are used with
adoption. The interpretability of the model corresponds to the same variables chosen by the previous study to build
the tag or the label on the pill bottle. The label needs to traditional and ensemble predictive models. By running the
be transparent for easy adoption. Shapley values and the different models, the model performance of the student per-
SHAP library are a valuable toolkit for uncovering the golden formance system can be empirically assessed and compared
nuggets that a machine learning algorithm has discovered. with previous working models. All codes, records, and full
SHAP concerns the local interpretability of a predictive documentation on how to restore the results can be found
model. In particular, by considering the effects of features available for public in the git hub repository.
on individual level instead of all individuals in the entire
dataset (and then summing the results), the interactions of B. DESCRIPTION OF EXPERIMENTAL PROTOCOL
features can also be revealed, making it possible to gain more This section aims to explain the experimental procedure for
impressive knowledge than with techniques for global feature testing and evaluating the performance of the proposed mod-
importance [60]. els for student performance prediction. The implementations
As a visualization technique, SHAP values provide a view were conducted on a Windows framework using Python 3.7
of the prediction model functionality and increases the trans- and primarily the sklearn library. The machine in use is an
parency of the model. Specifically, the SHAP force visual- ‘‘HP00 model with the following configuration: 16 GB of
ization provides a clear view of student performance features RAM, an Intel Core i7 CPU, and NVIDIA Geforce 930M
that affect the score that non technical users can interpret graphics card.
immediately [10].
C. RESULT ANALYSIS
IV. EXPERIMENTS AND RESULT ANALYSIS In this section, we used a tenfold cross-validation approach
This section examines the settings and protocols of the experi- to train our models. Considering that validation is a critical
ments. The results obtained are then analyzed and discussed. phase in building practical predictive models, we mainly
First, we begin by laying out a complete description of the used two groups of machine learning algorithms to build our
variables used in predicting the performance of students after- student performance prediction model: conventional meth-
ward, we focused on model implementation and interpreta- ods (decision tree and K -neighbor classifiers) and ensemble
tion. We define the experimental procedure used to test and methods. Here, we applied two frameworks (XGBoost and
validate the proposed models for predicting student perfor- extra trees) to train the decision tree, and we stored the basic
mance. We evaluate the results based on the training and performance metrics: accuracy, precision, and recall. Table 5
cross-validation score of different machine learning methods. shows the summary of the result of the validation and test sets.
Then, we record the accuracy, precision, and recall values for We found that the ensemble methods, i.e., ET and XGBoost,
model comparison. Here are the evaluation goals: always obtain the best result. Moreover, we observed that the
• Performance: How performance improve by using traditional methods always yield the lowest result through
ensemble methods in comparison with previous studies validation and test. The boosting and bagging method beats
that used the Jordan dataset for student performance
prediction. 1 https://www.kaggle.com/aljarah/xAPI-Edu-Data

VOLUME 9, 2021 152697


H. Sahlaoui et al.: Predicting and Interpreting Student Performance

TABLE 4. Description of the variables used in predicting the performance of student.

other traditional stand-alone methods, that is, boosting meth- from 77 to 100 for the average performers, from 88 to 100 for
ods improve the accuracy from 76.7% to 99.47%, indicat- low performers, and from 75 to 99 for high performers.
ing that the number of students accurately classified has
increased from 367 to 477 out of a total of 480 students. The
2) MODEL INTERPRETATIONS
recall results increased from 77. 097% to 99.47%, indicating
Ensemble methods, such as ET and XGBoost, can attain good
that out of the student pool of unclassified and correctly
accuracy rates; however, they are difficult for people to com-
classified students, 477 students were ranked correctly. The
prehend because these models are generally viewed as a black
precision scores also climbed from 77.37% to 99.37%, mean-
box. However, after we introduce the SHAP values, these
ing that out of the total of 480 students, 476 students were
models can be explained in a steady manner. As a great way to
classified correctly, and four were classified incorrectly. The
increase the transparency of the ensemble models, the SHAP
model performance of the boosting and bagging classifier is
framework provides global and local interpretation. Global
not significantly different. As shown Table 5, the prediction
interpretation uses feature importance plot and summary plot
models achieved accuracy and precision over 98%. These
as a way of visualization. By contrast, local interpretation a
results were obtained via the use of strategies and techniques,
uses force plot to show individual SHAP values for a single
such as SMOTE, hyperparameter optimization, and cross-
observation and try to visualize feature attributions at the
validation process, which demonstrate the dependability of
observation level to provide more accurate explanations and
the novel model. These values show that we can predict
understanding of model decisions.
student performance and enhance the prediction by using
The idea behind SHAP feature importance is simple, that
ensemble methods.
is, feature importance values are obtained by calculating the
average of the SHAP values of each feature. A feature impor-
1) CONFUSION MATRIX tance plot ranks the most relevant features in degressive order.
The overall accuracy of the boosting and bagging method The top features contribute more to the predictive power
on the test and validation tests was 0.998 (accurate predic- of the model than the bottom ones and thus have a huge
tions/total or true positives/total). Nevertheless, the confusion predictive effect. The feature importance of ensemble models
matrix provides additional insight into accuracy by class and for different classes of student performance is plotted in a
insight for precision and recall efficiency. An insight we can traditional bar chart, as illustrated in Figure 5.
obtain from the matrix is that the ensemble models were Based on Figure 5, student absences have a higher effect
accurate at classifying low and average performers, and the than other features, indicating that the change of this feature
accuracy was TP/total = 1.0. However, accuracy for high can have a more noticeable influence than others.
performers was slightly low (0.99). If for any rational motive, If student absences are higher, then student performance
successful classification of a student class was particularly would be reduced accordingly.
coveted to the case on hand, then the confusion matrix helps However, this feature has a higher negative effect specifi-
outline differences between the classes and identify which cally for low and average performers than higher performers.
classifier performs better. Boosting and bagging classifiers If we look at the visited resource, we can find that the
beat other traditional individual classifiers. For instance, more resources visited, the more likely student performance
the accuracy by class for boosting classifier increased improves. A similar result is found for class participation;

152698 VOLUME 9, 2021


H. Sahlaoui et al.: Predicting and Interpreting Student Performance

TABLE 5. Resultant analysis for various classifiers.

FIGURE 4. Confusion matrix.

stage ID, section ID, and semester, do not have a huge effect
on student performance for all class of performers, indicating
that the change of these features does not have a noticeable
influence on model prediction. As far as the summary plot
is concerned, the y-axis represents the feature values, the
x-axis depicts the Shapley value, and the color shows the
degree of the effect (red means high, whereas blue means
low); the higher the SHAP value, the larger the effect. At a
global level, the summary plot provides feature importance
with feature effects. The features are ranked in accordance
with their predictive power, and the graphs below show the
features that affected the model the most and how changes in
their values affect the model’s prediction. How much each
feature contributes, either positively or negatively, to the
FIGURE 5. Traditional feature importance plots.
model output. And which variables strongly correlate with
the target variable, making SHAP a tool with great benefit in
the more a student participates, the higher the student per- variables selection.
formance. These later features have a higher positive impact This plot shows that student absence days, a parent in
for all classes of performers. By contrast, features, such as charge, raise hands Features are the most important model

VOLUME 9, 2021 152699


H. Sahlaoui et al.: Predicting and Interpreting Student Performance

features because the values of these features (high and low) prediction starting from the baseline. For instance, the base-
are strongly correlated with high and low SHAP values. line is 0.5937. The actual prediction is 0.98. The students’
By contrast, other model features, such as stage ID, section absence days and visited resource features can increase the
ID, and semester, are less important as their corresponding prediction. By contrast, the parent in charge and announce-
SHAP values are closer to zero and thus have less effect on ments view features can decrease the final output prediction.
the model. By taking a look at the first rows of the summary
plot, we can tell that a high level of attributes, such as student 3) DISCUSSION
absence days, raise hands, and visited resources, have a high This paper describes the use of two types of predictive models
and positive effect on student performance. The high comes that show an improvement in accuracy over the usual tra-
from the red color, and positive is shown on the x-axis. When ditional methods. The result obtained show the following:
the number of student absence days is under seven, the feature The tested and cross-validated ensemble models improved
has a positive effect on student performance. Class partic- the performance prediction of student performance. These
ipation also has a positive effect on student performance. prediction models attained an accuracy and precision of over
Furthermore, the digital resources used feature is positively 98%, outperforming the results in [12], which used the same
related to student performance. In terms of which parent is dataset, as illustrated in Table 6. By examining two novel
in charge, a negative effect on student performance is found types of models related to student performance data, particu-
when the dad is in charge (negative SHAP value). This finding larly XGBoost and ET and used them to boost the predictive
is consistent with human intuition. This plot is a great tool accuracy of student performance models. And solving the
to obtain an improved understanding of how certain features problem of class imbalance, which implies strategies such
affect the model decision. To obtain a deeper understanding as data oversampling for the less represented class, known
of our model, we use other SHAP packages, such as force as SMOTE, and the optimization of hyperparameters by
plots, which provide us feature contributions for specific choosing a set of optimal hyperparameters for the learning
observation. algorithm. And a tenfold cross-validation method students to
improve the classification algorithms and avoid overfitting
the model. The results obtained via the use of the afore-
mentioned strategies and techniques demonstrate the depend-
ability of the novel model. In addition, a novel technique
is used to help determine which factors influence the score
(i.e., SHAP values, which increment model transparency) [9].
We apply SHAP values on student data to explain how ensem-
ble methods predict student performance. The SHAP value
and the associated visualizations provide a view into the
inner operations of the prediction models used and increase
the transparency of the model. Specifically, the SHAP force
visualization provides a clear view of student performance
features that affect the score that non technical users can
interpret immediately [10]. In particular, by considering the
effect of features on the prediction of the model at the macro-
level: on all individuals by considering the main features that
contribute to the goal. For example, student absences have
FIGURE 6. Summary plots. a greater effect than other features. And at the micro-level:
by identifying which features were most important for an
In this study, SHAP value of a feature for a single record individual. Force plots highlight the features responsible for
prediction is acquired by the contribution of that feature predicting student performance and features contributing to
towards the prediction. The force plot in Figure 7 highlights the prediction for the single record for single observation for
the features responsible for predicting student performance an individual (e.g., in terms of student absences, the lower the
and the features driving the model output from the baseline number of absences, the more positive the effect on student
value to the actual output. Features that have more predictive performance; in terms of parent in charge, a negative effect
control are shown in red, whereas features that have lower is observed when the dad is in charge; in terms of digital
predictive control are shown in blue. The output value is resources used, the more resources, the more positive the
the prediction with features, whereas the base value is the correlation with student performance). The latter represents
value that would be predicted without any features, that is, the first attempt to use SHAP values in the educational con-
the mean prediction. This chart can inform decision-makers text. This contribution fills up the gap in current prediction
in education about the key features responsible for student model interpretation, particularly in the field of education.
performance at the observation level. In the plot, each fea- We considered that SHAP values are a powerful tool to
ture value is a force that either increases or decreases the include in the framework for predicting student performance

152700 VOLUME 9, 2021


H. Sahlaoui et al.: Predicting and Interpreting Student Performance

FIGURE 7. Force plot depicting feature contribution towards a single prediction.

TABLE 6. Comparison between our results and that of the Jordan dataset based on accuracy, precision, and recall.

because they solve important issues about the interpretability in EDM are currently hesitant to share their research dataset
of more opac models. Model interpretation is a critical step for two main reasons: The first is for privacy, ethics, and
that could lead to the creation of new knowledge to improve legality; the second is for dataset acquisition, which is a
our understanding of student performance. As a result, time-consuming, labor-intensive, and expensive task. We
when the prediction and explanation are established through recommend that machine learning researchers disclose more
our experiment, educators can identify students at risk educational datasets based on a blend of privacy shield, eco-
early on and provide appropriate admonitions in promising nomic impact, and academic implications. This study used
ways. offline data although an ever-increasing number of online
data remains unused so we can train the model to predict
V. CONCLUSION AND FUTURE WORK online student performance in real time. A distinctive edu-
This paper presents our preliminary research and depicts the cational dataset that can be tested by our models. Right off
importance of tree-based machine learning algorithms and the bat, on the off chance that we can be given a large
their application in predicting student performance. The aim dataset, we can utilize the most recent big data technology to
of this work is to improve performance measure values by build a new model and validate the outcomes. Furthermore
using ensemble methods and comparing them with previous we can obtain additional data and attempt deep learning
studies that used the Jordan dataset to investigate student methods to improve model performance by using additional
performance prediction. We also apply the SHAP value to features, such as examining how the use of social media
explain the inner logic of the ensemble methods. The result would influence the performance of students. Moreover,
shows that our models have better accuracy than the tradi- additional experimentations could be conducted by using
tional models. The prediction models achieve an accuracy of other machine learning techniques, such as clustering. This
over 98%, which outperforms the result obtained in the study research used decision trees, KNC and ensemble methods,
made by [12] working on the same dataset. These results such as XGBoost and ET, for classification. Other methods,
are obtained via the use of strategies and techniques, such as such as clustering and deep neural networks, can be used for
SMOTE, hyperparameter optimization, and cross-validation classification or regression problems to have an improved
process, which demonstrate the dependability of the novel perception of the importance of method selection. Another
model. Moreover, the SHAP value and the associated visu- area that can be improved is the process of feature engineer-
alizations provide a view into the inner operations of the ing. Given limited data, the amount of feature engineering
prediction models used and increase the transparency of the that can be made is likewise limited. For model interpretabil-
model. Specifically, the SHAP force visualization provides ity, the Shapley value method needs to pass over ‘‘all pos-
a clear view of the features of student performance that sible combinations’’ of the features. When the number of
affects the score that non technical users can interpret imme- features is large, the number of combinations is extremely
diately. As a result, when the prediction and explanation are huge, yielding in a large amount of Shapley value calcula-
established through our experiment, educators can identify tion and immense time complexity, resulting in computation
students at risk early on and provide appropriate admonitions infeasibility. In the future, we will consider utilizing other
in promising ways. existing datasets with more features and will extend the scope
of study to include other African countries. We are currently
A. LIMITATIONS AND FUTURE WORK exploring student data obtained from the Kaggle repository
This study has certain limitations that must be highlighted. related to Nigerian schools, and we look forward to using
The study relies on publicly available datasets and not on a dynamic selection algorithms to improve model performance
dedicated student dataset. Moreover, the dataset was small, and extract knowledge that may be useful to school adminis-
having only under a thousand records. Research with more trators. Our focus is to extend the scope of EDM and provide
data may offer more conclusive insights. Most researchers noteworthy insights and models to improve education across

VOLUME 9, 2021 152701


H. Sahlaoui et al.: Predicting and Interpreting Student Performance

Africa. Another area of research for model interpretability [21] A. Tekin, ‘‘Early prediction of students’ grade point averages at gradua-
is to improve the SHAP library to calculate Shapley values tion: A data mining approach,’’ Eurasian J. Educ. Res., vol. 14, no. 54,
pp. 207–226, Feb. 2014.
quicker than if a model prediction had to be calculated for [22] J. Bansode, ‘‘Mining educational data to predict students’ academic per-
each conceivable combination of features. formance,’’ Int. J. Recent Innov. Trends Comput. Commun., vol. 4, no. 1,
pp. 1–5, 2016.
[23] D. Kabakchieva, ‘‘Predicting student performance by using data mining
REFERENCES methods for classification,’’ Cybern. Inf. Technol., vol. 13, no. 1, pp. 61–72,
[1] R. Asif, A. Merceron, S. A. Ali, and N. G. Haider, ‘‘Analyzing undergradu- 2013.
ate students’ performance using educational data mining,’’ Comput. Educ., [24] Z. Kovacic, ‘‘Early prediction of student success: Mining students’ enrol-
vol. 113, pp. 177–194, Oct. 2017. ment data,’’ in Proc. Informing Sci. IT Educ. Conf. (InSITE), 2010.
[2] M. Goyal and R. Vohra, ‘‘Applications of data mining in higher education,’’ [25] Z. Kovacic, ‘‘Predicting student success by mining enrolment data,’’ Res.
Int. J. Comput. Sci. Issues, vol. 9, no. 2, p. 113, 2012. Higher Educ., Tech. Rep. 15, 2012.
[26] V. Ramesh, P. Parkavi, and K. Ramar, ‘‘Predicting student performance:
[3] D. Kučak, V. Juričić, and G. Ðambić, ‘‘Machine learning in education—
A statistical and data mining approach,’’ Int. J. Comput. Appl., vol. 63,
A survey of current research trends,’’ Ann. DAAAM Proc., vol. 29,
no. 8, pp. 35–39, Feb. 2013.
pp. 406–410, 2018.
[27] H. Bydovska and L. Popelínskỳ, ‘‘Predicting student performance in higher
[4] A. Sayghe, Y. Hu, I. Zografopoulos, X. Liu, R. G. Dutta, Y. Jin, education,’’ in Proc. 24th Int. Workshop Database Expert Syst. Appl.,
and C. Konstantinou, ‘‘Survey of machine learning methods for detect- Aug. 2013, pp. 141–145.
ing false data injection attacks in power systems,’’ IET Smart Grid, [28] M. S. B. M. Azmi and I. H. B. M. Paris, ‘‘Academic performance prediction
vol. 3, no. 5, pp. 581–595, Oct. 2020. [Online]. Available: https://digital- based on voting technique,’’ in Proc. IEEE 3rd Int. Conf. Commun. Softw.
library.theiet.org/content/journals/10.1049/iet-stg.2020.0015 Netw., May 2011, pp. 24–27.
[5] A. Almasri, E. Celebi, and R. S. Alkhawaldeh, ‘‘EMT: Ensemble meta- [29] B. Minaei-Bidgoli, D. A. Kashy, G. Kortemeyer, and W. F. Punch, ‘‘Pre-
based tree model for predicting Student performance,’’ Sci. Program., dicting student performance: An application of data mining methods with
vol. 2019, pp. 1–13, Feb. 2019. an educational web-based system,’’ in Proc. 33rd Annu. Frontiers Educ.
[6] O. Adejo and T. Connolly, ‘‘An integrated system framework for predicting (FIE), vol. 1, 2003, p. T2A-13.
students’ academic performance in higher educational institutions,’’ Int. J. [30] C. Romero, S. Ventura, P. G. Espejo, and C. Hervás, ‘‘Data mining algo-
Comput. Sci. Inf. Technol., vol. 9, no. 3, pp. 149–157, Jun. 2017. rithms to classify students,’’ in Proc. 1st Int. Conf. Educ. Data Mining,
[7] M. Yalcintas and U. A. Ozturk, ‘‘An energy benchmarking model based Montréal, QC, Canada, Jun. 2008.
on artificial neural network method utilizing U.S. Commercial buildings [31] K. B. Bhegade and S. V. Shinde, ‘‘Student performance prediction system
energy consumption survey (CBECS) database,’’ Int. J. Energy Res., with educational data mining,’’ Int. J. Comput. Appl., vol. 146, no. 5,
vol. 31, no. 4, pp. 412–421, 2007. pp. 32–35, Jul. 2016.
[8] Y. Wei, X. Zhang, Y. Shi, L. Xia, S. Pan, J. Wu, M. Han, and X. Zhao, [32] R. Sumitha, E. Vinothkumar, and P. Scholar, ‘‘Prediction of students
‘‘A review of data-driven approaches for prediction and classification outcome using data mining techniques,’’ Int. J. Sci. Eng. Appl. Sci., vol. 2,
of building energy consumption,’’ Renew. Sustain. Energy Rev., vol. 82, no. 6, p. 8, 2016.
pp. 1027–1047, Feb. 2018. [33] A. Tribhuvan, P. Tribhuvan, and J. Gade, ‘‘Applying Naïve Bayesian
[9] Explain Your Model With the SHAP Values. Accessed: Sep. 14, 2019. classifier for predicting performance of a student using WEKA,’’ Adv.
[Online]. Available: https://towardsdatascience.com/explain-any-models- Comput. Res., vol. 7, no. 1, p. 239, 2015.
with-the-shap-values-use-the-kernelexplainer-79de9464897a [34] M. F. Møller, A Scaled Conjugate Gradient Algorithm for Fast Supervised
[10] P. Arjunan, K. Poolla, and C. Miller, ‘‘EnergyStar++: Towards more Learning. Aarhus, Denmark: Aarhus Univ., Computer Science Depart-
accurate and explanatory building energy benchmarking,’’ Appl. Energy, ment, 1990.
vol. 276, Oct. 2020, Art. no. 115413. [35] P. P. Bendangnuksung, ‘‘Students’ performance prediction using deep
[11] C. Konstantinou, ‘‘Cyber-physical systems security education through neural network,’’ Int. J. Appl. Eng. Res., vol. 13, no. 2, pp. 1171–1176,
hands-on lab exercises,’’ IEEE Des. Test. Comput., vol. 37, no. 6, 2018.
pp. 47–55, Dec. 2020. [36] P. M. Arsad, N. Buniyamin, and J.-L.-A. Manan, ‘‘A neural network
[12] E. A. Amrieh, T. Hamtini, and I. Aljarah, ‘‘Mining educational data to students’ performance prediction model (NNSPPM),’’ in Proc. IEEE Int.
predict student’s academic performance using ensemble methods,’’ Int. J. Conf. Smart Instrum., Meas. Appl. (ICSIMA), Nov. 2013, pp. 1–5.
Database Theory Appl., vol. 9, no. 8, pp. 119–136, 2016. [37] F. Razaque, N. Soomro, S. A. Shaikh, S. Soomro, J. A. Samo, N. Kumar,
[13] S. M. Lundberg and S.-I. Lee, ‘‘A unified approach to interpreting model and H. Dharejo, ‘‘Using Naïve Bayes algorithm to students’ bachelor aca-
predictions,’’ in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 4765–4774. demic performances analysis,’’ in Proc. 4th IEEE Int. Conf. Eng. Technol.
Appl. Sci. (ICETAS), Dec. 2017, pp. 1–5.
[14] B. Şen, E. Uçar, and D. Delen, ‘‘Predicting and analyzing secondary
[38] A. Srinivas and V. Ramaraju, ‘‘Detection of failures by analysis of student
education placement-test scores: A data mining approach,’’ Expert Syst.
academic performance using Naïve Bayes classifier,’’ Int. J. Comput. Math.
Appl., vol. 39, no. 10, pp. 9468–9476, Aug. 2012.
Sci., vol. 7, no. 3, pp. 277–283, 2017.
[15] M. Ramaswami and R. Bhaskaran, ‘‘A CHAID based performance predic-
[39] K. Eashwar, R. Venkatesan, and D. Ganesh, ‘‘Student performance predic-
tion model in educational data mining,’’ 2010, arXiv:1002.1144.
tion using SVM,’’ Int. J. Mech. Eng. Technol., vol. 8, no. 11, pp. 649–662,
[16] S. Sembiring, M. Zarlis, D. Hartama, S. Ramliana, and E. Wani, ‘‘Predic- 2017.
tion of student academic performance by an application of data mining [40] M. Jamuna and S. A. Shoba, ‘‘Educational data mining students perfor-
techniques,’’ in Proc. Int. Conf. Manage. Artif. Intell., 2011, vol. 6, no. 1, mance prediction using SVM techniques,’’ Int. Res. J. Eng. Technol., vol. 4,
pp. 110–114. no. 8, pp. 1248–1254, 2017.
[17] G. Gray, C. McGuinness, and P. Owende, ‘‘An application of classification [41] L. Breiman, ‘‘Arcing classifier (with discussion and a rejoinder by the
models to predict learner progression in tertiary education,’’ in Proc. IEEE author),’’ Ann. Statist., vol. 26, no. 3, pp. 801–849, 1998.
Int. Advance Comput. Conf. (IACC), Feb. 2014, pp. 549–554. [42] Y. Freund and R. E. Schapire, ‘‘A decision-theoretic generalization of on-
[18] A. M. Shahiri, W. Husain, and A. A. Rashid, ‘‘Science direct the third infor- line learning and an application to boosting,’’ J. Comput. Syst. Sci., vol. 55,
mation systems international conference a review on predicting student’s no. 1, pp. 119–139, Aug. 1997.
performance using data mining techniques,’’ Proc. Comput. Sci., vol. 72, [43] L. Breiman, ‘‘Bagging predictors,’’ Mach. Learn., vol. 24, no. 2,
pp. 414–422, 2015. pp. 123–140, 1996.
[19] S. T. Jishan, R. I. Rashu, N. Haque, and R. M. Rahman, ‘‘Improving [44] L. Breiman, Random Forests Machine Learning, vol. 45, no. 1.
accuracy of students’ final grade prediction model using optimal equal The Netherlands: Kluwer, Oct. 2001, pp. 5–32.
width binning and synthetic minority over-sampling technique,’’ Decis. [45] D. H. Wolpert, ‘‘Stacked generalization,’’ Neural Netw., vol. 5, no. 2,
Anal., vol. 2, no. 1, p. 1, Dec. 2015. pp. 241–259, 1992.
[20] F. Ahmad, N. H. Ismail, and A. A. Aziz, ‘‘The prediction of students’ [46] M. M. Ashenafi, M. Ronchetti, and G. Riccardi, ‘‘Predicting student
academic performance using classification data mining techniques,’’ Appl. progress from peer-assessment data,’’ in Proc. 9th Int. Conf. Educ. Data
Math. Sci., vol. 9, no. 129, pp. 6415–6426, 2015. Mining, Raleigh, NC, USA, 2016.

152702 VOLUME 9, 2021


H. Sahlaoui et al.: Predicting and Interpreting Student Performance

[47] (2020). Top 10 Reasons Why Python is so Popular With Developers EL ARBI ABDELLAOUI ALAOUI received the
in 2020. [Online]. Available: https://www.upgrad.com/blog/reasons-why- Ph.D. degree in computer science from the Faculty
python-popular-with-developers/ of Sciences and Technology—Errachidia, Univer-
[48] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, ‘‘SMOTE: sity of Moulay Ismail, Meknes, Morocco, in 2017.
Synthetic minority over-sampling technique,’’ J. Artif. Intell. Res., vol. 16, He is currently a Research Professor with the Ecole
no. 1, pp. 321–357, 2002. Normale Supérieure de Meknes, Moulay Ismail
[49] A. Roy, R. M. O. Cruz, R. Sabourin, and G. D. C. Cavalcanti, ‘‘A study University. His main research interests include
on combining dynamic selection and data preprocessing for imbalance
wireless networking, machine learning, DTN net-
learning,’’ Neurocomputing, vol. 286, pp. 179–192, Apr. 2018.
works, game theory, the Internet of Things (IoT),
[50] H. Faris, R. Abukhurma, W. Almanaseer, M. Saadeh, A. M. Mora,
P. A. Castillo, and I. Aljarah, ‘‘Improving financial bankruptcy prediction and smart cities.
in a highly imbalanced class distribution using oversampling and ensemble
learning: A case from the Spanish market,’’ Prog. Artif. Intell., vol. 9, no. 1,
pp. 31–53, Mar. 2020.
[51] S. Barua, M. M. Islam, X. Yao, and K. Murase, ‘‘MWMOTE–majority ANAND NAYYAR received the Ph.D. degree in
weighted minority oversampling technique for imbalanced data set learn- computer science from Desh Bhagat University,
ing,’’ IEEE Trans. Knowl. Data Eng., vol. 26, no. 2, pp. 405–425, in 2017, with a focus on wireless sensor networks
Feb. 2012. and swarm intelligence. He is currently working
[52] N. V. Chawla, A. Lazarevic, L. O. Hall, and K. W. Bowyer, ‘‘SMOTEBoost: with the Graduate School, Duy Tan University,
Improving prediction of the minority class in boosting,’’ in Proc. Eur. Conf. Da Nang, Vietnam. He is also working in the areas
Princ. Data Mining Knowl. Discovery. Berlin, Germany: Springer, 2003, of wireless sensor networks, the IoT, swarm intel-
pp. 107–119. ligence, cloud computing, artificial intelligence,
[53] H. Han, W.-Y. Wang, and B.-H. Mao, ‘‘Borderline-SMOTE: A new over- blockchain, cyber security, network simulation,
sampling method in imbalanced data sets learning,’’ in Proc. Int. Conf. and wireless communications. He is a Certified
Intell. Comput. Berlin, Germany: Springer, 2005, pp. 878–887. Professional with more than 75 professional certificates from CISCO,
[54] V. L. Miguéis, A. Freitas, P. J. V. Garcia, and A. Silva, ‘‘Early segmen- Microsoft, Oracle, Google, Beingcert, EXIN, GAQM, and Cyberoam. He has
tation of students according to their academic performance: A predictive
published more than 450 research papers in various national and international
modelling approach,’’ Decis. Support Syst., vol. 115, pp. 36–51, Nov. 2018.
conferences and international journals (Scopus/SCI/SCIE/SSCI indexed)
[55] J. R. Quinlan, ‘‘Induction of decision trees,’’ Mach. Learn., vol. 1, no. 1,
pp. 81–106, 1986. with high impact factor. He has authored/coauthored cum edited more
[56] P. Geurts, D. Ernst, and L. Wehenkel, ‘‘Extremely randomized trees,’’ than 30 books of computer science. He has five Australian patents to his
Mach. Learn., vol. 63, no. 1, pp. 3–42, 2006. credit in the area of wireless communications, artificial intelligence, the IoT,
[57] T. Chen and C. Guestrin, ‘‘XGBoost: A scalable tree boosting system,’’ and image processing. He is also a member of more than 50 associations as
in Proc. 22nd ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, a senior member and a life member, and acting as an ACM distinguished
Aug. 2016, pp. 785–794. speaker. He is associated with more than 500 international conferences
[58] I. Lykourentzou, I. Giannoukos, V. Nikolopoulos, G. Mpardis, and as a program committee member/the chair/an advisory board member/a
V. Loumos, ‘‘Dropout prediction in e-learning courses through the com- review board member. He was awarded more than 30 awards for teaching
bination of machine learning techniques,’’ Comput. Educ., vol. 53, no. 3, and research, including the Young Scientist, the Best Scientist, the Young
pp. 950–965, Nov. 2009. Researcher Award, the Outstanding Researcher Award, and the Excellence
[59] S. Mazzanti. (Jan. 4, 2020). Shap Values Explained Exactly How in Teaching. He is also acting as an Associate Editor for Wireless Networks
You Wished Someone Explained to You. [Online]. Available: https:// (Springer), IET Quantum Communication, IET Wireless Sensor Systems, IET
towardsdatascience.com/shap-explained-the-way-i-wish-someone- Networks, IJDST, IJISP, and IJCINI. He is also acting as the Editor-in-Chief
explained-it-to-me-ab81cc69ef30
for the USA journal International Journal of Smart Vehicles and Smart
[60] K. Coussement, M. Phan, A. De Caigny, D. F. Benoit, and A. Raes,
Transportation (IJSVST) (IGI-Global).
‘‘Predicting student dropout in subscription-based online learning environ-
ments: The beneficial impact of the logit leaf model,’’ Decis. Support Syst.,
vol. 135, Aug. 2020, Art. no. 113325.

SAID AGOUJIL received the M.S. and Ph.D.


degrees in mathematics from the Faculty of Sci-
ences and Technology of Marrakech (FSTM),
Morocco, in 2004 and 2008, respectively. He is
currently a Professor with the Department of Com-
puter Science, Faculty of Sciences and Tech-
nology, Moulay Ismail University, Errachidia,
HAYAT SAHLAOUI received the B.S. degree in Morocco. His current research interests include
web programming from the Faculty of Science, numerical analysis, wireless networks, linear alge-
University Moulay Ismail, Meknes, Morocco, bra, and speech coding.
in 2009, and the master’s degree in mathematics
education and technology from the Ecole Normale
Superieure Tetouan, in 2013. She is currently pur-
suing the Ph.D. degree in machine learning with
the Faculty of Science and Technique Errachidia, MUSTAFA MUSA JABER received the Ph.D.
University Moulay Ismail. Since 2011, she has degree from the Technical University of Malaysia.
been a distinguished educational professional in He holds a Postdoctoral position at University
various institutions of Meknes’ district Morocco, with ten years of teaching Tun hussein Onn Malaysia. He is currently the
expertise with an unparalleled ability to explain complicated mathematical Director of the Research Center with the Dijlah
concepts in an easily understandable manner. She has the talent for employ- University College, Baghdad, Iraq. His research
ing unique teaching strategies to effectively engage all students and foster interests include telemedicine, machine learning,
a fun and fascinating learning environment. Her research interests include and human factor.
education strategies, e-learning, web application development, learning ana-
lytics, data mining, and machine learning in education.

VOLUME 9, 2021 152703

You might also like