Predicting and Interpreting Student Performance Using Ensemble Models and Shapley Additive Explanations

Uploaded by

Sergio Alumenda

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

91 views16 pages

Predicting and Interpreting Student Performance Using Ensemble Models and Shapley Additive Explanations

Uploaded by

Sergio Alumenda

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Received October 7, 2021, accepted October 25, 2021, date of publication October 29, 2021, date of current version

November 19, 2021.

Digital Object Identifier 10.1109/ACCESS.2021.3124270

Predicting and Interpreting Student Performance

Using Ensemble Models and Shapley
Additive Explanations
HAYAT SAHLAOUI1 , EL ARBI ABDELLAOUI ALAOUI 2, ANAND NAYYAR 3,4 , SAID AGOUJIL1 ,
AND MUSTAFA MUSA JABER 5,6
1 Department of Computer Science, Faculty of Sciences and Techniques at Errachidia, University of Moulay Ismaïl, Meknes, Errachidia 52000, Morocco
2 Department of Sciences, Ecole Normale Supérieure, University of Moulay Ismaïl, Meknes 52000, Morocco
3 Graduate School, Duy Tan University, Da Nang 550000, Vietnam
4 Faculty of Information Technology, Duy Tan University, Da Nang 550000, Vietnam
5 Department of Computer Science, Al-Turath University College, Baghdad 10021, Iraq
6 Department of Computer Science, Dijlah University College, Baghdad 10021, Iraq

Corresponding authors: Anand Nayyar ([email protected]) and Mustafa Musa Jaber ([email protected])

ABSTRACT In several areas, including education, the use of machine learning, such as artificial neural
networks, has resulted in significant improvements in predicting tasks. The opacity of these models is one
of the problems with their use. Prediction models that may offer valuable insights while still being simple
to comprehend are preferred by decision-makers in education. Hence, this study suggests an approach that
improves the previous student performance prediction by enhancing performance and explaining why a
student’s performance is attaining a certain score. A prediction model was proposed and tested using machine
learning models. Our models outperform previous work models developed on the same dataset. Using a
combined framework of data level and algorithm approaches, the proposed model achieves an accuracy of
over 98%, inplying a 20.3% improvement compared with previous work models. As a balancing technique
for upsampling data, we use the default strategy of synthetic minority oversampling technique (SMOTE)
to oversample all classes to the number of examples in the majority class. We also use ensemble methods.
For tuning the parameters, we use a simple grid search algorithm provided by scikit to estimate the optimal
parameters of our model. This hyperparameter optimization along with a ten-fold cross-validation process
demonstrates the dependability of the novel model. In addition, a novel visual and intuitive technique is
used to help determine which factors most influence the score which helps to interpret and understand the
entire model and visualizes feature attributions at the observation level for the machine learning model.
Therefore, SHAP values are a powerful tool that should be incorporated within the student performance
prediction framework by obtaining the prediction and explanation created through the experiment, educators
can recognize students at risk early and provide suitable exhortation in an auspicious manner.

INDEX TERMS Ensemble methods, game theory, machine learning, student performance prediction, SHAP
values.

I. INTRODUCTION educational datasets to have a good understanding of stu-

Data mining techniques play an essential role in many appli- dents and their educational system [1]. A primal applica-
cation fields, such as business analytics, security analyt- tion of educational data mining (EDM) is investigating the
ics, financial analytics, and learning analytics. In this study, student learning process and predicting student performance
we are primarily concerned with applications of data min- to improve educational practices. In this context, we are
ing in the education environment. This area of research attempting to approximate student performance, experience,
focuses on the design and application of algorithms on ranking, or grade [2] by pulling out features from traditional
recorded or logged data. Student performance value could be
The associate editor coordinating the review of this manuscript and numeric in the case of a regression problem or categorical
approving it for publication was Shen Yin. in the case of a classification problem. Currently, one of the

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
152688 VOLUME 9, 2021
H. Sahlaoui et al.: Predicting and Interpreting Student Performance

most promising research fields of information technology with previous studies that used the Jordan dataset to investi-
is machine learning [3]. The application of machine learn- gate student performance prediction. We also apply the SHAP
ing in the education domain is the primary subject of our value to explain the inner logic of the ensemble method.
research [4]. This technology can help predict student per- In this study, we attempt to answer the following questions:
formance and alert teachers about students at risk to provide i) From previous studies related to student performance,
them with the support they need. Therefore, the question of which input attributes affect student performance, which set
predicting student performance has to deal with building the of algorithms is used for prediction, what are their accuracy
learning classifier by using students’ observed records that measures, and which tools are used?; ii) can the predictive
represent the training set and matching student historical data models used in student performance be improved by using
that represent features with their label that represents the ensemble methods?; iii) is there an opportunity to use these
actual performance [5]. The final goal is to alert teachers proposed models to improve the interpretability of the models
about students at risk of drop out and suggest means to that can enhance the use of these models by decision-makers
improve their performance and increase their retention and in education?; iv) what are the reasons and key indicators
completion rate [6]. for explaining student prediction models using game theory?;
Linear regression models are used in many existing student v) what are Shapley values and how do these values help
performance studies. Linear models are easy to implement improve prediction accuracy and model transparency?
and interpret, but they are inaccurate in terms of modeling The key contributions of this study are as follows:
student grades. Studies have also addressed the shortcomings General objectives:
of linear models using nonlinear models, such as artificial • This research could greatly benefit school administrators
neural networks [7]. In contrast to linear models, nonlinear in terms of their knowledge of the causes of student
models have been shown to achieve better performance [8]. performance problems and in finding solutions to them.
However, interpreting the model is difficult (e.g., which fac- • The results of this study could benefit officials in the
tors make student performance inefficient). This predictive ministry of education in terms of finding educational
model answers the ‘‘how much’’ question, not the reason why. plans, solutions, programs, and activities for students at
Therefore, the best case is to have a robust model that can risk.
be interpreted and result in improved adoption [9]. • The study could contribute to theoretical literature to
Robust machine learning algorithms can usually make extend the scope and nourish the machine learning
accurate predictions, but their well-known opaque nature repository and provide noteworthy insights and models
hampers their adoption. The interpretability of the model to improve education across the globe.
corresponds to the tag or the label on the pill bottle. The label • This study could open horizons for other researchers to
needs to be transparent for easy adoption. Decision-makers conduct other studies in this field.
in education prefer prediction models that provide useful Specific objectives:
insights, as well as those that are easy to understand [10], [11]. • The combination of machine learning models and SHAP
This clear-cut view additionally provides quality assurance value could overcome the challenges of the trade-
to the pipeline. Intuition helps researchers in determining off between student performance models’ interpretabil-
whether any significant errors exist when inputting or exe- ity and complexity and improve model accuracy and
cuting a model. Moreover, it is a smart way to regenerate the transparency.
outcome of the prediction algorithm. Most machine learning- • improving student performance model accuracy without
based projects always focus on results, not interpretability. compromising their interpretability via the combination
Here, we highlight the latest method for interpretability, of a data level and algorithm approach:
SHAP value to illustrate and explain the accuracy of the - by solving class imbalance problems that imply strate-
model. gies, such as data augmentation for the minority class,
This paper describes the first attempt to use machine learn- which is referred to SMOTE.
ing model interpretability in the context of student perfor- - by leveraging two innovative model types: bagging
mance prediction. Previous studies [12] focused on updating and boosting-based algorithms, in particular, ET and
the accuracy of student grade models. However, this paper XGBoost, which were examined for the first time in the
outlines, on the one hand, the first attempt to use a combined context of student performance data.
framework of data level and algorithm approaches to improve - to improve the precision and to avoid model overfit-
classifier performance measures, for instance, SMOTE tech- ting, we used hyperparameter optimization technique by
nique for upsampling data and bagging- (ET) and boosting- choosing a set of optimal hyperparameters for the learn-
based (XGBoost) algorithms for improving the performance ing algorithm. Tenfold cross validation was used to train
of single classifiers. And on the other hand, the use of SHAP and evaluate our models and analyze the performance of
values [13] and associated visualizations in the quest of the algorithms with proper metrics.
interpretability and explainability of models in an education • Augmenting our ensemble models with SHAP val-
context. This work aims to improve performance measure ues [13]. In real-world machine learning-based appli-
values by using the ensemble method and then compare it cations, model interpretability can sometimes be more