4.-Student Dropout Prediction 2020
4.-Student Dropout Prediction 2020
1 Introduction
Artificial Intelligence is changing many aspects of our society and our lives since
it provides the technological basis for new services and tools that help decision
making in everyday life. Education is not immune to this revolution. Indeed AI
and machine learning tools can help to improve in several ways the learning
process. A critical aspect in this context is the possibility of developing new
predictive tools which can be used to help students improve their academic
careers.
Among the many different observable phenomena in the students’ careers,
University dropout is one of the most complex and adverse events, both for
students or institutions. A dropout is a potentially devastating event in the life
of a student, and it also impacts negatively the University from an economic
point of view [6]. Furthermore, it could also be a signal of potential issues in the
organisation and the quality of the courses. Dropout prediction is a task that can
be addressed by exploiting machine learning techniques, which already proved to
be effective in the field of education for evaluating students’ performance [1,6,8–
10].
In this work, we face the challenge of early predicting the dropout for a
freshman by adopting a data-driven approach. Trough an automated learning
process, we aim to develop a model that is capable of capturing information
concerning the particular context in which dropout takes place.
We built our model by taking into account the following three design prin-
ciples. First, we want to estimate the risk of quitting an academic course at an
early stage, either before the student starts the course or during the first year.
Statistical evidence shows that this time frame is one of the most critical periods
for dropout. Targeting first-year students means that the data we can use to train
our predictive models are only personal information and academic records from
high school—e.g. gender, age, high school education, final mark — and the num-
ber of credits acquired during the first months of the first year. Second, we do
not focus on a specific predictive model; instead, we conducted a thorough study
considering several machine learning techniques in order to construct a baseline
and assess the challenge of the problem under analysis. Last, we conducted the
training and test processes on real data, collecting samples of approximately
15,000 students from a specific academic year of a major University.
The remainder of this paper has the following structure. Related approaches
are discussed in Sect. 2. In Sect. 3 we describe the machine learning methods used
in our analysis, the dataset we collected and the preprocessing techniques applied
to it. In Sect. 4 we evaluate the selected models by comparing their performance:
first, with the different values of the models’ parameters; second, to the features
used in the train and test sets and, finally, considering each academic school
separately. Then, we draw final remarks in Sect. 5 and present possible uses and
extensions of this work.
2 Related Work
3 Methodology
We considered a specific set of well-known classification algorithms to provide a
tool enabling a reasonably accurate prediction of the dropout phenomenon. In
132 F. Del Bonifro et al.
Dataset. The dataset used for this work has been extracted from a collection
of real data. More precisely, we considered pseudo-anonymized data describing
15, 000 students enrolled in several courses of the academic year 2016/2017.
The decision to focus our research within the limit of the first year lies in the
analysis of statistical evidence from the source data. This evidence indicates a
concentration of career dropouts in the first year of the course and a progressive
decrease of the phenomenon in the following years. More specifically, students
who leave within the first year is 14.8% of the total registered, while those who
leave by the third year is 21.6%. This is equivalent to saying that the 6.8%
of registered abandoned in subsequent years compared with 14.8% who leaves
during the first year; confirming the importance of acting within the first year
of the program to prevent the dropout phenomenon.
Table 1 shows a detailed description of the information available in the
dataset. The first column lists the name of the features, while the second column
describes the possible values or range. The first two features represent personal
data of students while the third and the fourth are information related to the
high school attended by the student.
Concerning the Age feature, its three possible values represent three different
ranges of ages at the moment of enrolment, the value 1 is assigned to students
until 19 years old, 2 for student’s age between 20 and 23 years, and 3 other-
wise. The values of High school id indicate ten different kinds of high school
where the student obtained the diploma.The High school final mark represents
Student Dropout Prediction 133
Table 1. Available features for each student in the original dataset, along with the
possible values range
the mark that the student received when graduating in high school. The flag
Additional Learning Requirements (ALR) represents the possibility for manda-
tory additional credits in the first academic year. In fact, some degree programs
present an admission test; if failed, the student has to attend some further spe-
cific courses and has to pass the relative examinations (within a given deadline)
in order to be able to continue in that program. The values for the ALR feature
indicate three possible situations: the value one is used to describe degree pro-
grams without ALR; the value two stands for an ALR examination that has been
passed while the value three indicates that the student failed to pass the ALR
examination, although it was required. Academic school id represents the aca-
demic school chosen by the student: there are eleven possible schools according
to the present dataset. Course Credits indicates the number of credits acquired
by the students. We use this attribute only in the case in which we evaluate the
students already enrolled, and we consider only those credits acquired before the
end of the first year, in order to obtain indications on the situation of the stu-
dent before the condition of abandonment arises. The Boolean attribute Dropout
represents the event of a student who abandons the degree course. This feature
also represents the class for the supervised classification task and the outcome
of the inference process—i.e. the prediction. Since the dropout assumes values
True (1) or False (0), the problem treated in this work is a binary classification
one.
It is possible to evaluate the amount of relevant information contained in the
presented features by computing the Information Gain for each of them. This
quantity is based on the concept of entropy, and it is usually exploited to build
decision trees, but it also permits to obtain a ranked list of the available features
for their relevance. In our case, some of the most relevant ones are (in descending
order) ALR, High school final mark, High school Id, Academic school Id.
task. Firstly, we observed that in the original dataset, some of the values contain
an implicit ordering that is not representative of the feature itself and can bias
the model. These are the High school id, Academic school id, and ALR. We
represent these three features as categorical—and thus not as numerical—by
transforming each value, adopting a One-hot encoding representation. As one
can expect, the dataset is highly unbalanced since the students who abandon the
enrolled course is a minority, less than 12.3%; in particular, the ratio between the
negative (non-dropout) and positive (dropout) examples is around 7 : 1. Even
though this is good for the educational institution, training a machine learning
model for binary classification with a highly unbalanced dataset may result in
poor final performance, mainly because in such a scenario the classifier would
underestimate the class with a lower number of samples [16]. For this reason, we
randomly select half of the negative samples (i.e., the students who effectively
drop) and use it in the train set; an equal number of instances of the other class
is randomly sampled from the dataset and added to the train set. In doing so, we
obtain a balanced train set, which is used to train the supervised models. The
remaining samples constitute an unbalanced test set which we use to measure
the performance of the trained models. This procedure is repeated ten times and
for each one of these trials we randomise the selection and keep balanced the
number of samples for the two classes in the train set. The final evaluation is
obtained by averaging the results of the ten trials on the test sets.
Fig. 1. Results obtained: (a) using RFs with an increasing number of estimators with-
out rescaling the data; (b) using SVM for different values of C without rescaling the
data; (c) using SVM for different values of C with standard rescaling; (d) using SVM
for different values of C with min-max rescaling.
4 Experimental Result
All the experiments have been performed using the Python programming lan-
guage (version 3.7) and the scikit-learn framework [13] (version 0.22.1),
which provides access to the implementation of several machine learning algo-
rithms. Training and testing run on a Linux Workstation equipped with Xeon
8-Core 2,1 Ghz processor and 96 GB of memory.
Dropout Analysis per Academic School. The results in Table 2 are useful to
understand the general behavior of the predictive model, but it may be difficult
for governance to extract useful information. The division of results by academic
school allows an analysis of the performance of the models with higher resolution.
This could be an important feature that facilitates local administrations (those
of schools) to interpret the results that concern students of their degree courses.
In Table 2, we have selected the best models from those trained with basic +
ALR and basic + ALR + CC features. These are RF for the former and SVM for
the latter. The results divided by school are shown in Table 3.
For completeness, we report in Fig. 2 an overview of the dataset composition
with respect to the school (horizontal axis) and the number of samples (vertical
axis), divided by dropouts, in green, and the remaining, in blue. The results of
Table 3 highlight a non-negligible variability between the results for each school
and suggests that each school contributes differently to the predictive model.
For instance, the results for schools 4, 9, and 10 are higher than those of schools
3, 7, and 8 and all of these schools show results that differ significantly from the
general ones (Table 2), both for basic + ALR and basic + ALR + CC. In this
case, the number of dropout samples for schools 4, 9, and 10 is 207, 66, and 231
examples—504, in total—respectively, against the number of dropout samples
for schools 3, 7, and 8 which is respectively of 76, 63, and 89 examples—139,
in total.
Student Dropout Prediction 137
Table 2. Experimental results for LDA, SVM and RF classifiers over different feature sets.
Fig. 2. Number of students per school. Green represents dropout students, blue repre-
sents the students which applied to the second academic year. (Color figure online)
Table 3. Experimental results for each academic school: (left) RF model trained using
Basic + ALR features; (right) SVM model trained using Basic + ALR + CC features.
References
1. Aulck, L., Velagapudi, N., Blumenstock, J., West, J.: Predicting student dropout
in higher education. In: 2016 ICML Workshops #Data4Good Machine Learning,
New York, vol. abs/1606.06364, pp. 16–20 (2016)
2. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001). https://doi.org/
10.1023/A:1010933404324
3. Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM
Trans. Intell. Syst. Technol. (TIST) 2(3), 27 (2011)
4. Freeman, E.A., Moisen, G.G.: A comparison of the performance of threshold cri-
teria for binary classification in terms of predicted prevalence and kappa. Ecol.
Modell. 217(1–2), 48–58 (2008)
5. Hellas, A., et al..: Predicting academic performance: a systematic literature review.
In: Proceedings Companion of the 23rd Annual ACM Conference on Innovation
and Technology in Computer Science Education, pp. 175–199. ACM (2018)
6. Jadrić, M., Garača, Ž., Ćukušić, M.: Student dropout analysis with application of
data mining methods. Manag. J. Contemp. Manag. Issues 15(1), 31–46 (2010)
7. Kadar, M., Sarraipa, J., Guevara, J.C., Restrepo, E.G.: An integrated approach
for fighting dropout and enhancing students’ satisfaction in higher education. In:
Proceedings of the 8th International Conference on Software Development and
Technologies for Enhancing Accessibility and Fighting Info-exclusion, DSAI 2019,
Thessaloniki, Greece, 20–22 June 2018, pp. 240–247 (2018)
8. Kotsiantis, S.B., Pierrakeas, C.J., Pintelas, P.E.: Preventing student dropout in
distance learning using machine learning techniques. In: Palade, V., Howlett, R.J.,
Jain, L. (eds.) KES 2003. LNCS (LNAI), vol. 2774, pp. 267–274. Springer, Heidel-
berg (2003). https://doi.org/10.1007/978-3-540-45226-3 37
9. Li, H., Lynch, C.F., Barnes, T.: Early prediction of course grades: models and
feature selection. In: Conference on Educational Data Mining, pp. 492–495 (2018)
10. Márquez-Vera, C., Romero Morales, C., Ventura Soto, S.: Predicting school failure
and dropout by using data mining techniques. Rev. Iberoam. Tecnol. del Aprendiz.
8(1), 7–14 (2013)
11. Martinho, V.R.D.C., Nunes, C., Minussi, C.R.: An intelligent system for prediction
of school dropout risk group in higher education classroom based on artificial neural
networks. In: 2013 IEEE 25th International Conference on Tools with Artificial
Intelligence, pp. 159–166, November 2013
12. Pal, S.: Mining educational data using classification to decrease dropout rate of
students. CoRR abs/1206.3078 (2012)
13. Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn.
Res. 12(Oct), 2825–2830 (2011)
14. Serra, A., Perchinunno, P., Bilancia, M.: Predicting student dropouts in higher
education using supervised classification algorithms. In: Gervasi, O., et al. (eds.)
ICCSA 2018. LNCS, vol. 10962, pp. 18–33. Springer, Cham (2018). https://doi.
org/10.1007/978-3-319-95168-3 2
140 F. Del Bonifro et al.
15. Whitehill, J., Mohan, K., Seaton, D.T., Rosen, Y., Tingley, D.: Delving deeper into
MOOC student dropout prediction. CoRR abs/1702.06404 (2017)
16. Zheng, Z., Li, Y., Cai, Y.: Oversampling method for imbalanced classification.
Comput. Inform. 34, 1017–1037 (2015)