A Predictive Model For The Early Identification of Student Dropout Using Data Classification Clustering and Association Methods
A Predictive Model For The Early Identification of Student Dropout Using Data Classification Clustering and Association Methods
20, 2025
Abstract— Technology development has led to increased data student profiles and the trends associated with academic failure.
generated in education, sparking interest in information extrac- The model’s practical application is demonstrated through a
tion to support educational management through automated data study.
analysis. Using such data to create models identifying students
likely to drop out has drawn research interest. A crucial factor in Index Terms— Data analysis, data mining, performance
reducing dropout rates is the systematic and early identification prediction.
of the level of student engagement, especially by detecting the
students’ behavior profile in the virtual environment, such as I. I NTRODUCTION
grades in assessments. There are predictive models based on
data mining processes that identify students prone to dropping
out. Unfortunately, the predictive models do not characterize the
profiles of these students or the specific trends associated with
S CHOOL dropouts are a problem faced by private and
community Higher Education Institutions (HEI) in Brazil
and are a recurring challenge affecting university management
these profiles. This article aims to fill a gap by presenting a [1], [2]. Some various studies and definitions seek to clarify
study that identifies and tracks the profiles of undergraduate the concept of dropping out [3], [4], [5], [6], [7], [8], [9],
students likely to drop out, starting with an analysis of academic
performance. We propose a predictive model beyond classifi-
[10]. In this study, dropping out is considered to be failing
cation by combining data mining techniques such as decision to complete a subject or a student giving up on a learning
trees, clustering, and frequent pattern analysis. Decision trees, program, such as a school term.
a data mining technique that uses a tree-like graph to represent According to De Brito et al., [1], several initiatives have
decisions and their possible consequences, identify students at been launched in Brazil to increase access to HEI, such
risk of failure from the entire dataset. Clustering analysis, a data
mining technique that groups similar data points together, groups
as offering additional places and creating funding programs.
students based on similar characteristics (e.g., students who However, there needs to be more initiatives aimed at help-
scored between 0 and 30 points on a specific activity). Frequent ing student to complete their studies successfully. One key
pattern analysis, a data mining technique that identifies patterns aspect is monitoring student learning in the distance education
that occur frequently in a dataset, uncovers the underlying environment. In this educational context, it is more difficult
factors contributing to low performance (e.g., identify which
activities had the most significant influence on a specific group’s
to monitor student performance due to the lack of face-to-
low performance). This integrated approach predicts dropout risk face contact between teachers and students and difficulties in
with 93.9% precision and provides a deeper understanding of perceiving interactions, including student behavior. This often
results in a lack of students’ motivation [11].
Received 14 July 2024; revised 6 December 2024; accepted 8 January 2025. One of the unresolved challenges is identifying behavioral
Date of publication 13 January 2025; date of current version 7 February 2025. profiles that influence students’ success in distance educa-
The work of Fabricia Roos-Frantz was supported by the Brazilian National
Council for Scientific and Technological Development (CNPq) under Project tion and, consequently, their permanence in the subject or
311011/2022-5. The work of Rafael Z. Frantz was supported by CNPq under course [12], [13]. The development of methods to monitor
Project 309425/2023-9 and Project 402915/2023-2. (Corresponding author: student performance through task completion in Learning
Patricia Mariotto Mozzaquatro Chicon.)
Patricia Mariotto Mozzaquatro Chicon is with the Distance Education Management System (LMS) can effectively identify these
Center, University of Cruz Alta (UNICRUZ), Cruz Alta, Rio Grande do Sul behavioral profiles and reduce the dropout rate in distance
98005-972, Brazil (e-mail: [email protected]). education courses.
Leo Natan Paschoal is with the Institute of Exact Sciences and Technology,
Paulista University (UNIP), Araraquara, São Paulo 14804-300, Brazil (e-mail: Identifying students at risk of dropping out, in anticipation,
[email protected]). can help educational institutions make better decisions and
Sandro Sawicki, Fabricia Roos-Frantz, and Rafael Z. Frantz are with reduce the chance of a student failing [14]. One of the
the Department of Exact Sciences and Engineering, Regional University
of Northwestern Rio Grande do Sul (UNIJUÍ), Ijuí, Rio Grande do Sul advantages of early detection of the risk of dropping out is that
98700-000, Brazil (e-mail: [email protected]; [email protected]; it eliminates the need to wait for the final grades of the subjects
[email protected]). in a given period. Students’ performance can be monitored in
A Spanish version of this article is available as supplementary material at
https://doi.org/10.1109/RITA.2025.3528369. real-time. As a result, action can be taken in advance to prevent
Digital Object Identifier 10.1109/RITA.2025.3528369 students from failing the examined subjects [15].
1932-8540 © 2025 IEEE. All rights reserved, including rights for text and data mining, and training of artificial intelligence
and similar technologies. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Alliance University. Downloaded on March 09,2025 at 12:06:06 UTC from IEEE Xplore. Restrictions apply.
CHICON et al.: PREDICTIVE MODEL FOR THE EARLY IDENTIFICATION OF STUDENT DROPOUT 13
Data mining techniques can help to identify students who II. R ELATED W ORKS
are prone to dropping out early, making it possible to pre- Various methods using data mining techniques have been
dict academic performance in educational environments [16], developed to predict the actions and behavior of students in
[17], [18]. Performance indicators are related to the overall learning environments. This section presents a non-exhaustive
performance of students, based on their academic perfor- review of predictive models capable of identifying, in antic-
mance [19], [20]. ipation, students at risk of dropping out using data mining
Several authors have used the performance indicator, tasks. It is important to note that this view is based on
obtained via LMS data, to extract information and identify systematic investigations that have analyzed existing predictive
patterns in the stored data [21], [22], [23], [24]. Despite models [27], [28].
predictions of students’ academic performance and dropout Sukhbaatar et al. [29] proposed a dropout prediction model
rates being topics that have been widely researched in lit- that considers characteristics acquired from previous years’
erature [21], we have found that most predictive models data, such as online course grades. The model uses a decision
focus solely on identifying the risk of dropping out without tree, a data mining method, to classify students who tend to
characterizing the profiles of at-risk students or revealing drop out during the academic semester. Once the model has
implicit patterns and trends associated with these profiles [25]. been developed, the student’s course grades are used to assess
So far, predictive models have predominantly focused on those at risk of dropping out in the current semester. As a
identifying the number of students with low academic per- result, the model provides the percentage of students at risk
formance. Generally, these approaches apply classification of dropping out.
techniques to build models that indicate whether a student is at Azcona et al. [30] presented a prediction model that aims
risk of failing and, consequently, of school dropout [22], [24], to identify, every week, students with a tendency to drop out.
[26]. However, there needs to be more combined use of data The model uses various data sources, including student demo-
mining techniques that go beyond this initial classification. graphics and dynamic data (e.g., the number of interactions
There needs to be evidence in the literature regarding the without content, interactions with educational materials, and
combined application of data mining techniques capable of tasks). Different classification methods were tested to build the
identifying at-risk students, characterizing their profiles, and model, including logistic regression, support vector machine
discovering relationships between the associated data. This gap with linear kernel, support vector machine with RBF kernel,
limits a deeper and more contextualized understanding of the random forest, decision tree, and closest K-neighbours. They
factors influencing academic performance and dropout. selected the closest K-neighbours because they achieved the
Once predictions are generated, it is possible, in addition best results during the evaluation. As a result, the model can
to knowing the number of students with low performance, generate weekly predictions throughout the semester.
to obtain information about the profiles of these students by Brandão et al. [31] proposed a model that utilizes data
applying additional data mining tasks. With this information in mining techniques to understand the influence of variables
hand, improving the performance of students at risk of failure related to participants’ performance on dropout rates. The
becomes possible. model analyzed data such as the number of interactions,
This article aims to present a prediction model to iden- submitted assignments, and responded forums. Classification
tify the profiles of students who are likely to underachieve and clustering tasks were employed to construct the model,
in school and, thus, prone to dropping out. Data mining which was implemented using the decision tree, random forest,
techniques are used to identify the risk of dropping out, char- and k-means algorithms. The results indicated that the deci-
acterize groups prone to dropping out, and uncover patterns sion tree algorithm achieved the best performance. However,
and trends in the data. This valuable information can help clustering proved promising as it can contribute to analyzing
university administrators develop effective student retention behaviors that, once identified, enable the planning of specific
strategies. It is worth highlighting that the article’s primary pedagogical actions.
contribution is developing a predictive model that identifies Milinković and Vujović [15] proposed prediction models
students at risk of failing and dropping out. This model to identify students who tend to drop out in the first and
groups students with similar performance and links activities second semesters of an undergraduate course. The prediction
and content that affect the risk of failure. Combining these models were built using the decision tree classifier with the
characteristics in a single model has not been done before, J48 algorithm. The analysis involved two sets of input data
making our work original. in identifying factors that could influence potential academic
It is essential to highlight that the algorithms J48, K-means, failure and dropout. The model considered pre-university
and Apriori were employed to construct the model using personal data (e.g., high school grades, information related to
different data mining techniques. J48 is a decision tree-based entrance exams, among others), administrative data (e.g., stu-
algorithm that predicts the number of students likely to fail a dent income), and demographic data (e.g., number of subjects
specific course. K-means, in turn, is a partitioning clustering taken, age, place of residence). In the first analysis, the
algorithm that allows the creation of groups of students with predictive model was created using all the attributes from
similar profiles. Finally, Apriori generates association rules, the input data sets. In the second analysis, the predictive
identifying recurring patterns within the group of students at model was created after applying a process of selecting
risk of failing. These algorithms were chosen based on the relevant attributes. The results revealed that the best predictive
study by Chicon et al. [27]. performance was achieved using the most data possible, with
Authorized licensed use limited to: Alliance University. Downloaded on March 09,2025 at 12:06:06 UTC from IEEE Xplore. Restrictions apply.
14 IEEE REVISTA IBEROAMERICANA DE TECNOLOGIAS DEL APRENDIZAJE, VOL. 20, 2025
50% of the attributes chosen linked to student activities in and the number of interactions with content and activities.
the first semester. Milinković and Vujović [15] concluded In 50% of the studies, algorithms were compared before being
that the characteristics of the respective data sets strongly applied to detect students’ performance. In comparison, the
influence the quality of predictive models. other 50% of the predictive models were directly constructed
Umer et al. [25] also developed a prediction model to with previously defined data mining techniques. Furthermore,
assess students’ academic performance every week. The model it was found that most of the predictive models were applied in
extracted four data sets from four face-to-face and remote undergraduate courses. The most commonly used data mining
learning courses. Sixteen data sets were created for each tasks were classification and clustering, and the most cited
course, using student logs and grades assigned to their algorithms in the analyzed models include decision tree, J48,
completed activities. Creating the prediction model involved k-means, logistic regression, random forest, and naive bayes.
analyzing different classification algorithms, including random The research reported in this article goes beyond the state-
forest, naive bayes, logistic regression, linear discriminant of-the-art in that it proposes a prediction model beyond simply
analysis, and classifiers for assortment. As a result, the random indicating the percentage of students at risk of dropping out
forest classifier was chosen since it performed the best on the due to poor academic performance. This model can charac-
analyzed data sets. Thus, the prediction model showed the terize the profiles of truant students according to the type of
percentage of students at risk of failing or dropping out for activity and the content covered in the assignment. In addition,
each data set. the model also discovers patterns of relationships between the
Usman et al. [32] developed a prediction model to predict data in these profiles, revealing associations between activity
student performance before school exams. The model consid- records that may be repeated in future activities, such as
ered student interactions and educational activity submissions repetition of grades from previous tasks.
via LMS Moodle. During the construction of the prediction
model, the following classifications were tested: decision tree, III. M ATERIAL AND M ETHODS
naive bayes, and k-nearest neighbors. As a result, the decision
tree was selected as the classifier with the best performance A. Description of the Predictive Model
compared to the others. Furthermore, the research found that The predictive model aims to identify, in anticipation,
uploading activities were the most important and impactful students at risk of failing due to poor performance and those at
resource for predicting student performance. risk of dropping out. It also characterizes the profile of groups
Lopes Filho and Silveira [33] developed a predictive model of students prone to failure and identifies underlying patterns
to identify students most likely to drop out at the end of each and trends in student data. The proposed model uses student
bimester. The model used administrative data from state school grades, which vary over time, to achieve this. This approach
students. Four classification algorithms were evaluated to build enables the development of a generic model that can be applied
the predictive model: decision forest, logistic regression, Bayes to student data from various subjects and courses.
point machine, and decision forest. As a result, the deci- This predictive model was built based on the updated
sion forest classifier was considered the most effective. The theoretical model of Rovai [35] proposed by Ramos et al. [36],
algorithm, trained on data from the previous year, identified which addresses factors linked to student dropout and per-
students at risk of dropping out in each bimester of 2019. sistence, including dropout indicators associated with student
Recently, Kim et al. [34] proposed a student dropout success in distance education courses. The model uses data
prediction system by developing a model capable of iden- mining techniques to carry out this process, accounting for
tifying students likely to leave the university. The model dropout indicators associated with student performance in sub-
was constructed using classification and clustering tasks, jects taught via LMS Moodle. In particular, the model makes
implemented with the K-means, logistic regression, artificial predictions based on students’ grades in subject activities
neural networks, gradient boosting, and PCA algorithms. As a throughout an academic semester. Based on this data, the
result, the model categorized the reasons for dropout into model calculates the performance score of students at risk
four groups: employed, did not register, personal issues, and of dropping out and assigns them to one of the following
admitted to other universities. Additionally, by predicting the categories:
reasons for dropout, specific recommendations were provided • Good performance (A): this category includes students
to each department, enabling students to receive personalized who scored between 7.0 and 10.0 points. This range
counseling. indicates that there is no risk of dropout.
The analysis of existing studies on dropout prediction mod- • Low performance (R): this category covers students who
els reveals two distinct categories. The first group of studies scored between 0.0 and 6.9 points. This range indicates
compares the techniques, methods, and algorithms used to that the student is at risk of dropping out.
build prediction models [30], [31], [32], [33]. The second The data mining technique was applied to identify students
group of studies presents prediction models that provide at risk of dropout by implementing the Knowledge Discovery
information on the number of students likely to underperform in Databases (KDD) process. KDD seeks to identify valid and
at school and dropout [15], [25], [29], [34]. valuable behavior patterns. Therefore, the data mining stage
Table I summarizes the predictive models used to detect is an integral part of this process [37].
school dropouts. It was observed that most of the predictive 1) Data Selection: The data used to develop the prediction
models analyzed utilized data such as grades in activities model were defined and selected to train the model. The
Authorized licensed use limited to: Alliance University. Downloaded on March 09,2025 at 12:06:06 UTC from IEEE Xplore. Restrictions apply.
CHICON et al.: PREDICTIVE MODEL FOR THE EARLY IDENTIFICATION OF STUDENT DROPOUT 15
TABLE I
OVERVIEW OF DATA , TASKS , T ECHNIQUES , AND A LGORITHMS IN R ELATED W ORKS
training dataset was composed of data from 96 students who 4) Data Mining: Data mining is carried out in the next
completed various courses in the distance learning format, stage, playing a key role in generating grade predictions and
including a Computer Science course (taught in previous potential failures. To achieve this, we used classification tech-
years) for students in the Law and Agronomy programs, niques such as decision tree,4 and the J48 method5 to assign
which shared the same syllabus. Each of the 96 students objects to specific classes [38]. This combination allowed us
completed five activities throughout the course, resulting in to visualize the output of the predictive model using a decision
480 data points. The data used were extracted from the tree.
mdl_grade_grades and mdl_grade_items tables in Moodle, We used a grouping process to pool the data, dividing
which store information about grades and assessment items. the records into distinct groups called clusters [37]. This
2) Pre-Processing of Data: The data was prepared after stage involved applying the partition group technique6 and
selection to be used with the mining algorithms and predic- the K-means algorithm. K-means uses centroids as group
tions. At this step, any faults in the data were discovered representatives, calculated by the average of the objects in
and corrected before being submitted to knowledge extraction the group. The iterative process ends when the centroid stops
methods. To do this, a data cleaning exercise was conducted, changing or after a set number of iterations.
which removes inconsistent data [38]. Incomplete records, To identify frequent patterns, the association task was used
incorrect values and inconsistent data were all corrected. to find sets of items that occur simultaneously and frequently
The missing values in the database had to be addressed. in a data collection [37]. We used the frequent patterns
A technique called “Replace Missing Values”1 was used technique for this stage, which looks for strong relationships
to replace all missing values in the dataset. For numerical between attribute values or items [38]. The Apriori algorithm
data, missing values were replaced with the average of the was chosen to extract data using associative rules. Apriori
category’s existing values. For nominal attributes, the missing creates candidate sets of large items, counts the occurrences in
values were replaced by the mode, i.e., the most frequent each set, and selects large item sets based on a predetermined
value of that attribute. This technique is accessible in Weka minimum support [40].
software.2 A training base of 96 instances was created to support the
3) Data Transformation: During the pre-processing of data, data classification task.7 This step is fundamental because the
the data was modified to meet the requirements of each model must be trained using data collection to generate rules
algorithm that would be used in the model. Normalizing for decision-making. The classification algorithm then applies
the values was a key step in this process as it transformed these rules to make predictions. The training base and other
them into a specific, standardized set. To conduct this task, devices used to design this model are accessible in a laboratory
the “Normalization” technique3 from the Weka software was package.
employed.
1 The “Replace Missing Values” technique is often used in data mining to 4 A decision tree is a map of the possible outcomes of a series of related
replace missing values with appropriate values in data sets. This is done to choices [38].
maintain data integrity and allow mining algorithms to be used later [39]. 5 The J48 algorithm aims to generate a decision tree based on a set of
2 More information is available at: training data [37].
https://www.cs.waikato.ac.nz/ml/weka/index.html 6 The partitional clustering technique divides data objects into
3 The Normalization technique is often used in data mining to adjust the non-overlapping subsets (groups) so that each data object is in precisely one
values of a data set to a specific scale. This is to balance out the attribute subset [37].
ranges and ensure that the values are within the same range [39]. 7 Instances refer to the amount of student data used in the training base.
Authorized licensed use limited to: Alliance University. Downloaded on March 09,2025 at 12:06:06 UTC from IEEE Xplore. Restrictions apply.
16 IEEE REVISTA IBEROAMERICANA DE TECNOLOGIAS DEL APRENDIZAJE, VOL. 20, 2025
TABLE II
M ODEL VALIDATION R ESULTS
listing 1. Fragment of the training base. Various evaluation metrics were used to measure the quality
of the training model. These metrics provide information on
different aspects of the model’s performance. The primary
metrics used are described below, according to Pandey and
Rajpoot [42].
• Precision: is a metric that compares the proportion of
accurately identified positive cases to the total number of
cases initially labeled as positive.
• Recall: is a metric that compares the percentage
of accurately detected positive cases to the total number
of cases in the positive class.
• F-measure: is a metric that combines the precision and
recall metrics to assess the performance of the classifica-
Fig. 1. The decision tree created using the J48 algorithm. tion model. This metric considers the model’s ability to
identify and classify true positives accurately.
• Kappa Statistic: is an index that compares the observed
The code snippet presented in Listing 1 illustrates part agreement between predicted and actual values to the
of the created training base. Weka tool’s executable stan- anticipated agreement at random.
dards structured this database. The database comprises grades • Mean Absolute Error: measures the difference between
obtained by students in five activities, represented in rows predicted and actual values in all cases. It measures how
2 to 6. In addition, a letter indicates the classification results, much predictions differ from the actual value in pure
as described in line 7. Some of the training data is presented terms.
starting on line 9. • Root Mean Square Error: measures the success of numer-
A decision tree was created to make the predictions. ical predictions.
Figure 1 allows us to make the following interpretations: The described metrics were applied to the model generated
• Students who achieve less than, or equal to 2 points in using the training sets. The results are presented in Table II.
the fourth activity are more likely to fail the course; All of the metrics examined indicate satisfactory model
• Students who achieve 8 points or less in the third activity performance. The Precision, Recall, F-Measure, and Kappa
are likely to fail the course; Statistic achieved scores close to 1, indicating that the model
• Students who score more than 8 points in the third activity can identify true positives while minimizing false negatives.
tend to pass the course; The F-Measure, in particular, achieved 0.961, a high score,
• Students who achieve 8.2 points or less in the first activity proving that the model can correctly identify positive cases
are likely to fail the course; while minimizing false negatives. Regarding error metrics,
• Students who score more than 8.2 in the first activity are the Mean Absolute Error achieved 0.0735, indicating that the
likely to pass the course; model fits the data well. The Root Mean Squared Error also
• Students who score more than 3.5 points in the third scored a low mark, indicating that the model predicts values
activity are likely to pass the course. relatively accurately.
5) Post-Processing: The final stage in constructing the
model was to validate the accuracy of the predictions. B. Model Overview
To achieve this, the Cross-Validation technique was used. This As shown in Figure 2, the dropout prediction model consists
entails randomly dividing data into ten folds; nine will be of three distinct phases: input, processing, and output.
used as the training set and one as the test set [41]. This • During the input phase, the model uses student perfor-
division was conducted repeatedly until each part had served mance to indicate dropping out while considering their
as the test set. After each iteration, the model’s accuracy for grades in educational activities on LMS Moodle.
each specific division was calculated. Upon completion, the • During the processing phase, the data is prepared for
classifier’s final accuracy was calculated as the average of the mining, and data mining techniques are applied.
accuracy ratings. The algorithm’s 93.8 accuracy reveals how – The classification task is used to identify the risk of
effectively the model can produce accurate predictions. failure.
Authorized licensed use limited to: Alliance University. Downloaded on March 09,2025 at 12:06:06 UTC from IEEE Xplore. Restrictions apply.
CHICON et al.: PREDICTIVE MODEL FOR THE EARLY IDENTIFICATION OF STUDENT DROPOUT 17
– The clustering task is used to characterize the profiles Fig. 3. Steps for instantiating the model.
of students prone to failure.
– The association task is used to discover underlying
patterns and trends in the data. A. Context Description
• During the output phase, the model generates early To illustrate the application of the model, we used sample
predictions of students with a high risk of failing and, data from 48 students enrolled in a Computer Science course
therefore, a higher probability of dropping out. It also in a Law degree program at a higher education institution. This
groups similar profiles and identifies common patterns subject was taught via distance education during the second
within the groups. half of 2020. The data were collected from ten activities con-
An educational administrator or teacher interested in using ducted during this period. Table III presents a brief description
the model must follow a series of steps, as depicted in Figure 3. of these activities. Notably, five of the ten activities account
These steps include: (i) selecting the desired subject and for 50% of the grade for that semester, with the remaining five
extracting grades from LMS Moodle; (ii) cleaning the data, making up the remaining half. The data was collected from the
removing duplicate records and missing values; (iii) preparing LMS Moodle database’s mdl_grades table because it relates
the data for application of data mining algorithms; (iv) apply- to the student performance indicator [27], [28].
ing the tasks, methods, and algorithms using a data mining
tool; and (v) validating the predictions with a tool that supports
B. Model Application Results
this feature.
The model was developed by following the steps shown in
Figure 3; this is how the database was set up to produce the
IV. T ESTING THE P REDICTIVE M ODEL’ S A PPLICATION predictions. The model was run before one of the semester’s
After creating the model and conducting the statistical exams to make an early prediction of the students at risk of
validation using the metrics specified in the previous section, dropping out8 .
it was necessary to test its performance in a real situation. Table IV shows the results of the generated predictions from
This test is important because the developed model aims to the first two terms and the academic semester. In the first
identify students with low performance and those likely to 8 The teacher conducts two exams per term to assess the student’s
drop out early. performance.
Authorized licensed use limited to: Alliance University. Downloaded on March 09,2025 at 12:06:06 UTC from IEEE Xplore. Restrictions apply.
18 IEEE REVISTA IBEROAMERICANA DE TECNOLOGIAS DEL APRENDIZAJE, VOL. 20, 2025
TABLE III
P ROPOSED ACTIVITIES D URING THE S ECOND S EMESTER OF 2020
TABLE IV
P REDICTIONS M ADE D URING THE F IRST T ERM , THE S ECOND
T ERM , AND THE ACADEMIC S EMESTER
Authorized licensed use limited to: Alliance University. Downloaded on March 09,2025 at 12:06:06 UTC from IEEE Xplore. Restrictions apply.
CHICON et al.: PREDICTIVE MODEL FOR THE EARLY IDENTIFICATION OF STUDENT DROPOUT 19
TABLE V
A SSOCIATION RULES G ENERATED FOR DATA F ROM THE F IRST B IMESTER
V. C ONCLUSION
This research presented a predictive model developed to
identify students likely to underachieve and drop out of
education at an early stage based on their performance in
educational activities in LMS. The model was developed using
96 samples, and predictions were made using the classification
task, the decision tree technique, and the J48 algorithm.
The clustering task, partitional grouping technique, and the
K-means algorithm were used to model the students’ profiles.
In addition, the association task was used to identify com-
mon patterns in student behavior using the frequent patterns
Fig. 8. Number of repetitions of the association rules second bimester. technique and the Apriori algorithm. The model was validated
using a set of metrics widely used in data mining. The results
show that the prediction model was 93.8% precision.
activity involved using the Microsoft Word text editor again, Therefore, as a contribution, this research has presented a
with students having to file an academic paper. In the tenth prediction model that goes beyond the widely used classifi-
activity, students were required to complete a 20-question cation algorithms to identify potential dropout risks based on
multi-choice questionnaire covering all course areas. students’ academic performance. The model also employs a
Figure 9 depicts the number of occurrences of activities clustering algorithm to represent the profile of students likely
in the rules generated for the academic semester as a whole. to underperform and drop out. This enabled the identification
The sixth, ninth, and tenth activities significantly influenced of characteristics that contribute to students’ poor performance
the students’ low performance, appearing twelve times in the and the discovery of patterns of relationships between data
generated rules. from these students’ profiles. Furthermore, the model identi-
Over the course of the semester, students appeared to have fied the activities that contribute to poor student performance.
more difficulty with content related to the Microsoft Word The prediction model has the potential to help teachers
text editor, spreadsheets, and multimedia presentations. It is make decisions by identifying student behavior, allowing them
noteworthy that the content covered in the Microsoft Word to plan more appropriate learning paths for each student
Authorized licensed use limited to: Alliance University. Downloaded on March 09,2025 at 12:06:06 UTC from IEEE Xplore. Restrictions apply.
20 IEEE REVISTA IBEROAMERICANA DE TECNOLOGIAS DEL APRENDIZAJE, VOL. 20, 2025
throughout their educational career. Furthermore, the model [9] W. Wang, H. Yu, and C. Miao, “Deep model for dropout prediction in
can be applied in different contexts using the data mining MOOCs,” in Proc. 2nd Int. Conf. Crowd Sci. Eng., Jul. 2017, pp. 26–32.
[10] W. M. Ramos and C. I. Boll, “Persisténcia e evasão na educação
model, providing insights for university administrators when a distǎncia,” in Dicionário Crítico de Educação e Tecnologias e de
planning effective strategies to maintain student attendance and Educação a Distǎncia, D. Mill, Ed. Campinas, Brazil: Papirus, 2018,
engagement. pp. 500–504.
Once the model was built, a practical demonstration was [11] J. Wang, W. J. Doll, X. Deng, K. Park, and M. G. M. Yang, “The impact
of faculty perceived reconfigurability of learning management systems
conducted to test its applicability. This demonstration used on effective teaching practices,” Comput. Educ., vol. 61, pp. 146–157,
sample data from 48 students enrolled in the Law course’s Feb. 2013.
Computer Science module to make predictions. The demon- [12] M. Phan, A. De Caigny, and K. Coussement, “A decision support
framework to incorporate textual data for early student dropout pre-
stration results showed that the model could fulfill its purpose diction in higher education,” Decis. Support Syst., vol. 168, May 2023,
- to identify the percentage of students likely to drop out Art. no. 113940.
based on poor academic performance. In addition, the model [13] J. G. C. Krüger, A. D. S. Britto, and J. P. Barddal, “An explainable
machine learning approach for student dropout prediction,” Exp. Syst.
generated profiles that helped to identify the activities that Appl., vol. 233, Dec. 2023, Art. no. 120933.
most influenced students’ poor performance based on their [14] F. Del Bonifro, M. Gabbrielli, G. Lisanti, and S. P. Zingaro,
scores. “Student dropout prediction,” in Artificial Intelligence in Education,
I. I. Bittencourt, M. Cukurova, K. Muldner, R. Luckin, and E. Millán,
It is worth highlighting that the model’s applicability was Eds., Cham, Switzerland: Springer International Publishing, 2020,
demonstrated using a specific example: the Computer Science pp. 129–140.
course. Future research can explore the model’s behavior in [15] S. Milinkovic and V. Vujovic, “Students’ success predictive models
predicting poor performance and dropout in other subjects and based on selected input parameters set,” in Proc. 18th Int. Symp.
INFOTEH-JAHORINA (INFOTEH), Mar. 2019, pp. 1–6.
fields of study, enabling a more comprehensive understanding [16] J. Niyogisubizo, L. Liao, E. Nziyumva, E. Murwanashyaka, and
of the model’s performance in different educational contexts. P. C. Nshimyumukiza, “Predicting student’s dropout in university
As part of future research, we plan to include other indi- classes using two-layer ensemble machine learning approach: A novel
stacked generalization,” Comput. Educ., Artif. Intell., vol. 3, Jan. 2022,
cators for dropout in the predictive model, such as access Art. no. 100066.
available to students and socioeconomic status. Including [17] J. Chen, B. Fang, H. Zhang, and X. Xue, “A systematic review for
additional dropout indicators in the model can provide a more MOOC dropout prediction from the perspective of machine learning,”
Interact. Learn. Environ., vol. 32, no. 5, pp. 1642–1655, 2022.
comprehensive understanding of the students’ profiles and the
[18] N. Mduma, “Data balancing techniques for predicting student dropout
recurring patterns that influence dropout. By enriching the using machine learning,” Data, vol. 8, no. 3, p. 49, Feb. 2023.
prediction data (i.e., the data the model uses to generate [19] R. Mazza and V. Dimitrova, “CourseVis: A graphical student monitoring
its predictions), these additional factors could enhance the tool for supporting instructors in web-based distance courses,” Int. J.
Hum.-Comput. Stud., vol. 65, no. 2, pp. 125–139, Feb. 2007.
accuracy and relevance of the predictions. In addition, it is [20] V. Realinho, J. Machado, L. Baptista, and M. V. Martins, “Predicting
important to highlight that the model developed still needs to student dropout and academic success,” Data, vol. 7, no. 11, p. 146,
provide concrete proposals for teachers developing initiatives Oct. 2022.
[21] D. West, D. Heath, and H. Huijser, “Let’s talk learning analytics: A
to prevent poor performance. However, this area could be framework for implementation in relation to student retention,” Online
explored in future research, considering the learning analytics Learn., vol. 20, no. 2, pp. 1–21, Dec. 2015.
approach allows for personalized interventions. [22] M. Teruel and L. Alonso Alemany, “Co-embeddings for student model-
ing in virtual learning environments,” in Proc. 26th Conf. User Model.,
Adaptation Personalization, Jul. 2018, pp. 73–80.
R EFERENCES [23] E. Alqurashi, “Predicting student satisfaction and perceived learning
within online learning environments,” Distance Educ., vol. 40, no. 1,
[1] B. C. P. D. Brito, R. F. L. D. Mello, and G. Alves, “Identificação de pp. 133–148, Jan. 2019.
atributos relevantes na evasão no ensino superior público brasileiro,” [24] T. Y. Tan, M. Jain, T. Obaid, and J. C. Nesbit, “What can completion time
in Proc. Simpósio Brasileiro de Informática na Educação (SBIE), of quizzes tell us about students’ motivations and learning strategies?”
Nov. 2020, pp. 1032–1041. J. Comput. Higher Educ., vol. 32, no. 2, pp. 389–405, Aug. 2020.
[2] F. Paz and S. Cazella, “Identificando o perfil de evasão de alunos de [25] R. Umer, A. Mathrani, T. Susnjak, and S. Lim, “Mining activity log
graduação através da mineração de dados educacionais: Um estudo data to predict student’s outcome in a course,” in Proc. Int. Conf. Big
de caso de uma universidade comunitária,” in Proc. Workshops do Data Educ., Mar. 2019, pp. 52–58.
Congresso Brasileiro de Informática na Educação, 2017, pp. 624–633. [26] M. Brito, F. Medeiros, and E. P. Bezerra, “An infographics-based tool
[3] A. Ramesh, D. Goldwasser, B. Huang, H. Daumé III, and L. Getoor, for monitoring dropout risk on distance learning in higher education,” in
“Modeling learner engagement in moocs using probabilistic soft logic,” Proc. 18th Int. Conf. Inf. Technol. Based Higher Educ. Training (ITHET),
in Proc. Workshop Data Driven Educ., 2013, pp. 1–7. Sep. 2019, pp. 1–7.
[4] J. Kim, P. J. Guo, D. T. Seaton, P. Mitros, K. Z. Gajos, and R. C. Miller, [27] P. M. M. Chicon, L. N. Paschoal, F. C. R. Frantz, R. Z. Frantz, and
“Understanding in-video dropouts and interaction peaks inonline lecture S. Sawicki, “Análise da construção de modelos preditivos sob a per-
videos,” in Proc. 1st ACM Conf. Learn. Scale Conf., Mar. 2014, spectiva de indicadores de evasão,” Renote, vol. 19, no. 1, pp. 341–350,
pp. 31–40. Jul. 2021.
[5] B. Drăgulescu, M. Bucos, and R. Vasiu, “Predicting assignment sub- [28] P. M. M. Chicon, L. N. Paschoal, and F. C. R. Frantz, “Indicadores
missions in a multi-class classification problem,” TEM J., vol. 4, no. 3, de evasão em ambientes virtuais de aprendizagem no contexto da
pp. 244–254, 2015. educação a distǎncia: Um mapeamento sistemático,” Renote, vol. 18,
[6] J. Whitehill, J. E. D. Williams, G. Lopez, C. Coleman, and J. Reich, no. 2, pp. 111–120, Jan. 2021.
“Beyond prediction: First steps toward automatic intervention in MOOC [29] S. Sukhbaatar, E. Denton, A. Szlam, and R. Fergus, “Learning goal
student stopout,” in Proc. Int. Conf. Educ. Data Mining, 2015, pp. 1–8. embeddings via self-play for hierarchical reinforcement learning,” 2018,
[7] C. Piech et al., “Deep knowledge tracing,” in Proc. Adv. Neural Inf. arXiv:1811.09083.
Process. Syst., vol. 28, 2015, pp. 505–513. [30] D. Azcona, I.-H. Hsiao, and A. F. Smeaton, “Detecting students-at-risk
[8] J. Liang, C. Li, and L. Zheng, “Machine learning application in in computer programming classes with learning analytics from students’
MOOCs: Dropout prediction,” in Proc. 11th Int. Conf. Comput. Sci. digital footprints,” User Model. User-Adapted Interact., vol. 29, no. 4,
Educ. (ICCSE), Aug. 2016, pp. 52–57. pp. 759–788, Sep. 2019.
Authorized licensed use limited to: Alliance University. Downloaded on March 09,2025 at 12:06:06 UTC from IEEE Xplore. Restrictions apply.
CHICON et al.: PREDICTIVE MODEL FOR THE EARLY IDENTIFICATION OF STUDENT DROPOUT 21
[31] I. V. Brandao, J. P. C. L. da Costa, G. A. Santos, B. J. G. Praciano, Leo Natan Paschoal received the B.Sc. degree
F. C. M. D. Junior, and R. T. D. S. Junior, “Classification and predictive in computer science from the University of Cruz
analysis of educational data to improve the quality of distance learning Alta (UNICRUZ), Brazil, in 2017, and the M.Sc.
courses,” in Proc. Workshop Commun. Netw. Power Syst. (WCNPS), degree in computer science and the Ph.D. degree
Oct. 2019, pp. 1–6. in computer science and computational mathe-
[32] U. I. Usman, A. Salisu, A. I. Barroon, and A. Yusuf, “A comparative matics from the University of Sao Paulo (USP),
study of base classifiers in predicting students’ performance based Brazil, in 2019 and 2024, respectively. His current
on interaction with LMS platform,” FUDMA J. Sci., vol. 3, no. 1, research focuses on chatbots and software engineer-
pp. 231–239, 2019. ing education.
[33] J. A. B. Lopes Filho and I. F. Silveira, “Detecção precoce de estudantes
em risco de evasão usando dados administrativos e aprendizagem de
máquina,” Revista Ibérica de Sistemas e Tecnologias de Informação,
vol. 1, no. 40, pp. 480–495, 2021.
[34] S. Kim, E. Choi, Y.-K. Jun, and S. Lee, “Student dropout prediction for
university with high precision and recall,” Appl. Sci., vol. 13, no. 10, Sandro Sawicki received the B.Sc. degree in
p. 6275, May 2023. informatics from the Regional University of North-
[35] A. P. Rovai, “In search of higher persistence rates in distance education western Rio Grande do Sul (UNIJUÍ), Brazil,
online programs,” Internet Higher Educ., vol. 6, no. 1, pp. 1–16, in 1999, and the M.Sc. and Ph.D. degrees in com-
Jan. 2003. puter science from the Federal University of Rio
[36] W. M. Ramos, R. N. M. Bicalho, and J. D. Sousa, “Evasão e persistência Grande do Sul (UFRGS), Brazil, in 2002 and 2009,
em cursos superiores a distância: O estado da arte da literatura interna- respectively. Currently, he is a Professor and a
cional,” in Proc. Conferência Forges, 2014, pp. 38–64. Researcher with the Postgraduate Program on Math-
[37] R. Goldschmidt, E. Passos, and E. Bezerra, Data Mining. Amsterdam, ematical Modeling, Department of Exact Sciences
The Netherlands: Elsevier, 2015. and Engineering, UNIJUÍ; and a member of the
[38] J. Han, J. Pei, and M. Kamber, Data Mining: Concepts and Techniques. Applied Computing Research Group (GCA). His
Amsterdam, The Netherlands: Elsevier, 2011. research interests include mathematical optimization, graph theory, hypergraph
[39] M. d. A. Silva, “Prprocessamento em minerao de dados como mtodo de partitioning, and search-based software engineering.
suporte modelagem algotmica,” Modelagem Computacional de Sistemas,
Dissertao de Mestrado, Univ. Federal do Tocantins, Brasil, 2014.
[40] R. Agrawal, T. Imielinski, and A. Swami, “Mining association rules
between sets of items in large databases,” in Proc. ACM SIGMOD Int.
Conf. Manag. Data, Jun. 1993, pp. 207–216. Fabricia Roos-Frantz received the Ph.D. degree
[41] C. Romero, P. G. Espejo, A. Zafra, J. R. Romero, and S. Ventura, in software engineering from the University of
“Web usage mining for predicting final marks of students that use Seville, Spain. She is currently a Professor with
moodle courses,” Comput. Appl. Eng. Educ., vol. 21, no. 1, pp. 135–146, the Department of Exact Sciences and Engineering,
Mar. 2013. Regional University of Northwestern Rio Grande do
[42] A. K. Pandey and D. S. Rajpoot, “A comparative study of classification Sul (UNIJUÍ), Brazil. Her current research interests
techniques by utilizing WEKA,” in Proc. Int. Conf. Signal Process. include software product lines and search-based soft-
Commun. (ICSC), Dec. 2016, pp. 219–224. ware engineering.
[43] G. Stålmarck and M. Säflund, “Modeling and verifying systems and
software in propositional logic,” in Safety of Computer Control Systems.
Amsterdam, The Netherlands: Elsevier, 1990, pp. 31–36.
Patricia Mariotto Mozzaquatro Chicon received Rafael Z. Frantz received the Ph.D. degree in
the M.Sc. degree in computer science from the Fed- technology and software engineering from the Uni-
eral University of Santa Maria (UFSM) in 2010 and versity of Seville, Spain. He is currently a Professor
the Ph.D. degree in mathematical modeling from with the Regional University of Northwestern Rio
the Regional University of Northwestern Rio Grande Grande do Sul (UNIJUÍ), Brazil, where he leads
do Sul (UNIJUÍ) in 2022. Her research interests the Applied Computing Research Group (GCA).
include computing applied to education, focusing In industry, he has five years of experience as a
on mobile learning, educational data mining, and Consulting Expert in software development.
learning management systems (LMS).
Authorized licensed use limited to: Alliance University. Downloaded on March 09,2025 at 12:06:06 UTC from IEEE Xplore. Restrictions apply.