Predicting the Students Performance
Predicting the Students Performance
Abstract-In recent years, Massive Open Online Courses (MOOCs) and online education have seen a surge in
popularity. Numerous issues pertaining to student dedication, presentation, and preservation assessments have been
brought to light by the considerable interest in online learning. Many academics have looked into methods to predict
student outcomes, like performance and failure in online courses, due to the increased demands and difficulties in
online education. MOOCs have been embraced by prestigious universities and institutions as an effective dashboard
platform that allows students from all over the world to take part in these courses. Set computer-marked assessments
are used to evaluate the learning progress of the students. Specifically, after the student finishes the online tests, the
computer provides instant feedback. In addition to the degree of engagement, the researchers assert that the success rate
of students in an online course can be linked to their performance in the previous session. The literature hasn't done
enough to assess whether student performance and engagement on earlier tests could have an impact on student
achievement on subsequent tests. The state-of-the-art research that uses machine learning techniques to analyze the data
of online learners and forecast their results is thoroughly reviewed in this paper. Identifying and classifying the
characteristics of online courses used to predict earners' outcomes, determining prediction outputs, identifying
strategies and feature extraction methodologies, describing evaluation metrics, offering taxonomy to analyze related
studies, and summarizing the field's limitations and challenges are all contributions of this study. In this study the
Decision Tree (90.64%) and Gradient Boosting (86.64%) algorithms show the highest accuracy, with Gradient
Boosting slightly lagging behind. SVM performed moderately, while Random Forest performed poorly, as indicated by
its significantly lower accuracy score.
Keywords: Machine Learning, Student Performance Prediction, MOOCs, Learning Analytics (LA), Tutor Marked
Assessments, Computer Marked Assessments), Recursive Feature Elimination.
I. INTRODUCTION
For a better life Education is an essential tool to fascinating tone-assurance and furnishes essential
needs. Due to changes in technology (such as AI&ML, IOT, etc.) and teaching methods, most of the
educational institutions are incorporate them into their traditional teaching methods [1]. Academic
performance of the student is the key indicator for any educational progress. Hence institutions are trying
to predict academic success of the student based on the factors like gender, age, staff, and literacy.
Placements and institutions rankings are improved if student performance is improved as it's a primary
factor analyzed by employer [2]. Higher Educational institutions are facing challenges for providing quality
education, formulating techniques for predicting and assessing student performance, and requirements.
Advanced teaching methods and massive open online courses (MOOC) platforms are taking
advantage in creating automatic grading systems, recommenders, and flexible systems. Lack of standard
evaluation tools, cost of the courses, and complexity in identifying specific requirements due to lack of
direct communication are main issues related with MOOC system. Online platforms are using Long-term
records to evaluate student performance and course [3]. Numerous scholars have proposed various machine
learning models to effective analysis of academic activities which poor in understanding where people may
produce efficient data- engineering algorithms [4]. In general, machine learning algorithms that conclude
from exemplifications handed externally to develop general suppositions that make prognostications about
cases to come [5].
On the other hand, data mining is the key in filtering through a massive quantum of data to find
relevant information generating opinions is backed by it. Data mining is applicable many ways in
educational data analysis [6]. Gathering and analysing of data from learners to improve literacy things and
enhance learning abilities is the main task of learning analytics [7]. This need can be met and implied
advancements by course design and delivery can be recommended by classifying student stranded on their
data. To scrutinize the factors impacting student performance and encouragement, the major thing is to
identify significant criterion in an academic environment and to examine the associations between these
criteria exercising the ideas of learning analytics and educational data mining [8].
The study also compares different machine learning algorithms, including SVM, Decision Tree,
Random Forest and Gradient Boosting to estimate their predictive performance in prognosticating pupil
issues. The exploration addresses specialized gaps in prognosticating pupil performance by fastening on
indispensable algorithms over artificial neural networks. The proposed model stands out with its innovative
clustering fashion, comprehensive relative analysis, and practical operation in soothsaying pupil
performance. The proposed models offer new sapience into determining the most critical literacy exertion
and help the preceptors in keeping shadowing of timely pupil performance. To the stylish of our
knowledge, pupil performance has been estimated in online course using only two targets “success” and
“fail”. Our model predicts the performance with three- class markers “success”, “fail” and “withdrew”.
In MOOCs, the performance of learner accomplishment is measured by how they affect their
learning. According to analysts, learning analytics and machine learning are practical tools for tracking
student information. They also believe that ML could assist students in transferring data about the
academic process, giving experimenters the ability to both visualize and analyze the data collected from
each learner league. Consequently, comparable courses can be used to learn an accurate prescient model
[11][12][13]. The final performance of students in an online course is predicted by their quiz scores and
marks from the first assessment, which are combined with social factors [14].
There were two prescient models displayed. Calculated relapse was used in the initial
demonstration to predict whether understudies received a typical or qualification certificate. Calculated
relapse was also used in the moment prescient demonstration to predict whether or not understudies
achieved certification. The results showed that the most persuasive factor in earning a qualification is the
quantity of peer evaluations. The most reliable indicator for receiving a certificate was thought to be the
typical test results. Separately, the qualification and ordinary model precision rates were 92.6% for the
initial demonstration and 79.6% for the current demonstration [14]. Calculated relapse has some obstacles
despite its focal points. It acknowledges that the autonomous factors have a direct relationship.
The association between the data of the Virtual Learning Environment and alternative
implementation has been examined at the College of Maryland, Baltimore Province (UMBC) [12]. LA was
utilized through the execution of the Check My Movement (CMA) apparatus. The courses appeared to
show that understudies who lock in with the course habitually are more likely to gain check C or higher
than those who do not frequently lock in [12]. Despite their various preferences, VLE apparatuses confront
challenges that restrain their adequacy. One major issue is the dependence on client engagement, as
understudies must reliably associate with the stage to advantage from its highlights.
The process we followed to reach our results, we present and describe the stages in figure 1. This
system aims to provide a robust methodology for early prediction and classification of student outcomes
such as "Pass," "Fail," or "Withdrawn". The prediction system is built on a foundation of ML models, like
SVM, RF, DTA, and GBM. Each model evaluated using key metrics like Accuracy, Precision, Recall, F1-
Score, and RMSE. The feature selection process employs Feature Elimination using Recursive method to
prioritize the most influential variables, ensuring the models remain both efficient and interpretable.
Figure 1: Demonstration of Proposed System
Chi-Square, Ok-Ek)2/Ek---------------(3)
Where, Ok and Ek are observed frequency and expected frequency.
The extracted features help the models to differentiate between students likely to perform well and those at
risk of dropping out.
Metric Formula
Precision (P)
Recall (R)
Accuracy
F1-score
MSE
RMSE
MAE
The Decision Tree algorithm achieved the highest accuracy of 89.50%, with a macro-average F1-
score of 82% and a weighted-average F1-score of 91%. The Gradient Boosting algorithm also performed
well, achieving an accuracy of 86.64%. Its macro-average F1-score is 72%, while the weighted-average F1-
score is 87%. The Random Forest algorithm, performed poorly, with an accuracy of only 4.39%. Both its
macro-average and weighted-average F1-scores were very low, indicating that it struggled to generalize on
this dataset. The SVM algorithm achieved an accuracy of 69.27%, with a macro-average F1-score of 41%
and a weighted-average F1-score of 69%. Its performance, while moderate, indicates that it struggles with
datasets containing imbalanced or non-linear relationships, as evidenced by its limited ability to capture all
performance classes accurately.
Figure .4. Decision Tree, Random Forest, and Gradient Boosting and Support Vector Machine confusion matrix
Figure .5. Heat map of Decision Tree, Random Forest, Gradient Boosting and Support Vector
The results validate the efficiency of feature selection in improving model performance,
and the Decision Tree algorithm emerged as the most reliable model for accurate predictions in
this application. Finally, the results are analyzed to uncover meaningful insights to understand
how different features impact student performance, identifying potential areas of improvement in
the model, and visualizing the results for better interpretability.
V. CONCLUSION
In this research, we established a data-driven approach to the analysis of predict learning performance
on an MOOC platform. Now a days, Massive Open Online Courses (MOOCs) and online education have
seen a surge in popularity. In data-driven environment, ML is a tool for predicting student performance in
online courses, with the goal of enhancing educational outcomes. By leveraging various ML models,
identify key factors that influence student success and to develop predictive models that can forecast
academic performance. The results of the research demonstrated the effectiveness of these models in
providing accurate predictions, with Random Forest and Gradient Boosting outperforming than other
algorithms in terms of accuracy, making them the preferred choices for predicting student success. A
critical finding of this research is the importance of prior academic performance in predicting future
success. The predictive models confirmed that factors such as previous grades, engagement with course
materials, and demographic features play significant roles in determining whether a student will pass, fail,
or withdraw from a course.
The analysis showed that the incorporation of both static features and dynamic features
significantly improved the accuracy of predictions, especially when using ensemble methods like Random
Forest and Gradient Boosting. This paper also highlighted the challenges posed by class imbalance and
high-dimensional data, issues that were addressed through techniques like feature selection and data
balancing. This paper emphasizes the increasing relevance of Learning Analytics (LA) in education, which
utilizes data to understand student behaviours, predict outcomes, and implement timely interventions. The
ability to predict student performance early in the course provides an opportunity for educators to identify
at-risk students and offer targeted support, potentially improving retention rates and overall student success.
Furthermore, this paper highlights the need for effective model deployment in real-world educational
settings. By developing models that predict student performance, educational institutions can leverage these
tools to enhance course design, adapt content delivery, and allocate resources more efficiently. This
approach not only improves the learning experience for students but also ensures that educational
institutions can effectively support their students and optimize the learning process. In summary, this paper
demonstrates the power of machine learning in educational data mining, offering a powerful tool for
predicting student performance and supporting timely interventions. By combining predictive analytics
with adaptive learning strategies, educational institutions can foster more inclusive, effective, and data-
driven learning environments. As technology continues to advance, these models will evolve, further
enhancing their ability to support student success and contribute to the ongoing transformation of
education.