Revised PROOFREAD Thesis Document
Revised PROOFREAD Thesis Document
APPROVAL SHEET
The Thesis attached hereto, entitled " Machine Learning Model to Predict Student
Attrition of College of Computing Studies Freshmen Students in Western
Mindanao State University", prepared and submitted by Lester C. Dela Torre, John
Paul M. Madroñal, and Keano Rae N. Sevilla, in partial fulfilment of the requirements for
the degree of Bachelor of Science in Computer Science, is hereby recommended for
Oral Examination.
Student attrition is a complex issue that affects both students and the educational
institutions they attend. It can result in various negative consequences, such as declining
enrollment numbers and damage to the institution's reputation and resources. Therefore,
there is a pressing need to mitigate student attrition rates and identify factors that
contribute to this issue. This study aimed to develop a machine learning model that
could predict whether a student in their first year at Western Mindanao State University's
College of Computing Studies would continue their studies into their second year. The
researchers collected and processed academic and personal data from the first-year
students and employed various machine learning algorithms, including Random Forest,
Logistic Regression, Support Vector Machine, Naïve Bayes, K-Nearest Neighbours,
Artificial Neural Network, and Decision Tree. Several data pre-processing and
processing techniques were employed to optimize the model. The study revealed that
the SMOTE method, coupled with Bagging Ensemble, was the most accurate algorithm
for predicting student attrition. The model was integrated into a web - application, which
provided an accessible tool for students and educators to assess the likelihood of
attrition. Although a low correlation among the considered variables were observed,
potential factors that contribute to student attrition were identified. Further studies could
deepen our understanding of these factors and lead to more effective measures for
mitigating student attrition rates.
Keywords: Student Attrition, Machine Learning Model, Algorithm, Predict, Data Pre-Processing, Data
Processing, Webapp
ACKNOWLEDGEMENT
To Hyrene N. Sevilla, Roderick R. Sevilla, Jane M. Madroñal, Jaime D. Madroñal,
Lommel G. Dela Torre, and Rachel C. Dela Torre for your unwavering and unconditional
love and support.
To Engr. Marvic A. Lines, our thesis adviser, for your assistance, advice, and
guidance throughout our study.
To Ma'am Lucy F. Sadiwa, Sir Gadmar M. Belamide and Engr. Marjorie A. Rojas
for the critique and suggestions for improvements to our study and program.
To Sir Salimar B. Tahil for helping and believing that we could, despite various
constraints, finish our study on time for the final defence.
To Sir Jaydee C. Ballaho for his help during our data gathering stage, and for
assisting us with his ideas.
To Mark Angelo S. Panaguiton and Justin Paul M. Dalay for their inputs and
suggestions in the development of our program.
To BSCS 4-A Class of 2023 for their unfailing companionship through even the
most difficult of days.
And above all, to the Almighty for His power and mercy knows no bounds.
Table of Contents
APPROVAL SHEET…………………………………………………………………….i
ABSTRACT……………………………………………………………………………...ii
ACKNOWLEDGEMENT……………………………………………………………….iii
Chapter 1………………………………………………………………………….…….1
INTRODUCTION………………………………………………………………………..1
1.1 Background of the Study………………...…………………………….
…………...1
1.2 Statement of the Problem……………...…………………………………………..2
1.3 Objectives………..…………………………………………………………….
…….3
1.4 Purpose of Specific Objectives……...………………………….
………………….3
1.5 Significance of the Study………………………..…………………….……………
4
1.6 Scope and Limitation…………………….……………………….…………………
4
Chapter 2………………………………………………………………………………..5
REVIEW OF RELATED LITERATURE.…………………….………………………..5
2.1 Related Studies…………………………………...………………………………...5
2.2 Synthesis………………………………...…………………………………………..8
2.3 Definition of Terms…………………….………..…………………………………
10
2.4 Conceptual Framework……………………...……………………………………13
Chapter 3………………………………………………………………………………14
METHODOLOGY……………………………………………………………………...14
3.1 Research Design…………………...……………………………….
……………..14
3.2 Respondents………………..…………………………………………….
………..14
3.3 Data Gathering Instruments, Techniques, and Procedures………..…..
……...14
3.4 Statistical and Analytical Tools…………………………...…………….
………...15
3.5 Technical Tools…………………………...……………………………………….18
3.6 Software Process Model……………………………..………………….
………..19
Chapter 4………………………………………………………………………………23
RESULTS AND DISCUSSION………………………………………………………23
4.1 Exploratory Data Analysis……………………………………………..…………23
4.2 Identify the Features to be used in Machine Learning Model………………..27
4.3 Model Development……………………………………………………..….
……..30
4.4 Stratify Target Variable During Train Test Split………………………..…..……
38
4.5 Retraining of Models………………………………………………………………40
4.6 Comparative Analysis of Algorithms……………………………...
……………...41
4.7 Stacking and Bootstrap Aggregating Ensemble Techniques………...….......41
4.7 Identifying The Possible Factors That Causes Student Attrition………..
……..42
4.8 Deployment of Machine Learning Model in Web - Application……...............43
Chapter 5………………………………………………………………………………45
CONCLUSION AND RECOMMENDATIONS……………………………………...45
5.1 Conclusion……………………………………………………………………...….45
5.2 Recommendations………………………………………………………...………46
Appendices……………………………………………………………………………47
APPENDIX A…………………………………………………………………………...47
APPENDIX B…………………………………………………………………………...48
APPENDIX C…………………….………………………………………..…………...49
APPENDIX D…………………………………………………………………………...53
APPENDIX E…………………………………………………………………………...54
APPENDIX F……………………………………………………………………………
55
APPENDIX
G…………………………………………………………………………...57
Bibliography…………………………………………………………………………..60
LIST OF FIGURES
Page | 1
This project aims to predict the attrition of WMSU’s College of Computing Studies
(CCS) freshmen students using machine learning models, thereby helping universities
develop effective strategies, tactics, and administration for decision-making processes.
By identifying students who are more likely to drop out or shift to another course, this
approach can aid in minimizing student dropouts and promoting academic success.
Page | 2
Objectives
▪ General Objective
This study aims to create a machine learning model that predicts
freshmen student attrition of freshmen students at Western Mindanao
State University.
▪ Specific Objectives
⮚ To identify and create features to use for creating a machine learning
model through testing various methods and techniques and choosing
based on results.
⮚ To create a machine learning model that predicts student attrition by
using the following algorithms for processing data:
● Random Forest
● Logistic Regression
● Support Vector Machine
● Naïve Bayes
● K-Nearest Neighbours
● Artificial Neural Network
● Decision Tree
⮚ To apply stacking and bootstrap aggregating ensemble techniques to
further process and refine the dataset.
⮚ To conduct a comparative analysis on each algorithm to determine which
one has the highest accuracy.
⮚ To Identify the factors/predictors that causes student dropouts or shifting
to another course by examining the results given by the algorithm.
⮚ To identify which machine learning model gives the highest accuracy
result.
⮚ To deploy the machine learning model in a simple web-application using
HTML, CSS and Flask.
Page | 3
Significance of the Study
Over the past few years, many researchers have been using machine learning
and data mining techniques in studying student retention as it produces a better result
than using traditional statistical [13]. The result of this study will help the university in
identifying students who are at-risk at drop-out or likely to shift to another course and the
factors involved so that the university can make necessary adjustments to improve
student retention.
Beneficiaries:
1.) Freshmen students – CCS first-year students will benefit from this
study by predicting if students are more likely to drop, shift or retain at an
early stage to minimize attrition.
2.) College of Computing Studies faculty – By learning which
factors/predictors cause student attrition, CCS faculty can adjust the
department’s plan, methods, and decision-making strategies.
3.) Western Mindanao State University – This study will aid the attrition
rate of the university.
Page | 4
that contribute to student attrition, thereby aiding faculty members in addressing them
more effectively.
Chapter II
Student attrition rate is defined as the number of individuals who leave a program
of study before completing it [7]. This may be caused by several issues and factors such
as time pressure, lack of support, lifestyle etc. [1]. These factors contribute to the
students’ decision to drop out from subjects or, in more severe cases, drop from the
course entirely. As student attrition or non-continuation rates are a barometer for the
performance of higher education institutions, it is vital to know the severity of these
cases to prevent, mitigate, and lower said cases.
For some students, there is a psychological factor in play as to why they decide
to leave their course prematurely. Loss of interest in the study program may lead to
increased tendencies for non-continuation. Aside from this, financial status also has
something to do with students stopping early. Students may need to work early to earn,
therefore dropping out to look for a job. Another example of a reason why student
attrition rates are constant is academic performance. If a student feels like they are not
fit to continue in their course, they may opt to stop and enroll in another one.
Loss of interest in studying causes many different issues to arise within a student
[16]. This in turn is a major contributor to student attrition. If a student feels that they are
not interested anymore in the lessons that their course is offering, then their attention will
be focused on things other than their course. Students may find the course offerings
becoming increasingly difficult or, at the other end of the spectrum, easier, leading to
diminishing interest in their studies.
A trait common with the different articles on student attrition that the researchers
noticed was that most of the factors that were included as causes for attrition in the
study were included in other studies, such as loss of interest [16], external factors [1],
Page | 5
among others. These were leading causes among students in different studies. In the
case of the College of Computer studies in Western Mindanao State University, these
were also determined to be part of the larger issue of freshman attrition.
Students are not the only ones that are affected by student attrition. The
educational institutions from where they came from are also impacted by the rate of non-
continuation [21]. It gives a negative impression to the institution and makes it look like it
is performing inadequately due to the number of dropouts. Student dropouts pose a
major concern for educational communities and institutions [11]. Institutions strive to
replenish the resources they have spent on the students that do not continue while the
students themselves lose valuable time and energy on building up knowledge and
experience from the course, only to not pursue it until the end.
This makes it easier to work with a dataset with the potential to have over a
hundred different possible entries. Although as the dataset increases in size therefore
increases in the difficulty of handling [16], it will be small enough to not require extensive
programming expertise. These models have been invaluable to researchers across the
globe due to the speed, accuracy and precision of the calculations and data analysis of
the systems. Apart from the speed of calculations, the machine learning models are also
consistent even when being supplied with big datasets which ordinary manual
calculations will struggle with due to human error and oversight. One big difference
between human error and machine learning model error is predictability [13]. Preventing
Page | 6
mistakes with the machine learning model is easier than with manual calculating
methods as these are systematic and can be programmed, unlike human methods.
Machine learning models have been used to great effect in the study of student
performance [1,2,8]. A test done in 2018 regarding variables that influence reading
proficiency in English among Filipino learners was done with 81.2% accuracy using a
dataset from OECD PISA 2018 database comprising of 7233 students [7]. Another test
done with machine learning, this time on the academic effect of Filipino students during
online examinations, yielded an accuracy of 92.66% [11]. The test was done with a
dataset comprising 75 students. This shows that machine learning models are a reliable
way of cleaning, plotting out data, and data interpretation of large datasets.
A case study [10] was conducted that predicts student retention of Catholic
University of Maule students from 1st year to 3rd year level. By using Knowledge
Discovery in Databases (KDD) process as their core method, they use data mining to
formulate four models (any level, 1st year, 2nd year, 3rd year) wherein they used the
following machine learning algorithms: Decision Trees, K-Nearest Neighbors, Logistic
Regression, Naïve Bayes, Random Forest, and Support Vector Machines. The
evaluation metrics they used are accuracy, tp rate, fp rate, precision, f-measure, RMSE,
k-statistics, and Friedman value. The result shows that all models exceed an accuracy of
80%. It was noted that it is necessary to balance the data to yield higher accuracy.
Random Forest algorithm achieved the highest accuracy on each level: 88.43% (any
level), 93.65% (1st year), 95.76% (2nd year), and 96.92% (3rd year). They also evaluate
the performance of each machine learning algorithm used in their study by using the
Friedman value test. The result of this statistical analysis shows that the Random Forest
algorithm ranked first among other algorithms. Which further tells that it is the best-
performing algorithm in their study.
Another study was conducted [14] that predicts student dropout of computer
science students from the University of the South Pacific. They used Random Forest
algorithm to build the model. The dataset consists of 963 observations and 33 features.
Two models were created to train and test the data. Where Model 1 used a 5-fold cross-
validation and Model 2 used a 10-fold cross-validation. Accuracy, specificity, sensitivity,
Page | 7
and kappa were used as performance metrics while the Confusion Matrix was used to
evaluate the models. The result of the study shows that the model with 5-fold cross-
validation appears to achieve the highest accuracy between the two models with an
accuracy of 0.82. And it was found out that the strongest predictor of dropouts was the
academic performance in the 1st year programming course.
The difference that the study had over other was the relatively small dataset due
in part to the scope of the study. A large dataset is much preferred as it makes the result
of data processing and calculations more refined [3]. While the dataset size was enough
for processing that yield plausible results in initial testing, it still paled in comparison to
other studies that have more entries.
Synthesis
TITLES IMPROVEMENTS
GAPS
(RESEARCHER’S STUDY)
Page | 8
Using ensemble decision The study used the The researchers
tree model to predict grades of only one worked using four
student dropout in subject when the study subjects and their
computing science [6] was conducted corresponding
laboratory and lecture
grade split for more
refined results
Page | 9
A Review on Predicting Didn’t check if the During data pre –
Student’s Performance dataset is balanced or processing, the
using Data not. researchers check the
Mining Techniques ratio of the target class.
[7]
Definition of Terms
Term Definition
Page | 10
problems. [21]
Page | 11
groups that a given data sample is to be
split into. As such, the procedure is often
called k-fold cross-validation. [17]
Conceptual Framework
Page | 12
The chart above shows the concept of the system's workflow. In the input stage,
the raw data was collected then analyzed through exploratory data analysis to look for
patterns and relationships between the different variables. In the process stage, the
collected data is subjected to various pre-processing techniques such as data cleaning,
normalization, and feature selection. To guarantee that the data are of good quality and
pertinent to the research issue, it is essential to complete this stage. The data is split into
training and testing sets after pre-processing, and then it is submitted to a variety of
machine learning methods, including Random Forest, Logistic Regression, Support
Vector Machine, Naive Bayes, K-Nearest Neighbors, Artificial Neural Networks, and
Decision Tree. In the Output stage, the results of the analysis are presented in the form
of predictive data. The precision, recall, accuracy, and F1-score are just a few of the
evaluation metrics that are used to assess the efficacy and correctness of machine
learning algorithms. The best model then was integrated into a web application that
serves as the data collection method for further inputs.
Page | 13
Research Design
This study employed an applied research methodology to investigate the
effectiveness of a preventive intervention in addressing the issue of freshman attrition in
the College of Computing Studies. The applied research approach is highly beneficial
because it facilitates the practical application of scientific principles to solve problems
that impact individuals, groups, and society. By identifying a problem, formulating
research hypotheses, and conducting experiments to test these hypotheses, the
researcher can leverage empirical methodologies to solve real-world problems. The
practical nature of this research design enables it to yield results that inform decision-
making, provide practical solutions, and improve outcomes. The research design is not
only relevant but also useful as it enables the researchers to work closely with
stakeholders to develop a functional and beneficial system to address the problem at
hand. [15]
Respondents
Page | 14
and whether or not they continued their course and enrolled in the second-year 1st
semester. The researchers sought and obtained appropriate permissions from the
relevant authorities to collect the data. Data were collected in both electronic and hard
copy formats and were subsequently stored in a Comma-Separated Values (CSV) file to
enable efficient management and analysis using Jupyter Notebook, which was used to
build the predictive model.
Page | 15
missing values on a specific row/column or on all columns. Interquartile
Range (IQR) was the measurement used to find the outliers in the
dataset. IQR has the formula of Q3 – Q1, where: Q3 = 75th percentile
and Q1 = 25th percentile. The values outside of IQR are considered as
outliers.
● Evaluation Metrics
Evaluation metrics quantifies the performance of the machine
learning model. The result of the chosen evaluation metrics was the basis
in choosing the best machine learning model for deployment. Accuracy
was the initial evaluation metric in evaluating the machine learning model
in the study as this was one of the common evaluation metric used based
on research and related studies. But since it was found out that the
dataset was imbalanced, the researchers look for another evaluation
metric since accuracy tends to be biased with an imbalanced dataset. F1
score with macro average weight was chosen as the main metric for
evaluating the models and the researchers also use precision, recall, and
auc-roc score to evaluate the result of each class of the target variable.
While confusion matrix was used to visualize the result of actual and
predicted values. The following are the definition of each evaluation
metrics:
Page | 16
2. Precision - used to determine the number of true positives divided
by the number of predicted positives. The formula for precision is
shown below.
Precision Formula
TP
Precision=
TP+ FP
F1 Score Formula
Precision·Recall
F 1 Score=2 ·
Precision+ Recall
Page | 17
6. AUC - ROC Score - used to compute the Area Under the Receiver
Operating Characteristic Curve (ROC AUC) from prediction
scores.
Figure 1.1 AUC - ROC Score
Technical Tools
In this study, the researchers utilized a range of technical tools for building the
predictive model and web application. Jupyter Notebook, a popular code editor for data
science projects, was used for exploring, analyzing, cleaning, preprocessing, and
building the predictive model. Python, a widely used programming language for data
science, was the language of choice for this study due to its versatility and abundance of
useful libraries.
The following python libraries were used for data analysis, visualization, mathematical
computations, and building the predictive model: Pandas, Numpy, Scikit-learn,
Matplotlib, Imbalanced-Learn, and Seaborn.
For developing the web application, HTML, CSS, and Bootstrap were employed to
create the structure and design of the web pages. Visual Studio Code, an integrated
development environment (IDE) with support for numerous python extensions, including
Page | 18
Flask, was used for integrating the predictive model into the web application.
Pythonanywhere was utilized as the web hosting service for deploying the web
application online. By using these technical tools, the researchers were able to develop
a robust and effective predictive model and a user-friendly web application.
Software Process Model
CRISP – DM (Cross Industry Standard Process for Data Mining) process model
was used as the main framework in this study. There are six sequential steps in CRISP
– DM framework that were used to gather insights and formulate the predictive model.
Figure 1.2 illustrates the six steps of CRISP – DM.
Page | 19
A. Business Understanding
Page | 20
B. Data Understanding
C. Data Preparation
D. Modeling
Page | 21
The research problem addressed in the thesis project belongs to the
classification task of supervised learning. To achieve the project’s objectives,
the researchers identified several classification algorithms, including Random
Forest, Logistic Regression, Support Vector Machine, Naïve Bayes, K-
Nearest Neighbors, Artificial Neural Networks, and Decision Tree. These
algorithms were used to build the predictive model and were subsequently
evaluated in the next step of the project.
E. Evaluation
F. Deployment
In the last stage of the study, the predictive model developed by the
researchers was deployed in a simple web-based application. The application
was intended to showcase the functionality of the model and allow end-users
to evaluate the accuracy of the model's predictions. It is important to note that
the application was designed for demonstration purposes only and was not
developed with the intent of integrating it into a system, as this task requires
specialized skills beyond the expertise of the researchers.
Page | 22
Chapter IV
Page | 23
The researchers conducted Exploratory Data Analysis to analyze the data and to
learn key insights. The dataset consists of 454 rows and 30 columns. Figure 2. shows
the columns of the dataset and its respective data type. The researchers noticed that the
columns for major subject grades (CC100, CC101, CC102, CS111/IT112) has an object
data type. Investigation for the unique values of each grade columns were conducted
and it was found out that there are two (2) string values in those grade subjects, which
are INC and UW. With this information, the researchers came up with a plan together
with the suggestion of the thesis adviser to create four (4) different routes in building the
model. The routes are the following:
Page | 24
Figure 2. Dataset Columns and its Data Type
The researchers also noticed that the SHS/HS GPA column has an object data
type. Upon looking its unique value, the researchers found out that it has Transferee and
Shiftee values. There are 24 total rows for those two values and the researchers decided
to drop it from the dataset. Aside from male and female, our Gender variable has two
other categories which are “prefer not to say” and “androgyne”. Since there’s only 1
instance of androgyne in our dataset, we decided to drop this category out of Gender
variable. Figure 3. shows the number of missing data for each column.
Page | 25
Figure 3. Number of Missing Data
The researchers plot data visualization to some variables to gather more insights
about the dataset. First, we plot a graph that shows the distribution of first year CCS
students who enrolled in 2nd year or not through the four school year data that we have.
We can see in Figure 4 that school year’s 2018 and 2021 had first year CCS students
who didn’t enrolled for their second year in college. While school year’s 2019 and 2020
had all students enrolled in second year. The graph also shows that the number of
students enrolled increases in succeeding years but since our dataset only consists of
454 students across four school years, this observation might not be valid.
Page | 26
Figure 4. Frequency of enrolled students in each school year
Figure 5 shows the frequency of first year CCS students with the Track that they
take during Senior High School. STEM and TVL-ICT has the greatest number of
students. This is not surprising since some core subjects are already taught in those two
SHS Tracks. While Figure 6 shows the frequency of different grades for each major
subjects in BSCS and BSIT from school year 2018 – 2021.
Page | 27
Figure 6. Frequency of grades of major subjects from school year 2018 - 2021
Page | 28
The next process would be handling missing data. As shown in Figure 3. above,
the features retained has missing data. The researchers plot a graph to know the normal
distribution graph of the numeric variables.
Figure 7. shows that only CET OAPR follows a bell-shaped symmetrical curve
while the rest are skewed to the right. The graph above will help us decide which method
to use in handling missing data. Since majority of the data are skewed, the appropriate
Page | 29
method to use would be filling the missing data with median. The researchers got the
median and use fillna() function from pandas to fill the missing data with the median. For
categorical variables, the researchers simply filled the missing data with mode. Next
would be handling the outliers. Figure 4. shows a box plot for numeric variables. The
black circles in the plot are the outliers. After finding out the outliers, the researchers
decided to ignore it and let it remain in the dataset. Although considered as outliers,
these are rare occurrences since those data points are still within the range of accepted
values for each variable, therefore the outliers were kept in the dataset. Next step will be
categorical encoding. The researchers investigated the unique values of each
categorical variables. The researchers determined that only SHS Strand, Gender, and
Year Started in Program variables are categorized as a nominal type of data. While
others are considered as ordinal data. For nominal data, the researchers used dummy
encoding which creates dummy variables that is equal to number of categories (k) in the
variable, and it encodes k-1 dummy variables. For ordinal data, the researchers used
mapping function to encode the data according to their rank/order.
Page | 30
Figure 8. Pearson’s Correlation Heatmap
Model Development
Page | 31
The researchers started by creating two variables: one for features, and the other
one is for target variable. Next is to split the data by 80:20 ratio for train and test data
respectively. But before going forward, the researchers looked for the distribution of the
classes of the target variable. And it was found out that the dataset is highly imbalanced
as shown in Figure 11. So before moving forward with building of models, the
researchers applied oversampling and undersampling method to the training data to
handle the imbalanced dataset.
Page | 32
Figure 11. Distribution of Class label
Page | 33
Figure 12. Distribution of Class label using SMOTE
Figure 13.
Distribution
of Class label
after using
Random
Undersampling
Page | 34
The evaluation metrics used for evaluating the models are f1 score with macro
average, precision, recall, accuracy, ROC-AUC score, and Confusion Matrix. These
metrics will help the researchers to distinguish the differences of all models. Then f1
score was used as the final evaluation metric to determine the best predictive model to
be used for deployment since this is one of the common evaluation metrics used with an
imbalanced dataset. Table 1 shows the parameters used in building the models. The
parameters chosen were based on related studies and examples using these algorithms
found online.
The researchers included two ensemble techniques in building the models, which
are stacking and bootstrap aggregating or bagging for short. Stacking ensemble
combines the results of two or more models so that a better prediction result can be
achieved. While bagging ensemble selects random subsets of data in the dataset to train
and then aggregate their individual results to form the final prediction. In stacking, all
algorithms used for modelling were combined. While in bagging the Decision Tree
algorithm was used as the base estimator.
Table 2 shows the result of the best models of each route without using any
method to balance the dataset. The highest f1 score in this testing is 0.494. There are
three models that produce the same result which are Random Forest, K-Nearest
Neighbors, and Stacking Ensemble. Table 3 shows the result of the best models of each
route using SMOTE method. SVM Linear algorithm produces the highest result with
0.606 f1 score.
Page | 35
Algorithm Parameter/s
Random
1 0.000 0.964 0.000 1.000 0.000 0.982 0.491 0.964
Forest
Logistic
1 0.000 0.964 0.000 1.000 0.000 0.982 0.491 0.964
Regression
1 SVM Poly 0.000 0.964 0.000 1.000 0.000 0.982 0.491 0.964
1 SVM Linear 0.000 0.964 0.000 1.000 0.000 0.982 0.491 0.964
K-Nearest
1 0.000 0.964 0.000 1.000 0.000 0.982 0.491 0.964
Neighbors
Artificial
1 Neural 0.000 0.964 0.000 1.000 0.000 0.982 0.491 0.964
Network
Stacking
1 0.000 0.964 0.000 1.000 0.000 0.982 0.491 0.964
Ensemble
Random
2 0.000 0.974 0.000 1.000 0.000 0.987 0.494 0.974
Forest
Page | 36
K-Nearest
2 0.000 0.974 0.000 1.000 0.000 0.987 0.494 0.974
Neighbors
Stacking
2 0.000 0.974 0.000 1.000 0.000 0.987 0.494 0.974
Ensemble
Random
3 0.000 0.953 0.000 1.000 0.000 0.976 0.488 0.953
Forest
Logistic
3 0.000 0.953 0.000 1.000 0.000 0.976 0.488 0.953
Regression
3 SVM Linear 0.000 0.953 0.000 1.000 0.000 0.976 0.488 0.953
3 SVM Poly 0.000 0.953 0.000 1.000 0.000 0.976 0.488 0.953
K-Nearest
3 0.000 0.953 0.000 1.000 0.000 0.976 0.488 0.953
Neighbors
Decision
3 0.000 0.953 0.000 1.000 0.000 0.976 0.488 0.953
Tree
Artificial
3 Neural 0.000 0.953 0.000 1.000 0.000 0.976 0.488 0.953
Network
Stacking
3 0.000 0.953 0.000 1.000 0.000 0.976 0.488 0.953
Ensemble
Random
4 0.000 0.953 0.000 1.000 0.000 0.976 0.488 0.953
Forest
Logistic
4 0.000 0.953 0.000 1.000 0.000 0.976 0.488 0.953
Regression
4 SVM Linear 0.000 0.953 0.000 1.000 0.000 0.976 0.488 0.953
4 SVM Poly 0.000 0.953 0.000 1.000 0.000 0.976 0.488 0.953
K-Nearest
4 0.000 0.953 0.000 1.000 0.000 0.976 0.488 0.953
Neighbors
Artificial
4 Neural 0.000 0.953 0.000 1.000 0.000 0.976 0.488 0.953
Network
Stacking
4 0.000 0.953 0.000 1.000 0.000 0.976 0.488 0.953
Ensemble
Page | 37
Bagging
4 0.000 0.953 0.000 1.000 0.000 0.976 0.488 0.953
Ensemble
1 SVM Linear 0.200 0.975 0.333 0.951 0.250 0.963 0.606 0.929
Decision
2 0.130 0.984 0.750 0.753 0.222 0.853 0.538 0.753
Tree
Gaussian
3 0.095 0.969 0.500 0.765 0.160 0.855 0.508 0.753
Naïve Bayes
Gaussian
4 0.100 0.969 0.500 0.778 0.167 0.853 0.515 0.765
Naïve Bayes
Decision
1 0.095 0.984 0.667 0.765 0.167 0.861 0.514 0.762
Tree
Stacking
1 0.095 0.984 0.667 0.765 0.167 0.861 0.514 0.762
Ensemble
2 SVM Poly 0.067 0.984 0.500 0.816 0.118 0.892 0.505 0.808
K-Nearest
3 0.136 0.984 0.750 0.765 0.231 0.861 0.546 0.765
Neighbors
Page | 38
K-Nearest
4 0.130 0.984 0.750 0.753 0.222 0.853 0.538 0.753
Neighbors
Before deployment, the researchers did a batch prediction using the SVM Linear
model to check once more the f1 score of the model. The researchers used 80 random
samples from test data for batch prediction. Table 5 shows the result of batch prediction
and during the building of the model. And both tests yield the same result of 0.606 f1
score with a miniscule difference in some other evaluation metrics. With this result, the
researchers are convinced to use the SVM Linear model for deployment.
Page | 39
Precision Recall F1 Score F1 Score
Accuracy
Class 0 Class 1 Class 0 Class 1 Class 0 Class 1 (Macro)
Single
0.200 0.975 0.333 0.951 0.250 0.963 0.606 0.929
Prediction
Batch Prediction 0.200 0.973 0.333 0.948 0.250 0.961 0.605 0.925
Before conducting the alpha test, the missing CET results of some students in
the dataset were received from the CS Department Head and it was suggested to stratify
the target variable during splitting of train and test data. Because stratifying the target
variable will balance its distribution to train and test data, which is helpful for imbalanced
dataset.
The researchers initially tested using a 20% split but tried to test again using
10/20/30/40/50 split of target variable in train and test data with Random State
parameters of 0 and 42 which are common numbers for this parameter. Table 6 shows
the test results. Before using the stratify parameter, the best model available has an f1
score of .606. But with stratify parameter, the best model was at .738 f1 score. This is
13% higher compared to the initial best model. Similarly, it was also tested through batch
prediction to verify the results. Table 7 shows the result of batch prediction and during
modelling phase. The batch prediction yields the same result during modelling phase.
Thus, the researchers used this new model for deployment of predictive model in a
simple web – application.
Page | 40
Random F1 Score
Route Split Algorithm
State (Macro)
Table 6. Results of best model for each route with Stratify parameter.
Page | 41
Precision Recall F1 Score F1 Score
Accuracy
Class 0 Class 1 Class 0 Class 1 Class 0 Class 1 (Macro)
Single
0.333 1.000 1.000 0.952 0.500 0.976 0.738 0.953
Prediction
Batch Prediction 0.333 1.000 1.000 0.952 0.500 0.976 0.738 0.953
Retraining of Model
The researchers were advised to retrain the model and remove the admission
year data so that data outside 2018 – 2021 can be used as inputs in the model for future
use. Table 8 shows that Bagging Ensemble with SMOTE method in route 1 yielded the
best result. Comparing the best model before retraining as shown in table 7, both best
models came from route 1. But the Split and Random State parameters and algorithm
differ.
Random F1 Score
Route Split Algorithm
State (Macro)
Logistic
3 .50 0 Regression/SVM .603
Linear
Logistic
4 .50 0 Regression/SVM .603
Linear
Table 8. Results of best model for each route with Stratify parameter after Retraining.
Page | 42
Comparative Analysis of Algorithms
Based on the results from tables 2,3,4 and 6, only SVM Linear, Decision Tree
and Logistic Regression algorithms produced at least 60% f1 score. Of the three models
used, Decision Tree yielded the most consistent predictions. One of the observations
was that during testing with imbalanced data, only Gaussian Naïve Bayes algorithm
produces non-zero result on precision, recall, and f1-score for the minority class.
Although it produced low results with around 40% - 50% f1 score, it can be concluded
that among classification algorithms, Gaussian Naïve Bayes was the only one not totally
biased towards the majority class in an imbalanced dataset. And when applied with
balancing techniques in the dataset, Logistic Regression and Decision Tree algorithm
were the ones that produced the best results in each route.
Bagging Ensemble
Page | 43
Stacking Ensemble
Page | 44
Deployment of Machine Learning Model in Web - Application
The researchers used visual studio code IDE to create the web application. For
building the frontend, html, css, and bootstrap framework were used. For backend
integration, the researchers used python programming and flask framework. The web –
application has 2 pages, one for landing page and the second is for prediction page (see
figures 16 and 17).
After completing both the frontend and backend of the web application, the
algorithms were integrated into the system. The system was then hosted in
pythonanywhere web hosting service for online use. The web application was used to
accept data inputs from user to feed said data into the model, which in turn processes
the data and outputs the result for the user to view.
Page | 45
Figure 17. Prediction Page
CHAPTER V
Conclusion
Page | 46
classification algorithms have a tendency to disregard the minority class, leading to
biased results. In this study, the researchers used resampling techniques such as the
SMOTE method for oversampling and the Random Undersampling method for
undersampling to balance the dataset. Tables 3, 4, 5, 7, and 9 show that algorithms with
applied resampling methods yielded higher f1 scores compared to algorithms without
any applied methods in handling imbalanced data. Furthermore, it is essential to use
other evaluation metrics such as f1 score with macro average in determining the best
model for imbalanced datasets.
Based on the result of the correlation between dependent and independent
variables, a weak correlation relationship was observed. Additionally, the evaluation of
the minority class yielded low scores for precision, recall, and f1 score even after
balancing the data. These observations suggest that a small dataset can produce poor
model performance.
The Decision Tree algorithm with SMOTE method produced the best f1 score
before retraining the models, while the Bagging Ensemble algorithm with SMOTE
method produced the best f1 score after retraining. Therefore, the researchers
recommend using resampling techniques and f1 score with macro average as evaluation
metrics in handling imbalanced datasets.
Recommendations
One suggestion for future research is to consider incorporating a more extensive
dataset. Although the current study focuses on the academic years from 2018-2021, the
dataset is relatively small and does not encompass all first-year CCS students during
that period. Obtaining a larger dataset may potentially reveal more significant findings
and enhance the predictive accuracy of the model. A second recommendation for future
research is to implement the predictive model in conjunction with an administrative
Page | 47
system. By doing so, the system could be utilized as a decision-making tool for CCS
faculty when addressing student attrition concerns.
APPENDICES
Appendix A
Gantt Chart
Page | 48
Page | 49
Appendix B
Flowchart
Page | 50
Appendix C
Code Tests
Page | 51
Page | 52
Page | 53
Page | 54
Appendix D
Relevant Source Code
Page | 55
Appendix E
Screenshot/Picture of the System
Page | 56
Appendix F
User Manual
1) Click the "Go!" button to redirect from the home page to the data entry page.
Page | 57
3) Press the submit button and the system will display the result.
Page | 58
Appendix G
Curriculum Vitae
Page | 59
Page | 60
Bibliography
[1] Abimbola Iyanda et al. 2018. Predicting Student Academic Performance in Computer
Science Courses: A Comparison of Neural Network Models.
DOI: http://dx.doi.org/10.5815/ijmecs.2018.06.01
[2] Agathe Merceron, Mahmood K. Pathan, and Raheela Asif. 2014. Predicting Student
Academic Performance at Degree Level: A Case Study. Retrieved from
https://www.researchgate.net/publication/287718318_Predicting_Student_Academic_Pe
rformance_at_Degree_Level_A_Case_Study
[3] Ahmad Alwosheel, Caspar G. Chorus, and Sander van Cranenburgh. 2018. Is your
dataset big enough? Sample size requirements when using artificial neural networks for
discrete choice analysis. Journal of choice modelling, 28, 167-182.
[4] Ajitesh Kumar. 2022. Machine Learning – Sensitivity vs Specificity Difference.
Retrieved from https://vitalflux.com/ml-metrics-sensitivity-vs-specificity-difference/
[5] Allan B.I. Bernardo et al. 2021. Using Machine Learning Approaches to Explore Non-
Cognitive Variables Influencing Reading Proficiency in English among Filipino Learners.
Retrieved from https://www.mdpi.com/2227-7102/11/10/628
[6] Aman Goel Lal et al. 2020. Using ensemble decision tree model to predict student
dropout in computing science. Retrieved from https://core.ac.uk/reader/404196578
[7] Amira Mohamed Shahiri, Nur’aini Abdul Rashid, and Wahidah Husain. 2015. A
Review on Predicting Student’s Performance using Data Mining Techniques. Retrieved
from https://www.sciencedirect.com/science/article/pii/S1877050915036182
[8] Antero R. Arias Jr., Mideth Abisado, and Ramon Rodriguez. 2019. Modeling Filipino
Academic Affect during Online Examination using Machine Learning. Retrieved from
https://www.researchgate.net/publication/336078841_Modeling_Filipino_
Academic_Affect_during_Online_Examination_using_Machine_Learning
[9] Bruce Ratner. (n.d.). The Correlation Coefficient: Definition. Retrieved from
http://www.dmstat1.com/res/TheCorrelationCoefficientDefined.html
[10] Carlos A. Palacios et al. 2021. Knowledge Discovery for Higher Education Student
Retention Based on Data Mining: Machine Learning Algorithms and Case Study in Chile.
Retrieved from https://www.mdpi.com/1099-4300/23/4/485/htm#sec2-entropy-23-00485
[11] Charles Gbollie, and Harriett P. Keamu. 2017. Student Academic Performance: The
Role of Motivation, Strategies, and Perceived Factors Hindering Liberian Junior and
Senior High School Students Learning. Education Research International, 2017, 1–11.
Page | 61
DOI: https://doi.org/10.1155/2017/1789084
[12] Chris Parsons. 2021. What Is a Machine Learning Model?. Retrieved from
https://blogs.nvidia.com/blog/2021/08/16/what-is-a-machine-learning-model/
[13] Dech Thammasiri et al. 2014. A critical assessment of imbalanced class distribution
problem: the case of predicting freshmen student attrition. Retrieved from
https://core.ac.uk/reader/32327521
[14] Dina Machuve, Khamisi Kalegele, and Neema Mduma. 2019. A Survey of Machine
Learning Approaches and Techniques for Student Dropout Prediction A Survey of
Machine Learning Approaches. Retrieved from
https://datascience.codata.org/articles/10.5334/dsj-2019-014
[15] Formplus Blog. (n.d.). What is Applied Research? + [Types, Examples & Method].
Retrieved from https://www.formpl.us/blog/applied-research
[16] IBM Cloud Education. 2020. What is Supervised Learning?. Retrieved from
https://www.ibm.com/cloud/learn/supervised-learning
[17] Jason Brownlee. 2020. A Gentle Introduction to k-fold Cross-Validation. Retrieved
from https://machinelearningmastery.com/k-fold-cross-validation/
[18] Johannes Berens et al. 2019. Early Detection of Students at Risk – Predicting
Student Dropouts Using Administrative Student Data and Machine Learning Methods.
DOI: https://doi.org/10.5281/zenodo.3594771
[19] Kenneth Jensen. 2012. A diagram showing the relationship between the different
phases of CRISP-DM and illustrates the recursive nature of a data mining project.
https://commons.wikimedia.org/wiki/File:CRISP-DM_Process_Diagram.png
[20] Manila Bulletin. 2012. College Education For Poor Students. Retrieved from
https://ph.news.yahoo.com/college-education-poor-students-090706071.html
[21] Onesmus Mbaabu. 2020. Introduction to Random Forest in Machine Learning.
Retrieved from https://www.section.io/engineering-education/introduction-to-random-
forest-in-machine-learning/
[22] Psychology Wiki (n.d.). Student attrition. Retrieved from
https://psychology.fandom.com/wiki/Student_attrition
[23] SAS® Insights. (n.d.). Machine Learning: What it is and why it matters. Retrieved
from https://www.sas.com/en_ph/insights/analytics/machine-learning.html
[24] Science Direct. 2020. Machine Learning Algorithm. Retrieved from
https://www.sciencedirect.com/topics/engineering/machine-learning-algorithm
Page | 62
[25] Vladan Devedzic. 2001. KNOWLEDGE DISCOVERY AND DATA MINING IN
DATABASES, Handbook of Software Engineering and Knowledge Engineering pp.615 -
637. Retrieved from https://www.worldscientific.com/doi/10.1142/9789812389718_0025
[26] ZACH. 2020. Friedman Test: Definition, Formula, and Example. Retrieved from
https://www.statology.org/friedman-test/
Page | 63