0% found this document useful (0 votes)
181 views74 pages

Revised PROOFREAD Thesis Document

This document is a thesis presented by three students, Lester C. Dela Torre, John Paul M. Madroñal, and Keano Rae N. Sevilla, to the Department of Computer Science at Western Mindanao State University's College of Computing Studies in partial fulfillment of the requirements for a Bachelor of Science degree in Computer Science. The thesis aims to develop a machine learning model to predict student attrition of freshmen students at the college. It describes the background, objectives, methodology, results and conclusions of developing algorithms and a web application to analyze student data and identify factors that may contribute to attrition.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
181 views74 pages

Revised PROOFREAD Thesis Document

This document is a thesis presented by three students, Lester C. Dela Torre, John Paul M. Madroñal, and Keano Rae N. Sevilla, to the Department of Computer Science at Western Mindanao State University's College of Computing Studies in partial fulfillment of the requirements for a Bachelor of Science degree in Computer Science. The thesis aims to develop a machine learning model to predict student attrition of freshmen students at the college. It describes the background, objectives, methodology, results and conclusions of developing algorithms and a web application to analyze student data and identify factors that may contribute to attrition.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 74

Republic of the Philippines

Western Mindanao State University


College of Computing Studies
DEPARTMENT OF COMPUTER SCIENCE
Zamboanga City

Machine Learning Model to Predict Student Attrition of


College of Computing Studies Freshmen Students in
Western Mindanao State University

A Thesis presented to the faculty of


Department of Computer Science
College of Computing Studies

In partial fulfilment of the requirements for the degree of


Bachelor of Science in Computer Science

Lester C. Dela Torre


John Paul M. Madroñal
Keano Rae N. Sevilla
Researchers

Marvic A. Lines, MEnggEd-ICT


Adviser

March 31, 2023


Republic of the Philippines
Western Mindanao State University
College of Computing Studies
DEPARTMENT OF COMPUTER SCIENCE
Zamboanga City

APPROVAL SHEET
The Thesis attached hereto, entitled " Machine Learning Model to Predict Student
Attrition of College of Computing Studies Freshmen Students in Western
Mindanao State University", prepared and submitted by Lester C. Dela Torre, John
Paul M. Madroñal, and Keano Rae N. Sevilla, in partial fulfilment of the requirements for
the degree of Bachelor of Science in Computer Science, is hereby recommended for
Oral Examination.

ENGR. MARVIC A. LINES, MEnggEd-ICT


Adviser
______________________________________________________________________
APPROVED by the Oral Examination Committee on _Jan 10, 2023_ with a rating of
PASSED.

ENGR. GADMAR M. BELAMIDE, MEnggEd-ICT


Chairperson
ENGR. MARJORIE A. ROJAS MR. JAYDEE C. BALLAHO
Member Member
______________________________________________________________________
ACCEPTED in partial fulfilment of the requirements for the degree of Bachelor of
Science in Computer Science

MS. LUCY F. SADIWA, MSCS


Head, Department of Computer Science

RODERICK P. GO, Ph.D.


Dean, College of Computing Studies
ABSTRACT

Student attrition is a complex issue that affects both students and the educational
institutions they attend. It can result in various negative consequences, such as declining
enrollment numbers and damage to the institution's reputation and resources. Therefore,
there is a pressing need to mitigate student attrition rates and identify factors that
contribute to this issue. This study aimed to develop a machine learning model that
could predict whether a student in their first year at Western Mindanao State University's
College of Computing Studies would continue their studies into their second year. The
researchers collected and processed academic and personal data from the first-year
students and employed various machine learning algorithms, including Random Forest,
Logistic Regression, Support Vector Machine, Naïve Bayes, K-Nearest Neighbours,
Artificial Neural Network, and Decision Tree. Several data pre-processing and
processing techniques were employed to optimize the model. The study revealed that
the SMOTE method, coupled with Bagging Ensemble, was the most accurate algorithm
for predicting student attrition. The model was integrated into a web - application, which
provided an accessible tool for students and educators to assess the likelihood of
attrition. Although a low correlation among the considered variables were observed,
potential factors that contribute to student attrition were identified. Further studies could
deepen our understanding of these factors and lead to more effective measures for
mitigating student attrition rates.

Keywords: Student Attrition, Machine Learning Model, Algorithm, Predict, Data Pre-Processing, Data
Processing, Webapp
ACKNOWLEDGEMENT
To Hyrene N. Sevilla, Roderick R. Sevilla, Jane M. Madroñal, Jaime D. Madroñal,
Lommel G. Dela Torre, and Rachel C. Dela Torre for your unwavering and unconditional
love and support.

To Engr. Marvic A. Lines, our thesis adviser, for your assistance, advice, and
guidance throughout our study.

To Ma'am Lucy F. Sadiwa, Sir Gadmar M. Belamide and Engr. Marjorie A. Rojas
for the critique and suggestions for improvements to our study and program.

To Sir Salimar B. Tahil for helping and believing that we could, despite various
constraints, finish our study on time for the final defence.

To Sir Jaydee C. Ballaho for his help during our data gathering stage, and for
assisting us with his ideas.

To Mark Angelo S. Panaguiton and Justin Paul M. Dalay for their inputs and
suggestions in the development of our program.

To BSCS 4-A Class of 2023 for their unfailing companionship through even the
most difficult of days.

And above all, to the Almighty for His power and mercy knows no bounds.
Table of Contents
APPROVAL SHEET…………………………………………………………………….i
ABSTRACT……………………………………………………………………………...ii
ACKNOWLEDGEMENT……………………………………………………………….iii
Chapter 1………………………………………………………………………….…….1
INTRODUCTION………………………………………………………………………..1
1.1 Background of the Study………………...…………………………….
…………...1
1.2 Statement of the Problem……………...…………………………………………..2
1.3 Objectives………..…………………………………………………………….
…….3
1.4 Purpose of Specific Objectives……...………………………….
………………….3
1.5 Significance of the Study………………………..…………………….……………
4
1.6 Scope and Limitation…………………….……………………….…………………
4

Chapter 2………………………………………………………………………………..5
REVIEW OF RELATED LITERATURE.…………………….………………………..5
2.1 Related Studies…………………………………...………………………………...5
2.2 Synthesis………………………………...…………………………………………..8
2.3 Definition of Terms…………………….………..…………………………………
10
2.4 Conceptual Framework……………………...……………………………………13

Chapter 3………………………………………………………………………………14
METHODOLOGY……………………………………………………………………...14
3.1 Research Design…………………...……………………………….
……………..14
3.2 Respondents………………..…………………………………………….
………..14
3.3 Data Gathering Instruments, Techniques, and Procedures………..…..
……...14
3.4 Statistical and Analytical Tools…………………………...…………….
………...15
3.5 Technical Tools…………………………...……………………………………….18
3.6 Software Process Model……………………………..………………….
………..19

Chapter 4………………………………………………………………………………23
RESULTS AND DISCUSSION………………………………………………………23
4.1 Exploratory Data Analysis……………………………………………..…………23
4.2 Identify the Features to be used in Machine Learning Model………………..27
4.3 Model Development……………………………………………………..….
……..30
4.4 Stratify Target Variable During Train Test Split………………………..…..……
38
4.5 Retraining of Models………………………………………………………………40
4.6 Comparative Analysis of Algorithms……………………………...
……………...41
4.7 Stacking and Bootstrap Aggregating Ensemble Techniques………...….......41
4.7 Identifying The Possible Factors That Causes Student Attrition………..
……..42
4.8 Deployment of Machine Learning Model in Web - Application……...............43

Chapter 5………………………………………………………………………………45
CONCLUSION AND RECOMMENDATIONS……………………………………...45
5.1 Conclusion……………………………………………………………………...….45
5.2 Recommendations………………………………………………………...………46

Appendices……………………………………………………………………………47
APPENDIX A…………………………………………………………………………...47
APPENDIX B…………………………………………………………………………...48
APPENDIX C…………………….………………………………………..…………...49
APPENDIX D…………………………………………………………………………...53
APPENDIX E…………………………………………………………………………...54
APPENDIX F……………………………………………………………………………
55
APPENDIX
G…………………………………………………………………………...57

Bibliography…………………………………………………………………………..60
LIST OF FIGURES

Figure 1 - Confusion Matrix……………………………………………………………….……


19
Figure 1.1 - AUC-ROC Score…………………………………………………….
…...19
Figure 1.2 - CRISP-DM Diagram……………………………………………….
…….22
Figure 2 - Dataset Columns and its Data Types…………………………………….
……….25
Figure 3 - Number of Missing Data……………………………………………….
…………...26
Figure 4 - Frequency of Enrolled Students in each School Year………….
………………..27
Figure 5 - Frequency of Students in each Senior High School Strand…………..….
……...27

Figure 6 - Frequency of Grades in Major Subjects from School Years 2018-


2021……….28
Figure 7 - Route 1 Histogram and Box Plot of Numeric
Values……………………………..29
Figure 8 - Pearson's Correlation Heatmap……………………………………………………
31

Figure 9 - Spearman's Correlation Heatmap…………………………………………………


31

Figure 10 - Pie Chart of Class Labels for each Route……………….


……………………….32

Figure 11 - Distribution of Class


Labels……………………………………………………….33
Figure 12 - Distribution of Class Labels using SMOTE………….
…………………………..33
Figure 13 - Distribution of Class Labels after using Random Undersampling…….
……….34

Figure 14 - Batch Prediction Confusion


Matrix……………………………………………….39
Figure 15 - Batch Prediction Confusion Matrix (Stratified)
…………………………………..41
Figure 16 - Landing
Page………………………………………………………......................43
Figure 17 - Prediction Page
……………………………………………………….................43
LIST OF TABLES
Table 1 - Algorithm parameter for Modelling……………………………………….
………...35
Table 2 - Result of Predictive Model with Imbalanced Data………………………..
……….35
Table 3 - Result of Predictive Model using SMOTE Method………………………..………
36
Table 4 - Result of Predictive Model using Random Undersampling………………..
……..38
Table 5 - Result of Batch Prediction and Single Prediction…………………………..
……..39
Table 6 - Result of Best Model for each Route with Stratify
Parameter…………………….40
Table 7 - Result of Batch Prediction and Single Prediction (Stratify)………………..
……..41
Table 8 - Results of Best Model for ach Route with Stratify Parameter after
Retraining..42
Table 9 - Results of best models using Ensemble Technique………………………………
43
Chapter 1 – Introduction

Background of the Study

Freshman student attrition is a persistent issue that poses significant challenges


for higher education institutions. In recent years, this issue has emerged as a critical
concern for university administrators who strive to secure student success or completion
while simultaneously reducing freshman student dropout rates. The global concern
about attrition has been fueled by a variety of factors, including the expansion of higher
education, increased diversity in the demographics and background of students, and a
renewed focus on the quality, impact, and outcomes of higher education systems.

Notably, institutions with high attrition rates experience significant financial


losses, as they lose money in terms of tuition, fees, and possible alumni contributions.
Research indicates that approximately half of all student attrition cases occur in the first
year of college, or the freshman year. Consequently, it is essential to identify vulnerable
freshmen students who are more likely to drop out in their first year.

Institutions play a pivotal role in supporting student completion, as they can


adjust to help freshmen students stay in school or shift to other courses. Multiple factors,
including a student's family history, personal qualities, socioeconomic status, past
education, prior academic success, and interactions between the freshman student and
faculty, can affect student attrition rates. Additionally, students must monitor, regulate,
and control their cognition, motivation, and behavior as part of self-regulated learning.
External and internal factors motivate individuals to remain engaged and devoted to their
profession, function, subject, or goal.

Western Mindanao State University (WMSU) faces significant challenges


associated with student attrition, as evidenced by students retaking courses, shifting to
other courses, and dropping all subjects due to various socioeconomic factors.
Addressing this difficulty requires a comprehensive understanding of the underlying
causes, as well as appropriate intervention strategies. Machine learning has emerged as
a viable approach to tackle the issue of freshman student dropout, as it can efficiently
identify at-risk freshmen students and facilitate early intervention planning.

Page | 1
This project aims to predict the attrition of WMSU’s College of Computing Studies
(CCS) freshmen students using machine learning models, thereby helping universities
develop effective strategies, tactics, and administration for decision-making processes.
By identifying students who are more likely to drop out or shift to another course, this
approach can aid in minimizing student dropouts and promoting academic success.

Statement of the Problem

The issue of freshman student attrition in the College of Computing Studies at


Western Mindanao State University is a multifaceted problem that can have detrimental
effects on both students and the educational institution. The negative consequences of
student attrition include reputational harm, resource wastage, and social and
psychological repercussions for the students themselves. Despite the potential losses
incurred, traditional statistical methods for identifying students at risk of dropping out
have proven to be insufficient. As such, there is a pressing need to explore more
advanced techniques, such as data mining and machine learning, to better manage
student retention. This study aims to develop a machine learning model utilizing various
algorithms to predict which students are most at risk of dropping out or likely to shift to
another course in the College of Computing Studies at Western Mindanao State
University. By identifying these students early on, the educational institution can take
appropriate measures to address their needs and improve their chances of completing
their studies.

Page | 2
Objectives

▪ General Objective
This study aims to create a machine learning model that predicts
freshmen student attrition of freshmen students at Western Mindanao
State University.
▪ Specific Objectives
⮚ To identify and create features to use for creating a machine learning
model through testing various methods and techniques and choosing
based on results.
⮚ To create a machine learning model that predicts student attrition by
using the following algorithms for processing data:
● Random Forest
● Logistic Regression
● Support Vector Machine
● Naïve Bayes
● K-Nearest Neighbours
● Artificial Neural Network
● Decision Tree
⮚ To apply stacking and bootstrap aggregating ensemble techniques to
further process and refine the dataset.
⮚ To conduct a comparative analysis on each algorithm to determine which
one has the highest accuracy.
⮚ To Identify the factors/predictors that causes student dropouts or shifting
to another course by examining the results given by the algorithm.
⮚ To identify which machine learning model gives the highest accuracy
result.
⮚ To deploy the machine learning model in a simple web-application using
HTML, CSS and Flask.

Page | 3
Significance of the Study

Over the past few years, many researchers have been using machine learning
and data mining techniques in studying student retention as it produces a better result
than using traditional statistical [13]. The result of this study will help the university in
identifying students who are at-risk at drop-out or likely to shift to another course and the
factors involved so that the university can make necessary adjustments to improve
student retention.

Beneficiaries:
1.) Freshmen students – CCS first-year students will benefit from this
study by predicting if students are more likely to drop, shift or retain at an
early stage to minimize attrition.
2.) College of Computing Studies faculty – By learning which
factors/predictors cause student attrition, CCS faculty can adjust the
department’s plan, methods, and decision-making strategies.
3.) Western Mindanao State University – This study will aid the attrition
rate of the university.

Scope and Limitation

The present study is centered on the analysis of Western Mindanao State


University's (WMSU) College of Computing Studies (CCS) freshmen students, and it
aims to assist CCS faculty members in identifying factors that contribute to student
attrition. The data utilized in this study was acquired by obtaining permission from the
CCS faculty, and it is confined to the academic years 2018 to 2021. While this study
does not offer strategies to mitigate attrition, it proposes the development of a predictive
model that can be employed by CCS faculty members to identify first-year students who
may be at risk of attrition. This model will be designed to accurately identify the factors

Page | 4
that contribute to student attrition, thereby aiding faculty members in addressing them
more effectively.

Chapter II

Review of Related Literature

Student attrition rate is defined as the number of individuals who leave a program
of study before completing it [7]. This may be caused by several issues and factors such
as time pressure, lack of support, lifestyle etc. [1]. These factors contribute to the
students’ decision to drop out from subjects or, in more severe cases, drop from the
course entirely. As student attrition or non-continuation rates are a barometer for the
performance of higher education institutions, it is vital to know the severity of these
cases to prevent, mitigate, and lower said cases.

For some students, there is a psychological factor in play as to why they decide
to leave their course prematurely. Loss of interest in the study program may lead to
increased tendencies for non-continuation. Aside from this, financial status also has
something to do with students stopping early. Students may need to work early to earn,
therefore dropping out to look for a job. Another example of a reason why student
attrition rates are constant is academic performance. If a student feels like they are not
fit to continue in their course, they may opt to stop and enroll in another one.

Loss of interest in studying causes many different issues to arise within a student
[16]. This in turn is a major contributor to student attrition. If a student feels that they are
not interested anymore in the lessons that their course is offering, then their attention will
be focused on things other than their course. Students may find the course offerings
becoming increasingly difficult or, at the other end of the spectrum, easier, leading to
diminishing interest in their studies.

A trait common with the different articles on student attrition that the researchers
noticed was that most of the factors that were included as causes for attrition in the
study were included in other studies, such as loss of interest [16], external factors [1],

Page | 5
among others. These were leading causes among students in different studies. In the
case of the College of Computer studies in Western Mindanao State University, these
were also determined to be part of the larger issue of freshman attrition.

Students are not the only ones that are affected by student attrition. The
educational institutions from where they came from are also impacted by the rate of non-
continuation [21]. It gives a negative impression to the institution and makes it look like it
is performing inadequately due to the number of dropouts. Student dropouts pose a
major concern for educational communities and institutions [11]. Institutions strive to
replenish the resources they have spent on the students that do not continue while the
students themselves lose valuable time and energy on building up knowledge and
experience from the course, only to not pursue it until the end.

Data collection, testing and plotting methods must be reviewed by the


educational institutions as to what are the most effective in mapping out the student
attrition cases in an institution. Manual calculating methods are acceptable, but only with
small data sets. Issues and difficulties will arise when dealing with much larger data sets.
A machine learning model will be of great use in completing this task because of its
reliability and consistency in dealing with large amounts of data [14]. Machine learning
models are shown to have consistently at least 95% of human accuracy throughout
different disciplines [11]. As studies that involve student population require accuracy with
large datasets, a machine learning model is apt for the undertaking. Studies have shown
that machine learning models can keep up with and even surpass human performance
in certain systematic tasks [1].

This makes it easier to work with a dataset with the potential to have over a
hundred different possible entries. Although as the dataset increases in size therefore
increases in the difficulty of handling [16], it will be small enough to not require extensive
programming expertise. These models have been invaluable to researchers across the
globe due to the speed, accuracy and precision of the calculations and data analysis of
the systems. Apart from the speed of calculations, the machine learning models are also
consistent even when being supplied with big datasets which ordinary manual
calculations will struggle with due to human error and oversight. One big difference
between human error and machine learning model error is predictability [13]. Preventing

Page | 6
mistakes with the machine learning model is easier than with manual calculating
methods as these are systematic and can be programmed, unlike human methods.

Machine learning models have been used to great effect in the study of student
performance [1,2,8]. A test done in 2018 regarding variables that influence reading
proficiency in English among Filipino learners was done with 81.2% accuracy using a
dataset from OECD PISA 2018 database comprising of 7233 students [7]. Another test
done with machine learning, this time on the academic effect of Filipino students during
online examinations, yielded an accuracy of 92.66% [11]. The test was done with a
dataset comprising 75 students. This shows that machine learning models are a reliable
way of cleaning, plotting out data, and data interpretation of large datasets.

A case study [10] was conducted that predicts student retention of Catholic
University of Maule students from 1st year to 3rd year level. By using Knowledge
Discovery in Databases (KDD) process as their core method, they use data mining to
formulate four models (any level, 1st year, 2nd year, 3rd year) wherein they used the
following machine learning algorithms: Decision Trees, K-Nearest Neighbors, Logistic
Regression, Naïve Bayes, Random Forest, and Support Vector Machines. The
evaluation metrics they used are accuracy, tp rate, fp rate, precision, f-measure, RMSE,
k-statistics, and Friedman value. The result shows that all models exceed an accuracy of
80%. It was noted that it is necessary to balance the data to yield higher accuracy.
Random Forest algorithm achieved the highest accuracy on each level: 88.43% (any
level), 93.65% (1st year), 95.76% (2nd year), and 96.92% (3rd year). They also evaluate
the performance of each machine learning algorithm used in their study by using the
Friedman value test. The result of this statistical analysis shows that the Random Forest
algorithm ranked first among other algorithms. Which further tells that it is the best-
performing algorithm in their study.

Another study was conducted [14] that predicts student dropout of computer
science students from the University of the South Pacific. They used Random Forest
algorithm to build the model. The dataset consists of 963 observations and 33 features.
Two models were created to train and test the data. Where Model 1 used a 5-fold cross-
validation and Model 2 used a 10-fold cross-validation. Accuracy, specificity, sensitivity,

Page | 7
and kappa were used as performance metrics while the Confusion Matrix was used to
evaluate the models. The result of the study shows that the model with 5-fold cross-
validation appears to achieve the highest accuracy between the two models with an
accuracy of 0.82. And it was found out that the strongest predictor of dropouts was the
academic performance in the 1st year programming course.

The difference that the study had over other was the relatively small dataset due
in part to the scope of the study. A large dataset is much preferred as it makes the result
of data processing and calculations more refined [3]. While the dataset size was enough
for processing that yield plausible results in initial testing, it still paled in comparison to
other studies that have more entries.

Synthesis

TITLES IMPROVEMENTS
GAPS
(RESEARCHER’S STUDY)

Early Detection of  Socio-economic factors The researchers used


Students at Risk – are not as fleshed out significantly more socio-
Predicting Student due to the difference in economic variables to fit
Dropouts Using economy of Germany with the status of
Administrative Student and the Philippines Filipino Students,
Data and Machine particularly in Western
Learning Methods [18] Mindanao State
University.

Knowledge Discovery for  Study suggested to use The researchers used


Higher Education Student an ensemble technique stacking and bagging
Retention Based on Data ensemble techniques in
Mining: Machine Learning modelling
Algorithms and Case
Study in Chile [10]

Page | 8
Using ensemble decision  The study used the The researchers
tree model to predict grades of only one worked using four
student dropout in subject when the study subjects and their
computing science [6] was conducted corresponding
laboratory and lecture
grade split for more
refined results

Predicting Student  Student academic Different data cleaning


Academic Performance in performance is not and pre – processing
Computer Science enough as a basis for techniques were
Courses: A attrition rate prediction employed in the
Comparison of Neural processing of the data
Network Models [1] to maintain accuracy
and consistency.

Modeling Filipino  Though the results While still narrow in


Academic Affect during were more than scope, the study used
Online Examination using acceptable, the dataset significantly more
Machine is small when used to participants and data
Learning [8] refer to Filipino inputs.
Students in general.

Predicting Student  The samples from This study uses


Academic Performance at minority class were SMOTE (oversampling)
Degree Level: A Case taken and copied and Random
Study [2] multiple times to Undersampling
balance the dataset. (undersampling)
But when compared to techniques to balance
original models, the dataset.
accuracy wasn’t
improved.

Page | 9
A Review on Predicting  Didn’t check if the During data pre –
Student’s Performance dataset is balanced or processing, the
using Data not. researchers check the
Mining Techniques ratio of the target class.
[7]

Using Machine Learning  Accuracy was the only F1 Score (macro –


Approaches to Explore evaluation metric used average), precision,
Non-Cognitive Variables recall, auc-roc score
Influencing Reading were also used as
Proficiency in English evaluation metrics to
among Filipino Learners evaluate the models
[5]

Definition of Terms

Term Definition

1. Student Attrition is the number of individuals who leave a


programme of study before completing it.
Attrition, or non-continuation rate, is a
measure of the performance indicator of
higher education providers. [22]

2. Random Forest Algorithm It’s a machine learning technique that’s


used to solve regression and classification
problems. It utilizes ensemble learning,
which is a technique that combines many
classifiers to provide solutions to complex

Page | 10
problems. [21]

3. Knowledge Discovery in Database is the process of automatic discovery of


(KDD) previously unknown patterns, rules, and
other regular contents implicitly present in
large volumes of data. [25]

4. Machine Learning Machine learning is a method of data


analysis that automates analytical model
building. It is a branch of artificial
intelligence based on the idea that
systems can learn from data, identify
patterns, and make decisions with minimal
human intervention. [23]

5. Machine Learning Model A machine learning model is an


expression of an algorithm that combs
through mountains of data to find patterns
or make predictions. Fueled by data,
machine learning (ML) models are the
mathematical engines of artificial
intelligence. [12]

6. Machine Learning Algorithm A Machine Learning algorithm, which is a


part of AI, uses an assortment of accurate,
probabilistic, and upgraded techniques
that empower computers to pick up from
the past point of reference and perceive
hard-to-perceive patterns from massive,
noisy, or complex datasets. [24]

7. k - Fold Cross – Validation Cross-validation is a resampling


procedure used to evaluate machine
learning models on a limited data sample.
The procedure has a single parameter
called k that refers to the number of

Page | 11
groups that a given data sample is to be
split into. As such, the procedure is often
called k-fold cross-validation. [17]

8. Sensitivity is the proportion of true positives that are


correctly predicted by the model. [4]

9. Specificity is the proportion of true negatives that are


correctly predicted by the model. [4]

10. Friedman Value Test The Friedman Test is a non-parametric


alternative to the Repeated Measures
ANOVA. It is used to determine whether
there is a statistically significant difference
between the means of three or more
groups in which the same subjects show
up in each group. [26]

Conceptual Framework

Page | 12
The chart above shows the concept of the system's workflow. In the input stage,
the raw data was collected then analyzed through exploratory data analysis to look for
patterns and relationships between the different variables. In the process stage, the
collected data is subjected to various pre-processing techniques such as data cleaning,
normalization, and feature selection. To guarantee that the data are of good quality and
pertinent to the research issue, it is essential to complete this stage. The data is split into
training and testing sets after pre-processing, and then it is submitted to a variety of
machine learning methods, including Random Forest, Logistic Regression, Support
Vector Machine, Naive Bayes, K-Nearest Neighbors, Artificial Neural Networks, and
Decision Tree. In the Output stage, the results of the analysis are presented in the form
of predictive data. The precision, recall, accuracy, and F1-score are just a few of the
evaluation metrics that are used to assess the efficacy and correctness of machine
learning algorithms. The best model then was integrated into a web application that
serves as the data collection method for further inputs.

Chapter III – Methodology

Page | 13
Research Design
This study employed an applied research methodology to investigate the
effectiveness of a preventive intervention in addressing the issue of freshman attrition in
the College of Computing Studies. The applied research approach is highly beneficial
because it facilitates the practical application of scientific principles to solve problems
that impact individuals, groups, and society. By identifying a problem, formulating
research hypotheses, and conducting experiments to test these hypotheses, the
researcher can leverage empirical methodologies to solve real-world problems. The
practical nature of this research design enables it to yield results that inform decision-
making, provide practical solutions, and improve outcomes. The research design is not
only relevant but also useful as it enables the researchers to work closely with
stakeholders to develop a functional and beneficial system to address the problem at
hand. [15]

Respondents

This study focused on a sample of first-year students enrolled in the College of


Computing Studies (CCS) at Western Mindanao State University. The data collected
from these students was used to construct a predictive model. The end-users of the final
product, which is a web-based application implementing the model, are the CCS faculty
members. These individuals play a vital role in evaluating the efficacy of the predictive
model in addressing the issue of student attrition. Their feedback will inform future
refinements and improvements to the model, thus contributing to the ongoing efforts to
enhance the effectiveness of retention strategies in higher education.

Data Gathering Instruments, Techniques, and Procedures

The researchers employed a retrospective study design in this thesis project,


using pre-existing data obtained from Western Mindanao State University, College of
Computing Studies (CCS) faculty. The data collected included various demographic
information such as gender, age, and financial status, as well as prior educational
background (Senior High School GWA and Strand), first-year 1st and 2nd semester
major subject grades and class schedules, college ID number, the year of enrollment,

Page | 14
and whether or not they continued their course and enrolled in the second-year 1st
semester. The researchers sought and obtained appropriate permissions from the
relevant authorities to collect the data. Data were collected in both electronic and hard
copy formats and were subsequently stored in a Comma-Separated Values (CSV) file to
enable efficient management and analysis using Jupyter Notebook, which was used to
build the predictive model.

Statistical and Analytical Tools


The following are the statistical and analytical tools used in building the predictive
model:
● Exploration and Analysis of Data
In this section, the researchers explored and analyzed the raw
dataset to look for patterns and also to gain valuable insights about the
data. The researchers used the pandas library to visualize and analyze
the dataset using the statistical tools from the mentioned library. The
following are the pandas library tools/functions used in exploring and
analyzing of raw data: Dataframe.columns, dtypes, values, and shape
were used to display the columns of the dataset, the size of its column
and row, the datatypes of each columns, and to display an array of
values. Dataframe.mean(), median(), mode(), and std() are functions that
were used to determine the measure of central tendency of some
columns and were also used in some calculations. Dataframe.corr() was
used to compute the correlation of columns. Dataframe.describe() was
used to display a summary of descriptive statistics of the dataset and
Dataframe.isna() was used to display the number of missing data.

● Data Cleaning and Preprocessing


Data cleaning and preprocessing is where the researchers
transforms the raw dataset into a meaningful information. The following
are the techniques or methods from pandas library that were used in
cleaning and preprocessing the data: Dataframe.fillna() was used to fill
missing data. Dataframe.dropna() and drop() were used to remove

Page | 15
missing values on a specific row/column or on all columns. Interquartile
Range (IQR) was the measurement used to find the outliers in the
dataset. IQR has the formula of Q3 – Q1, where: Q3 = 75th percentile
and Q1 = 25th percentile. The values outside of IQR are considered as
outliers.

● Evaluation Metrics
Evaluation metrics quantifies the performance of the machine
learning model. The result of the chosen evaluation metrics was the basis
in choosing the best machine learning model for deployment. Accuracy
was the initial evaluation metric in evaluating the machine learning model
in the study as this was one of the common evaluation metric used based
on research and related studies. But since it was found out that the
dataset was imbalanced, the researchers look for another evaluation
metric since accuracy tends to be biased with an imbalanced dataset. F1
score with macro average weight was chosen as the main metric for
evaluating the models and the researchers also use precision, recall, and
auc-roc score to evaluate the result of each class of the target variable.
While confusion matrix was used to visualize the result of actual and
predicted values. The following are the definition of each evaluation
metrics:

1. Accuracy - used to measure the percentage of correctly predicted


output by the model. It can be defined as the ratio of the number
of correct predictions and the total number of predictions. The
formula for accuracy is shown below.
Accuracy Formula
TP+TN
Accuracy=
TP+TN + FP+ FN

Where: TP = True Positive, TN = True Negative, FP = False


Positive, and FN = False Negative.

Page | 16
2. Precision - used to determine the number of true positives divided
by the number of predicted positives. The formula for precision is
shown below.
Precision Formula
TP
Precision=
TP+ FP

3. Recall - used to determine the number of true positives divided by


the total number of actual positives. The formula for recall is
shown below.
Recall Formula
TP
Recall=
TP+ FN

4. F1 Score - It gives a combined idea about Precision and Recall


metrics. It is maximum when Precision is equal to Recall. The
formula for F1 Score is shown below.

F1 Score Formula
Precision·Recall
F 1 Score=2 ·
Precision+ Recall

5. Confusion Matrix - a table that was used to describe the


performance of a classification model on a set of the test data for
which the true values are known. Figure 1 shows the illustration of
Confusion Matrix.

Figure 1. Confusion Matrix

Page | 17
6. AUC - ROC Score - used to compute the Area Under the Receiver
Operating Characteristic Curve (ROC AUC) from prediction
scores.
Figure 1.1 AUC - ROC Score

Technical Tools
In this study, the researchers utilized a range of technical tools for building the
predictive model and web application. Jupyter Notebook, a popular code editor for data
science projects, was used for exploring, analyzing, cleaning, preprocessing, and
building the predictive model. Python, a widely used programming language for data
science, was the language of choice for this study due to its versatility and abundance of
useful libraries.
The following python libraries were used for data analysis, visualization, mathematical
computations, and building the predictive model: Pandas, Numpy, Scikit-learn,
Matplotlib, Imbalanced-Learn, and Seaborn.

For developing the web application, HTML, CSS, and Bootstrap were employed to
create the structure and design of the web pages. Visual Studio Code, an integrated
development environment (IDE) with support for numerous python extensions, including

Page | 18
Flask, was used for integrating the predictive model into the web application.
Pythonanywhere was utilized as the web hosting service for deploying the web
application online. By using these technical tools, the researchers were able to develop
a robust and effective predictive model and a user-friendly web application.
Software Process Model

CRISP – DM (Cross Industry Standard Process for Data Mining) process model
was used as the main framework in this study. There are six sequential steps in CRISP
– DM framework that were used to gather insights and formulate the predictive model.
Figure 1.2 illustrates the six steps of CRISP – DM.

Figure 1.2 CRISP – DM Diagram [19]

Page | 19
A. Business Understanding

According to a statement released by the Commission of Higher


Education (CHED) in 2012, the Philippines is facing an alarming college
dropout rate of 83.7 percent, resulting in an estimated 2.13 million college

dropouts annually [20]. Studies have shown that a significant number of


student attrition cases occur during the first year of college [13]. Early
identification of students at risk of dropping out or shifting to another course
can aid faculties in devising strategies and making necessary adjustments to
support these students and enable them to complete their college education.
The proposed predictive model serves as a valuable tool for identifying first-
year students who are at risk of dropping out or changing courses. The
findings from the model can provide faculties with a deeper understanding of
the factors influencing student attrition, thereby enhancing their strategies to
combat the problem.

Page | 20
B. Data Understanding

During this phase, the researchers employed a data collection and


exploration approach to gather and analyze relevant data for the thesis
project. The collected data was stored in a CSV file, which was subsequently
loaded into the Jupyter Notebook for further analysis. Using various data
analysis techniques, the research team conducted a thorough investigation of
the data to determine patterns and relationships between variables, and
identify potential missing data points that could affect the accuracy of the
predictive model. By carefully analyzing the data, the researchers were able
to obtain valuable insights that will help inform their subsequent research
activities and model development.

C. Data Preparation

Following the exploration and analysis of the dataset, the subsequent


step entailed the application of data cleaning and pre-processing techniques.
This process involved the removal of irrelevant, noisy, and inconsistent data,
as well as addressing missing data. Additionally, a new feature/attribute was
derived from the existing dataset to further enhance the quality of the data.
The final step in this process was the preservation of the cleaned dataset into
another CSV file. This ensured that the dataset was prepared and organized
optimally to guarantee accurate and reliable outcomes in subsequent phases
of the project.

D. Modeling

This step constitutes the beginning of model building process. This


study employed the technique of supervised learning for constructing the
predictive model. Specifically, the researchers divided the dataset into two
subsets, i.e., the train and test datasets, where the former contains labelled
data or the correct output [16]. The model was then trained using these data
to enable it to learn from them and ultimately predict the output accurately.

Page | 21
The research problem addressed in the thesis project belongs to the
classification task of supervised learning. To achieve the project’s objectives,
the researchers identified several classification algorithms, including Random
Forest, Logistic Regression, Support Vector Machine, Naïve Bayes, K-
Nearest Neighbors, Artificial Neural Networks, and Decision Tree. These
algorithms were used to build the predictive model and were subsequently
evaluated in the next step of the project.

E. Evaluation

In this phase, the researchers assessed the performance of each


algorithm employed in the previous phase. The evaluation process was
crucial in determining the effectiveness of the models in predicting the
outcomes of the study. The assessment was conducted using several
standard evaluation metrics, including accuracy, sensitivity, specificity,
geometric mean (G-Mean), precision-recall metric, and 5-fold cross-
validation. In addition, the researchers also included and adjusted other
evaluation metrics as needed to ensure the validity and reliability of the
models.

F. Deployment

In the last stage of the study, the predictive model developed by the
researchers was deployed in a simple web-based application. The application
was intended to showcase the functionality of the model and allow end-users
to evaluate the accuracy of the model's predictions. It is important to note that
the application was designed for demonstration purposes only and was not
developed with the intent of integrating it into a system, as this task requires
specialized skills beyond the expertise of the researchers.

Page | 22
Chapter IV

RESULTS AND DISCUSSION

Exploratory Data Analysis

Page | 23
The researchers conducted Exploratory Data Analysis to analyze the data and to
learn key insights. The dataset consists of 454 rows and 30 columns. Figure 2. shows
the columns of the dataset and its respective data type. The researchers noticed that the
columns for major subject grades (CC100, CC101, CC102, CS111/IT112) has an object
data type. Investigation for the unique values of each grade columns were conducted
and it was found out that there are two (2) string values in those grade subjects, which
are INC and UW. With this information, the researchers came up with a plan together
with the suggestion of the thesis adviser to create four (4) different routes in building the
model. The routes are the following:

1) Treat the grade data as categorical variable.


2) Drop both INC and UW data then convert the grade columns as numeric
data.
3) Drop UW data and replace INC with median value then convert the grade
columns as numeric data.
4) Drop UW data and replace INC with a 3.0 value then convert the grade
columns into numeric data.

Page | 24
Figure 2. Dataset Columns and its Data Type

The researchers also noticed that the SHS/HS GPA column has an object data
type. Upon looking its unique value, the researchers found out that it has Transferee and
Shiftee values. There are 24 total rows for those two values and the researchers decided
to drop it from the dataset. Aside from male and female, our Gender variable has two
other categories which are “prefer not to say” and “androgyne”. Since there’s only 1
instance of androgyne in our dataset, we decided to drop this category out of Gender
variable. Figure 3. shows the number of missing data for each column.

Page | 25
Figure 3. Number of Missing Data

The researchers plot data visualization to some variables to gather more insights
about the dataset. First, we plot a graph that shows the distribution of first year CCS
students who enrolled in 2nd year or not through the four school year data that we have.
We can see in Figure 4 that school year’s 2018 and 2021 had first year CCS students
who didn’t enrolled for their second year in college. While school year’s 2019 and 2020
had all students enrolled in second year. The graph also shows that the number of
students enrolled increases in succeeding years but since our dataset only consists of
454 students across four school years, this observation might not be valid.

Page | 26
Figure 4. Frequency of enrolled students in each school year

Figure 5 shows the frequency of first year CCS students with the Track that they
take during Senior High School. STEM and TVL-ICT has the greatest number of
students. This is not surprising since some core subjects are already taught in those two
SHS Tracks. While Figure 6 shows the frequency of different grades for each major
subjects in BSCS and BSIT from school year 2018 – 2021.

Figure 5. Frequency of students on each SHS Strand

Page | 27
Figure 6. Frequency of grades of major subjects from school year 2018 - 2021

Identify the Features to be used in Machine Learning Model


The researchers have discussed some insights derived from Exploratory Data
Analysis. The subsequent parts will discuss the rest of pre-processing procedures done
for each route. The researchers chose the “Enrolled in 2nd Year” data as the target
variable since it would tell if the student continued with their studies or not. After that, the
researchers determined which among the columns/features would likely influence the
result of the target variable. The researchers decided to drop features that has personal
information which are Student ID, Name, First Name, Last Name, Birthday, and all
features that contains schedule data. The rest of the variables will be evaluated for its
correlation with the target variable after doing the categorical encoding.

Page | 28
The next process would be handling missing data. As shown in Figure 3. above,
the features retained has missing data. The researchers plot a graph to know the normal
distribution graph of the numeric variables.

Figure 7. Route 1 Histogram and Box plot of numeric variables

Figure 7. shows that only CET OAPR follows a bell-shaped symmetrical curve
while the rest are skewed to the right. The graph above will help us decide which method
to use in handling missing data. Since majority of the data are skewed, the appropriate

Page | 29
method to use would be filling the missing data with median. The researchers got the
median and use fillna() function from pandas to fill the missing data with the median. For
categorical variables, the researchers simply filled the missing data with mode. Next
would be handling the outliers. Figure 4. shows a box plot for numeric variables. The
black circles in the plot are the outliers. After finding out the outliers, the researchers
decided to ignore it and let it remain in the dataset. Although considered as outliers,
these are rare occurrences since those data points are still within the range of accepted
values for each variable, therefore the outliers were kept in the dataset. Next step will be
categorical encoding. The researchers investigated the unique values of each
categorical variables. The researchers determined that only SHS Strand, Gender, and
Year Started in Program variables are categorized as a nominal type of data. While
others are considered as ordinal data. For nominal data, the researchers used dummy
encoding which creates dummy variables that is equal to number of categories (k) in the
variable, and it encodes k-1 dummy variables. For ordinal data, the researchers used
mapping function to encode the data according to their rank/order.

As mentioned earlier, after doing categorical encoding, calculations for the


correlation coefficient of the remaining variables with the target variable will be
conducted. As shown in Figure 9 and 10, the highest correlation coefficient is at 0.23
and 0.17 respectively. 0.23 can be considered as moderate positive relationship. While
the rest of the variables are at below positive/negative 0.20 correlation coefficient. This
means that this variable has very weak relationship with the target variable and these
variables won’t be able to determine the output of the target variable. But since there’s
only one variable with at least a moderate relationship, the researchers decided to use
all the variables as features for building the predictive model.

Page | 30
Figure 8. Pearson’s Correlation Heatmap

Figure 9. Spearman’s Correlation Heatmap

Model Development

Page | 31
The researchers started by creating two variables: one for features, and the other
one is for target variable. Next is to split the data by 80:20 ratio for train and test data
respectively. But before going forward, the researchers looked for the distribution of the
classes of the target variable. And it was found out that the dataset is highly imbalanced
as shown in Figure 11. So before moving forward with building of models, the
researchers applied oversampling and undersampling method to the training data to
handle the imbalanced dataset.

Figure 10. Pie Chart of Class Labels for each Route

For oversampling, the researchers used Synthetic Minority Ovesampling


Technique (SMOTE) which generates new instances from existing minority class to even
the number of instances of both classes. And for undersampling, the researchers used
Random Undersampling method which removes random instances of majority class from
the training data to even the number of instances of both classes.

Page | 32
Figure 11. Distribution of Class label

Page | 33
Figure 12. Distribution of Class label using SMOTE

Figure 13.
Distribution
of Class label
after using
Random
Undersampling

Page | 34
The evaluation metrics used for evaluating the models are f1 score with macro
average, precision, recall, accuracy, ROC-AUC score, and Confusion Matrix. These
metrics will help the researchers to distinguish the differences of all models. Then f1
score was used as the final evaluation metric to determine the best predictive model to
be used for deployment since this is one of the common evaluation metrics used with an
imbalanced dataset. Table 1 shows the parameters used in building the models. The
parameters chosen were based on related studies and examples using these algorithms
found online.

The researchers included two ensemble techniques in building the models, which
are stacking and bootstrap aggregating or bagging for short. Stacking ensemble
combines the results of two or more models so that a better prediction result can be
achieved. While bagging ensemble selects random subsets of data in the dataset to train
and then aggregate their individual results to form the final prediction. In stacking, all
algorithms used for modelling were combined. While in bagging the Decision Tree
algorithm was used as the base estimator.

Table 2 shows the result of the best models of each route without using any
method to balance the dataset. The highest f1 score in this testing is 0.494. There are
three models that produce the same result which are Random Forest, K-Nearest
Neighbors, and Stacking Ensemble. Table 3 shows the result of the best models of each
route using SMOTE method. SVM Linear algorithm produces the highest result with
0.606 f1 score.

And Table 4 shows the result using Random Undersampling. K-Nearest


Neighbors produces the highest result with 0.546 f1 score. Based on these results,
balancing the data yield better prediction results than predicting with imbalanced data.
The best model in this test's results was Route 1’s SVM Linear kernel with an f1 score of
.606.

Page | 35
Algorithm Parameter/s

Random Forest n_estimator = 100


Logistic Regression default
Support Vector Machine kernel = linear
Support Vector Machine kernel = poly
Gaussian Naïve Bayes default
K-Nearest Neighbors n_neighbors = 3
Artificial Neural Network random_state = 1, max_iter = 300
Decision Tree default

Table 1. Algorithm Parameters for Modelling

Precision Recall F1 Score


Route F1 Score
Algorithm Class Class Class Class Class Class Accuracy
(Macro)
0 1 0 1 0 1

Random
1 0.000 0.964 0.000 1.000 0.000 0.982 0.491 0.964
Forest

Logistic
1 0.000 0.964 0.000 1.000 0.000 0.982 0.491 0.964
Regression

1 SVM Poly 0.000 0.964 0.000 1.000 0.000 0.982 0.491 0.964

1 SVM Linear 0.000 0.964 0.000 1.000 0.000 0.982 0.491 0.964

K-Nearest
1 0.000 0.964 0.000 1.000 0.000 0.982 0.491 0.964
Neighbors

Artificial
1 Neural 0.000 0.964 0.000 1.000 0.000 0.982 0.491 0.964
Network

Stacking
1 0.000 0.964 0.000 1.000 0.000 0.982 0.491 0.964
Ensemble

Random
2 0.000 0.974 0.000 1.000 0.000 0.987 0.494 0.974
Forest

Page | 36
K-Nearest
2 0.000 0.974 0.000 1.000 0.000 0.987 0.494 0.974
Neighbors

Stacking
2 0.000 0.974 0.000 1.000 0.000 0.987 0.494 0.974
Ensemble

Random
3 0.000 0.953 0.000 1.000 0.000 0.976 0.488 0.953
Forest

Logistic
3 0.000 0.953 0.000 1.000 0.000 0.976 0.488 0.953
Regression

3 SVM Linear 0.000 0.953 0.000 1.000 0.000 0.976 0.488 0.953

3 SVM Poly 0.000 0.953 0.000 1.000 0.000 0.976 0.488 0.953

K-Nearest
3 0.000 0.953 0.000 1.000 0.000 0.976 0.488 0.953
Neighbors

Decision
3 0.000 0.953 0.000 1.000 0.000 0.976 0.488 0.953
Tree

Artificial
3 Neural 0.000 0.953 0.000 1.000 0.000 0.976 0.488 0.953
Network

Stacking
3 0.000 0.953 0.000 1.000 0.000 0.976 0.488 0.953
Ensemble

Random
4 0.000 0.953 0.000 1.000 0.000 0.976 0.488 0.953
Forest

Logistic
4 0.000 0.953 0.000 1.000 0.000 0.976 0.488 0.953
Regression

4 SVM Linear 0.000 0.953 0.000 1.000 0.000 0.976 0.488 0.953

4 SVM Poly 0.000 0.953 0.000 1.000 0.000 0.976 0.488 0.953

K-Nearest
4 0.000 0.953 0.000 1.000 0.000 0.976 0.488 0.953
Neighbors

Artificial
4 Neural 0.000 0.953 0.000 1.000 0.000 0.976 0.488 0.953
Network

Stacking
4 0.000 0.953 0.000 1.000 0.000 0.976 0.488 0.953
Ensemble

Page | 37
Bagging
4 0.000 0.953 0.000 1.000 0.000 0.976 0.488 0.953
Ensemble

Table 2. Result of predictive model with imbalanced data

Precision Recall F1 Score


Route F1 Score
Algorithm Class Class Class Class Class Class Accuracy
(Macro)
0 1 0 1 0 1

1 SVM Linear 0.200 0.975 0.333 0.951 0.250 0.963 0.606 0.929

Decision
2 0.130 0.984 0.750 0.753 0.222 0.853 0.538 0.753
Tree

Gaussian
3 0.095 0.969 0.500 0.765 0.160 0.855 0.508 0.753
Naïve Bayes

Gaussian
4 0.100 0.969 0.500 0.778 0.167 0.853 0.515 0.765
Naïve Bayes

Table 3. Result of predictive model using SMOTE method

Precision Recall F1 Score


Route F1 Score
Algorithm Class Class Class Class Class Class Accuracy
(Macro)
0 1 0 1 0 1

Decision
1 0.095 0.984 0.667 0.765 0.167 0.861 0.514 0.762
Tree

Stacking
1 0.095 0.984 0.667 0.765 0.167 0.861 0.514 0.762
Ensemble

2 SVM Poly 0.067 0.984 0.500 0.816 0.118 0.892 0.505 0.808

K-Nearest
3 0.136 0.984 0.750 0.765 0.231 0.861 0.546 0.765
Neighbors

Page | 38
K-Nearest
4 0.130 0.984 0.750 0.753 0.222 0.853 0.538 0.753
Neighbors

Table 4. Result of predictive model using Random Undersampling

Figure 14. Batch Prediction Confusion Matrix

Before deployment, the researchers did a batch prediction using the SVM Linear
model to check once more the f1 score of the model. The researchers used 80 random
samples from test data for batch prediction. Table 5 shows the result of batch prediction
and during the building of the model. And both tests yield the same result of 0.606 f1
score with a miniscule difference in some other evaluation metrics. With this result, the
researchers are convinced to use the SVM Linear model for deployment.

Page | 39
Precision Recall F1 Score F1 Score
Accuracy
Class 0 Class 1 Class 0 Class 1 Class 0 Class 1 (Macro)

Single
0.200 0.975 0.333 0.951 0.250 0.963 0.606 0.929
Prediction

Batch Prediction 0.200 0.973 0.333 0.948 0.250 0.961 0.605 0.925

Table 5. Result of Batch Prediction and Single Prediction

Stratify Target Variable During Train Test Split

Before conducting the alpha test, the missing CET results of some students in
the dataset were received from the CS Department Head and it was suggested to stratify
the target variable during splitting of train and test data. Because stratifying the target
variable will balance its distribution to train and test data, which is helpful for imbalanced
dataset.
The researchers initially tested using a 20% split but tried to test again using
10/20/30/40/50 split of target variable in train and test data with Random State
parameters of 0 and 42 which are common numbers for this parameter. Table 6 shows
the test results. Before using the stratify parameter, the best model available has an f1
score of .606. But with stratify parameter, the best model was at .738 f1 score. This is
13% higher compared to the initial best model. Similarly, it was also tested through batch
prediction to verify the results. Table 7 shows the result of batch prediction and during
modelling phase. The batch prediction yields the same result during modelling phase.
Thus, the researchers used this new model for deployment of predictive model in a
simple web – application.

Page | 40
Random F1 Score
Route Split Algorithm
State (Macro)

1 .10 0 Decision Tree


.738
SMOTE

2 .20 0 Decision Tree


.626
SMOTE

3 .10 42 Logistic Regression


.681
SMOTE
4 .50 0 Logistic Regression .590

Table 6. Results of best model for each route with Stratify parameter.

Figure 15. Batch Prediction Confusion Matrix (Stratified)

Page | 41
Precision Recall F1 Score F1 Score
Accuracy
Class 0 Class 1 Class 0 Class 1 Class 0 Class 1 (Macro)

Single
0.333 1.000 1.000 0.952 0.500 0.976 0.738 0.953
Prediction

Batch Prediction 0.333 1.000 1.000 0.952 0.500 0.976 0.738 0.953

Table 7. Result of Batch Prediction and Single Prediction (Stratified)

Retraining of Model
The researchers were advised to retrain the model and remove the admission
year data so that data outside 2018 – 2021 can be used as inputs in the model for future
use. Table 8 shows that Bagging Ensemble with SMOTE method in route 1 yielded the
best result. Comparing the best model before retraining as shown in table 7, both best
models came from route 1. But the Split and Random State parameters and algorithm
differ.

Random F1 Score
Route Split Algorithm
State (Macro)

1 .10 42 Bagging Ensemble


.738
SMOTE

2 .50 42 SVM Linear .617

Logistic
3 .50 0 Regression/SVM .603
Linear

Logistic
4 .50 0 Regression/SVM .603
Linear

Table 8. Results of best model for each route with Stratify parameter after Retraining.

Page | 42
Comparative Analysis of Algorithms
Based on the results from tables 2,3,4 and 6, only SVM Linear, Decision Tree
and Logistic Regression algorithms produced at least 60% f1 score. Of the three models
used, Decision Tree yielded the most consistent predictions. One of the observations
was that during testing with imbalanced data, only Gaussian Naïve Bayes algorithm
produces non-zero result on precision, recall, and f1-score for the minority class.
Although it produced low results with around 40% - 50% f1 score, it can be concluded
that among classification algorithms, Gaussian Naïve Bayes was the only one not totally
biased towards the majority class in an imbalanced dataset. And when applied with
balancing techniques in the dataset, Logistic Regression and Decision Tree algorithm
were the ones that produced the best results in each route.

Stacking and Bootstrap Aggregating Ensemble Techniques


Table 9 below shows the best model for each route using Bagging and Stacking
Ensemble. Bagging ensemble got the highest f1 score between ensemble techniques.
But Stacking Ensemble showed a higher average f1 score across route compared to
Bagging Ensemble. And compared to other algorithms as shown in the tables mentioned
above, ensemble techniques didn’t often appear as the best model in different routes
with different sampling methods.

Bagging Ensemble

Random Resampling F1 Score


Route Split
State Method (Macro)

1 .10 42 SMOTE .738

2 .50 42 None .283

3 .50 0 SMOTE .545

4 .50 0 RUS .469

Page | 43
Stacking Ensemble

Random Resampling F1 Score


Route Split
State Method (Macro)

1 .10 42 RUS .253

2 .50 42 SMOTE .555

3 .50 0 SMOTE .571

4 .50 0 SMOTE .564

Table 9. Results of best models using Ensemble Technique

Identifying The Possible Factors That Causes Student Attrition


Based on research and related studies, the researchers used the correlation of
dependent and independent variables as basis for identifying the factors that cause
student attrition. A moderate to strong positive/negative correlation means that the
independent variable is a possible factor that causes attrition. A correlation with values
between ±0.7 and ±1.0 is considered as strong correlation, ±0.3 and ± 0.7 as moderate
correlation and values between 0 and ±0.3 as weak correlation [9]. Referring to Figure 8
and 9, it shows the Pearson and Spearman correlation heatmap and the highest
correlation we had is the Age variable with .23 correlation value. The rest of the
variables have less than .20 correlation value. Which means that the dependent and
independent variables have very weak correlation. Therefore, the researchers cannot
precisely identify which among the variables are the possible factors that caused student
attrition.

Page | 44
Deployment of Machine Learning Model in Web - Application
The researchers used visual studio code IDE to create the web application. For
building the frontend, html, css, and bootstrap framework were used. For backend
integration, the researchers used python programming and flask framework. The web –
application has 2 pages, one for landing page and the second is for prediction page (see
figures 16 and 17).
After completing both the frontend and backend of the web application, the
algorithms were integrated into the system. The system was then hosted in
pythonanywhere web hosting service for online use. The web application was used to
accept data inputs from user to feed said data into the model, which in turn processes
the data and outputs the result for the user to view.

Figure 16. Landing Page

Page | 45
Figure 17. Prediction Page

CHAPTER V

CONCLUSION AND RECOMMENDATION

Conclusion

It is imperative to investigate the class distribution of the target variable to ensure


the optimal performance of the classification algorithms used. Accuracy is a widely used
evaluation metric in determining the best model, but in an imbalanced dataset, the

Page | 46
classification algorithms have a tendency to disregard the minority class, leading to
biased results. In this study, the researchers used resampling techniques such as the
SMOTE method for oversampling and the Random Undersampling method for
undersampling to balance the dataset. Tables 3, 4, 5, 7, and 9 show that algorithms with
applied resampling methods yielded higher f1 scores compared to algorithms without
any applied methods in handling imbalanced data. Furthermore, it is essential to use
other evaluation metrics such as f1 score with macro average in determining the best
model for imbalanced datasets.
Based on the result of the correlation between dependent and independent
variables, a weak correlation relationship was observed. Additionally, the evaluation of
the minority class yielded low scores for precision, recall, and f1 score even after
balancing the data. These observations suggest that a small dataset can produce poor
model performance.
The Decision Tree algorithm with SMOTE method produced the best f1 score
before retraining the models, while the Bagging Ensemble algorithm with SMOTE
method produced the best f1 score after retraining. Therefore, the researchers
recommend using resampling techniques and f1 score with macro average as evaluation
metrics in handling imbalanced datasets.

Recommendations
One suggestion for future research is to consider incorporating a more extensive
dataset. Although the current study focuses on the academic years from 2018-2021, the
dataset is relatively small and does not encompass all first-year CCS students during
that period. Obtaining a larger dataset may potentially reveal more significant findings
and enhance the predictive accuracy of the model. A second recommendation for future
research is to implement the predictive model in conjunction with an administrative

Page | 47
system. By doing so, the system could be utilized as a decision-making tool for CCS
faculty when addressing student attrition concerns.

APPENDICES

Appendix A
Gantt Chart

Page | 48
Page | 49
Appendix B
Flowchart

Page | 50
Appendix C
Code Tests

Page | 51
Page | 52
Page | 53
Page | 54
Appendix D
Relevant Source Code

Source code zip file:


https://drive.google.com/drive/u/1/folders/1cxCHYMf2NYzQ2GJYaFB5HUtnJgoe2zD9

Page | 55
Appendix E
Screenshot/Picture of the System

Page | 56
Appendix F
User Manual

Instructions for users to use the WebApp:

1) Click the "Go!" button to redirect from the home page to the data entry page.

2) Input necessary data in the fields (entry is required).

Page | 57
3) Press the submit button and the system will display the result.

Page | 58
Appendix G
Curriculum Vitae

Page | 59
Page | 60
Bibliography

[1] Abimbola Iyanda et al. 2018. Predicting Student Academic Performance in Computer
Science Courses: A Comparison of Neural Network Models.
DOI: http://dx.doi.org/10.5815/ijmecs.2018.06.01
[2] Agathe Merceron, Mahmood K. Pathan, and Raheela Asif. 2014. Predicting Student
Academic Performance at Degree Level: A Case Study. Retrieved from
https://www.researchgate.net/publication/287718318_Predicting_Student_Academic_Pe
rformance_at_Degree_Level_A_Case_Study
[3] Ahmad Alwosheel, Caspar G. Chorus, and Sander van Cranenburgh. 2018. Is your
dataset big enough? Sample size requirements when using artificial neural networks for
discrete choice analysis. Journal of choice modelling, 28, 167-182.
[4] Ajitesh Kumar. 2022. Machine Learning – Sensitivity vs Specificity Difference.
Retrieved from https://vitalflux.com/ml-metrics-sensitivity-vs-specificity-difference/
[5] Allan B.I. Bernardo et al. 2021. Using Machine Learning Approaches to Explore Non-
Cognitive Variables Influencing Reading Proficiency in English among Filipino Learners.
Retrieved from https://www.mdpi.com/2227-7102/11/10/628
[6] Aman Goel Lal et al. 2020. Using ensemble decision tree model to predict student
dropout in computing science. Retrieved from https://core.ac.uk/reader/404196578
[7] Amira Mohamed Shahiri, Nur’aini Abdul Rashid, and Wahidah Husain. 2015. A
Review on Predicting Student’s Performance using Data Mining Techniques. Retrieved
from https://www.sciencedirect.com/science/article/pii/S1877050915036182
[8] Antero R. Arias Jr., Mideth Abisado, and Ramon Rodriguez. 2019. Modeling Filipino
Academic Affect during Online Examination using Machine Learning. Retrieved from
https://www.researchgate.net/publication/336078841_Modeling_Filipino_
Academic_Affect_during_Online_Examination_using_Machine_Learning
[9] Bruce Ratner. (n.d.). The Correlation Coefficient: Definition. Retrieved from
http://www.dmstat1.com/res/TheCorrelationCoefficientDefined.html
[10] Carlos A. Palacios et al. 2021. Knowledge Discovery for Higher Education Student
Retention Based on Data Mining: Machine Learning Algorithms and Case Study in Chile.
Retrieved from https://www.mdpi.com/1099-4300/23/4/485/htm#sec2-entropy-23-00485
[11] Charles Gbollie, and Harriett P. Keamu. 2017. Student Academic Performance: The
Role of Motivation, Strategies, and Perceived Factors Hindering Liberian Junior and
Senior High School Students Learning. Education Research International, 2017, 1–11.

Page | 61
DOI: https://doi.org/10.1155/2017/1789084
[12] Chris Parsons. 2021. What Is a Machine Learning Model?. Retrieved from
https://blogs.nvidia.com/blog/2021/08/16/what-is-a-machine-learning-model/
[13] Dech Thammasiri et al. 2014. A critical assessment of imbalanced class distribution
problem: the case of predicting freshmen student attrition. Retrieved from
https://core.ac.uk/reader/32327521
[14] Dina Machuve, Khamisi Kalegele, and Neema Mduma. 2019. A Survey of Machine
Learning Approaches and Techniques for Student Dropout Prediction A Survey of
Machine Learning Approaches. Retrieved from
https://datascience.codata.org/articles/10.5334/dsj-2019-014
[15] Formplus Blog. (n.d.). What is Applied Research? + [Types, Examples & Method].
Retrieved from https://www.formpl.us/blog/applied-research
[16] IBM Cloud Education. 2020. What is Supervised Learning?. Retrieved from
https://www.ibm.com/cloud/learn/supervised-learning
[17] Jason Brownlee. 2020. A Gentle Introduction to k-fold Cross-Validation. Retrieved
from https://machinelearningmastery.com/k-fold-cross-validation/
[18] Johannes Berens et al. 2019. Early Detection of Students at Risk – Predicting
Student Dropouts Using Administrative Student Data and Machine Learning Methods.
DOI: https://doi.org/10.5281/zenodo.3594771
[19] Kenneth Jensen. 2012. A diagram showing the relationship between the different
phases of CRISP-DM and illustrates the recursive nature of a data mining project.
https://commons.wikimedia.org/wiki/File:CRISP-DM_Process_Diagram.png
[20] Manila Bulletin. 2012. College Education For Poor Students. Retrieved from
https://ph.news.yahoo.com/college-education-poor-students-090706071.html
[21] Onesmus Mbaabu. 2020. Introduction to Random Forest in Machine Learning.
Retrieved from https://www.section.io/engineering-education/introduction-to-random-
forest-in-machine-learning/
[22] Psychology Wiki (n.d.). Student attrition. Retrieved from
https://psychology.fandom.com/wiki/Student_attrition
[23] SAS® Insights. (n.d.). Machine Learning: What it is and why it matters. Retrieved
from https://www.sas.com/en_ph/insights/analytics/machine-learning.html
[24] Science Direct. 2020. Machine Learning Algorithm. Retrieved from
https://www.sciencedirect.com/topics/engineering/machine-learning-algorithm

Page | 62
[25] Vladan Devedzic. 2001. KNOWLEDGE DISCOVERY AND DATA MINING IN
DATABASES, Handbook of Software Engineering and Knowledge Engineering pp.615 -
637. Retrieved from https://www.worldscientific.com/doi/10.1142/9789812389718_0025
[26] ZACH. 2020. Friedman Test: Definition, Formula, and Example. Retrieved from
https://www.statology.org/friedman-test/

Page | 63

You might also like