File 160828
File 160828
S T U D E N T M E N TA L H E A LT H
C L A S S I F I C AT I O N B E T W E E N
TRADITIONAL METHODS AND
MACHINE LEARNING
ALGORITHMS
R.K. CHOTE
committee
ir. F. Zamberlan
dr. M. van Wingerden
location
Tilburg University
School of Humanities and Digital Sciences
Department of Cognitive Science &
Artificial Intelligence
Tilburg, The Netherlands
date
July 4, 2022
acknowledgments
I want to thank ir. F. Zamberlan for his supervision and encouragment
during the entire process. My gratitude also extends to dr. M. van
Wingerden, who functioned as a second reader and provided me with
valueable feedback. Lastly, I want to thank my parents and sisters for
their unconditional support this entire year.
A C O M PA R AT I V E S T U D Y O N
S T U D E N T M E N TA L H E A LT H
C L A S S I F I C AT I O N B E T W E E N
TRADITIONAL METHODS AND
MACHINE LEARNING
ALGORITHMS
r.k. chote
Abstract
The prevalence of mental illness among students is on the rise, with
severe consequences for all those affected. Prevention and promotion
intervention strategies that identify students prior to the onset or
worsening of mental health problems could play a substantial role in
reducing this burden. Using the Mental Health Quotient questionnaire,
the purpose of this thesis was to investigate a feasible prevention
strategy that focuses on the early detection of students who exhibit
risk factors for poor mental health. Using the completed questionnaires
of n = 15,906 students between the ages of 18 and 24, a classification
prediction task was developed to classify individuals as either at-risk for
their mental health or normal/healthy. Machine learning models were
compared to a baseline linear model. Model selection included feature
selection and hyperparameter tuning using 10-fold cross-validation.
The results reveal that, in terms of their F1-score and ROC-AUC,
Naive Bayes, K-nearest neighbors, Random Forest, and Feedforward
Neural Network do not outperform the baseline Logistic Regression. No
statistical difference in performance was found between BLR and SVM.
Support Vector Machine and Binary Logistic Regression performed
the best, while a Feedforward Neural Network model performed the
worst. Directions for future research are given and the implications are
discussed.
1
1 data source/code/ethics statement 2
Work on this thesis did not involve collecting data from human participants or
animals. The original owner of the data used in this thesis retains ownership
of the data during and after the completion of this thesis. The author of
this thesis acknowledges that they do not have any legal claim to this data
or code.
2 introduction
2.1 The need for mental health prevention and promotion strategies
illnesses involves identifying, monitoring, and managing risk factors with the
goal of acting before the mental health problems worsen or even prevent their
onset by providing timely treatment (Arango et al., 2018). Promotion of
mental health focuses on enhancing protective factors and healthy behaviors
with the goal of enhancing mental health, which in turn reduces the incidence
of psychiatric disease. Given the fact that multiple studies show that both
prevention and promotion are effective supplements and/or alternatives to
treatment and are cost- effective, more research is needed to explore ways in
which students could benefit from these interventions (Arango et al., 2018;
Arumugam, 2019; Mihalopoulos et al., 2011; Purtle et al., 2020).
3 related work
This section discusses the existing literature on the use of machine learning
in mental health research based on the systematic reviews by Shatte et al.
3 related work 5
(2017), very few articles considered ethics in relation to the privacy of their
participants’ personal or private data (Conway & O’Connor, 2016; Rubeis,
2022).
Finally, the research conducted by Srividya et al. (2018) appears to align
most closely with the objective of this thesis. The purpose of their research
was to identify the mentally distressed individuals in the target population.
Using a questionnaire designed in consultation with a psychologist, they
collected data from 656 participants. The target population consisted of high
school and college students (n = 300) and working professionals with less
than five years of experience from various organizations (n = 356). Clustering
was utilized to obtain labels, which were then used to train classification
algorithms. Logistic regression, Naive Bayes, Support Vector Machines,
Decision Trees, and K-Nearest Neighbors were the classification algorithms
used. Support vector machines and K-Nearest Neighbors performed best
as individual classifiers, obtaining just above 80% accuracy. Random forest
and the bagging ensemble method produced the best results, obtaining
around 95% accuracy. There are, however, a few limitations regarding this
paper. Firstly, while the questionnaire used to assess the mental health of
the participants was consulted with a psychologist, no reliability or validity
results were provided to verify the quality of the obtained data, which could
be problematic during generalization (Shrout & Lane, 2012). Secondly, the
logistic regression did not perform significantly worse than the more advanced
machine learning algorithms like Random Forest, which leaves the question
of whether machine learning algorithms are needed in psychological research
unresolved. That is, why use advanced machine learning techniques with the
potential of drawing wrong conclusions if traditional methods accomplish the
same results and provides more interpretability? Thirdly, a small sample size
(n = 300) was used, which could have overestimated the predictive accuracy
that was obtained (Cui & Gong, 2018).
In conclusion, the results and limitations of relevant articles were dis-
cussed, and currently, there is no prediction model for assessing the overall
mental health of students with a focus on early detection. While the pilot
study by Srividya et al. (2018) is a solid starting point, the limitations of
the questionnaire and the small sample size, the lack of consensus on which
algorithm should be used, and the question of whether machine learning is
necessary in psychological research indicate that there is still a significant
gap in the literature on this topic. Therefore, the research presented in this
thesis aims at proposing a classification model which may be utilized as
an early detection method to identify students who are at risk in terms of
their mental health, using both descriptive and contextual features from the
Sapien Labs’ Dynamic Dataset of Population Mental Wellbeing (Newson &
Thiagarajan, 2021). Inspired by the choice of algorithms of Joshi et al. (2018)
4 method 7
and Srividya et al. (2018), the aim of this thesis extends to determining
whether a baseline Binary Logistic Regression model is outperformed by
Naive Bayes, K-nearest neighbors, Support Vector Machine, Random Forest,
and a Feedforward Neural Network.
4 method
The dataset used for this thesis is Sapien Labs’ Dynamic Dataset of Pop-
ulation Mental Wellbeing (Newson Thiagarajan, 2021). In response to a
request for the dataset from the creators at https://sapienlabs.org, the devel-
opers gave downloadable access to the dataset. The Mental Health Quotient
(MHQ) questionnaire was administered online to collect information on the
mental health of the global population. Respondents were recruited through
Facebook and Google Ads-based outreach initiatives. Observations are peri-
odically added to the dataset, making it an ongoing collection. When the
thesis was written, the dataset comprised approximately 250,000 observations.
Each observation represents an individual who completed the questionnaire
and is described by two types of features: MHQ scores in their (a) raw form,
and (b) composite forms. The raw MHQ scores are comprised of 47 distinct
mental wellbeing indicators and 30 contextual descriptors (i.e., life context
metrics and descriptive features) that provide demographical information
and information about a person’s particular circumstances. The composite
MHQ scores are the aggregated scores of the 47 individual indicators across
six dimensions of mental health, along with an overall MHQ score that
represents the individual’s mental health as a whole. It is important to note
that, for the purpose of this thesis, only the 30 contextual descriptors will
be considered as possible input features for the classification model in which
the overall MHQ will function as a target variable. As there is a summative
relationship between the calculation of the overall MHQ score and the 47
mental wellbeing indicators, they are ineligible for use as input features.
4 method 8
4.2 Pre-processing
The first step was to select the observations that were most likely to be
part of the target group. Namely, individuals who are currently studying.
Therefore, only individuals who answered ‘studying’ to the question ‘What is
your current occupational status?’ were selected for further preprocessing. It
is also important to note that only individuals between the ages of 18 and 24
were included because this age group is assumed to be most representative
of undergraduate students who are particularly at-risk for mental health
problems. See Figure 3 for the number of individuals per current educational
level.
Lastly, to protect the external validity of the study, only people from nations
with a minimum of 500 observations were included (Kukull & Ganguli, M,
2012). See Figure 4 for the number of individuals per country.
Finally, having ensured that the data was "tidy" and "clean," the non-
dummy coded features were standardized. The dataset was split at a
ratio of 80:20, with 80 percent of the data going into the training set. A
stratification technique was utilized to maintain the proportion of class
distribution depicted in Figure 5. The class distribution is roughly equal,
with 8058 individuals being labeled as normal/healthy and 7848 individuals
being labeled as at-risk. The total number of individuals involved in the
prediction analysis is 15,906.
4.3 Cross-validation
Feature selection will consist of correlation analysis between the features and
target variable, as well as inter-correlation analysis to assess multicollinearity.
Since the target variable for the is binary, Point Biserial Correlation is used,
which allows for the investigation of the association between continuous and
binary variables (Kornbrot, 2014). The idea behind correlation analysis as
a feature selection method is to only retain the set of features that show a
strong linear relationship with the target variable and discard the features
that show a weak relationship. The assumption is made that features that
show a weak linear relationship with the target variable are not informative
and in turn could negatively affect the predictions results. Correlation ranges
from -1 to 1, in which 1 represents a perfect positive linear relationship and
-1 a perfect negative linear relationship. Different values within this range
of -1 and 1 will serve as a minimum correlation value that features have to
meet in order to be selected. For example, a minimum value of 0.4, would
mean that only features with a correlation value of ≤ -0.4 or ≥ 0.4 will be
selected.
However, correlation analysis is restricted to linear connections. Conse-
quently, a tree-based feature selection method will also be utilized. Specifi-
cally, feature selection through Random Forest, which enables the exploration
of nonlinear relationships. Important features are identified by their impurity
importance, which is often measured by the Gini impurity index (Louppe,
2014; Nembrini et al., 2018). This operation will be carried out utilizing the
Boruta package (Kursa & Rudnicki, 2010).
Both feature selection via correlational analysis and feature selection
through Random Forest will yield a selection of features. To determine which
of these two will be utilized during the hyperparameter tuning and testing
phase, their performance on a Random Forest Classifier will be compared
using F 1-score as an evaluation metric. See section 3.6 for additional
information regarding the rationale behind the selected evaluation metric.
4 method 13
This section describes the algorithms used during the classification prediction
analysis and, if available, their hyperparameter search space.
4.5.3 K-nearest-neighbor
K-nearest neighbor (KNN) is a supervised non-parametric machine learning
algorithm that can be used for classification or regression (Azadkia, 2019).
Its performance, which is heavily reliant on the employed distance metric,
4 method 14
4.6 Evaluation
The evaluation metrics used for the classification task are the receiver
operating characteristic curve (ROC-AUC), the F1 score, and a confusion
matrix. To gain more insights into the models’ performance, error analysis
will be performed. Finally, model comparison will be performed by measuring
statistical differences between the models.
2020). Although steps are taken in this thesis take the imbalance of the class
distribution into account when training the models (i.e., stratification), the
decision has been made to also include the F 1-score as an evaluation metric.
The F 1-score is the harmonic mean of the precision and recall scores. It
focuses more on the classification performance of the model with regard to
the positive class (Raschka, 2014). That is, the students who are at-risk
regarding their overall mental health. In comparison to ROC-AUC, it is
a more robust metric against imbalanced datasets (DeVries et al., 2021).
Lastly, the confusion matrix provides a summary table of the number of
true and false positives as well as true and false negatives (Rosenbusch et
al., 2021).
5 results
This section describes the results of the classification prediction task. The
first part of this section consists of the cross-validation results. Namely, the
feature selection results and the classification scores that were obtained on
the validation sets. The second portion of this section will focus on the
results that were obtained during testing. Both validation and test results
are described by F1-scores, ROC-AUC scores, and confusion matrices. The
third portion of this section will focus on error analysis. This section will
conclude with a model comparison.
performing the best out of all machine learning models obtaining a similar
score as BLR on both the F1 and ROC-AUC metric.
Table 2: Validation results
The confusion matrices show that all models are capable of distinguishing
between the negative class and positive class given the number of true
negatives and true positives. However, BRL, SVM, and RF tend to produce
fewer false negatives in comparison to the other models.
The F1-scores and ROC-AUC scores for each model in the test set are
presented in Table 3. On both the F1-score and ROC-AUC, the outcomes
reveal that BLR performs slightly better than SVM. The run duration in
seconds displayed in the first column indicates that, with the exception of
NB, the baseline logistic regression model is significantly faster than the
machine learning and deep learning models.
Figure 7 depicts the confusion matrices for all models on the test set.
Similar to the validation results, the confusion matrices demonstrate that all
models appear to be able to differentiate between the negative and positive
classes, as indicated by the true negative and true positive rates. Also similar
to the validation results, BLR, SVM, and RF tend to have a more balanced
distribution between true and false negatives and positives. In contrast to
NB, KNN, and FNN where the number of false negatives exceed the number
of false positives.
5 results 20
This section discusses results of the error analysis. The results reveal a
couple of things. Firstly, there is an indication that extreme cases of mental
health were overall easier to classify than less extreme cases, where extremer
cases can be both a positive experience of mental health, such as being
very satisfied with one’s life, or a negative experience, such as describing
one’s mood as very negative. More specifically, when an individual has an
extremely positive or negative experience regarding a certain aspect of their
life, the number of correct classifications tends to be higher and the number
of incorrect classifications tends to be lower. This was a constant finding
across all models. An example is given in Figure 8, which shows the correct
and incorrect classifications of the BLR model on 3 features.
As seen in the figure, when someone is either very satisfied with their
life, indicated by a score of 8 or 9, or not satisfied at all, indicated by a score
of 1 or 2, the classifications are mostly correct, with only a small number
of incorrect classifications. The same goes for "overall mood" and "mental
5 results 21
Figure 8: Incorrect and correct classification for BLR model on three features
Figure 9: SVM false positives and false negatives on the feature life satisfaction
Figure 10: SVM false positives and false negatives on the feature life satisfaction
This section discusses the model comparison results between the baseline
BLR and all other models. Shapiro-Wilk Normality tests showed that the
score distributions from each model, both F1-scores and ROC-AUC scores
seem to be normally distributed, as shown in Table 4.
Model df t p-value
NB 1998 70.486 .000∗
KNN 1998 23.677 .000∗
SVM 1998 -1.632 .103∗
RF 1998 59.453 .000∗
FNN 1998 100.646 .000∗
∗p-value < .05
Model df t p-value
NB 1998 53.898 .000∗
KNN 1998 15.282 .000∗
SVM 1998 -1.780 .075∗
RF 1998 67.439 .000∗
FNN 1998 109.860 .000∗
∗p-value < .05
6 discussion
et al., 2019). Strengths of this thesis include the use of a systematic approach
to model comparison; dealing with limitations from previous studies, such as
using a larger sample size in comparison to previous studies; and optimization
of hyperparameter tuning. In addition, the performance between machine
learning algorithms and traditional methods in predicting student mental
health was unclear. The findings of this thesis show that traditional methods,
such as binary logistic regression, should continue to play a key role in mental
health research. This is especially relevant given that traditional methods
are not only simple to implement but are also easily interpretable, whereas
machine learning methods are often viewed as "black boxes" that do not
provide the user with readily available information regarding the importance
of individual features (Kuhle et al., 2018).
However, several limitations need to be acknowledged. The first limita-
tion is the self-reporting nature of the questionnaire. Although participants
were encouraged to provide accurate responses, there is always the possibility
that they misunderstood the questions or provided socially acceptable re-
sponses. That being said, there really is no alternate answer to this possible
problem other than evaluating the psychological status of individuals in a
clinical setting, which is unfeasible on a large scale. A second limitation is
that only a relatively small number of features were utilized. With more
variables or observations, it is plausible that the machine learning algorithms
included in this thesis could have been more discriminative. Future research
should investigate methods for utilizing not only the maximum number of
observations but also features. A third limitation is that while the run-time
of models was considered, the primary focus of this thesis was the com-
parison in performance. However, in real-life settings, implementation and
applicability could be just as important, which should be investigated by
future research. Lastly, the quality of the features used in the prediction
task could be a potential limitation. The average F1-scores and ROC-AUC
scores derived by the models included in this thesis using only demographic
and contextual characteristics is around 75%. As demonstrated by the error
analysis, although the models are capable of differentiating between the
positive and negative classes, there are a considerable number of inaccurate
classifications, as seen by the false positives and false negatives rates. While
demographics alone can predict mental health to a limited extent, these
results seem to suggest that they do not provide sufficient information on
their own. For example, Sano et al. (2015) found an average of 80% accuracy
utilizing only personal data. However, when data from wearable sensors
was included, the predictive performance was increased. Typically, risk
prediction models use demographic patient data to forecast the occurrence
of a given unfavorable event in conjunction with psychological patient traits
that can be directly related to a particular mental condition, and these
7 conclusion 29
features were not present in the current study (Shatte et al., 2019; Thieme
et al., 2020; Sheldon et al., 2012). Future research could study this topic
further by comparing the predictive performance of models containing solely
features relating to demographic characteristics to those containing both
demographic characteristics and various psychological aspects.
7 conclusion
Composite scores
Overall mental health Dimensions of mental health
Overall MHQ Core Cognition
Complex Cognition
Mood & Outlook
Drive & Motivation
Social Self
Mind-Body Connection
Contextual factors
Age
Gender
Time of day
Country
State
Zip Code
Race/Ethnicity
Income
Employment
Job Role
Household
Education
How satisfied are you with your life in general?
In general, I get as much sleep as I need:
How regularly to you engage in physical exercise (30 minutes or more)?
How regularly do you socialize with friends in person?
Please select which substances you consume regularly (at least every
week)
Do you have a diagnosed medical disorder that significantly impacts
your way of life?
Medical Condition
Are you currently seeking treatment for any mental health concerns?
You answered "No" to the previous question. Please explain further by
selecting the following/ What kind of mental health support have you
sought/are you currently seeking?
Life Trauma
How would you describe your overall mood right now on a scale from
very negative to very positive?
How mentally alert are you feeling right now?
Approximately how many hours did you sleep last night?
Approximately how many hours ago was your last meal?
Physical complaints
Are you currently pregnant?
Covid-health impact
Covid-financial and social impact