Optimized Stress Level Classification
Optimized Stress Level Classification
Abstract—This research explores an innovative approach to prediction systems. These models utilize historical data to
stress level classification using machine learning techniques establish behavioral patterns and predict stress levels based on
applied to environmental and physical activity data. The dataset real-time inputs [5]. The ability to continuously monitor stress
consists of features such as humidity, temperature, and step
count to predict an individual’s stress level. The methodology patterns can facilitate timely interventions, thereby improving
involves preprocessing, feature selection, normalization, decision- overall well-being. Additionally, integrating stress classifica-
making using the TOPSIS method, and classification with various tion systems into everyday devices such as smartphones and
machine learning models. A comparative study of classifiers smartwatches can enhance accessibility and usability, making
is conducted, and the Random Forest model emerges as the stress detection a seamless part of daily life [12].
best performer, offering high accuracy and reliability. The study
contributes to stress detection technologies by providing a robust This research applies machine learning techniques to clas-
and interpretable machine learning framework. sify stress levels based on environmental and activity-related
Index Terms—Stress Classification, Machine Learning, Feature parameters such as humidity, temperature, and step count. The
Selection, TOPSIS, Logistic Regression objective is to develop an efficient model that can predict stress
levels with high accuracy, providing a practical tool for contin-
I. I NTRODUCTION uous stress assessment and management [5]. By incorporating
Stress is a prevalent issue affecting individuals’ mental and the TOPSIS decision-making approach and evaluating multiple
physical health, often leading to conditions such as anxiety, classifiers, this study aims to establish an optimal framework
depression, and cardiovascular diseases [5]. Early identifica- for stress detection using readily available data sources. The
tion and intervention are crucial in mitigating its negative findings from this study can contribute to the development of
impact. Traditional stress detection methods rely on self- real-time stress monitoring applications, enabling individuals
reported surveys and physiological sensors [5], which can be to take proactive measures in managing their stress levels
intrusive and require specialized equipment. effectively.
Advancements in wearable technology and environmental This is how the rest of the paper is structured. The related
sensors have enabled the collection of real-time data on various work is summarized in Section II. Section III illustrates the
factors influencing stress levels [5]. These include physical stated methodology. The comparative analysis and experimen-
activity, ambient temperature, and humidity, which have been tal findings are covered in Section IV. Finally, the concluding
found to correlate with stress responses [6]. Leveraging ma- remarks were included in Section V.
chine learning techniques for stress classification based on
such non-invasive data sources presents a promising avenue II. R ELATED W ORK
for scalable and cost-effective stress monitoring solutions [7].
Furthermore, the integration of decision-making techniques Stress detection has been a widely studied field, with
such as the Technique for Order Preference by Similarity research spanning physiological, behavioral, and environmen-
to Ideal Solution (TOPSIS) enhances the interpretability of tal data sources [2]. Traditional approaches have primarily
stress classification results [8]. Machine learning models, when relied on physiological signals such as heart rate variability
combined with structured decision-making processes, provide (HRV), electrodermal activity (EDA), and electroencephalo-
a more comprehensive and accurate assessment of stress lev- gram (EEG) readings to assess stress level. These biomet-
els.The machine learning techniques have used to solve various ric signals are often captured through wearable devices and
diseases like breast cancer [8]–[11], brain tumor detection medical-grade sensors, enabling researchers to develop stress
[13] [15]. The ability to analyze real-time data from wearable classification models using machine learning techniques such
devices and environmental sensors allows for continuous stress as Support Vector Machines (SVM), Decision Trees, and
monitoring without the need for intrusive methods. Neural Networks [2]. While these methods have demonstrated
Recent advances in artificial intelligence and data-driven high accuracy, their reliance on specialized hardware limits
models have enabled the development of personalized stress their scalability and accessibility for broader applications.
More recent research has explored behavioral data, includ- the most relevant features improve classification accuracy, as
ing physical activity, sleep patterns, and mobile phone usage, illustrated in Fig. 1 [3].
as alternative indicators of stress [2]. Studies have lever-
A. Dataset Description
aged data from smartphone sensors, accelerometers, and GPS
tracking to analyze movement patterns, step counts, and user The dataset comprises 2001 records with three input vari-
interactions [2]. Machine learning models such as Random ables—humidity, temperature, and step count—and one target
Forest, K-Nearest Neighbors (KNN), and deep learning-based variable, stress level (0: low, 1: moderate, 2: high) [2]. Hu-
approaches have been used to classify stress levels based on midity and temperature reflect environmental conditions, while
these behavioral patterns [2]. However, challenges remain in step count represents physical activity [6]. The dataset includes
ensuring generalizability across different populations, as stress numerical features (humidity, temperature, step count) and a
responses can vary significantly between individuals. categorical target variable (stress level). Correlation analysis
A growing body of work has also focused on environmental highlights the influence of environmental factors and physical
factors, such as temperature, humidity, noise levels, and air activity on stress levels, with a balanced distribution ensuring
quality, in stress classification. Researchers have found corre- unbiased model training [7].
lations between environmental conditions and stress, leading to
the development of hybrid models that integrate multiple data
sources for improved accuracy. Feature selection techniques,
including Information Gain, Correlation Analysis, and Jaccard
Similarity, have been applied to refine predictive models and
enhance classification performance [5].
Despite advancements in stress classification, limited re-
search has explored integrating Multi-Criteria Decision-
Making (MCDM) techniques like the Technique for Order
of Preference by Similarity to Ideal Solution (TOPSIS) with
machine learning models [4]. TOPSIS ranks stress levels
using weighted decision matrices, enhancing interpretability
and structured classification [2]. By combining feature selec-
tion, decision-making, and classification, this study aims to
develop a scalable and accurate stress detection framework.
Logistic Regression provides probabilistic outputs that help in
understanding feature importance.
Our approach improves upon traditional models by [2] inte-
Fig. 1. The overall work flow of the proposed model
grating Logistic Regression with the TOPSIS ranking method,
enhancing feature selection and decision-making [5] . This
structured methodology refines classification accuracy by pri- B. Dataset used
oritizing significant stress indicators while reducing noise [7]. To enhance model performance, the dataset underwent
Compared to complex tree-based models like Random Forest preprocessing to handle inconsistencies, followed by feature
and deep learning methods, our model maintains high accuracy selection to identify the most relevant attributes. The data was
with reduced computational overhead. The combination of then split into training and testing sets, ensuring a structured
feature selection, ranking, and Logistic Regression ensures a approach for evaluation, as presented in TABLE I.
scalable and accessible framework for stress detection without
relying on specialized biometric sensors [2]. TABLE I
D ISTRIBUTION OF S AMPLES IN T RAINING AND T ESTING S ETS
III. P ROPOSED M ETHODOLOGY Class Training Set Testing Set
Low Stress (0) 400 samples 101 samples
The primary objective of this study is to develop a robust Medium Stress (1) 632 samples 158 samples
and interpretable stress classification model using environ- High Stress (2) 568 samples 142 samples
mental and physical activity data. The methodology involves
data preprocessing (handling missing values, normalization),
feature selection (Information Gain, Correlation, Jaccard Sim- C. Data Preprocessing
ilarity), and stress level ranking with TOPSIS. The selected Data preprocessing is a crucial step in preparing the dataset
features are classified using models like Logistic Regression, for model training and evaluation [2]. Raw data often contain
KNN, SVM, Naı̈ve Bayes, Decision Tree, AdaBoost, XG- inconsistencies, missing values, and features with different
Boost, ANN, and Random Forest, with performance evaluated scales, which can impact model performance. Therefore,
through accuracy, precision, recall, sensitivity, and specificity. several preprocessing techniques were applied to ensure the
This approach enhances model reliability and ensures that only quality and efficiency of the stress classification model [5].
a) Handling Missing Values: One of the primary chal- a) Constructing the Decision Matrix: The decision ma-
lenges in real-world datasets is missing data [7]. The dataset trix consists of multiple features that influence stress levels,
used in this study included various environmental and activity- such as environmental parameters (e.g., temperature, humidity)
based parameters such as temperature, humidity, step count, and activity data (e.g., step count, heart rate) [2].
and other physiological indicators [6]. Missing values in nu- b) Normalization of Decision Matrix: To ensure fair
merical attributes were handled using mean imputation, where comparisons, feature values are normalized using the following
missing entries were replaced with the mean of the respective equation (2) [5].
feature. This approach helps maintain data integrity without
significantly altering the distribution [5]. ′ Xij
Xij = qP (2)
b) Feature Normalization: Since the dataset consists of n 2
i=1 Xij
numerical values with different ranges, it was essential to
normalize the features [2]. For instance, temperature values ′
where Xij is the normalized value, and Xij is the original
range from 20°C to 40°C, while step counts can vary from a
feature value.
few hundred to thousands. If left unnormalized, features with
larger numeric values may dominate model learning, leading c) Weighting the Normalized Matrix: Feature importance
to biased predictions [7]. To address this issue, Min-Max is determined based on predefined weights using statistical
normalization was applied to scale all numerical attributes to techniques such as Information Gain, Correlation, and Jaccard
a fixed range of [0,1], ensuring that each feature contributes Similarity [5]. The weighted normalized matrix is computed
equally to the model [5]. function is described in equation (1) as:
as follows:
X − Xmin ′
X′ = (1) Vij = wj · Xij (3)
Xmax − Xmin
c) Dataset Splitting: After preprocessing, the dataset was where wj represents the weight assigned to feature j.
split into training and testing sets to ensure unbiased model d) Determination of Ideal and Negative-Ideal Solutions:
evaluation [2]. The training set was used to develop the model, The ideal and negative-ideal solutions are identified by equa-
while the testing set assessed its generalization capability [5]. tion (4).
A stratified split was performed to maintain class balance,
ensuring that each class was proportionally represented in both
subsets [7]. The dataset split details are provided in TABLE I. A+ = {max Vij | j ∈ J} , A− = {min Vij | j ∈ J} (4)
D. Feature Selection
The ideal solution represents the best-case scenario, while
To enhance model performance and reduce computational the negative-ideal solution represents the worst-case scenario
complexity, feature selection was performed. Three key tech- [8].
niques were utilized: e) Distance Calculation: The Euclidean distance of each
• Information Gain: Measures the importance of each fea- alternative from the ideal and negative-ideal solutions is cal-
ture in predicting stress levels based on entropy reduction. culated using equation (5).
• Correlation Analysis: Evaluates the relationship between
different features and their impact on stress classification. v v
Features with high correlation were either removed or um um
uX uX
combined to avoid redundancy. Di+ = t (Vij − A+ 2
j ) , Di− = t (Vij − A−
j )
2 (5)
TP
P recision = (8)
TP + FP
c) Recall (Sensitivity): Measures the model’s ability to
correctly identify stress cases. A high recall score suggests
the model effectively detects stress instances without missing
significant cases. Mathematically, recall is given by equation
(9).
TP
Recall = (9) Fig. 2. Confusion matrix obtained for the model
TP + FN
TABLE II presents the classification performance of vari- ECG-based methods, and deep learning approaches [7]. While
ous machine learning models trained using the TOPSIS-based these models have shown promising results, their accuracy
approach [8]. The Logistic Regression model achieves the remains lower than that of our proposed method [12].
highest accuracy of 99.75%, demonstrating superior perfor- Our proposed Logistic Regression model outperforms all
mance in stress classification [7]. Other models, including previous approaches, achieving an impressive accuracy of
KNN, Decision Tree, Random Forest, AdaBoost, and XG- 99.75% [8]. This superior performance is attributed to the
Boost, exhibit comparable accuracy, precision, recall, and robust feature selection process, optimized training strategy,
sensitivity, highlighting their robustness [5]. The SVM and and the model’s ability to generalize well across different
Naı̈ve Bayes models show slightly lower accuracy but still stress levels [4].
maintain competitive results [2]. These findings reinforce the
effectiveness of traditional and ensemble-based models in TABLE IV
achieving reliable stress detection [12]. PERFORMANCE COMPARISON WITH EXISTING MODELS
TABLE III
CLASSIFICATION REPORT FOR STRESS-LYSIS DATASET