The proposed algorithm was applied to improve the classification performance of the UCI-HAR dataset via a feature selection approach. In this section, the experimental settings, the results of the proposed approach, the comparisons with other models, and the classification rates for the concerned dataset with comparison to other studies in the literature are presented. Moreover, a critical analysis of the obtained results using the proposed HAR system is given.
5.1. UCI-HAR Dataset
The performance of GBOGWO was exhaustively compared to a set of 11 optimization algorithms for feature selection. Basic continuous-based versions of the GBO, GWO, genetic algorithm (GA) [
53], differential evolutionary algorithm (DE) [
54], moth–flame optimization (MFO) [
55], sine–cosine algorithm (SCA) [
56], Harris hawks optimization (HHO) [
57], and manta ray foraging (MRFO) [
58] were implemented, in addition the particle swarm optimization (B-PSO) [
59], bat algorithm (B-BAT) [
60] and sine–cosine algorithm (B-SCA) [
56]. The settings and parameter values of all algorithms used in the comparison are provided in
Table 2.
As a classification task, true positive (TP), true negative (TN), false positive (FP) and false negative (FN) rates define the commonly used performance metrics for HAR systems, which are defined as follows:
Evaluation metrics of the comparison involve the mean (M) and standard deviation (std) of the precision (PR), M, and std of the number of selected features (# F), the percentage of feature reduction (red (%)), and the execution time. The Wilcoxon statistical test was used to determine the degree of significant difference between GBOGWO and each other compared algorithm in terms of the null hypothesis indicator
H and significance level
p-value. Each algorithm was repeated for 10 independent runs; this may be considered as the bottom line for examining the behavior of such a stochastic optimization technique. The reason refers to the huge execution time when training a multi-class SVM for extremely long training records (the training set was recorded with the dimension 561). The classification rates obtained by the proposed approach were compared to those of the original paper of the dataset under study as well as one recent study in the literature. Moreover, the performance of GBOGWO was compared to commonly used filter-based methods such as the
t-test and
ReliefF [
61] in feature-selection applications.
All algorithms were implemented in the Matlab 2018a (MathWorks Inc., Natick, MA, USA) environment using CPU 2.6 GHz and RAM 10 GB.
5.2. Numerical Results of Experiments
Table 3 summarizes the results obtained for the proposed HAR model using various optimizers. The GBOGWO as a feature selector outperforms other techniques where the SVM model gives a PR of
using 304 features on average. Furthermore, average accuracy reaches 98%. Thus, the number of features is reduced from 561 to 304, which achieves a reduction ratio of 45.8%. The standard deviation of the proposed model, together with GA, are minimal (0.12 and 0.119, resp.) in this comparison. This reflects the good precision of the feature selection approach for this problem. However, the MRFO found a reduced feature set with a cardinality of 286.6 on average (i.e., a 52.12% reduction ratio), however, it seems that some important features were missing, thus the mean PR is 97.77%. Furthermore, HHO selected more features (approximately 428.6 on average), but the mean PR was only 97.25%. The results of the Wilcoxon test show that the performance of the GBOGWO is statistically distinguishable, where the
p-value, which is <0.05 for all pairwise comparisons, together with
, reflects the superiority of the proposed technique. Under the experimental settings shown in
Table 2, GBOGWO with the multiclass SVM model consumes 50.8 min on average for a single run. This execution time is very close to other faster optimizers such as GWO, BSCA, SCA, and DE with 49.02, 49.35, 49.8, and 49.8 min. In comparison, HHO takes a notable long execution time with 128.3 min.
Figure 2 shows a summary of the reported results in
Table 3 in a normalized fashion, which gives more clear intuition about the behavior of GBOGWO according to different evaluation metrics.
The confusion matrix, presented in
Table 4, provides the rates of PR, sensitivity (Sens.), and specificity (Spec.) for each single activity. Walking downstairs (WD), lying down (LD), and walking (WK) were the highest recognized activities with PR rates of 100%, 100%, and 99.2%, respectively, while the worst PR rate was for standing (SD) activity with 93.57%. The recall of most activities was high except for sitting (ST) with 92.46%. It can also be noticed that the Spec. for all activities is quite good (>98.51%). The proposed model was able to well distinguish between the group of periodic activities (WK, WU, WD) and the other one of static or single-transition activities (ST, SD, LD) where the rate of misclassification is almost zero (only one wrong label between WU and ST in
Table 4).
Figure 3 presents 2D visualization for the basic feature records of activities (i.e., with 561 features) via carrying out principal component analysis and clustering. In
Figure 3, (WK, WU, WD) in (dark green, blue, black) can be linearly separated from (SD, ST, LD) in (red, yellow, light green), except for very few records which are clustered to wrong classes between WU and ST. On the other hand, there is a high degree of similarity between the extracted features of each of SD and ST. Such similarity has complicated the classification task; thus, there is notable confusion between SD and ST (on average, 36 wrong labels in-between).
To summarize the conducted experiments, the proposed feature set for the UCI-HAR dataset in [
43] was useful for the targeted recognition task; however, discarding some illusive features using the proposed technique proved very useful to improve the overall performance of such an HAR model. The feature set was successfully reduced by 45.8%, and at the same time, the mean PR reached 98.13%, and the mean accuracy was 98%.
5.4. Comparison with Filter-Based Methods
Filter-based methods such as the statistical tests and the Relief
F algorithm [
62] are commonly used for feature selection tasks. Such methods are time-efficient and their classifier-independent nature simplifies passing the selected feature set to any further classifier [
63]. As a statistical test, the
t-test examines the similarity between classes for each individual feature via mean and standard deviation calculations. It is then possible to rank features according to the significance importance and finally, define some cut-off threshold to select a feature set. The Relie
F algorithm applies a penalty scheme, where features that map to different values for the same neighbors are penalized (i.e., negative weight); and otherwise rewarded. After that, the feature set with non-negative weights is expected to better represent the concerned classes.
Table 6 gives the results of the comparison between the proposed model and the filter-based approach using the
t-test and Relief
F. Relief
F was able to extract the smallest feature set, achieving a reduction ratio of 67%, but the GBOGWO was outstanding, according to the resulting accuracy, sensitivity, and precision. However, the feature set selection using the
t-test was enlarged to 350D, but this did not improve the performance. In
Table 6, and for a typical value
, the proposed GBOGWO fitness was 97.15%. For a more biased
towards reducing the feature set, the fitness of GBOGWO reaches 88.37%. For both cases of
, the proposed approach is superior to the examined filter-based methods.
The superior performance of the developed method over all other tested methods can be noticed from the previous discussion. However, the developed method still suffers from several limitations, such as the relatively large feature set required for achieving reasonable performance (i.e., 304 features on average for six activities). Thus, it is reasonable to realize such an HAR system on a smartphone environment to examine both the model size and real-time behavior. Moreover, enlarging the set of targeted activities is expected to add more time complexity for training a classifier such as the multi-class SVM.
5.5. Evaluate the Proposed GBOGWO with WISDM Dataset
For further evaluation, we test the proposed GBWGWO with other HAR datasets, called WISDM [
64] dataset. This dataset contains six activities, namely walking (WK), walking upstairs (WU), walking downstairs (WD), sitting (ST), standing (SD), and jogging (JG).
Table 7 shows the results of the proposed GBOGWO and several optimization methods, including the GWO, GA, MFO, MRFO, and GBO. From the table, we can see that the proposed method achieved the best results. It is worth mentioning that the best results for the WISDM dataset were achieved by using the random forest (RF) classifier; therefore, in this paper, for the WISDM dataset, we also used the RF.
A basic version of the RF algorithm with 50 decision trees gives an average accuracy of 97.5% for the feature set defined in
Table 8. Following the pre-processing steps of the UCI-HAR dataset, each activity signal was separated into body acceleration and gravity component signals. Then, segments of a length of 128 points (i.e., same segment length used for UCI-HAR dataset) with 50% overlap were generated for the purposes of real-time applications. The feature set in
Table 8 was generated using simple time-domain statistics in the three-axes of each segment, notably the mean, standard deviation (STD), the coefficients of the auto-regressive model (AR) in the order of 4, and the histogram counts where the number of bins is 5, among others. Moreover, the mean, max, and median frequencies of each segment in the three-axes enhance the feature set. Considering that the proposed features are generated for both the body signal and gravity component, then the cardinality of the feature set reaches 150. Thus, such a feature set can help distinguish the behavior of the compared algorithms for the WISDM dataset. Since previous studies that addressed the WISDM dataset have considered
Accuracy to evaluate their algorithms, then the classification error is set to
mean(
Accuracy) as shown in
Figure 4b.
Since the search space of UCI-HAR—as a feature selection problem—is high-dimensional, then it is a suitable examiner for compared algorithms. Thus, for avoiding redundancy, only the top six algorithms according to the results in
Table 3, namely GBOGWO, GWO, GA, MFO, MRFO, and GBO, were included in the experimentation of the WISDM dataset.
In
Table 7, GBOGWO is able to achieve a mean accuracy (Acc) of
, which is a notable optimization for the basic model with a whole feature set of
. The GBOGWO outperforms other algorithms according to the Acc of classification only using 32.7 features on average (
of reduction ratio). However, MFO uses the largest feature set among examined optimizers with 59.9 features, but it can reach a mean Acc of
. GBO attains the minimal feature set with cardinality of 25, but it seems insufficient to achieve a mean Acc above
. It was noticed that the STD for all algorithms was less than
, which may refer to the relatively limited search space (e.g., the feature set size is 150). Moreover, the Wilcoxon test results in
Table 7 ensure that GBOGWO is well distinguished from other algorithms of comparison.
In
Table 9, the selection power of GBOGWO outperforms both the
t-test and Relief
F which tend to attain a large feature set of size 124 and 108, respectively, whilst lesser mean Acc of
and
, respectively. According to the fitness criteria defined in Equation (
21), GBOGWO outperforms both methods in the case of giving most importance to Acc (i.e.,
) or to feature set reduction (i.e.,
).
Table 10 shows the confusion matrix of the test set, which represents
of whole samples. The activities ST, SD, and WK, were well recognized with the mean PR that exceeds
. It was noticed that the rates of PR, Sens. and Spec. were close for most activities which reflects that the classification model (features + classifier) was balanced between such metrics. Most conflicts occur between WU and WD, as well as between WU and JG where misclassifications reach 27 and 15, respectively. Such conflicts may be caused by the sensor position (in the pocket); thus, for such applications, it is suggested to collect activity signals from different positions on the body such as pocket, wrist, waist, and shoulder.
Figure 5 presents the selections of each of the top six algorithms for both datasets.
Table 11 focuses on the most frequent features in the optimized feature sets of each algorithm. For UCI-HAR, only features attained by all considered algorithms (e.g., count = 6) are shown. These features are generated from the body signals of both the accelerometer (BodyAcc) and gyroscope (BodyGyro) in both the time-domain (with the prefix
t) and frequency-domain (with the prefix
f). For more explanation of such features, the reader can refer to [
43]. For WISDM, the skewness of the y axis of the body signal (Skewness-Y) looks like the most important feature as it is attained by every algorithm. Similarly, the tilt angle (TA), the STD of the jerk of x axis body signal (STD-Jerk-X), and the first coefficient of the AR model of magnitude signal (AR-Magnitude,1) have a frequency of 5. The maximum frequency of the z axis of the body signal (Max-Freq-Z) shows most notable effectiveness in the generated frequency-domain features with a count of 4. It is reasonable to find that body signal statistics are more useful than those of gravity components for such applications. Thus, only Gravity-STD-Y and Gravity-Kurtosis-Y appear in the elite feature set.