Human Activity Recognition via Score Level Fusion of Wi-Fi CSI Signals

Lim, Gunsik; Oh, Beomseok; Kim, Donghyun; Toh, Kar-Ann

doi:10.3390/s23167292

Open AccessArticle

Human Activity Recognition via Score Level Fusion of Wi-Fi CSI Signals

¹

School of Electrical and Electronic Engineering, Yonsei University, Seoul 03722, Republic of Korea

²

Department of Applied Artificial Intelligence, Seoul National University of Science and Technology, Seoul 01811, Republic of Korea

^*

Author to whom correspondence should be addressed.

Sensors 2023, 23(16), 7292; https://doi.org/10.3390/s23167292

Submission received: 6 July 2023 / Revised: 12 August 2023 / Accepted: 18 August 2023 / Published: 21 August 2023

(This article belongs to the Special Issue Innovations in Wireless Sensor-Based Human Activity Recognition)

Download

Browse Figures

Versions Notes

Abstract

:

Wi-Fi signals are ubiquitous and provide a convenient, covert, and non-invasive means of recognizing human activity, which is particularly useful for healthcare monitoring. In this study, we investigate a score-level fusion structure for human activity recognition using the Wi-Fi channel state information (CSI) signals. The raw CSI signals undergo an important preprocessing stage before being classified using conventional classifiers at the first level. The output scores of two conventional classifiers are then fused via an analytic network that does not require iterative search for learning. Our experimental results show that the fusion provides good generalization and a shorter learning processing time compared with state-of-the-art networks.

Keywords:

human activity recognition; Wi-Fi CSI signals; score-level fusion

1. Introduction

Human activity recognition (HAR) is a field of research and technology that focuses on developing methods for automatically identifying and understanding human activities using sensor data [1,2]. HAR has a wide range of applications in various domains, including healthcare, security, sports, robotics, and surveillance (see e.g., [1,3]). In recent years, the importance of HAR has grown significantly due to its potential benefits and the many practical applications it offers [4].

There are three main approaches to HAR: vision-and-sound-based [5], mobile-device-based [6], and Wi-Fi-signal-based [7]. Each approach has its own advantages and challenges. The vision-and-sound-based approach can raise concerns about security and privacy due to the use of cameras and microphones. The mobile-device-based approach requires individuals to wear or carry smart devices, which can be expensive and inconvenient. In contrast, the Wi-Fi-based approach does not involve the use of cameras or microphones, so it does not raise the same concerns about security and privacy [8]. Additionally, individuals do not need to wear or carry any devices, as Wi-Fi signals are readily available in most environments. This makes the Wi-Fi-based approach a promising candidate for HAR. A brief review is provided in Section 2 regarding each of these approaches.

Wi-Fi signals, which operate within the radio frequency spectrum, can be affected by various factors such as interference, obstructions, and signal absorption. When humans move or interact with objects within the signal path, they can inadvertently affect the propagation of Wi-Fi signals [9]. This can lead to observable patterns in signal quality and connectivity. For example, the human body itself can act as an obstacle that attenuates or blocks Wi-Fi signals. When a person moves around a space, their physical presence can cause fluctuations in signal strength as the signals encounter the obstruction posed by their body. These disturbances can manifest as recognizable patterns in Wi-Fi connectivity, often seen as intermittent drops or variations in signal strength [7]. By capitalizing on this phenomenon, it is possible to use disturbed Wi-Fi CSI signals for HAR.

Both classical machine learning and deep learning methods have been used for HAR. Classical machine learning methods, such as Principal Component Analysis (PCA) (see e.g., [10]), K-Nearest Neighbors (KNN) (see e.g., [11]), and Support Vector Machines (SVM) [12], have the advantage of having low training demands but may compromise recognition accuracy. On the other hand, deep learning methods, such as Convolutional Neural Networks (CNN), Long Short-Term Memory (LSTM), and Gated Recurrent Unit (GRU) [13], offer high recognition accuracy but require large amounts of data and significant training effort. Unlike face recognition, where there is a large collection of images available from social media, HAR using Wi-Fi signals does not have a large enough dataset to train a deep network effectively. This has motivated the study of fusion structures that balance sample size, training effort, and prediction accuracy.

The main contribution of this work is the development of a fast-learning fusion network that avoids over-fitting while using a relatively small number of training samples. This approach balances learning accuracy with training sample size. To evaluate the effectiveness of the preprocessing and fusion settings, an extensive series of experiments were conducted on three publicly available datasets.

The paper is structured as follows: Section 2 offers a concise overview of related works. Section 3 and Section 4 introduce the preliminaries and our proposed study for HAR. In Section 5, we present our experimental setup and results with discussion. Finally, Section 6 presents our conclusion for the study.

2. Related Work

2.1. Vision-and-Sound-Based HAR

For vision-and-sound-based systems, a comprehensive survey of space–time and hierarchical approaches for HAR is provided by Aggarwal et al. [1]. In video-based systems, moving pictures are converted into 2D images, and image features are extracted for classification. For example, Anitha et al. [14] developed a system to recognize hand gestures by converting human action videos into 2D images and extracting features using Laplace Smoothing Transform (LST) and Kernel Principal Component Analysis (KPCA), with KNN used for classification. Local space–time features can also be used with classifiers such as SVM for recognizing human actions [15]. Ahmad et al. [16] used Spatio-Temporal Interest Points (STIPs) to detect important changes in images, extracting appearance and motion features using Histogram of Oriented Gradient (HOG) and Histogram of Optical Flow (HOF) descriptors, with SVM used for classification. Since using videos and images can jeopardize user privacy, other non-revealing sensors have been explored for HAR. For instance, Fu et al. [17] developed a motion detector with sub-millisecond temporal resolution using a contrast vision sensor. Sound signals have also been used, such as in the work of Stork et al. [18], who proposed a Non-Markovian Ensemble Voting (NEV) method to classify multiple human activities in real-time based on characteristic sounds. Additionally, 3D skeletal data have been utilized, as in the work of Ramezanpanah et al. [19], who represented 3D skeletal features of human action using Laban movement analysis and dynamic time warping, with SVM used for activity classification.

In addition to the classical machine learning methods mentioned above, recent work has focused on using deep learning for human action or activity recognition. For example, Wang et al. [20] developed a system for 3D human activity recognition using a re-configurable Convolutional Neural Network (CNN). Other examples of deep learning-based applications include the work of Dobhal et al. [21], who recognized human activity based on binary motion images and deep learning. In the work of Mahjoub et al. [22], image sequences were combined into a single Binary Motion Image (BMI) for feature representation, with a CNN used for classification.

2.2. Mobile-Device-Based HAR

In healthcare systems, wearable sensors are commonly used to recognize human activities. According to a survey by Thakur et al. [2], smartphone sensors such as accelerometers, gyroscopes, magnetometers, digital compasses, microphones, GPS, and cameras can be used to monitor and recognize human activities. For example, Anjum et al. [23] developed a smartphone application that uses the embedded motion sensor to track users’ physical activities and to provide feedback. The application estimates calories burned and breaks it down by activity such as walking, running, climbing stairs, descending stairs, driving, cycling, and being inactive. Another example is a framework proposed by Nandy et al. [24] that combines features from a smartphone accelerometer and a wearable heart rate sensor to recognize intense physical activity. The framework uses an ensemble model based on different classifiers. In addition to smartphone sensors, wearable acoustic sensors such as the bodyscope developed by Yatani et al. [25] can be used to detect and classify human activities. Human activities can also be remotely monitored via the use of wearable sensors that track heart rate, respiration rate, and body acceleration. Castro et al. [26] developed a remote monitoring system based on the Internet of Things (IoT) that uses these sensors.

The development of mobile-device-based HAR has followed a similar trend to vision-and-sound-based HAR in terms of the popularity of deep learning deployment. In addition to MLP and CNN, LSTM is also a popular choice. For example, Voicu et al. [27] used smartphone sensors such as accelerometers, gyroscopes, and gravity sensors to recognize physical activity, with an MLP used for learning and classification. Rustam et al. [28] employed sensor data from gyroscopes and accelerometers with a deep network model called Deep Stacked Multilayered-Perceptron (DS-MLP) for HAR. Chen et al. [29] constructed a CNN network for HAR based on a single accelerometer, with modified convolution kernels to adapt to the characteristics of tri-axial acceleration signals. Ghate et al. [30] constructed a hybrid deep learning model that combines deep neural networks with LSTM and GRU for effective classification of engineered features of CNN. The network integrates CNN with a Random Forest Classifier (DeepCNN-RF) to add randomness to the model.

2.3. Wi-Fi-Based HAR

Different from the mobile-based approach, the Wi-Fi signals offer a device free solution for HAR. Compared with vision-based methods, the Wi-Fi signals do not provide detailed information related to privacy. There are mainly two types of Wi-Fi signals commonly used for HAR: the RSSI and the CSI signals. The RSSI refers to a coarse-grained received signal strength indicator, and the CSI refers to the fine-grained channel state information. The RSSI is susceptible to signal fading, distortions, and inconsistency since it has a low resolution parameter, which is measured using a packet index and has only a single value per packet. For this reason, the RSSI is being replaced by the CSI in Wi-Fi sensing solutions. The Wi-Fi CSI is a fine grained signal measured via Orthogonal Frequency Division Multiplexing (OFDM) subcarriers. The following review has been conducted according to these two types of Wi-Fi technology.

In terms of using RSSI signals, four wireless technologies—Wi-Fi, BLE, Zigbee, and LoRaWAN—have been evaluated for indoor localization. According to a study by Sadowski et al. [31], Wi-Fi was found to be the most accurate among the investigated wireless technologies based on RSSI signals. In another example, Hsieh et al. [32] recognized human activity in an indoor environment using Wi-Fi RSSI and investigated the effectiveness of several machine learning methods, such as MLP and SVM, for activity detection.

In terms of using CSI signals, it has been found that CSI can capture unique patterns of small-scale fading caused by different human activities at a subcarrier level, which is not available in traditional received signal strength (RSS) extracted at the per-packet level [33]. Most methods that use CSI signals employ deep networks. For example, Chen et al. [34] constructed a deep learning network for HAR that uses an attention-based bi-directional long short-term memory model. Wang et al. [35] developed a deep learning network that combines hidden features from both temporal and spatial dimensions for accurate and reliable recognition. Other examples include the work of Lu et al. [36] and Islam et al. [37], who implemented a channel-exchanging fusion network to fuse CSI amplitude and phase features for HAR and constructed a spatio-temporal convolution with nested long short-term memory (STC-NLSTMNet) to extract spatial and temporal features concurrently for automatic recognition of human activities, respectively. Recent algorithms such as transfer learning and attention mechanisms have also been used in HAR. For instance, Yang et al. [38] used an attention mechanism with LSTM features at different dimensions for HAR, while Jung et al. [8] performed in-air handwritten signature recognition using transfer learning due to limited data availability.

Finally, we shall focus on the use of CNN, a popular network for CSI-based HAR. In the work of Moshiri et al. [39], CSI data were converted into images and a 2D CNN classifier was employed for HAR. Zhang et al. [40] exploited semantic activity features and temporal features from different dimensions to characterize activity at different locations. The semantic activity features were extracted using a CNN combined with a convolutional attention module (CBAM), while the temporal features were extracted using a Bidirectional Gated Recurrent Unit (BGRU) combined with a self-attention mechanism. In addition to semantic features, dimension reduction techniques have also been used with CNN for HAR. Showmik et al. [41] proposed a Principal Component-based Wavelet Convolutional Neural Network (PCWCNN), which uses PCA and Discrete Wavelet Transform (DWT) as preprocessing algorithms and a Wavelet CNN for classification. Zou et al. [42] fused a tailored CNN model with a variant of the C3D model using vision for HAR.

Summarizing the above work, it is noted that vision-and-sound-based approaches pose the fundamental limitations in terms of securing the audio and video passages in view of the potential violation of human privacy.

3. Preliminaries

3.1. Wi-Fi for Human Activity Recognition

As mentioned in the section of related work, the Wi-Fi RSSI and the Wi-Fi CSI are two types of signal measurements that can be utilized for HAR [7]. The principle underlying these technologies is the Doppler effect, where the wavelength of reflected signals changes according to different relative motions of the human under the signal propagation space [43]. The RSSI stands for received signal strength indicator, which is an energy characteristic of the Media Access Control layer [44]. However, due to its reliance on channel strength superposition, it may not accurately reflect changes in the channel, leading to a reduced detection rate. On the other hand, the CSI represents a fine-grained channel state information, which includes specific indicators such as carrier signal strength, amplitude, phase, and signal delay [45]. The physical layer of CSI can capture micro dynamic changes in human activity, leading to the ability to detect rapid changes caused by the superposition of multipath signal exchange layers. Conceptually, the channel response is the response to RSSI, similar to how the rainbow responds to solar beams. We can think of OFDM as the medium that refracts RSSI into CSI, allowing the components of different wavelengths to be separated. As a result, the OFDM modulation system can make Wi-Fi-based HAR systems more robust against complex indoor environments, improving their effectiveness and accuracy [46].

3.2. Wi-Fi CSI Signal Features

The CSI is the channel response extracted from the OFDM subcarriers using fine-grained wireless channel measurement [7,45,47]. In an OFDM system, the CSI data on each subcarrier are modulated and converted into frequency domain via Fast Fourier Transform [46]. This provides an estimate of the amplitudes and phases of each subcarrier of the channel properties. During operation, the signal that is transmitted experiences multiple paths before arriving at the receiver. Each of these paths introduces distinct variations in time delay, amplitude attenuation, and phase shift where the Channel Impulse Response (CIR) [46] can be expressed as

h (t) = \sum_{i = 1}^{n} a_{i} e^{- j θ_{i}} δ (t - τ_{i}) .

(1)

In this equation, a signal from the ith path is represented by

a_{i}

for its amplitude,

θ_{i}

for its phase, and

τ_{i}

for its time delay. n denotes the total number of paths, and

δ (t)

refers to the Dirac delta function.

For data collection in practice, the Channel Frequency Response (CFR) can be utilized to model the transmitting channel in place of the CIR. This is because the commodity hardware may not have the required time resolution to capture rapid changes in the signal. Under the unlimited bandwidth condition, the CFR can be derived from the CIR by applying the Fast Fourier Transform.

In frequency domain, the channel response of each subcarrier can be written as

H (f_{k}) = | H (f_{k}) | e^{j ∠ H (f_{k})},

(2)

for

k = 1, \dots, K

, where

H (f_{k})

denotes the CSI of the k-th subcarrier, and

∠ H (f_{k})

represents the corresponding phase shift information. By packing the CSI data based on the subcarrier index and the packet number, we can write

H = [\begin{matrix} H^{1, 1} & H^{1, 2} & \dots & H^{1, P} \\ H^{2, 1} & H^{2, 2} & \dots & H^{2, P} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ H^{K, 1} & H^{K, 2} & \dots & H^{K, P} \end{matrix}],

(3)

where

{1, 2, \dots, K}

denotes the subcarrier indices and

{1, 2, \dots, p}

denotes the packet numbers.

4. Proposed System

In this study, we adopt the score fusion strategy to learn and predict the class labels of human activities. Figure 1 shows the pipeline of the implemented system. Essentially, the raw data of Wi-Fi CSI signals go through a preprocessing stage for normalization. The differently normalized data, process1 and process2, are then classified separately via a linear model based on the linear Least Squares Error (LSE) and a nonlinear model based on the SVM utilizing the Radial Basis Function (RBF) kernel. The learned scores are subsequently concatenated to form the input features for fusion learning. The fusion learning uses ANnet, or another LSE classifier, KNN classifier, or SVM-RBF classifier for a final decision. The final decision has been based on the one-versus-rest technique.

4.1. Preprocessing

In view of the noisy nature, the raw Wi-Fi CSI signals are preprocessed before being fed into the learning algorithms. Firstly, the signals are cropped according to each activity at different lengths. Since the action activities have different lengths, the CSI signals are resized based on linear interpolation. The resized data are then packed in matrix form for further processing prior to learning and prediction. Subsequently, a low-pass filtering has been performed to remove high frequency noises. Eventually, a z-score normalization is formed to remove the signal bias.

The pre-processed signals are packed as shown in (4) for a subsequent stage of learning and prediction.

H_{p r o c e s s e d} = [\begin{matrix} H^{1, 1} & H^{1, 2} & \dots & H^{1, C} \\ H^{2, 1} & H^{2, 2} & \dots & H^{2, C} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ H^{K, 1} & H^{K, 2} & \dots & H^{K, C} \end{matrix}],

(4)

where

k \in {1, 2, \dots, K}

denotes the subcarrier index and C denotes the cropping size of the preprocessing. Figure 2 shows a sample of CSI raw signals and the preprocessed form before z-score normalization. The flow of the preprocessing steps is summarized in Figure 3. As illustrated in this figure, a set of cropped signals at length1 is named as process1 while another set of signals cropping at length2 is named as process2. The signals of process1 and process2 are eventually standardized via z-score normalization in the preprocessing step.

4.2. Classification Stage

The matrix for each activity sample

H_{p r o c e s s e d}

is flattened to form a row vector

h^{T} \in R^{1 \times D}

so that the M training samples can be stacked in matrix form as follows:

X = [\begin{matrix} h_{1}^{T} \\ ⋮ \\ h_{N}^{T} \end{matrix}] \in R^{M \times D} .

(5)

Correspondingly, the K number of target activities is encoded based on the one-hot encoder such as

Y = [\begin{matrix} 1 & 0 & \dots & 0 \\ 1 & 0 & \dots & 0 \\ 0 & 1 & \dots & 0 \\ 0 & 1 & \dots & 0 \\ ⋮ & ⋮ & ⋱ & ⋮ \\ 0 & 0 & \dots & 1 \end{matrix}] \in R^{M \times G},

(6)

where each sample row contains a ‘1’ at the column position corresponding to the class label. For first-level activity classification, two base classifiers namely the LSE and the SVM utilizing the RBF kernel have been deployed.

For training the linear prediction model

Y = X W

based on LSE, the learning weights

W \in R^{D \times G}

can be found deterministically as

\hat{W} = \{\begin{matrix} {(X^{T} X + λ I)}^{- 1} X^{T} Y, & if M > = D \\ X^{T} {(X X^{T} + λ I)}^{- 1} Y, & if M < D \end{matrix}

(7)

with regularization

λ = 0.001

and consideration of over-/under-determined systems. Subsequently, the prediction of test samples can be computed using

\hat{Y} = X_{t} \hat{W},

(8)

where

X_{t} \in R^{N \times D}

denotes the test matrix. These output scores will be used in the subsequent stage of fusion for final prediction.

For binary classification, the SVM learning can be written as

minimize : \frac{1}{2} {| | w | |}^{2} + C \sum_{i = 1}^{n} ξ_{i},

(9)

subject to : y_{i} (ϕ {(x_{i})}^{T} w + b) \geq 1 - ξ_{i},

(10)

where

ϕ

corresponds to the RBF function, and

x_{i}

and

y_{i}

are the ith sample of input feature vector and target value, respectively. For multiclass problems, multiple SVMs can be implemented with the one-versus-rest technique for class prediction. In our study, the output probability score values of multicategory SVM are used as input features in the fusion stage.

4.3. Fusion Stage

We represent, respectively, the prediction scores obtained from the first level LSE and SVM as

{\hat{Y}}_{LSE} \in R^{M \times G}

and

{\hat{Y}}_{SVM} \in R^{M \times G}

, where M denotes the sample size and G denotes the number of activity categories. Then, the scores for fusion can be stacked as

F_{scores} = [{\hat{Y}}_{LSE}, {\hat{Y}}_{SVM}] \in R^{M \times 2 G} .

(11)

According to [48], a network of two layers with sufficient hidden nodes can learn well data samples of limited size. Here, we implement a two-layer network known as ANnet [48] given by

Y = ϕ (ϕ (F_{scores}) W_{1}) W_{2}

(12)

to learn the stacked output scores (11) from the first-level LSE and SVM classification. In our implementation, the arctan (

{tan}^{- 1}

) function has been adopted as the nonlinear transformation

ϕ

. Similar to that in LSE, the learning target

Y

can adopt the one-hot encoding where the weights can be learned based on

\begin{matrix} {\hat{W}}_{1} & = & ϕ {(F_{scores} + γ N (k))}^{†}, \end{matrix}

(13)

\begin{matrix} {\hat{W}}_{2} & = & ϕ {(ϕ (F_{scores} + γ N (k)) {\hat{W}}_{1})}^{†} Y, \end{matrix}

(14)

where † denotes the Moore–Penrose inverse, which has been implemented using

A^{†} = \{\begin{matrix} {(A^{T} A + 0.01 I)}^{- 1} A^{T}, & if M > = D \\ A^{T} {(A A^{T} + 0.01 I)}^{- 1}, & if M < D \end{matrix}, A \in R^{M \times D}

(15)

considering stability of inversion in Python. In this learning, a random perturbation

γ R (k) \in R^{M \times 2 G}

, where

R

is a random matrix, has been included to spread the data. The scaling factor

γ \in R

of the perturbation and the random perturbation seed

k \in N

can be considered as hyperparameters to be determined based on cross-validation utilizing the training set.

For prediction using unseen

F_{testscores}

, the learned network weights can be substituted into (12) to obtain the estimated scores:

\hat{Y} = ϕ (ϕ (F_{testscores}) {\hat{W}}_{1}) {\hat{W}}_{2} .

(16)

The one-versus-rest technique can be applied to obtain the class label prediction.

5. Experiments

5.1. Database

Three datasets have been utilized for our experimentation. The first CSI dataset, which we called HAR-RP, has been obtained from [39]. This dataset contains seven different activities including RUN, WALK, FALL, BEND, SIT DOWN, STAND UP, and LIE DOWN operated by three volunteers in an indoor environment. In total, the dataset consists of 420 samples of CSI signals with 60 samples for each activity. These time sequence data are composed of 52 subcarriers where the period of each activity determines the length of data ranging from 600 to 1100 sampling measurements. The extracted CSI signals consist of three different types of subcarriers, namely the null subcarriers, the pilot subcarriers, and the data subcarriers. Only the data subcarriers contain crucial information related to human activities among these three types of subcarriers. Therefore, the pilot subcarriers and the Null subcarriers are not utilized for this experimentation following [39].

The second CSI dataset [49], which we called HAR-RT, has been collected based on an Asus RT-AC86U at a frequency band of 80 MHz. HAR-RT consists of six different activities including SIT, STAND, SIT-DOWN, STAND-UP, WALK, and FALL. This indoor CSI information was collected from different spots of the room at different frequency bandwidths in order to avoid possible location and bandwidth dependency. Table 1 shows that the HAR-RT dataset has a total of 1084 samples for six activities with 256 subcarriers where these time sequence data have been normalized.

The third CSI dataset, which we called HAR-ARIL, has been obtained from [50]. This dataset contains six distinct hand activities, namely, hand up, hand down, hand left, hand right, hand circle, and hand cross. The 1440 data samples have been collected based on 15 samples from each activity at 16 different locations by 15 individuals. To ensure data quality, the dataset was manually curated to include 1394 samples.

5.2. Experimental Setup

We conducted three experiments to analyse the fusion system for HAR, as shown in Table 2. Under experiment I, various preprocessing parameters were tested with three learning classifiers on the HAR-RP, HAR-RT, and HAR-ARIL datasets in terms of their signal cropping size and the normalized pass-band of the low-pass filter. Under experiment II, various fusion combinations were evaluated on two differently preprocessed data (called process1 and process2) of HAR-RP, HAR-RT, and HAR-ARIL with and without data transformation or normalization. Under experiment III, a comparison between the proposed fusion combination and state-of-the-art (SOTA) methods from [39,49,50] was carried out to observe the accuracy standing of activity recognition.

For the HAR-RP database [39], we randomly selected 336 samples to form the training set and the remaining 84 samples as test samples to follow the 80/20 ratio of a five-fold partitioning. For the HAR-RT database [49], 867 samples were selected to form the training set and the remaining 217 samples were used as test set. For the HAR-ARIL database, 1116 samples were typically used for training and the remaining 278 samples were used for testing. By permuting the above 80/20 ratios, a five-fold cross validation was employed to evaluate the testing accuracy for all the three datasets. All experiments were conducted on a PC equipped with an i9 processor of 3.7 GHz with 32 GB of RAM.

In experiment I, the linear LSE method, the SVM-RBF, and the KNN were utilized as learning classifiers to determine the best combination of the signal cropping size and the normalized pass-band of the low-pass filter. This experiment was conducted in two steps. In step 1, various cropping sizes and normalized pass-bands of the low-pass filter were applied separately to preprocess the CSI signals to determine their desired operating ranges. In step 2, the top two combinations of the cropping size and the normalized pass-band were determined based on the recognition accuracy. These two combinations with top accuracies were named as process1 and process2 respectively.

In experiment II, a score level fusion of the above results was performed using either ANnet, or another LSE classifier, SVM-RBF classifier, or KNN classifier. This experiment was also conducted in two steps. In step 1, process1 and process2 (which were differently preprocessed data as described in above) from experiment I were trained with LSE and SVM-RBF individually in order to generate scores with and without feature transformation/normalization. In step 2, the corresponding scores from different classifier settings were concatenated to form a new set of features for score level fusion using another LSE, SVM-RBF, KNN or ANnet that adopted the training strategy given by (13)–(14).

In experiment III, the proposed score level fusion was compared with SOTA methods in [39,49,50]. The experiment for our fusion was conducted according to the training and test settings in [39,49,50], where the accuracy of activity recognition and processing times were recorded. As for SOTA methods on the HAR-RP dataset, Moshiri et al. [39] converted the CSI data into a 2D array to form a pseudo color image and then utilized a 2D CNN for learning and recognition. Data pre-processing was not applied for the CSI amplitudes since they believed that any extra filtering can result in losing essential information and affect the classification performance. Moreover, the CNN is recommended instead of the LSTM to overcome the high computational complexity and long training time. For the HAR-RT dataset, Schäfer et al. [49] proposed the use of normalized raw CSI data directly in a LSTM network with 100 hidden nodes for HAR. The dataset was randomly divided into 80% for training and 20% for testing. The input dimension for the network was equal to the number of the subcarriers and the dropping rate of the LSTM was set at 40%. As for the HAR-ARIL dataset, the adopted classical machine learning baseline SOTA were the DWT+KNN and SVM-RBF classifiers [50].

5.3. Results and Discussion

5.3.1. Experiment I(a) Effect of Cropping Size and Pass-Band on the LSE Classifier

Table 3 shows the impact of cropping size and normalized pass-band on the accuracy of LSE for the HAR-RP, HAR-RT and HAR-ARIL databases. The cropping size refers to the number of effective points of time series data, and the normalized pass-bands refer to the bandwith of the low-pass filter. For the HAR-RP dataset, the result shows that a small crop size leads to a high accuracy at an intermediate range of normalized pass-bands. However, when the normalized pass-band falls below a certain range, the accuracy of LSE drops significantly. For example, at a normalized pass-band of 0.05, the accuracy with cropping size 50 is 40.4%. The result also shows that a combination of cropping size 100 and normalized pass-band 0.05 gives the highest accuracy of 73.6% among all the tested combinations. This is closely followed by the combination of cropping size 100 and pass-band 0.05, which gives an accuracy of 72.8%. For the HAR-RT database, the result shows that accuracy increases as the pass-band value increases. For example, when the cropping size is 50, the accuracy scores for at different pass-band values of 0.1, 0.5, 0.8 and 1.0 are 30.4%, 45.1%, 48.8%, and 71.4%, respectively. The result also shows that the top two accuracy scores are 71.4% and 66.4%, respectively, at cropping sizes 50 and 100. A similar trend in terms of the cropping size and the pass-band is observed for the HAR-ARIL database.

5.3.2. Experiment I(b) to Observe the Effect of Cropping Size and Pass-Band on the SVM-RBF Classifier

As seen from Table 4, the accuracy of SVM-RBF is generally higher than the accuracy of LSE in Experiment I(a). For the HAR-RP database, it can be observed that the accuracy tends to be slightly higher when the cropping size is small. For example, at a cropping size of 50, the accuracy is 96.3% for pass-band 0.1, which is the highest accuracy in the table. However, the second-highest accuracy of 95.6% is achieved at a cropping size of 100 and at pass-band 0.05. This suggests that with a moderate cropping size and pass-band, the SVM-RBF is able to achieve a high level of accuracy. For the HAR-RT database, a similar cropping size pattern is observed at pass-band 1.0, where the top two accuracies of 95.4% and 92.6% are obtained. These values are significantly higher than all other values in the table, which are around 80%. For the HAR-ARIL database, the two best performed results are 72.5% at pass-band 0.3 with cropping size 50 and 73.5% at pass-band 0.5 with cropping size 100.

It is clear that the accuracy is affected by both the cropping size and the pass-band. The patterns of the combination of cropping size and pass-band do show certain common trends between LSE and SVM-RBF. Table 5 shows the chosen parameters, which correspond to the top two accuracies for each of the HAR-RP and HAR-RT datasets.

5.3.3. Experiment I(c) to Observe the Effect of Cropping Size and Pass-Band on the KNN Classifier

According to Table 6, KNN outperforms LSE in terms of accuracy in Experiment I(a), but it is not as accurate as SVM in Experiment I(b). When examining the HAR-RP database, it is clear that accuracy tends to increase slightly with cropping sizes of 50 and 100. For example, the highest accuracy of 92.6% is achieved with a cropping size of 50 and a pass-band of 0.1. The second-highest accuracy of 92.3% is achieved with a cropping size of 100 and a pass-band of 0.05. A similar pattern is observed in the HAR-RT database, where the two highest accuracies of 82.3% and 76.7% are achieved with a pass-band of 1.0 and cropping sizes of 50 and 100, respectively. In the HAR-ARIL database, although performance remains relatively consistent across different cropping sizes and pass-band values, the two highest accuracies are also observed with cropping sizes of 50 and 100. This suggests that KNN can achieve high levels of accuracy with smaller cropping sizes and moderate pass-band values.

5.3.4. Experiment II Fusion of First-Level LSE and SVM-RBF Scores Using LSE, SVM-RBF, KNN, and ANnet

The scores of the first-level LSE and SVM-RBF on the two differently preprocessed data from experiment I are next utilized for fusion under several settings utilizing ANnet, KNN, and another set of LSE and SVM-RBF classifiers. In other words, the LSE and SVM-RBF methods are implemented individually on each preprocessed data (process1 and process2) before fusion with and without transformation/normalization to observe whether the scores from each method are distinguishable. Subsequently, the two individual scores are fused to form a set of new features for final classification decision using LSE, SVM-RBF, KNN, and ANnet. Table 7 shows that by applying a score level fusion method with transformation/normalization, the accuracy of the algorithm is increased in most cases. For the HAR-RP database, the highest accuracy of 97.6% is achieved by applying ANnet and SVM-RBF on the concatenated first-level LSE score of process1 and SVM-RBF score of process2 with transformation. For the HAR-RT database, the highest accuracy of 96.4% is achieved by applying SVM-RBF on the concatenated first-level SVM-RBF score of process1 and SVM-RBF score of process2 with transformation/normalization.

5.3.5. Experiment III Comparison of the Proposed Fusion with SOTA Methods in Table 1

Figure 4 shows the training and test accuracies of the SOTA namely, the 2D CNN, the 1D CNN, the BiLSTM, and the proposed ANnet fusion on the HAR-RP database. The results show significant over-fitting of the SOTA methods compared with that of ANnet. Figure 5 shows the training and test accuracies of the compared algorithms, namely, the LSTM and the proposed ANnet fusion on the HAR-RT database. The results show significant over-fitting of the LSTM compared with ANnet. Figure 6 shows the training and test accuracies of the compared algorithms, namely, the DTW + KNN, SVM-RBF, and the proposed ANnet fusion on the HAR-ARIL database. The results show comparable test accuracies of the proposed ANnet with the classical DTW + KNN method. In terms of the training processing time, our fusion benefits from low computational complexity compared with the deep learning method in Table 8, since our model is a combination of LSE and SVM-RBF with analytic learning.

5.4. Summary of Results and Observations

Expt I: This experiment reveals that the preprocessing steps of selecting the cropping size and the normalized pass-band have a significant impact on the recognition accuracy. In particular, each database shows its best accuracy at different combinations of settings. For example, the HAR-RP and HAR-ARIL datasets show that a small cropping size leads to a high accuracy at an intermediate range of normalized pass-bands. For the HAR-RT database, the accuracy increases as the pass-band value increases.
Expt II: This experiment shows that fusion using SVM-RBF and ANnet outperforms the LSE and KNN in general. Moreover, many of their fused results show an improved accuracy compared with that before fusion.
Expt III: This experiment shows that the proposed fusion has either comparable or better accuracy than that of SOTA. In particular, the SOTA methods show significant over-fitting in view of their higher model complexity than the proposed fusion method. In other words, the proposed fusion method has capitalized on the low model complexity but with sufficient mapping capability to generalize the prediction.

6. Conclusions

In response to the relatively poor generalization of complex network models on small data sizes, a fusion method with simpler model complexity has been proposed for human activity recognition. This method involves fusing the scores of two first-level classifiers using an analytic network to make the final decision. Experiments have shown not only that this fusion improves recognition accuracy but also that preprocessing of Wi-Fi signals plays a critical role in achieving good baseline recognition accuracy. In particular, varying the cropping size contributes to signal diversity for fusion gain. Additionally, linear LSE and SVM-RBF have been shown to contain complementary information for fusion gain. The proposed simple fusion structure has demonstrated good generalization compared with classical machine learning state-of-the-art methods for the tested datasets.

Author Contributions

Conceptualization, K.-A.T.; methodology, K.-A.T., B.O. and D.K.; software, G.L.; investigation, G.L.; writing—original draft preparation, G.L.; writing—review and editing, K.-A.T., B.O. and D.K.; supervision, K.-A.T. and D.K.; project administration, K.-A.T. and D.K.; funding acquisition, K.-A.T. and D.K. All authors have read and agreed to the published version of the manuscript.

Funding

National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (NRF-2021R1A2C1093425), and National Research Foundation of Korea (NRF) under the program of Basic Research Laboratory (BRL) (NRF-2022R1A4A2000748).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All datasets used in the paper are publicly available.

Acknowledgments

This research was supported in part by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (NRF-2021R1A2C1093425) and in part by the National Research Foundation of Korea (NRF) under the program of Basic Research Laboratory (BRL) (NRF-2022R1A4A2000748).

Conflicts of Interest

The authors declare no conflict of interest.

References

Aggarwal, J.K.; Ryoo, M.S. Human activity analysis: A review. Acm Comput. Surv. 2011, 43, 1–43. [Google Scholar] [CrossRef]
Thakur, D.; Biswas, S. Smartphone based human activity monitoring and recognition using ML and DL: A comprehensive survey. J. Ambient. Intell. Humaniz. Comput. 2020, 11, 5433–5444. [Google Scholar] [CrossRef]
Nweke, H.F.; Teh, Y.W.; Mujtaba, G.; Al-Garadi, M.A. Data fusion and multiple classifier systems for human activity detection and health monitoring: Review and open research directions. Inf. Fusion 2019, 46, 147–170. [Google Scholar] [CrossRef]
Ariza-Colpas, P.P.; Vicario, E.; Oviedo-Carrascal, A.I.; Shariq Butt Aziz, M.A.P.M.; Quintero-Linero, A.; Patara, F. Human Activity Recognition Data Analysis: History, Evolutions, and New Trends. Sensors 2022, 22, 3401. [Google Scholar] [CrossRef]
Jeon, J.H.; Oh, B.S.; Toh, K.A. A System for Hand Gesture Based Signature Recognition. In Proceedings of the International Conference on Control, Automation, Robotics and Vision (ICARCV 2012), Singapore, 5–7 December 2012; pp. 171–175. [Google Scholar]
Bailador, G.; Sanchez-Avila, C.; Guerra-Casanova, J.; de Santos Sierra, A. Analysis of pattern recognition techniques for in-air signature biometrics. Pattern Recognit. 2011, 44, 2467–2478. [Google Scholar] [CrossRef]
Wang, W.; Liu, A.X.; Shahzad, M.; Ling, K.; Lu, S. Understanding and modeling of wifi signal based human activity recognition. In Proceedings of the 21st Annual International Conference on Mobile Computing and Networking, Paris, France, 7–11 September 2015; pp. 65–76. [Google Scholar]
Jung, J.; Moon, H.C.; Kim, J.; Kim, D.; Toh, K.A. Wi-Fi based user identification using in-air handwritten signature. IEEE Access 2021, 9, 53548–53565. [Google Scholar] [CrossRef]
Scholz, M.; Sigg, S.; Schmidtke, H.R.; Beigl, M. Challenges for device-free radio-based activity recognition. In Proceedings of the Workshop on Context Systems, Design, Evaluation and Optimisation; 2011. [Google Scholar]
Duda, R.O.; Hart, P.E.; Stork, D.G. Pattern Classification, 2nd ed.; John Wiley & Sons, Inc.: New York, NY, USA, 2001. [Google Scholar]
Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction; Springer: Berlin/Heidelberg, Germany, 2017. [Google Scholar]
Ayodele, T.O. Types of machine learning algorithms. New Adv. Mach. Learn. 2010, 3, 19–48. [Google Scholar]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
Anitha, U.; Narmadha, R.; Sumanth, D.R.; Kumar, D.N. Robust human action recognition system via image processing. Procedia Comput. Sci. 2020, 167, 870–877. [Google Scholar] [CrossRef]
Schuldt, C.; Laptev, I.; Caputo, B. Recognizing human actions: A local SVM approach. In Proceedings of the 17th International Conference on Pattern Recognition, Cambridge, UK, 23–26 August 2004; Volume 3, pp. 32–36. [Google Scholar]
Ahmad, Z.; Illanko, K.; Khan, N.; Androutsos, D. Human action recognition using convolutional neural network and depth sensor data. In Proceedings of the 2019 International Conference on Information Technology and Computer Communications, Singapore, 16–18 August 2019; pp. 1–5. [Google Scholar]
Fu, Z.; Culurciello, E.; Lichtsteiner, P.; Delbruck, T. Fall detection using an address-event temporal contrast vision sensor. In Proceedings of the 2008 IEEE International Symposium on Circuits and Systems (ISCAS), Washington, DC, USA, 18–21 May 2008; pp. 424–427. [Google Scholar]
Stork, J.A.; Spinello, L.; Silva, J.; Arras, K.O. Audio-based human activity recognition using non-Markovian ensemble voting. In Proceedings of the 2012 IEEE RO-MAN: The 21st IEEE International Symposium on Robot and Human Interactive Communication, Paris, France, 9–13 September 2012; pp. 509–514. [Google Scholar]
Ramezanpanah, Z.; Mallem, M.; Davesne, F. Human action recognition using Laban movement analysis and dynamic time warping. Procedia Comput. Sci. 2020, 176, 390–399. [Google Scholar] [CrossRef]
Wang, K.; Wang, X.; Lin, L.; Wang, M.; Zuo, W. 3D human activity recognition with reconfigurable convolutional neural networks. In Proceedings of the 22nd ACM International Conference on Multimedia, Orlando, FL, USA, 3–7 November 2014; pp. 97–106. [Google Scholar]
Dobhal, T.; Shitole, V.; Thomas, G.; Navada, G. Human activity recognition using binary motion image and deep learning. Procedia Comput. Sci. 2015, 58, 178–185. [Google Scholar] [CrossRef]
Mahjoub, A.B.; Atri, M. Human action recognition using RGB data. In Proceedings of the 2016 11th International Design & Test Symposium (IDT), Hammamet, Tunisia, 18–20 December 2016; pp. 83–87. [Google Scholar]
Anjum, A.; Ilyas, M.U. Activity recognition using smartphone sensors. In Proceedings of the 2013 IEEE 10th Consumer Communications and Networking Conference (CCNC), Las Vegas, NV, USA, 11–14 January 2013; pp. 914–919. [Google Scholar]
Nandy, A.; Saha, J.; Chowdhury, C. Novel features for intensive human activity recognition based on wearable and smartphone sensors. Microsyst. Technol. 2020, 26, 1889–1903. [Google Scholar] [CrossRef]
Yatani, K.; Truong, K.N. Bodyscope: A wearable acoustic sensor for activity recognition. In Proceedings of the 2012 ACM Conference on Ubiquitous Computing, Pittsburgh, PA, USA, 5–8 September 2012; pp. 341–350. [Google Scholar]
Castro, D.; Coral, W.; Rodriguez, C.; Cabra, J.; Colorado, J. Wearable-based human activity recognition using an IoT approach. J. Sens. Actuator Netw. 2017, 6, 28. [Google Scholar] [CrossRef]
Voicu, R.A.; Dobre, C.; Bajenaru, L.; Ciobanu, R.I. Human physical activity recognition using smartphone sensors. Sensors 2019, 19, 458. [Google Scholar] [CrossRef]
Rustam, F.; Reshi, A.A.; Ashraf, I.; Mehmood, A.; Ullah, S.; Khan, D.M.; Choi, G.S. Sensor-based human activity recognition using deep stacked multilayered perceptron model. IEEE Access 2020, 8, 218898–218910. [Google Scholar] [CrossRef]
Chen, Y.; Xue, Y. A deep learning approach to human activity recognition based on single accelerometer. In Proceedings of the 2015 IEEE International Conference on Systems, Man, and Cybernetics, Hong Kong, 9–12 October 2015; pp. 1488–1492. [Google Scholar]
Ghate, V. Hybrid deep learning approaches for smartphone sensor-based human activity recognition. Multimed. Tools Appl. 2021, 80, 35585–35604. [Google Scholar] [CrossRef]
Sadowski, S.; Spachos, P. RSSI-based indoor localization with the internet of things. IEEE Access 2018, 6, 30149–30161. [Google Scholar] [CrossRef]
Hsieh, C.F.; Chen, Y.C.; Hsieh, C.Y.; Ku, M.L. Device-free indoor human activity recognition using Wi-Fi RSSI: Machine learning approaches. In Proceedings of the 2020 IEEE International Conference on Consumer Electronics-Taiwan (ICCE-Taiwan), Taiwan, 28–30 September 2020; pp. 1–2. [Google Scholar]
Wang, Y.; Liu, J.; Chen, Y.; Gruteser, M.; Yang, J.; Liu, H. E-eyes: Device-free location-oriented activity identification using fine-grained WiFi signatures. In Proceedings of the 20th Annual International Conference on Mobile Computing and Networking, Maui, HI, USA, 7–11 September 2014; pp. 617–628. [Google Scholar]
Chen, Z.; Zhang, L.; Jiang, C.; Cao, Z.; Cui, W. WiFi CSI based passive human activity recognition using attention based BLSTM. IEEE Trans. Mob. Comput. 2018, 18, 2714–2724. [Google Scholar] [CrossRef]
Wang, F.; Gong, W.; Liu, J. On spatial diversity in WiFi-based human activity recognition: A deep learning-based approach. IEEE Internet Things J. 2018, 6, 2035–2047. [Google Scholar] [CrossRef]
Lu, X.; Li, Y.; Cui, W.; Wang, H. CeHAR: CSI-based Channel-Exchanging Human Activity Recognition. IEEE Internet Things J. 2022, 10, 5953–5961. [Google Scholar] [CrossRef]
Islam, M.S.; Jannat, M.K.A.; Hossain, M.N.; Kim, W.S.; Lee, S.W.; Yang, S.H. STC-NLSTMNet: An Improved Human Activity Recognition Method Using Convolutional Neural Network with NLSTM from WiFi CSI. Sensors 2023, 23, 356. [Google Scholar] [CrossRef] [PubMed]
Yang, X.; Cao, R.; Zhou, M.; Xie, L. Temporal-frequency attention-based human activity recognition using commercial WiFi devices. IEEE Access 2020, 8, 137758–137769. [Google Scholar] [CrossRef]
Moshiri, P.F.; Shahbazian, R.; Nabati, M.; Ghorashi, S.A. A CSI-based Human Activity Recognition using deep learning. Sensors 2021, 21, 7225. [Google Scholar] [CrossRef]
Zhang, Y.; Liu, Q.; Wang, Y.; Yu, G. CSI-Based Location-Independent Human Activity Recognition Using Feature Fusion. IEEE Trans. Instrum. Meas. 2022, 71, 5503312. [Google Scholar] [CrossRef]
Showmik, I.A.; Sanam, T.F.; Imtiaz, H. Human Activity Recognition from Wi-Fi CSI Data Using Principal Component-Based Wavelet CNN. Digit. Signal Process. 2023, 138, 104056. [Google Scholar] [CrossRef]
Zou, H.; Yang, J.; Prasanna Das, H.; Liu, H.; Zhou, Y.; Spanos, C.J. WiFi and vision multimodal learning for accurate and robust device-free human activity recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–17 June 2019; pp. 1–8. [Google Scholar]
Yang, D.; Wang, T.; Sun, Y.; Wu, Y. Doppler shift measurement using complex-valued CSI of WiFi in corridors. In Proceedings of the 2018 3rd International Conference on Computer and Communication Systems (ICCCS), Nagoya, Japan, 27–30 April 2018; pp. 67–371. [Google Scholar]
IEEE std 802.11 a-1999; Part 11: Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) Specifications: High-Speed Physical Layer in the 5 GHz Band. IEEE: New York, NY, USA, 1999.
Halperin, D.; Hu, W.; Sheth, A.; Wetherall, D. Predictable 802.11 packet delivery from wireless channel measurements. ACM SIGCOMM Comput. Commun. Rev. 2010, 40, 159–170. [Google Scholar] [CrossRef]
Yang, Z.; Zhou, Z.; Liu, Y. From RSSI to CSI: Indoor localization via channel response. ACM Comput. Surv. 2013, 46, 1–32. [Google Scholar] [CrossRef]
Nee, R.V.; Prasad, R. OFDM for Wireless Multimedia Communications; Artech House, Inc.: Norwood, MA, USA, 2000. [Google Scholar]
Toh, K.A. Kernel and Range Approach to Analytic Network Learning. Int. J. Networked Distrib. Comput. 2018, 7, 20–28. [Google Scholar] [CrossRef]
Schäfer, J.; Barrsiwal, B.R.; Kokhkharova, M.; Adil, H.; Liebehenschel, J. Human Activity Recognition Using CSI Information with Nexmon. Appl. Sci. 2021, 11, 8860. [Google Scholar] [CrossRef]
Wang, F.; Feng, J.; Zhao, Y.; Zhang, X.; Zhang, S.; Han, J. Joint activity recognition and indoor localization with WiFi fingerprints. IEEE Access 2019, 7, 80058–80068. [Google Scholar] [CrossRef]

Figure 1. Pipeline of the fusion system.

Figure 2. (Left) CSI samples before preprocessing. (Right) CSI samples after cropping, resizing, and filtering. Each colored line represents a sample sequence.

Figure 3. Pipeline of preprocessing steps.

Figure 4. Comparison of our fusion method with methods in Table 1 for the HAR-RP database.

Figure 5. Comparison of our fusion method with methods in Table 1 for the HAR-RT database.

Figure 6. Comparison of our fusion method with methods in Table 1 for the HAR-RT database.

Table 1. Experimental scenarios.

Method	Database	Remark
1D-CNN BiLSTM	HAR-RP: 3 volunteers * 7 activities * 20 samples = 420 samples (sit down, stand up, lie down, run, walk, fall and bend) [39]	The raw CSI amplitude data with 52-dimensional vector.
2D-CNN		CSI signals are converted to RGB images by pseudo color map
LSTM	HAR-RT: 1084 samples with 6 activities (sit, sit down, stand, stand up, walk and fall) [49]	Normalized raw CSI amplitude data with 256-dimensional vector.
DTW+kNN SVM-RBF	HAR-ARIL: 1394 samples with 6 activities (hand up, hand down, hand left, hand right, hand circle, and hand cross) [50]	Normalized raw CSI amplitude data with 52-dimensional vector.

Table 2. Experimental setup.

Brief Description of the Experiments	Database
Experiment I Analysis of preprocessing parameters (cropping sizes, filtering band) (a) LSE, (b) SVM, and (c) KNN.	HAR-RP, HAR-RT HAR-ARIL
Experiment II Fusion of first level LSE and SVM-RBF scores using LSE, SVM-RBF, KNN, and ANnet.	HAR-RP, HAR-RT HAR-ARIL
Experiment III Comparison of proposed system with SOTA methods in Table 1.	HAR-RP, HAR-RT HAR-ARIL

Table 3. Experiment I(a): recognition accuracies of LSE at various combinations of preprocessing parameters.

Database	Size\Band	0.02	0.05	0.1	0.5
HAR-RP	50	48.1	40.4	72.8	72.4
	100	67.3	73.6	69.4	68.5
	200	72.5	71.9	69.3	64.5
	500	67.1	67.5	64.2	67.1
Database	Size\Band	0.1	0.5	0.8	1.0
HAR-RT	50	30.4	45.1	48.8	71.4
	100	39.1	48.8	50.7	66.4
	150	41.9	50.1	52.5	60.3
	200	46.5	51.2	53.5	55.2
Database	Size\Band	0.1	0.3	0.5	0.8
HAR-ARIL	50	37.2	35.4	32.6	42.1
	100	32.1	38.2	48.3	41.6
	150	35.1	48.2	50.1	51.3
	180	43.1	49.2	51.6	51.5

Table 4. Experiment I(b): recognition accuracies of SVM-RBF at various combinations of preprocessing parameters.

Database	Size\Band	0.02	0.05	0.1	0.5
HAR-RP	50	94.2	94.2	96.3	95.4
	100	95.1	95.6	95.4	95.1
	200	94.5	94.8	94.9	94.5
	500	94.2	94.5	94.6	94.2
Database	Size\Band	0.1	0.5	0.8	1.0
HAR-RT	50	69.5	81.1	81.1	95.4
	100	78.8	81.5	81.4	92.6
	150	81.1	81.5	81.6	88.0
	200	81.1	81.5	81.6	79.2
Database	Size\Band	0.1	0.3	0.5	0.8
HAR-ARIL	50	66.7	72.5	72.1	68.4
	100	68.9	70.3	73.5	69.1
	150	69.2	68.2	68.1	69.1
	180	69.5	68.1	68.4	68.7

Table 5. Summary of chosen parameters based on Table 3, Table 4, Table 5 and Table 6.

Database	Prepressing	Cropping and Resizing	Normalized Passband
HAR-RP	process1	50	0.1
HAR-RP	process2	100	0.05
HAR-RT	process1	50	1.0
HAR-RT	process2	100	1.0
HAR-ARIL	process1	50	0.3
HAR-ARIL	process2	100	0.5

Table 6. Experiment I(c): recognition accuracies of KNN at various combinations of preprocessing parameters.

Database	Size\Band	0.02	0.05	0.1	0.5
HAR-RP	50	89.2	90.4	92.6	90.8
	100	92.1	92.3	90.4	91.8
	200	91.6	89.2	92.0	90.6
	500	92.0	90.8	91.8	90.4
Database	Size\Band	0.1	0.5	0.8	1.0
HAR-RT	50	56.7	63.2	72.4	82.3
	100	54.5	60.4	68.4	76.7
	150	52.3	55.2	52.5	65.1
	200	50.3	54.5	53.5	60.2
Database	Size\Band	0.1	0.3	0.5	0.8
HAR-ARIL	50	63.9	65.7	63.8	62.5
	100	62.4	64.5	64.9	63.5
	150	63.5	64.1	63.5	63.2
	180	63.1	64.2	63.7	63.9

Table 7. Experiment II: performance comparison of fusion methods.

HAR-RP
Method	LSE	SVM	Score Level Fusion
			LSE	SVM	KNN	ANnet
W/O transform	process1	process2	52.3	86.9	82.4	88.0
	52.3	95.2
	process2	process1	69.0	91.7	89.1	90.4
	69.0	94.0
W transform	process1	process2	92.8	97.6	95.4	97.6
	92.8	96.4
	process2	process1	94.0	94.0	94.0	95.2
	90.4	94.0
HAR-RT
Method	SVM	SVM	Score Level Fusion
			LSE	SVM	KNN	ANnet
W/O transform	process1	process2	93.5	96.3	94.7	95.4
	95.4	92.6
W transform	process1	process2	94.5	96.4	95.2	95.4
	95.4	92.6
Method	LSE	SVM	Score Level Fusion
			LSE	SVM	KNN	ANnet
W/O transform	process1	process2	79.2	93.5	90.3	92.6
	71.4	92.6
	process2	process1	79.7	94.5	91.5	94.0
	66.3	95.4
W transform	process1	process2	92.1	93.0	92.6	92.6
	76.5	92.6
	process2	process1	88.9	92.1	89.3	90.8
	76.9	95.4
HAR-ARIL
Method	SVM	SVM	Score Level Fusion
			LSE	SVM	KNN	ANnet
W/O transform	process1	process2	72.5	78.4	74.2	75.5
	72.5	73.5
W transform	process1	process2	81.7	83.7	82.5	83.8
	80.1	82.3
Method	KNN	SVM	Score Level Fusion
			LSE	SVM	KNN	ANnet
W/O transform	process1	process2	73.6	75.2	74.7	75.2
	65.7	73.5
	process2	process1	72.5	74.8	73.1	72.6
	64.9	72.5
W transform	process1	process2	74.3	82.9	82.4	82.5
	70.8	82.3
	process2	process1	73.0	81.4	80.5	81.5
	69.6	80.1

Table 8. Experiment II: training processing time.

HAR-RP
Method	Execution Time (s)
ANnet-fusion	15.7
1D-CNN	47.2
2D-CNN	59.8
BiLSTM	67.9
HAR-RT
Method	Execution Time (s)
ANnet-fusion	58.3
LSTM	376.1
HAR-ARIL
Method	Execution Time (s)
ANnet-fusion	18.6
DTW+KNN	3356
SVM-RBF	69

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lim, G.; Oh, B.; Kim, D.; Toh, K.-A. Human Activity Recognition via Score Level Fusion of Wi-Fi CSI Signals. Sensors 2023, 23, 7292. https://doi.org/10.3390/s23167292

AMA Style

Lim G, Oh B, Kim D, Toh K-A. Human Activity Recognition via Score Level Fusion of Wi-Fi CSI Signals. Sensors. 2023; 23(16):7292. https://doi.org/10.3390/s23167292

Chicago/Turabian Style

Lim, Gunsik, Beomseok Oh, Donghyun Kim, and Kar-Ann Toh. 2023. "Human Activity Recognition via Score Level Fusion of Wi-Fi CSI Signals" Sensors 23, no. 16: 7292. https://doi.org/10.3390/s23167292

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Human Activity Recognition via Score Level Fusion of Wi-Fi CSI Signals

Abstract

1. Introduction

2. Related Work

2.1. Vision-and-Sound-Based HAR

2.2. Mobile-Device-Based HAR

2.3. Wi-Fi-Based HAR

3. Preliminaries

3.1. Wi-Fi for Human Activity Recognition

3.2. Wi-Fi CSI Signal Features

4. Proposed System

4.1. Preprocessing

4.2. Classification Stage

4.3. Fusion Stage

5. Experiments

5.1. Database

5.2. Experimental Setup

5.3. Results and Discussion

5.3.1. Experiment I(a) Effect of Cropping Size and Pass-Band on the LSE Classifier

5.3.2. Experiment I(b) to Observe the Effect of Cropping Size and Pass-Band on the SVM-RBF Classifier

5.3.3. Experiment I(c) to Observe the Effect of Cropping Size and Pass-Band on the KNN Classifier

5.3.4. Experiment II Fusion of First-Level LSE and SVM-RBF Scores Using LSE, SVM-RBF, KNN, and ANnet

5.3.5. Experiment III Comparison of the Proposed Fusion with SOTA Methods in Table 1

5.4. Summary of Results and Observations

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI