0% found this document useful (0 votes)

13 views8 pages

Ai Paper 5

A.I. the best paper which will change the world

Uploaded by

Serj Ionescu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views8 pages

Ai Paper 5

A.I. the best paper which will change the world

Uploaded by

Serj Ionescu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Weak-Annotation of HAR Datasets using

Vision Foundation Models

Marius Bock Kristof Van Laerhoven Michael Moeller
Ubiquitous Computing & Ubiquitous Computing Computer Vision
Computer Vision University of Siegen University of Siegen
University of Siegen Siegen, Germany Siegen, Germany
Siegen, Germany [email protected] [email protected]
[email protected]
arXiv:2408.05169v1 [cs.HC] 9 Aug 2024

GMM Distance-based Centroid Clip

Clustering Thresholding t Annotation
t
Label ?
t

t
t

Label Propagation
DL Model Lunging Hamstring
Stretches Stretches
Training Transferring
E.g Shallow DeepConvLSTM
DL Classifier to IMU Data
…
…

Lunges
Push-Ups
Sit-Ups

Figure 1: Our proposed weak-annotation pipeline: Visual embeddings extracted using Vision Foundation Models are clusterd
using Gaussian Mixture Models (GMMs). Decreasing the required labelling effort, a human annotator is only asked to annotate
each cluster’s centroid video clip. Centroid labels are then propagated within each cluster. Transferred to the corresponding
IMU-data, resulting weakly-annotated datasets can be used to train subsequent classifiers.
ABSTRACT KEYWORDS
As wearable-based data annotation remains, to date, a tedious, time- Data Annotation; Human Activity Recognition; Body-worn Sensors
consuming task requiring researchers to dedicate substantial time,
benchmark datasets within the field of Human Activity Recogni-
tion in lack richness and size compared to datasets available within
1 INTRODUCTION
related fields. Recently, vision foundation models such as CLIP have Though the automatic recognition of activities through wearable
gained significant attention, helping the vision community advance data has been identified as valuable information for numerous re-
in finding robust, generalizable feature representations. With the search fields [9], currently available wearable activity recognition
majority of researchers within the wearable community relying benchmark datasets lack richness and size compared to datasets
on vision modalities to overcome the limited expressiveness of available within related fields. Compared with, for example, the
wearable data and accurately label their to-be-released benchmark newly released Ego4D dataset [17], it becomes apparent that cur-
datasets offline, we propose a novel, clustering-based annotation rently used datasets within the inertial-based Human Activity
pipeline to significantly reduce the amount of data that needs to Recognition (HAR) community are significantly smaller in terms
be annotated by a human annotator. We show that using our ap- of the number of participants, length of recordings, and variety
proach, the annotation of centroid clips suffices to achieve average of performed activities. One of the main drivers for this is that,
labelling accuracies close to 90% across three publicly available even though body-worn sensing approaches allow recording large
HAR benchmark datasets. Using the weakly annotated datasets, amounts of data with only minimal impact on users in various
we further demonstrate that we can match the accuracy scores of situations in daily life, wearable-based data annotation remains, to
fully-supervised deep learning classifiers across all three bench- date, a tedious, time-consuming task and requires researchers to
mark datasets. Code as well as supplementary figures and results dedicate substantial time to it during data collection (taking up to
are publicly downloadable via github.com/mariusbock/weak_har. 14 to 20 times longer than the actual recorded data [30]).
Following the success in other vision-related fields, researchers
within the video activity recognition community have made use of
CCS CONCEPTS feature extraction methods which provide latent representations
• Human-centered computing → Ubiquitous and mobile com- of video clips rather than using raw image data [23, 33]. Such
puting design and evaluation methods. feature extraction methods are usually pretrained on existing large
Marius Bock, Kristof Van Laerhoven, and Michael Moeller

benchmark corpora, though often not particularly related, they are propagation [35], multi-instance learning [36] or probabilistic meth-
capable of transferring knowledge to the activity recognition task ods [16, 38]. Adaimi and Thomaz [2] followed the works of [34],
at hand. Recently, vision foundation models [28, 29] have gained proposing an active learning framework which focuses on asking
a lot of attention. Typically trained on a large amount of curated users to label only data which will gain the most classification
and uncurated benchmark datasets, these models have helped the performance boost. With the rise in popularity of Deep Learning,
community further advance in finding robust, generalizable visual deep clustering algorithms have been proposed to cluster latent
feature representations. representations in unsupervised and semi-supervised fashion us-
With the majority of researchers within the wearable activity ing e.g. autoencoders [4, 5, 22, 26, 42], recurrent networks [1, 16],
recognition community relying on the vision modality to overcome self-supervised [31] and contrastive learning [3, 12, 20, 40, 43].
the lack expressiveness of wearable data and accurately label their Recently, Xia et al. [43] and Tong et al. [40] demonstrated how
to-be-released benchmark datasets offline (see e.g. [11, 18, 30]), we vision-foundation models such as CLIP [29] and I3D [10] can be
propose a novel annotation pipeline which makes use of visual used to create visual, complementary embeddings to inertial data
embeddings extracted using pretrained foundation models to sig- such that a contrastive loss can be calculated. This work marks
nificantly limit the amount of data which needs to be annotated by one of the few instances of researchers trying to use visual data
a human annotator. Our contributions are three-fold: to limit the amount of annotations required in wearable activity
recognition. Our work ties into the works of Tong et al. [40] and
(1) We find that visual embeddings extracted using publicly-
Xia et al. [43], yet we propose instead to apply vision foundation
available vision foundation models can be clustered activity-
models to perform automatic label propagation between similar
wise.
embeddings.
(2) We show that annotating only one clip per cluster suffices
to achieve average labelling accuracies above 60% and close
to 90% across three publicly available HAR benchmark
datasets. 3 METHODOLOGY
(3) We demonstrate that using the weakly annotated datasets, 3.1 Annotation Pipeline
one is capable of matching accuracy scores of fully-supervised
Latent Space Clustering via Vision Foundation Models. Within
deep learning classifiers across all three benchmark datasets.
the first phase we divide the unlabeled dataset into (overlapping)
video clips. Given an input video stream 𝑋 of a sample participant,
2 RELATED WORK we apply a sliding window approach which shifts over 𝑋 , dividing
Vision Foundation Models. The term foundation models was the input data into video clips, e.g. of four second duration with a
coined by Devlin et al. [14] and refers to models which are pre- 75% overlap between consecutive windows. This process results in
trained on a large selection of datasets. The idea of pre-training 𝑋 = {x1, x2, ..., x𝑇 } being discretized into 𝑡 = {0, 1, ...,𝑇 } time steps,
models on large benchmark datasets has been prominent within where 𝑇 is the number of windows, i.e. video clips, for each partic-
the vision community for a long time. Within the video classifica- ipant. Inspired by classification approaches originating from the
tion community researchers demonstrated that pretrained methods temporal action localization community, we make use of pretrained
such as I3D [10], VideoSwin [24] or SlowFast [15] extract discrim- vision foundation models to extract latent representations of each
inate feature embeddings which can be use to train subsequent clip. That is, x𝑡 ∈ R𝐸 represents a one-dimensional feature embed-
classifiers. Following their success in Natural Language Processing ding vector associated with the video clip at time step 𝑡, where 𝐸
[8, 14], researchers applied masked autoencoders on visual data is number of latent features the embedding vector consists of. In
input. Unlike previous methods, masked autoencoders are capable total we evaluated three popular pretrained foundation models: a
of pretraining themselves in a self-supervised manner, allowing two-stream inflated 3D-ConvNet (I3D) [10] pretrained on the RGB
the use of larger data sources. Two of such methods are CLIP [29] and optical flow features extracted from Kinetics-400 dataset [19] as
and DINOv2 [28]. The former, published by OpenAI, is a vision- well as two transformer foundation models CLIP [29] and DINOv2
language model which tries to learn the alignment between text [28], which were pretrained on a multitude of curated and uncu-
and images. According to the authors, CLIP is pretrained on a large rated data sources. Note that, unlike Carreira and Zisserman in [10],
copora of image-text pairs scraped from the world wide web. Simi- we use RAFT [39] instead of TV-L1 [37] optical flow estimation.
larly, the recently released DINOv2 by META AI makes an effort As the CLIP and DINOv2 model both are not explicitly trained on
of providing a foundation model which is capable of extracting optical flow features, we also test complementing embeddings of
general purpose visual features, for which the authors collected the two models by concatenating them with extracted embeddings
data from curated and uncurated sources. of the inflated 3D-ConvNet trained on RAFT optical flow features
of the Kinetics dataset. In order to obtain latent representations
Weakly-Supervised Wearable HAR. With the activity labelling we altered models such that intermediate feature representations
of body-worn sensor data being a tedious task, many researchers can be extracted. Table 1 provides details activations at which layer
have looked at weakly-supervised learning techniques to reduce were considered to be the embedding of which pretrained method
the required amount of annotations to train subsequent classifiers. as well as their dimension. To merge together the frame-wise fea-
Early works such that of Stikic et al. [34] have shown to reduce tures outputted of the CLIP and DINOv2 model, we apply average
the labelling efforts for training classical machine learning mod- pooling as detailled in [25] to obtain a single latent representation
els through knowledge-driven approaches using graph-based label per sliding video clip.
Weak-Annotation of HAR Datasets using Vision Foundation Models

Table 1: Network layer used for extracting embeddings of the 3.2 Weakly-supervised Training
different vision foundation models [10, 28, 29]. Subsequent
Assuming inertial and video data are synchronised, we further
layers are omitted such that the network outputs latent repre-
evaluate how well the resulting annotated inertial data with non-
sentations at point of the embedding layer. Note that the I3D
uniform label noise is suited to be used for training inertial-based
network is used for extracting both RGB and flow features
deep learning classifiers. As our benchmark algorithms of choice
and we refer to the vision-based part of the CLIP model.
we use two recently published state-of-the-art methods, namely
the Shallow DeepConvLSTM [6] and TinyHAR architecture [45].
Model Embedding Layer Dimension We use both architectures as originally introduced by the authors,
I3D last average pool layer R1024 specifically using the same size and number of convolutional filters,
CLIP last projection layer (vision-CLIP) R768 convolutional layers and size of the recurrent layers. During train-
DINO last layer hidden state (clf. token) R1024 ing we apply a sliding window of one second with an overlap of 50%,
as it proofed to be provide consistent classification performances
across a multitude of HAR datasets [6]. We train each network for
30 epochs using the Adam optimizer (learning rate 1𝑒 −4 and weight
decay 1𝑒 −6 ) applying a step-wise learning rate with a decay factor
Having extracted latent representations of each video clip within of 0.9 after each 10 epochs. To migitate the introduced label noise
the tested benchmark datasets, we apply Gaussian Mixture Models by our proposed weak-annotation pipeline, we calculate the loss
(GMM) to cluster the embeddings on a per-participant level. Though during training using the weighted partially Huberised generalised
GMMs are not originally intended to be used with high-dimensional cross-entropy (PHGCE) loss [27], which extends the definition of
data, they have shown to provide good results clustering visual em- the generalized cross-entropy loss [44] with a variant of gradient
beddings especially in the context of action recognition [21, 41] and clipping. To compare the validity of our approach, we compare
allow, unlike methods such as k-nearest neighbors, more flexibility amongst a set of (weakly-)annotated training approaches:
regarding the shape of clusters. Training one GMM clustering algo-
(1) Fully-supervised: Fully-supervised results using the original,
rithm per participant and applying it to said that assigns each video
fully-annotated benchmark datasets.
clip x𝑡 ∈ R𝐸 a cluster label 𝑥𝑐 ∈ 1, ..., 𝐶, where 𝐶 is the number of
(2) Few-Shot-CE: Fully-supervised training using only the an-
GMM components, i.e. clusters, applied.
notated clips and a weighted cross-entropy loss.
(3) Random-CE: training using an equal amount of random
Weak-Labeling via Centroid Clips. Once each video clip of a study
annotated clips as in (2) and a weighted cross-entropy loss.
participant has been assigned a cluster label 𝑥𝑐 , the second phase
(4) Weak-CE: Weakly-supervised training using the weakly-
of our approach consists of a human annotator only needing to
annotated dataset and a weighted cross-entropy loss.
annotate one sample clip per cluster. Assuming the centroid of a
(5) Weak-PHGCE: Weakly-supervised training using the weakly-
cluster is most representative of all clips within that cluster, we
annotated dataset and a weighted PHGCE loss.
can propagate the activity label 𝑎 ∈ 1, ..., 𝐴 of said clip to all other
clips eliminating the need of annotating the other clips via a human
annotator, where 𝐴 is the number of activities within the dataset. Table 2: Average labeling accuracy and standard deviation
As GMM do not explicitly provide a definition of a centroid of a across study participants using different types and combina-
component, we calculate the centroid clip of each cluster compo- tions of embeddings [10, 28, 29] extracted from three bench-
nent being the clip which has the highest density within said cluster. mark datasets [7, 13, 32] applying a GMM-based clustering
That is, given the covariance matrix of each mixture component using 100 clusters. Overall, a combination of CLIP and optical
Σ ∈ R𝐸×𝐸 , assuming each component has its own general covari- flow embeddings proved most consistent across all datasets.
ance matrix, and mean vector 𝜇 ∈ R𝐸 , we calculate the density of
each point as the logarithm of the probability density function of
WEAR Wetlab ActionSense
the multivariate normal distribution defined by 𝜇 and Σ. Having
identified the centroid clip within each cluster, our approach propa- (1) I3D 82.62 (±4.65) 66.08 (± 9.53) 53.47 (± 5.95)
gates the annotation provided by the human annotator to all other (2) CLIP 82.47 (±6.03) 72.70 (± 6.42) 59.85 (± 4.42)
clips, which were also assigned to that cluster. (3) DINOv2 79.20 (±4.04) 69.28 (± 8.12) 60.25 (± 4.04)
As our approach forces each video clip to be assigned an activity (4) RAFT 76.86 (±4.79) 51.50 (± 6.96) 45.64 (± 5.19)
label, we augment our clustering with a subsequent distance-based (1) + (4) 85.17 (± 4.48) 60.91 (± 8.36) 53.00 (± 4.66)
thresholding in order to remove outlier clips from the automatic (2) + (4) 83.96 (±4.99) 66.23 (±7.86) 57.29 (± 5.64)
labelling. Assuming that the distance of another clip to the centroid (3) + (4) 79.13 (±4.30) 70.18 (± 9.79) 56.55 (± 4.51)
resembles its likelihood of belonging to the same activity class,
we omit clips from the dataset which exceed a certain distance
from their respective centroid clip, with the distance being calcu-
lated as the 𝐿 2 -norm between two embedding vectors. Even though 3.3 Datasets
this approach decreases the amount of data which can be used to WEAR. The WEAR dataset offers both inertial and egocentric
train subsequent classification algorithms, we show to increase the video data of 18 participants doing a variety of 18 sports-related
overall labelling accuracy by a significant margin. activities, including different styles of running, stretching, and
Marius Bock, Kristof Van Laerhoven, and Michael Moeller

(a) WEAR (b) Wetlab (c) ActionSense

100.00%

90.00%

80.00%

70.00%

60.00%

50.00%

31.58%
40.00%

26.54%
23.69%
21.30%

19.91%
30.00%

15.79%
15.97%

13.27%
10.65%

20.00%

7.90%

6.64%
5.32%

4.74%

3.98%
3.19%

2.65%
2.02%

1.42%
10.00%

0.00%
19 30 50 100 150 200 9 30 50 100 150 200 20 30 50 100 150 200
Number of Clusters
Labelling Accuracy % of Data Annotated

Figure 2: Box-plot diagrams showing the distribution of labelling accuracies across study participants with increasing number
of clusters. The bar plot below the box-plots provides details per cluster setting about the percentage of data compared to the
total size of the three benchmark datasets [7, 13, 32] an annotater would need to annotate. One can see a clear trend that with
an increase in clusters, labelling accuracy increases along with deviation across study participants decreasing.

strength-based exercises. Recordings took place at changing out- this clip length is a suitable length to be interpretable for a human
door locations. Each study participant was equipped with a head- annotator while simultaneously avoiding mixing multiple activities
mounted camera and four smartwatches, one worn on each limb in into one sliding window. Furthermore, during all experiments only
a fixed orientation, which captured 3D-accelerometer data. the label of the centroid clip is propagated to that of all other cluster
instances. Ablation experiments evaluating different clip lengths
ActionSense. Published by DelPreto et al. [13], the ActionSense and number of annotated clips per cluster to determine the label to
dataset provides a multitude of sensors capturing data within an be propagated can be found within our code repository.
indoor, artificial kitchen setup. Amongst the sensors, participants
wore Inertial Measurment Units (IMUs) on both wrists as well as
smart glasses which captured the ego-view of each participant. Dur- 4.1 Annotation Pipeline
ing recordings, participants were tasked to perform various kitchen Table 2 shows the average labelling accuracy averaged across par-
chores including chopping foods, setting a table and (un)loading a ticipants obtained when applying our proposed annotation pipeline
dishwasher. Within their original publication, the authors provide using various types of extracted visual embeddings. One can see
annotations of 19 activities of 10 participants. Note that the dataset that in case of the WEAR [7] and ActionSense dataset [13] labelling
download of the ActionSense dataset provides IMU and egocentric accuracy can be improved by combining both RGB and optical flow
video data of only 9 instead of 10 participants. features in case of all embeddings. Overall a combination of CLIP
and optical flow features proves to be most consistent across our
Wetlab. Taking place in a wetlab laboratory environment, the
three benchmark datasets of choice, making it thus our embedding
Wetlab dataset [32] comprises of data of 22 study participants which
of choice for subsequent experiments. Applying a labelling strat-
performed two DNA extraction experiments. For purposes of this
egy of only annotating the centroid clip of each cluster, Figure 2
paper we used the annotated provided by the authors of the re-
presents a box-plot visualization of applying different number of
occurring activities base activities (such as stirring, cutting, etc.)
clusters during the clustering of the participant-wise embeddings.
within the experimental protocol. During recordings, each partici-
One can see that by only annotating 100 clips per study participant,
pant wore a smartwatch in a fixed orientation on the wrist of their
our proposed annotation pipeline is capable of reaching labelling
dominant hand, which captured 3D-accelerometer data. Unlike
accuracies above 85% in case of the WEAR and close to 70% in case
the WEAR and ActionSense dataset, the Wetlab dataset provides
of the Wetlab [32] and ActionSense dataset.
video data of a static camera which was mounted above the table at
Furthermore, as evident by an overall shrinking boxplot with
which the experiment was performed, thus capturing a birds-eye
increasing number of clusters, our approach is becoming more
perspective of the experiment’s surroundings.
stable with the standard deviation across study participants de-
creasing in case of all three datasets. As Figure 2 shows, applying
4 RESULTS a clustering of 𝐶 = 𝐴, i.e. as many clusters as there are activities
To ensure that reported performance differences are not the based in the dataset, results in the clustering not being capable of differ-
on statistical variance, all reported experiments are repeated three entiating the normal and complex variations of activities, different
times, applying a set of three predefined random seeds. This applies running styles and null-class from all other activities. In general,
both for the annotation pipeline experiments as well as weakly- we witness a trend that by applying a larger amount of clusters
supervised training results. During all annotation-based experiment than activities present in the dataset, one gives the GMM clustering
mentioned in Section 4.1 we apply a clip length of four seconds enough degrees of freedom to differentiate even activities which
along with a three second overlap between clips. We assume that share similarities, yet slightly differ from each other. Lastly, by
Weak-Annotation of HAR Datasets using Vision Foundation Models

Table 3: Deep Learning results of applying two inertial-based models [6, 45] on various weakly-annotated versions of three
public datasets [7, 13, 32]. Training using weakly-annotated datasets outperformed both few-shot training using only the
annotated data as well as an equal amount of random annotated clips. With an increase in number of clusters our weakly-
supervised approach is capable of being close to matching the predictive performance of fully-supervised baselines having
manually annotated only a fraction of the actual dataset. The suffix T-6 (T-4) refer to training applying a threshold of 6 (4).

DeepConvLSTM TinyHAR
𝑐 = 19 𝑐 = 50 𝑐 = 100 𝑐 = 19 𝑐 = 50 𝑐 = 100
Acc F1 Acc F1 Acc F1 Acc F1 Acc F1 Acc F1
Fully-supervised 79.89 78.36 79.89 78.36 79.89 78.36 77.83 71.89 77.83 71.89 77.83 71.89
WEAR

Few-Shot-CE 37.41 24.76 59.58 46.25 65.61 53.51 37.41 26.55 59.58 46.25 65.61 53.51
Random-CE 45.90 31.13 59.46 46.98 65.91 53.38 23.73 23.73 59.72 46.34 66.27 55.00
Weak-CE 42.55 34.09 64.17 54.59 73.38 63.23 49.45 38.75 66.68 54.05 71.10 59.65
Weak-PHGCE 48.62 35.45 70.34 55.27 76.15 63.43 51.46 39.23 68.53 55.40 73.37 61.68
Weak-CE-T-6 59.70 46.63 73.06 60.60 76.28 66.13 57.22 46.71 68.19 55.56 72.03 60.31
Weak-PHGCE-T-6 59.39 44.97 73.17 58.84 77.55 64.77 58.78 46.27 69.47 56.29 74.05 61.67
Weak-CE-T-4 68.86 57.33 74.72 63.93 77.81 68.22 65.68 55.64 71.31 60.16 73.93 63.42
Weak-PHGCE-T-4 61.35 47.00 74.45 60.61 76.64 64.81 62.90 50.46 72.25 60.84 74.83 63.94
𝑐=9 𝑐 = 50 𝑐 = 100 𝑐=9 𝑐 = 50 𝑐 = 100
Acc F1 Acc F1 Acc F1 Acc F1 Acc F1 Acc F1
Fully-supervised 45.27 38.64 45.27 38.64 45.27 38.64 38.75 28.85 38.75 28.85 38.75 28.85
Few-Shot-CE 15.60 11.39 21.89 16.46 22.78 17.38 15.18 11.62 22.43 16.14 25.92 18.27
Wetlab

Random-CE 16.33 8.95 26.48 18.62 26.50 20.05 34.23 24.38 27.74 17.79 29.70 19.37
Weak-CE 16.97 8.53 32.57 25.78 36.51 29.72 23.90 14.70 34.23 24.38 36.30 25.76
Weak-PHGCE 18.62 10.48 27.64 23.17 33.78 27.62 24.06 15.19 33.79 24.35 35.53 25.53
Weak-CE-T-6 23.01 18.12 33.78 27.08 38.41 29.77 26.14 18.63 33.79 24.21 36.20 25.76
Weak-PHGCE-T-6 20.26 13.72 28.77 23.90 32.29 26.71 25.22 18.41 33.34 24.25 34.93 25.17
Weak-CE-T-4 23.54 18.71 32.02 26.37 35.25 28.95 25.15 17.91 32.85 23.84 34.64 25.04
Weak-PHGCE-T-4 21.82 15.81 28.33 23.91 31.20 25.64 23.82 17.29 32.39 23.72 33.66 24.70
𝑐 = 20 𝑐 = 50 𝑐 = 100 𝑐 = 20 𝑐 = 50 𝑐 = 100
Acc F1 Acc F1 Acc F1 Acc F1 Acc F1 Acc F1
Fully-supervised 20.73 14.82 20.73 14.82 20.73 14.82 22.19 18.04 22.19 18.04 22.19 18.04
ActionSense

Few-Shot-CE 6.47 2.58 8.14 4.92 9.96 6.08 6.19 2.32 8.50 4.00 11.15 6.72
Random-CE 4.69 1.63 9.42 5.09 10.02 5.77 5.01 1.70 7.80 4.19 11.44 6.50
Weak-CE 12.32 7.67 13.91 9.94 17.35 12.20 13.74 9.75 15.58 11.55 17.16 13.45
Weak-PHGCE 11.65 6.34 12.34 7.43 11.62 7.02 14.35 9.43 13.88 9.25 14.52 10.09
Weak-CE-T-6 14.67 8.81 15.95 10.07 15.29 10.69 13.35 9.32 15.70 11.22 17.53 13.15
Weak-PHGCE-T-6 9.91 4.92 10.57 5.44 11.78 6.79 12.17 7.52 13.72 8.16 14.53 9.01
Weak-CE-T-4 10.40 7.15 12.31 7.10 12.08 8.80 10.80 8.02 13.65 9.24 15.03 11.16
Weak-PHGCE-T-4 8.34 3.86 8.92 4.35 9.91 5.56 9.24 5.82 11.07 6.09 12.11 7.35

distance thresholding clusters and excluding instances which are clips, but for the case of applying 100 clusters is close to matching
exceeding a certain distance from their respective centroid, helps accuracy scores of a fully-supervised training across all three bench-
increasing the labelling accuracy significantly across all datasets. mark datasets, for both inertial-based architectures. Compared to
While a threshold of 4 helps increase the labelling accuracy well a normal cross-entropy loss, the PHGCE loss provides more sta-
above 75% and even up to 93% in case of the WEAR dataset, the ble results in case of higher label noise, e.g., when not applying a
thresholding omits between 50% and up to 90% of the datasets. distance-based thresholding and/or applying a smaller number of
clusters. In general, the distance-based thresholding significantly
improved results across all datasets. Although thresholding signifi-
4.2 Weakly-Supervised Training cantly reduces the amount of training data, the resulting decrease
As a combination of CLIP and optical-flow-based features proved in overall labelling noise, especially for approaches that applied a
to be most stable across all three datasets, we chose to use said lower number of clusters, improved classification results. We pro-
embedding as basis for our weakly-supervised training. Table 3 vide a detailed overview of the influence of threshoding on labelling
provides an overview across the eight evaluated training scenarios. accuracy and dataset size within the paper’s code repository.
Our proposed weakly-supervised training is not only capable of out-
performing the few-shot training using only the annotated centroid
Marius Bock, Kristof Van Laerhoven, and Michael Moeller

than vice versa. This caused the activities transfer and pouring,
Ground Truth two classes which only have a few instances in the ground truth
null
data and which are most frequently annotated incorrectly as NULL-
normal
rotating arms
class, to not be predicted correctly once across all study participants.
skipping Note that for the ActionSense dataset classification results even
sidesteps in the fully-supervised setting are significantly worse compared
butt-kicks
triceps
to that of the other two datasets. As evident by a nevertheless
large labelling accuracy using our approach, we assume that la-
Fully-supervised

lunging
shoulders bel semantics of the dataset are too vision-centric (e.g. peeling a
hamstrings
cucumber or a potato) to be correctly recognized using only inter-
lumbar rotation
push-ups
tial data. Nevertheless, per-class classification results of the fully-
push-ups (complex) vs. weakly-supervised training show a similar confusion amongst
sit-ups
classes, suggesting learned patterns of the classifier are similar in
sit-ups (complex)
burpees
both training scenarios.
lunges
lunges (complex) 5 DISCUSSION & CONCLUSION
bench-dips
Within this paper we presented a weak-annotation pipeline for
null
normal
HAR datasets based on Vision Foundation Models. We showed that
rotating arms visual embeddings extracted using Vision Foundation Models can
skipping be clustered using Gaussian Mixture Models (GMM). Decreasing
sidesteps
butt-kicks
the required labelling effort, using the suggested pipeline a human
triceps annotator is only asked to annotate each cluster’s centroid video
Weak-CE-T-4

lunging clip. By propagating the provided labels within each cluster our
shoulders
approach is capable of achieving average labelling accuracies above
hamstrings
lumbar rotation
60% and close to 90% across three popular HAR benchmark datasets.
push-ups We further showed that the resulting weakly-annotated wearable
push-ups (complex)
datasets can be used to train subsequent deep learning classifiers
sit-ups
sit-ups (complex)
with accuracy scores, in case of applying a sufficiently large num-
burpees ber of clusters, being close to matching that of a fully-supervised
lunges training across all three benchmark datasets.
lunges (complex)
bench-dips
Our results underscore one of the implications recent advance-
ments in the vision community in finding generalizable feature
Predicted

sit-ups
skipping

lumbar rotation
null

butt-kicks
triceps
lunging

hamstrings

burpees
rotating arms

sidesteps

shoulders

bench-dips
normal

push-ups (complex)

sit-ups (complex)

lunges (complex)
push-ups

lunges

representations might have on the field of HAR. With the rapid

progress being made in the area of foundation models, follow-ups
of models such as CLIP and DINOv2 could further robustify the au-
tomatic analysis of collected video streams in wearable-based data
collection. Our clustering-based pipeline thus not only improves
Figure 3: Confusion matrices comparing the shallow Deep- the efficiency of data annotation but also contributes to the creation
ConvLSTM fully-supervised results compared to that of the of richer and more extensive HAR benchmark datasets.
best performing weak-labelling approach. With exception of
the NULL-class, all activities were able to be classified close ACKNOWLEDGMENTS
to the performance of the fully-supervised approach. We gratefully acknowledge the DFG Project WASEDO (DFG LA
275811-1) and the University of Siegen’s OMNI cluster.
Figure 3 shows a side-by-side comparison of the confusion matri-
ces of the fully-supervised and best-performing weak-labelling REFERENCES
[1] Alireza Abedin, Farbod Motlagh, Qinfeng Shi, Hamid Rezatofighi, and D. Ranas-
approach (Weak-CE-T-4) using the WEAR dataset and shallow inghe. 2020. Towards deep clustering of human activities from wearables.
DeepConvLSTM. One can see that, apart from the NULL-class, https://doi.org/10.1145/3410531.3414312
classification results of all activities of the weakly-supervised train- [2] Rebecca Adaimi and Edison Thomaz. 2019. Leveraging Active Learning and
Conditional Mutual Information to Minimize Data Annotation in Human Activity
ing are similar to that of the fully-supervised. With the intra-class Recognition. ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies
similarity of a NULL-class within datasets being quite low, it is to 3, 3 (2019), 1–23. https://doi.org/10.1145/3351228
[3] Abrar Ahmed, Harish Haresamudram, and Thomas Ploetz. 2022. Clustering of
a larger degree grouped together with instances of other classes Human Activities from Wearables by Adopting Nearest Neighbors. In ACM In-
making it harder for an inertial-based classifier to learn unique ternational Symposium on Wearable Computers. https://doi.org/10.1145/3544794.
patterns only applicable to that of the NULL-class. Looking at per- 3558477
[4] Bandar Almaslukh, Jalal AlMuhtadi, and Abdelmonim Artoli. 2017. An Effective
class results of the Wetlab dataset, one can see that the introduced Deep Autoencoder Approach for Online Smartphone-Based Human Activity
labelling noise caused the classifier trained using weakly-annotated Recognition. International Journal of Computer Science and Network Security 17,
data is more likely to confuse activities with NULL-class rather 4 (2017), 160–165. http://paper.ijcsns.org/07_book/201704/20170423.pdf
Weak-Annotation of HAR Datasets using Vision Foundation Models

[5] Behrooz Azadi, Michael Haslgrübler, Bernhard Anzengruber-Tanase, Georgios IEEE Transactions on Image Processing 31 (2022). https://doi.org/10.1109/TIP.
Sopidis, and Alois Ferscha. 2024. Robust Feature Representation Using Multi- 2022.3195321
Task Learning for Human Activity Recognition. Sensors 24, 2 (2024), 681. https: [24] Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu.
//doi.org/10.3390/s24020681 2022. Video swin transformer. In IEEE/CVF Conference on Computer Vision and
[6] Marius Bock, Alexander Hoelzemann, Michael Moeller, and Kristof Van Laer- Pattern Recognition. https://doi.org/10.1109/cvpr52688.2022.00320
hoven. 2021. Improving Deep Learning for HAR With Shallow Lstms. In ACM In- [25] Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui
ternational Symposium on Wearable Computers. https://doi.org/10.1145/3460421. Li. 2022. CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval
3480419 and captioning. Neurocomputing 508 (2022), 293–304. https://doi.org/10.1016/j.
[7] Marius Bock, Hilde Kuehne, Kristof Van Laerhoven, and Michael Moeller. 2023. neucom.2022.07.028
WEAR: An Outdoor Sports Dataset for Wearable and Egocentric Activity Recog- [26] Haojie Ma, Zhijie Zhang, Wenzhong Li, and Sanglu Lu. 2021. Unsupervised
nition. CoRR abs/2304.05088 (2023). https://arxiv.org/abs/2304.05088 Human Activity Representation Learning with Multi-task Deep Clustering. ACM
[8] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, on Interactive, Mobile, Wearable and Ubiquitous Technologies 5, 1 (2021), 1–25.
Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda https://doi.org/10.1145/3448074
Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, [27] Aditya Krishna Menon, Ankit Singh Rawat, Sashank J. Reddi, and Sanjiv Kumar.
Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris 2020. Can gradient clipping mitigate label noise?. In International Conference on
Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Learning Representations. https://openreview.net/forum?id=rklB76EKPr
Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and [28] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec,
Dario Amodei. 2020. Language models are few-shot learners. In Advances in Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin
Neural Information Processing Systems. https://proceedings.neurips.cc/paper_ El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes,
files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel
[9] Andreas Bulling, Ulf Blanke, and Bernt Schiele. 2014. A Tutorial on Human Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Labatut, Armand Joulin,
Activity Recognition Using Body-Worn Inertial Sensors. Comput. Surveys 46, 3 and Piotr Bojanowski. 2024. DINOv2: Learning Robust Visual Features without
(2014), 1–33. https://doi.org/10.1145/2499621 Supervision. CoRR abs/2304.07193 (2024). https://doi.org/10.48550/arXiv.2304.
[10] Joao Carreira and Andrew Zisserman. 2017. Quo Vadis, Action Recognition? A 07193
New Model and the Kinetics Dataset. In IEEE Conference on Computer Vision and [29] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh,
Pattern Recognition. https://doi.org/10.1109/cvpr.2017.502 Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark,
[11] Shing Chan, Hang Yuan, Catherine Tong, Aidan Acquah, Abram Schonfeldt, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models
Jonathan Gershuny, and Aiden Doherty. 2024. CAPTURE-24: A large dataset From Natural Language Supervision. CoRR abs/2103.00020 (2021). https://doi.
of wrist-worn activity tracker data collected in the wild for human activity org/10.48550/arXiv.2103.00020
recognition. CoRR abs/2402.19229 (2024). https://doi.org/10.48550/arXiv.2402. [30] Daniel Roggen, Alberto Calatroni, Mirco Rossi, Thomas Holleczek, Kilian Förster,
19229 Gerhard Tröster, Paul Lukowicz, David Bannach, Gerald Pirkl, Alois Ferscha,
[12] Shohreh Deldari, Hao Xue, Aaqib Saeed, Daniel V. Smith, and Flora D. Salim. Jakob Doppler, Clemens Holzmann, Marc Kurz, Gerald Holl, Ricardo Chavar-
2022. COCOA: Cross Modality Contrastive Learning for Sensor Data. ACM riaga, Hesam Sagha, Hamidreza Bayati, Marco Creatura, and José del R. Millàn.
on Interactive, Mobile, Wearable and Ubiquitous Technologies 6, 3 (2022), 1–28. 2010. Collecting Complex Activity Datasets in Highly Rich Networked Sensor
https://doi.org/10.1145/3550316 Environments. In IEEE Seventh International Conference on Networked Sensing
[13] Joseph DelPreto, Chao Liu, Yiyue Luo, Michael Foshey, Yunzhu Li, Antonio Systems. https://doi.org/10.1109/INSS.2010.5573462
Torralba, Wojciech Matusik, and Daniela Rus. 2022. ActionSense: A Multimodal [31] Aaqib Saeed, Tanir Ozcelebi, and Johan Lukkien. 2019. Multi-task Self-Supervised
Dataset and Recording Framework for Human Activities Using Wearable Sensors Learning for Human Activity Detection. ACM on Interactive, Mobile, Wearable
in a Kitchen Environment. In Neural Information Processing Systems Track on and Ubiquitous Technologies 3, 2 (2019), 1–30. https://doi.org/10.1145/3328932
Datasets and Benchmarks. https://action-sense.csail.mit.edu [32] Philipp M. Scholl, Matthias Wille, and Kristof Van Laerhoven. 2015. Wearables
[14] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina N. Toutanova. 2019. in the Wet Lab: A Laboratory System for Capturing and Guiding Experiments.
BERT: Pre-training of Deep Bidirectional Transformers for Language Under- In ACM International Joint Conference on Pervasive and Ubiquitous Computing.
standing. In Confernce of the North American Chapter of the Association for https://doi.org/10.1145/2750858.2807547
Computational Linguistics. https://arxiv.org/abs/1810.04805 [33] Dingfeng Shi, Yujie Zhong, Qiong Cao, Lin Ma, Jia Li, and Dacheng Tao.
[15] Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slow- 2023. TriDet: Temporal Action Detection With Relative Boundary Modeling.
Fast Networks for Video Recognition. In International Conference on Computer In IEEE/CVF Conference on Computer Vision and Pattern Recognition. https:
Vision. https://doi.org/10.1109/ICCV.2019.00630 //doi.org/10.1109/cvpr52729.2023.01808
[16] Marjan Ghazvininejad, Hamid R. Rabiee, Nima Pourdamghani, and Parisa [34] Maja Stikic, Diane Larlus, Sandra Ebert, and Bernt Schiele. 2011. Weakly Super-
Khanipour. 2011. HMM based semi-supervised learning for activity recogni- vised Recognition of Daily Life Activities with Wearable Sensors. IEEE Trans-
tion. In ACM International Workshop on Situation Activity & Goal Awareness. actions on Pattern Analysis and Machine Intelligence 33, 12 (2011), 2521–2537.
https://doi.org/10.1145/2030045.2030065 https://doi.org/10.1109/tpami.2011.36
[17] Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino [35] Maja Stikic, Diane Larlus, and Bernt Schiele. 2009. Multi-graph Based Semi-
Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, and Xingyu supervised Learning for Activity Recognition. In IEEE International Symposium
Liu. 2022. Ego4D: Around the World in 3,000 Hours of Egocentric Video. In on Wearable Computers. https://doi.org/10.1109/ISWC.2009.24
IEEE/CVF Conference on Computer Vision and Pattern Recognition. https://doi. [36] Maja Stikic and Bernt Schiele. 2009. Activity Recognition from Sparsely Labeled
org/10.1109/CVPR52688.2022.01842 Data Using Multi-Instance Learning. In Springer International Symposium on
[18] Alexander Hoelzemann, Julia L. Romero, Marius Bock, Kristof Van Laerhoven, Location- and Context-Awareness. https://doi.org/10.1007/978-3-642-01721-6_10
and Qin Lv. 2023. Hang-Time HAR: A Benchmark Dataset for Basketball Activity [37] Javier Sánchez Pérez, Enric Meinhardt-Llopis, and Gabriele Facciolo. 2013. TV-L1
Recognition Using Wrist-Worn Inertial Sensors. MDPI Sensors 23, 13 (2023). Optical Flow Estimation. Image Processing On Line 3 (2013), 137–150. https:
https://doi.org/10.3390/s23135879 //doi.org/10.5201/ipol.2013.26
[19] Will Kay, João Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra [38] Jafar Tanha, Maarten Van Someren, and Hamideh Afsarmanesh. 2017. Semi-
Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa supervised self-training for decision tree classifiers. International Journal of
Suleyman, and Andrew Zisserman. 2017. The Kinetics Human Action Video Machine Learning and Cybernetics 8, 1 (2017), 355–370. https://doi.org/10.1007/
Dataset. CoRR abs/1705.06950 (2017). http://arxiv.org/abs/1705.06950 s13042-015-0328-7
[20] Bulat Khaertdinov, Esam Ghaleb, and Stylianos Asteriadis. 2021. Contrastive [39] Zachary Teed and Jia Deng. 2020. RAFT: Recurrent All-Pairs Field Transforms
Self-supervised Learning for Sensor-based Human Activity Recognition. In IEEE for Optical Flow. In European Conference on Computer Vision. https://doi.org/10.
International Joint Conference on Biometrics. https://doi.org/10.1109/IJCB52358. 1007/978-3-030-58536-5_24
2021.9484410 [40] Catherine Tong, Jinchen Ge, and Nicholas D. Lane. 2021. Zero-Shot Learning for
[21] Anna Kukleva, Hilde Kuehne, Fadime Sener, and Jurgen Gall. 2019. Unsupervised IMU-Based Activity Recognition Using Video Embeddings. ACM on Interactive,
Learning of Action Classes With Continuous Temporal Embedding. In 2019 Mobile, Wearable and Ubiquitous Technologies 5, 4 (2021), 1–23. https://doi.org/
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 10.1145/3494995
Long Beach, CA, USA, 12058–12066. https://doi.org/10.1109/CVPR.2019.01234 [41] Rosaura G. VidalMata, Walter J. Scheirer, Anna Kukleva, David Cox, and Hilde
[22] Yongmou Li, Dianxi Shi, Bo Ding, and Dongbo Liu. 2014. Unsupervised Feature Kuehne. 2021. Joint Visual-Temporal Embedding for Unsupervised Learning of
Learning for Human Activity Recognition Using Smartphone Sensors. In Springer Actions in Untrimmed Sequences. In IEEE Winter Conference on Applications of
Second International Conference on Mining Intelligence and Knowledge Exploration. Computer Vision. https://doi.org/10.1109/wacv48630.2021.00128
https://doi.org/10.1007/978-3-319-13817-6_11 [42] Aiguo Wang, Guilin Chen, Cuijuan Shang, Miaofei Zhang, and Li Liu. 2016.
[23] Xiaolong Liu, Qimeng Wang, Yao Hu, Xu Tang, Shiwei Zhang, Song Bai, and Human Activity Recognition in a Smart Home Environment with Stacked De-
Xiang Bai. 2022. End-To-End Temporal Action Detection With Transformer. noising Autoencoders. In Web-Age Information Management, Shaoxu Song and
Marius Bock, Kristof Van Laerhoven, and Michael Moeller

Yongxin Tong (Eds.). Vol. 9998. Springer International Publishing, Cham, 29– [44] Zhilu Zhang and Mert R. Sabuncu. 2018. Generalized Cross Entropy Loss for
40. https://doi.org/10.1007/978-3-319-47121-1_3 Series Title: Lecture Notes in Training Deep Neural Networks with Noisy Labels. In Advances in Neural In-
Computer Science. formation Processing Systems. https://proceedings.neurips.cc/paper_files/paper/
[43] Kang Xia, Wenzhong Li, Shiwei Gan, and Sanglu Lu. 2023. TS2ACT: Few-Shot 2018/file/f2925f97bc13ad2852a7a551802feea0-Paper.pdf
Human Activity Sensing with Cross-Modal Co-Learning. ACM on Interactive, [45] Yexu Zhou, Haibin Zhao, Yiran Huang, Till Riedel, Michael Hefenbrock, and
Mobile, Wearable and Ubiquitous Technologies 7, 4 (2023), 1–22. https://doi.org/ Michael Beigl. 2022. TinyHAR: A Lightweight Deep Learning Model Designed
10.1145/3631445 for Human Activity Recognition. In ACM International Symposium on Wearable
Computers. https://doi.org/10.1145/3544794.3558467

Artificial Intelligence (AI) / Machine Learning (ML) : Limited Seats Only
67% (3)
Artificial Intelligence (AI) / Machine Learning (ML) : Limited Seats Only
2 pages
(2021) Attention-Based Sensor Fusion For Human Activity Recognition Using IMU Signals
No ratings yet
(2021) Attention-Based Sensor Fusion For Human Activity Recognition Using IMU Signals
32 pages
The Latest Technology Trends in Hospitality and Tourism Industry
No ratings yet
The Latest Technology Trends in Hospitality and Tourism Industry
7 pages
HATRNet Human Activity Transition Recogn
No ratings yet
HATRNet Human Activity Transition Recogn
6 pages
Improvisation and The Creative Process
100% (1)
Improvisation and The Creative Process
14 pages
Es2013 11 PDF
No ratings yet
Es2013 11 PDF
10 pages
1 s2.0 S0167739X22003089 Main
No ratings yet
1 s2.0 S0167739X22003089 Main
14 pages
Customer Segmentation Using Machine Learning
100% (1)
Customer Segmentation Using Machine Learning
28 pages
Triple Cross-Domain Attention On Human Activity Recognition Using Wearable Sensors
No ratings yet
Triple Cross-Domain Attention On Human Activity Recognition Using Wearable Sensors
10 pages
Sensorgan: A Novel Data Recovery Approach For Wearable Human Activity Recognition
No ratings yet
Sensorgan: A Novel Data Recovery Approach For Wearable Human Activity Recognition
21 pages
Ensembles of Deep LSTM Learners For Activity Recognition Using Wearables
No ratings yet
Ensembles of Deep LSTM Learners For Activity Recognition Using Wearables
28 pages
An Active Semi-Supervised Deep Learning Model For Human Activity Recognition
No ratings yet
An Active Semi-Supervised Deep Learning Model For Human Activity Recognition
17 pages
Deep Learning For Sensor-Based Activity Recognition: A Survey
No ratings yet
Deep Learning For Sensor-Based Activity Recognition: A Survey
10 pages
Divide and Conquer-Based 1D CNN Human Activity Rec
No ratings yet
Divide and Conquer-Based 1D CNN Human Activity Rec
24 pages
Wearable-Based Behaviour Interpolation For Semi-Supervised Human Activity Recognition
No ratings yet
Wearable-Based Behaviour Interpolation For Semi-Supervised Human Activity Recognition
13 pages
Informatics 09 00056
No ratings yet
Informatics 09 00056
13 pages
Deep2019 2
No ratings yet
Deep2019 2
4 pages
Performance Analysis of Inception v2 and Yolov3 Based Human Activity Recognition in Videos
No ratings yet
Performance Analysis of Inception v2 and Yolov3 Based Human Activity Recognition in Videos
7 pages
V8i2 1320
No ratings yet
V8i2 1320
4 pages
2012 Ugulino Wearablecomputing Har Classifier ribbON
No ratings yet
2012 Ugulino Wearablecomputing Har Classifier ribbON
10 pages
1 s2.0 S2667096821000392 Main
No ratings yet
1 s2.0 S2667096821000392 Main
18 pages
Sensors 22 01476 v2
No ratings yet
Sensors 22 01476 v2
43 pages
SLR Zainab Saba
No ratings yet
SLR Zainab Saba
21 pages
Accepted Manuscript: Applied Soft Computing
No ratings yet
Accepted Manuscript: Applied Soft Computing
20 pages
DEEP-LEARNING-ENHANCED HUMAN ACTIVITY RECOGNITION FOR INTERNET OF HEALTHCARE THINGS Zhou Et Al 2020
No ratings yet
DEEP-LEARNING-ENHANCED HUMAN ACTIVITY RECOGNITION FOR INTERNET OF HEALTHCARE THINGS Zhou Et Al 2020
10 pages
Uni HAR
No ratings yet
Uni HAR
15 pages
Buffelli 2021
No ratings yet
Buffelli 2021
10 pages
A Novel Semisupervised Deep Learning Method For Human Activity Recognition PDF
No ratings yet
A Novel Semisupervised Deep Learning Method For Human Activity Recognition PDF
10 pages
Ensemble of Deep Learning Techniques To Human Activity Recognition Using Smart Phone Signals
No ratings yet
Ensemble of Deep Learning Techniques To Human Activity Recognition Using Smart Phone Signals
30 pages
Sensors: Deep Convolutional and LSTM Recurrent Neural Networks For Multimodal Wearable Activity Recognition
No ratings yet
Sensors: Deep Convolutional and LSTM Recurrent Neural Networks For Multimodal Wearable Activity Recognition
25 pages
Human Activity Recognition Based On Acceleration Data From Smartphones Using HMMs
No ratings yet
Human Activity Recognition Based On Acceleration Data From Smartphones Using HMMs
16 pages
A Hybrid Deep Approach To Recognizing Student Activity and Monitoring Health Physique Based On Accelerometer Data From Smartphones
No ratings yet
A Hybrid Deep Approach To Recognizing Student Activity and Monitoring Health Physique Based On Accelerometer Data From Smartphones
18 pages
Sensors 25 00430
No ratings yet
Sensors 25 00430
24 pages
Deep Neural Networks
No ratings yet
Deep Neural Networks
25 pages
Sensors 22 00174
No ratings yet
Sensors 22 00174
19 pages
Deep, Convolutional, and Recurrent Models For Human Activity Recognition Using Wearables
No ratings yet
Deep, Convolutional, and Recurrent Models For Human Activity Recognition Using Wearables
7 pages
HARcnn
No ratings yet
HARcnn
7 pages
A New Framework For Smartphone Sensor Based Human Activity Recognition Using Graph Neural Network
No ratings yet
A New Framework For Smartphone Sensor Based Human Activity Recognition Using Graph Neural Network
8 pages
HUMAN RECOGNITION - IEEE ConferencePaper
No ratings yet
HUMAN RECOGNITION - IEEE ConferencePaper
8 pages
Human Activity Recognition Using Machine Learning: Bachelor of Technology
No ratings yet
Human Activity Recognition Using Machine Learning: Bachelor of Technology
19 pages
Expert Systems With Applications: D.K. Vishwakarma, Rajiv Kapoor
No ratings yet
Expert Systems With Applications: D.K. Vishwakarma, Rajiv Kapoor
9 pages
Chapter 2
No ratings yet
Chapter 2
3 pages
A Human Activity Recognition Method Based On Lightweight Feature Extraction Combined With Pruned and Quantized CNN For Wearable Device
No ratings yet
A Human Activity Recognition Method Based On Lightweight Feature Extraction Combined With Pruned and Quantized CNN For Wearable Device
14 pages
TII Deep Learning PA Accepted
No ratings yet
TII Deep Learning PA Accepted
12 pages
Deep Learning Enhanced Human Activity Recognition For Internet of Healthcare Things
No ratings yet
Deep Learning Enhanced Human Activity Recognition For Internet of Healthcare Things
10 pages
Human Activity Recog Paper1
No ratings yet
Human Activity Recog Paper1
5 pages
Sensors 19 03731
No ratings yet
Sensors 19 03731
20 pages
Bachelor Thesis Topics Software Engineering
100% (3)
Bachelor Thesis Topics Software Engineering
7 pages
A Public Domain Dataset For Real-Life Human Activi
No ratings yet
A Public Domain Dataset For Real-Life Human Activi
14 pages
Electronics 10030308
No ratings yet
Electronics 10030308
21 pages
Research Highlights (Required) : /item
No ratings yet
Research Highlights (Required) : /item
11 pages
Optimizing Physical Activity Recognition Using LSTM Network
No ratings yet
Optimizing Physical Activity Recognition Using LSTM Network
14 pages
Multi Input CNN GRU Based Human Activity Recognition
No ratings yet
Multi Input CNN GRU Based Human Activity Recognition
18 pages
Deep Learning Models For Real-Time Human Activity Recognition
No ratings yet
Deep Learning Models For Real-Time Human Activity Recognition
13 pages
Papee 2
No ratings yet
Papee 2
19 pages
Enhanced Human Activity Recognition in Medical Emergencies Using A Hybrid Deep CNN and Bi-Directional LSTM Model With Wearable Sensors
No ratings yet
Enhanced Human Activity Recognition in Medical Emergencies Using A Hybrid Deep CNN and Bi-Directional LSTM Model With Wearable Sensors
24 pages
Daily Dose of Data Science Full Archive
No ratings yet
Daily Dose of Data Science Full Archive
53 pages
Human Activity Recognition With Convolutional Neural Networks
No ratings yet
Human Activity Recognition With Convolutional Neural Networks
12 pages
1717 - Final - Paper 2
No ratings yet
1717 - Final - Paper 2
5 pages
Basic Activity Recognition From Wearable
No ratings yet
Basic Activity Recognition From Wearable
20 pages
HumanActivity Recognition Deep Learning
No ratings yet
HumanActivity Recognition Deep Learning
6 pages
Foul Legacy
No ratings yet
Foul Legacy
17 pages
2 The Information Environment
No ratings yet
2 The Information Environment
19 pages
Multi-STMT Multi-Level Network For Human Activity Recognition Based On Wearable Sensors
No ratings yet
Multi-STMT Multi-Level Network For Human Activity Recognition Based On Wearable Sensors
12 pages
Smartwatch-Based Human Activity Recognition Using Hybrid LSTM Network
No ratings yet
Smartwatch-Based Human Activity Recognition Using Hybrid LSTM Network
4 pages
Smart-Wearable Sensors and CNN-BiGRU Model A Powerful Combination For Human Activity Recognition
No ratings yet
Smart-Wearable Sensors and CNN-BiGRU Model A Powerful Combination For Human Activity Recognition
12 pages
Module 1 & 2 PDF
No ratings yet
Module 1 & 2 PDF
76 pages
Navigating Artificial Intelligence-Aug 2024
No ratings yet
Navigating Artificial Intelligence-Aug 2024
50 pages
Foundations of Machine Learning
No ratings yet
Foundations of Machine Learning
80 pages
EBSCO-FullText-04 03 2025
No ratings yet
EBSCO-FullText-04 03 2025
9 pages
Embedded Systems and IoT - CS3691 - Notes Book - Unit 3 - IOT and Arduino Programming
No ratings yet
Embedded Systems and IoT - CS3691 - Notes Book - Unit 3 - IOT and Arduino Programming
33 pages
E-Commerce Product Delivery Prediction
No ratings yet
E-Commerce Product Delivery Prediction
13 pages
Cbleenexms 41
No ratings yet
Cbleenexms 41
34 pages
AI Research
No ratings yet
AI Research
17 pages
NLP Unit 6
No ratings yet
NLP Unit 6
16 pages
Eurofound - Ethical Digitalisation at Work
No ratings yet
Eurofound - Ethical Digitalisation at Work
68 pages
Postdoc BioSwarm en
No ratings yet
Postdoc BioSwarm en
3 pages
AI Trailblazing Webinar Takeaways Report
No ratings yet
AI Trailblazing Webinar Takeaways Report
24 pages
Neural Information Processing: Teddy Mantoro Minho Lee Media Anugerah Ayu Kok Wai Wong Achmad Nizar Hidayanto
No ratings yet
Neural Information Processing: Teddy Mantoro Minho Lee Media Anugerah Ayu Kok Wai Wong Achmad Nizar Hidayanto
703 pages
Copyright Related Concerns
No ratings yet
Copyright Related Concerns
25 pages
1 s2.0 S0267364922000097 Main
No ratings yet
1 s2.0 S0267364922000097 Main
26 pages
Ai Dystopia
No ratings yet
Ai Dystopia
11 pages
AsiaTEFL V18 N1 Spring 2021 Is It Beneficial To Use AI Chatbots To Improve Learners Speaking Performance
No ratings yet
AsiaTEFL V18 N1 Spring 2021 Is It Beneficial To Use AI Chatbots To Improve Learners Speaking Performance
18 pages
AIML Lab Programs
No ratings yet
AIML Lab Programs
13 pages
Simcpsr: Simple Contrastive Learning For Paper Submission Recommendation System
No ratings yet
Simcpsr: Simple Contrastive Learning For Paper Submission Recommendation System
13 pages
Ai Paper 4
No ratings yet
Ai Paper 4
14 pages
Final Proposal - Updated
No ratings yet
Final Proposal - Updated
7 pages
Manual1 7 PDF
No ratings yet
Manual1 7 PDF
221 pages
Manual1 7 PDF
No ratings yet
Manual1 7 PDF
221 pages
Sonographic Sound Processing
No ratings yet
Sonographic Sound Processing
11 pages
ADQ Game Changers - Module 2 - Assessment Khaled Alshehhi
No ratings yet
ADQ Game Changers - Module 2 - Assessment Khaled Alshehhi
4 pages
Portfolio - Yashica Jain
No ratings yet
Portfolio - Yashica Jain
4 pages
Rasa Pro One Pager
No ratings yet
Rasa Pro One Pager
1 page
Human Visual System Model: Understanding Perception and Processing
From Everand
Human Visual System Model: Understanding Perception and Processing
Fouad Sabry
No ratings yet
Hidden Line Removal: Unveiling the Invisible: Secrets of Computer Vision
From Everand
Hidden Line Removal: Unveiling the Invisible: Secrets of Computer Vision
Fouad Sabry
No ratings yet

Ai Paper 5

Uploaded by

Ai Paper 5

Uploaded by

Weak-Annotation of HAR Datasets using

Vision Foundation Models

GMM Distance-based Centroid Clip

(a) WEAR (b) Wetlab (c) ActionSense

representations might have on the field of HAR. With the rapid

You might also like