Ai Paper 5
Ai Paper 5
t
t
Label Propagation
DL Model Lunging Hamstring
Stretches Stretches
Training Transferring
E.g Shallow DeepConvLSTM
DL Classifier to IMU Data
…
…
Lunges
Push-Ups
Sit-Ups
Figure 1: Our proposed weak-annotation pipeline: Visual embeddings extracted using Vision Foundation Models are clusterd
using Gaussian Mixture Models (GMMs). Decreasing the required labelling effort, a human annotator is only asked to annotate
each cluster’s centroid video clip. Centroid labels are then propagated within each cluster. Transferred to the corresponding
IMU-data, resulting weakly-annotated datasets can be used to train subsequent classifiers.
ABSTRACT KEYWORDS
As wearable-based data annotation remains, to date, a tedious, time- Data Annotation; Human Activity Recognition; Body-worn Sensors
consuming task requiring researchers to dedicate substantial time,
benchmark datasets within the field of Human Activity Recogni-
tion in lack richness and size compared to datasets available within
1 INTRODUCTION
related fields. Recently, vision foundation models such as CLIP have Though the automatic recognition of activities through wearable
gained significant attention, helping the vision community advance data has been identified as valuable information for numerous re-
in finding robust, generalizable feature representations. With the search fields [9], currently available wearable activity recognition
majority of researchers within the wearable community relying benchmark datasets lack richness and size compared to datasets
on vision modalities to overcome the limited expressiveness of available within related fields. Compared with, for example, the
wearable data and accurately label their to-be-released benchmark newly released Ego4D dataset [17], it becomes apparent that cur-
datasets offline, we propose a novel, clustering-based annotation rently used datasets within the inertial-based Human Activity
pipeline to significantly reduce the amount of data that needs to Recognition (HAR) community are significantly smaller in terms
be annotated by a human annotator. We show that using our ap- of the number of participants, length of recordings, and variety
proach, the annotation of centroid clips suffices to achieve average of performed activities. One of the main drivers for this is that,
labelling accuracies close to 90% across three publicly available even though body-worn sensing approaches allow recording large
HAR benchmark datasets. Using the weakly annotated datasets, amounts of data with only minimal impact on users in various
we further demonstrate that we can match the accuracy scores of situations in daily life, wearable-based data annotation remains, to
fully-supervised deep learning classifiers across all three bench- date, a tedious, time-consuming task and requires researchers to
mark datasets. Code as well as supplementary figures and results dedicate substantial time to it during data collection (taking up to
are publicly downloadable via github.com/mariusbock/weak_har. 14 to 20 times longer than the actual recorded data [30]).
Following the success in other vision-related fields, researchers
within the video activity recognition community have made use of
CCS CONCEPTS feature extraction methods which provide latent representations
• Human-centered computing → Ubiquitous and mobile com- of video clips rather than using raw image data [23, 33]. Such
puting design and evaluation methods. feature extraction methods are usually pretrained on existing large
Marius Bock, Kristof Van Laerhoven, and Michael Moeller
benchmark corpora, though often not particularly related, they are propagation [35], multi-instance learning [36] or probabilistic meth-
capable of transferring knowledge to the activity recognition task ods [16, 38]. Adaimi and Thomaz [2] followed the works of [34],
at hand. Recently, vision foundation models [28, 29] have gained proposing an active learning framework which focuses on asking
a lot of attention. Typically trained on a large amount of curated users to label only data which will gain the most classification
and uncurated benchmark datasets, these models have helped the performance boost. With the rise in popularity of Deep Learning,
community further advance in finding robust, generalizable visual deep clustering algorithms have been proposed to cluster latent
feature representations. representations in unsupervised and semi-supervised fashion us-
With the majority of researchers within the wearable activity ing e.g. autoencoders [4, 5, 22, 26, 42], recurrent networks [1, 16],
recognition community relying on the vision modality to overcome self-supervised [31] and contrastive learning [3, 12, 20, 40, 43].
the lack expressiveness of wearable data and accurately label their Recently, Xia et al. [43] and Tong et al. [40] demonstrated how
to-be-released benchmark datasets offline (see e.g. [11, 18, 30]), we vision-foundation models such as CLIP [29] and I3D [10] can be
propose a novel annotation pipeline which makes use of visual used to create visual, complementary embeddings to inertial data
embeddings extracted using pretrained foundation models to sig- such that a contrastive loss can be calculated. This work marks
nificantly limit the amount of data which needs to be annotated by one of the few instances of researchers trying to use visual data
a human annotator. Our contributions are three-fold: to limit the amount of annotations required in wearable activity
recognition. Our work ties into the works of Tong et al. [40] and
(1) We find that visual embeddings extracted using publicly-
Xia et al. [43], yet we propose instead to apply vision foundation
available vision foundation models can be clustered activity-
models to perform automatic label propagation between similar
wise.
embeddings.
(2) We show that annotating only one clip per cluster suffices
to achieve average labelling accuracies above 60% and close
to 90% across three publicly available HAR benchmark
datasets. 3 METHODOLOGY
(3) We demonstrate that using the weakly annotated datasets, 3.1 Annotation Pipeline
one is capable of matching accuracy scores of fully-supervised
Latent Space Clustering via Vision Foundation Models. Within
deep learning classifiers across all three benchmark datasets.
the first phase we divide the unlabeled dataset into (overlapping)
video clips. Given an input video stream 𝑋 of a sample participant,
2 RELATED WORK we apply a sliding window approach which shifts over 𝑋 , dividing
Vision Foundation Models. The term foundation models was the input data into video clips, e.g. of four second duration with a
coined by Devlin et al. [14] and refers to models which are pre- 75% overlap between consecutive windows. This process results in
trained on a large selection of datasets. The idea of pre-training 𝑋 = {x1, x2, ..., x𝑇 } being discretized into 𝑡 = {0, 1, ...,𝑇 } time steps,
models on large benchmark datasets has been prominent within where 𝑇 is the number of windows, i.e. video clips, for each partic-
the vision community for a long time. Within the video classifica- ipant. Inspired by classification approaches originating from the
tion community researchers demonstrated that pretrained methods temporal action localization community, we make use of pretrained
such as I3D [10], VideoSwin [24] or SlowFast [15] extract discrim- vision foundation models to extract latent representations of each
inate feature embeddings which can be use to train subsequent clip. That is, x𝑡 ∈ R𝐸 represents a one-dimensional feature embed-
classifiers. Following their success in Natural Language Processing ding vector associated with the video clip at time step 𝑡, where 𝐸
[8, 14], researchers applied masked autoencoders on visual data is number of latent features the embedding vector consists of. In
input. Unlike previous methods, masked autoencoders are capable total we evaluated three popular pretrained foundation models: a
of pretraining themselves in a self-supervised manner, allowing two-stream inflated 3D-ConvNet (I3D) [10] pretrained on the RGB
the use of larger data sources. Two of such methods are CLIP [29] and optical flow features extracted from Kinetics-400 dataset [19] as
and DINOv2 [28]. The former, published by OpenAI, is a vision- well as two transformer foundation models CLIP [29] and DINOv2
language model which tries to learn the alignment between text [28], which were pretrained on a multitude of curated and uncu-
and images. According to the authors, CLIP is pretrained on a large rated data sources. Note that, unlike Carreira and Zisserman in [10],
copora of image-text pairs scraped from the world wide web. Simi- we use RAFT [39] instead of TV-L1 [37] optical flow estimation.
larly, the recently released DINOv2 by META AI makes an effort As the CLIP and DINOv2 model both are not explicitly trained on
of providing a foundation model which is capable of extracting optical flow features, we also test complementing embeddings of
general purpose visual features, for which the authors collected the two models by concatenating them with extracted embeddings
data from curated and uncurated sources. of the inflated 3D-ConvNet trained on RAFT optical flow features
of the Kinetics dataset. In order to obtain latent representations
Weakly-Supervised Wearable HAR. With the activity labelling we altered models such that intermediate feature representations
of body-worn sensor data being a tedious task, many researchers can be extracted. Table 1 provides details activations at which layer
have looked at weakly-supervised learning techniques to reduce were considered to be the embedding of which pretrained method
the required amount of annotations to train subsequent classifiers. as well as their dimension. To merge together the frame-wise fea-
Early works such that of Stikic et al. [34] have shown to reduce tures outputted of the CLIP and DINOv2 model, we apply average
the labelling efforts for training classical machine learning mod- pooling as detailled in [25] to obtain a single latent representation
els through knowledge-driven approaches using graph-based label per sliding video clip.
Weak-Annotation of HAR Datasets using Vision Foundation Models
Table 1: Network layer used for extracting embeddings of the 3.2 Weakly-supervised Training
different vision foundation models [10, 28, 29]. Subsequent
Assuming inertial and video data are synchronised, we further
layers are omitted such that the network outputs latent repre-
evaluate how well the resulting annotated inertial data with non-
sentations at point of the embedding layer. Note that the I3D
uniform label noise is suited to be used for training inertial-based
network is used for extracting both RGB and flow features
deep learning classifiers. As our benchmark algorithms of choice
and we refer to the vision-based part of the CLIP model.
we use two recently published state-of-the-art methods, namely
the Shallow DeepConvLSTM [6] and TinyHAR architecture [45].
Model Embedding Layer Dimension We use both architectures as originally introduced by the authors,
I3D last average pool layer R1024 specifically using the same size and number of convolutional filters,
CLIP last projection layer (vision-CLIP) R768 convolutional layers and size of the recurrent layers. During train-
DINO last layer hidden state (clf. token) R1024 ing we apply a sliding window of one second with an overlap of 50%,
as it proofed to be provide consistent classification performances
across a multitude of HAR datasets [6]. We train each network for
30 epochs using the Adam optimizer (learning rate 1𝑒 −4 and weight
decay 1𝑒 −6 ) applying a step-wise learning rate with a decay factor
Having extracted latent representations of each video clip within of 0.9 after each 10 epochs. To migitate the introduced label noise
the tested benchmark datasets, we apply Gaussian Mixture Models by our proposed weak-annotation pipeline, we calculate the loss
(GMM) to cluster the embeddings on a per-participant level. Though during training using the weighted partially Huberised generalised
GMMs are not originally intended to be used with high-dimensional cross-entropy (PHGCE) loss [27], which extends the definition of
data, they have shown to provide good results clustering visual em- the generalized cross-entropy loss [44] with a variant of gradient
beddings especially in the context of action recognition [21, 41] and clipping. To compare the validity of our approach, we compare
allow, unlike methods such as k-nearest neighbors, more flexibility amongst a set of (weakly-)annotated training approaches:
regarding the shape of clusters. Training one GMM clustering algo-
(1) Fully-supervised: Fully-supervised results using the original,
rithm per participant and applying it to said that assigns each video
fully-annotated benchmark datasets.
clip x𝑡 ∈ R𝐸 a cluster label 𝑥𝑐 ∈ 1, ..., 𝐶, where 𝐶 is the number of
(2) Few-Shot-CE: Fully-supervised training using only the an-
GMM components, i.e. clusters, applied.
notated clips and a weighted cross-entropy loss.
(3) Random-CE: training using an equal amount of random
Weak-Labeling via Centroid Clips. Once each video clip of a study
annotated clips as in (2) and a weighted cross-entropy loss.
participant has been assigned a cluster label 𝑥𝑐 , the second phase
(4) Weak-CE: Weakly-supervised training using the weakly-
of our approach consists of a human annotator only needing to
annotated dataset and a weighted cross-entropy loss.
annotate one sample clip per cluster. Assuming the centroid of a
(5) Weak-PHGCE: Weakly-supervised training using the weakly-
cluster is most representative of all clips within that cluster, we
annotated dataset and a weighted PHGCE loss.
can propagate the activity label 𝑎 ∈ 1, ..., 𝐴 of said clip to all other
clips eliminating the need of annotating the other clips via a human
annotator, where 𝐴 is the number of activities within the dataset. Table 2: Average labeling accuracy and standard deviation
As GMM do not explicitly provide a definition of a centroid of a across study participants using different types and combina-
component, we calculate the centroid clip of each cluster compo- tions of embeddings [10, 28, 29] extracted from three bench-
nent being the clip which has the highest density within said cluster. mark datasets [7, 13, 32] applying a GMM-based clustering
That is, given the covariance matrix of each mixture component using 100 clusters. Overall, a combination of CLIP and optical
Σ ∈ R𝐸×𝐸 , assuming each component has its own general covari- flow embeddings proved most consistent across all datasets.
ance matrix, and mean vector 𝜇 ∈ R𝐸 , we calculate the density of
each point as the logarithm of the probability density function of
WEAR Wetlab ActionSense
the multivariate normal distribution defined by 𝜇 and Σ. Having
identified the centroid clip within each cluster, our approach propa- (1) I3D 82.62 (±4.65) 66.08 (± 9.53) 53.47 (± 5.95)
gates the annotation provided by the human annotator to all other (2) CLIP 82.47 (±6.03) 72.70 (± 6.42) 59.85 (± 4.42)
clips, which were also assigned to that cluster. (3) DINOv2 79.20 (±4.04) 69.28 (± 8.12) 60.25 (± 4.04)
As our approach forces each video clip to be assigned an activity (4) RAFT 76.86 (±4.79) 51.50 (± 6.96) 45.64 (± 5.19)
label, we augment our clustering with a subsequent distance-based (1) + (4) 85.17 (± 4.48) 60.91 (± 8.36) 53.00 (± 4.66)
thresholding in order to remove outlier clips from the automatic (2) + (4) 83.96 (±4.99) 66.23 (±7.86) 57.29 (± 5.64)
labelling. Assuming that the distance of another clip to the centroid (3) + (4) 79.13 (±4.30) 70.18 (± 9.79) 56.55 (± 4.51)
resembles its likelihood of belonging to the same activity class,
we omit clips from the dataset which exceed a certain distance
from their respective centroid clip, with the distance being calcu-
lated as the 𝐿 2 -norm between two embedding vectors. Even though 3.3 Datasets
this approach decreases the amount of data which can be used to WEAR. The WEAR dataset offers both inertial and egocentric
train subsequent classification algorithms, we show to increase the video data of 18 participants doing a variety of 18 sports-related
overall labelling accuracy by a significant margin. activities, including different styles of running, stretching, and
Marius Bock, Kristof Van Laerhoven, and Michael Moeller
90.00%
80.00%
70.00%
60.00%
50.00%
31.58%
40.00%
26.54%
23.69%
21.30%
19.91%
30.00%
15.79%
15.97%
13.27%
10.65%
20.00%
7.90%
6.64%
5.32%
4.74%
3.98%
3.19%
2.65%
2.02%
1.42%
10.00%
0.00%
19 30 50 100 150 200 9 30 50 100 150 200 20 30 50 100 150 200
Number of Clusters
Labelling Accuracy % of Data Annotated
Figure 2: Box-plot diagrams showing the distribution of labelling accuracies across study participants with increasing number
of clusters. The bar plot below the box-plots provides details per cluster setting about the percentage of data compared to the
total size of the three benchmark datasets [7, 13, 32] an annotater would need to annotate. One can see a clear trend that with
an increase in clusters, labelling accuracy increases along with deviation across study participants decreasing.
strength-based exercises. Recordings took place at changing out- this clip length is a suitable length to be interpretable for a human
door locations. Each study participant was equipped with a head- annotator while simultaneously avoiding mixing multiple activities
mounted camera and four smartwatches, one worn on each limb in into one sliding window. Furthermore, during all experiments only
a fixed orientation, which captured 3D-accelerometer data. the label of the centroid clip is propagated to that of all other cluster
instances. Ablation experiments evaluating different clip lengths
ActionSense. Published by DelPreto et al. [13], the ActionSense and number of annotated clips per cluster to determine the label to
dataset provides a multitude of sensors capturing data within an be propagated can be found within our code repository.
indoor, artificial kitchen setup. Amongst the sensors, participants
wore Inertial Measurment Units (IMUs) on both wrists as well as
smart glasses which captured the ego-view of each participant. Dur- 4.1 Annotation Pipeline
ing recordings, participants were tasked to perform various kitchen Table 2 shows the average labelling accuracy averaged across par-
chores including chopping foods, setting a table and (un)loading a ticipants obtained when applying our proposed annotation pipeline
dishwasher. Within their original publication, the authors provide using various types of extracted visual embeddings. One can see
annotations of 19 activities of 10 participants. Note that the dataset that in case of the WEAR [7] and ActionSense dataset [13] labelling
download of the ActionSense dataset provides IMU and egocentric accuracy can be improved by combining both RGB and optical flow
video data of only 9 instead of 10 participants. features in case of all embeddings. Overall a combination of CLIP
and optical flow features proves to be most consistent across our
Wetlab. Taking place in a wetlab laboratory environment, the
three benchmark datasets of choice, making it thus our embedding
Wetlab dataset [32] comprises of data of 22 study participants which
of choice for subsequent experiments. Applying a labelling strat-
performed two DNA extraction experiments. For purposes of this
egy of only annotating the centroid clip of each cluster, Figure 2
paper we used the annotated provided by the authors of the re-
presents a box-plot visualization of applying different number of
occurring activities base activities (such as stirring, cutting, etc.)
clusters during the clustering of the participant-wise embeddings.
within the experimental protocol. During recordings, each partici-
One can see that by only annotating 100 clips per study participant,
pant wore a smartwatch in a fixed orientation on the wrist of their
our proposed annotation pipeline is capable of reaching labelling
dominant hand, which captured 3D-accelerometer data. Unlike
accuracies above 85% in case of the WEAR and close to 70% in case
the WEAR and ActionSense dataset, the Wetlab dataset provides
of the Wetlab [32] and ActionSense dataset.
video data of a static camera which was mounted above the table at
Furthermore, as evident by an overall shrinking boxplot with
which the experiment was performed, thus capturing a birds-eye
increasing number of clusters, our approach is becoming more
perspective of the experiment’s surroundings.
stable with the standard deviation across study participants de-
creasing in case of all three datasets. As Figure 2 shows, applying
4 RESULTS a clustering of 𝐶 = 𝐴, i.e. as many clusters as there are activities
To ensure that reported performance differences are not the based in the dataset, results in the clustering not being capable of differ-
on statistical variance, all reported experiments are repeated three entiating the normal and complex variations of activities, different
times, applying a set of three predefined random seeds. This applies running styles and null-class from all other activities. In general,
both for the annotation pipeline experiments as well as weakly- we witness a trend that by applying a larger amount of clusters
supervised training results. During all annotation-based experiment than activities present in the dataset, one gives the GMM clustering
mentioned in Section 4.1 we apply a clip length of four seconds enough degrees of freedom to differentiate even activities which
along with a three second overlap between clips. We assume that share similarities, yet slightly differ from each other. Lastly, by
Weak-Annotation of HAR Datasets using Vision Foundation Models
Table 3: Deep Learning results of applying two inertial-based models [6, 45] on various weakly-annotated versions of three
public datasets [7, 13, 32]. Training using weakly-annotated datasets outperformed both few-shot training using only the
annotated data as well as an equal amount of random annotated clips. With an increase in number of clusters our weakly-
supervised approach is capable of being close to matching the predictive performance of fully-supervised baselines having
manually annotated only a fraction of the actual dataset. The suffix T-6 (T-4) refer to training applying a threshold of 6 (4).
DeepConvLSTM TinyHAR
𝑐 = 19 𝑐 = 50 𝑐 = 100 𝑐 = 19 𝑐 = 50 𝑐 = 100
Acc F1 Acc F1 Acc F1 Acc F1 Acc F1 Acc F1
Fully-supervised 79.89 78.36 79.89 78.36 79.89 78.36 77.83 71.89 77.83 71.89 77.83 71.89
WEAR
Few-Shot-CE 37.41 24.76 59.58 46.25 65.61 53.51 37.41 26.55 59.58 46.25 65.61 53.51
Random-CE 45.90 31.13 59.46 46.98 65.91 53.38 23.73 23.73 59.72 46.34 66.27 55.00
Weak-CE 42.55 34.09 64.17 54.59 73.38 63.23 49.45 38.75 66.68 54.05 71.10 59.65
Weak-PHGCE 48.62 35.45 70.34 55.27 76.15 63.43 51.46 39.23 68.53 55.40 73.37 61.68
Weak-CE-T-6 59.70 46.63 73.06 60.60 76.28 66.13 57.22 46.71 68.19 55.56 72.03 60.31
Weak-PHGCE-T-6 59.39 44.97 73.17 58.84 77.55 64.77 58.78 46.27 69.47 56.29 74.05 61.67
Weak-CE-T-4 68.86 57.33 74.72 63.93 77.81 68.22 65.68 55.64 71.31 60.16 73.93 63.42
Weak-PHGCE-T-4 61.35 47.00 74.45 60.61 76.64 64.81 62.90 50.46 72.25 60.84 74.83 63.94
𝑐=9 𝑐 = 50 𝑐 = 100 𝑐=9 𝑐 = 50 𝑐 = 100
Acc F1 Acc F1 Acc F1 Acc F1 Acc F1 Acc F1
Fully-supervised 45.27 38.64 45.27 38.64 45.27 38.64 38.75 28.85 38.75 28.85 38.75 28.85
Few-Shot-CE 15.60 11.39 21.89 16.46 22.78 17.38 15.18 11.62 22.43 16.14 25.92 18.27
Wetlab
Random-CE 16.33 8.95 26.48 18.62 26.50 20.05 34.23 24.38 27.74 17.79 29.70 19.37
Weak-CE 16.97 8.53 32.57 25.78 36.51 29.72 23.90 14.70 34.23 24.38 36.30 25.76
Weak-PHGCE 18.62 10.48 27.64 23.17 33.78 27.62 24.06 15.19 33.79 24.35 35.53 25.53
Weak-CE-T-6 23.01 18.12 33.78 27.08 38.41 29.77 26.14 18.63 33.79 24.21 36.20 25.76
Weak-PHGCE-T-6 20.26 13.72 28.77 23.90 32.29 26.71 25.22 18.41 33.34 24.25 34.93 25.17
Weak-CE-T-4 23.54 18.71 32.02 26.37 35.25 28.95 25.15 17.91 32.85 23.84 34.64 25.04
Weak-PHGCE-T-4 21.82 15.81 28.33 23.91 31.20 25.64 23.82 17.29 32.39 23.72 33.66 24.70
𝑐 = 20 𝑐 = 50 𝑐 = 100 𝑐 = 20 𝑐 = 50 𝑐 = 100
Acc F1 Acc F1 Acc F1 Acc F1 Acc F1 Acc F1
Fully-supervised 20.73 14.82 20.73 14.82 20.73 14.82 22.19 18.04 22.19 18.04 22.19 18.04
ActionSense
Few-Shot-CE 6.47 2.58 8.14 4.92 9.96 6.08 6.19 2.32 8.50 4.00 11.15 6.72
Random-CE 4.69 1.63 9.42 5.09 10.02 5.77 5.01 1.70 7.80 4.19 11.44 6.50
Weak-CE 12.32 7.67 13.91 9.94 17.35 12.20 13.74 9.75 15.58 11.55 17.16 13.45
Weak-PHGCE 11.65 6.34 12.34 7.43 11.62 7.02 14.35 9.43 13.88 9.25 14.52 10.09
Weak-CE-T-6 14.67 8.81 15.95 10.07 15.29 10.69 13.35 9.32 15.70 11.22 17.53 13.15
Weak-PHGCE-T-6 9.91 4.92 10.57 5.44 11.78 6.79 12.17 7.52 13.72 8.16 14.53 9.01
Weak-CE-T-4 10.40 7.15 12.31 7.10 12.08 8.80 10.80 8.02 13.65 9.24 15.03 11.16
Weak-PHGCE-T-4 8.34 3.86 8.92 4.35 9.91 5.56 9.24 5.82 11.07 6.09 12.11 7.35
distance thresholding clusters and excluding instances which are clips, but for the case of applying 100 clusters is close to matching
exceeding a certain distance from their respective centroid, helps accuracy scores of a fully-supervised training across all three bench-
increasing the labelling accuracy significantly across all datasets. mark datasets, for both inertial-based architectures. Compared to
While a threshold of 4 helps increase the labelling accuracy well a normal cross-entropy loss, the PHGCE loss provides more sta-
above 75% and even up to 93% in case of the WEAR dataset, the ble results in case of higher label noise, e.g., when not applying a
thresholding omits between 50% and up to 90% of the datasets. distance-based thresholding and/or applying a smaller number of
clusters. In general, the distance-based thresholding significantly
improved results across all datasets. Although thresholding signifi-
4.2 Weakly-Supervised Training cantly reduces the amount of training data, the resulting decrease
As a combination of CLIP and optical-flow-based features proved in overall labelling noise, especially for approaches that applied a
to be most stable across all three datasets, we chose to use said lower number of clusters, improved classification results. We pro-
embedding as basis for our weakly-supervised training. Table 3 vide a detailed overview of the influence of threshoding on labelling
provides an overview across the eight evaluated training scenarios. accuracy and dataset size within the paper’s code repository.
Our proposed weakly-supervised training is not only capable of out-
performing the few-shot training using only the annotated centroid
Marius Bock, Kristof Van Laerhoven, and Michael Moeller
than vice versa. This caused the activities transfer and pouring,
Ground Truth two classes which only have a few instances in the ground truth
null
data and which are most frequently annotated incorrectly as NULL-
normal
rotating arms
class, to not be predicted correctly once across all study participants.
skipping Note that for the ActionSense dataset classification results even
sidesteps in the fully-supervised setting are significantly worse compared
butt-kicks
triceps
to that of the other two datasets. As evident by a nevertheless
large labelling accuracy using our approach, we assume that la-
Fully-supervised
lunging
shoulders bel semantics of the dataset are too vision-centric (e.g. peeling a
hamstrings
cucumber or a potato) to be correctly recognized using only inter-
lumbar rotation
push-ups
tial data. Nevertheless, per-class classification results of the fully-
push-ups (complex) vs. weakly-supervised training show a similar confusion amongst
sit-ups
classes, suggesting learned patterns of the classifier are similar in
sit-ups (complex)
burpees
both training scenarios.
lunges
lunges (complex) 5 DISCUSSION & CONCLUSION
bench-dips
Within this paper we presented a weak-annotation pipeline for
null
normal
HAR datasets based on Vision Foundation Models. We showed that
rotating arms visual embeddings extracted using Vision Foundation Models can
skipping be clustered using Gaussian Mixture Models (GMM). Decreasing
sidesteps
butt-kicks
the required labelling effort, using the suggested pipeline a human
triceps annotator is only asked to annotate each cluster’s centroid video
Weak-CE-T-4
lunging clip. By propagating the provided labels within each cluster our
shoulders
approach is capable of achieving average labelling accuracies above
hamstrings
lumbar rotation
60% and close to 90% across three popular HAR benchmark datasets.
push-ups We further showed that the resulting weakly-annotated wearable
push-ups (complex)
datasets can be used to train subsequent deep learning classifiers
sit-ups
sit-ups (complex)
with accuracy scores, in case of applying a sufficiently large num-
burpees ber of clusters, being close to matching that of a fully-supervised
lunges training across all three benchmark datasets.
lunges (complex)
bench-dips
Our results underscore one of the implications recent advance-
ments in the vision community in finding generalizable feature
Predicted
sit-ups
skipping
lumbar rotation
null
butt-kicks
triceps
lunging
hamstrings
burpees
rotating arms
sidesteps
shoulders
bench-dips
normal
push-ups (complex)
sit-ups (complex)
lunges (complex)
push-ups
lunges
[5] Behrooz Azadi, Michael Haslgrübler, Bernhard Anzengruber-Tanase, Georgios IEEE Transactions on Image Processing 31 (2022). https://doi.org/10.1109/TIP.
Sopidis, and Alois Ferscha. 2024. Robust Feature Representation Using Multi- 2022.3195321
Task Learning for Human Activity Recognition. Sensors 24, 2 (2024), 681. https: [24] Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu.
//doi.org/10.3390/s24020681 2022. Video swin transformer. In IEEE/CVF Conference on Computer Vision and
[6] Marius Bock, Alexander Hoelzemann, Michael Moeller, and Kristof Van Laer- Pattern Recognition. https://doi.org/10.1109/cvpr52688.2022.00320
hoven. 2021. Improving Deep Learning for HAR With Shallow Lstms. In ACM In- [25] Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui
ternational Symposium on Wearable Computers. https://doi.org/10.1145/3460421. Li. 2022. CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval
3480419 and captioning. Neurocomputing 508 (2022), 293–304. https://doi.org/10.1016/j.
[7] Marius Bock, Hilde Kuehne, Kristof Van Laerhoven, and Michael Moeller. 2023. neucom.2022.07.028
WEAR: An Outdoor Sports Dataset for Wearable and Egocentric Activity Recog- [26] Haojie Ma, Zhijie Zhang, Wenzhong Li, and Sanglu Lu. 2021. Unsupervised
nition. CoRR abs/2304.05088 (2023). https://arxiv.org/abs/2304.05088 Human Activity Representation Learning with Multi-task Deep Clustering. ACM
[8] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, on Interactive, Mobile, Wearable and Ubiquitous Technologies 5, 1 (2021), 1–25.
Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda https://doi.org/10.1145/3448074
Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, [27] Aditya Krishna Menon, Ankit Singh Rawat, Sashank J. Reddi, and Sanjiv Kumar.
Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris 2020. Can gradient clipping mitigate label noise?. In International Conference on
Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Learning Representations. https://openreview.net/forum?id=rklB76EKPr
Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and [28] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec,
Dario Amodei. 2020. Language models are few-shot learners. In Advances in Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin
Neural Information Processing Systems. https://proceedings.neurips.cc/paper_ El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes,
files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel
[9] Andreas Bulling, Ulf Blanke, and Bernt Schiele. 2014. A Tutorial on Human Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Labatut, Armand Joulin,
Activity Recognition Using Body-Worn Inertial Sensors. Comput. Surveys 46, 3 and Piotr Bojanowski. 2024. DINOv2: Learning Robust Visual Features without
(2014), 1–33. https://doi.org/10.1145/2499621 Supervision. CoRR abs/2304.07193 (2024). https://doi.org/10.48550/arXiv.2304.
[10] Joao Carreira and Andrew Zisserman. 2017. Quo Vadis, Action Recognition? A 07193
New Model and the Kinetics Dataset. In IEEE Conference on Computer Vision and [29] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh,
Pattern Recognition. https://doi.org/10.1109/cvpr.2017.502 Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark,
[11] Shing Chan, Hang Yuan, Catherine Tong, Aidan Acquah, Abram Schonfeldt, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models
Jonathan Gershuny, and Aiden Doherty. 2024. CAPTURE-24: A large dataset From Natural Language Supervision. CoRR abs/2103.00020 (2021). https://doi.
of wrist-worn activity tracker data collected in the wild for human activity org/10.48550/arXiv.2103.00020
recognition. CoRR abs/2402.19229 (2024). https://doi.org/10.48550/arXiv.2402. [30] Daniel Roggen, Alberto Calatroni, Mirco Rossi, Thomas Holleczek, Kilian Förster,
19229 Gerhard Tröster, Paul Lukowicz, David Bannach, Gerald Pirkl, Alois Ferscha,
[12] Shohreh Deldari, Hao Xue, Aaqib Saeed, Daniel V. Smith, and Flora D. Salim. Jakob Doppler, Clemens Holzmann, Marc Kurz, Gerald Holl, Ricardo Chavar-
2022. COCOA: Cross Modality Contrastive Learning for Sensor Data. ACM riaga, Hesam Sagha, Hamidreza Bayati, Marco Creatura, and José del R. Millàn.
on Interactive, Mobile, Wearable and Ubiquitous Technologies 6, 3 (2022), 1–28. 2010. Collecting Complex Activity Datasets in Highly Rich Networked Sensor
https://doi.org/10.1145/3550316 Environments. In IEEE Seventh International Conference on Networked Sensing
[13] Joseph DelPreto, Chao Liu, Yiyue Luo, Michael Foshey, Yunzhu Li, Antonio Systems. https://doi.org/10.1109/INSS.2010.5573462
Torralba, Wojciech Matusik, and Daniela Rus. 2022. ActionSense: A Multimodal [31] Aaqib Saeed, Tanir Ozcelebi, and Johan Lukkien. 2019. Multi-task Self-Supervised
Dataset and Recording Framework for Human Activities Using Wearable Sensors Learning for Human Activity Detection. ACM on Interactive, Mobile, Wearable
in a Kitchen Environment. In Neural Information Processing Systems Track on and Ubiquitous Technologies 3, 2 (2019), 1–30. https://doi.org/10.1145/3328932
Datasets and Benchmarks. https://action-sense.csail.mit.edu [32] Philipp M. Scholl, Matthias Wille, and Kristof Van Laerhoven. 2015. Wearables
[14] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina N. Toutanova. 2019. in the Wet Lab: A Laboratory System for Capturing and Guiding Experiments.
BERT: Pre-training of Deep Bidirectional Transformers for Language Under- In ACM International Joint Conference on Pervasive and Ubiquitous Computing.
standing. In Confernce of the North American Chapter of the Association for https://doi.org/10.1145/2750858.2807547
Computational Linguistics. https://arxiv.org/abs/1810.04805 [33] Dingfeng Shi, Yujie Zhong, Qiong Cao, Lin Ma, Jia Li, and Dacheng Tao.
[15] Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slow- 2023. TriDet: Temporal Action Detection With Relative Boundary Modeling.
Fast Networks for Video Recognition. In International Conference on Computer In IEEE/CVF Conference on Computer Vision and Pattern Recognition. https:
Vision. https://doi.org/10.1109/ICCV.2019.00630 //doi.org/10.1109/cvpr52729.2023.01808
[16] Marjan Ghazvininejad, Hamid R. Rabiee, Nima Pourdamghani, and Parisa [34] Maja Stikic, Diane Larlus, Sandra Ebert, and Bernt Schiele. 2011. Weakly Super-
Khanipour. 2011. HMM based semi-supervised learning for activity recogni- vised Recognition of Daily Life Activities with Wearable Sensors. IEEE Trans-
tion. In ACM International Workshop on Situation Activity & Goal Awareness. actions on Pattern Analysis and Machine Intelligence 33, 12 (2011), 2521–2537.
https://doi.org/10.1145/2030045.2030065 https://doi.org/10.1109/tpami.2011.36
[17] Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino [35] Maja Stikic, Diane Larlus, and Bernt Schiele. 2009. Multi-graph Based Semi-
Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, and Xingyu supervised Learning for Activity Recognition. In IEEE International Symposium
Liu. 2022. Ego4D: Around the World in 3,000 Hours of Egocentric Video. In on Wearable Computers. https://doi.org/10.1109/ISWC.2009.24
IEEE/CVF Conference on Computer Vision and Pattern Recognition. https://doi. [36] Maja Stikic and Bernt Schiele. 2009. Activity Recognition from Sparsely Labeled
org/10.1109/CVPR52688.2022.01842 Data Using Multi-Instance Learning. In Springer International Symposium on
[18] Alexander Hoelzemann, Julia L. Romero, Marius Bock, Kristof Van Laerhoven, Location- and Context-Awareness. https://doi.org/10.1007/978-3-642-01721-6_10
and Qin Lv. 2023. Hang-Time HAR: A Benchmark Dataset for Basketball Activity [37] Javier Sánchez Pérez, Enric Meinhardt-Llopis, and Gabriele Facciolo. 2013. TV-L1
Recognition Using Wrist-Worn Inertial Sensors. MDPI Sensors 23, 13 (2023). Optical Flow Estimation. Image Processing On Line 3 (2013), 137–150. https:
https://doi.org/10.3390/s23135879 //doi.org/10.5201/ipol.2013.26
[19] Will Kay, João Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra [38] Jafar Tanha, Maarten Van Someren, and Hamideh Afsarmanesh. 2017. Semi-
Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa supervised self-training for decision tree classifiers. International Journal of
Suleyman, and Andrew Zisserman. 2017. The Kinetics Human Action Video Machine Learning and Cybernetics 8, 1 (2017), 355–370. https://doi.org/10.1007/
Dataset. CoRR abs/1705.06950 (2017). http://arxiv.org/abs/1705.06950 s13042-015-0328-7
[20] Bulat Khaertdinov, Esam Ghaleb, and Stylianos Asteriadis. 2021. Contrastive [39] Zachary Teed and Jia Deng. 2020. RAFT: Recurrent All-Pairs Field Transforms
Self-supervised Learning for Sensor-based Human Activity Recognition. In IEEE for Optical Flow. In European Conference on Computer Vision. https://doi.org/10.
International Joint Conference on Biometrics. https://doi.org/10.1109/IJCB52358. 1007/978-3-030-58536-5_24
2021.9484410 [40] Catherine Tong, Jinchen Ge, and Nicholas D. Lane. 2021. Zero-Shot Learning for
[21] Anna Kukleva, Hilde Kuehne, Fadime Sener, and Jurgen Gall. 2019. Unsupervised IMU-Based Activity Recognition Using Video Embeddings. ACM on Interactive,
Learning of Action Classes With Continuous Temporal Embedding. In 2019 Mobile, Wearable and Ubiquitous Technologies 5, 4 (2021), 1–23. https://doi.org/
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 10.1145/3494995
Long Beach, CA, USA, 12058–12066. https://doi.org/10.1109/CVPR.2019.01234 [41] Rosaura G. VidalMata, Walter J. Scheirer, Anna Kukleva, David Cox, and Hilde
[22] Yongmou Li, Dianxi Shi, Bo Ding, and Dongbo Liu. 2014. Unsupervised Feature Kuehne. 2021. Joint Visual-Temporal Embedding for Unsupervised Learning of
Learning for Human Activity Recognition Using Smartphone Sensors. In Springer Actions in Untrimmed Sequences. In IEEE Winter Conference on Applications of
Second International Conference on Mining Intelligence and Knowledge Exploration. Computer Vision. https://doi.org/10.1109/wacv48630.2021.00128
https://doi.org/10.1007/978-3-319-13817-6_11 [42] Aiguo Wang, Guilin Chen, Cuijuan Shang, Miaofei Zhang, and Li Liu. 2016.
[23] Xiaolong Liu, Qimeng Wang, Yao Hu, Xu Tang, Shiwei Zhang, Song Bai, and Human Activity Recognition in a Smart Home Environment with Stacked De-
Xiang Bai. 2022. End-To-End Temporal Action Detection With Transformer. noising Autoencoders. In Web-Age Information Management, Shaoxu Song and
Marius Bock, Kristof Van Laerhoven, and Michael Moeller
Yongxin Tong (Eds.). Vol. 9998. Springer International Publishing, Cham, 29– [44] Zhilu Zhang and Mert R. Sabuncu. 2018. Generalized Cross Entropy Loss for
40. https://doi.org/10.1007/978-3-319-47121-1_3 Series Title: Lecture Notes in Training Deep Neural Networks with Noisy Labels. In Advances in Neural In-
Computer Science. formation Processing Systems. https://proceedings.neurips.cc/paper_files/paper/
[43] Kang Xia, Wenzhong Li, Shiwei Gan, and Sanglu Lu. 2023. TS2ACT: Few-Shot 2018/file/f2925f97bc13ad2852a7a551802feea0-Paper.pdf
Human Activity Sensing with Cross-Modal Co-Learning. ACM on Interactive, [45] Yexu Zhou, Haibin Zhao, Yiran Huang, Till Riedel, Michael Hefenbrock, and
Mobile, Wearable and Ubiquitous Technologies 7, 4 (2023), 1–22. https://doi.org/ Michael Beigl. 2022. TinyHAR: A Lightweight Deep Learning Model Designed
10.1145/3631445 for Human Activity Recognition. In ACM International Symposium on Wearable
Computers. https://doi.org/10.1145/3544794.3558467