22 Self Supervised Representation
22 Self Supervised Representation
22 Self Supervised Representation
Abstract
1 Introduction
In recent times, self-supervised representation learning (SSRL) has witnessed remarkable success,
particularly in the fields of computer vision [19] and natural language processing [18]. Even with
an abundance of raw data structured in tabular and time-series formats throughout the domains of
healthcare, finance, the natural sciences, and more, extending the success of SSRL to these data
modalities remains challenging [6].
The success of SSRL in computer vision and natural language processing primarily stems from
well-designed pretext tasks that create heuristics from unlabelled data allowing models to identify
and encode useful information. Pretext tasks are often highly customized to specific applications
based on a handful of underlying assumptions. Specifically, pretext tasks utilizing the transformation
invariance assumption across data augmentation views show leading performance in multiple research
domains. For computer vision, image representations are commonly guided to remain identical after
cropping, rotating, flipping, or corrupting, among others [11, 19]. Similarly, in natural language
processing, sentences with similar words and semantic meaning are expected to have the same
representation [45, 46]. In these and other domains transformation invariance is encouraged explicitly
through contrastive or momentum objectives that aim to bring together representations before and
after transformation.
Self-supervised representation learning (SSRL) methods enable the extraction of informative and
compact representations from raw data without manual annotation or labelling. These methods rely
on large amounts of unlabeled data and pretext tasks to implicitly model the observed distribution and
optimize deep neural networks. Contrastive learning methods use heavy augmentation to generate
positive views – semantically similar examples which are optimized to have the same representation
as the original datapoint [15, 20, 40, 11]. To ensure the quality of learned representations, augmenta-
tions should be semantic-preserving [37, 6]. However, finding suitable augmentations for different
application domains can be a challenging task, and researchers have invested considerable effort
into this area to enhance downstream performance. Augmentation strategies that work well for one
2
modality may not directly translate to others due to inherent differences, and the choice of suitable
augmentations can also be influenced by the specific application domain.
While masking approaches offer general applicability to all data modalities, the most effective
frameworks often rely on transformer-based backbones for optimal performance [30, 23, 14]. In this
work, we focus on a model-agnostic SSRL approach. Classic autoencoder-based methods provide an
alternative to SSRL without relying explicitly on transformation invariance [24, 43, 52]. However,
these methods tend to prioritize low-level reconstruction over capturing high-level abstractions
required for downstream tasks, resulting in suboptimal performance in practical applications [28].
Tabular SSRL methods are understudied in the tabular domain as designing effective semantic-
preserving augmentations is particularly challenging for structured data [49]. Unlike computer vision
tasks on photographic images, with tabular data small changes to individual features can drastically
change the content, and it is often difficult for a human to determine if two views should be considered
semantically equivalent. To generate positive views SubTab [38] uses different feature subsets. More
recently, SCARF [3] proposed to augment each record by corrupting a random subset of features.
Finally, STab [21] creates the contrastive views by imposing different regularization on the encoder
for the same input.
Time series Time series data often contains underlying patterns that are not easily identifiable
by humans, unlike images with recognizable features [29]. Consequently, designing effective data
augmentation methods for time series data poses significant challenges and often requires domain
knowledge. For example, augmentations for wearable sensor signals include rotation to simulate
different sensor placements and jittering to simulate sensor noise [39]. Other researchers have
focused on bio-signals and introduced channel augmentations that preserve the semantic information
in the context [31, 13]. Neighbourhood contrastive learning [48] proposed leveraging patient identity
information in online patient monitoring and using near-in-time data sequences of the same patient as
semantically equivalent pairs. However, these augmentations are often specifically designed for the
dataset and downstream task [51], and their performance may deteriorate when applied to other time
series data [17]. Therefore, identifying the optimal augmentation pipeline for each dataset and task
requires extensive analysis [25].
The current landscape of SSRL research highlights the need for a more versatile and effective approach
capable of addressing a wider range of modalities, applications, and architectures, especially the
understudied tabular and time-series modalities.
In this section, we present learning from randomness (LFR), an efficient and general SSRL algorithm.
We recap the representation learning problem setting as the following: given observed raw data
X = {· · · xi · · · }, where all data points share the same feature domain X , the representation learning
task is to learn a function fθ (X ) that produces a low-dimensional representation zi ∈ Z for each
raw data input xi . The representation zi should carry useful information about xi such that for
an arbitrary downstream task g(X ) it is possible to learn a simple prediction function hϕ (Z) that
replicates g(xi ) as hϕ (fθ (xi )) for all xi ∈ X .
As mentioned in the problem statement above, the ultimate purpose of representation learning is to
support arbitrary downstream predictive tasks. In reality, there is usually a small subset of downstream
tasks which are considered important. It is not a priori clear that directly learning to predict purely
random tasks would lead to good representations for important tasks.
To demonstrate the possibility of learning from randomness, we propose the surprising pretext task
shown in Figure 1. The pretext task contains three components, namely a representation model
fθ (X ), a set of randomly generated data projection functions G = {· · · g (k) (X ) · · · }, and a set of
(k)
simple predictors HΦ = {· · · hϕ (Z) · · · } that aim to predict the outcome of each random projection
3
Predictor H
<latexit sha1_base64="kPyKNEUrn6tiyTEzmSSqQcN8yeU=">AAAB6HicbVDLSsNAFJ3UV62vqks3g0VwVZJSX7uCC122YB/QhjKZ3rRjJ5MwMxFK6Be4caGIWz/JnX/jJA2i1gMXDufcy733eBFnStv2p1VYWV1b3yhulra2d3b3yvsHHRXGkkKbhjyUPY8o4ExAWzPNoRdJIIHHoetNr1O/+wBSsVDc6VkEbkDGgvmMEm2k1s2wXLGrdga8TJycVFCO5rD8MRiFNA5AaMqJUn3HjrSbEKkZ5TAvDWIFEaFTMoa+oYIEoNwkO3SOT4wywn4oTQmNM/XnREICpWaBZzoDoifqr5eK/3n9WPuXbsJEFGsQdLHIjznWIU6/xiMmgWo+M4RQycytmE6IJFSbbEpZCFcpzr5fXiadWtU5r9Zb9UqjlsdRREfoGJ0iB12gBrpFTdRGFAF6RM/oxbq3nqxX623RWrDymUP0C9b7F7DHjPU=</latexit>
<latexit sha1_base64="UXsZVeFT1K2hts/GKqpTKuXivTI=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKewGn7eAlxwjmAckS5idzCZjZmeWmVkhhPyDFw+KePV/vPk3zm4WUWNBQ1HVTXdXEHOmjet+OoWV1bX1jeJmaWt7Z3evvH/Q1jJRhLaI5FJ1A6wpZ4K2DDOcdmNFcRRw2gkmN6nfeaBKMynuzDSmfoRHgoWMYGOldmPQb47ZoFxxq24GtEy8nFQgR3NQ/ugPJUkiKgzhWOue58bGn2FlGOF0XuonmsaYTPCI9iwVOKLan2XXztGJVYYolMqWMChTf07McKT1NApsZ4TNWP/1UvE/r5eY8MqfMREnhgqyWBQmHBmJ0tfRkClKDJ9agoli9lZExlhhYmxApSyE6xTn3y8vk3at6l1Uz27PKvVaHkcRjuAYTsGDS6hDA5rQAgL38AjP8OJI58l5dd4WrQUnnzmEX3DevwBBf48E</latexit>
<latexit sha1_base64="cOyRfoi5AvcHBnCJbF38nmKZcYA=">AAAB8XicbVDLSsNAFL2pr1pfVZduBovgqiSl9bEruHFZwT6wDWUynbRDJ5MwMxFL6F+4caGIW//GnX/jJA2i1gMDh3PuZc49XsSZ0rb9aRVWVtfWN4qbpa3tnd298v5BR4WxJLRNQh7KnocV5UzQtmaa014kKQ48Trve9Cr1u/dUKhaKWz2LqBvgsWA+I1gb6W4QYD3x/ORhPixX7KqdAS0TJycVyNEalj8Go5DEARWacKxU37Ej7SZYakY4nZcGsaIRJlM8pn1DBQ6ocpMs8RydGGWE/FCaJzTK1J8bCQ6UmgWemUwTqr9eKv7n9WPtX7gJE1GsqSCLj/yYIx2i9Hw0YpISzWeGYCKZyYrIBEtMtCmplJVwmaLxffIy6dSqzlm1flOvNGt5HUU4gmM4BQfOoQnX0II2EBDwCM/wYinryXq13hajBSvfOYRfsN6/ABIakUY=</latexit>
x <latexit sha1_base64="ZyvMGTyu3ByOStJBSPDaxnx2Y+Y=">AAAB83icbVDLSsNAFJ3UV62vqks3g0Wom5KU+toV3LisYB/QxDKZTpqhk8kwMxFK6G+4caGIW3/GnX/jJA2i1gMXDufcy733+IJRpW370yqtrK6tb5Q3K1vbO7t71f2DnooTiUkXxyyWAx8pwignXU01IwMhCYp8Rvr+9Drz+w9EKhrzOz0TxIvQhNOAYqSN5IYjV4T0Pq07p/NRtWY37BxwmTgFqYECnVH1wx3HOIkI15ghpYaOLbSXIqkpZmRecRNFBMJTNCFDQzmKiPLS/OY5PDHKGAaxNMU1zNWfEymKlJpFvumMkA7VXy8T//OGiQ4uvZRykWjC8WJRkDCoY5gFAMdUEqzZzBCEJTW3QhwiibA2MVXyEK4ynH2/vEx6zYZz3mjdtmrtZhFHGRyBY1AHDrgAbXADOqALMBDgETyDFyuxnqxX623RWrKKmUPwC9b7F1/KkVg=</latexit>
Stop gradient
<latexit sha1_base64="zGFTI8bB7qbHU9XWMcotIkqU4fo=">AAAB7nicbVDLSsNAFL2pr1pfVZduBotQQUoi9bUruHFZwT6gjWUynbRDJ5MwMxFK6Ee4caGIW7/HnX/jJA2i1gMXDufcy733eBFnStv2p1VYWl5ZXSuulzY2t7Z3yrt7bRXGktAWCXkoux5WlDNBW5ppTruRpDjwOO14k+vU7zxQqVgo7vQ0om6AR4L5jGBtpM7oPqk6x7NBuWLX7AxokTg5qUCO5qD80R+GJA6o0IRjpXqOHWk3wVIzwums1I8VjTCZ4BHtGSpwQJWbZOfO0JFRhsgPpSmhUab+nEhwoNQ08ExngPVY/fVS8T+vF2v/0k2YiGJNBZkv8mOOdIjS39GQSUo0nxqCiWTmVkTGWGKiTUKlLISrFGffLy+S9mnNOa/Vb+uVxkkeRxEO4BCq4MAFNOAGmtACAhN4hGd4sSLryXq13uatBSuf2YdfsN6/AI+6jyM=</latexit>
(1)
g (1)
<latexit sha1_base64="TKDSUCEGGLO5bWwjAg4fbITHxCE=">AAAB+XicbVDLSsNAFL3xWesr6tJNsAh1U5JSX7uCG5cV7APaWCbTSTt0Mgkzk0II/RM3LhRx65+482+cpEHUemDgcM693DPHixiVyrY/jZXVtfWNzdJWeXtnd2/fPDjsyDAWmLRxyELR85AkjHLSVlQx0osEQYHHSNeb3mR+d0aEpCG/V0lE3ACNOfUpRkpLQ9McBEhNPD9N5g9p1TmbD82KXbNzWMvEKUgFCrSG5sdgFOI4IFxhhqTsO3ak3BQJRTEj8/IgliRCeIrGpK8pRwGRbponn1unWhlZfij048rK1Z8bKQqkTAJPT2Y55V8vE//z+rHyr9yU8ihWhOPFIT9mlgqtrAZrRAXBiiWaICyozmrhCRIIK11WOS/hOsP595eXSadecy5qjbtGpVkv6ijBMZxAFRy4hCbcQgvagGEGj/AML0ZqPBmvxttidMUodo7gF4z3L1Mwk4w=</latexit>
h Loss y(1)
Stop gradient
<latexit sha1_base64="X9hpPTE2SmRAyDbjaYoj7fxivp0=">AAAB7nicbVDLSsNAFL3xWeur6tLNYBEqSElKfe0KblxWsA9oY5lMJ+3QySTMTIQS+hFuXCji1u9x5984SYOo9cCFwzn3cu89XsSZ0rb9aS0tr6yurRc2iptb2zu7pb39tgpjSWiLhDyUXQ8rypmgLc00p91IUhx4nHa8yXXqdx6oVCwUd3oaUTfAI8F8RrA2Umd0n1RqJ7NBqWxX7QxokTg5KUOO5qD00R+GJA6o0IRjpXqOHWk3wVIzwums2I8VjTCZ4BHtGSpwQJWbZOfO0LFRhsgPpSmhUab+nEhwoNQ08ExngPVY/fVS8T+vF2v/0k2YiGJNBZkv8mOOdIjS39GQSUo0nxqCiWTmVkTGWGKiTULFLISrFGffLy+Sdq3qnFfrt/Vy4zSPowCHcAQVcOACGnADTWgBgQk8wjO8WJH1ZL1ab/PWJSufOYBfsN6/AJFAjyQ=</latexit>
<latexit sha1_base64="KPU0HZvu7t20sHDG4J7PnaE8cRw=">AAAB83icbVDLSsNAFJ3UV62vqks3g0Wom5KU+toV3LisYB/QxDKZTpqhk8kwMxFK6G+4caGIW3/GnX/jJA2i1gMXDufcy733+IJRpW370yqtrK6tb5Q3K1vbO7t71f2DnooTiUkXxyyWAx8pwignXU01IwMhCYp8Rvr+9Drz+w9EKhrzOz0TxIvQhNOAYqSN5IYjV4T0Pq03T+ejas1u2DngMnEKUgMFOqPqhzuOcRIRrjFDSg0dW2gvRVJTzMi84iaKCISnaEKGhnIUEeWl+c1zeGKUMQxiaYprmKs/J1IUKTWLfNMZIR2qv14m/ucNEx1ceinlItGE48WiIGFQxzALAI6pJFizmSEIS2puhThEEmFtYqrkIVxlOPt+eZn0mg3nvNG6bdXazSKOMjgCx6AOHHAB2uAGdEAXYCDAI3gGL1ZiPVmv1tuitWQVM4fgF6z3L2FQkVk=</latexit>
(2)
g (2)
<latexit sha1_base64="1sfi6NiGrVemupVs9+9VLA0wMCA=">AAAB+XicbVDLSsNAFL3xWesr6tJNsAh1U5JSX7uCG5cV7APaWCbTSTt0Mgkzk0II/RM3LhRx65+482+cpEHUemDgcM693DPHixiVyrY/jZXVtfWNzdJWeXtnd2/fPDjsyDAWmLRxyELR85AkjHLSVlQx0osEQYHHSNeb3mR+d0aEpCG/V0lE3ACNOfUpRkpLQ9McBEhNPD9N5g9ptX42H5oVu2bnsJaJU5AKFGgNzY/BKMRxQLjCDEnZd+xIuSkSimJG5uVBLEmE8BSNSV9TjgIi3TRPPrdOtTKy/FDox5WVqz83UhRImQSensxyyr9eJv7n9WPlX7kp5VGsCMeLQ37MLBVaWQ3WiAqCFUs0QVhQndXCEyQQVrqscl7CdYbz7y8vk0695lzUGneNSrNe1FGCYziBKjhwCU24hRa0AcMMHuEZXozUeDJejbfF6IpR7BzBLxjvX1S2k40=</latexit>
h Loss y(2)
z
<latexit sha1_base64="kXudInSWoQ0nQFd4EEUGFzN3kAs=">AAAB73icbVBNS8NAEJ3Ur1q/qh69BIvgqSSlft0KXjxWsB/QhrLZbtqlm03cnQgl9E948aCIV/+ON/+N2zSIWh8MPN6bYWaeHwuu0XE+rcLK6tr6RnGztLW9s7tX3j9o6yhRlLVoJCLV9YlmgkvWQo6CdWPFSOgL1vEn13O/88CU5pG8w2nMvJCMJA84JWikbjDo45ghGZQrTtXJYC8TNycVyNEclD/6w4gmIZNIBdG65zoxeilRyKlgs1I/0SwmdEJGrGeoJCHTXprdO7NPjDK0g0iZkmhn6s+JlIRaT0PfdIYEx/qvNxf/83oJBpdeymWcIJN0sShIhI2RPX/eHnLFKIqpIYQqbm616ZgoQtFEVMpCuJrj7PvlZdKuVd3zav22XmnU8jiKcATHcAouXEADbqAJLaAg4BGe4cW6t56sV+tt0Vqw8plD+AXr/QsyXZAr</latexit>
f✓
<latexit sha1_base64="OjNdrcF/H5/QV+hrgeo+lTXDfuk=">AAAB8XicbVDLSsNAFL2pr1pfVZduBovgqiSl9bEruHFZwT6wDWUynbRDJ5MwMxFq6F+4caGIW//GnX/jJA2i1gMDh3PuZc49XsSZ0rb9aRVWVtfWN4qbpa3tnd298v5BR4WxJLRNQh7KnocV5UzQtmaa014kKQ48Trve9Cr1u/dUKhaKWz2LqBvgsWA+I1gb6W4QYD3x/ORhPixX7KqdAS0TJycVyNEalj8Go5DEARWacKxU37Ej7SZYakY4nZcGsaIRJlM8pn1DBQ6ocpMs8RydGGWE/FCaJzTK1J8bCQ6UmgWemUwTqr9eKv7n9WPtX7gJE1GsqSCLj/yYIx2i9Hw0YpISzWeGYCKZyYrIBEtMtCmplJVwmaLxffIy6dSqzlm1flOvNGt5HUU4gmM4BQfOoQnX0II2EBDwCM/wYinryXq13hajBSvfOYRfsN6/ABUkkUg=</latexit>
… … …
Stop gradient
<latexit sha1_base64="UiF36UAodxrC9ccXSosIu5gpjqs=">AAAB7nicbVDLSsNAFL3xWeur6tLNYBEqSEmkvnYFN4KbCvYBbSyT6aQdOpmEmYlQQj/CjQtF3Po97vwbJ2kQtR64cDjnXu69x4s4U9q2P62FxaXlldXCWnF9Y3Nru7Sz21JhLAltkpCHsuNhRTkTtKmZ5rQTSYoDj9O2N75K/fYDlYqF4k5PIuoGeCiYzwjWRmoP75PKzdG0XyrbVTsDmidOTsqQo9EvffQGIYkDKjThWKmuY0faTbDUjHA6LfZiRSNMxnhIu4YKHFDlJtm5U3RolAHyQ2lKaJSpPycSHCg1CTzTGWA9Un+9VPzP68bav3ATJqJYU0Fmi/yYIx2i9Hc0YJISzSeGYCKZuRWREZaYaJNQMQvhMsXp98vzpHVSdc6qtdtauX6cx1GAfTiACjhwDnW4hgY0gcAYHuEZXqzIerJerbdZ64KVz+zBL1jvX7dWjz0=</latexit>
g (K)
<latexit sha1_base64="+kKAeRc4qAj9YCzbao23Sud+z8Y=">AAAB83icbVDLSsNAFJ3UV62vqks3g0Wom5KU+toV3AhuKlhbaGKZTCfN0MlkmJkIJfQ33LhQxK0/486/cZIGUeuBC4dz7uXee3zBqNK2/WmVlpZXVtfK65WNza3tneru3p2KE4lJF8csln0fKcIoJ11NNSN9IQmKfEZ6/uQy83sPRCoa81s9FcSL0JjTgGKkjeSGQ1eE9D6tXx/PhtWa3bBzwEXiFKQGCnSG1Q93FOMkIlxjhpQaOLbQXoqkppiRWcVNFBEIT9CYDAzlKCLKS/ObZ/DIKCMYxNIU1zBXf06kKFJqGvmmM0I6VH+9TPzPGyQ6OPdSykWiCcfzRUHCoI5hFgAcUUmwZlNDEJbU3ApxiCTC2sRUyUO4yHDy/fIiuWs2nNNG66ZVazeLOMrgAByCOnDAGWiDK9ABXYCBAI/gGbxYifVkvVpv89aSVczsg1+w3r8Ah2aRcg==</latexit>
(K)
<latexit sha1_base64="On+a8dSCVxGqgxXNO8oB7NOpZyg=">AAAB+XicbVDLSsNAFJ3UV62vqEs3g0Wom5KU+toV3AhuKtgHtLFMppN26GQSZiaFEPInblwo4tY/ceffOEmDqPXAwOGce7lnjhsyKpVlfRqlldW19Y3yZmVre2d3z9w/6MogEph0cMAC0XeRJIxy0lFUMdIPBUG+y0jPnV1nfm9OhKQBv1dxSBwfTTj1KEZKSyPTHPpITV0vidOHpHZ7mo7MqlW3csBlYhekCgq0R+bHcBzgyCdcYYakHNhWqJwECUUxI2llGEkSIjxDEzLQlCOfSCfJk6fwRCtj6AVCP65grv7cSJAvZey7ejLLKf96mfifN4iUd+kklIeRIhwvDnkRgyqAWQ1wTAXBisWaICyozgrxFAmElS6rkpdwleHs+8vLpNuo2+f15l2z2moUdZTBETgGNWCDC9ACN6ANOgCDOXgEz+DFSIwn49V4W4yWjGLnEPyC8f4FesyTpg==</latexit>
h Loss y(K)
Figure 1: Our proposed architecture for learning from randomness. An input x is encoded by fθ into
a useful representation z, while also being fed to random projection functions g (k) . Simple, learnable
(k)
predictor functions hϕ try to match the outputs y(k) from the projectors g (k) , which is only possible
when z contains rich information about the input.
4
3.2 Divergence Measure: Batch-wise Barlow Twins
Now we return to options for the divergence D in Objective 1. While there are several common choices
of divergence used in machine learning such as Mean Squared Error (MSE), Cross Entropy (CE), or
the Contrastive [15] and Triplet [34] losses, they are often inadequate for enforcing identifications
between subtly different points as is crucial for representation learning tasks; MSE downweights the
importance of small errors, CE is ill suited for regression tasks, while the Contrastive and Triplet
losses introduce significant stochasticity.
The Barlow Twins loss [50] has garnered much interest in the SSRL literature as it disentangles
learned representations through redundancy reduction. We also note its ability to scale to very
high-dimensional vectors [50]. Thus, we introduce Batch-wise Barlow Twins (BBT), a variant that
measures representation differences between data instances from two sources, the random projector
(k)
g (k) and the predictor hϕ , rather than disentangling the representation encoding. We define the BBT
loss as
XX 2 X (k) 2
(k)
LBBT = 1 − cii +λ cij , (5)
k i j̸=i
where the cij are the entries of a cosine similarity matrix,
(k) ⊤ (k)
(k) yi ŷj (k) (k) (k)
cij = , yi = g (k) (xi ), ŷi = hϕ (fθ (xi )), (6)
(k) (k)
yi ŷj
2 2
(k) (k) (k)
and yi , ŷi ∈ Rd . Compared to the loss in [50], Eq. 5 has an extra summation over the ensemble
k. The main difference comes from the definition of the cosine similarity matrix; our cosine similarity
is an m × m matrix with m the batch size, whereas in Barlow Twins it is a d(k) × d(k) matrix.
Learning from randomness aims to extract useful representations from random projection functions
g (k) (X ) ∈ G which mimic arbitrary downstream tasks. In practice we create multiple data projections
by randomly initializing neural networks of various architectures. Functions generated this way
can often capture similar information to each other when diversity is not specifically encouraged,
which limits the generalization capabilities of the representations learned by fθ . While increasing the
number of random projection functions could mitigate the diversity problem by brute force, such an
approach is computationally wasteful because it would maintain many similar projectors.
We propose a solution that picks K diverse projectors from N ≫ K randomly generated candidates.
The underlying hypothesis is that one sufficiently large batch of data can reveal the behavioral
differences between candidate random projectors. Presuming there is a batch of data X ∈ Rm×d , for
each of the N randomly generated projectors g (k) (X ) ∈ G we produce the normalized outputs
(k)
Y (k) = g (k) (X)/∥g (k) (X)∥2 , Y (k) ∈ Rm×d . (7)
We then compute the cosine similarity over the batch of outputs for each projector as
A(k) = Y (k) (Y (k) )⊤ , A(k) ∈ Rm×m . (8)
(k) (k) m2 ×1
By flattening the matrix A and again normalizing, we obtain a vector a ∈ R , which acts as
the signature of the k’th projector with respect to the batch. Finally, to select K target models from
the N candidates, we search for a subset that maximizes the following binary constraint optimization
problem involving matrices à made from K stacked vectors a(k) ,
h X i
argmax | det(B)| s.t. B = ÃÃ⊤ , Ã = a(k) k ∈ [0, N ], sk = 1, sk′ = K , (9)
s
k′
where the 1’s in the binary vector s ∈ {0, 1}N indicate the chosen projectors. While this problem
is known to be NP-hard, approximate methods such as the Fast Determinantal Point Process [10]
can find good solutions in reasonable time. It is worth noting that our diversity encouragement
solution does not involve gradient computations, and can be run once as a pre-processing step without
occupying computation resources during the SSRL training phase.
We summarize the full LFR algorithm in Appendix A.
5
4 Experiments and Evaluation
4.1 Datasets
We consider both time series and tabular data types in various domains to show the wide applicability
of learning from randomness. Datasets are further detailed in Appendix B.1.
Time series We utilized two standard time-series datasets, Human Activity Recognition (HAR) [2]
and Epileptic Seizure Recognition [1]. Both datasets were pre-processed using the same methods
as in TS-TCC [17]. As a larger scale test we also include the MIMIC-III dataset, a standard in the
medical domain for tasks involving electronic health record data. We utilized the pre-processed
version of the MIMIC-III Benchmark dataset [22], and focused on the length-of-stay task [48] which
is framed as a 10-class classification problem, where each class represents a different duration of stay.
Tabular We used three tabular UCI datasets in our experiments: Adult Income (Income) [26], First
Order Theorem Proving (Theorem) [9], and HEPMASS [4]. For Income, a binary classification
problem, we followed the data preprocessing steps in [38]. The Theorem dataset is framed a a 6-class
classification problem. The much larger HEPMASS dataset is another binary classification task
which includes 7 million training and 3.5 million testing events, each with 27 features.
4.2 Implementations
Evaluation All the downstream tasks in our study are treated as classification problems. To evaluate
the quality of the pre-trained representations, we employed supervised classifiers that are specific to
each dataset. For the MIMIC-III dataset we utilized a MLP classifier [48]. For tabular datasets, we
used logistic regression, similar to the approach in STab [21]. For the remaining datasets, a linear
classifier was employed. We experimented with both downstream evaluation where the classifiers
were trained on frozen representations, and finetuning where the classifier and representation model
are jointly trained. For finetuning on the larger datasets MIMIC-III and HEPMASS, we chose a
semi-supervised approach where we randomly selected 10% of labeled data from these datasets,
and we used a subset of the baseline methods. These decisions were driven by the computational
resources required when dealing with extensive data and finetuning large representation models.
Conversely, for smaller datasets, we conducted finetuning using the complete set of available labeled
data, enabling us to evaluate the model’s performance across the entirety of the datasets.
Metrics were then computed on the test set. Accuracy is our primary metric, except for MIMIC-III
where we adopted linearly weighted Cohen’s Kappa as in [48], with higher values indicating better
agreement. To ensure the robustness of our results, we conducted multiple random runs and report
the mean and standard deviation, using 5 runs for tabular datasets and 3 runs for time-series.
Model architectures Regarding the model architectures, we adopted similar backbone encoders as
previous works. For the HAR and Epilepsy datasets, we utilized the same 3-block convolutional layers
as TS-TCC [44]. For the MIMIC-III dataset, we employed the Temporal Convolutional Network used
by NCL [48]. For the Tabular datasets, we used 4-layer MLPs, following the approach in SCARF [3].
To avoid domain-specific projector design, in each case the random projectors reuse the architecture
from the encoder, but are scaled down. Complete details are in Appendix B.2.
6
Baseline methods Table 1 summarizes all baselines used in our experiments. It is worth noting that
while our proposed framework LFR is domain-agnostic, popular SSRL methods such as SimCLR and
SimSiam require domain-specific augmentations to achieve optimal performance. Specifically, the
default augmentations used for view creation in SimCLR and SimSiam are designed for natural image
classification, and may not be suitable for other modalities. In our experiments with tabular datasets,
we compare our approach to SCARF [3], which is a version of SimCLR adapted to tabular data that
uses random corruptions as augmentations, as well as STab [21] which is similar to SimSiam. For
more detailed information on the implementations and augmentations, please refer to Appendix B.3,
and B.4. Information on the computing resources used is in Appendix B.5.
The performance of LFR and baselines across multiple modalities and domains using downstream
evaluation is shown in Table 2. Our experiments show that for time series and tabular data where
there are no standardized augmentation pipelines, LFR had the strongest performance among the
SSRL methods, outperforming other self-supervised learning methods in most cases including the
domain-agnostic ones such as DACL. For instance, on the HAR and Epilepsy datasets, LFR was
the best performing method, beating the time-series specific self-supervised learning method TS-
TCC. Similarly, for the Income and Theorem datasets, LFR outperformed the tabular data specific
self-supervised learning baselines SCARF and STab. Although on the HEPMASS dataset LFR was
not the best, it still performed well, comparable to the autoencoder and SCARF. Interestingly, for
the Income dataset, LFR even outperformed supervised training. For time series and tabular data,
augmentation-based methods like SimSiam tend to underperform. For example, SimSiam was worse
than a randomly initialized encoder in HAR and Income.
Table 2: Performance comparison across various application domains with downstream evaluation.
Results of the best self-supervised learning methods are in bold.
Table 3: Performance comparison across various application domains with finetuning. Results of the
best self-supervised learning methods are in bold.
SimSiam 93.4 ± 0.6 97.9 ± 0.2 49.4 ± 0.3 85.2 ± 0.1 52.5 ± 0.8 90.7 ± 0.0
SimCLR 93.7 ± 1.1 97.8 ± 0.2 48.6 ± 0.8 - - -
SCARF - - - 85.1 ± 0.2 53.8 ± 0.8 90.9 ± 0.0
STab - - - 85.3 ± 0.2 53.0 ± 0.7 91.1 ± 0.0
LFR 94.7 ± 1.4 98.2 ± 0.2 49.6 ± 0.1 85.3 ± 0.1 54.3 ± 0.4 90.8 ± 0.0
7
Finetuning results are shown in Table 3. Through the finetuning process, all methods exhibit more
comparable performance across the datasets. LFR still achieved the best performance on a majority
of the datasets we used, although with overlapping error bars to other methods in those cases.
Overall, these experimental results reflect our hypothesis – it is feasible to learn high-quality data
representations across all domains tested by predicting random data projections. LFR shows com-
paratively good performance on domains where semantic-preserving augmentations are difficult to
create.
Accuracy
orem dataset we evaluated the performance of 45
LFR and baseline SSRL approaches across la-
tent dimension sizes. Figure 2 shows that in- 40
creasing the latent dimension improved the ac- 35
curacy of each approach up to about 256. LFR SimSiam Autoencoder DIET SCARF STab LFR
consistently outperformed all the other baselines
across all the latent dimension settings. Figure 2: Effect of embedding dimension on LFR
and self-supervised learning baselines.
References
[1] R. G. Andrzejak, K. Lehnertz, F. Mormann, C. Rieke, P. David, and C. E. Elger. Indications
of nonlinear deterministic and finite-dimensional structures in time series of brain electrical
activity: Dependence on recording region and brain state. Physical Review E, 64(6):061907,
2001.
[2] D. Anguita, A. Ghio, L. Oneto, X. Parra, and J. L. Reyes-Ortiz. A public domain dataset for
human activity recognition using smartphones. In The European Symposium on Artificial Neural
Networks, 2013.
[3] D. Bahri, H. Jiang, Y. Tay, and D. Metzler. Scarf: Self-supervised contrastive learning using
random feature corruption. In International Conference on Learning Representations, 2022.
[4] P. Baldi, K. Cranmer, T. Faucett, P. Sadowski, and D. Whiteson. Parameterized neural networks
for high-energy physics. The European Physical Journal C, 76(5):235, Apr 2016.
[5] R. Balestriero. Unsupervised Learning on a DIET: Datum IndEx as Target Free of Self-
Supervision, Reconstruction, Projector Head. arXiv:2302.10260, 2023.
8
[6] R. Balestriero, M. Ibrahim, V. Sobal, A. Morcos, S. Shekhar, T. Goldstein, F. Bordes, A. Bardes,
G. Mialon, Y. Tian, et al. A Cookbook of Self-Supervised Learning. arXiv:2304.12210, 2023.
[7] Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives.
IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013.
[8] J. Brehmer and K. Cranmer. Flows for simultaneous manifold learning and density estimation.
In Advances in Neural Information Processing Systems, volume 33, 2020.
[9] J. P. Bridge, S. Holden, and L. C. Paulson. Machine Learning for First-Order Theorem Proving
- Learning to Select a Good Heuristic. J. Autom. Reason., 53:141–172, 2014.
[10] L. Chen, G. Zhang, and E. Zhou. Fast greedy map inference for determinantal point process to
improve recommendation diversity. Advances in Neural Information Processing Systems, 31,
2018.
[11] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton. A simple framework for contrastive learning
of visual representations. In International conference on machine learning, pages 1597–1607.
PMLR, 2020.
[12] X. Chen and K. He. Exploring simple siamese representation learning. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15750–15758,
2021.
[13] J. Y. Cheng, H. Goh, K. Dogrusoz, O. Tuzel, and E. Azemi. Subject-aware contrastive learning
for biosignals. arXiv:2007.04871, 2020.
[14] M. Cheng, Q. Liu, Z. Liu, H. Zhang, R. Zhang, and E. Chen. Timemae: Self-supervised represen-
tations of time series with decoupled masked autoencoders. arXiv preprint arXiv:2303.00320,
2023.
[15] S. Chopra, R. Hadsell, and Y. LeCun. Learning a similarity metric discriminatively, with
application to face verification. In IEEE Computer Society Conference on Computer Vision and
Pattern Recognition, volume 1, pages 539–546, 2005.
[16] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding. In Proceedings of NAACL-HLT, pages 4171–4186,
2019.
[17] E. Eldele, M. Ragab, Z. Chen, M. Wu, C. K. Kwoh, X. Li, and C. Guan. Time-series repre-
sentation learning via temporal and contextual contrasting. In Proceedings of the Thirtieth
International Joint Conference on Artificial Intelligence, pages 2352–2359, 2021.
[18] T. Gao, X. Yao, and D. Chen. SimCSE: Simple contrastive learning of sentence embeddings. In
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,
pages 6894–6910, 2021.
[19] J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. Richemond, E. Buchatskaya, C. Doersch,
B. Avila Pires, Z. Guo, M. Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to
self-supervised learning. Advances in neural information processing systems, 33:21271–21284,
2020.
[20] M. Gutmann and A. Hyvärinen. Noise-contrastive estimation: A new estimation principle for
unnormalized statistical models. In Proceedings of the Thirteenth International Conference on
Artificial Intelligence and Statistics, volume 9, pages 297–304, 2010.
[21] E. Hajiramezanali, N. L. Diamant, G. Scalia, and M. W. Shen. STab: Self-supervised Learning
for Tabular Data. In NeurIPS 2022 First Table Representation Workshop, 2022.
[22] H. Harutyunyan, H. Khachatrian, D. C. Kale, G. Ver Steeg, and A. Galstyan. Multitask learning
and benchmarking with clinical time series data. Scientific data, 6(1):96, 2019.
[23] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick. Masked autoencoders are scalable
vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 16000–16009, 2022.
9
[24] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks.
science, 313(5786):504–507, 2006.
[25] B. K. Iwana and S. Uchida. An empirical survey of data augmentation for time series classifica-
tion with neural networks. PLOS ONE, 16(7):1–32, 2021.
[26] R. Kohavi et al. Scaling up the accuracy of Naive-Bayes classifiers: A decision-tree hybrid. In
KDD, volume 96, pages 202–207, 1996.
[27] P. H. Le-Khac, G. Healy, and A. F. Smeaton. Contrastive representation learning: A framework
and review. IEEE Access, 8:193907–193934, 2020.
[28] X. Liu, F. Zhang, Z. Hou, L. Mian, Z. Wang, J. Zhang, and J. Tang. Self-supervised learning:
Generative or contrastive. IEEE Transactions on Knowledge and Data Engineering, 35(1):857–
876, 2021.
[29] D. Luo, W. Cheng, Y. Wang, D. Xu, J. Ni, W. Yu, X. Zhang, Y. Liu, Y. Chen, H. Chen, et al.
Time series contrastive learning with information-aware augmentations. arXiv:2303.11911,
2023.
[30] K. Majmundar, S. Goyal, P. Netrapalli, and P. Jain. Met: Masked encoding for tabular data,
2022.
[31] M. N. Mohsenvand, M. R. Izadi, and P. Maes. Contrastive representation learning for elec-
troencephalogram classification. In Machine Learning for Health, pages 238–253. PMLR,
2020.
[32] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu.
Exploring the limits of transfer learning with a unified text-to-text transformer, 2020.
[33] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning Internal Representations by Error
Propagation, page 318–362. MIT Press, Cambridge, MA, USA, 1986.
[34] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition
and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 815–823, 2015.
[35] C. Shorten and T. M. Khoshgoftaar. A survey on image data augmentation for deep learning.
Journal of Big Data, 6(1):1–48, 2019.
[36] Y. Tian, X. Chen, and S. Ganguli. Understanding self-supervised learning dynamics without
contrastive pairs. In International Conference on Machine Learning, pages 10268–10278.
PMLR, 2021.
[37] Y. Tian, C. Sun, B. Poole, D. Krishnan, C. Schmid, and P. Isola. What makes for good views for
contrastive learning? Advances in neural information processing systems, 33:6827–6839, 2020.
[38] T. Ucar, E. Hajiramezanali, and L. Edwards. SubTab: Subsetting features of tabular data for
self-supervised representation learning. Advances in Neural Information Processing Systems,
34:18853–18865, 2021.
[39] T. T. Um, F. M. J. Pfister, D. Pichler, S. Endo, M. Lang, S. Hirche, U. Fietzek, and D. Kulić.
Data Augmentation of Wearable Sensor Data for Parkinson’s Disease Monitoring Using Con-
volutional Neural Networks. In Proceedings of the 19th ACM International Conference on
Multimodal Interaction, page 216–220, 2017.
[40] A. van den Oord, Y. Li, and O. Vinyals. Representation learning with contrastive predictive
coding. arXiv:1807.03748, 2018.
[41] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and
I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems,
volume 30, 2017.
[42] V. Verma, T. Luong, K. Kawaguchi, H. Pham, and Q. Le. Towards domain-agnostic contrastive
learning. In International Conference on Machine Learning, pages 10530–10541. PMLR, 2021.
10
[43] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol. Extracting and composing robust
features with denoising autoencoders. In Proceedings of the 25th international conference on
Machine learning, pages 1096–1103, 2008.
[44] Z. Wang, W. Yan, and T. Oates. Time series classification from scratch with deep neural
networks: A strong baseline. In 2017 International joint conference on neural networks
(IJCNN), pages 1578–1585. IEEE, 2017.
[45] J. Wei and K. Zou. Eda: Easy data augmentation techniques for boosting performance on text
classification tasks. arXiv preprint arXiv:1901.11196, 2019.
[46] X. Wu, S. Lv, L. Zang, J. Han, and S. Hu. Conditional bert contextual augmentation. In
Computational Science–ICCS 2019: 19th International Conference, Faro, Portugal, June 12–14,
2019, Proceedings, Part IV 19, pages 84–95. Springer, 2019.
[47] Z. Xie, Z. Zhang, Y. Cao, Y. Lin, J. Bao, Z. Yao, Q. Dai, and H. Hu. Simmim: A simple
framework for masked image modeling, 2022.
[48] H. Yèche, G. Dresdner, F. Locatello, M. Hüser, and G. Rätsch. Neighborhood contrastive learn-
ing applied to online patient monitoring. In Proceedings of the 38th International Conference
on Machine Learning, volume 139, pages 11964–11974, 2021.
[49] J. Yoon, Y. Zhang, J. Jordon, and M. van der Schaar. Vime: Extending the success of self-
and semi-supervised learning to tabular domain. Advances in Neural Information Processing
Systems, 33:11033–11043, 2020.
[50] J. Zbontar, L. Jing, I. Misra, Y. LeCun, and S. Deny. Barlow twins: Self-supervised learning via
redundancy reduction. In International Conference on Machine Learning, pages 12310–12320.
PMLR, 2021.
[51] J. Zhang and K. Ma. Rethinking the augmentation module in contrastive learning: Learning
hierarchical augmentation invariance with expanded views. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pages 16650–16659, 2022.
[52] R. Zhang, P. Isola, and A. A. Efros. Split-brain autoencoders: Unsupervised learning by cross-
channel prediction. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 1058–1067, 2017.
11
A Algorithm
We summarize the LFR algorithm in Algorithm 1 which uses the subroutine in Algorithm 2.
B Implementation Details
Table 4 provides a summary of all datasets used in our experiments, along with the corresponding
downstream tasks and evaluation metrics.
12
Table 5: Details on LFR architectural parameters
HAR/Epilepsy: For LFR, we used a three-block convolutional network from [44, 17] as the
(k)
representation model fθ . For the predictors hϕ , we used a single linear layer. For the random
projectors g (k) , we adopted a similar architecture to the representation model but with slightly
decreased complexity, a two-block convolutional network with 16 and 32 channels, followed by two
sequential linear layers with a hidden dimension of 256. For all other self-supervised methods, we
used the same representation model for a fair comparison. For SimCLR [11] and SimSiam [11], we
used the same predictors as LFR, and a 3-layer ReLU network of hidden dimension 512 as a projector.
The first two linear layers are followed by a batchnorm layer. To create the contrastive view for
SimCLR [11] and SimSiam [11], we adopted the same augmentations as designed in TS-TCC [17].
MIMIC-III: For all methods, we followed the encoder structure from [48] as the representaion
model/encoder, with the exception that we used flattened temporal convolutional network (TCN)
features followed by a linear layer, which produced the embedding size of 64. We also disabled the
L2 normalization in the encoder. For the random projectors in LFR, we adopted a three-block TCN
with kernel size of 2, followed by a linear layer with output channel size of 64 for each layer. The
two-layer ReLU predictor is shared in LFR, SimCLR and SimSiam with a hidden dimension of 256.
We used the same projector and augmentation as in HAR/Epilepsy for SimCLR and SimSiam.
Income/Theorem: For LFR, we followed the setup in [21, 3] and used a 4-layer ReLU network
with a hidden dimension of 256 as the representation model, with a single linear layer predictor.
The random projector networks had a similar architecture but were less complex, using a 2-layer
ReLU network with a hidden dimension of 256. For the contrastive baselines, we employed the same
encoder and predictor for a fair comparison, and followed [3] by using a 2-layer ReLU network with
a hidden dimension of 256 as projectors. To generate the contrastive views, we used the SCARF [3]
augmentation technique to randomly corrupt features with values sampled from their empirical
distribution, ensuring that our SimCLR baseline was identical to SCARF.
HEPMASS: For the HEPMASS dataset, we used the same network architecture as for the In-
come/Theorem datasets but with the output latent dimension of the encoder set to 16.
All supervised baselines use the same representation model as the SSL methods, with the final layer
being a linear classification layer. We summarize all architectural related settings of LFR in Table 5
Table 6 summarizes all the training settings used for LFR, while Table 7 outlines the evaluation
settings used for downstream tasks. We used a Logistic Regression classifier for all tabular datasets
including Income, Theorem, and HEPMASS, while for MIMIC-III we used a MLP network to
predict the length of stay following [48]. For the remaining datasets, we followed prior works such as
TS-TCC [17], SimSiam [11], and BYOL [27] by using a linear classifier for classification tasks.
13
Table 7: Details on Linear Evaluation Settings of SSL methods
Baseline training settings: All self-supervised baselines adopt the same training setting as LFR
unless stated otherwise. For DIET with MIMIC-III, we used batch size 512 and trained for 2000 steps
with 10 warmup epochs. We reserved 5000 epochs for training the autoencoder on MIMIC-III with
500 warmup epochs. For other self-supervised methods with MIMIC-III, we also added 60 warmup
epochs. We summarize the training settings of supervised baselines in Table 8.
Dataset Optimizer Batch size Learning Rate Optimizer Parameters Epochs Augmentations
HAR/Eplipsey Adam 128 3e-4 β=(0.9,0.999), wd=3e-4 500 None
Income/Theorem Adam 128 1e-3 β=(0.9,0.999), wd=0 100 None
MIMIC-III Adam 4096 5e-6 β=(0.9,0.999), wd=5e-4 10 Same as SimCLR and SimSiam
TS-TCC: Our results for TS-TCC on the HAR and Epilepsy datasets had several discrepancies
with the values reported in the original TS-TCC paper [17]. We discovered that in the official
implementation of TS-TCC, the input data was augmented once and then kept the same throughout
training, rather than being randomly augmented in each forward pass. We fixed this bug and were able
to achieve better results. Additionally, we increased the number of training epochs for our supervised
baseline, which also led to improved performance. Lastly, we noticed that in the original TS-TCC
implementation, the random initialization ablation was evaluated using a randomly initialized linear
classification head that was not trained, whereas we evaluated with a trained linear classification layer
and saw a significant increase in accuracy for this ablation.
STab: The original STab paper [21] did not provide information about the random DropConnect
ratio or training hyperparameters used in their experiments. In our implementation, we used the same
training hyperparameters as other SSL methods and tested DropConnect ratios of 0.1, 0.2, 0.4, 0.6,
and 0.8, with the results shown in Figure 3. We selected the best-performing ratio for each experiment
and reported the corresponding results. We ended up selecting 0.1 for Income and 0.8 for Theorem.
The time series experiments with HAR and Epilepsy were conducted on a Tesla V100 GPU with 32
GB of memory, except for TS-TCC which was conducted on a TITAN V with 12 GB of memory.The
experiments took a total of 102 GPU hours, including all baseline experiments. The MIMIC-III
experiments were conducted with a NVIDIA A100 GPU with 40GB of memory, except for TS-TCC
Income Theorem
0.84 latent dimension latent dimension
16 16
0.82 32 0.55 32
256 256
Accuracy
Accuracy
0.80 0.50
0.78
0.45
0.76
0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8
Dropout Rate Dropout Rate
Figure 3: STab accuracy as a function of dropout rate
14
which was again conducted on a TITAN V with 12 GB of memory, and cost 608 GPU hours, including
all baseline experiments. The tabular dataset experiments with Income, Theorem, and HEPMASS
were conducted on a NVIDIA TITAN V GPU with 12 GB of memory. The experiments took a total
of 70 GPU hours, including all baseline experiments.
15