0% found this document useful (0 votes)

13 views11 pages

Vime

Uploaded by

chatjoker1212356

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views11 pages

Vime

Uploaded by

chatjoker1212356

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

VIME: Extending the Success of Self- and

Semi-supervised Learning to Tabular Domain

Jinsung Yoon Yao Zhang

Google Cloud AI, UCLA University of Cambridge
[email protected] [email protected]

James Jordon Mihaela van der Schaar

University of Oxford University of Cambridge
[email protected] UCLA, Alan Turing Institute
[email protected]

Abstract
Self- and semi-supervised learning frameworks have made significant progress in
training machine learning models with limited labeled data in image and language
domains. These methods heavily rely on the unique structure in the domain datasets
(such as spatial relationships in images or semantic relationships in language). They
are not adaptable to general tabular data which does not have the same explicit
structure as image and language data. In this paper, we fill this gap by proposing
novel self- and semi-supervised learning frameworks for tabular data, which we
refer to collectively as VIME (Value Imputation and Mask Estimation). We create
a novel pretext task of estimating mask vectors from corrupted tabular data in
addition to the reconstruction pretext task for self-supervised learning. We also
introduce a novel tabular data augmentation method for self- and semi-supervised
learning frameworks. In experiments, we evaluate the proposed framework in
multiple tabular datasets from various application domains, such as genomics and
clinical data. VIME exceeds state-of-the-art performance in comparison to the
existing baseline methods.

1 Introduction
Tremendous successes have been achieved in a variety of applications (such as image classification [1],
object detection [2], and language translation [3]) with deep learning models via supervised learning
on large labeled datasets such as ImageNet [4]. Unfortunately, collecting sufficiently large labeled
datasets is expensive and even impossible in several domains (such as medical datasets concerned
with a particularly rare disease). In these settings, however, there is often a wealth of unlabeled data
available - datasets are often collected from a large population, but target labels are only available
for a small group of people. The 100,000 Genomes project [5], for instance, sequenced 100,000
genomes from around 85,000 NHS patients affected by a rare disease, such as cancer. By definition
rare diseases occur in (less than) 1 in 2000 people. Datasets like these present huge opportunities
for self- and semi-supervised learning algorithms, which can leverage the unlabeled data to further
improve the performance of a predictive model.
Unfortunately, existing self- and semi-supervised learning algorithms are not effective for tabular
data1 because they heavily rely on the spatial or semantic structure of image or language data. A
1
Tabular data is a database that is structured in a tabular form. It arranges data elements in vertical columns
(features) and horizontal rows (samples).

34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada.
standard self-supervised leaning framework designs a (set of) pretext task(s) to learn informative
representations from the raw input features. For the language domain, BERT introduces 4 different
pretext tasks (e.g. predicting future words from previous words) to learn representations of the
language data [6]. In the image domain, rotation [7], jigsaw puzzle [8], and colorization [9] can be
utilized as pretext tasks to learn representations of the images. Standard semi-supervised learning
methods also suffer from the same problem, since the regularizers they use for the predictive model
are based on some prior knowledge of these data structures. For example, the consistency regularizer
encourages the predictive model to have the same output distribution on a sample and its augmented
variants, e.g. an image and its rotated variants [7], or two images and their convex combination(s) [10].
The notion of rotation simply does not exist in tabular data. Moreover, in many settings, variables
are often categorical, and do not admit meaningful convex combinations. Even in a setting where all
variables are continuous, there is no guarantee that the data manifold is convex and as such taking
convex combinations will either generate out-of-distribution samples (therefore degrading model
performances) or be restricted to generating samples that are very close to real samples (limiting the
effectiveness of the data augmentation), for more details see the Supplementary Materials (Section 4).
Contribution: In this paper, we propose novel self- and semi-supervised learning frameworks for
tabular data. For self-supervised learning, we introduce a novel pretext task, mask vector estimation
in addition to feature vector estimation. To solve those pretext tasks, an encoder function learns to
construct informative representations from the raw features in the unlabeled data. For semi-supervised
learning, we introduce a novel tabular data augmentation scheme. We use the trained encoder to
generate multiple augmented samples for each data point by masking each point using several
different masks and then imputing the corrupted values for each masked data point. Finally, we
propose a systematic self- and semi-supervised learning framework for tabular data, VIME (Value
Imputation and Mask Estimation), that combines our ideas to produce state-of-the-art performances
on several tabular datasets with a few labeled samples, from various domains.

2 Related Works

Self-supervised learning (Self-SL) frameworks are representation learning methods using unlabeled
data. It can be categorized into two types: using pretext task(s) and contrastive learning. Most existing
works with pretext tasks are appropriate only for images or natural language: (i) surrogate classes
prediction (scaling and translation) [11], (ii) rotation degree predictions [7], (iii) colorization [9], (iv)
relative position of patches estimation [12], (v) jigsaw puzzle solving [8], (vi) image denoising [13],
(vii) partial-to-partial registration [14], and (viii) next words and previous words predictions [6]. Most
existing works with contrastive learning are also applicable only for image or natural languages due
to their data augmentation scheme, and temporal and spatial relationships for defining the similarity:
(i) contrastive predictive coding [15, 16], (ii) contrastive multi-view coding [17], (iii) SimCLR [18],
(iv) momentum contrast [19, 20].
There is some existing work on self-supervised learning which can be applied to tabular data. In
Denoising auto-encoder [21], the pretext task is to recover the original sample from a corrupted
sample. In Context Encoder [22], the pretext task is to reconstruct the original sample from both the
corrupted sample and the mask vector. The pretext task for self-supervised learning in TabNet [23]
and TaBERT [24] is also recovering corrupted tabular data.
In this paper, we propose a new pretext task: to recover the mask vector, in addition to the original
sample with a novel corrupted sample generation scheme. Also, we propose a novel tabular data
augmentation scheme that can be combined with various contrastive learning frameworks to extend
the self-supervised learning to tabular domains.
Semi-supervised learning (Semi-SL) frameworks can be categorized into two types: entropy mini-
mization and consistency regularization. Entropy minimization encourages a classifier to output low
entropy predictions on unlabeled data. For instance, [25] constructs hard labels from high-confidence
predictions on unlabeled data, and train the network using these pseudo-labels together with labeled
data in a supervised way. Consistency regularization encourages some sort of consistency between a
sample and some stochastically altered version of itself. Π-model [26] uses an L2 loss to encourage
consistency between predictions. Mean teacher [27] uses an L2 loss to encourage consistency between
the intermediate representations. Virtual Adversarial Training (VAT) [28] encourages prediction
consistency by minimizing the maximum difference in predictions between a sample and multiple

2
augmented versions. MixMatch [29] and ReMixMatch [30] combine entropy minimization with
consistency regularization in one unified framework with MixUp [10] as the data augmentation
method. There is a series of interesting works on graph-based semi-supervised learning [31, 32, 33]
which consider a special case of network data where samples are connected by a given edge, i.e. a
citation network where an article is connected with its citations. Here, we introduce a novel data
augmentation method for general tabular data which can be combined with various semi-supervised
learning frameworks to train a predictive model in a semi-supervised way.

3 Problem Formulation
In this section, we introduce the general formulation of self- and semi-supervised learning. Suppose
we have a small labeled dataset Dl = {xi , yi }N Nl +Nu
i=1 and a large unlabeled dataset Du = {xi }i=Nl +1 ,
l

where Nu Nl , xi ∈ X ⊆ Rd and yi ∈ Y. The label yi is a scalar in single-task learning while it

can be given as a multi-dimensional vector in multi-task learning. We assume every input feature xi
in Dl and Du is sampled i.i.d. from a feature distribution pX , and the labeled data pairs (xi , yi ) in Dl
are drawn from a joint distribution pX,Y . When only limited labeled samples from pX,Y are available,
a predictive model f : X → Y solely trained by supervised learning is likely to overfit the training
PNl
samples since the empirical supervised loss i=1 l f (xi ), yi we minimize deviates significantly
from the expected supervised loss E(x,y)∼pX,Y l f (x), y , where l(·, ·) is some standard supervised
loss function (e.g. cross-entropy).

3.1 Self-supervised learning

Self-supervised learning aims to learn informative representations from unlabeled data. In this
subsection, we focus on self-supervised learning with various self-supervised/pretext tasks for a
pretext model to solve. These tasks are set to be challenging but highly relevant to the downstream
tasks that we attempt to solve. Ideally, the pretext model will extract some useful information from
the raw data in the process of solving the pretext tasks. Then the extracted information can be utilized
by the predictive model f in the downstream tasks. In general, self-supervised learning constructs an
encoder function e : X → Z that takes a sample x ∈ X and returns an informative representation
z = e(x) ∈ Z. The representation z is optimized to solve a pretext task defined with a pseudo-label
ys ∈ Ys and a self-supervised loss function lss . For example, the pretext task can be predicting the
rotation degree of some rotated image in the raw dataset, where ys is the true rotation degree and
lss is the squared difference between the predicted rotation degree and ys . We define the pretext
predictive model as h : Z → Ys , which is trained jointly with the encoder function e by minimizing
the expected self-supervised loss function lss as follows,
h i
min E(xs ,ys )∼pXs ,Ys lss ys , (h ◦ e)(xs ) (1)
e,h

where pXs ,Ys is a pretext distribution that generates pseudo-labeled samples (xs , ys ) for training the
encoder e and pretext predictive model h. Note that we have sufficient samples to approximate the
objective function above since for each input sample in Du , we can generate a pretext sample (xs , ys )
for free, e.g. rotating an image xi to create xs and taking the rotation degree as the label ys . After
training, the encoder function e can be used to extract better data representations from raw data for
solving various downstream tasks. Note that in settings where the downstream task (and a loss for it)
are known in advance, the encoder can be trained jointly with the downstream task’s model.

3.2 Semi-supervised learning

Semi-supervised learning optimizes the predictive model f by minimizing the supervised loss
function jointly with some unsupervised loss function defined over the output space Y. Formally,
semi-supervised learning is formulated as an optimization problem as follows,
h i h i
min E(x,y)∼pXY l y, f (x) + β · Ex∼pX ,x0 ∼p̃X (x0 |x) lu f (x), f (x0 ) (2)
f

where lu : Y × Y → R is an unsupervised loss function, and a hyperparameter β ≥ 0 is introduced

to control the trade-off between the supervised and unsupervised losses. x0 is a perturbed version of
x assumed to be drawn from a conditional distribution p̃X (x0 |x). The first term is estimated using

3
Back-propagation
Unlabeled
dataset (𝒟 )
Feature vector estimator (𝑠 )
Feature 𝐱
Encoder (𝑒) Reconstruction
𝐱 𝐱
Feature
Recovered loss (𝑙 )
representation Feature
Pretext feature
Mask
𝐦 generator 𝐱 𝐳
generator Mask vector estimator (𝑠 )
Mask (𝑔 ) Corrupted
feature
Mask vector
𝐦 𝐦
Recovered
estimation loss (𝑙 )
Mask
mask

Back-propagation

Figure 1: Block diagram of the proposed self-supervised learning framework on tabular data. (1)
Mask generator generates binary mask vector (m) which is combined with an input sample (x) to
create a masked and corrupted sample (x̃), (2) Encoder (e) transforms x̃ into a latent representation
(z), (3) Mask vector estimator (sm ) is trained by minimizing the cross-entropy loss with m, feature
vector estimator (sr ) is trained by minimizing the reconstruction loss with x, (4) Encoder (e) is trained
by minimizing the weighted sum of both losses.

the small labeled dataset Dl , while the second term is estimated using all input features in Du . The
unsupervised loss function (lu ) is often inspired by some prior knowledge of the downstream task. For
example, consistency regularization encourages the model f to produce the same output distribution
when its inputs are perturbed (x0 ).

4 Proposed Model: VIME

In this section, we describe VIME, our systematic approach for self- and semi-supervised learning
for tabular data (block diagram can be found in the Supplementary Materials)). We first propose
two pretext tasks in self-supervised learning, then we develop an unsupervised loss function in semi-
supervised learning using the encoder learned from the pretext tasks via self-supervised learning.

4.1 Self-supervised learning for tabular data

We introduce two pretext tasks: feature vector estimation and mask vector estimation. Our goal is to
optimize a pretext model to recover an input sample (a feature vector) from its corrupted variant, at
the same time as estimating the mask vector that has been applied to the sample.
In our framework, the two pretext tasks share a single pretext distribution pXs ,Ys . First, a mask
vector generator outputs a binary mask vector m = [m1 , ..., md ]> ∈ {0, 1}d where mj is randomly
Qd
sampled from a Bernoulli distribution with probability pm (i.e. pm = j=1 Bern(mj |pm )). Then a
pretext generator gm : X × {0, 1}d → X takes a sample x from Du and a mask vector m as input,
and generates a masked sample x̃. The generating process of x̃ is given by
x̃ = gm (x, m) = m x̄ + (1 − m) x (3)
PNl +Nu
where the j-th feature of x̄ is sampled from the empirical distribution p̂Xj = N1u i=Nl +1 δ(xj =
xi,j ) where xi,j is the j-th feature of the i-th sample in Du (i.e. the empirical marginal distribution
of each feature). - see Figure 3 in the Supplementary Materials for further details. The generating
process in Equation (3) ensures the corrupted sample x̃ is not only tabular but also similar to the
samples in Du . Compared with standard sample corruption approaches, e.g. adding Gaussian noise
to, or replacing zeros with the missing features, our approach generates x̃ that is more difficult to
distinguish from x. This difficulty is crucial for self-supervised learning, which we will elaborate
more in the following sections.
There are two folds of randomness imposed in our pretext distribution pXs ,Ys . Explicitly, m is
a random vector sampled from a Bernoulli distribution. Implicitly, the pretext generator gm is
also a stochastic function whose randomness comes from x̄. Together, this randomness increases
the difficulty in reconstructing x from x̃. The level of difficulty can be adjusted by changing the
hyperparameter pm , the probability in Bern(·|pm ), which controls the proportion of features that will
be masked and corrupted.

4
Following the convention of self-supervised learning, the encoder e first transforms the masked and
corrupted sample x̃ to a representation z, then a pretext predictive model will be introduced to recover
the original sample x from z. Arguably, this is a more challenging task than existing pretext tasks,
such as correcting the rotation of images or recolorizing a grayscale image. A rotated or grayscale
image still contains some information about the original features. In contrast, masking completely
removes some of the features from x and replaces them with a noise sample x̄ of which each feature
may come from a different random sample in Du . The resulting sample x̃ may not contain any
information about the missing features and even hard to identify which features are missing. To solve
such a challenging task, we first divide it into two sub-tasks (pretext tasks):
(1) Mask vector estimation: predict which features have been masked;
(2) Feature vector estimation: predict the values of the features that have been corrupted.
We introduce a separate pretext predictive model for each pretext task. Both models operate on top
of the representation z given by the encoder e and try to estimate m and x collaboratively. The two
models and their functions are,
• Mask vector estimator, sm : Z → [0, 1]d , takes z as input and outputs a vector m̂ to predict
which features of x̃ have been replaced by a noisy counterpart (i.e., m);
• Feature vector estimator, sr : Z → X , takes z as input and returns x̂, an estimate of the original
sample x.
The encoder e and the pretext predictive models (in our case, the two estimators sm and sr ) are
trained jointly in the following optimization problem,
h i
min Ex∼pX ,m∼pm ,x̃∼gm (x,m) lm (m, m̂) + α · lr (x, x̂) (4)
e,sm ,sr

where m̂ = (sm ◦ e)(x̃) and x̂ = (sr ◦ e)(x̃). The first loss function lm is the sum of the binary
cross-entropy losses for each dimension of the mask vector2 :
d
1hX i
lm (m, m̂) = − mj log (sm ◦ e)j (x̃) + (1 − mj ) log 1 − (sm ◦ e)j (x̃) , (5)
d j=1
and the second loss function lr is the reconstruction loss,
d
1hX i
lr (x, x̂) = (xj − (sr ◦ e)j (x̃))2 . (6)
d j=1
α adjusts the trade-off between the two losses. For categorical variables, we modified Equation 6 to
cross-entropy loss. Figure 1 illustrates our entire self-supervised learning framework.
What has the encoder learned? These two loss functions share the encoder e. It is the only part
we will utilize in the downstream tasks. To understand how the encoder is going to benefit these
downstream tasks, we consider what the encoder must be able to do to solve our pretext tasks. We
make the following intuitive observation: it is important for e to capture the correlation among the
features of x and output some latent representations z that can recover x. In this case, sm can identify
the masked features from the inconsistency between feature values, and sr can impute the masked
features by learning from the correlated non-masked features. For instance, if the value of a feature is
very different from its correlated features, this feature is likely masked and corrupted. We note that
correlations are also learned in other self-supervised learning frameworks, e.g. spatial correlations in
rotated images and autocorrelations between future and previous words. Our framework is novel in
learning the correlations for tabular data whose correlation structure is less obvious than in images or
language. The learned representation that captures the correlation across different parts of the object,
regardless of the object type (e.g. language, image or tabular data), is an informative input for the
various downstream tasks.

4.2 Semi-supervised learning for tabular data

We now show how the encoder function e from the previous subsection can be used in semi-supervised
learning. Our framework of semi-supervised learning follows the structure as given in Section 3. Let
2
Subscript j represents the j-th element of the vector.

5
Feature Back-propagation
𝐱
Corrupted Feature
Masks features representations Predictions

With unlabeled
𝐦 𝐱 𝐳 y

samples (𝒟 )
𝐦 Pretext 𝐱 𝐳 y
Mask Consistency
generator
generator loss (𝑙 )
(𝑔 ) Encoder (𝑒) Predictor (𝑓)
𝐦 𝐱 𝐳 y

With labeled
samples (𝒟 )
Supervised
𝐱 𝐳 y
loss (𝑙 )
Feature Feature Prediction
representation
Back-propagation

Figure 2: Block diagram of the proposed semi-supervised learning framework on tabular data. For an
unlabeled sample x in Du , (1) Mask generator generates K-number of mask vectors and combine
each of them with x to generate the corrupted samples x̃k , k = 1, ..., K via pretext generator (gm ),
(2) Encoder (e) transforms these corrupted samples into latent representations zk , k = 1, ..., K as K
different augmented samples, (3) Predictive model is trained by minimizing the supervised loss on
(x, y) in Dl and the consistency loss on the augmented samples (zk , k = 1, ..., K) jointly. The block
diagram of the proposed self- and semi-supervised learning frameworks on exemplary tabular data
can be found in the Supplementary Materials (Figure 2).

fe = f ◦ e and ŷ = fe (x). We train the predictive model f by minimizing the objective function,
Lf inal = Ls + β · Lu . (7)
The supervised loss Ls is given by
h i
Ls = E(x,y)∼pXY ls y, fe (x) , (8)

where ls is the standard supervised loss function, e.g. mean squared error for regression or categorical
cross-entropy for classification. The unsupervised (consistency) loss Lu is defined between original
samples (x) and their reconstructions from corrupted and masked samples (x̃),
h 2 i
Lu = Ex∼pX ,m∼pm ,x̃∼gm (x,m) fe (x̃) − fe (x) . (9)

Our consistency loss is inspired by the idea in consistency regularizer: encouraging the predictive
model f to return the similar output distribution when its inputs are perturbed. However, the
perturbation in our framework is learned through our self-supervised framework while in the previous
works, the perturbation is from a manually chosen distribution, such as rotation.
For a fixed sample x, the inner expectation in Equation (9) is taken with respect to pm and gm (x, m)
and could be interpreted as the variance of the predictions of corrupted and masked samples. β
is another hyper-parameter to adjust the supervised loss Ls and the consistency loss Lu . In each
iteration of training, for each sample x ∈ Du in the batch, we create K augmented samples x̃1 , ...,
x̃K by repeating the operation in Equation (3) K times. Every time the sample x ∈ Du is used in a
batch, we recreate these augmented samples. The stochastic approximation of Lu is given as
Nb X
K h Nb X
K h
1 X 2 i 1 X 2 i
L̂u = fe (x̃i,k ) − fe (xi ) = f (zi,k ) − f (zi ) (10)
Nb K i=1 Nb K i=1
k=1 k=1

where Nb is the batch size. During training, the predictive model f is regularized to make similar
predictions on zi and zi,k , k = 1, ..., K. After training f , the output for a new test sample xt is given
by ŷ = fe (xt ). Figure 2 illustrates the entire procedure of the proposed semi-supervised framework
on tabular data with a pre-trained encoder.

5 Experiments
In this section, we conduct a series of experiments to demonstrate the efficacy of our framework
(VIME) on several tabular datasets from different application domains, including genomics and
clinical data. We use Min-max scaler to normalize the data between 0 and 1. For self-supervised

6
learning, we compare VIME against two benchmarks, Denoising auto-encoder (DAE) [21] and
Context Encoder [22]. For semi-supervised learning, we use the data augmentation method MixUp
[10] as the main benchmark. We exclude self- and semi-supervised learning benchmarks that
are applicable only to image or language data. As a baseline, we also include supervised learn-
ing benchmarks. Additional results with more baselines can be found in the Supplementary Ma-
terials. In the experiments, self- and semi-supervised learning methods use both labeled data
and unlabeled data, while the supervised learning methods only use the labeled data. Imple-
mentation details and sensitivity analyses on three hyperparameters (pm , α, β) can be found in
the Supplementary Materials (Section 5 & 6). The implementation of VIME can be found at
https://bitbucket.org/mvdschaar/mlforhealthlabpub/src/master/alg/vime/ and at
https://github.com/jsyoon0823/VIME.

5.1 Genomics data: Genome-wide polygenic scoring

In this subsection, we evaluate the methods on a large genomics dataset from UK Biobank consisting
of around 400,000 individuals’ genomics information (SNPs) and 6 corresponding blood cell traits:
(1) Mean Reticulocyte Volume (MRV), (2) Mean Platelet Volume (MPV), (3) Mean Cell Hemoglobin
(MCH), (4) Reticulocyte Fraction of Red Cells (RET), (5) Plateletcrit (PCT), and (6) Monocyte
Percentage of White Cells (MONO). The features of the dataset consist of around 700 SNPs (after
the standard p-values filtering process), where each SNP, taking value in {0, 1, 2}, is treated as a
categorical variable (with three categories). Here, we have 6 different blood cell traits to predict, and
we treat each of them as an independent prediction task (selected SNPs are different across different
blood cell traits). Detailed data descriptions are provided in the Supplementary Materials (Section 2).
Note that all the variables are categorical features.
To test the effectiveness of self- and semi-supervised learning in the small labeled data setting, VIME
and benchmarks are tasked to predict the 6 blood cell traits while we gradually increase the number of
labeled data points from 1,000 to 100,000 samples while using the remaining data as unlabeled data
(more than 300,000 samples). We use a linear model (Elastic Net [34]) as the predictive model due to
their superior performance in comparison to other non-linear models such as multi-layer perceptron
and random forests [35] on genomics datasets.

Figure 3: MSE performances on 6 different blood cell traits across different sizes of the labeled
genomics dataset (lower the better). Note that x-axis is a log-scale.

In Figure 3, we show the MSE performance (y-axis) against the number of labeled data points (x-axis,
in log scale) increasing from 1,000 to 10,0003 . The proposed model (VIME) outperforms all the
benchmarks, including purely supervised method ElasticNet, the self-supervised method Context
Encoder and the semi-supervised method MixUp. In fact, in many cases VIME shows similar
3
The performances for 10,000 to 100,000 range can be found in the Supplementary Materials (Section 3)

7
performances to the benchmarks even when it has access to only half as many labeled data points (as
the benchmarks).

5.2 Clinical data: Patient treatment prediction

In this subsection, we evaluate the methods on clinical data, using the UK and US prostate cancer
datasets (from Prostate Cancer UK and SEER datasets, respectively). The features consist of patients’
clinical information (e.g. age, grade, stage, Gleason scores) - total 28 features. We predict 2
possible treatments of UK prostate cancer patients (1) Hormone therapy (whether the patients got
hormone therapy), (2) Radical therapy (whether the patient got radical therapy). Both tasks are
binary classification. In the UK prostate cancer dataset, we only have around 10,000 labeled patients
samples. The US prostate cancer dataset contains more than 200,000 unlabeled patients samples,
twenty times bigger than the labeled UK dataset. We use 50% of the UK dataset (as the labeled data)
and the entire US dataset (as the unlabeled data) for training, with the remainder of the UK data being
used as the testing set. We also test three popular supervised learning models: Logistic Regression, a
2-layer Multi-layer Perceptron and XGBoost.
Table 1 shows that VIME results in the best prediction performance, outperforming the benchmarks.
More importantly, VIME is the only self- or semi-supervised learning framework that significantly
outperforms supervised learning models. These results shed light on the unique advantage of using
VIME in leveraging a large unlabeled tabular dataset (e.g. the US dataset) to strengthen a model’s
predictive power. Here we also demonstrate that VIME can perform well even when there exists a
distribution shift between the UK labeled data and the US unlabeled data (see the Supplementary
Materials (Section 2) for further details).

Table 1: AUROC Performances of patient treatment predictions on Hormone and Radical therapy
(higher the better). (Mean ± Standard deviations are computed over 10 runs)
Type Models Hormone Radical
Logistic Regression .8371±.0013 .8036±.0015
SL 2-layer Perceptron .8351±.0023 .8146±.0022
XGBoost .8423±.0018 .8166±.0011
DAE .8335±.0049 .8144±.0061
Self-SL
Context Encoder .8308±.0051 .8134±.0066
Semi-SL MixUp .8448±.0021 .8214±.0029
VIME .8602±.0029 .8391±.0021

5.3 Public tabular data

To further verify the generalizability and allow for reproducibility of our results, we compare VIME
with the benchmarks using three public tabular datasets: MNIST (interpreted as a tabular data with
784 features), UCI Income and UCI Blog. We use 10% of the data as labeled data, and the 90% of
the remaining data as unlabeled data. Prediction accuracy on a separate testing set is used as the
metric for all three datasets. As shown in Table 2 (Type - Supervised models, Self-supervised models,
Semi-supervised models and VIME), VIME achieves the best accuracy regardless of the application
domains. These results further confirm the superiority of VIME in a diverse range of tabular datasets.

5.4 Ablation study

In this section, we conduct an ablation study to analyze the performance gain of each component in
VIME on the tabular datasets introduced in Section 5.3. We define three variants of VIME:
• Supervised only: Exclude both self- and semi-supervised learning parts (i.e. 2-layer perceptron)
• Semi-SL only: Exclude self-supervised learning part (i.e. remove the encoder in Figure 2)
• Self-SL only: Exclude semi-supervised learning part (i.e. β = 0). More specifically, we first train
the encoder via self-supervised learning. Then, we train the predictive model with loss function (in
Equation (7) with β = 0 (only utilizing the labeled data).

8
Table 2: Prediction accuracy of the methods on UCI Income, MNIST and UCI Blog datasets (Mean
± Std are computed over 10 runs).
Type Models Income MNIST Blog
Logistic Regression .8425±.0013 .8989±.0023 .6915±.0029
Supervised models 2-layer Perceptron .8520±.0023 .9387±.0014 .7972±.0058
XGBoost .8623±.0021 .9413±.0026 .7975±.0030
DAE .8578±.0028 .9431±.0032 .8001±.0039
Self-supervised models
Context Encoder .8611±.0027 .9455±.0048 .8033±.0051
Semi-supervised models MixUp .8701±.0021 .9461±.0023 .8088±.0038
Supervised only .8520±.0023 .9387±.0014 .7972±.0058
Variants of VIME Self-SL only .8599±.0026 .9406±.0019 .8147±.0037
Semi-SL only .8771±.0031 .9548±.0023 .8361±.0041
VIME .8804±.0030 .9577±.0022 .8389±.0044

Table 2 (Type - Variants of VIME and VIME) shows that both Self-SL only and Semi-SL only
show performance gains compared with Supervised only, and VIME is always better than its
variants. Every component in VIME can improve the performance of a predictive model, and the
best performance is achieved when they work collaboratively in our unified framework. We note
that Self-SL only leads to a larger performance drop than Semi-SL only because in the former the
predictive model is trained solely on a small labeled dataset without the unsupervised loss function
Lu , while in the latter the predictive model is trained via minimizing both losses but without the
encoder. Additional ablation study can be found in the Supplementary Materials.

6 Discussions: Why does the proposed model (VIME) need for tabular data?

Image and tabular data are very different. The spatial correlations between pixels in images or the
sequential correlations between words in text data are well-known and consistent across different
datasets. By contrast, the correlation structure among features in tabular data is unknown and varies
across different datasets. In other words, there is no “common” correlation structure in tabular data
(unlike in image and text data). This makes the self- and semi-supervised learning in tabular data
more challenging. Note that promising methods for image domain do not guarantee the favorable
results on tabular domain (vice versa). Also, most augmentations and pretext tasks used in image data
are not applicable to tabular data; because they directly utilize the spatial relationship of the image for
augmentation (e.g., rotation) and pretext tasks (e.g., jigsaw puzzle and colorization). To transfer the
successes of self- and semi-supervised learning from image to tabular domains, proposing applicable
and proper pretext tasks and augmentations for tabular data (our main novelty) is critical. Note that
better augmentations and pretext tasks can significantly improve self- and semi-supervised learning
performances.

Broader Impact

Tabular data is the most common data type in the real-world. Most databases include tabular data
such as demographic information in medical and finance datasets and SNPs in genomic datasets.
However, the tremendous successes in deep learning (especially in image and language domains)
has not yet been fully extended to the tabular domain. Still, in the tabular domain, ensembles of
decision trees achieve the state-of-the-art performance. If we can efficiently extend the successful
deep learning methodologies from images and language to tabular data, the application of machine
learning in the real-world can be greatly extended. This paper takes a step in this direction for self-
and semi-supervised learning frameworks which recently have achieved significant successes in
images and language. In addition, the proposed tabular data augmentation and representation learning
methodologies can be utilized in various fields such as tabular data encoding, balancing the labels of
tabular data, and missing data imputation.

9
Acknowledgements and Funding Sources
The authors would like to thank the reviewers for their helpful comments. This work was supported
by the National Science Foundation (NSF grant 1722516), the US Office of Naval Research (ONR),
and GlaxoSmithKline (GSK).

References
[1] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image
recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,
pages 770–778, 2016.
[2] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Proceedings of
the IEEE international conference on computer vision, pages 2961–2969, 2017.
[3] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information
processing systems, pages 5998–6008, 2017.
[4] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-
scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern
recognition, pages 248–255. Ieee, 2009.
[5] Mark Peplow. The 100 000 genomes project. Bmj, 353:i1757, 2016.
[6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of
deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805,
2018.
[7] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by
predicting image rotations. arXiv preprint arXiv:1803.07728, 2018.
[8] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving
jigsaw puzzles. In European Conference on Computer Vision, pages 69–84. Springer, 2016.
[9] Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. In European
conference on computer vision, pages 649–666. Springer, 2016.
[10] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond
empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
[11] Alexey Dosovitskiy, Philipp Fischer, Jost Tobias Springenberg, Martin Riedmiller, and Thomas
Brox. Discriminative unsupervised feature learning with exemplar convolutional neural net-
works. IEEE transactions on pattern analysis and machine intelligence, 38(9):1734–1747,
2015.
[12] Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learning
by context prediction. In Proceedings of the IEEE International Conference on Computer
Vision, pages 1422–1430, 2015.
[13] Samuli Laine, Tero Karras, Jaakko Lehtinen, and Timo Aila. High-quality self-supervised deep
image denoising. In Advances in Neural Information Processing Systems, pages 6968–6978,
2019.
[14] Yue Wang and Justin M Solomon. Prnet: Self-supervised learning for partial-to-partial registra-
tion. In Advances in Neural Information Processing Systems, pages 8812–8824, 2019.
[15] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive
predictive coding. arXiv preprint arXiv:1807.03748, 2018.
[16] Olivier J Hénaff, Ali Razavi, Carl Doersch, SM Eslami, and Aaron van den Oord. Data-efficient
image recognition with contrastive predictive coding. arXiv preprint arXiv:1905.09272, 2019.

10
[17] Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive multiview coding. arXiv preprint
arXiv:1906.05849, 2019.
[18] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework
for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709, 2020.
[19] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for
unsupervised visual representation learning. arXiv preprint arXiv:1911.05722, 2019.
[20] Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum
contrastive learning. arXiv preprint arXiv:2003.04297, 2020.
[21] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and
composing robust features with denoising autoencoders. In Proceedings of the 25th international
conference on Machine learning, pages 1096–1103. ACM, 2008.
[22] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context
encoders: Feature learning by inpainting. In Proceedings of the IEEE conference on computer
vision and pattern recognition, pages 2536–2544, 2016.
[23] Sercan O Arik and Tomas Pfister. Tabnet: Attentive interpretable tabular learning. arXiv
preprint arXiv:1908.07442, 2019.
[24] Pengcheng Yin, Graham Neubig, Wen-tau Yih, and Sebastian Riedel. Tabert: Pretraining for
joint understanding of textual and tabular data. arXiv preprint arXiv:2005.08314, 2020.
[25] Dong-Hyun Lee. Pseudo-label: The simple and efficient semi-supervised learning method for
deep neural networks. In Workshop on Challenges in Representation Learning, ICML, volume 3,
page 2, 2013.
[26] Mehdi Sajjadi, Mehran Javanmardi, and Tolga Tasdizen. Regularization with stochastic transfor-
mations and perturbations for deep semi-supervised learning. In Advances in Neural Information
Processing Systems, pages 1163–1171, 2016.
[27] Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged
consistency targets improve semi-supervised deep learning results. In Advances in neural
information processing systems, pages 1195–1204, 2017.
[28] Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, and Shin Ishii. Virtual adversarial training:
a regularization method for supervised and semi-supervised learning. IEEE transactions on
pattern analysis and machine intelligence, 41(8):1979–1993, 2018.
[29] David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital Oliver, and
Colin Raffel. Mixmatch: A holistic approach to semi-supervised learning. arXiv preprint
arXiv:1905.02249, 2019.
[30] David Berthelot, Nicholas Carlini, Ekin D Cubuk, Alex Kurakin, Kihyuk Sohn, Han Zhang,
and Colin Raffel. Remixmatch: Semi-supervised learning with distribution alignment and
augmentation anchoring. arXiv preprint arXiv:1911.09785, 2019.
[31] Xiaojin Zhu, Zoubin Ghahramani, and John D Lafferty. Semi-supervised learning using gaussian
fields and harmonic functions. In Proceedings of the 20th International conference on Machine
learning (ICML-03), pages 912–919, 2003.
[32] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional
networks. arXiv preprint arXiv:1609.02907, 2016.
[33] Otilia Stretcu, Krishnamurthy Viswanathan, Dana Movshovitz-Attias, Emmanouil Platanios,
Sujith Ravi, and Andrew Tomkins. Graph agreement models for semi-supervised learning. In
Advances in Neural Information Processing Systems, pages 8710–8720, 2019.
[34] Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of
the royal statistical society: series B (statistical methodology), 67(2):301–320, 2005.
[35] Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001.

ANN-unit 4
No ratings yet
ANN-unit 4
25 pages
ML Questions 2021
100% (1)
ML Questions 2021
26 pages
2448 Self Supervised Visual Re
No ratings yet
2448 Self Supervised Visual Re
109 pages
Domain Generalization by Marginal Transfer Learning: Gilles Blanchard
No ratings yet
Domain Generalization by Marginal Transfer Learning: Gilles Blanchard
55 pages
Domain Generalization by Marginal Transfer Learning: Gilles Blanchard
No ratings yet
Domain Generalization by Marginal Transfer Learning: Gilles Blanchard
55 pages
Belkin 06 A
No ratings yet
Belkin 06 A
36 pages
Subtab: Subsetting Features of Tabular Data For Self-Supervised Representation Learning
No ratings yet
Subtab: Subsetting Features of Tabular Data For Self-Supervised Representation Learning
23 pages
Lecture11 - Unsupervised Learning (I)
No ratings yet
Lecture11 - Unsupervised Learning (I)
29 pages
Journal 1
No ratings yet
Journal 1
35 pages
Distributionally Robust Selfsupervised Learning For Tabular Data
No ratings yet
Distributionally Robust Selfsupervised Learning For Tabular Data
23 pages
On The Duality Between Contrastive and Noncontrastive Self-Supervised Learning
No ratings yet
On The Duality Between Contrastive and Noncontrastive Self-Supervised Learning
28 pages
22 Self Supervised Representation
No ratings yet
22 Self Supervised Representation
15 pages
L U V U C - P N 3DM: Earning From Nlabelled Ideos Sing ON Trastive Redictive Eural Apping
No ratings yet
L U V U C - P N 3DM: Earning From Nlabelled Ideos Sing ON Trastive Redictive Eural Apping
19 pages
(2203.06915) SimMatch - Semi-Supervised Learning With Similarity Matching
No ratings yet
(2203.06915) SimMatch - Semi-Supervised Learning With Similarity Matching
17 pages
Understanding Dimensional Collapse
No ratings yet
Understanding Dimensional Collapse
17 pages
Self-Supervised Representation Learning - Introduction, Advances and Challenges
No ratings yet
Self-Supervised Representation Learning - Introduction, Advances and Challenges
19 pages
Basak Pseudo-Label Guided Contrastive Learning For Semi-Supervised Medical Image Segmentation CVPR 2023 Paper
No ratings yet
Basak Pseudo-Label Guided Contrastive Learning For Semi-Supervised Medical Image Segmentation CVPR 2023 Paper
12 pages
4.1 - Unsupervised Visual Representation Learning by Context Prediction
No ratings yet
4.1 - Unsupervised Visual Representation Learning by Context Prediction
10 pages
GPT Self-Supervision For A Better Data Annotator: Preprint. Under Review
No ratings yet
GPT Self-Supervision For A Better Data Annotator: Preprint. Under Review
15 pages
Unsupervised Embedding Learning Via Invariant and Spreading Instance Feature
No ratings yet
Unsupervised Embedding Learning Via Invariant and Spreading Instance Feature
10 pages
Supervised Learning
No ratings yet
Supervised Learning
4 pages
Carlucci Domain Generalization by Solving Jigsaw Puzzles CVPR 2019 Paper
No ratings yet
Carlucci Domain Generalization by Solving Jigsaw Puzzles CVPR 2019 Paper
10 pages
Unsupervised Embedding Learning Via Invariant and Spreading Instance Feature
No ratings yet
Unsupervised Embedding Learning Via Invariant and Spreading Instance Feature
11 pages
A Distributed Approach For Supervised Som and Application To Facies Classification
No ratings yet
A Distributed Approach For Supervised Som and Application To Facies Classification
6 pages
6333 Regularization With Stochastic Transformations and Perturbations For Deep Semi Supervised Learning
No ratings yet
6333 Regularization With Stochastic Transformations and Perturbations For Deep Semi Supervised Learning
9 pages
Zhu 2018
No ratings yet
Zhu 2018
8 pages
Supervision Via Competition: Robot Adversaries For Learning Tasks
No ratings yet
Supervision Via Competition: Robot Adversaries For Learning Tasks
8 pages
Contrastive Self Supervised Learning With Hard Negative Pair Mining
No ratings yet
Contrastive Self Supervised Learning With Hard Negative Pair Mining
8 pages
Cty I2a 20230403
No ratings yet
Cty I2a 20230403
6 pages
Representation Learning
No ratings yet
Representation Learning
6 pages
Deep Unsupervised State Representation Learning With Robotic Priors: A Robustness Analysis
No ratings yet
Deep Unsupervised State Representation Learning With Robotic Priors: A Robustness Analysis
8 pages
Diligenti 2017
No ratings yet
Diligenti 2017
4 pages
Semi-Supervised Learning With Self-Supervised Networks
No ratings yet
Semi-Supervised Learning With Self-Supervised Networks
10 pages
A Survey On Contrastive Self-Supervised Learning
No ratings yet
A Survey On Contrastive Self-Supervised Learning
21 pages
Entropy 25 00033 v3
No ratings yet
Entropy 25 00033 v3
26 pages
Nips Ws 2017
No ratings yet
Nips Ws 2017
12 pages
Automatic Image Annotation and Retrieval Using Multi-Instance Multi-Label Learning
No ratings yet
Automatic Image Annotation and Retrieval Using Multi-Instance Multi-Label Learning
5 pages
S - S L C - C G A N: EMI Upervised Earning With Ontext Onditional Enerative Dversarial Etworks
No ratings yet
S - S L C - C G A N: EMI Upervised Earning With Ontext Onditional Enerative Dversarial Etworks
10 pages
Scaling Robot Learning With Semantically Imagined Experience
No ratings yet
Scaling Robot Learning With Semantically Imagined Experience
21 pages
Learning To Compress Images and Videos
No ratings yet
Learning To Compress Images and Videos
8 pages
Zheng SimMatch Semi-Supervised Learning With Similarity Matching CVPR 2022 Paper
No ratings yet
Zheng SimMatch Semi-Supervised Learning With Similarity Matching CVPR 2022 Paper
11 pages
Supervised Learning
No ratings yet
Supervised Learning
4 pages
Self Supervised Learning
No ratings yet
Self Supervised Learning
5 pages
Weakly Supervised Contrastive Learning
No ratings yet
Weakly Supervised Contrastive Learning
10 pages
Curriculum Learning: A Survey: Petru Soviany Radu Tudor Ionescu Paolo Rota Nicu Sebe
No ratings yet
Curriculum Learning: A Survey: Petru Soviany Radu Tudor Ionescu Paolo Rota Nicu Sebe
40 pages
Semi-Supervised Learning With Ladder Network
No ratings yet
Semi-Supervised Learning With Ladder Network
19 pages
Self-Supervised Learning: Pretext Tasks
No ratings yet
Self-Supervised Learning: Pretext Tasks
3 pages
Semi Active PDF
No ratings yet
Semi Active PDF
8 pages
DL M2 Regularization
No ratings yet
DL M2 Regularization
12 pages
Discovering & Learning To
No ratings yet
Discovering & Learning To
22 pages
Acourse of Pure Mathematics Cambrige
No ratings yet
Acourse of Pure Mathematics Cambrige
587 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
27 pages
Mensuration Maths
No ratings yet
Mensuration Maths
7 pages
Macro 1 Theory and Background - Rel 108 OM Format PDF
75% (4)
Macro 1 Theory and Background - Rel 108 OM Format PDF
33 pages
CASA KNOTS - Parts and Bights Explanations
100% (1)
CASA KNOTS - Parts and Bights Explanations
15 pages
Q3 DLP Math 1 Week 9
100% (1)
Q3 DLP Math 1 Week 9
7 pages
Mst121 Chapter A1
No ratings yet
Mst121 Chapter A1
52 pages
MQL4 Tutorial
No ratings yet
MQL4 Tutorial
370 pages
Real Options: Strategic Financial Management
No ratings yet
Real Options: Strategic Financial Management
25 pages
Humaira Thesis
No ratings yet
Humaira Thesis
28 pages
Ce2257 Lab Manual
No ratings yet
Ce2257 Lab Manual
53 pages
Calculation Cover Sheet Date: Author: Project: Calc No: Title
No ratings yet
Calculation Cover Sheet Date: Author: Project: Calc No: Title
6 pages
Table 1a: The Complete MSP430 Instruction Set of 27 Core Instructions
No ratings yet
Table 1a: The Complete MSP430 Instruction Set of 27 Core Instructions
9 pages
6.1 Sequences 6.1.1 Finding A Rule Position-To-Term and NTH Term
No ratings yet
6.1 Sequences 6.1.1 Finding A Rule Position-To-Term and NTH Term
14 pages
Notes 2
No ratings yet
Notes 2
193 pages
PLSQL Course Content
No ratings yet
PLSQL Course Content
5 pages
Common Core Lesson 12 Homework Answer Key
100% (2)
Common Core Lesson 12 Homework Answer Key
8 pages
Parabolas (All Lectures)
No ratings yet
Parabolas (All Lectures)
8 pages
Lectr14 (STOCHASTIC PROCESSES)
No ratings yet
Lectr14 (STOCHASTIC PROCESSES)
49 pages
Handout Measurement Uncertainty Training
No ratings yet
Handout Measurement Uncertainty Training
30 pages
1 Introduction To Rings
No ratings yet
1 Introduction To Rings
23 pages
Grains Weight (G) : Ugyen Academy Assignments For Class VIII Students - 2022
No ratings yet
Grains Weight (G) : Ugyen Academy Assignments For Class VIII Students - 2022
6 pages
BEng Mechanical 2024
No ratings yet
BEng Mechanical 2024
7 pages
Feedback Control System For Inverted Cart Pendulum
No ratings yet
Feedback Control System For Inverted Cart Pendulum
16 pages
Question Bank CS AI - VI Sem - 1-5
No ratings yet
Question Bank CS AI - VI Sem - 1-5
2 pages
Probability Class 11
No ratings yet
Probability Class 11
1 page
Government Intervention Chapter - 9: Exercise Practice Set: S D S D S D
No ratings yet
Government Intervention Chapter - 9: Exercise Practice Set: S D S D S D
7 pages
Ngineering ATA Nalysis: Math 4
No ratings yet
Ngineering ATA Nalysis: Math 4
14 pages
Stanford E14 PSET 1 Solutions
No ratings yet
Stanford E14 PSET 1 Solutions
18 pages
Formulating and Solving LPs Using Excel Solver
No ratings yet
Formulating and Solving LPs Using Excel Solver
10 pages
CSE330 Quiz Solutions
No ratings yet
CSE330 Quiz Solutions
5 pages
Foundational Models and Architectures S1: Generative AI, #1
From Everand
Foundational Models and Architectures S1: Generative AI, #1
Leaster Startx
No ratings yet
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Pathways to Machine Learning and Soft Computing: 邁向機器學習與軟計算之路（國際英文版）
From Everand
Pathways to Machine Learning and Soft Computing: 邁向機器學習與軟計算之路（國際英文版）
Jyh-Horng Jeng
No ratings yet
Contextual Image Classification: Understanding Visual Data for Effective Classification
From Everand
Contextual Image Classification: Understanding Visual Data for Effective Classification
Fouad Sabry
No ratings yet
Support Vector Machine: Fundamentals and Applications
From Everand
Support Vector Machine: Fundamentals and Applications
Fouad Sabry
No ratings yet
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Machine Learning: Fundamentals and Applications
From Everand
Machine Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Pedestrian Detection: Please, suggest a subtitle for a book with title 'Pedestrian Detection' within the realm of 'Computer Vision'. The suggested subtitle should not have ':'.
From Everand
Pedestrian Detection: Please, suggest a subtitle for a book with title 'Pedestrian Detection' within the realm of 'Computer Vision'. The suggested subtitle should not have ':'.
Fouad Sabry
No ratings yet
Constrained Conditional Model: Fundamentals and Applications
From Everand
Constrained Conditional Model: Fundamentals and Applications
Fouad Sabry
No ratings yet

Vime

Uploaded by

Vime

Uploaded by

VIME: Extending the Success of Self- and

Semi-supervised Learning to Tabular Domain

Jinsung Yoon Yao Zhang

James Jordon Mihaela van der Schaar

where Nu  Nl , xi ∈ X ⊆ Rd and yi ∈ Y. The label yi is a scalar in single-task learning while it

3.1 Self-supervised learning

3.2 Semi-supervised learning

where lu : Y × Y → R is an unsupervised loss function, and a hyperparameter β ≥ 0 is introduced

4 Proposed Model: VIME

4.1 Self-supervised learning for tabular data

4.2 Semi-supervised learning for tabular data

5.1 Genomics data: Genome-wide polygenic scoring

5.2 Clinical data: Patient treatment prediction

5.3 Public tabular data

5.4 Ablation study

You might also like

where Nu Nl , xi ∈ X ⊆ Rd and yi ∈ Y. The label yi is a scalar in single-task learning while it