Vime
Vime
Abstract
Self- and semi-supervised learning frameworks have made significant progress in
training machine learning models with limited labeled data in image and language
domains. These methods heavily rely on the unique structure in the domain datasets
(such as spatial relationships in images or semantic relationships in language). They
are not adaptable to general tabular data which does not have the same explicit
structure as image and language data. In this paper, we fill this gap by proposing
novel self- and semi-supervised learning frameworks for tabular data, which we
refer to collectively as VIME (Value Imputation and Mask Estimation). We create
a novel pretext task of estimating mask vectors from corrupted tabular data in
addition to the reconstruction pretext task for self-supervised learning. We also
introduce a novel tabular data augmentation method for self- and semi-supervised
learning frameworks. In experiments, we evaluate the proposed framework in
multiple tabular datasets from various application domains, such as genomics and
clinical data. VIME exceeds state-of-the-art performance in comparison to the
existing baseline methods.
1 Introduction
Tremendous successes have been achieved in a variety of applications (such as image classification [1],
object detection [2], and language translation [3]) with deep learning models via supervised learning
on large labeled datasets such as ImageNet [4]. Unfortunately, collecting sufficiently large labeled
datasets is expensive and even impossible in several domains (such as medical datasets concerned
with a particularly rare disease). In these settings, however, there is often a wealth of unlabeled data
available - datasets are often collected from a large population, but target labels are only available
for a small group of people. The 100,000 Genomes project [5], for instance, sequenced 100,000
genomes from around 85,000 NHS patients affected by a rare disease, such as cancer. By definition
rare diseases occur in (less than) 1 in 2000 people. Datasets like these present huge opportunities
for self- and semi-supervised learning algorithms, which can leverage the unlabeled data to further
improve the performance of a predictive model.
Unfortunately, existing self- and semi-supervised learning algorithms are not effective for tabular
data1 because they heavily rely on the spatial or semantic structure of image or language data. A
1
Tabular data is a database that is structured in a tabular form. It arranges data elements in vertical columns
(features) and horizontal rows (samples).
34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada.
standard self-supervised leaning framework designs a (set of) pretext task(s) to learn informative
representations from the raw input features. For the language domain, BERT introduces 4 different
pretext tasks (e.g. predicting future words from previous words) to learn representations of the
language data [6]. In the image domain, rotation [7], jigsaw puzzle [8], and colorization [9] can be
utilized as pretext tasks to learn representations of the images. Standard semi-supervised learning
methods also suffer from the same problem, since the regularizers they use for the predictive model
are based on some prior knowledge of these data structures. For example, the consistency regularizer
encourages the predictive model to have the same output distribution on a sample and its augmented
variants, e.g. an image and its rotated variants [7], or two images and their convex combination(s) [10].
The notion of rotation simply does not exist in tabular data. Moreover, in many settings, variables
are often categorical, and do not admit meaningful convex combinations. Even in a setting where all
variables are continuous, there is no guarantee that the data manifold is convex and as such taking
convex combinations will either generate out-of-distribution samples (therefore degrading model
performances) or be restricted to generating samples that are very close to real samples (limiting the
effectiveness of the data augmentation), for more details see the Supplementary Materials (Section 4).
Contribution: In this paper, we propose novel self- and semi-supervised learning frameworks for
tabular data. For self-supervised learning, we introduce a novel pretext task, mask vector estimation
in addition to feature vector estimation. To solve those pretext tasks, an encoder function learns to
construct informative representations from the raw features in the unlabeled data. For semi-supervised
learning, we introduce a novel tabular data augmentation scheme. We use the trained encoder to
generate multiple augmented samples for each data point by masking each point using several
different masks and then imputing the corrupted values for each masked data point. Finally, we
propose a systematic self- and semi-supervised learning framework for tabular data, VIME (Value
Imputation and Mask Estimation), that combines our ideas to produce state-of-the-art performances
on several tabular datasets with a few labeled samples, from various domains.
2 Related Works
Self-supervised learning (Self-SL) frameworks are representation learning methods using unlabeled
data. It can be categorized into two types: using pretext task(s) and contrastive learning. Most existing
works with pretext tasks are appropriate only for images or natural language: (i) surrogate classes
prediction (scaling and translation) [11], (ii) rotation degree predictions [7], (iii) colorization [9], (iv)
relative position of patches estimation [12], (v) jigsaw puzzle solving [8], (vi) image denoising [13],
(vii) partial-to-partial registration [14], and (viii) next words and previous words predictions [6]. Most
existing works with contrastive learning are also applicable only for image or natural languages due
to their data augmentation scheme, and temporal and spatial relationships for defining the similarity:
(i) contrastive predictive coding [15, 16], (ii) contrastive multi-view coding [17], (iii) SimCLR [18],
(iv) momentum contrast [19, 20].
There is some existing work on self-supervised learning which can be applied to tabular data. In
Denoising auto-encoder [21], the pretext task is to recover the original sample from a corrupted
sample. In Context Encoder [22], the pretext task is to reconstruct the original sample from both the
corrupted sample and the mask vector. The pretext task for self-supervised learning in TabNet [23]
and TaBERT [24] is also recovering corrupted tabular data.
In this paper, we propose a new pretext task: to recover the mask vector, in addition to the original
sample with a novel corrupted sample generation scheme. Also, we propose a novel tabular data
augmentation scheme that can be combined with various contrastive learning frameworks to extend
the self-supervised learning to tabular domains.
Semi-supervised learning (Semi-SL) frameworks can be categorized into two types: entropy mini-
mization and consistency regularization. Entropy minimization encourages a classifier to output low
entropy predictions on unlabeled data. For instance, [25] constructs hard labels from high-confidence
predictions on unlabeled data, and train the network using these pseudo-labels together with labeled
data in a supervised way. Consistency regularization encourages some sort of consistency between a
sample and some stochastically altered version of itself. Π-model [26] uses an L2 loss to encourage
consistency between predictions. Mean teacher [27] uses an L2 loss to encourage consistency between
the intermediate representations. Virtual Adversarial Training (VAT) [28] encourages prediction
consistency by minimizing the maximum difference in predictions between a sample and multiple
2
augmented versions. MixMatch [29] and ReMixMatch [30] combine entropy minimization with
consistency regularization in one unified framework with MixUp [10] as the data augmentation
method. There is a series of interesting works on graph-based semi-supervised learning [31, 32, 33]
which consider a special case of network data where samples are connected by a given edge, i.e. a
citation network where an article is connected with its citations. Here, we introduce a novel data
augmentation method for general tabular data which can be combined with various semi-supervised
learning frameworks to train a predictive model in a semi-supervised way.
3 Problem Formulation
In this section, we introduce the general formulation of self- and semi-supervised learning. Suppose
we have a small labeled dataset Dl = {xi , yi }N Nl +Nu
i=1 and a large unlabeled dataset Du = {xi }i=Nl +1 ,
l
Self-supervised learning aims to learn informative representations from unlabeled data. In this
subsection, we focus on self-supervised learning with various self-supervised/pretext tasks for a
pretext model to solve. These tasks are set to be challenging but highly relevant to the downstream
tasks that we attempt to solve. Ideally, the pretext model will extract some useful information from
the raw data in the process of solving the pretext tasks. Then the extracted information can be utilized
by the predictive model f in the downstream tasks. In general, self-supervised learning constructs an
encoder function e : X → Z that takes a sample x ∈ X and returns an informative representation
z = e(x) ∈ Z. The representation z is optimized to solve a pretext task defined with a pseudo-label
ys ∈ Ys and a self-supervised loss function lss . For example, the pretext task can be predicting the
rotation degree of some rotated image in the raw dataset, where ys is the true rotation degree and
lss is the squared difference between the predicted rotation degree and ys . We define the pretext
predictive model as h : Z → Ys , which is trained jointly with the encoder function e by minimizing
the expected self-supervised loss function lss as follows,
h i
min E(xs ,ys )∼pXs ,Ys lss ys , (h ◦ e)(xs ) (1)
e,h
where pXs ,Ys is a pretext distribution that generates pseudo-labeled samples (xs , ys ) for training the
encoder e and pretext predictive model h. Note that we have sufficient samples to approximate the
objective function above since for each input sample in Du , we can generate a pretext sample (xs , ys )
for free, e.g. rotating an image xi to create xs and taking the rotation degree as the label ys . After
training, the encoder function e can be used to extract better data representations from raw data for
solving various downstream tasks. Note that in settings where the downstream task (and a loss for it)
are known in advance, the encoder can be trained jointly with the downstream task’s model.
Semi-supervised learning optimizes the predictive model f by minimizing the supervised loss
function jointly with some unsupervised loss function defined over the output space Y. Formally,
semi-supervised learning is formulated as an optimization problem as follows,
h i h i
min E(x,y)∼pXY l y, f (x) + β · Ex∼pX ,x0 ∼p̃X (x0 |x) lu f (x), f (x0 ) (2)
f
3
Back-propagation
Unlabeled
dataset (𝒟 )
Feature vector estimator (𝑠 )
Feature 𝐱
Encoder (𝑒) Reconstruction
𝐱 𝐱
Feature
Recovered loss (𝑙 )
representation Feature
Pretext feature
Mask
𝐦 generator 𝐱 𝐳
generator Mask vector estimator (𝑠 )
Mask (𝑔 ) Corrupted
feature
Mask vector
𝐦 𝐦
Recovered
estimation loss (𝑙 )
Mask
mask
Back-propagation
Figure 1: Block diagram of the proposed self-supervised learning framework on tabular data. (1)
Mask generator generates binary mask vector (m) which is combined with an input sample (x) to
create a masked and corrupted sample (x̃), (2) Encoder (e) transforms x̃ into a latent representation
(z), (3) Mask vector estimator (sm ) is trained by minimizing the cross-entropy loss with m, feature
vector estimator (sr ) is trained by minimizing the reconstruction loss with x, (4) Encoder (e) is trained
by minimizing the weighted sum of both losses.
the small labeled dataset Dl , while the second term is estimated using all input features in Du . The
unsupervised loss function (lu ) is often inspired by some prior knowledge of the downstream task. For
example, consistency regularization encourages the model f to produce the same output distribution
when its inputs are perturbed (x0 ).
We introduce two pretext tasks: feature vector estimation and mask vector estimation. Our goal is to
optimize a pretext model to recover an input sample (a feature vector) from its corrupted variant, at
the same time as estimating the mask vector that has been applied to the sample.
In our framework, the two pretext tasks share a single pretext distribution pXs ,Ys . First, a mask
vector generator outputs a binary mask vector m = [m1 , ..., md ]> ∈ {0, 1}d where mj is randomly
Qd
sampled from a Bernoulli distribution with probability pm (i.e. pm = j=1 Bern(mj |pm )). Then a
pretext generator gm : X × {0, 1}d → X takes a sample x from Du and a mask vector m as input,
and generates a masked sample x̃. The generating process of x̃ is given by
x̃ = gm (x, m) = m x̄ + (1 − m) x (3)
PNl +Nu
where the j-th feature of x̄ is sampled from the empirical distribution p̂Xj = N1u i=Nl +1 δ(xj =
xi,j ) where xi,j is the j-th feature of the i-th sample in Du (i.e. the empirical marginal distribution
of each feature). - see Figure 3 in the Supplementary Materials for further details. The generating
process in Equation (3) ensures the corrupted sample x̃ is not only tabular but also similar to the
samples in Du . Compared with standard sample corruption approaches, e.g. adding Gaussian noise
to, or replacing zeros with the missing features, our approach generates x̃ that is more difficult to
distinguish from x. This difficulty is crucial for self-supervised learning, which we will elaborate
more in the following sections.
There are two folds of randomness imposed in our pretext distribution pXs ,Ys . Explicitly, m is
a random vector sampled from a Bernoulli distribution. Implicitly, the pretext generator gm is
also a stochastic function whose randomness comes from x̄. Together, this randomness increases
the difficulty in reconstructing x from x̃. The level of difficulty can be adjusted by changing the
hyperparameter pm , the probability in Bern(·|pm ), which controls the proportion of features that will
be masked and corrupted.
4
Following the convention of self-supervised learning, the encoder e first transforms the masked and
corrupted sample x̃ to a representation z, then a pretext predictive model will be introduced to recover
the original sample x from z. Arguably, this is a more challenging task than existing pretext tasks,
such as correcting the rotation of images or recolorizing a grayscale image. A rotated or grayscale
image still contains some information about the original features. In contrast, masking completely
removes some of the features from x and replaces them with a noise sample x̄ of which each feature
may come from a different random sample in Du . The resulting sample x̃ may not contain any
information about the missing features and even hard to identify which features are missing. To solve
such a challenging task, we first divide it into two sub-tasks (pretext tasks):
(1) Mask vector estimation: predict which features have been masked;
(2) Feature vector estimation: predict the values of the features that have been corrupted.
We introduce a separate pretext predictive model for each pretext task. Both models operate on top
of the representation z given by the encoder e and try to estimate m and x collaboratively. The two
models and their functions are,
• Mask vector estimator, sm : Z → [0, 1]d , takes z as input and outputs a vector m̂ to predict
which features of x̃ have been replaced by a noisy counterpart (i.e., m);
• Feature vector estimator, sr : Z → X , takes z as input and returns x̂, an estimate of the original
sample x.
The encoder e and the pretext predictive models (in our case, the two estimators sm and sr ) are
trained jointly in the following optimization problem,
h i
min Ex∼pX ,m∼pm ,x̃∼gm (x,m) lm (m, m̂) + α · lr (x, x̂) (4)
e,sm ,sr
where m̂ = (sm ◦ e)(x̃) and x̂ = (sr ◦ e)(x̃). The first loss function lm is the sum of the binary
cross-entropy losses for each dimension of the mask vector2 :
d
1hX i
lm (m, m̂) = − mj log (sm ◦ e)j (x̃) + (1 − mj ) log 1 − (sm ◦ e)j (x̃) , (5)
d j=1
and the second loss function lr is the reconstruction loss,
d
1hX i
lr (x, x̂) = (xj − (sr ◦ e)j (x̃))2 . (6)
d j=1
α adjusts the trade-off between the two losses. For categorical variables, we modified Equation 6 to
cross-entropy loss. Figure 1 illustrates our entire self-supervised learning framework.
What has the encoder learned? These two loss functions share the encoder e. It is the only part
we will utilize in the downstream tasks. To understand how the encoder is going to benefit these
downstream tasks, we consider what the encoder must be able to do to solve our pretext tasks. We
make the following intuitive observation: it is important for e to capture the correlation among the
features of x and output some latent representations z that can recover x. In this case, sm can identify
the masked features from the inconsistency between feature values, and sr can impute the masked
features by learning from the correlated non-masked features. For instance, if the value of a feature is
very different from its correlated features, this feature is likely masked and corrupted. We note that
correlations are also learned in other self-supervised learning frameworks, e.g. spatial correlations in
rotated images and autocorrelations between future and previous words. Our framework is novel in
learning the correlations for tabular data whose correlation structure is less obvious than in images or
language. The learned representation that captures the correlation across different parts of the object,
regardless of the object type (e.g. language, image or tabular data), is an informative input for the
various downstream tasks.
We now show how the encoder function e from the previous subsection can be used in semi-supervised
learning. Our framework of semi-supervised learning follows the structure as given in Section 3. Let
2
Subscript j represents the j-th element of the vector.
5
Feature Back-propagation
𝐱
Corrupted Feature
Masks features representations Predictions
With unlabeled
𝐦 𝐱 𝐳 y
samples (𝒟 )
𝐦 Pretext 𝐱 𝐳 y
Mask Consistency
generator
generator loss (𝑙 )
(𝑔 ) Encoder (𝑒) Predictor (𝑓)
𝐦 𝐱 𝐳 y
With labeled
samples (𝒟 )
Supervised
𝐱 𝐳 y
loss (𝑙 )
Feature Feature Prediction
representation
Back-propagation
Figure 2: Block diagram of the proposed semi-supervised learning framework on tabular data. For an
unlabeled sample x in Du , (1) Mask generator generates K-number of mask vectors and combine
each of them with x to generate the corrupted samples x̃k , k = 1, ..., K via pretext generator (gm ),
(2) Encoder (e) transforms these corrupted samples into latent representations zk , k = 1, ..., K as K
different augmented samples, (3) Predictive model is trained by minimizing the supervised loss on
(x, y) in Dl and the consistency loss on the augmented samples (zk , k = 1, ..., K) jointly. The block
diagram of the proposed self- and semi-supervised learning frameworks on exemplary tabular data
can be found in the Supplementary Materials (Figure 2).
fe = f ◦ e and ŷ = fe (x). We train the predictive model f by minimizing the objective function,
Lf inal = Ls + β · Lu . (7)
The supervised loss Ls is given by
h i
Ls = E(x,y)∼pXY ls y, fe (x) , (8)
where ls is the standard supervised loss function, e.g. mean squared error for regression or categorical
cross-entropy for classification. The unsupervised (consistency) loss Lu is defined between original
samples (x) and their reconstructions from corrupted and masked samples (x̃),
h 2 i
Lu = Ex∼pX ,m∼pm ,x̃∼gm (x,m) fe (x̃) − fe (x) . (9)
Our consistency loss is inspired by the idea in consistency regularizer: encouraging the predictive
model f to return the similar output distribution when its inputs are perturbed. However, the
perturbation in our framework is learned through our self-supervised framework while in the previous
works, the perturbation is from a manually chosen distribution, such as rotation.
For a fixed sample x, the inner expectation in Equation (9) is taken with respect to pm and gm (x, m)
and could be interpreted as the variance of the predictions of corrupted and masked samples. β
is another hyper-parameter to adjust the supervised loss Ls and the consistency loss Lu . In each
iteration of training, for each sample x ∈ Du in the batch, we create K augmented samples x̃1 , ...,
x̃K by repeating the operation in Equation (3) K times. Every time the sample x ∈ Du is used in a
batch, we recreate these augmented samples. The stochastic approximation of Lu is given as
Nb X
K h Nb X
K h
1 X 2 i 1 X 2 i
L̂u = fe (x̃i,k ) − fe (xi ) = f (zi,k ) − f (zi ) (10)
Nb K i=1 Nb K i=1
k=1 k=1
where Nb is the batch size. During training, the predictive model f is regularized to make similar
predictions on zi and zi,k , k = 1, ..., K. After training f , the output for a new test sample xt is given
by ŷ = fe (xt ). Figure 2 illustrates the entire procedure of the proposed semi-supervised framework
on tabular data with a pre-trained encoder.
5 Experiments
In this section, we conduct a series of experiments to demonstrate the efficacy of our framework
(VIME) on several tabular datasets from different application domains, including genomics and
clinical data. We use Min-max scaler to normalize the data between 0 and 1. For self-supervised
6
learning, we compare VIME against two benchmarks, Denoising auto-encoder (DAE) [21] and
Context Encoder [22]. For semi-supervised learning, we use the data augmentation method MixUp
[10] as the main benchmark. We exclude self- and semi-supervised learning benchmarks that
are applicable only to image or language data. As a baseline, we also include supervised learn-
ing benchmarks. Additional results with more baselines can be found in the Supplementary Ma-
terials. In the experiments, self- and semi-supervised learning methods use both labeled data
and unlabeled data, while the supervised learning methods only use the labeled data. Imple-
mentation details and sensitivity analyses on three hyperparameters (pm , α, β) can be found in
the Supplementary Materials (Section 5 & 6). The implementation of VIME can be found at
https://bitbucket.org/mvdschaar/mlforhealthlabpub/src/master/alg/vime/ and at
https://github.com/jsyoon0823/VIME.
In this subsection, we evaluate the methods on a large genomics dataset from UK Biobank consisting
of around 400,000 individuals’ genomics information (SNPs) and 6 corresponding blood cell traits:
(1) Mean Reticulocyte Volume (MRV), (2) Mean Platelet Volume (MPV), (3) Mean Cell Hemoglobin
(MCH), (4) Reticulocyte Fraction of Red Cells (RET), (5) Plateletcrit (PCT), and (6) Monocyte
Percentage of White Cells (MONO). The features of the dataset consist of around 700 SNPs (after
the standard p-values filtering process), where each SNP, taking value in {0, 1, 2}, is treated as a
categorical variable (with three categories). Here, we have 6 different blood cell traits to predict, and
we treat each of them as an independent prediction task (selected SNPs are different across different
blood cell traits). Detailed data descriptions are provided in the Supplementary Materials (Section 2).
Note that all the variables are categorical features.
To test the effectiveness of self- and semi-supervised learning in the small labeled data setting, VIME
and benchmarks are tasked to predict the 6 blood cell traits while we gradually increase the number of
labeled data points from 1,000 to 100,000 samples while using the remaining data as unlabeled data
(more than 300,000 samples). We use a linear model (Elastic Net [34]) as the predictive model due to
their superior performance in comparison to other non-linear models such as multi-layer perceptron
and random forests [35] on genomics datasets.
Figure 3: MSE performances on 6 different blood cell traits across different sizes of the labeled
genomics dataset (lower the better). Note that x-axis is a log-scale.
In Figure 3, we show the MSE performance (y-axis) against the number of labeled data points (x-axis,
in log scale) increasing from 1,000 to 10,0003 . The proposed model (VIME) outperforms all the
benchmarks, including purely supervised method ElasticNet, the self-supervised method Context
Encoder and the semi-supervised method MixUp. In fact, in many cases VIME shows similar
3
The performances for 10,000 to 100,000 range can be found in the Supplementary Materials (Section 3)
7
performances to the benchmarks even when it has access to only half as many labeled data points (as
the benchmarks).
In this subsection, we evaluate the methods on clinical data, using the UK and US prostate cancer
datasets (from Prostate Cancer UK and SEER datasets, respectively). The features consist of patients’
clinical information (e.g. age, grade, stage, Gleason scores) - total 28 features. We predict 2
possible treatments of UK prostate cancer patients (1) Hormone therapy (whether the patients got
hormone therapy), (2) Radical therapy (whether the patient got radical therapy). Both tasks are
binary classification. In the UK prostate cancer dataset, we only have around 10,000 labeled patients
samples. The US prostate cancer dataset contains more than 200,000 unlabeled patients samples,
twenty times bigger than the labeled UK dataset. We use 50% of the UK dataset (as the labeled data)
and the entire US dataset (as the unlabeled data) for training, with the remainder of the UK data being
used as the testing set. We also test three popular supervised learning models: Logistic Regression, a
2-layer Multi-layer Perceptron and XGBoost.
Table 1 shows that VIME results in the best prediction performance, outperforming the benchmarks.
More importantly, VIME is the only self- or semi-supervised learning framework that significantly
outperforms supervised learning models. These results shed light on the unique advantage of using
VIME in leveraging a large unlabeled tabular dataset (e.g. the US dataset) to strengthen a model’s
predictive power. Here we also demonstrate that VIME can perform well even when there exists a
distribution shift between the UK labeled data and the US unlabeled data (see the Supplementary
Materials (Section 2) for further details).
Table 1: AUROC Performances of patient treatment predictions on Hormone and Radical therapy
(higher the better). (Mean ± Standard deviations are computed over 10 runs)
Type Models Hormone Radical
Logistic Regression .8371±.0013 .8036±.0015
SL 2-layer Perceptron .8351±.0023 .8146±.0022
XGBoost .8423±.0018 .8166±.0011
DAE .8335±.0049 .8144±.0061
Self-SL
Context Encoder .8308±.0051 .8134±.0066
Semi-SL MixUp .8448±.0021 .8214±.0029
VIME .8602±.0029 .8391±.0021
To further verify the generalizability and allow for reproducibility of our results, we compare VIME
with the benchmarks using three public tabular datasets: MNIST (interpreted as a tabular data with
784 features), UCI Income and UCI Blog. We use 10% of the data as labeled data, and the 90% of
the remaining data as unlabeled data. Prediction accuracy on a separate testing set is used as the
metric for all three datasets. As shown in Table 2 (Type - Supervised models, Self-supervised models,
Semi-supervised models and VIME), VIME achieves the best accuracy regardless of the application
domains. These results further confirm the superiority of VIME in a diverse range of tabular datasets.
In this section, we conduct an ablation study to analyze the performance gain of each component in
VIME on the tabular datasets introduced in Section 5.3. We define three variants of VIME:
• Supervised only: Exclude both self- and semi-supervised learning parts (i.e. 2-layer perceptron)
• Semi-SL only: Exclude self-supervised learning part (i.e. remove the encoder in Figure 2)
• Self-SL only: Exclude semi-supervised learning part (i.e. β = 0). More specifically, we first train
the encoder via self-supervised learning. Then, we train the predictive model with loss function (in
Equation (7) with β = 0 (only utilizing the labeled data).
8
Table 2: Prediction accuracy of the methods on UCI Income, MNIST and UCI Blog datasets (Mean
± Std are computed over 10 runs).
Type Models Income MNIST Blog
Logistic Regression .8425±.0013 .8989±.0023 .6915±.0029
Supervised models 2-layer Perceptron .8520±.0023 .9387±.0014 .7972±.0058
XGBoost .8623±.0021 .9413±.0026 .7975±.0030
DAE .8578±.0028 .9431±.0032 .8001±.0039
Self-supervised models
Context Encoder .8611±.0027 .9455±.0048 .8033±.0051
Semi-supervised models MixUp .8701±.0021 .9461±.0023 .8088±.0038
Supervised only .8520±.0023 .9387±.0014 .7972±.0058
Variants of VIME Self-SL only .8599±.0026 .9406±.0019 .8147±.0037
Semi-SL only .8771±.0031 .9548±.0023 .8361±.0041
VIME .8804±.0030 .9577±.0022 .8389±.0044
Table 2 (Type - Variants of VIME and VIME) shows that both Self-SL only and Semi-SL only
show performance gains compared with Supervised only, and VIME is always better than its
variants. Every component in VIME can improve the performance of a predictive model, and the
best performance is achieved when they work collaboratively in our unified framework. We note
that Self-SL only leads to a larger performance drop than Semi-SL only because in the former the
predictive model is trained solely on a small labeled dataset without the unsupervised loss function
Lu , while in the latter the predictive model is trained via minimizing both losses but without the
encoder. Additional ablation study can be found in the Supplementary Materials.
6 Discussions: Why does the proposed model (VIME) need for tabular data?
Image and tabular data are very different. The spatial correlations between pixels in images or the
sequential correlations between words in text data are well-known and consistent across different
datasets. By contrast, the correlation structure among features in tabular data is unknown and varies
across different datasets. In other words, there is no “common” correlation structure in tabular data
(unlike in image and text data). This makes the self- and semi-supervised learning in tabular data
more challenging. Note that promising methods for image domain do not guarantee the favorable
results on tabular domain (vice versa). Also, most augmentations and pretext tasks used in image data
are not applicable to tabular data; because they directly utilize the spatial relationship of the image for
augmentation (e.g., rotation) and pretext tasks (e.g., jigsaw puzzle and colorization). To transfer the
successes of self- and semi-supervised learning from image to tabular domains, proposing applicable
and proper pretext tasks and augmentations for tabular data (our main novelty) is critical. Note that
better augmentations and pretext tasks can significantly improve self- and semi-supervised learning
performances.
Broader Impact
Tabular data is the most common data type in the real-world. Most databases include tabular data
such as demographic information in medical and finance datasets and SNPs in genomic datasets.
However, the tremendous successes in deep learning (especially in image and language domains)
has not yet been fully extended to the tabular domain. Still, in the tabular domain, ensembles of
decision trees achieve the state-of-the-art performance. If we can efficiently extend the successful
deep learning methodologies from images and language to tabular data, the application of machine
learning in the real-world can be greatly extended. This paper takes a step in this direction for self-
and semi-supervised learning frameworks which recently have achieved significant successes in
images and language. In addition, the proposed tabular data augmentation and representation learning
methodologies can be utilized in various fields such as tabular data encoding, balancing the labels of
tabular data, and missing data imputation.
9
Acknowledgements and Funding Sources
The authors would like to thank the reviewers for their helpful comments. This work was supported
by the National Science Foundation (NSF grant 1722516), the US Office of Naval Research (ONR),
and GlaxoSmithKline (GSK).
References
[1] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image
recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,
pages 770–778, 2016.
[2] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Proceedings of
the IEEE international conference on computer vision, pages 2961–2969, 2017.
[3] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information
processing systems, pages 5998–6008, 2017.
[4] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-
scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern
recognition, pages 248–255. Ieee, 2009.
[5] Mark Peplow. The 100 000 genomes project. Bmj, 353:i1757, 2016.
[6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of
deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805,
2018.
[7] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by
predicting image rotations. arXiv preprint arXiv:1803.07728, 2018.
[8] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving
jigsaw puzzles. In European Conference on Computer Vision, pages 69–84. Springer, 2016.
[9] Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. In European
conference on computer vision, pages 649–666. Springer, 2016.
[10] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond
empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
[11] Alexey Dosovitskiy, Philipp Fischer, Jost Tobias Springenberg, Martin Riedmiller, and Thomas
Brox. Discriminative unsupervised feature learning with exemplar convolutional neural net-
works. IEEE transactions on pattern analysis and machine intelligence, 38(9):1734–1747,
2015.
[12] Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learning
by context prediction. In Proceedings of the IEEE International Conference on Computer
Vision, pages 1422–1430, 2015.
[13] Samuli Laine, Tero Karras, Jaakko Lehtinen, and Timo Aila. High-quality self-supervised deep
image denoising. In Advances in Neural Information Processing Systems, pages 6968–6978,
2019.
[14] Yue Wang and Justin M Solomon. Prnet: Self-supervised learning for partial-to-partial registra-
tion. In Advances in Neural Information Processing Systems, pages 8812–8824, 2019.
[15] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive
predictive coding. arXiv preprint arXiv:1807.03748, 2018.
[16] Olivier J Hénaff, Ali Razavi, Carl Doersch, SM Eslami, and Aaron van den Oord. Data-efficient
image recognition with contrastive predictive coding. arXiv preprint arXiv:1905.09272, 2019.
10
[17] Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive multiview coding. arXiv preprint
arXiv:1906.05849, 2019.
[18] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework
for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709, 2020.
[19] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for
unsupervised visual representation learning. arXiv preprint arXiv:1911.05722, 2019.
[20] Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum
contrastive learning. arXiv preprint arXiv:2003.04297, 2020.
[21] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and
composing robust features with denoising autoencoders. In Proceedings of the 25th international
conference on Machine learning, pages 1096–1103. ACM, 2008.
[22] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context
encoders: Feature learning by inpainting. In Proceedings of the IEEE conference on computer
vision and pattern recognition, pages 2536–2544, 2016.
[23] Sercan O Arik and Tomas Pfister. Tabnet: Attentive interpretable tabular learning. arXiv
preprint arXiv:1908.07442, 2019.
[24] Pengcheng Yin, Graham Neubig, Wen-tau Yih, and Sebastian Riedel. Tabert: Pretraining for
joint understanding of textual and tabular data. arXiv preprint arXiv:2005.08314, 2020.
[25] Dong-Hyun Lee. Pseudo-label: The simple and efficient semi-supervised learning method for
deep neural networks. In Workshop on Challenges in Representation Learning, ICML, volume 3,
page 2, 2013.
[26] Mehdi Sajjadi, Mehran Javanmardi, and Tolga Tasdizen. Regularization with stochastic transfor-
mations and perturbations for deep semi-supervised learning. In Advances in Neural Information
Processing Systems, pages 1163–1171, 2016.
[27] Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged
consistency targets improve semi-supervised deep learning results. In Advances in neural
information processing systems, pages 1195–1204, 2017.
[28] Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, and Shin Ishii. Virtual adversarial training:
a regularization method for supervised and semi-supervised learning. IEEE transactions on
pattern analysis and machine intelligence, 41(8):1979–1993, 2018.
[29] David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital Oliver, and
Colin Raffel. Mixmatch: A holistic approach to semi-supervised learning. arXiv preprint
arXiv:1905.02249, 2019.
[30] David Berthelot, Nicholas Carlini, Ekin D Cubuk, Alex Kurakin, Kihyuk Sohn, Han Zhang,
and Colin Raffel. Remixmatch: Semi-supervised learning with distribution alignment and
augmentation anchoring. arXiv preprint arXiv:1911.09785, 2019.
[31] Xiaojin Zhu, Zoubin Ghahramani, and John D Lafferty. Semi-supervised learning using gaussian
fields and harmonic functions. In Proceedings of the 20th International conference on Machine
learning (ICML-03), pages 912–919, 2003.
[32] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional
networks. arXiv preprint arXiv:1609.02907, 2016.
[33] Otilia Stretcu, Krishnamurthy Viswanathan, Dana Movshovitz-Attias, Emmanouil Platanios,
Sujith Ravi, and Andrew Tomkins. Graph agreement models for semi-supervised learning. In
Advances in Neural Information Processing Systems, pages 8710–8720, 2019.
[34] Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of
the royal statistical society: series B (statistical methodology), 67(2):301–320, 2005.
[35] Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
11