Feature Learning
Feature Learning
In supervised feature learning, features are learned using labeled input data. Labeled data
includes input-label pairs where the input is given to the model and it must produce the
ground truth label as the correct answer.[3] This can be leveraged to generate feature
representations with the model which result in high label prediction accuracy. Examples
include supervised neural networks, multilayer perceptron and (supervised) dictionary
learning.
In unsupervised feature learning, features are learned with unlabeled input data by
analyzing the relationship between points in the dataset.[4] Examples include dictionary
learning, independent component analysis, matrix factorization[5] and various forms of
clustering.[6][7][8]
In self-supervised feature learning, features are learned using unlabeled data like
unsupervised learning, however input-label pairs are constructed from each data point,
which enables learning the structure of the data through supervised methods such as
gradient descent.[9] Classical examples include word embeddings and autoencoders.[10][11]
SSL has since been applied to many modalities through the use of deep neural network
architectures such as CNNs and transformers.[9]
Supervised
Supervised feature learning is learning features from labeled data. The data label allows the system to
compute an error term, the degree to which the system fails to produce the label, which can then be used as
feedback to correct the learning process (reduce/minimize the error). Approaches include:
Dictionary learning develops a set (dictionary) of representative elements from the input data such that each
data point can be represented as a weighted sum of the representative elements. The dictionary elements
and the weights may be found by minimizing the average representation error (over the input data), together
with L1 regularization on the weights to enable sparsity (i.e., the representation of each data point has only
a few nonzero weights).
Supervised dictionary learning exploits both the structure underlying the input data and the labels for
optimizing the dictionary elements. For example, this[12] supervised dictionary learning technique applies
dictionary learning on classification problems by jointly optimizing the dictionary elements, weights for
representing data points, and parameters of the classifier based on the input data. In particular, a
minimization problem is formulated, where the objective function consists of the classification error, the
representation error, an L1 regularization on the representing weights for each data point (to enable sparse
representation of data), and an L2 regularization on the parameters of the classifier.
Neural networks
Neural networks are a family of learning algorithms that use a "network" consisting of multiple layers of
inter-connected nodes. It is inspired by the animal nervous system, where the nodes are viewed as neurons
and edges are viewed as synapses. Each edge has an associated weight, and the network defines
computational rules for passing input data from the network's input layer to the output layer. A network
function associated with a neural network characterizes the relationship between input and output layers,
which is parameterized by the weights. With appropriately defined network functions, various learning
tasks can be performed by minimizing a cost function over the network function (weights).
Multilayer neural networks can be used to perform feature learning, since they learn a representation of
their input at the hidden layer(s) which is subsequently used for classification or regression at the output
layer. The most popular network architecture of this type is Siamese networks.
Unsupervised
Unsupervised feature learning is learning features from unlabeled data. The goal of unsupervised feature
learning is often to discover low-dimensional features that capture some structure underlying the high-
dimensional input data. When the feature learning is performed in an unsupervised way, it enables a form of
semisupervised learning where features learned from an unlabeled dataset are then employed to improve
performance in a supervised setting with labeled data.[13][14] Several approaches are introduced in the
following.
K-means clustering
K-means clustering is an approach for vector quantization. In particular, given a set of n vectors, k-means
clustering groups them into k clusters (i.e., subsets) in such a way that each vector belongs to the cluster
with the closest mean. The problem is computationally NP-hard, although suboptimal greedy algorithms
have been developed.
K-means clustering can be used to group an unlabeled set of inputs into k clusters, and then use the
centroids of these clusters to produce features. These features can be produced in several ways. The
simplest is to add k binary features to each sample, where each feature j has value one iff the jth centroid
learned by k-means is the closest to the sample under consideration.[6] It is also possible to use the distances
to the clusters as features, perhaps after transforming them through a radial basis function (a technique that
has been used to train RBF networks[15]). Coates and Ng note that certain variants of k-means behave
similarly to sparse coding algorithms.[16]
In a comparative evaluation of unsupervised feature learning methods, Coates, Lee and Ng found that k-
means clustering with an appropriate transformation outperforms the more recently invented auto-encoders
and RBMs on an image classification task.[6] K-means also improves performance in the domain of NLP,
specifically for named-entity recognition;[17] there, it competes with Brown clustering, as well as with
distributed word representations (also known as neural word embeddings).[14]
Principal component analysis (PCA) is often used for dimension reduction. Given an unlabeled set of n
input data vectors, PCA generates p (which is much smaller than the dimension of the input data) right
singular vectors corresponding to the p largest singular values of the data matrix, where the kth row of the
data matrix is the kth input data vector shifted by the sample mean of the input (i.e., subtracting the sample
mean from the data vector). Equivalently, these singular vectors are the eigenvectors corresponding to the p
largest eigenvalues of the sample covariance matrix of the input vectors. These p singular vectors are the
feature vectors learned from the input data, and they represent directions along which the data has the
largest variations.
PCA is a linear feature learning approach since the p singular vectors are linear functions of the data matrix.
The singular vectors can be generated via a simple algorithm with p iterations. In the ith iteration, the
projection of the data matrix on the (i-1)th eigenvector is subtracted, and the ith singular vector is found as
the right singular vector corresponding to the largest singular of the residual data matrix.
PCA has several limitations. First, it assumes that the directions with large variance are of most interest,
which may not be the case. PCA only relies on orthogonal transformations of the original data, and it
exploits only the first- and second-order moments of the data, which may not well characterize the data
distribution. Furthermore, PCA can effectively reduce dimension only when the input data vectors are
correlated (which results in a few dominant eigenvalues).
Local linear embedding (LLE) is a nonlinear learning approach for generating low-dimensional neighbor-
preserving representations from (unlabeled) high-dimension input. The approach was proposed by Roweis
and Saul (2000).[18][19] The general idea of LLE is to reconstruct the original high-dimensional data using
lower-dimensional points while maintaining some geometric properties of the neighborhoods in the original
data set.
LLE consists of two major steps. The first step is for "neighbor-preserving", where each input data point Xi
is reconstructed as a weighted sum of K nearest neighbor data points, and the optimal weights are found by
minimizing the average squared reconstruction error (i.e., difference between an input point and its
reconstruction) under the constraint that the weights associated with each point sum up to one. The second
step is for "dimension reduction," by looking for vectors in a lower-dimensional space that minimizes the
representation error using the optimized weights in the first step. Note that in the first step, the weights are
optimized with fixed data, which can be solved as a least squares problem. In the second step, lower-
dimensional points are optimized with fixed weights, which can be solved via sparse eigenvalue
decomposition.
The reconstruction weights obtained in the first step capture the "intrinsic geometric properties" of a
neighborhood in the input data.[19] It is assumed that original data lie on a smooth lower-dimensional
manifold, and the "intrinsic geometric properties" captured by the weights of the original data are also
expected to be on the manifold. This is why the same weights are used in the second step of LLE.
Compared with PCA, LLE is more powerful in exploiting the underlying data structure.
Independent component analysis (ICA) is a technique for forming a data representation using a weighted
sum of independent non-Gaussian components.[20] The assumption of non-Gaussian is imposed since the
weights cannot be uniquely determined when all the components follow Gaussian distribution.
Unsupervised dictionary learning does not utilize data labels and exploits the structure underlying the data
for optimizing dictionary elements. An example of unsupervised dictionary learning is sparse coding, which
aims to learn basis functions (dictionary elements) for data representation from unlabeled input data. Sparse
coding can be applied to learn overcomplete dictionaries, where the number of dictionary elements is larger
than the dimension of the input data.[21] Aharon et al. proposed algorithm K-SVD for learning a dictionary
of elements that enables sparse representation.[22]
Multilayer/deep architectures
The hierarchical architecture of the biological neural system inspires deep learning architectures for feature
learning by stacking multiple layers of learning nodes.[23] These architectures are often designed based on
the assumption of distributed representation: observed data is generated by the interactions of many
different factors on multiple levels. In a deep learning architecture, the output of each intermediate layer can
be viewed as a representation of the original input data. Each level uses the representation produced by
previous level as input, and produces new representations as output, which is then fed to higher levels. The
input at the bottom layer is raw data, and the output of the final layer is the final low-dimensional feature or
representation.
Restricted Boltzmann machines (RBMs) are often used as a building block for multilayer learning
architectures.[6][24] An RBM can be represented by an undirected bipartite graph consisting of a group of
binary hidden variables, a group of visible variables, and edges connecting the hidden and visible nodes. It
is a special case of the more general Boltzmann machines with the constraint of no intra-node connections.
Each edge in an RBM is associated with a weight. The weights together with the connections define an
energy function, based on which a joint distribution of visible and hidden nodes can be devised. Based on
the topology of the RBM, the hidden (visible) variables are independent, conditioned on the visible
(hidden) variables. Such conditional independence facilitates computations.
An RBM can be viewed as a single layer architecture for unsupervised feature learning. In particular, the
visible variables correspond to input data, and the hidden variables correspond to feature detectors. The
weights can be trained by maximizing the probability of visible variables using Hinton's contrastive
divergence (CD) algorithm.[24]
In general training RBM by solving the maximization problem tends to result in non-sparse representations.
Sparse RBM[25] was proposed to enable sparse representations. The idea is to add a regularization term in
the objective function of data likelihood, which penalizes the deviation of the expected hidden variables
from a small constant .
Autoencoder
An autoencoder consisting of an encoder and a decoder is a paradigm for deep learning architectures. An
example is provided by Hinton and Salakhutdinov[24] where the encoder uses raw data (e.g., image) as
input and produces feature or representation as output and the decoder uses the extracted feature from the
encoder as input and reconstructs the original input raw data as output. The encoder and decoder are
constructed by stacking multiple layers of RBMs. The parameters involved in the architecture were
originally trained in a greedy layer-by-layer manner: after one layer of feature detectors is learned, they are
fed up as visible variables for training the corresponding RBM. Current approaches typically apply end-to-
end training with stochastic gradient descent methods. Training can be repeated until some stopping criteria
are satisfied.
Self-supervised
Self-supervised representation learning is learning features by training on the structure of unlabeled data
rather than relying on explicit labels for an information signal. This approach has enabled the combined use
of deep neural network architectures and larger unlabeled datasets to produce deep feature
representations.[9] Training tasks typically fall under the classes of either contrastive, generative or both.[26]
Contrastive representation learning trains representations for associated data pairs, called positive samples,
to be aligned, while pairs with no relation, called negative samples, are contrasted. A larger portion of
negative samples is typically necessary in order to prevent catastrophic collapse, which is when all inputs
are mapped to the same representation.[9] Generative representation learning tasks the model with
producing the correct data to either match a restricted input or reconstruct the full input from a lower
dimensional representation.[26]
A common setup for self-supervised representation learning of a certain data type (e.g. text, image, audio,
video) is to pretrain the model using large datasets of general context, unlabeled data.[11] Depending on the
context, the result of this is either a set of representations for common data segments (e.g. words) which
new data can be broken into, or a neural network able to convert each new data point (e.g. image) into a set
of lower dimensional features.[9] In either case, the output representations can then be used as an
initialization in many different problem settings where labeled data may be limited. Specialization of the
model to specific tasks is typically done with supervised learning, either by fine-tuning the model /
representations with the labels as the signal, or freezing the representations and training an additional model
which takes them as an input.[11]
Many self-supervised training schemes have been developed for use in representation learning of various
modalities, often first showing successful application in text or image before being transferred to other data
types.[9]
Text
Word2vec is a word embedding technique which learns to represent words through self-supervision over
each word and its neighboring words in a sliding window across a large corpus of text.[27] The model has
two possible training schemes to produce word vector representations, one generative and one
contrastive.[26] The first is word prediction given each of the neighboring words as an input.[27] The
second is training on the representation similarity for neighboring words and representation dissimilarity for
random pairs of words.[10] A limitation of word2vec is that only the pairwise co-occurrence structure of the
data is used, and not the ordering or entire set of context words. More recent transformer-based
representation learning approaches attempt to solve this with word prediction tasks.[9] GPTs pretrain on
next word prediction using prior input words as context,[28] whereas BERT masks random tokens in order
to provide bidirectional context.[29]
Other self-supervised techniques extend word embeddings by finding representations for larger text
structures such as sentences or paragraphs in the input data.[9] Doc2vec extends the generative training
approach in word2vec by adding an additional input to the word prediction task based on the paragraph it is
within, and is therefore intended to represent paragraph level context.[30]
Image
The domain of image representation learning has employed many different self-supervised training
techniques, including transformation,[31] inpainting,[32] patch discrimination[33] and clustering.[34]
Examples of generative approaches are Context Encoders, which trains an AlexNet CNN architecture to
generate a removed image region given the masked image as input,[32] and iGPT, which applies the GPT-2
language model architecture to images by training on pixel prediction after reducing the image
resolution.[35]
Many other self-supervised methods use siamese networks, which generate different views of the image
through various augmentations that are then aligned to have similar representations. The challenge is
avoiding collapsing solutions where the model encodes all images to the same representation.[36] SimCLR
is a contrastive approach which uses negative examples in order to generate image representations with a
ResNet CNN.[33] Bootstrap Your Own Latent (BYOL) removes the need for negative samples by
encoding one of the views with a slow moving average of the model parameters as they are being modified
during training.[37]
Graph
The goal of many graph representation learning techniques is to produce an embedded representation of
each node based on the overall network topology.[38] node2vec extends the word2vec training technique to
nodes in a graph by using co-occurrence in random walks through the graph as the measure of
association.[39] Another approach is to maximize mutual information, a measure of similarity, between the
representations of associated structures within the graph.[9] An example is Deep Graph Infomax, which
uses contrastive self-supervision based on mutual information between the representation of a “patch”
around each node, and a summary representation of the entire graph. Negative samples are obtained by
pairing the graph representation with either representations from another graph in a multigraph training
setting, or corrupted patch representations in single graph training.[40]
Video
With analogous results in masked prediction[41] and clustering,[42] video representation learning
approaches are often similar to image techniques but must utilize the temporal sequence of video frames as
an additional learned structure. Examples include VCP, which masks video clips and trains to choose the
correct one given a set of clip options, and Xu et al., who train a 3D-CNN to identify the original order
given a shuffled set of video clips.[43]
Audio
Self-supervised representation techniques have also been applied to many audio data formats, particularly
for speech processing.[9] Wav2vec 2.0 discretizes the audio waveform into timesteps via temporal
convolutions, and then trains a transformer on masked prediction of random timesteps using a contrastive
loss.[44] This is similar to the BERT language model, except as in many SSL approaches to video, the
model chooses among a set of options rather than over the entire word vocabulary.[29][44]
Multimodal
Self-supervised learning has also been used to develop joint representations of multiple data types.[9]
Approaches usually rely on some natural or human-derived association between the modalities as an
implicit label, for instance video clips of animals or objects with characteristic sounds,[45] or captions
written to describe images.[46] CLIP produces a joint image-text representation space by training to align
image and text encodings from a large dataset of image-caption pairs using a contrastive loss.[46] MERLOT
Reserve trains a transformer-based encoder to jointly represent audio, subtitles and video frames from a
large dataset of videos through 3 joint pretraining tasks: contrastive masked prediction of either audio or
text segments given the video frames and surrounding audio and text context, along with contrastive
alignment of video frames with their corresponding captions.[45]
Multimodal representation models are typically unable to assume direct correspondence of representations
in the different modalities, since the precise alignment can often be noisy or ambiguous. For example, the
text "dog" could be paired with many different pictures of dogs, and correspondingly a picture of a dog
could be captioned with varying degrees of specificity. This limitation means that downstream tasks may
require an additional generative mapping network between modalities to achieve optimal performance, such
as in DALLE-2 for text to image generation.[47]
See also
Automated machine learning (AutoML)
Deep learning
Feature detection (computer vision)
Feature extraction
Word embedding
Vector quantization
Variational autoencoder
References
1. Goodfellow, Ian (2016). Deep learning. Yoshua Bengio, Aaron Courville. Cambridge,
Massachusetts. pp. 524–534. ISBN 0-262-03561-8. OCLC 955778308 (https://www.worldca
t.org/oclc/955778308).
2. Y. Bengio; A. Courville; P. Vincent (2013). "Representation Learning: A Review and New
Perspectives". IEEE Transactions on Pattern Analysis and Machine Intelligence. 35 (8):
1798–1828. arXiv:1206.5538 (https://arxiv.org/abs/1206.5538). doi:10.1109/tpami.2013.50 (h
ttps://doi.org/10.1109%2Ftpami.2013.50). PMID 23787338 (https://pubmed.ncbi.nlm.nih.gov/
23787338). S2CID 393948 (https://api.semanticscholar.org/CorpusID:393948).
3. Stuart J. Russell, Peter Norvig (2010) Artificial Intelligence: A Modern Approach, Third
Edition, Prentice Hall ISBN 978-0-13-604259-4.
4. Hinton, Geoffrey; Sejnowski, Terrence (1999). Unsupervised Learning: Foundations of
Neural Computation. MIT Press. ISBN 978-0-262-58168-4.
5. Nathan Srebro; Jason D. M. Rennie; Tommi S. Jaakkola (2004). Maximum-Margin Matrix
Factorization. NIPS.
6. Coates, Adam; Lee, Honglak; Ng, Andrew Y. (2011). An analysis of single-layer networks in
unsupervised feature learning (https://web.archive.org/web/20170813153615/http://machinel
earning.wustl.edu/mlpapers/paper_files/AISTATS2011_CoatesNL11.pdf) (PDF). Int'l Conf.
on AI and Statistics (AISTATS). Archived from the original (http://machinelearning.wustl.edu/
mlpapers/paper_files/AISTATS2011_CoatesNL11.pdf) (PDF) on 2017-08-13. Retrieved
2014-11-24.
7. Csurka, Gabriella; Dance, Christopher C.; Fan, Lixin; Willamowski, Jutta; Bray, Cédric
(2004). Visual categorization with bags of keypoints (https://www.cs.cmu.edu/~efros/courses/
LBMV07/Papers/csurka-eccv-04.pdf) (PDF). ECCV Workshop on Statistical Learning in
Computer Vision.
8. Daniel Jurafsky; James H. Martin (2009). Speech and Language Processing. Pearson
Education International. pp. 145–146.
9. Ericsson, Linus; Gouk, Henry; Loy, Chen Change; Hospedales, Timothy M. (May 2022).
"Self-Supervised Representation Learning: Introduction, advances, and challenges" (https://i
eeexplore.ieee.org/document/9770283). IEEE Signal Processing Magazine. 39 (3): 42–62.
arXiv:2110.09327 (https://arxiv.org/abs/2110.09327). Bibcode:2022ISPM...39c..42E (https://u
i.adsabs.harvard.edu/abs/2022ISPM...39c..42E). doi:10.1109/MSP.2021.3134634 (https://do
i.org/10.1109%2FMSP.2021.3134634). ISSN 1558-0792 (https://www.worldcat.org/issn/155
8-0792). S2CID 239017006 (https://api.semanticscholar.org/CorpusID:239017006).
10. Mikolov, Tomas; Sutskever, Ilya; Chen, Kai; Corrado, Greg S; Dean, Jeff (2013). "Distributed
Representations of Words and Phrases and their Compositionality" (https://proceedings.neu
rips.cc/paper/2013/hash/9aa42b31882ec039965f3c4923ce901b-Abstract.html). Advances
in Neural Information Processing Systems. Curran Associates, Inc. 26. arXiv:1310.4546 (http
s://arxiv.org/abs/1310.4546).
11. Goodfellow, Ian (2016). Deep learning. Yoshua Bengio, Aaron Courville. Cambridge,
Massachusetts. pp. 499–516. ISBN 0-262-03561-8. OCLC 955778308 (https://www.worldca
t.org/oclc/955778308).
12. Mairal, Julien; Bach, Francis; Ponce, Jean; Sapiro, Guillermo; Zisserman, Andrew (2009).
"Supervised Dictionary Learning". Advances in Neural Information Processing Systems.
13. Percy Liang (2005). Semi-Supervised Learning for Natural Language (http://people.csail.mit.
edu/pliang/papers/meng-thesis.pdf) (PDF) (M. Eng.). MIT. pp. 44–52.
14. Joseph Turian; Lev Ratinov; Yoshua Bengio (2010). Word representations: a simple and
general method for semi-supervised learning (https://web.archive.org/web/2014022620282
3/http://www.newdesign.aclweb.org/anthology/P/P10/P10-1040.pdf) (PDF). Proceedings of
the 48th Annual Meeting of the Association for Computational Linguistics. Archived from the
original (http://www.newdesign.aclweb.org/anthology/P/P10/P10-1040.pdf) (PDF) on 2014-
02-26. Retrieved 2014-02-22.
15. Schwenker, Friedhelm; Kestler, Hans A.; Palm, Günther (2001). "Three learning phases for
radial-basis-function networks". Neural Networks. 14 (4–5): 439–458.
CiteSeerX 10.1.1.109.312 (https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.109.3
12). doi:10.1016/s0893-6080(01)00027-2 (https://doi.org/10.1016%2Fs0893-6080%2801%2
900027-2). PMID 11411631 (https://pubmed.ncbi.nlm.nih.gov/11411631).
16. Coates, Adam; Ng, Andrew Y. (2012). "Learning feature representations with k-means". In G.
Montavon, G. B. Orr and K.-R. Müller (ed.). Neural Networks: Tricks of the Trade. Springer.
17. Dekang Lin; Xiaoyun Wu (2009). Phrase clustering for discriminative learning (http://wmmks.
csie.ncku.edu.tw/ACL-IJCNLP-2009/ACLIJCNLP/pdf/ACLIJCNLP116.pdf) (PDF). Proc. J.
Conf. of the ACL and 4th Int'l J. Conf. on Natural Language Processing of the AFNLP.
pp. 1030–1038.
18. Roweis, Sam T; Saul, Lawrence K (2000). "Nonlinear Dimensionality Reduction by Locally
Linear Embedding". Science. New Series. 290 (5500): 2323–2326.
Bibcode:2000Sci...290.2323R (https://ui.adsabs.harvard.edu/abs/2000Sci...290.2323R).
doi:10.1126/science.290.5500.2323 (https://doi.org/10.1126%2Fscience.290.5500.2323).
JSTOR 3081722 (https://www.jstor.org/stable/3081722). PMID 11125150 (https://pubmed.nc
bi.nlm.nih.gov/11125150). S2CID 5987139 (https://api.semanticscholar.org/CorpusID:59871
39).
19. Saul, Lawrence K; Roweis, Sam T (2000). "An Introduction to Locally Linear Embedding" (ht
tp://www.cs.toronto.edu/~roweis/lle/publications.html).
20. Hyvärinen, Aapo; Oja, Erkki (2000). "Independent Component Analysis: Algorithms and
Applications". Neural Networks. 13 (4): 411–430. doi:10.1016/s0893-6080(00)00026-5 (http
s://doi.org/10.1016%2Fs0893-6080%2800%2900026-5). PMID 10946390 (https://pubmed.n
cbi.nlm.nih.gov/10946390). S2CID 11959218 (https://api.semanticscholar.org/CorpusID:119
59218).
21. Lee, Honglak; Battle, Alexis; Raina, Rajat; Ng, Andrew Y (2007). "Efficient sparse coding
algorithms". Advances in Neural Information Processing Systems.
22. Aharon, Michal; Elad, Michael; Bruckstein, Alfred (2006). "K-SVD: An Algorithm for
Designing Overcomplete Dictionaries for Sparse Representation". IEEE Trans. Signal
Process. 54 (11): 4311–4322. Bibcode:2006ITSP...54.4311A (https://ui.adsabs.harvard.edu/
abs/2006ITSP...54.4311A). doi:10.1109/TSP.2006.881199 (https://doi.org/10.1109%2FTSP.
2006.881199). S2CID 7477309 (https://api.semanticscholar.org/CorpusID:7477309).
23. Bengio, Yoshua (2009). "Learning Deep Architectures for AI". Foundations and Trends in
Machine Learning. 2 (1): 1–127. doi:10.1561/2200000006 (https://doi.org/10.1561%2F22000
00006). S2CID 207178999 (https://api.semanticscholar.org/CorpusID:207178999).
24. Hinton, G. E.; Salakhutdinov, R. R. (2006). "Reducing the Dimensionality of Data with
Neural Networks" (http://www.cs.toronto.edu/~hinton/science.pdf) (PDF). Science. 313
(5786): 504–507. Bibcode:2006Sci...313..504H (https://ui.adsabs.harvard.edu/abs/2006Sci...
313..504H). doi:10.1126/science.1127647 (https://doi.org/10.1126%2Fscience.1127647).
PMID 16873662 (https://pubmed.ncbi.nlm.nih.gov/16873662). S2CID 1658773 (https://api.se
manticscholar.org/CorpusID:1658773).
25. Lee, Honglak; Ekanadham, Chaitanya; Andrew, Ng (2008). "Sparse deep belief net model
for visual area V2". Advances in Neural Information Processing Systems.
26. Liu, Xiao; Zhang, Fanjin; Hou, Zhenyu; Mian, Li; Wang, Zhaoyu; Zhang, Jing; Tang, Jie
(2021). "Self-supervised Learning: Generative or Contrastive" (https://ieeexplore.ieee.org/do
cument/9462394). IEEE Transactions on Knowledge and Data Engineering. 35 (1): 857–
876. arXiv:2006.08218 (https://arxiv.org/abs/2006.08218). doi:10.1109/TKDE.2021.3090866
(https://doi.org/10.1109%2FTKDE.2021.3090866). ISSN 1558-2191 (https://www.worldcat.or
g/issn/1558-2191). S2CID 219687051 (https://api.semanticscholar.org/CorpusID:21968705
1).
27. Mikolov, Tomas; Chen, Kai; Corrado, Greg; Dean, Jeffrey (2013-09-06). "Efficient Estimation
of Word Representations in Vector Space". arXiv:1301.3781 (https://arxiv.org/abs/1301.378
1) [cs.CL (https://arxiv.org/archive/cs.CL)].
28. "Improving Language Understanding by Generative Pre-Training" (https://s3-us-west-2.amaz
onaws.com/openai-assets/research-covers/language-unsupervised/language_understandin
g_paper.pdf) (PDF). Retrieved October 10, 2022.
29. Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina (June 2019). "BERT: Pre-
training of Deep Bidirectional Transformers for Language Understanding" (https://aclantholo
gy.org/N19-1423). Proceedings of the 2019 Conference of the North American Chapter of
the Association for Computational Linguistics: Human Language Technologies, Volume 1
(Long and Short Papers). Minneapolis, Minnesota: Association for Computational
Linguistics: 4171–4186. doi:10.18653/v1/N19-1423 (https://doi.org/10.18653%2Fv1%2FN19
-1423). S2CID 52967399 (https://api.semanticscholar.org/CorpusID:52967399).
30. Le, Quoc; Mikolov, Tomas (2014-06-18). "Distributed Representations of Sentences and
Documents" (https://proceedings.mlr.press/v32/le14.html). International Conference on
Machine Learning. PMLR: 1188–1196. arXiv:1405.4053 (https://arxiv.org/abs/1405.4053).
31. Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation
learning by predicting image rotations. (https://openreview.net/pdf?id=S1v4N2l0-) In ICLR,
2018.
32. Pathak, Deepak; Krahenbuhl, Philipp; Donahue, Jeff; Darrell, Trevor; Efros, Alexei A. (2016).
"Context Encoders: Feature Learning by Inpainting" (https://openaccess.thecvf.com/content_
cvpr_2016/html/Pathak_Context_Encoders_Feature_CVPR_2016_paper.html): 2536–2544.
arXiv:1604.07379 (https://arxiv.org/abs/1604.07379).
33. Chen, Ting; Kornblith, Simon; Norouzi, Mohammad; Hinton, Geoffrey (2020-11-21). "A
Simple Framework for Contrastive Learning of Visual Representations" (https://proceedings.
mlr.press/v119/chen20j.html). International Conference on Machine Learning. PMLR: 1597–
1607.
34. Mathilde, Caron; Ishan, Misra; Julien, Mairal; Priya, Goyal; Piotr, Bojanowski; Armand, Joulin
(2020). "Unsupervised Learning of Visual Features by Contrasting Cluster Assignments" (htt
ps://proceedings.neurips.cc/paper/2020/hash/70feb62b69f16e0238f741fab228fec2-Abstract.
html). Advances in Neural Information Processing Systems. 33. arXiv:2006.09882 (https://ar
xiv.org/abs/2006.09882).
35. Chen, Mark; Radford, Alec; Child, Rewon; Wu, Jeffrey; Jun, Heewoo; Luan, David;
Sutskever, Ilya (2020-11-21). "Generative Pretraining From Pixels" (https://proceedings.mlr.p
ress/v119/chen20s.html). International Conference on Machine Learning. PMLR: 1691–
1703.
36. Chen, Xinlei; He, Kaiming (2021). "Exploring Simple Siamese Representation Learning" (htt
ps://openaccess.thecvf.com/content/CVPR2021/html/Chen_Exploring_Simple_Siamese_R
epresentation_Learning_CVPR_2021_paper.html): 15750–15758. arXiv:2011.10566 (http
s://arxiv.org/abs/2011.10566).
37. Jean-Bastien, Grill; Florian, Strub; Florent, Altché; Corentin, Tallec; Pierre, Richemond;
Elena, Buchatskaya; Carl, Doersch; Bernardo, Avila Pires; Zhaohan, Guo; Mohammad,
Gheshlaghi Azar; Bilal, Piot; koray, kavukcuoglu; Remi, Munos; Michal, Valko (2020).
"Bootstrap Your Own Latent - A New Approach to Self-Supervised Learning" (https://proceed
ings.neurips.cc/paper/2020/hash/f3ada80d5c4ee70142b17b8192b2958e-Abstract.html).
Advances in Neural Information Processing Systems. 33.
38. Cai, HongYun; Zheng, Vincent W.; Chang, Kevin Chen-Chuan (September 2018). "A
Comprehensive Survey of Graph Embedding: Problems, Techniques, and Applications" (http
s://ieeexplore.ieee.org/document/8294302). IEEE Transactions on Knowledge and Data
Engineering. 30 (9): 1616–1637. arXiv:1709.07604 (https://arxiv.org/abs/1709.07604).
doi:10.1109/TKDE.2018.2807452 (https://doi.org/10.1109%2FTKDE.2018.2807452).
ISSN 1558-2191 (https://www.worldcat.org/issn/1558-2191). S2CID 13999578 (https://api.se
manticscholar.org/CorpusID:13999578).
39. Grover, Aditya; Leskovec, Jure (2016-08-13). "node2vec: Scalable Feature Learning for
Networks" (https://doi.org/10.1145/2939672.2939754). Proceedings of the 22nd ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD '16.
New York, NY, USA: Association for Computing Machinery. 2016: 855–864.
doi:10.1145/2939672.2939754 (https://doi.org/10.1145%2F2939672.2939754). ISBN 978-1-
4503-4232-2. PMC 5108654 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5108654).
PMID 27853626 (https://pubmed.ncbi.nlm.nih.gov/27853626).
40. Velikovi, P., Fedus, W., Hamilton, W. L., Li, P., Bengio, Y., and Hjelm, R. D. Deep Graph
InfoMax. (https://openreview.net/pdf?id=rklz9iAcKQ) In International Conference on Learning
Representations (ICLR’2019), 2019.
41. Luo, Dezhao; Liu, Chang; Zhou, Yu; Yang, Dongbao; Ma, Can; Ye, Qixiang; Wang, Weiping
(2020-04-03). "Video Cloze Procedure for Self-Supervised Spatio-Temporal Learning" (http
s://ojs.aaai.org/index.php/AAAI/article/view/6840). Proceedings of the AAAI Conference on
Artificial Intelligence. 34 (7): 11701–11708. doi:10.1609/aaai.v34i07.6840 (https://doi.org/10.
1609%2Faaai.v34i07.6840). ISSN 2374-3468 (https://www.worldcat.org/issn/2374-3468).
S2CID 209531629 (https://api.semanticscholar.org/CorpusID:209531629).
42. Humam, Alwassel; Dhruv, Mahajan; Bruno, Korbar; Lorenzo, Torresani; Bernard, Ghanem;
Du, Tran (2020). "Self-Supervised Learning by Cross-Modal Audio-Video Clustering" (http
s://proceedings.neurips.cc/paper/2020/hash/6f2268bd1d3d3ebaabb04d6b5d099425-Abstra
ct.html). Advances in Neural Information Processing Systems. 33. arXiv:1911.12667 (https://
arxiv.org/abs/1911.12667).
43. Xu, Dejing; Xiao, Jun; Zhao, Zhou; Shao, Jian; Xie, Di; Zhuang, Yueting (June 2019). "Self-
Supervised Spatiotemporal Learning via Video Clip Order Prediction" (https://ieeexplore.iee
e.org/document/8953292). 2019 IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR). pp. 10326–10335. doi:10.1109/CVPR.2019.01058 (https://doi.org/10.1
109%2FCVPR.2019.01058). ISBN 978-1-7281-3293-8. S2CID 195504152 (https://api.sema
nticscholar.org/CorpusID:195504152).
44. Alexei, Baevski; Yuhao, Zhou; Abdelrahman, Mohamed; Michael, Auli (2020). "wav2vec 2.0:
A Framework for Self-Supervised Learning of Speech Representations" (https://proceedings.
neurips.cc/paper/2020/hash/92d1e1eb1cd6f9fba3227870bb6d7f07-Abstract.html).
Advances in Neural Information Processing Systems. 33. arXiv:2006.11477 (https://arxiv.org/
abs/2006.11477).
45. Zellers, Rowan; Lu, Jiasen; Lu, Ximing; Yu, Youngjae; Zhao, Yanpeng; Salehi,
Mohammadreza; Kusupati, Aditya; Hessel, Jack; Farhadi, Ali; Choi, Yejin (2022). "MERLOT
Reserve: Neural Script Knowledge Through Vision and Language and Sound" (https://open
access.thecvf.com/content/CVPR2022/html/Zellers_MERLOT_Reserve_Neural_Script_Kno
wledge_Through_Vision_and_Language_and_CVPR_2022_paper.html): 16375–16387.
arXiv:2201.02639 (https://arxiv.org/abs/2201.02639).
46. Radford, Alec; Kim, Jong Wook; Hallacy, Chris; Ramesh, Aditya; Goh, Gabriel; Agarwal,
Sandhini; Sastry, Girish; Askell, Amanda; Mishkin, Pamela; Clark, Jack; Krueger, Gretchen;
Sutskever, Ilya (2021-07-01). "Learning Transferable Visual Models From Natural Language
Supervision" (https://proceedings.mlr.press/v139/radford21a.html). International Conference
on Machine Learning. PMLR: 8748–8763. arXiv:2103.00020 (https://arxiv.org/abs/2103.0002
0).
47. Ramesh, Aditya; Dhariwal, Prafulla; Nichol, Alex; Chu, Casey; Chen, Mark (2022-04-12).
"Hierarchical Text-Conditional Image Generation with CLIP Latents". arXiv:2204.06125 (http
s://arxiv.org/abs/2204.06125) [cs.CV (https://arxiv.org/archive/cs.CV)].