0% found this document useful (0 votes)
9 views12 pages

A Survey of Deep Learning - From Activations To Transformers

a survey of deep learning - from activations to transformers

Uploaded by

robson.mamedde
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views12 pages

A Survey of Deep Learning - From Activations To Transformers

a survey of deep learning - from activations to transformers

Uploaded by

robson.mamedde
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

A Survey of Deep Learning: From Activations to Transformers

Johannes Schneider1 and Michalis Vlachos2


1University of Liechtenstein, Vaduz, Liechtenstein
2University of Lausanne, Lausanne, Switzerland
[email protected], [email protected]

Keywords: survey, review, deep learning, architectures, layers, transformer,graphs

Abstract: Deep learning has made tremendous progress in the last decade. A key success factor is the large amount
arXiv:2302.00722v3 [cs.LG] 10 Feb 2024

of architectures, layers, objectives, and optimization techniques. They include a myriad of variants related to
attention, normalization, skip connections, transformers and self-supervised learning schemes – to name a few.
We provide a comprehensive overview of the most important, recent works in these areas to those who already
have a basic understanding of deep learning. We hope that a holistic and unified treatment of influential,
recent works helps researchers to form new connections between diverse areas of deep learning. We identify
and discuss multiple patterns that summarize the key strategies for many of the successful innovations over
the last decade as well as works that can be seen as rising stars. We also include a discussion on recent
commercially built, closed-source models such as OpenAI’s GPT-4 and Google’s PaLM 2.

1 Introduction sion, but were later applied in NLP, while transform-


ers were introduced in NLP and later adapted in com-
Deep learning is widely regarded as the driving puter vision. Therefore, removing barriers between
force behind artificial intelligence. Its models have disciplines can be highly beneficial. This paper takes
achieved top leaderboard rankings in various fields, this motivation by surveying the recent progress of
including computer vision, speech, and natural lan- deep learning from a holistic standpoint, rather than
guage processing. One of the major advantages of focusing on a particular niche area. We also believe
deep learning is its layered, modular structure, which that this is a necessary step, since major innovations
allows for the construction of models from individ- have slowed down in terms, i.e., now most architec-
ual components in a flexible manner. Researchers tures are based on the transformer architecture, which
have created a large selection of layers, architectures, dates back to 2017(Vaswani et al., 2017).
and objectives. Keeping up with the ongoing devel- It is difficult, if not impossible, to provide an en-
opments in the various aspects of deep learning is a compassing overview of the field due to the sheer
difficult task. Although specific surveys are available, number of articles published yearly and the contin-
there is currently no comprehensive overview of re- ual increase in relevant topics, such as transformers
cent progress covering multiple aspects of deep learn- and self-supervised learning that have become popu-
ing such as learning, layers and architecture. There lar only recently. Our strategy is to choose influen-
exist multiple reviews with a narrow focus such as tial works through (i) usage statistics and (ii) special-
large language models ( e.g. (Min et al., 2021)) ized surveys. We also offer an invigorating discussion
and convolutional neural networks (e.g. (Khan et al., of shared design patterns across areas that have been
2020)). Previous studies (Alom et al., 2019; Shrestha successful.
and Mahmood, 2019; Dong et al., 2021; Alzubaidi
et al., 2021) with a wider focus have often over-
looked new developments such as transformers and
supervised-learning. However, taking a more com-
2 Overview
prehensive and more holistic look at various disci-
Figure 1 provides an overview of the areas included
plines can be extremely advantageous: For example,
in this survey. We have investigated deep learning de-
NLP and computer vision have often influenced each
sign, including objectives and training. We have also
other; CNNs were initially introduced in computer vi-
given special attention to works that have been some-
Figure 1: Categorization of deep learning and areas covered in the survey

what established based on the usage statistics from the a negative input and maximize association between
popular platform ”Paperswithcode.com.” There has positively associated inputs, while minimizing those
been an increase in these types of platforms that en- of negative ones. It takes input pairs (x, y), each pro-
able the upload of papers (and models) and provide cessed by a separate but identical network. It maxi-
information on citations, as well as leaderboards. Al- mizes the joint probability p(x, y) of all pairs (x, y):
though there are drawbacks when utilizing data from 1
L(V p , Vn ) = − ∑ log p(x, y) (1)
|V p | · |Vn | x∈∑
these platforms, we believe that it offers a new per-
spective compared to traditional survey methods that V y∈V p n

often select more arbitrarily. We have only included 1


=− log(1 + ex−y )
a selection of the most influential works published |V p | · |Vn | x∈∑
V

p y∈Vn
from 2016 onwards, as well as rising stars (from 2020
or newer) that have gained significant popularity in a (2)
short time.
Here, V p and Vn are the positive and negative score
The extent to which each topic is covered depends
set respectively.
on the amount of recent research that has been con-
ducted and its foundational nature. We do not discuss Focal Loss(Lin et al., 2017) focuses learning on hard
data or computational aspects such as data augmen- misclassified samples by altering the cross entropy
tation, model compression, and distributed machine loss. It adds a factor (1 − p)γ , where p denotes the
learning. As a result of limited space, we had to be probability of a sample stemming from the cross en-
selective when it came to model families and left out tropy loss and γ is a tunable parameter.
relevant ones such as multi-modal models and autoen- L(p) = (1 − p)γ log(p) (3)
coders.
The Cycle Consistency Loss(Zhu et al., 2017) is tai-
lored towards unpaired image-to-image translation of
generative adversarial networks. For two image do-
3 Loss functions and Optimization mains X and Y , the loss supports the learning of map-
pings G : X → Y and F : Y → X so that one reverses
We discuss common loss functions and optimizers. the other, i.e., F(G(x)) ≈ x and G(F(y)) ≈ y.
L(G, F) = Ex∼pdata (x) [||F(G(x)) − x||1 ] (4)
3.1 Loss Functions + Ey∼pdata (y) [||G(F(y)) − y||1 ] (5)

Loss functions (surveyed in (Wang et al., 2020)) of- The Supervised Contrastive Loss(Khosla et al.,
ten consist of multiple terms that are enhanced with 2020) pulls together clusters of points of the same
a regularization term. Loss functions are often task- class in embedding space and pushes samples of dif-
specific but some general ideas are applicable across ferent classes apart. It aims at leveraging label infor-
tasks. Commonly, multiple loss terms are aggregated mation more effectively than cross-entropy loss.
in a weighted manner. Many papers improve prior −1
work (simply) by using a different loss function. Lisup = · (6)
2Nỹi − 1
The Triplet Loss (Dong and Shen, 2018) was intro- 2N
exp (zi · z j /τ)
duced for Siamese networks (Its origin dates back fur- ∑ 1i̸= j · 1ỹi =ỹ j · log ∑2N
ther (Schultz and Joachims, 2003).) The high level j=1 k=1 i̸=k · exp (zi · zk /τ)
1
idea is to compare a given input to a positive and (7)
where Nỹi is the total number of images in the mini- and generator inversion. Gradients with respect to
batch that have the same label ỹi as the anchor i. The w ∈ W stemming from random directions in the im-
total loss is the sum over the loss of all anchors i, i.e., age space should be almost equal in length indepen-
L = ∑i Lisup . The loss has important properties well dent of w or the image space direction. The local met-
suited for supervised learning: ric scaling characteristics of the generator g : W → Y
• generalization to an arbitrary number of positives are captured by the Jacobian matrix Jw = δg(w)/δw.
The regularizer becomes:
• contrastive power increases with more negatives.

Ew,y∼N (0,I) (||JTw y||2 − a)2 (11)


3.2 Regularization
where y are random images with normally distributed
Regularization techniques in machine learning (sur- pixel values, and w ∼ f (z), where z is normally dis-
veyed in (Moradi et al., 2020)) have proven very help- tributed. The constant a is the exponential moving
ful for deep learning. Explicit regularization adds a average of ||JTw y||2 . The paper further avoids the com-
loss term R( f ) for a network f to the loss function putationally expensive, explicit computation of the Ja-
L(x) for data (xi , yi ) with a trade-off parameter λ. cobian.
DropBlock(Ghiasi et al., 2018) drops correlated ar-
min ∑ L(xi , yi ) + λR( f ) (8) eas of features maps rather than selecting features to
f i
drop independently. This is especially suitable for
Implicit regularization is all other regularization, e.g., convolutional neural networks where features maps
early stopping or using a robust loss function. Clas- exhibit spatial correlation and a (real-world) feature
sical L2-regularization and dropout(Srivastava et al., often corresponds to a contiguous spatial area in fea-
2014), where activations of a random set of neurons ture maps.
are set to 0, are among the most wildly used regular-
ization.
3.3 Optimization
R1 Regularization (Mescheder et al., 2018) is used
to penalize the discriminator in generative adversarial Optimization(surveyed in (Sun, 2020)) is the process
networks based on the gradient with the goal of stabi- of estimating all network parameters so that the loss
lizing training: function is minimized. The two most wildly known
γ technique is stochastic gradient descent(SGD) and
R1 (ψ) = E pD (x) [||∇Dψ (x)||2 ] (9) Adam. None strictly outperforms in all cases in terms
2
of generalization performance. SGD dates back at
Technically, the regularization term penalizes gradi-
least to the 50ies(Kiefer and Wolfowitz, 1952), while
ents orthogonal to the data manifold.
Adam stems from 2014(Kingma and Ba, 2014).
Entropy Regularization (Mnih et al., 2016) aims at
Adafactor (Shazeer and Stern, 2018) reduces the
fostering diversity. Specifically, asynchronous meth-
memory needs of the Adam optimization by maintain-
ods for deep reinforcement learning (Williams and
ing only row- and column-wise statistics of parameter
Peng, 1991; Mnih et al., 2016). (Mnih et al., 2016)
matrixes rather than per-element information.
ensures diversity of actions in reinforcment learn-
ing, i.e., it prevents overoptimizion towards a small Layerwise adaptive large batch optimization
fraction of the environment. The entropy is simply (LAMB)(You et al., 2019) builds on Adam and accel-
computed over the probability distribution of actions erates training using large mini-batches. It performs
given by the policy π(x) as: per-dimension and layerwise normalization.
Two Time-scale Update Rule(TTUR): For genera-
H(x) = ∑ π(x) · log(π(x)) (10) tive adversarial networks trained with stochastic gra-
x
dient descent TTUR(Heusel et al., 2017) uses a sepa-
Path Length Regularization (Karras et al., 2020a) rate learning rate for the discriminator and generator.
for generative adversarial networks aims at ensur- For a fixed generator, the discriminator reaches a local
ing that the fixed-size step length in the latent space minimum. This still holds if the generator converges
matches the fixed-magnitude change in the image. slowly, e.g., using a small(er) learning rate. This helps
The idea is to encourage that a fixed-size step in the in convergence of the GAN and it can improve perfor-
latent space W results in a non-zero, fixed-magnitude mance since the generator captures the feedback of
change in the image. The goal is to ensure better con- the discriminator more profoundly before pushing it
ditioning of GANs, simplifying architecture search into new regions.
Decoupled Weight Decay Regularization for ResNet(He et al., 2016) is used. Representations are
ADAM: AdamW(Loshchilov and Hutter, 2017) is further processed using a simple MLP before the con-
built on a simple observation and implemenation. The trastive loss is applied.
orginal Adam optimization changes weights due to Bootstrap Your Own Latent (BYOL) (Grill et al.,
(L2-)regularization after computation of gradients for 2020) uses an online and a target network. Both have
Adam. But intuitively moving averages of gradients the same architecture consisting of an encoder, a pro-
should not include regularization. jector, and a predictor but they do not share weights.
RAdam and AMSGrad: Both techniques tackle the The target network’s parameters are an exponential
convergence problem of Adam. Rectified Adam(Liu moving average of the online network’s parameters.
et al., 2019a) rectifies the variance of the adaptive The online network has to predict the target network’s
learning rate, which is large initially. Thus, similar to representation given an augmentation of the (same)
the warm-up heuristic small initial learning rates can input.
help. AMSGrad (Reddi et al., 2019) uses the maxi- Barlow Twins(Zbontar et al., 2021) rely on an objec-
mum of past squared gradients rather than the expo- tive function that aims to reduce cross-correlation C
nential average. between outputs for a set of image Y A and their dis-
Stochastic Weight Averaging: Simple averaging of torted versions Y B as close to the identity as possible,
weights from different epochs during stochastic gra- i.e., the loss (including λ as a tuning parameter) is:
dient descent with constant or cycling learning rate
improves performance.(Izmailov et al., 2018) L = ∑(1 −Ci,i )2 + λ · ∑ ∑ Ci,2 j (13)
i i j̸=i
Sharpness-Aware Minimization(Foret et al., 2020)
minimizes loss value and sharpness, which improves Momentum Contrast (MoCo) (He et al., 2020)
generalization. It finds parameters with neighbor- builds a dynamic dictionary represented by an en-
hoods of low loss value (rather than parameters that coder using unsupervised contrastive learning. Train-
only themselves have low loss value). The loss is: ing performs look-ups and enforces that an encoded
min max L(w + ε) (12) query should be similar to its matching encoded key
w ||ε|| p ≤ρ and dissimilar to others. The dictionary is a queue of
data samples. For every mini-batch, encoded samples
are added, and the oldest mini-batch are dequeud. The
key encoder is a momentum-based moving average of
4 Self, Semi-supervised and the query encoder, which should help to maintain con-
Contrastive learning sistency.
Noisy Student: The paper(Xie et al., 2020) describes
Semi-supervised learning leverages a large amount training an (EfficientNet) model on labeled data. This
of unlabelled data based on a small amount of la- model is used as a teacher to generate pseudo labels
beled data (see (Yang et al., 2022) for a survey). for unlabeled images. A larger (EfficientNet) model
Self-supervised learning benefits from self-generated is trained on the union of all data. This process is
(pseudo)labels stemming from artificial tasks. Both repeated, i.e., the student becomes the teacher of a
reduce the burden of collecting (human) labeled data. new student. During student training, noise such as
Self-supervised (pre-)training combined with fine- dropout and data augmentation are applied so that the
tuning on a (small) human-annotated dataset can lead student’s learning is harder and it can improve on the
to state-of-the-art results. The paradigm has grown teacher.
extensively in recent years (surveyed in (Ericsson
FixMatch (Sohn et al., 2020) predicts the label of
et al., 2022)). It is commonly combined with con-
a weakly-augmented image. If the confidence for a
trastive learning. In contrastive learning, the goal is
label is above a threshold, then the model is trained
to learn to distinguish between similar and dissimi-
to produce the same label for the strongly-augmented
lar data. Since data can be automatically distorted to
version of the image.
different extents, creating “pseudo-labeled” data for
self-supervised learning can be straightforward.
The simple framework for contrastive learning
(SimCLR)(Chen et al., 2020) maximizes agreement 5 Architectures and Layers
between two inputs that result from augmenting the
same data sample differently. Augmentation can We elaborate on four important layers types, i.e.,
be random cropping, color distortions, and Gaus- activation-, skip-, normalization-, and attention lay-
sian blur. To obtain reprsentation vectors, a standard ers followed by numerous contemporary architectures
based on transformers as well as graph neural net- order(He et al., 2016) in combination with depthwise
works. convolutions for the wide-layer, parameters are re-
duced, and residual blocks execute faster.
5.1 Activation A Dense Block(Huang et al., 2017) receives in-
puts from all prior layers (with matching feature-map
Activation functions are usually non-linear. They sizes) and connects to all subsequent layers (with
have a profound impact on gradient flow and, thus, matching feature-map sizes).
on learning. Early activation functions commonly ResNeXt Block(Xie et al., 2017): This split-
used from the 1960s throuhout the early 2000s such transform-merge approach for residual blocks entails
as sigmoid and tanh make training deep networks dif- evaluating multiple residual blocks in parallel and ag-
ficult due to the vanishing gradient when these func- gregating them back into a single output.
tions saturate. The introduction of the rectified linear
unit ReLU in 2010(Nair and Hinton, 2010) marked
5.3 Normalization
a breakthrough result. While its original version is
still commonly used, transformer architectures have Since the introduction of batch-normalization(Ioffe
popularized other activation functions and ReLU vari- and Szegedy, 2015), normalization has been a very
ants. Most of them still share qualitatively the behav- successful concept in improving training speed, sta-
ior of ReLU, i.e., for negative inputs, outputs are of bility, and generalization of neural networks. How-
small magnitude and for positive inputs, they are un- ever, their need is debated(Shao et al., 2020), e.g.,
bounded (see (Apicella et al., 2021) for a survey). for some applications careful initialization and adjust-
Gaussian Error Linear Units (GELU)(Hendrycks ments of learning rates might make them at least par-
and Gimpel, 2016) weigh inputs by their precentile tially redundant. The idea of normalization is to trans-
(ReLUs only use the sign). Activation is the prod- form a value x to a normalized value x̃, by subtracting
uct of the input and the standard Gaussian cumulative the mean µ and scaling by the standard deviation σ,
distribution function Φ(x), i.e., i.e., x̃ = x−µ
σ . Normalization approaches differ in the
computation of µ and σ, e.g., µ and σ can be computed
GELU(x) = x · Φ(x) (14)
across different channels.
The Mish activation(Misra, 2019) originates from Layer Normalization: Given summed inputs, nor-
systematic experimentation inspired by Swish and malization statistics are computed(Ba et al., 2016) for
ReLU: a layer L with |L| neurons as:
v
f (x) = x · tanh(so f t + (x)) (15) 1 |L|−1
u |L|−1
u 1
+
with so f t (x) := ln(1 + e ) x
(16) µ= ∑
|L| i=0
ai σ=t ∑ (ai − µ)2 (18)
|L| i=0
In comparison, the Swish activation(Ramachandran In contrast to batch-normalization, it poses no restric-
et al., 2017) is: tions on batch size and also no dependencies between
f (x) = x · sigmoid(βx) (17) batches. In particular, it can be used with batch size
1.
Here β is a learnable parameter. Instance Normalization(Ulyanov et al., 2016) com-
putes for a 4-dimensional input, such as an image with
5.2 Skip connections height H, width W , channels C, and batch size T :
1
Skip connections originate from residual net- µt,c = ∑ xt,c,w,h (19)
HW T t<T,w<W,h<H
works(He et al., 2016). In the simplest form, the s
output y for an input x of a single layer L (or a set of a 1
few layers) with a skip connection is y(x) = L(x) + x. σt,c = ∑ (xt,c,w,h − µt,c )2
HW T t<T,w<W,h<H
(20)
The original paper used the term residual since the
layer L has to learn a residual L(x) = H(x) − x rather It can be used, e.g., to normalize contrast for an im-
than the desired mapping H itself. Since then, skip age. There exist multiple versions of it, e.g., a ver-
connections have been used in many variations. sion that scales based on weight norms(Karras et al.,
Inverted Residual Block(Sandler et al., 2018): By 2020b).
inverting the channel width to a narrow-wide-narrow LayerScale(Touvron et al., 2021) has been intro-
layer sequence from the original wide-narrow-wide duced in the context of transformers as a per-channel
multiplication of outputs of a residual block with a separate attention heads, where the mth head defines a
diagonal matrix: (m) (m)
subset Ai ⊂ SiF and lets Si = Ai . For strided self-
xl ′ = xl + diag(λ1 , ..., λd ) · SA(η(x)) (21) attention:
(1)
xl+1 = xl ′ + diag(λ1 , ..., λd ) · FFN(η(x)) (22) Ai = {t,t + 1, ...i} for t = max(0, i − l) (29)
(2)
SA is the self-attention layer, FFN is the feed forward Ai = { j : (i − j) mod l = 0} (30)
network, and η the layer-normalisation (see Figure 2).
This pattern is suitable, when structure aligns with the
stride-like images. For data without a periodic struc-
5.4 Attention ture like text, fixed attention can be preferable:
(1)
Attention mechanisms (surveyed in (Brauwers and Ai = { j : ⌊ j/l⌋ = ⌊i/l⌋} (31)
Frasincar, 2021; Guo et al., 2022b)) allow for learn- (2)
Ai = {j : j mod l ∈ {t,t + 1, ...l}} (32)
ing relevance scores for inputs, similar to how cog-
nitive attention works. Some parts of the inputs can where t = l − c and c is a hyperparameter. For ex-
be deemed highly important, while others are disre- ample, for stride 128 and c = 8, all future positions
garded as irrelevant. The relevance of a particular in- greater than 128 can attend to positions 120-128, all
put can often be determined by contextual informa- greater 256 to 248-256, etc.
tion, such as the relevance of a word in a text docu- A Residual Attention Network (RAN)(Wang et al.,
ment often depends on nearby words. 2017) module leverages the idea of skip connections.
Scaled Dot-Product Multi-Head Attention It consists of a mask and a trunk branch. The trunk
(Vaswani et al., 2017): Using dot products combined branch performs feature processing. It can be any net-
with down-scaling has proven very successful in work. The mask branch represents feature weights.
computing attention scores. Attention takes a query The output of an attention module is
Q, a key K and a value V as inputs and outputs an
Hi,c (x) = (1 + Mi,c (x)) · Fi,c (X) (33)
attention score:
QK T  Here i is a spatial position and c is a channel. M(x)
Att(Q, K,V ) = softmax √ ·V (23) should be approximatedly 0, H(x) approximates orig-
dk inal features F(x).
Using multiple, independent attention mecha- Large Kernel Attention(Guo et al., 2022a) decom-
nisms in parallel allows attending to various aspects poses a large scale convolution into three smaller
of the input. Formally, in multi-head attention, we scale convolutions using common ideas, i.e., depth-
learn matrixes W: wise dilated convolution, a non-dilated depthwise
MultiHead(Q, K, V) = [h0 , . . . , hn−1 ]W0 (24) convolution, and a channel-wise 1x1 convolution. For
the output of these convolutions, an attention map is
where head hi = Att(QWQ K V
i , KWi , VWi ) (25) learned.
Factorized (Self-)Attention (Child et al., 2019) re- Sliding Window Attention(Beltagy et al., 2020)
duces the computational and memory footprint of at- aims at improving the time and memory complexity
tention. While (full) self-attention(Vaswani et al., of attention. It reduces the number of considered in-
2017) allows attending to every prior input element, put pairs. More precisely, for a given window size w
factorized self-attention allows only to attend to a sub- each token attends to w2 tokens on each side.
set thereof. Formally, an output matrix is computed
given a matrix of input embeddings X and the con- 5.5 Transformers
nectivity pattern S = {S1 , ..., Sn }, where Si is the set
of indices of input vectors attended to by the ith out- Transformers have quickly become the dominant ar-
put vector. chitecture in deep learning. Combined with self-
FacAtt(X, S) = (A(xi , Si ))i∈[1,n] (26) supervised training on large datasets, they have
reached state-of-the-art on many benchmarks in
(Wq xi )KSTi NLP(see (Liu et al., 2023) for a survey) and computer
a(xi , Si ) = softmax( √ ) ·VSi (27)
d vision (surveyed in (Han et al., 2022; Khan et al.,
KSi = (Wk x j ) j∈Si VSi = (Wv x j ) j∈Si (28) 2022)). Since their introduction in 2017(Vaswani
et al., 2017) countless versions have emerged that
For full self-attention SiF := { j| j ̸= i} (indexes to prior tackle issues of the original transformer such as com-
inputs to i). In contrast, factorized self-attention has p putational overhead and data efficiency.
Transformers are said to have less inductive bias applying supervised fine-tuning. Pre-training takes
and are in turn more flexible than other architec- place on a large corpus of tokens U = (u0 , u1 , ..., un−1 )
tures, such as convolutional neural networks and re- by maximizing the likelihood of the next token given
current networks. Thus, they also require more train- prior tokens:
ing data to compensate for the lack of inductive bias.
L(U) = ∑ p(ui |ui−k , ..., ui−1 ) (36)
Since large amounts of labeled data are difficult to ob- i
tain, transformers are commonly trained using self-
supervised learning, i.e., pseudo-labels. The original where k is the size of the context window and the
transformer(Vaswani et al., 2017), developed for nat- conditional probability is modeled using a neural
ural language processing, employs an encoder and de- network, i.e., using a multi-layer transformer de-
coder like earlier recurrent neural networks. It stacks coder(Liu et al., 2018) by dropping the encoder in
multiple transformer blocks on top of each other, as (Vaswani et al., 2017). Rather than only predicting the
illustrated in Figure 2. Key elements are multi-head next token given an input, the model is also trained to
attention, layer normalization, and skip connections. predict input tokens. Furthermore, the memory foot-
Furthermore, positional encodings and embeddings print of attention is lowered. GPT-2 (Radford et al.,
of inputs play an important role. The absolute po- 2019) builds on GPT with few modifications, e.g.,
sitional encodings PE for position pos in (Vaswani layer normalization locations were changed (moved
et al., 2017) uses sine and cosine functions varying in to the input of each sub-block, and an extra normal-
frequency: ization was added after the final self-attention block),
initialization of residual weights was scaled, and the
PE(pos, 2i) = sin(pos/100002i/d ) (34) vocabulary, context, and batch size were increased.
PE(pos, 2i + 1) = cos(pos/10000 (2i)/d
) (35) GPT-3’s(Brown et al., 2020) architecture is almost
identical to that of GPT-2, but the number of param-
where i is the dimension of the encoding and d is the eters is more than 100 times larger and it differs in
number of dimensions. The choice was motivated by (amount of) training data. ChatGPT(OpenAI, 2022)
the fact that relative positions, which might be equally is a sibling to InstructGPT(Ouyang et al., 2022),
relevant to absolute ones, are a linear function of ab- which is optimized towards following user intentions.
solute position encodings. InstructGPT applies fine-tuning of GPT-3 in a two-
Bidirectional Encoder Representations from step process: (i) based on labeler demonstrations
Transformers(BERT) (Devlin et al., 2018) yields through supervised learning and (ii) based on human
contextual word-embeddings using the encoder rankings of model outputs using reinforcement learn-
of the transformer architecture. It relies on a ing. ChatGPT follows the same procedure, i.e., (i)
masked language model pre-training objective and for supervised learning, human AI trainers provided
self-supervised learning. The model must predict conversations by playing both the human user and
randomly chosen, masked input tokens given its the AI assistant. The resulting dialogue dataset was
context. Thus, the model has bidirectional informa- enhanced with the InstructGPT dataset, which was
tion, i.e., it is fed tokens before and after the masked transformed into a dialogue format. (ii) Conversations
words. In classical next-word prediction no tokens of AI trainers with ChatGPT were ranked, i.e., for a
after the word to predict are given. As a second randomly selected model-written message, AI train-
prediction task, the model must predict if a sentence ers ranked several alternative completions. The rank-
pair (A, B) consists of two consecutive sentences ing dataset was used for reinforcement learning. The
A and B within some document (or two possibly process was repeated multiple times.
unrelated sentences). The pre-trained model based Technical details of the successor of ChatGPT, i.e.,
on self-supervised training can be fine-tuned for GPT-4 have not been disclosed(OpenAI, 2023). The
downstream tasks using labeled data. provided technical report indicates that it is similar to
The original BERT model has since then im- ChatGPT. GPT-4 is multi-modal, i.e., it can also pro-
proved in many ways, e.g., (Sanh et al., 2019) reduced cess images, however, details are unknown. The re-
the computational burden of BERT, and (Liu et al., port only points towards major improvements in train-
2019b) trained models longer, on longer sequences, ing efficiency. The accomplishment was to predict
with bigger batches over more data, etc. This led to the performance of large scale models using the per-
more robust and generalizable representations. formance of small models (possibly trained on less
GPT to GPT-3 on to ChatGPT and GPT-4: GPT data). This is highly important as computational costs
is based on the decoder of a transformer to predict and time can be a key factor for large deep learning
tokens sequentially. GPT(Radford et al., 2018) first models.
performs pre-training in an unsupervised way before Text-to-Text Transfer Transformer (T5)(Raffel
Figure 2: Transformer with the four basic blocks on top and the encoder and decoder at the bottom

et al., 2020) views every text-based language mod- The Vision Transformer (Dosovitskiy et al., 2020)
els as generating an output text from a given input relies heavily on the original transformer. An image
text. It differs from BERT(Devlin et al., 2018) by us- is partitioned into small patches, which are flattened
ing causal masking during training for predicting the and linearly embedded with position embeddings. A
target. Causal masking prevents the network from ac- standard transformer encoder then processes the cre-
cessing “future” tokens of the target. T5 also differs ated vector of each patch.
in pre-training tasks. The Swin Transformer (Liu et al., 2021) for com-
BART(Lewis et al., 2020) is a denoising autoen- puter vision builds hierarchical feature maps rather
coder for pretraining sequence-to-sequence models than just a single (resolution) feature map. It also only
that uses a standard transformer based machine trans- computes self-attention within a local window reduc-
lation architecture. It has been shown to be effective ing computation time.
for language generation, translation, and comprehen- PaLM (2): The original PaLM(Chowdhery et al.,
sion. Training is based on corrupting text with noising 2022) is a large language model consisting of 540 bil-
functions ranging from token deletion, masking onto lion parameters similar to other more prominent such
sentence permutation and document rotation. Learn- as GPT-3. Technical innovation discussed is mostly
ing stems form reconstructing the original text from on the scaling of model training, i.e., a single model
its corrputed version. The flexibility in noising op- can be trained across tens of thousands of accelera-
tions is attributed due to BART’s generalization of tor chips efficiently. The original transformer archi-
prior works such as BERT and GPT, i.e., the encoder tecture(Vaswani et al., 2017) is also adjusted slightly,
is bi-directional (like BERT), while the decoder is au- e.g., SwiGLU activations are used, i.e.,
toregressive (like GPT).
XLNet (Yang et al., 2019) combines advantages of Swish(xW ) · xV (38)
autoregressive modeling like GPT, predicting the next , where Swish is given by Eq. 17, different posi-
token, and denoising auto-encoding BERT(Devlin tional embeddings (better for long sequences), and
et al., 2018), reconstructing x given a noisy input x̂ multi-query attention (faster computation), no biases
that originates through masking words of x. It does so (better training stability), and shared input-output
by using a permutation language model that samples embeddings.
a permutation of Z = z0 , z1 , ..., zT −1 of the sequence
(0, 1, 2, ..., T − 1) leading to the objective: PaLM 2(Google, 2023) is the better performing
max p(uzT |uz0 , ..., uzT −1 ) (37) successor of PaLM that differs in terms of dataset
There is no actual permutation of inputs, which would mixtures, e.g., using more diverse languages as well
be unnatural (and not occurring during later fine- as domains (e.g., programing languages, mathemat-
tuning tasks). Rather, the permutation impacts the at- ics). It also uses the classical transformer architecture.
tention mask to ensure that the factorization order by However, it uses a smaller model than the first PaLM
Z is maintained. version but more training compute. It also relies on
more diverse pre-training objectives (than simple next neighborhoods, thereby, nodes are viewed based on
word or masked word prediction). their role or communities they belong to.

5.6 Graph Neural Networks


6 Discussion
Graph neural networks (surveyed in (Wu et al., 2020))
can be seen as a generalization of CNNs and trans- Our survey focused on key design elements in
formers. They operate on graph data, i.e., nodes con- building deep learning models. Taking a practical
nected with edges. We discuss graph models, includ- approach, we chose to ignore theoretical works,
ing models to obtain node embeddings that can be which should be further explored in future studies.
used for downstream tasks. Our findings suggest that despite many small and
Graph Convolutional Networks(Kipf and Welling, creative innovations since the original transformer
2016) use CNNs for semi-supervised learning. They architecture, there have not been any significant
approximate spectral graph convolutions using poly- ”breakthrough” discoveries that have led to much
nomials of order k, which a CNN can compute with k better leaderboard results. The last few years have
linear layers. been characterized by the enlargement of existing
Graph Attention Networks(Veličković et al., 2017) networks such as GPT, the increase of data volume
rely on masked self-attention layers allowing nodes (and quality), and a shift towards self-supervised
to attend flexibly over their neighborhoods’ features, learning. This could indicate a need for more daring
i.e., node j obtains importance scores for node i’s fea- approaches to research rather than incremental im-
tures. Masking allows to only consider edges between provements of existing works. Combining different
node pairs that are actually connected. In contrast elements as outlined in this work could be one way to
to GCN, different importances for nodes in the same achieve this.
neighborhood can be assigned. Also, it does not rely
on costly matrix operations for eigendecompositions. In addition, we noted a few general patterns that
Graph Transformer(Dwivedi and Bresson, 2020) have been proven effective in many areas:
extends the original transformer to graphs by using at- • “Multi-X”, i.e., using the same element multiple
tention over neighborhood connectivity for each node, times in parallel, such as using multiple residual
generalizing the position encoding, replacing layer- blocks (ResNeXt) or multi-head attention. This
with batch-normalization, and learning edge repre- idea is also closely related to “ensemble learning”.
sentations (in addition to node representations). • “Higher order layers”, i.e., classical CNNs and
TuckER(Balažević et al., 2019) performs factoriza- MLPs only apply linear layers and simple ReLU,
tion for link prediction in knowledge graph. Knowl- but layers like Mish or attention layers perform
edge is represented as (subject, relation, object) more complex operations.
triplets, and the task is to predict whether two enti- • “Moving average”, i.e., averaging weights such as
ties are related. The graph can be represented as a bi- for SGD and BYOL.
nary tensor with the subjects, relations, and objects as
dimensions. They use Tucker decompositions to de- • “Decompose”, i.e., decomposing matrixes such as
compose the binary tensor into a product of a core ma- for TuckER and large kernel attention.
trix and embedding matrices for subjects, relations, • “Weighing functions”, i.e., using parameterized
and objects. weighing functions of inputs can be seen within
Embedding by Relational Rotation (RotatE)(Sun the attention mechanism but also for GELU units.
et al., 2019) performs missing link prediction Therefore, rather than naively aggregating inputs,
in knowledge graphs (like the priorly described inputs are weighed and aggregated. The weight
TuckER(Balažević et al., 2019)) to model more re- might stem from a function with learnt parame-
lational properties such as composition and inversion. ters. Such functions can also be seen as “gates”
They embed entities into a complex space and treat that only permit the flow of information within
the relation as an element-wise rotation that is opti- some range of the input parameters.
mized to lead from one entity to the other. Our survey was also deliberately geared towards more
Scalable Feature Learning for Net- recent works, but still well-established works; this
works(Node2Vec)(Grover and Leskovec, 2016) could be perceived as a strength or as a limitation.
learns feature vectors that preserve a node’s neigh- The selection of papers and areas was driven by a
borhood. They use random walks to generate sample prominent platform providing leaderboards. While a
reader looking for “what works well and what is very Beltagy, I., Peters, M. E., and Cohan, A. (2020).
promising” benefits from this approach, it could po- Longformer: The long-document transformer.
tentially leave out works with exciting ideas that re- arXiv:2004.05150.
quire more research to reveal their full capabilities. Brauwers, G. and Frasincar, F. (2021). A general survey on
This could be seen as perpetuating the ”winner-takes- attention mechanisms in deep learning. Transactions
on Knowledge and Data Engineering.
all” paradigm that reinforces already successful ideas.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D.,
However, due to the sheer amount of papers, a se- Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,
lection is necessary for conducting a holistic survey Askell, A., et al. (2020). Language models are few-
of deep learning. We acknowledge that online plat- shot learners. Advances in neural information pro-
forms providing leaderboards etc. are very beneficial cessing systems.
to the research community and that they should be fur- Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. (2020).
ther advanced. Still, we found that manual verifica- A simple framework for contrastive learning of visual
tion (e.g., by double checking relevance with Google representations. In Int. Conf. on machine learning.
scholar citations and by reading surveys and papers) Child, R., Gray, S., Radford, A., and Sutskever, I. (2019).
was required as we identified works and methods that Generating long sequences with sparse transformers.
were not listed correctly on the platform. arXiv:1904.10509.
Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra,
G., Roberts, A., Barham, P., Chung, H. W., Sut-
ton, C., Gehrmann, S., et al. (2022). Palm: Scal-
7 Conclusions ing language modeling with pathways. arXiv preprint
arXiv:2204.02311.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova,
We have presented a brief but comprehensive K. (2018). Bert: Pre-training of deep bidi-
overview of the deep learning design landscape. We rectional transformers for language understanding.
have summarized key works from various significant arXiv:1810.04805.
areas that have emerged in recent years. We believe Dong, S., Wang, P., and Abbas, K. (2021). A survey on
that our holistic overview in one paper can establish deep learning and its applications. Computer Science
connections that could inspire novel ideas. We have Review.
also identified four patterns that characterize many Dong, X. and Shen, J. (2018). Triplet loss in siamese net-
improvements. To further advance the development work for object tracking. In European Conf. on com-
of deep learning, we need to generate fundamentally puter vision (ECCV).
new and successful approaches, as the improvements Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn,
D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer,
made in the past few years were numerous and often
M., Heigold, G., Gelly, S., et al. (2020). An image is
very creative but mainly incremental. worth 16x16 words: Transformers for image recogni-
tion at scale. arXiv:2010.11929.
Dwivedi, V. P. and Bresson, X. (2020). A generalization of
REFERENCES transformer networks to graphs. arXiv:2012.09699.
Ericsson, L., Gouk, H., Loy, C. C., and Hospedales, T. M.
Alom, M. Z., Taha, T. M., Yakopcic, C., Westberg, S., (2022). Self-supervised representation learning: In-
Sidike, P., Nasrin, M. S., Hasan, M., Van Essen, B. C., troduction, advances, and challenges. Signal Process-
Awwal, A. A., and Asari, V. K. (2019). A state-of-the- ing Magazine.
art survey on deep learning theory and architectures. Foret, P., Kleiner, A., Mobahi, H., and Neyshabur, B.
electronics. (2020). Sharpness-aware minimization for efficiently
Alzubaidi, L., Zhang, J., Humaidi, A. J., Al-Dujaili, A., improving generalization. arXiv:2010.01412.
Duan, Y., Al-Shamma, O., Santamarı́a, J., Fadhel, Ghiasi, G., Lin, T.-Y., and Le, Q. V. (2018). Dropblock:
M. A., Al-Amidie, M., and Farhan, L. (2021). Re- A regularization method for convolutional networks.
view of deep learning: Concepts, CNN architectures, Advances in neural information processing systems.
challenges, applications, future directions. Journal of Google (2023). Palm 2 technical report. https:// ai.google/
big Data. static/ documents/ palm2techreport.pdf .
Apicella, A., Donnarumma, F., Isgrò, F., and Prevete, R. Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P.,
(2021). A survey on modern trainable activation func- Buchatskaya, E., Doersch, C., Avila Pires, B., Guo,
tions. Neural Networks. Z., Gheshlaghi Azar, M., et al. (2020). Bootstrap your
Ba, J. L., Kiros, J. R., and Hinton, G. E. (2016). Layer own latent-a new approach to self-supervised learn-
normalization. arXiv:1607.06450. ing. Adv. in neural information processing systems.
Balažević, I., Allen, C., and Hospedales, T. M. (2019). Grover, A. and Leskovec, J. (2016). node2vec: Scalable
Tucker: Tensor factorization for knowledge graph feature learning for networks. In ACM SIGKDD Int.
completion. arXiv:1901.09590. Conf. on Knowledge discovery and data mining.
Guo, M.-H., Lu, C.-Z., Liu, Z.-N., Cheng, M.-M., Kipf, T. N. and Welling, M. (2016). Semi-supervised
and Hu, S.-M. (2022a). Visual attention network. classification with graph convolutional networks.
arXiv:2202.09741. arXiv:1609.02907.
Guo, M.-H., Xu, T.-X., Liu, J.-J., Liu, Z.-N., Jiang, P.-T., Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mo-
Mu, T.-J., Zhang, S.-H., Martin, R. R., Cheng, M.- hamed, A., Levy, O., Stoyanov, V., and Zettlemoyer,
M., and Hu, S.-M. (2022b). Attention mechanisms L. (2020). BART: Denoising Sequence-to-Sequence
in computer vision: A survey. Computational Visual Pre-training for Natural Language Generation, Trans-
Media. lation, and Comprehension. In Proceedings of the
Han, K., Wang, Y., Chen, H., Chen, X., Guo, J., Liu, Z., 58th Annual Meeting of the Association for Compu-
Tang, Y., Xiao, A., Xu, C., Xu, Y., et al. (2022). A tational Linguistics, pages 7871–7880.
survey on vision transformer. transactions on pattern Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Dollár, P.
analysis and machine intelligence. (2017). Focal loss for dense object detection. In Int.
He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. (2020). Conf. on computer vision.
Momentum contrast for unsupervised visual represen- Liu, L., Jiang, H., He, P., Chen, W., Liu, X., Gao, J., and
tation learning. In Conf. on computer vision and pat- Han, J. (2019a). On the variance of the adaptive learn-
tern recognition. ing rate and beyond. arXiv:1908.03265.
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid- Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., and Neubig,
ual learning for image recognition. In Conf. on com- G. (2023). Pre-train, prompt, and predict: A system-
puter vision and pattern recognition. atic survey of prompting methods in natural language
Hendrycks, D. and Gimpel, K. (2016). Gaussian error linear processing. ACM Computing Surveys.
units (gelus). arXiv:1606.08415. Liu, P. J., Saleh, M., Pot, E., Goodrich, B., Sepa-
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and ssi, R., Kaiser, L., and Shazeer, N. (2018). Gen-
Hochreiter, S. (2017). Gans trained by a two time- erating wikipedia by summarizing long sequences.
scale update rule converge to a local nash equilibrium. arXiv:1801.10198.
Advances in neural information processing systems. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D.,
Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov,
K. Q. (2017). Densely connected convolutional net- V. (2019b). Roberta: A robustly optimized bert pre-
works. In Conf. on computer vision and pattern recog- training approach. arXiv:1907.11692.
nition. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin,
Ioffe, S. and Szegedy, C. (2015). Batch normalization: Ac- S., and Guo, B. (2021). Swin transformer: Hierarchi-
celerating deep network training by reducing internal cal vision transformer using shifted windows. In Int.
covariate shift. In Int. Conf. on machine learning. Conf. on computer vision.
Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D., Loshchilov, I. and Hutter, F. (2017). Decoupled weight de-
and Wilson, A. G. (2018). Averaging weights cay regularization. arXiv:1711.05101.
leads to wider optima and better generalization. Mescheder, L., Geiger, A., and Nowozin, S. (2018). Which
arXiv:1803.05407. training methods for GANs do actually converge? In
Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Int. Conf. on machine learning.
and Aila, T. (2020a). Analyzing and improving the Min, B., Ross, H., Sulem, E., Veyseh, A. P. B., Nguyen,
image quality of stylegan. In Conf. on computer vision T. H., Sainz, O., Agirre, E., Heinz, I., and Roth, D.
and pattern recognition. (2021). Recent advances in natural language process-
Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., ing via large pre-trained language models: A survey.
and Aila, T. (2020b). Analyzing and improving the arXiv preprint arXiv:2111.01243.
image quality of stylegan. In Conf. on computer vision Misra, D. (2019). Mish: A self regularized non-monotonic
and pattern recognition. activation function. arXiv:1908.08681.
Khan, A., Sohail, A., Zahoora, U., and Qureshi, A. S. Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T.,
(2020). A survey of the recent architectures of deep Harley, T., Silver, D., and Kavukcuoglu, K. (2016).
convolutional neural networks. Artificial intelligence Asynchronous methods for deep reinforcement learn-
review. ing. In Int. Conf. on machine learning.
Khan, S., Naseer, M., Hayat, M., Zamir, S. W., Khan, F. S., Moradi, R., Berangi, R., and Minaei, B. (2020). A survey
and Shah, M. (2022). Transformers in vision: A sur- of regularization strategies for deep models. Artificial
vey. ACM computing surveys (CSUR). Intelligence Review.
Khosla, P., Teterwak, P., Wang, C., Sarna, A., Tian, Y., Nair, V. and Hinton, G. E. (2010). Rectified linear units
Isola, P., Maschinot, A., Liu, C., and Krishnan, D. improve restricted boltzmann machines. In Int. Conf.
(2020). Supervised contrastive learning. Advances on machine learning (ICML-).
in neural information processing systems.
OpenAI (2022). Chatgpt: Optimizing language models for
Kiefer, J. and Wolfowitz, J. (1952). Stochastic estimation dialogue. https:// openai.com/ blog/ chatgpt/ .
of the maximum of a regression function. The Annals
OpenAI (2023). Gpt-4 technical report.
of Mathematical Statistics.
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wain-
Kingma, D. P. and Ba, J. (2014). Adam: A method for
wright, C. L., Mishkin, P., Zhang, C., Agarwal, S.,
stochastic optimization. arXiv:1412.6980.
Slama, K., Ray, A., et al. (2022). Training language Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
models to follow instructions with human feedback. L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I.
arXiv:2203.02155. (2017). Attention is all you need. Advances in neural
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., information processing systems.
et al. (2018). Improving language understanding by Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio,
generative pre-training. P., and Bengio, Y. (2017). Graph attention networks.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., arXiv:1710.10903.
Sutskever, I., et al. (2019). Language models are un- Wang, F., Jiang, M., Qian, C., Yang, S., Li, C., Zhang, H.,
supervised multitask learners. OpenAI blog. Wang, X., and Tang, X. (2017). Residual attention
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., network for image classification. In Conf. on com-
Matena, M., Zhou, Y., Li, W., and Liu, P. J. (2020). puter vision and pattern recognition.
Exploring the limits of transfer learning with a uni- Wang, Q., Ma, Y., Zhao, K., and Tian, Y. (2020). A compre-
fied text-to-text transformer. The Journal of Machine hensive survey of loss functions in machine learning.
Learning Research. Annals of Data Science.
Ramachandran, P., Zoph, B., and Le, Q. V. (2017). Williams, R. J. and Peng, J. (1991). Function optimiza-
Searching for activation functions. arXiv preprint tion using connectionist reinforcement learning algo-
arXiv:1710.05941. rithms. Connection Science.
Reddi, S. J., Kale, S., and Kumar, S. (2019). On the conver- Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C., and Philip,
gence of adam and beyond. arXiv:1904.09237. S. Y. (2020). A comprehensive survey on graph neu-
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., and ral networks. Transactions on neural networks and
Chen, L.-C. (2018). Mobilenetv2: Inverted residuals learning systems.
and linear bottlenecks. In Conf. on computer vision Xie, Q., Luong, M.-T., Hovy, E., and Le, Q. V. (2020). Self-
and pattern recognition. training with noisy student improves imagenet clas-
Sanh, V., Debut, L., Chaumond, J., and Wolf, T. (2019). sification. In Conf. on computer vision and pattern
DistilBERT, a distilled version of BERT: smaller, recognition.
faster, cheaper and lighter. arXiv:1910.01108. Xie, S., Girshick, R., Dollár, P., Tu, Z., and He, K. (2017).
Schultz, M. and Joachims, T. (2003). Learning a distance Aggregated residual transformations for deep neural
metric from relative comparisons. Advances in neural networks. In Conf. on computer vision and pattern
information processing systems, 16. recognition.
Shao, J., Hu, K., Wang, C., Xue, X., and Raj, B. (2020). Is Yang, X., Song, Z., King, I., and Xu, Z. (2022). A survey
normalization indispensable for training deep neural on deep semi-supervised learning. Transactions on
network? Advances in Neural Information Processing Knowledge and Data Engineering.
Systems. Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov,
Shazeer, N. and Stern, M. (2018). Adafactor: Adaptive R. R., and Le, Q. V. (2019). Xlnet: Generalized au-
learning rates with sublinear memory cost. In Int. toregressive pretraining for language understanding.
Conf. on Machine Learning. Advances in neural information processing systems.
Shrestha, A. and Mahmood, A. (2019). Review of deep You, Y., Li, J., Reddi, S., Hseu, J., Kumar, S., Bhojanapalli,
learning algorithms and architectures. IEEE access. S., Song, X., Demmel, J., Keutzer, K., and Hsieh, C.-
Sohn, K., Berthelot, D., Carlini, N., Zhang, Z., Zhang, H., J. (2019). Large batch optimization for deep learning:
Raffel, C. A., Cubuk, E. D., Kurakin, A., and Li, C.- Training bert in 76 minutes. arXiv:1904.00962.
L. (2020). Fixmatch: Simplifying semi-supervised Zbontar, J., Jing, L., Misra, I., LeCun, Y., and Deny, S.
learning with consistency and confidence. Advances (2021). Barlow twins: Self-supervised learning via re-
in neural information processing systems. dundancy reduction. In Int. Conf. on Machine Learn-
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., ing.
and Salakhutdinov, R. (2014). Dropout: a simple way Zhu, J.-Y., Park, T., Isola, P., and Efros, A. A. (2017).
to prevent neural networks from overfitting. The jour- Unpaired image-to-image translation using cycle-
nal of machine learning research. consistent adversarial networks. In Int. Conf. on com-
Sun, R.-Y. (2020). Optimization for deep learning: An puter vision.
overview. Operations Research Society of China.
Sun, Z., Deng, Z.-H., Nie, J.-Y., and Tang, J. (2019). Rotate:
Knowledge graph embedding by relational rotation in
complex space. arXiv:1902.10197.
Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., and
Jégou, H. (2021). Going deeper with image transform-
ers. In Int. Conf. on Computer Vision.
Ulyanov, D., Vedaldi, A., and Lempitsky, V. (2016). In-
stance normalization: The missing ingredient for fast
stylization. arXiv:1607.08022.

You might also like