A Survey of Deep Learning - From Activations To Transformers
A Survey of Deep Learning - From Activations To Transformers
Abstract: Deep learning has made tremendous progress in the last decade. A key success factor is the large amount
arXiv:2302.00722v3 [cs.LG] 10 Feb 2024
of architectures, layers, objectives, and optimization techniques. They include a myriad of variants related to
attention, normalization, skip connections, transformers and self-supervised learning schemes – to name a few.
We provide a comprehensive overview of the most important, recent works in these areas to those who already
have a basic understanding of deep learning. We hope that a holistic and unified treatment of influential,
recent works helps researchers to form new connections between diverse areas of deep learning. We identify
and discuss multiple patterns that summarize the key strategies for many of the successful innovations over
the last decade as well as works that can be seen as rising stars. We also include a discussion on recent
commercially built, closed-source models such as OpenAI’s GPT-4 and Google’s PaLM 2.
what established based on the usage statistics from the a negative input and maximize association between
popular platform ”Paperswithcode.com.” There has positively associated inputs, while minimizing those
been an increase in these types of platforms that en- of negative ones. It takes input pairs (x, y), each pro-
able the upload of papers (and models) and provide cessed by a separate but identical network. It maxi-
information on citations, as well as leaderboards. Al- mizes the joint probability p(x, y) of all pairs (x, y):
though there are drawbacks when utilizing data from 1
L(V p , Vn ) = − ∑ log p(x, y) (1)
|V p | · |Vn | x∈∑
these platforms, we believe that it offers a new per-
spective compared to traditional survey methods that V y∈V p n
Loss functions (surveyed in (Wang et al., 2020)) of- The Supervised Contrastive Loss(Khosla et al.,
ten consist of multiple terms that are enhanced with 2020) pulls together clusters of points of the same
a regularization term. Loss functions are often task- class in embedding space and pushes samples of dif-
specific but some general ideas are applicable across ferent classes apart. It aims at leveraging label infor-
tasks. Commonly, multiple loss terms are aggregated mation more effectively than cross-entropy loss.
in a weighted manner. Many papers improve prior −1
work (simply) by using a different loss function. Lisup = · (6)
2Nỹi − 1
The Triplet Loss (Dong and Shen, 2018) was intro- 2N
exp (zi · z j /τ)
duced for Siamese networks (Its origin dates back fur- ∑ 1i̸= j · 1ỹi =ỹ j · log ∑2N
ther (Schultz and Joachims, 2003).) The high level j=1 k=1 i̸=k · exp (zi · zk /τ)
1
idea is to compare a given input to a positive and (7)
where Nỹi is the total number of images in the mini- and generator inversion. Gradients with respect to
batch that have the same label ỹi as the anchor i. The w ∈ W stemming from random directions in the im-
total loss is the sum over the loss of all anchors i, i.e., age space should be almost equal in length indepen-
L = ∑i Lisup . The loss has important properties well dent of w or the image space direction. The local met-
suited for supervised learning: ric scaling characteristics of the generator g : W → Y
• generalization to an arbitrary number of positives are captured by the Jacobian matrix Jw = δg(w)/δw.
The regularizer becomes:
• contrastive power increases with more negatives.
et al., 2020) views every text-based language mod- The Vision Transformer (Dosovitskiy et al., 2020)
els as generating an output text from a given input relies heavily on the original transformer. An image
text. It differs from BERT(Devlin et al., 2018) by us- is partitioned into small patches, which are flattened
ing causal masking during training for predicting the and linearly embedded with position embeddings. A
target. Causal masking prevents the network from ac- standard transformer encoder then processes the cre-
cessing “future” tokens of the target. T5 also differs ated vector of each patch.
in pre-training tasks. The Swin Transformer (Liu et al., 2021) for com-
BART(Lewis et al., 2020) is a denoising autoen- puter vision builds hierarchical feature maps rather
coder for pretraining sequence-to-sequence models than just a single (resolution) feature map. It also only
that uses a standard transformer based machine trans- computes self-attention within a local window reduc-
lation architecture. It has been shown to be effective ing computation time.
for language generation, translation, and comprehen- PaLM (2): The original PaLM(Chowdhery et al.,
sion. Training is based on corrupting text with noising 2022) is a large language model consisting of 540 bil-
functions ranging from token deletion, masking onto lion parameters similar to other more prominent such
sentence permutation and document rotation. Learn- as GPT-3. Technical innovation discussed is mostly
ing stems form reconstructing the original text from on the scaling of model training, i.e., a single model
its corrputed version. The flexibility in noising op- can be trained across tens of thousands of accelera-
tions is attributed due to BART’s generalization of tor chips efficiently. The original transformer archi-
prior works such as BERT and GPT, i.e., the encoder tecture(Vaswani et al., 2017) is also adjusted slightly,
is bi-directional (like BERT), while the decoder is au- e.g., SwiGLU activations are used, i.e.,
toregressive (like GPT).
XLNet (Yang et al., 2019) combines advantages of Swish(xW ) · xV (38)
autoregressive modeling like GPT, predicting the next , where Swish is given by Eq. 17, different posi-
token, and denoising auto-encoding BERT(Devlin tional embeddings (better for long sequences), and
et al., 2018), reconstructing x given a noisy input x̂ multi-query attention (faster computation), no biases
that originates through masking words of x. It does so (better training stability), and shared input-output
by using a permutation language model that samples embeddings.
a permutation of Z = z0 , z1 , ..., zT −1 of the sequence
(0, 1, 2, ..., T − 1) leading to the objective: PaLM 2(Google, 2023) is the better performing
max p(uzT |uz0 , ..., uzT −1 ) (37) successor of PaLM that differs in terms of dataset
There is no actual permutation of inputs, which would mixtures, e.g., using more diverse languages as well
be unnatural (and not occurring during later fine- as domains (e.g., programing languages, mathemat-
tuning tasks). Rather, the permutation impacts the at- ics). It also uses the classical transformer architecture.
tention mask to ensure that the factorization order by However, it uses a smaller model than the first PaLM
Z is maintained. version but more training compute. It also relies on
more diverse pre-training objectives (than simple next neighborhoods, thereby, nodes are viewed based on
word or masked word prediction). their role or communities they belong to.