The Shaped Transformer: Attention Models in the Infinite Depth-and-Width Limit

Noci, Lorenzo; Li, Chuning; Li, Mufan Bill; He, Bobby; Hofmann, Thomas; Maddison, Chris; Roy, Daniel M.

Statistics > Machine Learning

arXiv:2306.17759 (stat)

[Submitted on 30 Jun 2023 (v1), last revised 9 Dec 2023 (this version, v2)]

Title:The Shaped Transformer: Attention Models in the Infinite Depth-and-Width Limit

Authors:Lorenzo Noci, Chuning Li, Mufan Bill Li, Bobby He, Thomas Hofmann, Chris Maddison, Daniel M. Roy

View PDF HTML (experimental)

Abstract:In deep learning theory, the covariance matrix of the representations serves as a proxy to examine the network's trainability. Motivated by the success of Transformers, we study the covariance matrix of a modified Softmax-based attention model with skip connections in the proportional limit of infinite-depth-and-width. We show that at initialization the limiting distribution can be described by a stochastic differential equation (SDE) indexed by the depth-to-width ratio. To achieve a well-defined stochastic limit, the Transformer's attention mechanism is modified by centering the Softmax output at identity, and scaling the Softmax logits by a width-dependent temperature parameter. We examine the stability of the network through the corresponding SDE, showing how the scale of both the drift and diffusion can be elegantly controlled with the aid of residual connections. The existence of a stable SDE implies that the covariance structure is well-behaved, even for very large depth and width, thus preventing the notorious issues of rank degeneracy in deep attention models. Finally, we show, through simulations, that the SDE provides a surprisingly good description of the corresponding finite-size model. We coin the name shaped Transformer for these architectural modifications.

Subjects:	Machine Learning (stat.ML); Machine Learning (cs.LG)
Cite as:	arXiv:2306.17759 [stat.ML]
	(or arXiv:2306.17759v2 [stat.ML] for this version)
	https://doi.org/10.48550/arXiv.2306.17759

Submission history

From: Mufan (Bill) Li [view email]
[v1] Fri, 30 Jun 2023 16:10:36 UTC (1,567 KB)
[v2] Sat, 9 Dec 2023 19:59:40 UTC (5,467 KB)

Statistics > Machine Learning

Title:The Shaped Transformer: Attention Models in the Infinite Depth-and-Width Limit

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Statistics > Machine Learning

Title:The Shaped Transformer: Attention Models in the Infinite Depth-and-Width Limit

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators