Toy Models of Superposition
Toy Models of Superposition
Abstract: Neural networks often pack many unrelated concepts into a single neuron – a puzzling
phenomenon known as 'polysemanticity' which makes interpretability much more challenging. This
paper provides a toy model where polysemanticity can be fully understood, arising as a result of
models storing additional sparse features in "superposition." We demonstrate the existence of a
phase change, a surprising connection to the geometry of uniform polytopes, and evidence of a link
to adversarial examples. We also discuss potential implications for mechanistic interpretability.
We recommend reading this paper as an HTML article.
It would be very convenient if the individual neurons of artificial neural networks corresponded to
cleanly interpretable features of the input. For example, in an “ideal” ImageNet classifier, each
neuron would fire only in the presence of a specific visual feature, such as the color red, a left-
facing curve, or a dog snout. Empirically, in models we have studied, some of the neurons do cleanly
map to features. But it isn't always the case that features correspond so cleanly to neurons,
especially in large language models where it actually seems rare for neurons to correspond to clean
features. This brings up many questions. Why is it that neurons sometimes align with features and
sometimes don't? Why do some models and tasks have many of these clean neurons, while they're
vanishingly rare in others?
In this paper, we use toy models — small ReLU networks trained on synthetic data with sparse input
features — to investigate how and when models represent more features than they have
dimensions. We call this phenomenon superposition. When features are sparse, superposition
allows compression beyond what a linear model would do, at the cost of "interference" that requires
nonlinear filtering.
Consider a toy model where we train an embedding of five features of varying importance 1 in two
dimensions, add a ReLU afterwards for filtering, and vary the sparsity of the features. With dense
features, the model learns to represent an orthogonal basis of the most important two features
(similar to what Principal Component Analysis might give us), and the other three features are not
represented. But if we make the features sparse, this changes:
This figure and a few others can be reproduced using the toy model framework Colab notebook in our Github repo
Not only can models store additional features in superposition by tolerating some interference, but
we'll show that, at least in certain limited cases, models can perform computation while in
superposition. (In particular, we'll show that models can put simple circuits computing the absolute
value function in superposition.) This leads us to hypothesize that the neural networks we observe
in practice are in some sense noisily simulating larger, highly sparse networks. In other words, it's
possible that models we train can be thought of as doing “the same thing as” an imagined much-
larger model, representing the exact same features but with no interference.
Feature superposition isn't a novel idea. A number of previous interpretability papers have
speculated about it [1, 2] , and it's very closely related to the long-studied topic of compressed
sensing in mathematics [3] , as well as the ideas of distributed, dense, and population codes in
neuroscience [4] and deep learning [5] . What, then, is the contribution of this paper?
For interpretability researchers, our main contribution is providing a direct demonstration that
superposition occurs in artificial neural networks given a relatively natural setup, suggesting this
may also occur in practice. We offer a theory of when and why this occurs, revealing a phase
diagram for superposition. We also discover that, at least in our toy model, superposition exhibits
complex geometric structure.
But our results may also be of broader interest. We find preliminary evidence that superposition may
be linked to adversarial examples and grokking, and might also suggest a theory for the
performance of mixture of experts models. More broadly, the toy model we investigate has
unexpectedly rich structure, exhibiting phase changes, a geometric structure based on uniform
polytopes, "energy level"-like jumps during training, and a phenomenon which is qualitatively similar
to the fractional quantum Hall effect in physics. We originally investigated the subject to gain
understanding of cleanly-interpretable neurons in larger models, but we've found these toy models
to be surprisingly interesting in their own right.
Empirical Phenomena
When we talk about "features" and how they're represented, this is ultimately theory building
around several observed empirical phenomena. Before describing how we conceptualize those
results, we'll simply describe some of the major results motivating our thinking:
Word Embeddings - A famous result by Mikolov et al. [8] found that word embeddings
appear to have directions which correspond to semantic properties, allowing for embedding
arithmetic vectors such as V("king") - V("man") + V("woman") = V("queen")
(but see [9] ).
Latent Spaces - Similar "vector arithmetic" and interpretable direction results have also
been found for generative adversarial networks (e.g. [10] ).
Interpretable Neurons - There is a significant body of results finding neurons which appear
to be interpretable (in RNNs [11, 12] ; in CNNs [13, 14] ; in GANs [15] ), activating in response to
some understandable property. This work has faced some skepticism [16, 17] . In response,
several papers have aimed to give extremely detailed accounts of a few specific neurons, in
the hope of dispositively establishing examples of neurons which truly detect some
understandable property (notably Cammarata et al. [6] , but also [18, 19] ).
Universality - Many analogous neurons responding to the same properties can be found
across networks [20, 1, 18] .
Polysemantic Neurons - At the same time, there are also many neurons which appear to not
respond to an interpretable property of the input, and in particular, many polysemantic
neurons which appear to respond to unrelated mixtures of inputs [21] .
As a result, we tend to think of neural network representations as being composed of features which
are represented as directions. We'll unpack this idea in the following sections.
Features as Directions
As we've mentioned in previous sections, we generally think of features as being represented by
directions. For example, in word embeddings, "gender" and "royalty" appear to correspond to
directions, allowing arithmetic like V("king") - V("man") + V("woman") = V("queen")
[8] . Examples of interpretable neurons are also cases of features as directions, since the amount a
neuron activates corresponds to a basis direction in the representation.
Let's call a neural network representation linear if features correspond to directions in activation
space. In a linear representation, each feature fi has a corresponding representation direction Wi.
The presence of multiple features f1 , f2 … activating with values xf , xf … is represented by
xf Wf + xf Wf .... To be clear, the features being represented are almost certainly nonlinear
1 2
functions of the input. It's only the map from features to activation vectors which is linear. Note that
1
1
2
2
whether something is a linear representation depends on what you consider to be the features.
We don't think it's a coincidence that neural networks empirically seem to have linear
representations. Neural networks are built from linear functions interspersed with non-linearities. In
some sense, the linear functions are the vast majority of the computation (for example, as measured
in FLOPs). Linear representations are the natural format for neural networks to represent
information in! Concretely, there are three major benefits:
Linear representations are the natural outputs of obvious algorithms a layer might
implement. If one sets up a neuron to pattern match a particular weight template, it will fire
more as a stimulus matches the template better and less as it matches it less well.
Linear representations make features "linearly accessible." A typical neural network layer
is a linear function followed by a non-linearity. If a feature in the previous layer is represented
linearly, a neuron in the next layer can "select it" and have it consistently excite or inhibit that
neuron. If a feature were represented non-linearly, the model would not be able to do this in a
single step.
Statistical Efficiency. Representing features as different directions may allow non-local
generalization in models with linear transformations (such as the weights of neural nets),
increasing their statistical efficiency relative to models which can only locally generalize. This
view is especially advocated in some of Bengio's writing (e.g. [5] ). A more accessible
argument can be found in this blog post.
It is possible to construct non-linear representations, and retrieve information from them, if you use
multiple layers (although even these examples can be seen as linear representations with more
exotic features). We provide an example in the appendix. However, our intuition is that non-linear
representations are generally inefficient for neural networks.
One might think that a linear representation can only store as many features as it has dimensions,
but it turns out this isn't the case! We'll see that the phenomenon we call superposition will allow
models to store more features – potentially many more features – in linear representations.
For discussion on how this view of features squares with a conception of features as being
multidimensional manifolds, see the appendix “What about Multidimensional Features?”.
Several results from mathematics suggest that something like this might be plausible:
Almost Orthogonal Vectors. Although it's only possible to have n orthogonal vectors in an
n-dimensional space, it's possible to have exp(n) many "almost orthogonal" (< ϵ cosine
similarity) vectors in high-dimensional spaces. See the Johnson–Lindenstrauss lemma.
Compressed sensing. In general, if one projects a vector into a lower-dimensional space,
one can't reconstruct the original vector. However, this changes if one knows that the original
vector is sparse. In this case, it is often possible to recover the original vector.
Concretely, in the superposition hypothesis, features are represented as almost-orthogonal
directions in the vector space of neuron outputs. Since the features are only almost-orthogonal, one
feature activating looks like other features slightly activating. Tolerating this "noise" or
"interference" comes at a cost. But for neural networks with highly sparse features, this cost may
be outweighed by the benefit of being able to represent more features! (Crucially, sparsity greatly
reduces the costs since sparse features are rarely active to interfere with each other, and non-linear
activation functions create opportunities to filter out small amounts of noise.)
One way to think of this is that a small neural network may be able to noisily "simulate" a sparse
larger model:
Although we've described superposition with respect to neurons, it can also occur in
representations with an unprivileged basis, such as a word embedding. Superposition simply means
that there are more features than dimensions.
1 2 1 1 2 2
The first two (decomposability and linearity) are properties we hypothesize to be widespread, while
the latter (non-superposition and basis-aligned) are properties we believe only sometimes occur.
Demonstrating Superposition
If one takes the superposition hypothesis seriously, a natural first question is whether neural
networks can actually noisily represent more features than they have neurons. If they can't, the
superposition hypothesis may be comfortably dismissed.
The intuition from linear models would be that this isn't possible: the best a linear model can do is to
store the principal components. But we'll see that adding just a slight nonlinearity can make models
behave in a radically different way! This will be our first demonstration of superposition. (It will also
be an object lesson in the complexity of even very simple neural networks.)
Experiment Setup
Our goal is to explore whether a neural network can project a high dimensional vector x ∈ Rn into a
lower dimensional vector h ∈ Rm and then recover it. 5
Gabor filter, a curve detector, or a floppy ear detector. In a language model, it might correspond to a
token referring to a specific famous person, or a clause being a particular kind of description.
Since we don't have any ground truth for features, we need to create synthetic data for x which
simulates any important properties we believe features have from the perspective of modeling
them. We make three major assumptions:
Feature Sparsity: In the natural world, many features seem to be sparse in the sense that
they only rarely occur. For example, in vision, most positions in an image don't contain a
horizontal edge, or a curve, or a dog head [1] . In language, most tokens don't refer to Martin
Luther King or aren't part of a clause describing music [2] . This idea goes back to classical
work on vision and the statistics of natural images (see e.g. Olshausen, 1997, the section
"Why Sparseness?" [24] ). For this reason, we will choose a sparse distribution for our
features.
More Features Than Neurons: There are an enormous number of potentially useful features
a model might represent. 6 This imbalance between features and neurons in real models
seems like it must be a central tension in neural network representations.
Features Vary in Importance: Not all features are equally useful to a given task. Some can
reduce the loss more than others. For an ImageNet model, where classifying different species
of dogs is a central task, a floppy ear detector might be one of the most important features it
can have. In contrast, another feature might only very slightly improve performance. 7
Concretely, our synthetic data is defined as follows: The input vectors x are synthetic data intended
to simulate the properties we believe the true underlying features of our task have. We consider
each dimension xi to be a "feature". Each one has an associated sparsity Si and importance Ii . We
let xi = 0 with probability Si , but is otherwise uniformly distributed between [0, 1]. 8 In practice, we
focus on the case where all features have the same sparsity, Si = S .
THE MODEL (X → X′ )
We will actually consider two models, which we motivate below. The first "linear model" is a well
understood baseline which does not exhibit superposition. The second "ReLU output model" is a
very simple model which does exhibit superposition. The two models vary only in the final activation
function.
Linear Model ReLU Output Model
h = Wx h = Wx
x′ = W T h + b x′ = ReLU(W T h + b)
x′ = W T W x + b x′ = ReLU(W T W x + b)
To recover the original vector, we'll use the transpose of the same matrix W T . This has the
advantage of avoiding any ambiguity regarding what direction in the lower-dimensional space really
corresponds to a feature. It also seems relatively mathematically principled 9 , and empirically works.
We also add a bias. One motivation for this is that it allows the model to set features it doesn't
represent to their expected value. But we'll see later that the ability to set a negative bias is
important for superposition for a second set of reasons – roughly, it allows models to discard small
amounts of noise.
The final step is whether to add an activation function. This turns out to be critical to whether
superposition occurs. In a real neural network, when features are actually used by the model to do
computation, there will be an activation function, so it seems principled to include one at the end.
THE LOSS
Our loss is weighted mean squared error weighted by the feature importances, Ii , described above:
L = ∑ ∑ Ii (xi − x′i )2
x i
Basic Results
Our first experiment will simply be to train a few ReLU output models with different sparsity levels
and visualize the results. (We'll also train a linear model – if optimized well enough, the linear model
solution does not depend on sparsity level.)
The main question is how to visualize the results. The simplest way is to visualize W T W (a features
by features matrix) and b (a feature length vector). Note that features are arranged from most
important to least, so the results have a fairly nice structure. Here's an example of what this type of
visualization might look like, for a small model model (n = 20; m = 5;) which behaves in the
"expected linear model-like" way, only representing as many features as it has dimensions:
But the thing we really care about is this hypothesized phenomenon of superposition – does the
model represent "extra features" by storing them non-orthogonally? Is there a way to get at it more
explicitly? Well, one question is just how many features the model learns to represent. For any
feature, whether or not it is represented is determined by ∣∣Wi ∣∣, the norm of its embedding vector.
We'd also like to understand whether a given feature shares its dimension with other features. For
this, we calculate ∑j≠i (W^ i ⋅ Wj )2 , projecting all other features onto the direction vector of Wi. It
will be 0 if the feature is orthogonal to other features (dark blue below). On the other hand, values
≥ 1 mean that there is some group of other features which can activate Wi as strongly as feature i
itself!
Now that we have a way to visualize models, we can start to actually do experiments. We'll start by
considering models with only a few features (n = 20; m = 5; Ii = 0.7i ). This will make it easy to
visually see what happens. We consider a linear model, and several ReLU-output models trained on
Mathematical Understanding
In the previous section, we observed a surprising empirical result: adding a ReLU to the output of
our model allowed a radically different solution – superposition – which doesn't occur in linear
models.
The model where it occurs is still quite mathematically simple. Can we analytically understand why
superposition is occurring? And for that matter, why does adding a single non-linearity make things
so different from the linear model case? It turns out that we can get a fairly satisfying answer,
revealing that our model is governed by balancing two competing forces – feature benefit and
interference – which will be useful intuition going forwards. We'll also discover a connection to the
famous Thomson Problem in chemistry.
Let's start with the linear case. This is well understood by prior work! If one wants to understand
why linear models don't exhibit superposition, the easy answer is to observe that linear models
essentially perform PCA. But this isn't fully satisfying: if we set aside all our knowledge and intuition
about linear functions for a moment, why exactly is it that superposition can't occur?
A deeper understanding can come from the results of Saxe et al. [26] who study the learning
dynamics of linear neural networks – that is, neural networks without activation functions. Such
models are ultimately linear functions, but because they are the composition of multiple linear
functions the dynamics are potentially quite complex. The punchline of their paper reveals that
neural network weights can be thought of as optimizing a simple closed-form solution. We can
tweak their problem to be a bit more similar to our linear case, 10 revealing the following equation:
The Saxe results reveal that there are fundamentally two competing forces which control learning
dynamics in the considered model. Firstly, the model can attain a better loss by representing more
features (we've labeled this "feature benefit"). But it also gets a worse loss if it represents more
than it can fit orthogonally due to "interference" between features. 11 In fact, this makes it never
worthwhile for the linear model to represent more features than it has dimensions. 12
Can we achieve a similar kind of understanding for the ReLU output model? Concretely, we'd like to
understand L = ∫x ∣∣I(x − ReLU(W T W x + b))∣∣2 dp(x) where x is distributed such that
xi = 0 with probability S .
The integral over x decomposes into a term for each sparsity pattern according to the binomial
expansion of ((1−S) + S)n . We can group terms of the sparsity together, rewriting the loss as
L = (1−S)n Ln + … + (1−S)S n−1 L1 + S n L0 , with each Lk corresponding to the loss when
the input is a k-sparse vector. Note that as S → 1, L1 and L0 dominate. The L0 term,
corresponding to the loss on a zero vector, is just a penalty on positive biases, ∑i ReLU(bi )2 . So
This new equation is vaguely similar to the famous Thomson problem in chemistry. In particular, if
we assume uniform importance and that there are a fixed number of features with ∣∣Wi ∣∣ = 1 and
the rest have ∣∣Wi ∣∣ = 0, and that bi = 0, then the feature benefit term is constant and the
interference term becomes a generalized Thomson problem – we're just packing points on the
surface of the sphere with a slightly unusual energy function. (We'll see this can be a productive
analogy when we resume our empirical investigation in the following sections!)
Another interesting property is that ReLU makes negative interference free in the 1-sparse case.
This explains why the solutions we've seen prefer to only have negative interference when possible.
Further, using a negative bias can convert small positive interferences into essentially being
negative interferences.
What about the terms corresponding to less sparse vectors? We leave explicitly writing these out to
the reader, but the main idea is that there are multiple compounding interferences, and the "active
features" can experience interference. In a later section, we'll see that features often organize
themselves into sparse interference graphs such that only a small number of features interfere with
another feature – it's interesting to note that this reduces the probability of compounding
interference and makes the 1-sparse loss term more important relative to others.
that the number of input features n doesn't matter as long as it's much larger than the number of
hidden dimensions, n ≫ m. And it also turns out that the number of hidden dimensions doesn't
really matter as long as we're interested in the ratio of features learned to hidden features. Doubling
the number of hidden dimensions just doubles the number of features the model learns.
A convenient way to measure the number of features the model has learned is to look at the
Frobenius norm, ∣∣W ∣∣2F . Since ∣∣Wi ∣∣2 ≃ 1 if a feature is represented and ∣∣Wi ∣∣2 ≃ 0 if it is not,
this is roughly the number of features the model has learned to represent. Conveniently, this norm is
basis-independent, so it still behaves nicely in the dense regime S = 0 where the feature basis isn't
privileged by anything and the model represents features with arbitrary directions instead.
We'll plot D∗ = m/∣∣W ∣∣2F , which we can think of as the "dimensions per feature":
Surprisingly, we find that this graph is "sticky" at 1 and 1/2. (This very vaguely resembles the
fractional quantum Hall effect – see e.g. this diagram.) Why is this? On inspection, the 1/2 "sticky
point" seems to correspond to a precise geometric arrangement where features come in "antipodal
pairs", each being exactly the negative of the other, allowing two features to be packed into each
hidden dimension. It appears that antipodal pairs are so effective that the model preferentially uses
them over a wide range of the sparsity regime.
It turns out that antipodal pairs are just the tip of the iceberg. Hiding underneath this curve are a
number of extremely specific geometric configurations of features.
FEATURE DIMENSIONALITY
In the previous section, we saw that there's a sticky regime where the model has "half a dimension
per feature" in some sense. This is an average statistical property of the features the model
represents, but it seems to hint at something interesting. Is there a way we could understand what
"fraction of a dimension" a specific feature gets?
We'll define the dimensionality of the ith feature, Di, as:
∣∣Wi ∣∣2
Di =
∑j (W^ i ⋅ Wj )2
where Wi is the weight vector column associated with the ith feature, and W^ i is the unit version of
that vector.
Intuitively, the numerator represents the extent to which a given feature is represented, while the
denominator is "how many features share the dimension it is embedded in" by projecting each
feature onto its dimension. In the antipodal case, each feature participating in an antipodal pair will
have a dimensionality of D = 1/(1 + 1) = 1/2 while features which are not learned will have a
dimensionality of 0. Empirically, it seems that the dimensionality of all features add up to the number
of embedding dimensions when the features are "packed efficiently" in some sense.
We can now break the above plot down on a per-feature basis. This reveals many more of these
"sticky points"! To help us understand this better, we're going to create a scatter plot annotated
with some additional information:
We start with the line plot we had in the previous section.
We overlay this with a scatter plot of the individual feature dimensionalities for each feature in
the models at each sparsity level.
The feature dimensionalities cluster at certain fractions, so we draw lines for those. (It turns
out that each fraction corresponds to a specific weight geometry – we'll discuss this shortly.)
We visualize the weight geometries for a few models with a "feature geometry graph" where
each feature is a node and edge weights are based on the absolute value of the dot product
feature embedding vectors. So features are connected if they aren't orthogonal.
Let's look at the resulting plot, and then we'll try to figure out what it's showing us:
What is going on with the points clustering at specific fractions?? We'll see shortly that the model
likes to create specific weight geometries and kind of jumps between the different configurations.
In the previous section, we developed a theory of superposition as a phase change. But everything
on this plot between 0 (not learning a feature) and 1 (dedicating a dimension to a feature) is
superposition. Superposition is what happens when features have fractional dimensionality. That is
to say – superposition isn't just one thing!
How can we relate this to our original understanding of the phase change? We often think of water
as only having three phases: ice, water and steam. But this is a simplification: there are actually
many phases of ice, often corresponding to different crystal structures (eg. hexagonal vs cubic ice).
In a vaguely similar way, neural network features seem to also have many other phases within the
general category of "superposition."
This also suggests a possible reason why we observe 3D Thomson problem solutions, despite the
fact that we're actually studying a higher dimensional version of the problem. Just as many 3D
Thomson solutions are tegum products of 2D and 1D solutions, perhaps higher dimensional
solutions are often tegum products of 1D, 2D, and 3D solutions.
The orthogonality of factors in tegum products has interesting implications. For the purposes of
superposition, it means that there can't be any "interference" across tegum-factors. This may be
preferred by the toy model: having many features interfere simultaneously could be really bad for
it. (See related discussion in our earlier mathematical analysis.)
Non-Uniform Superposition
So far, this section has focused on the geometry of uniform superposition, where all features are of
equal importance, equal sparsity, and independent. The model is essentially solving a variant of the
Thomson problem. Because all features are the same, solutions corresponding to uniform polyhedra
get especially low loss. In this subsection, we'll study non-uniform superposition, where features
are somehow not uniform. They may vary in importance and sparsity, or have a correlational
structure that makes them not independent. This distorts the uniform geometry we saw earlier.
In practice, it seems like superposition in real neural networks will be non-uniform, so developing an
understanding of it seems important. Unfortunately, we're far from a comprehensive theory of the
geometry of non-uniform superposition at this point. As a result, the goal of this section will merely
be to highlight some of the more striking phenomena we observe:
Features varying in importance or sparsity causes smooth deformation of polytopes as
the imbalance builds, up until a critical breaking point at which they snap to another polytope.
Correlated features prefer to be orthogonal, often forming in different tegum factors. As a
result, correlated features may form an orthogonal local basis. When they can't be
orthogonal, they prefer to be side-by-side. In some cases correlated features merge into a
single feature: this hints at some kind of interaction between "superposition-like behavior"
and "PCA-like behavior".
Anti-correlated features prefer to be in the same tegum factor when superposition is
necessary. They prefer to have negative interference, ideally being antipodal.
We attempt to illustrate these phenomena with some representative experiments below.
PERTURBING A SINGLE FEATURE
The simplest kind of non-uniform superposition is to vary one feature and leave the others uniform.
As an experiment, let's consider an experiment where we represent n = 5 features in m = 2
dimensions. In the uniform case, with importance I = 1 and activation density 1 − S = 0.05, we
get a regular pentagon. But if we vary one point – in this case we'll make it more or less sparse – we
see the pentagram stretch to account for the new value. If we make it denser, activating more
frequently (yellow) the other features repel from it, giving it more space. On the other hand, if we
make it sparser, activating less frequently (blue) it takes less space and other points push towards
it.
If we make it sufficiently sparse, there's a phase change, and it collapses from a pentagon to a pair
of digons with the sparser point at zero. The phase change corresponds to loss curves
corresponding to the two different geometries crossing over. (This observation allows us to directly
confirm that it is genuinely a first order phase change.)
To visualize the solutions, we canonicalize them, rotating them to align with each other in a
consistent manner.
These results seem to suggest that, at least in some cases, non-uniform superposition can be
understood as a deformation of uniform superposition and jumping between uniform superposition
configurations rather than a totally different regime. Since uniform superposition has a lot of
understandable structure, but real world superposition is almost certainly non-uniform, this seems
very promising!
The reason pentagonal solutions are not on the unit circle is because models reduce the effect of
positive interference, setting a slight negative bias to cut off noise and setting their weights to
∣∣Wi ∣∣ = 1/(1 − bi ) to compensate. Distance from the unit circle can be interpreted as primarily
driven by the amount of positive interference.
A note for reimplementations: optimizing with a two-dimensional hidden space makes this easier to
study, but the actual optimization process to be really challenging from gradient descent – a lot
harder than even just having three dimensions. Getting clean results required fitting each model
multiple times and taking the solution with the lowest loss. However, there's a silver lining to this:
visualizing the sub-optimal solutions on a scatter plot as above allows us to see the loss curves for
different geometries and gain greater insight into the phase change.
As an experiment, we consider six features, organized into three sets of correlated pairs. Features in
each correlated pair are represented by a given color (red, green, and blue). The correlation is
created by having both features always activate together – they're either both zero or neither zero.
(The exact non-zero values they take when they activate is uncorrelated.)
As we vary the sparsity of the features, we find that in the very sparse regime, we observe
superposition as expected, with features arranged in a hexagon and correlated features side-by-
side. As we decrease sparsity, the features progressively "collapse" into their principal components.
In very dense regimes, the solution becomes equivalent to PCA.
These results seem to hint that PCA and superposition are in some sense complementary strategies
which trade off with one another. As features become more correlated, PCA becomes a better
strategy. As features become sparser, superposition becomes a better strategy. When features are
both sparse and correlated, mixtures of each strategy seem to occur. It would be nice to more
deeply understand this space of tradeoffs.
It's also interesting to think about this in the context of continuous equivariant features, such as
features which occur in different rotations.
The ϵ entries (which are solely an artifact of superposition "interference") create an obvious way for
an adversary to attack the most important feature. Note that this may remain true even in the infinite
data limit: the optimal behavior of the model fit to sparse infinite data is to use superposition to
represent more features, leaving it vulnerable to attack.
To test this, we generated L2 adversarial examples (allowing a max L2 attack norm of 0.1 of the
average input norm). We originally generated attacks with gradient descent, but found that for
extremely sparse examples where ReLU neurons are in the zero regime 99% of the time, attacks
were difficult, effectively due to gradient masking [32] . Instead, we found it worked better to
analytically derive adversarial attacks by considering the optimal L2 attacks for each feature (
λ(W T W )i /∣∣(W T W )i ∣∣2 ) and taking the one of these attacks which most harms model
performance.
x′ = ReLU(W T h + b)
Recall that we think of basis elements in the input as "features," and basis elements in the middle
layer as "neurons". Thus W is a map from features to neurons.
What we see in the above plot is that the features are aligning with neurons in a structured way!
Many of the neurons are simply dedicated to representing a feature! (This is the critical property
that justifies why neuron-focused interpretability approaches – such as much of the work in the
original Circuits thread – can be effective in some circumstances.)
Let's explore this in more detail.
VISUALIZING SUPERPOSITION IN TERMS OF NEURONS
Having a privileged basis opens up new possibilities for visualizing our models. As we saw above,
we can simply inspect W . We can also make a per-neuron stacked bar plot where, for every neuron,
we visualize its weights as a stack of rectangles on top of each other:
Each column in the stack plot visualizes one column of W .
Each rectangle represents one weight entry, with height corresponding to the absolute value.
The color of each rectangle corresponds to the feature it acts on (i.e. which row of W it's in).
Negative values go below the x-axis.
The order of the rectangles is not significant.
This stack plot visualization can be nice as models get bigger. It also makes polysemantic neurons
obvious: they simply correspond to having more than one weight.
We'll now visualize a ReLU hidden layer toy model with n = 10; m = 5; I i = 0.75i and varying
feature sparsity levels. We chose a very small model (only 5 neurons) both for ease of visualization,
and to circumvent some issues with this toy model we'll discuss below.
However, we found that these small models were harder to optimize. For each model shown, we
trained 1000 models and visualized the one with the lowest loss. Although the typical solutions are
often similar to the minimal loss solutions shown, selecting the minimal loss solutions reveals even
more structure in how features align with neurons. It also reveals that there are ranges of sparsity
values where the optimal solution for all models trained on data with that sparsity have the same
weight configurations.
The solutions are visualized below, both visualizing the raw W and a neuron stacked bar plot. We
color features in the stacked bar plot based on whether they're in superposition, and color neurons
as being monosemantic or polysemantic depending on whether they store more than one feature.
Neuron order was chosen by hand (since it's arbitrary).
The most important thing to pay attention to is how there's a shift from monosemantic to
polysemantic neurons as sparsity increases. Monosemantic neurons do exist in some regimes!
Polysemantic neurons exist in others. And they can both exist in the same model! Moreover, while
it's not quite clear how to formalize this, it looks a great deal like there's a neuron-level phase
change, mirroring the feature phase changes we saw earlier.
It's also interesting to examine the structure of the polysemantic solutions, which turn out to be
surprisingly structured and neuron-aligned. Features typically correspond to sets of
neurons (monosemantic neurons might be seen as the special case where features only correspond
to singleton sets). There's also structure in how polysemantic neurons are. They transition from
monosemantic, to only representing a few features, to gradually representing more. However, it's
unclear how much of this is generalizable to real models.
Computation in Superposition
So far, we've shown that neural networks can store sparse features in superposition and then
recover them. But we actually believe superposition is more powerful than this – we think that
neural networks can perform computation entirely in superposition rather than just using it as
storage. This model will also give us a more principled way to study a privileged basis where
features align with basis dimensions.
To explore this, we consider a new setup where we imagine our input and output layer to be the
layers of our hypothetical disentangled model, but have our hidden layer be a smaller layer we're
imagining to be the observed model which might use superposition. We'll then try to compute a
simple non-linear function and explore whether it can use superposition to do this. Since the model
will have (and need to use) the hidden layer non-linearity, we'll also see features align with a
privileged basis.
Specifically, we'll have the model compute y = abs(x). Absolute value is an appealing function to
study because there's a very simple way to compute it with ReLU neurons:
abs(x) = ReLU(x) + ReLU(−x). This simple structure will make it easy for us to study the
geometry of how the hidden layer is leveraged to do computation.
Since this model needs ReLU to compute absolute value, it doesn't have the issues the model in the
previous section had with trying to avoid the activation function.
Experiment Setup
The input feature vector, x, is still sparse, with each feature xi having probability Si of being 0.
However, since we want to have the model compute absolute value, we need to allow it to take on
non-positive values for this to be a non-trivial task. As a result, if it is non-zero, its value is now
sampled uniformly from [−1, 1]. The target output y is y = abs(x).
Following the previous section, we'll consider the "ReLU hidden layer" toy model variant, but no
longer tie the two weights to be identical:
h = ReLU(W1 x)
y ′ = ReLU(W2 h + b)
The loss is still the mean squared error weighted by feature importances Ii as before.
Basic Results
With this model, it's a bit less straightforward to study how individual features get embedded;
because of the ReLU on the hidden layer, we can't just study W2T W1. And because W2 and W1 are
now learned independently, we can't just study columns of W1. We believe that with some
manipulation we could recover much of the simplicity of the earlier model by considering "positive
features" and "negative features" independently, but we're going to focus on another perspective
instead.
As we saw in the previous section, having a hidden layer activation function means that it makes
sense to visualize the weights in terms of neurons. We can visualize W directly or as a neuron stack
plot as we did before. We can also visualize it as a graph, which can sometimes be helpful for
understanding computation.
Let's look at what happens when we train a model with n = 3 features to perform absolute value on
m = 6 hidden layer neurons. Without superposition, the model needs two hidden layer neurons to
implement absolute value on one feature.
The resulting model – modulo a subtle issue about rescaling input and output weights 15 – performs
absolute value exactly as one might expect. For each input feature xi , it constructs a "positive side"
neuron ReLU(xi ) and a "negative side" neuron ReLU(−xi ). It then adds these together to
Superposition vs Sparsity
We've seen that – as expected – our toy model can learn to implement absolute value. But can it
use superposition to compute absolute value for more features? To test this, we train models with
n = 100 features and m = 40 neurons and a feature importance curve Ii = 0.8i , varying feature
sparsity. 16
One neuron does asymmetric superposition. In normal superposition, one might store features with
equal weights (eg. W = [1, −1]) and then have equal output weights (W = [1, 1]). In asymmetric
superposition, one stores the features with different magnitudes (eg. W = [2, − 12 ]) and then has
reciprocal output weights (eg. W = [ 12 , 2]). This causes one feature to heavily interfere with the
To avoid the consequences of that interference, the model has another neuron heavily inhibit the
feature in the case where there would have been positive interference. This essentially converts
positive interference (which could greatly increase the loss) into negative interference (which has
limited consequences due to the output ReLU).
There are a few other weights this doesn't explain. (We believe they're effectively small conditional
biases.) But this asymmetric superposition and inhibition pattern appears to be the primary story.
The Strategic Picture of Superposition
Although superposition is scientifically interesting, much of our interest comes from a pragmatic
motivation: we believe that superposition is deeply connected to the challenge of using
interpretability to make claims about the safety of AI systems. In particular, it is a clear challenge to
the most promising path we see to be able to say that neural networks won't perform certain
harmful behaviors or to catch "unknown unknowns" safety problems. This is because superposition
is deeply linked to the ability to identify and enumerate over all features in a model, and the ability to
enumerate over all features would be a powerful primitive for making claims about model behavior.
We begin this section by describing how "solving superposition" in a certain sense is equivalent to
many strong interpretability properties which might be useful for safety. Next, we'll describe three
high level strategies one might take to "solving superposition." Finally, we'll describe a few other
additional strategic considerations.
Open Questions
This paper has shown that the superposition hypothesis is true in certain toy models. But if
anything, we're left with many more questions about it than we had at the start. In this final section,
we review some of the questions which strike us as most important: what do we know, and would
we like for future work to clarify?
Is there a statistical test for catching superposition?
How can we control whether superposition and polysemanticity occur? Put another way,
can we change the phase diagram such that features don't fall into the superposition regime?
Pragmatically, this seems like the most important question. L1 regularization of activations,
adversarial training, and changing the activation function all seem promising.
Are there any models of superposition which have a closed-form solution? Saxe et al.
[26] demonstrate that it's possible to create nice closed-form solutions for linear neural
networks. We made some progress towards this for the n = 2; m = 1 ReLU output model
(and Tom McGrath makes further progress in his comment), but it would be nice to solve this
more generally.
How realistic are these toy models? To what extent do they capture the important
properties of real models with respect to superposition? How can we tell?
Can we estimate the feature importance curve or feature sparsity curve of real
models? If one takes our toy models seriously, the most important properties for
understanding the problem are the feature importance and sparsity curves. Is there a way we
can estimate them for real models? (Likely, this would involve training models of varying sizes
or amounts of regularization, observing the loss and neuron sparsities, and trying to infer
something.)
Should we expect superposition to go away if we just scale enough? What assumptions
about the feature importance curve and sparsity would need to be true for that to be the
case? Alternatively, should we expect superposition to remain a constant fraction of
represented features, or even to increase as we scale?
Are we measuring the maximally principled things? For example, what is the most
principled definition of superposition / polysemanticity?
How important are polysemantic neurons? If X% of the model is interpretable neurons and
1-X% are polysemantic, how much should we believe we understand from understanding the
x% interpretable neurons? (See also the "feature packing principle" suggested above.)
How many features should we expect to be stored in superposition? This was briefly
discussed in the previous section. It seems like results from compressed sensing should be
able to give us useful upper-bounds, but it would be nice to have a clearer understanding –
and perhaps tighter bounds!
Does the apparent phase change we observe in features/neurons have any connection
to phase changes in compressed sensing?
How does superposition relate to non-robust features? An interesting paper by Gabriel
Goh (archive.org backup) explores features in a linear model in terms of the principal
components of the data. It focuses on a trade off between "usefulness" and "robustness" in
the principal component features, but it seems like one could also relate it to the
interpretability of features. How much would this perspective change if one believed the
superposition hypothesis – could it be that the useful, non-robust features are an artifact of
superposition?
To what extent can neural networks "do useful computation" on features in
superposition? Is the absolute value problem representative of computation in superposition
generally, or idiosyncratic? What class of computation is amenable to being performed in
superposition? Does it require a sparse structure to the computation?
How does superposition change if features are not independent? Can superposition pack
features more efficiently if they are anti-correlated?
Can models effectively use nonlinear representations? We suspect models will tend not
to use them, but further experimentation could provide good evidence. See the appendix on
nonlinear compression. For example investigating the representations used by autoencoders
with multi-layer encoders and decoders with really small bottlenecks on random uncorrelated
data.
Related Work
INTERPRETABLE FEATURES
Our work is inspired by research exploring the features that naturally occur in neural networks. Many
models form at least some interpretable features. Word embeddings have semantic directions (see
[8] ). There is evidence of interpretable neurons in RNNs (e.g. [11, 12] ), convolutional neural networks
(see generally e.g. [13, 14, 41, 19] ; individual neuron families [6, 18] ), and in some limited cases,
transformer language models (see detailed discussion in our previous paper). However this work
has also found many "polysemantic" neurons which are not interpretable as a single concept [21] .
SUPERPOSITION
We're aware of two separate origins of the idea of superposition in neural networks. The first is the
superposition hypothesis explored in this paper. The existence of polysemantic neurons (described
in the previous section) led to the superposition hypothesis as one of the most plausible seeming
explanations [1] . This hypothesis is a kind of "feature level" superposition.
Separately, Cheung et al. [7] explore what one might describe as "model level" superposition: can
neural network parameters represent multiple completely independent models? Their investigation
is motivated by catastrophic forgetting, but seems quite related to the questions investigated in this
paper.
DISENTANGLEMENT
The goal of learning disentangled representations arises from Bengio et al.'s influential position
paper on representation learning [5] : "we would like our representations to disentangle the factors
of variation… to learn representations that separate the various explanatory sources." Since then, a
literature has developed motivated by this goal, tending to focus on creating genderantive models
which separate out major factors of variation in their latent spaces.
Concretely, disentanglement research often explores whether one can train a VAE or GAN where
basis dimensions correspond to the major features one might use to describe the problem (e.g.
rotation, lighting, gender… as relevant). In the language of this paper, the goal is to impose a strong
privileged basis on the latent space of a generative model, which are often totally rotationally
invariant by default. Early work often focused on semi-supervised approaches where the features
were known in advance, but fully unsupervised approaches started to develop around 2016
[42, 43, 44]
How does superposition relate to disentanglement? Although our investigation was motivated
primarily by different examples, we see no reason to think that superposition doesn't also occur in
the latent spaces of generative models. If so, it may be that superposition is a major reason why
disentanglement is difficult. Superposition may allow generative models to be much more effective
than they would otherwise be without. Put another way, disentanglement often assumes a small
number of important latent variables explain the data. There are clearly examples of such variables,
like the orientation of objects – but what if a large number of sparse, rare, individually unimportant
features are collectively very important? Superposition would be the natural way for models to
represent this. 22
COMPRESSED SENSING
The toy problems we consider are quite similar to the problems considered in the field of
compressed sensing, which is also known as compressive sensing and sparse recovery. However,
there are some important differences:
Compressed sensing recovers vectors by solving an optimization problem using general
techniques, while our toy model must use a neural network layer. Compressed sensing
algorithms are, in principle, much more powerful than our toy model
Compressed sensing works using the number of non-zero entries as the measure of sparsity,
while we use the probability that each dimension is zero as the sparsity. These are not wholly
unrelated: concentration of measure implies that our vectors have a bounded number of non-
zero entries with high probability.
Compressed sensing requires that the embedding matrix (usually called the measurement
matrix) have a certain “incoherent” structure [45] such as the restricted isometry property
[25] or nullspace property [46] . Our toy model learns the embedding matrix, and will often
simply ignore many input dimensions to make others easier to recover.
Features in our toy model have different "importances", which means the model will often
prefer to be able to recover “important” features more accurately, at the cost of not being
able to recover “less important” features at all.
In general, our toy model is solving a similar problem using less powerful than compressed sensing
algorithms, especially because the computational model is so much more restricted (to just a single
linear transformation and a non-linearity) compared to the arbitrary computation that might be used
by a compressed sensing algorithm.
As a result, compressed sensing lower bounds—which give lower bounds on the dimension of the
embedding such that recovery is still possible—can be interpreted as giving an upper bound on the
amount of superposition in our toy model. In particular, in various compressed sensing settings, one
can recover an n-dimensional k-sparse vector from an m dimensional projection if and only if
m = Ω(k log(n/k)) [47, 48, 49] . While the connection is not entirely straightforward, we apply one
such result to the toy model in the appendix.
At first, this bound appears to allow a number of features that is exponential in m to be packed into
the m-dimensional embedding space. However, in our setting, the integer k for which all vectors
have at most k non-zero entries is determined by the fixed density parameter S as
k = O((1 − S)n). As a result, our bound is actually m = Ω(−n(1 − S) log(1 − S)). Therefore,
the number of features is linear in m but modulated by the sparsity. 23 This is good news if we are
hoping to eliminate superposition as a phenomenon! However, these bounds also allow for the
amount of superposition to increase dramatically with sparsity – hopefully this is an artifact of the
techniques in the proofs and not an inherent barrier to reducing or eliminating superposition.
A striking parallel between our toy model and compressed sensing is the existence of phase
changes. 24 In compressed sensing, if one considers a two-dimensional space defined by the
sparsity and dimensionality of the vectors, there are sharp phase changes where the vector can
almost surely be recovered in one regime and almost surely not in the other [50, 51] . It isn't
immediately obvious how to connect these phase changes in compressed sensing – which apply to
recovery of the entire vector, rather than one particular component – to the phase changes we
observe in features and neurons. But the parallel is suspicious.
Another interesting line of work has tried to build useful sparse recovery algorithms using neural
networks [52, 53, 54] . While we find it useful for analysis purposes to view the toy model as a sparse
recovery algorithm, so that we may apply sparse recovery lower bounds, we do not expect that the
toy model is useful for the problem of sparse recovery. However, there may be an exciting
opportunity to relate our understanding of the phenomenon of superposition to these and other
techniques.
THEORIES OF NEURAL CODING AND REPRESENTATION
Our work explores representations in artificial “neurons”. Neuroscientists study similar questions in
biological neurons. There are a variety of theories for how information could be encoded by a group
of neurons. At one extreme is a local code, in which every individual stimulus is represented by a
separate neuron. At the other extreme is a maximally-dense distributed code, in which the
information-theoretic capacity of the population is fully utilized, and every neuron in the population
plays a necessary role in representing every input.
One challenge in comparing our work with the neuroscience literature is that a “distributed
representation” seems to mean different things. Consider an overly-simplified example of a
population of neurons, each taking a binary value of active or inactive, and a stimulus set of sixteen
items: four shapes, with four colors (example borrowed from [4] ). A “local code” would be one with
a “red triangle” neuron, a “red square” neuron, and so on. In what sense could the representation be
made more “distributed”? One sense is by representing independent features separately — e.g. four
“shape” neurons and four “color” neurons. A second sense is by representing more items than
neurons — i.e. using a binary code over four neurons to encode 2^4 = 16 stimuli. In our framework,
these senses correspond to decomposability (representing stimuli as compositions of independent
features) and superposition (representing more features than neurons, at cost of interference if
features co-occur).
Decomposability doesn’t necessarily mean each feature gets its own neuron. Instead, it could be
that each feature corresponds to a “direction in activation-space” 25 , given scalar “activations”
(which in biological neurons would be firing rate). Then, only if there is a privileged basis, “feature
neurons” are incentivized to develop. In biological neurons, metabolic considerations are often
hypothesized to induce a privileged basis, and thus a “sparse code”. This would be expected if the
nervous system’s energy expenditure increases linearly or sublinearly with firing rate. 26 Additionally,
neurons are the units by which biological neural networks can implement non-linear
transformations, so if a feature needs to be non-linearly transformed, a “feature neuron” is a good
way to achieve that.
Any decomposable linear code that uses orthogonal feature vectors is functionally equivalent from
the viewpoint of a linear readout. So, a code can both be “maximally distributed” — in the sense that
every neuron participates in representing every input, making each neuron extremely polysemantic
— and also have no more features than it has dimensions. In this conception, it’s clear that a code
can be fully “distributed” and also have no superposition.
A notable difference between our work, and the neuroscience literature we have encountered, is
that we consider as a central concept the likelihood that features co-occur with some probability. 27
A “maximally-dense distributed code” makes the most sense in the case where items never co-
occur; if the network only needs to represent one item at a time, it can tolerate a very extreme
degree of superposition. By contrast, a network that could plausibly need to represent all the items
at once can do so without interference between the items if it uses a code with no superposition.
One example of high feature co-occurrence could be encoding spatial frequency in a receptive
field; these visual neurons need to be able to represent white noise, which has energy at all
frequencies. An example of limited co-occurrence could be a motor “reach” task to discrete targets,
far enough apart that only one can be reached at a time
.
One hypothesis in neuroscience is that highly compressed representations might have an important
use in long-range communication between brain areas [57] . Under this theory, sparse
representations are used within a brain area to do computation, and then are compressed for
transmission across a small number of axons. Our experiments with the absolute value toy model
shows that networks can do useful computation even under a code with a moderate degree of
superposition. This suggests that all neural codes, not just those used for efficient communication,
could plausibly be “compressed” to some degree; the regional code might not necessarily need to
be decompressed to a fully sparse one.
It's worth noting that the term "distributed representation" is also used in deep learning, and has
the same ambiguities of meaning there. Our sense is that some influential early works (e.g. [5] ) may
have primarily meant the "independent features are represented independently"
decomposability sense, but we believe that other work intends to suggest something similar to what
we call superposition.
Comments & Replications
Inspired by the original Circuits Thread and Distill's Discussion Article experiment, the authors
invited several external researchers who we had previously discussed our preliminary results with to
comment on this work. Their comments are included below.
Making this substitution renders the integral analytically tractable, which allows us to plot the
full loss surface and solve for the loss minima directly. We show some example loss surfaces
below:
Although many of these loss surfaces (Figure 1a, 1b) have minima qualitatively similar to one of
the network weights used in the section Superposition as a Phase Change, we also find a new
phase where W1 ≃ W2 ≃ 12 : weights are similar rather than antipodal. This ‘confused
feature’ regime occurs when sparsity is low and both features are important (Figure 1c). (This is
would not) this could account for the Discrete "Energy Level" Jumps phenomenon as solutions
hop between minima. In some cases (e.g. when parameters approach those needed for a phase
transition) the global minimum can have a considerably smaller basin of attraction than local
minima. The transition between the antipodal and confused-feature solutions appears to be
discontinuous.
Original Authors' Response: This closed form analysis of the n = 2, m = 1 case is
fascinating. We hadn't realized that W1 ≃ W2 ≃ 12 could be a solution without correlated
features! The clarification of the "blurry behavior" and the observation about local minima are
also very interesting. More generally, we're very grateful for the independent replication of our
core results.
REPLICATION
Jeffrey Wu and Dan Mossing are members of the Alignment team at OpenAI.
We are very excited about these toy models of polysemanticity. This work sits at a rare
intersection of being plausibly very important for training more interpretable models and being
very simple and elegant. The results have been surprisingly easy to replicate -- we have
reproduced (with very little fuss) plots similar to those in the Demonstrating Superposition –
Basic Results, Geometry – Feature Dimensionality, and Learning Dynamics – Discrete "Energy
Level" Jumps sections.
Original Authors' Response: We really appreciate this replication of our basic results. Some
of our findings were quite surprising to us, and this gives us more confidence that they aren't
the result of an idiosyncratic quirk or bug in our implementations.
Code
We provide a notebook to reproduce some of the core diagrams in this article here. (It isn't comprehensive, since we
needed to rewrite code for our experiments to run outside our codebase.) We provide a separate notebook for the
theoretical phase change diagrams.
Note that the reproductions by other researchers mentioned in comments above were not based on this code, but are
instead fully independent replications with clean code from the description in an early draft of this article.
Acknowledgments
We're extremely grateful to a number of colleagues across several organizations for their invaluable support in our writing
of this paper.
Jeff Wu, Daniel Mossing, Tom McGrath, and Kshitij Sachan did independent replications of many of our experiments,
greatly increasing our confidence in our results. Kshitij Sachan's and Tom McGrath's additional investigations and
insightful questions both pushed us to clarify our understanding of the superposition phase change (both as reflected in
this paper, and in further understanding which we learned from them not captured here). Buck Shlegeris, Adam Scherlis,
and Adam Jermyn shared valuable insights into the mathematical nature of the toy problem and related work. Adam
Jermyn also coined the term "virtual neurons."
Gabriel Goh, Neel Nanda, Vladimir Mikulik, and Nick Cammarata gave detailed feedback which improved the paper, in
addition to being motivating. Alex Dimakis, Piotr Indyk, Dan Yamins generously took time to discuss these results with us
and give advice on how they might connect to their area of expertise. Finally, we benefited from the feedback and
comments of James Bradbury, Sebastian Farquhar, Shan Carter, Patrick Mineault, Alex Tamkin, Paul Christiano, Evan
Hubinger, Ian McKenzie, and Sid Black.
Finally, we're very grateful to all our colleagues at Anthropic for their advice and support: Daniela Amodei, Jack Clark,
Tom Brown, Ben Mann, Nick Joseph, Danny Hernandez, Amanda Askell, Kamal Ndousse, Andy Jones,, Timothy Telleen-
Lawton, Anna Chen, Yuntao Bai, Jeffrey Ladish, Deep Ganguli, Liane Lovitt, Nova DasSarma, Jia Yuan Loke, Jackson
Kernion, Tom Conerly, Scott Johnston, Jamie Kerr, Sheer El Showk, Stanislav Fort, Rebecca Raible, Saurav Kadavath,
Rune Kvist, Jarrah Bloomfield, Eli Tran-Johnson, Rob Gilson, Guro Khundadze, Filipe Dobreira, Ethan Perez, Sam
Bowman, Sam Ringer, Sebastian Conybeare, Jeeyoon Hyun, Michael Sellitto, Jared Mueller, Joshua Landau, Cameron
McKinnon, Sandipan Kundu, Jasmine Brazilek, Da Yan, Robin Larson, Noemí Mercado, Anna Goldie, Azalia Mirhoseini,
Jennifer Zhou, Erick Galankin, James Sully, Dustin Li, James Landis.
Author Contributions
Basic Results - The basic toy model results demonstrating the existence of superposition were done by Nelson Elhage
and Chris Olah. Chris suggested the toy model and Nelson ran the experiments.
Phase Change - Chris Olah ran the empirical phase change experiments, with help from Nelson Elhage. Martin
Wattenberg introduced the theoretical model where exact losses for specific weight configurations can be computed.
Geometry - The uniform superposition geometry results were discovered by Nelson Elhage and Nicholas Schiefer, with
help from Chris Olah. Nelson discovered the original m/∣∣W ∣∣2F mysterious "stickiness". Chris introduced the definition of
feature dimensionality. Nicholas and Nelson then investigated the polytopes that formed. As for non-uniform
superposition, Martin Wattenberg performed the initial investigations of the resulting geometry, focusing on the behavior
of correlated features. Chris extended this with an investigation of the role of relative feature importance and sparsity.
Learning Dynamics - Nelson Elhage discovered the "energy level jump" phenomenon, in collaboration with Nicholas
Schiefer and Chris Olah. Martin Wattenberg discovered the "geometric transformations" phenomenon.
Adversarial Examples - Chris Olah and Catherine Olsson found evidence of a connection between superposition and
adversarial examples.
Superposition with a Privileged Basis / Doing Computation - Chris Olah did the basic investigation of superposition in
a privileged basis. Nelson Elhage, with help from Chris, investigated the "absolute value" model which provided a more
principled demonstration of superposition and showed that computation could be done while in superposition. Nelson
discovered the "asymmetric superposition" motif.
Theory - The theoretical picture articulated over the course of this paper (especially in the "mathematical
understanding" section) was developed in conversations between all authors, but especially Chris Olah, Jared Kaplan,
Martin Wattenberg, Nelson Elhage, Tristan Hume, Tom Henighan, Catherine Olsson, Nicholas Schiefer, Dawn Drain,
Shauna Kravec, Roger Grosse, Robert Lasenby, and Sam McCandlish. Jared introduced the strategy of rewriting the loss
by grouping terms with the number of active features. Both Jared and Martin independently noticed the value of
investigating the n = 2; m = 1 case as the simplest case to understand. Nicholas and Dawn clarified our understanding
of the connection to compressed sensing.
Strategic Picture - The strategic picture articulated in this paper – What does superposition mean for interpretability
and safety? What would a suitable solution be? How might one solve it? – developed in extensive conversations between
authors, and in particular Chris Olah, Tristan Hume, Nelson Elhage, Dario Amodei, Jared Kaplan. Nelson Elhage
recognized the potential importance of "enumerative safety", further articulated by Dario. Tristan brainstormed
extensively about ways one might solve superposition and pushed Chris on this topic.
Writing - The paper was primarily drafted by Chris Olah, with some sections by Nelson Elhage, Tristan Hume, Martin
Wattenberg, and Catherine Olsson. All authors contributed to editing, with particularly significant contributions from Zac
Hatfield Dodds, Robert Lasenby, Kipply Chen, and Roger Grosse.
Illustration - The paper was primarily illustrated by Chris Olah, with help from Tristan Hume, Nelson Elhage, and
Catherine Olsson.
Citation Information
Please cite as:
Elhage, et al., "Toy Models of Superposition", Transformer Circuits Thread, 2022.
BibTeX Citation:
@article{elhage2022superposition,
title={Toy Models of Superposition},
author={Elhage, Nelson and Hume, Tristan and Olsson, Catherine and Schiefer, Nicholas and Henighan,
Tom and Kravec, Shauna and Hatfield-Dodds, Zac and Lasenby, Robert and Drain, Dawn and Chen, Carol and
Grosse, Roger and McCandlish, Sam and Kaplan, Jared and Amodei, Dario and Wattenberg, Martin and Olah,
Christopher},
year={2022},
journal={Transformer Circuits Thread},
note={https://transformer-circuits.pub/2022/toy_model/index.html}
}
Footnotes
1. Where “importance” is a scalar multiplier on mean squared error loss. [↩]
2. In the context of vision, these have ranged from low-level neurons like curve detectors [6] and high-low frequency
detectors [18] , to more complex neurons like oriented dog-head detectors or car detectors [1] , to extremely abstract
neurons corresponding to famous people, emotions, geographic regions, and more [19] . In language models, researchers
have found word embedding directions such as a male-female or singular-plural direction [8] , low-level neurons
disambiguating words that occur in multiple languages, much more abstract neurons, and "action" output neurons that
help produce certain words [2] . [↩]
3. This definition is trickier than it seems. Specifically, something is a feature if there exists a large enough model size such
that it gets a dedicated neuron. This create a kind "epsilon-delta" like definition. Our present understanding – as we'll
see in later sections – is that arbitrarily large models can still have a large fraction of their features be in superposition.
However, for any given feature, assuming the feature importance curve isn't flat, it should eventually be given a
dedicated neuron. This definition can be helpful in saying that something is a feature – curve detectors are a feature
because you find them in across a range of models larger than some minimal size – but unhelpful for the much more
common case of features we only hypothesize about or observe in superposition. [↩]
4. A famous book by Lakatos [23] illustrates the importance of uncertainty about definitions and how important rethinking
definitions often is in the context of research. [↩]
5. This experiment setup could also be viewed as an autoencoder reconstructing x. [↩]
6. A vision model of sufficient generality might benefit from representing every species of plant and animal and every
manufactured object which it might potentially see. A language model might benefit from representing each person who
has ever been mentioned in writing. These are only scratching the surface of plausible features, but already there seem
more than any model has neurons. In fact, large language models demonstrably do in fact know about people of very
modest prominence – presumably more such people than they have neurons. This point is a common argument in
discussion of the plausibility of "grandmother neurons'' in neuroscience, but seems even stronger for artificial neural
networks. [↩]
7. For computational reasons, we won't focus on it in this article, but we often imagine an infinite number of features with
importance asymptotically approaching zero. [↩]
8. The choice to have features distributed uniformly is arbitrary. An exponential or power law distribution would also be very
natural. [↩]
9. Recall that W T = W −1 if W is orthonormal. Although W can't be literally orthonormal, our intuition from compressed
sensing is that it will be "almost orthonormal" in the sense of Candes & Tao [25] . [↩]
10. We have the model be x′ = W T W x, but leave x Gaussianaly distributed as in Saxe. [↩]
11. As a brief aside, it's interesting to contrast the linear model interference, ∑i≠j ∣Wi ⋅ WJ ∣2 , to the notion of coherence in
compressed sensing, maxi≠j ∣Wi ⋅ WJ ∣. We can see them as the L2 and L∞ norms of the same vector. [↩]
12. To prove that superposition is never optimal in a linear model, solve for the gradient of the loss being zero or consult
Saxe et al. [↩]
13. Here, we use “phase change” in the generalized sense of “discontinuous change”, rather than in the more technical sense
of a discontinuity arising in the limit of infinite system size. [↩]
14. Scaling the importance of all features by the same amount simply scales the loss, and does not change the optimal
solutions. [↩]
15. Note that there's a degree of freedom for the model in learning W1 : We can rescale any hidden unit by scaling its row of
W1 by α, and its column of W2 by α−1 , and arrive at the same model. For consistency in the visualization, we rescale
each hidden unit before visualizing so that the largest-magnitude weight to that neuron from W1 has magnitude 1. [↩]
16. These specific values were chosen to illustrate the phenomenon we're interested in: the absolute value model learns
more easily when there are more neurons, but we wanted to keep the numbers small enough that it could be easily
visualized. [↩]
17. One question you might ask is whether we can quantify the ability of superposition to enable extra computation by
examining the loss. Unfortunately, we can't easily do this. Superposition occurs when we change the task, making it
sparser. As a result, the losses of models with different amounts of superposition are not comparable – they're
measuring the loss on different tasks! [↩]
18. Ultimately we want to say that a model doesn't implement some class of behaviors. Enumerating over all features makes
it easy to say a feature doesn't exist (e.g. "there is no 'deceptive behavior' feature") but that isn't quite what we want. We
expect models that need to represent the world to represent unsavory behaviors. But it may be possible to build more
subtle claims such as "all 'deceptive behavior' features do not participate in circuits X, Y and Z." [↩]
19. Superposition also makes it harder to find interpretable directions in a model without a privileged basis. Without
superposition, one could try to do something like the Gram–Schmidt process, progressively identifying interpretable
directions and then removing them to make future features easier to identify. But with superposition, one can't simply
remove a direction even if one knows that it is a feature direction. [↩]
20. More formally, given a matrix H ∼ [d, m] = [h0 , h1 , …] of hidden layer activations h ∼ [m] sampled over d stimuli, if
we believe there are n underlying features, we can try to find matrices A ∼ [d, n] and B ∼ [n, m] such that A is
sparse. [↩]
21. In particular, it seems like we should expect to be able to reduce superposition at least a little bit with essentially no
effect on performance, just by doing something like L1 regularization without any architectural changes. Note that
models should have a level of superposition where the derivative of loss with respect to the amount of superposition is
zero – otherwise, they'd use more or less superposition. As a result, there should be at least some margin within which
we can reduce the amount of superposition without affecting model performance. [↩]
22. A more subtle issue is that GANs and VAEs often assume that their latent space is Gaussianly distributed. Sparse latent
variables are very non-Gaussian, but central limit theorem means that the superposition of many such variables will
gradually look more Gaussian. So the latent spaces of some generative models may in fact force models to use
superposition! [↩]
23. Note that this has a nice information-theoretic interpretation: log(1 − S) is the surprisal of a given dimension being non-
zero, and is multiplied by the expected number of non-zeros. [↩]
24. Note that in the compressed sensing case, the phase transition is in the limit as the number of dimensions becomes large
- for finite-dimensional spaces, the transition is fast but not discontinuous. [↩]
25. We haven’t encountered a specific term in the distributed coding literature that corresponds to this hypothesis
specifically, although the idea of a “direction in activation-space” is common in the literature, which may be due to
ignorance on our part. We call this hypothesis linearity [↩]
26. Experimental evidence seems to support this [55] [↩]
27. A related, but different, concept in the neuroscience literature is the “binding problem” [56] in which e.g. a red triangle is
a co-occurrence of exactly one shape and exactly one color, which is not a representational challenge, but a binding
problem arises if a decomposed code needs to represent simultaneously also a blue square — which shape feature goes
with which color feature? Our work does not engage with the binding question, merely treating this as a co-occurrence of
“blue”, “red”, “triangle”, and “square”. [↩]
References
1. Zoom In: An Introduction to Circuits
Olah, C., Cammarata, N., Schubert, L., Goh, G., Petrov, M. and Carter, S., 2020. Distill. DOI: 10.23915/distill.00024.001
2. Softmax Linear Units
Elhage, N., Hume, T., Olsson, C., Nanda, N., Henighan, T., Johnston, S., ElShowk, S., Joseph, N., DasSarma, N., Mann, B.,
Hernandez, D., Askell, A., Ndousse, K., Jones, A., Drain, D., Chen, A., Bai, Y., Ganguli, D., Lovitt, L., Hatfield-Dodds, Z.,
Kernion, J., Conerly, T., Kravec, S., Fort, S., Kadavath, S., Jacobson, J., Tran-Johnson, E., Kaplan, J., Clark, J., Brown, T.,
McCandlish, S., Amodei, D. and Olah, C., 2022. Transformer Circuits Thread.
3. Compressed sensing
Donoho, D.L., 2006. IEEE Transactions on information theory, Vol 52(4), pp. 1289--1306. IEEE.
4. Local vs. Distributed Coding
Thorpe, S.J., 1989. Intellectica, Vol 8, pp. 3--40.
5. Representation learning: A review and new perspectives
Bengio, Y., Courville, A. and Vincent, P., 2013. IEEE transactions on pattern analysis and machine intelligence, Vol 35(8),
pp. 1798--1828. IEEE.
6. Curve Detectors [link]
Cammarata, N., Goh, G., Carter, S., Schubert, L., Petrov, M. and Olah, C., 2020. Distill.
7. Superposition of many models into one
Cheung, B., Terekhov, A., Chen, Y., Agrawal, P. and Olshausen, B., 2019. Advances in neural information processing
systems, Vol 32.
8. Linguistic regularities in continuous space word representations [PDF]
Mikolov, T., Yih, W. and Zweig, G., 2013. Proceedings of the 2013 conference of the north american chapter of the
association for computational linguistics: Human language technologies, pp. 746--751.
9. Linguistic regularities in sparse and explicit word representations
Levy, O. and Goldberg, Y., 2014. Proceedings of the eighteenth conference on computational natural language learning,
pp. 171--180.
10. Unsupervised representation learning with deep convolutional generative adversarial networks
Radford, A., Metz, L. and Chintala, S., 2015. arXiv preprint arXiv:1511.06434.
11. Visualizing and understanding recurrent networks [PDF]
Karpathy, A., Johnson, J. and Fei-Fei, L., 2015. arXiv preprint arXiv:1506.02078.
12. Learning to generate reviews and discovering sentiment [PDF]
Radford, A., Jozefowicz, R. and Sutskever, I., 2017. arXiv preprint arXiv:1704.01444.
13. Object detectors emerge in deep scene cnns [PDF]
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A. and Torralba, A., 2014. arXiv preprint arXiv:1412.6856.
14. Network Dissection: Quantifying Interpretability of Deep Visual Representations [PDF]
Bau, D., Zhou, B., Khosla, A., Oliva, A. and Torralba, A., 2017. Computer Vision and Pattern Recognition.
15. Understanding the role of individual units in a deep neural network
Bau, D., Zhu, J., Strobelt, H., Lapedriza, A., Zhou, B. and Torralba, A., 2020. Proceedings of the National Academy of
Sciences, Vol 117(48), pp. 30071--30078. National Acad Sciences.
16. On the importance of single directions for generalization [PDF]
Morcos, A.S., Barrett, D.G., Rabinowitz, N.C. and Botvinick, M., 2018. arXiv preprint arXiv:1803.06959.
17. On Interpretability and Feature Representations: An Analysis of the Sentiment Neuron
Donnelly, J. and Roegiest, A., 2019. European Conference on Information Retrieval, pp. 795--802.
18. High-Low Frequency Detectors
Schubert, L., Voss, C., Cammarata, N., Goh, G. and Olah, C., 2021. Distill. DOI: 10.23915/distill.00024.005
19. Multimodal Neurons in Artificial Neural Networks
Goh, G., Cammarata, N., Voss, C., Carter, S., Petrov, M., Schubert, L., Radford, A. and Olah, C., 2021. Distill. DOI:
10.23915/distill.00030
20. Convergent learning: Do different neural networks learn the same representations?
Li, Y., Yosinski, J., Clune, J., Lipson, H., Hopcroft, J.E. and others,, 2015. FE@ NIPS, pp. 196--212.
21. Feature Visualization [link]
Olah, C., Mordvintsev, A. and Schubert, L., 2017. Distill. DOI: 10.23915/distill.00007
22. Adversarial examples are not bugs, they are features
Ilyas, A., Santurkar, S., Tsipras, D., Engstrom, L., Tran, B. and Madry, A., 2019. Advances in neural information processing
systems, Vol 32.
23. Proofs and refutations
Lakatos, I., 1963. Nelson London.
24. Sparse coding with an overcomplete basis set: A strategy employed by V1?
Olshausen, B.A. and Field, D.J., 1997. Vision research, Vol 37(23), pp. 3311--3325. Elsevier.
25. Decoding by linear programming
Candes, E.J. and Tao, T., 2005. IEEE transactions on information theory, Vol 51(12), pp. 4203--4215. IEEE.
26. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks
Saxe, A.M., McClelland, J.L. and Ganguli, S., 2014.
27. In-context Learning and Induction Heads [HTML]
Olsson, C., Elhage, N., Nanda, N., Joseph, N., DasSarma, N., Henighan, T., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly,
T., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Johnston, S., Jones, A., Kernion, J., Lovitt, L., Ndousse, K.,
Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S. and Olah, C., 2022. Transformer Circuits Thread.
28. A Mechanistic Interpretability Analysis of Grokking [link]
Nanda, N. and Lieberum, T., 2022.
29. Grokking: Generalization beyond overfitting on small algorithmic datasets
Power, A., Burda, Y., Edwards, H., Babuschkin, I. and Misra, V., 2022. arXiv preprint arXiv:2201.02177.
30. The surprising simplicity of the early-time learning dynamics of neural networks
Hu, W., Xiao, L., Adlam, B. and Pennington, J., 2020. Advances in Neural Information Processing Systems, Vol 33, pp.
17116--17128.
31. A mathematical theory of semantic development in deep neural networks
Saxe, A.M., McClelland, J.L. and Ganguli, S., 2019. Proceedings of the National Academy of Sciences, Vol 116(23), pp.
11537--11546. National Acad Sciences.
32. Towards the science of security and privacy in machine learning
Papernot, N., McDaniel, P., Sinha, A. and Wellman, M., 2016. arXiv preprint arXiv:1611.03814.
33. Adversarial spheres
Gilmer, J., Metz, L., Faghri, F., Schoenholz, S.S., Raghu, M., Wattenberg, M. and Goodfellow, I., 2018. arXiv preprint
arXiv:1801.02774.
34. Adversarial robustness as a prior for learned representations
Engstrom, L., Ilyas, A., Santurkar, S., Tsipras, D., Tran, B. and Madry, A., 2019. arXiv preprint arXiv:1906.00945.
35. Delving into transferable adversarial examples and black-box attacks
Liu, Y., Chen, X., Liu, C. and Song, D., 2016. arXiv preprint arXiv:1611.02770.
36. An introduction to systems biology: design principles of biological circuits
Alon, U., 2019. CRC press. DOI: 10.1201/9781420011432
37. The Building Blocks of Interpretability [link]
Olah, C., Satyanarayan, A., Johnson, I., Carter, S., Schubert, L., Ye, K. and Mordvintsev, A., 2018. Distill. DOI:
10.23915/distill.00010
38. Visualizing Weights [link]
Voss, C., Cammarata, N., Goh, G., Petrov, M., Schubert, L., Egan, B., Lim, S.K. and Olah, C., 2021. Distill. DOI:
10.23915/distill.00024.007
39. A Review of Sparse Expert Models in Deep Learning
Fedus, W., Dean, J. and Zoph, B., 2022. arXiv preprint arXiv:2209.01667.
40. A Mathematical Framework for Transformer Circuits [HTML]
Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., DasSarma,
N., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D.,
Brown, T., Clark, J., Kaplan, J., McCandlish, S. and Olah, C., 2021. Transformer Circuits Thread.
41. An Overview of Early Vision in InceptionV1
Olah, C., Cammarata, N., Schubert, L., Goh, G., Petrov, M. and Carter, S., 2020. Distill. DOI: 10.23915/distill.00024.002
42. beta-vae: Learning basic visual concepts with a constrained variational framework
Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S. and Lerchner, A., 2016.
43. Infogan: Interpretable representation learning by information maximizing generative adversarial nets
Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I. and Abbeel, P., 2016. Advances in neural information
processing systems, Vol 29.
44. Disentangling by factorising
Kim, H. and Mnih, A., 2018. International Conference on Machine Learning, pp. 2649--2658.
45. Uncertainty principles and ideal atomic decomposition
Donoho, D.L., Huo, X. and others,, 2001. IEEE transactions on information theory, Vol 47(7), pp. 2845--2862. Citeseer.
46. Compressed sensing and best -term approximation
𝑘
Cohen, A., Dahmen, W. and DeVore, R., 2009. Journal of the American mathematical society, Vol 22(1), pp. 211--231.
47. A remark on compressed sensing
Kashin, B.S. and Temlyakov, V.N., 2007. Mathematical notes, Vol 82(5), pp. 748--755. Springer.
48. Information-theoretic bounds on sparsity recovery in the high-dimensional and noisy setting
Wainwright, M., 2007. 2007 IEEE International Symposium on Information Theory, pp. 961--965.
49. Lower bounds for sparse recovery
Do Ba, K., Indyk, P., Price, E. and Woodruff, D.P., 2010. Proceedings of the twenty-first annual ACM-SIAM symposium on
Discrete Algorithms, pp. 1190--1197.
50. Neighborly polytopes and sparse solution of underdetermined linear equations
Donoho, D.L., 2005.
51. Compressed sensing: How sharp is the RIP
Blanchard, J.D., Cartis, C. and Tanner, J., 2009. SIAM Rev., accepted, Vol 10, pp. 090748160.
52. A deep learning approach to structured signal recovery
Mousavi, A., Patel, A.B. and Baraniuk, R.G., 2015. 2015 53rd annual allerton conference on communication, control, and
computing (Allerton), pp. 1336--1343.
53. Learned D-AMP: Principled neural network based compressive image recovery
Metzler, C., Mousavi, A. and Baraniuk, R., 2017. Advances in Neural Information Processing Systems, Vol 30.
54. Compressed Sensing using Generative Models [HTML]
Bora, A., Jalal, A., Price, E. and Dimakis, A.G., 2017. Proceedings of the 34th International Conference on Machine
Learning, Vol 70, pp. 537--546. PMLR.
55. Average firing rate rather than temporal pattern determines metabolic cost of activity in thalamocortical relay neurons
Yi, G. and Grill, W., 2019. Scientific reports, Vol 9(1), pp. 6940. DOI: 10.1038/s41598-019-43460-8
56. Distributed representations
Plate, T., 2003. Cognitive Science, pp. 1-15.
57. Compressed Sensing, Sparsity, and Dimensionality in Neuronal Information Processing and Data Analysis [link]
Ganguli, S. and Sompolinsky, H., 2012. Annual Review of Neuroscience, Vol 35(1), pp. 485-508. DOI: 10.1146/annurev-
neuro-062111-150410
Nonlinear Compression
This paper focuses on the assumption that representations are linear. But what if models don’t use linear feature
directions to represent information? What might such a thing concretely look like?
Neural networks have nonlinearities that make it theoretically possible to compress information even more compactly
than a linear superposition. There are reasons we think models are unlikely to pervasively use nonlinear compression
schemes:
The model needs to decompress things before it can compute with them naturally: Most of the computation in
the model is linear, so this kind of compression is likely only worth it to save space in the residual stream across
many layers before being decompressed to be computed with linearly again.
They’re probably difficult to learn: Nonlinear compression schemes may require finely tuned approximations of
discontinuities, and for the compression and decompression to line up, and may be difficult for gradient descent
to learn.
They probably take enough neurons that the benefit over superposition isn’t worth it:
Representing the piecewise linear functions in the simple example with ReLU neurons using the universal
function approximation result that each line segment takes two neurons, would require 12 neurons per Z
segment, so only starts to beat linear compression at a combined 36 neurons for the compression and
decompression.
This comparison has only one hidden dimension and dense features, which is somewhat of a degenerate
case for superposition. Superposition is much more powerful for compression of sparse features in many
dimensions. We suspect in large models the scaling is in favor of superposition, although this is just
intuition and it’s possible that scaling nonlinear compression is competitive.
Regardless of whether large models end up using nonlinear compression, it should be possible to view directions being
used with nonlinear compression as linear feature directions and reverse engineer the computation being used for
compression like any other circuit. If this kind of encoding is pervasive throughout the network then it may merit some
kind of automated decoding. It shouldn’t pose a fundamental challenge to interpretability unless the model learns a
scheme for doing complex computation while staying in a complicated nonlinear representation, which we suspect is
unlikely.
To help provide intuition, the simplest example of what a nonlinear compression scheme might look like is compressing
two [0,1) dimensions x and y into a single [0,1) dimension t:
⌊Zx⌋ + y
t=
This works by quantizing the x dimension using some integer Z such that the floating point precision of t is split
between x and y. This particular function needs the discontinuous floor function to compute, and the discontinuous
fmod function to invert, but models can’t compute discontinuous functions. However it’s possible to replace the
discontinuities with steep linear segments that are only some epsilon value wide.
We can compare the mean squared error loss on random uniform dense values of x and y and see that even with
epsilons as large as 0.1 and Z values as small as 3 the nonlinear compression outperforms linear compression such as
picking one of the dimensions or using the average:
Connection between compressed sensing lower bounds and the toy model
Here, we formalize the relationship between a compressed sensing lower bound and the toy model.
Let T (x) : Rn → Rn be the complete toy model autoencoder defined by T (x) = ReLU(W2 W1 x − b) for an m × n
matrix W1 and an n × m matrix W2 .
We prove this result by framing our toy model as a compressed sensing algorithm. The primary barrier to doing so is that
our optimization only searches for vectors that are close in ℓ2 distance to the original vector and may not itself be exactly
k-sparse. The following lemma resolves this concern through a denoising step:
Lemma 1. Suppose that we have a toy model T (x) with the properties in Theorem 1. Then there exists a compressed
sensing algorithm f (y) : Rm → Rn for the measurement matrix W1 .
Proof. We construct f (y) as follows. First, compute x~ = ReLU(W2 y − b), as in T (x). This produces the vector
~ = T (x) and so by supposition ∥T (x) − x∥ ≤ ε. Next, we threshold x ~ to obtain x~′ by dropping all but its k largest
x 2
entries. Lastly, we solve the optimization problem: minx ∥x − x ∥ subject to W1 x = y, which is convex because x′ and
′
′ ~ ′ ′
~′ have the same support. For sufficiently small ε (specifically, ε smaller than the (k + 1)th largest entry in x), both x
x ~
and the nearest k-sparse vector to x have the same support, and so the the convex optimization problem has a unique
solution: the nearest k sparse vector to x. Therefore, f is a compressed sensing algorithm for W1 with approximation
factor 1. .
Lastly, we use the deterministic compressed sensing lower bound of Do Ba, Indyk, Price, and Woodruff [49] :
Theorem 2 (Corollary 3.1 in [49] ). Given a k × n matrix A with the restricted isometry property, a sparse recovery
algorithm find a k-sparse approximation x^ of x ∈ Rn from Ax such that
∥x − x
^∥1 ≤ C(k)
min ∥x − x′ ∥1
x′ ,∥x′ ∥0 ≤k
for an approximation factor C(k). If C(k) = O(1), then a sparse recovery algorithm exists only if m = Ω(k log(n/k)).
Theorem 1 follows directly from Lemma 1 and Theorem 2.