High-Level Interpretability Detecting An AI's Objectives - LessWrong
High-Level Interpretability Detecting An AI's Objectives - LessWrong
lesswrong.com
39–50 minutes
Summary
1 of 31 12/19/23, 1:24 PM
High-level interpretability: detecting an AI's objectives — LessWrong about:reader?url=https%3A%2F%2Ffanyv88.com%3A443%2Fhttps%2Fwww.lesswrong.com%2Fposts%...
understand them.[1]
Background
Prior work has discussed how agentic AIs are likely to have
2 of 31 12/19/23, 1:24 PM
High-level interpretability: detecting an AI's objectives — LessWrong about:reader?url=https%3A%2F%2Ffanyv88.com%3A443%2Fhttps%2Fwww.lesswrong.com%2Fposts%...
What is an objective?
3 of 31 12/19/23, 1:24 PM
High-level interpretability: detecting an AI's objectives — LessWrong about:reader?url=https%3A%2F%2Ffanyv88.com%3A443%2Fhttps%2Fwww.lesswrong.com%2Fposts%...
4 of 31 12/19/23, 1:24 PM
High-level interpretability: detecting an AI's objectives — LessWrong about:reader?url=https%3A%2F%2Ffanyv88.com%3A443%2Fhttps%2Fwww.lesswrong.com%2Fposts%...
5 of 31 12/19/23, 1:24 PM
High-level interpretability: detecting an AI's objectives — LessWrong about:reader?url=https%3A%2F%2Ffanyv88.com%3A443%2Fhttps%2Fwww.lesswrong.com%2Fposts%...
being trained/deployed.
6 of 31 12/19/23, 1:24 PM
High-level interpretability: detecting an AI's objectives — LessWrong about:reader?url=https%3A%2F%2Ffanyv88.com%3A443%2Fhttps%2Fwww.lesswrong.com%2Fposts%...
Properties of objectives
7 of 31 12/19/23, 1:24 PM
High-level interpretability: detecting an AI's objectives — LessWrong about:reader?url=https%3A%2F%2Ffanyv88.com%3A443%2Fhttps%2Fwww.lesswrong.com%2Fposts%...
8 of 31 12/19/23, 1:24 PM
High-level interpretability: detecting an AI's objectives — LessWrong about:reader?url=https%3A%2F%2Ffanyv88.com%3A443%2Fhttps%2Fwww.lesswrong.com%2Fposts%...
variables (over the AI’s sensory input dataset), we can talk about
measuring the mutual information between them.
This might allow the overseer to see which abstractions are being
used as part of the agent’s action selection criterion (objective),
and so may yield evidence about the agent’s target outcome.
9 of 31 12/19/23, 1:24 PM
High-level interpretability: detecting an AI's objectives — LessWrong about:reader?url=https%3A%2F%2Ffanyv88.com%3A443%2Fhttps%2Fwww.lesswrong.com%2Fposts%...
10 of 31 12/19/23, 1:24 PM
High-level interpretability: detecting an AI's objectives — LessWrong about:reader?url=https%3A%2F%2Ffanyv88.com%3A443%2Fhttps%2Fwww.lesswrong.com%2Fposts%...
Empirical work/setup
Maze environment
Models
11 of 31 12/19/23, 1:24 PM
High-level interpretability: detecting an AI's objectives — LessWrong about:reader?url=https%3A%2F%2Ffanyv88.com%3A443%2Fhttps%2Fwww.lesswrong.com%2Fposts%...
12 of 31 12/19/23, 1:24 PM
High-level interpretability: detecting an AI's objectives — LessWrong about:reader?url=https%3A%2F%2Ffanyv88.com%3A443%2Fhttps%2Fwww.lesswrong.com%2Fposts%...
13 of 31 12/19/23, 1:24 PM
High-level interpretability: detecting an AI's objectives — LessWrong about:reader?url=https%3A%2F%2Ffanyv88.com%3A443%2Fhttps%2Fwww.lesswrong.com%2Fposts%...
model’s sensory input dataset and things inside the objective that
track this information.
Edit: Nora's comment below points out that this is not true. I'm still
taking time to think about the right notion to use instead so I'll
leave Nora's comment here for now:
14 of 31 12/19/23, 1:24 PM
High-level interpretability: detecting an AI's objectives — LessWrong about:reader?url=https%3A%2F%2Ffanyv88.com%3A443%2Fhttps%2Fwww.lesswrong.com%2Fposts%...
/Edit
15 of 31 12/19/23, 1:24 PM
High-level interpretability: detecting an AI's objectives — LessWrong about:reader?url=https%3A%2F%2Ffanyv88.com%3A443%2Fhttps%2Fwww.lesswrong.com%2Fposts%...
as in the full image case. For these classifier probes, the score is
the mean accuracy, and so must be in .
Methodology
16 of 31 12/19/23, 1:24 PM
High-level interpretability: detecting an AI's objectives — LessWrong about:reader?url=https%3A%2F%2Ffanyv88.com%3A443%2Fhttps%2Fwww.lesswrong.com%2Fposts%...
images).
• We extract abstractions from the input dataset (in this case, object
locations, e.g., (x,y)-coordinates of the object in the case of full-
image probes or boolean values for whether the object is present
in a pixel in the case of convolutional probes).
• For a given layer in the network, we train probes and use their
scores on a test set as a proxy for the mutual information between
abstractions/object locations and activations.
• We plot the probe scores for all objects/models that we’re tracking
for selected layers throughout the network.[11]
Experiments
17 of 31 12/19/23, 1:24 PM
High-level interpretability: detecting an AI's objectives — LessWrong about:reader?url=https%3A%2F%2Ffanyv88.com%3A443%2Fhttps%2Fwww.lesswrong.com%2Fposts%...
Predictions
We predict that the probe scores for the cheese location in the
cheese model will be higher in the later layers of the network
compared to the probe scores for the other models because the
cheese location isn’t necessary for selecting good actions in these
models.
Results
The fact that the convolutional probe scores are higher towards
the beginning and middle of the network follows from the way the
18 of 31 12/19/23, 1:24 PM
High-level interpretability: detecting an AI's objectives — LessWrong about:reader?url=https%3A%2F%2Ffanyv88.com%3A443%2Fhttps%2Fwww.lesswrong.com%2Fposts%...
We haven’t spent much time thinking about the results of the full-
image probes.[12]
Predictions
19 of 31 12/19/23, 1:24 PM
High-level interpretability: detecting an AI's objectives — LessWrong about:reader?url=https%3A%2F%2Ffanyv88.com%3A443%2Fhttps%2Fwww.lesswrong.com%2Fposts%...
Results
20 of 31 12/19/23, 1:24 PM
High-level interpretability: detecting an AI's objectives — LessWrong about:reader?url=https%3A%2F%2Ffanyv88.com%3A443%2Fhttps%2Fwww.lesswrong.com%2Fposts%...
Note that the red gem location seems harder for the probes to
detect than the cheese location (as seen by the convolutional
probe scores for the input layer). It could be the case that the red
gem information is present/being used in the later layers but just
harder to detect (although note that the mouse location seems
even harder to detect based on the input probe scores, yet the
later layers of the model seem to be able to track the mouse
location with ease). The following plots comparing the probe
scores for the red gem location with the top-right and randomly
initialized/baseline model suggest that the cheese model is using
the red gem location information about as much as a randomly
initialized model.
21 of 31 12/19/23, 1:24 PM
High-level interpretability: detecting an AI's objectives — LessWrong about:reader?url=https%3A%2F%2Ffanyv88.com%3A443%2Fhttps%2Fwww.lesswrong.com%2Fposts%...
The setup
22 of 31 12/19/23, 1:24 PM
High-level interpretability: detecting an AI's objectives — LessWrong about:reader?url=https%3A%2F%2Ffanyv88.com%3A443%2Fhttps%2Fwww.lesswrong.com%2Fposts%...
In the case of the maze-solving model, we let the model take one
action within the maze and record its observation and the
corresponding activations. We then reset the maze and the model
and repeat (1500 times).
23 of 31 12/19/23, 1:24 PM
High-level interpretability: detecting an AI's objectives — LessWrong about:reader?url=https%3A%2F%2Ffanyv88.com%3A443%2Fhttps%2Fwww.lesswrong.com%2Fposts%...
It seems more likely that the model is going for the cheese (or
“human flourishing”) because parts of the model that are
responsible for action selection (middle and later layers) have
activations with high mutual information with the cheese location
but not the red gem location.
We seem safe! We let the model run, and it does indeed create a
prosperous future for humanity.
24 of 31 12/19/23, 1:24 PM
High-level interpretability: detecting an AI's objectives — LessWrong about:reader?url=https%3A%2F%2Ffanyv88.com%3A443%2Fhttps%2Fwww.lesswrong.com%2Fposts%...
Concluding thoughts
Related work
This is far from the first research agenda that targets high-level
model interpretability. This feels like a more direct approach toward
alignment-relevant properties, but there’s a lot of exciting work
that’s inspired our views on this.
25 of 31 12/19/23, 1:24 PM
High-level interpretability: detecting an AI's objectives — LessWrong about:reader?url=https%3A%2F%2Ffanyv88.com%3A443%2Fhttps%2Fwww.lesswrong.com%2Fposts%...
26 of 31 12/19/23, 1:24 PM
High-level interpretability: detecting an AI's objectives — LessWrong about:reader?url=https%3A%2F%2Ffanyv88.com%3A443%2Fhttps%2Fwww.lesswrong.com%2Fposts%...
Appendix
27 of 31 12/19/23, 1:24 PM
High-level interpretability: detecting an AI's objectives — LessWrong about:reader?url=https%3A%2F%2Ffanyv88.com%3A443%2Fhttps%2Fwww.lesswrong.com%2Fposts%...
1. ^
"High-level interpretability" refers to our top-down approach to
developing an understanding of high-level internal structures of
AIs, such as objectives, and developing tools to detect these
structures.
28 of 31 12/19/23, 1:24 PM
High-level interpretability: detecting an AI's objectives — LessWrong about:reader?url=https%3A%2F%2Ffanyv88.com%3A443%2Fhttps%2Fwww.lesswrong.com%2Fposts%...
2. ^
Our argument for why we believe this is outside the scope of this
post, but we aim to publish a post on this topic soon.
3. ^
Or action sequences, or plans, etc.
4. ^
We note that notions like "action-selection mechanism" and
"criterion" are fuzzy concepts that may apply in different degrees
and forms in different agentic systems. Still, we're fairly confident
that some appropriate notions of these concepts hold for the types
of agents we care about, including future agentic systems and toy
models of agentic systems like maze-solving models.
5. ^
We believe that this notion of objective might be probable and
predictive and intend to check this with further work. The argument
presented suggests that it’s probable, and in theory, if we could
fully understand the criteria used to select actions, it would be
predictive.
6. ^
There are different ways one could frame this, from mesa-
optimizers to general-purpose cognition shards, etc., all of which
point to the same underlying idea here of something internally that
applies optimization power at runtime.
7. ^
Thanks to Johannes Treutlein for pointing this out.
8. ^
29 of 31 12/19/23, 1:24 PM
High-level interpretability: detecting an AI's objectives — LessWrong about:reader?url=https%3A%2F%2Ffanyv88.com%3A443%2Fhttps%2Fwww.lesswrong.com%2Fposts%...
9. ^
We note that it could be the case that objectives are sparse and
non-local structures within the AI’s internals, and we don’t assume
30 of 31 12/19/23, 1:24 PM
High-level interpretability: detecting an AI's objectives — LessWrong about:reader?url=https%3A%2F%2Ffanyv88.com%3A443%2Fhttps%2Fwww.lesswrong.com%2Fposts%...
10. ^
Though we do have some ideas here, they are beyond the scope
of this post. See Searching for Search.
11. ^
Note that we could have used all layers in the network, but this felt
unnecessary. We could also calculate scores for individual layers,
which can be used to do automated discovery of cheese channels.
12. ^
13. ^
14. ^
On a more positive note, we have observed that this method
somewhat works out of distribution (e.g., when the cheese model
is in an environment with a yellow star instead of a red gem).
31 of 31 12/19/23, 1:24 PM