0% found this document useful (0 votes)
61 views61 pages

Interpretability: Demystifying The Black-Box Lms

Uploaded by

Amit Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views61 pages

Interpretability: Demystifying The Black-Box Lms

Uploaded by

Amit Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

Interpretability

Demystifying the Black-Box LMs


Large Language Models: Introduction and Recent Advances
ELL881 · AIL821

Anwoy Chatterjee
PhD Student (Google PhD Fellow)
IIT Delhi
The Nascent Field of NLP Interpretability
• NLP researchers published focused analyses of linguistic structure in neural models as
early as 2016, primarily studying recurrent architectures like LSTMs.
• The growth of the field, however, also coincided with the adoption of Transformers!
• To serve the expanding NLP-Interpretability community, the first BlackBoxNLP workshop
was held in 2018.
• It immediately became one of the most popular workshops at any ACL conference.
• ACL implemented an “Interpretability and Analysis” main conference track in 2020
reflecting the mainstream success of the field.

Saphra and Wiegreffe, Mechanistic?


LLMs: Introduction and Recent Advances

LLMs: Introduction and Recent Advances Tanmoy Chakraborty Anwoy Chatterjee


Broad Classification of Interpretability Techniques
Input Attribution Logit Attribution
Behavior Activation
Localization Patching
Model
Causal
Component
Interventions
Attribution
Attribution
Interpretability Patching
Techniques
Probing Circuits Analysis

Information Dictionary
Decoding Learning

Decoding in
Vocabulary Space
Ferrando et al., A Primer on the Inner Workings of Transformer-based Language Models
LLMs: Introduction and Recent Advances

LLMs: Introduction and Recent Advances Tanmoy Chakraborty Anwoy Chatterjee


Earlier Techniques in NLP Interpretability
• Distributional semantics and representational similarity
• Interest in vector semantics exploded in the NLP community after word2vec popularized many
approaches to interpreting word embeddings.
• Distributional semantics has generalized to representational similarity methods and vector space
analogical reasoning.
• Attention maps
• In BERT models, the concurrent discovery of both a correlational and causal relationship between
syntax and attention demonstrated the case for attention maps as a window into how Transformer LMs
handled complex linguistic structure.
• Neuron analysis and localization
• Component analysis and probing
Saphra and Wiegreffe, Mechanistic?
LLMs: Introduction and Recent Advances

LLMs: Introduction and Recent Advances Tanmoy Chakraborty Anwoy Chatterjee


Probing
• The probing classifier g: 𝑓 𝑙 𝑥 → 𝑧 maps intermediate representations to some input
features (labels) 𝑧, which can be, for instance, a part-of-speech tag), or semantic and
syntactic information.

• From an information theoretic perspective, training the probing classifier g can be seen as
estimating the mutual information between the intermediate representations 𝑓 𝑙 𝑥 and
the property 𝑧, which we write 𝐼(𝑍; 𝐻), where 𝑍 is a random variable ranging over
properties 𝑧, and 𝐻 is a random variable ranging over representations 𝑓 𝑙 𝑥 .

Belinkov, Probing Classifiers: Promises, Shortcomings, and Advances


LLMs: Introduction and Recent Advances

LLMs: Introduction and Recent Advances Tanmoy Chakraborty Anwoy Chatterjee


Motivation of Probe Tasks
• If we can train a classifier to predict a property of the input text based on its
representation, it means the property is encoded somewhere in the representation.

• If we cannot train a classifier to predict a property of the input text based on its
representation, it means the property is not encoded in the representation or not encoded
in a useful way, considering how the representation is likely to be used

LLMs: Introduction and Recent Advances

LLMs: Introduction and Recent Advances Tanmoy Chakraborty Anwoy Chatterjee


Probe Approach

Slide Credits: Mohit Iyyer, UMass CS685


LLMs: Introduction and Recent Advances

LLMs: Introduction and Recent Advances Tanmoy Chakraborty Anwoy Chatterjee


Probe Complexity
• Arguments for “simple” probes
• we want to find easily accessible information in a representation

• Arguments for “complex” probes


• useful properties might be encoded non-linearly

Slide Credits: Mohit Iyyer, UMass CS685


LLMs: Introduction and Recent Advances

LLMs: Introduction and Recent Advances Tanmoy Chakraborty Anwoy Chatterjee


Control Tasks

Slide Credits: Mohit Iyyer, UMass CS685


LLMs: Introduction and Recent Advances

LLMs: Introduction and Recent Advances Tanmoy Chakraborty Anwoy Chatterjee


Designing Control Tasks
• Independently sample a control behavior 𝐶(𝑣) for each word type 𝑣 in the vocabulary

• Specifies how to define 𝑦𝑖 ∈ 𝑌 for a word token 𝑥𝑖 with word type 𝑣

• Control task is a function that maps each token 𝑥𝑖 to the label specified by the behavior
𝐶(𝑥𝑖 )

Slide Credits: Mohit Iyyer, UMass CS685


LLMs: Introduction and Recent Advances

LLMs: Introduction and Recent Advances Tanmoy Chakraborty Anwoy Chatterjee


Look at ‘selectivity’

Measures the probe model’s


ability to make output decisions
independently of linguistic
properties of the representation

Slide Credits: Mohit Iyyer, UMass CS685


LLMs: Introduction and Recent Advances

LLMs: Introduction and Recent Advances Tanmoy Chakraborty Anwoy Chatterjee


Mechanistic Interpretability
A New Paradigm or, ‘Old Wine in New Bottle’?
LLMs: Introduction and Recent Advances

LLMs: Introduction and Recent Advances Tanmoy Chakraborty Anwoy Chatterjee


So, What is Mechanistic Interpretability (MI)?
• Elhage et al. (2021) provided the first explicit definition of MI:
“attempting to reverse engineer the detailed computations performed by Transformers,
similar to how a programmer might try to reverse engineer complicated binaries into
human-readable source code.”
• Recent definitions, such as that of the ICML 2024 MI workshop use similar wording:
“. . . reverse engineering the algorithms implemented by neural networks into human-
understandable mechanisms, often by examining the weights and activations of neural
networks to identify circuits . . . that implement particular behaviors.”

LLMs: Introduction and Recent Advances

LLMs: Introduction and Recent Advances Tanmoy Chakraborty Anwoy Chatterjee


Coinage of the Term MI and Initial Works
How do scientists understand complex systems?
• ZOOM IN to study the components of the systems
• For example, scientists study properties of materials based on the structure of their atoms
• Similarly, to study complex neural networks, studying individual neurons can be insightful
• This is the idea behind mechanistic interpretability
• First employed in Convolution Neural Networks (CNNs) by Chris Olah et al.

Olah, et al., "Zoom In: An Introduction to Circuits", Distill, 2020.


LLMs: Introduction and Recent Advances

LLMs: Introduction and Recent Advances Tanmoy Chakraborty Anwoy Chatterjee


‘Circuits’

• A circuit is a computational subgraph of a neural network, with neurons (or, their linear
combination) as nodes connected by the weighted edges that go between them in the
original network.

Olah, et al., "Zoom In: An Introduction to Circuits", Distill, 2020.


LLMs: Introduction and Recent Advances

LLMs: Introduction and Recent Advances Tanmoy Chakraborty Anwoy Chatterjee


‘Circuits’

Olah, et al., "Zoom In: An Introduction to Circuits", Distill, 2020.


LLMs: Introduction and Recent Advances

LLMs: Introduction and Recent Advances Tanmoy Chakraborty Anwoy Chatterjee


‘Circuits’

Olah, et al., "Zoom In: An Introduction to Circuits", Distill, 2020.


LLMs: Introduction and Recent Advances

LLMs: Introduction and Recent Advances Tanmoy Chakraborty Anwoy Chatterjee


Circuit in GPT-2 for IOI Task

LLMs: Introduction and Recent Advances

LLMs: Introduction and Recent Advances Tanmoy Chakraborty Anwoy Chatterjee


Circuit in GPT-2 for IOI Task

Wang, et al., Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small
LLMs: Introduction and Recent Advances

LLMs: Introduction and Recent Advances Tanmoy Chakraborty Anwoy Chatterjee


MI Workflow for Finding Circuits
1. Observe a behavior (or task) that a neural network displays, create a dataset
that reproduces the behavior in question, and choose a metric to measure the
extent to which the model performs the task.
2. Define the scope of the interpretation, i.e. decide to what level of granularity
(e.g. attention heads and MLP layers, individual neurons, whether these are
split by token position) at which one wants to analyze the network. This results
in a computational graph of interconnected model units.
3. Perform an extensive and iterative series of patching experiments with the goal
of removing as many unnecessary components and connections from the
model as possible.
Conmy et al., Towards Automated Circuit Discovery for Mechanistic Interpretability
LLMs: Introduction and Recent Advances

LLMs: Introduction and Recent Advances Tanmoy Chakraborty Anwoy Chatterjee


MI Workflow for Finding Circuits
1. Observe a behavior (or task) that a neural network displays, create a dataset
that reproduces the behavior in question, and choose a metric to measure the
extent to which the model performs the task.
2. Define the scope of the interpretation, i.e. decide to what level of granularity
(e.g. attention heads and MLP layers, individual neurons, whether these are
split by token position) at which one wants to analyze the network. This results
in a computational graph of interconnected model units.
3. Perform an extensive and iterative series of patching experiments with the goal
of removing as many unnecessary components and connections from the
model as possible.
Conmy et al., Towards Automated Circuit Discovery for Mechanistic Interpretability
LLMs: Introduction and Recent Advances

LLMs: Introduction and Recent Advances Tanmoy Chakraborty Anwoy Chatterjee


MI Workflow for Finding Circuits: Step 1 Examples

Conmy et al., Towards Automated Circuit Discovery for Mechanistic Interpretability


LLMs: Introduction and Recent Advances

LLMs: Introduction and Recent Advances Tanmoy Chakraborty Anwoy Chatterjee


MI Workflow for Finding Circuits
1. Observe a behavior (or task) that a neural network displays, create a dataset
that reproduces the behavior in question, and choose a metric to measure the
extent to which the model performs the task.
2. Define the scope of the interpretation, i.e. decide to what level of granularity
(e.g. attention heads and MLP layers, individual neurons, whether these are
split by token position) at which one wants to analyze the network. This results
in a computational graph of interconnected model units.
3. Perform an extensive and iterative series of patching experiments with the goal
of removing as many unnecessary components and connections from the
model as possible.
Conmy et al., Towards Automated Circuit Discovery for Mechanistic Interpretability
LLMs: Introduction and Recent Advances

LLMs: Introduction and Recent Advances Tanmoy Chakraborty Anwoy Chatterjee


MI Workflow for Finding Circuits: Step 2 Examples
• To find circuits for the behavior of interest, one must represent the internals of the model
as a computational directed acyclic graph (DAG).
• Current work chooses the abstraction level of the computational graph depending on the
level of detail of their explanations of model behavior.
• For example, at a coarse level, computational graphs can represent interactions between attention
heads and MLPs.
• At a more granular level they could include separate query, key and value activations, the interactions
between individual neurons, or have a node for each token position.

Conmy et al., Towards Automated Circuit Discovery for Mechanistic Interpretability


LLMs: Introduction and Recent Advances

LLMs: Introduction and Recent Advances Tanmoy Chakraborty Anwoy Chatterjee


MI Workflow for Finding Circuits Step 3: Activation
Patching
The importance of nodes/edges are tested by using recursive activation patching:
i) overwrite the activation value of a node or edge with a corrupted activation,
ii) run a forward pass through the model, and
iii) compare the output values of the new model with the original model, using the chosen
metric

Conmy et al., Towards Automated Circuit Discovery for Mechanistic Interpretability


LLMs: Introduction and Recent Advances

LLMs: Introduction and Recent Advances Tanmoy Chakraborty Anwoy Chatterjee


Activation Patching

Zhang and Nanda., Towards Best Practices of Activation Patching in Language Models: Metrics and Methods
LLMs: Introduction and Recent Advances

LLMs: Introduction and Recent Advances Tanmoy Chakraborty Anwoy Chatterjee


Activation Patching
The method involves a clean prompt (Xclean, e.g.,“The Eiffel Tower is in”) with an associated answer r
(“Paris”), a corrupted prompt (Xcorrupt, e.g., “The Colosseum is in”), and three model runs:
1. Clean run: run the model on Xclean and cache activations of a set of given model components, such as MLP
or attention heads outputs.
2. Corrupted run: run the model on Xcorrupt and record the model outputs.
3. Patched run: run the model on Xcorrupt with a specific model component’s activation restored from the
cached value of the clean run.
Finally, we evaluate the patching effect, such as P(“Paris”) in the patched run (3) compared to the
corrupted run (2). Intuitively, corruption hurts model performance while patching restores it.
Patching effect measures how much the patching intervention restores performance, which
indicates the importance of the activation.

Zhang and Nanda., Towards Best Practices of Activation Patching in Language Models: Metrics and Methods
LLMs: Introduction and Recent Advances

LLMs: Introduction and Recent Advances Tanmoy Chakraborty Anwoy Chatterjee


Activation Patching: Metrics
• The patching effect is defined as the gap of the model performance between the
corrupted and patched run, under an evaluation metric. Let cl, ∗, pt be the clean,
corrupted and patched run.

Zhang and Nanda., Towards Best Practices of Activation Patching in Language Models: Metrics and Methods
LLMs: Introduction and Recent Advances

LLMs: Introduction and Recent Advances Tanmoy Chakraborty Anwoy Chatterjee


Activation Patching

LLMs: Introduction and Recent Advances

LLMs: Introduction and Recent Advances Tanmoy Chakraborty Anwoy Chatterjee


Automatic Circuit DisCovery (ACDC)

Conmy et al., Towards Automated Circuit Discovery for Mechanistic Interpretability


LLMs: Introduction and Recent Advances

LLMs: Introduction and Recent Advances Tanmoy Chakraborty Anwoy Chatterjee


Automatic Circuit DisCovery (ACDC)

Conmy et al., Towards Automated Circuit Discovery for Mechanistic Interpretability


LLMs: Introduction and Recent Advances

LLMs: Introduction and Recent Advances Tanmoy Chakraborty Anwoy Chatterjee


ACDC Discovered Circuit Example

Conmy et al., Towards Automated Circuit Discovery for Mechanistic Interpretability


LLMs: Introduction and Recent Advances

LLMs: Introduction and Recent Advances Tanmoy Chakraborty Anwoy Chatterjee


Attribution Patching

Nanda, Attribution Patching: Activation Patching At Industrial Scale


LLMs: Introduction and Recent Advances

LLMs: Introduction and Recent Advances Tanmoy Chakraborty Anwoy Chatterjee


Attribution Patching

Nanda, Attribution Patching: Activation Patching At Industrial Scale


LLMs: Introduction and Recent Advances

LLMs: Introduction and Recent Advances Tanmoy Chakraborty Anwoy Chatterjee


Attribution Patching

Nanda, Attribution Patching: Activation Patching At Industrial Scale


LLMs: Introduction and Recent Advances

LLMs: Introduction and Recent Advances Tanmoy Chakraborty Anwoy Chatterjee


Attribution Patching

Nanda, Attribution Patching: Activation Patching At Industrial Scale


LLMs: Introduction and Recent Advances

LLMs: Introduction and Recent Advances Tanmoy Chakraborty Anwoy Chatterjee


Attribution Patching
• Attribution patching is really fast and scalable!
• Once you do a clean forward pass, corrupted forward pass, and corrupted backward
pass, the attribution patch for any activation is just ((clean_act - corrupted_act) *
corrupted_grad_act).sum().

Nanda, Attribution Patching: Activation Patching At Industrial Scale


LLMs: Introduction and Recent Advances

LLMs: Introduction and Recent Advances Tanmoy Chakraborty Anwoy Chatterjee


Broad Classification of Interpretability Techniques
Input Attribution Logit Attribution
Behavior Activation
Localization Patching
Model
Causal
Component
Interventions
Attribution
Attribution
Interpretability Patching
Techniques
Probing Circuits Analysis

Information Dictionary
Decoding Learning

Decoding in
Vocabulary Space
Ferrando et al., A Primer on the Inner Workings of Transformer-based Language Models
LLMs: Introduction and Recent Advances

LLMs: Introduction and Recent Advances Tanmoy Chakraborty Anwoy Chatterjee


Induction Heads
Induction Heads
• Induction head is a circuit whose function is to look back over the sequence for previous instances of the
current token (call it A), find the token that came after it last time (call it B), and then predict that the same
completion will occur again
• E.g., forming the sequence [A][B] … [A] → [B]
• In other words, induction heads “complete the pattern” by copying and completing sequences that have occurred before.

• Mechanically, induction heads in our models are implemented by a circuit of two attention heads:
• the first head is a “previous token head” which copies information from the previous token into the next token
• And, the second head (the actual “induction head”) uses that information to find tokens preceded by the present
token.
• For 2-layer attention-only models, it is shown that induction heads implement this pattern copying behavior
and appear to be the primary source of in-context learning.

Olsson, et al., In-context Learning and Induction Heads


LLMs: Introduction and Recent Advances

LLMs: Introduction and Recent Advances Tanmoy Chakraborty Anwoy Chatterjee


Induction Heads

Olsson, et al., In-context Learning and Induction Heads


LLMs: Introduction and Recent Advances

LLMs: Introduction and Recent Advances Tanmoy Chakraborty Anwoy Chatterjee


ICL Score is defined as the 50th token loss
minus the 500th token loss.

Prefix Matching Score is the average fraction of a


head's attention weight given to the token we
expect an induction head to attend to - the
token where the prefix matches the present
context.

Olsson, et al., In-context Learning and Induction Heads


Mechanistic Understanding of CoT
LLMs: Introduction and Recent Advances

LLMs: Introduction and Recent Advances Tanmoy Chakraborty Anwoy Chatterjee


Dutta et al., How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning
LLMs: Introduction and Recent Advances

LLMs: Introduction and Recent Advances Tanmoy Chakraborty Anwoy Chatterjee


Mechanistic Understanding of CoT Reasoning
Understanding the internal mechanisms of the
models that facilitate COT generation. Information Movement
(Token Mixing) in early
layers
● Attention heads perform information
movement between ontologically related (or
negatively related) tokens. (Token mixing)
● Multiple different neural pathways are
deployed to compute the answer, that too in
parallel.

Multiple answer writing heads =>


Multiple pathways in the model
Dutta et al., How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning
LLMs: Introduction and Recent Advances

LLMs: Introduction and Recent Advances Tanmoy Chakraborty Anwoy Chatterjee


Mechanistic Understanding of CoT Reasoning

● Parallel answer generation pathways collect answers


from different segments of the input.

● Functional rift at the very middle of the LLM (16th


decoder block in case of Llama-2 7B)
○ First Half Heads: assist information movement
between residual stream and align the
representations.
○ Second Half Heads: Model employs multiple Heads that collect the answer tokens from the
pathways to write the answer to the last residual generated context (green), question context (blue), and
stream. few-shot context (red)

Dutta et al., How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning


LLMs: Introduction and Recent Advances

LLMs: Introduction and Recent Advances Tanmoy Chakraborty Anwoy Chatterjee


Decoding in Vocabulary Space
Logit Lens
• The logit lens proposes projecting intermediate residual stream states 𝑥 𝑙 by the
unembedding matrix 𝑊𝑈 .
• The logit lens can also be interpreted as the prediction the model would do if all later layers are
skipped, and can be used to analyze how the model refines the prediction throughout the forward
pass.

• However, the logit lens can fail to elicit plausible predictions in some particular models.
• This phenomenon have inspired researchers to train translators, which are functions applied to the
intermediate representations prior to the unembedding projection.

LLMs: Introduction and Recent Advances

LLMs: Introduction and Recent Advances Tanmoy Chakraborty Anwoy Chatterjee


Logit Lens on Vision Models

Interpreting and Editing Vision-Language Representations to Mitigate Hallucinations


LLMs: Introduction and Recent Advances

LLMs: Introduction and Recent Advances Tanmoy Chakraborty Anwoy Chatterjee


Patchscopes: Patching and Probing

Modifying at inference
time the activations of the
model to explore where
information is encoded or
learnt.

Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models


LLMs: Introduction and Recent Advances

LLMs: Introduction and Recent Advances Tanmoy Chakraborty Anwoy Chatterjee


Broad Classification of Interpretability Techniques
Input Attribution Logit Attribution
Behavior Activation
Localization Patching
Model
Causal
Component
Interventions
Attribution
Attribution
Interpretability Patching
Techniques
Probing Circuits Analysis

Information Dictionary
Decoding Learning

Decoding in
Vocabulary Space
Ferrando et al., A Primer on the Inner Workings of Transformer-based Language Models
LLMs: Introduction and Recent Advances

LLMs: Introduction and Recent Advances Tanmoy Chakraborty Anwoy Chatterjee


Dictionary Learning
Linear Representation Hypothesis
• Circuits define the way a model builds up the embeddings - but it does not clarify what
these embeddings mean.
• The linear representation hypothesis (LRH) assumes that “interpretable features” are
represented as linear directions in the latent space, which are activated when the
embeddings “align with” these directions.
• Because of superposition, individual features in the latent space may not be informative.

LLMs: Introduction and Recent Advances

LLMs: Introduction and Recent Advances Tanmoy Chakraborty Anwoy Chatterjee


Interpretable Features

Toy Models of Superposition


LLMs: Introduction and Recent Advances

LLMs: Introduction and Recent Advances Tanmoy Chakraborty Anwoy Chatterjee


Sparse Autoencoders
Under the LRH, we can learn the overcomplete space of a trained model by training what is
called a sparse autoencoder model, which learns a sparse decomposition of the activation:

MLP activation (for Overcomplete basis (dictionary)


Feature
one token) of “interpretable directions”
activation
(sparse)

LLMs: Introduction and Recent Advances

LLMs: Introduction and Recent Advances Tanmoy Chakraborty Anwoy Chatterjee


Sparse Autoencoders

Sparse autoencoders can be trained in an


unsupervised way from a collection of
activations of the model.

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

LLMs: Introduction and Recent Advances

LLMs: Introduction and Recent Advances Tanmoy Chakraborty Anwoy Chatterjee


SAE Explanations in Billion-Scale LLMs

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet


LLMs: Introduction and Recent Advances

LLMs: Introduction and Recent Advances Tanmoy Chakraborty Anwoy Chatterjee


Controlling Features

Manually increasing or
decreasing a specific
feature can elicit (or
remove) specific
features of the model
(assuming the
explanation is correct).

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet


LLMs: Introduction and Recent Advances

LLMs: Introduction and Recent Advances Tanmoy Chakraborty Anwoy Chatterjee

You might also like