Interpretability: Demystifying The Black-Box Lms
Interpretability: Demystifying The Black-Box Lms
Anwoy Chatterjee
PhD Student (Google PhD Fellow)
IIT Delhi
The Nascent Field of NLP Interpretability
• NLP researchers published focused analyses of linguistic structure in neural models as
early as 2016, primarily studying recurrent architectures like LSTMs.
• The growth of the field, however, also coincided with the adoption of Transformers!
• To serve the expanding NLP-Interpretability community, the first BlackBoxNLP workshop
was held in 2018.
• It immediately became one of the most popular workshops at any ACL conference.
• ACL implemented an “Interpretability and Analysis” main conference track in 2020
reflecting the mainstream success of the field.
Information Dictionary
Decoding Learning
Decoding in
Vocabulary Space
Ferrando et al., A Primer on the Inner Workings of Transformer-based Language Models
LLMs: Introduction and Recent Advances
• From an information theoretic perspective, training the probing classifier g can be seen as
estimating the mutual information between the intermediate representations 𝑓 𝑙 𝑥 and
the property 𝑧, which we write 𝐼(𝑍; 𝐻), where 𝑍 is a random variable ranging over
properties 𝑧, and 𝐻 is a random variable ranging over representations 𝑓 𝑙 𝑥 .
• If we cannot train a classifier to predict a property of the input text based on its
representation, it means the property is not encoded in the representation or not encoded
in a useful way, considering how the representation is likely to be used
• Control task is a function that maps each token 𝑥𝑖 to the label specified by the behavior
𝐶(𝑥𝑖 )
• A circuit is a computational subgraph of a neural network, with neurons (or, their linear
combination) as nodes connected by the weighted edges that go between them in the
original network.
Wang, et al., Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small
LLMs: Introduction and Recent Advances
Zhang and Nanda., Towards Best Practices of Activation Patching in Language Models: Metrics and Methods
LLMs: Introduction and Recent Advances
Zhang and Nanda., Towards Best Practices of Activation Patching in Language Models: Metrics and Methods
LLMs: Introduction and Recent Advances
Zhang and Nanda., Towards Best Practices of Activation Patching in Language Models: Metrics and Methods
LLMs: Introduction and Recent Advances
Information Dictionary
Decoding Learning
Decoding in
Vocabulary Space
Ferrando et al., A Primer on the Inner Workings of Transformer-based Language Models
LLMs: Introduction and Recent Advances
• Mechanically, induction heads in our models are implemented by a circuit of two attention heads:
• the first head is a “previous token head” which copies information from the previous token into the next token
• And, the second head (the actual “induction head”) uses that information to find tokens preceded by the present
token.
• For 2-layer attention-only models, it is shown that induction heads implement this pattern copying behavior
and appear to be the primary source of in-context learning.
• However, the logit lens can fail to elicit plausible predictions in some particular models.
• This phenomenon have inspired researchers to train translators, which are functions applied to the
intermediate representations prior to the unembedding projection.
Modifying at inference
time the activations of the
model to explore where
information is encoded or
learnt.
Information Dictionary
Decoding Learning
Decoding in
Vocabulary Space
Ferrando et al., A Primer on the Inner Workings of Transformer-based Language Models
LLMs: Introduction and Recent Advances
Manually increasing or
decreasing a specific
feature can elicit (or
remove) specific
features of the model
(assuming the
explanation is correct).