An Attempt To Unraveling Token Prediction Refinement and Identifying Essential Layers of Large Language Models
An Attempt To Unraveling Token Prediction Refinement and Identifying Essential Layers of Large Language Models
Jaturong Kongmanee
Independent Researcher
arXiv:2501.15054v1 [cs.CL] 25 Jan 2025
Toronto, Canada
[email protected]
Abstract
This research aims to unravel how large language models (LLMs) iteratively refine
token predictions (or, in a general sense, vector predictions). We utilized a logit
lens technique to analyze the model’s token predictions derived from intermediate
representations. Specifically, we focused on how LLMs access and use information
from input contexts, and how positioning of relevant information affects the model’s
token prediction refinement process. Our findings for multi-document question
answering task, by varying input context lengths (the number of documents), using
GPT-2, revealed that the number of layers between the first layer that the model
predicted next tokens correctly and the later layers that the model finalized its
correct predictions, as a function of the position of relevant information (i.e.,
placing the relevant one at the beginning, middle, or end of the input context),
has a nearly inverted U shape. We found that the gap between these two layers,
on average, diminishes when relevant information is positioned at the beginning
or end of the input context, suggesting that the model requires more refinements
when processing longer contexts with relevant information situated in the middle,
and highlighting which layers are essential for determining the correct output.
Our analysis provides insights about how token predictions are distributed across
different conditions, and establishes important connections to existing hypotheses
and previous findings in AI safety research and development.
1 Introduction
Recent advances in technologies have seen an increasing number of applications of capable AI
systems in various domains. One example is that of large language models (LLMs) that have
exhibited improved capabilities in content and code generation and language translation. As LLMs
rapidly advance in sophistication and generality, understanding them is essential to ensure their
alignment with human values and prevent catastrophic outcomes [Hendrycks and Mazeika, 2022,
Hendrycks et al., 2023, Ji et al., 2024]. Research that aims to improve an understanding of LLM goes
beyond simple performance metrics, to unravel the internal workings of LLMs. This work utilized an
observational approach that analyzes the inner workings of neural networks: Logit Lens (as discussed
in detail in Section 2). We focused on how LLMs access and use information from input contexts,
and how positioning of relevant information affects the model’s token prediction refinement process,
to gain more insights that potentially aid the development of methods that can ensure trustworthiness
in LLMs.
Mechanistic interpretability (MI) is a growing approach that aims to precisely define computations
within neural networks. It seeks to fully specify a neural network’s computations, aiming for a
granular understanding of model behavior, akin to reverse-engineering the model’s processes into
(n)
[logit1 , ..., logit|V | ] = WU · LayerNormL (hL ), (1)
where WU ∈ R|V |×d denotes an embedding matrix, and LayerNormL is the pre-embedding layer
normalization. The logit lens applied the same unembedding operation to the earlier hidden states
(i)
hl :
(n)
[logitl1 , ..., logitl|V | ] = WU · LayerNormL (hl ), (2)
where an intermediate predictive distribution over tokens at layer l, pl (tn+1 |t1 , .., tn ), can be obtained.
In this experiment, we utilized logit lens to hidden states of GPT-2 processing data of internet. For
example, given an instance "Hinton is a prominent figure in the field of artificial intelligence and deep
learning.", shown in Figure 1, we observed how distributions over next token predictions gradually
converge to the final distribution. The observed, general trend is that initial predictions of tokens
2
(a) Maximum Probability
(b) Cross-Entropy
Figure 1: The logit lens were applied to the hidden states of GPT-2 processing Hinton is a prominent
figure in the field of artificial intelligence and deep learning. The vertical axis represents layers
ranging from 0 - 12 (bottom to top), and the horizon axis represents input tokens. Each cell illustrates
the most likely next tokens (i.e., top-1 token prediction) in a sequence that the respective hidden state
predicts. In case of maximum probability, the darker cells correspond to higher probabilities. And in
case of cross-entropy and forward Kullback–Leibler (KL) divergence, the lighter cells correspond to
higher probabilities.
resemble plausible completions but seems far from the correct prediction. As layers progress, the
predictions become more accurate with respect to the ground truth (next token). Roughly in the
middle layers, a model forms a correct token of the next token, with later layers refining these
predictions with higher predicted probabilities.
Take the 6th and 9th columns of Figure 1 as an example of the case where linking words (e.g.,
conjunctions and subordinating conjunctions), occur, in this case, "figure in" and "field of". The
model predicts the next word correctly in the early layer, without changing to other tokens in the
later layers. And the model’s confidence evolves over layers to the degree where the model is
more confident about its final token prediction (nearly 100% of predicted probabilities). Despite
being trained on vast amounts of text without explicit syntactic rules, language models, GPT-2
in this case, show an ability to capture dependency relations, and structural aspects of language.
Study reported by [Hewitt and Manning, 2019] shows that language models capture hierarchical
syntactic structures, such as nested clauses, phrase boundaries, and subject-verb-object relationships.
3
top-1 token first matched
gpt-2 12 top-1 token finalized
72.5
70.0 10
67.5 8
Accuray
Layers
65.0
6
62.5
60.0 4
57.5 2
55.0
1st 2nd 3rd 0
1 256 512
Position of Document with the Answer Tokens
68 gpt-2 12
top-1 token first matched
top-1 token finalized
66
10
64
8
Accuray
62
Layers
60 6
58
4
56
2
54
1st 2nd 3rd 4th 5th 0
1 256 512 768 1024
Position of Document with the Answer Tokens
Figure 2: Left: Shifting the location of relevant information within the model’s input context reveals
an U-shaped performance pattern. Model performs best when the relevant information is at the very
beginning or end of the input, while its performance on average declines significantly when the
critical information appears in the middle of the context. Right: Similarly, changing the position
of relevant information produces a nearly inverted-U pattern, illustrating when the top-1 token first
aligns with the correct answer and when it is finalized. The gap between these two layers on average
is smaller when the relevant information is placed at the beginning or end of the input, indicating that
the model needs more prediction refinements when processing long input context with the relevant
information positioned in the middle. The experiments were repeated for 10 independent runs, and
dots represent mean (arithmetic average) performance, with shaded bands represent 95% confidence
interval (CI).
However, an intriguing question is how transformer-based models, trained solely on natural language
data, manage to learn its hierarchical structure and generalize to sentences with unseen syntactic
patterns?–despite lacking any explicitly encoded structural bias.
[Ahuja et al., 2024] explored the inductive biases in transformers that may drive this generalization.
Through their extensive experiments on synthetic datasets with various training objectives, they found
that, while objectives like sequence-to-sequence and prefix language modeling often fail to produce
hierarchical generalization, models trained with the language modeling objective consistently are able
to produce hierarchical generalization. The authors also conducted experiments on model pruning
to examine how transformers encode hierarchical structure. They found (pruned) sub-networks
with distinct generalization behaviors aligned with hierarchical structures and with linear order.
Model pruning involves reducing the size of networks by removing unnecessary parameters while
maintaining performance, Pruning models not only reduces model complexity and computational
costs but also provides insights into model efficiency and interpretability. For example, pruning can
reveal which parts of a network are most critical for certain tasks, shedding light on how models work
internally.
Another interesting aspect is the model’s demonstrated ability to grasp compositional understanding.
For instance, in columns 11st and 14th, it recognizes how meaning arises from combining parts
of a sentence, such as "artificial ... intelligence" and "deep ... learning," despite having multiple
word choices that could follow "artificial" or "deep." Given the vast token space available at each
layer, it is intriguing that the model not only selects the correct token but also reflects uncertainty
in its decision with low probabilities. This somehow highlights its nuanced handling of language
structure and meaning. However, this ability may not extend to more complex cases where the length
of the input context and the position of relevant information vary. In such scenarios, as shown in
Section 3 for preliminary results, the model’s performance may struggle to maintain the same level of
compositional understanding and accuracy.
4
3 Relation of Logit Lens to Input Context Lengths and the Position of
Relevant Information within Input Context
Input contexts for LLMs can contain thousands of tokens, especially when processing lengthy
documents or integrating external information. For these tasks, LLMs must efficiently handle long
sequences. Study done by [Ivgi et al., 2023, Liu et al., 2024] explored how LLMs use input contexts
to perform downstream tasks.
[Liu et al., 2024] designed an experiment to assess how LLMs access and use information from input
contexts. They manipulated two factors: (1) the length of the input context and (2) the position of
relevant information within it, to evaluate their impact on model performance. They hypothesized
that if LLMs can reliably use information from long contexts, performance should remain stable
regardless of the position of relevant information.
Multi-document question answering was selected for the experiments, where models must reason
over multiple documents to extract relevant information and answer a question. This task reflects
the retrieval-augmented generation process used in various applications. In more details about the
set up, they controlled two key factors: (i) input context length by varying the number of documents
(simulating different retrieval volumes), and (ii) the position of relevant information by altering the
document order, placing the relevant one at the beginning, middle, or end of the context. Their
findings showed that the position of relevant information in the input context significantly impacts
model performance, revealing that current language models struggled to consistently access and
use information in long contexts. Notably, they observed a U-shaped performance curve: models
performed best when relevant information is at the beginning or the end of the input context, but their
performance declines sharply when the information is located in the middle. In this experiment, we
followed procedures and data sets used in [Liu et al., 2024], to examine how token predictions are
distributed as a function of input context length and position of relevant information.
In which layer(s) does the model’s correct prediction happen and finalize? As illustrated in Figure
2, the top-1 token that matches the correct answer typically emerges in the middle or later layers.
When handling long input contexts with relevant information placed in the middle, the correct top-1
tokens are identified similar to earlier, but it takes many additional layers before the model finalizes
the prediction. The logit lens reveals how predictions are iteratively refined as they pass through each
successive layer. Early layers may produce outputs that seem plausible but are far from accurate. The
model begins with rough guesses, which are gradually improved as it incorporates more context and
relevant information. This study indicates that when identifying bottlenecks or inefficiencies in the
model, it’s crucial to consider the position of relevant information. This positioning impacts to some
extent the refinement process of token prediction, highlighting which layers on average are critical
for determining the final output.
5 Relevance to AI Safety
Understanding inner working mechanisms of LLMs is essential for ensuring their safe development
as we move toward more powerful models. Mechanistic interpretability approach has the potential to
significantly advance LLM/AI safety research by providing a richer, stronger foundation for model
5
Figure 3: High-level overview of probing (Reproduced from Figure 2 in [Levinstein and Herrmann,
2023]).
evaluation [Casper, 2023]. It can also offer early warnings of emergent capabilities, enabling us
to understand better how internal structures and representations incrementally evolve as models
learn, and to detect new skills or behaviors in models before they fully develop [Wei et al., 2022,
Steinhardt, 2023, Nanda et al., 2023, Barak et al., 2022]. Furthermore, interpretability can strengthen
theoretical risk models with concrete evidence, such as identifying inner misalignment (when a
model’s behavior diverges from its intended goals). By exposing potential risks or problematic
behaviors, interpretability could also prompt a shift within the AI community toward adopting more
rigorous safety protocols [Hubinger, 2019].
When it comes to specific AI risks [Hendrycks et al., 2023], interpretability is a powerful tool
for preventing malicious misuse by identifying and eliminating sensitive information embedded
in models [Meng et al., 2022, Nguyen et al., 2022]. It can also alleviate competitive pressures by
providing clear evidence of potential threats, fostering a culture of safety within organizations, and
reinforcing AI alignment—ensuring that AI systems stay aligned with intended goals—through
enhanced monitoring and evaluation [Hendrycks and Mazeika, 2022]. In addition, interpretability
offers essential safety checks throughout the AI development process. For example, prior to training,
it informs deliberate design choices to enhance safety [Hubinger, 2019]; during training, it detects
early signs of misalignment, enabling proactive shifts toward alignment [Hubinger, 2022, Sharkey,
2022]; and after training, it ensures rigorous evaluation of artificial cognition, verifying the model’s
honesty [Burns et al., 2023, Zou et al., 2023] and screening for deceptive behaviors [Park et al., 2023].
This comprehensive approach helps safeguard AI systems at every stage, driving safer and more
trustworthy advancements in AI.
The emergence of internal world models in LLMs holds transformative potential for AI alignment
research. If we can identify internal representations of human values and align the AI system’s
objectives accordingly, achieving alignment may become a straightforward task [Wentworth, 2022].
This is especially promising if the AI’s internal world model remains distinct from any notions of
goals or agency [Ruthenis, 2022]. In such cases, simply interpreting the world model could be enough
to ensure alignment [Ruthenis, 2023], providing an effective path to safer AI systems.
Mechanistic interpretability plays a vital role in advancing various AI alignment initiatives, including
the understanding and control of existing models, facilitating AI systems in solving alignment
challenges, and developing robust alignment theories [technicalities and Stag, 2023, Hubinger, 2020].
By enhancing strategies to detect deceptive alignment—a scenario where a model appears aligned
while pursuing misaligned goals without raising suspicion [Park et al., 2023]—and eliciting latent
knowledge from models [Christiano et al., 2021], we can significantly improve scalable oversight,
such as through iterative distillation and amplification [Chan, 2023]. Moreover, comprehensive
6
interpretability can serve as an alignment strategy by helping us identify internal representations of
human values and guiding the model to pursue those values through re-targeting an internal search
process [Wentworth, 2022]. Ultimately, the connection between understanding and control is crucial;
a deeper understanding enables more reliable control over AI systems.
7 Conclusion
Investigating the intermediate representations by examining how a model iteratively refines token
predictions, offers insights into how a model’s internal representations evolve. In this study, we
utilized the logit lens to meticulously examine next token prediction refinements across different
conditions, to some degree, to uncover critical nuances in how predictions are progressively refined. A
deeper grasp of the prediction refinement process will enhance the development in AI safety research
and development. That is, by establishing connections between the previous findings and existing
hypotheses, we aim to advance the understanding of LLM behavior. These insights are vital for the
development of AI technologies that are not only highly effective but also secure and aligned with
human values.
Acknowledgments
The author would like to thank the support and opportunity given by Mukaya (Tai) Panich and
Thanapong Boontaeng to research this topic. The author would also like to thank the members of the
SCB 10X team for their useful comments during the contract term from May to October 2024.
7
References
Dan Hendrycks and Mantas Mazeika. X-risk analysis for ai research. CoRR, June 2022. URL
https://arxiv.org/abs/2206.05862v7.
Dan Hendrycks, Mantas Mazeika, and Thomas Woodside. An overview of catastrophic ai risks.
CoRR, October 2023. URL http://arxiv.org/abs/2306.12001.
Jiaming Ji, Tianyi Qiu, Boyuan Chen, Borong Zhang, Hantao Lou, Kaile Wang, Yawen Duan,
Zhonghao He, Jiayi Zhou, Zhaowei Zhang, Fanzhi Zeng, Kwan Yee Ng, Juntao Dai, Xuehai Pan,
Aidan O’Gara, Yingshan Lei, Hua Xu, Brian Tse, Jie Fu, Stephen McAleer, Yaodong Yang, Yizhou
Wang, Song-Chun Zhu, Yike Guo, and Wen Gao. Ai alignment: A comprehensive survey. CoRR,
January 2024. doi: 10.48550/arXiv.2310.19852. URL http://arxiv.org/abs/2310.19852.
Christopher Olah. Mechanistic interpretability, variables, and the importance of interpretable
bases. Transformer Circuits Thread, 2022. URL https://transformer-circuits.pub/
2022/mech-interp-essay/index.html.
Neel Nanda. A comprehensive mechanistic interpretability explainer & glossary. Neel Nanda’s Blog,
December 2022. URL https://www.neelnanda.io/mechanistic-interpretability/
glossary.
Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter.
Zoom in: An introduction to circuits. Distill, March 2020. URL https://distill.pub/2020/
circuits/zoom-in.
Lee Sharkey, Sid Black, and beren. Current themes in mechanis-
tic interpretability research. AI Alignment Forum, November 2022.
URL https://www.alignmentforum.org/posts/Jgs7LQwmvErxR9BCC/
current-themes-in-mechanistic-interpretability-research.
Chris Olah, Arvind Satyanarayan, Ian Johnson, Shan Carter, Ludwig Schubert, Katherine Ye, and
Alexander Mordvintsev. The building blocks of interpretability. Distill, March 2018. URL
https://distill.pub/2018/building-blocks.
Neel Nanda. Mechanistic interpretability quickstart guide. Neel Nanda’s Blog, January 2023. URL
https://www.neelnanda.io/mechanistic-interpretability/quickstart.
Neel Nanda. An extremely opinionated annotated list of my favourite
mechanistic interpretability papers v2. AI Alignment Forum, July 2024.
URL https://www.alignmentforum.org/posts/NfFST5Mio7BCAQHPA/
an-extremely-opinionated-annotated-list-of-my-favourite.
Saikat Das, Namita Agarwal, Deepak Venugopal, Frederick T Sheldon, and Sajjan Shiva. Taxonomy
and survey of interpretable machine learning method. In 2020 IEEE Symposium Series on
Computational Intelligence (SSCI), pages 670–677. IEEE, 2020.
Timo Speith. A review of taxonomies of explainable artificial intelligence (xai) methods. In
Proceedings of the 2022 ACM conference on fairness, accountability, and transparency, pages
2239–2250, 2022.
Leonard Bereska and Efstratios Gavves. Mechanistic interpretability for ai safety–a review. arXiv
preprint arXiv:2404.14082, 2024.
Alexandra Zytek, Ignacio Arnaldo, Dongyu Liu, Laure Berti-Equille, and Kalyan Veeramachaneni.
The need for interpretable features: Motivation and taxonomy. ACM SIGKDD Explorations
Newsletter, 24(1):1–13, 2022.
Yu Zhang, Peter Tiňo, Aleš Leonardis, and Ke Tang. A survey on neural network interpretability.
IEEE Transactions on Emerging Topics in Computational Intelligence, 5(5):726–742, 2021.
Guillaume Alain. Understanding intermediate layers using linear classifier probes. arXiv preprint
arXiv:1610.01644, 2016.
8
Allyson Ettinger, Ahmed Elgohary, and Philip Resnik. Probing for semantic evidence of composition
by means of simple classification tasks. In Proceedings of the 1st workshop on evaluating vector-
space representations for nlp, pages 134–139, 2016.
Dieuwke Hupkes, Sara Veldhoen, and Willem Zuidema. Visualisation and’diagnostic classifiers’
reveal how recurrent and recursive neural networks process hierarchical structure. Journal of
Artificial Intelligence Research, 61:907–926, 2018.
Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. Discovering latent knowledge in language
models without supervision. arXiv preprint arXiv:2212.03827, 2022.
Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoen-
coders find highly interpretable features in language models. arXiv preprint arXiv:2309.08600,
2023.
Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Inter-
pretability in the wild: a circuit for indirect object identification in gpt-2 small. arXiv preprint
arXiv:2211.00593, 2022.
Nicholas Goldowsky-Dill, Chris MacLeod, Lucas Sato, and Aryaman Arora. Localizing model
behavior with path patching. arXiv preprint arXiv:2304.05969, 2023.
nostalgebraist. interpreting gpt: the logit lens. LessWrong, 2020. URL https://www.lesswrong.
com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens.
John Hewitt and Christopher D Manning. A structural probe for finding syntax in word representations.
In Proceedings of the 2019 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers),
pages 4129–4138, 2019.
Kabir Ahuja, Vidhisha Balachandran, Madhur Panwar, Tianxing He, Noah A Smith, Navin Goyal, and
Yulia Tsvetkov. Learning syntax without planting trees: Understanding when and why transformers
generalize hierarchically. arXiv preprint arXiv:2404.16367, 2024.
Maor Ivgi, Uri Shaham, and Jonathan Berant. Efficient long-text understanding with short-text
models. Transactions of the Association for Computational Linguistics, 11:284–299, 2023.
Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and
Percy Liang. Lost in the middle: How language models use long contexts. Transactions of the
Association for Computational Linguistics, 12:157–173, 2024.
B. A. Levinstein and Daniel A. Herrmann. Still no lie detector for language models: Probing
empirical and conceptual roadblocks. CoRR, June 2023. doi: 10.48550/arXiv.2307.00175. URL
http://arxiv.org/abs/2307.00175.
John Hewitt and Percy Liang. Designing and interpreting probes with control tasks. arXiv preprint
arXiv:1909.03368, 2019.
Kenneth Li, Aspen K Hopkins, David Bau, Fernanda Viégas, Hanspeter Pfister, and Martin Watten-
berg. Emergent world representations: Exploring a sequence model trained on a synthetic task.
arXiv preprint arXiv:2210.13382, 2022.
Stephen Casper. The engineer’s interpretability sequence. AI Alignment Forum, February 2023. URL
https://www.alignmentforum.org/s/a6ne2ve5uturEEQK7.
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama,
Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals,
Percy Liang, Jeff Dean, and William Fedus. Emergent abilities of large language models. TMLR,
October 2022. doi: 10.48550/arXiv.2206.07682. URL http://arxiv.org/abs/2206.07682.
Jacob Steinhardt. Emergent deception and emergent optimization. Bounded Regret, February 2023.
URL https://bounded-regret.ghost.io/emergent-deception-optimization/.
9
Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for
grokking via mechanistic interpretability. ICLR, January 2023. doi: 10.48550/arXiv.2301.05217.
URL http://arxiv.org/abs/2301.05217.
Boaz Barak, Benjamin L. Edelman, Surbhi Goel, Sham Kakade, Eran Malach, and Cyril Zhang.
Hidden progress in deep learning: Sgd learns parities near the computational limit. NeurIPS, 2022.
doi: 10.48550/arXiv.2207.08799. URL http://arxiv.org/abs/2207.08799.
Evan Hubinger. Chris olah’s views on agi safety. AI Alignment Forum, Novem-
ber 2019. URL https://www.alignmentforum.org/posts/X2i9dQQK3gETCyqh2/
chris-olah-s-views-on-agi-safety.
Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual
associations in gpt. NeurIPS, 2022. doi: 10.48550/arXiv.2202.05262. URL http://arxiv.org/
abs/2202.05262.
Thanh Tam Nguyen, Thanh Trung Huynh, Phi Le Nguyen, Alan Wee-Chung Liew, Hongzhi Yin,
and Quoc Viet Hung Nguyen. A survey of machine unlearning. CoRR, October 2022. doi:
10.48550/arXiv.2209.02299. URL http://arxiv.org/abs/2209.02299.
Evan Hubinger. A transparency and interpretability tech tree. AI Alignment Fo-
rum, June 2022. URL https://www.alignmentforum.org/posts/nbq2bWLcYmSGup9aF/
a-transparency-and-interpretability-tech-tree.
Lee Sharkey. Circumventing interpretability: How to defeat mind-readers. CoRR, December 2022.
doi: 10.48550/ARXIV.2212.11415. URL https://arxiv.org/abs/2212.11415.
Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. Discovering latent knowledge in language
models without supervision. ICLR, 2023. URL http://arxiv.org/abs/2212.03827.
Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan,
Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J.
Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson,
J. Zico Kolter, and Dan Hendrycks. Representation engineering: A top-down approach to ai
transparency. CoRR, October 2023. doi: 10.48550/arXiv.2310.01405. URL http://arxiv.org/
abs/2310.01405.
Peter S. Park, Simon Goldstein, Aidan O’Gara, Michael Chen, and Dan Hendrycks. Ai deception: A
survey of examples, risks, and potential solutions. CoRR, August 2023. doi: 10.48550/arXiv.2308.
14752. URL http://arxiv.org/abs/2308.14752.
John Wentworth. How to go from interpretability to alignment: Just
retarget the search. AI Alignment Forum, August 2022. URL
https://www.alignmentforum.org/posts/w4aeAFzSAguvqA5qu/
how-to-go-from-interpretability-to-alignment-just-retarget.
Thane Ruthenis. Internal interfaces are a high-priority interpretability target. AI Alignment Fo-
rum, December 2022. URL https://www.lesswrong.com/posts/nwLQt4e7bstCyPEXs/
internal-interfaces-are-a-high-priority-interpretability.
Thane Ruthenis. World-model interpretability is all we need. AI Alignment Forum,
January 2023. URL https://www.alignmentforum.org/posts/HaHcsrDSZ3ZC2b4fK/
world-model-interpretability-is-all-we-need.
technicalities and Stag. Shallow review of live agendas in alignment & safety.
LessWrong, 2023. URL https://www.lesswrong.com/posts/zaaGsFBeDTpCsYHef/
shallow-review-of-live-agendas-in-alignment-and-safety.
Evan Hubinger. An overview of 11 proposals for building safe advanced ai. CoRR, December 2020.
doi: 10.48550/arXiv.2012.07532. URL http://arxiv.org/abs/2012.07532.
Paul Christiano, Ajeya Cotra, and Mark Xu. Eliciting latent knowledge, January
2021. URL https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_
EpsnjrC1dwZXR37PC8/edit?usp=sharing&usp=embed_facebook.
10
Lawrence Chan. What i would do if i wasn’t at arc evals. AI Alignment Fo-
rum, May 2023. URL https://www.lesswrong.com/posts/6FkWnktH3mjMAxdRT/
what-i-would-do-if-i-wasn-t-at-arc-evals.
11