0% found this document useful (0 votes)

16 views11 pages

An Attempt To Unraveling Token Prediction Refinement and Identifying Essential Layers of Large Language Models

This research investigates how large language models (LLMs) refine token predictions by analyzing intermediate representations using a logit lens technique. The study reveals that the positioning of relevant information within input contexts significantly affects the model's prediction refinement process, demonstrating an inverted U-shaped performance pattern based on the location of this information. Insights gained from this analysis contribute to understanding the internal workings of LLMs and their implications for AI safety research and development.

Uploaded by

zeqinghe63

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views11 pages

An Attempt To Unraveling Token Prediction Refinement and Identifying Essential Layers of Large Language Models

Uploaded by

zeqinghe63

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

An Attempt to Unraveling Token Prediction

Refinement and Identifying Essential Layers of

Large Language Models

Jaturong Kongmanee
Independent Researcher
arXiv:2501.15054v1 [cs.CL] 25 Jan 2025

Toronto, Canada
[email protected]

Abstract
This research aims to unravel how large language models (LLMs) iteratively refine
token predictions (or, in a general sense, vector predictions). We utilized a logit
lens technique to analyze the model’s token predictions derived from intermediate
representations. Specifically, we focused on how LLMs access and use information
from input contexts, and how positioning of relevant information affects the model’s
token prediction refinement process. Our findings for multi-document question
answering task, by varying input context lengths (the number of documents), using
GPT-2, revealed that the number of layers between the first layer that the model
predicted next tokens correctly and the later layers that the model finalized its
correct predictions, as a function of the position of relevant information (i.e.,
placing the relevant one at the beginning, middle, or end of the input context),
has a nearly inverted U shape. We found that the gap between these two layers,
on average, diminishes when relevant information is positioned at the beginning
or end of the input context, suggesting that the model requires more refinements
when processing longer contexts with relevant information situated in the middle,
and highlighting which layers are essential for determining the correct output.
Our analysis provides insights about how token predictions are distributed across
different conditions, and establishes important connections to existing hypotheses
and previous findings in AI safety research and development.

1 Introduction
Recent advances in technologies have seen an increasing number of applications of capable AI
systems in various domains. One example is that of large language models (LLMs) that have
exhibited improved capabilities in content and code generation and language translation. As LLMs
rapidly advance in sophistication and generality, understanding them is essential to ensure their
alignment with human values and prevent catastrophic outcomes [Hendrycks and Mazeika, 2022,
Hendrycks et al., 2023, Ji et al., 2024]. Research that aims to improve an understanding of LLM goes
beyond simple performance metrics, to unravel the internal workings of LLMs. This work utilized an
observational approach that analyzes the inner workings of neural networks: Logit Lens (as discussed
in detail in Section 2). We focused on how LLMs access and use information from input contexts,
and how positioning of relevant information affects the model’s token prediction refinement process,
to gain more insights that potentially aid the development of methods that can ensure trustworthiness
in LLMs.
Mechanistic interpretability (MI) is a growing approach that aims to precisely define computations
within neural networks. It seeks to fully specify a neural network’s computations, aiming for a
granular understanding of model behavior, akin to reverse-engineering the model’s processes into

Preprint. Under review.

pseudo-code. This approach stands out for its ambitious goal of complete reverse engineering and
its focus on AI safety. (for a list of introduction and comprehensive review, see [Olah, 2022, Nanda,
2022, Olah et al., 2020, Sharkey et al., 2022, Olah et al., 2018, Nanda, 2023, 2024].)
MI uses a variety of tools, from observational analysis to causal interventions (for a more detailed
discussion of taxonomies of explainable artificial intelligence and MI methods, see [Das et al., 2020,
Speith, 2022, Bereska and Gavves, 2024, Zytek et al., 2022, Zhang et al., 2021]). Causal methods
span from purely observational, which analyze existing representations without manipulation, to
interventional approaches that directly perturb model components with the hope to reveal causal
relationships. In terms of learning phases, post-hoc techniques are applied after training, while
intrinsic methods are designed to enhance interpretability during the training process itself. This is to
gain a deeper understanding of the behavior of the model from the beginning.
Another approach, given the scope of analysis and application, is to categorize methods based on
general tendencies of interpretability. Some can provide both local and global insight, or partial
and comprehensive understanding. Probing techniques [Alain, 2016, Ettinger et al., 2016, Hupkes
et al., 2018], for instance, range from local to global: simple linear probes reveal individual feature
insights, while advanced structured probes [Burns et al., 2022] expose broader patterns. Sparse
autoencoders [Cunningham et al., 2023], while decomposing individual neuron activations (local), aim
to disentangle features throughout the model (global). Path patching [Wang et al., 2022, Goldowsky-
Dill et al., 2023] transforms local interventions into global understanding by tracing information flow
across layers, showing how small perturbations can illuminate the entire model’s behavior.
In this work, logit lens technique–one among the other observational methods–was employed (1) to
examine how network layers iteratively refine token predictions, and (2) to observe its relationships
to input context lengths and the position of relevant information within input context, and to another
method under the same category: probing. In the remainder of this paper, details about logit lens
technique is provided in Section 2. Section 3 presents the relationships between logit lens and input
context lengths and the position of relevant information within the input context. Relation of logit
lens to probing is discussed in Section 4. Section 5 gives an overview of how understanding inner
working mechanisms of LLM is relevant to AI safety research and development.

2 Logit Lens: A View into Next Token Information

Logit lens technique, as first introduced by nostalgebraist [2020], provides a way for examining
intermediate representations of model’s predictions at different stages of the model’s output generation
process, thereby providing insights into how the model’s internal representations evolve as model
layers iteratively refine predictions.
Specifically, given a pre-trained large language model (LLM), we use logit lens to decode a probability
distribution over tokens from each intermediate layer. These token distributions represent model
predictions after l ∈ {1, .., L} layers of input processing. Given a sequence of tokens t1 , .., tn ∈ V ,
(i)
and hl ∈ Rd denoting the hidden state of token ti at layer l, the logits of the predictive distribution
p(tn+1 |t1 , .., tn ) are given by

(n)
[logit1 , ..., logit|V | ] = WU · LayerNormL (hL ), (1)

where WU ∈ R|V |×d denotes an embedding matrix, and LayerNormL is the pre-embedding layer
normalization. The logit lens applied the same unembedding operation to the earlier hidden states
(i)
hl :

(n)
[logitl1 , ..., logitl|V | ] = WU · LayerNormL (hl ), (2)

where an intermediate predictive distribution over tokens at layer l, pl (tn+1 |t1 , .., tn ), can be obtained.
In this experiment, we utilized logit lens to hidden states of GPT-2 processing data of internet. For
example, given an instance "Hinton is a prominent figure in the field of artificial intelligence and deep
learning.", shown in Figure 1, we observed how distributions over next token predictions gradually
converge to the final distribution. The observed, general trend is that initial predictions of tokens

2
(a) Maximum Probability

(b) Cross-Entropy

(c) Forward Kullback-Leibler (KL) Divergence

Figure 1: The logit lens were applied to the hidden states of GPT-2 processing Hinton is a prominent
figure in the field of artificial intelligence and deep learning. The vertical axis represents layers
ranging from 0 - 12 (bottom to top), and the horizon axis represents input tokens. Each cell illustrates
the most likely next tokens (i.e., top-1 token prediction) in a sequence that the respective hidden state
predicts. In case of maximum probability, the darker cells correspond to higher probabilities. And in
case of cross-entropy and forward Kullback–Leibler (KL) divergence, the lighter cells correspond to
higher probabilities.

resemble plausible completions but seems far from the correct prediction. As layers progress, the
predictions become more accurate with respect to the ground truth (next token). Roughly in the
middle layers, a model forms a correct token of the next token, with later layers refining these
predictions with higher predicted probabilities.
Take the 6th and 9th columns of Figure 1 as an example of the case where linking words (e.g.,
conjunctions and subordinating conjunctions), occur, in this case, "figure in" and "field of". The
model predicts the next word correctly in the early layer, without changing to other tokens in the
later layers. And the model’s confidence evolves over layers to the degree where the model is
more confident about its final token prediction (nearly 100% of predicted probabilities). Despite
being trained on vast amounts of text without explicit syntactic rules, language models, GPT-2
in this case, show an ability to capture dependency relations, and structural aspects of language.
Study reported by [Hewitt and Manning, 2019] shows that language models capture hierarchical
syntactic structures, such as nested clauses, phrase boundaries, and subject-verb-object relationships.

3
top-1 token first matched
gpt-2 12 top-1 token finalized
72.5
70.0 10
67.5 8
Accuray

Layers
65.0
6
62.5
60.0 4
57.5 2
55.0
1st 2nd 3rd 0
1 256 512
Position of Document with the Answer Tokens

68 gpt-2 12
top-1 token first matched
top-1 token finalized

66
10
64
8
Accuray

Layers
60 6
58
4
56
2
54
1st 2nd 3rd 4th 5th 0
1 256 512 768 1024
Position of Document with the Answer Tokens

Figure 2: Left: Shifting the location of relevant information within the model’s input context reveals
an U-shaped performance pattern. Model performs best when the relevant information is at the very
beginning or end of the input, while its performance on average declines significantly when the
critical information appears in the middle of the context. Right: Similarly, changing the position
of relevant information produces a nearly inverted-U pattern, illustrating when the top-1 token first
aligns with the correct answer and when it is finalized. The gap between these two layers on average
is smaller when the relevant information is placed at the beginning or end of the input, indicating that
the model needs more prediction refinements when processing long input context with the relevant
information positioned in the middle. The experiments were repeated for 10 independent runs, and
dots represent mean (arithmetic average) performance, with shaded bands represent 95% confidence
interval (CI).

However, an intriguing question is how transformer-based models, trained solely on natural language
data, manage to learn its hierarchical structure and generalize to sentences with unseen syntactic
patterns?–despite lacking any explicitly encoded structural bias.
[Ahuja et al., 2024] explored the inductive biases in transformers that may drive this generalization.
Through their extensive experiments on synthetic datasets with various training objectives, they found
that, while objectives like sequence-to-sequence and prefix language modeling often fail to produce
hierarchical generalization, models trained with the language modeling objective consistently are able
to produce hierarchical generalization. The authors also conducted experiments on model pruning
to examine how transformers encode hierarchical structure. They found (pruned) sub-networks
with distinct generalization behaviors aligned with hierarchical structures and with linear order.
Model pruning involves reducing the size of networks by removing unnecessary parameters while
maintaining performance, Pruning models not only reduces model complexity and computational
costs but also provides insights into model efficiency and interpretability. For example, pruning can
reveal which parts of a network are most critical for certain tasks, shedding light on how models work
internally.
Another interesting aspect is the model’s demonstrated ability to grasp compositional understanding.
For instance, in columns 11st and 14th, it recognizes how meaning arises from combining parts
of a sentence, such as "artificial ... intelligence" and "deep ... learning," despite having multiple
word choices that could follow "artificial" or "deep." Given the vast token space available at each
layer, it is intriguing that the model not only selects the correct token but also reflects uncertainty
in its decision with low probabilities. This somehow highlights its nuanced handling of language
structure and meaning. However, this ability may not extend to more complex cases where the length
of the input context and the position of relevant information vary. In such scenarios, as shown in
Section 3 for preliminary results, the model’s performance may struggle to maintain the same level of
compositional understanding and accuracy.

4
3 Relation of Logit Lens to Input Context Lengths and the Position of
Relevant Information within Input Context
Input contexts for LLMs can contain thousands of tokens, especially when processing lengthy
documents or integrating external information. For these tasks, LLMs must efficiently handle long
sequences. Study done by [Ivgi et al., 2023, Liu et al., 2024] explored how LLMs use input contexts
to perform downstream tasks.
[Liu et al., 2024] designed an experiment to assess how LLMs access and use information from input
contexts. They manipulated two factors: (1) the length of the input context and (2) the position of
relevant information within it, to evaluate their impact on model performance. They hypothesized
that if LLMs can reliably use information from long contexts, performance should remain stable
regardless of the position of relevant information.
Multi-document question answering was selected for the experiments, where models must reason
over multiple documents to extract relevant information and answer a question. This task reflects
the retrieval-augmented generation process used in various applications. In more details about the
set up, they controlled two key factors: (i) input context length by varying the number of documents
(simulating different retrieval volumes), and (ii) the position of relevant information by altering the
document order, placing the relevant one at the beginning, middle, or end of the context. Their
findings showed that the position of relevant information in the input context significantly impacts
model performance, revealing that current language models struggled to consistently access and
use information in long contexts. Notably, they observed a U-shaped performance curve: models
performed best when relevant information is at the beginning or the end of the input context, but their
performance declines sharply when the information is located in the middle. In this experiment, we
followed procedures and data sets used in [Liu et al., 2024], to examine how token predictions are
distributed as a function of input context length and position of relevant information.
In which layer(s) does the model’s correct prediction happen and finalize? As illustrated in Figure
2, the top-1 token that matches the correct answer typically emerges in the middle or later layers.
When handling long input contexts with relevant information placed in the middle, the correct top-1
tokens are identified similar to earlier, but it takes many additional layers before the model finalizes
the prediction. The logit lens reveals how predictions are iteratively refined as they pass through each
successive layer. Early layers may produce outputs that seem plausible but are far from accurate. The
model begins with rough guesses, which are gradually improved as it incorporates more context and
relevant information. This study indicates that when identifying bottlenecks or inefficiencies in the
model, it’s crucial to consider the position of relevant information. This positioning impacts to some
extent the refinement process of token prediction, highlighting which layers on average are critical
for determining the final output.

4 Relation of Logit Lens to Probing

Probing refers to training a classifier to predict properties from internal representation to identify
such properties, which is presumably encoded in learned representations [Alain, 2016, Ettinger
et al., 2016, Hupkes et al., 2018] (see Figure 3). High probing accuracy in a layer indicates that the
correct answer can be extracted from its hidden states (see Section 4.1 in [Levinstein and Herrmann,
2023]). However, this standard is often too easy to meet, particularly in straightforward classification
tasks with high-dimensional hidden states [Hewitt and Liang, 2019]. In contrast, a high logit lens
performance signifies that the layer effectively encodes correct answers along a direction in the
residual stream, providing significantly more insight as shown in interventional experiments [Li et al.,
2022] that internal representations can be used to control the output of the network. Conversely, low
logit lens performance does not necessarily mean that the correct answers cannot be decoded from
that layer.

5 Relevance to AI Safety
Understanding inner working mechanisms of LLMs is essential for ensuring their safe development
as we move toward more powerful models. Mechanistic interpretability approach has the potential to
significantly advance LLM/AI safety research by providing a richer, stronger foundation for model

5
Figure 3: High-level overview of probing (Reproduced from Figure 2 in [Levinstein and Herrmann,
2023]).

evaluation [Casper, 2023]. It can also offer early warnings of emergent capabilities, enabling us
to understand better how internal structures and representations incrementally evolve as models
learn, and to detect new skills or behaviors in models before they fully develop [Wei et al., 2022,
Steinhardt, 2023, Nanda et al., 2023, Barak et al., 2022]. Furthermore, interpretability can strengthen
theoretical risk models with concrete evidence, such as identifying inner misalignment (when a
model’s behavior diverges from its intended goals). By exposing potential risks or problematic
behaviors, interpretability could also prompt a shift within the AI community toward adopting more
rigorous safety protocols [Hubinger, 2019].
When it comes to specific AI risks [Hendrycks et al., 2023], interpretability is a powerful tool
for preventing malicious misuse by identifying and eliminating sensitive information embedded
in models [Meng et al., 2022, Nguyen et al., 2022]. It can also alleviate competitive pressures by
providing clear evidence of potential threats, fostering a culture of safety within organizations, and
reinforcing AI alignment—ensuring that AI systems stay aligned with intended goals—through
enhanced monitoring and evaluation [Hendrycks and Mazeika, 2022]. In addition, interpretability
offers essential safety checks throughout the AI development process. For example, prior to training,
it informs deliberate design choices to enhance safety [Hubinger, 2019]; during training, it detects
early signs of misalignment, enabling proactive shifts toward alignment [Hubinger, 2022, Sharkey,
2022]; and after training, it ensures rigorous evaluation of artificial cognition, verifying the model’s
honesty [Burns et al., 2023, Zou et al., 2023] and screening for deceptive behaviors [Park et al., 2023].
This comprehensive approach helps safeguard AI systems at every stage, driving safer and more
trustworthy advancements in AI.
The emergence of internal world models in LLMs holds transformative potential for AI alignment
research. If we can identify internal representations of human values and align the AI system’s
objectives accordingly, achieving alignment may become a straightforward task [Wentworth, 2022].
This is especially promising if the AI’s internal world model remains distinct from any notions of
goals or agency [Ruthenis, 2022]. In such cases, simply interpreting the world model could be enough
to ensure alignment [Ruthenis, 2023], providing an effective path to safer AI systems.
Mechanistic interpretability plays a vital role in advancing various AI alignment initiatives, including
the understanding and control of existing models, facilitating AI systems in solving alignment
challenges, and developing robust alignment theories [technicalities and Stag, 2023, Hubinger, 2020].
By enhancing strategies to detect deceptive alignment—a scenario where a model appears aligned
while pursuing misaligned goals without raising suspicion [Park et al., 2023]—and eliciting latent
knowledge from models [Christiano et al., 2021], we can significantly improve scalable oversight,
such as through iterative distillation and amplification [Chan, 2023]. Moreover, comprehensive

6
interpretability can serve as an alignment strategy by helping us identify internal representations of
human values and guiding the model to pursue those values through re-targeting an internal search
process [Wentworth, 2022]. Ultimately, the connection between understanding and control is crucial;
a deeper understanding enables more reliable control over AI systems.

6 Discussion and Limitations

The effectiveness of the logit lens can vary significantly depending on the task and model used. For
instance, tasks that require deep, abstract reasoning, layers far from the output may offer little relevant
information when analyzed solely through token prediction.
Explaining how each layer of a model contributes to the final prediction by reducing complex
representations to token predictions introduces biases. Many layers may serve other functions (e.g.,
refining internal features or re-weighting contextual information) which are not directly captured by
logits. For instance, middle layers may not just be involved in predicting the next token, so using the
logit lens alone may overlook their other roles in the model.
The logit lens relies on predicting which tokens have high probabilities at different layers. However,
token distributions can be highly skewed, with some tokens dominating; even though they may not
reflect the true nature of the underlying hidden state. This can obscure the true purpose or significance
of intermediate representations.
Useful perspective on how language models construct their predictions can be obtained using logit
lens. Its capabilities, however, are constrained when it comes to unveiling the model with more
complex internal operations. Logit lens do not fully reveal the layered, nuanced, and abstract
processes that underlie a language model’s true capabilities. Thus it is essential to utilize logit lens in
combination with other techniques that focus on different aspects of model behavior (e.g., neuron
activity and attention patterns), to uncover the internal workings, and to obtain a deeper and more
comprehensive understanding of how language models function.

7 Conclusion
Investigating the intermediate representations by examining how a model iteratively refines token
predictions, offers insights into how a model’s internal representations evolve. In this study, we
utilized the logit lens to meticulously examine next token prediction refinements across different
conditions, to some degree, to uncover critical nuances in how predictions are progressively refined. A
deeper grasp of the prediction refinement process will enhance the development in AI safety research
and development. That is, by establishing connections between the previous findings and existing
hypotheses, we aim to advance the understanding of LLM behavior. These insights are vital for the
development of AI technologies that are not only highly effective but also secure and aligned with
human values.

Acknowledgments
The author would like to thank the support and opportunity given by Mukaya (Tai) Panich and
Thanapong Boontaeng to research this topic. The author would also like to thank the members of the
SCB 10X team for their useful comments during the contract term from May to October 2024.

7
References
Dan Hendrycks and Mantas Mazeika. X-risk analysis for ai research. CoRR, June 2022. URL
https://arxiv.org/abs/2206.05862v7.
Dan Hendrycks, Mantas Mazeika, and Thomas Woodside. An overview of catastrophic ai risks.
CoRR, October 2023. URL http://arxiv.org/abs/2306.12001.
Jiaming Ji, Tianyi Qiu, Boyuan Chen, Borong Zhang, Hantao Lou, Kaile Wang, Yawen Duan,
Zhonghao He, Jiayi Zhou, Zhaowei Zhang, Fanzhi Zeng, Kwan Yee Ng, Juntao Dai, Xuehai Pan,
Aidan O’Gara, Yingshan Lei, Hua Xu, Brian Tse, Jie Fu, Stephen McAleer, Yaodong Yang, Yizhou
Wang, Song-Chun Zhu, Yike Guo, and Wen Gao. Ai alignment: A comprehensive survey. CoRR,
January 2024. doi: 10.48550/arXiv.2310.19852. URL http://arxiv.org/abs/2310.19852.
Christopher Olah. Mechanistic interpretability, variables, and the importance of interpretable
bases. Transformer Circuits Thread, 2022. URL https://transformer-circuits.pub/
2022/mech-interp-essay/index.html.
Neel Nanda. A comprehensive mechanistic interpretability explainer & glossary. Neel Nanda’s Blog,
December 2022. URL https://www.neelnanda.io/mechanistic-interpretability/
glossary.
Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter.
Zoom in: An introduction to circuits. Distill, March 2020. URL https://distill.pub/2020/
circuits/zoom-in.
Lee Sharkey, Sid Black, and beren. Current themes in mechanis-
tic interpretability research. AI Alignment Forum, November 2022.
URL https://www.alignmentforum.org/posts/Jgs7LQwmvErxR9BCC/
current-themes-in-mechanistic-interpretability-research.
Chris Olah, Arvind Satyanarayan, Ian Johnson, Shan Carter, Ludwig Schubert, Katherine Ye, and
Alexander Mordvintsev. The building blocks of interpretability. Distill, March 2018. URL
https://distill.pub/2018/building-blocks.
Neel Nanda. Mechanistic interpretability quickstart guide. Neel Nanda’s Blog, January 2023. URL
https://www.neelnanda.io/mechanistic-interpretability/quickstart.
Neel Nanda. An extremely opinionated annotated list of my favourite
mechanistic interpretability papers v2. AI Alignment Forum, July 2024.
URL https://www.alignmentforum.org/posts/NfFST5Mio7BCAQHPA/
an-extremely-opinionated-annotated-list-of-my-favourite.
Saikat Das, Namita Agarwal, Deepak Venugopal, Frederick T Sheldon, and Sajjan Shiva. Taxonomy
and survey of interpretable machine learning method. In 2020 IEEE Symposium Series on
Computational Intelligence (SSCI), pages 670–677. IEEE, 2020.
Timo Speith. A review of taxonomies of explainable artificial intelligence (xai) methods. In
Proceedings of the 2022 ACM conference on fairness, accountability, and transparency, pages
2239–2250, 2022.
Leonard Bereska and Efstratios Gavves. Mechanistic interpretability for ai safety–a review. arXiv
preprint arXiv:2404.14082, 2024.
Alexandra Zytek, Ignacio Arnaldo, Dongyu Liu, Laure Berti-Equille, and Kalyan Veeramachaneni.
The need for interpretable features: Motivation and taxonomy. ACM SIGKDD Explorations
Newsletter, 24(1):1–13, 2022.
Yu Zhang, Peter Tiňo, Aleš Leonardis, and Ke Tang. A survey on neural network interpretability.
IEEE Transactions on Emerging Topics in Computational Intelligence, 5(5):726–742, 2021.
Guillaume Alain. Understanding intermediate layers using linear classifier probes. arXiv preprint
arXiv:1610.01644, 2016.

8
Allyson Ettinger, Ahmed Elgohary, and Philip Resnik. Probing for semantic evidence of composition
by means of simple classification tasks. In Proceedings of the 1st workshop on evaluating vector-
space representations for nlp, pages 134–139, 2016.

Dieuwke Hupkes, Sara Veldhoen, and Willem Zuidema. Visualisation and’diagnostic classifiers’
reveal how recurrent and recursive neural networks process hierarchical structure. Journal of
Artificial Intelligence Research, 61:907–926, 2018.

Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. Discovering latent knowledge in language
models without supervision. arXiv preprint arXiv:2212.03827, 2022.

Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoen-
coders find highly interpretable features in language models. arXiv preprint arXiv:2309.08600,
2023.

Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Inter-
pretability in the wild: a circuit for indirect object identification in gpt-2 small. arXiv preprint
arXiv:2211.00593, 2022.

Nicholas Goldowsky-Dill, Chris MacLeod, Lucas Sato, and Aryaman Arora. Localizing model
behavior with path patching. arXiv preprint arXiv:2304.05969, 2023.

nostalgebraist. interpreting gpt: the logit lens. LessWrong, 2020. URL https://www.lesswrong.
com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens.

John Hewitt and Christopher D Manning. A structural probe for finding syntax in word representations.
In Proceedings of the 2019 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers),
pages 4129–4138, 2019.

Kabir Ahuja, Vidhisha Balachandran, Madhur Panwar, Tianxing He, Noah A Smith, Navin Goyal, and
Yulia Tsvetkov. Learning syntax without planting trees: Understanding when and why transformers
generalize hierarchically. arXiv preprint arXiv:2404.16367, 2024.

Maor Ivgi, Uri Shaham, and Jonathan Berant. Efficient long-text understanding with short-text
models. Transactions of the Association for Computational Linguistics, 11:284–299, 2023.

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and
Percy Liang. Lost in the middle: How language models use long contexts. Transactions of the
Association for Computational Linguistics, 12:157–173, 2024.

B. A. Levinstein and Daniel A. Herrmann. Still no lie detector for language models: Probing
empirical and conceptual roadblocks. CoRR, June 2023. doi: 10.48550/arXiv.2307.00175. URL
http://arxiv.org/abs/2307.00175.

John Hewitt and Percy Liang. Designing and interpreting probes with control tasks. arXiv preprint
arXiv:1909.03368, 2019.

Kenneth Li, Aspen K Hopkins, David Bau, Fernanda Viégas, Hanspeter Pfister, and Martin Watten-
berg. Emergent world representations: Exploring a sequence model trained on a synthetic task.
arXiv preprint arXiv:2210.13382, 2022.

Stephen Casper. The engineer’s interpretability sequence. AI Alignment Forum, February 2023. URL
https://www.alignmentforum.org/s/a6ne2ve5uturEEQK7.

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama,
Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals,
Percy Liang, Jeff Dean, and William Fedus. Emergent abilities of large language models. TMLR,
October 2022. doi: 10.48550/arXiv.2206.07682. URL http://arxiv.org/abs/2206.07682.

Jacob Steinhardt. Emergent deception and emergent optimization. Bounded Regret, February 2023.
URL https://bounded-regret.ghost.io/emergent-deception-optimization/.

9
Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for
grokking via mechanistic interpretability. ICLR, January 2023. doi: 10.48550/arXiv.2301.05217.
URL http://arxiv.org/abs/2301.05217.
Boaz Barak, Benjamin L. Edelman, Surbhi Goel, Sham Kakade, Eran Malach, and Cyril Zhang.
Hidden progress in deep learning: Sgd learns parities near the computational limit. NeurIPS, 2022.
doi: 10.48550/arXiv.2207.08799. URL http://arxiv.org/abs/2207.08799.
Evan Hubinger. Chris olah’s views on agi safety. AI Alignment Forum, Novem-
ber 2019. URL https://www.alignmentforum.org/posts/X2i9dQQK3gETCyqh2/
chris-olah-s-views-on-agi-safety.
Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual
associations in gpt. NeurIPS, 2022. doi: 10.48550/arXiv.2202.05262. URL http://arxiv.org/
abs/2202.05262.
Thanh Tam Nguyen, Thanh Trung Huynh, Phi Le Nguyen, Alan Wee-Chung Liew, Hongzhi Yin,
and Quoc Viet Hung Nguyen. A survey of machine unlearning. CoRR, October 2022. doi:
10.48550/arXiv.2209.02299. URL http://arxiv.org/abs/2209.02299.
Evan Hubinger. A transparency and interpretability tech tree. AI Alignment Fo-
rum, June 2022. URL https://www.alignmentforum.org/posts/nbq2bWLcYmSGup9aF/
a-transparency-and-interpretability-tech-tree.
Lee Sharkey. Circumventing interpretability: How to defeat mind-readers. CoRR, December 2022.
doi: 10.48550/ARXIV.2212.11415. URL https://arxiv.org/abs/2212.11415.
Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. Discovering latent knowledge in language
models without supervision. ICLR, 2023. URL http://arxiv.org/abs/2212.03827.
Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan,
Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J.
Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson,
J. Zico Kolter, and Dan Hendrycks. Representation engineering: A top-down approach to ai
transparency. CoRR, October 2023. doi: 10.48550/arXiv.2310.01405. URL http://arxiv.org/
abs/2310.01405.
Peter S. Park, Simon Goldstein, Aidan O’Gara, Michael Chen, and Dan Hendrycks. Ai deception: A
survey of examples, risks, and potential solutions. CoRR, August 2023. doi: 10.48550/arXiv.2308.
14752. URL http://arxiv.org/abs/2308.14752.
John Wentworth. How to go from interpretability to alignment: Just
retarget the search. AI Alignment Forum, August 2022. URL
https://www.alignmentforum.org/posts/w4aeAFzSAguvqA5qu/
how-to-go-from-interpretability-to-alignment-just-retarget.
Thane Ruthenis. Internal interfaces are a high-priority interpretability target. AI Alignment Fo-
rum, December 2022. URL https://www.lesswrong.com/posts/nwLQt4e7bstCyPEXs/
internal-interfaces-are-a-high-priority-interpretability.
Thane Ruthenis. World-model interpretability is all we need. AI Alignment Forum,
January 2023. URL https://www.alignmentforum.org/posts/HaHcsrDSZ3ZC2b4fK/
world-model-interpretability-is-all-we-need.
technicalities and Stag. Shallow review of live agendas in alignment & safety.
LessWrong, 2023. URL https://www.lesswrong.com/posts/zaaGsFBeDTpCsYHef/
shallow-review-of-live-agendas-in-alignment-and-safety.
Evan Hubinger. An overview of 11 proposals for building safe advanced ai. CoRR, December 2020.
doi: 10.48550/arXiv.2012.07532. URL http://arxiv.org/abs/2012.07532.
Paul Christiano, Ajeya Cotra, and Mark Xu. Eliciting latent knowledge, January
2021. URL https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_
EpsnjrC1dwZXR37PC8/edit?usp=sharing&usp=embed_facebook.

10
Lawrence Chan. What i would do if i wasn’t at arc evals. AI Alignment Fo-
rum, May 2023. URL https://www.lesswrong.com/posts/6FkWnktH3mjMAxdRT/
what-i-would-do-if-i-wasn-t-at-arc-evals.

Downloed Papers
No ratings yet
Downloed Papers
700 pages
LangChain Applications in Modern LLM Development: The Complete Guide for Developers and Engineers
From Everand
LangChain Applications in Modern LLM Development: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Markov Logic - Theory, Algorithms and Applications
No ratings yet
Markov Logic - Theory, Algorithms and Applications
157 pages
2024 Donner Catherine Thesis
No ratings yet
2024 Donner Catherine Thesis
161 pages
Explainable Artificial Intelligence (XAI) : Concepts, Taxonomies, Opportunities and Challenges Toward Responsible AI
No ratings yet
Explainable Artificial Intelligence (XAI) : Concepts, Taxonomies, Opportunities and Challenges Toward Responsible AI
72 pages
Shakarian AAAI Tutorial
No ratings yet
Shakarian AAAI Tutorial
64 pages
Different XAI Techniques
No ratings yet
Different XAI Techniques
52 pages
Roisinluo Reasoning in LLMs
No ratings yet
Roisinluo Reasoning in LLMs
72 pages
Language Model
No ratings yet
Language Model
58 pages
Perspectives in Business Ethics
No ratings yet
Perspectives in Business Ethics
113 pages
Logic Tensor Networks
No ratings yet
Logic Tensor Networks
61 pages
A Philosophical Introduction To Language Models Part II - Milliere and Buckner
No ratings yet
A Philosophical Introduction To Language Models Part II - Milliere and Buckner
47 pages
Interpretability of LLM
No ratings yet
Interpretability of LLM
48 pages
Train The Trainer With Wali
92% (12)
Train The Trainer With Wali
64 pages
ChatBot With GANs
No ratings yet
ChatBot With GANs
61 pages
2024-Language Model Behavior A Comprehensive Survey
No ratings yet
2024-Language Model Behavior A Comprehensive Survey
58 pages
Natural Language Processing NLP and Machine Learning MLTheory and Applications PDF
100% (2)
Natural Language Processing NLP and Machine Learning MLTheory and Applications PDF
306 pages
Transformers in Deep Learning Architecture: Definitive Reference for Developers and Engineers
From Everand
Transformers in Deep Learning Architecture: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Personal Data Sheet
No ratings yet
Personal Data Sheet
9 pages
Towards Interpreting Language Models
No ratings yet
Towards Interpreting Language Models
79 pages
Mortizing Intractable Inference in Large Language Models: Edward J. Hu, Moksh Jain, Eric Elmoznino Younesse Kaddar
No ratings yet
Mortizing Intractable Inference in Large Language Models: Edward J. Hu, Moksh Jain, Eric Elmoznino Younesse Kaddar
31 pages
Axe The X in XAI: A Plea For Understandable AI: Apaez@uniandes - Edu.co
No ratings yet
Axe The X in XAI: A Plea For Understandable AI: Apaez@uniandes - Edu.co
32 pages
Locating and Editing Factual Associations in GPT (ROME)
No ratings yet
Locating and Editing Factual Associations in GPT (ROME)
35 pages
Physics of Language Models Part 3.2 Knowledge Manipulation
No ratings yet
Physics of Language Models Part 3.2 Knowledge Manipulation
29 pages
2005 14165v3 PDF
No ratings yet
2005 14165v3 PDF
74 pages
RFBT - Chapter 7 - Electronic Commerce Act
No ratings yet
RFBT - Chapter 7 - Electronic Commerce Act
5 pages
Language Models Can Improve Event Prediction
No ratings yet
Language Models Can Improve Event Prediction
26 pages
Language Models Can Improve Event Prediction by Few-Shot Abductive Reasoning
No ratings yet
Language Models Can Improve Event Prediction by Few-Shot Abductive Reasoning
26 pages
Patchscopes:: A Unifying Framework For Inspecting Hidden Representations of Language Models
No ratings yet
Patchscopes:: A Unifying Framework For Inspecting Hidden Representations of Language Models
25 pages
Tuned Lens
No ratings yet
Tuned Lens
25 pages
How Does GPT-2 Compute Greater-Than?: Interpreting Mathematical Abilities in A Pre-Trained Language Model
No ratings yet
How Does GPT-2 Compute Greater-Than?: Interpreting Mathematical Abilities in A Pre-Trained Language Model
26 pages
Dola - Decoding by Contrasting Layers Improves
No ratings yet
Dola - Decoding by Contrasting Layers Improves
26 pages
Eliciting Latent Predictions From Transformers With The Tuned Lens
No ratings yet
Eliciting Latent Predictions From Transformers With The Tuned Lens
23 pages
Kalyan 1 s2.0 S2949719123000456 Main
No ratings yet
Kalyan 1 s2.0 S2949719123000456 Main
48 pages
AI - 4th Unit
No ratings yet
AI - 4th Unit
19 pages
Syntatic Data
No ratings yet
Syntatic Data
26 pages
In-Context Learning Creates Task Vectors
No ratings yet
In-Context Learning Creates Task Vectors
16 pages
H S: F L L M W E L: Ide and EEK Ingerprinting Arge Anguage Odels ITH Volutionary Earning
No ratings yet
H S: F L L M W E L: Ide and EEK Ingerprinting Arge Anguage Odels ITH Volutionary Earning
14 pages
Wisdom of The Silicon Crowd 2402.19379
No ratings yet
Wisdom of The Silicon Crowd 2402.19379
20 pages
Mantle
No ratings yet
Mantle
19 pages
NeurIPS 2022 Locating and Editing Factual Associations in GPT Paper Conference
No ratings yet
NeurIPS 2022 Locating and Editing Factual Associations in GPT Paper Conference
14 pages
4373 Mechanistic Permutability
No ratings yet
4373 Mechanistic Permutability
18 pages
Speech Self-Supervised Learning Using Diffusion Model Synthetic Data
No ratings yet
Speech Self-Supervised Learning Using Diffusion Model Synthetic Data
14 pages
Unveiling the Secrets of ChatGPT Inside the Mind of an AI
From Everand
Unveiling the Secrets of ChatGPT Inside the Mind of an AI
Nelson Ambrose
No ratings yet
Participation Game (DSI 27 Apr) - A Post-Turing Frontier For Generative AI Systems
No ratings yet
Participation Game (DSI 27 Apr) - A Post-Turing Frontier For Generative AI Systems
39 pages
Towards Understanding Distilled Reasoning Models: A Rep-Resentational Approach
No ratings yet
Towards Understanding Distilled Reasoning Models: A Rep-Resentational Approach
13 pages
A Statistical Physics of Language Model Reasoning
No ratings yet
A Statistical Physics of Language Model Reasoning
15 pages
Interpreting Black Box Models: A Review On Explainable Artificial Intelligence
No ratings yet
Interpreting Black Box Models: A Review On Explainable Artificial Intelligence
30 pages
Towards Causal Representation Learning
No ratings yet
Towards Causal Representation Learning
24 pages
50 Breakthrough AI Concepts in 500 Words Each: In 500 words, #17
From Everand
50 Breakthrough AI Concepts in 500 Words Each: In 500 words, #17
Nietsnie Trebla
No ratings yet
From Understanding To Utilization: A Survey On Explainability For Large Language Models
No ratings yet
From Understanding To Utilization: A Survey On Explainability For Large Language Models
13 pages
1 s2.0 S2095809922006324 Main
No ratings yet
1 s2.0 S2095809922006324 Main
20 pages
Augmenting LLMs Survey
No ratings yet
Augmenting LLMs Survey
33 pages
Business User Requirements
100% (3)
Business User Requirements
8 pages
BERT Foundations and Applications: Definitive Reference for Developers and Engineers
From Everand
BERT Foundations and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks
No ratings yet
Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks
11 pages
Genai Principles
No ratings yet
Genai Principles
12 pages
tmpDB70 TMP
No ratings yet
tmpDB70 TMP
12 pages
Recent Advances in Language Modeling (2022-2025)
No ratings yet
Recent Advances in Language Modeling (2022-2025)
5 pages
1 s2.0 S2667325821002193 Main
No ratings yet
1 s2.0 S2667325821002193 Main
3 pages
The Six Step Process To Financial Planning PDF
50% (2)
The Six Step Process To Financial Planning PDF
2 pages
Entropy 23 00018 v2 36
No ratings yet
Entropy 23 00018 v2 36
1 page
Entropy 23 00018 v2 41
No ratings yet
Entropy 23 00018 v2 41
1 page
UC 1. Lead Team Communication
No ratings yet
UC 1. Lead Team Communication
81 pages
Local Guiding Services NC II CG
No ratings yet
Local Guiding Services NC II CG
10 pages
E-Book Reiki Revealed - Dialogues With A Reiki Master - Usui Cover
100% (1)
E-Book Reiki Revealed - Dialogues With A Reiki Master - Usui Cover
0 pages
5problem Solving Skills of PF
No ratings yet
5problem Solving Skills of PF
28 pages
Systems Analysis and Design
No ratings yet
Systems Analysis and Design
29 pages
ARM 301 Practical Manual
No ratings yet
ARM 301 Practical Manual
104 pages
4cm1 02 Rms 20240125
No ratings yet
4cm1 02 Rms 20240125
25 pages
Digital Library Proposal
No ratings yet
Digital Library Proposal
7 pages
Dịch Thuật Và Tự Do - Hồ Đắc Túc - 2012
No ratings yet
Dịch Thuật Và Tự Do - Hồ Đắc Túc - 2012
161 pages
Study Resource - Vce Command Term Glossary PDF
No ratings yet
Study Resource - Vce Command Term Glossary PDF
1 page
Assistant Director of Admissions Cover Letter
100% (1)
Assistant Director of Admissions Cover Letter
8 pages
Introduction To IT and Systems
No ratings yet
Introduction To IT and Systems
4 pages
Mini Project
No ratings yet
Mini Project
23 pages
Conceptual Dependency Theory: Fundamentals and Applications
From Everand
Conceptual Dependency Theory: Fundamentals and Applications
Fouad Sabry
No ratings yet
You Can Contact Me Using: Gemechu Nemera, PHD Mobile: 0911877124 E-Mail
No ratings yet
You Can Contact Me Using: Gemechu Nemera, PHD Mobile: 0911877124 E-Mail
84 pages
Soundscape Thesis
100% (2)
Soundscape Thesis
8 pages
Manual Técnico Hydronic
No ratings yet
Manual Técnico Hydronic
44 pages
Sussman Anomaly: Fundamentals and Applications
From Everand
Sussman Anomaly: Fundamentals and Applications
Fouad Sabry
No ratings yet
Legal Counselling
No ratings yet
Legal Counselling
29 pages
Animal Production: Quarter 1 - Module 1
No ratings yet
Animal Production: Quarter 1 - Module 1
32 pages
Cyber Assessment Framework V3.2
No ratings yet
Cyber Assessment Framework V3.2
40 pages
Part 44
No ratings yet
Part 44
18 pages
Laptop Acceptance/Issue Form
100% (1)
Laptop Acceptance/Issue Form
3 pages
Pogram Student Profilling MAC 2024
No ratings yet
Pogram Student Profilling MAC 2024
22 pages
Honors Purposeful Information Assessment 1
No ratings yet
Honors Purposeful Information Assessment 1
3 pages
Lesson Plan Form: III/English III
No ratings yet
Lesson Plan Form: III/English III
6 pages
Critical Review Ade Diaz Primadharma
No ratings yet
Critical Review Ade Diaz Primadharma
4 pages