LLM Attention
LLM Attention
Imagine you're reading a long novel. You don't read every word with the same level of focus. Instead, you pay more
attention to certain parts, like the plot twists or character developments. This is similar to how attention works in a
language model.
The model presented in this repository is a type of language model that predicts the next word with a self-
attention mechanism.
Self-Attention: The self-attention mechanism allows the model to focus on different parts of the input
sequence based on their relevance to the current output. This is achieved by assigning weights to each input
element, with larger weights indicating greater importance.
1. Embedding: Each word in the input sequence is converted into a numerical representation called an
embedding.
2. Query, Key, and Value: For each word, three vectors are calculated: a query, a key, and a value.
3. Attention Scores: The dot product between the query of one word and the keys of all other words is
calculated. This gives a score representing the similarity between the words.
4. Softmax: The scores are normalized using the softmax function to obtain attention weights.
5. Context Vector: The weighted sum of the value vectors, using the attention weights, creates a context vector.
This context vector captures the relevant information from the entire sequence.
Word Embeddings: The code initializes random word embeddings. In practice, pre-trained embeddings like
Word2Vec or GloVe can be used for better performance.
Context Window: The code defines a context window, specifying the number of words to consider before the
current word.
Self-Attention: The code calculates attention scores, applies softmax, and creates the context vector.
Prediction: The context vector is used to predict the next word, often using a simple linear layer or a more
complex model.
(Image Credit to article Attention Map-Guided Visual Explanations for Deep Neural Networks )
In the above image/example, there are 7 sequences in the sentence ‘the train left the station on time’, and we can
see a 7x7 attention score matrix.
According to the self-attention scores depicted in the picture, the word ‘train’ pays more attention to the word
‘station’ rather than other words in consideration, such as ‘on’ or ‘the’. Alternatively, we can say the word ‘station’
pays more attention to the word ‘train’ rather than other words in consideration, such as ‘on’ or ‘the’.
Attention is a mechanism that allows a language model to focus on different parts of its input sequence based on
their relevance to the current output. This is achieved by assigning weights to each input element, with larger
weights indicating greater importance. These weights are calculated using a similarity metric, such as the dot
product, between the query vector and each key vector in the input sequence.
For instance, in translation, attention helps the model concentrate on words or phrases that are semantically
connected, leading to more accurate translations. On the other hand, this same mechanism can be exploited to
generate misleading or biased text by directing the model's focus towards specific information.
Positive impact: Attention allows LLMs to focus on the most relevant parts of the input sequence when
generating a response. This leads to responses that are more coherent, relevant, and grammatically correct.
Negative impact:
Focus on misleading information: If the input contains misleading or irrelevant keywords, the LLM's
attention might be drawn to them, resulting in inaccurate, nonsensical or other undesired responses.
Missing key information: The LLM might overlook crucial information if the wording is different from
what it's trained on.
2. Introduction
In this guide, we will explore the concept of attention through a Python code snippet that uses the self-attention
mechanism to predict the next word in a sentence.
We will implement a basic language model that uses self-attention to predict the next word in a sentence.
Word: "cat"
These numbers, or embeddings, are randomly initialized. Over time, as the model learns from data, these
embeddings will adjust to better represent the meaning and context of the words.
Keys: "The," "quick," "brown," "fox," "over," "the," "lazy," "dog" (all the words in the sentence)
The model calculates a similarity score between the query and each key. This score represents how relevant each
word is to understanding "jumps." For instance, "fox" and "jumps" might have a high similarity score.
Similarity Scores: [0.2, 0.5, 0.1, 0.6, 0.3, 0.2, 0.4, 0.1]
These probabilities indicate the importance of each word in the context of understanding "jumps."
3.4. Predicting the Next Word
The model then uses these probabilities to weigh the influence of each word's embedding. The word with the
highest combined weight is predicted as the next word.
Word Embeddings: The model uses randomly generated embeddings to represent each word. These
embeddings are then projected into query, key, and value vectors which are used for calculating the attention
weights.
Overall, the model can be considered a simple language model with a self-attention mechanism for next word
prediction. It demonstrates the core idea of self-attention but lacks the complexity of more advanced models like
Transformers, which utilize this mechanism extensively.
1 import numpy as np
We start by importing the numpy library, which provides support for large, multi-dimensional arrays and matrices,
along with a large collection of high-level mathematical functions to operate on these arrays.
1 def softmax(x):
2 """
3 This softmax function is often used in machine learning and deep learning to convert
4 a vector of real numbers into a probability distribution.
5 Each output value is between 0 and 1 (inclusive), and the sum of all output values is
1.
6 """
7 # Subtract the max value in the input array from all elements for numerical
stability.
8 # This ensures that all values in the array are between 0 and 1, which helps prevent
potential overflow or underflow issues.
9 x -= np.max(x)
10
11 # Apply the exponential function to each element in the array.
12 # This transforms each value in the array into a positive value.
13 exp_x = np.exp(x)
14
15 # Divide each element in the array by the sum of all elements in the array.
16 # This normalizes the values so that they all add up to 1, which is a requirement for
a probability distribution.
17 softmax_x = exp_x / np.sum(exp_x)
18
19 # Return the resulting array, which represents a probability distribution over the
input array.
20 return softmax_x
1 def create_word_representations(sentences):
2 word_to_index = {}
3 index_to_word = {}
4 word_embeddings = []
5
6 for sentence in sentences:
7 for word in sentence.split():
8 if word not in word_to_index:
9 word_to_index[word] = len(word_to_index)
10 index_to_word[len(index_to_word)] = word
11 word_embeddings.append(np.random.rand(3)) # Random embeddings
12
13 return np.array(word_embeddings), word_to_index, index_to_word
Initialization: Random embeddings provide a starting point for the model to learn meaningful representations
of words. Without them, the model wouldn't know where to begin and its outputs would likely be nonsensical.
Exploration: Randomness encourages the model to explore different directions in the solution space,
potentially leading to better performance as it learns from the data.
Arbitrary Starting Point: Random embeddings are essentially random guesses about how words should be
represented. They may not capture any inherent relationships between words initially.
Slower Learning: The model might take longer to converge on optimal word representations if the random
starting points are far from the ideal ones.
The quality of the word embeddings directly affects the model's output:
Better Embeddings, Better Outputs: If the model starts with good word representations that capture semantic
relationships, it will be better at predicting the next word in a sentence and generating more coherent and
relevant outputs.
Poor Embeddings, Poor Outputs: With random embeddings, the model might struggle to understand the
context and relationships between words. This can lead to nonsensical or grammatically incorrect outputs.
Example:
Consider the sentence "The quick brown fox jumps over the lazy dog."
With good embeddings: The model might identify the relationship between "fox" and "jumps" and predict
"jumps" as the next word.
With poor embeddings: The model might struggle to connect "fox" to any meaningful word and might predict
something unrelated, like "The" or "dog."
In summary, while randomly generated embeddings may seem arbitrary at first, they play a crucial role in initializing
the model and allowing it to learn meaningful representations of words.
8. Calculating Self-Attention
The calculate_self_attention function calculates the attention scores for each word in the context. It then
computes the attention weights by applying the exponential function to the scores and normalizing them.
The attention weights show how much importance the model assigns to each word in the context when predicting
the next word. Higher weights indicate greater relevance.
The: 2.1307
quick: 2.5428
brown: 1.9087
fox: 2.6365
jumps: 2.2119
over: 1.2500
the: 2.1166
lazy: 2.5802
dog: 1.5677
As you can see, the words "quick," "fox," and "lazy" have the highest weights, suggesting they are the most important
for predicting the next word.
(Image Credit to An illustration of next word prediction with state-of-the-art network architectures like BERT, GPT, and
XLNet )
This code is implementing a simple version of the self-attention mechanism, which is a key component in
Transformer models used in natural language processing. The self-attention mechanism allows the model to weigh
the importance of words in a sentence when predicting the next word.
2. calculate_self_attention(query, key, value) : This function calculates the self-attention weights and
the output vector. The attention weights are calculated by taking the dot product of the query and key, scaling
it, and applying the softmax function. The output vector is the weighted sum of the value vectors, where the
weights are the attention weights.
1 if __name__ == "__main__":
2 sentences = [
3 "The quick brown fox jumps over the lazy dog",
4 ]
5
6 word_embeddings, word_to_index, index_to_word =
create_word_representations(sentences)
7 current_word = "jumps"
8 context_window_size = 2 # Considering two words before the current word
9
10 for sentence in sentences:
11 words = sentence.split()
12 current_word_index = words.index(current_word)
13 context_window = words[max(0, current_word_index -
context_window_size):current_word_index]
14 predicted_word, attention_probabilities =
predict_next_word_with_self_attention(current_word, context_window, words,
word_embeddings, word_to_index, index_to_word)
15 print(f"\nGiven the word: {current_word}")
16 print(f"Context: {' '.join(context_window)}") # Print context window
17 print(f"Sentence: {sentence}")
18 print("Attention Probabilities:")
19 for word, prob in zip(words, attention_probabilities):
20 print(f"\t{word}: {prob:.4f}")
21 print(f"Predicted next word: {predicted_word}\n")
22 print("""
23 The word embeddings are initialized randomly in this code.
24 This means that the relationships between different words are not captured in the
embeddings,
25 which could lead to seemingly random attention probabilities.
26 """)
27 print(f"Prediction process: The model uses the context of the given word
'{current_word}' to predict the next word. The attention mechanism assigns different
weights to the words in the context based on their relevance. The word with the highest
weight is considered as the most relevant word for the prediction.")
28 print(f"Attention Impact: The attention probabilities show the relevance of each word
in the context for the prediction. The higher the probability, the more impact the word
has on the prediction.\n")
This code provides a basic model that uses self-attention to predict the next word in a sentence. It demonstrates the
core idea of self-attention but lacks the complexity of more advanced models like Transformers, which utilize this
mechanism extensively.
The size of the vocabulary and the complexity of the language can also affect the model's accuracy.
The "intelligence" of an LLM is directly tied to the quality and diversity of its training data. Here's how:
Data Bias: If the training data is biased, the LLM will also be biased in its outputs. For example, an LLM trained
on mostly news articles might struggle to understand sarcasm or humor.
Data Limitedness: The real world is vast and complex. LLMs can only process what they've been trained on.
Limited data can lead to incomplete understanding and difficulty handling unexpected situations.
Training Objectives: Ultimately, LLMs are optimized for the tasks they are trained on. An LLM trained for text
summarization may not excel at creative writing tasks, even if the data is vast.
NOTE:
In most transformer models, the word embeddings remain constant throughout the training process. It's the model
weights that are updated during training to improve the model's ability to represent and understand the data.
13.1. Multi-Head Attention
Multi-head attention is a type of attention mechanism that allows the model to focus on different parts of the input
sequence simultaneously. It does this by splitting the input into multiple "heads" and applying the attention
mechanism to each head independently. This allows the model to capture various aspects of the input sequence,
such as different levels of abstraction or different types of relationships between words.
In the context of language models, multi-head attention can help the model understand complex sentences where
different words have different relationships with each other. For example, in the sentence "The cat sat on the mat,"
the word "cat" is related to "sat" (the action the cat is performing) and "mat" (the location of the action). Multi-head
attention allows the model to capture both of these relationships simultaneously.
14. Conclusion
The Challenges of LLM Intelligence
The "intelligence" of an LLM heavily depends on the quality and variety of its training data. Biases, limitations in the
data itself, and narrow training objectives can all hinder a model's ability to represent the real world's complexities.
Just like a student highlighting doesn't guarantee comprehension, attention in LLMs doesn't guarantee true
understanding.
14.1. FULL SOURCE CODE
1 import numpy as np
2
3 def softmax(x):
4 """
5 This softmax function is often used in machine learning and deep learning to convert
6 a vector of real numbers into a probability distribution.
7 Each output value is between 0 and 1 (inclusive), and the sum of all output values is
1.
8 """
9 # Subtract the max value in the input array from all elements for numerical
stability.
10 # This ensures that all values in the array are between 0 and 1, which helps prevent
potential overflow or underflow issues.
11 x -= np.max(x)
12
13 # Apply the exponential function to each element in the array.
14 # This transforms each value in the array into a positive value.
15 exp_x = np.exp(x)
16
17 # Divide each element in the array by the sum of all elements in the array.
18 # This normalizes the values so that they all add up to 1, which is a requirement for
a probability distribution.
19 softmax_x = exp_x / np.sum(exp_x)
20
21 # Return the resulting array, which represents a probability distribution over the
input array.
22 return softmax_x
23
24 def create_word_representations(sentences):
25 word_to_index = {}
26 index_to_word = {}
27 word_embeddings = []
28
29 for sentence in sentences:
30 for word in sentence.split():
31 if word not in word_to_index:
32 word_to_index[word] = len(word_to_index)
33 index_to_word[len(index_to_word)] = word
34 word_embeddings.append(np.random.rand(3)) # Random embeddings
35
36 return np.array(word_embeddings), word_to_index, index_to_word
37
38 def calculate_self_attention(query, keys, values):
39 scores = np.dot(query, keys.T) / np.sqrt(keys.shape[1])
40 attention_weights = np.empty_like(scores)
41 for i in range(len(scores)):
42 if len(keys[i].shape) == 1: # Check if 1D array
43 attention_weights[i] = np.exp(scores[i]) # No need to sum for unique words
44 else:
45 attention_weights[i] = np.exp(scores[i]) / np.sum(np.exp(scores[i]),
axis=1, keepdims=True)
46
47 return attention_weights
48
49 def predict_next_word_with_self_attention(current_word, context_window, words,
word_embeddings, word_to_index, index_to_word):
50 context_embeddings = word_embeddings[[word_to_index[word] for word in
context_window]]
51 query = np.mean(context_embeddings, axis=0) # Average context embeddings
52 keys = values = np.array([word_embeddings[word_to_index[word]] for word in words])
53 attention_weights = calculate_self_attention(query, keys, values)
54 attention_probabilities = softmax(attention_weights)
55 predicted_index = np.argmax(attention_probabilities) # Select the word with the
highest probability
56 predicted_word = index_to_word[predicted_index]
57 return predicted_word, attention_probabilities
58
59 if __name__ == "__main__":
60 sentences = [
61 "The quick brown fox jumps over the lazy dog",
62 ]
63
64 word_embeddings, word_to_index, index_to_word =
create_word_representations(sentences)
65 current_word = "jumps"
66 context_window_size = 2 # Considering two words before the current word
67
68 for sentence in sentences:
69 words = sentence.split()
70 current_word_index = words.index(current_word)
71 context_window = words[max(0, current_word_index -
context_window_size):current_word_index]
72 predicted_word, attention_probabilities =
predict_next_word_with_self_attention(current_word, context_window, words,
word_embeddings, word_to_index, index_to_word)
73 print(f"\nGiven the word: {current_word}")
74 print(f"Context: {' '.join(context_window)}") # Print context window
75 print(f"Sentence: {sentence}")
76 print("Attention Probabilities:")
77 for word, prob in zip(words, attention_probabilities):
78 print(f"\t{word}: {prob:.4f}")
79 print(f"Predicted next word: {predicted_word}\n")
80 print("""
81 The word embeddings are initialized randomly in this code.
82 This means that the relationships between different words are not captured in the
embeddings,
83 which could lead to seemingly random attention probabilities.
84 """)
85 print("""
86 The input triggers the attention mechanism which is used to weight
87 the importance of different words in the sentence for the prediction of the next word.
88 """)
89 print(f"Prediction process: The model uses the context of the given word
'{current_word}' to predict the next word. The attention mechanism assigns different
weights to the words in the context based on their relevance. The word with the highest
weight is considered as the most relevant word for the prediction.")
90 print(f"Attention Impact: The attention probabilities show the relevance of each word
in the context for the prediction. The higher the probability, the more impact the word
has on the prediction.\n")
14.2. Resources
https://eugeneyan.com/writing/attention/