0% found this document useful (0 votes)

14 views12 pages

Notes On Implementing Attention - Eli Bendersky

Uploaded by

yt peek

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views12 pages

Notes On Implementing Attention - Eli Bendersky

Uploaded by

yt peek

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Notes on implementing Attention

Some notes on implementing attention blocks in pure Python + Numpy. The focus here is on the exact
implementation in code, explaining all the shapes throughout the process. The motivation for why attention works is
not covered here - there are plenty of excellent online resources explaining it.

Several papers are mentioned throughout the code; they are:

AIAYN - Attention Is All You Need by Vaswani et al.

GPT-3 - Language Models are Few-Shot Learners by Brown et al.

1. Basic scaled self-attention

We'll start with the most basic scaled dot product self-attention, working on a single sequence of tokens, without
masking.

The input is a 2D array of shape (N, D). N is the length of the sequence (how many tokens it contains) and D is the
embedding depth - the length of the embedding vector representing each token [1]. D could be something like 512,
or more, depending on the model.

A self-attention module is parameterized with three weight matrices, Wk , Wq and Wv . Some variants also have
accompanying bias vectors, but the AIAYN paper doesn't use them, so I'll skip them here. In the general case, the
shape of each weight matrix is (D, HS), where HS is some fraction of D. HS stands for "head size" and we'll see what
this means soon. This is a diagram of a self-attention module (the diagram assumes N=6, D is some large number
and so is HS). In the diagram, @ stands for matrix multiplication (Python/Numpy syntax):
Here's a basic Numpy implementation of this:

1 # self_attention the way it happens in the Transformer model. No bias.

2 # D = model dimension/depth (length of embedding)
3 # N = input sequence length
4 # HS = head size
5 #
6 # x is the input (N, D), each token in a row.
7 # Each of W* is a weight matrix of shape (D, HS)
8 # The result is (N, HS)
9 def self_attention(x, Wk, Wq, Wv):
10 # Each of these is (N, D) @ (D, HS) = (N, HS)
11 q = x @ Wq
12 k = x @ Wk
13 v = x @ Wv
14
15 # kq: (N, N) matrix of dot products between each pair of q and k vectors.
16 # The division by sqrt(HS) is the scaling.
17 kq = q @ k.T / np.sqrt(k.shape[1])
18
19 # att: (N, N) attention matrix. The rows become the weights that sum
20 # to 1 for each output vector.
21 att = softmax_lastdim(kq)
22 return att @ v # (N, HS)

The "scaled" part is just dividing kq by the square root of HS , which is done to keep the values of the dot products
manageable (otherwise they would grow with the size of the contracted dimension).

The only dependency is a function for calculating Softmax across the last dimension of an input array:
1 def softmax_lastdim(x):
2 """Compute softmax across last dimension of x.
3
4 x is an arbitrary array with at least two dimensions. The returned array has
5 the same shape as x, but its elements sum up to 1 across the last dimension.
6 """
7 # Subtract the max for numerical stability
8 ex = np.exp(x - np.max(x, axis=-1, keepdims=True))
9 # Divide by sums across last dimension
10 return ex / np.sum(ex, axis=-1, keepdims=True)

When the input is 2D, the "last dimension" is the columns. Colloquially, this Softmax function acts on each row of x
separately; it applies the Softmax formula to the elements (columns) of the row, ending up with a row of numbers in
the range [0,1] that all sum up to 1.

Another note on the dimensions: it's possible for the Wv matrix to have a different second dimension from Wq and
Wk . If you look at the diagram, you can see this will work out, since the softmax produces (N, N), and whatever the
second dimension of V is, will be the second dimension of the output. The AIAYN paper designates these dimensions
as d_k and d_v, but in practice d_k=d_v in all the variants it lists. I found that these dimensions are typically the same
in other papers as well. Therefore, for simplicity I just made them all equal to D in this post; if desired, a variant with
different d_k and d_v is a fairly trivial modification to this code.

2. Batched self-attention
In the real world, the input array is unlikely to be 2D because models are trained on batches of input sequences. To
leverage the parallelism of modern hardware, whole batches are typically processed in the same operation.
The batched version of scaled self-attention is very similar to the non-batched one, due to the magic of Numpy
matrix multiplication and broadcasts. Now the input shape is (B, N, D), where B is the batch dimension. The W*
matrices are still (D, HS); multiplying a (B, N, D) array by (D, HS) performs contraction between the last axis of the first
array and the first axis of the second array, resulting in (B, N, HS). Here's the code, with the dimensions annotated for
each operation:

1 # self_attention with inputs that have a batch dimension.

2 # x has shape (B, N, D)
3 # Each of W* has shape (D, D)
4 def self_attention_batched(x, Wk, Wq, Wv):
5 q = x @ Wq # (B, N, HS)
6 k = x @ Wk # (B, N, HS)
7 v = x @ Wv # (B, N, HS)
8
9 kq = q @ k.swapaxes(-2, -1) / np.sqrt(k.shape[-1]) # (B, N, N)
10
11 att = softmax_lastdim(kq) # (B, N, N)
12 return att @ v # (B, N, HS)

Note that the only difference between this and the non-batched version is the line calculating kq :

Since k is no longer 2D, the notion of "transpose" is ambiguous so we explicitly ask to swap the last and the
penultimate axis, leaving the first axis (B) intact.

When calculating the scaling factor we use k.shape[-1] to select the last dimension of k , instead of
k.shape[1] which only selects the last dimension for 2D arrays.

In fact, this function could also calculate the non-batched version! From now on, we'll assume that all inputs are
batched, and all operations are implicitly batched. I'm not going to be using the "batched" prefix or suffix on
functions any more.

The basic underlying idea of the attention module is to shift around the multi-dimensional representations of tokens
in the sequence towards a better representation of the entire sequence. The tokens attend to each other. Specifically,
the matrix produced by the Softmax operation is called the attention matrix. It's (N, N); for each token it specifies
how much information from every other token in the sequence should be taken into account. For example, a higher
number in cell (R, C) means that there's a stronger relation of token at index R in the sequence to the token at index
C.

Here's a nice example from the AIAYN paper, showing a word sequence and the weights produced by two attention
heads (purple and brown) for a given position in the input sequence:
This shows how the model is learning to resolve what the word "its" refers to in the sentence. Let's take just the
purple head as an example. The index of token "its" in the sequence is 8, and the index of "Law" is 1. In the attention
matrix for this head, the value at index (8, 1) will be very high (close to 1), with other values in the same row much
lower.

While this intuitive explanation isn't critical to understand how attention is implemented, it will become more
important when we talk about masked self-attention later on.

3. Multi-head attention
The attention mechanism we've seen so far has a single set of K, Q and V matrices. This is called one "head" of
attention. In today's models, there are typically multiple heads. Each head does its attention job separately, and in
the end all these results are concatenated and feed through a linear layer.

In what follows, NH is the number of heads and HS is the head size. Typically, NH times HS would be D; for example,
the AIAYN paper mentions several configurations for D=512: NH=8 and HS=64, NH=32 and HS=16, and so on [2].
However, the math works out even if this isn't the case, because the final linear ("projection") layer maps the output
back to (N, D).

Assuming the previous diagram showing a self-attention module is a single head with input (N, D) and output (N,
HS), this is how multiple heads are combined:
Each of the (NH) heads has its own parameter weights for Q, K and V. Each attention head outputs a (N, HS) matrix;
these are concatenated along the last dimension to (N, NH * HS), which is passed through a final linear projection.

Here's a function implementing (batched) multi-head attention; for now, please ignore the code inside do_mask
conditions:

1 # x has shape (B, N, D)

2 # In what follows:
3 # NH = number of heads
4 # HS = head size
5 # Each W*s is a list of NH weight matrices of shape (D, HS).
6 # Wp is a weight matrix for the final linear projection, of shape (NH * HS, D)
7 # The result is (B, N, D)
8 # If do_mask is True, each attention head is masked from attending to future
9 # tokens.
10 def multihead_attention_list(x, Wqs, Wks, Wvs, Wp, do_mask=False):
11 # Check shapes.
12 NH = len(Wks)
13 HS = Wks[0].shape[1]
14 assert len(Wks) == len(Wqs) == len(Wvs)
15 for W in Wqs + Wks + Wvs:
16 assert W.shape[1] == HS
17 assert Wp.shape[0] == NH * HS
18
19 # List of head outputs
20 head_outs = []
21
22 if do_mask:
23 # mask is a lower-triangular (N, N) matrix, with zeros above
24 # the diagonal and ones on the diagonal and below.
25 N = x.shape[1]
26 mask = np.tril(np.ones((N, N)))
27
28 for Wk, Wq, Wv in zip(Wks, Wqs, Wvs):
29 # Calculate self attention for each head separately
30 q = x @ Wq # (B, N, HS)
31 k = x @ Wk # (B, N, HS)
32 v = x @ Wv # (B, N, HS)
33
34 kq = q @ k.swapaxes(-2, -1) / np.sqrt(k.shape[-1]) # (B, N, N)
35
36 if do_mask:
37 # Set the masked positions to -inf, to ensure that a token isn't
38 # affected by tokens that come after it in the softmax.
39 kq = np.where(mask == 0, -np.inf, kq)
40
41 att = softmax_lastdim(kq) # (B, N, N)
42 head_outs.append(att @ v) # (B, N, HS)
43
44 # Concatenate the head outputs and apply the final linear projection
45 all_heads = np.concatenate(head_outs, axis=-1) # (B, N, NH * HS)
46 return all_heads @ Wp # (B, N, D)

It is possible to vectorize this code even further; you'll sometimes see the heads laid out in a separate (4th)
dimension instead of being a list. See the Vectorizing across the heads dimension section.

4. Masked (or Causal) self-attention

Attention modules can be used in both encoder and decoder blocks. Encoder blocks are useful for things like
language understanding or translation; for these, it makes sense for each token to attend to all the other tokens in
the sequence.

However, for generative models this presents a problem: if during training a word attends to future words, the model
will just "cheat" and not really learn how to generate the next word from only past words. This is done in a decoder
block, and for this we need to add masking to attention.

Conceptually, masking is very simple. Consider the sentence:

People like watching funny cat videos

When our attention code generates the att matrix, it's a square (N, N) matrix with attention weights from each
token to each other token in the sequence:
What we want is for all the gray cells in this matrix to be zero, to ensure that a token doesn't attend to future tokens.
The blue cells in the matrix add up to 1 in each row, after the softmax operation.

Now take a look at the previous code sample and see what happens when do_mask=True :

1. First, a (N, N) lower-triangular array is prepared with zeros above the diagonal and ones on the diagonal and
below.

2. Then, before we pass the scaled QK^T to softmax, we set its values to -\infty wherever the mask matrix is 0. This
ensures that the softmax function will assign zeros to outputs at these indices, while still producing the proper
values in the rest of the row.

Another name for masked self-attention is causal self-attention. This is a very good name that comes from causal
systems in control theory.

5. Cross-attention
So far we've been working with self-attention blocks, where the self suggests that elements in the input sequence
attend to other elements in the same input sequence.

Another variant of attention is cross-attention, where elements of one sequence attend to elements in another
sequence. This variant exists in the decoder block of the AIAYN paper. This is a single head of cross-attention:
Here we have two sequences with potentially different lengths: xq and xv . xq is used for the query part of
attention, while xv is used for the key and value parts. The rest of the dimensions remain as before. The output of
such a block is shaped (Nq, HS).

This is an implementation of multi-head cross-attention; it doesn't include masking, since masking is not typically
necessary in cross attention - it's OK for elements of xq to attend to all elements of xv [3]:

1 # Cross attention between two input sequences that can have different lengths.
2 # xq has shape (B, Nq, D)
3 # xv has shape (B, Nv, D)
4 # In what follows:
5 # NH = number of heads
6 # HS = head size
7 # Each W*s is a list of NH weight matrices of shape (D, HS).
8 # Wp is a weight matrix for the final linear projection, of shape (NH * HS, D)
9 # The result is (B, Nq, D)
10 def multihead_cross_attention_list(xq, xv, Wqs, Wks, Wvs, Wp):
11 # Check shapes.
12 NH = len(Wks)
13 HS = Wks[0].shape[1]
14 assert len(Wks) == len(Wqs) == len(Wvs)
15 for W in Wqs + Wks + Wvs:
16 assert W.shape[1] == HS
17 assert Wp.shape[0] == NH * HS
18
19 # List of head outputs
20 head_outs = []
21
22 for Wk, Wq, Wv in zip(Wks, Wqs, Wvs):
23 q = xq @ Wq # (B, Nq, HS)
24 k = xv @ Wk # (B, Nv, HS)
25 v = xv @ Wv # (B, Nv, HS)
26
27 kq = q @ k.swapaxes(-2, -1) / np.sqrt(k.shape[-1]) # (B, Nq, Nv)
28
29 att = softmax_lastdim(kq) # (B, Nq, Nv)
30 head_outs.append(att @ v) # (B, Nq, HS)
31
32 # Concatenate the head outputs and apply the final linear projection
33 all_heads = np.concatenate(head_outs, axis=-1) # (B, Nq, NH * HS)
34 return all_heads @ Wp # (B, Nq, D)

6. Vectorizing across the heads dimension

The multihead_attention_list implementation shown above uses lists of weight matrices as input. While this
makes the code clearer, it's not a particularly friendly format for an optimized implementation - especially on
accelerators like GPUs and TPUs. We can vectorize it further by creating a new dimension for attention heads.

To understand the trick being used, consider a basic matmul of (8, 6) by (6, 2):

Now suppose we want to multiply our LHS by another (6, 2) matrix. We can do it all in the same operation by
concatenating the two RHS matrices along columns:

If the yellow RHS block in both diagrams is identical, the green block of the result will be as well. And the violet block
is just the matmul of the LHS by the red block of the RHS. This stems from the semantics of matrix multiplication, and
is easy to verify on paper.

Now back to our multi-head attention. Note that we multiply the input x by a whole list of weight matrices - in fact,
by three lists (one list for Q, one for K, and another for V). We can use the same vectorization technique by
concatenating all these weight matrices into a single one. Assuming that NH * HS = D , the shape of the combined
matrix is (D, 3 * D). Here's the vectorized implementation:

1 # x has shape (B, N, D)

2 # In what follows:
3 # NH = number of heads
4 # HS = head size
5 # NH * HS = D
6 # W is expected to have shape (D, 3 * D), with all the weight matrices for
7 # Qs, Ks, and Vs concatenated along the last dimension, in this order.
8 # Wp is a weight matrix for the final linear projection, of shape (D, D).
9 # The result is (B, N, D).
10 # If do_mask is True, each attention head is masked from attending to future
11 # tokens.
12 def multihead_attention_vec(x, W, NH, Wp, do_mask=False):
13 B, N, D = x.shape
14 assert W.shape == (D, 3 * D)
15 qkv = x @ W # (B, N, 3 * D)
16 q, k, v = np.split(qkv, 3, axis=-1) # (B, N, D) each
17
18 if do_mask:
19 # mask is a lower-triangular (N, N) matrix, with zeros above
20 # the diagonal and ones on the diagonal and below.
21 mask = np.tril(np.ones((N, N)))
22
23 HS = D // NH
24 q = q.reshape(B, N, NH, HS).transpose(0, 2, 1, 3) # (B, NH, N, HS)
25 k = k.reshape(B, N, NH, HS).transpose(0, 2, 1, 3) # (B, NH, N, HS)
26 v = v.reshape(B, N, NH, HS).transpose(0, 2, 1, 3) # (B, NH, N, HS)
27
28 kq = q @ k.swapaxes(-1, -2) / np.sqrt(k.shape[-1]) # (B, NH, N, N)
29
30 if do_mask:
31 # Set the masked positions to -inf, to ensure that a token isn't
32 # affected by tokens that come after it in the softmax.
33 kq = np.where(mask == 0, -np.inf, kq)
34
35 att = softmax_lastdim(kq) # (B, NH, N, N)
36 out = att @ v # (B, NH, N, HS)
37 return out.transpose(0, 2, 1, 3).reshape(B, N, D) @ Wp # (B, N, D)

This code computes Q, K and V in a single matmul, and then splits them into separate arrays (note that on
accelerators these splits and later transposes may be very cheap or even free as they represent a different access
pattern into the same data).

Each of Q, K and V is initially (B, N, D), so they are reshaped into a more convenient shape by first splitting the D into
(NH, HS), and finally changing the order of dimensions to get (B, NH, N, HS). In this format, both B and NH are
considered batch dimensions that are fully parallelizable. The QK^T computation can then proceed as before, and
Numpy will automatically perform the matmul over all the batch dimensions.

Sometimes you'll see an alternative notation used in papers for these matrix multiplications: numpy.einsum . For
example, in our last code sample the computation of kq could also be written as:

1 kq = np.einsum("bhqd,bhkd->bhqk", q, k) / np.sqrt(k.shape[-1])

See this post for my detailed notes on this notation.

7. Code
The full code for these samples, with tests, is available in this repository.

[1]In LLM papers, D is often called .

[2]In the GPT-3 paper, this is also true for all model variants. For example, the largest 175B model has NH=96,
HS=128 and D=12288.

[3]It's also not as easy to define mathematically: how do we make a non-square matrix triangular? And what does it
mean when the lengths of the two inputs are different?

Understanding Attention Mechanisms in Deep Learning
No ratings yet
Understanding Attention Mechanisms in Deep Learning
104 pages
Cefr Letters b2 and c1
No ratings yet
Cefr Letters b2 and c1
32 pages
Quality Systems Manual Method Statement
No ratings yet
Quality Systems Manual Method Statement
9 pages
NLP Week8 Transformers
No ratings yet
NLP Week8 Transformers
66 pages
Transformers 22nd April 2025
No ratings yet
Transformers 22nd April 2025
67 pages
CNNs and Transformers
No ratings yet
CNNs and Transformers
90 pages
Internal and External Data Sources For MIS
No ratings yet
Internal and External Data Sources For MIS
2 pages
Cost Management A Strategic Emphasis 8th Edition Blocher Digital Access
100% (2)
Cost Management A Strategic Emphasis 8th Edition Blocher Digital Access
405 pages
LLM Attention
No ratings yet
LLM Attention
13 pages
Transformer
No ratings yet
Transformer
59 pages
Attention Variants
No ratings yet
Attention Variants
24 pages
NLP 8
No ratings yet
NLP 8
42 pages
Transformer 24 Aug
No ratings yet
Transformer 24 Aug
56 pages
Lec 12
No ratings yet
Lec 12
30 pages
Chapter 2
No ratings yet
Chapter 2
52 pages
2021 - IEEE Style Manual
0% (1)
2021 - IEEE Style Manual
70 pages
Chap6 Transformer (20240219) - DL4H Practioner Guide
No ratings yet
Chap6 Transformer (20240219) - DL4H Practioner Guide
36 pages
Self-Attion v7
No ratings yet
Self-Attion v7
43 pages
Sermon Notes: "The Good Life?" (Luke 12:13-21)
No ratings yet
Sermon Notes: "The Good Life?" (Luke 12:13-21)
3 pages
Lecture15 Transformer
No ratings yet
Lecture15 Transformer
26 pages
Chapter 4
No ratings yet
Chapter 4
24 pages
Xformer Lhy
No ratings yet
Xformer Lhy
37 pages
Coding Attention Mechanisms
No ratings yet
Coding Attention Mechanisms
24 pages
11 Transformers Notes
No ratings yet
11 Transformers Notes
25 pages
Lecture Notes - Advanced Language Model - BERT, GPT
No ratings yet
Lecture Notes - Advanced Language Model - BERT, GPT
24 pages
The Transformer Family Version 20 LilLog
No ratings yet
The Transformer Family Version 20 LilLog
32 pages
Anlp 05 Transformers
No ratings yet
Anlp 05 Transformers
40 pages
NA DeiselShip Latest
No ratings yet
NA DeiselShip Latest
105 pages
495 Lecture 10 Attall
No ratings yet
495 Lecture 10 Attall
18 pages
Understanding and Coding Self-Attention, Multi-Head Attention, Cross-Attention, and Causal-Attention in LLMs
No ratings yet
Understanding and Coding Self-Attention, Multi-Head Attention, Cross-Attention, and Causal-Attention in LLMs
38 pages
Transformer Explained
No ratings yet
Transformer Explained
29 pages
Understanding and Coding The Self-Attention Mechanism of Large Language Models From Scratch
No ratings yet
Understanding and Coding The Self-Attention Mechanism of Large Language Models From Scratch
20 pages
Duman Keles23a
No ratings yet
Duman Keles23a
23 pages
The Transformer Family
No ratings yet
The Transformer Family
25 pages
Transformer Mixture Key
No ratings yet
Transformer Mixture Key
27 pages
How To Implement Multi-Head Attention From Scratch in TensorFlow and Keras
No ratings yet
How To Implement Multi-Head Attention From Scratch in TensorFlow and Keras
20 pages
NLPMCQ
No ratings yet
NLPMCQ
23 pages
Transformers From Scratch PoliTO - Ipynb Colab
No ratings yet
Transformers From Scratch PoliTO - Ipynb Colab
17 pages
Attention Is Not You Need: Pure Attention Loses Rank Doubly Exponentially With Depth
No ratings yet
Attention Is Not You Need: Pure Attention Loses Rank Doubly Exponentially With Depth
22 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
18 pages
Transformer
No ratings yet
Transformer
10 pages
Decoder-Only Transformer (LLM) For Question Asking: Notebook Structure
No ratings yet
Decoder-Only Transformer (LLM) For Question Asking: Notebook Structure
9 pages
Intra-Neuronal Attention Within Language Models: Relationships Between Activation and Semantics
No ratings yet
Intra-Neuronal Attention Within Language Models: Relationships Between Activation and Semantics
42 pages
Transformers
No ratings yet
Transformers
15 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
8 pages
Transformers Implementations 1731410319
No ratings yet
Transformers Implementations 1731410319
10 pages
Fast Transformer Decoding - One Write-Head Is All You Need
No ratings yet
Fast Transformer Decoding - One Write-Head Is All You Need
9 pages
CS541 HW4
No ratings yet
CS541 HW4
11 pages
NeurIPS 2020 Fast Transformers With Clustered Attention Paper
No ratings yet
NeurIPS 2020 Fast Transformers With Clustered Attention Paper
10 pages
NLP 4
No ratings yet
NLP 4
10 pages
1706.03762v7 5 15
No ratings yet
1706.03762v7 5 15
11 pages
Self Attention With Trainable Weights 1726701162
No ratings yet
Self Attention With Trainable Weights 1726701162
12 pages
Assignment No 4
No ratings yet
Assignment No 4
8 pages
Flashattn
No ratings yet
Flashattn
6 pages
Self Attention Mechanism
No ratings yet
Self Attention Mechanism
20 pages
Transformer
No ratings yet
Transformer
4 pages
Attention in Neural Networks
No ratings yet
Attention in Neural Networks
8 pages
Memory-Efficient Attention Mechanism
No ratings yet
Memory-Efficient Attention Mechanism
8 pages
Dis7 Sol
No ratings yet
Dis7 Sol
8 pages
Assignment 2 - ML-SelfAttn
No ratings yet
Assignment 2 - ML-SelfAttn
4 pages
Transformer Flux
No ratings yet
Transformer Flux
11 pages
SNEHA JADHAV Projects........... 2000
No ratings yet
SNEHA JADHAV Projects........... 2000
84 pages
Transformer Arch Optimisations
No ratings yet
Transformer Arch Optimisations
3 pages
Processing of Leather by Microbial Enzyme
100% (1)
Processing of Leather by Microbial Enzyme
13 pages
A4
No ratings yet
A4
8 pages
Practical 10 Solution
No ratings yet
Practical 10 Solution
6 pages
Attention Mechanism by Hand Exercise
No ratings yet
Attention Mechanism by Hand Exercise
1 page
LLM Compact Guide
No ratings yet
LLM Compact Guide
9 pages
Symbiosis Institute of Business Management SIBM Bengaluru
No ratings yet
Symbiosis Institute of Business Management SIBM Bengaluru
7 pages
Science of Programming Matrix Computations
100% (2)
Science of Programming Matrix Computations
178 pages
FEA-Academy Course On-Demand - Practical Basic FEA
No ratings yet
FEA-Academy Course On-Demand - Practical Basic FEA
35 pages
Astm A641
No ratings yet
Astm A641
5 pages
A320 Limitations
No ratings yet
A320 Limitations
19 pages
Mckay Denise 222105489 Epm742 At1 3
No ratings yet
Mckay Denise 222105489 Epm742 At1 3
16 pages
DIVIDENDS
No ratings yet
DIVIDENDS
2 pages
Ring Load
No ratings yet
Ring Load
1 page
CV Vetting Guidelines 2023-24
No ratings yet
CV Vetting Guidelines 2023-24
16 pages
Retail Supply Chain Management
No ratings yet
Retail Supply Chain Management
12 pages
Laguna State Polytechnic University
0% (1)
Laguna State Polytechnic University
2 pages
(Ebook) Cause and Correlation in Biology: A User's Guide To Path Analysis, Structural Equations and Causal Inference by Bill Shipley ISBN 9780521529211, 0521529212 PDF Download
No ratings yet
(Ebook) Cause and Correlation in Biology: A User's Guide To Path Analysis, Structural Equations and Causal Inference by Bill Shipley ISBN 9780521529211, 0521529212 PDF Download
54 pages
CPS 5008
No ratings yet
CPS 5008
12 pages
Businesses Proposal
No ratings yet
Businesses Proposal
9 pages
Santos Et Al 2020
No ratings yet
Santos Et Al 2020
6 pages
Linear Regression Gradient Descent Vs Analytical Solution
No ratings yet
Linear Regression Gradient Descent Vs Analytical Solution
5 pages
Hewlett-Packard Journal: February 1971
No ratings yet
Hewlett-Packard Journal: February 1971
16 pages
Counting Sort - Good
No ratings yet
Counting Sort - Good
8 pages
Plato's Apology Essay
No ratings yet
Plato's Apology Essay
2 pages
Full Quadrant Approximations For Arctangent Tips and Tricks
No ratings yet
Full Quadrant Approximations For Arctangent Tips and Tricks
6 pages
Design and Implementation of Smart Micro-Grid and Its Digital Replica: First Steps
No ratings yet
Design and Implementation of Smart Micro-Grid and Its Digital Replica: First Steps
7 pages
Research Scope - Period Panties Market. - Global Industry Analysis Size Share Growth Trends and Forecasts 2023 - 2031
No ratings yet
Research Scope - Period Panties Market. - Global Industry Analysis Size Share Growth Trends and Forecasts 2023 - 2031
13 pages
Leading For The Future
No ratings yet
Leading For The Future
4 pages
4ME Brochure Update V2657
No ratings yet
4ME Brochure Update V2657
12 pages
Meeting Script
No ratings yet
Meeting Script
1 page
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
A Short Course in Discrete Mathematics
From Everand
A Short Course in Discrete Mathematics
Edward A. Bender
3/5 (1)

Notes On Implementing Attention - Eli Bendersky

Uploaded by

Notes On Implementing Attention - Eli Bendersky

Uploaded by

Notes on implementing Attention

Several papers are mentioned throughout the code; they are:

AIAYN - Attention Is All You Need by Vaswani et al.

GPT-3 - Language Models are Few-Shot Learners by Brown et al.

1. Basic scaled self-attention

1 # self_attention the way it happens in the Transformer model. No bias.

1 # self_attention with inputs that have a batch dimension.

1 # x has shape (B, N, D)

4. Masked (or Causal) self-attention

Conceptually, masking is very simple. Consider the sentence:

People like watching funny cat videos

6. Vectorizing across the heads dimension

1 # x has shape (B, N, D)

See this post for my detailed notes on this notation.

[1]In LLM papers, D is often called .

You might also like