0% found this document useful (0 votes)
19 views2 pages

GPT 2 - Learninhg 2

Uploaded by

sid_hyd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views2 pages

GPT 2 - Learninhg 2

Uploaded by

sid_hyd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

gpt.

md 2024-07-27

class CausalSelfAttention(nn.Module):
def __init__(self, config):
super().__init__()
assert config.n_embd % config.n_head == 0
self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd)
self.c_proj = nn.Linear(config.n_embd, config.n_embd)
self.c_proj.GPT_SCALE_UNIT = 1
self.n_head = config.n_head
self.n_embd = config.n_embd
self.register_buffer("bias",
torch.tril(torch.ones(config.block_size, config.block_size))
.view(1, 1, config.block_size, config.block_size))

def forward(self, x):


B, T, C = x.size()
qkv = self.c_attn(x)
q, k, v = qkv.split(self.n_embd, dim=2)
k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
att = att.masked_fill(self.bias[:, :, :T, :T] == 0, float('-inf'))
att = F.softmax(att, dim=-1)
y = att @ v
y = y.transpose(1, 2).contiguous().view(B, T, C)
y = self.c_proj(y)
return y

MLP

The MLP class represents a simple feed-forward neural network with GELU (Gaussian Error Linear Unit)
activation, designed to enhance gradient flow and learning. Unlike ReLU, which can suffer from "dying
neurons" due to its flat gradient for negative values, GELU provides a smooth curve that improves gradient
propagation. This activation function is also employed in other advanced models like BERT and GPT-2. The
class consists of two linear layers: the first (c_fc) expands the embedding dimension to four times its size,
while the second (c_proj) projects it back to the original embedding dimension. The forward method
sequentially applies these linear transformations and the GELU activation, processing the input tensor x to
produce the output. This design ensures efficient learning and better performance in handling complex data
patterns.

class MLP(nn.Module):
def __init__(self, config):
super().__init__()
self.c_fc = nn.Linear(config.n_embd, 4 * config.n_embd)
self.gelu = nn.GELU(approximate="tanh")
self.c_proj = nn.Linear(4 * config.n_embd, config.n_embd)
self.c_proj.GPT_SCALE_UNIT = 1

def forward(self, x):

3 / 11
gpt.md 2024-07-27

x = self.c_fc(x)
x = self.gelu(x)
x = self.c_proj(x)
return x

Block

The Block class is a fundamental component in the GPT-2 architecture, integrating self-attention and MLP
(Multi-Layer Perceptron) modules with layer normalization and residual connections. Unlike the
conventional transformer architecture where layer normalization follows the self-attention or MLP, GPT-2
positions layer normalization at the input of each sub-block, ensuring a clean residual path. This design
choice facilitates smooth gradient flow from the top layers down to the input/token layer, enhancing learning
efficiency. The class initializes with two layer normalization layers (ln_1 and ln_2), a
CausalSelfAttention module for token interaction, and an MLP for token-wise transformations. In the
forward method, the input tensor x undergoes layer normalization followed by self-attention, and the result
is added back to x as a residual connection. This process is repeated with the MLP, ensuring that each
token is updated independently and effectively. The combination of these elements allows the block to
efficiently aggregate information across tokens and update their representations, forming a robust building
block for the overall model.

class Block(nn.Module):
def __init__(self, config):
super().__init__()
self.ln_1 = nn.LayerNorm(config.n_embd)
self.attn = CausalSelfAttention(config)
self.ln_2 = nn.LayerNorm(config.n_embd)
self.mlp = MLP(config)

def forward(self, x):


x = x + self.attn(self.ln_1(x))
x = x + self.mlp(self.ln_2(x))
return x

GPT

The GPT class is the primary structure for the GPT-2 model, combining multiple components to form a
complete transformer network for text generation. It starts by initializing essential elements: token and
position embeddings (wte and wpe), a stack of transformer blocks (h), final layer normalization (ln_f), and
the output linear layer (lm_head). These modules are organized in a ModuleDict for flexible access and
management. The embeddings map input indices to dense vectors, while the transformer blocks apply self-
attention and MLP operations to these embeddings, progressively refining the representations.
To ensure consistent and efficient learning, the class incorporates weight sharing between the token
embeddings and the output layer, and initializes weights following the original GPT-2 model's methodology.
During forward passes, the model processes input sequences by adding positional information to token
embeddings and passing them through the transformer blocks. The final layer normalization and linear
4 / 11

You might also like