0% found this document useful (0 votes)
6 views14 pages

Chapter 3

Chapter 3 of 'The Hundred-Page Language Models Book' focuses on Recurrent Neural Networks (RNNs), particularly the Elman RNN, which is designed for processing sequential data. It explains the structure of RNNs, their functioning in language modeling, and introduces mini-batch gradient descent for efficient training. Additionally, it provides implementation details for an Elman RNN unit and a recurrent language model using PyTorch.

Uploaded by

Sentinel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views14 pages

Chapter 3

Chapter 3 of 'The Hundred-Page Language Models Book' focuses on Recurrent Neural Networks (RNNs), particularly the Elman RNN, which is designed for processing sequential data. It explains the structure of RNNs, their functioning in language modeling, and introduces mini-batch gradient descent for efficient training. Additionally, it provides implementation details for an Elman RNN unit and a recurrent language model using PyTorch.

Uploaded by

Sentinel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

DRAFT The Hundred-Page Language Models Book DRAFT

Chapter 3. Recurrent Neural Network


In this chapter, we discuss a key neural network architecture that changed how machines handle
sequences: the recurrent neural network. We’ll cover its structure and use in language modeling.
While recurrent neural networks are largely superseded by transformers and other attention-
based architectures in many modern applications, they remain foundational to understanding how
machine learning approaches sequential data.

3.1. Elman RNN


A recurrent neural network, or RNN, is a neural network designed for sequential data. Unlike
feedforward neural networks, RNNs include loops in their connections, enabling information to
carry over from one step in the sequence to the next. This makes them well-suited for tasks like
time series analysis, natural language processing, and other sequential data problems.

To illustrate the sequential nature of RNNs, let’s consider a neural network with a single unit.
Consider the input document “Learning from text is cool.” Ignoring case and punctuation, the matrix
representing this document would be as follows:

Word Embedding vector


learning [0.1,0.2,0.6]⊤
from [0.2,0.1,0.4]⊤
text [0.1,0.3,0.3]⊤
is [0.0,0.7,0.1]⊤
cool [0.5,0.2,0.7]⊤
PAD [0.0,0.0,0.0]⊤

Each row of the matrix represents a word’s embedding learned during neural network training.
The order of words is preserved. The matrix dimensions are (sequence length, embedding
dimensionality). Sequence length specifies the maximum number of words in a document. Shorter
documents are padded with padding tokens, while longer ones are truncated. Padding uses
dummy embeddings, usually zero vectors.

More formally, the matrix would look like this:


0.1 0.2 0.6
0.2 0.1 0.4
0.1 0.3 0.3
𝐗=
0.0 0.7 0.1
0.5 0.2 0.7
[0.0 0.0 0.0]
Here, we have five 3D embedding vectors, 𝐱1 , … , 𝐱5 , representing each word in the document. For
instance, 𝐱1 = [0.1,0.2,0.6]⊤, 𝐱2 = [0.2,0.1,0.4]⊤ , and so on. The sixth vector is a padding vector. The
single-unit RNN used to process this sequence is structured as follows:

Read first, buy later 72


DRAFT The Hundred-Page Language Models Book DRAFT

The same unit receives as input a sequence of embedding vectors, one at a time, and outputs, at
each time step 𝑡, the hidden state 𝐡𝑡 . This unit is known as the Elman RNN, named after Jeffrey
Locke Elman, who introduced the simple recurrent neural network in 1990.

Unlike a unit in a multilayer perceptron, which takes a vector and returns a scalar as shown in
Figure 1.1, an RNN unit returns a vector, functioning as an entire layer.

As shown, an RNN receives two inputs at each time step 𝑡: an input embedding vector 𝐱𝑡 (typically
a word embedding) and a hidden state vector 𝐡𝑡−1 from the previous time step. It outputs an
updated hidden state 𝐡𝑡 . This characteristic—using its own output from the previous time step as
an input—gives the network its “recurrent” nature. The initial hidden state 𝐡0 is usually initialized
with a zero vector.

A hidden state is a vector that stores information from all previous time steps in a sequence. It acts
as the network’s “memory,” enabling past information to influence future predictions. At each time
step, the hidden state is updated using the current input and the previous hidden state. This is
critical for sequential tasks like language modeling, where context from earlier words helps predict
the next word.

To deepen the network, we add a second RNN layer. The first layer’s outputs, 𝐡𝑡 , become inputs to
the second, whose outputs are the network’s final outputs:

Read first, buy later 73


DRAFT The Hundred-Page Language Models Book DRAFT

Figure 3.1: A two-layer Elman RNN. The first layer’s outputs serve as inputs to the second layer.

3.2. Mini-Batch Gradient Descent


Before coding the RNN model, we need to discuss the shape of the input data. In Section 1.7, we
used the entire dataset for each gradient descent step. Here, and for training all future models, we’ll
adopt mini-batch gradient descent, a widely used method for large models and datasets. Mini-
batch gradient descent calculates derivatives over smaller data subsets, which speeds up learning
and reduces memory usage.

With mini-batch gradient descent, the data shape is organized as (batch size, sequence length,
embedding dimensionality). This structure divides the training set into fixed-size mini-batches,
each containing sequences of embeddings with consistent lengths. (From this point on, “batch” and
“mini-batch” will be used interchangeably.)

For example, if the batch size is 2, the sequence length is 4, and the embedding dimensionality is 3,
the mini-batch can be represented as:
seq1,1 seq1,2 seq1,3 seq1,4
batch1 = [seq seq2,2 seq2,3 seq2,4 ]
2,1

Here, seq𝑖,𝑗 , for 𝑖 ∈ {1,2} and 𝑗 ∈ {1, … ,4} is an embedding vector.

For example, if we have the following embeddings for each sequence:

[0.1,0.2,0.3]
[0.4,0.5,0.6]
seq1 : [ ]
[0.7,0.8,0.9]
[1.0,1.1,1.2]

Read first, buy later 74


DRAFT The Hundred-Page Language Models Book DRAFT

[1.3,1.4,1.5]
[1.6,1.7,1.8]
seq2 : [ ]
[1.9,2.0,2.1]
[2.2,2.3,2.4]

The mini-batch will look like this:

[0.1,0.2,0.3] [0.4,0.5,0.6] [0.7,0.8,0.9] [1.0,1.1,1.2]


batch1 = [ ]
[1.3,1.4,1.5] [1.6,1.7,1.8] [1.9,2.0,2.1] [2.2,2.3,2.4]

During each step of gradient descent, we:

1. Select a mini-batch from the training set,


2. Pass it through the neural network,
3. Compute the loss,
4. Calculate gradients,
5. Update model parameters,
6. Repeat from step 1.

Mini-batch gradient descent often achieves faster convergence compared to using the entire
training set per step. It efficiently handles large models and datasets by using modern hardware’s
parallel processing capabilities. In PyTorch, models require the first dimension of the input data to
be the batch dimension, even if there’s only one example in the batch.

3.3. Programming an RNN


Let’s implement an Elman RNN unit:

import torch
import torch.nn as nn

class ElmanRNNUnit(nn.Module):
def __init__(self, emb_dim):
super().__init__()
self.Uh = nn.Parameter(torch.randn(emb_dim, emb_dim)) ➊
self.Wh = nn.Parameter(torch.randn(emb_dim, emb_dim)) ➋
self.b = nn.Parameter(torch.zeros(emb_dim)) ➌

def forward(self, x, h):


return torch.tanh(x @ self.Wh + h @ self.Uh + self.b) ➍

In the constructor:

• Lines ➊ and ➋ initialize self.Uh and self.Wh, the weight matrices for the hidden state
and input vector, with random values.
• Line ➌ sets self.b, the bias vector, to zero.

Read first, buy later 75


DRAFT The Hundred-Page Language Models Book DRAFT

In the forward method, line ➍ handles the computation for each time step. It processes the current
input x and the previous hidden state h, both shaped (batch_size, emb_dim), combines them with
the weight matrices and bias, and applies the tanh activation. The output is the new hidden state,
also of shape (batch_size, emb_dim).

The @ the matrix multiplication operator in PyTorch. We use x @ self.Wh rather than self.Wh
@ x because of the way PyTorch handles batch dimensions in matrix multiplication. When working
with batched inputs, x has a shape of (batch_size, emb_dim), while self.Wh has a shape of
(emb_dim, emb_dim). Remember from Section 1.6 that for two matrices to be multipliable, the
number of columns in the left matrix must be the same as the number of rows in the right matrix.
This is satisfied in x @ self.Wh.

Now, let’s define the class ElmanRNN, which implements a two-layer Elman RNN using
ElmanRNNUnit as its core building block:

class ElmanRNN(nn.Module):
def __init__(self, emb_dim, num_layers):
super().__init__()
self.emb_dim = emb_dim
self.num_layers = num_layers
self.rnn_units = nn.ModuleList(
[ElmanRNNUnit(emb_dim) for _ in range(num_layers)]
) ➊

def forward(self, x):


batch_size, seq_len, emb_dim = x.shape ➋
h_prev = [
torch.zeros(batch_size, emb_dim, device=x.device) ➌
for _ in range(self.num_layers)
]
outputs = []
for t in range(seq_len): ➍
input_t = x[:, t]
for l, rnn_unit in enumerate(self.rnn_units):
h_new = rnn_unit(input_t, h_prev[l])
h_prev[l] = h_new # Update hidden state
input_t = h_new # Input for next layer
outputs.append(input_t) # Collect outputs
return torch.stack(outputs, dim=1) ➎

In line ➊ of the constructor, we initialize the RNN layers by creating a ModuleList containing
ElmanRNNUnit instances—one per layer. Using ModuleList instead of a regular Python list
ensures the parent module (ElmanRNN) properly registers all RNN unit parameters. This
guarantees that calling .parameters() or .to(device) on the parent module includes
parameters from all modules in the ModuleList.

In the forward method:

Read first, buy later 76


DRAFT The Hundred-Page Language Models Book DRAFT

• Line ➋ extracts batch_size, seq_len, and emb_dim from the input tensor x.
• Line ➌ initializes the hidden states h_prev for all layers with zero tensors. Each hidden
state in the list has the shape (batch_size, emb_dim).

We store hidden states for each layer in a list instead of a multidimensional tensor
because we need to modify them during processing. In-place modifications of tensors
can disrupt PyTorch’s automatic differentiation system, which might result in incorrect
gradient calculations.

• Line ➍ iterates over time steps t in the input sequence. For each t:
o Extract the input at time t: input_t = x[:, t].
o For each layer l:
▪ Compute the new hidden state h_new from input_t and h_prev[l].
▪ Update the hidden state: h_prev[l] = h_new (updates in place).
▪ Set input_t = h_new to pass to the next layer.
o Append the output of the last layer: outputs.append(input_t).
• Once all time steps are processed, line ➎ converts the outputs list into a tensor by
stacking it along the time dimension. The resulting tensor has the shape (batch_size,
seq_len, emb_dim).

3.4. RNN as a Language Model


An RNN-based language model uses ElmanRNN as its building block:

class RecurrentLanguageModel(nn.Module):
def __init__(self, vocab_size, emb_dim, num_layers, pad_idx):
super().__init__()
self.embedding = nn.Embedding(
vocab_size,
emb_dim,
padding_idx=pad_idx
) ➊
self.rnn = ElmanRNN(emb_dim, num_layers)
self.fc = nn.Linear(emb_dim, vocab_size)

def forward(self, x):


embeddings = self.embedding(x)
rnn_output = self.rnn(embeddings)
logits = self.fc(rnn_output)
return logits

The RecurrentLanguageModel class integrates three components: an embedding layer, the


ElmanRNN defined earlier, and a final linear layer.

Read first, buy later 77


DRAFT The Hundred-Page Language Models Book DRAFT

In the constructor, line ➊ defines the embedding layer. This layer transforms input token indices
into dense vectors. The padding_idx parameter ensures that padding tokens are represented by
zero vectors. (We’ll cover the embedding layer in the next section.)

Next, we initialize the custom ElmanRNN, specifying the embedding dimensionality and the number
of layers. Finally, we add a fully connected layer, which converts the RNN’s output into vocabulary-
sized logits for each token in the sequence.

In the forward method:

• We pass the input x through the embedding layer. Input x has shape (batch_size,
seq_len), and the output embeddings have shape (batch_size, seq_len, emb_dim).
• We then pass the embedded input through our ElmanRNN, obtaining rnn_output with
shape (batch_size, seq_len, emb_dim).
• Finally, we apply the fully connected layer to the RNN output, producing logits for each
token in the vocabulary at each position in the sequence. The output logits have shape
(batch_size, seq_len, vocab_size).

3.5. Embedding Layer


An embedding layer, implemented as nn.Embedding in PyTorch, maps token indices from a
vocabulary to dense, fixed-size vectors. It acts as a learnable lookup table, where each token is
assigned a unique embedding vector. During training, these vectors are adjusted to capture
meaningful numerical representations of the tokens.

Here’s an example to show how an embedding layer works. Imagine a vocabulary with five tokens,
indexed from 0 to 4. We want each token to have a three-dimensional embedding vector. To begin,
we create an embedding layer:

import torch
import torch.nn as nn

vocab_size = 5 # Number of unique tokens


emb_dim = 3 # Size of each embedding vector
emb_layer = nn.Embedding(vocab_size, emb_dim)

The embedding layer initializes the embedding matrix 𝐄 with random values. In this case, the
matrix has 5 rows (one for each token) and 3 columns (the embedding dimensionality):

0.2 −0.4 0.1


−0.3 0.8 −0.5
𝐄 = 0.7 0.1 −0.2
−0.6 0.5 0.4
[ 0.9 −0.7 0.3]
Each row in 𝐄 represents the embedding for a specific token in the vocabulary.

Now, let’s input a sequence of token indices:

Read first, buy later 78


DRAFT The Hundred-Page Language Models Book DRAFT

token_indices = torch.tensor([0, 2, 4])

The embedding layer retrieves the rows of 𝐄 corresponding to the input indices:
0.2 −0.4 0.1
Embeddings = [ 0.7 0.1 −0.2]
0.9 −0.7 0.3
This output is a matrix whose number of rows equals the input sequence length and whose number
of columns equals the embedding dimensionality:

embeddings = embedding_layer(token_indices)
print(embeddings)

The output might look like this:

tensor([[ 0.2, -0.4, 0.1],


[ 0.7, 0.1, -0.2],
[ 0.9, -0.7, 0.3]])

The embedding layer can manage padding tokens as well. Padding ensures sequences in a mini-
batch have the same length. To prevent the model from updating embeddings for padding tokens
during training, the layer maps them to a zero vector that remains unchanged. For example, if we
define the padding index:

emb_layer = nn.Embedding(vocab_size, emb_dim, padding_idx=0)

The embedding for token 0 (padding token) is always [0,0,0]⊤ .

Given the input:

token_indices = torch.tensor([0, 2, 4])


embeddings = emb_layer(token_indices)
print(embeddings)

The result would be:

tensor([[ 0.0, 0.0, 0.0], # Padding token


[ 0.7, 0.1, -0.2], # Token 2 embedding
[ 0.9, -0.7, 0.3]]) # Token 4 embedding

With modern language models, vocabularies often include hundreds of thousands of tokens, and
embedding dimensions are typically around a thousand. This makes the embedding matrix a
significant part of the model, sometimes containing nearly 2 billion parameters.

3.6. Training a Language Model


Start by importing libraries and defining utility functions:

Read first, buy later 79


DRAFT The Hundred-Page Language Models Book DRAFT

import torch, torch.nn as nn

def set_seed(seed):
random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed) ➊
torch.backends.cudnn.deterministic = True ➋
torch.backends.cudnn.benchmark = False ➌

The set_seed function enforces reproducibility by setting random seeds. It sets the Python
random seed, the PyTorch CPU seed, and, in line ➊, the CUDA seed for all GPUs (Graphics Processing
Units). CUDA is NVIDIA’s parallel computing platform and API that enables significant performance
improvements in computing by leveraging the power of GPUs. Using
torch.cuda.manual_seed_all ensures consistent GPU-based random behavior, while lines ➋
and ➌ disable CUDA’s auto-tuner and enforce deterministic algorithms, guaranteeing identical
results across different GPU models.

With the model class ready, we’ll train our neural language model. First, we install the
transformers package—an open-source library providing APIs and tools to easily download,
train and use pretrained models from the Hugging Face Hub:

$ pip3 install transformers

The package offers a Python API for training that works with both PyTorch and TensorFlow. For
now, we only need it to get a tokenizer.

Now we import transformers, set the tokenizer, define the hyperparameter values, prepare the
data, and instantiate the model, loss function, and optimizer objects:

from transformers import AutoTokenizer

device = torch.device("cuda" if torch.cuda.is_available() else "cpu") ➊


tokenizer = AutoTokenizer.from_pretrained(
"microsoft/Phi-3.5-mini-instruct"
) ➋
vocab_size = len(tokenizer) ➌

emb_dim, num_layers, batch_size, learning_rate, num_epochs = get_hyperparam


eters()

data_url = "https://www.thelmbook.com/data/news"
train_loader, test_loader = download_and_prepare_data(
data_url, batch_size, tokenizer) ➍

model = RecurrentLanguageModel(
vocab_size, emb_dim, num_layers, tokenizer.pad_token_id
)

Read first, buy later 80


DRAFT The Hundred-Page Language Models Book DRAFT

initialize_weights(model) ➎
model.to(device)

criterion = nn.CrossEntropyLoss(ignore_index=tokenizer.pad_token_id) ➏
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

Line ➊ detects a CUDA device if it’s available. Otherwise, it defaults to CPU.

Most models on the Hugging Face Hub include the tokenizer that was used to train them. Line ➋
initializes the Phi 3.5 mini tokenizer. It was trained on a large text corpus using the byte-pair
encoding algorithm and has a vocabulary size of 32,064.

Line ➌ retrieves the tokenizer’s vocabulary size. Line ➍ downloads and prepares the dataset—a
collection of news sentences from online articles—tokenizing them and creating DataLoader
objects that iterate over batches.

Line ➎ initializes the model parameters. Initial parameter values can greatly influence the training
process. They can affect how quickly training progresses and the final loss value. Certain
initialization techniques, like Xavier initialization, have shown good results in practice. The
initialize_weights function, implementing this method, is defined in the notebook.

Line ➏ creates the loss function with the ignore_index parameter. This ensures the loss is not
calculated for padding tokens.

Now, let’s look at the training loop:

for epoch in range(num_epochs): ➊


model.train() ➋
for batch in train_loader: ➌
input_seq, target_seq = batch
input_seq = input_seq.to(device) ➍
target_seq = target_seq.to(device) ➎
batch_size_current, seq_len = input_seq.shape ➏
optimizer.zero_grad()
output = model(input_seq)
output = output.reshape(batch_size_current * seq_len, vocab_size) ➐
target = target_seq.reshape(batch_size_current * seq_len) ➑
loss = criterion(output, target) ➒
loss.backward()
optimizer.step()

Line ➊ iterates over epochs. An epoch is a single pass through the entire dataset. Training for
multiple epochs can improve the model, especially with limited training data. The number of epochs
is a hyperparameter that you adjust based on the model’s performance on the test set.

Line ➋ calls model.train() at the start of each epoch to set the model in training mode. This is
important for models that have layers behaving differently during training vs. evaluation.
Read first, buy later 81
DRAFT The Hundred-Page Language Models Book DRAFT

Although our RNN model doesn’t use such layers, calling model.train() ensures the
model is properly configured for training. This avoids unexpected behavior and keeps
consistency, especially if future changes add layers dependent on the mode.

Line ➌ iterates over batches. Each batch is a tuple: one tensor contains input sequences, and the
other contains target sequences. Lines ➍ and ➎ move these tensors to the same device as the model.
If the model and data are on different devices, PyTorch raises an error.

Line ➏ retrieves the batch size and sequence length from input_seq (target_seq has the same
shape). These dimensions are needed to reshape the model’s output tensor
(batch_size_current, seq_len, vocab_size) and target tensor (batch_size_current,
seq_len) into compatible shapes for the cross-entropy loss function. In line ➐, the output is
reshaped to (batch_size_current * seq_len, vocab_size), and in line ➑, the target is
flattened to batch_size_current * seq_len, allowing the loss calculation in line ➒ to process
all tokens in the batch simultaneously and return the average loss per token.

3.7. Training Data and Loss Computation


When studying neural language models, a key aspect is understanding the structure of a training
example. The text corpus is split into overlapping input and target sequences. Each input sequence
aligns with a target sequence shifted by one token. This setup trains the model to predict the next
word at each position in the sequence.

For instance, take the sentence “We train a recurrent neural network as a language model.” After
tokenizing it with the Phi 3.5 mini tokenizer, we get:

["_We", "_train", "_a", "_rec", "urrent", "_neural", "_network", "_as", "_a


", "_language", "_model", "."]

To create one training example, we convert the sentence into input and target sequences by shifting
tokens forward by one position:

Input: ["_We", "_train", "_a", "_rec", "urrent", "_neural", "_network", "_a


s", "_a", "_language", "_model"]
Target: ["_train", "_a", "_rec", "urrent", "_neural", "_network", "_as", "_
a", "_language", "_model", "."]

A training example doesn’t need to be a complete sentence. Modern language models process
sequences up to their context window length—a fixed maximum number of tokens they can
handle at once (like 2048, 4096, or 8192 tokens). This context window determines how much text
the model can “see” and reason about at any time, which affects its ability to understand
relationships between distant parts of text. The training corpus is therefore segmented into chunks
matching this context window length, with the target sequence for each chunk shifted forward one
position relative to the input.

Read first, buy later 82


DRAFT The Hundred-Page Language Models Book DRAFT

During training, the RNN processes one token at a time, updating its hidden states layer by layer.
At each step, it generates logits aimed at predicting the next token in the sequence. Each logit
corresponds to a vocabulary token and is converted into probabilities using softmax. These
probabilities are then used to compute the loss.

Each sentence results in multiple predictions and losses. For example, the model first processes
“_We” and tries to predict “_train” by assigning probabilities to all vocabulary tokens. The loss is
computed using the probability of “_train,” as defined in Equation 2.1. Next, the model processes
“_train” to predict “_a,” generating another loss. This continues for every token in the sequence. In
total, the model makes 11 predictions and calculates 11 losses for this example.

The losses are averaged across the tokens in a training example and all examples in the batch. The
average loss is then used in backpropagation to update the model’s parameters.

Predicting the next token at each position gives the model many “signals” to learn from, speeding
up learning compared to predicting just one hidden token for the whole sequence, as is the case
with masked language models.

Let’s break down the loss calculation for each position with some made-up numbers:

• Position 1:
o Target token: “_train”
o Logit for “_train”: −0.5
o After applying softmax to the logits, suppose the probability of “_train” is 0.1
o Contribution to the total loss by Equation 2.1 is −log(0.1) = 2.30
• Position 2:
o Target token: “_a”
o Logit for “_a”: 3.2
o After softmax, the probability for “_a”: 0.05
o Contribution to loss: −log(0.05) = 2.99
• Position 3:
o The probability for “_rec”: 0.02
o Contribution to loss: −log(0.02) = 3.91
• Position 4:
o The probability for “urrent”: 0.34
o Contribution to loss: −log(0.34) = 1.08

We continue until calculating the loss contribution for the final token, the period:

• Position 11:
o Target token: “.”
o Logit for “.”: −1.2
o After softmax, the probability for “.”: 0.11
o Contribution to loss: −log(0.11) = 2.21

The final loss is calculated by taking the average of these values:

Read first, buy later 83


DRAFT The Hundred-Page Language Models Book DRAFT

(2.30 + 2.99 + 3.91 + 1.08 + ⋯ + 2.21)


= 2.11 (hypothetically)
11
During training, the objective is to minimize this loss. This involves improving the model so that it
assigns higher probabilities to the correct target tokens at each position.

The full code for training the RNN-based language model can be found in thelmbook.com/nb/3.1. I
used the following hyperparameter values: emb_dim = 128, num_layers = 2, batch_size =
128, learning_rate = 0.001, and num_epochs = 1.

Here are three continuations for the prompt “The President” generated at later training steps:

The President refused to comment on the best news in the five on BBC .
The President has been a `` very serious '' and `` unacceptable '' .
The President 's office is not the first time to be able to take the lead .

At the start of training, the model generated almost random token sequences. Over time, its outputs
improved: it now correctly closes quotes and parentheses in appropriate parts of sentences. Still,
the generated continuations remain below the level of advanced LLMs. For instance, the model’s
perplexity is 72.41, much higher than the 20 perplexity of the older, relatively small GPT-2 model
and far above the perplexity of around 5 achieved by leading LLMs.

This gap has several causes. First, our model is smaller than LLMs, with just 8,292,619 parameters,
most of which are in the embedding layer. Second, simple RNN architectures, like the Elman RNN,
have clear limitations. While they handle sequential data, they often fail to retain information from
earlier tokens as sequences grow. The hidden state gradually “forgets” past inputs. Lastly, RNNs
process tokens sequentially, which complicates training of larger models. Each token depends on
the processing of the previous one, forcing the GPU to process tokens one at a time rather than
leveraging parallel computation.

These limitations inspired the development of advanced recurrent architectures like long short-
term memory (LSTM) networks. LSTMs mitigate some RNN weaknesses but still struggle with
very long sequences, such as those spanning thousands of tokens, which are common in modern
language models.

The introduction of transformers, discussed in the next chapter, resolved many of these issues. By
2023, transformers have largely replaced RNNs in natural language processing because they handle
long-range dependencies better and allow parallel computation.

Interest in RNNs was reignited in 2024 with the invention of the minLSTM and xLSTM
architectures, which achieve performance comparable to Transformer-based models.
This resurgence reflects a broader trend in AI research: no model type is ever
permanently obsolete. Researchers often revisit and refine older ideas, adapting them
to address modern challenges and leverage current hardware capabilities.

Read first, buy later 84


DRAFT The Hundred-Page Language Models Book DRAFT

3.8. Simplified Model Representation


Now that we’ve covered the math behind language model layers and the structure of the training
data, we can simplify the model’s representation by representing each unit as a square, just like in
Section 1.5. Below is a simplified diagram of the two-layer Elman RNN from Figure 3.1:

Here, we’ve adjusted the information flow in the diagram from left-to-right, as used in earlier
chapters, to bottom-to-top. This is the standard orientation for high-level language model diagrams
in the literature. We’ll keep this orientation when discussing the Transformer.

With that, we’ve finished covering recurrent neural networks and the language models built on
them. Next, we’ll explore transformer neural networks: how they differ from the models we’ve
studied and how they handle tasks like language modeling and document classification.

Read first, buy later 85

You might also like