0% found this document useful (0 votes)

6 views14 pages

Chapter 3

Chapter 3 of 'The Hundred-Page Language Models Book' focuses on Recurrent Neural Networks (RNNs), particularly the Elman RNN, which is designed for processing sequential data. It explains the structure of RNNs, their functioning in language modeling, and introduces mini-batch gradient descent for efficient training. Additionally, it provides implementation details for an Elman RNN unit and a recurrent language model using PyTorch.

Uploaded by

Sentinel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views14 pages

Chapter 3

Uploaded by

Sentinel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

DRAFT The Hundred-Page Language Models Book DRAFT

Chapter 3. Recurrent Neural Network

In this chapter, we discuss a key neural network architecture that changed how machines handle
sequences: the recurrent neural network. We’ll cover its structure and use in language modeling.
While recurrent neural networks are largely superseded by transformers and other attention-
based architectures in many modern applications, they remain foundational to understanding how
machine learning approaches sequential data.

3.1. Elman RNN

A recurrent neural network, or RNN, is a neural network designed for sequential data. Unlike
feedforward neural networks, RNNs include loops in their connections, enabling information to
carry over from one step in the sequence to the next. This makes them well-suited for tasks like
time series analysis, natural language processing, and other sequential data problems.

To illustrate the sequential nature of RNNs, let’s consider a neural network with a single unit.
Consider the input document “Learning from text is cool.” Ignoring case and punctuation, the matrix
representing this document would be as follows:

Word Embedding vector

learning [0.1,0.2,0.6]⊤
from [0.2,0.1,0.4]⊤
text [0.1,0.3,0.3]⊤
is [0.0,0.7,0.1]⊤
cool [0.5,0.2,0.7]⊤
PAD [0.0,0.0,0.0]⊤

Each row of the matrix represents a word’s embedding learned during neural network training.
The order of words is preserved. The matrix dimensions are (sequence length, embedding
dimensionality). Sequence length specifies the maximum number of words in a document. Shorter
documents are padded with padding tokens, while longer ones are truncated. Padding uses
dummy embeddings, usually zero vectors.

More formally, the matrix would look like this:

0.1 0.2 0.6
0.2 0.1 0.4
0.1 0.3 0.3
𝐗=
0.0 0.7 0.1
0.5 0.2 0.7
[0.0 0.0 0.0]
Here, we have five 3D embedding vectors, 𝐱1 , … , 𝐱5 , representing each word in the document. For
instance, 𝐱1 = [0.1,0.2,0.6]⊤, 𝐱2 = [0.2,0.1,0.4]⊤ , and so on. The sixth vector is a padding vector. The
single-unit RNN used to process this sequence is structured as follows:

Read first, buy later 72

DRAFT The Hundred-Page Language Models Book DRAFT

The same unit receives as input a sequence of embedding vectors, one at a time, and outputs, at
each time step 𝑡, the hidden state 𝐡𝑡 . This unit is known as the Elman RNN, named after Jeffrey
Locke Elman, who introduced the simple recurrent neural network in 1990.

Unlike a unit in a multilayer perceptron, which takes a vector and returns a scalar as shown in
Figure 1.1, an RNN unit returns a vector, functioning as an entire layer.

As shown, an RNN receives two inputs at each time step 𝑡: an input embedding vector 𝐱𝑡 (typically
a word embedding) and a hidden state vector 𝐡𝑡−1 from the previous time step. It outputs an
updated hidden state 𝐡𝑡 . This characteristic—using its own output from the previous time step as
an input—gives the network its “recurrent” nature. The initial hidden state 𝐡0 is usually initialized
with a zero vector.

A hidden state is a vector that stores information from all previous time steps in a sequence. It acts
as the network’s “memory,” enabling past information to influence future predictions. At each time
step, the hidden state is updated using the current input and the previous hidden state. This is
critical for sequential tasks like language modeling, where context from earlier words helps predict
the next word.

To deepen the network, we add a second RNN layer. The first layer’s outputs, 𝐡𝑡 , become inputs to
the second, whose outputs are the network’s final outputs:

Read first, buy later 73

DRAFT The Hundred-Page Language Models Book DRAFT

Figure 3.1: A two-layer Elman RNN. The first layer’s outputs serve as inputs to the second layer.

3.2. Mini-Batch Gradient Descent

Before coding the RNN model, we need to discuss the shape of the input data. In Section 1.7, we
used the entire dataset for each gradient descent step. Here, and for training all future models, we’ll
adopt mini-batch gradient descent, a widely used method for large models and datasets. Mini-
batch gradient descent calculates derivatives over smaller data subsets, which speeds up learning
and reduces memory usage.

With mini-batch gradient descent, the data shape is organized as (batch size, sequence length,
embedding dimensionality). This structure divides the training set into fixed-size mini-batches,
each containing sequences of embeddings with consistent lengths. (From this point on, “batch” and
“mini-batch” will be used interchangeably.)

For example, if the batch size is 2, the sequence length is 4, and the embedding dimensionality is 3,
the mini-batch can be represented as:
seq1,1 seq1,2 seq1,3 seq1,4
batch1 = [seq seq2,2 seq2,3 seq2,4 ]
2,1

Here, seq𝑖,𝑗 , for 𝑖 ∈ {1,2} and 𝑗 ∈ {1, … ,4} is an embedding vector.

For example, if we have the following embeddings for each sequence:

[0.1,0.2,0.3]
[0.4,0.5,0.6]
seq1 : [ ]
[0.7,0.8,0.9]
[1.0,1.1,1.2]

Read first, buy later 74

DRAFT The Hundred-Page Language Models Book DRAFT

[1.3,1.4,1.5]
[1.6,1.7,1.8]
seq2 : [ ]
[1.9,2.0,2.1]
[2.2,2.3,2.4]

The mini-batch will look like this:

[0.1,0.2,0.3] [0.4,0.5,0.6] [0.7,0.8,0.9] [1.0,1.1,1.2]

batch1 = [ ]
[1.3,1.4,1.5] [1.6,1.7,1.8] [1.9,2.0,2.1] [2.2,2.3,2.4]

During each step of gradient descent, we:

1. Select a mini-batch from the training set,

2. Pass it through the neural network,
3. Compute the loss,
4. Calculate gradients,
5. Update model parameters,
6. Repeat from step 1.

Mini-batch gradient descent often achieves faster convergence compared to using the entire
training set per step. It efficiently handles large models and datasets by using modern hardware’s
parallel processing capabilities. In PyTorch, models require the first dimension of the input data to
be the batch dimension, even if there’s only one example in the batch.

3.3. Programming an RNN

Let’s implement an Elman RNN unit:

import torch
import torch.nn as nn

class ElmanRNNUnit(nn.Module):
def __init__(self, emb_dim):
super().__init__()
self.Uh = nn.Parameter(torch.randn(emb_dim, emb_dim)) ➊
self.Wh = nn.Parameter(torch.randn(emb_dim, emb_dim)) ➋
self.b = nn.Parameter(torch.zeros(emb_dim)) ➌

def forward(self, x, h):

return torch.tanh(x @ self.Wh + h @ self.Uh + self.b) ➍

In the constructor:

• Lines ➊ and ➋ initialize self.Uh and self.Wh, the weight matrices for the hidden state
and input vector, with random values.
• Line ➌ sets self.b, the bias vector, to zero.

Read first, buy later 75

DRAFT The Hundred-Page Language Models Book DRAFT

In the forward method, line ➍ handles the computation for each time step. It processes the current
input x and the previous hidden state h, both shaped (batch_size, emb_dim), combines them with
the weight matrices and bias, and applies the tanh activation. The output is the new hidden state,
also of shape (batch_size, emb_dim).

The @ the matrix multiplication operator in PyTorch. We use x @ self.Wh rather than self.Wh
@ x because of the way PyTorch handles batch dimensions in matrix multiplication. When working
with batched inputs, x has a shape of (batch_size, emb_dim), while self.Wh has a shape of
(emb_dim, emb_dim). Remember from Section 1.6 that for two matrices to be multipliable, the
number of columns in the left matrix must be the same as the number of rows in the right matrix.
This is satisfied in x @ self.Wh.

Now, let’s define the class ElmanRNN, which implements a two-layer Elman RNN using
ElmanRNNUnit as its core building block:

class ElmanRNN(nn.Module):
def __init__(self, emb_dim, num_layers):
super().__init__()
self.emb_dim = emb_dim
self.num_layers = num_layers
self.rnn_units = nn.ModuleList(
[ElmanRNNUnit(emb_dim) for _ in range(num_layers)]
) ➊

def forward(self, x):

batch_size, seq_len, emb_dim = x.shape ➋
h_prev = [
torch.zeros(batch_size, emb_dim, device=x.device) ➌
for _ in range(self.num_layers)
]
outputs = []
for t in range(seq_len): ➍
input_t = x[:, t]
for l, rnn_unit in enumerate(self.rnn_units):
h_new = rnn_unit(input_t, h_prev[l])
h_prev[l] = h_new # Update hidden state
input_t = h_new # Input for next layer
outputs.append(input_t) # Collect outputs
return torch.stack(outputs, dim=1) ➎

In line ➊ of the constructor, we initialize the RNN layers by creating a ModuleList containing
ElmanRNNUnit instances—one per layer. Using ModuleList instead of a regular Python list
ensures the parent module (ElmanRNN) properly registers all RNN unit parameters. This
guarantees that calling .parameters() or .to(device) on the parent module includes
parameters from all modules in the ModuleList.

In the forward method:

Read first, buy later 76

DRAFT The Hundred-Page Language Models Book DRAFT

• Line ➋ extracts batch_size, seq_len, and emb_dim from the input tensor x.
• Line ➌ initializes the hidden states h_prev for all layers with zero tensors. Each hidden
state in the list has the shape (batch_size, emb_dim).

We store hidden states for each layer in a list instead of a multidimensional tensor
because we need to modify them during processing. In-place modifications of tensors
can disrupt PyTorch’s automatic differentiation system, which might result in incorrect
gradient calculations.

• Line ➍ iterates over time steps t in the input sequence. For each t:
o Extract the input at time t: input_t = x[:, t].
o For each layer l:
▪ Compute the new hidden state h_new from input_t and h_prev[l].
▪ Update the hidden state: h_prev[l] = h_new (updates in place).
▪ Set input_t = h_new to pass to the next layer.
o Append the output of the last layer: outputs.append(input_t).
• Once all time steps are processed, line ➎ converts the outputs list into a tensor by
stacking it along the time dimension. The resulting tensor has the shape (batch_size,
seq_len, emb_dim).

3.4. RNN as a Language Model

An RNN-based language model uses ElmanRNN as its building block:

class RecurrentLanguageModel(nn.Module):
def __init__(self, vocab_size, emb_dim, num_layers, pad_idx):
super().__init__()
self.embedding = nn.Embedding(
vocab_size,
emb_dim,
padding_idx=pad_idx
) ➊
self.rnn = ElmanRNN(emb_dim, num_layers)
self.fc = nn.Linear(emb_dim, vocab_size)

def forward(self, x):

embeddings = self.embedding(x)
rnn_output = self.rnn(embeddings)
logits = self.fc(rnn_output)
return logits

The RecurrentLanguageModel class integrates three components: an embedding layer, the

ElmanRNN defined earlier, and a final linear layer.

Read first, buy later 77

DRAFT The Hundred-Page Language Models Book DRAFT

In the constructor, line ➊ defines the embedding layer. This layer transforms input token indices
into dense vectors. The padding_idx parameter ensures that padding tokens are represented by
zero vectors. (We’ll cover the embedding layer in the next section.)

Next, we initialize the custom ElmanRNN, specifying the embedding dimensionality and the number
of layers. Finally, we add a fully connected layer, which converts the RNN’s output into vocabulary-
sized logits for each token in the sequence.

In the forward method:

• We pass the input x through the embedding layer. Input x has shape (batch_size,
seq_len), and the output embeddings have shape (batch_size, seq_len, emb_dim).
• We then pass the embedded input through our ElmanRNN, obtaining rnn_output with
shape (batch_size, seq_len, emb_dim).
• Finally, we apply the fully connected layer to the RNN output, producing logits for each
token in the vocabulary at each position in the sequence. The output logits have shape
(batch_size, seq_len, vocab_size).

3.5. Embedding Layer

An embedding layer, implemented as nn.Embedding in PyTorch, maps token indices from a
vocabulary to dense, fixed-size vectors. It acts as a learnable lookup table, where each token is
assigned a unique embedding vector. During training, these vectors are adjusted to capture
meaningful numerical representations of the tokens.

Here’s an example to show how an embedding layer works. Imagine a vocabulary with five tokens,
indexed from 0 to 4. We want each token to have a three-dimensional embedding vector. To begin,
we create an embedding layer:

import torch
import torch.nn as nn

vocab_size = 5 # Number of unique tokens

emb_dim = 3 # Size of each embedding vector
emb_layer = nn.Embedding(vocab_size, emb_dim)

The embedding layer initializes the embedding matrix 𝐄 with random values. In this case, the
matrix has 5 rows (one for each token) and 3 columns (the embedding dimensionality):

0.2 −0.4 0.1

−0.3 0.8 −0.5
𝐄 = 0.7 0.1 −0.2
−0.6 0.5 0.4
[ 0.9 −0.7 0.3]
Each row in 𝐄 represents the embedding for a specific token in the vocabulary.

Now, let’s input a sequence of token indices:

Read first, buy later 78

DRAFT The Hundred-Page Language Models Book DRAFT

token_indices = torch.tensor([0, 2, 4])

The embedding layer retrieves the rows of 𝐄 corresponding to the input indices:
0.2 −0.4 0.1
Embeddings = [ 0.7 0.1 −0.2]
0.9 −0.7 0.3
This output is a matrix whose number of rows equals the input sequence length and whose number
of columns equals the embedding dimensionality:

embeddings = embedding_layer(token_indices)
print(embeddings)

The output might look like this:

tensor([[ 0.2, -0.4, 0.1],

[ 0.7, 0.1, -0.2],
[ 0.9, -0.7, 0.3]])

The embedding layer can manage padding tokens as well. Padding ensures sequences in a mini-
batch have the same length. To prevent the model from updating embeddings for padding tokens
during training, the layer maps them to a zero vector that remains unchanged. For example, if we
define the padding index:

emb_layer = nn.Embedding(vocab_size, emb_dim, padding_idx=0)

The embedding for token 0 (padding token) is always [0,0,0]⊤ .

Given the input:

token_indices = torch.tensor([0, 2, 4])

embeddings = emb_layer(token_indices)
print(embeddings)

The result would be:

tensor([[ 0.0, 0.0, 0.0], # Padding token

[ 0.7, 0.1, -0.2], # Token 2 embedding
[ 0.9, -0.7, 0.3]]) # Token 4 embedding

With modern language models, vocabularies often include hundreds of thousands of tokens, and
embedding dimensions are typically around a thousand. This makes the embedding matrix a
significant part of the model, sometimes containing nearly 2 billion parameters.

3.6. Training a Language Model

Start by importing libraries and defining utility functions:

Read first, buy later 79

DRAFT The Hundred-Page Language Models Book DRAFT

import torch, torch.nn as nn

def set_seed(seed):
random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed) ➊
torch.backends.cudnn.deterministic = True ➋
torch.backends.cudnn.benchmark = False ➌

The set_seed function enforces reproducibility by setting random seeds. It sets the Python
random seed, the PyTorch CPU seed, and, in line ➊, the CUDA seed for all GPUs (Graphics Processing
Units). CUDA is NVIDIA’s parallel computing platform and API that enables significant performance
improvements in computing by leveraging the power of GPUs. Using
torch.cuda.manual_seed_all ensures consistent GPU-based random behavior, while lines ➋
and ➌ disable CUDA’s auto-tuner and enforce deterministic algorithms, guaranteeing identical
results across different GPU models.

With the model class ready, we’ll train our neural language model. First, we install the
transformers package—an open-source library providing APIs and tools to easily download,
train and use pretrained models from the Hugging Face Hub:

$ pip3 install transformers

The package offers a Python API for training that works with both PyTorch and TensorFlow. For
now, we only need it to get a tokenizer.

Now we import transformers, set the tokenizer, define the hyperparameter values, prepare the
data, and instantiate the model, loss function, and optimizer objects:

from transformers import AutoTokenizer

device = torch.device("cuda" if torch.cuda.is_available() else "cpu") ➊

tokenizer = AutoTokenizer.from_pretrained(
"microsoft/Phi-3.5-mini-instruct"
) ➋
vocab_size = len(tokenizer) ➌

emb_dim, num_layers, batch_size, learning_rate, num_epochs = get_hyperparam

eters()

data_url = "https://www.thelmbook.com/data/news"
train_loader, test_loader = download_and_prepare_data(
data_url, batch_size, tokenizer) ➍

model = RecurrentLanguageModel(
vocab_size, emb_dim, num_layers, tokenizer.pad_token_id
)

Read first, buy later 80

DRAFT The Hundred-Page Language Models Book DRAFT

initialize_weights(model) ➎
model.to(device)

criterion = nn.CrossEntropyLoss(ignore_index=tokenizer.pad_token_id) ➏
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

Line ➊ detects a CUDA device if it’s available. Otherwise, it defaults to CPU.

Most models on the Hugging Face Hub include the tokenizer that was used to train them. Line ➋
initializes the Phi 3.5 mini tokenizer. It was trained on a large text corpus using the byte-pair
encoding algorithm and has a vocabulary size of 32,064.

Line ➌ retrieves the tokenizer’s vocabulary size. Line ➍ downloads and prepares the dataset—a
collection of news sentences from online articles—tokenizing them and creating DataLoader
objects that iterate over batches.

Line ➎ initializes the model parameters. Initial parameter values can greatly influence the training
process. They can affect how quickly training progresses and the final loss value. Certain
initialization techniques, like Xavier initialization, have shown good results in practice. The
initialize_weights function, implementing this method, is defined in the notebook.

Line ➏ creates the loss function with the ignore_index parameter. This ensures the loss is not
calculated for padding tokens.

Now, let’s look at the training loop:

for epoch in range(num_epochs): ➊

model.train() ➋
for batch in train_loader: ➌
input_seq, target_seq = batch
input_seq = input_seq.to(device) ➍
target_seq = target_seq.to(device) ➎
batch_size_current, seq_len = input_seq.shape ➏
optimizer.zero_grad()
output = model(input_seq)
output = output.reshape(batch_size_current * seq_len, vocab_size) ➐
target = target_seq.reshape(batch_size_current * seq_len) ➑
loss = criterion(output, target) ➒
loss.backward()
optimizer.step()

Line ➊ iterates over epochs. An epoch is a single pass through the entire dataset. Training for
multiple epochs can improve the model, especially with limited training data. The number of epochs
is a hyperparameter that you adjust based on the model’s performance on the test set.

Line ➋ calls model.train() at the start of each epoch to set the model in training mode. This is
important for models that have layers behaving differently during training vs. evaluation.
Read first, buy later 81
DRAFT The Hundred-Page Language Models Book DRAFT

Although our RNN model doesn’t use such layers, calling model.train() ensures the
model is properly configured for training. This avoids unexpected behavior and keeps
consistency, especially if future changes add layers dependent on the mode.

Line ➌ iterates over batches. Each batch is a tuple: one tensor contains input sequences, and the
other contains target sequences. Lines ➍ and ➎ move these tensors to the same device as the model.
If the model and data are on different devices, PyTorch raises an error.

Line ➏ retrieves the batch size and sequence length from input_seq (target_seq has the same
shape). These dimensions are needed to reshape the model’s output tensor
(batch_size_current, seq_len, vocab_size) and target tensor (batch_size_current,
seq_len) into compatible shapes for the cross-entropy loss function. In line ➐, the output is
reshaped to (batch_size_current * seq_len, vocab_size), and in line ➑, the target is
flattened to batch_size_current * seq_len, allowing the loss calculation in line ➒ to process
all tokens in the batch simultaneously and return the average loss per token.

3.7. Training Data and Loss Computation

When studying neural language models, a key aspect is understanding the structure of a training
example. The text corpus is split into overlapping input and target sequences. Each input sequence
aligns with a target sequence shifted by one token. This setup trains the model to predict the next
word at each position in the sequence.

For instance, take the sentence “We train a recurrent neural network as a language model.” After
tokenizing it with the Phi 3.5 mini tokenizer, we get:

["_We", "_train", "_a", "_rec", "urrent", "_neural", "_network", "_as", "_a

", "_language", "_model", "."]

To create one training example, we convert the sentence into input and target sequences by shifting
tokens forward by one position:

Input: ["_We", "_train", "_a", "_rec", "urrent", "_neural", "_network", "_a

s", "_a", "_language", "_model"]
Target: ["_train", "_a", "_rec", "urrent", "_neural", "_network", "_as", "_
a", "_language", "_model", "."]

A training example doesn’t need to be a complete sentence. Modern language models process
sequences up to their context window length—a fixed maximum number of tokens they can
handle at once (like 2048, 4096, or 8192 tokens). This context window determines how much text
the model can “see” and reason about at any time, which affects its ability to understand
relationships between distant parts of text. The training corpus is therefore segmented into chunks
matching this context window length, with the target sequence for each chunk shifted forward one
position relative to the input.

Read first, buy later 82

DRAFT The Hundred-Page Language Models Book DRAFT

During training, the RNN processes one token at a time, updating its hidden states layer by layer.
At each step, it generates logits aimed at predicting the next token in the sequence. Each logit
corresponds to a vocabulary token and is converted into probabilities using softmax. These
probabilities are then used to compute the loss.

Each sentence results in multiple predictions and losses. For example, the model first processes
“_We” and tries to predict “_train” by assigning probabilities to all vocabulary tokens. The loss is
computed using the probability of “_train,” as defined in Equation 2.1. Next, the model processes
“_train” to predict “_a,” generating another loss. This continues for every token in the sequence. In
total, the model makes 11 predictions and calculates 11 losses for this example.

The losses are averaged across the tokens in a training example and all examples in the batch. The
average loss is then used in backpropagation to update the model’s parameters.

Predicting the next token at each position gives the model many “signals” to learn from, speeding
up learning compared to predicting just one hidden token for the whole sequence, as is the case
with masked language models.

Let’s break down the loss calculation for each position with some made-up numbers:

• Position 1:
o Target token: “_train”
o Logit for “_train”: −0.5
o After applying softmax to the logits, suppose the probability of “_train” is 0.1
o Contribution to the total loss by Equation 2.1 is −log(0.1) = 2.30
• Position 2:
o Target token: “_a”
o Logit for “_a”: 3.2
o After softmax, the probability for “_a”: 0.05
o Contribution to loss: −log(0.05) = 2.99
• Position 3:
o The probability for “_rec”: 0.02
o Contribution to loss: −log(0.02) = 3.91
• Position 4:
o The probability for “urrent”: 0.34
o Contribution to loss: −log(0.34) = 1.08

We continue until calculating the loss contribution for the final token, the period:

• Position 11:
o Target token: “.”
o Logit for “.”: −1.2
o After softmax, the probability for “.”: 0.11
o Contribution to loss: −log(0.11) = 2.21

The final loss is calculated by taking the average of these values:

Read first, buy later 83

DRAFT The Hundred-Page Language Models Book DRAFT

(2.30 + 2.99 + 3.91 + 1.08 + ⋯ + 2.21)

= 2.11 (hypothetically)
11
During training, the objective is to minimize this loss. This involves improving the model so that it
assigns higher probabilities to the correct target tokens at each position.

The full code for training the RNN-based language model can be found in thelmbook.com/nb/3.1. I
used the following hyperparameter values: emb_dim = 128, num_layers = 2, batch_size =
128, learning_rate = 0.001, and num_epochs = 1.

Here are three continuations for the prompt “The President” generated at later training steps:

The President refused to comment on the best news in the five on BBC .
The President has been a `` very serious '' and `` unacceptable '' .
The President 's office is not the first time to be able to take the lead .

At the start of training, the model generated almost random token sequences. Over time, its outputs
improved: it now correctly closes quotes and parentheses in appropriate parts of sentences. Still,
the generated continuations remain below the level of advanced LLMs. For instance, the model’s
perplexity is 72.41, much higher than the 20 perplexity of the older, relatively small GPT-2 model
and far above the perplexity of around 5 achieved by leading LLMs.

This gap has several causes. First, our model is smaller than LLMs, with just 8,292,619 parameters,
most of which are in the embedding layer. Second, simple RNN architectures, like the Elman RNN,
have clear limitations. While they handle sequential data, they often fail to retain information from
earlier tokens as sequences grow. The hidden state gradually “forgets” past inputs. Lastly, RNNs
process tokens sequentially, which complicates training of larger models. Each token depends on
the processing of the previous one, forcing the GPU to process tokens one at a time rather than
leveraging parallel computation.

These limitations inspired the development of advanced recurrent architectures like long short-
term memory (LSTM) networks. LSTMs mitigate some RNN weaknesses but still struggle with
very long sequences, such as those spanning thousands of tokens, which are common in modern
language models.

The introduction of transformers, discussed in the next chapter, resolved many of these issues. By
2023, transformers have largely replaced RNNs in natural language processing because they handle
long-range dependencies better and allow parallel computation.

Interest in RNNs was reignited in 2024 with the invention of the minLSTM and xLSTM
architectures, which achieve performance comparable to Transformer-based models.
This resurgence reflects a broader trend in AI research: no model type is ever
permanently obsolete. Researchers often revisit and refine older ideas, adapting them
to address modern challenges and leverage current hardware capabilities.

Read first, buy later 84

DRAFT The Hundred-Page Language Models Book DRAFT

3.8. Simplified Model Representation

Now that we’ve covered the math behind language model layers and the structure of the training
data, we can simplify the model’s representation by representing each unit as a square, just like in
Section 1.5. Below is a simplified diagram of the two-layer Elman RNN from Figure 3.1:

Here, we’ve adjusted the information flow in the diagram from left-to-right, as used in earlier
chapters, to bottom-to-top. This is the standard orientation for high-level language model diagrams
in the literature. We’ll keep this orientation when discussing the Transformer.

With that, we’ve finished covering recurrent neural networks and the language models built on
them. Next, we’ll explore transformer neural networks: how they differ from the models we’ve
studied and how they handle tasks like language modeling and document classification.

Read first, buy later 85

Tappi T411
100% (1)
Tappi T411
4 pages
T-Spot Test Results
No ratings yet
T-Spot Test Results
1 page
Trial Memorandum Plaintiff SAMPLE
100% (4)
Trial Memorandum Plaintiff SAMPLE
10 pages
Self Healing Concrete PPT Mu
50% (2)
Self Healing Concrete PPT Mu
22 pages
Ultraviolet Protection Factor (UPF)
No ratings yet
Ultraviolet Protection Factor (UPF)
4 pages
RNN Notes
No ratings yet
RNN Notes
36 pages
Chap 7.2 Sequence Analysis Using RNN LSTM
No ratings yet
Chap 7.2 Sequence Analysis Using RNN LSTM
60 pages
Cs224n Self Attention Transformers 2023 Draft
No ratings yet
Cs224n Self Attention Transformers 2023 Draft
18 pages
Astro AI
No ratings yet
Astro AI
20 pages
UNIT-3 Sequence Modeling
No ratings yet
UNIT-3 Sequence Modeling
20 pages
RNNstepbystep
No ratings yet
RNNstepbystep
18 pages
Correct The Error
No ratings yet
Correct The Error
11 pages
Recurrent Neural Networks (RNN) : Subtitle
No ratings yet
Recurrent Neural Networks (RNN) : Subtitle
53 pages
RNN Basics
No ratings yet
RNN Basics
17 pages
Unit III (2) RNN, LSTM, Gru
No ratings yet
Unit III (2) RNN, LSTM, Gru
14 pages
CS60010: Deep Learning: Recurrent Neural Network
No ratings yet
CS60010: Deep Learning: Recurrent Neural Network
44 pages
DL Mod4
No ratings yet
DL Mod4
105 pages
Unit 3
No ratings yet
Unit 3
41 pages
Recurrent Neural Networks (RNNS) : A Gentle Introduction and Overview
No ratings yet
Recurrent Neural Networks (RNNS) : A Gentle Introduction and Overview
16 pages
DL Mod 3
No ratings yet
DL Mod 3
4 pages
Lec 10
No ratings yet
Lec 10
37 pages
GPT2 From Scratch in PyTorch
No ratings yet
GPT2 From Scratch in PyTorch
13 pages
DL 4
No ratings yet
DL 4
19 pages
RNNs
No ratings yet
RNNs
22 pages
Dis6 Sol
No ratings yet
Dis6 Sol
6 pages
CNN RNN LSTM Attention
No ratings yet
CNN RNN LSTM Attention
86 pages
RNN LSTM GRU Transformers
0% (1)
RNN LSTM GRU Transformers
123 pages
Sequence Models231205
No ratings yet
Sequence Models231205
72 pages
Unit 3
No ratings yet
Unit 3
8 pages
Module 7 RNN
No ratings yet
Module 7 RNN
12 pages
Recurrent Neural Network
No ratings yet
Recurrent Neural Network
34 pages
The Unreasonable Effectiveness of Recurrent Neural Networks
No ratings yet
The Unreasonable Effectiveness of Recurrent Neural Networks
1 page
Lecture Notes - Recurrent Neural Networks
No ratings yet
Lecture Notes - Recurrent Neural Networks
11 pages
Recurrent Neural Network - Fundamentals of Deep Learning
No ratings yet
Recurrent Neural Network - Fundamentals of Deep Learning
16 pages
21cse356t NLP Unit 4
No ratings yet
21cse356t NLP Unit 4
81 pages
21CSE356T-NLP-Unit 4.1
No ratings yet
21CSE356T-NLP-Unit 4.1
46 pages
Unit 3 RCNN Updated
No ratings yet
Unit 3 RCNN Updated
28 pages
Lec14 RNN3 8 Feb 18
No ratings yet
Lec14 RNN3 8 Feb 18
16 pages
10 RNN
No ratings yet
10 RNN
56 pages
Lesson 6: Practical Deep Learning For Coders (V2)
No ratings yet
Lesson 6: Practical Deep Learning For Coders (V2)
21 pages
RNN Lstmgru
No ratings yet
RNN Lstmgru
3 pages
Time Series RNN LSTM 1746197734
No ratings yet
Time Series RNN LSTM 1746197734
25 pages
DL 4 Notes
No ratings yet
DL 4 Notes
34 pages
Recurrent Neural Network Wiki
100% (1)
Recurrent Neural Network Wiki
7 pages
Definition of RNN (Recurrent Neural Network) :: H F W X W H B y G W H B
No ratings yet
Definition of RNN (Recurrent Neural Network) :: H F W X W H B y G W H B
26 pages
Outline
No ratings yet
Outline
50 pages
11 RNN
No ratings yet
11 RNN
32 pages
NNDL
No ratings yet
NNDL
10 pages
Unit 3
No ratings yet
Unit 3
52 pages
RNN & LSTM Notes
No ratings yet
RNN & LSTM Notes
8 pages
Unit 5 Updated
No ratings yet
Unit 5 Updated
125 pages
RNNs and LSTMs
No ratings yet
RNNs and LSTMs
41 pages
4-Recurrent Neural Network
No ratings yet
4-Recurrent Neural Network
21 pages
LLM For Maths People
No ratings yet
LLM For Maths People
53 pages
5a. Recurrent Neural Networks
No ratings yet
5a. Recurrent Neural Networks
45 pages
Machine Learning and Pattern Recognition Week 8 Neural Net Architectures
No ratings yet
Machine Learning and Pattern Recognition Week 8 Neural Net Architectures
3 pages
RNN StannfordBased
No ratings yet
RNN StannfordBased
102 pages
C6-Builders Guide
No ratings yet
C6-Builders Guide
26 pages
Dl-Unit 5
No ratings yet
Dl-Unit 5
10 pages
Astro AI
No ratings yet
Astro AI
20 pages
Module5 DL
No ratings yet
Module5 DL
18 pages
6b. Recurrent Neural Networks
No ratings yet
6b. Recurrent Neural Networks
38 pages
RNN Tutorial
No ratings yet
RNN Tutorial
41 pages
Nn4ir PDF
No ratings yet
Nn4ir PDF
290 pages
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
HTML Images
No ratings yet
HTML Images
12 pages
HTML Styles Css
No ratings yet
HTML Styles Css
12 pages
HTML Quotation Elements
No ratings yet
HTML Quotation Elements
7 pages
HTML Favicon
No ratings yet
HTML Favicon
5 pages
HTML Tables
No ratings yet
HTML Tables
8 pages
HTML Text Formatting
No ratings yet
HTML Text Formatting
9 pages
Python Tutorial - Shallow and Deep
No ratings yet
Python Tutorial - Shallow and Deep
1 page
Python Tutorial - Structuring With Indentation
No ratings yet
Python Tutorial - Structuring With Indentation
1 page
Python Tutorial - Lambda Operator, Filter, Reduce and Map
No ratings yet
Python Tutorial - Lambda Operator, Filter, Reduce and Map
1 page
Python Tutorial - List Comprehension
No ratings yet
Python Tutorial - List Comprehension
1 page
Python Tutorial - Magic Methods
No ratings yet
Python Tutorial - Magic Methods
1 page
Python Tutorial - Formatted Output
No ratings yet
Python Tutorial - Formatted Output
1 page
Python Tutorial - Keyboard Input
No ratings yet
Python Tutorial - Keyboard Input
1 page
Python Tutorial - Exception Handling
No ratings yet
Python Tutorial - Exception Handling
1 page
Python Tutorial - File Management
No ratings yet
Python Tutorial - File Management
1 page
Python Tutorial - Dynamically Creating Classes With Type
No ratings yet
Python Tutorial - Dynamically Creating Classes With Type
1 page
Python Tutorial - Functions
No ratings yet
Python Tutorial - Functions
1 page
Reinforcement Learning With AWS DeepRacer and Amazon SageMaker RL - Pedro Paez
No ratings yet
Reinforcement Learning With AWS DeepRacer and Amazon SageMaker RL - Pedro Paez
31 pages
Easy Love Spell
50% (2)
Easy Love Spell
2 pages
Residential Plots For Sale in Wadakpally - Bheeramguda
No ratings yet
Residential Plots For Sale in Wadakpally - Bheeramguda
2 pages
Shriman Tapasviji Maharaj.
No ratings yet
Shriman Tapasviji Maharaj.
40 pages
Assignment - 2 (Google in China)
100% (1)
Assignment - 2 (Google in China)
5 pages
Divinity Activation Mantras Empowerment
0% (2)
Divinity Activation Mantras Empowerment
2 pages
Kashish Research Paper
No ratings yet
Kashish Research Paper
101 pages
SANS MGT414 10 Course Book
No ratings yet
SANS MGT414 10 Course Book
100 pages
Code Blue PDF
No ratings yet
Code Blue PDF
9 pages
ABCD Complete V7b HR 1
No ratings yet
ABCD Complete V7b HR 1
11 pages
This Content Downloaded From 42.1.77.20 On Tue, 05 Nov 2024 14:43:27 UTC
No ratings yet
This Content Downloaded From 42.1.77.20 On Tue, 05 Nov 2024 14:43:27 UTC
17 pages
Art and Technology in Poland Ed. Agnieszka Jelewska
No ratings yet
Art and Technology in Poland Ed. Agnieszka Jelewska
258 pages
Mental Math Slide Show
No ratings yet
Mental Math Slide Show
22 pages
2022 CALM Permission Form
No ratings yet
2022 CALM Permission Form
2 pages
Full Download The Future of HRD, Volume I: Innovation and Technology Mark Loon PDF
100% (2)
Full Download The Future of HRD, Volume I: Innovation and Technology Mark Loon PDF
76 pages
Activity 3 Earths Interior
No ratings yet
Activity 3 Earths Interior
3 pages
QMM Report Tata Steel
100% (1)
QMM Report Tata Steel
33 pages
Isolated Footing Excel Computation
No ratings yet
Isolated Footing Excel Computation
27 pages
Organism: With A Foreword by
100% (1)
Organism: With A Foreword by
432 pages
Test Bank For Financial Accounting, 11th Edition: Albrecht - 2025 Version Is Available With All Chapters
100% (9)
Test Bank For Financial Accounting, 11th Edition: Albrecht - 2025 Version Is Available With All Chapters
37 pages
Exam 2022 p2 Ans
No ratings yet
Exam 2022 p2 Ans
14 pages
ATC-3002 Quick Start Guide
No ratings yet
ATC-3002 Quick Start Guide
2 pages
Briandavidphillips - Core Skills Hypnosis DVD Course
No ratings yet
Briandavidphillips - Core Skills Hypnosis DVD Course
6 pages
Data Sheet - Carrier Chiller
No ratings yet
Data Sheet - Carrier Chiller
4 pages
06 - Class 06 - Trade Setups
No ratings yet
06 - Class 06 - Trade Setups
12 pages
Spring Lighting 2013 - HKD1800 Travel Reimbursement
No ratings yet
Spring Lighting 2013 - HKD1800 Travel Reimbursement
1 page