Lab 1X Appendix Training Models With Pytorch
Lab 1X Appendix Training Models With Pytorch
Using Pytorch is easy but it can look complicated because it requires that
you either learn or remember that Python is an object oriented language.
To implement an algorithm that solves (1) it is not as easy as calling
a function that performs the minimization. You have to create objects
that instantiate classes where you specify the operations that are to be
performed. This results in code that can look weird and complicated but
that is easier to modify. And while it may look complicated, it is not, in
reality complicated.
1
1.1 Classes: Attributes and Methods
class LinearFunction:
def __init__(self, m, n, A)
self.m = m
self.n = n
self.A = A
def evaluate(self, x)
y = np.matmul(x,A)
return y
The class definition contains two methods. The method init plays a
special role in the creation of objects which we will explain soon. At this
point, observe how it specifies the attributes that are part of the class. In
this specific example, the class contains three attributes, the dimensions
m and n and the matrix A. When you define a class, the init
function has to be specified always and self has to always be the first
parameter of the init method.
2
1.2 Objects: Concrete Instances of Abstract Classes
The class is an abstract object with methods that specifies how to ma-
nipulate its attributes. If we want to actually process data, we create a
specific instance. This is an object. For example, if we want a linear
transformation specified by a matrix A with 42 rows, 71 columns, and
random binary entries that are equally likely to be 0 or 1, we create the
object BernoulliMap as an instance of the class LinearFunction ,
m = 42
n = 71
A = np.random.binomial(n=1, p=0.5, size=(m, n))
BernoulliMap = LinearFunction(m, n, A)
If this looks like a lot of trouble for a matrix product it is because it is a lot
of trouble; indeed. However, suppose that you now find a more efficient
algorithm for implementing matrix computations. Perhaps because you
3
have decided to take advantage of a GPU. You go into the definition
of the LinearFunction class and update the evaluate method. The
change is now implemented in the hundreds of places in your code where
you had used matrix multiplication.
1.3 Inheritance
class LinearBernoulliFunction(LinearFunction):
def __init__(self, m, n)
self.m = m
self.n = n
self.A = np.random.binomial(n=1, p=0.5, size=(m, n))
With this new class, the creation of random Bernoulli maps simplifies to
the code
m = 42
4
n = 71
BernoulliMap = LinearBinomialFunction(m, n)
AnotherBernoulliMap = LinearBernoulliFunction(m, n)
The code for the evaluation of the linear functions is still the same be-
cause it has been inherited. The most important advantage of defining a
new class is that modifications to the class will now propagate to all the
places where a Bernoulli map is defined. If, say, we decide that a proba-
bility p = 0.3 for drawing ones is more appropriate, it’s just a matter of
changing the definition of the LinearBernoulliFunction. init
method. The change will propagate to all the places where we instantiate
an object belonging to the LinearBernoulliFunction class.
The reason why training with Pytorch may look complicated is that part
of the operations are encapsulated in an object that inherits methods from
a parent class. Having developed an understanding of the encapsulation
of operations inside of objects, it is now easy to understand how to write
a training loop in Pytorch.
In this section we focus on the problem in (1). In which the loss associated
to individual observations is the mean squared cost `(y, ŷ) = ky − ŷk2
and the learning parametrization is the linear function ŷ = Hx.
Our first task is to specify the learning parametrization that we will use.
We do that by creating a class – which we will instantiate later – that we
will call Parametrization . This class must have an init method,
as all classes do, and a method called forward . Most importantly, the
class must inherit from the Module class that is part of the torch.nn
library. This is what will allow its use in a training loop. To describe this
in more detail, here is a minimal code that define a Parametrization
class for estimates ŷ = Hx,
5
import torch
import torch.nn as nn
class Parametrization(nn.Module):
Asides from that, we specify the init method and the forward
method. The init method is mostly formulaic. The first line of the
method initializes the parent and the second line of the method speci-
fies that the variable self.H is a parameter. This means exactly what
you think it means. Is is indicating that self.H is a variable that we
will train. A fact that has to be specified for gradients to be computed
correctly. The Parametrization class could include other parameters
that are not trained. The specification of self.H further states that this
variable is a torch.Tensor with m rows and n columns. This is just
a specification of a matrix.
6
def forward(self, x):
z = torch.matmul(x,H)
sigma = nn.ReLU
yHat = sigma(z)
return yHat
The following code trains the linear model that we encapsulated in the
Parametrization class defined in Section 2.1,
import torch
import torch.optim as optim
estimator = Parametrization(n, m)
optimizer = optim.SGD(estimator.parameters(), lr=eps, momentum=0)
iter = 0
while iter < nIters
x, y = getBatch(batchSize, xTrain, yTrain)
estimator.zero grad()
y = estimator.forward(x)
loss = torch.mean((yHat-y)**2)
loss.backward()
optimizer.step()
iter += 1
The first line after the import commands in the code above instantiates
estimator as an object belonging to class Parametrization . This
7
object is essentially a matrix. It is not really a matrix. It is an object of
class Parametrization , which inherits from class nn.Module . This
endows it with methods which allow the computation of gradients. But
this is transparent to us. All that matters is that estimator is a matrix
that we are learning. If we want to access the actual matrix we have to
call estimator.H .
The loop iterates for nIters iterations. In each iteration there are three
separate actions: (i) We access a batch using the getBatch(batchSize)
function (ii) we compute stochastic gradients. (iii) We perform a SGD
step by calling optimizer.step() . The computation of gradients is
undertaken by the 4 commands that begin and end in a row that is high-
lighted in red.
8
to perform the operations
1
loss =
batchSize ∑ y[i,:] − H*y[i,:] k22 (2)
Batch
The mechanics of how gradients are computed are fascinating and worth
learning. But you don’t need to know them to run training loops. Begin-
ners don’t even need to modify the training loop. They just modify the
Parametrization class. That suffices to train a different system. The
explanations here are enough to make us Intermediate users. These are
the facts we have learned:
(L5) This gradient is accessed by the optimizer object to update the val-
ues of estimator.parameters() .
9
We will revisit these learned facts by discussing the training of a Neural
Network in the next section.
If we keep using the squared Euclidean error loss `(y, ŷ) = ky − ŷk2 ,
the ERM problem we want to solve is obtained by substituting the linear
parametrization ŷ = Hx used in (1) by the NN parametrization in (??).
This yields the ERM problem
1 Q 1 2
Q q∑
H∗ = argmin y q − H2 σ H1 x q . (4)
H ∈Rm × n =1 2 2
To use pytorch to train (??) the training loop doesn’t have to change.
All we have to do is replace the Parametrization class by the class
TwoLayerNN that implements the parametrization in (??). This class has
an init method and a forward method and is defined as follows,
import torch
import torch.nn as nn
class TwoLayerNN(nn.Module):
10
def __init__(self, n, m, h):
super().__init__()
self.H1 = nn.parameter.Parameter(torch.rand(n, h))
self.H2 = nn.parameter.Parameter(torch.rand(h, m))
We have said that the training loop does not change. This is true except
that when we instantiate the estimator object we need to instantiate it
as member of the TwoLayerNN class. For completeness, we rewrite the
training loop here with that modification highlighted,
import torch
import torch.optim as optim
estimator = TwoLayerNN(n, m, h)
optimizer = optim.SGD(estimator.parameters(), lr=eps, momentum=0)
iter = 0
while iter < nIters
x, y = getBatch(batchSize, xTrain, yTrain)
estimator.zero grad()
y = estimator.forward(x)
loss = torch.mean((yHat-y)**2)
loss.backward()
optimizer.step()
iter += 1
The only difference between this training loop and the training loop for
11
linear parametrizations is the use of a different class for the estimator ob-
ject. In this loop, the combined calls to estimator.zero grad() and
loss.backward() result on computations of gradients with respect to
the NN parameters H1 and H2 . The call of the step optimizer.step()
results in a stochastic gradient update of these parameters. These changes
are implemented by the expedient action of replacing the definition of the
estimator object. If we want to train a graph neural network, we just
need to define a proper class and instantiate a proper object. The training
loop remains unchanged.
4 Code links
The implementation of the basic training loop with the linear parametriza-
tion can be found in the folder code simple loop.zip. This folder contains
the following files:
The implementation of the basic training loop with a two-layer fully con-
nected neural network can be found in the folder code simple loop nn.zip.
This folder contains the following files:
12
5 A More Comprehensive Learning Loop
5.1 Validation
13
5.2 Testing
Unlike the validation set used to tune the hyperparameters and keep
track of the best model, the samples in the test set are only accessed once
the training loop is over. The learned model is run on these samples
to compute the test error, which provides a measure of how well the
model generalizes to unseen data. In particular, for a good model the
gap between the training and the test error should be small. A large gap
usually indicates that the model has overfitted the training data.
In most train-test splits, the largest portion of the data (80-90%) is used
for training and the rest for testing. The validation set is obtained by
setting aside a small fraction of the training data. Before splitting the data
between the training and test sets, the samples are randomized. This is
an important step because in real-world scenarios we don’t usually know
whether the available samples are random or ordered in some way. In
practice, randomizing the samples is also necessary to assess the quality
of the model parametrization independently of the quality of a particular
train-test split. This is done by running Monte-Carlo experiments, where
estimators are trained on multiple train-test splits to compute the average
14
test error realized by models with a given parametrization.
In the basic training loop, the samples of a batch are selected at random
from the training set in each training step. If the number of training steps
is large enough, this is not an issue as it is highly likely that all training
samples have been included in a batch — and therefore used to train
the model — at least once. However, the randomness of this approach
might make it so that some samples are selected multiple times before
the dataset is considered in full. This affects the gradient descent path
and, if the number of training steps is not chosen judiciously, it can have
a negative effect on the resulting model.
To address this, we can train the model in epochs, which are full passes
over the dataset. In each epoch, the samples are permuted and parti-
tioned in fixed-size batches covering the entire dataset in order to use
every training sample an equal number of times. Training in epochs is
helpful because epochs are more interpretable than training steps — it
makes more sense to specify the number of full passes over the data than
the total number of steps. Given a certain number of epochs and the
size of a batch, the number of training steps is calculated as the number
of epochs multiplied by the number of batches necessary to cover the
training set.
15
• main testing.py : This is the main script modified to include
validation and testing.
16