0% found this document useful (0 votes)
48 views10 pages

Gradients Without Backpropagation

1. The document presents a method called "forward gradients" to compute gradients without using backpropagation. Forward gradients provide an unbiased estimate of the gradient that can be evaluated in a single forward pass. 2. The authors implement forward gradients in PyTorch from scratch without relying on reverse-mode automatic differentiation. They show that gradient descent can be performed using only forward-mode AD, eliminating the need for backpropagation. 3. Experiments show that training machine learning models with forward gradient descent can achieve similar loss performance as backpropagation, while providing speedups of up to 2x faster training in some cases.

Uploaded by

Mark Freegatti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views10 pages

Gradients Without Backpropagation

1. The document presents a method called "forward gradients" to compute gradients without using backpropagation. Forward gradients provide an unbiased estimate of the gradient that can be evaluated in a single forward pass. 2. The authors implement forward gradients in PyTorch from scratch without relying on reverse-mode automatic differentiation. They show that gradient descent can be performed using only forward-mode AD, eliminating the need for backpropagation. 3. Experiments show that training machine learning models with forward gradient descent can achieve similar loss performance as backpropagation, while providing speedups of up to 2x faster training in some cases.

Uploaded by

Mark Freegatti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Gradients without Backpropagation

Atılım Güneş Baydin 1 Barak A. Pearlmutter 2 Don Syme 3 Frank Wood 4 Philip Torr 5

Abstract forward–backward algorithm, of which backpropagation is a


special case conventionally applied to neural networks. This
Using backpropagation to compute gradients of
is mainly due to the central role of scalar-valued objectives
objective functions for optimization has remained
arXiv:2202.08587v1 [cs.LG] 17 Feb 2022

in ML, whose gradient with respect to a very large number


a mainstay of machine learning. Backpropagation,
of inputs can be evaluated exactly and efficiently with a
or reverse-mode differentiation, is a special case
single evaluation of the reverse mode.
within the general family of automatic differen-
tiation algorithms that also includes the forward Reverse mode is a member of a larger family of AD algo-
mode. We present a method to compute gradients rithms that also includes the forward mode (Wengert, 1964),
based solely on the directional derivative that one which has the favorable characteristic of requiring only a
can compute exactly and efficiently via the for- single forward evaluation of a function (i.e., not involving
ward mode. We call this formulation the forward any backpropagation) at a significantly lower computational
gradient, an unbiased estimate of the gradient cost. Crucially, forward and reverse modes of AD evalu-
that can be evaluated in a single forward run of the ate different quantities. Given a function f : Rn → Rm ,
function, entirely eliminating the need for back- forward mode evaluates the Jacobian–vector product Jf v,
propagation in gradient descent. We demonstrate where Jf ∈ Rm×n and v ∈ Rn ; and revese mode evaluates
forward gradient descent in a range of problems, the vector–Jacobian product v | Jf , where v ∈ Rm . For the
showing substantial savings in computation and case of f : Rn → R (e.g., an objective function in ML), for-
enabling training up to twice as fast in some cases. ward mode gives us ∇f · v ∈ R, the directional derivative;
and reverse mode gives us the full gradient ∇f ∈ Rn .1
From the perspective of AD applied to ML, a “holy grail”
1. Introduction is whether the practical usefulness of gradient descent can
Backpropagation (Linnainmaa, 1970; Rumelhart et al., be achieved using only the forward mode, eliminating the
1985) and gradient-based optimization have been the core need for backpropagation. This could potentially change the
algorithms underlying many recent successes in machine computational complexity of typical ML training pipelines,
learning (ML) (Goodfellow et al., 2016; Deisenroth et al., reduce the time and energy costs of training, influence ML
2020). It is generally accepted that one of the factors con- hardware design, and even have implications regarding the
tributing to the recent pace of advance in ML has been the biological plausibility of backpropagation in the brain (Ben-
ease with which differentiable ML code can be implemented gio et al., 2015; Lillicrap et al., 2020). In this work we
via well engineered libraries such as PyTorch (Paszke et al., present results that demonstrate stable gradient descent over
2019) or TensorFlow (Abadi et al., 2016) with automatic dif- a range of ML architectures using only forward mode AD.
ferentiation (AD) capabilities (Griewank & Walther, 2008; Contributions
Baydin et al., 2018). These frameworks provide the compu-
tational infrastructure on which our field is built.
• We define the “forward gradient”, an estimator of the
Until recently, all major software frameworks for ML have gradient that we prove to be unbiased, based on forward
been built around the reverse mode of AD, a technique mode AD without backpropagation.
to evaluate derivatives of numeric code using a two-phase
1
• We implement a forward AD system from scratch in
Department of Computer Science, University of Oxford PyTorch, entirely independent of the reverse AD imple-
2
Department of Computer Science, National University of Ire-
land Maynooth 3 Microsoft 4 Computer Science Department, Uni-
mentation already present in this library.
versity of British Columbia 5 Department of Engineering Science,
University of Oxford. Correspondence to: Atılım Güneş Baydin • We use forward gradients in stochastic gradient descent
<[email protected]>. (SGD) optimization of a range of architectures, and show
1
Under Review. We represent ∇f as a column vector.
Gradients without Backpropagation

that a typical modern ML training pipeline can be con- Given a function f : Rn → Rm and the values θ ∈ Rn ,
structed with only forward AD and no backpropagation. v ∈ Rm , reverse mode AD computes f (θ) and the vector–
Jacobian product5 v | Jf (θ), where Jf ∈ Rm×n is the
• We compare the runtime and loss performance charac- Jacobian matrix of all partial derivatives of f evaluated
teristics of forward gradients and backpropagation, and at θ, and v ∈ Rm is a vector of adjoints. For the case
demonstrate speedups of up to twice as fast compared of f : Rn → R and v = 1, reverse mode computes the
with backpropagation in some cases. h the partial iderivatives of f w.r.t. all n inputs
gradient, i.e.,
|
∂f ∂f
∇f (θ) = ∂θ1 , . . . , ∂θn .
A note on naming: When naming the technique, it is tempt-
ing to adopt names like “forward propagation” or “forward- Note that v | Jf is computed in a single forward–backward
prop” to contrast it with backpropagation. We do not use evaluation, without having to compute the Jacobian Jf .6
this name as it is commonly used to refer to the forward
evaluation phase of backpropagation, distinct from forward 2.3. Runtime Cost
AD. We observe that the simple name “forward gradient” is The runtime costs of both modes of AD are bounded by
currently not used in ML, and it also captures the aspect that a constant multiple of the time it takes to run the function
we are presenting a drop-in replacement for the gradient. f we are differentiating (Griewank & Walther, 2008). Re-
verse mode has a higher cost than forward mode, because
2. Background it involves data-flow reversal and it needs to keep a record
(a “tape”, stack, or graph) of the results of operations en-
In order to introduce our method, we start by briefly review- countered in the forward pass, because these are needed
ing the two main modes of automatic differentiation. in the evaluation of derivatives in the backward pass that
follows. The memory and computation cost characteris-
2.1. Forward Mode AD tics ultimately depend on the features implemented by the
AD system such as exploiting sparsity (Gebremedhin et al.,
θ Forward f (θ)
2005) or checkpointing (Siskind & Pearlmutter, 2018).
v Jf (θ) v
The cost can be analyzed by assuming computational com-
Given a function f : Rn → Rm and the values θ ∈ Rn , v ∈ plexities of elementary operations such as fetches, stores, ad-
Rn , forward mode AD computes f (θ) and the Jacobian– ditions, multiplications, and nonlinear operations (Griewank
vector product2 Jf (θ) v, where Jf (θ) ∈ Rm×n is the & Walther, 2008). Denoting the time it takes to evaluate the
Jacobian matrix of all partial derivatives of f evaluated at θ, original function f as runtime(f ), we can express the time
and v is a vector of perturbations.3 For the case of f : Rn → taken by the forward and reverse modes as Rf × runtime(f )
R the Jacobian–vector product corresponds to a directional and Rb × runtime(f ) respectively. In practice, Rf is typi-
derivative ∇f (θ) · v, which is the projection of the gradient cally between 1 and 3, and Rb is typically between 5 and 10
∇f at θ onto the direction vector v, representing the rate of (Hascoët, 2014), but these are highly program dependent.
change along that direction.
Note that in ML the original function corresponds to the ex-
It is important to note that the forward mode evaluates the ecution of the ML code without any derivative computation
function f and its Jacobian–vector product Jf v simulta- or training, i.e., just evaluating a given model with input
neously in a single forward run. Also note that Jf v is data.7 We will call this “base runtime” in this paper.
obtained without having to compute the Jacobian Jf , a fea-
ture sometimes referred to as a matrix-free computation.4
3. Method
2.2. Reverse Mode AD 3.1. Forward Gradients
Forward Definition 1. Given a function f : Rn → R, we define the
θ f (θ) “forward gradient” g : Rn → Rn as
v | Jf (θ) v
Backward
g(θ) = (∇f (θ) · v) v , (1)
2 5
Popularized recently as a jvp operation in tensor frameworks Popularized recently as a vjp operation in tensor frameworks
such as JAX (Bradbury et al., 2018). such as JAX (Bradbury et al., 2018).
3 6
Also called “tangents”. The full Jacobian J can be computed with reverse AD using
4
The full Jacobian J can be computed with forward AD using m evaluations of e|i J , i = 1, . . . m using standard basis vectors
n forward evaluations of Jei , i = 1, . . . n using standard basis ei so that each run gives us a single row of J .
7
vectors ei so that each forward run gives us a single column of J . Sometimes called “inference” by practitioners.
Gradients without Backpropagation

where θ ∈ Rn is the point at which we are evaluating


the gradient, v ∈ Rn is a perturbation vector taken as a 4
multivariate random variable v ∼ p(v) such that v’s scalar
components vi are independent and have zero mean and 2
unit variance for all i, and ∇f (θ) · v ∈ R is the directional
derivative of f at point θ in direction v.
0

y
We first talk briefly about the intuition that led to this defini-
tion, before showing that g(θ) is an unbiased estimator of 2
the gradient ∇f (θ) in Section 3.2. Perturbations vk
Forward gradients ( f vk)vk
4 Forward gradient (empirical mean)
As explained in Section 2, forwardP mode gives us the direc- True gradient f
∂f
tional derivative ∇f (θ) · v = i ∂θ v i directly, without
i 4 2 0 2 4
having to compute ∇f . Computing ∇f using only forward x
mode is possible by evaluating f forward n times with di-
rection vectors taken as standard basis (or one-hot) vectors Figure 1. Five samples of forward gradient, the empirical mean
ei ∈ Rn , i = 1 . . . n, where ei denotes a vector with a 1 in of these five samples, and the true gradient for the Beale func-
the ith coordinate and 0s elsewhere. This allows the evalua- tion (Section 5.1) at x = 1.5, y = −0.1. Star marks the global
∂f minimum.
tion of the sensitivity of f w.r.t. each input ∂θ i
separately,
which when combined give us the gradient ∇f .
sense to point towards the true gradient (red) while being
In order to have any chance of runtime advantage over back-
constrained in orientation. The green arrow shows a Monte
propagation, we need to work with a single run of the for-
Carlo
PKgradient estimate via averaged forward gradients, i.e.,
ward mode per optimization iteration, not n runs.8 In a 1
single forward run, we can interpret the direction v as a K k=1 (∇f · vk )vk ≈ E[(∇f · v)v].

weight vectorP in a weighted sum of sensitivities w.r.t. each


input, that is i ∂θ∂f
vi , albeit without the possibility of dis- 3.2. Proof of Unbiasedness
i
tinguishing the contribution of each θi in the final total. We Theorem 1. The forward gradient g(θ) is an unbiased
therefore use the weight vector v to attribute the overall sen- estimator of the gradient ∇f (θ) .
sitivity back to each individual parameter θi , proportional
to the weight vi of each parameter θi (e.g., a parameter with Proof. We start with the directional derivative of f evalu-
a small weight had a small contribution and a large one had ated at θ in direction v written out as follows
a large contribution in the total sensitivity). X ∂f
d(θ, v) = ∇f (θ) · v = vi
In summary, each time the forward gradient is evaluated, we i
∂θi
simply do the following: ∂f ∂f ∂f
= v1 + v2 + · · · + vn . (2)
∂θ1 ∂θ2 ∂θn
• Sample a random perturbation vector v ∼ p(v), which
has the same size with f ’s argument θ. We then expand the forward gradient g in Eq. (1) as
• Run f via forward-mode AD, which evaluates f (θ) and g(θ) = d(θ, v)v
∇f (θ) · v simultaneously in the same single forward run,
 ∂f 2 ∂f ∂f

∂θ v1 + ∂θ2 v1 v2 + ··· + ∂θn v1 vn
without having to compute ∇f at all in the process.  ∂f1 ∂f 2 ∂f

The directional derivative obtained, ∇f (θ)·v, is a scalar,  ∂θ v1 v2 +

∂θ2 v2 + ··· + ∂θn v2 vn

= 1

and is computed exactly by AD (not an approximation). .. 

 . 

∂f ∂f ∂f 2
• Multiply the scalar directional derivative ∇f (θ) · v with ∂θ1 v1 vn + ∂θ2 v2 vn + ··· + ∂θn vn
vector v and obtain g(θ), the forward gradient.
and note that the components of g have the following form

Figure 1 illustrates the process showing several evaluations ∂f 2 X ∂f


gi (θ) = v + vi vj . (3)
of the forward gradient for the Beale function. We see how ∂θi i ∂θj
j6=i
perturbations vk (orange) transform into forward gradients
(∇f · vk )vk (blue) for k ∈ [1, 5], sometimes reversing the The expected value of each component gi is
 
8
This requirement can be relaxed depending on the problem ∂f X ∂f
setting and we would expect the gradient estimation to get better E [gi (θ)] = E  v2 + vi vj 
with more forward runs per optimization iteration. ∂θi i ∂θj
j6=i
Gradients without Backpropagation
 

∂f 2
 X ∂f Algorithm 1 Forward gradient descent (FGD)
=E v + E vi vj  Require: η: learning rate
∂θi i ∂θj
j6=i Require: f : objective function
  X   Require: θ0 : initial parameter vector
∂f 2 ∂f t←0 . Initialize
=E v + E vi vj
∂θi i ∂θj while θt not converged do
j6=i
t←t+1
∂f  2  X ∂f vt ∼ N (0, I) . Sample perturbation
= E vi + E [vi vj ] (4)
∂θi ∂θj Note: the following computes ft and dt simultaneously and
j6=i
without having to compute ∇f in the process
The first expected value in Eq. (4) is of a squared random ft , dt ← f (θt ), ∇f (θt ) · v . Forward AD (Section 3.1)
variable and all expectations in the summation term are of g t ← vt d t . Forward gradient
two independent and identically distributed random vari- θt+1 ← θt − η gt . Parameter update
ables multiplied. end while
return θt
Lemma 1. The expected value of a random variable v
squared E[v 2 ] = 1 when E[v] = 0 and Var[v] = 1.
    3.4. Choice of Direction Distribution
Proof. Variance is Var[v] = E (v − E[v])2 = E v 2 −
As shown by the proof in Section 3.2, the multivariate dis-
E[v]2 . Rearranging
  and substituting E[v] = 0 and Var[v] = tribution p(v) from which direction vectors v are sampled
1, we get E v 2 = E[v]2 + Var[v] = 0 + 1 = 1 .
must have two properties: (1) the components must be inde-
Lemma 2. The expected value of two i.i.d. random vari- pendent from each other (e.g., a diagonal Gaussian) and (2)
ables multiplied E[vi vj ] = 0 when E[vi ] = 0 or E[vj ] = 0. the components must have zero mean and unit variance.
In our experiments we use the multivariate standard normal
Proof. For i.i.d. vi and vj the expected value E[vi vj ] =
as the direction distribution p(v) so that v ∼ N (0, I),
E[vi ]E[vj ] = 0 when E[vi ] = 0 or E[vj ] = 0.
that is, vi ∼ N (0, 1) are independent for all i. We leave
exploring other admissible distributions for future work.
Using Lemmas 1 and 2, Eq. (4) reduces to
∂f
E [gi (θ)] = (5) 4. Related Work
∂θi
and therefore The idea of performing optimization by the use of random
perturbations, thus avoiding adjoint computations, is the in-
E [g(θ)] = ∇f (θ) . (6) tuition behind a variety of approaches, including simulated
annealing (Kirkpatrick et al., 1983), stochastic approxima-
tion (Spall et al., 1992), stochastic convex optimization
(Nesterov & Spokoiny, 2017; Dvurechensky et al., 2021),
3.3. Forward Gradient Descent
and correlation-based learning methods (Barto et al., 1983),
We construct a forward gradient descent (FGD) algorithm by which lend themselves to efficient hardware implementation
replacing the gradient ∇f in standard GD with the forward (Alspector et al., 1988). Our work here falls in the general
gradient g (Algorithm 1). In practice we use a mini-batch class of so-called weight perturbation methods; see Pearl-
stochastic version of this where ft changes per iteration as mutter (1994, §4.4) for an overview along with a description
it depends on each mini-batch of data used during training. of a method for efficiently gathering second-order informa-
We note that the directional derivative dt in Algorithm 1 can tion during the perturbative process, which suggests that
have positive or negative sign. When the sign is negative, the accelerated second-order variants of the present method may
forward gradient gt corresponds to backtracking from the be feasible. Note that our method is novel in avoiding the
direction of vt , or reversing the direction to point towards truncation error of previous weight perturbation approaches
the true gradient in expectation. Figure 1 shows two vk by using AD rather than small but finite perturbations, thus
samples exemplifying this behavior. completely avoiding the method of divided differences and
its associated numeric issues.
In this paper we limit our scope to FGD to clearly study this
fundamental algorithm and compare it to standard backprop- In neural network literature, alternatives to backpropagation
agation, without confounding factors such as momentum or proposed include target propagation (LeCun, 1986; 1987;
adaptive learning rate schemes. We believe that extensions Bengio, 2014; 2020; Meulemans et al., 2020), a technique
of the method to other families of gradient-based optimiza- that propagates target values rather than gradients backwards
tion algorithms are possible. between layers. For recurrent neural networks (RNNs), vari-
Gradients without Backpropagation

ous approaches to the online credit assignment problem have First we look at test functions for optimization, and compare
features in common with forward mode AD (Pearlmutter, the behavior of forward gradient and backpropagation in the
1995). An early example is the real-time recurrent learn- R2 space where we can plot and follow optimization trajec-
ing (RTRL) algorithm (Williams & Zipser, 1989) which tories. We then share results of experiments with training
accumulates local sensitivities in an RNN during forward ML architectures of increasing complexity. We measured
execution, in a manner similar to forward AD. A very recent no practical difference in memory usage between the two
example in the RTRL area is an anonymous submission methods (less than 0.1% difference in each experiment).
we identified at the time of drafting this manuscript, where
the authors are using directional derivatives to improve sev- 5.1. Optimization Trajectories of Test Functions
eral gradient estimators, e.g., synthetic gradients (Jaderberg
et al., 2017), first-order meta-learning (Nichol et al., 2018), In Figure 2 we show the results of experiments with
as applied to RNNs (Anonymous, 2022).
• the Beale function, f (x, y) = (1.5 − x + xy)2 + (2.25 −
Coordinate descent (CD) algorithms (Wright, 2015) have a
x + xy 2 )2 + (2.625 − x + xy 3 )2
structure where in each optimization iteration only a single
∂f
component ∂θ i
of the gradient ∇f is used compute an up- • and the Rosenbrock function, f (x, y) = (a−x)2 +b(y−
date. Nesterov (2012) provides an extension of CD called x2 )2 , where a = 1, b = 100.
random coordinate descent (RCD), based on coordinate di-
rectional derivatives, where the directions are constrained to
Note that forward gradient and backpropagation have
randomly chosen coordinate axes in the function’s domain
roughly similar time complexity in these cases, forward
as opposed to arbitrary directions we use in our method.
gradient being slightly faster per iteration. Crucially, we
A recent use of RCD is by Ding & Li (2021) in Langevin
see that forward gradient steps behave the same way as
Monte Carlo sampling, where the authors report no com-
backpropagation in expectation, as seen in loss per iteration
putational gain as the RCD needs to be run multiple times
(leftmost) and optimization trajectory (rightmost) plots.
per iteration in order to achieve a preset error tolerance.
The SEGA (SkEtched GrAdient) method by Hanzely et al.
(2018) is based on gradient estimation via random linear 5.2. Empirical Measures of Complexity
transformations of the gradient that is called a “sketch” com- In order to compare the two algorithms applied to ML prob-
puted using finite differences. Jacobian sketching by Gower lems in the rest of this section, we use several measures.
et al. (2018b) is designed to provide good estimates of the
Jacobian, in a manner similar to how quasi-Newton methods For runtime comparison we use the Rf and Rb factors de-
update Hessian estimates (Gower et al., 2018a). fined in Section 2.3. In order to compute these factors, we
measure runtime(f ) as the time it takes to run a given archi-
Lastly there are other, and more distantly related, ap- tecture with a sample minibatch of data and compute the
proaches concerning gradient estimation such as synthetic loss, without performing any derivative computation and
gradients (Jaderberg et al., 2017), motivated by a need to parameter update. Note that in the measurements of Rf and
break the sequential forward–backward structure of back- Rb , the time taken by gradient descent (parameter updates)
propagation, and Monte Carlo gradient estimation (Mo- are included, in addition to the time spent in computing the
hamed et al., 2020), where the gradient of an expectation derivatives. We also introduce the ratio Rf /Rb as a measure
of a function is computed with respect to the parameters of the runtime cost of the forward gradient relative to the
defining the distribution that is integrated. cost of backpropagation in a given architecture.
For a review of the origins of reverse mode AD and back- In order to compare loss performance, we define Tb as the
propagation, we refer the interested readers to Schmidhuber time at which the lowest validation loss is achieved by back-
(2020) and Griewank (2012). propagation (averaged over runs). Tf is the time the same
validation loss is achieved by the forward gradient for the
5. Experiments same architecture. The Tf /Tb ratio gives us a measure of the
time it takes for the forward mode to achieve the minimum
We implement forward AD in PyTorch to perform the experi- validation loss relative to the time taken by backpropagation.
ments (details given in Section 6). In all experiments, except
in Section 5.1, we use learning rate decay with ηi = η0 e−i k ,
5.3. Logistic Regression
where ηi is the learning rate at iteration i, η0 is the initial
learning rate, and k = 10−4 . In all experiments we run Figure 3 gives the results of several runs of multinomial lo-
forward gradients and backpropagation for an equal number gistic regression for MNIST digit classification. We observe
of iterations. We run the code with CUDA on a Nvidia Titan that the runtime cost of the forward gradient and backprop-
XP GPU and use a minibatch size of 64. agation relative to the base runtime are Rf = 2.435 and
Gradients without Backpropagation
Beale, lr=0.01, lr_decay=0.0, n=1,000

101 Forward grad 101 0.6


Backprop 0.20
100 100 0.4
10 1 10 1 0.15

Time (s)
0.2
f(x)

f(x)
10 2 10 2 0.10

y
0.0
10 3 10 3 0.05 0.2
10 4 10 4
0.00 0.4
0 200 400 600 800 1000 0.00 0.05 0.10 0.15 lr=0.0005,
Rosenbrock, 0.20 0 n=25,000
lr_decay=0.0, 200 400 600 800 1000 0 1 2 3
Iterations Time (s) Iterations x
102 Forward grad 102 4 1.0
Backprop 0.8
100 100 3 0.6

Time (s)
2 0.4
f(x)

f(x)
10 2 10 2

y
0.2
10 4 10 4 1 0.0
0 0.2
0 5000 10000 15000 20000 25000 0 1 2 3 4 0 5000 10000 15000 20000 25000 1.0 0.5 0.0 0.5 1.0
Iterations Time (s) Iterations x
Figure 2. Comparison of forward gradient and backpropagation in test functions, showing ten independent runs. Top row: Beale function,
learning rate 0.01. Bottom row: Rosenbrock function. Learning rate 5 × 10−4 . Rightmost column: Optimization trajectories in each
function’s domain, shown over contour plots of the functions. Star symbol marks the global minimum in the contour plots.
logreg, sgd, lr=0.0001, lr_decay=0.0001, batch_size=64

2.5 Forward grad (train) 2.5 Tf = 197.326 s Rf = 2.435


Forward grad (valid) Tb = 356.756 s 300 RRbf /=Rb4.389
Backprop (train) Tf / Tb = 0.553 = 0.555
2.0 Backprop (valid) 2.0

Time (s)
1.5 1.5 200
Loss

Loss

1.0 1.0 100


0.5 0.5 Base runtime
0
0 5000 10000 15000 20000 25000 0 50 100 150 200 250 300 350 0 5000 10000 15000 20000 25000
Iterations Time (s) Iterations

Figure 3. Comparison of forward gradient and backpropagation in logistic regression, showing five independent runs. Learning rate 10−4 .

Rb = 4.389, which are compatible with what one would The top row (learning rate 2 × 10−5 ) shows a result where
expect from a typical AD system (Section 2.3). The ratios forward gradient and backpropagation behave nearly identi-
Rf /Rb = 0.555 and Tf /Tb = 0.553 indicate that the for- cal in loss per iteration (leftmost plot), resulting in a Tf /Tb
ward gradient is roughly twice as fast as backpropagation in ratio close to Rf /Rb . We show this result to communicate
both runtime and loss performance. In this simple problem an example where the behavior is similar to the one we
these ratios coincide as both techniques have nearly iden- observed for logistic regression, where the loss per iteration
tical behavior in the loss per iteration space, meaning that behavior between the techniques are roughly the same and
the runtime benefit is reflected almost directly in the loss the runtime benefit is the main contributing factor in the loss
per time space. In more complex models in the following per time behavior (second plot from the left).
subsections we will see that the relative loss and runtime
Interestingly, in the second experiment (learning rate 2 ×
ratios can be different in practice.
10−4 ) we see that forward gradient achieves faster descent
in the loss per iteration plot. We believe that this behav-
5.4. Multi-Layer Neural Networks ior is due to the different nature of stochasticity between
Figure 4 shows two experiments with a multi-layer neural the regular SGD (backpropagation) and the forward SGD
network (NN) for MNIST classification with different learn- algorithms, and we speculate that the noise introduced by
ing rates. The architecture we use has three fully-connected forward gradients might be beneficial in exploring the loss
layers of size 1024, 1024, 10, with ReLU activation after the surface. When we look at the loss per time plot, which
first two layers. In this model architecture, we observe the also incorporates the favorable runtime of the forward mode,
runtime costs of the forward gradient and backpropagation we see a loss performance metric Tf /Tb value of 0.211,
relative to the base runtime as Rf = 2.468 and Rb = 4.165, representing a case that is more than four times as fast as
and the relative measure Rf /Rb = 0.592 on average. These backpropagation in achieving the reference validation loss.
are roughly the same with the logistic regression case.
Gradients without Backpropagation
mlp, sgd, lr=2e-05, lr_decay=0.0001, batch_size=64
2.45 Forward grad (train) 2.45 Tf = 165.842 s 300 Rf = 2.329 0.30
2.40 Forward grad (valid) 2.40 Tb = 292.359 s 250 RRbf /=Rb3.971
Tf / Tb = 0.567 = 0.586 0.25

Validation acuracy
Backprop (train)
2.35 Backprop (valid) 2.35 200

Time (s)
2.30 2.30 0.20
150
Loss

Loss
2.25 2.25 100 0.15
2.20 2.20 50
2.15 2.15 Base runtime 0.10
0
0 5000 10000 15000 20000 0 100mlp, sgd,200 300 0 5000 10000 15000 20000 0 100 200 300
Iterations Time (s) lr=0.0002, lr_decay=0.0001, batch_size=64Iterations Time (s)
2.5 2.5 Tf = 68.014 s 300 Rf = 2.607
Forward grad (train) 0.8
Forward grad (valid) Tb = 296.957 s 250 RRbf /=Rb4.359
2.0 2.0 Tf / Tb = 0.229 = 0.598

Validation acuracy
Backprop (train)
Backprop (valid) 200 0.6
1.5 1.5

Time (s)
150
Loss

Loss
0.4
1.0 1.0 100
0.5 0.5 50 0.2
Base runtime
0
0 5000 10000 15000 20000 0 50 100 150 200 250 300 0 5000 10000 15000 20000 0 50 100 150 200 250 300
Iterations Time (s) Iterations Time (s)

Figure 4. Comparison of forward gradient and backpropagation for the multi-layer NN, showing two learning rates. Top row: learning
rate 2 × 10−5 . Bottom row: learning rate 2 × 10−4 . Showing five independent runs per experiment.
cnn4b, sgd, lr=0.0002, lr_decay=0.0001, batch_size=64
2.5 2.5
Forward grad (train) Tf = 362.247 s Rf = 1.434 0.8
2.0 Forward grad (valid) 2.0 Tb = 704.369 s 600 RRbf /=Rb2.211
Tf / Tb = 0.514 = 0.649

Validation acuracy
Backprop (train)
Backprop (valid) 0.6
1.5 1.5
Time (s)

400
Loss

Loss

1.0 1.0 0.4


200
0.5 0.5 0.2
Base runtime
0
0 10000 20000 30000 40000 0 200 400 600 0 10000 20000 30000 40000 0 200 400 600
Iterations Time (s) Iterations Time (s)

Figure 5. Comparison of forward gradient and backpropagation for the CNN. Learning rate 2 × 10−4 . Showing five independent runs.

5.5. Convolutional Neural Networks • training without backpropagation can feasibly work
within a typical ML training pipeline and do so in a
In Figure 5 we show a comparison between the forward
computationally competitive way, and
gradient and backpropagation for a convolutional neural
network (CNN) for the same MNIST classification task. The • forward AD can even beat backpropagation in loss de-
CNN has four convolutional layers with 3 × 3 kernels and crease per training time for the same choice of hyperpa-
64 channels, followed by two linear layers of sizes 1024 and rameters (learning rate and learning rate decay).
10. All convolutions and the first linear layer are followed
by ReLU activation and there are two max-pooling layers
with 2 × 2 kernel after the second and fourth convolutions. In order to investigate whether these results will scale to
larger NNs with more layers, we measure runtime cost and
In this architecture we observe the best forward AD perfor- memory usage as a function of NN size. In Figure 6 we
mance with respect to the base runtime, where the forward show the results for the MLP architecture (Section 5.4),
mode has Rf = 1.434 representing an overhead of only where we run experiments with an increasing number of
43% on top of the base runtime. Backpropagation with layers in the range [1, 100]. The linear layers are of size
Rb = 2.211 is very close to the ideal case one can expect 1,024, with no bias. We use a mini-batch size 64 as before.
from a reverse AD system, taking roughly double the time.
Rf /Rb = 0.649 represents a significant benefit for the for- Looking at the cost relative to the base runtime, which
ward AD runtime with respect to backpropagation. In loss also changes as a function of the number of layers, we
space, we get a ratio Tf /Tb = 0.514 which shows that for- see that backpropagation remains within Rb ∈ [4, 5] and
ward gradients are close to twice as fast as backpropagation forward gradient remains within Rf ∈ [3, 4] for a large
in achieving the reference level of validation loss. proportion of the experiments. We also observe that forward
gradients remain favorable for the whole range of layer sizes
5.6. Scalability considered, with the Rf /Rb ratio staying below 0.6 up to ten
layers and going slightly over 0.8 at 100 layers. Importantly,
The results in the previous subsections demonstrate that there is virtually no difference in memory consumption
between the two methods.
Gradients without Backpropagation
mlp_no_bias, sgd, layer_size=1,024, batch_size=64, runs=10, n=100, num_layers=[1, 5, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
5 tation we use in this paper does not currently have these.
Runtime cost

We expect the forward gradient performance to improve


4
even further as high-quality forward-mode implementations
3 Forward grad Rf find their way into mainstream ML libraries and get tightly
Backprop Rb
integrated into tensor code.
1.0 Another implementation approach that can enable a straight-
forward application of forward gradients to existing code
Runtime cost

0.8
can be based on the complex-step method (Martins et al.,
0.6 2003), a technique that can approximate directional deriva-
Relative runtime cost Rf / Rb tives with nothing but basic support for complex numbers.
0.4
1e9
Memory use (bytes)

1.0
7. Conclusions
We have shown that a typical ML training pipeline can be
0.5
Forward grad constructed without backpropagation, using only forward
Backprop AD, while still being computationally competitive. We ex-
0.0
1 5 10 20 30 40 50 60 70 80 90 100 pect this contribution to find use in distributed ML training,
Number of layers
which is outside the scope of this paper. Furthermore, the
Figure 6. Comparison of how the runtime cost and memory usage runtime results we obtained with our forward AD prototype
of forward gradients and backpropagation scale as a function NN in PyTorch are encouraging and we are cautiously optimistic
depth for the MLP architecture where each layer is of size 1024. that they might be the first step towards significantly decreas-
Showing mean and standard deviation over ten independent runs. ing the time taken to train ML architectures, or alternatively,
enabling the training of more complex architectures with a
6. Implementation given compute budget. We are excited to have the results
confirmed and studied further by the research community.
We implement a forward-mode AD system in Python and
base this on PyTorch tensors in order to enable a fair com- The work presented here is the basis for several directions
parison with a typical backpropagation pipeline in PyTorch, that we would like to follow. In particular, we are interested
which is widely used by the ML community.9 We release in working on gradient descent algorithms other than SGD,
our implementation publicly.10 such as SGD with momentum, and adaptive learning rate al-
gorithms such as Adam (Kingma & Ba, 2015). In this paper
Our forward-mode AD engine is implemented from scratch we deliberately excluded these to focus on the most isolated
using operator overloading and non-differentiable PyTorch and clear case of SGD, in order to establish the technique
tensors (requires grad=False) as a building block. and a baseline. We are also interested in experimenting
This means that our forward AD implementation does not with other ML architectures. The components used in our
use PyTorch’s reverse-mode implementation (called “au- experiments (i.e., linear and convolutional layers, pooling,
tograd”) and computation graph. We produce the back- ReLU nonlinearity) are representative of the building blocks
propagation results in experiments using PyTorch’s ex- of many current architectures in practice, and we expect the
isting reverse-mode code (requires grad=True and results to apply to these as well.
.backward()) as usual.
Lastly, in the longer term we are interested in seeing whether
Note that empirical comparisons of the relative runtimes of the forward gradient algorithm can contribute to the math-
forward- and reverse-mode AD are highly dependent on the ematical understanding of the biological learning mecha-
implementation details in a given system and would show nisms in the brain, as backpropagation has been historically
differences across different code bases. When implement- viewed as biologically implausible as it requires a precise
ing the forward mode of tensor operations common in ML backward connectivity (Bengio et al., 2015; Lillicrap et al.,
(e.g., matrix multiplication, convolutions), we identified op- 2016; 2020). In this context, one way to look at the role of
portunities to the make forward AD operations even more the directional derivative in forward gradients is to interpret
efficient (e.g., stacking channels of primal and derivative it as the feedback of a single global scalar quantity that is
parts of tensors in a convolution). Note that the implemen- identical for all the computation nodes in the network.
9
We also experimented with the forward mode implementation We believe that forward AD has computational characteris-
in JAX (Bradbury et al., 2018) but decided to base our implemen- tics that are ripe for exploration by the ML community, and
tation on PyTorch due to its maturity and simplicity allowing us to
we expect that its addition to the conventional ML infrastruc-
perform a clear comparison.
10
To be shared in the upcoming revision. ture will lead to major breakthroughs and new approaches.
Gradients without Backpropagation

References Dvurechensky, P., Gorbunov, E., and Gasnikov, A. An accel-


erated directional derivative method for smooth stochastic
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean,
convex optimization. European Journal of Operational
J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al.
Research, 290(2):601–621, 2021.
Tensorflow: A system for large-scale machine learning. In
12th USENIX Symposium on Operating Systems Design Gebremedhin, A. H., Manne, F., and Pothen, A. What
and Implementation (OSDI 16), pp. 265–283, 2016. color is your Jacobian? Graph coloring for computing
derivatives. SIAM Review, 47(4):629–705, 2005.
Alspector, J., Allen, R., Hu, V., and Satyanarayana, S.
Stochastic learning networks and their electronic im- Goodfellow, I., Bengio, Y., and Courville, A. Deep Learning.
plementation. In Anderson, D. (ed.), Neural Informa- MIT Press, 2016. http://www.deeplearningbook.org.
tion Processing Systems. American Institute of Physics,
1988. URL https://proceedings.neurips.cc/paper/1987/ Gower, R., Hanzely, F., Richtarik, P., and Stich, S. U. Ac-
file/f033ab37c30201f73f142449d037028d-Paper.pdf. celerated stochastic matrix inversion: General theory and
speeding up bfgs rules for faster second-order optimiza-
Anonymous. Learning by directional gradient descent. tion. In Bengio, S., Wallach, H., Larochelle, H., Grauman,
In Submitted to The Tenth International Conference on K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances
Learning Representations, 2022. URL https://openreview. in Neural Information Processing Systems, volume 31.
net/forum?id=5i7lJLuhTm. Under Review. Curran Associates, Inc., 2018a.

Barto, A. G., Sutton, R. S., and Anderson, C. W. Neuronlike Gower, R. M., Richtárik, P., and Bach, F. Stochastic quasi-
adaptive elements that can solve difficult learning control gradient methods: Variance reduction via Jacobian sketch-
problems. IEEE Transactions on Systems, Man, and ing. arXiv preprint arXiv:1805.02632, 2018b.
Cybernetics, SMC-13(5):834–846, 1983. doi: 10.1109/
TSMC.1983.6313077. Griewank, A. Who invented the reverse mode of differen-
tiation? Documenta Mathematica, Extra Volume ISMP:
Baydin, A. G., Pearlmutter, B. A., Radul, A. A., and Siskind, 389–400, 2012.
J. M. Automatic differentiation in machine learning: a
Griewank, A. and Walther, A. Evaluating Derivatives:
survey. Journal of Machine Learning Research (JMLR),
Principles and Techniques of Algorithmic Differentiation.
18(153):1–43, 2018. URL http://jmlr.org/papers/v18/17-
SIAM, 2008.
468.html.
Hanzely, F., Mishchenko, K., and Richtárik, P. SEGA:
Bengio, Y. How auto-encoders could provide credit as-
Variance reduction via gradient sketching. arXiv preprint
signment in deep networks via target propagation. arXiv
arXiv:1809.03054, 2018.
preprint arXiv:1407.7906, 2014.
Hascoët, L. Adjoints by automatic differentiation. Advanced
Bengio, Y. Deriving differential target propagation Data Assimilation for Geosciences: Lecture Notes of the
from iterating approximate inverses. arXiv preprint Les Houches School of Physics: Special Issue, June 2012,
arXiv:2007.15139, 2020. (2012):349, 2014.
Bengio, Y., Lee, D.-H., Bornschein, J., Mesnard, T., and Lin, Jaderberg, M., Czarnecki, W. M., Osindero, S., Vinyals, O.,
Z. Towards biologically plausible deep learning. arXiv Graves, A., Silver, D., and Kavukcuoglu, K. Decoupled
preprint arXiv:1502.04156, 2015. neural interfaces using synthetic gradients. In Interna-
tional Conference on Machine Learning, pp. 1627–1635.
Bradbury, J., Frostig, R., Hawkins, P., Johnson, M. J., Leary,
PMLR, 2017.
C., Maclaurin, D., Necula, G., Paszke, A., VanderPlas, J.,
Wanderman-Milne, S., and Zhang, Q. JAX: composable Kingma, D. P. and Ba, J. Adam: A method for stochastic
transformations of Python+NumPy programs, 2018. URL optimization. In International Conference on Learning
http://github.com/google/jax. Representations (ICLR), 2015.
Deisenroth, M. P., Faisal, A. A., and Ong, C. S. Mathemat- Kirkpatrick, S., Gelatt, C. D., and Vecchi, M. P. Opti-
ics for Machine Learning. Cambridge University Press, mization by simulated annealing. Science, 220(4598):
2020. 671–680, 1983.

Ding, Z. and Li, Q. Langevin Monte Carlo: random coordi- LeCun, Y. Learning process in an asymmetric threshold
nate descent and variance reduction. Journal of Machine network. In Disordered Systems and Biological Organi-
Learning Research, 22(205):1–51, 2021. zation, pp. 233–240. Springer, 1986.
Gradients without Backpropagation

LeCun, Y. PhD thesis: Modeles connexionnistes de Rumelhart, D. E., Hinton, G. E., and Williams, R. J. Learn-
l’apprentissage (connectionist learning models). Uni- ing internal representations by error propagation. Tech-
versite P. et M. Curie (Paris 6), June 1987. nical report, California Univ San Diego La Jolla Inst for
Cognitive Science, 1985.
Lillicrap, T. P., Cownden, D., Tweed, D. B., and Akerman,
C. J. Random synaptic feedback weights support error Schmidhuber, J. Who invented backpropagation?
backpropagation for deep learning. Nature Communica- 2020. URL https://people.idsia.ch/∼juergen/who-
tions, 7(1):1–10, 2016. invented-backpropagation.html.

Lillicrap, T. P., Santoro, A., Marris, L., Akerman, C. J., Siskind, J. M. and Pearlmutter, B. A. Divide-and-conquer
and Hinton, G. Backpropagation and the brain. Nature checkpointing for arbitrary programs with no user an-
Reviews Neuroscience, 21(6):335–346, 2020. notation. Optimization Methods and Software, 33(4-6):
1288–1330, 2018.
Linnainmaa, S. The representation of the cumulative round-
ing error of an algorithm as a Taylor expansion of the Spall, J. C. et al. Multivariate stochastic approximation
local rounding errors. Master’s Thesis (in Finnish), Univ. using a simultaneous perturbation gradient approximation.
Helsinki, pp. 6–7, 1970. IEEE Transactions on Automatic Control, 37(3):332–341,
1992.
Martins, J. R., Sturdza, P., and Alonso, J. J. The complex-
Wengert, R. E. A simple automatic derivative evaluation
step derivative approximation. ACM Transactions on
program. Communications of the ACM, 7(8):463–464,
Mathematical Software (TOMS), 29(3):245–262, 2003.
1964.
Meulemans, A., Carzaniga, F., Suykens, J., Sacramento,
Williams, R. J. and Zipser, D. A learning algorithm for con-
J., and Grewe, B. F. A theoretical framework for target
tinually running fully recurrent neural networks. Neural
propagation. Advances in Neural Information Processing
Computation, 1(2):270–280, 1989.
Systems, 33:20024–20036, 2020.
Wright, S. J. Coordinate descent algorithms. Mathematical
Mohamed, S., Rosca, M., Figurnov, M., and Mnih, A. Programming, 151(1):3–34, 2015.
Monte Carlo gradient estimation in machine learning.
Journal of Machine Learning Research, 21(132):1–62,
2020.

Nesterov, Y. Efficiency of coordinate descent methods on


huge-scale optimization problems. SIAM Journal on
Optimization, 22(2):341–362, 2012.

Nesterov, Y. and Spokoiny, V. Random gradient-free mini-


mization of convex functions. Foundations of Computa-
tional Mathematics, 17(2):527–566, 2017.

Nichol, A., Achiam, J., and Schulman, J. On


first-order meta-learning algorithms. arXiv preprint
arXiv:1803.02999, 2018.

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J.,
Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga,
L., et al. Pytorch: An imperative style, high-performance
deep learning library. Advances in Neural Information
Processing Systems, 32:8026–8037, 2019.

Pearlmutter, B. A. Fast exact multiplication by the Hessian.


Neural Computation, 6(1):147–60, 1994. doi: 10.1162/
neco.1994.6.1.147.

Pearlmutter, B. A. Gradient calculations for dynamic recur-


rent neural networks: A survey. IEEE Transactions on
Neural Networks, 6(5):1212–1228, 1995.

You might also like