Gradients Without Backpropagation
Gradients Without Backpropagation
Atılım Güneş Baydin 1 Barak A. Pearlmutter 2 Don Syme 3 Frank Wood 4 Philip Torr 5
that a typical modern ML training pipeline can be con- Given a function f : Rn → Rm and the values θ ∈ Rn ,
structed with only forward AD and no backpropagation. v ∈ Rm , reverse mode AD computes f (θ) and the vector–
Jacobian product5 v | Jf (θ), where Jf ∈ Rm×n is the
• We compare the runtime and loss performance charac- Jacobian matrix of all partial derivatives of f evaluated
teristics of forward gradients and backpropagation, and at θ, and v ∈ Rm is a vector of adjoints. For the case
demonstrate speedups of up to twice as fast compared of f : Rn → R and v = 1, reverse mode computes the
with backpropagation in some cases. h the partial iderivatives of f w.r.t. all n inputs
gradient, i.e.,
|
∂f ∂f
∇f (θ) = ∂θ1 , . . . , ∂θn .
A note on naming: When naming the technique, it is tempt-
ing to adopt names like “forward propagation” or “forward- Note that v | Jf is computed in a single forward–backward
prop” to contrast it with backpropagation. We do not use evaluation, without having to compute the Jacobian Jf .6
this name as it is commonly used to refer to the forward
evaluation phase of backpropagation, distinct from forward 2.3. Runtime Cost
AD. We observe that the simple name “forward gradient” is The runtime costs of both modes of AD are bounded by
currently not used in ML, and it also captures the aspect that a constant multiple of the time it takes to run the function
we are presenting a drop-in replacement for the gradient. f we are differentiating (Griewank & Walther, 2008). Re-
verse mode has a higher cost than forward mode, because
2. Background it involves data-flow reversal and it needs to keep a record
(a “tape”, stack, or graph) of the results of operations en-
In order to introduce our method, we start by briefly review- countered in the forward pass, because these are needed
ing the two main modes of automatic differentiation. in the evaluation of derivatives in the backward pass that
follows. The memory and computation cost characteris-
2.1. Forward Mode AD tics ultimately depend on the features implemented by the
AD system such as exploiting sparsity (Gebremedhin et al.,
θ Forward f (θ)
2005) or checkpointing (Siskind & Pearlmutter, 2018).
v Jf (θ) v
The cost can be analyzed by assuming computational com-
Given a function f : Rn → Rm and the values θ ∈ Rn , v ∈ plexities of elementary operations such as fetches, stores, ad-
Rn , forward mode AD computes f (θ) and the Jacobian– ditions, multiplications, and nonlinear operations (Griewank
vector product2 Jf (θ) v, where Jf (θ) ∈ Rm×n is the & Walther, 2008). Denoting the time it takes to evaluate the
Jacobian matrix of all partial derivatives of f evaluated at θ, original function f as runtime(f ), we can express the time
and v is a vector of perturbations.3 For the case of f : Rn → taken by the forward and reverse modes as Rf × runtime(f )
R the Jacobian–vector product corresponds to a directional and Rb × runtime(f ) respectively. In practice, Rf is typi-
derivative ∇f (θ) · v, which is the projection of the gradient cally between 1 and 3, and Rb is typically between 5 and 10
∇f at θ onto the direction vector v, representing the rate of (Hascoët, 2014), but these are highly program dependent.
change along that direction.
Note that in ML the original function corresponds to the ex-
It is important to note that the forward mode evaluates the ecution of the ML code without any derivative computation
function f and its Jacobian–vector product Jf v simulta- or training, i.e., just evaluating a given model with input
neously in a single forward run. Also note that Jf v is data.7 We will call this “base runtime” in this paper.
obtained without having to compute the Jacobian Jf , a fea-
ture sometimes referred to as a matrix-free computation.4
3. Method
2.2. Reverse Mode AD 3.1. Forward Gradients
Forward Definition 1. Given a function f : Rn → R, we define the
θ f (θ) “forward gradient” g : Rn → Rn as
v | Jf (θ) v
Backward
g(θ) = (∇f (θ) · v) v , (1)
2 5
Popularized recently as a jvp operation in tensor frameworks Popularized recently as a vjp operation in tensor frameworks
such as JAX (Bradbury et al., 2018). such as JAX (Bradbury et al., 2018).
3 6
Also called “tangents”. The full Jacobian J can be computed with reverse AD using
4
The full Jacobian J can be computed with forward AD using m evaluations of e|i J , i = 1, . . . m using standard basis vectors
n forward evaluations of Jei , i = 1, . . . n using standard basis ei so that each run gives us a single row of J .
7
vectors ei so that each forward run gives us a single column of J . Sometimes called “inference” by practitioners.
Gradients without Backpropagation
y
We first talk briefly about the intuition that led to this defini-
tion, before showing that g(θ) is an unbiased estimator of 2
the gradient ∇f (θ) in Section 3.2. Perturbations vk
Forward gradients ( f vk)vk
4 Forward gradient (empirical mean)
As explained in Section 2, forwardP mode gives us the direc- True gradient f
∂f
tional derivative ∇f (θ) · v = i ∂θ v i directly, without
i 4 2 0 2 4
having to compute ∇f . Computing ∇f using only forward x
mode is possible by evaluating f forward n times with di-
rection vectors taken as standard basis (or one-hot) vectors Figure 1. Five samples of forward gradient, the empirical mean
ei ∈ Rn , i = 1 . . . n, where ei denotes a vector with a 1 in of these five samples, and the true gradient for the Beale func-
the ith coordinate and 0s elsewhere. This allows the evalua- tion (Section 5.1) at x = 1.5, y = −0.1. Star marks the global
∂f minimum.
tion of the sensitivity of f w.r.t. each input ∂θ i
separately,
which when combined give us the gradient ∇f .
sense to point towards the true gradient (red) while being
In order to have any chance of runtime advantage over back-
constrained in orientation. The green arrow shows a Monte
propagation, we need to work with a single run of the for-
Carlo
PKgradient estimate via averaged forward gradients, i.e.,
ward mode per optimization iteration, not n runs.8 In a 1
single forward run, we can interpret the direction v as a K k=1 (∇f · vk )vk ≈ E[(∇f · v)v].
ous approaches to the online credit assignment problem have First we look at test functions for optimization, and compare
features in common with forward mode AD (Pearlmutter, the behavior of forward gradient and backpropagation in the
1995). An early example is the real-time recurrent learn- R2 space where we can plot and follow optimization trajec-
ing (RTRL) algorithm (Williams & Zipser, 1989) which tories. We then share results of experiments with training
accumulates local sensitivities in an RNN during forward ML architectures of increasing complexity. We measured
execution, in a manner similar to forward AD. A very recent no practical difference in memory usage between the two
example in the RTRL area is an anonymous submission methods (less than 0.1% difference in each experiment).
we identified at the time of drafting this manuscript, where
the authors are using directional derivatives to improve sev- 5.1. Optimization Trajectories of Test Functions
eral gradient estimators, e.g., synthetic gradients (Jaderberg
et al., 2017), first-order meta-learning (Nichol et al., 2018), In Figure 2 we show the results of experiments with
as applied to RNNs (Anonymous, 2022).
• the Beale function, f (x, y) = (1.5 − x + xy)2 + (2.25 −
Coordinate descent (CD) algorithms (Wright, 2015) have a
x + xy 2 )2 + (2.625 − x + xy 3 )2
structure where in each optimization iteration only a single
∂f
component ∂θ i
of the gradient ∇f is used compute an up- • and the Rosenbrock function, f (x, y) = (a−x)2 +b(y−
date. Nesterov (2012) provides an extension of CD called x2 )2 , where a = 1, b = 100.
random coordinate descent (RCD), based on coordinate di-
rectional derivatives, where the directions are constrained to
Note that forward gradient and backpropagation have
randomly chosen coordinate axes in the function’s domain
roughly similar time complexity in these cases, forward
as opposed to arbitrary directions we use in our method.
gradient being slightly faster per iteration. Crucially, we
A recent use of RCD is by Ding & Li (2021) in Langevin
see that forward gradient steps behave the same way as
Monte Carlo sampling, where the authors report no com-
backpropagation in expectation, as seen in loss per iteration
putational gain as the RCD needs to be run multiple times
(leftmost) and optimization trajectory (rightmost) plots.
per iteration in order to achieve a preset error tolerance.
The SEGA (SkEtched GrAdient) method by Hanzely et al.
(2018) is based on gradient estimation via random linear 5.2. Empirical Measures of Complexity
transformations of the gradient that is called a “sketch” com- In order to compare the two algorithms applied to ML prob-
puted using finite differences. Jacobian sketching by Gower lems in the rest of this section, we use several measures.
et al. (2018b) is designed to provide good estimates of the
Jacobian, in a manner similar to how quasi-Newton methods For runtime comparison we use the Rf and Rb factors de-
update Hessian estimates (Gower et al., 2018a). fined in Section 2.3. In order to compute these factors, we
measure runtime(f ) as the time it takes to run a given archi-
Lastly there are other, and more distantly related, ap- tecture with a sample minibatch of data and compute the
proaches concerning gradient estimation such as synthetic loss, without performing any derivative computation and
gradients (Jaderberg et al., 2017), motivated by a need to parameter update. Note that in the measurements of Rf and
break the sequential forward–backward structure of back- Rb , the time taken by gradient descent (parameter updates)
propagation, and Monte Carlo gradient estimation (Mo- are included, in addition to the time spent in computing the
hamed et al., 2020), where the gradient of an expectation derivatives. We also introduce the ratio Rf /Rb as a measure
of a function is computed with respect to the parameters of the runtime cost of the forward gradient relative to the
defining the distribution that is integrated. cost of backpropagation in a given architecture.
For a review of the origins of reverse mode AD and back- In order to compare loss performance, we define Tb as the
propagation, we refer the interested readers to Schmidhuber time at which the lowest validation loss is achieved by back-
(2020) and Griewank (2012). propagation (averaged over runs). Tf is the time the same
validation loss is achieved by the forward gradient for the
5. Experiments same architecture. The Tf /Tb ratio gives us a measure of the
time it takes for the forward mode to achieve the minimum
We implement forward AD in PyTorch to perform the experi- validation loss relative to the time taken by backpropagation.
ments (details given in Section 6). In all experiments, except
in Section 5.1, we use learning rate decay with ηi = η0 e−i k ,
5.3. Logistic Regression
where ηi is the learning rate at iteration i, η0 is the initial
learning rate, and k = 10−4 . In all experiments we run Figure 3 gives the results of several runs of multinomial lo-
forward gradients and backpropagation for an equal number gistic regression for MNIST digit classification. We observe
of iterations. We run the code with CUDA on a Nvidia Titan that the runtime cost of the forward gradient and backprop-
XP GPU and use a minibatch size of 64. agation relative to the base runtime are Rf = 2.435 and
Gradients without Backpropagation
Beale, lr=0.01, lr_decay=0.0, n=1,000
Time (s)
0.2
f(x)
f(x)
10 2 10 2 0.10
y
0.0
10 3 10 3 0.05 0.2
10 4 10 4
0.00 0.4
0 200 400 600 800 1000 0.00 0.05 0.10 0.15 lr=0.0005,
Rosenbrock, 0.20 0 n=25,000
lr_decay=0.0, 200 400 600 800 1000 0 1 2 3
Iterations Time (s) Iterations x
102 Forward grad 102 4 1.0
Backprop 0.8
100 100 3 0.6
Time (s)
2 0.4
f(x)
f(x)
10 2 10 2
y
0.2
10 4 10 4 1 0.0
0 0.2
0 5000 10000 15000 20000 25000 0 1 2 3 4 0 5000 10000 15000 20000 25000 1.0 0.5 0.0 0.5 1.0
Iterations Time (s) Iterations x
Figure 2. Comparison of forward gradient and backpropagation in test functions, showing ten independent runs. Top row: Beale function,
learning rate 0.01. Bottom row: Rosenbrock function. Learning rate 5 × 10−4 . Rightmost column: Optimization trajectories in each
function’s domain, shown over contour plots of the functions. Star symbol marks the global minimum in the contour plots.
logreg, sgd, lr=0.0001, lr_decay=0.0001, batch_size=64
Time (s)
1.5 1.5 200
Loss
Loss
Figure 3. Comparison of forward gradient and backpropagation in logistic regression, showing five independent runs. Learning rate 10−4 .
Rb = 4.389, which are compatible with what one would The top row (learning rate 2 × 10−5 ) shows a result where
expect from a typical AD system (Section 2.3). The ratios forward gradient and backpropagation behave nearly identi-
Rf /Rb = 0.555 and Tf /Tb = 0.553 indicate that the for- cal in loss per iteration (leftmost plot), resulting in a Tf /Tb
ward gradient is roughly twice as fast as backpropagation in ratio close to Rf /Rb . We show this result to communicate
both runtime and loss performance. In this simple problem an example where the behavior is similar to the one we
these ratios coincide as both techniques have nearly iden- observed for logistic regression, where the loss per iteration
tical behavior in the loss per iteration space, meaning that behavior between the techniques are roughly the same and
the runtime benefit is reflected almost directly in the loss the runtime benefit is the main contributing factor in the loss
per time space. In more complex models in the following per time behavior (second plot from the left).
subsections we will see that the relative loss and runtime
Interestingly, in the second experiment (learning rate 2 ×
ratios can be different in practice.
10−4 ) we see that forward gradient achieves faster descent
in the loss per iteration plot. We believe that this behav-
5.4. Multi-Layer Neural Networks ior is due to the different nature of stochasticity between
Figure 4 shows two experiments with a multi-layer neural the regular SGD (backpropagation) and the forward SGD
network (NN) for MNIST classification with different learn- algorithms, and we speculate that the noise introduced by
ing rates. The architecture we use has three fully-connected forward gradients might be beneficial in exploring the loss
layers of size 1024, 1024, 10, with ReLU activation after the surface. When we look at the loss per time plot, which
first two layers. In this model architecture, we observe the also incorporates the favorable runtime of the forward mode,
runtime costs of the forward gradient and backpropagation we see a loss performance metric Tf /Tb value of 0.211,
relative to the base runtime as Rf = 2.468 and Rb = 4.165, representing a case that is more than four times as fast as
and the relative measure Rf /Rb = 0.592 on average. These backpropagation in achieving the reference validation loss.
are roughly the same with the logistic regression case.
Gradients without Backpropagation
mlp, sgd, lr=2e-05, lr_decay=0.0001, batch_size=64
2.45 Forward grad (train) 2.45 Tf = 165.842 s 300 Rf = 2.329 0.30
2.40 Forward grad (valid) 2.40 Tb = 292.359 s 250 RRbf /=Rb3.971
Tf / Tb = 0.567 = 0.586 0.25
Validation acuracy
Backprop (train)
2.35 Backprop (valid) 2.35 200
Time (s)
2.30 2.30 0.20
150
Loss
Loss
2.25 2.25 100 0.15
2.20 2.20 50
2.15 2.15 Base runtime 0.10
0
0 5000 10000 15000 20000 0 100mlp, sgd,200 300 0 5000 10000 15000 20000 0 100 200 300
Iterations Time (s) lr=0.0002, lr_decay=0.0001, batch_size=64Iterations Time (s)
2.5 2.5 Tf = 68.014 s 300 Rf = 2.607
Forward grad (train) 0.8
Forward grad (valid) Tb = 296.957 s 250 RRbf /=Rb4.359
2.0 2.0 Tf / Tb = 0.229 = 0.598
Validation acuracy
Backprop (train)
Backprop (valid) 200 0.6
1.5 1.5
Time (s)
150
Loss
Loss
0.4
1.0 1.0 100
0.5 0.5 50 0.2
Base runtime
0
0 5000 10000 15000 20000 0 50 100 150 200 250 300 0 5000 10000 15000 20000 0 50 100 150 200 250 300
Iterations Time (s) Iterations Time (s)
Figure 4. Comparison of forward gradient and backpropagation for the multi-layer NN, showing two learning rates. Top row: learning
rate 2 × 10−5 . Bottom row: learning rate 2 × 10−4 . Showing five independent runs per experiment.
cnn4b, sgd, lr=0.0002, lr_decay=0.0001, batch_size=64
2.5 2.5
Forward grad (train) Tf = 362.247 s Rf = 1.434 0.8
2.0 Forward grad (valid) 2.0 Tb = 704.369 s 600 RRbf /=Rb2.211
Tf / Tb = 0.514 = 0.649
Validation acuracy
Backprop (train)
Backprop (valid) 0.6
1.5 1.5
Time (s)
400
Loss
Loss
Figure 5. Comparison of forward gradient and backpropagation for the CNN. Learning rate 2 × 10−4 . Showing five independent runs.
5.5. Convolutional Neural Networks • training without backpropagation can feasibly work
within a typical ML training pipeline and do so in a
In Figure 5 we show a comparison between the forward
computationally competitive way, and
gradient and backpropagation for a convolutional neural
network (CNN) for the same MNIST classification task. The • forward AD can even beat backpropagation in loss de-
CNN has four convolutional layers with 3 × 3 kernels and crease per training time for the same choice of hyperpa-
64 channels, followed by two linear layers of sizes 1024 and rameters (learning rate and learning rate decay).
10. All convolutions and the first linear layer are followed
by ReLU activation and there are two max-pooling layers
with 2 × 2 kernel after the second and fourth convolutions. In order to investigate whether these results will scale to
larger NNs with more layers, we measure runtime cost and
In this architecture we observe the best forward AD perfor- memory usage as a function of NN size. In Figure 6 we
mance with respect to the base runtime, where the forward show the results for the MLP architecture (Section 5.4),
mode has Rf = 1.434 representing an overhead of only where we run experiments with an increasing number of
43% on top of the base runtime. Backpropagation with layers in the range [1, 100]. The linear layers are of size
Rb = 2.211 is very close to the ideal case one can expect 1,024, with no bias. We use a mini-batch size 64 as before.
from a reverse AD system, taking roughly double the time.
Rf /Rb = 0.649 represents a significant benefit for the for- Looking at the cost relative to the base runtime, which
ward AD runtime with respect to backpropagation. In loss also changes as a function of the number of layers, we
space, we get a ratio Tf /Tb = 0.514 which shows that for- see that backpropagation remains within Rb ∈ [4, 5] and
ward gradients are close to twice as fast as backpropagation forward gradient remains within Rf ∈ [3, 4] for a large
in achieving the reference level of validation loss. proportion of the experiments. We also observe that forward
gradients remain favorable for the whole range of layer sizes
5.6. Scalability considered, with the Rf /Rb ratio staying below 0.6 up to ten
layers and going slightly over 0.8 at 100 layers. Importantly,
The results in the previous subsections demonstrate that there is virtually no difference in memory consumption
between the two methods.
Gradients without Backpropagation
mlp_no_bias, sgd, layer_size=1,024, batch_size=64, runs=10, n=100, num_layers=[1, 5, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
5 tation we use in this paper does not currently have these.
Runtime cost
0.8
can be based on the complex-step method (Martins et al.,
0.6 2003), a technique that can approximate directional deriva-
Relative runtime cost Rf / Rb tives with nothing but basic support for complex numbers.
0.4
1e9
Memory use (bytes)
1.0
7. Conclusions
We have shown that a typical ML training pipeline can be
0.5
Forward grad constructed without backpropagation, using only forward
Backprop AD, while still being computationally competitive. We ex-
0.0
1 5 10 20 30 40 50 60 70 80 90 100 pect this contribution to find use in distributed ML training,
Number of layers
which is outside the scope of this paper. Furthermore, the
Figure 6. Comparison of how the runtime cost and memory usage runtime results we obtained with our forward AD prototype
of forward gradients and backpropagation scale as a function NN in PyTorch are encouraging and we are cautiously optimistic
depth for the MLP architecture where each layer is of size 1024. that they might be the first step towards significantly decreas-
Showing mean and standard deviation over ten independent runs. ing the time taken to train ML architectures, or alternatively,
enabling the training of more complex architectures with a
6. Implementation given compute budget. We are excited to have the results
confirmed and studied further by the research community.
We implement a forward-mode AD system in Python and
base this on PyTorch tensors in order to enable a fair com- The work presented here is the basis for several directions
parison with a typical backpropagation pipeline in PyTorch, that we would like to follow. In particular, we are interested
which is widely used by the ML community.9 We release in working on gradient descent algorithms other than SGD,
our implementation publicly.10 such as SGD with momentum, and adaptive learning rate al-
gorithms such as Adam (Kingma & Ba, 2015). In this paper
Our forward-mode AD engine is implemented from scratch we deliberately excluded these to focus on the most isolated
using operator overloading and non-differentiable PyTorch and clear case of SGD, in order to establish the technique
tensors (requires grad=False) as a building block. and a baseline. We are also interested in experimenting
This means that our forward AD implementation does not with other ML architectures. The components used in our
use PyTorch’s reverse-mode implementation (called “au- experiments (i.e., linear and convolutional layers, pooling,
tograd”) and computation graph. We produce the back- ReLU nonlinearity) are representative of the building blocks
propagation results in experiments using PyTorch’s ex- of many current architectures in practice, and we expect the
isting reverse-mode code (requires grad=True and results to apply to these as well.
.backward()) as usual.
Lastly, in the longer term we are interested in seeing whether
Note that empirical comparisons of the relative runtimes of the forward gradient algorithm can contribute to the math-
forward- and reverse-mode AD are highly dependent on the ematical understanding of the biological learning mecha-
implementation details in a given system and would show nisms in the brain, as backpropagation has been historically
differences across different code bases. When implement- viewed as biologically implausible as it requires a precise
ing the forward mode of tensor operations common in ML backward connectivity (Bengio et al., 2015; Lillicrap et al.,
(e.g., matrix multiplication, convolutions), we identified op- 2016; 2020). In this context, one way to look at the role of
portunities to the make forward AD operations even more the directional derivative in forward gradients is to interpret
efficient (e.g., stacking channels of primal and derivative it as the feedback of a single global scalar quantity that is
parts of tensors in a convolution). Note that the implemen- identical for all the computation nodes in the network.
9
We also experimented with the forward mode implementation We believe that forward AD has computational characteris-
in JAX (Bradbury et al., 2018) but decided to base our implemen- tics that are ripe for exploration by the ML community, and
tation on PyTorch due to its maturity and simplicity allowing us to
we expect that its addition to the conventional ML infrastruc-
perform a clear comparison.
10
To be shared in the upcoming revision. ture will lead to major breakthroughs and new approaches.
Gradients without Backpropagation
Barto, A. G., Sutton, R. S., and Anderson, C. W. Neuronlike Gower, R. M., Richtárik, P., and Bach, F. Stochastic quasi-
adaptive elements that can solve difficult learning control gradient methods: Variance reduction via Jacobian sketch-
problems. IEEE Transactions on Systems, Man, and ing. arXiv preprint arXiv:1805.02632, 2018b.
Cybernetics, SMC-13(5):834–846, 1983. doi: 10.1109/
TSMC.1983.6313077. Griewank, A. Who invented the reverse mode of differen-
tiation? Documenta Mathematica, Extra Volume ISMP:
Baydin, A. G., Pearlmutter, B. A., Radul, A. A., and Siskind, 389–400, 2012.
J. M. Automatic differentiation in machine learning: a
Griewank, A. and Walther, A. Evaluating Derivatives:
survey. Journal of Machine Learning Research (JMLR),
Principles and Techniques of Algorithmic Differentiation.
18(153):1–43, 2018. URL http://jmlr.org/papers/v18/17-
SIAM, 2008.
468.html.
Hanzely, F., Mishchenko, K., and Richtárik, P. SEGA:
Bengio, Y. How auto-encoders could provide credit as-
Variance reduction via gradient sketching. arXiv preprint
signment in deep networks via target propagation. arXiv
arXiv:1809.03054, 2018.
preprint arXiv:1407.7906, 2014.
Hascoët, L. Adjoints by automatic differentiation. Advanced
Bengio, Y. Deriving differential target propagation Data Assimilation for Geosciences: Lecture Notes of the
from iterating approximate inverses. arXiv preprint Les Houches School of Physics: Special Issue, June 2012,
arXiv:2007.15139, 2020. (2012):349, 2014.
Bengio, Y., Lee, D.-H., Bornschein, J., Mesnard, T., and Lin, Jaderberg, M., Czarnecki, W. M., Osindero, S., Vinyals, O.,
Z. Towards biologically plausible deep learning. arXiv Graves, A., Silver, D., and Kavukcuoglu, K. Decoupled
preprint arXiv:1502.04156, 2015. neural interfaces using synthetic gradients. In Interna-
tional Conference on Machine Learning, pp. 1627–1635.
Bradbury, J., Frostig, R., Hawkins, P., Johnson, M. J., Leary,
PMLR, 2017.
C., Maclaurin, D., Necula, G., Paszke, A., VanderPlas, J.,
Wanderman-Milne, S., and Zhang, Q. JAX: composable Kingma, D. P. and Ba, J. Adam: A method for stochastic
transformations of Python+NumPy programs, 2018. URL optimization. In International Conference on Learning
http://github.com/google/jax. Representations (ICLR), 2015.
Deisenroth, M. P., Faisal, A. A., and Ong, C. S. Mathemat- Kirkpatrick, S., Gelatt, C. D., and Vecchi, M. P. Opti-
ics for Machine Learning. Cambridge University Press, mization by simulated annealing. Science, 220(4598):
2020. 671–680, 1983.
Ding, Z. and Li, Q. Langevin Monte Carlo: random coordi- LeCun, Y. Learning process in an asymmetric threshold
nate descent and variance reduction. Journal of Machine network. In Disordered Systems and Biological Organi-
Learning Research, 22(205):1–51, 2021. zation, pp. 233–240. Springer, 1986.
Gradients without Backpropagation
LeCun, Y. PhD thesis: Modeles connexionnistes de Rumelhart, D. E., Hinton, G. E., and Williams, R. J. Learn-
l’apprentissage (connectionist learning models). Uni- ing internal representations by error propagation. Tech-
versite P. et M. Curie (Paris 6), June 1987. nical report, California Univ San Diego La Jolla Inst for
Cognitive Science, 1985.
Lillicrap, T. P., Cownden, D., Tweed, D. B., and Akerman,
C. J. Random synaptic feedback weights support error Schmidhuber, J. Who invented backpropagation?
backpropagation for deep learning. Nature Communica- 2020. URL https://people.idsia.ch/∼juergen/who-
tions, 7(1):1–10, 2016. invented-backpropagation.html.
Lillicrap, T. P., Santoro, A., Marris, L., Akerman, C. J., Siskind, J. M. and Pearlmutter, B. A. Divide-and-conquer
and Hinton, G. Backpropagation and the brain. Nature checkpointing for arbitrary programs with no user an-
Reviews Neuroscience, 21(6):335–346, 2020. notation. Optimization Methods and Software, 33(4-6):
1288–1330, 2018.
Linnainmaa, S. The representation of the cumulative round-
ing error of an algorithm as a Taylor expansion of the Spall, J. C. et al. Multivariate stochastic approximation
local rounding errors. Master’s Thesis (in Finnish), Univ. using a simultaneous perturbation gradient approximation.
Helsinki, pp. 6–7, 1970. IEEE Transactions on Automatic Control, 37(3):332–341,
1992.
Martins, J. R., Sturdza, P., and Alonso, J. J. The complex-
Wengert, R. E. A simple automatic derivative evaluation
step derivative approximation. ACM Transactions on
program. Communications of the ACM, 7(8):463–464,
Mathematical Software (TOMS), 29(3):245–262, 2003.
1964.
Meulemans, A., Carzaniga, F., Suykens, J., Sacramento,
Williams, R. J. and Zipser, D. A learning algorithm for con-
J., and Grewe, B. F. A theoretical framework for target
tinually running fully recurrent neural networks. Neural
propagation. Advances in Neural Information Processing
Computation, 1(2):270–280, 1989.
Systems, 33:20024–20036, 2020.
Wright, S. J. Coordinate descent algorithms. Mathematical
Mohamed, S., Rosca, M., Figurnov, M., and Mnih, A. Programming, 151(1):3–34, 2015.
Monte Carlo gradient estimation in machine learning.
Journal of Machine Learning Research, 21(132):1–62,
2020.
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J.,
Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga,
L., et al. Pytorch: An imperative style, high-performance
deep learning library. Advances in Neural Information
Processing Systems, 32:8026–8037, 2019.