Amos, Kolter - 2017 - OptNet Differentiable Optimization As A Layer in Neural Networks
Amos, Kolter - 2017 - OptNet Differentiable Optimization As A Layer in Neural Networks
Abstract tal can reduce the overall depth of the network while pre-
serving richness of representation. Specifically, we build a
This paper presents OptNet, a network architec- framework where the output of the i + 1th layer in a net-
ture that integrates optimization problems (here, work is the solution to a constrained optimization problem
specifically in the form of quadratic programs) based upon previous layers. This framework naturally en-
as individual layers in larger end-to-end train- compasses a wide variety of inference problems expressed
able deep networks. These layers encode con- within a neural network, allowing for the potential of much
straints and complex dependencies between the richer end-to-end training for complex tasks that require
hidden states that traditional convolutional and such inference procedures.
fully-connected layers often cannot capture. In
this paper, we explore the foundations for such Concretely, in this paper we specifically consider the task
an architecture: we show how techniques from of solving small quadratic programs as individual layers.
sensitivity analysis, bilevel optimization, and im- These optimization problems are well-suited to captur-
plicit differentiation can be used to exactly differ- ing interesting behavior and can be efficiently solved with
entiate through these layers and with respect to GPUs. Specifically, we consider layers of the form
layer parameters; we develop a highly efficient
1 T
solver for these layers that exploits fast GPU- zi+1 = argmin z Q(zi )z + q(zi )T z
based batch solves within a primal-dual interior z 2
(1)
point method, and which provides backpropaga- subject to A(zi )z = b(zi )
tion gradients with virtually no additional cost on G(zi )z ≤ h(zi )
top of the solve; and we highlight the applica-
tion of these approaches in several problems. In where z is the optimization variable, Q(zi ), q(zi ), A(zi ),
one notable example, we show that the method b(zi ), G(zi ), and h(zi ) are parameters of the optimization
is capable of learning to play mini-Sudoku (4x4) problem. As the notation suggests, these parameters can
given just input and output games, with no a pri- depend in any differentiable way on the previous layer zi ,
ori information about the rules of the game; this and which can eventually be optimized just like any other
highlights the ability of our architecture to learn weights in a neural network. These layers can be learned
hard constraints better than other neural architec- by taking the gradients of some loss function with respect
tures. to the parameters. In this paper, we derive the gradients of
(1) by taking matrix differentials of the KKT conditions of
the optimization problem at its solution.
1. Introduction In order to the make the approach practical for larger net-
In this paper, we consider how to treat exact, constrained works, we develop a custom solver which can simultane-
optimization as an individual layer within a deep learn- ously solve multiple small QPs in batch form. We do so
ing architecture. Unlike traditional feedforward networks, by developing a custom primal-dual interior point method
where the output of each layer is a relatively simple (though tailored specifically to dense batch operations on a GPU.
non-linear) function of the previous layer, our optimization In total, the solver can solve batches of quadratic programs
framework allows for individual layers to capture much over 100 times faster than existing highly tuned quadratic
richer behavior, expressing complex operations that in to- programming solvers such as Gurobi and CPLEX. One cru-
cial algorithmic insight in the solver is that by using a
1
School of Computer Science, Carnegie Mellon Univer- specific factorization of the primal-dual interior point up-
sity. Pittsburgh, PA, USA. Correspondence to: Brandon Amos date, we can obtain a backward pass over the optimiza-
<[email protected]>, J. Zico Kolter <[email protected]>. tion layer virtually “for free” (i.e., requiring no additional
Proceedings of the 34 th International Conference on Machine factorization once the optimization problem itself has been
Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 solved). Together, these innovations enable parameterized
by the author(s). optimization problems to be inserted within the architec-
OptNet: Differentiable Optimization as a Layer in Neural Networks
ture of existing deep networks. tional random fields as the “last layer” of a deep network
architecture (Peng et al., 2009; Zheng et al., 2015; Chen
We begin by highlighting background and related work,
et al., 2015) as well as in deeper energy-based architec-
and then present our optimization layer itself. Using matrix
tures (Belanger & McCallum, 2016; Belanger et al., 2017;
differentials we derive rules for computing all the neces-
Amos et al., 2017). Learning in this context requires ob-
sary backpropagation updates. We then detail our specific
served data, which isn’t present in some of the contexts we
solver for these quadratic programs, based upon a state-of-
consider in this paper, and also may suffer from instabil-
the-art primal-dual interior point method, and highlight the
ity issues when combined with deep energy-based archi-
novel elements as they apply to our formulation, such as
tectures as observed in Belanger & McCallum (2016); Be-
the aforementioned fact that we can compute backpropaga-
langer et al. (2017); Amos et al. (2017).
tion at very little additional cost. We then provide experi-
mental results that demonstrate the capabilities of the archi-
tecture, highlighting potential tasks that these architectures Analytically If an analytic solution to the argmin can be
can solve, and illustrating improvements upon existing ap- found, such as in an unconstrained quadratic minimization,
proaches. the gradients can often also be computed analytically. This
is done in Tappen et al. (2007); Schmidt & Roth (2014). We
cannot use these methods for the constrained optimization
2. Background and related work problems we consider in this paper because there are no
Optimization plays a key role in modeling complex phe- known analytic solutions.
nomena and providing concrete decision making processes
in sophisticated environments. A full treatment of opti- Unrolling The argmin operation over an unconstrained
mization applications is beyond our scope (Boyd & Van- objective can be approximated by a first-order gradient-
denberghe, 2004) but these methods have bound applica- based method and unrolled. These architectures typically
bility in control frameworks (Morari & Lee, 1999; Sastry introduce an optimization procedure such as gradient de-
& Bodson, 2011); numerous statistical and mathematical scent into the inference procedure. This is done in Domke
formalisms (Sra et al., 2012), and physical simulation prob- (2012); Amos et al. (2017); Belanger et al. (2017); Metz
lems like rigid body dynamics (Lötstedt, 1984). Generally et al. (2016); Goodfellow et al. (2013); Stoyanov et al.
speaking, our work is a step towards learning optimization (2011); Brakel et al. (2013). The optimization procedure is
problems behind real-world processes from data that can unrolled automatically or manually (Domke, 2012) to ob-
be learned end-to-end rather than requiring human specifi- tain derivatives during training that incorporate the effects
cation and intervention. of these in-the-loop optimization procedures. However, un-
rolling the computation of a method like gradient descent
In the machine learning setting, a wide array of applica-
typically requires a substantially larger network, and adds
tions consider optimization as a means to perform infer-
substantially to the computational complexity of the net-
ence in learning. Among many other applications, these
work.
architectures are well-studied for generic classification and
structured prediction tasks (Goodfellow et al., 2013; Stoy- In all of these existing cases, the optimization problem is
anov et al., 2011; Brakel et al., 2013; LeCun et al., 2006; unconstrained and unrolling gradient descent is often easy
Belanger & McCallum, 2016; Belanger et al., 2017; Amos to do. When constraints are added to the optimization prob-
et al., 2017); in vision for tasks such as denoising (Tappen lem, iterative algorithms often use a projection operator
et al., 2007; Schmidt & Roth, 2014); and Metz et al. (2016) that may be difficult to unroll through. In this paper, we
uses unrolled optimization within a network to stabilize the do not unroll an optimization procedure but instead use
convergence of generative adversarial networks (Goodfel- argmin differentiation as described in the next section.
low et al., 2014). Indeed, the general idea of solving re-
stricted classes of optimization problem using neural net- Argmin differentiation Most closely related to our own
works goes back many decades (Kennedy & Chua, 1988; work, there have been several papers that propose some
Lillo et al., 1993), but has seen a number of advances in form of differentiation through argmin operators. These
recent years. These models are often trained by one of the techniques also come up in bilevel optimization (Gould
following four methods. et al., 2016; Kunisch & Pock, 2013) and sensitivity anal-
ysis (Bertsekas, 1999; Fiacco & Ishizuka, 1990; Bonnans
Energy-based learning methods These methods can be & Shapiro, 2013). In the case of Gould et al. (2016),
used for tasks like (structured) prediction where the train- the authors describe general techniques for differentiation
ing method shapes the energy function to be low around the through optimization problems, but only describe the case
observed data manifold and high elsewhere (LeCun et al., of exact equality constraints rather than both equality and
2006). In recent years, there has been a strong push to fur- inequality constraints (in the case inequality constraints,
ther incorporate structured prediction methods like condi- they add these via a barrier function). Amos et al. (2017)
OptNet: Differentiable Optimization as a Layer in Neural Networks
considers argmin differentiation within the context of a spe- ral network setting, the optimal solution (or more generally,
cific optimization problem (the bundle method) but does a subset of the optimal solution) of this optimization prob-
not consider a general setting. Johnson et al. (2016) per- lems becomes the output of our layer, denoted zi+1 , and
forms implicit differentiation on (multi-)convex objectives any of the problem data Q, q, A, b, G, h can depend on the
with coordinate subspace constraints, but don’t consider in- value of the previous layer zi . The forward pass in our Opt-
equality constraints and don’t consider in detail general lin- Net architecture thus involves simply setting up and finding
ear equality constraints. Their optimization problem is only the solution to this optimization problem.
in the final layer of a variational inference network while
Training deep architectures, however, requires that we not
we propose to insert optimization problems anywhere in
just have a forward pass in our network but also a back-
the network. Therefore a special case of OptNet layers
ward pass. This requires that we compute the derivative of
(with no inequality constraints) has a natural interpretation
the solution to the QP with respect to its input parameters,
in terms of Gaussian inference, and so Gaussian graphi-
a general topic we topic we discussed previously. To ob-
cal models (or CRF ideas more generally) provide tools
tain these derivatives, we differentiate the KKT conditions
for making the computation more efficient and interpreting
(sufficient and necessary conditions for optimality) of (2)
or constraining its structure. Similarly, the older work of
at a solution to the problem using techniques from matrix
Mairal et al. (2012) considered argmin differentiation for a
differential calculus (Magnus & Neudecker, 1988). Our
LASSO problem, deriving specific rules for this case, and
analysis here can be extended to more general convex opti-
presenting an efficient algorithm based upon our ability to
mization problems.
solve the LASSO problem efficiently.
The Lagrangian of (2) is given by
In this paper, we use implicit differentiation (Dontchev
& Rockafellar, 2009; Griewank & Walther, 2008) and 1 T
techniques from matrix differential calculus (Magnus & L(z, ν, λ) = z Qz + q T z + ν T (Az − b) + λT (Gz − h)
2
Neudecker, 1988) to derive the gradients from the KKT (3)
matrix of the problem we are interested in. A notable dif- where ν are the dual variables on the equality constraints
ferent from other work within ML that we are aware of, is and λ ≥ 0 are the dual variables on the inequality con-
that we analytically differentiate through inequality as well straints. The KKT conditions for stationarity, primal feasi-
as just equality constraints, but differentiating the comple- bility, and complementary slackness are
mentarity conditions; this differs from e.g., Gould et al.
(2016) where they instead approximately convert the prob- Qz ? + q + AT ν ? + GT λ? = 0
lem to an unconstrained one via a barrier method. We have Az ? − b = 0 (4)
also developed methods to make this approach practical D(λ )(Gz ? − h) = 0,
?
and reasonably scalable within the context of deep archi-
tectures. where D(·) creates a diagonal matrix from a vector and z ? ,
ν ? and λ? are the optimal primal and dual variables. Taking
3. OptNet: solving optimization within a the differentials of these conditions gives the equations
neural network
dQz ? + Qdz + dq + dAT ν ? +
Although in the most general form, an OptNet layer can AT dν + dGT λ? + GT dλ = 0
be any optimization problem, in this paper we will study (5)
dAz ? + Adz − db = 0
OptNet layers defined by a quadratic program
D(Gz ? − h)dλ + D(λ? )(dGz ? + Gdz − dh) = 0
1 T
minimize z Qz + q T z or written more compactly in matrix form
z 2 (2)
subject to Az = b, Gz ≤ h
Q GT AT dz
where z ∈ Rn is our optimization variable Q ∈ Rn×n 0 D(λ? )G D(Gz ? − h) 0 dλ =
(a positive semidefinite matrix), q ∈ Rn , A ∈ Rm×n , A 0 0 dν
(6)
b ∈ Rm , G ∈ Rp×n and h ∈ Rp are problem data,
−dQz − dq − dG λ − dAT ν ?
? T ?
and leaving out the dependence on the previous layer zi −D(λ? )dGz ? + D(λ? )dh .
as we showed in (1) for notational convenience. As is well- −dAz ? + db
known, these problems can be solved in polynomial time
using a variety of methods; if one desires exact (to numeri- Using these equations, we can form the Jacobians of z ? (or
cal precision) solutions to these problems, then primal-dual λ? and ν ? , though we don’t consider this case here), with
interior point methods, as we will use in a later section, are respect to any of the data parameters.
?
For example, if we
the current state of the art in solution methods. In the neu- wished to compute the Jacobian ∂z ∂b ∈ Rn×m , we would
OptNet: Differentiable Optimization as a Layer in Neural Networks
simply substitute db = I (and set all other differential terms icantly faster than the standard non-batch solvers Gurobi
in the right hand side to zero), solve the equation, and the and CPLEX.
resulting value of dz would be the desired Jacobian.
Following the method of Mattingley & Boyd (2012), our
In the backpropagation algorithm, however, we never want solver introduces slack variables on the inequality con-
to explicitly form the actual Jacobian matrices, but rather straints and iteratively minimizes the residuals from the
want to form the left matrix-vector product with some KKT conditions over the primal variable z ∈ Rn , slack
∂` n ∂` ∂z ?
previous backward pass vector ∂z ? ∈ R , i.e., ∂z ? ∂b . variable s ∈ Rp , and dual variables ν ∈ Rm associ-
We can do this efficiently by noting the solution for the ated with the equality constraints and λ ∈ Rp associated
(dz, dλ, dν) involves multiplying the inverse of the left- with the inequality constraints. Each iteration computes
hand-side matrix in (6) by some right hand side. Thus, if the affine scaling directions by solving
we multiply the backward pass vector by the transpose of aff
the differential matrix ∆z −(AT ν + GT λ + Qz + q)
−1 ∆saff −Sλ
dz Q GT D(λ? ) AT ∂` T
K ∆λaff =
(9)
∂z ? −(Gz + s − h)
dλ = G D(Gz ? − h) 0 0 (7) ∆ν aff −(Az − b)
dν A 0 0 0
where
then the relevant gradients with respect to all the QP pa- Q 0 GT AT
rameters can be given by 0 D(λ) D(s) 0
K=
G
,
I 0 0
∂` ∂`
= dz = −dν A 0 0 0
∂q ∂b
∂` ∂` 1 then centering-plus-corrector directions by solving
= −D(λ? )dλ = (dz z T + zdTz ) (8)
∂h ∂Q 2 cc
∂` ∂` ∆z 0
= dν z T + νdTz = D(λ? )(dλ z T + λdTz ) ∆scc σµ1 − D(∆saff )∆λaff
∂A ∂G K ∆λcc =
, (10)
0
where as in standard backpropagation, all these terms are ∆ν cc 0
at most the size of the parameter matrices. We note that
some of these parameters should depend on the previous where µ = sT λ/p is the duality gap and σ is defined in
layer zi and the gradients with respect to the previous layer Mattingley & Boyd (2012). Each variable v is updated with
can be obtained through the chain rule. As we will see in ∆v = ∆v aff + ∆v cc using an appropriate step size. We ac-
the next section, the solution to an interior point method in tually solve a symmetrized version of the KKT conditions,
fact already provides a factorization we can use to compute obtained by scaling the second row block by D(1/s). We
these gradient efficiently. analytically decompose these systems into smaller sym-
metric systems and pre-factorize portions of them that don’t
3.1. An efficient batched QP solver change (i.e. that don’t involve D(λ/s) between iterations).
We have implemented a batched version of this method
Deep networks are typically trained in mini-batches to take with the PyTorch library2 and have released it as an open
advantage of efficient data-parallel GPU operations. With- source library at https://github.com/locuslab/
out mini-batching on the GPU, many modern deep learning qpth. It uses a custom CUBLAS extension that pro-
architectures become intractable for all practical purposes. vides an interface to solve multiple matrix factorizations
However, today’s state-of-the-art QP solvers like Gurobi and solves in parallel, and which provides the necessary
and CPLEX do not have the capability of solving multi- backpropagation gradients for their use in an end-to-end
ple optimization problems on the GPU in parallel across learning system.
the entire minibatch. This makes larger OptNet layers be-
come quickly intractable compared to a fully-connected 3.1.1. E FFICIENTLY COMPUTING GRADIENTS
layer with the same number of parameters.
A key point of the particular form of primal-dual interior
To overcome this performance bottleneck in our quadratic point method that we employ is that it is possible to com-
program layers, we have implemented a GPU-based pute the backward pass gradients “for free” after solving
primal-dual interior point method (PDIPM) based on Mat- the original QP, without an additional matrix factorization
tingley & Boyd (2012) that solves a batch of quadratic pro- or solve. Specifically, at each iteration in the primal-dual
grams, and which provides the necessary gradients needed interior point, we are computing an LU decomposition of
to train these in an end-to-end fashion. Our performance
2
experiments in Section 4.1 shows that our solver is signif- https://pytorch.org
OptNet: Differentiable Optimization as a Layer in Neural Networks
the matrix Ksym .3 This matrix is essentially a symmetrized mate arbitrary elementwise piecewise-linear functions, and
version of the matrix needed for computing the backprop- so among other things can represent a ReLU layer.
agated gradients, and we can similarly compute the dz,λ,ν Theorem 2. Let f : Rn → Rn be an elementwise piece-
terms by solving the linear system wise linear function with k linear regions. Then the func-
T tion can be represented as an OptNet layer using O(nk)
dz − ∂`
parameters. Additionally, the layer zi+1 = max{W zi +
∂zi+1
ds
b, 0} for W ∈ Rn×m , b ∈ Rn can be represented by an
Ksym = 0 ,
(11)
d˜λ
OptNet layer with O(mn) parameters.
0
dν 0
Finally, we show that the converse does not hold: that there
are function representable by an OptNet layer which cannot
where d˜λ = D(λ? )dλ for dλ as defined in (7). Thus, all
be represented exactly by a two-layer ReLU layer, which
the backward pass gradients can be computed using the
take exponentially many units to approximate (known to
factored KKT matrix at the solution. Crucially, because
be a universal function approximator). A simple ex-
the bottleneck of solving this linear system is computing
ample of such a layer (and one which we use in the
the factorization of the KKT matrix (cubic time as op-
proof) is just the max over three linear functions f (z) =
posed to the quadratic time for solving via backsubstitution
max{aT1 x, aT2 x, aT3 x}.
once the factorization is computed), the additional time re-
quirements for computing all the necessary gradients in the Theorem 3. Let f (z) : Rn → R be a scalar-valued func-
backward pass is virtually nonexistent compared with the tion specified by an POptNet layer with p parameters. Con-
m
time of computing the solution. To the best of our knowl- versely, let f 0 (z) = i=1 wi max{aTi z + bi , 0} be the out-
edge, this is the first time that this fact has been exploited put of a two-layer ReLU network. Then there exist func-
in the context of learning end-to-end systems. tions that the ReLU network cannot represent exactly over
all of R, and which require O(cp ) parameters to approxi-
3.2. Properties and representational power mate over a finite region.
In this section we briefly highlight some of the mathe- 3.3. Limitations of the method
matical properties of OptNet layers. The proofs here are
straightforward, and are mostly based upon well-known re- Although, as we will show shortly, the OptNet layer has
sults in convex analysis, so are deferred to the appendix. several strong points, we also want to highlight the poten-
The first result simply highlights that (because the solution tial drawbacks of this approach. First, although, with an
of strictly convex QPs is continuous), that OptNet layers efficient batch solver, integrating an OptNet layer into ex-
are subdifferentiable everywhere, and differentiable at all isting deep learning architectures is potentially practical,
but a measure-zero set of points. we do note that solving optimization problems exactly as
we do here has has cubic complexity in the number of vari-
Theorem 1. Let z ? (θ) be the output of an OptNet layer,
ables and/or constraints. This contrasts with the quadratic
where θ = {Q, p, A, b, G, h}. Assuming Q 0 and that
complexity of standard feedforward layers. This means that
A has full row rank, then z ? (θ) is subdifferentiable every-
we are ultimately limited to settings where the number of
where: ∂z ? (θ) 6= ∅, where ∂z ? (θ) denotes the Clarke
hidden variables in an OptNet layer is not too large (less
generalized subdifferential (Clarke, 1975) (an extension of
than 1000 dimensions seems to be the limits of what we
the subgradient to non-convex functions), and has a single
currently find to the be practical, and substantially less if
unique element (the Jacobian) for all but a measure zero
one wants real-time results for an architecture).
set of points θ.
Secondly, there are many improvements to the OptNet lay-
The next two results show the representational power of the ers that are still possible. Our QP solver, for instance, uses
OptNet layer, specifically how an OptNet layer compares fully dense matrix operations, which makes the solves very
to the common linear layer followed by a ReLU activation. efficient for GPU solutions, and which also makes sense for
The first theorem shows that an OptNet layer can approxi- our general setting where the coefficients of the quadratic
3 problem can be learned. However, for setting many real-
We actually perform an LU decomposition of a certain sub-
set of the matrix formed by eliminating variables to create only world optimization problems (and hence for architectures
a p × p matrix (the number of inequality constraints) that needs that wish to more closely mimic some real-world opti-
to be factor during each iteration of the primal-dual algorithm, mization problem), there is often substantial structure (e.g.,
and one m × m and one n × n matrix once at the start of the sparsity), in the data matrices that can be exploited for ef-
primal-dual algorithm, though we omit the detail here. We also ficiency. There is of course no prohibition of incorporat-
use an LU decomposition as this routine is provided in batch form
by CUBLAS, but could potentially use a (faster) Cholesky fac- ing sparse matrix methods into the fast custom solver, but
torization if and when the appropriate functionality is added to doing so would require substantial added complexity, espe-
CUBLAS). cially regarding efforts like finding minimum fill orderings
OptNet: Differentiable Optimization as a Layer in Neural Networks
10 1
for different sparsity patterns of the KKT systems. In our Linear Forward Linear Backward
10 0 QP Forward QP Backward
open source solver qpth, we have started experimenting
Runtime (s)
with cuSOLVER’s batched sparse QR factorizations and 10 -1
solves. 10 -2
10 -3
Lastly, we note that while the OptNet layers can be trained
10 -4
just as any neural network layer, since they are a new cre-
10 -5
ation and since they have manifolds in the parameter space 10 50 100 500
Number of Variables (and Inequality Constraints)
which have no effect on the resulting solution (e.g., scaling
the rows of a constraint matrix and its right hand side does
Figure 1. Performance of a linear layer and a QP layer.
not change the optimization problem), there is admittedly
(Batch size 128)
more tuning required to get these to work. This situation
is common when developing new neural network architec-
10 1
tures and has also been reported in the similar architecture Gurobi
Ours Serial
of Schmidt & Roth (2014). Our hope is that techniques for
Runtime (s)
Ours Batched
10 0
overcoming some of the challenges in learning these layers
will continue to be developed in future work. 10 -1
10 -2
4. Experimental results 1 64 128
Batch Size
In this section, we present several experimental results that
highlight the capabilities of the QP OptNet layer. Specif- Figure 2. Performance of Gurobi and our QP solver.
ically we look at 1) computational efficiency over exiting
solvers; 2) the ability to improve upon existing convex als) of the forward and backward passes. The OptNet layer
problems such as those used in signal denoising; 3) inte- is significantly slower than the linear layer as expected, yet
grating the architecture into an generic deep learning archi- still tractable in many practical contexts.
tectures; and 4) performance of our approach on a problem Our next experiment illustrates why standard baseline QP
that is challenging for current approaches. In particular, we solvers like CPLEX and Gurobi without batch support are
want to emphasize the results of our system on learning the too computationally expensive for QP OptNet layers to be
game of (4x4) mini-Sudoku, a well-known logical puzzle; tractable. We set up random QP of the form (1) that have
our layer is able to directly learn the necessary constraints 100 variables and 100 inequality constraints in Gurobi and
using just gradient information and no a priori knowl- in the serialized and batched versions of our solver qpth
edge of the rules of Sudoku. The code and data for our and vary the batch size.4
experiments are open sourced in the icml2017 branch
of https://github.com/locuslab/optnet and Figure 2 shows the means and standard deviations of run-
our batched QP solver is available as a library at https: ning each trial 10 times, showing that our batched solver
//github.com/locuslab/qpth. outperforms Gurobi, itself a highly tuned solver for reason-
able batch sizes. For the minibatch size of 128, we solve
4.1. Batch QP solver performance all problems in an average of 0.18 seconds, whereas Gurobi
tasks an average of 4.7 seconds. In the context of training a
All of the OptNet performance results in this section are run deep architecture this type of speed difference for a single
on an unloaded Titan X GPU. Gurobi is run on an unloaded minibatch can make the difference between a practical and
quad-core Intel Core i7-5960X CPU @ 3.00GHz. a completely unusable solution.
Our OptNet layers are much more computationally expen-
sive than a linear or convolutional layer and a natural ques- 4.2. Total variation denoising
tion is to ask what the performance difference is. We set Our next experiment studies how we can use the OptNet
up an experiment comparing a linear layer to a QP OptNet architecture to improve upon signal processing techniques
layer with a mini-batch size of 128 on CUDA with ran-
4
domly generated input vectors sized 10, 50, 100, and 500. Experimental details: we sample entries of a matrix U from a
Each layer maps this input to an output of the same dimen- random uniform distribution and set Q = U T U + 10−3 I, sample
sion; the linear layer does this with a batched matrix-vector G with random normal entries, and set h by selecting generat-
ing some z0 random normal and s0 random uniform and setting
multiplication and the OptNet layer does this by taking the h = Gz0 + s0 (we didn’t include equality constraints just for
argmin of a random QP that has the same number of in- simplicity, and since the number of inequality constraints in the
equality constraints as the dimensionality of the problem. primary driver of complexity for the iterations in a primal-dual
Figure 1 shows the profiling results (averaged over 10 tri- interior point method). The choice of h guarantees the problem is
feasible.
OptNet: Differentiable Optimization as a Layer in Neural Networks
Loss
Loss
Figure 4. Example mini-Sudoku initial problem and solution. 10 -2
-2
10
4.3. MNIST
10 -3
0 2 4 6 8 10 12 14 16 18
One compelling use case of an OptNet layer is to learn con- Epoch
straints and dependencies over the output or latent space of 10 -3 10 0
0 2 4 6 8 10 12 14 16 18
a model. As a simple example to illustrate that OptNet lay-
ers can be included in existing architectures and that the Epoch
10 -1
Error
gradients can be efficiently propagated through the layer,
we show the performance of a fully-connected feedforward 10 -2
network with and without an OptNet layer in Section A in
the supplemental material.
10 -3
0 2 4 6 8 10 12 14 16 18
Epoch
4.4. Sudoku
Finally, we present the main illustrative example of the rep- Figure 5. Sudoku training plots.
resentational power of our approach, the task of learning
solution method for such problems), but the model here is
the game of Sudoku. Sudoku is a popular logical puzzle,
told nothing about the rules of Sudoku.
where a (typically 9x9) grid of points must be arranged
given some initial point, so that each row, each column, and We trained these models using ADAM (Kingma & Ba,
each 3x3 grid of points must contain one of each number 2014) to minimize the MSE (which we refer to as “loss”)
1 through 9. We consider the simpler case of 4x4 Sudoku on a dataset we created consisting of 9000 training puz-
puzzles, with numbers 1 through 4, as shown in Figure 4.3. zles, and we then tested the models on 1000 different held-
out puzzles. The error rate is the percentage of puzzles
Sudoku is fundamentally a constraint satisfaction problem,
solved correctly if the cells are assigned to whichever in-
and is trivial for computers to solve when told the rules of
dex is largest in the prediction. Figure 5 shows that the
the game. However, if we do not know the rules of the
convolutional is able to learn all of the necessary logic for
game, but are only presented with examples of unsolved
the task and ends up over-fitting to the training data. We
and the corresponding solved puzzle, this is a challenging
contrast this with the performance of the OptNet network,
task. We consider this to be an interesting benchmark task
which learns most of the correct hard constraints within the
for algorithms that seek to capture complex strict relation-
first three epochs and is able to generalize much better to
ships between all input and output variables. The input to
unseen examples.
the algorithm consists of a 4x4 grid (really a 4x4x4 tensor
with a one-hot encoding for known entries an all zeros for
unknown entries), and the desired output is a 4x4x4 tensor 5. Conclusion
of the one-hot encoding of the solution.
We have presented OptNet, a neural network architecture
This is a problem where traditional neural networks have where we use optimization problems as a single layer in
difficulties learning the necessary hard constraints. As a the network. We have derived the algorithmic formula-
baseline inspired by the models at https://github. tion for differentiating through these layers, allowing for
com/Kyubyong/sudoku, we implemented a multilayer backpropagating in end-to-end architectures. We have also
feedforward network to attempt to solve Sudoku problems. developed an efficient batch solver for these optimizations
Specifically, we report results for a network that has 10 con- based upon a primal-dual interior point method, and devel-
volutional layers with 512 3x3 filters each, and tried other oped a method for attaining the necessary gradient informa-
architectures as well. The OptNet layer we use on this task tion “for free” from this approach. Our experiments high-
is a completely generic QP in “standard form” with only light the potential power of these networks, showing that
positivity inequality constraints but an arbitrary constraint they can solve problems where existing networks are very
matrix Ax = b, a small Q = 0.1I to make sure the prob- poorly suited, such as learning Sudoku problems purely
lem is strictly feasible, and with the linear term q simply from data. There are many future directions of research
being the input one-hot encoding of the Sudoku problem. for these approaches, but we feel that they add another im-
We know that Sudoku can be approximated well with a portant primitive to the toolbox of neural network practi-
linear program (indeed, integer programming is a typical tioners.
OptNet: Differentiable Optimization as a Layer in Neural Networks
Bertsekas, Dimitri P. Nonlinear programming. Athena sci- Griewank, Andreas and Walther, Andrea. Evaluating
entific Belmont, 1999. derivatives: principles and techniques of algorithmic
differentiation. SIAM, 2008.
Bonnans, J Frédéric and Shapiro, Alexander. Perturbation
Johnson, Matthew, Duvenaud, David K, Wiltschko, Alex,
analysis of optimization problems. Springer Science &
Adams, Ryan P, and Datta, Sandeep R. Composing
Business Media, 2013.
graphical models with neural networks for structured
Boyd, Stephen and Vandenberghe, Lieven. Convex opti- representations and fast inference. In Advances in Neural
mization. Cambridge university press, 2004. Information Processing Systems, pp. 2946–2954, 2016.