0% found this document useful (0 votes)

176 views10 pages

Amos, Kolter - 2017 - OptNet Differentiable Optimization As A Layer in Neural Networks

Uploaded by

chalkchalk.ycz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

176 views10 pages

Amos, Kolter - 2017 - OptNet Differentiable Optimization As A Layer in Neural Networks

Uploaded by

chalkchalk.ycz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

OptNet: Differentiable Optimization as a Layer in Neural Networks

Brandon Amos 1 J. Zico Kolter 1

Abstract tal can reduce the overall depth of the network while pre-
serving richness of representation. Specifically, we build a
This paper presents OptNet, a network architec- framework where the output of the i + 1th layer in a net-
ture that integrates optimization problems (here, work is the solution to a constrained optimization problem
specifically in the form of quadratic programs) based upon previous layers. This framework naturally en-
as individual layers in larger end-to-end train- compasses a wide variety of inference problems expressed
able deep networks. These layers encode con- within a neural network, allowing for the potential of much
straints and complex dependencies between the richer end-to-end training for complex tasks that require
hidden states that traditional convolutional and such inference procedures.
fully-connected layers often cannot capture. In
this paper, we explore the foundations for such Concretely, in this paper we specifically consider the task
an architecture: we show how techniques from of solving small quadratic programs as individual layers.
sensitivity analysis, bilevel optimization, and im- These optimization problems are well-suited to captur-
plicit differentiation can be used to exactly differ- ing interesting behavior and can be efficiently solved with
entiate through these layers and with respect to GPUs. Specifically, we consider layers of the form
layer parameters; we develop a highly efficient
1 T
solver for these layers that exploits fast GPU- zi+1 = argmin z Q(zi )z + q(zi )T z
based batch solves within a primal-dual interior z 2
(1)
point method, and which provides backpropaga- subject to A(zi )z = b(zi )
tion gradients with virtually no additional cost on G(zi )z ≤ h(zi )
top of the solve; and we highlight the applica-
tion of these approaches in several problems. In where z is the optimization variable, Q(zi ), q(zi ), A(zi ),
one notable example, we show that the method b(zi ), G(zi ), and h(zi ) are parameters of the optimization
is capable of learning to play mini-Sudoku (4x4) problem. As the notation suggests, these parameters can
given just input and output games, with no a pri- depend in any differentiable way on the previous layer zi ,
ori information about the rules of the game; this and which can eventually be optimized just like any other
highlights the ability of our architecture to learn weights in a neural network. These layers can be learned
hard constraints better than other neural architec- by taking the gradients of some loss function with respect
tures. to the parameters. In this paper, we derive the gradients of
(1) by taking matrix differentials of the KKT conditions of
the optimization problem at its solution.
1. Introduction In order to the make the approach practical for larger net-
In this paper, we consider how to treat exact, constrained works, we develop a custom solver which can simultane-
optimization as an individual layer within a deep learn- ously solve multiple small QPs in batch form. We do so
ing architecture. Unlike traditional feedforward networks, by developing a custom primal-dual interior point method
where the output of each layer is a relatively simple (though tailored specifically to dense batch operations on a GPU.
non-linear) function of the previous layer, our optimization In total, the solver can solve batches of quadratic programs
framework allows for individual layers to capture much over 100 times faster than existing highly tuned quadratic
richer behavior, expressing complex operations that in to- programming solvers such as Gurobi and CPLEX. One cru-
cial algorithmic insight in the solver is that by using a
1
School of Computer Science, Carnegie Mellon Univer- specific factorization of the primal-dual interior point up-
sity. Pittsburgh, PA, USA. Correspondence to: Brandon Amos date, we can obtain a backward pass over the optimiza-
<[email protected]>, J. Zico Kolter <[email protected]>. tion layer virtually “for free” (i.e., requiring no additional
Proceedings of the 34 th International Conference on Machine factorization once the optimization problem itself has been
Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 solved). Together, these innovations enable parameterized
by the author(s). optimization problems to be inserted within the architec-
OptNet: Differentiable Optimization as a Layer in Neural Networks

ture of existing deep networks. tional random fields as the “last layer” of a deep network
architecture (Peng et al., 2009; Zheng et al., 2015; Chen
We begin by highlighting background and related work,
et al., 2015) as well as in deeper energy-based architec-
and then present our optimization layer itself. Using matrix
tures (Belanger & McCallum, 2016; Belanger et al., 2017;
differentials we derive rules for computing all the neces-
Amos et al., 2017). Learning in this context requires ob-
sary backpropagation updates. We then detail our specific
served data, which isn’t present in some of the contexts we
solver for these quadratic programs, based upon a state-of-
consider in this paper, and also may suffer from instabil-
the-art primal-dual interior point method, and highlight the
ity issues when combined with deep energy-based archi-
novel elements as they apply to our formulation, such as
tectures as observed in Belanger & McCallum (2016); Be-
the aforementioned fact that we can compute backpropaga-
langer et al. (2017); Amos et al. (2017).
tion at very little additional cost. We then provide experi-
mental results that demonstrate the capabilities of the archi-
tecture, highlighting potential tasks that these architectures Analytically If an analytic solution to the argmin can be
can solve, and illustrating improvements upon existing ap- found, such as in an unconstrained quadratic minimization,
proaches. the gradients can often also be computed analytically. This
is done in Tappen et al. (2007); Schmidt & Roth (2014). We
cannot use these methods for the constrained optimization
2. Background and related work problems we consider in this paper because there are no
Optimization plays a key role in modeling complex phe- known analytic solutions.
nomena and providing concrete decision making processes
in sophisticated environments. A full treatment of opti- Unrolling The argmin operation over an unconstrained
mization applications is beyond our scope (Boyd & Van- objective can be approximated by a first-order gradient-
denberghe, 2004) but these methods have bound applica- based method and unrolled. These architectures typically
bility in control frameworks (Morari & Lee, 1999; Sastry introduce an optimization procedure such as gradient de-
& Bodson, 2011); numerous statistical and mathematical scent into the inference procedure. This is done in Domke
formalisms (Sra et al., 2012), and physical simulation prob- (2012); Amos et al. (2017); Belanger et al. (2017); Metz
lems like rigid body dynamics (Lötstedt, 1984). Generally et al. (2016); Goodfellow et al. (2013); Stoyanov et al.
speaking, our work is a step towards learning optimization (2011); Brakel et al. (2013). The optimization procedure is
problems behind real-world processes from data that can unrolled automatically or manually (Domke, 2012) to ob-
be learned end-to-end rather than requiring human specifi- tain derivatives during training that incorporate the effects
cation and intervention. of these in-the-loop optimization procedures. However, un-
rolling the computation of a method like gradient descent
In the machine learning setting, a wide array of applica-
typically requires a substantially larger network, and adds
tions consider optimization as a means to perform infer-
substantially to the computational complexity of the net-
ence in learning. Among many other applications, these
work.
architectures are well-studied for generic classification and
structured prediction tasks (Goodfellow et al., 2013; Stoy- In all of these existing cases, the optimization problem is
anov et al., 2011; Brakel et al., 2013; LeCun et al., 2006; unconstrained and unrolling gradient descent is often easy
Belanger & McCallum, 2016; Belanger et al., 2017; Amos to do. When constraints are added to the optimization prob-
et al., 2017); in vision for tasks such as denoising (Tappen lem, iterative algorithms often use a projection operator
et al., 2007; Schmidt & Roth, 2014); and Metz et al. (2016) that may be difficult to unroll through. In this paper, we
uses unrolled optimization within a network to stabilize the do not unroll an optimization procedure but instead use
convergence of generative adversarial networks (Goodfel- argmin differentiation as described in the next section.
low et al., 2014). Indeed, the general idea of solving re-
stricted classes of optimization problem using neural net- Argmin differentiation Most closely related to our own
works goes back many decades (Kennedy & Chua, 1988; work, there have been several papers that propose some
Lillo et al., 1993), but has seen a number of advances in form of differentiation through argmin operators. These
recent years. These models are often trained by one of the techniques also come up in bilevel optimization (Gould
following four methods. et al., 2016; Kunisch & Pock, 2013) and sensitivity anal-
ysis (Bertsekas, 1999; Fiacco & Ishizuka, 1990; Bonnans
Energy-based learning methods These methods can be & Shapiro, 2013). In the case of Gould et al. (2016),
used for tasks like (structured) prediction where the train- the authors describe general techniques for differentiation
ing method shapes the energy function to be low around the through optimization problems, but only describe the case
observed data manifold and high elsewhere (LeCun et al., of exact equality constraints rather than both equality and
2006). In recent years, there has been a strong push to fur- inequality constraints (in the case inequality constraints,
ther incorporate structured prediction methods like condi- they add these via a barrier function). Amos et al. (2017)
OptNet: Differentiable Optimization as a Layer in Neural Networks

considers argmin differentiation within the context of a spe- ral network setting, the optimal solution (or more generally,
cific optimization problem (the bundle method) but does a subset of the optimal solution) of this optimization prob-
not consider a general setting. Johnson et al. (2016) per- lems becomes the output of our layer, denoted zi+1 , and
forms implicit differentiation on (multi-)convex objectives any of the problem data Q, q, A, b, G, h can depend on the
with coordinate subspace constraints, but don’t consider in- value of the previous layer zi . The forward pass in our Opt-
equality constraints and don’t consider in detail general lin- Net architecture thus involves simply setting up and finding
ear equality constraints. Their optimization problem is only the solution to this optimization problem.
in the final layer of a variational inference network while
Training deep architectures, however, requires that we not
we propose to insert optimization problems anywhere in
just have a forward pass in our network but also a back-
the network. Therefore a special case of OptNet layers
ward pass. This requires that we compute the derivative of
(with no inequality constraints) has a natural interpretation
the solution to the QP with respect to its input parameters,
in terms of Gaussian inference, and so Gaussian graphi-
a general topic we topic we discussed previously. To ob-
cal models (or CRF ideas more generally) provide tools
tain these derivatives, we differentiate the KKT conditions
for making the computation more efficient and interpreting
(sufficient and necessary conditions for optimality) of (2)
or constraining its structure. Similarly, the older work of
at a solution to the problem using techniques from matrix
Mairal et al. (2012) considered argmin differentiation for a
differential calculus (Magnus & Neudecker, 1988). Our
LASSO problem, deriving specific rules for this case, and
analysis here can be extended to more general convex opti-
presenting an efficient algorithm based upon our ability to
mization problems.
solve the LASSO problem efficiently.
The Lagrangian of (2) is given by
In this paper, we use implicit differentiation (Dontchev
& Rockafellar, 2009; Griewank & Walther, 2008) and 1 T
techniques from matrix differential calculus (Magnus & L(z, ν, λ) = z Qz + q T z + ν T (Az − b) + λT (Gz − h)
2
Neudecker, 1988) to derive the gradients from the KKT (3)
matrix of the problem we are interested in. A notable dif- where ν are the dual variables on the equality constraints
ferent from other work within ML that we are aware of, is and λ ≥ 0 are the dual variables on the inequality con-
that we analytically differentiate through inequality as well straints. The KKT conditions for stationarity, primal feasi-
as just equality constraints, but differentiating the comple- bility, and complementary slackness are
mentarity conditions; this differs from e.g., Gould et al.
(2016) where they instead approximately convert the prob- Qz ? + q + AT ν ? + GT λ? = 0
lem to an unconstrained one via a barrier method. We have Az ? − b = 0 (4)
also developed methods to make this approach practical D(λ )(Gz ? − h) = 0,
?
and reasonably scalable within the context of deep archi-
tectures. where D(·) creates a diagonal matrix from a vector and z ? ,
ν ? and λ? are the optimal primal and dual variables. Taking
3. OptNet: solving optimization within a the differentials of these conditions gives the equations
neural network
dQz ? + Qdz + dq + dAT ν ? +
Although in the most general form, an OptNet layer can AT dν + dGT λ? + GT dλ = 0
be any optimization problem, in this paper we will study (5)
dAz ? + Adz − db = 0
OptNet layers defined by a quadratic program
D(Gz ? − h)dλ + D(λ? )(dGz ? + Gdz − dh) = 0
1 T
minimize z Qz + q T z or written more compactly in matrix form
z 2 (2)
subject to Az = b, Gz ≤ h   
Q GT AT dz
where z ∈ Rn is our optimization variable Q ∈ Rn×n 0 D(λ? )G D(Gz ? − h) 0  dλ =
(a positive semidefinite matrix), q ∈ Rn , A ∈ Rm×n , A 0 0 dν
(6)
b ∈ Rm , G ∈ Rp×n and h ∈ Rp are problem data,
 
−dQz − dq − dG λ − dAT ν ?
? T ?
and leaving out the dependence on the previous layer zi  −D(λ? )dGz ? + D(λ? )dh  .
as we showed in (1) for notational convenience. As is well- −dAz ? + db
known, these problems can be solved in polynomial time
using a variety of methods; if one desires exact (to numeri- Using these equations, we can form the Jacobians of z ? (or
cal precision) solutions to these problems, then primal-dual λ? and ν ? , though we don’t consider this case here), with
interior point methods, as we will use in a later section, are respect to any of the data parameters.
?
For example, if we
the current state of the art in solution methods. In the neu- wished to compute the Jacobian ∂z ∂b ∈ Rn×m , we would
OptNet: Differentiable Optimization as a Layer in Neural Networks

simply substitute db = I (and set all other differential terms icantly faster than the standard non-batch solvers Gurobi
in the right hand side to zero), solve the equation, and the and CPLEX.
resulting value of dz would be the desired Jacobian.
Following the method of Mattingley & Boyd (2012), our
In the backpropagation algorithm, however, we never want solver introduces slack variables on the inequality con-
to explicitly form the actual Jacobian matrices, but rather straints and iteratively minimizes the residuals from the
want to form the left matrix-vector product with some KKT conditions over the primal variable z ∈ Rn , slack
∂` n ∂` ∂z ?
previous backward pass vector ∂z ? ∈ R , i.e., ∂z ? ∂b . variable s ∈ Rp , and dual variables ν ∈ Rm associ-
We can do this efficiently by noting the solution for the ated with the equality constraints and λ ∈ Rp associated
(dz, dλ, dν) involves multiplying the inverse of the left- with the inequality constraints. Each iteration computes
hand-side matrix in (6) by some right hand side. Thus, if the affine scaling directions by solving
we multiply the backward pass vector by the transpose of  aff   
the differential matrix ∆z −(AT ν + GT λ + Qz + q)
−1   ∆saff   −Sλ 
  
dz Q GT D(λ? ) AT ∂` T
 K ∆λaff  = 
   (9)
∂z ? −(Gz + s − h) 
dλ  = G D(Gz ? − h) 0   0  (7) ∆ν aff −(Az − b)
dν A 0 0 0
where  
then the relevant gradients with respect to all the QP pa- Q 0 GT AT
rameters can be given by  0 D(λ) D(s) 0 
K=
G
,
I 0 0 
∂` ∂`
= dz = −dν A 0 0 0
∂q ∂b
∂` ∂` 1 then centering-plus-corrector directions by solving
= −D(λ? )dλ = (dz z T + zdTz ) (8)
∂h ∂Q 2  cc   
∂` ∂` ∆z 0
= dν z T + νdTz = D(λ? )(dλ z T + λdTz )  ∆scc  σµ1 − D(∆saff )∆λaff 
∂A ∂G K ∆λcc  = 
  , (10)
0 
where as in standard backpropagation, all these terms are ∆ν cc 0
at most the size of the parameter matrices. We note that
some of these parameters should depend on the previous where µ = sT λ/p is the duality gap and σ is defined in
layer zi and the gradients with respect to the previous layer Mattingley & Boyd (2012). Each variable v is updated with
can be obtained through the chain rule. As we will see in ∆v = ∆v aff + ∆v cc using an appropriate step size. We ac-
the next section, the solution to an interior point method in tually solve a symmetrized version of the KKT conditions,
fact already provides a factorization we can use to compute obtained by scaling the second row block by D(1/s). We
these gradient efficiently. analytically decompose these systems into smaller sym-
metric systems and pre-factorize portions of them that don’t
3.1. An efficient batched QP solver change (i.e. that don’t involve D(λ/s) between iterations).
We have implemented a batched version of this method
Deep networks are typically trained in mini-batches to take with the PyTorch library2 and have released it as an open
advantage of efficient data-parallel GPU operations. With- source library at https://github.com/locuslab/
out mini-batching on the GPU, many modern deep learning qpth. It uses a custom CUBLAS extension that pro-
architectures become intractable for all practical purposes. vides an interface to solve multiple matrix factorizations
However, today’s state-of-the-art QP solvers like Gurobi and solves in parallel, and which provides the necessary
and CPLEX do not have the capability of solving multi- backpropagation gradients for their use in an end-to-end
ple optimization problems on the GPU in parallel across learning system.
the entire minibatch. This makes larger OptNet layers be-
come quickly intractable compared to a fully-connected 3.1.1. E FFICIENTLY COMPUTING GRADIENTS
layer with the same number of parameters.
A key point of the particular form of primal-dual interior
To overcome this performance bottleneck in our quadratic point method that we employ is that it is possible to com-
program layers, we have implemented a GPU-based pute the backward pass gradients “for free” after solving
primal-dual interior point method (PDIPM) based on Mat- the original QP, without an additional matrix factorization
tingley & Boyd (2012) that solves a batch of quadratic pro- or solve. Specifically, at each iteration in the primal-dual
grams, and which provides the necessary gradients needed interior point, we are computing an LU decomposition of
to train these in an end-to-end fashion. Our performance
2
experiments in Section 4.1 shows that our solver is signif- https://pytorch.org
OptNet: Differentiable Optimization as a Layer in Neural Networks

the matrix Ksym .3 This matrix is essentially a symmetrized mate arbitrary elementwise piecewise-linear functions, and
version of the matrix needed for computing the backprop- so among other things can represent a ReLU layer.
agated gradients, and we can similarly compute the dz,λ,ν Theorem 2. Let f : Rn → Rn be an elementwise piece-
terms by solving the linear system wise linear function with k linear regions. Then the func-
   T  tion can be represented as an OptNet layer using O(nk)
dz − ∂`
parameters. Additionally, the layer zi+1 = max{W zi +
∂zi+1
 ds   
b, 0} for W ∈ Rn×m , b ∈ Rn can be represented by an
Ksym  = 0 ,

(11)
d˜λ   
OptNet layer with O(mn) parameters.
0 
dν 0
Finally, we show that the converse does not hold: that there
are function representable by an OptNet layer which cannot
where d˜λ = D(λ? )dλ for dλ as defined in (7). Thus, all
be represented exactly by a two-layer ReLU layer, which
the backward pass gradients can be computed using the
take exponentially many units to approximate (known to
factored KKT matrix at the solution. Crucially, because
be a universal function approximator). A simple ex-
the bottleneck of solving this linear system is computing
ample of such a layer (and one which we use in the
the factorization of the KKT matrix (cubic time as op-
proof) is just the max over three linear functions f (z) =
posed to the quadratic time for solving via backsubstitution
max{aT1 x, aT2 x, aT3 x}.
once the factorization is computed), the additional time re-
quirements for computing all the necessary gradients in the Theorem 3. Let f (z) : Rn → R be a scalar-valued func-
backward pass is virtually nonexistent compared with the tion specified by an POptNet layer with p parameters. Con-
m
time of computing the solution. To the best of our knowl- versely, let f 0 (z) = i=1 wi max{aTi z + bi , 0} be the out-
edge, this is the first time that this fact has been exploited put of a two-layer ReLU network. Then there exist func-
in the context of learning end-to-end systems. tions that the ReLU network cannot represent exactly over
all of R, and which require O(cp ) parameters to approxi-
3.2. Properties and representational power mate over a finite region.

In this section we briefly highlight some of the mathe- 3.3. Limitations of the method
matical properties of OptNet layers. The proofs here are
straightforward, and are mostly based upon well-known re- Although, as we will show shortly, the OptNet layer has
sults in convex analysis, so are deferred to the appendix. several strong points, we also want to highlight the poten-
The first result simply highlights that (because the solution tial drawbacks of this approach. First, although, with an
of strictly convex QPs is continuous), that OptNet layers efficient batch solver, integrating an OptNet layer into ex-
are subdifferentiable everywhere, and differentiable at all isting deep learning architectures is potentially practical,
but a measure-zero set of points. we do note that solving optimization problems exactly as
we do here has has cubic complexity in the number of vari-
Theorem 1. Let z ? (θ) be the output of an OptNet layer,
ables and/or constraints. This contrasts with the quadratic
where θ = {Q, p, A, b, G, h}. Assuming Q 0 and that
complexity of standard feedforward layers. This means that
A has full row rank, then z ? (θ) is subdifferentiable every-
we are ultimately limited to settings where the number of
where: ∂z ? (θ) 6= ∅, where ∂z ? (θ) denotes the Clarke
hidden variables in an OptNet layer is not too large (less
generalized subdifferential (Clarke, 1975) (an extension of
than 1000 dimensions seems to be the limits of what we
the subgradient to non-convex functions), and has a single
currently find to the be practical, and substantially less if
unique element (the Jacobian) for all but a measure zero
one wants real-time results for an architecture).
set of points θ.
Secondly, there are many improvements to the OptNet lay-
The next two results show the representational power of the ers that are still possible. Our QP solver, for instance, uses
OptNet layer, specifically how an OptNet layer compares fully dense matrix operations, which makes the solves very
to the common linear layer followed by a ReLU activation. efficient for GPU solutions, and which also makes sense for
The first theorem shows that an OptNet layer can approxi- our general setting where the coefficients of the quadratic
3 problem can be learned. However, for setting many real-
We actually perform an LU decomposition of a certain sub-
set of the matrix formed by eliminating variables to create only world optimization problems (and hence for architectures
a p × p matrix (the number of inequality constraints) that needs that wish to more closely mimic some real-world opti-
to be factor during each iteration of the primal-dual algorithm, mization problem), there is often substantial structure (e.g.,
and one m × m and one n × n matrix once at the start of the sparsity), in the data matrices that can be exploited for ef-
primal-dual algorithm, though we omit the detail here. We also ficiency. There is of course no prohibition of incorporat-
use an LU decomposition as this routine is provided in batch form
by CUBLAS, but could potentially use a (faster) Cholesky fac- ing sparse matrix methods into the fast custom solver, but
torization if and when the appropriate functionality is added to doing so would require substantial added complexity, espe-
CUBLAS). cially regarding efforts like finding minimum fill orderings
OptNet: Differentiable Optimization as a Layer in Neural Networks

10 1
for different sparsity patterns of the KKT systems. In our Linear Forward Linear Backward
10 0 QP Forward QP Backward
open source solver qpth, we have started experimenting

Runtime (s)
with cuSOLVER’s batched sparse QR factorizations and 10 -1

solves. 10 -2

10 -3
Lastly, we note that while the OptNet layers can be trained
10 -4
just as any neural network layer, since they are a new cre-
10 -5
ation and since they have manifolds in the parameter space 10 50 100 500
Number of Variables (and Inequality Constraints)
which have no effect on the resulting solution (e.g., scaling
the rows of a constraint matrix and its right hand side does
Figure 1. Performance of a linear layer and a QP layer.
not change the optimization problem), there is admittedly
(Batch size 128)
more tuning required to get these to work. This situation
is common when developing new neural network architec-
10 1
tures and has also been reported in the similar architecture Gurobi
Ours Serial
of Schmidt & Roth (2014). Our hope is that techniques for

Runtime (s)
Ours Batched
10 0
overcoming some of the challenges in learning these layers
will continue to be developed in future work. 10 -1

10 -2
4. Experimental results 1 64 128
Batch Size
In this section, we present several experimental results that
highlight the capabilities of the QP OptNet layer. Specif- Figure 2. Performance of Gurobi and our QP solver.
ically we look at 1) computational efficiency over exiting
solvers; 2) the ability to improve upon existing convex als) of the forward and backward passes. The OptNet layer
problems such as those used in signal denoising; 3) inte- is significantly slower than the linear layer as expected, yet
grating the architecture into an generic deep learning archi- still tractable in many practical contexts.
tectures; and 4) performance of our approach on a problem Our next experiment illustrates why standard baseline QP
that is challenging for current approaches. In particular, we solvers like CPLEX and Gurobi without batch support are
want to emphasize the results of our system on learning the too computationally expensive for QP OptNet layers to be
game of (4x4) mini-Sudoku, a well-known logical puzzle; tractable. We set up random QP of the form (1) that have
our layer is able to directly learn the necessary constraints 100 variables and 100 inequality constraints in Gurobi and
using just gradient information and no a priori knowl- in the serialized and batched versions of our solver qpth
edge of the rules of Sudoku. The code and data for our and vary the batch size.4
experiments are open sourced in the icml2017 branch
of https://github.com/locuslab/optnet and Figure 2 shows the means and standard deviations of run-
our batched QP solver is available as a library at https: ning each trial 10 times, showing that our batched solver
//github.com/locuslab/qpth. outperforms Gurobi, itself a highly tuned solver for reason-
able batch sizes. For the minibatch size of 128, we solve
4.1. Batch QP solver performance all problems in an average of 0.18 seconds, whereas Gurobi
tasks an average of 4.7 seconds. In the context of training a
All of the OptNet performance results in this section are run deep architecture this type of speed difference for a single
on an unloaded Titan X GPU. Gurobi is run on an unloaded minibatch can make the difference between a practical and
quad-core Intel Core i7-5960X CPU @ 3.00GHz. a completely unusable solution.
Our OptNet layers are much more computationally expen-
sive than a linear or convolutional layer and a natural ques- 4.2. Total variation denoising
tion is to ask what the performance difference is. We set Our next experiment studies how we can use the OptNet
up an experiment comparing a linear layer to a QP OptNet architecture to improve upon signal processing techniques
layer with a mini-batch size of 128 on CUDA with ran-
4
domly generated input vectors sized 10, 50, 100, and 500. Experimental details: we sample entries of a matrix U from a
Each layer maps this input to an output of the same dimen- random uniform distribution and set Q = U T U + 10−3 I, sample
sion; the linear layer does this with a batched matrix-vector G with random normal entries, and set h by selecting generat-
ing some z0 random normal and s0 random uniform and setting
multiplication and the OptNet layer does this by taking the h = Gz0 + s0 (we didn’t include equality constraints just for
argmin of a random QP that has the same number of in- simplicity, and since the number of inequality constraints in the
equality constraints as the dimensionality of the problem. primary driver of complexity for the iterations in a primal-dual
Figure 1 shows the profiling results (averaged over 10 tri- interior point method). The choice of h guarantees the problem is
feasible.
OptNet: Differentiable Optimization as a Layer in Neural Networks

that currently use convex optimization as a basis. Specifi-

cally, our goal in this case is to denoise a noisy 1D signal
given training data consistency of noisy and clean signals
generated from the same distribution. Such problems are
often addressed by convex optimization procedures, and
(1D) total variation denoising is a particularly common and
simple approach. Specifically, the total variation denoising
approach attempts to smooth some noisy observed signal y
by solving the optimization problem Figure 3. Initial and learned difference operators for denoising.
1
argmin ||y − z|| + λ||Dz||1 (12) Method Train MSE Test MSE
z 2
FC Net 18.5 29.8
where D is the first-order differencing operation, which Pure OptNet 52.9 53.3
can be expressed in matrix form by a matrix with rows Total Variation 16.3 16.5
Di = ei − ei+1 Penalizing the `1 norm of the signal differ- OptNet Tuned TV 13.8 14.4
ence encourages this difference to be sparse, i.e., the num-
ber of changepoints of the signal is small, and we end up Table 1. Denoising task error rates.
approximating y by a (roughly) piecewise constant func-
tion.
To test this approach and competing ones on a denoising tialize. While the accuracy here is substantially lower than
task, we generate piecewise constant signals (which are the even the fully connected case, this is largely the result of
desired outputs of the learning algorithm) and corrupt them learning an over-regularized solution to D. This is indeed
with independent Gaussian noise (which form the inputs a point that should be addressed in future work (we refer
to the learning algorithm). Table 1 shows the error rate of back to our comments in the previous section on the po-
these four approaches. tential challenges of training these layers), but the point we
want to highlight here is that the OptNet layer seems to
4.2.1. BASELINE : T OTAL VARIATION DENOISING be learning something very interpretable and understand-
able. Specifically, Figure 3 shows the D matrix of our so-
To establish a baseline for denoising performance with total
lution before and after learning (we permute the rows to
variation, we run the above optimization problem varying
make them ordered by the magnitude of where the large-
values of λ between 0 and 100. The procedure performs
absolute-value entries occurs). What is interesting in this
best with a choice of λ ≈ 13, and achieves a minimum test
picture is that the learned D matrix typically captures ex-
MSE on our task of about 16.5 (the units here are unimpor-
actly the same intuition as the D matrix used by total vari-
tant, the only relevant quantity is the relative performances
ation denoising: a mainly sparse matrix with a few entries
of the different algorithms).
of alternating sign next to each other. This implies that for
the data set we have, total variation denoising is indeed the
4.2.2. BASELINE : L EARNING WITH A
“right” way to think about denoising the resulting signal,
FULLY- CONNECTED NEURAL NETWORK
but if some other noise process were to generate the data,
An alternative approach to denoising is by learning from then we can learn that process instead. We can then at-
data. A function fθ (x) parameterized by θ can be used to tain lower actual error for the method (in this case similar
predict the original signal. The optimal θ can be learned though slightly higher than the TV solution), by fixing the
by using the mean squared error between the true and pre- learned sparsity of the D matrix and then fine tuning.
dicted signals. Denoising is typically a difficult function
to learn and Table 1 shows that a fully-connected neural 4.2.4. F INE - TUNING AND IMPROVING THE TOTAL
network perform substantially worse on this denoising task VARIATION SOLUTION
than the convex optimization problem. Section B shows the
To finally highlight the ability of the OptNet methods to
convergence of the fully-connected network.
improve upon the results of a convex program, specifically
tailoring to the data. Here, we use the same OptNet archi-
4.2.3. L EARNING THE DIFFERENCING OPERATOR WITH
tecture as in the previous subsection, but initialize D to be
O PT N ET
the differencing matrix as in the total variation solution. As
Between the feedforward neural network approach and the shown in Table 1, the procedure is able to improve both the
convex total variation optimization, we could instead use a training and testing MSE over the TV solution, specifically
generic OptNet layers that effectively allowed us to solve improving upon test MSE by 12%. Section B shows the
(12) using any denoising matrix, which we randomly ini- convergence of fine-tuning.
OptNet: Differentiable Optimization as a Layer in Neural Networks

Conv Train Conv Test OptNet Train OptNet Test

3 2 4 1 3
1 1 3 2 4
4 3 1 4 2 10 -1 10 -1
4 1 4 2 3 1

Loss
Loss
Figure 4. Example mini-Sudoku initial problem and solution. 10 -2
-2
10
4.3. MNIST
10 -3
0 2 4 6 8 10 12 14 16 18
One compelling use case of an OptNet layer is to learn con- Epoch
straints and dependencies over the output or latent space of 10 -3 10 0
0 2 4 6 8 10 12 14 16 18
a model. As a simple example to illustrate that OptNet lay-
ers can be included in existing architectures and that the Epoch
10 -1

Error
gradients can be efficiently propagated through the layer,
we show the performance of a fully-connected feedforward 10 -2
network with and without an OptNet layer in Section A in
the supplemental material.
10 -3
0 2 4 6 8 10 12 14 16 18
Epoch
4.4. Sudoku
Finally, we present the main illustrative example of the rep- Figure 5. Sudoku training plots.
resentational power of our approach, the task of learning
solution method for such problems), but the model here is
the game of Sudoku. Sudoku is a popular logical puzzle,
told nothing about the rules of Sudoku.
where a (typically 9x9) grid of points must be arranged
given some initial point, so that each row, each column, and We trained these models using ADAM (Kingma & Ba,
each 3x3 grid of points must contain one of each number 2014) to minimize the MSE (which we refer to as “loss”)
1 through 9. We consider the simpler case of 4x4 Sudoku on a dataset we created consisting of 9000 training puz-
puzzles, with numbers 1 through 4, as shown in Figure 4.3. zles, and we then tested the models on 1000 different held-
out puzzles. The error rate is the percentage of puzzles
Sudoku is fundamentally a constraint satisfaction problem,
solved correctly if the cells are assigned to whichever in-
and is trivial for computers to solve when told the rules of
dex is largest in the prediction. Figure 5 shows that the
the game. However, if we do not know the rules of the
convolutional is able to learn all of the necessary logic for
game, but are only presented with examples of unsolved
the task and ends up over-fitting to the training data. We
and the corresponding solved puzzle, this is a challenging
contrast this with the performance of the OptNet network,
task. We consider this to be an interesting benchmark task
which learns most of the correct hard constraints within the
for algorithms that seek to capture complex strict relation-
first three epochs and is able to generalize much better to
ships between all input and output variables. The input to
unseen examples.
the algorithm consists of a 4x4 grid (really a 4x4x4 tensor
with a one-hot encoding for known entries an all zeros for
unknown entries), and the desired output is a 4x4x4 tensor 5. Conclusion
of the one-hot encoding of the solution.
We have presented OptNet, a neural network architecture
This is a problem where traditional neural networks have where we use optimization problems as a single layer in
difficulties learning the necessary hard constraints. As a the network. We have derived the algorithmic formula-
baseline inspired by the models at https://github. tion for differentiating through these layers, allowing for
com/Kyubyong/sudoku, we implemented a multilayer backpropagating in end-to-end architectures. We have also
feedforward network to attempt to solve Sudoku problems. developed an efficient batch solver for these optimizations
Specifically, we report results for a network that has 10 con- based upon a primal-dual interior point method, and devel-
volutional layers with 512 3x3 filters each, and tried other oped a method for attaining the necessary gradient informa-
architectures as well. The OptNet layer we use on this task tion “for free” from this approach. Our experiments high-
is a completely generic QP in “standard form” with only light the potential power of these networks, showing that
positivity inequality constraints but an arbitrary constraint they can solve problems where existing networks are very
matrix Ax = b, a small Q = 0.1I to make sure the prob- poorly suited, such as learning Sudoku problems purely
lem is strictly feasible, and with the linear term q simply from data. There are many future directions of research
being the input one-hot encoding of the Sudoku problem. for these approaches, but we feel that they add another im-
We know that Sudoku can be approximated well with a portant primitive to the toolbox of neural network practi-
linear program (indeed, integer programming is a typical tioners.
OptNet: Differentiable Optimization as a Layer in Neural Networks

Acknowledgments Duchi, John, Shalev-Shwartz, Shai, Singer, Yoram, and

Chandra, Tushar. Efficient projections onto the l 1-ball
BA is supported by the National Science Foundation for learning in high dimensions. In Proceedings of the
Graduate Research Fellowship Program under Grant No. 25th international conference on Machine learning, pp.
DGE1252522. We would like to thank the developers 272–279, 2008.
of PyTorch for helping us add core features, particularly
Soumith Chintala and Adam Paszke. We also thank Ian Fiacco, Anthony V and Ishizuka, Yo. Sensitivity and stabil-
Goodfellow, Lekan Ogunmolu, Rui Silva, Po-Wei Wang, ity analysis for nonlinear programming. Annals of Oper-
and Eric Wong for invaluable comments, as well as Rocky ations Research, 27(1):215–235, 1990.
Duan who helped us improve our feedforward network
baseline on mini-Sudoku. Goodfellow, Ian, Mirza, Mehdi, Courville, Aaron, and
Bengio, Yoshua. Multi-prediction deep boltzmann ma-
chines. In Advances in Neural Information Processing
References Systems, pp. 548–556, 2013.
Amos, Brandon, Xu, Lei, and Kolter, J Zico. Input con-
vex neural networks. In Proceedings of the International Goodfellow, Ian, Pouget-Abadie, Jean, Mirza, Mehdi, Xu,
Conference on Machine Learning, 2017. Bing, Warde-Farley, David, Ozair, Sherjil, Courville,
Aaron, and Bengio, Yoshua. Generative adversarial nets.
Belanger, David and McCallum, Andrew. Structured pre- In Advances in Neural Information Processing Systems,
diction energy networks. In Proceedings of the Interna- pp. 2672–2680, 2014.
tional Conference on Machine Learning, 2016.
Gould, Stephen, Fernando, Basura, Cherian, Anoop, An-
Belanger, David, Yang, Bishan, and McCallum, Andrew. derson, Peter, Santa Cruz, Rodrigo, and Guo, Edison. On
End-to-end learning for structured prediction energy net- differentiating parameterized argmin and argmax prob-
works. In Proceedings of the International Conference lems with application to bi-level optimization. arXiv
on Machine Learning, 2017. preprint arXiv:1607.05447, 2016.

Bertsekas, Dimitri P. Nonlinear programming. Athena sci- Griewank, Andreas and Walther, Andrea. Evaluating
entific Belmont, 1999. derivatives: principles and techniques of algorithmic
differentiation. SIAM, 2008.
Bonnans, J Frédéric and Shapiro, Alexander. Perturbation
Johnson, Matthew, Duvenaud, David K, Wiltschko, Alex,
analysis of optimization problems. Springer Science &
Adams, Ryan P, and Datta, Sandeep R. Composing
Business Media, 2013.
graphical models with neural networks for structured
Boyd, Stephen and Vandenberghe, Lieven. Convex opti- representations and fast inference. In Advances in Neural
mization. Cambridge university press, 2004. Information Processing Systems, pp. 2946–2954, 2016.

Kennedy, Michael Peter and Chua, Leon O. Neural net-

Brakel, Philémon, Stroobandt, Dirk, and Schrauwen, Ben-
works for nonlinear programming. IEEE Transactions
jamin. Training energy-based models for time-series im-
on Circuits and Systems, 35(5):554–562, 1988.
putation. Journal of Machine Learning Research, 14(1):
2771–2797, 2013. Kingma, Diederik and Ba, Jimmy. Adam: A
method for stochastic optimization. arXiv preprint
Chen, Liang-Chieh, Schwing, Alexander G, Yuille, Alan L, arXiv:1412.6980, 2014.
and Urtasun, Raquel. Learning deep structured models.
In Proceedings of the International Conference on Ma- Kunisch, Karl and Pock, Thomas. A bilevel optimization
chine Learning, 2015. approach for parameter learning in variational models.
SIAM Journal on Imaging Sciences, 6(2):938–983, 2013.
Clarke, Frank H. Generalized gradients and applica-
tions. Transactions of the American Mathematical So- LeCun, Yann, Chopra, Sumit, Hadsell, Raia, Ranzato, M,
ciety, 205:247–262, 1975. and Huang, F. A tutorial on energy-based learning. Pre-
dicting structured data, 1:0, 2006.
Domke, Justin. Generic methods for optimization-based
modeling. In AISTATS, volume 22, pp. 318–326, 2012. Lillo, Walter E, Loh, Mei Heng, Hui, Stefen, and Zak,
Stanislaw H. On solving constrained optimization prob-
Dontchev, Asen L and Rockafellar, R Tyrrell. Implicit lems with neural networks: A penalty method approach.
functions and solution mappings. Springer Monogr. IEEE Transactions on neural networks, 4(6):931–940,
Math., 2009. 1993.
OptNet: Differentiable Optimization as a Layer in Neural Networks

Lötstedt, Per. Numerical simulation of time-dependent

contact and friction problems in rigid body mechanics.
SIAM journal on scientific and statistical computing, 5
(2):370–393, 1984.
Magnus, X and Neudecker, Heinz. Matrix differential cal-
culus. New York, 1988.
Mairal, Julien, Bach, Francis, and Ponce, Jean. Task-driven
dictionary learning. IEEE Transactions on Pattern Anal-
ysis and Machine Intelligence, 34(4):791–804, 2012.
Mattingley, Jacob and Boyd, Stephen. Cvxgen: A code
generator for embedded convex optimization. Optimiza-
tion and Engineering, 13(1):1–27, 2012.
Metz, Luke, Poole, Ben, Pfau, David, and Sohl-Dickstein,
Jascha. Unrolled generative adversarial networks. arXiv
preprint arXiv:1611.02163, 2016.
Morari, Manfred and Lee, Jay H. Model predictive con-
trol: past, present and future. Computers & Chemical
Engineering, 23(4):667–682, 1999.
Peng, Jian, Bo, Liefeng, and Xu, Jinbo. Conditional neu-
ral fields. In Advances in neural information processing
systems, pp. 1419–1427, 2009.
Sastry, Shankar and Bodson, Marc. Adaptive control: sta-
bility, convergence and robustness. Courier Corporation,
2011.
Schmidt, Uwe and Roth, Stefan. Shrinkage fields for effec-
tive image restoration. In Proceedings of the IEEE Con-
ference on Computer Vision and Pattern Recognition, pp.
2774–2781, 2014.
Sra, Suvrit, Nowozin, Sebastian, and Wright, Stephen J.
Optimization for machine learning. Mit Press, 2012.
Stoyanov, Veselin, Ropson, Alexander, and Eisner, Jason.
Empirical risk minimization of graphical model param-
eters given approximate inference, decoding, and model
structure. In AISTATS, pp. 725–733, 2011.
Tappen, Marshall F, Liu, Ce, Adelson, Edward H, and Free-
man, William T. Learning gaussian conditional random
fields for low-level vision. In Computer Vision and Pat-
tern Recognition, 2007. CVPR’07. IEEE Conference on,
pp. 1–8. IEEE, 2007.
Zheng, Shuai, Jayasumana, Sadeep, Romera-Paredes,
Bernardino, Vineet, Vibhav, Su, Zhizhong, Du, Dalong,
Huang, Chang, and Torr, Philip HS. Conditional random
fields as recurrent neural networks. In Proceedings of
the IEEE International Conference on Computer Vision,
pp. 1529–1537, 2015.

CSC 301 - Design and Analysis of Algorithms
No ratings yet
CSC 301 - Design and Analysis of Algorithms
8 pages
Represent Real-Life Situations Using Rational Functions
33% (3)
Represent Real-Life Situations Using Rational Functions
30 pages
Xingbaogao 2010
No ratings yet
Xingbaogao 2010
12 pages
Neural Deep Equilibrium Solver
No ratings yet
Neural Deep Equilibrium Solver
19 pages
Computer Vision Graph Cuts: Exploring Graph Cuts in Computer Vision
From Everand
Computer Vision Graph Cuts: Exploring Graph Cuts in Computer Vision
Fouad Sabry
No ratings yet
Optimisation 1
No ratings yet
Optimisation 1
22 pages
A Novel Neural Network For Nonlinear Convex Programming: Xing-Bao Gao
No ratings yet
A Novel Neural Network For Nonlinear Convex Programming: Xing-Bao Gao
9 pages
Learning in Latent Spaces Improves The Predictive Accuracy of Deep Neural Operators
No ratings yet
Learning in Latent Spaces Improves The Predictive Accuracy of Deep Neural Operators
22 pages
Deep Learning Meets Sparse Regularization: A Signal Processing Perspective
No ratings yet
Deep Learning Meets Sparse Regularization: A Signal Processing Perspective
23 pages
1512 03385-Cropped
No ratings yet
1512 03385-Cropped
12 pages
Deep Neural Networks and Partial Differential Equations - Approximation Theory and Structural Properties) Philipp Petersen, University of Oxford
No ratings yet
Deep Neural Networks and Partial Differential Equations - Approximation Theory and Structural Properties) Philipp Petersen, University of Oxford
49 pages
Improving Physics-Informed Neural Networks With Meta-Learned Optimization
No ratings yet
Improving Physics-Informed Neural Networks With Meta-Learned Optimization
26 pages
He Deep Residual Learning CVPR 2016 Paper PDF
No ratings yet
He Deep Residual Learning CVPR 2016 Paper PDF
9 pages
SIAMCSE
No ratings yet
SIAMCSE
47 pages
Deep Residual Learning For Image Recognition: Kaiming He Xiangyu Zhang Shaoqing Ren Jian Sun Microsoft Research
No ratings yet
Deep Residual Learning For Image Recognition: Kaiming He Xiangyu Zhang Shaoqing Ren Jian Sun Microsoft Research
7 pages
2019 HORKNet Poster
No ratings yet
2019 HORKNet Poster
1 page
Pure Optimization
No ratings yet
Pure Optimization
23 pages
Fivth
No ratings yet
Fivth
32 pages
Deep Nets
No ratings yet
Deep Nets
25 pages
Neural ODES
No ratings yet
Neural ODES
32 pages
Module 2 DL
No ratings yet
Module 2 DL
9 pages
Chen, Deng Et Al 2021 - Effective and Efficient Batch Normalization
No ratings yet
Chen, Deng Et Al 2021 - Effective and Efficient Batch Normalization
15 pages
Unit 5 (Second Half)
No ratings yet
Unit 5 (Second Half)
10 pages
U O D L J M L C: Nderstanding Ptimization of EEP Earning Via Acobian Atrix and Ipschitz Onstant
No ratings yet
U O D L J M L C: Nderstanding Ptimization of EEP Earning Via Acobian Atrix and Ipschitz Onstant
48 pages
(Alt, Tobias, Et Al.), Translating Numerical Concepts For Pdes Into Neural Architectures, Arxiv Preprint Arxiv-2103.15419 (2021) .
No ratings yet
(Alt, Tobias, Et Al.), Translating Numerical Concepts For Pdes Into Neural Architectures, Arxiv Preprint Arxiv-2103.15419 (2021) .
12 pages
Deepgraphonet: A Deep Graph Operator Network To Learn and Zero-Shot Transfer The Dynamic Response of Networked Systems
No ratings yet
Deepgraphonet: A Deep Graph Operator Network To Learn and Zero-Shot Transfer The Dynamic Response of Networked Systems
10 pages
Neuralnetworks 1
No ratings yet
Neuralnetworks 1
65 pages
Neural Networks Give A Warm Start To Linear Optimization Problems
No ratings yet
Neural Networks Give A Warm Start To Linear Optimization Problems
6 pages
A Imprimer 4
No ratings yet
A Imprimer 4
4 pages
19CSE456 - VI Sem May 2022
No ratings yet
19CSE456 - VI Sem May 2022
6 pages
Sciadv Abi8605
No ratings yet
Sciadv Abi8605
10 pages
Machine Learning and Pattern Recognition Week 8 - Neural - Net - Fitting
No ratings yet
Machine Learning and Pattern Recognition Week 8 - Neural - Net - Fitting
3 pages
Survey of FNN
No ratings yet
Survey of FNN
25 pages
On Training Implicit Models: Zhengyang Geng Xin-Yu Zhang Shaojie Bai Yisen Wang Zhouchen Lin
No ratings yet
On Training Implicit Models: Zhengyang Geng Xin-Yu Zhang Shaojie Bai Yisen Wang Zhouchen Lin
24 pages
Fault Tolerant
No ratings yet
Fault Tolerant
10 pages
A Deterministic Annealing Neural Network For Convex Programming
No ratings yet
A Deterministic Annealing Neural Network For Convex Programming
13 pages
IoT - Lecture 11
No ratings yet
IoT - Lecture 11
58 pages
Lec 105
No ratings yet
Lec 105
19 pages
1710 11573 PDF
No ratings yet
1710 11573 PDF
14 pages
Recurrent Neural Network Application
No ratings yet
Recurrent Neural Network Application
10 pages
The Little Book of Deep Learning
No ratings yet
The Little Book of Deep Learning
163 pages
Machine Learning (ML) :: Aim: Analysis and Implementation of Deep Neural Network. Definitions
No ratings yet
Machine Learning (ML) :: Aim: Analysis and Implementation of Deep Neural Network. Definitions
6 pages
Backward Feature Correction: How Deep Learning Performs Deep (Hierarchical) Learning
No ratings yet
Backward Feature Correction: How Deep Learning Performs Deep (Hierarchical) Learning
93 pages
Deep Learning
No ratings yet
Deep Learning
19 pages
NeurIPS21 Implicit
No ratings yet
NeurIPS21 Implicit
24 pages
MG NO
No ratings yet
MG NO
26 pages
Solving Parabolic Periodic P-Laplacian by Deep Learning
No ratings yet
Solving Parabolic Periodic P-Laplacian by Deep Learning
15 pages
Transformer (1)
No ratings yet
Transformer (1)
15 pages
1.1 Introduction
No ratings yet
1.1 Introduction
73 pages
Artificial Neural Network Methods For The Solution of Second Order Boundary Value Problems
No ratings yet
Artificial Neural Network Methods For The Solution of Second Order Boundary Value Problems
15 pages
Review of Deep Learning Algorithms and Architectur
No ratings yet
Review of Deep Learning Algorithms and Architectur
29 pages
Neural Network Method For Solving Convex Programming Problem With Linear Inequality Constraints
No ratings yet
Neural Network Method For Solving Convex Programming Problem With Linear Inequality Constraints
3 pages
NNQuant 1
No ratings yet
NNQuant 1
14 pages
1 s2.0 S0893608024010426 Main
No ratings yet
1 s2.0 S0893608024010426 Main
17 pages
2022 - Neural Optimization Machine-A Neural Network Approach For Optimization
No ratings yet
2022 - Neural Optimization Machine-A Neural Network Approach For Optimization
22 pages
Slides 11
No ratings yet
Slides 11
48 pages
CSC 2541: Neural Net Training Dynamics: Lecture 1 - A Toy Model: Linear Regression
No ratings yet
CSC 2541: Neural Net Training Dynamics: Lecture 1 - A Toy Model: Linear Regression
62 pages
Approximate Softmax Functions For Energy-Efficient Deep Neural Networks
No ratings yet
Approximate Softmax Functions For Energy-Efficient Deep Neural Networks
13 pages
Reinforcement Learning Optimization
No ratings yet
Reinforcement Learning Optimization
6 pages
Notes Chapter8
No ratings yet
Notes Chapter8
4 pages
The Mathematics of Artificial Intelligence: 1 Supervised Learning
No ratings yet
The Mathematics of Artificial Intelligence: 1 Supervised Learning
10 pages
Deep MLP's
No ratings yet
Deep MLP's
44 pages
Graphs Differential Equations and Graphs of Derivatives Booklet
No ratings yet
Graphs Differential Equations and Graphs of Derivatives Booklet
16 pages
Chapter 19 Numerical Differentiation: Taylor Polynomials Lagrange Interpolation
No ratings yet
Chapter 19 Numerical Differentiation: Taylor Polynomials Lagrange Interpolation
29 pages
Algebraic Expressionsandidentities
No ratings yet
Algebraic Expressionsandidentities
18 pages
Numerical Fractional Optimal Control of Respiratory Syncytial Virus Infection in Octave/MATLAB
No ratings yet
Numerical Fractional Optimal Control of Respiratory Syncytial Virus Infection in Octave/MATLAB
21 pages
2 - Force Method (Solved Examples)
0% (1)
2 - Force Method (Solved Examples)
16 pages
Maths Class IX Holiday HW For Summer Vacation 2024 MATHS
No ratings yet
Maths Class IX Holiday HW For Summer Vacation 2024 MATHS
5 pages
Numerical Hyperbolic Systems
No ratings yet
Numerical Hyperbolic Systems
42 pages
Lecture 29
No ratings yet
Lecture 29
26 pages
Becc 110
No ratings yet
Becc 110
8 pages
Time Table of 5th Semester (Fall-2022) of 21CE W.E.F 21-11-2023 After Removal of Clashes
No ratings yet
Time Table of 5th Semester (Fall-2022) of 21CE W.E.F 21-11-2023 After Removal of Clashes
4 pages
MA1006 A-SNM Unit-4 QB New PDF
No ratings yet
MA1006 A-SNM Unit-4 QB New PDF
24 pages
Grade 8 - Factoring-Polynomials
100% (7)
Grade 8 - Factoring-Polynomials
16 pages
Updated DSP Manual 15ECL57 PDF
100% (1)
Updated DSP Manual 15ECL57 PDF
97 pages
MAE 101 Homework 2
No ratings yet
MAE 101 Homework 2
1 page
Numerical Study and Visualization On Flow Characteristics of Reflux Condensation
No ratings yet
Numerical Study and Visualization On Flow Characteristics of Reflux Condensation
32 pages
Numerical Methods
100% (1)
Numerical Methods
35 pages
G8DLL Q1W1 LC01
No ratings yet
G8DLL Q1W1 LC01
10 pages
Determinant and Inverse Matrix
No ratings yet
Determinant and Inverse Matrix
3 pages
Solving Nonlinear Least Squares Problem Using Gauss-Newton Method
No ratings yet
Solving Nonlinear Least Squares Problem Using Gauss-Newton Method
5 pages
AI Midterm Review
No ratings yet
AI Midterm Review
4 pages
Wolsey IntegerProgramming
No ratings yet
Wolsey IntegerProgramming
20 pages
5.7. Quadratic Forms: Example 5.7.1
No ratings yet
5.7. Quadratic Forms: Example 5.7.1
6 pages
CFD at Chapter-1
No ratings yet
CFD at Chapter-1
23 pages
Econometrics Edited Chapter-4
No ratings yet
Econometrics Edited Chapter-4
35 pages
Fast Fourier Transform
No ratings yet
Fast Fourier Transform
10 pages
ENGG 681 Engineering Tools: Lecture #01 Dr. Sameh Nassar
No ratings yet
ENGG 681 Engineering Tools: Lecture #01 Dr. Sameh Nassar
64 pages
Tarlac National High School
No ratings yet
Tarlac National High School
2 pages
Cubic Spline
No ratings yet
Cubic Spline
40 pages

Amos, Kolter - 2017 - OptNet Differentiable Optimization As A Layer in Neural Networks

Uploaded by

Amos, Kolter - 2017 - OptNet Differentiable Optimization As A Layer in Neural Networks

Uploaded by

OptNet: Differentiable Optimization as a Layer in Neural Networks

Brandon Amos 1 J. Zico Kolter 1

that currently use convex optimization as a basis. Specifi-

Conv Train Conv Test OptNet Train OptNet Test

Acknowledgments Duchi, John, Shalev-Shwartz, Shai, Singer, Yoram, and

Kennedy, Michael Peter and Chua, Leon O. Neural net-

Lötstedt, Per. Numerical simulation of time-dependent

You might also like