Differentiable Convex Optimization Layers: Akshay Agrawal Brandon Amos Shane Barratt
Differentiable Convex Optimization Layers: Akshay Agrawal Brandon Amos Shane Barratt
Abstract
Recent work has shown how to embed differentiable optimization problems (that is,
problems whose solutions can be backpropagated through) as layers within deep
learning architectures. This method provides a useful inductive bias for certain
problems, but existing software for differentiable optimization layers is rigid and
difficult to apply to new settings. In this paper, we propose an approach to differ-
entiating through disciplined convex programs, a subclass of convex optimization
problems used by domain-specific languages (DSLs) for convex optimization. We
introduce disciplined parametrized programming, a subset of disciplined convex
programming, and we show that every disciplined parametrized program can be
represented as the composition of an affine map from parameters to problem data,
a solver, and an affine map from the solver’s solution to a solution of the original
problem (a new form we refer to as affine-solver-affine form). We then demonstrate
how to efficiently differentiate through each of these components, allowing for
end-to-end analytical differentiation through the entire convex program. We im-
plement our methodology in version 1.1 of CVXPY, a popular Python-embedded
DSL for convex optimization, and additionally implement differentiable layers for
disciplined convex programs in PyTorch and TensorFlow 2.0. Our implementation
significantly lowers the barrier to using convex optimization problems in differen-
tiable programs. We present applications in linear machine learning models and in
stochastic control, and we show that our layer is competitive (in execution time)
compared to specialized differentiable solvers from past work.
1 Introduction
Recent work has shown how to differentiate through specific subclasses of convex optimization
problems, which can be viewed as functions mapping problem data to solutions [6, 31, 10, 1,
4]. These layers have found several applications [40, 6, 35, 27, 5, 53, 75, 52, 12, 11], but many
applications remain relatively unexplored (see, e.g., [4, §8]).
While convex optimization layers can provide useful inductive bias in end-to-end models, their
adoption has been slowed by how difficult they are to use. Existing layers (e.g., [6, 1]) require users
to transform their problems into rigid canonical forms by hand. This process is tedious, error-prone,
and time-consuming, and often requires familiarity with convex analysis. Domain-specific languages
(DSLs) for convex optimization abstract away the process of converting problems to canonical forms,
letting users specify problems in a natural syntax; programs are then lowered to canonical forms and
∗
Authors listed in alphabetical order.
33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.
supplied to numerical solvers behind-the-scenes [3]. DSLs enable rapid prototyping and make convex
optimization accessible to scientists and engineers who are not necessarily experts in optimization.
The point of this paper is to do what DSLs have done for convex optimization, but for differentiable
convex optimization layers. In this work, we show how to efficiently differentiate through disciplined
convex programs [45]. This is a large class of convex optimization problems that can be parsed and
solved by most DSLs for convex optimization, including CVX [44], CVXPY [29, 3], Convex.jl [72],
and CVXR [39]. Concretely, we introduce disciplined parametrized programming (DPP), a grammar
for producing parametrized disciplined convex programs. Given a program produced by DPP, we
show how to obtain an affine map from parameters to problem data, and an affine map from a solution
of the canonicalized problem to a solution of the original problem. We refer to this representation of
a problem — i.e., the composition of an affine map from parameters to problem data, a solver, and an
affine map to retrieve a solution — as affine-solver-affine (ASA) form.
Our contributions are three-fold:
1. We introduce DPP, a new grammar for parametrized convex optimization problems, and ASA
form, which ensures that the mapping from problem parameters to problem data is affine. DPP and
ASA-form make it possible to differentiate through DSLs for convex optimization, without explicitly
backpropagating through the operations of the canonicalizer. We present DPP and ASA form in §4.
2. We implement the DPP grammar and a reduction from parametrized programs to ASA form in
CVXPY 1.1. We also implement differentiable convex optimization layers in PyTorch [66] and
TensorFlow 2.0 [2]. Our software substantially lowers the barrier to using convex optimization layers
in differentiable programs and neural networks (§5).
3. We present applications to sensitivity analysis for linear machine learning models, and to learning
control-Lyapunov policies for stochastic control (§6). We also show that for quadratic programs
(QPs), our layer’s runtime is competitive with OptNet’s specialized solver qpth [6] (§7).
2 Related work
DSLs for convex optimization. DSLs for convex optimization allow users to specify convex
optimization problems in a natural way that follows the math. At the foundation of these languages is
a ruleset from convex analysis known as disciplined convex programming (DCP) [45]. A mathematical
program written using DCP is called a disciplined convex program, and all such programs are convex.
Disciplined convex programs can be canonicalized to cone programs by expanding each nonlinear
function into its graph implementation [43]. DPP can be seen as a subset of DCP that mildly restricts
the way parameters (symbolic constants) can be used; a similar grammar is described in [26]. The
techniques used in this paper to canonicalize parametrized programs are similar to the methods
used by code generators for optimization problems, such as CVXGEN [60], which targets QPs, and
QCML, which targets second-order cone programs (SOCPs) [26, 25].
2
3 Background
Convex optimization problems. A parametrized convex optimization problem can be represented
as
minimize f0 (x; θ)
subject to fi (x; θ) ≤ 0, i = 1, . . . , m1 , (1)
gi (x; θ) = 0, i = 1, . . . , m2 ,
where x ∈ Rn is the optimization variable and θ ∈ Rp is the parameter vector [22, §4.2]. The
functions fi : Rn → R are convex, and the functions gi : Rn → R are affine. A solution to (1) is any
vector x? ∈ Rn that minimizes the objective function, among all choices that satisfy the constraints.
The problem (1) can be viewed as a (possibly multi-valued) function that maps a parameter to
solutions. In this paper, we consider the case when this solution map is single-valued, and we denote
it by S : Rp → Rn . The function S maps a parameter θ to a solution x? . From the perspective of
end-to-end learning, θ (or parameters it depends on) is learned in order to minimize some scalar
function of x? . In this paper, we show how to obtain the derivative of S with respect to θ, when (1) is
a DPP-compliant program (and when the derivative exists).
We focus on convex optimization because it is a powerful modeling tool, with applications in control
[20, 16, 71], finance [57, 19], energy management [63], supply chain [17, 15], physics [51, 8],
computational geometry [73], aeronautics [48], and circuit design [47, 21], among other fields.
Disciplined convex programming. DCP is a grammar for constructing convex optimization prob-
lems [45, 43]. It consists of functions, or atoms, and a single rule for composing them. An atom is a
function with known curvature (affine, convex, or concave) and per-argument monotonicities. The
composition rule is based on the following theorem from convex analysis. Suppose h : Rk → R
is convex, nondecreasing in arguments indexed by a set I1 ⊆ {1, 2, . . . , k}, and nonincreasing in
arguments indexed by I2 . Suppose also that gi : Rn → R are convex for i ∈ I1 , concave for i ∈ I2 ,
and affine for i ∈ (I1 ∩ I2 )c . Then the composition f (x) = h(g1 (x), g2 (x), . . . , gk (x)) is convex.
DCP allows atoms to be composed so long as the composition satisfies this composition theorem.
Every disciplined convex program is a convex optimization problem, but the converse is not true.
This is not a limitation in practice, because atom libraries are extensible (i.e., the class corresponding
to DCP is parametrized by which atoms are implemented). In this paper, we consider problems of the
form (1) in which the functions fi and gi are constructed using DPP, a version of DCP that performs
parameter-dependent curvature analysis (see §4.1).
3
a function f evaluated at x, and DT f (x) to denote the adjoint of the derivative at x.) We consider
the special case of canonicalizing a disciplined convex program to a cone program. With little extra
effort, our method can be extended to other targets.
We express S as the composition R ◦ s ◦ C; the canonicalizer C maps parameters to cone problem
data (A, b, c), the cone solver s solves the cone problem, furnishing a solution x̃? , and the retriever R
maps x̃? to a solution x? of the original problem. A problem is in ASA form if C and R are affine.
By the chain rule, the adjoint of the derivative of a disciplined convex program is
DT S(θ) = DT C(θ)DT s(A, b, c)DT R(x̃? ).
The remainder of this section proceeds as follows. In §4.1, we present DPP, a ruleset for constructing
disciplined convex programs reducible to ASA form. In §4.2, we describe the canonicalization
procedure and show how to represent C as a sparse matrix. In §4.3, we review how to differentiate
through cone programs, and in §4.4, we describe the form of R.
DPP is a grammar for producing parametrized disciplined convex programs from a set of functions, or
atoms, with known curvature (constant, affine, convex, or concave) and per-argument monotonicities.
A program produced using DPP is called a disciplined parametrized program. Like DCP, DPP is
based on the well-known composition theorem for convex functions, and it guarantees that every
function appearing in a disciplined parametrized program is affine, convex, or concave. Unlike DCP,
DPP also guarantees that the produced program can be reduced to ASA form.
A disciplined parametrized program is an optimization problem of the form
minimize f0 (x, θ)
subject to fi (x, θ) ≤ f˜i (x, θ), i = 1, . . . , m1 , (3)
gi (x, θ) = g̃i (x, θ), i = 1, . . . , m2 ,
where x ∈ Rn is a variable, θ ∈ Rp is a parameter, the fi are convex, f˜i are concave, gi and g̃i are
affine, and the expressions are constructed using DPP. An expression can be thought of as a tree,
where the nodes are atoms and the leaves are variables, constants, or parameters. A parameter is a
symbolic constant with known properties such as sign but unknown numeric value. An expression is
said to be parameter-affine if it does not have variables among its leaves and is affine in its parameters;
an expression is parameter-free if it is not parametrized, and variable-free if it does not have variables.
Every DPP program is also DCP, but the converse is not true. DPP generates programs reducible to
ASA form by introducing two restrictions on expressions involving parameters:
1. In DCP, we classify the curvature of each subexpression appearing in the problem description
as convex, concave, affine, or constant. All parameters are classified as constant. In DPP,
parameters are classified as affine, just like variables.
2. In DCP, the product atom φprod (x, y) = xy is affine if x or y is a constant (i.e., variable-free).
Under DPP, the product is affine when at least one of the following is true:
• x or y is constant (i.e., both parameter-free and variable-free);
• one of the expressions is parameter-affine and the other is parameter-free.
The DPP specification can (and may in the future) be extended to handle several other combinations
of expressions and parameters.
4
• F x − g is affine because F x and −g are affine and the sum of affine expressions is affine;
• kF x − gk2 is convex because k·k2 is convex and convex composed with affine is convex;
• φprod (λ, kxk2 ) is convex because the product is affine (λ is parameter-affine, kxk2 is
parameter-free), it is increasing in kxk2 (because λ is nonnegative), and kxk2 is convex;
• the objective is convex because the sum of convex expressions is convex.
• The expression φprod (p1 , p2 ) is not DPP because both of its arguments are parametrized. It
can be rewritten in a DPP-compliant way by introducing a variable s, replacing p1 p2 with
the expression p1 s, and adding the constraint s = p2 .
• Let e be an expression. The quotient e/p1 is not DPP, but it can be rewritten as ep2 , where
p2 is a new parameter representing 1/p1 .
• The expression log |p1 | is not DPP because log is concave and increasing but | · | is convex.
It can be rewritten as log p2 where p2 is a new parameter representing |p1 |.
• If P1 ∈ Rn×n is a parameter representing a (symmetric) positive semidefinite matrix and
x ∈ Rn is a variable, the expression φquadform (x, P1 ) = xT P1 x is not DPP. It can be
1/2
rewritten as kP2 xk22 , where P2 is a new parameter representing P1 .
4.2 Canonicalization
The canonicalization of a disciplined parametrized program to ASA form is similar to the canoni-
calization of a disciplined convex program to a cone program. All nonlinear atoms are expanded
into their graph implementations [43], generating affine expressions of variables. The resulting
expressions are also affine in the problem parameters due to the DPP rules. Because these expressions
represent the problem data for the cone program, the function C from parameters to problem data is
affine.
As an example, the DPP program (4) can be canonicalized to the cone program
minimize t1 + λt2
subject to (t1 , F x − g) ∈ Qm+1 ,
(5)
(t2 , x) ∈ Qn+1 ,
x ∈ Rn+ ,
where (t1 , t2 , x) is the variable, Qn is the n-dimensional second-order cone, and Rn+ is the nonnega-
tive orthant. When rewritten in the standard form (2), this problem has data
−1
0
−F −g 1
" #
, b = 0 , c = λ , K = Qm+1 × Qn+1 × Rn+ ,
A= −1
−I 0 0
−I 0
with blank spaces representing zeros and the horizontal line denoting the cone boundary. In this case,
the parameters F , g and λ are just negated and copied into the problem data.
The canonicalization map. The full canonicalization procedure (which includes expanding graph
implementations) only runs the first time the problem is canonicalized. When the same problem
is canonicalized in the future (e.g., with new parameter values), the problem data (A, b, c) can be
obtained by multiplying a sparse matrix representing C by the parameter vector (and reshaping);
the adjoint of the derivative can be computed by just transposing the matrix. The naïve alternative
— expanding graph implementations and extracting new problem data every time parameters are
updated (and differentiating through this algorithm in the backward pass) — is much slower (see §7).
The following lemma tells us that C can be represented as a sparse matrix.
5
Lemma 1. The canonicalizer map C for a disciplined parametrized program can be represented with
a sparse matrix Q ∈ Rn×p+1 and sparse tensor R ∈ Rm×n+1×p+1 , where m is the dimension of the
constraints. Letting θ̃ ∈ Rp+1 denote the concatenation of θ and the scalar offset 1, the problem data
Pp+1
can be obtained as c = Qθ̃ and [A b] = i=1 R[:,:,i] θ̃i .
By applying the implicit function theorem [36, 34] to the optimality conditions of a cone program, it
is possible to compute its derivative Ds(A, b, c). To compute DT s(A, b, c), we follow the methods
presented in [1] and [4, §7.3]. Our calculations are given in Appendix B.
If the cone program is not differentiable at a solution, we compute a heuristic quantity, as is common
practice in automatic differentiation [46, §14]. In particular, at non-differentiable points, a linear
system that arises in the computation of the derivative might fail to be invertible. When this happens,
we compute a least-squares solution to the system instead. See Appendix B for details.
The cone program obtained by canonicalizing a DPP-compliant problem uses the variable x̃ =
(x, s) ∈ Rn × Rk , where s ∈ Rk is a slack variable. If x̃? = (x? , s? ) is optimal for the cone program,
then x? is optimal for the original problem (up to reshaping and scaling by a constant). As such, a
solution to the original problem can be obtained by slicing, i.e., R(x̃? ) = x? . This map is evidently
linear.
5 Implementation
We have implemented DPP and the reduction to ASA form in version 1.1 of CVXPY, a Python-
embedded DSL for convex optimization [29, 3]; our implementation extends CVXCanon, an open-
source library that reduces affine expression trees to matrices [62]. We have also implemented
differentiable convex optimization layers in PyTorch and TensorFlow 2.0. These layers implement
the forward and backward maps described in §4; they also efficiently support batched inputs (see §7).
We use the the diffcp package [1] to obtain derivatives of cone programs. We modified this package
for performance: we ported much of it from Python to C++, added an option to compute the derivative
using a dense direct solve, and made the forward and backward passes amenable to parallelization.
Our implementation of DPP and ASA form, coupled with our PyTorch and TensorFlow layers, makes
our software the first DSL for differentiable convex optimization layers. Our software is open-source.
CVXPY and our layers are available at
https://www.cvxpy.org, https://www.github.com/cvxgrp/cvxpylayers.
Example. Below is an example of how to specify the problem (4) using CVXPY 1.1.
1 import cvxpy as cp
2
3 m , n = 20 , 10
4 x = cp . Variable (( n , 1) )
5 F = cp . Parameter (( m , n ) )
6 g = cp . Parameter (( m , 1) )
7 lambd = cp . Parameter ((1 , 1) , nonneg = True )
8 objective_fn = cp . norm ( F @ x - g ) + lambd * cp . norm ( x )
9 constraints = [ x >= 0]
10 problem = cp . Problem ( cp . Minimize ( objective_fn ) , constraints )
11 assert problem . is_dpp ()
The below code shows how to use our PyTorch layer to solve and backpropagate through problem
(the code for our TensorFlow layer is almost identical; see Appendix D).
6
8
train 1.5
6 test
4 1.4
2 1.3
average cost
0 1.2
2
1.1
4
1.0
6
8 0.9
8 6 4 2 0 2 4 6 8 0 20 40 60 80 100
iteration
Figure 1: Gradients (black lines) of the logistic Figure 2: Per-iteration cost while learning an ADP
test loss with respect to the training data. policy for stochastic control.
1 import torch
2 from cvxpylayers . torch import CvxpyLayer
3
4 F_t = torch . randn (m , n , requires_grad = True )
5 g_t = torch . randn (m , 1 , requires_grad = True )
6 lambd_t = torch . rand (1 , 1 , requires_grad = True )
7 layer = CvxpyLayer (
8 problem , parameters =[ F , g , lambd ] , variables =[ x ])
9 x_star , = layer ( F_t , g_t , lambd_t )
10 x_star . sum () . backward ()
Constructing layer in line 7-8 canonicalizes problem to extract C and R, as described in §4.2.
Calling layer in line 9 applies the map R ◦ s ◦ C from §4, returning a solution to the problem. Line
10 computes the gradients of summing x_star, with respect to F_t, g_t, and lambd_t.
6 Examples
In this section, we present two applications of differentiable convex optimization, meant to be
suggestive of possible use cases for our layer. We give more examples in Appendix E.
Numerical example. We consider 30 training points and 30 test points in R2 , and we fit a logistic
model with elastic-net regularization. This problem can be written using DPP, with xi as parameters
7
Table 1: Time (ms) to canonicalize examples, across 10 runs.
(see Appendix C for the code). We used our convex optimization layer to fit this model and obtain
the gradient of the test loss with respect to the training data. Figure 1 visualizes the results. The
orange (?) and blue (+) points are training data, belonging to different classes. The red line (dashed)
is the hyperplane learned by fitting the the model, while the blue line (solid) is the hyperplane that
minimizes the test loss. The gradients are visualized as black lines, attached to the data points.
Moving the points in the gradient directions torques the learned hyperplane away from the optimal
hyperplane for the test set.
ADP policy. A common heuristic for solving (7) is approximate dynamic programming (ADP),
which parametrizes φ and replaces the minimization over functions φ with a minimization over
parameters. In this example, we take U to be the unit ball and we represent φ as a quadratic
control-Lyapunov policy [74]. Evaluating φ corresponds to solving the SOCP
minimize uT P u + xTt Qu + q T u
(8)
subject to kuk2 ≤ 1,
with variable u and parameters P , Q, q, and xt . We can run stochastic gradient descent (SGD) on P ,
Q, and q to approximately solve (7), which requires differentiating through (8). Note that if u were
unconstrained, (7) could be solved exactly, via linear quadratic regulator (LQR) theory [50]. The
policy (8) can be written using DPP (see Appendix C for the code).
Numerical example. Figure 2 plots the estimated average cost for each iteration of gradient descent
for a numerical example, with x ∈ R2 and u ∈ R3 , a time horizon of T = 25, and a batch size of
8. We initialize our policy’s parameters with the LQR solution, ignoring the constraint on u. This
method decreased the average cost by roughly 40%.
7 Evaluation
Our implementation substantially lowers the barrier to using convex optimization layers. Here, we
show that our implementation substantially reduces canonicalization time. Additionally, for dense
problems, our implementation is competitive (in execution time) with a specialized solver for QPs;
for sparse problems, our implementation is much faster.
Canonicalization. Table 1 reports the time it takes to canonicalize the logistic regression and
stochastic control problems from §6, comparing CVXPY version 1.0.23 with CVXPY 1.1. Each
canonicalization was performed on a single core of an unloaded Intel i7-8700K processor. We report
the average time and standard deviation across 10 runs, excluding a warm-up run. Our extension
achieves on average an order-of-magnitude speed-up since computing C via a sparse matrix multiply
is much more efficient than going through the DSL.
8
forward 25 forward
3.5 backward backward
canon canon
3.0 retrieval retrieval
20
2.5
15
time (s)
time (s)
2.0
1.5 10
1.0
5
0.5
0.0 0
qpth GPU qpth CPU cvxpylayers qpth CPU cvxpylayers
(a) Dense QP, batch size of 128. (b) Sparse QP, batch size of 32.
Figure 3: Comparison of our PyTorch CvxpyLayer to qpth, over 10 trials. For cvxpylayers, we
separate out the canonicalization and solution retrieval times, to allow for a fair comparison.
Comparison to specialized layers. We have implemented a batched solver and backward pass for
our differentiable CVXPY layer that makes it competitive with the batched QP layer qpth from [6].
Figure 3 compares the runtimes of our PyTorch CvxpyLayer and qpth on a dense and sparse QP.
The sparse problem is too large for qpth to run in GPU mode. The QPs have the form
minimize 12 xT Qx + pT x
subject to Ax = b, (9)
Gx ≤ h,
8 Discussion
Other solvers. Solvers that are specialized to subclasses of convex programs are often faster than
more general conic solvers. For example, one might use OSQP [69] to solve QPs, or gradient-based
methods like L-BFGS [54] or SAGA [28] for empirical risk minimization. Because CVXPY lets
developers add specialized solvers as additional back-ends, our implementation of DPP and ASA
form can be easily extended to other problem classes. We plan to interface QP solvers in future work.
9
Acknowledgments
We gratefully acknowledge discussions with Eric Chu, who designed and implemented a code
generator for SOCPs [26, 25], Nicholas Moehle, who designed and implemented a basic version of a
code generator for convex optimization in unpublished work, and Brendan O’Donoghue. We also
would like to thank the anonymous reviewers, who provided us with useful suggestions that improved
the paper. S. Barratt is supported by the National Science Foundation Graduate Research Fellowship
under Grant No. DGE-1656518.
References
[1] A. Agrawal, S. Barratt, S. Boyd, E. Busseti, and W. Moursi. Differentiating through a cone
program. In: Journal of Applied and Numerical Optimization 1.2 (2019), pp. 107–115.
[2] A. Agrawal, A. N. Modi, A. Passos, A. Lavoie, A. Agarwal, A. Shankar, I. Ganichev, J.
Levenberg, M. Hong, R. Monga, and S. Cai. TensorFlow Eager: A multi-stage, Python-
embedded DSL for machine learning. In: Proc. Systems for Machine Learning. 2019.
[3] A. Agrawal, R. Verschueren, S. Diamond, and S. Boyd. A rewriting system for convex
optimization problems. In: Journal of Control and Decision 5.1 (2018), pp. 42–60.
[4] B. Amos. Differentiable optimization-based modeling for machine learning. PhD thesis.
Carnegie Mellon University, 2019.
[5] B. Amos, I. Jimenez, J. Sacks, B. Boots, and J. Z. Kolter. Differentiable MPC for end-to-end
planning and control. In: Advances in Neural Information Processing Systems. 2018, pp. 8299–
8310.
[6] B. Amos and J. Z. Kolter. OptNet: Differentiable optimization as a layer in neural networks.
In: Intl. Conf. Machine Learning. 2017.
[7] B. Amos, V. Koltun, and J. Z. Kolter. The limited multi-label projection layer. 2019. arXiv:
1906.08707.
[8] G. Angeris, J. Vučković, and S. Boyd. Computational Bounds for Photonic Design. In: ACS
Photonics 6.5 (2019), pp. 1232–1239.
[9] M. ApS. MOSEK optimization suite. http://docs.mosek.com/9.0/intro.pdf. 2019.
[10] S. Barratt. On the differentiability of the solution to convex optimization problems. 2018. arXiv:
1804.05098.
[11] S. Barratt and S. Boyd. Fitting a kalman smoother to data. 2019. arXiv: 1910.08615.
[12] S. Barratt and S. Boyd. Least squares auto-tuning. 2019. arXiv: 1904.05460.
[13] S. Barratt and S. Boyd. Stochastic control with affine dynamics and extended quadratic costs.
2018. arXiv: 1811.00168.
[14] D. Belanger, B. Yang, and A. McCallum. End-to-end learning for structured prediction energy
networks. In: Intl. Conf. Machine Learning. 2017.
[15] A. Ben-Tal, B. Golany, A. Nemirovski, and J.-P. Vial. Retailer-supplier flexible commit-
ments contracts: A robust optimization approach. In: Manufacturing & Service Operations
Management 7.3 (2005), pp. 248–271.
[16] D. P. Bertsekas. Dynamic programming and optimal control. 3rd ed. Vol. 1. Athena scientific
Belmont, 2005.
[17] D. Bertsimas and A. Thiele. A robust optimization approach to supply chain management. In:
Proc. Intl. Conf. on Integer Programming and Combinatorial Optimization. Springer. 2004,
pp. 86–100.
[18] B. Biggio and F. Roli. Wild patterns: Ten years after the rise of adversarial machine learning.
In: Pattern Recognition 84 (2018), pp. 317–331.
[19] S. Boyd, E. Busseti, S. Diamond, R. Kahn, K. Koh, P. Nystrup, and J. Speth. Multi-period
trading via convex optimization. In: Foundations and Trends in Optimization 3.1 (2017),
pp. 1–76.
[20] S. Boyd, L. El Ghaoui, E. Feron, and V. Balakrishnan. Linear matrix inequalities in system
and control theory. SIAM, 1994.
[21] S. Boyd, S.-J. Kim, D. Patil, and M. Horowitz. Digital circuit optimization via geometric
programming. In: Operations Research 53.6 (2005).
10
[22] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.
[23] P. Brakel, D. Stroobandt, and B. Schrauwen. Training energy-based models for time-series
imputation. In: Journal of Machine Learning Research 14.1 (2013), pp. 2771–2797.
[24] E. Busseti, W. Moursi, and S. Boyd. Solution refinement at regular points of conic problems.
2018. arXiv: 1811.02157.
[25] E. Chu and S. Boyd. QCML: Quadratic Cone Modeling Language. https://github.com/
cvxgrp/qcml. 2017.
[26] E. Chu, N. Parikh, A. Domahidi, and S. Boyd. Code generation for embedded second-order
cone programming. In: 2013 European Control Conference (ECC). IEEE. 2013, pp. 1547–
1552.
[27] F. de Avila Belbute-Peres, K. Smith, K. Allen, J. Tenenbaum, and J. Z. Kolter. End-to-end
differentiable physics for learning and control. In: Advances in Neural Information Processing
Systems. 2018, pp. 7178–7189.
[28] A. Defazio, F. Bach, and S. Lacoste-Julien. SAGA: A fast incremental gradient method with
support for non-strongly convex composite objectives. In: Advances in Neural Information
Processing Systems. 2014, pp. 1646–1654.
[29] S. Diamond and S. Boyd. CVXPY: A Python-embedded modeling language for convex
optimization. In: Journal of Machine Learning Research 17.1 (2016), pp. 2909–2913.
[30] S. Diamond, V. Sitzmann, F. Heide, and G. Wetzstein. Unrolled optimization with deep priors.
2017. arXiv: 1705.08041.
[31] J. Djolonga and A. Krause. Differentiable learning of submodular models. In: Advances in
Neural Information Processing Systems. 2017, pp. 1013–1023.
[32] A. Domahidi, E. Chu, and S. Boyd. ECOS: An SOCP solver for embedded systems. In: Control
Conference (ECC), 2013 European. IEEE. 2013, pp. 3071–3076.
[33] J. Domke. Generic methods for optimization-based modeling. In: AISTATS. Vol. 22. 2012,
pp. 318–326.
[34] A. L. Dontchev and R. T. Rockafellar. Implicit functions and solution mappings. In: Springer
Monogr. Math. (2009).
[35] P. Donti, B. Amos, and J. Z. Kolter. Task-based end-to-end model learning in stochastic
optimization. In: Advances in Neural Information Processing Systems. 2017, pp. 5484–5494.
[36] A. Fiacco and G. McCormick. Nonlinear programming: Sequential unconstrained minimiza-
tion techniques. John Wiley and Sons, Inc., New York-London-Sydney, 1968, pp. xiv+210.
[37] A. V. Fiacco. Introduction to sensitivity and stability analysis in nonlinear programming.
Vol. 165. Mathematics in Science and Engineering. Academic Press, Inc., Orlando, FL, 1983,
pp. xii+367.
[38] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep
networks. In: 34th Intl. Conf. Machine Learning-Volume 70. JMLR. org. 2017, pp. 1126–1135.
[39] A. Fu, B. Narasimhan, and S. Boyd. CVXR: An R package for disciplined convex optimization.
In: arXiv preprint arXiv:1711.07582 (2017).
[40] Z. Geng, D. Johnson, and R. Fedkiw. Coercing machine learning to output physically accurate
results. 2019. arXiv: 1910.09671 [physics.comp-ph].
[41] I. Goodfellow, M. Mirza, A. Courville, and Y. Bengio. Multi-prediction deep Boltzmann
machines. In: Advances in Neural Information Processing Systems. 2013, pp. 548–556.
[42] S. Gould, R. Hartley, and D. Campbell. Deep declarative networks: A new hope. 2019. arXiv:
1909.04866.
[43] M. Grant and S. Boyd. Graph implementations for nonsmooth convex programs. In: Recent
Advances in Learning and Control. Ed. by V. Blondel, S. Boyd, and H. Kimura. Lecture Notes
in Control and Information Sciences. Springer, 2008, pp. 95–110.
[44] M. Grant and S. Boyd. CVX: Matlab software for disciplined convex programming, version
2.1. http://cvxr.com/cvx. 2014.
[45] M. Grant, S. Boyd, and Y. Ye. Disciplined convex programming. In: Global optimization.
Springer, 2006, pp. 155–210.
[46] A. Griewank and A. Walther. Evaluating derivatives: principles and techniques of algorithmic
differentiation. SIAM, 2008.
11
[47] M. Hershenson, S. Boyd, and T. Lee. Optimal design of a CMOS op-amp via geometric
programming. In: IEEE Transactions on Computer-Aided Design of Integrated Circuits and
Systems 20.1 (2001), pp. 1–21.
[48] W. Hoburg and P. Abbeel. Geometric programming for aircraft design optimization. In: AIAA
Journal 52.11 (2014), pp. 2414–2426.
[49] M. Jagielski, A. Oprea, B. Biggio, C. Liu, C. Nita-Rotaru, and B. Li. Manipulating machine
learning: Poisoning attacks and countermeasures for regression learning. In: IEEE Symposium
on Security and Privacy. IEEE. 2018, pp. 19–35.
[50] R. Kalman. When is a linear control system optimal? In: Journal of Basic Engineering 86.1
(1964), pp. 51–60.
[51] Y. Kanno. Nonsmooth Mechanics and Convex Optimization. CRC Press, Boca Raton, FL,
2011.
[52] K. Lee, S. Maji, A. Ravichandran, and S. Soatto. Meta-learning with differentiable convex
optimization. In: arXiv preprint arXiv:1904.03758 (2019).
[53] C. K. Ling, F. Fang, and J. Z. Kolter. What game are we playing? End-to-end learning in
normal and extensive form games. 2018. arXiv: 1805.02777.
[54] D. C. Liu and J. Nocedal. On the limited memory BFGS method for large scale optimization.
In: Mathematical programming 45.1-3 (1989), pp. 503–528.
[55] C. Malaviya, P. Ferreira, and A. F. Martins. Sparse and constrained attention for neural
machine translation. 2018. arXiv: 1805.08241.
[56] M. Mardani, Q. Sun, S. Vasawanala, V. Papyan, H. Monajemi, J. Pauly, and D. Donoho. Neural
proximal gradient descent for compressive imaging. 2018. arXiv: 1806.03963 [cs.CV].
[57] H. Markowitz. Portfolio selection. In: Journal of Finance 7.1 (1952), pp. 77–91.
[58] A. Martins and R. Astudillo. From softmax to sparsemax: A sparse model of attention and
multi-label classification. In: Intl. Conf. Machine Learning. 2016, pp. 1614–1623.
[59] A. F. Martins and J. Kreutzer. Learning what’s easy: Fully differentiable neural easy-first
taggers. In: 2017 Conference on Empirical Methods in Natural Language Processing. 2017,
pp. 349–362.
[60] J. Mattingley and S. Boyd. CVXGEN: A code generator for embedded convex optimization.
In: Optimization and Engineering 13.1 (2012), pp. 1–27.
[61] L. Metz, B. Poole, D. Pfau, and J. Sohl-Dickstein. Unrolled generative adversarial networks.
2016. arXiv: 1611.02163.
[62] J. Miller, J. Zhu, and P. Quigley. CVXCanon. https://github.com/cvxgrp/CVXcanon/.
2015.
[63] N. Moehle, E. Busseti, S. Boyd, and M. Wytock. Dynamic energy management. 2019. arXiv:
1903.06230.
[64] B. O’Donoghue, E. Chu, N. Parikh, and S. Boyd. SCS: Splitting conic solver, version 2.1.0.
https://github.com/cvxgrp/scs. 2017.
[65] C. C. Paige and M. A. Saunders. LSQR: An algorithm for sparse linear equations and sparse
least squares. In: ACM Transactions on Mathematical Software (TOMS) 8.1 (1982), pp. 43–71.
[66] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison,
L. Antiga, and A. Lerer. Automatic differentiation in PyTorch. In: NIPS Autodiff Workshop
(2017).
[67] H. Pirnay, R. López-Negrete, and L. T. Biegler. Optimal sensitivity based on IPOPT. In:
Mathematical Programming Computation 4.4 (2012), pp. 307–331.
[68] S. Robinson. Strongly regular generalized equations. In: Mathematics of Operations Research
5.1 (1980), pp. 43–62.
[69] B. Stellato, G. Banjac, P. Goulart, A. Bemporad, and S. Boyd. OSQP: An operator splitting
solver for quadratic programs. 2017. arXiv: 1711.08013.
[70] V. Stoyanov, A. Ropson, and J. Eisner. Empirical risk minimization of graphical model
parameters given approximate inference, decoding, and model structure. In: AISTATS. 2011,
pp. 725–733.
[71] E. Todorov, T. Erez, and Y. Tassa. MuJoCo: A physics engine for model-based control. In:
2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE. 2012,
pp. 5026–5033.
12
[72] M. Udell, K. Mohan, D. Zeng, J. Hong, S. Diamond, and S. Boyd. Convex optimization in
Julia. In: SC14 Workshop on High Performance Technical Computing in Dynamic Languages
(2014). arXiv: 1410.4821 [math.OC].
[73] M. Van Kreveld, O. Schwarzkopf, M. de Berg, and M. Overmars. Computational geometry
algorithms and applications. Springer, 2000.
[74] Y. Wang and S. Boyd. Fast evaluation of quadratic control-Lyapunov policy. In: IEEE Transac-
tions on Control Systems Technology 19.4 (2010), pp. 939–946.
[75] B. Wilder, B. Dilkina, and M. Tambe. Melding the data-decisions pipeline: Decision-focused
learning for combinatorial optimization. 2018. arXiv: 1809.05504.
√
[76] Y. Ye, M. J. Todd, and S. Mizuno. An O( nL)-iteration homogeneous and self-dual linear
programming algorithm. In: Mathematics of Operations Research 19.1 (1994), pp. 53–67.
13
A The canonicalization map
In this appendix, we provide a proof of Lemma 1. We compute Q and R via a reduction on the affine
expression trees that represent the canonicalized problem. Let f be the root node with arguments
(descendants) g1 , . . . , gn . Then we obtain tensors T1 , . . . , Tn representing the (linear) action of f
on each argument. We recurse on each subtree gi and obtain tensors S1 , . . . , Sn . Due to the DPP
rules, for i = 1, . . . , n, we either have (Ti )j,k,` = 0 for ` 6= p + 1 or (Si )j,k,` = 0 for ` 6= p + 1.
Pp+1
We define an operation ψ(Ti , Si ) such that in the first case, ψ(Ti , Si ) = `=1 (Ti )[:,:,p+1] (Si )[:,:,`] ,
Pp+1
and in the second case ψ(Ti , Si ) = `=1 (Ti )[:,:,`] (Si )[:,:,p+1] . The tree rooted at f then evaluates to
S0 = ψ(T1 , S1 ) + · · · + ψ(Tn , Sn ).
The base case of the recursion corresponds to the tensors produced when a variable, parameter, or
constant node is evaluated. (These are the leaf nodes of an affine expression tree.)
Here x ∈ Rn is the primal variable, y ∈ Rm is the dual variable, and s ∈ Rm is the primal slack
variable. The set K ⊆ Rm is a nonempty, closed, convex cone with dual cone K∗ ⊆ Rm . We call
(x, y, s) a solution of the primal-dual cone program (10) if it satisfies the KKT conditions:
Ax + s = b, AT y + c = 0, s ∈ K, y ∈ K∗ , sT y = 0.
Homogeneous self-dual embedding. The homogeneous self-dual embedding reduces the process
of solving (10) to finding a zero of a certain residual map [76]. Letting N = n+m+1, the embedding
uses the variable z ∈ RN , which we partition as (u, v, w) ∈ Rn × Rm × R. The normalized residual
map introduced in [24] is the function N : RN × RN ×N → RN , defined by
N (z, Q) = (Q − I)Π + I (z/|w|),
where Π denotes projection onto Rn × K∗ × R+ , and Q is the skew-symmetric matrix
0 AT c
Q = −A 0 b . (11)
T T
−c −b 0
If N (z, Q) = 0 and w > 0, we can use z to construct a solution of the primal-dual pair (10) as
(x, y, s) = (u, ΠK∗ (v), ΠK∗ (v) − v)/w, (12)
∗
where ΠK∗ (v) denotes the projection of v onto K . From here onward, we assume that w = 1. (If
this is not the case, we can scale z such that it is the case.)
14
Differentiation. A conic solver is a numerical algorithm for solving (10). We can view a conic
solver as a function ψ : Rm×n × Rm × Rn → Rn+2m mapping the problem data (A, b, c) to a
solution (x, y, s). (We assume that the cone K is fixed.) In this section we derive expressions for the
derivative of ψ, assuming that S is in fact differentiable. Interlaced with our derivations, we describe
how to numerically evaluate the adjoint of the derivative map, which is necessary for backpropagation.
Following [1] and [4, Section 7], we can express ψ as the composition φ ◦ s ◦ Q, where
To backpropagate through ψ, we need to compute the adjoint of the derivative of ψ at (A, b, c) applied
to the vector (dx, dy, ds), or
(dA, db, dc) = DT ψ(A, b, c)(dx, dy, ds) = DT Q(A, b, c)DT s(Q)DT φ(z)(dx, dy, ds).
Since our layer only outputs the primal solution x, we can simplify the calculation by taking
dy = ds = 0. By (12),
dx
dz = DT φ(z)(dx, 0, 0) = 0 .
−xT dx
We can compute Ds(Q) by implicitly differentiating the normalized residual map:
Ds(Q) = −(Dz N (s(Q), Q))−1 DQ N (s(Q), Q). (13)
This gives
dQ = DT s(Q)dz = −(M −T dz)Π(z)T ,
where M = (Q − I)DΠ(z) + I. Computing g = M −T dz via a direct method (i.e., materializing M ,
factorizing it, and back-solving) can be impractical when M is large. Instead, one might use a Krylov
method like LSQR [65] to solve
minimize kM T g − dzk22 , (14)
g
Non-differentiability. To implicitly differentiate the solution map in (13), we assumed that the
M was invertible. When M is not invertible, we approximate dQ as −g ls Π(z)T , where g ls is a
least-squares solution to (14).
15
C Examples
This appendix includes code for the examples presented in §6.
Logistic regression. The code for the logistic regression problem is below:
1 import cvxpy as cp
2 from cvxpylayers . torch import CvxpyLayer
3
4 beta = cp . Variable (( n , 1) )
5 b = cp . Variable ((1 , 1) )
6 X = cp . Parameter (( N , n ) )
7
8 log_likelihood = (1. / N ) * cp . sum (
9 cp . multiply (Y , X @ beta + b ) - cp . logistic ( X @ beta + b )
10 )
11 regularization = -0.1 * cp . norm ( beta , 1) -0.1 *
cp . sum_squares ( beta )
12
13 prob = cp . Problem ( cp . Maximize ( log_likelihood + regularization ) )
14 fit_logreg = CvxpyLayer ( prob , parameters =[ X ] , variables =[ beta ,
b ])
Stochastic control. The code for the stochastic control problem (7) is below:
1 import cvxpy as cp
2 from cvxpylayers . torch import CvxpyLayer
3
4 x_cvxpy = cp . Parameter (( n , 1) )
5 P_sqrt_cvxpy = cp . Parameter (( m , m ) )
6 P_21_cvxpy = cp . Parameter (( n , m ) )
7 q_cvxpy = cp . Parameter (( m , 1) )
8
9 u_cvxpy = cp . Variable (( m , 1) )
10 y_cvxpy = cp . Variable (( n , 1) )
11
12 objective = .5 * cp . sum_squares ( P_sqrt_cvxpy @ u_cvxpy ) +
x_cvxpy . T @ y_cvxpy + q_cvxpy . T @ u_cvxpy
13 prob = cp . Problem ( cp . Minimize ( objective ) ,
14 [ cp . norm ( u_cvxpy ) <= .5 , y_cvxpy == P_21_cvxpy @ u_cvxpy ])
15
16 policy = CvxpyLayer ( prob ,
17 parameters =[ x_cvxpy , P_sqrt_cvxpy , P_21_cvxpy , q_cvxpy ] ,
18 variables =[ u_cvxpy ])
D TensorFlow layer
In §5, we showed how to implement the problem (4) using our PyTorch layer. The below code shows
how to implement the same problem using our TensorFlow 2.0 layer.
1 import tensorflow as tf
2 from cvxpylayers . tensorflow import CvxpyLayer
3
4 F_t = tf . Variable ( tf . random . normal ( F . shape ) )
5 g_t = tf . Variable ( tf . random . normal ( g . shape ) )
6 lambd_t = tf . Variable ( tf . random . normal ( lambd . shape ) )
7 layer = CvxpyLayer ( problem , parameters =[ F , g , lambd ] ,
variables =[ x ])
8 with tf . GradientTape () as tape :
9 x_star , = layer ( F_t , g_t , lambd_t )
10 dF , dg , dlambd = tape . gradient ( x_star , [ F_t , g_t , lambd_t ])
16
E Additional examples
In this appendix we provide additional examples of constructing differentiable convex optimization
layers using our implementation. We present the implementation of common neural networks layers,
even though analytic solutions exist for some of these operations. These layers can be modified in
simple ways such that they do not have analytical solutions. In the below problems, the optimization
variable is y (unless stated otherwise). We also show how prior work on differentiable convex
optimization layers such as OptNet [6] is captured by our framework.
The ReLU, defined by f (x) = max{0, x}, can be interpreted as projecting a point x ∈ Rn onto the
non-negative orthant as
minimize 12 ||x − y||22
subject to y ≥ 0.
We can implement this layer with:
1 x = cp . Parameter ( n )
2 y = cp . Variable ( n )
3 obj = cp . Minimize ( cp . sum_squares (x - y ) )
4 cons = [ y >= 0]
5 prob = cp . Problem ( obj , cons )
6 layer = CvxpyLayer ( prob , parameters =[ x ] , variables =[ y ])
The sigmoid or logistic function, defined by f (x) = (1 + e−x )−1 , can be interpreted as projecting a
point x ∈ Rn onto the interior of the unit hypercube as
minimize −x> y − Hb (y)
subject to 0 < y < 1,
P
where Hb (y) = − ( i yi log yi + (1 − yi ) log(1 − yi )) is the binary entropy function. This is
proved, e.g., in [4, Section 2.4]. We can implement this layer with:
1 x = cp . Parameter ( n )
2 y = cp . Variable ( n )
3 obj = cp . Minimize ( - x . T * y - cp . sum ( cp . entr ( y ) + cp . entr (1. - y ) ) )
4 prob = cp . Problem ( obj )
5 layer = CvxpyLayer ( prob , parameters =[ x ] , variables =[ y ])
The softmax, defined by f (x)j = exj / i exi , can be interpreted as projecting a point x ∈ Rn onto
P
the interior of the (n − 1)-simplex ∆n−1 = {p ∈ Rn | 1> p = 1 and p ≥ 0} as
minimize −x> y − H(y)
subject to 0 < y < 1,
1> y = 1,
P
where H(y) = − i yi log yi is the entropy function. This is proved, e.g., in [4, Section 2.4]. We
can implement this layer with:
1 x = cp . Parameter ( d )
2 y = cp . Variable ( d )
3 obj = cp . Minimize ( - x . T * y - cp . sum ( cp . entr ( y ) ) )
4 cons = [ sum ( y ) == 1.]
5 prob = cp . Problem ( obj , cons )
6 layer = CvxpyLayer ( prob , parameters =[ x ] , variables =[ y ])
17
The sparsemax [58] does a Euclidean projection onto the simplex as
The Limited Multi-Label (LML) layer [7] solves the optimization problem
18
The OptNet QP. We can re-implement the OptNet QP layer [6] in a few lines of code. The OptNet
layer is a solution to a convex quadratic program of the form
minimize 12 x> Qx + q > x
subject to Ax = b,
Gx ≤ h,
where x ∈ Rn is the optimization variable, and the problem data are Q ∈ Rn×n (which is positive
semidefinite), q ∈ Rn , A ∈ Rm×n , b ∈ Rm , G ∈ Rp×n , and h ∈ Rp . We can implement this with:
1 Q_sqrt = cp . Parameter (( n , n ) )
2 q = cp . Parameter ( n )
3 A = cp . Parameter (( m , n ) )
4 b = cp . Parameter ( m )
5 G = cp . Parameter (( p , n ) )
6 h = cp . Parameter ( p )
7 x = cp . Variable ( n )
8 obj = cp . Minimize (0.5* cp . sum_squares ( Q_sqrt * x ) + q . T @ x )
9 cons = [ A @ x == b , G @ x <= h ]
10 prob = cp . Problem ( obj , cons )
11 layer = CvxpyLayer ( prob , parameters =[ Q_sqrt , q , A , b , G , h ] ,
variables =[ x ])
Note that we take the matrix square-root of Q in PyTorch, outside the CVXPY layer, to get the
derivative with respect to Q. DPP does not allow the quadratic form atom to be parametrized, as
discussed in §4.1.
19