Search | arXiv e-print repository

Structured Semidefinite Programming for Recovering Structured Preconditioners

Authors: Arun Jambulapati, Jerry Li, Christopher Musco, Kirankumar Shiragur, Aaron Sidford, Kevin Tian

Abstract: We develop a general framework for finding approximately-optimal preconditioners for solving linear systems. Leveraging this framework we obtain improved runtimes for fundamental preconditioning and linear system solving problems including the following. We give an algorithm which, given positive definite $\mathbf{K} \in \mathbb{R}^{d \times d}$ with $\mathrm{nnz}(\mathbf{K})$ nonzero entries, com… ▽ More We develop a general framework for finding approximately-optimal preconditioners for solving linear systems. Leveraging this framework we obtain improved runtimes for fundamental preconditioning and linear system solving problems including the following. We give an algorithm which, given positive definite $\mathbf{K} \in \mathbb{R}^{d \times d}$ with $\mathrm{nnz}(\mathbf{K})$ nonzero entries, computes an $ε$-optimal diagonal preconditioner in time $\widetilde{O}(\mathrm{nnz}(\mathbf{K}) \cdot \mathrm{poly}(κ^\star,ε^{-1}))$, where $κ^\star$ is the optimal condition number of the rescaled matrix. We give an algorithm which, given $\mathbf{M} \in \mathbb{R}^{d \times d}$ that is either the pseudoinverse of a graph Laplacian matrix or a constant spectral approximation of one, solves linear systems in $\mathbf{M}$ in $\widetilde{O}(d^2)$ time. Our diagonal preconditioning results improve state-of-the-art runtimes of $Ω(d^{3.5})$ attained by general-purpose semidefinite programming, and our solvers improve state-of-the-art runtimes of $Ω(d^ω)$ where $ω> 2.3$ is the current matrix multiplication constant. We attain our results via new algorithms for a class of semidefinite programs (SDPs) we call matrix-dictionary approximation SDPs, which we leverage to solve an associated problem we call matrix-dictionary recovery. △ Less

Submitted 27 October, 2023; originally announced October 2023.

Comments: Merge of arXiv:1812.06295 and arXiv:2008.01722

arXiv:2301.00457 [pdf, other]

ReSQueing Parallel and Private Stochastic Convex Optimization

Authors: Yair Carmon, Arun Jambulapati, Yujia Jin, Yin Tat Lee, Daogao Liu, Aaron Sidford, Kevin Tian

Abstract: We introduce a new tool for stochastic convex optimization (SCO): a Reweighted Stochastic Query (ReSQue) estimator for the gradient of a function convolved with a (Gaussian) probability density. Combining ReSQue with recent advances in ball oracle acceleration [CJJJLST20, ACJJS21], we develop algorithms achieving state-of-the-art complexities for SCO in parallel and private settings. For a SCO obj… ▽ More We introduce a new tool for stochastic convex optimization (SCO): a Reweighted Stochastic Query (ReSQue) estimator for the gradient of a function convolved with a (Gaussian) probability density. Combining ReSQue with recent advances in ball oracle acceleration [CJJJLST20, ACJJS21], we develop algorithms achieving state-of-the-art complexities for SCO in parallel and private settings. For a SCO objective constrained to the unit ball in $\mathbb{R}^d$, we obtain the following results (up to polylogarithmic factors). We give a parallel algorithm obtaining optimization error $ε_{\text{opt}}$ with $d^{1/3}ε_{\text{opt}}^{-2/3}$ gradient oracle query depth and $d^{1/3}ε_{\text{opt}}^{-2/3} + ε_{\text{opt}}^{-2}$ gradient queries in total, assuming access to a bounded-variance stochastic gradient estimator. For $ε_{\text{opt}} \in [d^{-1}, d^{-1/4}]$, our algorithm matches the state-of-the-art oracle depth of [BJLLS19] while maintaining the optimal total work of stochastic gradient descent. Given $n$ samples of Lipschitz loss functions, prior works [BFTT19, BFGT20, AFKT21, KLL21] established that if $n \gtrsim d ε_{\text{dp}}^{-2}$, $(ε_{\text{dp}}, δ)$-differential privacy is attained at no asymptotic cost to the SCO utility. However, these prior works all required a superlinear number of gradient queries. We close this gap for sufficiently large $n \gtrsim d^2 ε_{\text{dp}}^{-3}$, by using ReSQue to design an algorithm with near-linear gradient query complexity in this regime. △ Less

Submitted 27 October, 2023; v1 submitted 1 January, 2023; originally announced January 2023.

arXiv:2210.06728 [pdf, ps, other]

On the Efficient Implementation of High Accuracy Optimality of Profile Maximum Likelihood

Authors: Moses Charikar, Zhihao Jiang, Kirankumar Shiragur, Aaron Sidford

Abstract: We provide an efficient unified plug-in approach for estimating symmetric properties of distributions given $n$ independent samples. Our estimator is based on profile-maximum-likelihood (PML) and is sample optimal for estimating various symmetric properties when the estimation error $ε\gg n^{-1/3}$. This result improves upon the previous best accuracy threshold of $ε\gg n^{-1/4}$ achievable by pol… ▽ More We provide an efficient unified plug-in approach for estimating symmetric properties of distributions given $n$ independent samples. Our estimator is based on profile-maximum-likelihood (PML) and is sample optimal for estimating various symmetric properties when the estimation error $ε\gg n^{-1/3}$. This result improves upon the previous best accuracy threshold of $ε\gg n^{-1/4}$ achievable by polynomial time computable PML-based universal estimators [ACSS21, ACSS20]. Our estimator reaches a theoretical limit for universal symmetric property estimation as [Han21] shows that a broad class of universal estimators (containing many well known approaches including ours) cannot be sample optimal for every $1$-Lipschitz property when $ε\ll n^{-1/3}$. △ Less

Submitted 13 October, 2022; originally announced October 2022.

Comments: Accepted at NeurIPS 2022

arXiv:2203.15260 [pdf, other]

Efficient Convex Optimization Requires Superlinear Memory

Authors: Annie Marsden, Vatsal Sharan, Aaron Sidford, Gregory Valiant

Abstract: We show that any memory-constrained, first-order algorithm which minimizes $d$-dimensional, $1$-Lipschitz convex functions over the unit ball to $1/\mathrm{poly}(d)$ accuracy using at most $d^{1.25 - δ}$ bits of memory must make at least $\tildeΩ(d^{1 + (4/3)δ})$ first-order queries (for any constant $δ\in [0, 1/4]$). Consequently, the performance of such memory-constrained algorithms are a polyno… ▽ More We show that any memory-constrained, first-order algorithm which minimizes $d$-dimensional, $1$-Lipschitz convex functions over the unit ball to $1/\mathrm{poly}(d)$ accuracy using at most $d^{1.25 - δ}$ bits of memory must make at least $\tildeΩ(d^{1 + (4/3)δ})$ first-order queries (for any constant $δ\in [0, 1/4]$). Consequently, the performance of such memory-constrained algorithms are a polynomial factor worse than the optimal $\tilde{O}(d)$ query bound for this problem obtained by cutting plane methods that use $\tilde{O}(d^2)$ memory. This resolves a COLT 2019 open problem of Woodworth and Srebro. △ Less

Submitted 24 July, 2024; v1 submitted 29 March, 2022; originally announced March 2022.

Comments: 33 pages, 1 figure

arXiv:2203.04002 [pdf, ps, other]

Semi-Random Sparse Recovery in Nearly-Linear Time

Authors: Jonathan A. Kelner, Jerry Li, Allen Liu, Aaron Sidford, Kevin Tian

Abstract: Sparse recovery is one of the most fundamental and well-studied inverse problems. Standard statistical formulations of the problem are provably solved by general convex programming techniques and more practical, fast (nearly-linear time) iterative methods. However, these latter "fast algorithms" have previously been observed to be brittle in various real-world settings. We investigate the brittl… ▽ More Sparse recovery is one of the most fundamental and well-studied inverse problems. Standard statistical formulations of the problem are provably solved by general convex programming techniques and more practical, fast (nearly-linear time) iterative methods. However, these latter "fast algorithms" have previously been observed to be brittle in various real-world settings. We investigate the brittleness of fast sparse recovery algorithms to generative model changes through the lens of studying their robustness to a "helpful" semi-random adversary, a framework which tests whether an algorithm overfits to input assumptions. We consider the following basic model: let $\mathbf{A} \in \mathbb{R}^{n \times d}$ be a measurement matrix which contains an unknown subset of rows $\mathbf{G} \in \mathbb{R}^{m \times d}$ which are bounded and satisfy the restricted isometry property (RIP), but is otherwise arbitrary. Letting $x^\star \in \mathbb{R}^d$ be $s$-sparse, and given either exact measurements $b = \mathbf{A} x^\star$ or noisy measurements $b = \mathbf{A} x^\star + ξ$, we design algorithms recovering $x^\star$ information-theoretically optimally in nearly-linear time. We extend our algorithm to hold for weaker generative models relaxing our planted RIP assumption to a natural weighted variant, and show that our method's guarantees naturally interpolate the quality of the measurement matrix to, in some parameter regimes, run in sublinear time. Our approach differs from prior fast iterative methods with provable guarantees under semi-random generative models: natural conditions on a submatrix which make sparse recovery tractable are NP-hard to verify. We design a new iterative method tailored to the geometry of sparse recovery which is provably robust to our semi-random model. We hope our approach opens the door to new robust, efficient algorithms for natural statistical inverse problems. △ Less

Submitted 8 March, 2022; originally announced March 2022.

Comments: 42 pages, comments welcome!

arXiv:2111.03137 [pdf, other]

Big-Step-Little-Step: Efficient Gradient Methods for Objectives with Multiple Scales

Authors: Jonathan Kelner, Annie Marsden, Vatsal Sharan, Aaron Sidford, Gregory Valiant, Honglin Yuan

Abstract: We provide new gradient-based methods for efficiently solving a broad class of ill-conditioned optimization problems. We consider the problem of minimizing a function $f : \mathbb{R}^d \rightarrow \mathbb{R}$ which is implicitly decomposable as the sum of $m$ unknown non-interacting smooth, strongly convex functions and provide a method which solves this problem with a number of gradient evaluatio… ▽ More We provide new gradient-based methods for efficiently solving a broad class of ill-conditioned optimization problems. We consider the problem of minimizing a function $f : \mathbb{R}^d \rightarrow \mathbb{R}$ which is implicitly decomposable as the sum of $m$ unknown non-interacting smooth, strongly convex functions and provide a method which solves this problem with a number of gradient evaluations that scales (up to logarithmic factors) as the product of the square-root of the condition numbers of the components. This complexity bound (which we prove is nearly optimal) can improve almost exponentially on that of accelerated gradient methods, which grow as the square root of the condition number of $f$. Additionally, we provide efficient methods for solving stochastic, quadratic variants of this multiscale optimization problem. Rather than learn the decomposition of $f$ (which would be prohibitively expensive), our methods apply a clean recursive "Big-Step-Little-Step" interleaving of standard methods. The resulting algorithms use $\tilde{\mathcal{O}}(d m)$ space, are numerically stable, and open the door to a more fine-grained understanding of the complexity of convex optimization beyond condition number. △ Less

Submitted 4 November, 2021; originally announced November 2021.

Comments: 95 pages, 4 figures; authors are listed in alphabetical order

arXiv:2011.02761 [pdf, other]

Instance Based Approximations to Profile Maximum Likelihood

Authors: Nima Anari, Moses Charikar, Kirankumar Shiragur, Aaron Sidford

Abstract: In this paper we provide a new efficient algorithm for approximately computing the profile maximum likelihood (PML) distribution, a prominent quantity in symmetric property estimation. We provide an algorithm which matches the previous best known efficient algorithms for computing approximate PML distributions and improves when the number of distinct observed frequencies in the given instance is s… ▽ More In this paper we provide a new efficient algorithm for approximately computing the profile maximum likelihood (PML) distribution, a prominent quantity in symmetric property estimation. We provide an algorithm which matches the previous best known efficient algorithms for computing approximate PML distributions and improves when the number of distinct observed frequencies in the given instance is small. We achieve this result by exploiting new sparsity structure in approximate PML distributions and providing a new matrix rounding algorithm, of independent interest. Leveraging this result, we obtain the first provable computationally efficient implementation of PseudoPML, a general framework for estimating a broad class of symmetric properties. Additionally, we obtain efficient PML-based estimators for distributions with small profile entropy, a natural instance-based complexity measure. Further, we provide a simpler and more practical PseudoPML implementation that matches the best-known theoretical guarantees of such an estimator and evaluate this method empirically. △ Less

Submitted 5 November, 2020; originally announced November 2020.

Comments: Accepted at Thirty-fourth Conference on Neural Information Processing Systems (NeurIPS 2020)

arXiv:2010.05893 [pdf, other]

Large-Scale Methods for Distributionally Robust Optimization

Authors: Daniel Levy, Yair Carmon, John C. Duchi, Aaron Sidford

Abstract: We propose and analyze algorithms for distributionally robust optimization of convex losses with conditional value at risk (CVaR) and $χ^2$ divergence uncertainty sets. We prove that our algorithms require a number of gradient evaluations independent of training set size and number of parameters, making them suitable for large-scale applications. For $χ^2$ uncertainty sets these are the first such… ▽ More We propose and analyze algorithms for distributionally robust optimization of convex losses with conditional value at risk (CVaR) and $χ^2$ divergence uncertainty sets. We prove that our algorithms require a number of gradient evaluations independent of training set size and number of parameters, making them suitable for large-scale applications. For $χ^2$ uncertainty sets these are the first such guarantees in the literature, and for CVaR our guarantees scale linearly in the uncertainty level rather than quadratically as in previous work. We also provide lower bounds proving the worst-case optimality of our algorithms for CVaR and a penalized version of the $χ^2$ problem. Our primary technical contributions are novel bounds on the bias of batch robust risk estimation and the variance of a multilevel Monte Carlo gradient estimator due to [Blanchet & Glynn, 2015]. Experiments on MNIST and ImageNet confirm the theoretical scaling of our algorithms, which are 9--36 times more efficient than full-batch methods. △ Less

Submitted 10 December, 2020; v1 submitted 12 October, 2020; originally announced October 2020.

Comments: 63 pages, NeurIPS 2020

arXiv:2008.12776 [pdf, other]

Efficiently Solving MDPs with Stochastic Mirror Descent

Authors: Yujia Jin, Aaron Sidford

Abstract: We present a unified framework based on primal-dual stochastic mirror descent for approximately solving infinite-horizon Markov decision processes (MDPs) given a generative model. When applied to an average-reward MDP with $A_{tot}$ total state-action pairs and mixing time bound $t_{mix}$ our method computes an $ε$-optimal policy with an expected $\widetilde{O}(t_{mix}^2 A_{tot} ε^{-2})$ samples f… ▽ More We present a unified framework based on primal-dual stochastic mirror descent for approximately solving infinite-horizon Markov decision processes (MDPs) given a generative model. When applied to an average-reward MDP with $A_{tot}$ total state-action pairs and mixing time bound $t_{mix}$ our method computes an $ε$-optimal policy with an expected $\widetilde{O}(t_{mix}^2 A_{tot} ε^{-2})$ samples from the state-transition matrix, removing the ergodicity dependence of prior art. When applied to a $γ$-discounted MDP with $A_{tot}$ total state-action pairs our method computes an $ε$-optimal policy with an expected $\widetilde{O}((1-γ)^{-4} A_{tot} ε^{-2})$ samples, matching the previous state-of-the-art up to a $(1-γ)^{-1}$ factor. Both methods are model-free, update state values and policies simultaneously, and run in time linear in the number of samples taken. We achieve these results through a more general stochastic mirror descent framework for solving bilinear saddle-point problems with simplex and box domains and we demonstrate the flexibility of this framework by providing further applications to constrained MDPs. △ Less

Submitted 28 August, 2020; originally announced August 2020.

Comments: ICML 2020

arXiv:2008.01722 [pdf, ps, other]

Fast and Near-Optimal Diagonal Preconditioning

Authors: Arun Jambulapati, Jerry Li, Christopher Musco, Aaron Sidford, Kevin Tian

Abstract: The convergence rates of iterative methods for solving a linear system $\mathbf{A} x = b$ typically depend on the condition number of the matrix $\mathbf{A}$. Preconditioning is a common way of speeding up these methods by reducing that condition number in a computationally inexpensive way. In this paper, we revisit the decades-old problem of how to best improve $\mathbf{A}$'s condition number by… ▽ More The convergence rates of iterative methods for solving a linear system $\mathbf{A} x = b$ typically depend on the condition number of the matrix $\mathbf{A}$. Preconditioning is a common way of speeding up these methods by reducing that condition number in a computationally inexpensive way. In this paper, we revisit the decades-old problem of how to best improve $\mathbf{A}$'s condition number by left or right diagonal rescaling. We make progress on this problem in several directions. First, we provide new bounds for the classic heuristic of scaling $\mathbf{A}$ by its diagonal values (a.k.a. Jacobi preconditioning). We prove that this approach reduces $\mathbf{A}$'s condition number to within a quadratic factor of the best possible scaling. Second, we give a solver for structured mixed packing and covering semidefinite programs (MPC SDPs) which computes a constant-factor optimal scaling for $\mathbf{A}$ in $\widetilde{O}(\text{nnz}(\mathbf{A}) \cdot \text{poly}(κ^\star))$ time; this matches the cost of solving the linear system after scaling up to a $\widetilde{O}(\text{poly}(κ^\star))$ factor. Third, we demonstrate that a sufficiently general width-independent MPC SDP solver would imply near-optimal runtimes for the scaling problems we consider, and natural variants concerned with measures of average conditioning. Finally, we highlight connections of our preconditioning techniques to semi-random noise models, as well as applications in reducing risk in several statistical regression models. △ Less

Submitted 3 November, 2021; v1 submitted 4 August, 2020; originally announced August 2020.

Comments: 46 pages

arXiv:2004.02425 [pdf, other]

The Bethe and Sinkhorn Permanents of Low Rank Matrices and Implications for Profile Maximum Likelihood

Authors: Nima Anari, Moses Charikar, Kirankumar Shiragur, Aaron Sidford

Abstract: In this paper we consider the problem of computing the likelihood of the profile of a discrete distribution, i.e., the probability of observing the multiset of element frequencies, and computing a profile maximum likelihood (PML) distribution, i.e., a distribution with the maximum profile likelihood. For each problem we provide polynomial time algorithms that given $n$ i.i.d.\ samples from a discr… ▽ More In this paper we consider the problem of computing the likelihood of the profile of a discrete distribution, i.e., the probability of observing the multiset of element frequencies, and computing a profile maximum likelihood (PML) distribution, i.e., a distribution with the maximum profile likelihood. For each problem we provide polynomial time algorithms that given $n$ i.i.d.\ samples from a discrete distribution, achieve an approximation factor of $\exp\left(-O(\sqrt{n} \log n) \right)$, improving upon the previous best-known bound achievable in polynomial time of $\exp(-O(n^{2/3} \log n))$ (Charikar, Shiragur and Sidford, 2019). Through the work of Acharya, Das, Orlitsky and Suresh (2016), this implies a polynomial time universal estimator for symmetric properties of discrete distributions in a broader range of error parameter. We achieve these results by providing new bounds on the quality of approximation of the Bethe and Sinkhorn permanents (Vontobel, 2012 and 2014). We show that each of these are $\exp(O(k \log(N/k)))$ approximations to the permanent of $N \times N$ matrices with non-negative rank at most $k$, improving upon the previous known bounds of $\exp(O(N))$. To obtain our results on PML, we exploit the fact that the PML objective is proportional to the permanent of a certain Vandermonde matrix with $\sqrt{n}$ distinct columns, i.e. with non-negative rank at most $\sqrt{n}$. As a by-product of our work we establish a surprising connection between the convex relaxation in prior work (CSS19) and the well-studied Bethe and Sinkhorn approximations. △ Less

Submitted 6 April, 2020; originally announced April 2020.

Comments: 59 pages

arXiv:2003.00844 [pdf, other]

A General Framework for Symmetric Property Estimation

Authors: Moses Charikar, Kirankumar Shiragur, Aaron Sidford

Abstract: In this paper we provide a general framework for estimating symmetric properties of distributions from i.i.d. samples. For a broad class of symmetric properties we identify the easy region where empirical estimation works and the difficult region where more complex estimators are required. We show that by approximately computing the profile maximum likelihood (PML) distribution \cite{ADOS16} in th… ▽ More In this paper we provide a general framework for estimating symmetric properties of distributions from i.i.d. samples. For a broad class of symmetric properties we identify the easy region where empirical estimation works and the difficult region where more complex estimators are required. We show that by approximately computing the profile maximum likelihood (PML) distribution \cite{ADOS16} in this difficult region we obtain a symmetric property estimation framework that is sample complexity optimal for many properties in a broader parameter regime than previous universal estimation approaches based on PML. The resulting algorithms based on these pseudo PML distributions are also more practical. △ Less

Submitted 2 March, 2020; originally announced March 2020.

Comments: Published in Neural Information Processing Systems 2019

arXiv:1908.11071 [pdf, ps, other]

Solving Discounted Stochastic Two-Player Games with Near-Optimal Time and Sample Complexity

Authors: Aaron Sidford, Mengdi Wang, Lin F. Yang, Yinyu Ye

Abstract: In this paper, we settle the sampling complexity of solving discounted two-player turn-based zero-sum stochastic games up to polylogarithmic factors. Given a stochastic game with discount factor $γ\in(0,1)$ we provide an algorithm that computes an $ε$-optimal strategy with high-probability given $\tilde{O}((1 - γ)^{-3} ε^{-2})$ samples from the transition function for each state-action-pair. Our a… ▽ More In this paper, we settle the sampling complexity of solving discounted two-player turn-based zero-sum stochastic games up to polylogarithmic factors. Given a stochastic game with discount factor $γ\in(0,1)$ we provide an algorithm that computes an $ε$-optimal strategy with high-probability given $\tilde{O}((1 - γ)^{-3} ε^{-2})$ samples from the transition function for each state-action-pair. Our algorithm runs in time nearly linear in the number of samples and uses space nearly linear in the number of state-action pairs. As stochastic games generalize Markov decision processes (MDPs) our runtime and sample complexities are optimal due to Azar et al (2013). We achieve our results by showing how to generalize a near-optimal Q-learning based algorithms for MDP, in particular Sidford et al (2018), to two-player strategy computation algorithms. This overcomes limitations of standard Q-learning and strategy iteration or alternating minimization based approaches and we hope will pave the way for future reinforcement learning results by facilitating the extension of MDP results to multi-agent settings with little loss. △ Less

Submitted 29 August, 2019; originally announced August 2019.

arXiv:1906.11985 [pdf, other]

Near-Optimal Methods for Minimizing Star-Convex Functions and Beyond

Authors: Oliver Hinder, Aaron Sidford, Nimit S. Sohoni

Abstract: In this paper, we provide near-optimal accelerated first-order methods for minimizing a broad class of smooth nonconvex functions that are strictly unimodal on all lines through a minimizer. This function class, which we call the class of smooth quasar-convex functions, is parameterized by a constant $γ\in (0,1]$, where $γ= 1$ encompasses the classes of smooth convex and star-convex functions, and… ▽ More In this paper, we provide near-optimal accelerated first-order methods for minimizing a broad class of smooth nonconvex functions that are strictly unimodal on all lines through a minimizer. This function class, which we call the class of smooth quasar-convex functions, is parameterized by a constant $γ\in (0,1]$, where $γ= 1$ encompasses the classes of smooth convex and star-convex functions, and smaller values of $γ$ indicate that the function can be "more nonconvex." We develop a variant of accelerated gradient descent that computes an $ε$-approximate minimizer of a smooth $γ$-quasar-convex function with at most $O(γ^{-1} ε^{-1/2} \log(γ^{-1} ε^{-1}))$ total function and gradient evaluations. We also derive a lower bound of $Ω(γ^{-1} ε^{-1/2})$ on the worst-case number of gradient evaluations required by any deterministic first-order method, showing that, up to a logarithmic factor, no deterministic first-order method can improve upon ours. △ Less

Submitted 24 February, 2023; v1 submitted 27 June, 2019; originally announced June 2019.

Comments: 48 pages. Published as a conference paper at COLT 2020

arXiv:1906.00618 [pdf, other]

A Direct $\tilde{O}(1/ε)$ Iteration Parallel Algorithm for Optimal Transport

Authors: Arun Jambulapati, Aaron Sidford, Kevin Tian

Abstract: Optimal transportation, or computing the Wasserstein or ``earth mover's'' distance between two distributions, is a fundamental primitive which arises in many learning and statistical settings. We give an algorithm which solves this problem to additive $ε$ with $\tilde{O}(1/ε)$ parallel depth, and $\tilde{O}\left(n^2/ε\right)$ work. Barring a breakthrough on a long-standing algorithmic open problem… ▽ More Optimal transportation, or computing the Wasserstein or ``earth mover's'' distance between two distributions, is a fundamental primitive which arises in many learning and statistical settings. We give an algorithm which solves this problem to additive $ε$ with $\tilde{O}(1/ε)$ parallel depth, and $\tilde{O}\left(n^2/ε\right)$ work. Barring a breakthrough on a long-standing algorithmic open problem, this is optimal for first-order methods. Blanchet et. al. '18, Quanrud '19 obtained similar runtimes through reductions to positive linear programming and matrix scaling. However, these reduction-based algorithms use complicated subroutines which may be deemed impractical due to requiring solvers for second-order iterations (matrix scaling) or non-parallelizability (positive LP). The fastest practical algorithms run in time $\tilde{O}(\min(n^2 / ε^2, n^{2.5} / ε))$ (Dvurechensky et. al. '18, Lin et. al. '19). We bridge this gap by providing a parallel, first-order, $\tilde{O}(1/ε)$ iteration algorithm without worse dependence on dimension, and provide preliminary experimental evidence that our algorithm may enjoy improved practical performance. We obtain this runtime via a primal-dual extragradient method, motivated by recent theoretical improvements to maximum flow (Sherman '17). △ Less

Submitted 3 June, 2019; originally announced June 2019.

Comments: 23 pages, 2 figures

arXiv:1905.08448 [pdf, other]

Efficient Profile Maximum Likelihood for Universal Symmetric Property Estimation

Authors: Moses Charikar, Kirankumar Shiragur, Aaron Sidford

Abstract: Estimating symmetric properties of a distribution, e.g. support size, coverage, entropy, distance to uniformity, are among the most fundamental problems in algorithmic statistics. While each of these properties have been studied extensively and separate optimal estimators are known for each, in striking recent work, Acharya et al. 2016 showed that there is a single estimator that is competitive fo… ▽ More Estimating symmetric properties of a distribution, e.g. support size, coverage, entropy, distance to uniformity, are among the most fundamental problems in algorithmic statistics. While each of these properties have been studied extensively and separate optimal estimators are known for each, in striking recent work, Acharya et al. 2016 showed that there is a single estimator that is competitive for all symmetric properties. This work proved that computing the distribution that approximately maximizes \emph{profile likelihood (PML)}, i.e. the probability of observed frequency of frequencies, and returning the value of the property on this distribution is sample competitive with respect to a broad class of estimators of symmetric properties. Further, they showed that even computing an approximation of the PML suffices to achieve such a universal plug-in estimator. Unfortunately, prior to this work there was no known polynomial time algorithm to compute an approximate PML and it was open to obtain a polynomial time universal plug-in estimator through the use of approximate PML. In this paper we provide a algorithm (in number of samples) that, given $n$ samples from a distribution, computes an approximate PML distribution up to a multiplicative error of $\exp(n^{2/3} \mathrm{poly} \log(n))$ in time nearly linear in $n$. Generalizing work of Acharya et al. 2016 on the utility of approximate PML we show that our algorithm provides a nearly linear time universal plug-in estimator for all symmetric functions up to accuracy $ε= Ω(n^{-0.166})$. Further, we show how to extend our work to provide efficient polynomial-time algorithms for computing a $d$-dimensional generalization of PML (for constant $d$) that allows for universal plug-in estimation of symmetric relationships between distributions. △ Less

Submitted 21 May, 2019; originally announced May 2019.

Comments: 68 pages

arXiv:1904.08544 [pdf, ps, other]

Memory-Sample Tradeoffs for Linear Regression with Small Error

Authors: Vatsal Sharan, Aaron Sidford, Gregory Valiant

Abstract: We consider the problem of performing linear regression over a stream of $d$-dimensional examples, and show that any algorithm that uses a subquadratic amount of memory exhibits a slower rate of convergence than can be achieved without memory constraints. Specifically, consider a sequence of labeled examples $(a_1,b_1), (a_2,b_2)\ldots,$ with $a_i$ drawn independently from a $d$-dimensional isotro… ▽ More We consider the problem of performing linear regression over a stream of $d$-dimensional examples, and show that any algorithm that uses a subquadratic amount of memory exhibits a slower rate of convergence than can be achieved without memory constraints. Specifically, consider a sequence of labeled examples $(a_1,b_1), (a_2,b_2)\ldots,$ with $a_i$ drawn independently from a $d$-dimensional isotropic Gaussian, and where $b_i = \langle a_i, x\rangle + η_i,$ for a fixed $x \in \mathbb{R}^d$ with $\|x\|_2 = 1$ and with independent noise $η_i$ drawn uniformly from the interval $[-2^{-d/5},2^{-d/5}].$ We show that any algorithm with at most $d^2/4$ bits of memory requires at least $Ω(d \log \log \frac{1}ε)$ samples to approximate $x$ to $\ell_2$ error $ε$ with probability of success at least $2/3$, for $ε$ sufficiently small as a function of $d$. In contrast, for such $ε$, $x$ can be recovered to error $ε$ with probability $1-o(1)$ with memory $O\left(d^2 \log(1/ε)\right)$ using $d$ examples. This represents the first nontrivial lower bounds for regression with super-linear memory, and may open the door for strong memory/sample tradeoffs for continuous optimization. △ Less

Submitted 10 October, 2020; v1 submitted 17 April, 2019; originally announced April 2019.

Comments: A few minor edits over previous version

arXiv:1903.02675 [pdf, other]

A Rank-1 Sketch for Matrix Multiplicative Weights

Authors: Yair Carmon, John C. Duchi, Aaron Sidford, Kevin Tian

Abstract: We show that a simple randomized sketch of the matrix multiplicative weight (MMW) update enjoys (in expectation) the same regret bounds as MMW, up to a small constant factor. Unlike MMW, where every step requires full matrix exponentiation, our steps require only a single product of the form $e^A b$, which the Lanczos method approximates efficiently. Our key technique is to view the sketch as a… ▽ More We show that a simple randomized sketch of the matrix multiplicative weight (MMW) update enjoys (in expectation) the same regret bounds as MMW, up to a small constant factor. Unlike MMW, where every step requires full matrix exponentiation, our steps require only a single product of the form $e^A b$, which the Lanczos method approximates efficiently. Our key technique is to view the sketch as a $\textit{randomized mirror projection}$, and perform mirror descent analysis on the $\textit{expected projection}$. Our sketch solves the online eigenvector problem, improving the best known complexity bounds by $Ω(\log^5 n)$. We also apply this sketch to semidefinite programming in saddle-point form, yielding a simple primal-dual scheme with guarantees matching the best in the literature. △ Less

Submitted 12 August, 2019; v1 submitted 6 March, 2019; originally announced March 2019.

arXiv:1711.08426 [pdf, ps, other]

Leverage Score Sampling for Faster Accelerated Regression and ERM

Authors: Naman Agarwal, Sham Kakade, Rahul Kidambi, Yin Tat Lee, Praneeth Netrapalli, Aaron Sidford

Abstract: Given a matrix $\mathbf{A}\in\mathbb{R}^{n\times d}$ and a vector $b \in\mathbb{R}^{d}$, we show how to compute an $ε$-approximate solution to the regression problem $ \min_{x\in\mathbb{R}^{d}}\frac{1}{2} \|\mathbf{A} x - b\|_{2}^{2} $ in time $ \tilde{O} ((n+\sqrt{d\cdotκ_{\text{sum}}})\cdot s\cdot\logε^{-1}) $ where… ▽ More Given a matrix $\mathbf{A}\in\mathbb{R}^{n\times d}$ and a vector $b \in\mathbb{R}^{d}$, we show how to compute an $ε$-approximate solution to the regression problem $ \min_{x\in\mathbb{R}^{d}}\frac{1}{2} \|\mathbf{A} x - b\|_{2}^{2} $ in time $ \tilde{O} ((n+\sqrt{d\cdotκ_{\text{sum}}})\cdot s\cdot\logε^{-1}) $ where $κ_{\text{sum}}=\mathrm{tr}\left(\mathbf{A}^{\top}\mathbf{A}\right)/λ_{\min}(\mathbf{A}^{T}\mathbf{A})$ and $s$ is the maximum number of non-zero entries in a row of $\mathbf{A}$. Our algorithm improves upon the previous best running time of $ \tilde{O} ((n+\sqrt{n \cdotκ_{\text{sum}}})\cdot s\cdot\logε^{-1})$. We achieve our result through a careful combination of leverage score sampling techniques, proximal point methods, and accelerated coordinate descent. Our method not only matches the performance of previous methods, but further improves whenever leverage scores of rows are small (up to polylogarithmic factors). We also provide a non-linear generalization of these results that improves the running time for solving a broader class of ERM problems. △ Less

Submitted 22 November, 2017; originally announced November 2017.

arXiv:1710.09430 [pdf, ps, other]

doi 10.4230/LIPIcs.FSTTCS.2017.2

A Markov Chain Theory Approach to Characterizing the Minimax Optimality of Stochastic Gradient Descent (for Least Squares)

Authors: Prateek Jain, Sham M. Kakade, Rahul Kidambi, Praneeth Netrapalli, Venkata Krishna Pillutla, Aaron Sidford

Abstract: This work provides a simplified proof of the statistical minimax optimality of (iterate averaged) stochastic gradient descent (SGD), for the special case of least squares. This result is obtained by analyzing SGD as a stochastic process and by sharply characterizing the stationary covariance matrix of this process. The finite rate optimality characterization captures the constant factors and addre… ▽ More This work provides a simplified proof of the statistical minimax optimality of (iterate averaged) stochastic gradient descent (SGD), for the special case of least squares. This result is obtained by analyzing SGD as a stochastic process and by sharply characterizing the stationary covariance matrix of this process. The finite rate optimality characterization captures the constant factors and addresses model mis-specification. △ Less

Submitted 21 July, 2018; v1 submitted 25 October, 2017; originally announced October 2017.

Comments: Lemma 1 has been updated in v2

arXiv:1704.08227 [pdf, other]

Accelerating Stochastic Gradient Descent For Least Squares Regression

Authors: Prateek Jain, Sham M. Kakade, Rahul Kidambi, Praneeth Netrapalli, Aaron Sidford

Abstract: There is widespread sentiment that it is not possible to effectively utilize fast gradient methods (e.g. Nesterov's acceleration, conjugate gradient, heavy ball) for the purposes of stochastic optimization due to their instability and error accumulation, a notion made precise in d'Aspremont 2008 and Devolder, Glineur, and Nesterov 2014. This work considers these issues for the special case of stoc… ▽ More There is widespread sentiment that it is not possible to effectively utilize fast gradient methods (e.g. Nesterov's acceleration, conjugate gradient, heavy ball) for the purposes of stochastic optimization due to their instability and error accumulation, a notion made precise in d'Aspremont 2008 and Devolder, Glineur, and Nesterov 2014. This work considers these issues for the special case of stochastic approximation for the least squares regression problem, and our main result refutes the conventional wisdom by showing that acceleration can be made robust to statistical errors. In particular, this work introduces an accelerated stochastic gradient method that provably achieves the minimax optimal statistical risk faster than stochastic gradient descent. Critical to the analysis is a sharp characterization of accelerated stochastic gradient descent as a stochastic process. We hope this characterization gives insights towards the broader question of designing simple and effective accelerated stochastic methods for more general convex and non-convex optimization problems. △ Less

Submitted 31 July, 2018; v1 submitted 26 April, 2017; originally announced April 2017.

Comments: 54 pages, 3 figures, 1 table; updated acknowledgements, minor title change. Paper appeared in the proceedings of the Conference on Learning Theory (COLT), 2018

arXiv:1610.03774 [pdf, other]

Parallelizing Stochastic Gradient Descent for Least Squares Regression: mini-batching, averaging, and model misspecification

Authors: Prateek Jain, Sham M. Kakade, Rahul Kidambi, Praneeth Netrapalli, Aaron Sidford

Abstract: This work characterizes the benefits of averaging schemes widely used in conjunction with stochastic gradient descent (SGD). In particular, this work provides a sharp analysis of: (1) mini-batching, a method of averaging many samples of a stochastic gradient to both reduce the variance of the stochastic gradient estimate and for parallelizing SGD and (2) tail-averaging, a method involving averagin… ▽ More This work characterizes the benefits of averaging schemes widely used in conjunction with stochastic gradient descent (SGD). In particular, this work provides a sharp analysis of: (1) mini-batching, a method of averaging many samples of a stochastic gradient to both reduce the variance of the stochastic gradient estimate and for parallelizing SGD and (2) tail-averaging, a method involving averaging the final few iterates of SGD to decrease the variance in SGD's final iterate. This work presents non-asymptotic excess risk bounds for these schemes for the stochastic approximation problem of least squares regression. Furthermore, this work establishes a precise problem-dependent extent to which mini-batch SGD yields provable near-linear parallelization speedups over SGD with batch size one. This allows for understanding learning rate versus batch size tradeoffs for the final iterate of an SGD method. These results are then utilized in providing a highly parallelizable SGD method that obtains the minimax risk with nearly the same number of serial updates as batch gradient descent, improving significantly over existing SGD methods. A non-asymptotic analysis of communication efficient parallelization schemes such as model-averaging/parameter mixing methods is then provided. Finally, this work sheds light on some fundamental differences in SGD's behavior when dealing with agnostic noise in the (non-realizable) least squares regression problem. In particular, the work shows that the stepsizes that ensure minimax risk for the agnostic case must be a function of the noise properties. This paper builds on the operator view of analyzing SGD methods, introduced by Defossez and Bach (2015), followed by developing a novel analysis in bounding these operators to characterize the excess risk. These techniques are of broader interest in analyzing computational aspects of stochastic approximation. △ Less

Submitted 31 July, 2018; v1 submitted 12 October, 2016; originally announced October 2016.

Comments: 39 pages. Published in the Journal of Machine Learning Research (JMLR)

arXiv:1604.03930 [pdf, ps, other]

Efficient Algorithms for Large-scale Generalized Eigenvector Computation and Canonical Correlation Analysis

Authors: Rong Ge, Chi Jin, Sham M. Kakade, Praneeth Netrapalli, Aaron Sidford

Abstract: This paper considers the problem of canonical-correlation analysis (CCA) (Hotelling, 1936) and, more broadly, the generalized eigenvector problem for a pair of symmetric matrices. These are two fundamental problems in data analysis and scientific computing with numerous applications in machine learning and statistics (Shi and Malik, 2000; Hardoon et al., 2004; Witten et al., 2009). We provide si… ▽ More This paper considers the problem of canonical-correlation analysis (CCA) (Hotelling, 1936) and, more broadly, the generalized eigenvector problem for a pair of symmetric matrices. These are two fundamental problems in data analysis and scientific computing with numerous applications in machine learning and statistics (Shi and Malik, 2000; Hardoon et al., 2004; Witten et al., 2009). We provide simple iterative algorithms, with improved runtimes, for solving these problems that are globally linearly convergent with moderate dependencies on the condition numbers and eigenvalue gaps of the matrices involved. We obtain our results by reducing CCA to the top-$k$ generalized eigenvector problem. We solve this problem through a general framework that simply requires black box access to an approximate linear system solver. Instantiating this framework with accelerated gradient descent we obtain a running time of $O(\frac{z k \sqrtκ}ρ \log(1/ε) \log \left(kκ/ρ\right))$ where $z$ is the total number of nonzero entries, $κ$ is the condition number and $ρ$ is the relative eigenvalue gap of the appropriate matrices. Our algorithm is linear in the input size and the number of components $k$ up to a $\log(k)$ factor. This is essential for handling large-scale matrices that appear in practice. To the best of our knowledge this is the first such algorithm with global linear convergence. We hope that our results prompt further research and ultimately improve the practical running time for performing these important data analysis procedures on large data sets. △ Less

Submitted 27 May, 2016; v1 submitted 13 April, 2016; originally announced April 2016.

Comments: International Conference on Machine Learning (ICML) 2016

arXiv:1602.06929 [pdf, ps, other]

Streaming PCA: Matching Matrix Bernstein and Near-Optimal Finite Sample Guarantees for Oja's Algorithm

Authors: Prateek Jain, Chi Jin, Sham M. Kakade, Praneeth Netrapalli, Aaron Sidford

Abstract: This work provides improved guarantees for streaming principle component analysis (PCA). Given $A_1, \ldots, A_n\in \mathbb{R}^{d\times d}$ sampled independently from distributions satisfying $\mathbb{E}[A_i] = Σ$ for $Σ\succeq \mathbf{0}$, this work provides an $O(d)$-space linear-time single-pass streaming algorithm for estimating the top eigenvector of $Σ$. The algorithm nearly matches (and in… ▽ More This work provides improved guarantees for streaming principle component analysis (PCA). Given $A_1, \ldots, A_n\in \mathbb{R}^{d\times d}$ sampled independently from distributions satisfying $\mathbb{E}[A_i] = Σ$ for $Σ\succeq \mathbf{0}$, this work provides an $O(d)$-space linear-time single-pass streaming algorithm for estimating the top eigenvector of $Σ$. The algorithm nearly matches (and in certain cases improves upon) the accuracy obtained by the standard batch method that computes top eigenvector of the empirical covariance $\frac{1}{n} \sum_{i \in [n]} A_i$ as analyzed by the matrix Bernstein inequality. Moreover, to achieve constant accuracy, our algorithm improves upon the best previous known sample complexities of streaming algorithms by either a multiplicative factor of $O(d)$ or $1/\mathrm{gap}$ where $\mathrm{gap}$ is the relative distance between the top two eigenvalues of $Σ$. These results are achieved through a novel analysis of the classic Oja's algorithm, one of the oldest and most popular algorithms for streaming PCA. In particular, this work shows that simply picking a random initial point $w_0$ and applying the update rule $w_{i + 1} = w_i + η_i A_i w_i$ suffices to accurately estimate the top eigenvector, with a suitable choice of $η_i$. We believe our result sheds light on how to efficiently perform streaming PCA both in theory and in practice and we hope that our analysis may serve as the basis for analyzing many variants and extensions of streaming PCA. △ Less

Submitted 28 March, 2016; v1 submitted 22 February, 2016; originally announced February 2016.

Comments: Updated title

arXiv:1602.06872 [pdf, ps, other]

Principal Component Projection Without Principal Component Analysis

Authors: Roy Frostig, Cameron Musco, Christopher Musco, Aaron Sidford

Abstract: We show how to efficiently project a vector onto the top principal components of a matrix, without explicitly computing these components. Specifically, we introduce an iterative algorithm that provably computes the projection using few calls to any black-box routine for ridge regression. By avoiding explicit principal component analysis (PCA), our algorithm is the first with no runtime dependenc… ▽ More We show how to efficiently project a vector onto the top principal components of a matrix, without explicitly computing these components. Specifically, we introduce an iterative algorithm that provably computes the projection using few calls to any black-box routine for ridge regression. By avoiding explicit principal component analysis (PCA), our algorithm is the first with no runtime dependence on the number of top principal components. We show that it can be used to give a fast iterative method for the popular principal component regression problem, giving the first major runtime improvement over the naive method of combining PCA with regression. To achieve our results, we first observe that ridge regression can be used to obtain a "smooth projection" onto the top principal components. We then sharpen this approximation to true projection using a low-degree polynomial approximation to the matrix step function. Step function approximation is a topic of long-term interest in scientific computing. We extend prior theory by constructing polynomials with simple iterative structure and rigorously analyzing their behavior under limited precision. △ Less

Submitted 26 November, 2019; v1 submitted 22 February, 2016; originally announced February 2016.

arXiv:1506.07512 [pdf, other]

Un-regularizing: approximate proximal point and faster stochastic algorithms for empirical risk minimization

Authors: Roy Frostig, Rong Ge, Sham M. Kakade, Aaron Sidford

Abstract: We develop a family of accelerated stochastic algorithms that minimize sums of convex functions. Our algorithms improve upon the fastest running time for empirical risk minimization (ERM), and in particular linear least-squares regression, across a wide range of problem settings. To achieve this, we establish a framework based on the classical proximal point algorithm. Namely, we provide several a… ▽ More We develop a family of accelerated stochastic algorithms that minimize sums of convex functions. Our algorithms improve upon the fastest running time for empirical risk minimization (ERM), and in particular linear least-squares regression, across a wide range of problem settings. To achieve this, we establish a framework based on the classical proximal point algorithm. Namely, we provide several algorithms that reduce the minimization of a strongly convex function to approximate minimizations of regularizations of the function. Using these results, we accelerate recent fast stochastic algorithms in a black-box fashion. Empirically, we demonstrate that the resulting algorithms exhibit notions of stability that are advantageous in practice. Both in theory and in practice, the provided algorithms reap the computational benefits of adding a large strongly convex regularization term, without incurring a corresponding bias to the original problem. △ Less

Submitted 24 June, 2015; originally announced June 2015.

arXiv:1412.6606 [pdf, other]

Competing with the Empirical Risk Minimizer in a Single Pass

Authors: Roy Frostig, Rong Ge, Sham M. Kakade, Aaron Sidford

Abstract: In many estimation problems, e.g. linear and logistic regression, we wish to minimize an unknown objective given only unbiased samples of the objective function. Furthermore, we aim to achieve this using as few samples as possible. In the absence of computational constraints, the minimizer of a sample average of observed data -- commonly referred to as either the empirical risk minimizer (ERM) or… ▽ More In many estimation problems, e.g. linear and logistic regression, we wish to minimize an unknown objective given only unbiased samples of the objective function. Furthermore, we aim to achieve this using as few samples as possible. In the absence of computational constraints, the minimizer of a sample average of observed data -- commonly referred to as either the empirical risk minimizer (ERM) or the $M$-estimator -- is widely regarded as the estimation strategy of choice due to its desirable statistical convergence properties. Our goal in this work is to perform as well as the ERM, on every problem, while minimizing the use of computational resources such as running time and space usage. We provide a simple streaming algorithm which, under standard regularity assumptions on the underlying problem, enjoys the following properties: * The algorithm can be implemented in linear time with a single pass of the observed data, using space linear in the size of a single sample. * The algorithm achieves the same statistical rate of convergence as the empirical risk minimizer on every problem, even considering constant factors. * The algorithm's performance depends on the initial error at a rate that decreases super-polynomially. * The algorithm is easily parallelizable. Moreover, we quantify the (finite-sample) rate at which the algorithm becomes competitive with the ERM. △ Less

Submitted 25 February, 2015; v1 submitted 20 December, 2014; originally announced December 2014.

arXiv:1408.5099 [pdf, ps, other]

Uniform Sampling for Matrix Approximation

Authors: Michael B. Cohen, Yin Tat Lee, Cameron Musco, Christopher Musco, Richard Peng, Aaron Sidford

Abstract: Random sampling has become a critical tool in solving massive matrix problems. For linear regression, a small, manageable set of data rows can be randomly selected to approximate a tall, skinny data matrix, improving processing time significantly. For theoretical performance guarantees, each row must be sampled with probability proportional to its statistical leverage score. Unfortunately, leverag… ▽ More Random sampling has become a critical tool in solving massive matrix problems. For linear regression, a small, manageable set of data rows can be randomly selected to approximate a tall, skinny data matrix, improving processing time significantly. For theoretical performance guarantees, each row must be sampled with probability proportional to its statistical leverage score. Unfortunately, leverage scores are difficult to compute. A simple alternative is to sample rows uniformly at random. While this often works, uniform sampling will eliminate critical row information for many natural instances. We take a fresh look at uniform sampling by examining what information it does preserve. Specifically, we show that uniform sampling yields a matrix that, in some sense, well approximates a large fraction of the original. While this weak form of approximation is not enough for solving linear regression directly, it is enough to compute a better approximation. This observation leads to simple iterative row sampling algorithms for matrix approximation that run in input-sparsity time and preserve row structure and sparsity at all intermediate steps. In addition to an improved understanding of uniform sampling, our main proof introduces a structural result of independent interest: we show that every matrix can be made to have low coherence by reweighting a small subset of its rows. △ Less

Submitted 21 August, 2014; originally announced August 2014.

Showing 1–28 of 28 results for author: Sidford, A