Search | arXiv e-print repository

A Symplectic Analysis of Alternating Mirror Descent

Authors: Jonas Katona, Xiuyuan Wang, Andre Wibisono

Abstract: Motivated by understanding the behavior of the Alternating Mirror Descent (AMD) algorithm for bilinear zero-sum games, we study the discretization of continuous-time Hamiltonian flow via the symplectic Euler method. We provide a framework for analysis using results from Hamiltonian dynamics, Lie algebra, and symplectic numerical integrators, with an emphasis on the existence and properties of a co… ▽ More Motivated by understanding the behavior of the Alternating Mirror Descent (AMD) algorithm for bilinear zero-sum games, we study the discretization of continuous-time Hamiltonian flow via the symplectic Euler method. We provide a framework for analysis using results from Hamiltonian dynamics, Lie algebra, and symplectic numerical integrators, with an emphasis on the existence and properties of a conserved quantity, the modified Hamiltonian (MH), for the symplectic Euler method. We compute the MH in closed-form when the original Hamiltonian is a quadratic function, and show that it generally differs from the other conserved quantity known previously in that case. We derive new error bounds on the MH when truncated at orders in the stepsize in terms of the number of iterations, $K$, and use these bounds to show an improved $\mathcal{O}(K^{1/5})$ total regret bound and an $\mathcal{O}(K^{-4/5})$ duality gap of the average iterates for AMD. Finally, we propose a conjecture which, if true, would imply that the total regret for AMD scales as $\mathcal{O}\left(K^{\varepsilon}\right)$ and the duality gap of the average iterates as $\mathcal{O}\left(K^{-1+\varepsilon}\right)$ for any $\varepsilon>0$, and we can take $\varepsilon=0$ upon certain convergence conditions for the MH. △ Less

Submitted 28 May, 2024; v1 submitted 6 May, 2024; originally announced May 2024.

Comments: 94 pages, 3 figures

arXiv:2402.17067 [pdf, ps, other]

On Independent Samples Along the Langevin Diffusion and the Unadjusted Langevin Algorithm

Authors: Jiaming Liang, Siddharth Mitra, Andre Wibisono

Abstract: We study the rate at which the initial and current random variables become independent along a Markov chain, focusing on the Langevin diffusion in continuous time and the Unadjusted Langevin Algorithm (ULA) in discrete time. We measure the dependence between random variables via their mutual information. For the Langevin diffusion, we show the mutual information converges to $0$ exponentially fast… ▽ More We study the rate at which the initial and current random variables become independent along a Markov chain, focusing on the Langevin diffusion in continuous time and the Unadjusted Langevin Algorithm (ULA) in discrete time. We measure the dependence between random variables via their mutual information. For the Langevin diffusion, we show the mutual information converges to $0$ exponentially fast when the target is strongly log-concave, and at a polynomial rate when the target is weakly log-concave. These rates are analogous to the mixing time of the Langevin diffusion under similar assumptions. For the ULA, we show the mutual information converges to $0$ exponentially fast when the target is strongly log-concave and smooth. We prove our results by developing the mutual version of the mixing time analyses of these Markov chains. We also provide alternative proofs based on strong data processing inequalities for the Langevin diffusion and the ULA, and by showing regularity results for these processes in mutual information. △ Less

Submitted 26 February, 2024; originally announced February 2024.

Comments: 41 pages

arXiv:2312.08823 [pdf, other]

Fast sampling from constrained spaces using the Metropolis-adjusted Mirror Langevin algorithm

Authors: Vishwak Srinivasan, Andre Wibisono, Ashia Wilson

Abstract: We propose a new method called the Metropolis-adjusted Mirror Langevin algorithm for approximate sampling from distributions whose support is a compact and convex set. This algorithm adds an accept-reject filter to the Markov chain induced by a single step of the Mirror Langevin algorithm (Zhang et al., 2020), which is a basic discretisation of the Mirror Langevin dynamics. Due to the inclusion of… ▽ More We propose a new method called the Metropolis-adjusted Mirror Langevin algorithm for approximate sampling from distributions whose support is a compact and convex set. This algorithm adds an accept-reject filter to the Markov chain induced by a single step of the Mirror Langevin algorithm (Zhang et al., 2020), which is a basic discretisation of the Mirror Langevin dynamics. Due to the inclusion of this filter, our method is unbiased relative to the target, while known discretisations of the Mirror Langevin dynamics including the Mirror Langevin algorithm have an asymptotic bias. For this algorithm, we also give upper bounds for the number of iterations taken to mix to a constrained distribution whose potential is relatively smooth, convex, and Lipschitz continuous with respect to a self-concordant mirror function. As a consequence of the reversibility of the Markov chain induced by the inclusion of the Metropolis-Hastings filter, we obtain an exponentially better dependence on the error tolerance for approximate constrained sampling. We also present numerical experiments that corroborate our theoretical findings. △ Less

Submitted 21 June, 2024; v1 submitted 14 December, 2023; originally announced December 2023.

Comments: 49 pages, 6 figures, 2 tables. Shorter version without experiments accepted to COLT 2024

arXiv:2309.14155 [pdf, other]

Extragradient Type Methods for Riemannian Variational Inequality Problems

Authors: Zihao Hu, Guanghui Wang, Xi Wang, Andre Wibisono, Jacob Abernethy, Molei Tao

Abstract: Riemannian convex optimization and minimax optimization have recently drawn considerable attention. Their appeal lies in their capacity to adeptly manage the non-convexity of the objective function as well as constraints inherent in the feasible set in the Euclidean sense. In this work, we delve into monotone Riemannian Variational Inequality Problems (RVIPs), which encompass both Riemannian conve… ▽ More Riemannian convex optimization and minimax optimization have recently drawn considerable attention. Their appeal lies in their capacity to adeptly manage the non-convexity of the objective function as well as constraints inherent in the feasible set in the Euclidean sense. In this work, we delve into monotone Riemannian Variational Inequality Problems (RVIPs), which encompass both Riemannian convex optimization and minimax optimization as particular cases. In the context of Euclidean space, it is established that the last-iterates of both the extragradient (EG) and past extragradient (PEG) methods converge to the solution of monotone variational inequality problems at a rate of $O\left(\frac{1}{\sqrt{T}}\right)$ (Cai et al., 2022). However, analogous behavior on Riemannian manifolds remains an open question. To bridge this gap, we introduce the Riemannian extragradient (REG) and Riemannian past extragradient (RPEG) methods. We demonstrate that both exhibit $O\left(\frac{1}{\sqrt{T}}\right)$ last-iterate convergence. Additionally, we show that the average-iterate convergence of both REG and RPEG is $O\left(\frac{1}{T}\right)$, aligning with observations in the Euclidean case (Mokhtari et al., 2020). These results are enabled by judiciously addressing the holonomy effect so that additional complications in Riemannian cases can be reduced and the Euclidean proof inspired by the performance estimation problem (PEP) technique or the sum-of-squares (SOS) technique can be applied again. △ Less

Submitted 1 June, 2024; v1 submitted 25 September, 2023; originally announced September 2023.

Comments: Published in Proceedings of The 27th International Conference on Artificial Intelligence and Statistics (AISTATS 2024)

arXiv:2305.17244 [pdf, other]

Mitigating Catastrophic Forgetting in Long Short-Term Memory Networks

Authors: Ketaki Joshi, Raghavendra Pradyumna Pothukuchi, Andre Wibisono, Abhishek Bhattacharjee

Abstract: Continual learning on sequential data is critical for many machine learning (ML) deployments. Unfortunately, LSTM networks, which are commonly used to learn on sequential data, suffer from catastrophic forgetting and are limited in their ability to learn multiple tasks continually. We discover that catastrophic forgetting in LSTM networks can be overcome in two novel and readily-implementable ways… ▽ More Continual learning on sequential data is critical for many machine learning (ML) deployments. Unfortunately, LSTM networks, which are commonly used to learn on sequential data, suffer from catastrophic forgetting and are limited in their ability to learn multiple tasks continually. We discover that catastrophic forgetting in LSTM networks can be overcome in two novel and readily-implementable ways -- separating the LSTM memory either for each task or for each target label. Our approach eschews the need for explicit regularization, hypernetworks, and other complex methods. We quantify the benefits of our approach on recently-proposed LSTM networks for computer memory access prefetching, an important sequential learning problem in ML-based computer system optimization. Compared to state-of-the-art weight regularization methods to mitigate catastrophic forgetting, our approach is simple, effective, and enables faster learning. We also show that our proposal enables the use of small, non-regularized LSTM networks for complex natural language processing in the offline learning scenario, which was previously considered difficult. △ Less

Submitted 26 May, 2023; originally announced May 2023.

arXiv:2302.07851 [pdf, other]

Continuized Acceleration for Quasar Convex Functions in Non-Convex Optimization

Authors: Jun-Kun Wang, Andre Wibisono

Abstract: Quasar convexity is a condition that allows some first-order methods to efficiently minimize a function even when the optimization landscape is non-convex. Previous works develop near-optimal accelerated algorithms for minimizing this class of functions, however, they require a subroutine of binary search which results in multiple calls to gradient evaluations in each iteration, and consequently t… ▽ More Quasar convexity is a condition that allows some first-order methods to efficiently minimize a function even when the optimization landscape is non-convex. Previous works develop near-optimal accelerated algorithms for minimizing this class of functions, however, they require a subroutine of binary search which results in multiple calls to gradient evaluations in each iteration, and consequently the total number of gradient evaluations does not match a known lower bound. In this work, we show that a recently proposed continuized Nesterov acceleration can be applied to minimizing quasar convex functions and achieves the optimal bound with a high probability. Furthermore, we find that the objective functions of training generalized linear models (GLMs) satisfy quasar convexity, which broadens the applicability of the relevant algorithms, while known practical examples of quasar convexity in non-convex learning are sparse in the literature. We also show that if a smooth and one-point strongly convex, Polyak-Lojasiewicz, or quadratic-growth function satisfies quasar convexity, then attaining an accelerated linear rate for minimizing the function is possible under certain conditions, while acceleration is not known in general for these classes of functions. △ Less

Submitted 15 February, 2023; originally announced February 2023.

Comments: Accepted at ICLR (International Conference on Learning Representations), 2023

arXiv:2211.01512 [pdf, ps, other]

Convergence of the Inexact Langevin Algorithm and Score-based Generative Models in KL Divergence

Authors: Kaylee Yingxi Yang, Andre Wibisono

Abstract: We study the Inexact Langevin Dynamics (ILD), Inexact Langevin Algorithm (ILA), and Score-based Generative Modeling (SGM) when utilizing estimated score functions for sampling. Our focus lies in establishing stable biased convergence guarantees in terms of the Kullback-Leibler (KL) divergence. To achieve these guarantees, we impose two key assumptions: 1) the target distribution satisfies the log-… ▽ More We study the Inexact Langevin Dynamics (ILD), Inexact Langevin Algorithm (ILA), and Score-based Generative Modeling (SGM) when utilizing estimated score functions for sampling. Our focus lies in establishing stable biased convergence guarantees in terms of the Kullback-Leibler (KL) divergence. To achieve these guarantees, we impose two key assumptions: 1) the target distribution satisfies the log-Sobolev inequality (LSI), and 2) the score estimator exhibits a bounded Moment Generating Function (MGF) error. Notably, the MGF error assumption we adopt is more lenient compared to the $L^\infty$ error assumption used in existing literature. However, it is stronger than the $L^2$ error assumption utilized in recent works, which often leads to unstable bounds. We explore the question of how to obtain a provably accurate score estimator that satisfies the MGF error assumption. Specifically, we demonstrate that a simple estimator based on kernel density estimation fulfills the MGF error assumption for sub-Gaussian target distribution, at the population level. △ Less

Submitted 2 June, 2023; v1 submitted 2 November, 2022; originally announced November 2022.

arXiv:2210.16181 [pdf, ps, other]

Aggregation in the Mirror Space (AIMS): Fast, Accurate Distributed Machine Learning in Military Settings

Authors: Ryan Yang, Haizhou Du, Andre Wibisono, Patrick Baker

Abstract: Distributed machine learning (DML) can be an important capability for modern military to take advantage of data and devices distributed at multiple vantage points to adapt and learn. The existing distributed machine learning frameworks, however, cannot realize the full benefits of DML, because they are all based on the simple linear aggregation framework, but linear aggregation cannot handle the… ▽ More Distributed machine learning (DML) can be an important capability for modern military to take advantage of data and devices distributed at multiple vantage points to adapt and learn. The existing distributed machine learning frameworks, however, cannot realize the full benefits of DML, because they are all based on the simple linear aggregation framework, but linear aggregation cannot handle the $\textit{divergence challenges}$ arising in military settings: the learning data at different devices can be heterogeneous ($\textit{i.e.}$, Non-IID data), leading to model divergence, but the ability for devices to communicate is substantially limited ($\textit{i.e.}$, weak connectivity due to sparse and dynamic communications), reducing the ability for devices to reconcile model divergence. In this paper, we introduce a novel DML framework called aggregation in the mirror space (AIMS) that allows a DML system to introduce a general mirror function to map a model into a mirror space to conduct aggregation and gradient descent. Adapting the convexity of the mirror function according to the divergence force, AIMS allows automatic optimization of DML. We conduct both rigorous analysis and extensive experimental evaluations to demonstrate the benefits of AIMS. For example, we prove that AIMS achieves a loss of $O\left((\frac{m^{r+1}}{T})^{\frac1r}\right)$ after $T$ network-wide updates, where $m$ is the number of devices and $r$ the convexity of the mirror function, with existing linear aggregation frameworks being a special case with $r=2$. Our experimental evaluations using EMANE (Extendable Mobile Ad-hoc Network Emulator) for military communications settings show similar results: AIMS can improve DML convergence rate by up to 57\% and scale well to more devices with weak connectivity, all with little additional computation overhead compared to traditional linear aggregation. △ Less

Submitted 28 October, 2022; originally announced October 2022.

Comments: 9 pages. To be published in MILCOM 2022

arXiv:2210.10019 [pdf, other]

Towards Understanding GD with Hard and Conjugate Pseudo-labels for Test-Time Adaptation

Authors: Jun-Kun Wang, Andre Wibisono

Abstract: We consider a setting that a model needs to adapt to a new domain under distribution shifts, given that only unlabeled test samples from the new domain are accessible at test time. A common idea in most of the related works is constructing pseudo-labels for the unlabeled test samples and applying gradient descent (GD) to a loss function with the pseudo-labels. Recently, \cite{GSRK22} propose conju… ▽ More We consider a setting that a model needs to adapt to a new domain under distribution shifts, given that only unlabeled test samples from the new domain are accessible at test time. A common idea in most of the related works is constructing pseudo-labels for the unlabeled test samples and applying gradient descent (GD) to a loss function with the pseudo-labels. Recently, \cite{GSRK22} propose conjugate labels, which is a new kind of pseudo-labels for self-training at test time. They empirically show that the conjugate label outperforms other ways of pseudo-labeling on many domain adaptation benchmarks. However, provably showing that GD with conjugate labels learns a good classifier for test-time adaptation remains open. In this work, we aim at theoretically understanding GD with hard and conjugate labels for a binary classification problem. We show that for square loss, GD with conjugate labels converges to an $ε$-optimal predictor under a Gaussian model for any arbitrarily small $ε$, while GD with hard pseudo-labels fails in this task. We also analyze them under different loss functions for the update. Our results shed lights on understanding when and why GD with hard labels or conjugate labels works in test-time adaptation. △ Less

Submitted 25 February, 2023; v1 submitted 18 October, 2022; originally announced October 2022.

Comments: Accepted at ICLR (International Conference on Learning Representations), 2023

arXiv:2207.02189 [pdf, other]

Accelerating Hamiltonian Monte Carlo via Chebyshev Integration Time

Authors: Jun-Kun Wang, Andre Wibisono

Abstract: Hamiltonian Monte Carlo (HMC) is a popular method in sampling. While there are quite a few works of studying this method on various aspects, an interesting question is how to choose its integration time to achieve acceleration. In this work, we consider accelerating the process of sampling from a distribution $π(x) \propto \exp(-f(x))$ via HMC via time-varying integration time. When the potential… ▽ More Hamiltonian Monte Carlo (HMC) is a popular method in sampling. While there are quite a few works of studying this method on various aspects, an interesting question is how to choose its integration time to achieve acceleration. In this work, we consider accelerating the process of sampling from a distribution $π(x) \propto \exp(-f(x))$ via HMC via time-varying integration time. When the potential $f$ is $L$-smooth and $m$-strongly convex, i.e.\ for sampling from a log-smooth and strongly log-concave target distribution $π$, it is known that under a constant integration time, the number of iterations that ideal HMC takes to get an $ε$ Wasserstein-2 distance to the target $π$ is $O( κ\log \frac{1}ε )$, where $κ:= \frac{L}{m}$ is the condition number. We propose a scheme of time-varying integration time based on the roots of Chebyshev polynomials. We show that in the case of quadratic potential $f$, i.e., when the target $π$ is a Gaussian distribution, ideal HMC with this choice of integration time only takes $O( \sqrtκ \log \frac{1}ε )$ number of iterations to reach Wasserstein-2 distance less than $ε$; this improvement on the dependence on condition number is akin to acceleration in optimization. The design and analysis of HMC with the proposed integration time is built on the tools of Chebyshev polynomials. Experiments find the advantage of adopting our scheme of time-varying integration time even for sampling from distributions with smooth strongly convex potentials that are not quadratic. △ Less

Submitted 14 February, 2023; v1 submitted 5 July, 2022; originally announced July 2022.

Comments: Accepted at ICLR (International Conference on Learning Representations), 2023

arXiv:2206.11872 [pdf, other]

Provable Acceleration of Heavy Ball beyond Quadratics for a Class of Polyak-Łojasiewicz Functions when the Non-Convexity is Averaged-Out

Authors: Jun-Kun Wang, Chi-Heng Lin, Andre Wibisono, Bin Hu

Abstract: Heavy Ball (HB) nowadays is one of the most popular momentum methods in non-convex optimization. It has been widely observed that incorporating the Heavy Ball dynamic in gradient-based methods accelerates the training process of modern machine learning models. However, the progress on establishing its theoretical foundation of acceleration is apparently far behind its empirical success. Existing p… ▽ More Heavy Ball (HB) nowadays is one of the most popular momentum methods in non-convex optimization. It has been widely observed that incorporating the Heavy Ball dynamic in gradient-based methods accelerates the training process of modern machine learning models. However, the progress on establishing its theoretical foundation of acceleration is apparently far behind its empirical success. Existing provable acceleration results are of the quadratic or close-to-quadratic functions, as the current techniques of showing HB's acceleration are limited to the case when the Hessian is fixed. In this work, we develop some new techniques that help show acceleration beyond quadratics, which is achieved by analyzing how the change of the Hessian at two consecutive time points affects the convergence speed. Based on our technical results, a class of Polyak-Łojasiewicz (PL) optimization problems for which provable acceleration can be achieved via HB is identified. Moreover, our analysis demonstrates a benefit of adaptively setting the momentum parameter. (Update: 08/29/2023) Erratum is added in Appendix J. This is an updated version that fixes an issue in the previous version. An additional condition needs to be satisfied for the acceleration result of HB beyond quadratics in this work, which naturally holds when the dimension is one or, more broadly, when the Hessian is diagonal. We elaborate on the issue in Appendix J. △ Less

Submitted 29 August, 2023; v1 submitted 22 June, 2022; originally announced June 2022.

Comments: (ICML 2022) Proceedings of the 39th International Conference on Machine Learning;

arXiv:2206.04160 [pdf, other]

Alternating Mirror Descent for Constrained Min-Max Games

Authors: Andre Wibisono, Molei Tao, Georgios Piliouras

Abstract: In this paper we study two-player bilinear zero-sum games with constrained strategy spaces. An instance of natural occurrences of such constraints is when mixed strategies are used, which correspond to a probability simplex constraint. We propose and analyze the alternating mirror descent algorithm, in which each player takes turns to take action following the mirror descent algorithm for constrai… ▽ More In this paper we study two-player bilinear zero-sum games with constrained strategy spaces. An instance of natural occurrences of such constraints is when mixed strategies are used, which correspond to a probability simplex constraint. We propose and analyze the alternating mirror descent algorithm, in which each player takes turns to take action following the mirror descent algorithm for constrained optimization. We interpret alternating mirror descent as an alternating discretization of a skew-gradient flow in the dual space, and use tools from convex optimization and modified energy function to establish an $O(K^{-2/3})$ bound on its average regret after $K$ iterations. This quantitatively verifies the algorithm's better behavior than the simultaneous version of mirror descent algorithm, which is known to diverge and yields an $O(K^{-1/2})$ average regret bound. In the special case of an unconstrained setting, our results recover the behavior of alternating gradient descent algorithm for zero-sum games which was studied in (Bailey et al., COLT 2020). △ Less

Submitted 8 June, 2022; originally announced June 2022.

arXiv:2201.12488 [pdf, ps, other]

Achieving Efficient Distributed Machine Learning Using a Novel Non-Linear Class of Aggregation Functions

Authors: Haizhou Du, Ryan Yang, Yijian Chen, Qiao Xiang, Andre Wibisono, Wei Huang

Abstract: Distributed machine learning (DML) over time-varying networks can be an enabler for emerging decentralized ML applications such as autonomous driving and drone fleeting. However, the commonly used weighted arithmetic mean model aggregation function in existing DML systems can result in high model loss, low model accuracy, and slow convergence speed over time-varying networks. To address this issue… ▽ More Distributed machine learning (DML) over time-varying networks can be an enabler for emerging decentralized ML applications such as autonomous driving and drone fleeting. However, the commonly used weighted arithmetic mean model aggregation function in existing DML systems can result in high model loss, low model accuracy, and slow convergence speed over time-varying networks. To address this issue, in this paper, we propose a novel non-linear class of model aggregation functions to achieve efficient DML over time-varying networks. Instead of taking a linear aggregation of neighboring models as most existing studies do, our mechanism uses a nonlinear aggregation, a weighted power-p mean (WPM), as the aggregation function of local models from neighbors. The subsequent optimizing steps are taken using mirror descent defined by a Bregman divergence that maintains convergence to optimality. In this paper, we analyze properties of the WPM and rigorously prove convergence properties of our aggregation mechanism. Additionally, through extensive experiments, we show that when p > 1, our design significantly improves the convergence speed of the model and the scalability of DML under time-varying networks compared with arithmetic mean aggregation functions, with little additional computation overhead. △ Less

Submitted 19 February, 2022; v1 submitted 28 January, 2022; originally announced January 2022.

Comments: 13 pages, 26 figures

ACM Class: I.2.11

arXiv:2109.12077 [pdf, ps, other]

The Mirror Langevin Algorithm Converges with Vanishing Bias

Authors: Ruilin Li, Molei Tao, Santosh S. Vempala, Andre Wibisono

Abstract: The technique of modifying the geometry of a problem from Euclidean to Hessian metric has proved to be quite effective in optimization, and has been the subject of study for sampling. The Mirror Langevin Diffusion (MLD) is a sampling analogue of mirror flow in continuous time, and it has nice convergence properties under log-Sobolev or Poincare inequalities relative to the Hessian metric, as shown… ▽ More The technique of modifying the geometry of a problem from Euclidean to Hessian metric has proved to be quite effective in optimization, and has been the subject of study for sampling. The Mirror Langevin Diffusion (MLD) is a sampling analogue of mirror flow in continuous time, and it has nice convergence properties under log-Sobolev or Poincare inequalities relative to the Hessian metric, as shown by Chewi et al. (2020). In discrete time, a simple discretization of MLD is the Mirror Langevin Algorithm (MLA) studied by Zhang et al. (2020), who showed a biased convergence bound with a non-vanishing bias term (does not go to zero as step size goes to zero). This raised the question of whether we need a better analysis or a better discretization to achieve a vanishing bias. Here we study the basic Mirror Langevin Algorithm and show it indeed has a vanishing bias. We apply mean-square analysis based on Li et al. (2019) and Li et al. (2021) to show the mixing time bound for MLA under the modified self-concordance condition introduced by Zhang et al. (2020). △ Less

Submitted 11 October, 2021; v1 submitted 24 September, 2021; originally announced September 2021.

arXiv:1911.08418 [pdf, other]

Fast Convergence of Fictitious Play for Diagonal Payoff Matrices

Authors: Jacob Abernethy, Kevin A. Lai, Andre Wibisono

Abstract: Fictitious Play (FP) is a simple and natural dynamic for repeated play in zero-sum games. Proposed by Brown in 1949, FP was shown to converge to a Nash Equilibrium by Robinson in 1951, albeit at a slow rate that may depend on the dimension of the problem. In 1959, Karlin conjectured that FP converges at the more natural rate of $O(1/\sqrt{t})$. However, Daskalakis and Pan disproved a version of th… ▽ More Fictitious Play (FP) is a simple and natural dynamic for repeated play in zero-sum games. Proposed by Brown in 1949, FP was shown to converge to a Nash Equilibrium by Robinson in 1951, albeit at a slow rate that may depend on the dimension of the problem. In 1959, Karlin conjectured that FP converges at the more natural rate of $O(1/\sqrt{t})$. However, Daskalakis and Pan disproved a version of this conjecture in 2014, showing that a slow rate can occur, although their result relies on adversarial tie-breaking. In this paper, we show that Karlin's conjecture is indeed correct for the class of diagonal payoff matrices, as long as ties are broken lexicographically. Specifically, we show that FP converges at a $O(1/\sqrt{t})$ rate in the case when the payoff matrix is diagonal. We also prove this bound is tight by showing a matching lower bound in the identity payoff case under the lexicographic tie-breaking assumption. △ Less

Submitted 15 November, 2020; v1 submitted 19 November, 2019; originally announced November 2019.

arXiv:1911.01469 [pdf, ps, other]

Proximal Langevin Algorithm: Rapid Convergence Under Isoperimetry

Authors: Andre Wibisono

Abstract: We study the Proximal Langevin Algorithm (PLA) for sampling from a probability distribution $ν= e^{-f}$ on $\mathbb{R}^n$ under isoperimetry. We prove a convergence guarantee for PLA in Kullback-Leibler (KL) divergence when $ν$ satisfies log-Sobolev inequality (LSI) and $f$ has bounded second and third derivatives. This improves on the result for the Unadjusted Langevin Algorithm (ULA), and matche… ▽ More We study the Proximal Langevin Algorithm (PLA) for sampling from a probability distribution $ν= e^{-f}$ on $\mathbb{R}^n$ under isoperimetry. We prove a convergence guarantee for PLA in Kullback-Leibler (KL) divergence when $ν$ satisfies log-Sobolev inequality (LSI) and $f$ has bounded second and third derivatives. This improves on the result for the Unadjusted Langevin Algorithm (ULA), and matches the fastest known rate for sampling under LSI (without Metropolis filter) with a better dependence on the LSI constant. We also prove convergence guarantees for PLA in Rényi divergence of order $q > 1$ when the biased limit satisfies either LSI or Poincaré inequality. △ Less

Submitted 4 November, 2019; originally announced November 2019.

arXiv:1906.02027 [pdf, other]

Last-iterate convergence rates for min-max optimization

Authors: Jacob Abernethy, Kevin A. Lai, Andre Wibisono

Abstract: While classic work in convex-concave min-max optimization relies on average-iterate convergence results, the emergence of nonconvex applications such as training Generative Adversarial Networks has led to renewed interest in last-iterate convergence guarantees. Proving last-iterate convergence is challenging because many natural algorithms, such as Simultaneous Gradient Descent/Ascent, provably di… ▽ More While classic work in convex-concave min-max optimization relies on average-iterate convergence results, the emergence of nonconvex applications such as training Generative Adversarial Networks has led to renewed interest in last-iterate convergence guarantees. Proving last-iterate convergence is challenging because many natural algorithms, such as Simultaneous Gradient Descent/Ascent, provably diverge or cycle even in simple convex-concave min-max settings, and previous work on global last-iterate convergence rates has been limited to the bilinear and convex-strongly concave settings. In this work, we show that the Hamiltonian Gradient Descent (HGD) algorithm achieves linear convergence in a variety of more general settings, including convex-concave problems that satisfy a "sufficiently bilinear" condition. We also prove similar convergence rates for the Consensus Optimization (CO) algorithm of [MNG17] for some parameter settings of CO. △ Less

Submitted 25 October, 2019; v1 submitted 5 June, 2019; originally announced June 2019.

arXiv:1903.08568 [pdf, other]

Rapid Convergence of the Unadjusted Langevin Algorithm: Isoperimetry Suffices

Authors: Santosh S. Vempala, Andre Wibisono

Abstract: We study the Unadjusted Langevin Algorithm (ULA) for sampling from a probability distribution $ν= e^{-f}$ on $\mathbb{R}^n$. We prove a convergence guarantee in Kullback-Leibler (KL) divergence assuming $ν$ satisfies a log-Sobolev inequality and the Hessian of $f$ is bounded. Notably, we do not assume convexity or bounds on higher derivatives. We also prove convergence guarantees in Rényi divergen… ▽ More We study the Unadjusted Langevin Algorithm (ULA) for sampling from a probability distribution $ν= e^{-f}$ on $\mathbb{R}^n$. We prove a convergence guarantee in Kullback-Leibler (KL) divergence assuming $ν$ satisfies a log-Sobolev inequality and the Hessian of $f$ is bounded. Notably, we do not assume convexity or bounds on higher derivatives. We also prove convergence guarantees in Rényi divergence of order $q > 1$ assuming the limit of ULA satisfies either the log-Sobolev or Poincaré inequality. We also prove a bound on the bias of the limiting distribution of ULA assuming third-order smoothness of $f$, without requiring isoperimetry. △ Less

Submitted 2 March, 2022; v1 submitted 20 March, 2019; originally announced March 2019.

Comments: v4: Updated discussion and added properties of biased limit v3: Simplified analysis of Rényi divergence, improved exposition, and added figures v2: Added analysis of Rényi divergence and Poincaré assumption

arXiv:1805.01401 [pdf, ps, other]

Convexity of mutual information along the Ornstein-Uhlenbeck flow

Authors: Andre Wibisono, Varun Jog

Abstract: We study the convexity of mutual information as a function of time along the flow of the Ornstein-Uhlenbeck process. We prove that if the initial distribution is strongly log-concave, then mutual information is eventually convex, i.e., convex for all large time. In particular, if the initial distribution is sufficiently strongly log-concave compared to the target Gaussian measure, then mutual info… ▽ More We study the convexity of mutual information as a function of time along the flow of the Ornstein-Uhlenbeck process. We prove that if the initial distribution is strongly log-concave, then mutual information is eventually convex, i.e., convex for all large time. In particular, if the initial distribution is sufficiently strongly log-concave compared to the target Gaussian measure, then mutual information is always a convex function of time. We also prove that if the initial distribution is either bounded or has finite fourth moment and Fisher information, then mutual information is eventually convex. Finally, we provide counterexamples to show that mutual information can be nonconvex at small time. △ Less

Submitted 31 July, 2018; v1 submitted 3 May, 2018; originally announced May 2018.

Comments: 12 pages, 1 figure. To appear at the International Symposium on Information Theory and Its Applications (ISITA), October 2018

arXiv:1802.08089 [pdf, ps, other]

Sampling as optimization in the space of measures: The Langevin dynamics as a composite optimization problem

Authors: Andre Wibisono

Abstract: We study sampling as optimization in the space of measures. We focus on gradient flow-based optimization with the Langevin dynamics as a case study. We investigate the source of the bias of the unadjusted Langevin algorithm (ULA) in discrete time, and consider how to remove or reduce the bias. We point out the difficulty is that the heat flow is exactly solvable, but neither its forward nor backwa… ▽ More We study sampling as optimization in the space of measures. We focus on gradient flow-based optimization with the Langevin dynamics as a case study. We investigate the source of the bias of the unadjusted Langevin algorithm (ULA) in discrete time, and consider how to remove or reduce the bias. We point out the difficulty is that the heat flow is exactly solvable, but neither its forward nor backward method is implementable in general, except for Gaussian data. We propose the symmetrized Langevin algorithm (SLA), which should have a smaller bias than ULA, at the price of implementing a proximal gradient step in space. We show SLA is in fact consistent for Gaussian target measure, whereas ULA is not. We also illustrate various algorithms explicitly for Gaussian target measure, including gradient descent, proximal gradient, and Forward-Backward, and show they are all consistent. △ Less

Submitted 6 June, 2018; v1 submitted 22 February, 2018; originally announced February 2018.

Comments: To appear at the Conference on Learning Theory (COLT), July 2018

arXiv:1801.06968 [pdf, ps, other]

Convexity of mutual information along the heat flow

Authors: Andre Wibisono, Varun Jog

Abstract: We study the convexity of mutual information along the evolution of the heat equation. We prove that if the initial distribution is log-concave, then mutual information is always a convex function of time. We also prove that if the initial distribution is either bounded, or has finite fourth moment and Fisher information, then mutual information is eventually convex, i.e., convex for all large tim… ▽ More We study the convexity of mutual information along the evolution of the heat equation. We prove that if the initial distribution is log-concave, then mutual information is always a convex function of time. We also prove that if the initial distribution is either bounded, or has finite fourth moment and Fisher information, then mutual information is eventually convex, i.e., convex for all large time. Finally, we provide counterexamples to show that mutual information can be nonconvex at small time. △ Less

Submitted 7 May, 2018; v1 submitted 22 January, 2018; originally announced January 2018.

Comments: 10 pages, 1 figure. To appear at the IEEE International Symposium on Information Theory (ISIT), June 2018

arXiv:1702.03656 [pdf, ps, other]

Information and estimation in Fokker-Planck channels

Authors: Andre Wibisono, Varun Jog, Po-Ling Loh

Abstract: We study the relationship between information- and estimation-theoretic quantities in time-evolving systems. We focus on the Fokker-Planck channel defined by a general stochastic differential equation, and show that the time derivatives of entropy, KL divergence, and mutual information are characterized by estimation-theoretic quantities involving an appropriate generalization of the Fisher inform… ▽ More We study the relationship between information- and estimation-theoretic quantities in time-evolving systems. We focus on the Fokker-Planck channel defined by a general stochastic differential equation, and show that the time derivatives of entropy, KL divergence, and mutual information are characterized by estimation-theoretic quantities involving an appropriate generalization of the Fisher information. Our results vastly extend De Bruijn's identity and the classical I-MMSE relation. △ Less

Submitted 13 February, 2017; originally announced February 2017.

arXiv:1603.04245 [pdf, ps, other]

doi 10.1073/pnas.1614734113

A Variational Perspective on Accelerated Methods in Optimization

Authors: Andre Wibisono, Ashia C. Wilson, Michael I. Jordan

Abstract: Accelerated gradient methods play a central role in optimization, achieving optimal rates in many settings. While many generalizations and extensions of Nesterov's original acceleration method have been proposed, it is not yet clear what is the natural scope of the acceleration concept. In this paper, we study accelerated methods from a continuous-time perspective. We show that there is a Lagrangi… ▽ More Accelerated gradient methods play a central role in optimization, achieving optimal rates in many settings. While many generalizations and extensions of Nesterov's original acceleration method have been proposed, it is not yet clear what is the natural scope of the acceleration concept. In this paper, we study accelerated methods from a continuous-time perspective. We show that there is a Lagrangian functional that we call the \emph{Bregman Lagrangian} which generates a large class of accelerated methods in continuous time, including (but not limited to) accelerated gradient descent, its non-Euclidean extension, and accelerated higher-order gradient methods. We show that the continuous-time limit of all of these methods correspond to traveling the same curve in spacetime at different speeds. From this perspective, Nesterov's technique and many of its generalizations can be viewed as a systematic way to go from the continuous-time curves generated by the Bregman Lagrangian to a family of discrete-time accelerated algorithms. △ Less

Submitted 14 March, 2016; originally announced March 2016.

Comments: 38 pages. Subsumes an earlier working draft arXiv:1509.03616

arXiv:1312.2139 [pdf, ps, other]

Optimal rates for zero-order convex optimization: the power of two function evaluations

Authors: John C. Duchi, Michael I. Jordan, Martin J. Wainwright, Andre Wibisono

Abstract: We consider derivative-free algorithms for stochastic and non-stochastic convex optimization problems that use only function values rather than gradients. Focusing on non-asymptotic bounds on convergence rates, we show that if pairs of function values are available, algorithms for $d$-dimensional optimization that use gradient estimates based on random perturbations suffer a factor of at most… ▽ More We consider derivative-free algorithms for stochastic and non-stochastic convex optimization problems that use only function values rather than gradients. Focusing on non-asymptotic bounds on convergence rates, we show that if pairs of function values are available, algorithms for $d$-dimensional optimization that use gradient estimates based on random perturbations suffer a factor of at most $\sqrt{d}$ in convergence rate over traditional stochastic gradient methods. We establish such results for both smooth and non-smooth cases, sharpening previous analyses that suggested a worse dimension dependence, and extend our results to the case of multiple ($m \ge 2$) evaluations. We complement our algorithmic development with information-theoretic lower bounds on the minimax convergence rate of such problems, establishing the sharpness of our achievable results up to constant (sometimes logarithmic) factors. △ Less

Submitted 20 August, 2014; v1 submitted 7 December, 2013; originally announced December 2013.

Comments: 34 pages

arXiv:1307.6769 [pdf, other]

Streaming Variational Bayes

Authors: Tamara Broderick, Nicholas Boyd, Andre Wibisono, Ashia C. Wilson, Michael I. Jordan

Abstract: We present SDA-Bayes, a framework for (S)treaming, (D)istributed, (A)synchronous computation of a Bayesian posterior. The framework makes streaming updates to the estimated posterior according to a user-specified approximation batch primitive. We demonstrate the usefulness of our framework, with variational Bayes (VB) as the primitive, by fitting the latent Dirichlet allocation model to two large-… ▽ More We present SDA-Bayes, a framework for (S)treaming, (D)istributed, (A)synchronous computation of a Bayesian posterior. The framework makes streaming updates to the estimated posterior according to a user-specified approximation batch primitive. We demonstrate the usefulness of our framework, with variational Bayes (VB) as the primitive, by fitting the latent Dirichlet allocation model to two large-scale document collections. We demonstrate the advantages of our algorithm over stochastic variational inference (SVI) by comparing the two after a single pass through a known amount of data---a case where SVI may be applied---and in the streaming setting, where SVI does not apply. △ Less

Submitted 20 November, 2013; v1 submitted 25 July, 2013; originally announced July 2013.

Comments: 25 pages, 3 figures, 1 table

arXiv:1210.4251 [pdf]

Performance Analysis Cluster and GPU Computing Environment on Molecular Dynamic Simulation of BRV-1 and REM2 with GROMACS

Authors: Heru Suhartanto, Arry Yanuar, Ari Wibisono

Abstract: One of application that needs high performance computing resources is molecular d ynamic. There is some software available that perform molecular dynamic, one of these is a well known GROMACS. Our previous experiment simulating molecular dynamics of Indonesian grown herbal compounds show sufficient speed up on 32 n odes Cluster computing environment. In order to obtain a reliable simulation, one u… ▽ More One of application that needs high performance computing resources is molecular d ynamic. There is some software available that perform molecular dynamic, one of these is a well known GROMACS. Our previous experiment simulating molecular dynamics of Indonesian grown herbal compounds show sufficient speed up on 32 n odes Cluster computing environment. In order to obtain a reliable simulation, one usually needs to run the experiment on the scale of hundred nodes. But this is expensive to develop and maintain. Since the invention of Graphical Processing Units that is also useful for general programming, many applications have been developed to run on this. This paper reports our experiments that evaluate the performance of GROMACS that runs on two different environment, Cluster computing resources and GPU based PCs. We run the experiment on BRV-1 and REM2 compounds. Four different GPUs are installed on the same type of PCs of quad cores; they are Gefore GTS 250, GTX 465, GTX 470 and Quadro 4000. We build a cluster of 16 nodes based on these four quad cores PCs. The preliminary experiment shows that those run on GTX 470 is the best among the other type of GPUs and as well as the cluster computing resource. A speed up around 11 and 12 is gained, while the cost of computer with GPU is only about 25 percent that of Cluster we built. △ Less

Submitted 16 October, 2012; originally announced October 2012.

Comments: 5 pages, 1 figure, 5 tables

Journal ref: Int. J. Comp. Sci. Issue (2011), Vol. 8, Issue 4, No 2, p131-135

arXiv:1202.2585 [pdf, ps, other]

Minimax Option Pricing Meets Black-Scholes in the Limit

Authors: Jacob Abernethy, Rafael M. Frongillo, Andre Wibisono

Abstract: Option contracts are a type of financial derivative that allow investors to hedge risk and speculate on the variation of an asset's future market price. In short, an option has a particular payout that is based on the market price for an asset on a given date in the future. In 1973, Black and Scholes proposed a valuation model for options that essentially estimates the tail risk of the asset price… ▽ More Option contracts are a type of financial derivative that allow investors to hedge risk and speculate on the variation of an asset's future market price. In short, an option has a particular payout that is based on the market price for an asset on a given date in the future. In 1973, Black and Scholes proposed a valuation model for options that essentially estimates the tail risk of the asset price under the assumption that the price will fluctuate according to geometric Brownian motion. More recently, DeMarzo et al., among others, have proposed more robust valuation schemes, where we can even assume an adversary chooses the price fluctuations. This framework can be considered as a sequential two-player zero-sum game between the investor and Nature. We analyze the value of this game in the limit, where the investor can trade at smaller and smaller time intervals. Under weak assumptions on the actions of Nature (an adversary), we show that the minimax option price asymptotically approaches exactly the Black-Scholes valuation. The key piece of our analysis is showing that Nature's minimax optimal dual strategy converges to geometric Brownian motion in the limit. △ Less

Submitted 12 February, 2012; originally announced February 2012.

Comments: 19 pages

Showing 1–27 of 27 results for author: Wibisono, A