Search | arXiv e-print repository

An Equilibrium Dynamic Traffic Assignment Model with Linear Programming Formulation

Authors: Victoria Guseva, Ilya Sklonin, Irina Podlipnova, Demyan Yarmoshik, Alexander Gasnikov

Abstract: In this paper, we consider a dynamic equilibrium transportation problem. There is a fixed number of cars moving from origin to destination areas. Preferences for arrival times are expressed as a cost of arriving before or after the preferred time at the destination. Each driver aims to minimize the time spent during the trip, making the time spent a measure of cost. The chosen routes and departure… ▽ More In this paper, we consider a dynamic equilibrium transportation problem. There is a fixed number of cars moving from origin to destination areas. Preferences for arrival times are expressed as a cost of arriving before or after the preferred time at the destination. Each driver aims to minimize the time spent during the trip, making the time spent a measure of cost. The chosen routes and departure times impact the network loading. The goal is to find an equilibrium distribution across departure times and routes. For a relatively simplified transportation model we show that an equilibrium traffic distribution can be found as a solution to a linear program. In earlier works linear programming formulations were only obtained for social optimum dynamic traffic assignment problems. We also discuss algorithmic approaches for solving the equilibrium problem using time-expanded networks. △ Less

Submitted 30 August, 2024; originally announced August 2024.

arXiv:2408.05606 [pdf, other]

Exploring Applications of State Space Models and Advanced Training Techniques in Sequential Recommendations: A Comparative Study on Efficiency and Performance

Authors: Mark Obozov, Makar Baderko, Stepan Kulibaba, Nikolay Kutuzov, Alexander Gasnikov

Abstract: Recommender systems aim to estimate the dynamically changing user preferences and sequential dependencies between historical user behaviour and metadata. Although transformer-based models have proven to be effective in sequential recommendations, their state growth is proportional to the length of the sequence that is being processed, which makes them expensive in terms of memory and inference cos… ▽ More Recommender systems aim to estimate the dynamically changing user preferences and sequential dependencies between historical user behaviour and metadata. Although transformer-based models have proven to be effective in sequential recommendations, their state growth is proportional to the length of the sequence that is being processed, which makes them expensive in terms of memory and inference costs. Our research focused on three promising directions in sequential recommendations: enhancing speed through the use of State Space Models (SSM), as they can achieve SOTA results in the sequential recommendations domain with lower latency, memory, and inference costs, as proposed by arXiv:2403.03900 improving the quality of recommendations with Large Language Models (LLMs) via Monolithic Preference Optimization without Reference Model (ORPO); and implementing adaptive batch- and step-size algorithms to reduce costs and accelerate training processes. △ Less

Submitted 10 August, 2024; originally announced August 2024.

Comments: arXiv admin note: text overlap with arXiv:2403.07691 by other authors

arXiv:2406.04443 [pdf, other]

Gradient Clipping Improves AdaGrad when the Noise Is Heavy-Tailed

Authors: Savelii Chezhegov, Yaroslav Klyukin, Andrei Semenov, Aleksandr Beznosikov, Alexander Gasnikov, Samuel Horváth, Martin Takáč, Eduard Gorbunov

Abstract: Methods with adaptive stepsizes, such as AdaGrad and Adam, are essential for training modern Deep Learning models, especially Large Language Models. Typically, the noise in the stochastic gradients is heavy-tailed for the later ones. Gradient clipping provably helps to achieve good high-probability convergence for such noises. However, despite the similarity between AdaGrad/Adam and Clip-SGD, the… ▽ More Methods with adaptive stepsizes, such as AdaGrad and Adam, are essential for training modern Deep Learning models, especially Large Language Models. Typically, the noise in the stochastic gradients is heavy-tailed for the later ones. Gradient clipping provably helps to achieve good high-probability convergence for such noises. However, despite the similarity between AdaGrad/Adam and Clip-SGD, the high-probability convergence of AdaGrad/Adam has not been studied in this case. In this work, we prove that AdaGrad (and its delayed version) can have provably bad high-probability convergence if the noise is heavy-tailed. To fix this issue, we propose a new version of AdaGrad called Clip-RAdaGradD (Clipped Reweighted AdaGrad with Delay) and prove its high-probability convergence bounds with polylogarithmic dependence on the confidence level for smooth convex/non-convex stochastic optimization with heavy-tailed noise. Our empirical evaluations, including NLP model fine-tuning, highlight the superiority of clipped versions of AdaGrad/Adam in handling the heavy-tailed noise. △ Less

Submitted 6 June, 2024; originally announced June 2024.

Comments: 37 pages, 8 figures

arXiv:2406.00846 [pdf, other]

Local Methods with Adaptivity via Scaling

Authors: Savelii Chezhegov, Sergey Skorik, Nikolas Khachaturov, Danil Shalagin, Aram Avetisyan, Aleksandr Beznosikov, Martin Takáč, Yaroslav Kholodov, Alexander Gasnikov

Abstract: The rapid development of machine learning and deep learning has introduced increasingly complex optimization challenges that must be addressed. Indeed, training modern, advanced models has become difficult to implement without leveraging multiple computing nodes in a distributed environment. Distributed optimization is also fundamental to emerging fields such as federated learning. Specifically, t… ▽ More The rapid development of machine learning and deep learning has introduced increasingly complex optimization challenges that must be addressed. Indeed, training modern, advanced models has become difficult to implement without leveraging multiple computing nodes in a distributed environment. Distributed optimization is also fundamental to emerging fields such as federated learning. Specifically, there is a need to organize the training process to minimize the time lost due to communication. A widely used and extensively researched technique to mitigate the communication bottleneck involves performing local training before communication. This approach is the focus of our paper. Concurrently, adaptive methods that incorporate scaling, notably led by Adam, have gained significant popularity in recent years. Therefore, this paper aims to merge the local training technique with the adaptive approach to develop efficient distributed learning methods. We consider the classical Local SGD method and enhance it with a scaling feature. A crucial aspect is that the scaling is described generically, allowing us to analyze various approaches, including Adam, RMSProp, and OASIS, in a unified manner. In addition to theoretical analysis, we validate the performance of our methods in practice by training a neural network. △ Less

Submitted 12 June, 2024; v1 submitted 2 June, 2024; originally announced June 2024.

Comments: 41 pages, 2 algorithms, 6 figures, 1 table

arXiv:2405.18031 [pdf, other]

Lower Bounds and Optimal Algorithms for Non-Smooth Convex Decentralized Optimization over Time-Varying Networks

Authors: Dmitry Kovalev, Ekaterina Borodich, Alexander Gasnikov, Dmitrii Feoktistov

Abstract: We consider the task of minimizing the sum of convex functions stored in a decentralized manner across the nodes of a communication network. This problem is relatively well-studied in the scenario when the objective functions are smooth, or the links of the network are fixed in time, or both. In particular, lower bounds on the number of decentralized communications and (sub)gradient computations r… ▽ More We consider the task of minimizing the sum of convex functions stored in a decentralized manner across the nodes of a communication network. This problem is relatively well-studied in the scenario when the objective functions are smooth, or the links of the network are fixed in time, or both. In particular, lower bounds on the number of decentralized communications and (sub)gradient computations required to solve the problem have been established, along with matching optimal algorithms. However, the remaining and most challenging setting of non-smooth decentralized optimization over time-varying networks is largely underexplored, as neither lower bounds nor optimal algorithms are known in the literature. We resolve this fundamental gap with the following contributions: (i) we establish the first lower bounds on the communication and subgradient computation complexities of solving non-smooth convex decentralized optimization problems over time-varying networks; (ii) we develop the first optimal algorithm that matches these lower bounds and offers substantially improved theoretical performance compared to the existing state of the art. △ Less

Submitted 28 May, 2024; originally announced May 2024.

arXiv:2404.03323 [pdf, other]

Sparse Concept Bottleneck Models: Gumbel Tricks in Contrastive Learning

Authors: Andrei Semenov, Vladimir Ivanov, Aleksandr Beznosikov, Alexander Gasnikov

Abstract: We propose a novel architecture and method of explainable classification with Concept Bottleneck Models (CBMs). While SOTA approaches to Image Classification task work as a black box, there is a growing demand for models that would provide interpreted results. Such a models often learn to predict the distribution over class labels using additional description of this target instances, called conce… ▽ More We propose a novel architecture and method of explainable classification with Concept Bottleneck Models (CBMs). While SOTA approaches to Image Classification task work as a black box, there is a growing demand for models that would provide interpreted results. Such a models often learn to predict the distribution over class labels using additional description of this target instances, called concepts. However, existing Bottleneck methods have a number of limitations: their accuracy is lower than that of a standard model and CBMs require an additional set of concepts to leverage. We provide a framework for creating Concept Bottleneck Model from pre-trained multi-modal encoder and new CLIP-like architectures. By introducing a new type of layers known as Concept Bottleneck Layers, we outline three methods for training them: with $\ell_1$-loss, contrastive loss and loss function based on Gumbel-Softmax distribution (Sparse-CBM), while final FC layer is still trained with Cross-Entropy. We show a significant increase in accuracy using sparse hidden layers in CLIP-based bottleneck models. Which means that sparse representation of concepts activation vector is meaningful in Concept Bottleneck Models. Moreover, with our Concept Matrix Search algorithm we can improve CLIP predictions on complex datasets without any additional training or fine-tuning. The code is available at: https://github.com/Andron00e/SparseCBM. △ Less

Submitted 4 April, 2024; originally announced April 2024.

Comments: 23 pages, 1 algorithm, 36 figures

MSC Class: I.2.6; I.2.10; I.4.10; I.5.1; I.5.4; I.5.5 ACM Class: I.2.6; I.2.10; I.4.10; I.5.1; I.5.4; I.5.5

arXiv:2403.13117 [pdf, other]

Optimal Flow Matching: Learning Straight Trajectories in Just One Step

Authors: Nikita Kornilov, Petr Mokrov, Alexander Gasnikov, Alexander Korotin

Abstract: Over the several recent years, there has been a boom in development of Flow Matching (FM) methods for generative modeling. One intriguing property pursued by the community is the ability to learn flows with straight trajectories which realize the Optimal Transport (OT) displacements. Straightness is crucial for the fast integration (inference) of the learned flow's paths. Unfortunately, most exist… ▽ More Over the several recent years, there has been a boom in development of Flow Matching (FM) methods for generative modeling. One intriguing property pursued by the community is the ability to learn flows with straight trajectories which realize the Optimal Transport (OT) displacements. Straightness is crucial for the fast integration (inference) of the learned flow's paths. Unfortunately, most existing flow straightening methods are based on non-trivial iterative FM procedures which accumulate the error during training or exploit heuristics based on minibatch OT. To address these issues, we develop and theoretically justify the novel Optimal Flow Matching approach which allows recovering the straight OT displacement for the quadratic transport in just one FM step. The main idea of our approach is the employment of vector field for FM which are parameterized by convex functions. △ Less

Submitted 25 May, 2024; v1 submitted 19 March, 2024; originally announced March 2024.

arXiv:2402.05264 [pdf, other]

AdaBatchGrad: Combining Adaptive Batch Size and Adaptive Step Size

Authors: Petr Ostroukhov, Aigerim Zhumabayeva, Chulu Xiang, Alexander Gasnikov, Martin Takáč, Dmitry Kamzolov

Abstract: This paper presents a novel adaptation of the Stochastic Gradient Descent (SGD), termed AdaBatchGrad. This modification seamlessly integrates an adaptive step size with an adjustable batch size. An increase in batch size and a decrease in step size are well-known techniques to tighten the area of convergence of SGD and decrease its variance. A range of studies by R. Byrd and J. Nocedal introduced… ▽ More This paper presents a novel adaptation of the Stochastic Gradient Descent (SGD), termed AdaBatchGrad. This modification seamlessly integrates an adaptive step size with an adjustable batch size. An increase in batch size and a decrease in step size are well-known techniques to tighten the area of convergence of SGD and decrease its variance. A range of studies by R. Byrd and J. Nocedal introduced various testing techniques to assess the quality of mini-batch gradient approximations and choose the appropriate batch sizes at every step. Methods that utilized exact tests were observed to converge within $O(LR^2/\varepsilon)$ iterations. Conversely, inexact test implementations sometimes resulted in non-convergence and erratic performance. To address these challenges, AdaBatchGrad incorporates both adaptive batch and step sizes, enhancing the method's robustness and stability. For exact tests, our approach converges in $O(LR^2/\varepsilon)$ iterations, analogous to standard gradient descent. For inexact tests, it achieves convergence in $O(\max\lbrace LR^2/\varepsilon, σ^2 R^2/\varepsilon^2 \rbrace )$ iterations. This makes AdaBatchGrad markedly more robust and computationally efficient relative to prevailing methods. To substantiate the efficacy of our method, we experimentally show, how the introduction of adaptive step size and adaptive batch size gradually improves the performance of regular SGD. The results imply that AdaBatchGrad surpasses alternative methods, especially when applied to inexact tests. △ Less

Submitted 7 February, 2024; originally announced February 2024.

arXiv:2401.07809 [pdf, other]

doi 10.1134/S1064562423701600

Optimal Data Splitting in Distributed Optimization for Machine Learning

Authors: Daniil Medyakov, Gleb Molodtsov, Aleksandr Beznosikov, Alexander Gasnikov

Abstract: The distributed optimization problem has become increasingly relevant recently. It has a lot of advantages such as processing a large amount of data in less time compared to non-distributed methods. However, most distributed approaches suffer from a significant bottleneck - the cost of communications. Therefore, a large amount of research has recently been directed at solving this problem. One suc… ▽ More The distributed optimization problem has become increasingly relevant recently. It has a lot of advantages such as processing a large amount of data in less time compared to non-distributed methods. However, most distributed approaches suffer from a significant bottleneck - the cost of communications. Therefore, a large amount of research has recently been directed at solving this problem. One such approach uses local data similarity. In particular, there exists an algorithm provably optimally exploiting the similarity property. But this result, as well as results from other works solve the communication bottleneck by focusing only on the fact that communication is significantly more expensive than local computing and does not take into account the various capacities of network devices and the different relationship between communication time and local computing expenses. We consider this setup and the objective of this study is to achieve an optimal ratio of distributed data between the server and local machines for any costs of communications and local computations. The running times of the network are compared between uniform and optimal distributions. The superior theoretical performance of our solutions is experimentally validated. △ Less

Submitted 26 March, 2024; v1 submitted 15 January, 2024; originally announced January 2024.

Comments: 17 pages, 2 figures

arXiv:2401.07788 [pdf, other]

doi 10.1134/S1064562423701314

Activations and Gradients Compression for Model-Parallel Training

Authors: Mikhail Rudakov, Aleksandr Beznosikov, Yaroslav Kholodov, Alexander Gasnikov

Abstract: Large neural networks require enormous computational clusters of machines. Model-parallel training, when the model architecture is partitioned sequentially between workers, is a popular approach for training modern models. Information compression can be applied to decrease workers communication time, as it is often a bottleneck in such systems. This work explores how simultaneous compression of ac… ▽ More Large neural networks require enormous computational clusters of machines. Model-parallel training, when the model architecture is partitioned sequentially between workers, is a popular approach for training modern models. Information compression can be applied to decrease workers communication time, as it is often a bottleneck in such systems. This work explores how simultaneous compression of activations and gradients in model-parallel distributed training setup affects convergence. We analyze compression methods such as quantization and TopK compression, and also experiment with error compensation techniques. Moreover, we employ TopK with AQ-SGD per-batch error feedback approach. We conduct experiments on image classification and language model fine-tuning tasks. Our findings demonstrate that gradients require milder compression rates than activations. We observe that $K=10\%$ is the lowest TopK compression level, which does not harm model convergence severely. Experiments also show that models trained with TopK perform well only when compression is also applied during inference. We find that error feedback techniques do not improve model-parallel training compared to plain compression, but allow model inference without compression with almost no quality drop. Finally, when applied with the AQ-SGD approach, TopK stronger than with $ K=30\%$ worsens model performance significantly. △ Less

Submitted 26 March, 2024; v1 submitted 15 January, 2024; originally announced January 2024.

Comments: 17 pages, 6 figures, 5 tables

arXiv:2311.04161 [pdf, other]

Breaking the Heavy-Tailed Noise Barrier in Stochastic Optimization Problems

Authors: Nikita Puchkin, Eduard Gorbunov, Nikolay Kutuzov, Alexander Gasnikov

Abstract: We consider stochastic optimization problems with heavy-tailed noise with structured density. For such problems, we show that it is possible to get faster rates of convergence than $\mathcal{O}(K^{-2(α- 1)/α})$, when the stochastic gradients have finite moments of order $α\in (1, 2]$. In particular, our analysis allows the noise norm to have an unbounded expectation. To achieve these results, we s… ▽ More We consider stochastic optimization problems with heavy-tailed noise with structured density. For such problems, we show that it is possible to get faster rates of convergence than $\mathcal{O}(K^{-2(α- 1)/α})$, when the stochastic gradients have finite moments of order $α\in (1, 2]$. In particular, our analysis allows the noise norm to have an unbounded expectation. To achieve these results, we stabilize stochastic gradients, using smoothed medians of means. We prove that the resulting estimates have negligible bias and controllable variance. This allows us to carefully incorporate them into clipped-SGD and clipped-SSTM and derive new high-probability complexity bounds in the considered setup. △ Less

Submitted 17 April, 2024; v1 submitted 7 November, 2023; originally announced November 2023.

Comments: AISTATS 2024. 60 pages, 3 figures. Changes in V2: small typos were fixed, extra experiments and discussion were added. Code: https://github.com/Kutuz4/AISTATS2024_SMoM

arXiv:2310.01860 [pdf, other]

High-Probability Convergence for Composite and Distributed Stochastic Minimization and Variational Inequalities with Heavy-Tailed Noise

Authors: Eduard Gorbunov, Abdurakhmon Sadiev, Marina Danilova, Samuel Horváth, Gauthier Gidel, Pavel Dvurechensky, Alexander Gasnikov, Peter Richtárik

Abstract: High-probability analysis of stochastic first-order optimization methods under mild assumptions on the noise has been gaining a lot of attention in recent years. Typically, gradient clipping is one of the key algorithmic ingredients to derive good high-probability guarantees when the noise is heavy-tailed. However, if implemented naïvely, clipping can spoil the convergence of the popular methods f… ▽ More High-probability analysis of stochastic first-order optimization methods under mild assumptions on the noise has been gaining a lot of attention in recent years. Typically, gradient clipping is one of the key algorithmic ingredients to derive good high-probability guarantees when the noise is heavy-tailed. However, if implemented naïvely, clipping can spoil the convergence of the popular methods for composite and distributed optimization (Prox-SGD/Parallel SGD) even in the absence of any noise. Due to this reason, many works on high-probability analysis consider only unconstrained non-distributed problems, and the existing results for composite/distributed problems do not include some important special cases (like strongly convex problems) and are not optimal. To address this issue, we propose new stochastic methods for composite and distributed optimization based on the clipping of stochastic gradient differences and prove tight high-probability convergence results (including nearly optimal ones) for the new methods. Using similar ideas, we also develop new methods for composite and distributed variational inequalities and analyze the high-probability convergence of these methods. △ Less

Submitted 24 July, 2024; v1 submitted 3 October, 2023; originally announced October 2023.

Comments: ICML 2024; changes in version 2: minor corrections (typos were fixed and the structure was modified)

arXiv:2305.15938 [pdf, ps, other]

First Order Methods with Markovian Noise: from Acceleration to Variational Inequalities

Authors: Aleksandr Beznosikov, Sergey Samsonov, Marina Sheshukova, Alexander Gasnikov, Alexey Naumov, Eric Moulines

Abstract: This paper delves into stochastic optimization problems that involve Markovian noise. We present a unified approach for the theoretical analysis of first-order gradient methods for stochastic optimization and variational inequalities. Our approach covers scenarios for both non-convex and strongly convex minimization problems. To achieve an optimal (linear) dependence on the mixing time of the unde… ▽ More This paper delves into stochastic optimization problems that involve Markovian noise. We present a unified approach for the theoretical analysis of first-order gradient methods for stochastic optimization and variational inequalities. Our approach covers scenarios for both non-convex and strongly convex minimization problems. To achieve an optimal (linear) dependence on the mixing time of the underlying noise sequence, we use the randomized batching scheme, which is based on the multilevel Monte Carlo method. Moreover, our technique allows us to eliminate the limiting assumptions of previous research on Markov noise, such as the need for a bounded domain and uniformly bounded stochastic gradients. Our extension to variational inequalities under Markovian noise is original. Additionally, we provide lower bounds that match the oracle complexity of our method in the case of strongly convex optimization problems. △ Less

Submitted 30 March, 2024; v1 submitted 25 May, 2023; originally announced May 2023.

Comments: Appears in: Advances in Neural Information Processing Systems 36 (NeurIPS 2023). 41 pages, 3 algorithms, 2 tables

Journal ref: https://proceedings.neurips.cc/paper_files/paper/2023/hash/8c3e38ce55a0fa44bc325bc6fdb7f4e5-Abstract-Conference.html

arXiv:2305.06743 [pdf, other]

Implicitly normalized forecaster with clipping for linear and non-linear heavy-tailed multi-armed bandits

Authors: Yuriy Dorn, Nikita Kornilov, Nikolay Kutuzov, Alexander Nazin, Eduard Gorbunov, Alexander Gasnikov

Abstract: The Implicitly Normalized Forecaster (INF) algorithm is considered to be an optimal solution for adversarial multi-armed bandit (MAB) problems. However, most of the existing complexity results for INF rely on restrictive assumptions, such as bounded rewards. Recently, a related algorithm was proposed that works for both adversarial and stochastic heavy-tailed MAB settings. However, this algorithm… ▽ More The Implicitly Normalized Forecaster (INF) algorithm is considered to be an optimal solution for adversarial multi-armed bandit (MAB) problems. However, most of the existing complexity results for INF rely on restrictive assumptions, such as bounded rewards. Recently, a related algorithm was proposed that works for both adversarial and stochastic heavy-tailed MAB settings. However, this algorithm fails to fully exploit the available data. In this paper, we propose a new version of INF called the Implicitly Normalized Forecaster with clipping (INF-clip) for MAB problems with heavy-tailed reward distributions. We establish convergence results under mild assumptions on the rewards distribution and demonstrate that INF-clip is optimal for linear heavy-tailed stochastic MAB problems and works well for non-linear ones. Furthermore, we show that INF-clip outperforms the best-of-both-worlds algorithm in cases where it is difficult to distinguish between different arms. △ Less

Submitted 26 December, 2023; v1 submitted 11 May, 2023; originally announced May 2023.

arXiv:2302.07615 [pdf, other]

Similarity, Compression and Local Steps: Three Pillars of Efficient Communications for Distributed Variational Inequalities

Authors: Aleksandr Beznosikov, Martin Takáč, Alexander Gasnikov

Abstract: Variational inequalities are a broad and flexible class of problems that includes minimization, saddle point, and fixed point problems as special cases. Therefore, variational inequalities are used in various applications ranging from equilibrium search to adversarial learning. With the increasing size of data and models, today's instances demand parallel and distributed computing for real-world m… ▽ More Variational inequalities are a broad and flexible class of problems that includes minimization, saddle point, and fixed point problems as special cases. Therefore, variational inequalities are used in various applications ranging from equilibrium search to adversarial learning. With the increasing size of data and models, today's instances demand parallel and distributed computing for real-world machine learning problems, most of which can be represented as variational inequalities. Meanwhile, most distributed approaches have a significant bottleneck - the cost of communications. The three main techniques to reduce the total number of communication rounds and the cost of one such round are the similarity of local functions, compression of transmitted information, and local updates. In this paper, we combine all these approaches. Such a triple synergy did not exist before for variational inequalities and saddle problems, nor even for minimization problems. The methods presented in this paper have the best theoretical guarantees of communication complexity and are significantly ahead of other methods for distributed variational inequalities. The theoretical results are confirmed by adversarial learning experiments on synthetic and real datasets. △ Less

Submitted 30 March, 2024; v1 submitted 15 February, 2023; originally announced February 2023.

Comments: Appears in: Advances in Neural Information Processing Systems 36 (NeurIPS 2023) (https://proceedings.neurips.cc/paper_files/paper/2023/hash/5b4a459db23e6db9be2a128380953d96-Abstract-Conference.html). 36 pages, 3 algorithms, 1 figure, 1 table

arXiv:2302.00999 [pdf, ps, other]

High-Probability Bounds for Stochastic Optimization and Variational Inequalities: the Case of Unbounded Variance

Authors: Abdurakhmon Sadiev, Marina Danilova, Eduard Gorbunov, Samuel Horváth, Gauthier Gidel, Pavel Dvurechensky, Alexander Gasnikov, Peter Richtárik

Abstract: During recent years the interest of optimization and machine learning communities in high-probability convergence of stochastic optimization methods has been growing. One of the main reasons for this is that high-probability complexity bounds are more accurate and less studied than in-expectation ones. However, SOTA high-probability non-asymptotic convergence results are derived under strong assum… ▽ More During recent years the interest of optimization and machine learning communities in high-probability convergence of stochastic optimization methods has been growing. One of the main reasons for this is that high-probability complexity bounds are more accurate and less studied than in-expectation ones. However, SOTA high-probability non-asymptotic convergence results are derived under strong assumptions such as the boundedness of the gradient noise variance or of the objective's gradient itself. In this paper, we propose several algorithms with high-probability convergence results under less restrictive assumptions. In particular, we derive new high-probability convergence results under the assumption that the gradient/operator noise has bounded central $α$-th moment for $α\in (1,2]$ in the following setups: (i) smooth non-convex / Polyak-Lojasiewicz / convex / strongly convex / quasi-strongly convex minimization problems, (ii) Lipschitz / star-cocoercive and monotone / quasi-strongly monotone variational inequalities. These results justify the usage of the considered methods for solving problems that do not fit standard functional classes studied in stochastic optimization. △ Less

Submitted 18 July, 2023; v1 submitted 2 February, 2023; originally announced February 2023.

Comments: ICML 2023. 86 pages. Changes in v2: ICML formatting was applied along with minor edits of the text

arXiv:2212.14439 [pdf, other]

An Optimal Algorithm for Strongly Convex Min-min Optimization

Authors: Alexander Gasnikov, Dmitry Kovalev, Grigory Malinovsky

Abstract: In this paper we study the smooth strongly convex minimization problem $\min_{x}\min_y f(x,y)$. The existing optimal first-order methods require $\mathcal{O}(\sqrt{\max\{κ_x,κ_y\}} \log 1/ε)$ of computations of both $\nabla_x f(x,y)$ and $\nabla_y f(x,y)$, where $κ_x$ and $κ_y$ are condition numbers with respect to variable blocks $x$ and $y$. We propose a new algorithm that only requires… ▽ More In this paper we study the smooth strongly convex minimization problem $\min_{x}\min_y f(x,y)$. The existing optimal first-order methods require $\mathcal{O}(\sqrt{\max\{κ_x,κ_y\}} \log 1/ε)$ of computations of both $\nabla_x f(x,y)$ and $\nabla_y f(x,y)$, where $κ_x$ and $κ_y$ are condition numbers with respect to variable blocks $x$ and $y$. We propose a new algorithm that only requires $\mathcal{O}(\sqrt{κ_x} \log 1/ε)$ of computations of $\nabla_x f(x,y)$ and $\mathcal{O}(\sqrt{κ_y} \log 1/ε)$ computations of $\nabla_y f(x,y)$. In some applications $κ_x \gg κ_y$, and computation of $\nabla_y f(x,y)$ is significantly cheaper than computation of $\nabla_x f(x,y)$. In this case, our algorithm substantially outperforms the existing state-of-the-art methods. △ Less

Submitted 8 February, 2023; v1 submitted 29 December, 2022; originally announced December 2022.

Comments: 12 pages, 2 figures, 1 algorithm

arXiv:2210.05994 [pdf, other]

doi 10.1007/978-3-031-31654-8_3

SARAH-based Variance-reduced Algorithm for Stochastic Finite-sum Cocoercive Variational Inequalities

Authors: Aleksandr Beznosikov, Alexander Gasnikov

Abstract: Variational inequalities are a broad formalism that encompasses a vast number of applications. Motivated by applications in machine learning and beyond, stochastic methods are of great importance. In this paper we consider the problem of stochastic finite-sum cocoercive variational inequalities. For this class of problems, we investigate the convergence of the method based on the SARAH variance re… ▽ More Variational inequalities are a broad formalism that encompasses a vast number of applications. Motivated by applications in machine learning and beyond, stochastic methods are of great importance. In this paper we consider the problem of stochastic finite-sum cocoercive variational inequalities. For this class of problems, we investigate the convergence of the method based on the SARAH variance reduction technique. We show that for strongly monotone problems it is possible to achieve linear convergence to a solution using this method. Experiments confirm the importance and practical applicability of our approach. △ Less

Submitted 12 October, 2022; originally announced October 2022.

Comments: 11 pages, 1 algorithm, 1 figure, 1 theorem

arXiv:2208.13592 [pdf, ps, other]

doi 10.4171/MAG/112

Smooth Monotone Stochastic Variational Inequalities and Saddle Point Problems: A Survey

Authors: Aleksandr Beznosikov, Boris Polyak, Eduard Gorbunov, Dmitry Kovalev, Alexander Gasnikov

Abstract: This paper is a survey of methods for solving smooth (strongly) monotone stochastic variational inequalities. To begin with, we give the deterministic foundation from which the stochastic methods eventually evolved. Then we review methods for the general stochastic formulation, and look at the finite sum setup. The last parts of the paper are devoted to various recent (not necessarily stochastic)… ▽ More This paper is a survey of methods for solving smooth (strongly) monotone stochastic variational inequalities. To begin with, we give the deterministic foundation from which the stochastic methods eventually evolved. Then we review methods for the general stochastic formulation, and look at the finite sum setup. The last parts of the paper are devoted to various recent (not necessarily stochastic) advances in algorithms for variational inequalities. △ Less

Submitted 2 April, 2023; v1 submitted 29 August, 2022; originally announced August 2022.

Comments: 12 pages

arXiv:2206.09446 [pdf, other]

doi 10.1007/978-3-031-22543-7_11

Compression and Data Similarity: Combination of Two Techniques for Communication-Efficient Solving of Distributed Variational Inequalities

Authors: Aleksandr Beznosikov, Alexander Gasnikov

Abstract: Variational inequalities are an important tool, which includes minimization, saddles, games, fixed-point problems. Modern large-scale and computationally expensive practical applications make distributed methods for solving these problems popular. Meanwhile, most distributed systems have a basic problem - a communication bottleneck. There are various techniques to deal with it. In particular, in t… ▽ More Variational inequalities are an important tool, which includes minimization, saddles, games, fixed-point problems. Modern large-scale and computationally expensive practical applications make distributed methods for solving these problems popular. Meanwhile, most distributed systems have a basic problem - a communication bottleneck. There are various techniques to deal with it. In particular, in this paper we consider a combination of two popular approaches: compression and data similarity. We show that this synergy can be more effective than each of the approaches separately in solving distributed smooth strongly monotone variational inequalities. Experiments confirm the theoretical conclusions. △ Less

Submitted 3 September, 2022; v1 submitted 19 June, 2022; originally announced June 2022.

Comments: v2: minor changes. 19 pages, 1 algorithm, 1 figure, 1 table, 1 theorem

arXiv:2206.08303 [pdf, other]

On Scaled Methods for Saddle Point Problems

Authors: Aleksandr Beznosikov, Aibek Alanov, Dmitry Kovalev, Martin Takáč, Alexander Gasnikov

Abstract: Methods with adaptive scaling of different features play a key role in solving saddle point problems, primarily due to Adam's popularity for solving adversarial machine learning problems, including GANS training. This paper carries out a theoretical analysis of the following scaling techniques for solving SPPs: the well-known Adam and RmsProp scaling and the newer AdaHessian and OASIS based on Hut… ▽ More Methods with adaptive scaling of different features play a key role in solving saddle point problems, primarily due to Adam's popularity for solving adversarial machine learning problems, including GANS training. This paper carries out a theoretical analysis of the following scaling techniques for solving SPPs: the well-known Adam and RmsProp scaling and the newer AdaHessian and OASIS based on Hutchison approximation. We use the Extra Gradient and its improved version with negative momentum as the basic method. Experimental studies on GANs show good applicability not only for Adam, but also for other less popular methods. △ Less

Submitted 21 June, 2023; v1 submitted 16 June, 2022; originally announced June 2022.

Comments: 54 pages, 2 algorithms with 4 options for each, 12 figures, 5 tables, 2 theorems

arXiv:2206.01666 [pdf, other]

Algorithm for Constrained Markov Decision Process with Linear Convergence

Authors: Egor Gladin, Maksim Lavrik-Karmazin, Karina Zainullina, Varvara Rudenko, Alexander Gasnikov, Martin Takáč

Abstract: The problem of constrained Markov decision process is considered. An agent aims to maximize the expected accumulated discounted reward subject to multiple constraints on its costs (the number of constraints is relatively small). A new dual approach is proposed with the integration of two ingredients: entropy regularized policy optimizer and Vaidya's dual optimizer, both of which are critical to ac… ▽ More The problem of constrained Markov decision process is considered. An agent aims to maximize the expected accumulated discounted reward subject to multiple constraints on its costs (the number of constraints is relatively small). A new dual approach is proposed with the integration of two ingredients: entropy regularized policy optimizer and Vaidya's dual optimizer, both of which are critical to achieve faster convergence. The finite-time error bound of the proposed approach is provided. Despite the challenge of the nonconcave objective subject to nonconcave constraints, the proposed approach is shown to converge (with linear rate) to the global optimum. The complexity expressed in terms of the optimality gap and the constraint violation significantly improves upon the existing primal-dual approaches. △ Less

Submitted 19 October, 2022; v1 submitted 3 June, 2022; originally announced June 2022.

Comments: 27 pages, 2 figures, 3 tables. Improved presentation of the material, added a table with results, stated contributions more clearly, changed article template

arXiv:2206.01095 [pdf, other]

Clipped Stochastic Methods for Variational Inequalities with Heavy-Tailed Noise

Authors: Eduard Gorbunov, Marina Danilova, David Dobre, Pavel Dvurechensky, Alexander Gasnikov, Gauthier Gidel

Abstract: Stochastic first-order methods such as Stochastic Extragradient (SEG) or Stochastic Gradient Descent-Ascent (SGDA) for solving smooth minimax problems and, more generally, variational inequality problems (VIP) have been gaining a lot of attention in recent years due to the growing popularity of adversarial formulations in machine learning. However, while high-probability convergence bounds are kno… ▽ More Stochastic first-order methods such as Stochastic Extragradient (SEG) or Stochastic Gradient Descent-Ascent (SGDA) for solving smooth minimax problems and, more generally, variational inequality problems (VIP) have been gaining a lot of attention in recent years due to the growing popularity of adversarial formulations in machine learning. However, while high-probability convergence bounds are known to reflect the actual behavior of stochastic methods more accurately, most convergence results are provided in expectation. Moreover, the only known high-probability complexity results have been derived under restrictive sub-Gaussian (light-tailed) noise and bounded domain assumption [Juditsky et al., 2011]. In this work, we prove the first high-probability complexity results with logarithmic dependence on the confidence level for stochastic methods for solving monotone and structured non-monotone VIPs with non-sub-Gaussian (heavy-tailed) noise and unbounded domains. In the monotone case, our results match the best-known ones in the light-tails case [Juditsky et al., 2011], and are novel for structured non-monotone problems such as negative comonotone, quasi-strongly monotone, and/or star-cocoercive ones. We achieve these results by studying SEG and SGDA with clipping. In addition, we numerically validate that the gradient noise of many practical GAN formulations is heavy-tailed and show that clipping improves the performance of SEG/SGDA. △ Less

Submitted 1 November, 2022; v1 submitted 2 June, 2022; originally announced June 2022.

Comments: NeurIPS 2022. 74 pages, 18 figures. Changes in v2: few typos were fixed, new experiments with clipped-SEG were added. Code: https://github.com/busycalibrating/clipped-stochastic-methods

arXiv:2205.15136 [pdf, other]

Optimal Gradient Sliding and its Application to Distributed Optimization Under Similarity

Authors: Dmitry Kovalev, Aleksandr Beznosikov, Ekaterina Borodich, Alexander Gasnikov, Gesualdo Scutari

Abstract: We study structured convex optimization problems, with additive objective $r:=p + q$, where $r$ is ($μ$-strongly) convex, $q$ is $L_q$-smooth and convex, and $p$ is $L_p$-smooth, possibly nonconvex. For such a class of problems, we proposed an inexact accelerated gradient sliding method that can skip the gradient computation for one of these components while still achieving optimal complexity of g… ▽ More We study structured convex optimization problems, with additive objective $r:=p + q$, where $r$ is ($μ$-strongly) convex, $q$ is $L_q$-smooth and convex, and $p$ is $L_p$-smooth, possibly nonconvex. For such a class of problems, we proposed an inexact accelerated gradient sliding method that can skip the gradient computation for one of these components while still achieving optimal complexity of gradient calls of $p$ and $q$, that is, $\mathcal{O}(\sqrt{L_p/μ})$ and $\mathcal{O}(\sqrt{L_q/μ})$, respectively. This result is much sharper than the classic black-box complexity $\mathcal{O}(\sqrt{(L_p+L_q)/μ})$, especially when the difference between $L_q$ and $L_q$ is large. We then apply the proposed method to solve distributed optimization problems over master-worker architectures, under agents' function similarity, due to statistical data similarity or otherwise. The distributed algorithm achieves for the first time lower complexity bounds on {\it both} communication and local gradient calls, with the former having being a long-standing open problem. Finally the method is extended to distributed saddle-problems (under function similarity) by means of solving a class of variational inequalities, achieving lower communication and computation complexity bounds. △ Less

Submitted 30 May, 2022; originally announced May 2022.

Comments: 24 pages, 2 new algorithms, 12 theorems, 2 figures

arXiv:2205.09647 [pdf, other]

The First Optimal Acceleration of High-Order Methods in Smooth Convex Optimization

Authors: Dmitry Kovalev, Alexander Gasnikov

Abstract: In this paper, we study the fundamental open question of finding the optimal high-order algorithm for solving smooth convex minimization problems. Arjevani et al. (2019) established the lower bound $Ω\left(ε^{-2/(3p+1)}\right)$ on the number of the $p$-th order oracle calls required by an algorithm to find an $ε$-accurate solution to the problem, where the $p$-th order oracle stands for the comput… ▽ More In this paper, we study the fundamental open question of finding the optimal high-order algorithm for solving smooth convex minimization problems. Arjevani et al. (2019) established the lower bound $Ω\left(ε^{-2/(3p+1)}\right)$ on the number of the $p$-th order oracle calls required by an algorithm to find an $ε$-accurate solution to the problem, where the $p$-th order oracle stands for the computation of the objective function value and the derivatives up to the order $p$. However, the existing state-of-the-art high-order methods of Gasnikov et al. (2019b); Bubeck et al. (2019); Jiang et al. (2019) achieve the oracle complexity $\mathcal{O}\left(ε^{-2/(3p+1)} \log (1/ε)\right)$, which does not match the lower bound. The reason for this is that these algorithms require performing a complex binary search procedure, which makes them neither optimal nor practical. We fix this fundamental issue by providing the first algorithm with $\mathcal{O}\left(ε^{-2/(3p+1)}\right)$ $p$-th order oracle complexity. △ Less

Submitted 19 May, 2022; originally announced May 2022.

arXiv:2205.05653 [pdf, other]

The First Optimal Algorithm for Smooth and Strongly-Convex-Strongly-Concave Minimax Optimization

Authors: Dmitry Kovalev, Alexander Gasnikov

Abstract: In this paper, we revisit the smooth and strongly-convex-strongly-concave minimax optimization problem. Zhang et al. (2021) and Ibrahim et al. (2020) established the lower bound $Ω\left(\sqrt{κ_xκ_y} \log \frac{1}ε\right)$ on the number of gradient evaluations required to find an $ε$-accurate solution, where $κ_x$ and $κ_y$ are condition numbers for the strong convexity and strong concavity assump… ▽ More In this paper, we revisit the smooth and strongly-convex-strongly-concave minimax optimization problem. Zhang et al. (2021) and Ibrahim et al. (2020) established the lower bound $Ω\left(\sqrt{κ_xκ_y} \log \frac{1}ε\right)$ on the number of gradient evaluations required to find an $ε$-accurate solution, where $κ_x$ and $κ_y$ are condition numbers for the strong convexity and strong concavity assumptions. However, the existing state-of-the-art methods do not match this lower bound: algorithms of Lin et al. (2020) and Wang and Li (2020) have gradient evaluation complexity $\mathcal{O}\left( \sqrt{κ_xκ_y}\log^3\frac{1}ε\right)$ and $\mathcal{O}\left( \sqrt{κ_xκ_y}\log^3 (κ_xκ_y)\log\frac{1}ε\right)$, respectively. We fix this fundamental issue by providing the first algorithm with $\mathcal{O}\left(\sqrt{κ_xκ_y}\log\frac{1}ε\right)$ gradient evaluation complexity. We design our algorithm in three steps: (i) we reformulate the original problem as a minimization problem via the pointwise conjugate function; (ii) we apply a specific variant of the proximal point algorithm to the reformulated problem; (iii) we compute the proximal operator inexactly using the optimal algorithm for operator norm reduction in monotone inclusions. △ Less

Submitted 11 May, 2022; originally announced May 2022.

arXiv:2202.02771 [pdf, other]

Optimal Algorithms for Decentralized Stochastic Variational Inequalities

Authors: Dmitry Kovalev, Aleksandr Beznosikov, Abdurakhmon Sadiev, Michael Persiianov, Peter Richtárik, Alexander Gasnikov

Abstract: Variational inequalities are a formalism that includes games, minimization, saddle point, and equilibrium problems as special cases. Methods for variational inequalities are therefore universal approaches for many applied tasks, including machine learning problems. This work concentrates on the decentralized setting, which is increasingly important but not well understood. In particular, we consid… ▽ More Variational inequalities are a formalism that includes games, minimization, saddle point, and equilibrium problems as special cases. Methods for variational inequalities are therefore universal approaches for many applied tasks, including machine learning problems. This work concentrates on the decentralized setting, which is increasingly important but not well understood. In particular, we consider decentralized stochastic (sum-type) variational inequalities over fixed and time-varying networks. We present lower complexity bounds for both communication and local iterations and construct optimal algorithms that match these lower bounds. Our algorithms are the best among the available literature not only in the decentralized stochastic case, but also in the decentralized deterministic and non-distributed stochastic cases. Experimental results confirm the effectiveness of the presented algorithms. △ Less

Submitted 2 April, 2023; v1 submitted 6 February, 2022; originally announced February 2022.

Comments: Appears in: Advances in Neural Information Processing Systems 35 (NeurIPS 2022). Minor modifications with respect to the NeurIPS version. 58 pages, 6 algorithms, 9 figures, 4 tables

Journal ref: https://proceedings.neurips.cc/paper_files/paper/2022/hash/c959bb2cb164d37569a17fa67494d69a-Abstract-Conference.html

arXiv:2112.15199 [pdf, ps, other]

Accelerated Primal-Dual Gradient Method for Smooth and Convex-Concave Saddle-Point Problems with Bilinear Coupling

Authors: Dmitry Kovalev, Alexander Gasnikov, Peter Richtárik

Abstract: In this paper we study the convex-concave saddle-point problem $\min_x \max_y f(x) + y^T \mathbf{A} x - g(y)$, where $f(x)$ and $g(y)$ are smooth and convex functions. We propose an Accelerated Primal-Dual Gradient Method (APDG) for solving this problem, achieving (i) an optimal linear convergence rate in the strongly-convex-strongly-concave regime, matching the lower complexity bound (Zhang et al… ▽ More In this paper we study the convex-concave saddle-point problem $\min_x \max_y f(x) + y^T \mathbf{A} x - g(y)$, where $f(x)$ and $g(y)$ are smooth and convex functions. We propose an Accelerated Primal-Dual Gradient Method (APDG) for solving this problem, achieving (i) an optimal linear convergence rate in the strongly-convex-strongly-concave regime, matching the lower complexity bound (Zhang et al., 2021), and (ii) an accelerated linear convergence rate in the case when only one of the functions $f(x)$ and $g(y)$ is strongly convex or even none of them are. Finally, we obtain a linearly convergent algorithm for the general smooth and convex-concave saddle point problem $\min_x \max_y F(x,y)$ without the requirement of strong convexity or strong concavity. △ Less

Submitted 9 March, 2022; v1 submitted 30 December, 2021; originally announced December 2021.

arXiv:2110.12347 [pdf, other]

Acceleration in Distributed Optimization under Similarity

Authors: Ye Tian, Gesualdo Scutari, Tianyu Cao, Alexander Gasnikov

Abstract: We study distributed (strongly convex) optimization problems over a network of agents, with no centralized nodes. The loss functions of the agents are assumed to be \textit{similar}, due to statistical data similarity or otherwise. In order to reduce the number of communications to reach a solution accuracy, we proposed a {\it preconditioned, accelerated} distributed method. An $\varepsilon$-solut… ▽ More We study distributed (strongly convex) optimization problems over a network of agents, with no centralized nodes. The loss functions of the agents are assumed to be \textit{similar}, due to statistical data similarity or otherwise. In order to reduce the number of communications to reach a solution accuracy, we proposed a {\it preconditioned, accelerated} distributed method. An $\varepsilon$-solution is achieved in $\tilde{\mathcal{O}}\big(\sqrt{\frac{β/μ}{1-ρ}}\log1/\varepsilon\big)$ number of communications steps, where $β/μ$ is the relative condition number between the global and local loss functions, and $ρ$ characterizes the connectivity of the network. This rate matches (up to poly-log factors) lower complexity communication bounds of distributed gossip-algorithms applied to the class of problems of interest. Numerical results show significant communication savings with respect to existing accelerated distributed schemes, especially when solving ill-conditioned problems. △ Less

Submitted 9 April, 2022; v1 submitted 24 October, 2021; originally announced October 2021.

arXiv:2110.03313 [pdf, other]

Distributed Methods with Compressed Communication for Solving Variational Inequalities, with Theoretical Guarantees

Authors: Aleksandr Beznosikov, Peter Richtárik, Michael Diskin, Max Ryabinin, Alexander Gasnikov

Abstract: Variational inequalities in general and saddle point problems in particular are increasingly relevant in machine learning applications, including adversarial learning, GANs, transport and robust optimization. With increasing data and problem sizes necessary to train high performing models across various applications, we need to rely on parallel and distributed computing. However, in distributed tr… ▽ More Variational inequalities in general and saddle point problems in particular are increasingly relevant in machine learning applications, including adversarial learning, GANs, transport and robust optimization. With increasing data and problem sizes necessary to train high performing models across various applications, we need to rely on parallel and distributed computing. However, in distributed training, communication among the compute nodes is a key bottleneck during training, and this problem is exacerbated for high dimensional and over-parameterized models. Due to these considerations, it is important to equip existing methods with strategies that would allow to reduce the volume of transmitted information during training while obtaining a model of comparable quality. In this paper, we present the first theoretically grounded distributed methods for solving variational inequalities and saddle point problems using compressed communication: MASHA1 and MASHA2. Our theory and methods allow for the use of both unbiased (such as Rand$k$; MASHA1) and contractive (such as Top$k$; MASHA2) compressors. New algorithms support bidirectional compressions, and also can be modified for stochastic setting with batches and for federated learning with partial participation of clients. We empirically validated our conclusions using two experimental setups: a standard bilinear min-max problem, and large-scale distributed adversarial training of transformers. △ Less

Submitted 2 April, 2023; v1 submitted 7 October, 2021; originally announced October 2021.

Comments: Appears in: Advances in Neural Information Processing Systems 35 (NeurIPS 2022). Minor modifications with respect to the NeurIPS version. 73 pages, 9 algorithms, 2 figures, 2 tables

Journal ref: https://proceedings.neurips.cc/paper_files/paper/2022/hash/5ac1428c23b5da5e66d029646ea3206d-Abstract-Conference.html

arXiv:2107.10706 [pdf, other]

Distributed Saddle-Point Problems Under Similarity

Authors: Aleksandr Beznosikov, Gesualdo Scutari, Alexander Rogozin, Alexander Gasnikov

Abstract: We study solution methods for (strongly-)convex-(strongly)-concave Saddle-Point Problems (SPPs) over networks of two type - master/workers (thus centralized) architectures and meshed (thus decentralized) networks. The local functions at each node are assumed to be similar, due to statistical data similarity or otherwise. We establish lower complexity bounds for a fairly general class of algorithms… ▽ More We study solution methods for (strongly-)convex-(strongly)-concave Saddle-Point Problems (SPPs) over networks of two type - master/workers (thus centralized) architectures and meshed (thus decentralized) networks. The local functions at each node are assumed to be similar, due to statistical data similarity or otherwise. We establish lower complexity bounds for a fairly general class of algorithms solving the SPP. We show that a given suboptimality $ε>0$ is achieved over master/workers networks in $Ω\big(Δ\cdot δ/μ\cdot \log (1/\varepsilon)\big)$ rounds of communications, where $δ>0$ measures the degree of similarity of the local functions, $μ$ is their strong convexity constant, and $Δ$ is the diameter of the network. The lower communication complexity bound over meshed networks reads $Ω\big(1/{\sqrtρ} \cdot δ/μ\cdot\log (1/\varepsilon)\big)$, where $ρ$ is the (normalized) eigengap of the gossip matrix used for the communication between neighbouring nodes. We then propose algorithms matching the lower bounds over either types of networks (up to log-factors). We assess the effectiveness of the proposed algorithms on a robust logistic regression problem. △ Less

Submitted 22 August, 2022; v1 submitted 22 July, 2021; originally announced July 2021.

Comments: Appears in: Advances in Neural Information Processing Systems 34 (NeurIPS 2021). Minor modifications with respect to the NeurIPS version. 35 pages, 3 algorithms, 4 figures, 1 table

Journal ref: https://proceedings.neurips.cc/paper/2021/hash/44e65d3e9bc2f88b2b3d566de51a5381-Abstract.html

arXiv:2107.05951 [pdf, other]

One-Point Gradient-Free Methods for Composite Optimization with Applications to Distributed Optimization

Authors: Ivan Stepanov, Artyom Voronov, Aleksandr Beznosikov, Alexander Gasnikov

Abstract: This work is devoted to solving the composite optimization problem with the mixture oracle: for the smooth part of the problem, we have access to the gradient, and for the non-smooth part, only to the one-point zero-order oracle. For such a setup, we present a new method based on the sliding algorithm. Our method allows to separate the oracle complexities and compute the gradient for one of the fu… ▽ More This work is devoted to solving the composite optimization problem with the mixture oracle: for the smooth part of the problem, we have access to the gradient, and for the non-smooth part, only to the one-point zero-order oracle. For such a setup, we present a new method based on the sliding algorithm. Our method allows to separate the oracle complexities and compute the gradient for one of the function as rarely as possible. The paper also present the applicability of our new method to the problems of distributed optimization and federated learning. Experimental results confirm the theory. △ Less

Submitted 16 February, 2022; v1 submitted 13 July, 2021; originally announced July 2021.

Comments: New in v2: completely new text of the paper; 26 pages, 1 figure, 2 tables, 1 algorithm

arXiv:2106.08315 [pdf, other]

Decentralized Local Stochastic Extra-Gradient for Variational Inequalities

Authors: Aleksandr Beznosikov, Pavel Dvurechensky, Anastasia Koloskova, Valentin Samokhin, Sebastian U Stich, Alexander Gasnikov

Abstract: We consider distributed stochastic variational inequalities (VIs) on unbounded domains with the problem data that is heterogeneous (non-IID) and distributed across many devices. We make a very general assumption on the computational network that, in particular, covers the settings of fully decentralized calculations with time-varying networks and centralized topologies commonly used in Federated L… ▽ More We consider distributed stochastic variational inequalities (VIs) on unbounded domains with the problem data that is heterogeneous (non-IID) and distributed across many devices. We make a very general assumption on the computational network that, in particular, covers the settings of fully decentralized calculations with time-varying networks and centralized topologies commonly used in Federated Learning. Moreover, multiple local updates on the workers can be made for reducing the communication frequency between the workers. We extend the stochastic extragradient method to this very general setting and theoretically analyze its convergence rate in the strongly-monotone, monotone, and non-monotone (when a Minty solution exists) settings. The provided rates explicitly exhibit the dependence on network characteristics (e.g., mixing time), iteration counter, data heterogeneity, variance, number of devices, and other standard parameters. As a special case, our method and analysis apply to distributed stochastic saddle-point problems (SPP), e.g., to the training of Deep Generative Adversarial Networks (GANs) for which decentralized training has been reported to be extremely challenging. In experiments for the decentralized training of GANs we demonstrate the effectiveness of our proposed approach. △ Less

Submitted 2 April, 2023; v1 submitted 15 June, 2021; originally announced June 2021.

Comments: Appears in: Advances in Neural Information Processing Systems 35 (NeurIPS 2022). Minor modifications with respect to the NeurIPS version. 43 pages, 1 algorithm, 6 figures, 2 tables

Journal ref: https://proceedings.neurips.cc/paper_files/paper/2022/hash/f9379afacdbabfdc6b060972b60f9ab8-Abstract-Conference.html

arXiv:2106.07289 [pdf, other]

Decentralized Personalized Federated Learning for Min-Max Problems

Authors: Ekaterina Borodich, Aleksandr Beznosikov, Abdurakhmon Sadiev, Vadim Sushko, Nikolay Savelyev, Martin Takáč, Alexander Gasnikov

Abstract: Personalized Federated Learning (PFL) has witnessed remarkable advancements, enabling the development of innovative machine learning applications that preserve the privacy of training data. However, existing theoretical research in this field has primarily focused on distributed optimization for minimization problems. This paper is the first to study PFL for saddle point problems encompassing a br… ▽ More Personalized Federated Learning (PFL) has witnessed remarkable advancements, enabling the development of innovative machine learning applications that preserve the privacy of training data. However, existing theoretical research in this field has primarily focused on distributed optimization for minimization problems. This paper is the first to study PFL for saddle point problems encompassing a broader range of optimization problems, that require more than just solving minimization problems. In this work, we consider a recently proposed PFL setting with the mixing objective function, an approach combining the learning of a global model together with locally distributed learners. Unlike most previous work, which considered only the centralized setting, we work in a more general and decentralized setup that allows us to design and analyze more practical and federated ways to connect devices to the network. We proposed new algorithms to address this problem and provide a theoretical analysis of the smooth (strongly) convex-(strongly) concave saddle point problems in stochastic and deterministic cases. Numerical experiments for bilinear problems and neural networks with adversarial noise demonstrate the effectiveness of the proposed methods. △ Less

Submitted 17 April, 2024; v1 submitted 14 June, 2021; originally announced June 2021.

Comments: 33 pages, 3 algorithms, 5 figures, 2 tables

arXiv:2106.05958 [pdf, other]

High Probability Complexity Bounds for Non-Smooth Stochastic Optimization with Heavy-Tailed Noise

Authors: Eduard Gorbunov, Marina Danilova, Innokentiy Shibaev, Pavel Dvurechensky, Alexander Gasnikov

Abstract: Stochastic first-order methods are standard for training large-scale machine learning models. Random behavior may cause a particular run of an algorithm to result in a highly suboptimal objective value, whereas theoretical guarantees are usually proved for the expectation of the objective value. Thus, it is essential to theoretically guarantee that algorithms provide small objective residual with… ▽ More Stochastic first-order methods are standard for training large-scale machine learning models. Random behavior may cause a particular run of an algorithm to result in a highly suboptimal objective value, whereas theoretical guarantees are usually proved for the expectation of the objective value. Thus, it is essential to theoretically guarantee that algorithms provide small objective residual with high probability. Existing methods for non-smooth stochastic convex optimization have complexity bounds with the dependence on the confidence level that is either negative-power or logarithmic but under an additional assumption of sub-Gaussian (light-tailed) noise distribution that may not hold in practice. In our paper, we resolve this issue and derive the first high-probability convergence results with logarithmic dependence on the confidence level for non-smooth convex stochastic optimization problems with non-sub-Gaussian (heavy-tailed) noise. To derive our results, we propose novel stepsize rules for two stochastic methods with gradient clipping. Moreover, our analysis works for generalized smooth objectives with Hölder-continuous gradients, and for both methods, we provide an extension for strongly convex problems. Finally, our results imply that the first (accelerated) method we consider also has optimal iteration and oracle complexity in all the regimes, and the second one is optimal in the non-smooth setting. △ Less

Submitted 30 August, 2024; v1 submitted 10 June, 2021; originally announced June 2021.

Comments: 61 pages, 12 figures. Changes in V2: different presentation of the results, different structure, new experiments. Changes in V3: some typos were fixed

arXiv:2106.04469 [pdf, other]

Lower Bounds and Optimal Algorithms for Smooth and Strongly Convex Decentralized Optimization Over Time-Varying Networks

Authors: Dmitry Kovalev, Elnur Gasanov, Peter Richtárik, Alexander Gasnikov

Abstract: We consider the task of minimizing the sum of smooth and strongly convex functions stored in a decentralized manner across the nodes of a communication network whose links are allowed to change in time. We solve two fundamental problems for this task. First, we establish the first lower bounds on the number of decentralized communication rounds and the number of local computations required to find… ▽ More We consider the task of minimizing the sum of smooth and strongly convex functions stored in a decentralized manner across the nodes of a communication network whose links are allowed to change in time. We solve two fundamental problems for this task. First, we establish the first lower bounds on the number of decentralized communication rounds and the number of local computations required to find an $ε$-accurate solution. Second, we design two optimal algorithms that attain these lower bounds: (i) a variant of the recently proposed algorithm ADOM (Kovalev et al., 2021) enhanced via a multi-consensus subroutine, which is optimal in the case when access to the dual gradients is assumed, and (ii) a novel algorithm, called ADOM+, which is optimal in the case when access to the primal gradients is assumed. We corroborate the theoretical efficiency of these algorithms by performing an experimental comparison with existing state-of-the-art methods. △ Less

Submitted 8 June, 2021; originally announced June 2021.

arXiv:2102.09234 [pdf, other]

ADOM: Accelerated Decentralized Optimization Method for Time-Varying Networks

Authors: Dmitry Kovalev, Egor Shulgin, Peter Richtárik, Alexander Rogozin, Alexander Gasnikov

Abstract: We propose ADOM - an accelerated method for smooth and strongly convex decentralized optimization over time-varying networks. ADOM uses a dual oracle, i.e., we assume access to the gradient of the Fenchel conjugate of the individual loss functions. Up to a constant factor, which depends on the network structure only, its communication complexity is the same as that of accelerated Nesterov gradient… ▽ More We propose ADOM - an accelerated method for smooth and strongly convex decentralized optimization over time-varying networks. ADOM uses a dual oracle, i.e., we assume access to the gradient of the Fenchel conjugate of the individual loss functions. Up to a constant factor, which depends on the network structure only, its communication complexity is the same as that of accelerated Nesterov gradient method (Nesterov, 2003). To the best of our knowledge, only the algorithm of Rogozin et al. (2019) has a convergence rate with similar properties. However, their algorithm converges under the very restrictive assumption that the number of network changes can not be greater than a tiny percentage of the number of iterations. This assumption is hard to satisfy in practice, as the network topology changes usually can not be controlled. In contrast, ADOM merely requires the network to stay connected throughout time. △ Less

Submitted 18 February, 2021; originally announced February 2021.

arXiv:2102.07758 [pdf, other]

doi 10.1080/10556788.2023.2280062

Decentralized Distributed Optimization for Saddle Point Problems

Authors: Alexander Rogozin, Aleksandr Beznosikov, Darina Dvinskikh, Dmitry Kovalev, Pavel Dvurechensky, Alexander Gasnikov

Abstract: We consider distributed convex-concave saddle point problems over arbitrary connected undirected networks and propose a decentralized distributed algorithm for their solution. The local functions distributed across the nodes are assumed to have global and local groups of variables. For the proposed algorithm we prove non-asymptotic convergence rate estimates with explicit dependence on the network… ▽ More We consider distributed convex-concave saddle point problems over arbitrary connected undirected networks and propose a decentralized distributed algorithm for their solution. The local functions distributed across the nodes are assumed to have global and local groups of variables. For the proposed algorithm we prove non-asymptotic convergence rate estimates with explicit dependence on the network characteristics. To supplement the convergence rate analysis, we propose lower bounds for strongly-convex-strongly-concave and convex-concave saddle-point problems over arbitrary connected undirected networks. We illustrate the considered problem setting by a particular application to distributed calculation of non-regularized Wasserstein barycenters. △ Less

Submitted 9 April, 2024; v1 submitted 15 February, 2021; originally announced February 2021.

arXiv:2102.06780 [pdf, ps, other]

Newton Method over Networks is Fast up to the Statistical Precision

Authors: Amir Daneshmand, Gesualdo Scutari, Pavel Dvurechensky, Alexander Gasnikov

Abstract: We propose a distributed cubic regularization of the Newton method for solving (constrained) empirical risk minimization problems over a network of agents, modeled as undirected graph. The algorithm employs an inexact, preconditioned Newton step at each agent's side: the gradient of the centralized loss is iteratively estimated via a gradient-tracking consensus mechanism and the Hessian is subsamp… ▽ More We propose a distributed cubic regularization of the Newton method for solving (constrained) empirical risk minimization problems over a network of agents, modeled as undirected graph. The algorithm employs an inexact, preconditioned Newton step at each agent's side: the gradient of the centralized loss is iteratively estimated via a gradient-tracking consensus mechanism and the Hessian is subsampled over the local data sets. No Hessian matrices are thus exchanged over the network. We derive global complexity bounds for convex and strongly convex losses. Our analysis reveals an interesting interplay between sample and iteration/communication complexity: statistically accurate solutions are achievable in roughly the same number of iterations of the centralized cubic Newton method, with a communication cost per iteration of the order of $\widetilde{\mathcal{O}}\big(1/\sqrt{1-ρ}\big)$, where $ρ$ characterizes the connectivity of the network. This demonstrates a significant communication saving with respect to that of existing, statistically oblivious, distributed Newton-based methods over networks. △ Less

Submitted 16 June, 2021; v1 submitted 12 February, 2021; originally announced February 2021.

Comments: In proceedings of the 38th International Conference on Machine Learning, PMLR 139, 2021

arXiv:2012.06188 [pdf, ps, other]

Recent Theoretical Advances in Non-Convex Optimization

Authors: Marina Danilova, Pavel Dvurechensky, Alexander Gasnikov, Eduard Gorbunov, Sergey Guminov, Dmitry Kamzolov, Innokentiy Shibaev

Abstract: Motivated by recent increased interest in optimization algorithms for non-convex optimization in application to training deep neural networks and other optimization problems in data analysis, we give an overview of recent theoretical results on global performance guarantees of optimization algorithms for non-convex optimization. We start with classical arguments showing that general non-convex pro… ▽ More Motivated by recent increased interest in optimization algorithms for non-convex optimization in application to training deep neural networks and other optimization problems in data analysis, we give an overview of recent theoretical results on global performance guarantees of optimization algorithms for non-convex optimization. We start with classical arguments showing that general non-convex problems could not be solved efficiently in a reasonable time. Then we give a list of problems that can be solved efficiently to find the global minimizer by exploiting the structure of the problem as much as it is possible. Another way to deal with non-convexity is to relax the goal from finding the global minimum to finding a stationary point or a local minimum. For this setting, we first present known results for the convergence rates of deterministic first-order methods, which are then followed by a general theoretical analysis of optimal stochastic and randomized gradient schemes, and an overview of the stochastic first-order methods. After that, we discuss quite general classes of non-convex problems, such as minimization of $α$-weakly-quasi-convex functions and functions that satisfy Polyak--Lojasiewicz condition, which still allow obtaining theoretical convergence guarantees of first-order methods. Then we consider higher-order and zeroth-order/derivative-free methods and their convergence rates for non-convex optimization problems. △ Less

Submitted 26 November, 2021; v1 submitted 11 December, 2020; originally announced December 2020.

Comments: 81 pages

arXiv:2010.13112 [pdf, other]

Distributed Saddle-Point Problems: Lower Bounds, Near-Optimal and Robust Algorithms

Authors: Aleksandr Beznosikov, Valentin Samokhin, Alexander Gasnikov

Abstract: This paper focuses on the distributed optimization of stochastic saddle point problems. The first part of the paper is devoted to lower bounds for the cenralized and decentralized distributed methods for smooth (strongly) convex-(strongly) concave saddle-point problems as well as the near-optimal algorithms by which these bounds are achieved. Next, we present a new federated algorithm for cenraliz… ▽ More This paper focuses on the distributed optimization of stochastic saddle point problems. The first part of the paper is devoted to lower bounds for the cenralized and decentralized distributed methods for smooth (strongly) convex-(strongly) concave saddle-point problems as well as the near-optimal algorithms by which these bounds are achieved. Next, we present a new federated algorithm for cenralized distributed saddle point problems - Extra Step Local SGD. Theoretical analysis of the new method is carried out for strongly convex-strongly concave and non-convex-non-concave problems. In the experimental part of the paper, we show the effectiveness of our method in practice. In particular, we train GANs in a distributed manner. △ Less

Submitted 8 July, 2022; v1 submitted 25 October, 2020; originally announced October 2020.

Comments: 52 pages, 9 figures, 1 table, 4 algorithms (3 new)

arXiv:2009.09908 [pdf, other]

doi 10.1007/978-3-030-86433-0_5

Zeroth-Order Algorithms for Smooth Saddle-Point Problems

Authors: Abdurakhmon Sadiev, Aleksandr Beznosikov, Pavel Dvurechensky, Alexander Gasnikov

Abstract: Saddle-point problems have recently gained increased attention from the machine learning community, mainly due to applications in training Generative Adversarial Networks using stochastic gradients. At the same time, in some applications only a zeroth-order oracle is available. In this paper, we propose several algorithms to solve stochastic smooth (strongly) convex-concave saddle-point problems u… ▽ More Saddle-point problems have recently gained increased attention from the machine learning community, mainly due to applications in training Generative Adversarial Networks using stochastic gradients. At the same time, in some applications only a zeroth-order oracle is available. In this paper, we propose several algorithms to solve stochastic smooth (strongly) convex-concave saddle-point problems using zeroth-order oracles and estimate their convergence rate and its dependence on the dimension $n$ of the variable. In particular, our analysis shows that in the case when the feasible set is a direct product of two simplices, our convergence rate for the stochastic term is only by a $\log n$ factor worse than for the first-order methods. We also consider a mixed setup and develop 1/2th-order methods that use zeroth-order oracle for the minimization part and first-order oracle for the maximization part. Finally, we demonstrate the practical performance of our zeroth-order and 1/2th-order methods on practical problems. △ Less

Submitted 27 February, 2021; v1 submitted 21 September, 2020; originally announced September 2020.

arXiv:2006.06763 [pdf, other]

Stochastic Saddle-Point Optimization for Wasserstein Barycenters

Authors: Daniil Tiapkin, Alexander Gasnikov, Pavel Dvurechensky

Abstract: We consider the population Wasserstein barycenter problem for random probability measures supported on a finite set of points and generated by an online stream of data. This leads to a complicated stochastic optimization problem where the objective is given as an expectation of a function given as a solution to a random optimization problem. We employ the structure of the problem and obtain a conv… ▽ More We consider the population Wasserstein barycenter problem for random probability measures supported on a finite set of points and generated by an online stream of data. This leads to a complicated stochastic optimization problem where the objective is given as an expectation of a function given as a solution to a random optimization problem. We employ the structure of the problem and obtain a convex-concave stochastic saddle-point reformulation of this problem. In the setting when the distribution of random probability measures is discrete, we propose a stochastic optimization algorithm and estimate its complexity. The second result, based on kernel methods, extends the previous one to the arbitrary distribution of random probability measures. Moreover, this new algorithm has a total complexity better than the Stochastic Approximation approach combined with the Sinkhorn algorithm in many cases. We also illustrate our developments by a series of numerical experiments. △ Less

Submitted 2 December, 2021; v1 submitted 11 June, 2020; originally announced June 2020.

arXiv:2005.10785 [pdf, other]

Stochastic Optimization with Heavy-Tailed Noise via Accelerated Gradient Clipping

Authors: Eduard Gorbunov, Marina Danilova, Alexander Gasnikov

Abstract: In this paper, we propose a new accelerated stochastic first-order method called clipped-SSTM for smooth convex stochastic optimization with heavy-tailed distributed noise in stochastic gradients and derive the first high-probability complexity bounds for this method closing the gap in the theory of stochastic optimization with heavy-tailed noise. Our method is based on a special variant of accele… ▽ More In this paper, we propose a new accelerated stochastic first-order method called clipped-SSTM for smooth convex stochastic optimization with heavy-tailed distributed noise in stochastic gradients and derive the first high-probability complexity bounds for this method closing the gap in the theory of stochastic optimization with heavy-tailed noise. Our method is based on a special variant of accelerated Stochastic Gradient Descent (SGD) and clipping of stochastic gradients. We extend our method to the strongly convex case and prove new complexity bounds that outperform state-of-the-art results in this case. Finally, we extend our proof technique and derive the first non-trivial high-probability complexity bounds for SGD with clipping without light-tails assumption on the noise. △ Less

Submitted 23 October, 2020; v1 submitted 21 May, 2020; originally announced May 2020.

Comments: NeurIPS 2020; 60 pages, 15 figures

arXiv:2005.05913 [pdf, other]

doi 10.1007/978-3-030-58657-7_11

Gradient-Free Methods for Saddle-Point Problem

Authors: Aleksandr Beznosikov, Abdurakhmon Sadiev, Alexander Gasnikov

Abstract: In the paper, we generalize the approach Gasnikov et. al, 2017, which allows to solve (stochastic) convex optimization problems with an inexact gradient-free oracle, to the convex-concave saddle-point problem. The proposed approach works, at least, like the best existing approaches. But for a special set-up (simplex type constraints and closeness of Lipschitz constants in 1 and 2 norms) our approa… ▽ More In the paper, we generalize the approach Gasnikov et. al, 2017, which allows to solve (stochastic) convex optimization problems with an inexact gradient-free oracle, to the convex-concave saddle-point problem. The proposed approach works, at least, like the best existing approaches. But for a special set-up (simplex type constraints and closeness of Lipschitz constants in 1 and 2 norms) our approach reduces $\frac{n}{\log n}$ times the required number of oracle calls (function calculations). Our method uses a stochastic approximation of the gradient via finite differences. In this case, the function must be specified not only on the optimization set itself, but in a certain neighbourhood of it. In the second part of the paper, we analyze the case when such an assumption cannot be made, we propose a general approach on how to modernize the method to solve this problem, and also we apply this approach to particular cases of some classical sets. △ Less

Submitted 11 September, 2022; v1 submitted 12 May, 2020; originally announced May 2020.

Comments: Appears in: Communications in Computer and Information Science book series (CCIS,volume 1275). Minor modifications (typos) with respect to the CCIS version. 26 pages, 1 algorithm, 5 figures, 3 tables

arXiv:1906.03622 [pdf, other]

On a Combination of Alternating Minimization and Nesterov's Momentum

Authors: Sergey Guminov, Pavel Dvurechensky, Nazarii Tupitsa, Alexander Gasnikov

Abstract: Alternating minimization (AM) procedures are practically efficient in many applications for solving convex and non-convex optimization problems. On the other hand, Nesterov's accelerated gradient is theoretically optimal first-order method for convex optimization. In this paper we combine AM and Nesterov's acceleration to propose an accelerated alternating minimization algorithm. We prove $1/k^2$… ▽ More Alternating minimization (AM) procedures are practically efficient in many applications for solving convex and non-convex optimization problems. On the other hand, Nesterov's accelerated gradient is theoretically optimal first-order method for convex optimization. In this paper we combine AM and Nesterov's acceleration to propose an accelerated alternating minimization algorithm. We prove $1/k^2$ convergence rate in terms of the objective for convex problems and $1/k$ in terms of the squared gradient norm for non-convex problems, where $k$ is the iteration counter. Our method does not require any knowledge of neither convexity of the problem nor function parameters such as Lipschitz constant of the gradient, i.e. it is adaptive to convexity and smoothness and is uniformly optimal for smooth convex and non-convex problems. Further, we develop its primal-dual modification for strongly convex problems with linear constraints and prove the same $1/k^2$ for the primal objective residual and constraints feasibility. △ Less

Submitted 15 September, 2021; v1 submitted 9 June, 2019; originally announced June 2019.

Comments: Compared to previous versions: dual WB problem and complexity analysis for WB problem corrected, updated and extended experiments

arXiv:1901.08686 [pdf, ps, other]

On the Complexity of Approximating Wasserstein Barycenter

Authors: Alexey Kroshnin, Darina Dvinskikh, Pavel Dvurechensky, Alexander Gasnikov, Nazarii Tupitsa, Cesar Uribe

Abstract: We study the complexity of approximating Wassertein barycenter of $m$ discrete measures, or histograms of size $n$ by contrasting two alternative approaches, both using entropic regularization. The first approach is based on the Iterative Bregman Projections (IBP) algorithm for which our novel analysis gives a complexity bound proportional to $\frac{mn^2}{\varepsilon^2}$ to approximate the origina… ▽ More We study the complexity of approximating Wassertein barycenter of $m$ discrete measures, or histograms of size $n$ by contrasting two alternative approaches, both using entropic regularization. The first approach is based on the Iterative Bregman Projections (IBP) algorithm for which our novel analysis gives a complexity bound proportional to $\frac{mn^2}{\varepsilon^2}$ to approximate the original non-regularized barycenter. Using an alternative accelerated-gradient-descent-based approach, we obtain a complexity proportional to $\frac{mn^{2.5}}{\varepsilon} $. As a byproduct, we show that the regularization parameter in both approaches has to be proportional to $\varepsilon$, which causes instability of both algorithms when the desired accuracy is high. To overcome this issue, we propose a novel proximal-IBP algorithm, which can be seen as a proximal gradient method, which uses IBP on each iteration to make a proximal step. We also consider the question of scalability of these algorithms using approaches from distributed optimization and show that the first algorithm can be implemented in a centralized distributed setting (master/slave), while the second one is amenable to a more general decentralized distributed setting with an arbitrary network topology. △ Less

Submitted 20 February, 2020; v1 submitted 24 January, 2019; originally announced January 2019.

Comments: Corrected misprints. Added a reference to accelerated Iterative Bregman Projections introduced in arXiv:1906.03622

MSC Class: 90C25; 90C30; 90C06; 90C90

Journal ref: ICML 2019, in PMLR 97:3530-3540. http://proceedings.mlr.press/v97/kroshnin19a.html

arXiv:1809.00710 [pdf, other]

A Dual Approach for Optimal Algorithms in Distributed Optimization over Networks

Authors: César A. Uribe, Soomin Lee, Alexander Gasnikov, Angelia Nedić

Abstract: We study dual-based algorithms for distributed convex optimization problems over networks, where the objective is to minimize a sum $\sum_{i=1}^{m}f_i(z)$ of functions over in a network. We provide complexity bounds for four different cases, namely: each function $f_i$ is strongly convex and smooth, each function is either strongly convex or smooth, and when it is convex but neither strongly conve… ▽ More We study dual-based algorithms for distributed convex optimization problems over networks, where the objective is to minimize a sum $\sum_{i=1}^{m}f_i(z)$ of functions over in a network. We provide complexity bounds for four different cases, namely: each function $f_i$ is strongly convex and smooth, each function is either strongly convex or smooth, and when it is convex but neither strongly convex nor smooth. Our approach is based on the dual of an appropriately formulated primal problem, which includes a graph that models the communication restrictions. We propose distributed algorithms that achieve the same optimal rates as their centralized counterparts (up to constant and logarithmic factors), with an additional optimal cost related to the spectral properties of the network. Initially, we focus on functions for which we can explicitly minimize its Legendre-Fenchel conjugate, i.e., admissible or dual friendly functions. Then, we study distributed optimization algorithms for non-dual friendly functions, as well as a method to improve the dependency on the parameters of the functions involved. Numerical analysis of the proposed algorithms is also provided. △ Less

Submitted 16 March, 2020; v1 submitted 3 September, 2018; originally announced September 2018.

Comments: This work is an extended version of the manuscript: Optimal Algorithms for Distributed Optimization arXiv:1712.00232

arXiv:1806.03915 [pdf, other]

Decentralize and Randomize: Faster Algorithm for Wasserstein Barycenters

Authors: Pavel Dvurechensky, Darina Dvinskikh, Alexander Gasnikov, César A. Uribe, Angelia Nedić

Abstract: We study the decentralized distributed computation of discrete approximations for the regularized Wasserstein barycenter of a finite set of continuous probability measures distributedly stored over a network. We assume there is a network of agents/machines/computers, and each agent holds a private continuous probability measure and seeks to compute the barycenter of all the measures in the network… ▽ More We study the decentralized distributed computation of discrete approximations for the regularized Wasserstein barycenter of a finite set of continuous probability measures distributedly stored over a network. We assume there is a network of agents/machines/computers, and each agent holds a private continuous probability measure and seeks to compute the barycenter of all the measures in the network by getting samples from its local measure and exchanging information with its neighbors. Motivated by this problem, we develop, and analyze, a novel accelerated primal-dual stochastic gradient method for general stochastic convex optimization problems with linear equality constraints. Then, we apply this method to the decentralized distributed optimization setting to obtain a new algorithm for the distributed semi-discrete regularized Wasserstein barycenter problem. Moreover, we show explicit non-asymptotic complexity for the proposed algorithm. △ Less

Submitted 19 February, 2020; v1 submitted 11 June, 2018; originally announced June 2018.

MSC Class: 90C25; 90C30; 90C06; 90C90; 68Q25; 65K05; 65Y20; 68W40 ACM Class: G.1.6

arXiv:1804.02394 [pdf, other]

doi 10.1016/j.ejor.2020.08.027

An Accelerated Directional Derivative Method for Smooth Stochastic Convex Optimization

Authors: Pavel Dvurechensky, Eduard Gorbunov, Alexander Gasnikov

Abstract: We consider smooth stochastic convex optimization problems in the context of algorithms which are based on directional derivatives of the objective function. This context can be considered as an intermediate one between derivative-free optimization and gradient-based optimization. We assume that at any given point and for any given direction, a stochastic approximation for the directional derivati… ▽ More We consider smooth stochastic convex optimization problems in the context of algorithms which are based on directional derivatives of the objective function. This context can be considered as an intermediate one between derivative-free optimization and gradient-based optimization. We assume that at any given point and for any given direction, a stochastic approximation for the directional derivative of the objective function at this point and in this direction is available with some additive noise. The noise is assumed to be of an unknown nature, but bounded in the absolute value. We underline that we consider directional derivatives in any direction, as opposed to coordinate descent methods which use only derivatives in coordinate directions. For this setting, we propose a non-accelerated and an accelerated directional derivative method and provide their complexity bounds. Our non-accelerated algorithm has a complexity bound which is similar to the gradient-based algorithm, that is, without any dimension-dependent factor. Our accelerated algorithm has a complexity bound which coincides with the complexity bound of the accelerated gradient-based algorithm up to a factor of square root of the problem dimension. We extend these results to strongly convex problems. △ Less

Submitted 21 September, 2020; v1 submitted 8 April, 2018; originally announced April 2018.

Comments: arXiv admin note: text overlap with arXiv:1802.09022

MSC Class: 90C25; 90C15; 90C56; 90C30; 90C06; 68Q25; 65K05; 68W20; 65Y20; 68W40 ACM Class: G.1.6

Showing 1–50 of 56 results for author: Gasnikov, A