Search | arXiv e-print repository

The Road Less Scheduled

Authors: Aaron Defazio, Xingyu Alice Yang, Harsh Mehta, Konstantin Mishchenko, Ahmed Khaled, Ashok Cutkosky

Abstract: Existing learning rate schedules that do not require specification of the optimization stopping step T are greatly out-performed by learning rate schedules that depend on T. We propose an approach that avoids the need for this stopping time by eschewing the use of schedules entirely, while exhibiting state-of-the-art performance compared to schedules across a wide family of problems ranging from c… ▽ More Existing learning rate schedules that do not require specification of the optimization stopping step T are greatly out-performed by learning rate schedules that depend on T. We propose an approach that avoids the need for this stopping time by eschewing the use of schedules entirely, while exhibiting state-of-the-art performance compared to schedules across a wide family of problems ranging from convex problems to large-scale deep learning problems. Our Schedule-Free approach introduces no additional hyper-parameters over standard optimizers with momentum. Our method is a direct consequence of a new theory we develop that unifies scheduling and iterate averaging. An open source implementation of our method is available (https://github.com/facebookresearch/schedule_free). △ Less

Submitted 7 August, 2024; v1 submitted 24 May, 2024; originally announced May 2024.

arXiv:2403.04081 [pdf, other]

Directional Smoothness and Gradient Methods: Convergence and Adaptivity

Authors: Aaron Mishkin, Ahmed Khaled, Yuanhao Wang, Aaron Defazio, Robert M. Gower

Abstract: We develop new sub-optimality bounds for gradient descent (GD) that depend on the conditioning of the objective along the path of optimization, rather than on global, worst-case constants. Key to our proofs is directional smoothness, a measure of gradient variation that we use to develop upper-bounds on the objective. Minimizing these upper-bounds requires solving implicit equations to obtain a se… ▽ More We develop new sub-optimality bounds for gradient descent (GD) that depend on the conditioning of the objective along the path of optimization, rather than on global, worst-case constants. Key to our proofs is directional smoothness, a measure of gradient variation that we use to develop upper-bounds on the objective. Minimizing these upper-bounds requires solving implicit equations to obtain a sequence of strongly adapted step-sizes; we show that these equations are straightforward to solve for convex quadratics and lead to new guarantees for two classical step-sizes. For general functions, we prove that the Polyak step-size and normalized GD obtain fast, path-dependent rates despite using no knowledge of the directional smoothness. Experiments on logistic regression show our convergence guarantees are tighter than the classical theory based on L-smoothness. △ Less

Submitted 6 March, 2024; originally announced March 2024.

Comments: Twenty-four pages

arXiv:2402.07793 [pdf, ps, other]

Tuning-Free Stochastic Optimization

Authors: Ahmed Khaled, Chi Jin

Abstract: Large-scale machine learning problems make the cost of hyperparameter tuning ever more prohibitive. This creates a need for algorithms that can tune themselves on-the-fly. We formalize the notion of "tuning-free" algorithms that can match the performance of optimally-tuned optimization algorithms up to polylogarithmic factors given only loose hints on the relevant problem parameters. We consider i… ▽ More Large-scale machine learning problems make the cost of hyperparameter tuning ever more prohibitive. This creates a need for algorithms that can tune themselves on-the-fly. We formalize the notion of "tuning-free" algorithms that can match the performance of optimally-tuned optimization algorithms up to polylogarithmic factors given only loose hints on the relevant problem parameters. We consider in particular algorithms that can match optimally-tuned Stochastic Gradient Descent (SGD). When the domain of optimization is bounded, we show tuning-free matching of SGD is possible and achieved by several existing algorithms. We prove that for the task of minimizing a convex and smooth or Lipschitz function over an unbounded domain, tuning-free optimization is impossible. We discuss conditions under which tuning-free optimization is possible even over unbounded domains. In particular, we show that the recently proposed DoG and DoWG algorithms are tuning-free when the noise distribution is sufficiently well-behaved. For the task of finding a stationary point of a smooth and potentially nonconvex function, we give a variant of SGD that matches the best-known high-probability convergence rate for tuned SGD at only an additional polylogarithmic cost. However, we also give an impossibility result that shows no algorithm can hope to match the optimal expected convergence rate for tuned SGD with high probability. △ Less

Submitted 18 March, 2024; v1 submitted 12 February, 2024; originally announced February 2024.

arXiv:2401.16515 [pdf, other]

Dynamic Electro-Optic Analog Memory for Neuromorphic Photonic Computing

Authors: Sean Lam, Ahmed Khaled, Simon Bilodeau, Bicky A. Marquez, Paul R. Prucnal, Lukas Chrostowski, Bhavin J. Shastri, Sudip Shekhar

Abstract: Artificial intelligence (AI) has seen remarkable advancements across various domains, including natural language processing, computer vision, autonomous vehicles, and biology. However, the rapid expansion of AI technologies has escalated the demand for more powerful computing resources. As digital computing approaches fundamental limits, neuromorphic photonics emerges as a promising platform to co… ▽ More Artificial intelligence (AI) has seen remarkable advancements across various domains, including natural language processing, computer vision, autonomous vehicles, and biology. However, the rapid expansion of AI technologies has escalated the demand for more powerful computing resources. As digital computing approaches fundamental limits, neuromorphic photonics emerges as a promising platform to complement existing digital systems. In neuromorphic photonic computing, photonic devices are controlled using analog signals. This necessitates the use of digital-to-analog converters (DAC) and analog-to-digital converters (ADC) for interfacing with these devices during inference and training. However, data movement between memory and these converters in conventional von Neumann computing architectures consumes energy. To address this, analog memory co-located with photonic computing devices is proposed. This approach aims to reduce the reliance on DACs and ADCs and minimize data movement to enhance compute efficiency. This paper demonstrates a monolithically integrated neuromorphic photonic circuit with co-located capacitive analog memory and compares various analog memory technologies for neuromorphic photonic computing using the MNIST dataset as a benchmark. △ Less

Submitted 29 January, 2024; originally announced January 2024.

Comments: 22 pages, 10 figures

arXiv:2312.00596 [pdf, other]

BCN: Batch Channel Normalization for Image Classification

Authors: Afifa Khaled, Chao Li, Jia Ning, Kun He

Abstract: Normalization techniques have been widely used in the field of deep learning due to their capability of enabling higher learning rates and are less careful in initialization. However, the effectiveness of popular normalization technologies is typically limited to specific areas. Unlike the standard Batch Normalization (BN) and Layer Normalization (LN), where BN computes the mean and variance along… ▽ More Normalization techniques have been widely used in the field of deep learning due to their capability of enabling higher learning rates and are less careful in initialization. However, the effectiveness of popular normalization technologies is typically limited to specific areas. Unlike the standard Batch Normalization (BN) and Layer Normalization (LN), where BN computes the mean and variance along the (N,H,W) dimensions and LN computes the mean and variance along the (C,H,W) dimensions (N, C, H and W are the batch, channel, spatial height and width dimension, respectively), this paper presents a novel normalization technique called Batch Channel Normalization (BCN). To exploit both the channel and batch dependence and adaptively and combine the advantages of BN and LN based on specific datasets or tasks, BCN separately normalizes inputs along the (N, H, W) and (C, H, W) axes, then combines the normalized outputs based on adaptive parameters. As a basic block, BCN can be easily integrated into existing models for various applications in the field of computer vision. Empirical results show that the proposed technique can be seamlessly applied to various versions of CNN or Vision Transformer architecture. The code is publicly available at https://github.com/AfifaKhaled/BatchChannel-Normalization △ Less

Submitted 1 December, 2023; originally announced December 2023.

arXiv:2306.05745 [pdf, other]

Two Independent Teachers are Better Role Model

Authors: Afifa Khaled, Ahmed A. Mubarak, Kun He

Abstract: Recent deep learning models have attracted substantial attention in infant brain analysis. These models have performed state-of-the-art performance, such as semi-supervised techniques (e.g., Temporal Ensembling, mean teacher). However, these models depend on an encoder-decoder structure with stacked local operators to gather long-range information, and the local operators limit the efficiency and… ▽ More Recent deep learning models have attracted substantial attention in infant brain analysis. These models have performed state-of-the-art performance, such as semi-supervised techniques (e.g., Temporal Ensembling, mean teacher). However, these models depend on an encoder-decoder structure with stacked local operators to gather long-range information, and the local operators limit the efficiency and effectiveness. Besides, the $MRI$ data contain different tissue properties ($TPs$) such as $T1$ and $T2$. One major limitation of these models is that they use both data as inputs to the segment process, i.e., the models are trained on the dataset once, and it requires much computational and memory requirements during inference. In this work, we address the above limitations by designing a new deep-learning model, called 3D-DenseUNet, which works as adaptable global aggregation blocks in down-sampling to solve the issue of spatial information loss. The self-attention module connects the down-sampling blocks to up-sampling blocks, and integrates the feature maps in three dimensions of spatial and channel, effectively improving the representation potential and discriminating ability of the model. Additionally, we propose a new method called Two Independent Teachers ($2IT$), that summarizes the model weights instead of label predictions. Each teacher model is trained on different types of brain data, $T1$ and $T2$, respectively. Then, a fuse model is added to improve test accuracy and enable training with fewer parameters and labels compared to the Temporal Ensembling method without modifying the network architecture. Empirical results demonstrate the effectiveness of the proposed method. The code is available at https://github.com/AfifaKhaled/Two-Independent-Teachers-are-Better-Role-Model. △ Less

Submitted 21 December, 2023; v1 submitted 9 June, 2023; originally announced June 2023.

Comments: This manuscript contains 14 pages, 7 figures

arXiv:2305.16284 [pdf, other]

DoWG Unleashed: An Efficient Universal Parameter-Free Gradient Descent Method

Authors: Ahmed Khaled, Konstantin Mishchenko, Chi Jin

Abstract: This paper proposes a new easy-to-implement parameter-free gradient-based optimizer: DoWG (Distance over Weighted Gradients). We prove that DoWG is efficient -- matching the convergence rate of optimally tuned gradient descent in convex optimization up to a logarithmic factor without tuning any parameters, and universal -- automatically adapting to both smooth and nonsmooth problems. While popular… ▽ More This paper proposes a new easy-to-implement parameter-free gradient-based optimizer: DoWG (Distance over Weighted Gradients). We prove that DoWG is efficient -- matching the convergence rate of optimally tuned gradient descent in convex optimization up to a logarithmic factor without tuning any parameters, and universal -- automatically adapting to both smooth and nonsmooth problems. While popular algorithms following the AdaGrad framework compute a running average of the squared gradients to use for normalization, DoWG maintains a new distance-based weighted version of the running average, which is crucial to achieve the desired properties. To complement our theory, we also show empirically that DoWG trains at the edge of stability, and validate its effectiveness on practical machine learning tasks. △ Less

Submitted 29 January, 2024; v1 submitted 25 May, 2023; originally announced May 2023.

Comments: 22 pages, 1 table, 4 figures

arXiv:2209.02257 [pdf, other]

Faster federated optimization under second-order similarity

Authors: Ahmed Khaled, Chi Jin

Abstract: Federated learning (FL) is a subfield of machine learning where multiple clients try to collaboratively learn a model over a network under communication constraints. We consider finite-sum federated optimization under a second-order function similarity condition and strong convexity, and propose two new algorithms: SVRP and Catalyzed SVRP. This second-order similarity condition has grown popular r… ▽ More Federated learning (FL) is a subfield of machine learning where multiple clients try to collaboratively learn a model over a network under communication constraints. We consider finite-sum federated optimization under a second-order function similarity condition and strong convexity, and propose two new algorithms: SVRP and Catalyzed SVRP. This second-order similarity condition has grown popular recently, and is satisfied in many applications including distributed statistical learning and differentially private empirical risk minimization. The first algorithm, SVRP, combines approximate stochastic proximal point evaluations, client sampling, and variance reduction. We show that SVRP is communication efficient and achieves superior performance to many existing algorithms when function similarity is high enough. Our second algorithm, Catalyzed SVRP, is a Catalyst-accelerated variant of SVRP that achieves even better performance and uniformly improves upon existing algorithms for federated optimization under second-order similarity and strong convexity. In the course of analyzing these algorithms, we provide a new analysis of the Stochastic Proximal Point Method (SPPM) that might be of independent interest. Our analysis of SPPM is simple, allows for approximate proximal point evaluations, does not require any smoothness assumptions, and shows a clear benefit in communication complexity over ordinary distributed stochastic gradient descent. △ Less

Submitted 22 May, 2023; v1 submitted 6 September, 2022; originally announced September 2022.

Comments: Published at ICLR 2023

arXiv:2206.07021 [pdf, other]

Federated Optimization Algorithms with Random Reshuffling and Gradient Compression

Authors: Abdurakhmon Sadiev, Grigory Malinovsky, Eduard Gorbunov, Igor Sokolov, Ahmed Khaled, Konstantin Burlachenko, Peter Richtárik

Abstract: Gradient compression is a popular technique for improving communication complexity of stochastic first-order methods in distributed training of machine learning models. However, the existing works consider only with-replacement sampling of stochastic gradients. In contrast, it is well-known in practice and recently confirmed in theory that stochastic methods based on without-replacement sampling,… ▽ More Gradient compression is a popular technique for improving communication complexity of stochastic first-order methods in distributed training of machine learning models. However, the existing works consider only with-replacement sampling of stochastic gradients. In contrast, it is well-known in practice and recently confirmed in theory that stochastic methods based on without-replacement sampling, e.g., Random Reshuffling (RR) method, perform better than ones that sample the gradients with-replacement. In this work, we close this gap in the literature and provide the first analysis of methods with gradient compression and without-replacement sampling. We first develop a naïve combination of random reshuffling with gradient compression (Q-RR). Perhaps surprisingly, but the theoretical analysis of Q-RR does not show any benefits of using RR. Our extensive numerical experiments confirm this phenomenon. This happens due to the additional compression variance. To reveal the true advantages of RR in the distributed learning with compression, we propose a new method called DIANA-RR that reduces the compression variance and has provably better convergence rates than existing counterparts with with-replacement sampling of stochastic gradients. Next, to have a better fit to Federated Learning applications, we incorporate local computation, i.e., we propose and analyze the variants of Q-RR and DIANA-RR -- Q-NASTYA and DIANA-NASTYA that use local gradient steps and different local and global stepsizes. Finally, we conducted several numerical experiments to illustrate our theoretical results. △ Less

Submitted 3 November, 2022; v1 submitted 14 June, 2022; originally announced June 2022.

Comments: 66 pages, 6 figures. Changes in V2: the presentation of the results was changed, extra experiments were added. Code: https://github.com/IgorSokoloff/rr_with_compression_experiments_source_code

arXiv:2204.04189 [pdf]

doi 10.1109/ACIT53391.2021.9677225

Internet of Things Protection and Encryption: A Survey

Authors: Ghassan Samara, Ruzayn Quaddoura, Mooad Imad Al-Shalout, AL-Qawasmi Khaled, Ghadeer Al Besani

Abstract: The Internet of Things (IoT) has enabled a wide range of sectors to interact effectively with their consumers in order to deliver seamless services and products. Despite the widespread availability of (IoT) devices and their Internet connectivity, they have a low level of information security integrity. A number of security methods were proposed and evaluated in our research, and comparisons were… ▽ More The Internet of Things (IoT) has enabled a wide range of sectors to interact effectively with their consumers in order to deliver seamless services and products. Despite the widespread availability of (IoT) devices and their Internet connectivity, they have a low level of information security integrity. A number of security methods were proposed and evaluated in our research, and comparisons were made in terms of energy and time in the encryption and decryption processes. A ratification procedure is also performed on the devices in the main manager, which is regarded as a full firewall for IoT devices. The suggested algorithm's success has been shown utilizing low-cost Adriano Uno and Raspberry Pi devices. Arduous Uno has been used to demonstrate the encryption process in low-energy devices using a variety of algorithms, including Enhanced Algorithm for Data Integrity and Authentication (EDAI) and raspberry, which serves as a safety manager in low-energy device molecules. A variety of enhanced algorithms used in conjunction with Blockchain software have also assured the security and integrity of the information. These findings and discussions are presented at the conclusion of the paper. △ Less

Submitted 30 March, 2022; originally announced April 2022.

Comments: 7 pages

Journal ref: 2021 22nd International Arab Conference on Information Technology (ACIT)

arXiv:2111.11556 [pdf, other]

FLIX: A Simple and Communication-Efficient Alternative to Local Methods in Federated Learning

Authors: Elnur Gasanov, Ahmed Khaled, Samuel Horváth, Peter Richtárik

Abstract: Federated Learning (FL) is an increasingly popular machine learning paradigm in which multiple nodes try to collaboratively learn under privacy, communication and multiple heterogeneity constraints. A persistent problem in federated learning is that it is not clear what the optimization objective should be: the standard average risk minimization of supervised learning is inadequate in handling sev… ▽ More Federated Learning (FL) is an increasingly popular machine learning paradigm in which multiple nodes try to collaboratively learn under privacy, communication and multiple heterogeneity constraints. A persistent problem in federated learning is that it is not clear what the optimization objective should be: the standard average risk minimization of supervised learning is inadequate in handling several major constraints specific to federated learning, such as communication adaptivity and personalization control. We identify several key desiderata in frameworks for federated learning and introduce a new framework, FLIX, that takes into account the unique challenges brought by federated learning. FLIX has a standard finite-sum form, which enables practitioners to tap into the immense wealth of existing (potentially non-local) methods for distributed optimization. Through a smart initialization that does not require any communication, FLIX does not require the use of local steps but is still provably capable of performing dissimilarity regularization on par with local methods. We give several algorithms for solving the FLIX formulation efficiently under communication constraints. Finally, we corroborate our theoretical results with extensive experimentation. △ Less

Submitted 23 February, 2022; v1 submitted 22 November, 2021; originally announced November 2021.

Comments: V2: includes non-convex analysis as well as new large-scale experiments with neural networks. To appear in AISTATS 2022

arXiv:2102.06704 [pdf, other]

Proximal and Federated Random Reshuffling

Authors: Konstantin Mishchenko, Ahmed Khaled, Peter Richtárik

Abstract: Random Reshuffling (RR), also known as Stochastic Gradient Descent (SGD) without replacement, is a popular and theoretically grounded method for finite-sum minimization. We propose two new algorithms: Proximal and Federated Random Reshuffing (ProxRR and FedRR). The first algorithm, ProxRR, solves composite convex finite-sum minimization problems in which the objective is the sum of a (potentially… ▽ More Random Reshuffling (RR), also known as Stochastic Gradient Descent (SGD) without replacement, is a popular and theoretically grounded method for finite-sum minimization. We propose two new algorithms: Proximal and Federated Random Reshuffing (ProxRR and FedRR). The first algorithm, ProxRR, solves composite convex finite-sum minimization problems in which the objective is the sum of a (potentially non-smooth) convex regularizer and an average of $n$ smooth objectives. We obtain the second algorithm, FedRR, as a special case of ProxRR applied to a reformulation of distributed problems with either homogeneous or heterogeneous data. We study the algorithms' convergence properties with constant and decreasing stepsizes, and show that they have considerable advantages over Proximal and Local SGD. In particular, our methods have superior complexities and ProxRR evaluates the proximal operator once per epoch only. When the proximal operator is expensive to compute, this small difference makes ProxRR up to $n$ times faster than algorithms that evaluate the proximal operator in every iteration. We give examples of practical optimization tasks where the proximal operator is difficult to compute and ProxRR has a clear advantage. Finally, we corroborate our results with experiments on real data sets. △ Less

Submitted 12 February, 2021; originally announced February 2021.

Comments: 21 pages, 2 figures, 3 algorithms

arXiv:2007.13476 [pdf]

doi 10.35444/IJANA.2020.11063

Benchmarking Meta-heuristic Optimization

Authors: Mona Nasr, Omar Farouk, Ahmed Mohamedeen, Ali Elrafie, Marwan Bedeir, Ali Khaled

Abstract: Solving an optimization task in any domain is a very challenging problem, especially when dealing with nonlinear problems and non-convex functions. Many meta-heuristic algorithms are very efficient when solving nonlinear functions. A meta-heuristic algorithm is a problem-independent technique that can be applied to a broad range of problems. In this experiment, some of the evolutionary algorithms… ▽ More Solving an optimization task in any domain is a very challenging problem, especially when dealing with nonlinear problems and non-convex functions. Many meta-heuristic algorithms are very efficient when solving nonlinear functions. A meta-heuristic algorithm is a problem-independent technique that can be applied to a broad range of problems. In this experiment, some of the evolutionary algorithms will be tested, evaluated, and compared with each other. We will go through the Genetic Algorithm\, Differential Evolution, Particle Swarm Optimization Algorithm, Grey Wolf Optimizer, and Simulated Annealing. They will be evaluated against the performance from many points of view like how the algorithm performs throughout generations and how the algorithm's result is close to the optimal result. Other points of evaluation are discussed in depth in later sections. △ Less

Submitted 27 July, 2020; originally announced July 2020.

Comments: International Journal of Advanced Networking and Applications - IJANA

arXiv:2006.11573 [pdf, other]

Unified Analysis of Stochastic Gradient Methods for Composite Convex and Smooth Optimization

Authors: Ahmed Khaled, Othmane Sebbouh, Nicolas Loizou, Robert M. Gower, Peter Richtárik

Abstract: We present a unified theorem for the convergence analysis of stochastic gradient algorithms for minimizing a smooth and convex loss plus a convex regularizer. We do this by extending the unified analysis of Gorbunov, Hanzely \& Richtárik (2020) and dropping the requirement that the loss function be strongly convex. Instead, we only rely on convexity of the loss function. Our unified analysis appli… ▽ More We present a unified theorem for the convergence analysis of stochastic gradient algorithms for minimizing a smooth and convex loss plus a convex regularizer. We do this by extending the unified analysis of Gorbunov, Hanzely \& Richtárik (2020) and dropping the requirement that the loss function be strongly convex. Instead, we only rely on convexity of the loss function. Our unified analysis applies to a host of existing algorithms such as proximal SGD, variance reduced methods, quantization and some coordinate descent type methods. For the variance reduced methods, we recover the best known convergence rates as special cases. For proximal SGD, the quantization and coordinate type methods, we uncover new state-of-the-art convergence rates. Our analysis also includes any form of sampling and minibatching. As such, we are able to determine the minibatch size that optimizes the total complexity of variance reduced methods. We showcase this by obtaining a simple formula for the optimal minibatch size of two variance reduced methods (\textit{L-SVRG} and \textit{SAGA}). This optimal minibatch size not only improves the theoretical total complexity of the methods but also improves their convergence in practice, as we show in several experiments. △ Less

Submitted 20 June, 2020; originally announced June 2020.

arXiv:2006.05988 [pdf, other]

Random Reshuffling: Simple Analysis with Vast Improvements

Authors: Konstantin Mishchenko, Ahmed Khaled, Peter Richtárik

Abstract: Random Reshuffling (RR) is an algorithm for minimizing finite-sum functions that utilizes iterative gradient descent steps in conjunction with data reshuffling. Often contrasted with its sibling Stochastic Gradient Descent (SGD), RR is usually faster in practice and enjoys significant popularity in convex and non-convex optimization. The convergence rate of RR has attracted substantial attention r… ▽ More Random Reshuffling (RR) is an algorithm for minimizing finite-sum functions that utilizes iterative gradient descent steps in conjunction with data reshuffling. Often contrasted with its sibling Stochastic Gradient Descent (SGD), RR is usually faster in practice and enjoys significant popularity in convex and non-convex optimization. The convergence rate of RR has attracted substantial attention recently and, for strongly convex and smooth functions, it was shown to converge faster than SGD if 1) the stepsize is small, 2) the gradients are bounded, and 3) the number of epochs is large. We remove these 3 assumptions, improve the dependence on the condition number from $κ^2$ to $κ$ (resp. from $κ$ to $\sqrtκ$) and, in addition, show that RR has a different type of variance. We argue through theory and experiments that the new variance type gives an additional justification of the superior performance of RR. To go beyond strong convexity, we present several results for non-strongly convex and non-convex objectives. We show that in all cases, our theory improves upon existing literature. Finally, we prove fast convergence of the Shuffle-Once (SO) algorithm, which shuffles the data only once, at the beginning of the optimization process. Our theory for strongly-convex objectives tightly matches the known lower bounds for both RR and SO and substantiates the common practical heuristic of shuffling once or only a few times. As a byproduct of our analysis, we also get new results for the Incremental Gradient algorithm (IG), which does not shuffle the data at all. △ Less

Submitted 5 April, 2021; v1 submitted 10 June, 2020; originally announced June 2020.

Comments: v3 updates: Theorem 4 includes a new result for Polyak-Lojasiewicz functions. NeurIPS 2020. 35 pages, 2 figures, 2 tables, 3 algorithms

arXiv:2002.03329 [pdf, other]

Better Theory for SGD in the Nonconvex World

Authors: Ahmed Khaled, Peter Richtárik

Abstract: Large-scale nonconvex optimization problems are ubiquitous in modern machine learning, and among practitioners interested in solving them, Stochastic Gradient Descent (SGD) reigns supreme. We revisit the analysis of SGD in the nonconvex setting and propose a new variant of the recently introduced expected smoothness assumption which governs the behaviour of the second moment of the stochastic grad… ▽ More Large-scale nonconvex optimization problems are ubiquitous in modern machine learning, and among practitioners interested in solving them, Stochastic Gradient Descent (SGD) reigns supreme. We revisit the analysis of SGD in the nonconvex setting and propose a new variant of the recently introduced expected smoothness assumption which governs the behaviour of the second moment of the stochastic gradient. We show that our assumption is both more general and more reasonable than assumptions made in all prior work. Moreover, our results yield the optimal $\mathcal{O}(\varepsilon^{-4})$ rate for finding a stationary point of nonconvex smooth functions, and recover the optimal $\mathcal{O}(\varepsilon^{-1})$ rate for finding a global solution if the Polyak-Łojasiewicz condition is satisfied. We compare against convergence rates under convexity and prove a theorem on the convergence of SGD under Quadratic Functional Growth and convexity, which might be of independent interest. Moreover, we perform our analysis in a framework which allows for a detailed study of the effects of a wide array of sampling strategies and minibatch sizes for finite-sum optimization problems. We corroborate our theoretical results with experiments on real and synthetic data. △ Less

Submitted 24 July, 2020; v1 submitted 9 February, 2020; originally announced February 2020.

Comments: 33 pages, 3 figures, 4 theorems, and 4 propositions. V3 updates: added several references on error conditions (Tseng, Solodov, Bottou and Tsitsiklis, Grimmer), added a full proof of Corollary 1, cleaned up several proofs, and made minor adjustments to text for clarity

arXiv:1912.09925 [pdf, other]

Distributed Fixed Point Methods with Compressed Iterates

Authors: Sélim Chraibi, Ahmed Khaled, Dmitry Kovalev, Peter Richtárik, Adil Salim, Martin Takáč

Abstract: We propose basic and natural assumptions under which iterative optimization methods with compressed iterates can be analyzed. This problem is motivated by the practice of federated learning, where a large model stored in the cloud is compressed before it is sent to a mobile device, which then proceeds with training based on local data. We develop standard and variance reduced methods, and establis… ▽ More We propose basic and natural assumptions under which iterative optimization methods with compressed iterates can be analyzed. This problem is motivated by the practice of federated learning, where a large model stored in the cloud is compressed before it is sent to a mobile device, which then proceeds with training based on local data. We develop standard and variance reduced methods, and establish communication complexity bounds. Our algorithms are the first distributed methods with compressed iterates, and the first fixed point methods with compressed iterates. △ Less

Submitted 20 December, 2019; originally announced December 2019.

Comments: 15 pages, 4 algorithms, 4 Theorems

arXiv:1909.04746 [pdf, other]

Tighter Theory for Local SGD on Identical and Heterogeneous Data

Authors: Ahmed Khaled, Konstantin Mishchenko, Peter Richtárik

Abstract: We provide a new analysis of local SGD, removing unnecessary assumptions and elaborating on the difference between two data regimes: identical and heterogeneous. In both cases, we improve the existing theory and provide values of the optimal stepsize and optimal number of local iterations. Our bounds are based on a new notion of variance that is specific to local SGD methods with different data. T… ▽ More We provide a new analysis of local SGD, removing unnecessary assumptions and elaborating on the difference between two data regimes: identical and heterogeneous. In both cases, we improve the existing theory and provide values of the optimal stepsize and optimal number of local iterations. Our bounds are based on a new notion of variance that is specific to local SGD methods with different data. The tightness of our results is guaranteed by recovering known statements when we plug $H=1$, where $H$ is the number of local steps. The empirical evidence further validates the severe impact of data heterogeneity on the performance of local SGD. △ Less

Submitted 14 April, 2022; v1 submitted 10 September, 2019; originally announced September 2019.

Comments: AISTATS 2020. 31 pages, 1 algorithm, 5 theorems, 6 figures

arXiv:1909.04716 [pdf, other]

Gradient Descent with Compressed Iterates

Authors: Ahmed Khaled, Peter Richtárik

Abstract: We propose and analyze a new type of stochastic first order method: gradient descent with compressed iterates (GDCI). GDCI in each iteration first compresses the current iterate using a lossy randomized compression technique, and subsequently takes a gradient step. This method is a distillation of a key ingredient in the current practice of federated learning, where a model needs to be compressed… ▽ More We propose and analyze a new type of stochastic first order method: gradient descent with compressed iterates (GDCI). GDCI in each iteration first compresses the current iterate using a lossy randomized compression technique, and subsequently takes a gradient step. This method is a distillation of a key ingredient in the current practice of federated learning, where a model needs to be compressed by a mobile device before it is sent back to a server for aggregation. Our analysis provides a step towards closing the gap between the theory and practice of federated learning, and opens the possibility for many extensions. △ Less

Submitted 18 March, 2020; v1 submitted 10 September, 2019; originally announced September 2019.

Comments: NeurIPS 2019 Workshop on Federated Learning for Data Privacy and Confidentiality. 10 pages, 1 algorithm, 1 theorem, 5 lemmas

arXiv:1909.04715 [pdf, other]

First Analysis of Local GD on Heterogeneous Data

Authors: Ahmed Khaled, Konstantin Mishchenko, Peter Richtárik

Abstract: We provide the first convergence analysis of local gradient descent for minimizing the average of smooth and convex but otherwise arbitrary functions. Problems of this form and local gradient descent as a solution method are of importance in federated learning, where each function is based on private data stored by a user on a mobile device, and the data of different users can be arbitrarily heter… ▽ More We provide the first convergence analysis of local gradient descent for minimizing the average of smooth and convex but otherwise arbitrary functions. Problems of this form and local gradient descent as a solution method are of importance in federated learning, where each function is based on private data stored by a user on a mobile device, and the data of different users can be arbitrarily heterogeneous. We show that in a low accuracy regime, the method has the same communication complexity as gradient descent. △ Less

Submitted 18 March, 2020; v1 submitted 10 September, 2019; originally announced September 2019.

Comments: NeurIPS 2019 Workshop on Federated Learning for Data Privacy and Confidentiality. 11 pages, 4 lemmas, 1 theorem

arXiv:1902.05391 [pdf]

Deep Learning for Bridge Load Capacity Estimation in Post-Disaster and -Conflict Zones

Authors: Arya Pamuncak, Weisi Guo, Ahmed Soliman Khaled, Irwanda Laory

Abstract: Many post-disaster and -conflict regions do not have sufficient data on their transportation infrastructure assets, hindering both mobility and reconstruction. In particular, as the number of aging and deteriorating bridges increase, it is necessary to quantify their load characteristics in order to inform maintenance and prevent failure. The load carrying capacity and the design load are consider… ▽ More Many post-disaster and -conflict regions do not have sufficient data on their transportation infrastructure assets, hindering both mobility and reconstruction. In particular, as the number of aging and deteriorating bridges increase, it is necessary to quantify their load characteristics in order to inform maintenance and prevent failure. The load carrying capacity and the design load are considered as the main aspects of any civil structures. Human examination can be costly and slow when expertise is lacking in challenging scenarios. In this paper, we propose to employ deep learning as method to estimate the load carrying capacity from crowd sourced images. A new convolutional neural network architecture is trained on data from over 6000 bridges, which will benefit future research and applications. We tackle significant variations in the dataset (e.g. class interval, image completion, image colour) and quantify their impact on the prediction accuracy, precision, recall and F1 score. Finally, practical optimisation is performed by converting multiclass classification into binary classification to achieve a promising field use performance. △ Less

Submitted 5 February, 2019; originally announced February 2019.

arXiv:1708.02664 [pdf]

Internet of Tangible Things (IoTT): Challenges and Opportunities for Tangible Interaction with IoT

Authors: Leonardo Angelini, Nadine Couture, Omar Abou Khaled, Elena Mugellini

Abstract: In the Internet of Things era, an increasing number of household devices and everyday objects are able to send to and retrieve information from the Internet, offering innovative services to the user. However, most of these devices provide only smartphone or web interfaces to control the IoT object properties and functions. As a result, generally, the interaction is disconnected from the physical w… ▽ More In the Internet of Things era, an increasing number of household devices and everyday objects are able to send to and retrieve information from the Internet, offering innovative services to the user. However, most of these devices provide only smartphone or web interfaces to control the IoT object properties and functions. As a result, generally, the interaction is disconnected from the physical world, decreasing the user experience and increasing the risk of isolating the user in digital bubbles. We argue that tangible interaction can counteract this trend and this paper discusses the potential benefits and the still open challenges of tangible interaction applied to the Internet of Things. To underline this need, we introduce the term Internet of Tangible Things. In the article, after an analysis of current open challenges for Human-Computer Interaction in IoT, we summarize current trends in tangible interaction and extrapolate eight tangible interaction properties that could be exploited for designing novel interactions with IoT objects. Through a systematic literature review of tangible interaction applied to IoT, we show what has been already explored in the systems that pioneered the field and the future explorations that still have to be conducted. △ Less

Submitted 8 August, 2017; originally announced August 2017.

Comments: Suibmitted to MDPI Informatics, Special Issue on Tangible and Embodied Interaction

arXiv:1309.6839 [pdf]

Solving Limited-Memory Influence Diagrams Using Branch-and-Bound Search

Authors: Arindam Khaled, Eric A. Hansen, Changhe Yuan

Abstract: A limited-memory influence diagram (LIMID) generalizes a traditional influence diagram by relaxing the assumptions of regularity and no-forgetting, allowing a wider range of decision problems to be modeled. Algorithms for solving traditional influence diagrams are not easily generalized to solve LIMIDs, however, and only recently have exact algorithms for solving LIMIDs been developed. In this pap… ▽ More A limited-memory influence diagram (LIMID) generalizes a traditional influence diagram by relaxing the assumptions of regularity and no-forgetting, allowing a wider range of decision problems to be modeled. Algorithms for solving traditional influence diagrams are not easily generalized to solve LIMIDs, however, and only recently have exact algorithms for solving LIMIDs been developed. In this paper, we introduce an exact algorithm for solving LIMIDs that is based on branch-and-bound search. Our approach is related to the approach of solving an influence diagram by converting it to an equivalent decision tree, with the difference that the LIMID is converted to a much smaller decision graph that can be searched more efficiently. △ Less

Submitted 26 September, 2013; originally announced September 2013.

Comments: Appears in Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence (UAI2013)

Report number: UAI-P-2013-PG-331-340

arXiv:1211.2445 [pdf]

A Semi-Structured Tailoring-Driven Approach for ERP Selection

Authors: Abdelilah Khaled, Mohammed Abdou Janati Idrissi

Abstract: It has been widely reported that selecting an inappropriate system is a major reason for ERP implementation failures. The selection of an ERP system is therefore critical. While the number of papers related to ERP implementation is substantial, ERP evaluation and selection approaches have received few attention. Motivated by the adaptation concept of the ERP systems, we propose in this paper a sem… ▽ More It has been widely reported that selecting an inappropriate system is a major reason for ERP implementation failures. The selection of an ERP system is therefore critical. While the number of papers related to ERP implementation is substantial, ERP evaluation and selection approaches have received few attention. Motivated by the adaptation concept of the ERP systems, we propose in this paper a semi-structured approach for ERP system selection that differs from existing models in that it has a more holistic focus by simultaneously 1) considering the anticipated fitness of ERP solutions after the optimal resolution, within limited resources, of a set of the identified mismatches and 2) evaluating candidate products according to both functional and non-functional requirements. The approach consists of an iterative selection process model and an evaluation methodology based on 0-1 linear programming and MACBETH technique to elaborate multi-criteria performance expressions. △ Less

Submitted 11 November, 2012; originally announced November 2012.

Comments: 10 pages, 7 figues; IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 5, No 2, September 2012

ACM Class: H.3.4; C.4; D.2.1; D.2.2; D.2.8; D.2.9; D.2.10; G.1.1; G.1.3; G.1.6

Journal ref: IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 5, No 2, September 2012

Showing 1–24 of 24 results for author: Khaled, A