-
The Road Less Scheduled
Authors:
Aaron Defazio,
Xingyu Alice Yang,
Harsh Mehta,
Konstantin Mishchenko,
Ahmed Khaled,
Ashok Cutkosky
Abstract:
Existing learning rate schedules that do not require specification of the optimization stopping step T are greatly out-performed by learning rate schedules that depend on T. We propose an approach that avoids the need for this stopping time by eschewing the use of schedules entirely, while exhibiting state-of-the-art performance compared to schedules across a wide family of problems ranging from c…
▽ More
Existing learning rate schedules that do not require specification of the optimization stopping step T are greatly out-performed by learning rate schedules that depend on T. We propose an approach that avoids the need for this stopping time by eschewing the use of schedules entirely, while exhibiting state-of-the-art performance compared to schedules across a wide family of problems ranging from convex problems to large-scale deep learning problems. Our Schedule-Free approach introduces no additional hyper-parameters over standard optimizers with momentum. Our method is a direct consequence of a new theory we develop that unifies scheduling and iterate averaging. An open source implementation of our method is available (https://github.com/facebookresearch/schedule_free).
△ Less
Submitted 7 August, 2024; v1 submitted 24 May, 2024;
originally announced May 2024.
-
Directional Smoothness and Gradient Methods: Convergence and Adaptivity
Authors:
Aaron Mishkin,
Ahmed Khaled,
Yuanhao Wang,
Aaron Defazio,
Robert M. Gower
Abstract:
We develop new sub-optimality bounds for gradient descent (GD) that depend on the conditioning of the objective along the path of optimization, rather than on global, worst-case constants. Key to our proofs is directional smoothness, a measure of gradient variation that we use to develop upper-bounds on the objective. Minimizing these upper-bounds requires solving implicit equations to obtain a se…
▽ More
We develop new sub-optimality bounds for gradient descent (GD) that depend on the conditioning of the objective along the path of optimization, rather than on global, worst-case constants. Key to our proofs is directional smoothness, a measure of gradient variation that we use to develop upper-bounds on the objective. Minimizing these upper-bounds requires solving implicit equations to obtain a sequence of strongly adapted step-sizes; we show that these equations are straightforward to solve for convex quadratics and lead to new guarantees for two classical step-sizes. For general functions, we prove that the Polyak step-size and normalized GD obtain fast, path-dependent rates despite using no knowledge of the directional smoothness. Experiments on logistic regression show our convergence guarantees are tighter than the classical theory based on L-smoothness.
△ Less
Submitted 6 March, 2024;
originally announced March 2024.
-
Tuning-Free Stochastic Optimization
Authors:
Ahmed Khaled,
Chi Jin
Abstract:
Large-scale machine learning problems make the cost of hyperparameter tuning ever more prohibitive. This creates a need for algorithms that can tune themselves on-the-fly. We formalize the notion of "tuning-free" algorithms that can match the performance of optimally-tuned optimization algorithms up to polylogarithmic factors given only loose hints on the relevant problem parameters. We consider i…
▽ More
Large-scale machine learning problems make the cost of hyperparameter tuning ever more prohibitive. This creates a need for algorithms that can tune themselves on-the-fly. We formalize the notion of "tuning-free" algorithms that can match the performance of optimally-tuned optimization algorithms up to polylogarithmic factors given only loose hints on the relevant problem parameters. We consider in particular algorithms that can match optimally-tuned Stochastic Gradient Descent (SGD). When the domain of optimization is bounded, we show tuning-free matching of SGD is possible and achieved by several existing algorithms. We prove that for the task of minimizing a convex and smooth or Lipschitz function over an unbounded domain, tuning-free optimization is impossible. We discuss conditions under which tuning-free optimization is possible even over unbounded domains. In particular, we show that the recently proposed DoG and DoWG algorithms are tuning-free when the noise distribution is sufficiently well-behaved. For the task of finding a stationary point of a smooth and potentially nonconvex function, we give a variant of SGD that matches the best-known high-probability convergence rate for tuned SGD at only an additional polylogarithmic cost. However, we also give an impossibility result that shows no algorithm can hope to match the optimal expected convergence rate for tuned SGD with high probability.
△ Less
Submitted 18 March, 2024; v1 submitted 12 February, 2024;
originally announced February 2024.
-
Dynamic Electro-Optic Analog Memory for Neuromorphic Photonic Computing
Authors:
Sean Lam,
Ahmed Khaled,
Simon Bilodeau,
Bicky A. Marquez,
Paul R. Prucnal,
Lukas Chrostowski,
Bhavin J. Shastri,
Sudip Shekhar
Abstract:
Artificial intelligence (AI) has seen remarkable advancements across various domains, including natural language processing, computer vision, autonomous vehicles, and biology. However, the rapid expansion of AI technologies has escalated the demand for more powerful computing resources. As digital computing approaches fundamental limits, neuromorphic photonics emerges as a promising platform to co…
▽ More
Artificial intelligence (AI) has seen remarkable advancements across various domains, including natural language processing, computer vision, autonomous vehicles, and biology. However, the rapid expansion of AI technologies has escalated the demand for more powerful computing resources. As digital computing approaches fundamental limits, neuromorphic photonics emerges as a promising platform to complement existing digital systems. In neuromorphic photonic computing, photonic devices are controlled using analog signals. This necessitates the use of digital-to-analog converters (DAC) and analog-to-digital converters (ADC) for interfacing with these devices during inference and training. However, data movement between memory and these converters in conventional von Neumann computing architectures consumes energy. To address this, analog memory co-located with photonic computing devices is proposed. This approach aims to reduce the reliance on DACs and ADCs and minimize data movement to enhance compute efficiency. This paper demonstrates a monolithically integrated neuromorphic photonic circuit with co-located capacitive analog memory and compares various analog memory technologies for neuromorphic photonic computing using the MNIST dataset as a benchmark.
△ Less
Submitted 29 January, 2024;
originally announced January 2024.
-
BCN: Batch Channel Normalization for Image Classification
Authors:
Afifa Khaled,
Chao Li,
Jia Ning,
Kun He
Abstract:
Normalization techniques have been widely used in the field of deep learning due to their capability of enabling higher learning rates and are less careful in initialization. However, the effectiveness of popular normalization technologies is typically limited to specific areas. Unlike the standard Batch Normalization (BN) and Layer Normalization (LN), where BN computes the mean and variance along…
▽ More
Normalization techniques have been widely used in the field of deep learning due to their capability of enabling higher learning rates and are less careful in initialization. However, the effectiveness of popular normalization technologies is typically limited to specific areas. Unlike the standard Batch Normalization (BN) and Layer Normalization (LN), where BN computes the mean and variance along the (N,H,W) dimensions and LN computes the mean and variance along the (C,H,W) dimensions (N, C, H and W are the batch, channel, spatial height and width dimension, respectively), this paper presents a novel normalization technique called Batch Channel Normalization (BCN). To exploit both the channel and batch dependence and adaptively and combine the advantages of BN and LN based on specific datasets or tasks, BCN separately normalizes inputs along the (N, H, W) and (C, H, W) axes, then combines the normalized outputs based on adaptive parameters. As a basic block, BCN can be easily integrated into existing models for various applications in the field of computer vision. Empirical results show that the proposed technique can be seamlessly applied to various versions of CNN or Vision Transformer architecture. The code is publicly available at https://github.com/AfifaKhaled/BatchChannel-Normalization
△ Less
Submitted 1 December, 2023;
originally announced December 2023.
-
Two Independent Teachers are Better Role Model
Authors:
Afifa Khaled,
Ahmed A. Mubarak,
Kun He
Abstract:
Recent deep learning models have attracted substantial attention in infant brain analysis. These models have performed state-of-the-art performance, such as semi-supervised techniques (e.g., Temporal Ensembling, mean teacher). However, these models depend on an encoder-decoder structure with stacked local operators to gather long-range information, and the local operators limit the efficiency and…
▽ More
Recent deep learning models have attracted substantial attention in infant brain analysis. These models have performed state-of-the-art performance, such as semi-supervised techniques (e.g., Temporal Ensembling, mean teacher). However, these models depend on an encoder-decoder structure with stacked local operators to gather long-range information, and the local operators limit the efficiency and effectiveness. Besides, the $MRI$ data contain different tissue properties ($TPs$) such as $T1$ and $T2$. One major limitation of these models is that they use both data as inputs to the segment process, i.e., the models are trained on the dataset once, and it requires much computational and memory requirements during inference. In this work, we address the above limitations by designing a new deep-learning model, called 3D-DenseUNet, which works as adaptable global aggregation blocks in down-sampling to solve the issue of spatial information loss. The self-attention module connects the down-sampling blocks to up-sampling blocks, and integrates the feature maps in three dimensions of spatial and channel, effectively improving the representation potential and discriminating ability of the model. Additionally, we propose a new method called Two Independent Teachers ($2IT$), that summarizes the model weights instead of label predictions. Each teacher model is trained on different types of brain data, $T1$ and $T2$, respectively. Then, a fuse model is added to improve test accuracy and enable training with fewer parameters and labels compared to the Temporal Ensembling method without modifying the network architecture. Empirical results demonstrate the effectiveness of the proposed method. The code is available at https://github.com/AfifaKhaled/Two-Independent-Teachers-are-Better-Role-Model.
△ Less
Submitted 21 December, 2023; v1 submitted 9 June, 2023;
originally announced June 2023.
-
DoWG Unleashed: An Efficient Universal Parameter-Free Gradient Descent Method
Authors:
Ahmed Khaled,
Konstantin Mishchenko,
Chi Jin
Abstract:
This paper proposes a new easy-to-implement parameter-free gradient-based optimizer: DoWG (Distance over Weighted Gradients). We prove that DoWG is efficient -- matching the convergence rate of optimally tuned gradient descent in convex optimization up to a logarithmic factor without tuning any parameters, and universal -- automatically adapting to both smooth and nonsmooth problems. While popular…
▽ More
This paper proposes a new easy-to-implement parameter-free gradient-based optimizer: DoWG (Distance over Weighted Gradients). We prove that DoWG is efficient -- matching the convergence rate of optimally tuned gradient descent in convex optimization up to a logarithmic factor without tuning any parameters, and universal -- automatically adapting to both smooth and nonsmooth problems. While popular algorithms following the AdaGrad framework compute a running average of the squared gradients to use for normalization, DoWG maintains a new distance-based weighted version of the running average, which is crucial to achieve the desired properties. To complement our theory, we also show empirically that DoWG trains at the edge of stability, and validate its effectiveness on practical machine learning tasks.
△ Less
Submitted 29 January, 2024; v1 submitted 25 May, 2023;
originally announced May 2023.
-
Faster federated optimization under second-order similarity
Authors:
Ahmed Khaled,
Chi Jin
Abstract:
Federated learning (FL) is a subfield of machine learning where multiple clients try to collaboratively learn a model over a network under communication constraints. We consider finite-sum federated optimization under a second-order function similarity condition and strong convexity, and propose two new algorithms: SVRP and Catalyzed SVRP. This second-order similarity condition has grown popular r…
▽ More
Federated learning (FL) is a subfield of machine learning where multiple clients try to collaboratively learn a model over a network under communication constraints. We consider finite-sum federated optimization under a second-order function similarity condition and strong convexity, and propose two new algorithms: SVRP and Catalyzed SVRP. This second-order similarity condition has grown popular recently, and is satisfied in many applications including distributed statistical learning and differentially private empirical risk minimization. The first algorithm, SVRP, combines approximate stochastic proximal point evaluations, client sampling, and variance reduction. We show that SVRP is communication efficient and achieves superior performance to many existing algorithms when function similarity is high enough. Our second algorithm, Catalyzed SVRP, is a Catalyst-accelerated variant of SVRP that achieves even better performance and uniformly improves upon existing algorithms for federated optimization under second-order similarity and strong convexity. In the course of analyzing these algorithms, we provide a new analysis of the Stochastic Proximal Point Method (SPPM) that might be of independent interest. Our analysis of SPPM is simple, allows for approximate proximal point evaluations, does not require any smoothness assumptions, and shows a clear benefit in communication complexity over ordinary distributed stochastic gradient descent.
△ Less
Submitted 22 May, 2023; v1 submitted 6 September, 2022;
originally announced September 2022.
-
Federated Optimization Algorithms with Random Reshuffling and Gradient Compression
Authors:
Abdurakhmon Sadiev,
Grigory Malinovsky,
Eduard Gorbunov,
Igor Sokolov,
Ahmed Khaled,
Konstantin Burlachenko,
Peter Richtárik
Abstract:
Gradient compression is a popular technique for improving communication complexity of stochastic first-order methods in distributed training of machine learning models. However, the existing works consider only with-replacement sampling of stochastic gradients. In contrast, it is well-known in practice and recently confirmed in theory that stochastic methods based on without-replacement sampling,…
▽ More
Gradient compression is a popular technique for improving communication complexity of stochastic first-order methods in distributed training of machine learning models. However, the existing works consider only with-replacement sampling of stochastic gradients. In contrast, it is well-known in practice and recently confirmed in theory that stochastic methods based on without-replacement sampling, e.g., Random Reshuffling (RR) method, perform better than ones that sample the gradients with-replacement. In this work, we close this gap in the literature and provide the first analysis of methods with gradient compression and without-replacement sampling. We first develop a naïve combination of random reshuffling with gradient compression (Q-RR). Perhaps surprisingly, but the theoretical analysis of Q-RR does not show any benefits of using RR. Our extensive numerical experiments confirm this phenomenon. This happens due to the additional compression variance. To reveal the true advantages of RR in the distributed learning with compression, we propose a new method called DIANA-RR that reduces the compression variance and has provably better convergence rates than existing counterparts with with-replacement sampling of stochastic gradients. Next, to have a better fit to Federated Learning applications, we incorporate local computation, i.e., we propose and analyze the variants of Q-RR and DIANA-RR -- Q-NASTYA and DIANA-NASTYA that use local gradient steps and different local and global stepsizes. Finally, we conducted several numerical experiments to illustrate our theoretical results.
△ Less
Submitted 3 November, 2022; v1 submitted 14 June, 2022;
originally announced June 2022.
-
Internet of Things Protection and Encryption: A Survey
Authors:
Ghassan Samara,
Ruzayn Quaddoura,
Mooad Imad Al-Shalout,
AL-Qawasmi Khaled,
Ghadeer Al Besani
Abstract:
The Internet of Things (IoT) has enabled a wide range of sectors to interact effectively with their consumers in order to deliver seamless services and products. Despite the widespread availability of (IoT) devices and their Internet connectivity, they have a low level of information security integrity. A number of security methods were proposed and evaluated in our research, and comparisons were…
▽ More
The Internet of Things (IoT) has enabled a wide range of sectors to interact effectively with their consumers in order to deliver seamless services and products. Despite the widespread availability of (IoT) devices and their Internet connectivity, they have a low level of information security integrity. A number of security methods were proposed and evaluated in our research, and comparisons were made in terms of energy and time in the encryption and decryption processes. A ratification procedure is also performed on the devices in the main manager, which is regarded as a full firewall for IoT devices. The suggested algorithm's success has been shown utilizing low-cost Adriano Uno and Raspberry Pi devices. Arduous Uno has been used to demonstrate the encryption process in low-energy devices using a variety of algorithms, including Enhanced Algorithm for Data Integrity and Authentication (EDAI) and raspberry, which serves as a safety manager in low-energy device molecules. A variety of enhanced algorithms used in conjunction with Blockchain software have also assured the security and integrity of the information. These findings and discussions are presented at the conclusion of the paper.
△ Less
Submitted 30 March, 2022;
originally announced April 2022.
-
FLIX: A Simple and Communication-Efficient Alternative to Local Methods in Federated Learning
Authors:
Elnur Gasanov,
Ahmed Khaled,
Samuel Horváth,
Peter Richtárik
Abstract:
Federated Learning (FL) is an increasingly popular machine learning paradigm in which multiple nodes try to collaboratively learn under privacy, communication and multiple heterogeneity constraints. A persistent problem in federated learning is that it is not clear what the optimization objective should be: the standard average risk minimization of supervised learning is inadequate in handling sev…
▽ More
Federated Learning (FL) is an increasingly popular machine learning paradigm in which multiple nodes try to collaboratively learn under privacy, communication and multiple heterogeneity constraints. A persistent problem in federated learning is that it is not clear what the optimization objective should be: the standard average risk minimization of supervised learning is inadequate in handling several major constraints specific to federated learning, such as communication adaptivity and personalization control. We identify several key desiderata in frameworks for federated learning and introduce a new framework, FLIX, that takes into account the unique challenges brought by federated learning. FLIX has a standard finite-sum form, which enables practitioners to tap into the immense wealth of existing (potentially non-local) methods for distributed optimization. Through a smart initialization that does not require any communication, FLIX does not require the use of local steps but is still provably capable of performing dissimilarity regularization on par with local methods. We give several algorithms for solving the FLIX formulation efficiently under communication constraints. Finally, we corroborate our theoretical results with extensive experimentation.
△ Less
Submitted 23 February, 2022; v1 submitted 22 November, 2021;
originally announced November 2021.
-
Proximal and Federated Random Reshuffling
Authors:
Konstantin Mishchenko,
Ahmed Khaled,
Peter Richtárik
Abstract:
Random Reshuffling (RR), also known as Stochastic Gradient Descent (SGD) without replacement, is a popular and theoretically grounded method for finite-sum minimization. We propose two new algorithms: Proximal and Federated Random Reshuffing (ProxRR and FedRR). The first algorithm, ProxRR, solves composite convex finite-sum minimization problems in which the objective is the sum of a (potentially…
▽ More
Random Reshuffling (RR), also known as Stochastic Gradient Descent (SGD) without replacement, is a popular and theoretically grounded method for finite-sum minimization. We propose two new algorithms: Proximal and Federated Random Reshuffing (ProxRR and FedRR). The first algorithm, ProxRR, solves composite convex finite-sum minimization problems in which the objective is the sum of a (potentially non-smooth) convex regularizer and an average of $n$ smooth objectives. We obtain the second algorithm, FedRR, as a special case of ProxRR applied to a reformulation of distributed problems with either homogeneous or heterogeneous data. We study the algorithms' convergence properties with constant and decreasing stepsizes, and show that they have considerable advantages over Proximal and Local SGD. In particular, our methods have superior complexities and ProxRR evaluates the proximal operator once per epoch only. When the proximal operator is expensive to compute, this small difference makes ProxRR up to $n$ times faster than algorithms that evaluate the proximal operator in every iteration. We give examples of practical optimization tasks where the proximal operator is difficult to compute and ProxRR has a clear advantage. Finally, we corroborate our results with experiments on real data sets.
△ Less
Submitted 12 February, 2021;
originally announced February 2021.
-
Benchmarking Meta-heuristic Optimization
Authors:
Mona Nasr,
Omar Farouk,
Ahmed Mohamedeen,
Ali Elrafie,
Marwan Bedeir,
Ali Khaled
Abstract:
Solving an optimization task in any domain is a very challenging problem, especially when dealing with nonlinear problems and non-convex functions. Many meta-heuristic algorithms are very efficient when solving nonlinear functions. A meta-heuristic algorithm is a problem-independent technique that can be applied to a broad range of problems. In this experiment, some of the evolutionary algorithms…
▽ More
Solving an optimization task in any domain is a very challenging problem, especially when dealing with nonlinear problems and non-convex functions. Many meta-heuristic algorithms are very efficient when solving nonlinear functions. A meta-heuristic algorithm is a problem-independent technique that can be applied to a broad range of problems. In this experiment, some of the evolutionary algorithms will be tested, evaluated, and compared with each other. We will go through the Genetic Algorithm\, Differential Evolution, Particle Swarm Optimization Algorithm, Grey Wolf Optimizer, and Simulated Annealing. They will be evaluated against the performance from many points of view like how the algorithm performs throughout generations and how the algorithm's result is close to the optimal result. Other points of evaluation are discussed in depth in later sections.
△ Less
Submitted 27 July, 2020;
originally announced July 2020.
-
Unified Analysis of Stochastic Gradient Methods for Composite Convex and Smooth Optimization
Authors:
Ahmed Khaled,
Othmane Sebbouh,
Nicolas Loizou,
Robert M. Gower,
Peter Richtárik
Abstract:
We present a unified theorem for the convergence analysis of stochastic gradient algorithms for minimizing a smooth and convex loss plus a convex regularizer. We do this by extending the unified analysis of Gorbunov, Hanzely \& Richtárik (2020) and dropping the requirement that the loss function be strongly convex. Instead, we only rely on convexity of the loss function. Our unified analysis appli…
▽ More
We present a unified theorem for the convergence analysis of stochastic gradient algorithms for minimizing a smooth and convex loss plus a convex regularizer. We do this by extending the unified analysis of Gorbunov, Hanzely \& Richtárik (2020) and dropping the requirement that the loss function be strongly convex. Instead, we only rely on convexity of the loss function. Our unified analysis applies to a host of existing algorithms such as proximal SGD, variance reduced methods, quantization and some coordinate descent type methods. For the variance reduced methods, we recover the best known convergence rates as special cases. For proximal SGD, the quantization and coordinate type methods, we uncover new state-of-the-art convergence rates. Our analysis also includes any form of sampling and minibatching. As such, we are able to determine the minibatch size that optimizes the total complexity of variance reduced methods. We showcase this by obtaining a simple formula for the optimal minibatch size of two variance reduced methods (\textit{L-SVRG} and \textit{SAGA}). This optimal minibatch size not only improves the theoretical total complexity of the methods but also improves their convergence in practice, as we show in several experiments.
△ Less
Submitted 20 June, 2020;
originally announced June 2020.
-
Random Reshuffling: Simple Analysis with Vast Improvements
Authors:
Konstantin Mishchenko,
Ahmed Khaled,
Peter Richtárik
Abstract:
Random Reshuffling (RR) is an algorithm for minimizing finite-sum functions that utilizes iterative gradient descent steps in conjunction with data reshuffling. Often contrasted with its sibling Stochastic Gradient Descent (SGD), RR is usually faster in practice and enjoys significant popularity in convex and non-convex optimization. The convergence rate of RR has attracted substantial attention r…
▽ More
Random Reshuffling (RR) is an algorithm for minimizing finite-sum functions that utilizes iterative gradient descent steps in conjunction with data reshuffling. Often contrasted with its sibling Stochastic Gradient Descent (SGD), RR is usually faster in practice and enjoys significant popularity in convex and non-convex optimization. The convergence rate of RR has attracted substantial attention recently and, for strongly convex and smooth functions, it was shown to converge faster than SGD if 1) the stepsize is small, 2) the gradients are bounded, and 3) the number of epochs is large. We remove these 3 assumptions, improve the dependence on the condition number from $κ^2$ to $κ$ (resp. from $κ$ to $\sqrtκ$) and, in addition, show that RR has a different type of variance. We argue through theory and experiments that the new variance type gives an additional justification of the superior performance of RR. To go beyond strong convexity, we present several results for non-strongly convex and non-convex objectives. We show that in all cases, our theory improves upon existing literature. Finally, we prove fast convergence of the Shuffle-Once (SO) algorithm, which shuffles the data only once, at the beginning of the optimization process. Our theory for strongly-convex objectives tightly matches the known lower bounds for both RR and SO and substantiates the common practical heuristic of shuffling once or only a few times. As a byproduct of our analysis, we also get new results for the Incremental Gradient algorithm (IG), which does not shuffle the data at all.
△ Less
Submitted 5 April, 2021; v1 submitted 10 June, 2020;
originally announced June 2020.
-
Better Theory for SGD in the Nonconvex World
Authors:
Ahmed Khaled,
Peter Richtárik
Abstract:
Large-scale nonconvex optimization problems are ubiquitous in modern machine learning, and among practitioners interested in solving them, Stochastic Gradient Descent (SGD) reigns supreme. We revisit the analysis of SGD in the nonconvex setting and propose a new variant of the recently introduced expected smoothness assumption which governs the behaviour of the second moment of the stochastic grad…
▽ More
Large-scale nonconvex optimization problems are ubiquitous in modern machine learning, and among practitioners interested in solving them, Stochastic Gradient Descent (SGD) reigns supreme. We revisit the analysis of SGD in the nonconvex setting and propose a new variant of the recently introduced expected smoothness assumption which governs the behaviour of the second moment of the stochastic gradient. We show that our assumption is both more general and more reasonable than assumptions made in all prior work. Moreover, our results yield the optimal $\mathcal{O}(\varepsilon^{-4})$ rate for finding a stationary point of nonconvex smooth functions, and recover the optimal $\mathcal{O}(\varepsilon^{-1})$ rate for finding a global solution if the Polyak-Łojasiewicz condition is satisfied. We compare against convergence rates under convexity and prove a theorem on the convergence of SGD under Quadratic Functional Growth and convexity, which might be of independent interest. Moreover, we perform our analysis in a framework which allows for a detailed study of the effects of a wide array of sampling strategies and minibatch sizes for finite-sum optimization problems. We corroborate our theoretical results with experiments on real and synthetic data.
△ Less
Submitted 24 July, 2020; v1 submitted 9 February, 2020;
originally announced February 2020.
-
Distributed Fixed Point Methods with Compressed Iterates
Authors:
Sélim Chraibi,
Ahmed Khaled,
Dmitry Kovalev,
Peter Richtárik,
Adil Salim,
Martin Takáč
Abstract:
We propose basic and natural assumptions under which iterative optimization methods with compressed iterates can be analyzed. This problem is motivated by the practice of federated learning, where a large model stored in the cloud is compressed before it is sent to a mobile device, which then proceeds with training based on local data. We develop standard and variance reduced methods, and establis…
▽ More
We propose basic and natural assumptions under which iterative optimization methods with compressed iterates can be analyzed. This problem is motivated by the practice of federated learning, where a large model stored in the cloud is compressed before it is sent to a mobile device, which then proceeds with training based on local data. We develop standard and variance reduced methods, and establish communication complexity bounds. Our algorithms are the first distributed methods with compressed iterates, and the first fixed point methods with compressed iterates.
△ Less
Submitted 20 December, 2019;
originally announced December 2019.
-
Tighter Theory for Local SGD on Identical and Heterogeneous Data
Authors:
Ahmed Khaled,
Konstantin Mishchenko,
Peter Richtárik
Abstract:
We provide a new analysis of local SGD, removing unnecessary assumptions and elaborating on the difference between two data regimes: identical and heterogeneous. In both cases, we improve the existing theory and provide values of the optimal stepsize and optimal number of local iterations. Our bounds are based on a new notion of variance that is specific to local SGD methods with different data. T…
▽ More
We provide a new analysis of local SGD, removing unnecessary assumptions and elaborating on the difference between two data regimes: identical and heterogeneous. In both cases, we improve the existing theory and provide values of the optimal stepsize and optimal number of local iterations. Our bounds are based on a new notion of variance that is specific to local SGD methods with different data. The tightness of our results is guaranteed by recovering known statements when we plug $H=1$, where $H$ is the number of local steps. The empirical evidence further validates the severe impact of data heterogeneity on the performance of local SGD.
△ Less
Submitted 14 April, 2022; v1 submitted 10 September, 2019;
originally announced September 2019.
-
Gradient Descent with Compressed Iterates
Authors:
Ahmed Khaled,
Peter Richtárik
Abstract:
We propose and analyze a new type of stochastic first order method: gradient descent with compressed iterates (GDCI). GDCI in each iteration first compresses the current iterate using a lossy randomized compression technique, and subsequently takes a gradient step. This method is a distillation of a key ingredient in the current practice of federated learning, where a model needs to be compressed…
▽ More
We propose and analyze a new type of stochastic first order method: gradient descent with compressed iterates (GDCI). GDCI in each iteration first compresses the current iterate using a lossy randomized compression technique, and subsequently takes a gradient step. This method is a distillation of a key ingredient in the current practice of federated learning, where a model needs to be compressed by a mobile device before it is sent back to a server for aggregation. Our analysis provides a step towards closing the gap between the theory and practice of federated learning, and opens the possibility for many extensions.
△ Less
Submitted 18 March, 2020; v1 submitted 10 September, 2019;
originally announced September 2019.
-
First Analysis of Local GD on Heterogeneous Data
Authors:
Ahmed Khaled,
Konstantin Mishchenko,
Peter Richtárik
Abstract:
We provide the first convergence analysis of local gradient descent for minimizing the average of smooth and convex but otherwise arbitrary functions. Problems of this form and local gradient descent as a solution method are of importance in federated learning, where each function is based on private data stored by a user on a mobile device, and the data of different users can be arbitrarily heter…
▽ More
We provide the first convergence analysis of local gradient descent for minimizing the average of smooth and convex but otherwise arbitrary functions. Problems of this form and local gradient descent as a solution method are of importance in federated learning, where each function is based on private data stored by a user on a mobile device, and the data of different users can be arbitrarily heterogeneous. We show that in a low accuracy regime, the method has the same communication complexity as gradient descent.
△ Less
Submitted 18 March, 2020; v1 submitted 10 September, 2019;
originally announced September 2019.
-
Deep Learning for Bridge Load Capacity Estimation in Post-Disaster and -Conflict Zones
Authors:
Arya Pamuncak,
Weisi Guo,
Ahmed Soliman Khaled,
Irwanda Laory
Abstract:
Many post-disaster and -conflict regions do not have sufficient data on their transportation infrastructure assets, hindering both mobility and reconstruction. In particular, as the number of aging and deteriorating bridges increase, it is necessary to quantify their load characteristics in order to inform maintenance and prevent failure. The load carrying capacity and the design load are consider…
▽ More
Many post-disaster and -conflict regions do not have sufficient data on their transportation infrastructure assets, hindering both mobility and reconstruction. In particular, as the number of aging and deteriorating bridges increase, it is necessary to quantify their load characteristics in order to inform maintenance and prevent failure. The load carrying capacity and the design load are considered as the main aspects of any civil structures. Human examination can be costly and slow when expertise is lacking in challenging scenarios. In this paper, we propose to employ deep learning as method to estimate the load carrying capacity from crowd sourced images. A new convolutional neural network architecture is trained on data from over 6000 bridges, which will benefit future research and applications. We tackle significant variations in the dataset (e.g. class interval, image completion, image colour) and quantify their impact on the prediction accuracy, precision, recall and F1 score. Finally, practical optimisation is performed by converting multiclass classification into binary classification to achieve a promising field use performance.
△ Less
Submitted 5 February, 2019;
originally announced February 2019.
-
Internet of Tangible Things (IoTT): Challenges and Opportunities for Tangible Interaction with IoT
Authors:
Leonardo Angelini,
Nadine Couture,
Omar Abou Khaled,
Elena Mugellini
Abstract:
In the Internet of Things era, an increasing number of household devices and everyday objects are able to send to and retrieve information from the Internet, offering innovative services to the user. However, most of these devices provide only smartphone or web interfaces to control the IoT object properties and functions. As a result, generally, the interaction is disconnected from the physical w…
▽ More
In the Internet of Things era, an increasing number of household devices and everyday objects are able to send to and retrieve information from the Internet, offering innovative services to the user. However, most of these devices provide only smartphone or web interfaces to control the IoT object properties and functions. As a result, generally, the interaction is disconnected from the physical world, decreasing the user experience and increasing the risk of isolating the user in digital bubbles. We argue that tangible interaction can counteract this trend and this paper discusses the potential benefits and the still open challenges of tangible interaction applied to the Internet of Things. To underline this need, we introduce the term Internet of Tangible Things. In the article, after an analysis of current open challenges for Human-Computer Interaction in IoT, we summarize current trends in tangible interaction and extrapolate eight tangible interaction properties that could be exploited for designing novel interactions with IoT objects. Through a systematic literature review of tangible interaction applied to IoT, we show what has been already explored in the systems that pioneered the field and the future explorations that still have to be conducted.
△ Less
Submitted 8 August, 2017;
originally announced August 2017.
-
Solving Limited-Memory Influence Diagrams Using Branch-and-Bound Search
Authors:
Arindam Khaled,
Eric A. Hansen,
Changhe Yuan
Abstract:
A limited-memory influence diagram (LIMID) generalizes a traditional influence diagram by relaxing the assumptions of regularity and no-forgetting, allowing a wider range of decision problems to be modeled. Algorithms for solving traditional influence diagrams are not easily generalized to solve LIMIDs, however, and only recently have exact algorithms for solving LIMIDs been developed. In this pap…
▽ More
A limited-memory influence diagram (LIMID) generalizes a traditional influence diagram by relaxing the assumptions of regularity and no-forgetting, allowing a wider range of decision problems to be modeled. Algorithms for solving traditional influence diagrams are not easily generalized to solve LIMIDs, however, and only recently have exact algorithms for solving LIMIDs been developed. In this paper, we introduce an exact algorithm for solving LIMIDs that is based on branch-and-bound search. Our approach is related to the approach of solving an influence diagram by converting it to an equivalent decision tree, with the difference that the LIMID is converted to a much smaller decision graph that can be searched more efficiently.
△ Less
Submitted 26 September, 2013;
originally announced September 2013.
-
A Semi-Structured Tailoring-Driven Approach for ERP Selection
Authors:
Abdelilah Khaled,
Mohammed Abdou Janati Idrissi
Abstract:
It has been widely reported that selecting an inappropriate system is a major reason for ERP implementation failures. The selection of an ERP system is therefore critical. While the number of papers related to ERP implementation is substantial, ERP evaluation and selection approaches have received few attention. Motivated by the adaptation concept of the ERP systems, we propose in this paper a sem…
▽ More
It has been widely reported that selecting an inappropriate system is a major reason for ERP implementation failures. The selection of an ERP system is therefore critical. While the number of papers related to ERP implementation is substantial, ERP evaluation and selection approaches have received few attention. Motivated by the adaptation concept of the ERP systems, we propose in this paper a semi-structured approach for ERP system selection that differs from existing models in that it has a more holistic focus by simultaneously 1) considering the anticipated fitness of ERP solutions after the optimal resolution, within limited resources, of a set of the identified mismatches and 2) evaluating candidate products according to both functional and non-functional requirements. The approach consists of an iterative selection process model and an evaluation methodology based on 0-1 linear programming and MACBETH technique to elaborate multi-criteria performance expressions.
△ Less
Submitted 11 November, 2012;
originally announced November 2012.