Search | arXiv e-print repository

Beyond Model Collapse: Scaling Up with Synthesized Data Requires Reinforcement

Authors: Yunzhen Feng, Elvis Dohmatob, Pu Yang, Francois Charton, Julia Kempe

Abstract: Synthesized data from generative models is increasingly considered as an alternative to human-annotated data for fine-tuning Large Language Models. This raises concerns about model collapse: a drop in performance of models fine-tuned on generated data. Considering that it is easier for both humans and machines to tell between good and bad examples than to generate high-quality samples, we investig… ▽ More Synthesized data from generative models is increasingly considered as an alternative to human-annotated data for fine-tuning Large Language Models. This raises concerns about model collapse: a drop in performance of models fine-tuned on generated data. Considering that it is easier for both humans and machines to tell between good and bad examples than to generate high-quality samples, we investigate the use of feedback on synthesized data to prevent model collapse. We derive theoretical conditions under which a Gaussian mixture classification model can achieve asymptotically optimal performance when trained on feedback-augmented synthesized data, and provide supporting simulations for finite regimes. We illustrate our theoretical predictions on two practical problems: computing matrix eigenvalues with transformers and news summarization with large language models, which both undergo model collapse when trained on model-generated data. We show that training from feedback-augmented synthesized data, either by pruning incorrect predictions or by selecting the best of several guesses, can prevent model collapse, validating popular approaches like RLHF. △ Less

Submitted 11 June, 2024; originally announced June 2024.

arXiv:2402.07712 [pdf, other]

Model Collapse Demystified: The Case of Regression

Authors: Elvis Dohmatob, Yunzhen Feng, Julia Kempe

Abstract: In the era of proliferation of large language and image generation models, the phenomenon of "model collapse" refers to the situation whereby as a model is trained recursively on data generated from previous generations of itself over time, its performance degrades until the model eventually becomes completely useless, i.e the model collapses. In this work, we study this phenomenon in the setting… ▽ More In the era of proliferation of large language and image generation models, the phenomenon of "model collapse" refers to the situation whereby as a model is trained recursively on data generated from previous generations of itself over time, its performance degrades until the model eventually becomes completely useless, i.e the model collapses. In this work, we study this phenomenon in the setting of high-dimensional regression and obtain analytic formulae which quantitatively outline this phenomenon in a broad range of regimes. In the special case of polynomial decaying spectral and source conditions, we obtain modified scaling laws which exhibit new crossover phenomena from fast to slow rates. We also propose a simple strategy based on adaptive regularization to mitigate model collapse. Our theoretical results are validated with experiments. △ Less

Submitted 30 April, 2024; v1 submitted 12 February, 2024; originally announced February 2024.

arXiv:2402.07043 [pdf, other]

A Tale of Tails: Model Collapse as a Change of Scaling Laws

Authors: Elvis Dohmatob, Yunzhen Feng, Pu Yang, Francois Charton, Julia Kempe

Abstract: As AI model size grows, neural scaling laws have become a crucial tool to predict the improvements of large models when increasing capacity and the size of original (human or natural) training data. Yet, the widespread use of popular models means that the ecosystem of online data and text will co-evolve to progressively contain increased amounts of synthesized data. In this paper we ask: How will… ▽ More As AI model size grows, neural scaling laws have become a crucial tool to predict the improvements of large models when increasing capacity and the size of original (human or natural) training data. Yet, the widespread use of popular models means that the ecosystem of online data and text will co-evolve to progressively contain increased amounts of synthesized data. In this paper we ask: How will the scaling laws change in the inevitable regime where synthetic data makes its way into the training corpus? Will future models, still improve, or be doomed to degenerate up to total (model) collapse? We develop a theoretical framework of model collapse through the lens of scaling laws. We discover a wide range of decay phenomena, analyzing loss of scaling, shifted scaling with number of generations, the ''un-learning" of skills, and grokking when mixing human and synthesized data. Our theory is validated by large-scale experiments with a transformer on an arithmetic task and text generation using the large language model Llama2. △ Less

Submitted 31 May, 2024; v1 submitted 10 February, 2024; originally announced February 2024.

Journal ref: ICML 2024

arXiv:2310.02984 [pdf, other]

Scaling Laws for Associative Memories

Authors: Vivien Cabannes, Elvis Dohmatob, Alberto Bietti

Abstract: Learning arguably involves the discovery and memorization of abstract rules. The aim of this paper is to study associative memory mechanisms. Our model is based on high-dimensional matrices consisting of outer products of embeddings, which relates to the inner layers of transformer language models. We derive precise scaling laws with respect to sample size and parameter size, and discuss the stati… ▽ More Learning arguably involves the discovery and memorization of abstract rules. The aim of this paper is to study associative memory mechanisms. Our model is based on high-dimensional matrices consisting of outer products of embeddings, which relates to the inner layers of transformer language models. We derive precise scaling laws with respect to sample size and parameter size, and discuss the statistical efficiency of different estimators, including optimization-based algorithms. We provide extensive numerical experiments to validate and interpret theoretical results, including fine-grained visualizations of the stored memory associations. △ Less

Submitted 20 February, 2024; v1 submitted 4 October, 2023; originally announced October 2023.

ACM Class: I.2.6; G.1.6

arXiv:2308.00556 [pdf, other]

Robust Linear Regression: Phase-Transitions and Precise Tradeoffs for General Norms

Authors: Elvis Dohmatob, Meyer Scetbon

Abstract: In this paper, we investigate the impact of test-time adversarial attacks on linear regression models and determine the optimal level of robustness that any model can reach while maintaining a given level of standard predictive performance (accuracy). Through quantitative estimates, we uncover fundamental tradeoffs between adversarial robustness and accuracy in different regimes. We obtain a preci… ▽ More In this paper, we investigate the impact of test-time adversarial attacks on linear regression models and determine the optimal level of robustness that any model can reach while maintaining a given level of standard predictive performance (accuracy). Through quantitative estimates, we uncover fundamental tradeoffs between adversarial robustness and accuracy in different regimes. We obtain a precise characterization which distinguishes between regimes where robustness is achievable without hurting standard accuracy and regimes where a tradeoff might be unavoidable. Our findings are empirically confirmed with simple experiments that represent a variety of settings. This work applies to feature covariance matrices and attack norms of any nature, and extends beyond previous works in this area. △ Less

Submitted 1 August, 2023; originally announced August 2023.

arXiv:2301.13486 [pdf, other]

Robust Linear Regression: Gradient-descent, Early-stopping, and Beyond

Authors: Meyer Scetbon, Elvis Dohmatob

Abstract: In this work we study the robustness to adversarial attacks, of early-stopping strategies on gradient-descent (GD) methods for linear regression. More precisely, we show that early-stopped GD is optimally robust (up to an absolute constant) against Euclidean-norm adversarial attacks. However, we show that this strategy can be arbitrarily sub-optimal in the case of general Mahalanobis attacks. This… ▽ More In this work we study the robustness to adversarial attacks, of early-stopping strategies on gradient-descent (GD) methods for linear regression. More precisely, we show that early-stopped GD is optimally robust (up to an absolute constant) against Euclidean-norm adversarial attacks. However, we show that this strategy can be arbitrarily sub-optimal in the case of general Mahalanobis attacks. This observation is compatible with recent findings in the case of classification~\cite{Vardi2022GradientMP} that show that GD provably converges to non-robust models. To alleviate this issue, we propose to apply instead a GD scheme on a transformation of the data adapted to the attack. This data transformation amounts to apply feature-depending learning rates and we show that this modified GD is able to handle any Mahalanobis attack, as well as more general attacks under some conditions. Unfortunately, choosing such adapted transformations can be hard for general attacks. To the rescue, we design a simple and tractable estimator whose adversarial risk is optimal up to within a multiplicative constant of 1.1124 in the population regime, and works for any norm. △ Less

Submitted 31 January, 2023; originally announced January 2023.

arXiv:2211.02675 [pdf, other]

An Adversarial Robustness Perspective on the Topology of Neural Networks

Authors: Morgane Goibert, Thomas Ricatte, Elvis Dohmatob

Abstract: In this paper, we investigate the impact of neural networks (NNs) topology on adversarial robustness. Specifically, we study the graph produced when an input traverses all the layers of a NN, and show that such graphs are different for clean and adversarial inputs. We find that graphs from clean inputs are more centralized around highway edges, whereas those from adversaries are more diffuse, leve… ▽ More In this paper, we investigate the impact of neural networks (NNs) topology on adversarial robustness. Specifically, we study the graph produced when an input traverses all the layers of a NN, and show that such graphs are different for clean and adversarial inputs. We find that graphs from clean inputs are more centralized around highway edges, whereas those from adversaries are more diffuse, leveraging under-optimized edges. Through experiments on a variety of datasets and architectures, we show that these under-optimized edges are a source of adversarial vulnerability and that they can be used to detect adversarial inputs. △ Less

Submitted 4 November, 2022; originally announced November 2022.

arXiv:2210.09957 [pdf, other]

Contextual bandits with concave rewards, and an application to fair ranking

Authors: Virginie Do, Elvis Dohmatob, Matteo Pirotta, Alessandro Lazaric, Nicolas Usunier

Abstract: We consider Contextual Bandits with Concave Rewards (CBCR), a multi-objective bandit problem where the desired trade-off between the rewards is defined by a known concave objective function, and the reward vector depends on an observed stochastic context. We present the first algorithm with provably vanishing regret for CBCR without restrictions on the policy space, whereas prior works were restri… ▽ More We consider Contextual Bandits with Concave Rewards (CBCR), a multi-objective bandit problem where the desired trade-off between the rewards is defined by a known concave objective function, and the reward vector depends on an observed stochastic context. We present the first algorithm with provably vanishing regret for CBCR without restrictions on the policy space, whereas prior works were restricted to finite policy spaces or tabular representations. Our solution is based on a geometric interpretation of CBCR algorithms as optimization algorithms over the convex set of expected rewards spanned by all stochastic policies. Building on Frank-Wolfe analyses in constrained convex optimization, we derive a novel reduction from the CBCR regret to the regret of a scalar-reward bandit problem. We illustrate how to apply the reduction off-the-shelf to obtain algorithms for CBCR with both linear and general reward functions, in the case of non-combinatorial actions. Motivated by fairness in recommendation, we describe a special case of CBCR with rankings and fairness-aware objectives, leading to the first algorithm with regret guarantees for contextual combinatorial bandits with fairness of exposure. △ Less

Submitted 28 February, 2023; v1 submitted 18 October, 2022; originally announced October 2022.

Comments: ICLR 2023

arXiv:2209.13019 [pdf, other]

Fast online ranking with fairness of exposure

Authors: Nicolas Usunier, Virginie Do, Elvis Dohmatob

Abstract: As recommender systems become increasingly central for sorting and prioritizing the content available online, they have a growing impact on the opportunities or revenue of their items producers. For instance, they influence which recruiter a resume is recommended to, or to whom and how much a music track, video or news article is being exposed. This calls for recommendation approaches that not onl… ▽ More As recommender systems become increasingly central for sorting and prioritizing the content available online, they have a growing impact on the opportunities or revenue of their items producers. For instance, they influence which recruiter a resume is recommended to, or to whom and how much a music track, video or news article is being exposed. This calls for recommendation approaches that not only maximize (a proxy of) user satisfaction, but also consider some notion of fairness in the exposure of items or groups of items. Formally, such recommendations are usually obtained by maximizing a concave objective function in the space of randomized rankings. When the total exposure of an item is defined as the sum of its exposure over users, the optimal rankings of every users become coupled, which makes the optimization process challenging. Existing approaches to find these rankings either solve the global optimization problem in a batch setting, i.e., for all users at once, which makes them inapplicable at scale, or are based on heuristics that have weak theoretical guarantees. In this paper, we propose the first efficient online algorithm to optimize concave objective functions in the space of rankings which applies to every concave and smooth objective function, such as the ones found for fairness of exposure. Based on online variants of the Frank-Wolfe algorithm, we show that our algorithm is computationally fast, generating rankings on-the-fly with computation cost dominated by the sort operation, memory efficient, and has strong theoretical guarantees. Compared to baseline policies that only maximize user-side performance, our algorithm allows to incorporate complex fairness of exposure criteria in the recommendations with negligible computational overhead. △ Less

Submitted 13 September, 2022; originally announced September 2022.

Comments: FAccT 2022

arXiv:2207.00486 [pdf, other]

Scalable MCMC Sampling for Nonsymmetric Determinantal Point Processes

Authors: Insu Han, Mike Gartrell, Elvis Dohmatob, Amin Karbasi

Abstract: A determinantal point process (DPP) is an elegant model that assigns a probability to every subset of a collection of $n$ items. While conventionally a DPP is parameterized by a symmetric kernel matrix, removing this symmetry constraint, resulting in nonsymmetric DPPs (NDPPs), leads to significant improvements in modeling power and predictive performance. Recent work has studied an approximate Mar… ▽ More A determinantal point process (DPP) is an elegant model that assigns a probability to every subset of a collection of $n$ items. While conventionally a DPP is parameterized by a symmetric kernel matrix, removing this symmetry constraint, resulting in nonsymmetric DPPs (NDPPs), leads to significant improvements in modeling power and predictive performance. Recent work has studied an approximate Markov chain Monte Carlo (MCMC) sampling algorithm for NDPPs restricted to size-$k$ subsets (called $k$-NDPPs). However, the runtime of this approach is quadratic in $n$, making it infeasible for large-scale settings. In this work, we develop a scalable MCMC sampling algorithm for $k$-NDPPs with low-rank kernels, thus enabling runtime that is sublinear in $n$. Our method is based on a state-of-the-art NDPP rejection sampling algorithm, which we enhance with a novel approach for efficiently constructing the proposal distribution. Furthermore, we extend our scalable $k$-NDPP sampling algorithm to NDPPs without size constraints. Our resulting sampling method has polynomial time complexity in the rank of the kernel, while the existing approach has runtime that is exponential in the rank. With both a theoretical analysis and experiments on real-world datasets, we verify that our scalable approximate sampling algorithms are orders of magnitude faster than existing sampling approaches for $k$-NDPPs and NDPPs. △ Less

Submitted 1 July, 2022; originally announced July 2022.

Comments: ICML 2022

arXiv:2203.13779 [pdf, other]

Origins of Low-dimensional Adversarial Perturbations

Authors: Elvis Dohmatob, Chuan Guo, Morgane Goibert

Abstract: In this paper, we initiate a rigorous study of the phenomenon of low-dimensional adversarial perturbations (LDAPs) in classification. Unlike the classical setting, these perturbations are limited to a subspace of dimension $k$ which is much smaller than the dimension $d$ of the feature space. The case $k=1$ corresponds to so-called universal adversarial perturbations (UAPs; Moosavi-Dezfooli et al.… ▽ More In this paper, we initiate a rigorous study of the phenomenon of low-dimensional adversarial perturbations (LDAPs) in classification. Unlike the classical setting, these perturbations are limited to a subspace of dimension $k$ which is much smaller than the dimension $d$ of the feature space. The case $k=1$ corresponds to so-called universal adversarial perturbations (UAPs; Moosavi-Dezfooli et al., 2017). First, we consider binary classifiers under generic regularity conditions (including ReLU networks) and compute analytical lower-bounds for the fooling rate of any subspace. These bounds explicitly highlight the dependence of the fooling rate on the pointwise margin of the model (i.e., the ratio of the output to its $L_2$ norm of its gradient at a test point), and on the alignment of the given subspace with the gradients of the model w.r.t. inputs. Our results provide a rigorous explanation for the recent success of heuristic methods for efficiently generating low-dimensional adversarial perturbations. Finally, we show that if a decision-region is compact, then it admits a universal adversarial perturbation with $L_2$ norm which is $\sqrt{d}$ times smaller than the typical $L_2$ norm of a data point. Our theoretical results are confirmed by experiments on both synthetic and real data. △ Less

Submitted 4 July, 2022; v1 submitted 25 March, 2022; originally announced March 2022.

arXiv:2203.11864 [pdf, other]

On the (Non-)Robustness of Two-Layer Neural Networks in Different Learning Regimes

Authors: Elvis Dohmatob, Alberto Bietti

Abstract: Neural networks are known to be highly sensitive to adversarial examples. These may arise due to different factors, such as random initialization, or spurious correlations in the learning problem. To better understand these factors, we provide a precise study of the adversarial robustness in different scenarios, from initialization to the end of training in different regimes, as well as intermedia… ▽ More Neural networks are known to be highly sensitive to adversarial examples. These may arise due to different factors, such as random initialization, or spurious correlations in the learning problem. To better understand these factors, we provide a precise study of the adversarial robustness in different scenarios, from initialization to the end of training in different regimes, as well as intermediate scenarios, where initialization still plays a role due to "lazy" training. We consider over-parameterized networks in high dimensions with quadratic targets and infinite samples. Our analysis allows us to identify new tradeoffs between approximation (as measured via test error) and robustness, whereby robustness can only get worse when test error improves, and vice versa. We also show how linearized lazy training regimes can worsen robustness, due to improperly scaled random initialization. Our theoretical results are illustrated with numerical experiments. △ Less

Submitted 4 July, 2022; v1 submitted 22 March, 2022; originally announced March 2022.

arXiv:2201.08417 [pdf, other]

Scalable Sampling for Nonsymmetric Determinantal Point Processes

Authors: Insu Han, Mike Gartrell, Jennifer Gillenwater, Elvis Dohmatob, Amin Karbasi

Abstract: A determinantal point process (DPP) on a collection of $M$ items is a model, parameterized by a symmetric kernel matrix, that assigns a probability to every subset of those items. Recent work shows that removing the kernel symmetry constraint, yielding nonsymmetric DPPs (NDPPs), can lead to significant predictive performance gains for machine learning applications. However, existing work leaves op… ▽ More A determinantal point process (DPP) on a collection of $M$ items is a model, parameterized by a symmetric kernel matrix, that assigns a probability to every subset of those items. Recent work shows that removing the kernel symmetry constraint, yielding nonsymmetric DPPs (NDPPs), can lead to significant predictive performance gains for machine learning applications. However, existing work leaves open the question of scalable NDPP sampling. There is only one known DPP sampling algorithm, based on Cholesky decomposition, that can directly apply to NDPPs as well. Unfortunately, its runtime is cubic in $M$, and thus does not scale to large item collections. In this work, we first note that this algorithm can be transformed into a linear-time one for kernels with low-rank structure. Furthermore, we develop a scalable sublinear-time rejection sampling algorithm by constructing a novel proposal distribution. Additionally, we show that imposing certain structural constraints on the NDPP kernel enables us to bound the rejection rate in a way that depends only on the kernel rank. In our experiments we compare the speed of all of these samplers for a variety of real-world tasks. △ Less

Submitted 19 April, 2022; v1 submitted 20 January, 2022; originally announced January 2022.

Comments: ICLR 2022

arXiv:2106.02630 [pdf, other]

Fundamental tradeoffs between memorization and robustness in random features and neural tangent regimes

Authors: Elvis Dohmatob

Abstract: This work studies the (non)robustness of two-layer neural networks in various high-dimensional linearized regimes. We establish fundamental trade-offs between memorization and robustness, as measured by the Sobolev-seminorm of the model w.r.t the data distribution, i.e the square root of the average squared $L_2$-norm of the gradients of the model w.r.t the its input. More precisely, if $n$ is the… ▽ More This work studies the (non)robustness of two-layer neural networks in various high-dimensional linearized regimes. We establish fundamental trade-offs between memorization and robustness, as measured by the Sobolev-seminorm of the model w.r.t the data distribution, i.e the square root of the average squared $L_2$-norm of the gradients of the model w.r.t the its input. More precisely, if $n$ is the number of training examples, $d$ is the input dimension, and $k$ is the number of hidden neurons in a two-layer neural network, we prove for a large class of activation functions that, if the model memorizes even a fraction of the training, then its Sobolev-seminorm is lower-bounded by (i) $\sqrt{n}$ in case of infinite-width random features (RF) or neural tangent kernel (NTK) with $d \gtrsim n$; (ii) $\sqrt{n}$ in case of finite-width RF with proportionate scaling of $d$ and $k$; and (iii) $\sqrt{n/k}$ in case of finite-width NTK with proportionate scaling of $d$ and $k$. Moreover, all of these lower-bounds are tight: they are attained by the min-norm / least-squares interpolator (when $n$, $d$, and $k$ are in the appropriate interpolating regime). All our results hold as soon as data is log-concave isotropic, and there is label-noise, i.e the target variable is not a deterministic function of the data / features. We empirically validate our theoretical results with experiments. Accidentally, these experiments also reveal for the first time, (iv) a multiple-descent phenomenon in the robustness of the min-norm interpolator. △ Less

Submitted 4 June, 2021; originally announced June 2021.

arXiv:2011.06550 [pdf, other]

Implicit bias of any algorithm: bounding bias via margin

Authors: Elvis Dohmatob

Abstract: Consider $n$ points $x_1,\ldots,x_n$ in finite-dimensional euclidean space, each having one of two colors. Suppose there exists a separating hyperplane (identified with its unit normal vector $w)$ for the points, i.e a hyperplane such that points of same color lie on the same side of the hyperplane. We measure the quality of such a hyperplane by its margin $γ(w)$, defined as minimum distance betwe… ▽ More Consider $n$ points $x_1,\ldots,x_n$ in finite-dimensional euclidean space, each having one of two colors. Suppose there exists a separating hyperplane (identified with its unit normal vector $w)$ for the points, i.e a hyperplane such that points of same color lie on the same side of the hyperplane. We measure the quality of such a hyperplane by its margin $γ(w)$, defined as minimum distance between any of the points $x_i$ and the hyperplane. In this paper, we prove that the margin function $γ$ satisfies a nonsmooth Kurdyka-Lojasiewicz inequality with exponent $1/2$. This result has far-reaching consequences. For example, let $γ^{opt}$ be the maximum possible margin for the problem and let $w^{opt}$ be the parameter for the hyperplane which attains this value. Given any other separating hyperplane with parameter $w$, let $d(w):=\|w-w^{opt}\|$ be the euclidean distance between $w$ and $w^{opt}$, also called the bias of $w$. From the previous KL-inequality, we deduce that $(γ^{opt}-γ(w)) / R \le d(w) \le 2\sqrt{(γ^{opt}-γ(w))/γ^{opt}}$, where $R:=\max_i \|x_i\|$ is the maximum distance of the points $x_i$ from the origin. Consequently, for any optimization algorithm (gradient-descent or not), the bias of the iterates converges at least as fast as the square-root of the rate of their convergence of the margin. Thus, our work provides a generic tool for analyzing the implicit bias of any algorithm in terms of its margin, in situations where a specialized analysis might not be available: it is sufficient to establish a good rate for converge of the margin, a task which is usually much easier. △ Less

Submitted 23 November, 2020; v1 submitted 12 November, 2020; originally announced November 2020.

arXiv:2006.09989 [pdf, other]

Classifier-independent Lower-Bounds for Adversarial Robustness

Authors: Elvis Dohmatob

Abstract: We theoretically analyse the limits of robustness to test-time adversarial and noisy examples in classification. Our work focuses on deriving bounds which uniformly apply to all classifiers (i.e all measurable functions from features to labels) for a given problem. Our contributions are two-fold. (1) We use optimal transport theory to derive variational formulae for the Bayes-optimal error a class… ▽ More We theoretically analyse the limits of robustness to test-time adversarial and noisy examples in classification. Our work focuses on deriving bounds which uniformly apply to all classifiers (i.e all measurable functions from features to labels) for a given problem. Our contributions are two-fold. (1) We use optimal transport theory to derive variational formulae for the Bayes-optimal error a classifier can make on a given classification problem, subject to adversarial attacks. The optimal adversarial attack is then an optimal transport plan for a certain binary cost-function induced by the specific attack model, and can be computed via a simple algorithm based on maximal matching on bipartite graphs. (2) We derive explicit lower-bounds on the Bayes-optimal error in the case of the popular distance-based attacks. These bounds are universal in the sense that they depend on the geometry of the class-conditional distributions of the data, but not on a particular classifier. Our results are in sharp contrast with the existing literature, wherein adversarial vulnerability of classifiers is derived as a consequence of nonzero ordinary test error. △ Less

Submitted 9 November, 2020; v1 submitted 17 June, 2020; originally announced June 2020.

arXiv:2006.09862 [pdf, other]

Scalable Learning and MAP Inference for Nonsymmetric Determinantal Point Processes

Authors: Mike Gartrell, Insu Han, Elvis Dohmatob, Jennifer Gillenwater, Victor-Emmanuel Brunel

Abstract: Determinantal point processes (DPPs) have attracted significant attention in machine learning for their ability to model subsets drawn from a large item collection. Recent work shows that nonsymmetric DPP (NDPP) kernels have significant advantages over symmetric kernels in terms of modeling power and predictive performance. However, for an item collection of size $M$, existing NDPP learning and in… ▽ More Determinantal point processes (DPPs) have attracted significant attention in machine learning for their ability to model subsets drawn from a large item collection. Recent work shows that nonsymmetric DPP (NDPP) kernels have significant advantages over symmetric kernels in terms of modeling power and predictive performance. However, for an item collection of size $M$, existing NDPP learning and inference algorithms require memory quadratic in $M$ and runtime cubic (for learning) or quadratic (for inference) in $M$, making them impractical for many typical subset selection tasks. In this work, we develop a learning algorithm with space and time requirements linear in $M$ by introducing a new NDPP kernel decomposition. We also derive a linear-complexity NDPP maximum a posteriori (MAP) inference algorithm that applies not only to our new kernel but also to that of prior work. Through evaluation on real-world datasets, we show that our algorithms scale significantly better, and can match the predictive performance of prior work. △ Less

Submitted 13 April, 2021; v1 submitted 17 June, 2020; originally announced June 2020.

Comments: ICLR 2021

arXiv:2006.04596 [pdf, other]

Learning disconnected manifolds: a no GANs land

Authors: Ugo Tanielian, Thibaut Issenhuth, Elvis Dohmatob, Jeremie Mary

Abstract: Typical architectures of Generative AdversarialNetworks make use of a unimodal latent distribution transformed by a continuous generator. Consequently, the modeled distribution always has connected support which is cumbersome when learning a disconnected set of manifolds. We formalize this problem by establishing a no free lunch theorem for the disconnected manifold learning stating an upper bound… ▽ More Typical architectures of Generative AdversarialNetworks make use of a unimodal latent distribution transformed by a continuous generator. Consequently, the modeled distribution always has connected support which is cumbersome when learning a disconnected set of manifolds. We formalize this problem by establishing a no free lunch theorem for the disconnected manifold learning stating an upper bound on the precision of the targeted distribution. This is done by building on the necessary existence of a low-quality region where the generator continuously samples data between two disconnected modes. Finally, we derive a rejection sampling method based on the norm of generators Jacobian and show its efficiency on several generators including BigGAN. △ Less

Submitted 10 December, 2020; v1 submitted 8 June, 2020; originally announced June 2020.

Comments: 24 pages

Journal ref: PMLR 119:9418-9427, 2020

arXiv:1909.09621 [pdf, other]

On the Convergence of Approximate and Regularized Policy Iteration Schemes

Authors: Elena Smirnova, Elvis Dohmatob

Abstract: Entropy regularized algorithms such as Soft Q-learning and Soft Actor-Critic, recently showed state-of-the-art performance on a number of challenging reinforcement learning (RL) tasks. The regularized formulation modifies the standard RL objective and thus generally converges to a policy different from the optimal greedy policy of the original RL problem. Practically, it is important to control th… ▽ More Entropy regularized algorithms such as Soft Q-learning and Soft Actor-Critic, recently showed state-of-the-art performance on a number of challenging reinforcement learning (RL) tasks. The regularized formulation modifies the standard RL objective and thus generally converges to a policy different from the optimal greedy policy of the original RL problem. Practically, it is important to control the sub-optimality of the regularized optimal policy. In this paper, we establish sufficient conditions for convergence of a large class of regularized dynamic programming algorithms, unified under regularized modified policy iteration (MPI) and conservative value iteration (VI) schemes. We provide explicit convergence rates to the optimality depending on the decrease rate of the regularization parameter. Our experiments show that the empirical error closely follows the established theoretical convergence rates. In addition to optimality, we demonstrate two desirable behaviours of the regularized algorithms even in the absence of approximations: robustness to stochasticity of environment and safety of trajectories induced by the policy iterates. △ Less

Submitted 14 October, 2019; v1 submitted 20 September, 2019; originally announced September 2019.

arXiv:1906.11567 [pdf, other]

Adversarial Robustness via Label-Smoothing

Authors: Morgane Goibert, Elvis Dohmatob

Abstract: We study Label-Smoothing as a means for improving adversarial robustness of supervised deep-learning models. After establishing a thorough and unified framework, we propose several variations to this general method: adversarial, Boltzmann and second-best Label-Smoothing methods, and we explain how to construct your own one. On various datasets (MNIST, CIFAR10, SVHN) and models (linear models, MLPs… ▽ More We study Label-Smoothing as a means for improving adversarial robustness of supervised deep-learning models. After establishing a thorough and unified framework, we propose several variations to this general method: adversarial, Boltzmann and second-best Label-Smoothing methods, and we explain how to construct your own one. On various datasets (MNIST, CIFAR10, SVHN) and models (linear models, MLPs, LeNet, ResNet), we show that Label-Smoothing in general improves adversarial robustness against a variety of attacks (FGSM, BIM, DeepFool, Carlini-Wagner) by better taking account of the dataset geometry. The proposed Label-Smoothing methods have two main advantages: they can be implemented as a modified cross-entropy loss, thus do not require any modifications of the network architecture nor do they lead to increased training times, and they improve both standard and adversarial accuracy. △ Less

Submitted 15 October, 2019; v1 submitted 27 June, 2019; originally announced June 2019.

arXiv:1906.06211 [pdf, ps, other]

Distributionally Robust Counterfactual Risk Minimization

Authors: Louis Faury, Ugo Tanielian, Flavian Vasile, Elena Smirnova, Elvis Dohmatob

Abstract: This manuscript introduces the idea of using Distributionally Robust Optimization (DRO) for the Counterfactual Risk Minimization (CRM) problem. Tapping into a rich existing literature, we show that DRO is a principled tool for counterfactual decision making. We also show that well-established solutions to the CRM problem like sample variance penalization schemes are special instances of a more gen… ▽ More This manuscript introduces the idea of using Distributionally Robust Optimization (DRO) for the Counterfactual Risk Minimization (CRM) problem. Tapping into a rich existing literature, we show that DRO is a principled tool for counterfactual decision making. We also show that well-established solutions to the CRM problem like sample variance penalization schemes are special instances of a more general DRO problem. In this unifying framework, a variety of distributionally robust counterfactual risk estimators can be constructed using various probability distances and divergences as uncertainty measures. We propose the use of Kullback-Leibler divergence as an alternative way to model uncertainty in CRM and derive a new robust counterfactual objective. In our experiments, we show that this approach outperforms the state-of-the-art on four benchmark datasets, validating the relevance of using other uncertainty measures in practical applications. △ Less

Submitted 14 December, 2019; v1 submitted 14 June, 2019; originally announced June 2019.

Comments: Accepted at AAAI20

arXiv:1905.12962 [pdf, other]

Learning Nonsymmetric Determinantal Point Processes

Authors: Mike Gartrell, Victor-Emmanuel Brunel, Elvis Dohmatob, Syrine Krichene

Abstract: Determinantal point processes (DPPs) have attracted substantial attention as an elegant probabilistic model that captures the balance between quality and diversity within sets. DPPs are conventionally parameterized by a positive semi-definite kernel matrix, and this symmetric kernel encodes only repulsive interactions between items. These so-called symmetric DPPs have significant expressive power,… ▽ More Determinantal point processes (DPPs) have attracted substantial attention as an elegant probabilistic model that captures the balance between quality and diversity within sets. DPPs are conventionally parameterized by a positive semi-definite kernel matrix, and this symmetric kernel encodes only repulsive interactions between items. These so-called symmetric DPPs have significant expressive power, and have been successfully applied to a variety of machine learning tasks, including recommendation systems, information retrieval, and automatic summarization, among many others. Efficient algorithms for learning symmetric DPPs and sampling from these models have been reasonably well studied. However, relatively little attention has been given to nonsymmetric DPPs, which relax the symmetric constraint on the kernel. Nonsymmetric DPPs allow for both repulsive and attractive item interactions, which can significantly improve modeling power, resulting in a model that may better fit for some applications. We present a method that enables a tractable algorithm, based on maximum likelihood estimation, for learning nonsymmetric DPPs from data composed of observed subsets. Our method imposes a particular decomposition of the nonsymmetric kernel that enables such tractable learning algorithms, which we analyze both theoretically and experimentally. We evaluate our model on synthetic and real-world datasets, demonstrating improved predictive performance compared to symmetric DPPs, which have previously shown strong performance on modeling tasks associated with these datasets. △ Less

Submitted 12 November, 2020; v1 submitted 30 May, 2019; originally announced May 2019.

Comments: NeurIPS 2019

arXiv:1902.08708 [pdf, other]

Distributionally Robust Reinforcement Learning

Authors: Elena Smirnova, Elvis Dohmatob, Jérémie Mary

Abstract: Real-world applications require RL algorithms to act safely. During learning process, it is likely that the agent executes sub-optimal actions that may lead to unsafe/poor states of the system. Exploration is particularly brittle in high-dimensional state/action space due to increased number of low-performing actions. In this work, we consider risk-averse exploration in approximate RL setting. To… ▽ More Real-world applications require RL algorithms to act safely. During learning process, it is likely that the agent executes sub-optimal actions that may lead to unsafe/poor states of the system. Exploration is particularly brittle in high-dimensional state/action space due to increased number of low-performing actions. In this work, we consider risk-averse exploration in approximate RL setting. To ensure safety during learning, we propose the distributionally robust policy iteration scheme that provides lower bound guarantee on state-values. Our approach induces a dynamic level of risk to prevent poor decisions and yet preserves the convergence to the optimal policy. Our formulation results in a efficient algorithm that accounts for a simple re-weighting of policy actions in the standard policy iteration scheme. We extend our approach to continuous state/action space and present a practical algorithm, distributionally robust soft actor-critic, that implements a different exploration strategy: it acts conservatively at short-term and it explores optimistically in a long-run. We provide promising experimental results on continuous control tasks. △ Less

Submitted 14 June, 2019; v1 submitted 22 February, 2019; originally announced February 2019.

arXiv:1811.07245 [pdf, other]

Deep Determinantal Point Processes

Authors: Mike Gartrell, Elvis Dohmatob, Jon Alberdi

Abstract: Determinantal point processes (DPPs) have attracted significant attention as an elegant model that is able to capture the balance between quality and diversity within sets. DPPs are parameterized by a positive semi-definite kernel matrix. While DPPs have substantial expressive power, they are fundamentally limited by the parameterization of the kernel matrix and their inability to capture nonlinea… ▽ More Determinantal point processes (DPPs) have attracted significant attention as an elegant model that is able to capture the balance between quality and diversity within sets. DPPs are parameterized by a positive semi-definite kernel matrix. While DPPs have substantial expressive power, they are fundamentally limited by the parameterization of the kernel matrix and their inability to capture nonlinear interactions between items within sets. We present the deep DPP model as way to address these limitations, by using a deep feed-forward neural network to learn the kernel matrix. In addition to allowing us to capture nonlinear item interactions, the deep DPP also allows easy incorporation of item metadata into DPP learning. Since the learning target is the DPP kernel matrix, the deep DPP allows us to use existing DPP algorithms for efficient learning, sampling, and prediction. Through an evaluation on several real-world datasets, we show experimentally that the deep DPP can provide a considerable improvement in the predictive performance of DPPs, while also outperforming strong baseline models in many cases. △ Less

Submitted 29 May, 2019; v1 submitted 17 November, 2018; originally announced November 2018.

arXiv:1810.04065 [pdf, other]

Generalized No Free Lunch Theorem for Adversarial Robustness

Authors: Elvis Dohmatob

Abstract: This manuscript presents some new impossibility results on adversarial robustness in machine learning, a very important yet largely open problem. We show that if conditioned on a class label the data distribution satisfies the $W_2$ Talagrand transportation-cost inequality (for example, this condition is satisfied if the conditional distribution has density which is log-concave; is the uniform mea… ▽ More This manuscript presents some new impossibility results on adversarial robustness in machine learning, a very important yet largely open problem. We show that if conditioned on a class label the data distribution satisfies the $W_2$ Talagrand transportation-cost inequality (for example, this condition is satisfied if the conditional distribution has density which is log-concave; is the uniform measure on a compact Riemannian manifold with positive Ricci curvature, any classifier can be adversarially fooled with high probability once the perturbations are slightly greater than the natural noise level in the problem. We call this result The Strong "No Free Lunch" Theorem as some recent results (Tsipras et al. 2018, Fawzi et al. 2018, etc.) on the subject can be immediately recovered as very particular cases. Our theoretical bounds are demonstrated on both simulated and real data (MNIST). We conclude the manuscript with some speculation on possible future research directions. △ Less

Submitted 4 June, 2019; v1 submitted 8 October, 2018; originally announced October 2018.

arXiv:1512.06999 [pdf, ps, other]

FAASTA: A fast solver for total-variation regularization of ill-conditioned problems with application to brain imaging

Authors: Gaël Varoquaux, Michael Eickenberg, Elvis Dohmatob, Bertand Thirion

Abstract: The total variation (TV) penalty, as many other analysis-sparsity problems, does not lead to separable factors or a proximal operatorwith a closed-form expression, such as soft thresholding for the $\ell\_1$ penalty. As a result, in a variational formulation of an inverse problem or statisticallearning estimation, it leads to challenging non-smooth optimization problemsthat are often solved with e… ▽ More The total variation (TV) penalty, as many other analysis-sparsity problems, does not lead to separable factors or a proximal operatorwith a closed-form expression, such as soft thresholding for the $\ell\_1$ penalty. As a result, in a variational formulation of an inverse problem or statisticallearning estimation, it leads to challenging non-smooth optimization problemsthat are often solved with elaborate single-step first-order methods. When thedata-fit term arises from empirical measurements, as in brain imaging, it isoften very ill-conditioned and without simple structure. In this situation, in proximal splitting methods, the computation cost of thegradient step can easily dominate each iteration. Thus it is beneficialto minimize the number of gradient steps.We present fAASTA, a variant of FISTA, that relies on an internal solver forthe TV proximal operator, and refines its tolerance to balance computationalcost of the gradient and the proximal steps. We give benchmarks andillustrations on "brain decoding": recovering brain maps from noisymeasurements to predict observed behavior. The algorithm as well as theempirical study of convergence speed are valuable for any non-exact proximaloperator, in particular analysis-sparsity problems. △ Less

Submitted 22 December, 2015; originally announced December 2015.

Journal ref: Colloque GRETSI, Sep 2015, Lyon, France. Gretsi, 2015, http://www.gretsi.fr/colloque2015/myGretsi/programme.php

arXiv:1507.07901 [pdf, other]

A simple and numerically stable primal-dual algorithm for computing Nash-equilibria in sequential games with incomplete information

Authors: Elvis Dohmatob

Abstract: We present a simple primal-dual algorithm for computing approximate Nash-equilibria in two-person zero-sum sequential games with incomplete information and perfect recall (like Texas Hold'em Poker). Our algorithm is numerically stable, performs only basic iterations (i.e matvec multiplications, clipping, etc., and no calls to external first-order oracles, no matrix inversions, etc.), and is applic… ▽ More We present a simple primal-dual algorithm for computing approximate Nash-equilibria in two-person zero-sum sequential games with incomplete information and perfect recall (like Texas Hold'em Poker). Our algorithm is numerically stable, performs only basic iterations (i.e matvec multiplications, clipping, etc., and no calls to external first-order oracles, no matrix inversions, etc.), and is applicable to a broad class of two-person zero-sum games including simultaneous games and sequential games with incomplete information and perfect recall. The applicability to the latter kind of games is thanks to the sequence-form representation which allows us to encode any such game as a matrix game with convex polytopial strategy profiles. We prove that the number of iterations needed to produce a Nash-equilibrium with a given precision is inversely proportional to the precision. As proof-of-concept, we present experimental results on matrix games on simplexes and Kuhn Poker. △ Less

Submitted 23 December, 2015; v1 submitted 28 July, 2015; originally announced July 2015.

arXiv:1412.3925 [pdf, other]

Region segmentation for sparse decompositions: better brain parcellations from rest fMRI

Authors: Alexandre Abraham, Elvis Dohmatob, Bertrand Thirion, Dimitris Samaras, Gael Varoquaux

Abstract: Functional Magnetic Resonance Images acquired during resting-state provide information about the functional organization of the brain through measuring correlations between brain areas. Independent components analysis is the reference approach to estimate spatial components from weakly structured data such as brain signal time courses; each of these components may be referred to as a brain network… ▽ More Functional Magnetic Resonance Images acquired during resting-state provide information about the functional organization of the brain through measuring correlations between brain areas. Independent components analysis is the reference approach to estimate spatial components from weakly structured data such as brain signal time courses; each of these components may be referred to as a brain network and the whole set of components can be conceptualized as a brain functional atlas. Recently, new methods using a sparsity prior have emerged to deal with low signal-to-noise ratio data. However, even when using sophisticated priors, the results may not be very sparse and most often do not separate the spatial components into brain regions. This work presents post-processing techniques that automatically sparsify brain maps and separate regions properly using geometric operations, and compares these techniques according to faithfulness to data and stability metrics. In particular, among threshold-based approaches, hysteresis thresholding and random walker segmentation, the latter improves significantly the stability of both dense and sparse models. △ Less

Submitted 12 December, 2014; originally announced December 2014.

Journal ref: Sparsity Techniques in Medical Imaging, Sep 2014, Boston, United States. pp.8

Showing 1–28 of 28 results for author: Dohmatob, E