Liu Et Al. - 2021 - Random Features For Kernel Approximation A Survey On Algorithms, Theory, and Beyond
Liu Et Al. - 2021 - Random Features For Kernel Approximation A Survey On Algorithms, Theory, and Beyond
Abstract—The class of random features is one of the most popular techniques to speed up kernel methods in large-scale problems.
Related works have been recognized by the NeurIPS Test-of-Time award in 2017 and the ICML Best Paper Finalist in 2019. The body of
work on random features has grown rapidly, and hence it is desirable to have a comprehensive overview on this topic explaining the
connections among various algorithms and theoretical results. In this survey, we systematically review the work on random features
from the past ten years. First, the motivations, characteristics and contributions of representative random features based algorithms are
summarized according to their sampling schemes, learning procedures, variance reduction properties and how they exploit training data.
arXiv:2004.11154v5 [stat.ML] 11 Jul 2021
Second, we review theoretical results that center around the following key question: how many random features are needed to ensure a
high approximation quality or no loss in the empirical/expected risks of the learned estimator. Third, we provide a comprehensive evaluation
of popular random features based algorithms on several large-scale benchmark datasets and discuss their approximation quality and
prediction performance for classification. Last, we discuss the relationship between random features and modern over-parameterized
deep neural networks (DNNs), including the use of high dimensional random features in the analysis of DNNs as well as the gaps between
current theoretical and empirical results. This survey may serve as a gentle introduction to this topic, and as a users’ guide for practitioners
interested in applying the representative algorithms and understanding theoretical results under various technical assumptions. We hope
that this survey will facilitate discussion on the open problems in this topic, and more importantly, shed light on future research directions.
F
1 I NTRODUCTION
K ERNEL methods [1], [2], [3] are one of the most powerful
techniques for nonlinear statistical learning problems with
a wide range of successful applications. Let x, x0 ∈ X ⊆ Rd
methods provide a data dependent vector representation of the
kernel. Random Fourier features (RFF) [9], on the other hand, is
a typical data-independent technique to approximate the kernel
be two samples and φ : X → H be a nonlinear feature map function using an explicit feature mapping. This survey focuses
transforming each element in X into a reproducing kernel Hilbert on RFF and its variants for kernel approximation. RFF applies in
space (RKHS) H, in which the inner product between φ(x) and particular to shift-invariant (also called “stationary”) kernels that
φ(x0 ) endowed by H can be computed using a kernel function satisfy k(x, x0 ) = k(x − x0 ). By virtue of the correspondence
k(·, ·) : Rd × Rd → R as hφ(x), φ(x0 )iH = k(x, x0 ). In practice, between a shift-invariant kernel and its Fourier spectral density, the
the kernel function k is directly given to obtain the inner product kernel can be approximated by k(x, x0 ) ≈ hϕ(x), ϕ(x0 )i, where
hφ(x), φ(x0 )iH instead of finding the explicit expression of φ, the explicit mapping ϕ : Rd → Rs is obtained by sampling from
which is known as the kernel trick. Benefiting from this scheme, a distribution defined by the inverse Fourier transform of k . To
kernel methods are effective for learning nonlinear structures but scale kernel methods in the large sample case (e.g., n d), the
often suffer from scalability issues in large-scale problems due to number of random features s is often taken to be larger than the
high space and time complexities. For instance, given n samples in original sample dimension d but much smaller than the sample size
the original d-dimensional space X , kernel ridge regression (KRR) n to achieve computational efficiency in practice.1 Accordingly, the
requires O(n3 ) training time and O(n2 ) space to store the kernel random features model is a powerful tool for scaling up traditional
matrix, which is often computationally infeasible when n is large. kernel methods [10], [11], neural tangent kernel [12], [13], [14],
To overcome the poor scalability of kernel methods, kernel graph neural networks [15], [16], and attention in Transformers
approximation is an effective technique by constructing an explicit [17], [18]. Interestingly, the random features model can be viewed
mapping Ψ : Rd → Rs such that k(x, y) ≈ Ψ(x)> Ψ(y). By as a class of two-layer neural networks with fixed weights in the
doing so, an efficient linear model can be well learned in the first layer. This connection has important theoretical implications.
transformed space with O(ns2 ) time and O(ns) memory while It has been observed that deep neural networks (DNNs) exhibit
retaining the expressive power of nonlinear methods. A series of certain intriguing phenomena such as the ability to fit random labels
kernel approximation algorithms have been developed in the past [19] and double descent [20] in the over-parameterized regime.
years, including divide-and-conquer approaches [4], [5], [6], greedy Theoretical results [13], [21], [22], [23] for random features can be
basis selection techniques [7] and Nyström methods [8]. These leveraged to explain these phenomena and provide an analysis of
two-layer over-parameterized neural networks. Partly due to its far-
F. Liu and J.A.K. Suykens are with the Department of Electrical
Engineering (ESAT-STADIUS), KU Leuven, B-3001 Leuven, Belgium (email: reaching repercussions, the seminal work by Rahimi and Recht on
{fanghui.liu;johan.suykens}@esat.kuleuven.be). RFF [9] won the Test-of-Time Award in the Thirty-first Advances
X. Huang is with Institute of Image Processing and Pattern Recognition, and also in Neural Information Processing Systems (NeurIPS 2017).
with Institute of Medical Robotics, Shanghai Jiao Tong University, Shanghai
200240, P.R. China (e-mail: [email protected]).
Y. Chen is with School of Operations Research and Information Engineering, 1. Random features model can be regarded as an over-parameterized model
Cornell University, Ithaca, NY 14850 USA (e-mail: [email protected]). allowing for s n, refer to Section 7 for details.
2
2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 future
Figure 1. Timeline of representative work on the algorithms and theory of random features.
of the true target function fρ , which can be quantified by the excess as a finite-dimensional empirical risk minimization problem
risk E(fz,λ ) − E(fρ ), or the estimation error kfz,λ − fρ k2 in an n
1X
appropriate norm k · k. βλ := argmin ` yi , β> ϕ(xi ) + λkβk22 . (3)
Using an explicit randomized feature mapping ϕ : Rd → Rs , β∈Rs n i=1
one may approximate the kernel function k(x, x0 ) by k̃(x, x0 ) =
For example, in least squares regression where ` is the squared
hϕ(x), ϕ(x0 )i. In this case, the approximate kernel k̃(·, ·) defines 2
loss, the first term in problem (3) is equivalent to ky − Zβk2 ,
an RKHS H e (not necessarily contained in the RKHS H associated
>
where y = [y1 , y2 , · · · , yn ] is the label vector and Z =
with the original kernel function k ). With the above approximation,
[ϕ(x1 ), · · · , ϕ(xn )]> ∈ Rn×s is the random feature matrix.
one solves the following approximate version of problem (1):
( n ) This is a linear ridge regression problem in the space spanned
1X by the random features, with the optimal prediction given by
˜ 2
fz,λ := argmin ` yi , f (xi ) + λkf kHe . (2) f˜z,λ (x0 ) = β> 0 0
n i=1 λ ϕ(x ) for a new data point x , where βλ has the
f ∈H > −1 >
explicit expression βλ = (Z Z+nλI) Z y . For classification,
e
By the representer theorem [1], the above problem can be rewritten one may take the sign to output the binary classification labels. Note
4
that problem (3) also corresponds to fixed-size kernel methods with where Λi ≥ 0 are the Fourier coefficients,
Yi,j is
the spherical
feature map approximation (related to Nyström approximation) and i + d − 3
harmonics, and N (d, i) = 2i+d−2 .
estimation in the primal [2]. i d−2
Note that, dot product kernels defined in Rd do not belong
2.2 Theoretical Foundation of Random Features
to the rotation-invariant class. Nevertheless, by virtue of the
The theoretical foundation of RFF builds on Bochner’s celebrated neural network structure under Gaussian initialization, some dot
characterization of positive definite functions. product kernels defined on Rd are able to benefit from the sampling
Theorem 1 (Bochner’s Theorem [55]). A continuous and shift- framework q behind RFF. Given a two-layer network of the form
f (x; θ) = 2s sj=1 aj σ(ω>
P
invariant function k : Rd × Rd → R is positive definite if and only j x) with s neurons (notation chosen
if it can be represented as to be consistent with the number of random features), for some
Z activation function σ and x ∈ Rd , when ω ∼ N (0, Id ) are fixed
0
k(x − x ) = exp iω> (x − x0 ) µk (dω) , and only the second layer (parameters a) are optimized3 , this
Rd
actually corresponds to random features approximation
where µk is a positive finite measure on the frequencies ω .
According to Bochner’s theorem, the spectral distribution µk k (x, x0 ) = Eω∼N (0,Id ) [σ(ω > x)σ(ω> x0 )] , (6)
of a stationary kernel k is the finite measure induced by a Fourier
transform. By setting k(0) = 1, we may normalize µk to a where the nonlinear activation function σ(·) depends on the
probability density p (the Fourier transform associated with k ), kernel type such that ϕ(xi ) := σ(W xi ) in Eq. (5), by denoting
hence the transformation matrix W := [ω1 , ω2 , · · · , ωs ]> ∈ Rs×d .
Z The formulation in (6) is quite general to cover a series of
k(x − x0 ) = exp iω> (x − x0 ) µ(dω)
kernels by various activation functions. For example, if we
Rd (4)
take σ(x) = [cos(x), sin(x)]> , Eq. (6) corresponds to the
= Eω∼p(·) exp(iω> x) exp(iω> x0 )∗ ,
Gaussian kernel, which is the standard RFF model [9] for
where the symbol z ∗ denotes the complex conjugate of z . The Gaussian kernel approximation. If we consider the commonly
kernels used in practice are typically real-valued and thus the used ReLU activation σ(x) = max{0, x} in neural networks,
imaginary part in Eq. (4) can be discarded. According to Eq. (4), Eq. (6) corresponds to the first order arc-cosine √ kernel, termed as
0 1
RFF makes use of the standard Monte Carlo sampling scheme to k(x, x ) ≡ κ1 (u) = π (u(π −arccos(u))+ 1 − u2 ) by setting
approximate k(x, x0 ). In particular, one uses the approximation u := hx, x0 i/(kxkkx0 k). If the Heaviside step function σ(x) =
1
2 (1 + sign(x)) is used, Eq. (6) corresponds to the zeroth order
k(x, x0 ) = Eω∼p [ϕp (x)> ϕp (x0 )] ≈ k̃p (x, x0 ) := ϕp (x)> ϕp (x0 ) arc-cosine kernel, termed as k(x, x0 ) ≡ κ0 (u) = 1− 1 arccos(u)
π
with the explicit feature mapping2 by setting u := hx, x0 i/(kxkkx0 k), refer to arc-cosine kernels
[60] for details. If we take other activation functions used in neural
1 > > >
ϕp (x) := √ exp(−iω1 x), · · · , exp(−iωs x)] , (5) networks, e.g., erf activations [61], GELU [62] in Eq. (6), such
s two-layer neural network also corresponds to a kernel. In this case,
where {ωi }i=1 are sampled from p(·) independently of the the standard RFF model is still valid (via Monte Carlo sampling
s
training set. Consequently, the original kernel matrix K = from a Gaussian distribution) for these non-stationary kernels.
[k(xi , xj )]n×n can be approximated by K ≈ K fp = Zp Z> Further, for a fully-connected deep neural network (more than
p
>
with Zp = [ϕp (x1 ), · · · , ϕp (xn )] ∈ R n×s
. It is convenient two layers) and fixed random weights before the output layer,
to introduce the shorthand >
zp (ωi , xj ) := exp(−iωi xj ) such if the hidden layers are wide enough, one can still approach a
√ > kernel obtained by letting the widths tend to infinity [63], [64].
that ϕp (x) = 1/ s[zp (ω1 , x), · · · , zp (ωs , x)] . With this
0
notation, the approximate kernel k̃p (x, x ) can be rewritten as If both intermediate layers and the output layer are trained by
0 1 Ps
k̃p (x, x ) = s i=1 zp (ωi , x)zp (ωi , x ).0 (stochastic) gradient descent, for the network f (x; θ) with large
A similar characterization in Eq. (4) is available for rotation- enough s, the model remains close to its linearization around its
invariant kernels, where the Fourier basis functions are spherical random initialization throughout training, known as lazy training
harmonics [56], [57]. Here rotation-invariant kernels are dot- regime [65]. Learning is then equivalent to a kernel method with
product kernels defined on the unit sphere X = S d−1
:= {x ∈ another architecture-specific kernel, known as neural tangent kernel
d
R : kxk2 = 1}, and can be represented as a non-negative (NTK, [12]). Interestingly, NTK for two-layer ReLU networks
expansion with spherical harmonics, refer to the book [58] for [66] can be constructed by arc-cosine kernels, i.e., k (x, x0 ) =
0
details. kxkkx k[uκ0 (u) + κ1 (u)]. In fact, there is an interesting line of
work showing insightful connections between kernel methods and
Theorem 2 ([56]). A rotation-invariant continuous function k : (over-parameterized) neural networks, but this is out of scope of
Sd−1 × Sd−1 → R is positive definite if and only if it has a this survey on random features. We suggest the readers refer to
d
symmetric non-negative expansion into spherical harmonics Y`,m , some recent literature [13], [67], [68] for details.
that is Further, if we consider the general non-stationary kernels [69],
X∞ N (d,i)
X [70], the spectral representation can be generalized by introducing
k(x, x0 ) ≡ k(hx, x0 i) = Λi Yi,j (x)Yi,j (x0 ) , two random variables ω and ω 0 .
i=0 j=1
2. The subscript in ϕp , Zp , kp (and other symbols) emphasizes the 3. Extreme learning machine [59] is another structure in a two-layer
dependence on the distribution p(·) but can be omitted for notational simplicity. feedforward neural network by randomly hidden nodes.
5
Theorem 3. ( [70], [71], [72]) A non-stationary kernel k is 2.4 Taxonomy of random features based algorithms
positive definite if and only if it admits The key step in random features based algorithms is constructing
Z
>
the following random feature mapping
k(x, x0 ) = exp i ω> x − ω 0 x0 µΨk (dω, dω 0 ) ,
Rd ×Rd 1
ϕ(x) := √ a1 exp(−iω> >
1 x), · · · , as exp(−iωs x)]
>
(7)
s
where µΨk is the Lebesgue-Stieltjes measure on the product space
Rd × Rd associated to some positive definite function Ψk (ω, ω 0 ) so as to approximate the integral (4). Random features {ωi }si=1 can
with bounded variations. be formulated as the feature matrix W = [ω1 , · · · , ωs ]> ∈ Rs×d
in a compact form. Existing algorithms differ in how they select
the points ωi (the transformation matrix W ) and weights ai .
2.3 Commonly used kernels in Random Features Figure 2 presents a taxonomy of some representative random
Random features based algorithms often consider the following features based algorithms. They can be grouped into two categories,
kernels: data-independent algorithms and data-dependent algorithms, based
i) Gaussian kernel: Arguably the most important member of on whether or not the selection of ωi and ai is independent of the
shift-invariant kernels, the Gaussian kernel is given by training data.
Data-independent random features based algorithms can be
kx − x0 k22
further categorized into three classes according to their sampling
k(x, x0 ) = exp − ,
2ς 2 strategy:
i) Monte Carlo sampling: The points {ωi }si=1 are sampled
where ς > 0 is the kernel width. The density (see Theorem 1 from p(·) in Eq. (4) (see the red box in Figure 2). In particular,
or Eq. (6)) associated with the Gaussian kernel is Gaussian ω ∼ to approximate the Gaussian kernel by RFF [9], these points are
N (0, ς −2 Id ). sampled from the Gaussian distribution p = N (0, ς −2 Id ), with
ii) arc-cosine kernels: This class admits Eq. (6) by sampling the weights being equal, i.e., ai ≡ 1 in Eq. (7). To reduce the
from the Gaussian distribution N (0, Id ), that can be connected storage and time complexity, one may replace the dense Gaussian
to a two-layer neural networks with various activation functions. matrix in RFF by structural matrices; see, e.g., Fastfood [36]
Following [60], we define the b-order arc-cosine kernel by using Hadamard matrices as well as its general version P -model
1 [41]. An alternative approach is using circulant matrices; see,
k(x, x0 ) = kxkb2 kx0 kb2 Jb (θ) , e.g., Signed Circulant Random Features (SCRF) [40]. To improve
π
the approximation quality, a simple and effective approach is to
> 0
where θ = cos−1 kxkx2 kx
x
0k and use an `2 -normalization scheme, which leads to Normalized RFF
2
(NRFF) [79]. Another powerful technique for variance reduction
b is orthogonalization to decrease the randomness in Monte Carlo
π−θ
1 ∂
Jb (θ) = (−1)b (sin θ)2b+1 . sampling. Typical algorithms include Orthogonal Random Features
sin θ ∂θ sin θ (ORF) [24] by employing an orthogonality constraint to the random
Most common in practice are the zeroth order (b = 0) and first Gaussian matrix, Structural ORF (SORF) [24], [91], and Random
order (b = 1) arc-cosine kernels. The zeroth order kernel is given Orthogonal Embeddings (ROM) [80].
explicitly by ii) Quasi-Monte Carlo sampling: This is a typical sampling
θ scheme in sampling theory [92] to reduce the randomness in
k(x, x0 ) = 1 − , Monte Carlo sampling for variance reduction. It can significantly
π
improve the convergence of Monte Carlo sampling by virtue of
and the first order kernel is a low-discrepancy sequence t1 , t2 , · · · , ts ∈ [0, 1]d instead of
1 a uniform sampling sequence over the unit cube to construct
k(x, x0 ) = kxk2 kx0 k2 (sin θ + (π − θ) cos θ) . the sample points; see the integral representation in the green
π
box in Figure 2. Based on this representation, it can be used
iii) Polynomial kernel: This is a widely used family of non- for kernel approximation, as conducted by [25]. Subsequently,
stationary kernels given by Lyu [43] proposes Spherical Structural Features (SSF), which
generates asymptotically uniformly distributed points on Sd−1
k(x, x0 ) = (1 + hx, x0 i)b , to achieve better convergence rate and approximation quality.
The Moment Matching (MM) scheme [42] is based on the same
where b is the order of the polynomial.
integral representation but uses a d-dimensional refined uniform
Note that, dot-product kernel defined in Rd admit neither
sampling sequence {ti }si=1 instead of a low discrepancy sequence.
spherical harmonics nor Eq. (6). As a result, random features
Strictly speaking, SSF and MM go beyond the QMC framework.
for polynomial kernels work in different theoretical foundations
Nevertheless, these methods share the same integration formulation
and settings, and have been studied in a smaller number of
with QMC over the unit cube and thus we include them here for a
papers, including Maclaurin expansion [34], the tensor sketch
streamlined presentation.
technique [73], [74], and oblivious subspace embedding [75], [76].
iii) Quadrature based methods: Numerical integration tech-
Interestingly, if the data are `2 normalized, dot product kernels
niques can be also used to approximate the integral representation
defined in Rd can be transformed as stationary but indefinite (real,
in Eq. (4). These techniques may involve deterministic selection of
symmetric, but not positive definite) on the unit sphere4 . The related
random features based algorithms under this setting provide biased 4. This setting cannot ensure the data are i.i.d on the unit sphere, which is
estimators [39], [77], or unbiased estimation [78]. different from the setting of previously discussed rotation invariant kernels.
6
(
structural: Fastfood [36], P -model [41], SORF [24]
acceleration circulant: SCRF [40]
i) Monte Carlo sampling
(
`2 normalization: NRFF [79]
variance reduction
orthogonal constraint: ORF [24], ROM [80]
data-independent
QMC [37]
ii) Quasi-Monte Carlo sampling
structural spherical feature: SSF [43]
moment matching: MM [42]
(
deterministic quadrature rules: GQ, SGQ [26]
iii) Quadrature rules
stochastic spherical-radial rule: SSR [27]
leverage score sampling: LSS-RFF
[31], fast leverage score approximation [47], [48], [81]
weighted random features: [32], [82] for RFF, [25] for QMC, [26] for GQ
re-weighted random features kernel alignment: KA-RFF [83] and KP-RFF [44]
compressed low-rank approximation: CLR-RFF [46]
data-dependent
one-stage: ( [84] via generative models
kernel learning by random features joint optimization: [85], [86]
two-stage
spectral learning in mixture models: [87], [88], [89], [90]
others: quantization [45]; doubly stochastic [38]
ii) Quasi-Monte Carlo sampling
i) Monte Carlo sampling • QMC
k(x − x0 ) = p(ω) exp iω> (x − x0 ) dω • SSF
R
• variance reduction
Rd
• MM
• acceleration
k(x − x0 ) = [0,1]d exp i(x−x0 )>Φ−1 (t) dt
R
Qd R ∞ (j)
k(x−x0 ) = (j) (j) (j) 0(j)
j=1 −∞ pj ω exp iω (x − x ) dω
R∞ r2
k(x − x0 ) = (2π)−d/2 e− 2 |r|d−1 g(ru)drdu
R
data-dependent Ud 0
iii) Quadrature rules
• random features
• GQ, SGQ
k(x, x0 ) = Rd q(ω) p(ω) > 0
R
selection/learning q(ω) exp iω (x − x ) dw • SSR
• leverage score
the points and weights, e.g., by using Gaussian Quadrature (GQ) RFF [32], [82], weighted QMC [25], and weighted GQ [26]. Note
[26] or Sparse Grids Quadrature (SGQ) [26] over each dimension that these algorithms directly learn the weights of pre-given random
(their integration formulation can be found in the first blue box in features. Another line of methods re-weight the random features
Figure 2). The selection can also be randomized. For example, in the using a two-step procedure: i) “up-projection”: first generate a
work [27], the d-dimensional integration in Eq. (4) is transformed large set of random features {ωi }Ji=1 ; ii) “compression”: then
to a double integral, and then approximated by using the Stochastic reduce these features to a small number (e.g., 102 ∼ 103 ) in a
Spherical-Radial (SSR) rule (see the second blue box in Figure 2). data-dependent manner, e.g., by using kernel alignment [83], kernel
Data-dependent algorithms use the training data to guide the polarization [44], or compressed low-rank approximation [46].
selection of points and weights in the random features for better
iii) Kernel learning by random features: This class of methods
approximation quality and/or generalization performance. These
aim to learn the spectral distribution of kernel from the data so as
algorithms can be grouped into three classes according to how the
to achieve better similarity representation and prediction. Note that
random features are generated.
these methods learn both the weights and the distribution of the
i) Leverage score sampling: Built upon the importance sampling features, and hence differ from the other random features selection
framework, this class of algorithm replaces the original distribution methods mentioned above, which assume that the candidate features
p(ω) by a carefully chosen distribution q(ω) constructed using are generated from a pre-given distribution and only learn the
leverage scores [51], [52] (see the yellow box in Figure 2). The weights of these features. Representative approaches for kernel
representative approach in this class is Leverage Score based RFF learning involve a one-stage [84] or two-stage procedure [85], [86],
(LS-RFF) [31], and its accelerated version [47], [81]. [87], [88], [89], [90]. From a more general point of view, the
ii) Re-weighted random feature selection: Here the basic idea aforementioned re-weighted random features selection methods
is to re-weight the random features by solving a constrained can also be classified into this class. Since these methods belong to
optimization problem. Examples of this approach include weighted the broad area of kernel learning instead of kernel approximation,
7
we do not detail them in this survey. RFF, especially in high dimensions (e.g., d ≥ 1000). In particular,
Besides the above three main categories, other data-dependent W used in Eq. (8) is substituted by
approaches include the following. i) Quantization random features
1
[45]: Given a memory budget, this method quantizes RFF for WFastfood = B1 HGΓHB2 , (9)
Gaussian kernel approximation. A key observation from this work ς
is that random features achieve better generalization performance where H is the Walsh-Hadamard matrix admitting fast multiplica-
than Nyström approximation [93] under the same memory space. tion in O(d log d) time, and Γ ∈ {0, 1}d×d is a permutation
ii) Doubly stochastic random features [38]: This method uses matrix that decorrelates the eigen-systems of two Hadamard
two sources of stochasticity, one from sampling data points by matrices. The three diagonal random matrices G, B1 and B2
stochastic gradient descent (SGD), and the other from using RFF are specified as follows: G has independent Gaussian entries
to approximate the kernel. This scheme has been used for Kernel drawn from N (0, 1); B1 is a random scaling matrix with
PCA approximation [94], and can be further extended to triply (B1 )ii = kωi k2 /kGkF , which encodes the spectral properties
stochastic scheme for multiple kernel approximation [95]. of the associated kernel; B2 is a binary decorrelation matrix
with independent random {±1} entries. FastFood is an unbiased
estimator, but may have a larger variance than RFF:
3 DATA - INDEPENDENT A LGORITHMS
6τ 4 τ2
−τ 2
In this section, we discuss data-independent algorithms in a V [Fastfood ] − V [RFF ] ≤ e + ,
s 3
unified framework based on the transformation matrix W , that
plays an important role in constructing the mapping ϕ(·) in which converges at an O(1/s) rate.
Eq. (7) and determining how well the estimated kernel converges P -model [41]: A general version of Fastfood, the P -model
to the actual kernel. Table 2 reports various random features constructs the transformation matrix as
based algorithms in terms of the class of kernels they apply
to (in theory) as well as their space and time complexities for WP = [g> P1 , g> P2 , · · · , g> Ps ]> ∈ Rs×d ,
computing the feature mapping W x for a given x ∈ X . In
where g is a Gaussian random vector of length a and P = {Pi }si=1
Table 2, we also summarize the variance reduction properties of
is a sequence of a-by-d matrices each with unit `2 norm columns.
these algorithms, i.e., whether the variance of the resulting kernel
Fastfood can viewed as a special case of the P -model: the matrix
estimator is smaller than the standard RFF. Before proceeding,
HG in Eq. (9) can be constructed by using a fixed budget of
we introduce some notations and definitions. When discussing
randomness in g and letting each Pi be a random diagonal matrix
a stationary kernel function k(x, x0 ) = k(x − x0 ), we use
with diagonal entries of the form Hi1 , Hi2 , . . . , Hid . The P -model
the convenient shorthands τ := x − x0 and τ := kτ k2 .
is unbiased and its variance is close to that of RFF with an O(1/d)
For a random features algorithm A with frequencies {ωi }si=1
convergence rate
sampled from a distribution µ(·)P,s we define>
its
expectation
E(A) := E[k(τ )] = Eω∼µ Ps 1/s i=1 cos(ω
i τ ) and variance V[P -model] − V[RFF] = O (1/d) .
V[A] := V[k(τ )] = V 1s i=1 cos(ω> τ ) .
SCRF [40]: It accelerates the construction of random features
3.1 Monte Carlo sampling based approaches by using circulant matrices. The transformation matrix is
We describe several representative data-independent algorithms WSCRF = [ν ⊗ C(ω1 ), ν ⊗ C(ω2 ), · · · , ν ⊗ C(ωt )]> ∈ Rtd×d ,
based on Monte Carlo sampling, using the Gaussian kernel
k(x, x0 ) = k(τ ) = exp(−kτ k22 /2ς 2 ) as an example. Note that where ⊗ denotes the tensor product, ν = [ν1 , ν2 , . . . , νd ] is a
these algorithms often apply to more general classes of kernels, as Rademacher vector with P(νi = ±1) = 1/2, and C(wi ) ∈ Rd×d
summarized in Table 2. is a circulant matrix generated by the vector ωi ∼ N (0, ς −2 Id ).
RFF [9]: For Gaussian kernels, RFF directly samples the Thanks to the circulant structure, we only need O(s) space to store
random features from a Gaussian distribution (corresponds to the feature mapping matrix WSCRF with s = td. Note that C(wi )
the inverse Fourier transform): {ω}si=1 ∼ p(ω). In particular, the can be diagonalized using the Discrete Fourier Transform for ωi .
corresponding transformation matrix is given by SCRF is unbiased and has the same variance as RFF.
The above three approaches are designed to accelerate the
1 computation of RFF. We next overview representative methods that
WRFF = G, (8)
ς aim for better approximation performance than RFF.
NRFF [79]: It normalizes the input data to have unit `2 norm
where G ∈ Rs×d is a (dense) Gaussian matrix with Gij ∼
before constructing the random Fourier features. With normalized
N (0, 1). For other stationary kernels, the associated p(·) corre-
data, the Gaussian kernel can be computed as
sponds to the specific distribution given by the Bochner’s Theorem.
For example, the Laplacian kernel k(τ ) = exp(−kτ k1 /ς) is 1
x> x0
!
0
associated with a Cauchy distribution. RFF is unbiased, i.e., k(x, x ) = exp − 2 1 − ,
ς kxk2 kx0 k2
E[RFF] = exp(−kτ k22 /2ς 2 ), and the corresponding variance is
2
V[RFF] = (1 − e−τ )2 /2s. which is related to the normalized linear kernel [39], [79]. Albeit
Fastfood [36]: By observing the similarity between the dense simple, NRFF is effective in variance reduction and in particular
Gaussian matrix and Hadamard matrices with diagonal Gaussian satisfies
matrices, Le et al. [36] firstly introduce Hadamard and diagonal 1 −τ 2 2
matrices to speed up the construction of dense Gaussian matrices in V[NRFF] = V[RFF] − e (3 − e−2τ ) .
4s
8
Table 2
Comparison of different kernel approximation methods on space and time complexities to obtain W x.
Method Kernels (in theory) Extra Memory Time Lower variance than RFF
Random Fourier Features (RFF) [9] shift-invariant kernels O(sd) O(sd) -
Quasi-Monte Carlo (QMC) [37] shift-invariant kernels O(sd) O(sd) Yes
Normalized RFF (NRFF) [79] Gaussian kernel O(sd) O(sd) Yes
Moment matching (MM) [42] shift-invariant kernels O(sd) O(sd) Yes
Orthogonal Random Feature (ORF) [24] Gaussian kernel O(sd) O(sd) Yes
Fastfood [36] Gaussian kernel O(s) O(s log d) No
Spherical Structured Features (SSF) [43] shift and rotation-invariant kernels O(s) O(s log d) Yes
Structured ORF (SORF) [24], [91] shift and rotation-invariant kernels O(s) O(s log d) Unknown
Signed Circulant (SCRF) [40] shift-invariant kernels O(s) O(s log d) The same
P-model [41] shift and rotation-invariant kernels O(s) O(s log d) No
Random Orthogonal Embeddings (ROM) [80] rotation-invariant kernels O(d) O(d log d) Yes
Gaussian Quadrature (GQ), Sparse Grids Quadrature (SGQ) [26] shift invariant kernels O(d) O(d log d) Yes
Stochastic Spherical-Radial rules (SSR) [27] shift and rotation-invariant kernels O(d) O(d log d) Yes
ORF [24]: It imposes orthogonality on random features for the SORF [24], [91]: It replaces the random orthogonal matrices
Gaussian kernel and has the transformation matrix used in ORF by a class of structured matrices akin to those in
1 Fastfood. The transformation matrix of SORF is given by
WORF = SQ , √
ς d
WSORF = HD1 HD2 HD3 , (12)
where Q is a uniformly distributed random orthogonal matrix, and ς
S is a diagonal matrix with diagonal entries sampled i.i.d from where H is the normalized Walsh-Hadamard matrix and Di ∈
the χ-distribution with d degrees of freedom. This orthogonality Rd×d , i = 1, 2, 3 are diagonal “sign-flipping” matrices, of
constraint is useful in reducing the approximation error in random which each diagonal entry is sampled from the Rademacher
features. It is also considered in [96] for unifying orthogonal Monte distribution. Bojarski et al. [91] consider more general structures
Carlo methods. ORF is unbiased and with variance bounded by for the three blocks of matrices HDi in Eq. (12). Note that
2
first block HD1 satisfies
!
1 g(τ ) (d − 1)e−τ τ 4 each
h block plays a different
i role. The 2
V[ORF] − V[RFF] ≤ − , Pr kHD1 xk∞ > log √d ≤ 2de− log8 d
for any x ∈ Rd with
s d 2d d
log2 d
2 kxk2 = 1, termed as (log d, 2de− 8 )-balanced, hence no
where we have g(τ ) = eτ τ 8 + 6τ 6 + 7τ 4 + τ 2 /4
2 dimension carries too much of the `2 norm of the vector x. The
+eτ τ 4 τ 6 + 2τ 4 /2d. It can be seen that the variance
second block HD2 ensures that vectors are close to orthogonal.
reduction property Var[ORF] < Var[RFF] holds under some The third block HD3 controls the capacity of the entire structured
conditions, e.g., when d is large and τ is small. For a large d, the transform by providing a vector of parameters. SORF is not an
ratio of the variances of ORF and RFF can be approximated by unbiased estimator of the Gaussian kernel, but it satisfies an
2
V[ORF] (s − 1)e−τ τ 4 asymptotic unbiased property
≈1− 2 . (10)
6τ
V[RFF] d 1 − e−τ 2 2
E [SORF] − e−τ /2 ≤ √ .
d
Choromanski et al. [97] further improve the variance bound to
ROM [80]: It generalizes SORF to the form
V[RFF]−V[ORF] = √ t
p dY
J d −1 ( R12 + R22 τ )Γ(d/2)
" #
s−1 WROM = HDi ,
ER1 ,R2 2
p d
ς i=1
s ( R12 + R22 τ /2) 2 −1 (11)
" #2 where H can be the normalized Hadamard matrix or the Walsh
s−1 J d −1 (R1 τ ) Γ(d/2) matrix, and Di is the Rademacher matrix as defined in SORF.
− ER1 2
d
−1
,
s (R1 τ /2) 2 Theoretical results in [80] show that the ROM estimator achieves
variance reduction compared to RFF. Interestingly, odd values of t
where Jd is the Bessel function of the first kind of degree d, and yield better results than even t. This provides an explanation for
R1 and R2 are two independent scalar random variables satisfying why SORF chooses t = 3.
ω1 = R1 v and ω2 = R2 v with ω1 , ω2 ∼ N (0, ς −2 Id ) and LP-RFF [45]: It attempts to quantize RFF with the Gaussian kernel
v ∼ Unif(S d−1 ). According to Eq. (11), the property V[ORF] < under a memory budget, i.e., mapping each s-dimensional
V[RFF] holds asymptotically in cases: i) a fixed d and a small p p p random
1 feature zp (x) = 2/s cos(WRFF x) ∈ [− 2/s, 2/s] to an
enough τ with E[kωk42 ] ≤ ∞; ii) a fixed τ < 4√ with some s-dimensional low precision vector with b bits
constant c and a large d, in which case we have
c
p via apstochastic
rounding scheme. They divide the interval [− 2/s, 2/s] into
s − 1 1 τ 4 − τ22 2pb − 1 equal-sized sub-intervals and randomly round each value
1
V[RFF] − V[ORF] = e ς +O . 2/s cos(ωi x) to either the top or bottom of the corresponding
s 2d ς 2 d
9
sub-interval. Strictly speaking, this method does not belong to data- SSF [43]: It improves the space and time complexities of
independent algorithms. But we put it here for ease of description QMC for approximating shift- and rotation-invariant kernels.
as this approach directly quantizes RFF. More importantly, a SSF generates points {v1 , v2 , · · · , vs } asymptotically uniformly
new insight demonstrated by this method is that, under the same distributed on the sphere Sd−1 , and construct the transformation
memory budget, random features based algorithms achieve better matrix as
generalization performance than Nyström approximation [93].
Apart from the stochastic quantization scheme used in [45], the WSSF = [Φ−1 (t)v1 , Φ−1 (t)v2 , · · · , Φ−1 (t)vs ]> ∈ Rs×d ,
authors of [98] employ Lloyd-Max quantization with a smaller where Φ−1 (t) uses the one-dimensional QMC point. The structure
number of bits. matrix V := [v1 , v2 , · · · , vs ] ∈ S(d−1)×s has the form
From the above description, one can find that orthogonalization
is a typical operation for variance reduction, e.g., ORF/SORF/ROM. 1 Re FΛ − Im FΛ
V = ∈ Rd×s ,
d/2 Im FΛ Re FΛ
p
Here we take the Gaussian kernel as an example to illustrate
insights of such scheme. By sampling {ωi }si=1 ∼ N (0, ς −2 Id ), d s
the used Gaussian distribution is isotropic and only depends on the where FΛ ∈ C 2 × 2 consists of a subset of the rows of the discrete
s s
norm kωk2 instead of ω . The used orthogonal operator makes the Fourier matrix F ∈ C 2 × 2 . The selection of d2 rows from F is
direction of ωi orthogonal to each other (that means more uniform) done by minimizing the discrete Riesz 0-energy [99] such that the
while retaining its norm unchanged5 , which leads to decrease the points spread as evenly as possible on the sphere.
randomness in Monte Carlo sampling, and thus achieve variance MM [42]: It also uses the transformation matrix in Eq. (14),
reduction effect. If we attempt to directly decrease the randomness but generates a d-dimensional uniform sampling sequence {ti }si=1
in Monte Carlo sampling, QMC is a powerful way to achieve this by a moment matching scheme instead of using a low discrepancy
goal and can then be used to kernel approximation. This is another sequence as in QMC. In particular, the transformation matrix is
line of random features with variance reduction illustrated as below. e −1 (t1 ), Φ
e −1 (t2 ), · · · , Φ
e −1 (ts )]> ∈ Rs×d , (15)
WMM = [Φ
3.2 Quasi-Monte Carlo Sampling where one uses moment matching to construct the vectors
e −1 (ti ) = Ã−1 (Φ−1 (ti ) − µ̃) with the sample mean µ̃ =
Φ
Here we briefly review methods based on quasi-Monte Carlo 1 Ps −1
sampling (QMC) [37], spherical structured feature (SSF) [43], s i=1 Φ (ti ) and the square root of the sample covariance
and moment matching (MM) [42]. These three methods achieve a matrix à satisfying ÃÃ> = Cov(Φ−1 (ti ) − µ̃).
lower variance or approximation error than RFF. Strictly speaking, To achieve the target of variance reduction, both orthogonaliza-
the later two algorithms do not belong to the quasi-Monte Carlo tion in Monte Carlo sampling and QMC based algorithms share
sampling framework. However, SSF and MM share the same the similar principle, namely, generating random features that
integration formulation with QMC and thus we introduce them are as independent/uniform as possible. To be specific, QMC
here for simplicity. and MM are able to generate more uniform data points to avoid
Classical Monte Carlo sampling generates a sequence of undesirable clustering effect, see Figure 1 in [37]. Likewise, SSF
samples randomly and independently, which may lead to an aims to generate asymptotically uniformly distributed points on
undesired clustering effect and empty spaces between the samples the sphere Sd−1 , which attempts to encode more information
[92]. Instead of fully random samples, QMC [37] outputs low- with fewer random features, and thus allows for variance
discrepancy sequences. A typical QMC sequence has a hierarchical reduction. In sampling theory, QMC can be further improved
structure: the initial points are sampled on a coarse scale whereas by an sub-grouped based rank-one lattice construction [100] for
the subsequent points are sampled more finely. For approximating computational efficiency, which can be used for the subsequent
a high-dimensional integral, QMC achieves an asymptotic error kernel approximation.
convergence rate of = O((log s)d /s), which is faster than the
O(s−1/2 ) rate of Monte Carlo. Note however that QMC often 3.3 Quadrature based Methods
requires s to be exponential in d for the improvement to manifest. Quadrature based methods build on a long line of work on
QMC [37]: It assumes that p(·) factorizes with respect to the numerical quadrature for estimating integrals. In quadrature
Qd
dimensions, i.e., p(x) = j=1 pj (xj ), where each pj (·) is a methods, the weights are often non-uniform, and the points are
univariate density function. QMC generally transforms an integral usually selected using deterministic rules including Gaussian
on Rd in Eq. (4) to one on the unit cube [0, 1]d as quadrature (GQ) [26], [101] and sparse grids quadrature (SGQ) [26].
Deterministic rules can be extended to their stochastic versions. For
Z
k(x − x0 ) = exp i(x − x0 )> Φ−1 (t) dt ,
(13) example, Munkhoeva et al. [27] explore the stochastic spherical-
[0,1]d
radial (SSR) rule [102], [103] in kernel approximation. Below we
where Φ−1 (t) = Φ−1 −1 d
1 (t1 ) , · · · , Φd (td ) ∈ R with Φj being briefly review these methods.
the cumulative distribution function (CDF) of pj . Accordingly, by GQ [26]: It assumes that the kernel function k factorizes
generating a low discrepancy sequence t1 , t2 , · · · , ts ∈ [0, 1]d , with respect to the dimensions and the corresponding distribution
the random frequencies can be constructed by ωi = Φ−1 (ti ). The p(ω) = p([ω (1) , ω (2) , . . . , ω (d) ]> ) in Eq. (4) is sub-Gaussian.
corresponding transformation matrix for QMC is Therefore, the d-dimenionsal integral in Eq. (4) can be factorized
as
WQMC = [Φ−1 (t1 ), Φ−1 (t2 ), · · · , Φ−1 (ts )]> ∈ Rs×d . (14)
Yd Z ∞
0 (j) (j) (j) 0(j)
(j)
5. In fact, while orthogonalization only makes the direction of {ωi }si=1 k(x−x ) = pj ω exp iω (x − x ) dω .
more uniform, one can make the length kωi k2 uniform by sampling from the j=1 −∞
cumulative distribution function of kωk2 . (16)
10
Since each of the factors is a one-dimensional integral, we can the above two rules, we have the SSR rule. Accordingly, the
approximate them using a one-dimensional quadrature rule. For transformation matrix of SSR is
example, one may use Gaussian quadrature [101] with orthogonal
(QV )>
polynomials: WSSR = ϑ ⊗ ∈ R2(d+1)×d ,
−(QV )>
∞ L
with ϑ = [ϑ1 , ϑ2 , · · · , ϑs ] and V = [v1 , v2 , · · · , vd+1 ], where
Z X
p(ω) exp(iω(x − x0 ))dω ≈ aj exp iγ> 0
j (x − x ) , ϑ ∼ χ(d + 2) and {vi }d+1
−∞ j=1 i=1 are the vertices of a unit regular
(17) d-simplex, which is randomly rotated by Q. To get s features, one
where L is the accuracy level and each γj is a univariate point may stack s/(2d + 3) independent copies of W as suggested by
associated with the weight aj . For a third-point rule with the [27]. Finally, the feature mapping by SSR is given by
points {−p̂1 , 0, p̂1 } and their associated weights (â1 , â0 , â1 ), the ϕ(x) = [a0 g(0), a1 g(w> >
1 x), · · · , as g(ws x)] ,
transformation matrix WGQ ∈ Rs×d has entries Wij following the r q
distribution Pd+1
where a0 = 1 − j=1 ρd2 , aj = ρ1j 2(d+1) d
for j ∈ [s], and
j
Pr (Wij = ±p̂1 ) = â1 , Pr (Wij = 0) = â0 , ∀i ∈ [s], j ∈ [d] . wj is the j -th element of the stacked W .
In general, according to Eq. (6), kernel approximation
In general, the univariate Gaussian quadrature with L quadrature by random features is actually a d-dimensional integration
points is exact for polynomials up to (2L − 1) degrees. The approximation problem in mathematics. Sampling methods and
multivariate Gaussian quadrature is exact for all polynomials of quadrature based rules are two typical classes of approaches for
the form ω1i1 ω2i2 · · · ωdid with 1 ≤ ij ≤ 2L − 1; however the total high-dimensional integration approximation. Efforts on quadrature
number of points s = Ld scales exponentially with the dimension based methods focus on developing a high-accuracy, mesh-free,
d and thus this method suffers from the curse of dimensionality. efficiency rule, e.g., [105], [106]. Note that, if the integrand
SGQ [26]: To alleviate the curse of dimensionality, SGQ g(ω) := σ(ω > x)σ(ω> x0 ) in the integration representation (6)
uses the Smolyak rule [104] to decrease the needed number belongs to a RKHS, the above quadrature rules can be termed as
of points. Here we consider the third-degree SGQ using the kernel-based quadrature, e.g., Bayesian quadrature [107], [108]
symmetric univariate quadrature points {−p̂1 , 0, p̂1 } with weights and leverage-score quadrature [52]. This approach is in essence
(â1 , â0 , â1 ): different from the previously studied quadrature rules in functional
d
spaces, model formulation, and scope of application.
X
k(x, x0 ) ≈ (1−d+dâ0 ) g(0) + â1
g (p̂1 ej )+g (−p̂1 ej ) , 4 DATA - DEPENDENT ALGORITHMS
j=1
Data-dependent approaches aim to design/learn the random features
where the function g(ω) := σ(ω > x)σ(ω> x0 ) is given by Eq. (6), using the training data so as to achieve better approximation quality
and ei is the d-dimensional standard basis vector with the i-th or generalization performance. Based on how the random features
element being 1. The corresponding transformation matrix is are generated, we can group these algorithms into three classes:
leverage score sampling, random features selection, and kernel
WSGQ = [0d , p̂1 e1 , · · · , p̂1 ed , −p̂1 e1 , · · · , −p̂1 ed ]> ∈ R(2d+1)×d, learning by random features.
The quantity dλK n determines the number of independent random features, so that the kernel matrix matches the target kernel
parameters in a learning problem and hence is referred to as the yy> . Problem (24) can be efficiently solved via bisection over a
number of effective degrees of freedom [110], [111]. With the above scalar dual variable, and an -suboptimal solution can be found in
notation, the distribution q designed in [51] is given by O(J log(1/)) time.
KP-RFF (Kernel Polarization-RFF) [44]: It first generates a
lλ (ω) lλ (ω)
q(ω) := R = λ . (22) large number of random features by RFF and then selects a subset
lλ (ω)dω dK from them using an energy-based scheme
Compared to standard Monte Carlo sampling for RFF, leverage n
1X
score sampling requires fewer Fourier features and enjoys nice S̃(ω) = yi zp (xi , ω) .
theoretical guarantees [31], [51] (see the next section for details). n i=1
Note that q(ω) can be also defined by the integral operator [52], PJ
Further, the quantity (1/J) i=1 S̃ 2 (ωj ) can be associated with
[112] rather than the Gram matrix used above, but we do not
kernel polarization for {wi }Ji=1 sampled from p(ω). Accordingly,
strictly distinguish these two cases. The typical leverage score
the top s random features with the top |S̃(·)| values are selected as
based sampling algorithm for RFF is illustrated in [31] as below.
the refined random features. This algorithm can in fact be regarded
LS-RFF (Leverage Score-RFF) [31]: It uses a subset of data to
as a version of the kernel alignment method for generating random
approximate the matrix K in Eq. (21) so as to compute dλK . LS-
features.
RFF needs O(ns2 + s3 ) time to generate refined random features,
CLR-RFF (Compression Low Rank-RFF) [46]: It first gener-
which can be used in KRR [31] and SVM [11] for prediction.
ates a large number of random features and then selects a subset
SLS-RFF (Surrogate Leverage Score-RFF) [47]: To avoid
from them by approximately solving the optimization problem
inverting an s × s matrix in LS-RFF, SLS-RFF designs a simple
but effective surrogate leverage function 1 2
min 2
ZJ Z> J −Z eJ (a)Z eJ (a)> =
a∈RJ :kak0 ≤s n F
(25)
1 >
Lλ (w) = p(w)z> (X) yy + nI zp,w (X) ,
h i
p,w > >
n2 λ E i.i.d.
i,j ∼ [J]
ϕp (x i ) ϕ p (x j ) − ϕ
e p (x i ) ϕ
ep (xj ) ,
(23)
where the additional term nI and the coefficient 1/(n2 λ) in where ϕp (x) ∈ RJ uses J random features, and ϕ
ep (x) is
Eq. (23) ensure that Lλ is a surrogate function that upper bounds 1 >
the function lλ in Eq. (20). One then samples random features ep (x) := √ a1 exp(−iω>
ϕ 1 x), · · · , aJ exp(−iωJ x)
>
,
L (ω) J
from the surrogate distribution Q(w) = R Lλλ(ω)dω , which has
the same time complexity O(ns2 ) as RFF. SLS-RFF and can be which leads to Z eJ (a) = [ϕ ep (x1 ), ϕep (x2 ), · · · , ϕ
ep (xn )] ∈
applied to KRR [47] and Canonical Correlation Analysis [109]. Rn×J . We can construct a Monte-Carlo estimate of the opti-
Note that leverage scores sampling is a powerful tool used in mization objective function in Eq. (25) by sampling some pairs
i.i.d.
sub-sampling algorithms for approximating large kernel matrices i, j ∼ [J]. Therefore, this scheme focuses on a subset of pairs,
with theoretical guarantees, in particular in Nyström approximation. instead of the all data pairs, by seeking a sparse weight vector a
Research on this topic mainly focuses on obtaining fast leverage with only s nonzero elements. The problem of building a small,
score approximation due to inversion of an n-by-n kernel matrix, weighted subset of the data that approximates the full dataset,
e.g., two-pass sampling [113] (LS-RFF belongs to this), online is known as the Hilbert coreset construction problem. It can be
setting [114], path-following algorithm [81], or developing various approximately solved by greedy iterative geodesic ascent [116]
surrogate leverage score sampling based algorithms [47], [48], or Frank-Wolfe based methods [117]. Another way to obtain the
[109]. compact random features is using Johnson-Lindenstrauss random
projection [118] instead of the above data-dependent optimization
4.2 Re-weighted random features scheme.
Here we briefly review three re-weighted methods: KA-RFF [83]
4.3 Kernel learning by random features
by kernel alignment, KP-RFF [44] by kernel polarization, and
CLR-RFF [46] by compressed low-rank approximation. This class of approaches construct random features using sophisti-
KA-RFF (Kernel Alignment-RFF) [83]: It pre-computes a large cated learning techniques, e.g., by learning the spectral distribution
number of random features that are generated by RFF, and then of kernel from the data.
select a subset of them by solving a simple optimization problem Representative approaches in this class often involve a one-
based on kernel alignment [115]. In particular, the optimization stage or two-stage process. The two-stage scheme is common when
problem is using random features. It first learns the random features, and then
incorporates them into kernel methods for prediction. Actually, the
n J
X X above-mentioned leverage sampling and random features selection
max yi yj at zp (xi , ωt ) zp (xj , ωt ) , (24)
a∈PJ based algorithms employ this scheme. The algorithm proposed in
i,j=1 t=1
[84] is a typical method for kernel learning by random features.
where J > s is the number of the candidate random features by This method first learns a spectral distribution of a kernel via an
RFF, and a is the weight vector. Here the maximization is over the implicit generative model, and then trains a linear model by these
set of distributions PJ := {a : Df (ak1/J) ≤R c}, where c > 0 learned features.
dP
is a pre-specified constant and Df (P kQ) := f ( dQ )dQ with One-stage algorithms aim to simultaneously learn the spectral
2 2
f (t) = t − 1 is the χ -divergence between the distributions distribution of a kernel and the prediction model by solving a
P and Q (a special case of the f -divergence). Solving the single joint optimization problem or using a spectral inference
problem (24) learns a (sparse) weight vector a of the candidate scheme. For example, Yu et al. [85] propose to jointly optimize
12
improved in [11], [31], [52] under various settings. Note that some
kk − k̃k∞ : [9], [30], [49], [50]
results above do not directly apply to the squared loss in KRR,
kk − k̃kLr : [30], [49]
whose Lipschitz parameter
√ is unbounded. For squared losses, Rudi
approximation error
∆-spectral approximation: [51], [97]
et al. [53] show that Ω( n log n) random features by√ RFF suffice
(∆ , ∆ )-spectral approximation: [45]
1 2 to achieve a minimax optimal learning rate O(1/ n). A more
empirical risk:
[45], [51] ( refined analysis is given in [31] under the p(ω)-sampling and
ω ∼ p(ω): [31], [53] q(ω)-sampling settings.
squared loss ω ∼ q(ω): [31]
Below we discuss the above theoretical work in more details.
expected risk
(
ω ∼ p(ω): [31], [119]
Lipschitz continuous ω ∼ q(ω): [11], [31], [52]
5.1 Approximation error
Figure 3. Taxonomy of theoretical results on random features. Table 3 summarizes representative theoretical results on the
convergence rates, the upper bound of the growing diameter, and the
the nonlinear feature mapping matrix W and the linear model resulting sample complexity under different metrics. Here sample
with the hinge loss. The associated optimization problem can be complexity means the number of random features sufficient for
solved in an alternating fashion with SGD. In [86], the kernel achieving a maximum approximation error at most .
alignment approach in the Fourier domain and SVM are combined The first result of this kind is given by Rahimi and Recht
into a unified framework, which can be also solved using an [9], who use a covering number argument to derive a uniform
alternating scheme by Langevin dynamics and projection gradient convergence guarantee as follows. For a compact subset S of Rd ,
descent. Wilson and Adams [87] construct stationary kernels as the let |S| := supx,x0 ∈S kx − x0 k2 be its diameter and consider the
Fourier transform of a Gaussian mixture based on Gaussian process L∞ error kk − k̃k∞ := supx,x0 ∈S |k(x, x0 ) − k̃(x, x0 )|.
frequency functions. This approach can be extended to learning
Theorem 4. [Uniform convergence of RFF [9], [30]] Let S be
with Fastfood [88], non-stationary spectral kernel generalization
a compact subset of Rd with diameter |S|. Then, for a stationary
[70], [71], and the harmonizable mixture kernel [89]. Moreover,
kernel k and its approximated kernel k̃ obtained by RFF, we have
Oliva et al. [90] propose a nonparametric Bayesian model, in
which p(ω) is modeled as a mixture of Gaussians with a Dirichlet 2d
ςp |S| s2
h i d+2
process prior. The parameters of the Gaussian mixture and the Pr kk − k̃k∞ ≥ ≤ Cd exp − ,
classifier/regressor model are inferred using MCMC. 4(d + 2)
Table 3
Comparison of convergence rates and required random features for kernel approximation error.
Metric Results Convergence rate Upper bound of |S| Required random features s
q q
|S|
Theorem 4 ( [9], [30]) Op |S| log s
s
|S| ≤ Ω s
log s s ≥ Ω d−2 log
q
log |S|
kk − k̃k∞ |S| ≤ Ω(sc )1 s ≥ Ω d−2 log |S|
Theorem 1 in [49] Op s
q
log |S|
|S| ≤ Ω(sc ) s ≥ Ω −2 log |S|
Theorem 1 in [50] (Gaussian kernels) Op s
q
2d log |S|
r
s
s ≥ Ω d−2 log |S|
kk − k̃kLr (1 ≤ r < ∞) Corollary 2 in [49] Op |S| r
s |S| ≤ Ω ( log s)
4d
q
2d r
1
s ≥ Ω d−2 log |S|
kk − k̃kLr (2 ≤ r < ∞) Theorem 3 in [49] Op |S| r
s |S| ≤ Ω s 4d
q
nλ
Theorem 7 in [51] Op s - s ≥ Ω(nλ log dλ
K)
∆-spectral approximation
Theorem 5.4 in [97] (Gaussian kernels) ORFF/ORF 1
sλ2
- s ≥ Ω(n2α )
r !
dλ
Lemma 6 in [51] Oq K
s - s ≥ Ω(dλ λ
K log dK )
q
nλ
(∆1 , ∆2 )-spectral approximation Theorem 2 in [45] OLP s
2
- s ≥ Ω(nλ log dλ
K)
1
c is some constant satisfying 0 < c < 1.
2
LP denotes that {ωi }si=1 are obtained by RFF and then are quantized to a Low-Precision b-bit representation; see [45].
when one quantizes each random Fourier feature ωi to a low- functions with/without Lipschitz continuity and for learning tasks
precision b-bit representation, which allows more features to be including KRR [31], [53] and SVM [11], [32], [52]. Apart from
stored in the same amount of space. supervised learning with random features, results on randomized
nonlinear component analysis refer to [10], random features with
Theorem 10 (Theorem 2 in [45]). Let K f be an s-features b-bit
matrix sketching [120], doubly stochastic gradients scheme [94],
LP-RFF approximation of a kernel matrix K and δ ∈ (0, 1).
statistical consistency [121], [122].
Assume that kKk2 ≥ λ ≥ δb2 = 2/(2b − 1)2 and define a :=
δ2
8 Tr(K + λIn )−1 (K + δb2 In ). For ∆1 ≤ 3/2 and ∆2 ∈ [ λb , 32 ], 5.2.1 Assumptions
if the total number of random features satisfies
Before we detail these theoretical results, we summarize the
8 2 2 a
standard assumptions imposed in existing work. Some assumptions
s ≥ nλ max , 2 log ,
3 ∆1 ∆2 − δb /λ δ are technical, and thus familiarity with statistical learning theory
then (see Section 2.1) would be helpful. We organize these assumptions
h i in four categories as shown in Figure 4, including i) the existence
Pr (1−∆1 ) (K +λIn ) K̃ +λIn (1 + ∆2 ) (K +λIn ) of fρ (Assumption 1) and its stronger version (Assumption 8);
−3s∆21
ii) quality of random features (Assumptions 2, 6, 7); iii)
≥ 1 − a exp noise conditions (Assumptions 3, 9, 10); iv) eigenvalue decay
4nλ (1+2/3∆1 )
(Assumptions 4, 5).
−3s(∆2 −δb2 /λ)2
+ exp . We first state three basic assumptions, which are needed in all
4nλ (1 + 2/3(∆2 −δb2 /λ)) of the (regression) results to be presented.
Theorem 10 shows that when the quantization noise is small Assumption 1 (Existence [53], [123]). In regression task, we
relative to the regularization parameter, using low precision has assume fρ ∈ H.
minimal impact on the number of features required for the
(∆1 , ∆2 )-spectral approximation. In particular, as s → ∞, ∆1 Note that since we consider a potentially infinite dimensional
converges to zero for any precision b, whereas ∆2 converges to a RKHS H, possibly universal [124], the existence of the target
value upper bounded by δb2 /λ. If δb2 /λ ∆2 , using b-bit precision function fρ is not automatic. However, if we restrict to a bounded
has negligible effect on the number of features required to attain subspace of H, i.e., HR = {f ∈ H : kf k ≤ R} with R < ∞
this ∆2 see Table 3 for a summary. fixed a prior, then a minimizer of the risk E(f ) always exists as
long as HR is not universal. If fρ exists, then it must lie in a ball
5.2 Risk and generalization property of some radius Rρ,H . The results in this section do not require
The above results on approximation error are a means to an end. prior knowledge of Rρ,H and they hold for any finite radius.
More directly related to the learning performance is understanding Assumption 2 (Random features are bounded and continuous [53]).
generalization properties of random features based algorithms. To For the shift-invariant kernel k , we assume that ϕ(ω> x) in Eq. (6)
this end, a series of work study the generalization properties of is continuous in both variables and bounded, i.e., there exists κ ≥ 1
algorithms based on p(ω)-sampling and q(ω)-sampling. Under such that |ϕ(ω> x)| < κ for all x ∈ X and ω ∈ Rd .
different assumptions, theoretical results have been obtained for loss
15
i) regression: fρ ∈ H (Ass. 1)(⇐ source condition (Ass. 8)
bounded and continuous (Ass. 2)
ii) quality of random features
q(ω)-sampling: compatibility condition (Ass. 6) ⇐ optimized distribution (Ass. 7)
(
regression: boundedness on y (Ass. 3)
iii) noise condition
classification: Massart’s low noise condition (Ass. 9) ⇐ separation condition (Ass. 10)
(
exponential decay
iv) eigenvalue decays assumption (Ass. 4)
polynomial decay and harmonic decay ⇔ capacity condition (Ass. 5)
Figure 4. Relationship between the needed assumptions. The notation A ⇐ B means that B is a stronger assumption than A.
k(x, ·) due to the self-adjoint property of the Hilbert spaces L2ρX and
X H H [122]. With s random features, the inclusion operator I can
be approximated by the operator A : H e → L2 , (Aβ) =
ρX
ϕ(·) I s
hϕ(·), βiHe , ∀β ∈ R . Figure 5 presents the relationship between
various spaces under different operators.
H
e L2ρX The integral operator Σ plays a significant role in characterizing
A
the hypothesis space. In particular, the decay rate of the spectrum
of Σ quantifies the capacity of the hypothesis space in which
Figure 5. Maps between various spaces.
we search for the solution. This capacity in turn determines the
number of random features required for accurate learning. Rudi
and Rosasco [53] consider the following assumption on Σ.
condition [124], [125]). For any x ∈
Assumption 3 (Bernstein’s
X , we assume E |y|b | x ≤ 12 b!ς 2 B b−2 when b ≥ 2 .
Assumption 5 (Capacity condition [123], [126]). There exist Q >
This noise condition is weaker than the boundedness on y . It 0 and γ ∈ [0, 1] such that for any λ > 0, we have
is satisfied when y is bounded, sub-Gaussian, or sub-exponential. N (λ) := tr (Σ + λI)−1 Σ ≤ Q2 λ−γ .
(26)
In particular, if y ∈ [− 2b , 2b ] almost surely with b > 0, then
Assumption 3 is satisfied with ς = B = b. The effective dimension N (λ) [110] measures the “size” of
The above three assumptions are needed in all theoretical the RKHS, and is in fact the operator form of dλK in Eq. (21).
results for regression presented in this section, so we omit them Assumption 5 holds if the eigenvalues λi of Σ decay as i−1/γ ,
when stating these results. We next introduce several additional which corresponds to the eigenvalue decay of K in Assumption 4
assumptions, which are needed in some of the theoretical results. with γ := 1/(2a) [127]. The case γ = 0 is the more benign
Eigenvalue Decay Assumptions: The following assumption, situation, whereas γ = 1 is the worst case.
which characterizes the “size” of the RKHS H of interest, is often Quality of Random Features: Here we introduce several
discussed in learning theory. technical assumptions on the quality of random features. The
leverage score in Eq. (20) admits the operator form
Assumption 4 (Eigenvalue decays [111]). A kernel matrix 2
K admit the following three types of eigenvalue decays: 1) F∞ (λ) := sup (Σ + λI)−1/2 ϕ(x) , ∀λ > 0 ,
L2ρ
Geometric/exponential decay: λi (K) ∝ n exp(−i1/c ), which ω X
leads to dλK . log(R0 /λ); 2) Polynomial decay: λi (K) ∝ which is also called as the maximum random features dimension
ni−2a , which implies dλK . (1/λ)1/2a ; 3) Harmonic decay: [53]. By defintion we always have N (λ) ≤ F∞ (λ). Roughly
λi (K) ∝ n/i, which results in dλK . (1/λ). speaking, when the random features are “good”, it is easy to
control their leverage scores in terms of the decay of the spectrum
We give some remarks on the above assumption. For shift-
of Σ. Further, fast learning rates using fewer random features can
invariant kernels, if the RKHS is small, the eigenvalues of the
be achieved if the features are compatible with the data distribution
kernel matrix K often admit a fast decay. Consequently, functions
in the following sense.
in the RKHS are smooth enough that a good prediction performance
can be achieved. On the other hand, if the RKHS is large and the Assumption 6 (Compatibility condition [53]). With the above
eigenvalues decay slowly, then functions in the RKHS are not definition of F∞ (λ), assume that there exist % ∈ [0, 1], and F > 0
smooth, which would lead to a sub-optimal error rate for prediction. such that F∞ (λ) ≤ F λ−% , ∀λ > 0.
It can be linked to the integral operator [123], [124] characterizing
It always holds that F∞ (λ) ≤ κ2 λ−1 when z is uniformly
the hypothesis space, defined as Σ : L2ρX → L2ρX such that
bounded by κ. So the worst case is % = 1, which means that the
random features are sampled in a problem independent way. The
Z
(Σg)(x) = k(x, x0 )g(x0 )dρX (x0 ), ∀g ∈ L2ρX . favorable case is % = γ , which means that N (λ) ≤ F∞ (λ) ≤
X
O(n−αγ ). In [11], the authors consider the following assumption.
It is clear that the operator Σ is self-adjoint, positive definite,
and trace-class when k(·, ·) is continuous. This operator can be Assumption 7 (Optimized distribution [11]). The feature mapping
represented as Σ = II∗ in terms of the inclusion operator I : z(ω, x) is called optimized if there is a small constant λ0 such
P∞ λi (Σ)
H → L2ρX , (If ) = f . Here I∗ is the adjoint of I and is given by that for any λ ≤ λ0 , F∞ (λ) ≤ N (λ) = i=1 λi (Σ)+λ .
Z Under the previous definitions, Assumption 7 holds only
I∗ : L2ρX → H, (I∗ f )(·) = k(x, ·)f (x)dρX , when F∞ (λ) = N (λ). This assumption is stronger than the
X
16
√
compatibility condition in Assumption 6. Note that Assumption 7 assumptions and appropriately chosen parameters, Ω( n log n)
is satisfied when sampling from q(ω). random features suffice for KRR to achieve minimax optimal rates.
Source condition on fρ : The following assumption states that
Theorem 11 (Generalization bound; Theorem 3 in [53]). Suppose
fρ has some desirable regularity properties.
that Assumption 8 (source condition) holds with r ∈ [ 12 , 1], As-
Assumption 8 (Source condition [53], [128]). There exist 1/2 ≤ sumption 6 (compatibility) holds with % ∈ [0, 1], and Assumption 5
r ≤ 1 and g ∈ L2ρX such that fρ (x) = (Σr g)(x) almost surely. (capacity) holds with γ ∈ [0, 1]. Assume that n ≥ n0 and choose
1
Since Σ is a compact positive operator on L2ρX , its r-th power λ := n 2r+γ . If the number of random features satisfies
r
Σ is well defined for any r > 0.6 Assumption 8 imposes a form of α+(2r−1)(1+γ−α) 108κ2
regularity/sparsity of fρ , which requires the expansion of fρ on the s ≥ c0 n 2r+γ log ,
λδ
basis given by the integral operator Σ. Note that this assumption
is more stringent than the existence of fρ in H. The latter is then the excess risk of f˜z,λ can be upper bounded as
equivalent to Assumption 8 with r = 12 (the worst case), in which
case fρ ∈ H need not have much regularity/sparsity.
2 18 − 2r+γ
2r
E f˜z,λ − E (fρ ) = f˜z,λ − fρ ≤ c1 log2 n ,
Noise Condition: The following two assumptions on noise are L2ρ
X
δ
considered in random features for classification.
where c0 , c1 are constants independent of (n, λ, δ), and n0 does
Assumption 9 (Massart’s low noise condition [11]). There exists not depends on n, λ, fρ , or ρ.
V ≥ 2 such that
Theorem 11 unifies several results in [53] that impose different
E(x,y)∼ρ [y|x] ≥ 2/V . assumptions. The simplest result is Theorem 1 in [53], which only
requires the three basic Assumptions 1–3 on existence, boundedness
Assumption 10 (Separation condition [11]). The points in X can
and continuity, corresponding to the the worst case of Theorem 11
be collected into two sets according to their labels as follows
with % = γ = 1 and √ r = 1/2. In this case, by choosing λ =
X1 := {x ∈ X : E[y|x] > 0} , n−1/2 , we require Ω( n log n) random features to achieve the
X−1 := {x ∈ X : E[y|x] < 0} . minimax convergence rate O(n−1/2 ); also see Table 4.
A more refined result is given in Theorem 2 in [53], which
For i ∈ {±1}, the distance of a point x ∈ Xi to the set X−i accounts for the capacity of the RKHS and the regularity of fρ ,
is denoted by ∆(x). We say that the data distribution satisfies a as quantified by the parameters γ ∈ [0, 1] (Assumption 5) and
separation condition if there exists ∆ > 0 such that ρX (∆(x) < r ∈ [ 21 , 1] (Assumption 8), respectively. Under these conditions
c) = 0. 1 1+γ(2r−1)
and choosing λ := n− 2r+γ , we require Ω n 2r+γ log n
2r
The above two assumptions, both controlling the noise level random features to achieve the convergence rate O n− 2r+γ .
in the labels, can be cast under into a unified framework [131] as Note that γ = 1 is the worst case, where the eigenvalues of
follows. Define the regression function η(x) = E[y|X = x] in K have the slowest decay, and γ = 1/(2a) ∈ (0, 1) means that
binary classification problems. The Massart’s low noise condition the eigenvalues follow a polynomial decay λi ∝ ni−2a . Table 4
means that there exists h ∈ (0, 1] such that for |η(x)| ≥ h for all presents this result with γ := 1/(2a) for better comparison with
x ∈ X . Here h characterizes the level of noise in classification the other results.
problems. If small h is small, then η(x) is close to zero, in The above two results apply to the standard RFF setting with
which case correct classification is difficult. Massart’s condition data-independent sampling. When {ωi }si=1 are sampled from a
can be extended to the following more flexible condition known as data-dependent distribution satisfying the compatibility condition
Tsybakov’s low noise assumption [131]. This assumption stipulates in Assumption 6 with % ∈ [0, 1], then Theorem 3 in [53] provide
that there exists a constant C > 0 such that for all sufficiently 1
an improved result. In this case, by choosing λ := n− 2r+γ , we
small t > 0, we have %+(1+γ−%)(2r−1)
require Ω n 2r+γ log n random features to achieve the
Pr {x ∈ X : |2η(x) − 1| ≤ t} ≤ C · tq ,
2r
convergence rate O n− 2r+γ .
for some q > 0. The separation condition in Assumption 10 is an If the compatibility condition is replaced by the stronger
extreme case of the Tsybakov’s noise assumption with q = ∞. It is Assumption 7 (optimized distribution), satisfied by q(ω)-sampling,
clear that noise-free distributions satisfy this separation assumption, the work [31] derives an improved bound that is the sharpest to
since the conditional probability η is bounded away from 1/2. date. Below we state a general result from [31] that covers both
p(ω)- and q(ω)-sampling.
5.2.2 Squared loss in KRR
Theorem 12 (Theorem 1 in [31]). Suppose that the regularization
In this section, we review theoretical results on the generalization parameter λ satisfies 0 ≤ nλ ≤ λ1 . We consider two sampling
properties of KRR with squared loss and random features, for schemes.
both the p(ω)-sampling (data-independent) and q(ω)-sampling
(data-dependent) settings. Table 4 summarizes these results for the
• {ωi }si=1 ∼ p(ω): if s ≥ (5z02 /λ) log(16dλK /δ) and
excess risk in terms of the key assumptions imposed, the learning
|z(ω, x)| ≤ z0 ,
s λ λ
• {ωi }i=1 ∼ q(ω): if s ≥ 5dK log 16dK /δ ,
rates, and the required number of random features.
We begin with the remarkable result by Rudi and Rosasco then for 0 < δ < 1, with probability 1 − δ , the excess risk of f˜z,λ
[53]. They are among the first to show that under some mild can be upper bounded as
Table 4
Comparison of learning rates and required random features for expected risk with the squared loss function.
sampling scheme Results key assumptions eigenvalue decays λ learning rates required s
1 1 √
[53, Theorem 1] - - n− 2 Op n− 2 s ≥ Ω( n log n)
− 2t − 4rt
i−2t n 1+4rt Op n 1+4rt s ≥ Ω( 2t+2r−1
1+4rt log n)
source condition
[53, Theorem 2]
− 1 − 2r 2r
1/i n 2r+1 Op n 2r+1 s ≥ Ω(n 2r+1 log n)
{ωi }si=1 ∼ p(ω)
1 1 √
e− c i
1 n− 2 Op n− 2 s ≥ Ω( n log log n)
1 1 √
[31, Corollary 2] - i−2t n− 2 Op n− 2 s ≥ Ω( n log n)
1 1 √
1/i n− 2 Op n− 2 s ≥ Ω( n log n)
− 2t − 4rt %+(2r−1)(2t+1−2t%)
i−2t n 1+4rt Oq n 1+4rt s ≥ Ω( 1+4rt log n)
source condition;
[53, Theorem 3] compatibility condition
− 1 − 2r 2r
1/i n 2r+1 Oq n 2r+1 s≥ Ω(n 2r+1 log n)
1 1
{ωi }si=1 ∼ q(ω)
e− c i
1 n− 2 Oq n− 2 s ≥ Ω(log2 n)
1 1
[31, Corollary 1] optimized distribution i−2t n− 2 Oq n− 2 s ≥ Ω(n1/(2t) log n)
1 1 √
1/i n− 2 Oq n− 2 s ≥ Ω( n log n)
where we recall that E fz,λ −E (fρ ) is the excess risk of standard 5.2.3 Lipschitz continuous loss function
KRR with an exact kernel (see Section 2). In this section, we consider loss functions ` that are Lipschitz
continuous. Examples include the hinge loss in SVM and the
Remark: A sharper convergence rate can be achieved if the
cross-entropy loss in kernel logistic regression. Table 5 summarizes
Rademacher complexity used in [31] is substituted by the local
several existing results for such loss functions in terms of the
Rademacher complexity [132], see [133] for details.
learning rate and the required number of random features. We
For p(ω)-sampling, Theorem 12 improves on the results of
briefly discuss these results below and refer the readers to the cited
[53] under the exponential and polynomial decays. Specifically,
work for the precise theorem statements.
if {ωi }si=1 ∼ p(ω), Theorem 12 requires s ∝ 1/λ log dλK .
If {ωi }si=1 ∼ p(ω), i.e., under the standard RFF setting with
Specialized
√ to the exponential decay case, this result requires
data-independent sampling, we have the following results.
Ω( n log log n) random features to achieve an O(n−1/2 ) learn-
• Theorem 1 in [32] shows that the excess risk converges at a
ing√rate, which is an improvement compared to [53] with
Ω( n log n) random features. certain O(n−1/2 ) rate with Ω(n log n) random features.
• Corollary 4 in [31] shows that with λ ∈ O(1/n) and
For q(ω)-sampling, Theorem 12 shows that if λ = n−1/2 ,
Ω (1/λ) log dλK random features, the excess risk of f˜z,λ
then s ∝ dλK log dλK random features is sufficient to incurs no
can be upper bounded by
loss in the expected risk if KRR, with a minimax learning rate √
√
O(n−1/2 ). Corollaries of this result under three different regimes E(f˜z,λ ) − E (fρ ) ≤ O 1/ n + O( λ) .
of eigenvalue decay are summarized in Table 4. √
Carratino et al. [134] extend the result of [53] to the setting The above bound scales with λ, which is different from
where KRR is solved by stochastic gradient descent (SGD). They the bound in Eq. (27) for the squared loss. Therefore, for
show that under the basic Lipschitiz continuous loss functions, we need to choose a
√ Assumptions 1–3 and some mild smaller regularization parameter λ ∈ O(1/n) to achieve the
conditions for SGD, Ω( n) random features suffice to achieve
the minimax learning rate O(n−1/2 ). This result matches those same O(n−1/2 ) convergence rate. Also note that as before
for standard KRR with an exact kernel [135]. The above results we can bound dλK under the three types of eigenvalue decay.
can be improved if in addition the source condition in Assumption If {ωi }si=1 ∼ q(ω), i.e., under the data-dependent sampling
1+α(2r−1)
8 holds, in which case Ω(n 2r+α ) random features suffice to setting, we have the following results.
2r
achieve an O(n− 2r+α ) learning rate. • For SVM with random features, under the optimized dis-
The work in [136] shows that if the randomized feature map is tribution in Assumption 7 and the low noise condition in
bounded (which is weaker than Assumption 2), then we have the Assumption 9, Theorem 1 in [11] provides bounds on the
following out-of-sample bound learning rates and the required number of random features.
This result is improved in [11, Theorem 2] if we consider the
stronger separation condition in Assumption 10. Details can
1
E(fg
z,λ ) − E(fz,λ ) ≤ O . be found in Table 5.
sλ • In Section 4.5 in [52] and Corollary 3 in [31], it is shown that
if Assumption 7 holds, then the excess risk of f˜z,λ converges
If we choose λ := n−1/2 , then Ω(n) random features are sufficient at an O(n−1/2 ) rate with Ω(dλK log dλK ) random features, if
to ensure an O(n−1/2 ) rate in the out-of-sample bound. we choose λ ∈ O(1/n).
18
Table 5
Comparison of learning rates and required random features for expected risk with a Lipschitz continuous loss function.
sampling scheme Results key assumptions eigenvalue decay λ learning rates required s
1
[32, Theorem 1] - - - Op n− 2 s ≥ Ω(n log n)
1
e− c i
1 1
n Op n− 2 s ≥ Ω(n log log n)
{ωi }si=1 ∼ p(ω)
1
[31, Corollary 4] - i−2t 1
n Op n− 2 s ≥ Ω(n log n)
1
1/i 1
n Op n− 2 s ≥ Ω(n log n)
c+2
1 1
s ≥ Ω(logc n log logc n)
1
e− c i n Oq n log n
low noise condition
− t − t 1
[11, Theorem 1] i−2t n 1+t Oq n 1+t log n s ≥ Ω(n 1+t log n)
optimized distribution
1
1/i 1
n Oq n− 2 s ≥ Ω(n log n)
{ωi }si=1 ∼ q(ω)
separation condition 1 2
e− c i
1
log2c+1 n log log n s ≥ Ω(log2c n log log n)
[11, Theorem 2]
optimized distribution n−2c Oq n
1
e− c i
1 1
n Oq n− 2 s ≥ Ω(log2 n)
[52, Section 4.5] 1
[31, Corollary 3]
optimized distribution i−2t 1
n Oq n− 2 s ≥ Ω(n1/(2t) log n)
1
1/i 1
n Oq n− 2 s ≥ Ω(n log n)
6.1 Experimental settings including the number of feature dimension, training samples,
Kernel: We choose the popular Gaussian kernel, zero/first-order test data, training/test split, and the normalization scheme.
arc-cosine kernels, and polynomial kernels for experiments. These eight non-image benchmark datasets can be downloaded
i) Gaussian kernel: from https://www.csie.ntu.edu.tw/∼cjlin/libsvmtools/datasets/ or
the UCI Machine Learning Repository7 . Some datasets include a
kx − x0 k22
0
k(x, x ) = exp − , (28) training/test partition, denoted as “no” in the random split column.
2ς 2 For the other datasets, we randomly pick half of the data for training
where the kernel width parameter ς is tuned via 5-fold inner cross and the rest for testing, denoted as “yes” in the random split column.
validation over a grid of {0.01, 0.1, 1, 10, 100}. There are two typical normalization schemes used in these datasets:
To evaluate the Gaussian kernel, we conduct the following “mapstd” and “minmax”. The “mapstd” scheme sets each sample’s
representative algorithms for comparison: RFF [9], ORF [24], mean to 0 and deviation to 1, while the “minmax” scheme is a
SORF [24], ROM [80], Fastfood [36], QMC [37], SSF [43], standard min-max scaling operation mapping the samples to the
GQ [26], and LS-RFF [31]. These algorithms include both data- bounded set [0, 1]d . Two representative image datasets are the
independent and data-dependent approaches and involve a variety of MNIST handwritten digits dataset [137] and the CIFAR10 natural
techniques including Monte Carlo and quasi-Monte Carlo sampling, image classification dataset [138], summarized in the last two rows
quadrature rules, variance reduction, and computational speedup in Table 6. The MNIST dataset contains 60,000 training samples
using structural/circulant matrices. and 10,000 test samples, each of which is a 28 × 28 gray-scale
ii) arc-cosine kernels: Different from Gaussian kernels and image of a handwritten digit from 0 to 9. Here the “minmax”
polynomial kernels, the designed arc-cosine kernels [60] can be normalization scheme means that each pixel value is divided by
closely connected to neural networks, which include feature spaces 255. The CIFAR10 dataset consists of 60,000 color images of
that mimic the sparse, nonnegative, distributed representations of size 32 × 32 × 3 in 10 categories, with 50,000 for training and
single-layer threshold networks. The used zeroth order kernel is 10,000 for test. Besides, apart from medium/large scale datasets in
given explicitly by our experiments, we also evaluate the compared approaches on a
ultra-large scale dataset MNIST 8M [139], which is derived from
θ
k(x, x0 ) = 1 − , the MNIST dataset by random deformations and translations. It
π shares the same number of feature dimension and test data with the
which corresponds to the Heaviside step function σ(ω > x) = MNIST dataset, but has 8,100,000 training data.
1 >
2 (1 + sign(ω x)) in Eq. (6). The first order kernel is Evaluation metrics: We evaluate the performance of all the
1 compared algorithms in terms of approximation error, time
k(x, x0 ) = kxk2 kx0 k2 (sin θ + (π − θ) cos θ) , cost, and test accuracy. We use kK − Kk f F /kKkF as the
π
error metric for kernel approximation. A small error indicates
which corresponds to the ReLU activation function σ(ω > x) = a high approximation quality. To compute the approximation
max{0, ω>x} in Eq. (6). error, we randomly sample 1,000 data points to construct the
Here we consider the zero/first-order arc-cosine kernel and sub-feature matrix and the sub-kernel matrix. We record the
compare these ten algorithms (used for Gaussian kernel approx- time cost of each algorithm on generating feature mappings. The
imation) as well. Note that, the theoretical foundation behind kernel width ς in the Gaussian kernel is tuned by five-fold cross
random features, Bocher’s theorem, is invalid to arc-cosine kernels. validation over the grid {0.01, 0.1, 1, 10, 100}. The regularization
Thankfully, according to the formulation of arc-cosine kernels parameter λ in ridge linear regression and the balance parameter in
admitting in Eq. (6), the Monte Carlo sampling (e.g., RFF) is liblinear are tuned via 5-fold inner cross validation on a grid
able to used for arc-cosine kernel approximation. In this case, the of {10−8 , 10−6 , 10−4 , 10−3 , 10−2 , 0.05, 0.1, 0.5, 1, 5, 10} and
remaining algorithms, e.g., ORF, QMC, and Fastfood, on various {0.01, 0.1, 1, 10, 100}, respectively. For the sake of computational
sampling strategies, can be still applicable to arc-cosine kernels, at efficiency, we conduct a relatively coarse hyper-parameter tuning.
least in the algorithmic aspect. Nevertheless, a refined hyper-parameter search might result in better
iii) Polynomial kernel: This is a widely used family of dot classification performance. The random features dimension s in our
product kernels given by experiments takes value in {2d, 4d, 8d, 16d, 32d}. All experiments
k(x, x0 ) = (1 + hx, x0 i)b , are repeated 10 times and we report the average approximation
error, average classification accuracy with their respective standard
where b is the order. In our experiments, the order is set to b = 2. deviations as well as the time cost for generating random features.
Note that, different from Gaussian kernels and arc-cosine kernels,
polynomial kernels admit neither the Bochner’s theorem nor the 6.2 Results for the Gaussian Kernel
sampling formulation in Eq. (6), so classical random features based 6.2.1 Results on non-image benchmark datasets
algorithms are applicable to arc-cosine kernels but still invalid to Here we test various random features based algorithms, including
polynomial kernels even though both of them are dot-product. As RFF [9], ORF [24], SORF [24], ROM [80], Fastfood [36], QMC
a result, algorithms for polynomial kernel approximation are often [37], SSF [43], GQ [26], LS-RFF [31] for kernel approximation and
totally different. In this survey, we include three representative then combine these algorithms with lr/liblinear for classification
approaches for evaluation, including Random Maclaurin (RM) [34], on eight non-image benchmark datasets, refer to Appendix B.1 for
Tensor Sketch (TS) [73], and Tensorized Random Projection (TRP) details. Here we summarize the best performing algorithm on each
[74]. dataset in terms of the approximation quality and classification
Datasets: We consider eight non-image benchmark datasets, two accuracy in Table 7, where we distinguish the small s case (i.e.,
representative image datasets, and a ultra-large scale dataset
for evaluation. Table 6 gives an overview of these datasets 7. https://archive.ics.uci.edu/ml/datasets.html.
20
Table 7
Results statistics on several classification datasets. The best algorithm on each dataset is given in two cases: low dimensional (i.e., s = 2d, 4d) and
high dimensional (i.e., s = 16d, 32d) according to approximation quality, test accuracy in linear regression or liblinear. The notation “-” means that
there is no statistically significant difference in the performance of most algorithms.
approximation lr liblinear
datasets
small s large s small s large s small s large s
ijcnn1 SSF SORF, QMC, ORF - - Fastfood -
EEG SSF ORF - - - -
cod-RNA SSF - - - - -
covtype ORF - - - - -
magic04 SSF SSF, ORF, QMC, ROM - - - -
letter SSF SSF, ORF - - - -
skin SSF, ROM QMC - - - -
a8a - - - - SSF -
Table 8
Comparison of problem settings on analysis of high dimensional random features on double descent.
data generation
studies metric asymptotic? result
{xi }n
i=1 fρ activation function W
[143, Theorem 7] population risk N (0, Id ) hx, ζi normalized N (0, 1/d) 3 variance % &
[155, Theorem 4] population risk N (0, Id ) hx, ζi bounded N (0, 1/d) 3 variance % &
√ √
[144, Theorem 2] expected excess risk Sd−1 ( d) hx, ζi + nonlinear 1
bounded Unif(Sd−1 ( d)) 3 variance, bias % &
√ √
σ(XW> / d)/ d by considering the Stieltjes transform of a “calculus of deterministic equivalents” for random matrices [170].
related random block matrix, and show that, under least squares This technique is also used in [154] to derive the exact asymptotic
regression setting in an asymptotic viewpoint, both the bias and deterministic equivalent of EW [(ZZ> +nλI)−1 ], which captures
variance have a peak at the interpolation threshold ψ2 = 1 and the asymptotic behavior on double descent. Note that, this work
diverge there when λ → 0. Under this framework, according to the makes no data assumption to match real-world data, which is
randomness stemming from label noise, initialization, and training different from previous work relying on specific data distribution.
features, a refined bias-variance decomposition is conducted by
[156], [163] and further improved by [158], [164] using the analysis 7.2 Discussion on Random Features and DNNs
of variance. Apart from refined error decomposition schemes, the
As mentioned, random features models have been fruitfully used to
authors of [155], [157], [159] consider a general setting on convex
analyze the double descent phenomenon. However, it is non-trivial
loss functions, transformation matrix, and activation functions for
to transfer results for these models to practical neural networks,
regression and classification. Here the techniques used for analysis
which are typically deep but not too wide. There is still a substantial
are not limited to RMT. Instead, replica method [165] (a non-
gap between existing theory based on random features and the
rigorous heuristic method from statistical physics) used in [156],
modern practice of DNNs in approximation ability. For example,
[157], [163] and the convex Gaussian min-max (CGMM) theorem
under the spherical data setting, Ghorbani et al. [67] (a more general
[166] used in [159] are two alternative way to derive the desired
version in [171] on data distribution) point out that as n → ∞,
results. Note that, CGMM requires the data to be Gaussian, which
a random features regression model can only fit the projection
might restrict the application scope of their results but is still
of the target function onto the space of degree-` polynomials
a common-used technical tool for max-margin linear classifier
when s = Ω(d`+1−δ ) random features are used for some δ > 0.
[167], boosting classifiers [168], and adversarial training for linear
More importantly, if s, d are taken as large with s = Ω(d), then the
regression [169] in over-parameterized regimes. Admittedly, the
function space by random features can only capture linear functions.
applied replica method in statistical physics is quite different from
Even if we consider the NTK model, it can just capture quadratic
[144] for tackling inverse random matrices in RMT. However,
functions. That means, both random features and NTK have limited
most of the above methods admit the equivalence between the
approximation power in the lazy training scheme [65]. In addition,
considered data model and the Gaussian covariate model. That
Yehudai and Shamir [172] show that the random features model
means, problem (3) with Gaussian data can be asymptotically
cannot efficiently approximate a single ReLU neuron as it requires
equivalent to
the number of random features to be exponentially large in the
n
1X feature dimension d. This is consistent with the classical result
min ` yi , β > (µ0 1k + µ1 W xi + µ? ti ) + λkβk22 , for kernel approximation in the under-parameterized regime: the
β∈Rs n
i=1
random features model, QMC, and quadrature based methods
where {ti }n i=1 ∼ N (0, Id ), µ0 = E[σ(t)], µ1 = E[tσ(t)] and require s = Ω(exp(d)) to achieve an approximation error [26].
µ? = E[σ(t)2 ] − µ20 − µ21 for a standard Gaussian variable t. This Admittedly, the above results may appear pessimistic due to
equivalence on generalization error in an asymptotic viewpoint is the simple architecture. Nevertheless, random features is still an
proved in [160]. effective tool, at least the first step, for analyzing and understanding
Different from the above results in an asymptotic view, DNNs in certain regimes, and we believe its potential has yet to
Jacot et al. [161] present a non-asymptotic result by taking finite- be fully exploited. Note that the random features model is still
size Stieltjes transform of generalized Wishart matrix, and further a strong and universal approximator [173] in the sense that the
argue that random feature models can be close to KRR with an RKHSs induced by a broad class of random features are dense in
additional regularization. The used technical tool is related to the the space of continuous functions. While the aforementioned results
23
show that the number of required features may be exponential in the • There exist significant gaps between the random features
worst case, a more refined analysis can still provide useful insights model and practical neural networks, both in theory and
for DNNs. One potential way forward in deep learning theory is empirically. Even for fitting simple quadratic or mixture
to use the random features model to analyze DNNs with pruning. models, the random features model cannot achieve a zero
For example, the best paper [174] in the Seventh International error with a finite number of neurons in general, while NTK
Conference on Learning Representations (ICLR2019) put forward and fully trained networks can [180]. Numerical experiments
the following Lottery Ticket Hypothesis: a deep neural network indicate that the prediction performance of NTK and CNTK
with random initialization contains a small sub-network which, may significantly degrade if the random features are generated
when trained in isolation, can compete with the performance of from practically sized nets [13].
the original one. Malach et al. [54] provide a stronger claim that • Despite the limitations of existing theory, random features
a randomly-initialized and sufficiently over-parameterized neural models are still useful for understanding and improving DNNs.
network contains a sub-network with nearly the same accuracy as For example, understanding the equivalence between the
the original one, without any further training. Their analysis points random features model and weight pruning in the Lottery
to the equivalence between random features and the sub-network Ticket Hypothesis [54], may be promising future directions.
model. As such, the random features model is potentially useful We hope that this survey will stimulate further research on the
for network pruning [175] in terms of, e.g., guiding the design of above open problems.
neurons pruning for accelerating computations, and understanding
network pruning and the full DNNs. ACKNOWLEDGEMENTS
The research leading to these results has received funding from
the European Research Council under the European Union’s
8 C ONCLUSION Horizon 2020 research and innovation program / ERC Advanced
Grant E-DUALITY (787960). This paper reflects only the authors’
In this survey, we systematically review random features based
views and the Union is not liable for any use that may be made
algorithms and their associated theoretical results. We also give an
of the contained information. This work was supported in part
overview on generalization properties of high dimensional random
by Research Council KU Leuven: Optimization frameworks for
features in over-parameterized regimes on double descent, and
deep kernel machines C14/18/068; Flemish Government: FWO
discuss the limitations and potential of random features in the theory
projects: GOA4917N (Deep Restricted Kernel Machines: Methods
development for neural networks. Below we provide additional
and Foundations), PhD/Postdoc grant. This research received
remarks and discuss several open problems that are of interest for
funding from the Flemish Government (AI Research Program).
future research.
This work was supported in part by Ford KU Leuven Research
• As a typical data independent method, random features are Alliance Project KUL0076 (Stability analysis and performance
simpler to implement, easy to parallelize, and naturally apply improvement of deep reinforcement learning algorithms), EU
to streaming or dynamic data. Current efforts on Nyström H2020 ICT-48 Network TAILOR (Foundations of Trustworthy AI
approximation by a preconditioned gradient solver parallelized - Integrating Reasoning, Learning and Optimization), Leuven.AI
with multiple GPUs [176] and quantum algorithms [112] Institute; and in part by the National Natural Science Foundation
can guide us to design powerful implementation for random of China 61977046, in part by National Science Foundation
features to handling millions/billions data. grants CCF-1657420 and CCF-1704828, and in part by SJTU
• Experimental comparisons show that better kernel approxima- Global Strategic Partnership Fund (2020 SJTU-CORNELL) and
tion does not directly translate to lower generalization errors. Shanghai Municipal Science and Technology Major Project
We partly answer this question in the current survey but it (2021SHZDZX0102).
may be not sufficient to explain this phenomenon. We believe
this issue deserves further in-depth study.
• Kernel learning via the spectral density is a popular direction
[87], [89], which can be naturally combined with Generative
Adversarial Networks (GANs); see [84] for details. In
this setting, one may associate the learned model with an
implicit probability density that is flexible to characterize the
relationships and similarities in the data. This is an interesting
area for further research.
• The double descent phenomenon has been observed and
studied in random features model by various technical tools
under different settings. Current theoretical results, such as
those in [144], [154], may be extended to a more general
setting with less restricted assumptions on data generation,
model formulation, and the target function. Besides, more
refined analysis and delicate phenomena beyond double
descent have been investigated on the linear model, e.g.,
multiple descent phenomena [177] and optimal (negative)
regularization [178], [179]. Understanding these more delicate
phenomena for random features requires further investigation
and refined analysis.
24
A PPENDIX A
P ROOF OF P ROPOSITION 1
Proof. It is clear that an exact KRR estimator is fz,λ (x) = k(x, X)(K + nλI)−1 y and its random features based version is
f˜z,λ (x) = k̃(x, X)(K f + nλI)−1 y , where K f = ZZ> with Z ∈ Rn×s . The definition of the excess risk for least squares implies
h i
E(f˜z,λ ) − E(fρ ) = E(f˜z,λ ) − E(fz,λ ) + [E(fz,λ ) − E(fρ )] = kf˜z,λ − fz,λ k2 + kfz,λ − fρ k2 ,
where the first term in the right hand is the expected error difference between the original KRR and its random features approximation
version. The second term in the right hand is the excess risk of KRR, which is independent of the quality of kernel approximation.
Specifically, the first term can be further expressed by the representer theorem
n h i 2
!
kf˜z,λ − fz,λ k = Ex [f˜z,λ (x) − fz,λ (x)] = Ex
X
2 2
α̃i k̃(xi , x) − αi k(xi , x) . (29)
i=1
Intuitively speaking, kernel approximation aims to preserve the inner product in two Hilbert spaces, i.e., hk(x, ·), k(x0 , ·)iH ≈
hk̃(x, ·), k̃(x0 , ·)iHe . Nevertheless, the preservation of the inner-product does not immediately guarantee a small value of
α̃i hk̃(x, ·), k̃(x0 , ·)iHe − αi hk(x, ·), k(x0 , ·)iH in Eq. (29).
Informally, the proof idea is the following: define K f = K + E and k̃(x, X) = k(x, X) + ˜ with the residual error matrix
n×n 1×n
E ∈R and the residual error vector ˜ ∈ R such that k̃(x, X) ∈ R1×n . Generally, the residual error E and ˜ show the
consistency, that is, a small kernel approximation error kEk implies a small k˜ k. Consider two random features based algorithms A1 and
A2 yielding two approximated kernel matrices K f1 and K f2 , and their respective KRR estimators f˜(A1) and f˜(A2) . The corresponding
z,λ z,λ
residual error matrices/vectors are defined as (E1 , ˜1 ) and (E2 , ˜2 ) such that K f1 := K + E1 and K f2 := K + E2 . Without loss
of generality, we assume kE1 k ≤ kE2 k and k˜ 1 k ≤ k˜2 k. In this case, our target is to prove that, there exists one case such that
(A1) (A2)
|f˜z,λ (x) − fz,λ (x)| ≥ |f˜z,λ (x) − fz,λ (x)|. For notational simplicity, denote T (E, ˜) := hy> , k(x, X)(K + nλI)−1 E − ˜i,
T1 (E1 , ˜1 ) := hy> , k(x, X)(K + nλI)−1 E1 − ˜1 i, and T2 (E2 , ˜2 ) := hy> , k(x, X)(K + nλI)−1 E2 − ˜2 i. To prove our result,
we make the following three assumptions:
• I. the residual matrix E is semi-positive definite and K f1 , Kf2 are non-singular.
• II. nλ ≤ λ1 (K f1 ) ≤ λ1 (Kf2 ), and K
f2 admits (at least) polynomial decay.
• III. the inner product hy> , k(x, X)(K + nλI)−1 E − ˜i =: T (E, ˜) is non-negative.
The above three assumptions are mild, common-used and attainable, see in [31]. Specifically, we only need to prove the existence of
(A1) (A2)
our claim: there exists one case such that |f˜z,λ (x) − fz,λ (x)| ≥ |f˜z,λ (x) − fz,λ (x)| under kE1 k ≤ kE2 k and k˜ 1 k ≤ k˜
2 k.
Therefore, the above assumptions could be further relaxed.
According to Eq. (29), for a new sample x, we in turn focus on |f˜z,λ (x) − fz,λ (x)|, which can be upper bounded by
|f˜z,λ (x) − fz,λ (x)| = |k(x, X)(K + nλI)−1 y − [k(x, X) + ˜](K + E + nλI)−1 y|
= |k(x, X)[(K + nλI)−1 − (K + nλI + E)−1 ]y − ˜(K + nλI + E)−1 y|
= |[k(x, X)(K + nλI)−1 E − ˜](K + nλI + E)−1 y|
n
X 1 (30)
≤ [k(x, X)(K + nλI)−1 E − ˜]y
λ
i=1 i
(K + E) + nλ
n n
X 1 X T (E, ˜)
= hy> , (k(x, X)(K + nλI)−1 E − ˜i =:
i=1
λi (K + E) + nλ i=1
λi (K + E) + nλ
where the third equality holds by A−1 −B −1 = A−1 (B−A)B −1 . The first inequality derives from a> Ab ≤ a> b tr(A) for two semi-
positive definite matrix A and b> a (which can be derived from the used assumptions). Further, by virtue of a> Ab ≥ λn (A) tr(a> b),
the error |f˜z,λ (x) − fz,λ (x)| can be lower bounded by
|f˜z,λ (x) − fz,λ (x)| = |[k(x, X)(K + nλI)−1 E − ˜](K + nλI + E)−1 y|
hy> , [k(x, X)(K + nλI)−1 E − ˜]i T (E, ˜) (31)
≥ =:
λ1 (K + E) + nλ λ1 (K + E) + nλ
Combining Eqs. (30) and (31), we have
n
T (E, ˜) T (E, ˜)
≤ |f˜z,λ (x) − fz,λ (x)| ≤
X
0≤ . (32)
λ1 (K + E) + nλ λ (K + E) + nλ
i=1 i
Considering such two algorithms A1 and A2, under the condition of kE1 k ≤ kE2 k and k˜
1 k ≤ k˜
2 k, there exists one case such
that T1 (E1 , ˜1 ) ≥ T2 (E2 , ˜2 ), i.e.,
hy> , (k(x, X)(K + nλI)−1 E1 − ˜1 i ≥ hy> , (k(x, X)(K + nλI)−1 E2 − ˜2 i , (33)
25
˜1
˜2
y
T1 (E1 , ˜1 ) T2 (E2 , ˜2 )
Figure 8. Illustration of the geometric proof for one case such that T1 (E1 , ˜1 ) ≥ cT2 (E2 , ˜2 ) under the condition of kE1 k ≤ kE2 k and k˜
1 k ≤ k˜
2 k,
where c is some constant.
which can be achieved by a geometry explanation in Figure 8. By virtue of Eq. (33) and Assumption II, we have
T1 (E1 , ˜1 ) T2 (E2 , ˜2 )
− e ≥ 0.
=: C
λ1 (K + E1 ) + nλ λ1 (K + E2 ) + nλ
The above inequality implies
n n
T2 (E2 , ˜2 ) (A1) (A2) T1 (E1 , ˜1 )
≤ f˜z,λ (x) − fz,λ (x) − f˜z,λ (x) − fz,λ (x) ≤ C
X X
e−
C e+ .
i=2
λi (K + E2 ) + nλ i=2
λi (K + E1 ) + nλ
The left-hand of the above inequality can be further improved as
n
(A1) (A2) T2 (E2 , ˜2 )
f˜z,λ (x) − fz,λ (x) − f˜z,λ (x) − fz,λ (x) ≥ C
X
e−
i=2
λi (K + E2 ) + nλ
n
T1 (E1 , ˜1 ) X T2 (E2 , ˜2 )
≥ −
λ1 (K + E2 ) + nλ i=1 λi (K + E2 ) + nλ
n
T1 (E1 , ˜1 ) T2 (E2 , ˜2 ) X λi (K + E2 )
≥ −
λ1 (K + E2 ) + nλ λn (K + E2 ) i=1 λi (K + E2 ) + nλ
T1 (E1 , ˜1 ) T2 (E2 , ˜2 ) λ
≥ − df ,
2λ1 (K
f2 ) f2 ) K
λn (K 2
where dλf is the “effective dimension” of Kf2 defined in Eq. (21) and the last inequality follows from Assumption II.
K2
(A1) (A2)
According to the above result, f˜ (x) − fz,λ (x) − f˜
z,λ (x) − fz,λ (x) ≥ 0 holds by the following condition
z,λ
" #
λ1 (K
f2 )
T1 (E1 , ˜1 ) ≥ 2 dλf T2 (E2 , ˜2 ) .
f2 ) K
λn (K 2
| {z }
=O(1)
A PPENDIX B
E XPERIMENTS
In this section, we detail the experimental settings and present the comparison results on the compared approaches on several benchmark
datasets across various kernels. This part is organized as follows.
• In Section B.1, we present experimental results across the Gaussian kernel on eight non-image datasets in terms of approximation
error, the time cost (sec.) for generating random features mappings, classification accuracy by linear regression and liblinear.
26
(a) letter
(b) ijcnn1
(c) covtype
(d) cod-RNA
Figure 9. Results of various algorithms across the Gaussian kernel on the letter, ijcnn1, covtype, cod-RNA datasets.
• Results on approximation error and test accuracy (by linear regression) across arc-cosine kernels and polynomial kernels are
presented in Sections B.2 and B.3, respectively.
• In Section B.4, a ultra-large scale dataset is applied to further validate the related algorithms.
(a) EEG
(b) magic04
(c) skin
(d) a8a
Figure 10. Results of various algorithms across the Gaussian kernel on the EEG, magic04, skin, a8a datasets.
data-dependent algorithm that needs to calculate the approximated ridge leverage score. Nevertheless, Fastfood/SORF/ROM does not
achieve the reduction on time cost, which appears contradictory to the underlying theoretical result on time complexity. This might be
because, one hand, the feature dimension of the used datasets in our experiments often ranges from 10 to 100, except for the image
datasets. In this case, it appears difficult to observe the computational saving from O(sd) to O(s log d) or O(d log d). On the other
hand, in our experiments, due to the relatively inefficient Matlab implementation of Fast Discrete Walsh-Hadamard Transform, typical
algorithms (e.g., Fastfood/SORF/ROM) do not show a significant reduction on computational efficiency than RFF.
Figure 11. Results on eight datasets across the zero-order arc-cosine kernel.
that for the Gaussian kernel. This is because, according to Eq. (6), we actually conduct a d-dimensional integration approximation,
the smoothness of the integrand σ(ω > x)σ(ω> x0 ) would significantly effect the approximation performance as indicated by sampling
theory. In the term of classification performance, the difference in test accuracy of most algorithms is relatively small, which shows the
similar tendency with that of the Gaussian kernel.
Figure 12. Results on eight datasets across the first-order arc-cosine kernel.
kernel with kernel bandwidth ς equaling to four times the median pairwise distance; logistic regression with the regularization parameter
λ = 0.0005 for this multi-class classification task; the batch size is set to be 220 and feature block to be 215 . Besides, we report the total
time cost of each algorithm on generating feature mapping, training process and test process for evaluation.
Table 9 reports the approximation error, training error, test error, and the total time cost of each algorithm across the Gaussian kernel
and the zero/first-order arc-cosine kernels under s = 4096. It can be found that, ORF/SORF and SSF achieve the best approximation
performance on the Gaussian kernel, but ORF fails to significantly improve the approximation ability on arc-cosine kernels. This is
consistent with previous discussion on medium datasets in Section B.2.
R EFERENCES
[1] Bernhard Schölkopf and Alexander J. Smola, Learning with kernels: support vector machines, regularization, optimization, and beyond, MIT Press, 2003.
[2] Johan A.K. Suykens, Tony Van Gestel, Jos De Brabanter, Bart De Moor, and Joos Vandewalle, Least Squares Support Vector Machines, World Scientific,
2002.
[3] Mehran Kafai and Kave Eshghi, “CROification: accurate kernel classification with the efficiency of sparse linear SVM,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 41, no. 1, pp. 34–48, 2019.
[4] Cho-Jui Hsieh, Si Si, and Inderjit Dhillon, “A divide-and-conquer solver for kernel support vector machines,” in International Conference on Machine
Learning, 2014, pp. 566–574.
[5] Yuchen Zhang, John Duchi, and Martin Wainwright, “Divide and conquer kernel ridge regression,” in Conference on Learning Theory, 2013, pp. 592–617.
[6] Fanghui Liu, Xiaolin Huang, Chen Gong, Jie Yang, and Li Li, “Learning data-adaptive non-parametric kernels,” Journal of Machine Learning Research,
vol. 21, no. 208, pp. 1–39, 2020.
[7] Alex J. Smola and Bernhard Schölkopf, “Sparse greedy matrix approximation for machine learning,” in International Conference on Machine Learning,
2000, pp. 911–918.
[8] Christopher K.I. Williams and Matthias Seeger, “Using the Nyström method to speed up kernel machines,” in Advances in Neural Information Processing
Systems, 2001, pp. 682–688.
30
[9] Ali Rahimi and Benjamin Recht, “Random features for large-scale kernel machines,” in Advances in Neural Information Processing Systems, 2007, pp.
1177–1184.
[10] David Lopez-Paz, Suvrit Sra, Alex J. Smola, Zoubin Ghahramani, and Bernhard Schölkopf, “Randomized nonlinear component analysis,” in International
Conference on Machine Learning, 2014, pp. 1359–1367.
[11] Yitong Sun, Anna Gilbert, and Ambuj Tewari, “But how does it work in theory? Linear SVM with random features,” in Advances in Neural Information
Processing Systems, 2018, pp. 3383–3392.
[12] Arthur Jacot, Franck Gabriel, and Clément Hongler, “Neural tangent kernel: Convergence and generalization in neural networks,” in Advances in Neural
Information Processing Systems, 2018, pp. 8571–8580.
[13] Sanjeev Arora, Simon S. Du, Wei Hu, Zhiyuan Li, Russ R. Salakhutdinov, and Ruosong Wang, “On exact computation with an infinitely wide neural net,”
in Advances in Neural Information Processing Systems, 2019, pp. 8139–8148.
[14] Amir Zandieh, Insu Han, Haim Avron, Neta Shoham, Chaewon Kim, and Jinwoo Shin, “Scaling neural tangent kernels via sketching and random features,”
arXiv preprint arXiv:2106.07880, 2021.
[15] Simon S Du, Kangcheng Hou, Barnabás Póczos, Ruslan Salakhutdinov, Ruosong Wang, and Keyulu Xu, “Graph neural tangent kernel: Fusing graph neural
networks with graph kernels,” in Advances in Neural Information Processing Systems, 2019, pp. 1–11.
[16] Daniele Zambon, Cesare Alippi, and Lorenzo Livi, “Graph random neural features for distance-preserving graph representations,” in International
Conference on Machine Learning. PMLR, 2020, pp. 10968–10977.
[17] Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin,
Lukasz Kaiser, and Weller Adrian, “Rethinking attention with performers,” in International Conference on Learning Representations, 2021.
[18] Hao Peng, Nikolaos Pappas, Dani Yogatama, Roy Schwartz, Noah Smith, and Lingpeng Kong, “Random feature attention,” in International Conference on
Learning Representations, 2021, pp. 1–19.
[19] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals, “Understanding deep learning requires rethinking generalization,” arXiv
preprint arXiv:1611.03530, 2016.
[20] Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal, “Reconciling modern machine-learning practice and the classical bias–variance trade-off,”
the National Academy of Sciences, vol. 116, no. 32, pp. 15849–15854, 2019.
[21] Yuan Cao and Quanquan Gu, “Generalization bounds of stochastic gradient descent for wide and deep neural networks,” in Advances in Neural Information
Processing Systems, 2019, pp. 10835–10845.
31
(a) zero-order arc-cosine kernel (b) first-order arc-cosine kernel (c) polynomial kernel
Figure 14. Comparison of various algorithms on the covtype dataset in terms of time cost for generating random features mappings.
Table 9
Comparison results of various algorithms across three kernels in terms of training error (%), classification error (%) and total time
cost (sec.) on the ultra-large MNIST 8M dataset.
kernel metric RFF QMC ORF SORF ROM Fastfood SSF GQ LS-RFF
approximation error 0.0126 0.0065 0.0041 0.0041 0.0046 0.0159 0.0078 0.0121 0.0147
Gaussian training error 0.22% 0.21% 0.19% 0.22% 0.19% 0.21% 0.20% 0.21% 0.22%
test error 0.99% 1.07% 1.11% 1.13% 0.99% 1.16% 1.09% 1.16% 0.97%
time cost (sec.) 13669 13999 14296 14526 13497 14343 14872 14322 15725
approximation error 0.0209 0.0124 0.0224 0.0231 0.0199 0.0246 0.0448 0.0383 0.0612
arccos0 training error 2.71% 2.70% 2.70% 2.70% 2.70% 2.60% 2.70% 3.02% 2.64%
test error 2.76% 2.91% 2.75% 2.86% 2.73% 2.94% 2.89% 3.00% 2.72%
time cost (sec.) 10577 10266 10501 10558 10595 10807 11235 10330 12231
approximation error 0.0394 0.0104 0.0310 0.0316 0.0259 0.0458 0.0198 0.0369 0.0357
arccos1 training error 0.93% 0.96% 0.94% 1.00% 0.94% 0.98% 0.95% 0.96% 0.93%
test error 1.64% 1.59% 1.52% 1.57% 1.62% 1.27% 1.34% 1.51% 1.62%
time cost (sec.) 9243.7 9170.3 9187.4 8861.6 8870.8 8824.1 9455.3 9188.1 9742.3
[22] Sanjeev Arora, Simon Du, Wei Hu, Zhiyuan Li, and Ruosong Wang, “Fine-grained analysis of optimization and generalization for overparameterized
two-layer neural networks,” in International Conference on Machine Learning, 2019, pp. 322–332.
[23] Ziwei Ji and Matus Telgarsky, “Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow ReLU networks,” in
International Conference on Learning Representations, 2020, pp. 1–8.
[24] Felix Xinnan Yu, Ananda Theertha Suresh, Krzysztof Choromanski, Daniel Holtmannrice, and Sanjiv Kumar, “Orthogonal random features,” in Advances
in Neural Information Processing Systems, 2016, pp. 1975–1983.
[25] Haim Avron, Vikas Sindhwani, Jiyan Yang, and Michael W. Mahoney, “Quasi-Monte Carlo feature maps for shift-invariant kernels,” Journal of Machine
Learning Research, vol. 17, no. 1, pp. 4096–4133, 2016.
[26] Tri Dao, Christopher M. De Sa, and Christopher Ré, “Gaussian quadrature for kernel features,” in Advances in neural information processing systems,
2017, pp. 6107–6117.
[27] Marina Munkhoeva, Yermek Kapushev, Evgeny Burnaev, and Ivan Oseledets, “Quadrature-based features for kernel approximation,” in Advances in Neural
Information Processing Systems, 2018, pp. 9147–9156.
[28] Alaa Saade, Francesco Caltagirone, Igor Carron, Laurent Daudet, Angélique Drémeau, Sylvain Gigan, and Florent Krzakala, “Random projections through
multiple optical scattering: Approximating kernels at the speed of light,” in Proceedings og IEEE International Conference on Acoustics, Speech and Signal
Processing. IEEE, 2016, pp. 6215–6219.
[29] Ruben Ohana, Jonas Wacker, Jonathan Dong, Sébastien Marmin, Florent Krzakala, Maurizio Filippone, and Laurent Daudet, “Kernel computations from
large-scale random features obtained by optical processing units,” arXiv preprint arXiv:1910.09880, 2019.
[30] Danica J. Sutherland and Jeff Schneider, “On the error of random Fourier features,” in Conference on Uncertainty in Artificial Intelligence, 2015, pp.
862–871.
[31] Zhu Li, Jean-Francois Ton, Dino Oglic, and Dino Sejdinovic, “Towards a unified analysis of random Fourier features,” in the 36th International Conference
on Machine Learning, 2019, pp. 3905–3914.
[32] Ali Rahimi and Benjamin Recht, “Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning,” in Advances in neural
information processing systems, 2009, pp. 1313–1320.
[33] Fuxin Li, Catalin Ionescu, and Cristian Sminchisescu, “Random Fourier approximations for skewed multiplicative histogram kernels,” in Joint Pattern
Recognition Symposium. Springer, 2010, pp. 262–271.
[34] Purushottam Kar and Harish Karnick, “Random feature maps for dot product kernels,” in International Conference on Artificial Intelligence and Statistics,
2012, pp. 583–591.
[35] Andrea Vedaldi and Andrew Zisserman, “Efficient additive kernels via explicit feature maps,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 34, no. 3, pp. 480–492, 2012.
[36] Quoc Le, Tamás Sarlós, and Alex J. Smola, “FastFood—approximating kernel expansions in loglinear time,” in International Conference on Machine
Learning, 2013, pp. 244–252.
[37] Jiyan Yang, Vikas Sindhwani, Haim Avron, and Michael Mahoney, “Quasi-Monte Carlo feature maps for shift-invariant kernels,” in International
Conference on Machine Learning, 2014, pp. 485–493.
[38] Bo Dai, Bo Xie, Niao He, Yingyu Liang, Anant Raj, Maria-Florina F Balcan, and Le Song, “Scalable kernel methods via doubly stochastic gradients,” in
Advances in Neural Information Processing Systems, 2014, pp. 3041–3049.
[39] Jeffrey Pennington, Felix Xinnan X. Yu, and Sanjiv Kumar, “Spherical random features for polynomial kernels,” in Advances in Neural Information
Processing Systems, 2015, pp. 1846–1854.
32
[40] Chang Feng, Qinghua Hu, and Shizhong Liao, “Random feature mapping with signed circulant matrix projection,” in Twenty-Fourth International Joint
Conference on Artificial Intelligence, 2015.
[41] Krzysztof Choromanski and Vikas Sindhwani, “Recycling randomness with structure for sublinear time kernel expansions,” in International Conference on
Machine Learning, 2016, pp. 2502–2510.
[42] Weiwei Shen, Zhihui Yang, and Jun Wang, “Random features for shift-invariant kernels with moment matching,” in Thirty-First AAAI Conference on
Artificial Intelligence, 2017, pp. 2520–2526.
[43] Yueming Lyu, “Spherical structured feature maps for kernel approximation,” in 34th International Conference on Machine Learning. JMLR.org, 2017, pp.
2256–2264.
[44] Shahin Shahrampour, Ahmad Beirami, and Vahid Tarokh, “On data-dependent random features for improved generalization in supervised learning,” in
Thirty-Second AAAI Conference on Artificial Intelligence, 2018, pp. 4026–4033.
[45] Jian Zhang, Avner May, Tri Dao, and Christopher Re, “Low-precision random Fourier features for memory-constrained kernel approximation,” in 22nd
International Conference on Artificial Intelligence and Statistics, 2019, pp. 1264–1274.
[46] Raj Agrawal, Trevor Campbell, Jonathan Huggins, and Tamara Broderick, “Data-dependent compression of random features for large-scale kernel
approximation,” in 22nd International Conference on Artificial Intelligence and Statistics, 2019, pp. 1822–1831.
[47] Fanghui Liu, Xiaolin Huang, Yudong Chen, Jie Yang, and Johan A.K. Suykens, “Random Fourier features via fast surrogate leverage weighted sampling,”
in Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020, pp. 4844–4851.
[48] Tamás Erdélyi, Cameron Musco, and Christopher Musco, “Fourier sparse leverage scores and approximate kernel learning,” in Advances in Neural
Information Processing Systems, 2020.
[49] Bharath K. Sriperumbudur and Zoltán Szabó, “Optimal rates for random Fourier features,” in Advances in Neural Information Processing Systems, 2015,
pp. 1144–1152.
[50] Jean Honorio and Yu-Jun Li, “The error probability of random Fourier features is dimensionality independent,” arXiv preprint arXiv:1710.09953, 2017.
[51] Haim Avron, Michael Kapralov, Cameron Musco, Christopher Musco, Ameya Velingker, and Amir Zandieh, “Random Fourier features for kernel ridge
regression: Approximation bounds and statistical guarantees,” in International Conference on Machine Learning, 2017, pp. 253–262.
[52] Francis Bach, “On the equivalence between kernel quadrature rules and random feature expansions,” Journal of Machine Learning Research, vol. 18, no. 1,
pp. 714–751, 2017.
[53] Alessandro Rudi and Lorenzo Rosasco, “Generalization properties of learning with random features,” in Advances in Neural Information Processing
Systems, 2017, pp. 3215–3225.
[54] Eran Malach, Gilad Yehudai, Shai Shalev-Shwartz, and Ohad Shamir, “Proving the lottery ticket hypothesis: Pruning is all you need,” arXiv preprint
arXiv:2002.00585, 2020.
[55] Salomon Bochner, Harmonic Analysis and the Theory of Probability, Courier Corporation, 2005.
[56] I. J. Schoenberg, “Positive definite functions on spheres,” Duke Mathematical Journal, vol. 9, no. 1, pp. 96–108, 1942.
[57] Alex J. Smola, Zoltan L. Ovari, and Robert C. Williamson, “Regularization with dot-product kernels,” in Advances in Neural Information Processing
Systems, 2001, pp. 308–314.
[58] Claus Müller, Spherical harmonics, vol. 17, Springer, 2006.
[59] Guang-Bin Huang, Qin-Yu Zhu, and Chee-Kheong Siew, “Extreme learning machine: theory and applications,” Neurocomputing, vol. 70, no. 1-3, pp.
489–501, 2006.
[60] Youngmin Cho and Lawrence K Saul, “Kernel methods for deep learning,” in Advances in Neural Information Processing Systems, 2009, pp. 342–350.
[61] Christopher K.I. Williams, “Computing with infinite networks,” in Advances in Neural Information Processing Systems, 1997, pp. 295–301.
[62] Dan Hendrycks and Kevin Gimpel, “Gaussian error linear units (GELUs),” arXiv preprint arXiv:1606.08415, 2016.
[63] Amit Daniely, Roy Frostig, and Yoram Singer, “Toward deeper understanding of neural networks: The power of initialization and a dual view on
expressivity,” in Advances In Neural Information Processing Systems, 2016, pp. 2253–2261.
[64] Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S Schoenholz, Jeffrey Pennington, and Jascha Sohl-Dickstein, “Deep neural networks as Gaussian
Processes,” in International Conference on Learning Representations, 2018.
[65] Lenaic Chizat, Edouard Oyallon, and Francis Bach, “On lazy training in differentiable programming,” in Advances in Neural Information Processing
Systems, 2019, pp. 2933–2943.
[66] Alberto Bietti and Julien Mairal, “On the inductive bias of neural tangent kernels,” in Advances in Neural Information Processing Systems, 2019, pp.
12873–12884.
[67] Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari, “Linearized two-layers neural networks in high dimension,” Annals of
Statistics, 2019.
[68] Alberto Bietti and Francis Bach, “Deep equals shallow for ReLU networks in kernel regimes,” in International Conference on Learning Representations,
2021.
[69] Marc G. Genton, “Classes of kernels for machine learning: a statistics perspective,” Journal of Machine Learning Research, vol. 2, pp. 299–312, 2001.
[70] Sami Remes, Markus Heinonen, and Samuel Kaski, “Non-stationary spectral kernels,” in Advances in Neural Information Processing Systems, 2017, pp.
4642–4651.
[71] Jean-Francois Ton, Seth Flaxman, Dino Sejdinovic, and Samir Bhatt, “Spatial mapping with Gaussian processes and nonstationary Fourier features,”
Spatial Statistics, vol. 28, pp. 59–78, 2018.
[72] Akiva M Yaglom, Correlation Theory of Stationary and Related Random Functions, Springer-Verlag, 1987.
[73] Ninh Pham and Rasmus Pagh, “Fast and scalable polynomial kernels via explicit feature maps,” in ACM International Conference on Knowledge Discovery
and Data Mining, 2013, pp. 239–247.
[74] Michela Meister, Tamas Sarlos, and David Woodruff, “Tight dimensionality reduction for sketching low degree polynomial kernels,” in Advances in Neural
Information Processing Systems, 2019, pp. 9475–9486.
[75] Haim Avron, Huy Nguyen, and David Woodruff, “Subspace embeddings for the polynomial kernel,” in Advances in neural information processing systems,
2014, pp. 2258–2266.
[76] David P Woodruff and Amir Zandieh, “Near input sparsity time kernel embeddings via adaptive sampling,” in International Conference on Machine
Learning, 2020, pp. 10324–10333.
[77] Fanghui Liu, Lei Shi, Xiaolin Huang, Jie Yang, and Johan A.K. Suykens, “A double-variational Bayesian framework in random Fourier features for
indefinite kernels,” IEEE Transactions on Neural Networks and Learning Systems, vol. 31, no. 8, pp. 2965–2979, 2020.
[78] Fanghui Liu, Xiaolin Huang, Yingyi Chen, and Johan A.K. Suykens, “Fast learning in reproducing kernel Kreı̆n spaces via signed measures,” in
International Conference on Artificial Intelligence and Statistics, 2021, pp. 1–11.
[79] Ping Li, “Linearized GMM kernels and normalized random Fourier features,” in 23rd ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining, 2017, pp. 315–324.
[80] Krzysztof M. Choromanski, Mark Rowland, and Adrian Weller, “The unreasonable effectiveness of structured random orthogonal embeddings,” in
Advances in Neural Information Processing Systems, 2017, pp. 219–228.
[81] Alessandro Rudi, Daniele Calandriello, Luigi Carratino, and Lorenzo Rosasco, “On fast leverage score sampling and optimal learning,” in Advances in
Neural Information Processing Systems, 2018, pp. 5672–5682.
[82] Wei-Cheng Chang, Chun-Liang Li, Yiming Yang, and Barnabás Póczos, “Data-driven random Fourier features using Stein effect,” in 26th International
Joint Conference on Artificial Intelligence, 2017, pp. 1497–1503.
33
[83] Aman Sinha and John C. Duchi, “Learning kernels with random features,” in Proceedins of Advances in Neural Information Processing Systems, 2016, pp.
1298–1306.
[84] Chun-Liang Li, Wei-Cheng Chang, Youssef Mroueh, Yiming Yang, and Barnabas Poczos, “Implicit kernel learning,” in International Conference on
Artificial Intelligence and Statistics, 2019, pp. 2007–2016.
[85] Felix X. Yu, Sanjiv Kumar, Henry Rowley, and Shih Fu Chang, “Compact nonlinear maps and circulant extensions,” arXiv preprint arXiv:1503.03893,
2015.
[86] Brian Bullins, Cyril Zhang, and Yi Zhang, “Not-so-random features,” in International Conference on Learning Representations, 2018.
[87] Andrew Gordon Wilson and Ryan Prescott Adams, “Gaussian process kernels for pattern discovery and extrapolation,” in International Conference on
Machine Learning, 2013, pp. 1067–1075.
[88] Zichao Yang, Andrew Wilson, Alex J. Smola, and Le Song, “À la carte–learning fast kernels,” in Artificial Intelligence and Statistics, 2015, pp. 1098–1106.
[89] Zheyang Shen, Markus Heinonen, and Samuel Kaski, “Harmonizable mixture kernels with variational Fourier features,” in International Conference on
Artificial Intelligence and Statistics. PMLR, 2019.
[90] Junier B. Oliva, Avinava Dubey, Andrew G. Wilson, Barnabás Póczos, Jeff Schneider, and Eric P. Xing, “Bayesian nonparametric kernel learning,” in
International Conference on Artificial Intelligence and Statistics, 2016, pp. 1078–1086.
[91] Mariusz Bojarski, Anna Choromanska, Krzysztof Choromanski, Francois Fagan, Cedric Gouy-Pailler, Anne Morvan, Nouri Sakr, Tamas Sarlos, and Jamal
Atif, “Structured adaptive and random spinners for fast machine learning computations,” in Artificial Intelligence and Statistics, 2017, pp. 1020–1029.
[92] Harald Niederreiter, Random number generation and quasi-Monte Carlo methods, vol. 63, SIAM, 1992.
[93] Tianbao Yang, Yu Feng Li, Mehrdad Mahdavi, Rong Jin, and Zhi Hua Zhou, “Nyström method vs random Fourier features: a theoretical and empirical
comparison,” in Advances in Neural Information Processing Systems, 2012, pp. 476–484.
[94] Bo Xie, Yingyu Liang, and Le Song, “Scale up nonlinear component analysis with doubly stochastic gradients,” in Advances in Neural Information
Processing Systems, 2015, pp. 2341–2349.
[95] Xiang Li, Bin Gu, Shuang Ao, Huaimin Wang, and Charles X. Ling, “Triply stochastic gradients on multiple kernel learning.,” in Conference on Uncertainty
in Artificial Intelligence, 2017, pp. 1–9.
[96] Krzysztof Choromanski, Mark Rowland, Wenyu Chen, and Adrian Weller, “Unifying orthogonal Monte Carlo methods,” in International Conference on
Machine Learning, 2019, pp. 1203–1212.
[97] Krzysztof Choromanski, Mark Rowland, Tamás Sarlós, Vikas Sindhwani, Richard Turner, and Adrian Weller, “The geometry of random features,” in
International Conference on Artificial Intelligence and Statistics, 2018, pp. 1–9.
[98] Xiaoyun Li and Ping Li, “Quantization algorithms for random fourier features,” in International Conference on Machine Learning, 2021, pp. 6369–6380.
[99] Johann S. Brauchart and Peter J. Grabner, “Distributing many points on spheres: minimal energy and designs,” Journal of Complexity, vol. 31, no. 3, pp.
293–326, 2015.
[100] Yueming LYU, Yuan Yuan, and Ivor Tsang, “Subgroup-based rank-1 lattice quasi-monte carlo,” in Advances in Neural Information Processing Systems,
2020.
[101] Gwynne Evans, Practical numerical integration, Wiley New York, 1993.
[102] Alan Genz and John Monahan, “Stochastic integration rules for infinite regions,” SIAM Journal on Scientific Computing, vol. 19, no. 2, pp. 426–439, 1998.
[103] Alan Genz and John Monahan, “A stochastic algorithm for high-dimensional integrals over unbounded regions with gaussian weight,” Journal of
Computational and Applied Mathematics, vol. 112, no. 1-2, pp. 71–81, 1999.
[104] Florian Heiss and Viktor Winschel, “Likelihood approximation by numerical integration on sparse grids,” Journal of Econometrics, vol. 144, no. 1, pp.
62–80, 2008.
[105] Ayoub Belhadji, Rémi Bardenet, and Pierre Chainais, “Kernel quadrature with dpps,” in Advances in Neural Information Processing Systems, 2019, pp.
1–11.
[106] Fanghui Liu, Xiaolin Huang, Yudong Chen, and Johan A.K. Suykens, “Towards a unified quadrature framework for large-scale kernel machines,” arXiv
preprint arXiv:2011.01668, 2020.
[107] François-Xavier Briol, Chris J Oates, Jon Cockayne, Wilson Ye Chen, and Mark Girolami, “On the sampling problem for kernel quadrature,” in
International Conference on Machine Learning, 2017, pp. 586–595.
[108] Bertrand Gauthier and Johan A.K. Suykens, “Optimal quadrature-sparsification for integral operator approximation,” SIAM Journal on Scientific Computing,
vol. 40, no. 5, pp. A3636–A3674, 2018.
[109] Yinsong Wang and Shahin Shahrampour, “A general scoring rule for randomized kernel approximation with application to canonical correlation analysis,”
arXiv preprint arXiv:1910.05384, 2019.
[110] Tong Zhang, “Learning bounds for kernel regression using effective data dimensionality,” Neural Computation, vol. 17, no. 9, pp. 2077–2098, 2005.
[111] Francis Bach, “Sharp analysis of low-rank kernel matrix approximations,” in Conference on Learning Theory, 2013, pp. 185–209.
[112] Hayata Yamasaki, Sathyawageeswar Subramanian, Sho Sonoda, and Masato Koashi, “Fast quantum algorithm for learning with optimized random features,”
in Advances in Neural Information Processing Systems, 2020, pp. 1–10.
[113] Ahmed Alaoui and Michael W Mahoney, “Fast randomized kernel ridge regression with statistical guarantees,” in Advances in Neural Information
Processing Systems, 2015, pp. 775–783.
[114] Daniele Calandriello, Alessandro Lazaric, and Michal Valko, “Distributed adaptive sampling for kernel matrix approximation,” in Artificial Intelligence and
Statistics, 2017, pp. 1421–1429.
[115] Corinna Cortes, Mehryar Mohri, and Afshin Rostamizadeh, “Two-stage learning kernel algorithms,” in International Conference on Machine Learning,
2010, pp. 239–246.
[116] Trevor Campbell and Tamara Broderick, “Bayesian coreset construction via greedy iterative geodesic ascent,” in International Conference on Machine
Learning, 2018, pp. 698–706.
[117] Trevor Campbell and Tamara Broderick, “Automated scalable Bayesian inference via Hilbert coresets,” Journal of Machine Learning Research, vol. 20, no.
1, pp. 551–588, 2019.
[118] Raffay Hamid, Ying Xiao, Alex Gittens, and Dennis Decoste, “Compact random feature maps,” in International Conference on Machine Learning, 2014,
pp. 19–27.
[119] Ali Rahimi and Benjamin Recht, “Uniform approximation of functions with random bases,” in Annual Allerton Conference on Communication, Control,
and Computing. IEEE, 2008, pp. 555–561.
[120] Mina Ghashami, Daniel J. Perry, and Jeff Phillips, “Streaming kernel principal component analysis,” in Artificial intelligence and statistics, 2016, pp.
1365–1374.
[121] Bharath Sriperumbudur and Nicholas Sterge, “Statistical consistency of kernel PCA with random features,” arXiv√preprint arXiv:1706.06296, 2017.
[122] Enayat Ullah, Poorya Mianjy, Teodor Vanislavov Marinov, and Raman Arora, “Streaming kernel PCA with Õ( n) random features,” in Advances in
Neural Information Processing Systems, 2018, pp. 7311–7321.
[123] Felipe Cucker and Dingxuan Zhou, Learning theory: an approximation theory viewpoint, vol. 24, Cambridge University Press, 2007.
[124] Ingo Steinwart and Christmann Andreas, Support Vector Machines, Springer Science and Business Media, 2008.
[125] Gilles Blanchard and Nicole Krämer, “Optimal learning rates for kernel conjugate gradient regression,” in Advances in Neural Information Processing
Systems, 2010, pp. 226–234.
[126] Andrea Caponnetto and Ernesto De Vito, “Optimal rates for the regularized least-squares algorithm,” Foundations of Computational Mathematics, vol. 7,
no. 3, pp. 331–368, 2007.
34
[127] John Shawe-Taylor, Chris Williams, Nello Cristianini, and Jaz Kandola, “On the eigenspectrum of the gram matrix and its relationship to the operator
eigenspectrum,” in International Conference on Algorithmic Learning Theory. Springer, 2002, pp. 23–40.
[128] Steve Smale and Ding-Xuan Zhou, “Learning theory estimates via integral operators and their approximations,” Constructive Approximation, vol. 26, no. 2,
pp. 153–172, 2007.
[129] Zheng-Chu Guo and Lei Shi, “Optimal rates for coefficient-based regularized regression,” Applied and Computational Harmonic Analysis, vol. 47, no. 3,
pp. 662–701, 2019.
[130] Shao-Bo Lin, Xin Guo, and Ding-Xuan Zhou, “Distributed learning with regularized least squares,” Journal of Machine Learning Research, vol. 18, no. 1,
pp. 3202–3232, 2017.
[131] Vladimir Koltchinskii, Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems, vol. 2033, Springer Science & Business Media,
2011.
[132] Peter L Bartlett, Olivier Bousquet, and Shahar Mendelson, “Local rademacher complexities,” Annals of Statistics, vol. 33, no. 4, pp. 1497–1537, 2005.
[133] Zhu Li, Jean-Francois Ton, Dino Oglic, and Dino Sejdinovic, “Towards a unified analysis of random fourier features,” Journal of Machine Learning
Research, vol. 22, no. 108, pp. 1–51, 2021.
[134] Luigi Carratino, Alessandro Rudi, and Lorenzo Rosasco, “Learning with SGD and random features,” in Advances in Neural Information Processing
Systems, 2018, pp. 10212–10223.
[135] Ingo Steinwart and Clint Scovel, “Fast rates for support vector machines using Gaussian kernels,” Annals of Statistics, vol. 35, no. 2, pp. 575–607, 2007.
[136] Shusen Wang, “Simple and almost assumption-free out-of-sample bound for random feature mapping,” arXiv preprint arXiv:1909.11207, 2019.
[137] Yann Lecun, Leon Bottou, Yoshua Bengio, and Patrick Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE.
[138] Alex Krizhevsky and Geoffrey Hinton, “Learning multiple layers of features from tiny images,” Technical report, University of Toronto, 2009.
[139] Gaëlle Loosli, Stéphane Canu, and Léon Bottou, “Training invariant support vector machines using selective sampling,” Large scale kernel machines, vol.
2, 2007.
[140] Sanjeev Arora, Simon S Du, Zhiyuan Li, Ruslan Salakhutdinov, Ruosong Wang, and Dingli Yu, “Harnessing the power of infinitely wide deep nets on
small-data tasks,” in International Conference on Learning Representations, 2020.
[141] Sergey Ioffe and Christian Szegedy, “Batch normalization: accelerating deep network training by reducing internal covariate shift,” in International
Conference on Machine Learning, 2015, pp. 448–456.
[142] Jia Deng, Wei Dong, Richard Socher, Li Jia Li, Kai Li, and Fei Fei Li, “Imagenet: A large-scale hierarchical image database,” in Proceedins of the IEEE
Conference on Computer Vision and Pattern Recognition, 2009, pp. 248–255.
[143] Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J. Tibshirani, “Surprises in high-dimensional ridgeless least squares interpolation,” arXiv
preprint arXiv:1903.08560, 2019.
[144] Song Mei and Andrea Montanari, “The generalization error of random features regression: Precise asymptotics and double descent curve,” arXiv preprint
arXiv:1908.05355, 2019.
[145] Tengyuan Liang, Alexander Rakhlin, and Xiyu Zhai, “On the multiple descent of minimum-norm interpolants and restricted lower isometry of kernels,” in
Annual Conference on Learning Theory, 2019, pp. 1–32.
[146] Fanghui Liu, Zhenyu Liao, and Johan A.K. Suykens, “Kernel regression in high dimensions: Refined analysis beyond double descent,” in International
Conference on Artificial Intelligence and Statistics, 2021, pp. 1–11.
[147] Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever, “Deep double descent: Where bigger models and more data
hurt,” in International Conference on Learning Representations, 2019.
[148] Mikhail Belkin, Siyuan Ma, and Soumik Mandal, “To understand deep learning we need to understand kernel learning,” in International Conference on
Machine Learning, 2018, pp. 541–549.
[149] Felipe Cucker and Steve Smale, “On the mathematical foundations of learning,” Bulletin of the American mathematical society, vol. 39, no. 1, pp. 1–49,
2002.
[150] Zeyuan Allen-Zhu, Yuanzhi Li, and Yingyu Liang, “Learning and generalization in overparameterized neural networks, going beyond two layers,” in
Advances in neural information processing systems, 2019, pp. 6158–6169.
[151] Terence Tao, Topics in random matrix theory, American Mathematical Society, 2012.
[152] Jeffrey Pennington and Pratik Worah, “Nonlinear random matrix theory for deep learning,” in Advances in Neural Information Processing Systems, 2017,
pp. 2634–2643.
[153] Zhenyu Liao and Romain Couillet, “On the spectrum of random features maps of high dimensional data,” in International Conference on Machine
Learning, 2018, pp. 3063–3071.
[154] Zhenyu Liao, Romain Couillet, and Michael Mahoney, “A random matrix analysis of random fourier features: beyond the gaussian kernel, a precise phase
transition, and the corresponding double descent,” in Neural Information Processing Systems, 2020.
[155] Jimmy Ba, Murat A. Erdogdu, Taiji Suzuki, Denny Wu, and Tianzong Zhang, “Generalization of two-layer neural networks: an asymptotic viewpoint,” in
International Conference on Learning Representations, 2020, pp. 1–8.
[156] Stéphane d’Ascoli, Maria Refinetti, Giulio Biroli, and Florent Krzakala, “Double trouble in double descent: Bias and variance(s) in the lazy regime,” arXiv
preprint arXiv:2003.01054, 2020.
[157] Federica Gerace, Bruno Loureiro, Florent Krzakala, Marc Mézard, and Lenka Zdeborová, “Generalisation error in learning with random features and the
hidden manifold model,” in International Conference on Machine Learning, 2020, pp. 3452–3462.
[158] Ben Adlam and Jeffrey Pennington, “Understanding double descent requires a fine-grained bias-variance decomposition,” in Advances in neural information
processing systems, 2020.
[159] Oussama Dhifallah and Yue M Lu, “A precise performance analysis of learning with random features,” arXiv preprint arXiv:2008.11904, 2020.
[160] Hong Hu and Yue M Lu, “Universality laws for high-dimensional learning with random features,” arXiv preprint arXiv:2009.07669, 2020.
[161] Arthur Jacot, Berfin Şimşek, Francesco Spadaro, Clément Hongler, and Franck Gabriel, “Implicit regularization of random feature models,” in International
Conference on Machine Learning, 2020, pp. 4631–4640.
[162] Mikhail Belkin, Daniel Hsu, and Ji Xu, “Two models of double descent for weak features,” SIAM Journal on Mathematics of Data Science, vol. 2, no. 4,
pp. 1167–1180, 2020.
[163] Jason W Rocks and Pankaj Mehta, “Memorizing without overfitting: Bias, variance, and interpolation in over-parameterized models,” arXiv preprint
arXiv:2010.13933, 2020.
[164] Licong Lin and Edgar Dobriban, “What causes the test error? going beyond bias-variance via anova,” arXiv preprint arXiv:2010.05170, 2020.
[165] Marc Mézard, Giorgio Parisi, and Miguel Angel Virasoro, Spin glass theory and beyond: An Introduction to the Replica Method and Its Applications,
vol. 9, World Scientific Publishing Company, 1987.
[166] Christos Thrampoulidis, Samet Oymak, and Babak Hassibi, “Regularized linear regression: A precise analysis of the estimation error,” in Conference on
Learning Theory, 2015, pp. 1683–1709.
[167] Andrea Montanari, Feng Ruan, Youngtak Sohn, and Jun Yan, “The generalization error of max-margin linear classifiers: High-dimensional asymptotics in
the overparametrized regime,” arXiv preprint arXiv:1911.01544, 2019.
[168] Tengyuan Liang and Pragya Sur, “A precise high-dimensional asymptotic theory for boosting and min-`1 -norm interpolated classifiers,” arXiv preprint
arXiv:2002.01586, 2020.
[169] Adel Javanmard, Mahdi Soltanolkotabi, and Hamed Hassani, “Precise tradeoffs in adversarial training for linear regression,” in Conference on Learning
Theory, 2020, pp. 2034–2078.
35
[170] Cosme Louart, Zhenyu Liao, and Romain Couillet, “A random matrix approach to neural networks,” The Annals of Applied Probability, vol. 28, no. 2, pp.
1190–1248, 2018.
[171] Song Mei, Theodor Misiakiewicz, and Andrea Montanari, “Generalization error of random features and kernel methods: hypercontractivity and kernel
matrix concentration,” arXiv preprint arXiv:2101.10588, 2021.
[172] Gilad Yehudai and Ohad Shamir, “On the power and limitations of random features for understanding neural networks,” in Advances in Neural Information
Processing Systems, 2019, pp. 6594–6604.
[173] Yitong Sun, Anna Gilbert, and Ambuj Tewari, “On the approximation properties of random ReLU features,” arXiv preprint arXiv:1810.04374, 2018.
[174] Jonathan Frankle and Michael Carbin, “The lottery ticket hypothesis: Finding sparse, trainable neural networks,” in International Conference on Learning
Representations, 2019.
[175] Hengyuan Hu, Rui Peng, Yu-Wing Tai, and Chi-Keung Tang, “Network trimming: A data-driven neuron pruning approach towards efficient deep
architectures,” arXiv preprint arXiv:1607.03250, 2016.
[176] Giacomo Meanti, Luigi Carratino, Lorenzo Rosasco, and Alessandro Rudi, “Kernel methods through the roof: Handling billions of points efficiently,” in
Advances in Neural Information Processing Systems, 2020.
[177] Lin Chen, Yifei Min, Mikhail Belkin, and Amin Karbasi, “Multiple descent: Design your own generalization curve,” arXiv preprint arXiv:2008.01036,
2020.
[178] Denny Wu and Ji Xu, “On the optimal weighted `2 regularization in overparameterized linear regression,” in Advances in Neural Information Processing
Systems, 2020, pp. 1–11.
[179] Dmitry Kobak, Jonathan Lomond, and Benoit Sanchez, “The optimal ridge penalty for real-world high-dimensional data can be zero or negative due to the
implicit ridge regularization,” Journal of Machine Learning Research, vol. 21, no. 169, pp. 1–16, 2020.
[180] Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari, “Limitations of lazy training of two-layers neural network,” in Advances in
Neural Information Processing Systems, 2019, pp. 9108–9118.