0% found this document useful (0 votes)
32 views117 pages

Classification With Deep Neural Networks and Logistic Loss: Zihan Zhang

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views117 pages

Classification With Deep Neural Networks and Logistic Loss: Zihan Zhang

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 117

Journal of Machine Learning Research 25 (2024) 1-117 Submitted 1/22; Revised 7/23; Published 4/24

Classification with Deep Neural Networks and Logistic Loss

Zihan Zhang [email protected]


Shanghai Center for Mathematical Sciences
Fudan University, Shanghai 200433, China
School of Data Science
City University of Hong Kong, Kowloon, Hong Kong
Lei Shi∗ [email protected]
School of Mathematical Sciences and Shanghai
Key Laboratory for Contemporary Applied Mathematics
Fudan University, Shanghai 200433, China
Shanghai Artificial Intelligence Laboratory
701 Yunjin Road, Shanghai 200232, China
Ding-Xuan Zhou [email protected]
School of Mathematics and Statistics
University of Sydney, Sydney NSW 2006, Australia

Editor: Maxim Raginsky

Abstract
Deep neural networks (DNNs) trained with the logistic loss (also known as the cross entropy
loss) have made impressive advancements in various binary classification tasks. Despite the
considerable success in practice, generalization analysis for binary classification with deep
neural networks and the logistic loss remains scarce. The unboundedness of the target
function for the logistic loss in binary classification is the main obstacle to deriving sat-
isfactory generalization bounds. In this paper, we aim to fill this gap by developing a
novel theoretical analysis and using it to establish tight generalization bounds for training
fully connected ReLU DNNs with logistic loss in binary classification. Our generalization
analysis is based on an elegant oracle-type inequality which enables us to deal with the
boundedness restriction of the target function. Using this oracle-type inequality, we es-
tablish generalization bounds for fully connected ReLU DNN classifiers fˆnFNN trained by
empirical logistic risk minimization with respect to i.i.d. samples of size n, which lead to
sharp rates of convergence as n → ∞. In particular, we obtain optimal convergence rates
for fˆnFNN (up to some logarithmic factor) only requiring the Hölder smoothness of the con-
ditional class probability η of data. Moreover, we consider a compositional assumption that
requires η to be the composition of several vector-valued multivariate functions of which
each component function is either a maximum value function or a Hölder smooth function
only depending on a small number of its input variables. Under this assumption, we can
even derive optimal convergence rates for fˆnFNN (up to some logarithmic factor) which are
independent of the input dimension of data. This result explains why in practice DNN
classifiers can overcome the curse of dimensionality and perform well in high-dimensional
classification problems. Furthermore, we establish dimension-free rates of convergence un-
der other circumstances such as when the decision boundary is piecewise smooth and the
input data are bounded away from it. Besides the novel oracle-type inequality, the sharp

∗. Corresponding author

c 2024 Zihan Zhang, Lei Shi and Ding-Xuan Zhou.


License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided
at http://jmlr.org/papers/v25/22-0049.html.
Zhang, Shi and Zhou

convergence rates presented in our paper also owe to a tight error bound for approximating
the natural logarithm function near zero (where it is unbounded) by ReLU DNNs. In ad-
dition, we justify our claims for the optimality of rates by proving corresponding minimax
lower bounds. All these results are new in the literature and will deepen our theoretical
understanding of classification with deep neural networks.
Keywords: deep learning; deep neural networks; binary classification; logistic loss; gen-
eralization analysis

1. Introduction
In this paper, we study the binary classification problem using deep neural networks (DNNs)
with the rectified linear unit (ReLU) activation function. Deep learning based on DNNs
has recently achieved remarkable success in a wide range of classification tasks including
text categorization (Iyyer et al., 2015), image classification (Krizhevsky et al., 2012), and
speech recognition (Hinton et al., 2012), which has become a cutting-edge learning method.
ReLU is one of the most popular activation functions, as scalable computing and stochastic
optimization techniques can facilitate the training of ReLU DNNs (Han et al., 2015; Kingma
and Ba, 2015). Given a positive integer d, consider the binary classification problem where
we regard [0, 1]d as the input space and {−1, 1} as the output space representing the two
labels of input data. Let P be a Borel probability measure on [0, 1]d × {−1, 1}, regarded as
the data distribution (i.e., the joint distribution of the input and output data). The goal
of classification is to learn a real-valued function from a hypothesis space F (i.e., a set of
candidate functions) based on the sample of the distribution P . The predictive performance
of any (deterministic) real-valued function f which has a Borel measurable restriction to
[0, 1]d (i.e., the domain of f contains [0, 1]d , and [0, 1]d 3 x 7→ f (x) ∈ R is Borel measurable)
is measured by the misclassification error of f with respect to P , given by
n o
RP (f ) := P (x, y) ∈ [0, 1]d × {−1, 1} y 6= sgn(f (x)) , (1.1)

or equivalently, the excess misclassification error


n o
EP (f ) := RP (f ) − inf RP (g) g : [0, 1]d → R is Borel measurable . (1.2)

Here sgn(·) denotes the sign function which is defined as sgn(t) = 1 if t ≥ 0 and sgn(t) = −1
otherwise. The misclassification error RP (f ) characterizes the probability that the binary
classifier sgn ◦ f makes a wrong prediction, where ◦ means function composition, and by a
binary classifier (or classifier for short) we mean a {−1, 1}-valued function whose domain
contains the input space [0, 1]d . Since any real-valued function f with its domain containing
[0, 1]d determines a classifier sgn ◦ f , we in this paper may call such a function f a classifier
as well.
Note that the function we learn in a classification problem is based on the sample,
meaning that it is not deterministic but a random function. Thus we take the expectation
to measure its efficiency using the (excess) misclassification error. More specifically, let
{(Xi , Yi )}ni=1 be an independent and identically distributed (i.i.d.) sample of the distribution
P and the hypothesis space F be a set of real-valued functions which have a Borel measurable
restriction to [0, 1]d . We desire to construct an F-valued statistic fˆn from the sample

2
Classification with Deep Neural Networks and Logistic Loss

{(Xi , Yi )}ni=1 and the classification performance of fˆn can beh characterized
i by upper bounds
ˆ
for the expectation of the excess misclassification error E EP (fn ) . One possible way to
produce fˆn is the empirical risk minimization with some loss function φ : R → [0, ∞),
which is given by
n
1X
fˆn ∈ arg min φ (Yi f (Xi )) . (1.3)
f ∈F n
i=1

If fˆn satisfies (1.3), then we will call fˆn an empirical φ-risk minimizer (ERM with respect
to φ, or φ-ERM) over F. For any real-valued function f which has a Borel measurable
restriction to [0, 1]d , the φ-risk and excess φ-risk of f with respect to P , denoted by RφP (f )
and EPφ (f ) respectively, are defined as
Z
RφP (f ) := φ(yf (x))dP (x, y) (1.4)
[0,1]d ×{−1,1}

and n o
EPφ (f ) := RφP (f ) − inf RφP (g) g : [0, 1]d → R is Borel measurable . (1.5)
h i h i
To derive upper bounds for E EP (fˆn ) , we can first establish upper bounds for E EPφ (fˆn ) ,
which are typically controlled by two parts, namely the sample error and the approximation
h i
error (e.g., cf. Chapter 2 of Cucker and Zhou (2007)). Then we are able to bound E EP (fˆn )
h i
by E EPφ (fˆn ) through the so-called calibration inequality (also known as Comparison The-
orem, see, e.g., Theorem 10.5 of Cucker and Zhou (2007) and Theorem 3.22 of h Steinwart
i
and Christmann (2008)). In this paper, we will call any upper bound for E EP (fn ) or ˆ
h i
E EPφ (fˆn ) a generalization bound.
Note that lim n1 ni=1 φ (Yi f (Xi )) = RφP (f ) almost surely for all measurable f . There-
P
n→∞
fore, the empirical φ-risk minimizer fˆn defined in (1.3) can be regarded as an estimation
of the so-called target function which minimizes the φ-risk RφP over all Borel measurable
functions f . The target function can be defined pointwise. Rigorously, we say a measurable
function f ∗ : [0, 1]d → [−∞, ∞] is a target function of the φ-risk
R under the distribution P
if for PX -almost all x ∈ [0, 1]d the value of f ∗ at x minimizes {−1,1} φ(yz)dP (y|x) over all
z ∈ [−∞, ∞], i.e.,
Z
f ∗ (x) ∈ arg min φ(yz)dP (y|x) for PX -almost all x ∈ [0, 1]d , (1.6)
z∈[−∞,∞] {−1,1}

where φ(yz) := lim φ(t) if z ∈ {−∞, ∞}, PX is the marginal distribution of P on [0, 1]d ,
t→yz
and P (·|x) is the regular conditional distribution of P on {−1, 1} given x (cf. Lemma A.3.16
in Steinwart and Christmann (2008)). In this paper, we will use fφ,P ∗ to denote the target

function of the φ-risk under P . Note that fφ,P may take values in {−∞, ∞}, and fφ,P ∗

3
Zhang, Shi and Zhou

minimizes RφP in the sense that


Z
RφP (fφ,P

) := ∗
φ(yfφ,P (x))dP (x, y)
d
[0,1] ×{−1,1}
n o (1.7)
= inf RφP (g) g : [0, 1]d → R is Borel measurable ,

∗ (x)) :=
where φ(yfφ,P lim ∗ (x) ∈ {−∞, ∞} (cf. Lemma C.1).
φ(t) if yfφ,P
∗ (x)
t→yfφ,P
In practice, the choice of the loss function φ varies, depending on the classification
method used. For neural network classification, although other loss functions have been
investigated, the logistic loss φ(t) = log(1 + e−t ), also known as the cross entropy loss,
is most commonly used (see, e.g., Janocha and Czarnecki (2016); Hui and Belkin (2021);
Hu et al. (2022b)). We now explain why the logistic loss is related to cross entropy. Let
X be an arbitrary nonempty countable set equipped with the sigma algebra consisting of
all its subsets. For any two probability measures
P Q0 and Q on X , the cross entropy of Q
relative to Q0 is defined as H(Q0 , Q) := − z∈X Q0 ({z})·log Q({z}), where log 0 := −∞ and
0·(−∞) := 0 (cf. (2.112) of Murphy (2012)). One can show that H(Q0 , Q) ≥ H(Q0 , Q0 ) ≥ 0
and
{Q0 } = arg min H(Q0 , Q) if H(Q0 , Q0 ) < ∞.
Q

Therefore, roughly speaking, the cross entropy H(Q0 , Q) characterizes how close Q is to Q0 .
For any a ∈ [0, 1], let Ma denote the probability measure on {−1, 1} with Ma ({1}) = a and
Ma ({−1}) = 1 − a. Recall that any real-valued Borel measurable function f defined on the
input space [0, 1]d can induce a classifier sgn ◦ f . We can interpret the construction of the
classifier sgn ◦ f from f as follows. Consider the logistic function

¯l : R → (0, 1), z 7→ 1
, (1.8)
1 + e−z

which is strictly increasing. For each x ∈ [0, 1]d , f induces a probability measure Ml̄(f (x)) on
{−1, 1} via ¯l, which we regard as a prediction made by f of the distribution of the output
data (i.e., the two labels +1 and −1) given the input data x. Observe that the larger f (x)
is, the closer the number ¯l(f (x)) gets to 1, and the more likely the event {1} occurs under
the distribution Ml̄(f (x)) . If Ml̄(f (x)) ({+1}) ≥ Ml̄(f (x)) ({−1}), then +1 is more likely to
appear given the input data x and we thereby think of f as classifying the input x as class
+1. Otherwise, when Ml̄(f (x)) ({+1}) < Ml̄(f (x)) ({−1}), x is classified as −1. In this way,
f induces a classifier given by
(
+1, if Ml̄(f (x)) ({1}) ≥ Ml̄(f (x)) ({−1}),
x 7→ (1.9)
−1, if Ml̄(f (x)) ({1}) < Ml̄(f (x)) ({−1}).

Indeed, the classifier in (1.9) is exactly sgn ◦ f . Thus we can also measure the predic-
tive performance of f in terms of Ml̄(f (·)) (instead of sgn ◦ f ). To this end, one natural
way is to compute the average “extent” of how close Ml̄(f (x)) is to the true conditional
distribution of the output given the input x. If we use the cross entropy to characterize
this “extent”, then its average, which measures the classification performance of f , will be

4
Classification with Deep Neural Networks and Logistic Loss

[0,1]d H(Yx , Ml̄(f (x)) )dX (x), where X is the distribution of the input data, and Yx is the
R

conditional distribution of the output data given the input x. However, one can show that
this quantity is just the logistic risk of f . Indeed,
Z
H(Yx , Ml̄(f (x)) )dX (x)
[0,1]d
Z  
= −Yx ({1}) · log(Ml̄(f (x)) ({1})) − Yx ({−1}) log(Ml̄(f (x)) ({−1})) dX (x)
[0,1]d
Z
−Yx ({1}) · log(¯l(f (x))) − Yx ({−1}) log(1 − ¯l(f (x))) dX (x)

=
[0,1]d
Z  
= Yx ({1}) · log(1 + e−f (x) ) + Yx ({−1}) log(1 + ef (x) ) dX (x)
[0,1]d
Z
= (Yx ({1}) · φ(f (x)) + Yx ({−1})φ(−f (x))) dX (x)
[0,1]d
Z Z Z
= φ(yf (x))dYx (y)dX (x) = φ(yf (x))dP (x, y) = RφP (f ),
[0,1]d {−1,1} [0,1]d ×{−1,1}

where φ is the logistic loss and P is the joint distribution of the input and output data, i.e.,
dP (x, y) = dYx (y)dX (x). Therefore, the average cross entropy of the distribution Ml̄(f (x))
induced by f to the true conditional distribution of the output data given the input data x
is equal to the logistic risk of f with respect to the joint distribution of the input and output
data, which explains why the logistic loss is also called the cross entropy loss. Compared
with the misclassification error RP (f ) which measures the performance of the classifier f (x)
in correctly generating the class label sgn(f (x)) that equals the most probable class label
of the input data x (i.e., the label yx ∈ {−1, +1} such that Yx ({yx }) ≥ Yx ({−yx })), the
logistic risk RφP (f ) measures how close the induced distribution Ml̄(f (x)) is to the true con-
ditional distribution Yx . Consequently, in comparison with the (excess) misclassification
error, the (excess) logistic risk is also a reasonable quantity for characterizing the perfor-
mance of classifiers but from a different angle. When classifying with the logistic loss, we
are essentially learning the conditional distribution Yx through the cross entropy and the
logistic function ¯l. Moreover, for any classifier fˆn : [0, 1]d → R trained with logistic loss,
the composite function ¯l ◦ fˆn (x) = Ml̄◦fˆn (x) ({1}) yields an estimation of the conditional
class probability function η(x) := P ({1} |x) = Yx ({1}). Therefore, classifiers trained with
logistic loss essentially capture more information about the exact value of the conditional
class probability function η(x) than we actually need to minimize the misclassification error
RP (·), since the knowledge of the sign of 2η(x) − 1 is already sufficient for minimizing RP (·)
(see (2.49)). In addition, we point out that the excess logistic risk EPφ (f ) is actually the
average Kullback-Leibler divergence (KL divergence) from Ml̄(f (x)) to Yx . Here for any two
probability measures Q0 andPQ on some countable set X , the KL divergence from Q to Q0
is defined as KL(Q0 ||Q) := z∈X Q0 ({z}) · log QQ({z})0 ({z})
, where Q0 ({z}) · log QQ({z})
0 ({z})
:= 0 if
Q0 ({z}) = 0 and Q0 ({z}) · log QQ({z})
0 ({z})
:= ∞ if Q0 ({z}) > 0 = Q({z}) (cf. (2.111) of Murphy
(2012) or Definition 2.5 of Tsybakov (2009)).
In this work, we focus on the generalization analysis of binary classification with em-
pirical risk minimization over ReLU DNNs. That is, the classifiers under consideration

5
Zhang, Shi and Zhou

are produced by algorithm (1.3) in which the hypothesis space F is generated by deep
ReLU networks. Based on recent studies in complexity and approximation theory of DNNs
(e.g., Bartlett et al. (2019); Petersen and Voigtlaender (2018); Yarotsky (2017)), several re-
searchers have derived generalization bounds for φ-ERMs over DNNs in binary classification
problems (Farrell et al., 2021; Kim et al., 2021; Shen et al., 2021). However, to the best of
our knowledge, the existing literature fails to establish satisfactory generalization analysis

if the target function fφ,P is unbounded. In particular, take φ to be the logistic loss, i.e.,
P -a.s. η
φ(t) = log(1 + e−t ). The target function is then explicitly given by fφ,P ∗ ==X==== log 1−η
with η(x) := P ({1} |x) (x ∈ [0, 1]d ) being the conditional class probability function of P
(cf. Lemma C.2), where recall that P (·|x) denotes the conditional probability of P on
{−1, 1} given x. Hence fφ,P ∗ is unbounded if η can be arbitrarily close to 0 or 1, which
happens in many practical problems (see Section 3 for more details). For instance, we have
η(x) = 0 or η(x) = 1 for a noise-free distribution P , implying fφ,P ∗ (x) = ∞ for P -almost
X
all x ∈ [0, 1] , where PX is the marginal distribution of P on [0, 1]d . DNNs trained with
d

the logistic loss perform efficiently in various image recognition applications as the smooth-
ness of the loss function can further simplify the optimization procedure (Goodfellow et al.,
2016; Krizhevsky et al., 2012; Simonyan and Zisserman, 2015). However, due to the un-
boundedness of fφ,P ∗ , the existing generalization analysis for classification with DNNs and

the logistic loss either results in slow rates of convergence (e.g., the logarithmic rate in Shen
et al. (2021)) or can only be conducted under very restrictive conditions (e.g., Kim et al.
(2021); Farrell et al. (2021)) (cf. the discussions in Section 3). The unboundedness of the
target function brings several technical difficulties to the generalization analysis. Indeed, if

fφ,P is unbounded, it cannot be approximated uniformly by continuous functions on [0, 1]d ,
which poses extra challenges for bounding the approximation error. Besides, previous sam-
ple error estimates based on concentration techniques are no longer valid because these
estimates usually require involved random variables to be bounded or to satisfy strong tail
conditions (cf. Chapter 2 of Wainwright (2019)). Therefore, in contrast to empirical stud-
ies, the previous strategies for generalization analysis could not demonstrate the efficiency
of classification with DNNs and the logistic loss.
To fill this gap, in this paper we develop a novel theoretical analysis to establish tight
generalization bounds for training DNNs with ReLU activation function and logistic loss in
binary classification. Our main contributions are summarized as follows.
• For φ being the logistic loss, we establish an oracle-type inequality to bound the
∗ . Through
excess φ-risk without using the explicit form of the target function fφ,P
constructing a suitable bivariate function ψ : [0, 1]d × {−1, 1} → R, generalization
analysis based on this oracle-type inequality can remove the boundedness restriction
of the target function. Similar results hold even for the more general case when φ is
merely Lipschitz continuous (see Theorem 2.1 and related discussions in Section 2.1).
• By using our oracle-type inequality, we establish tight generalization bounds for fully
connected ReLU DNN classifiers fˆnFNN trained by empirical logistic risk minimization
(see (2.14)) and obtain sharp convergence rates in various settings:
◦ We establish optimal convergence rates for the excess logistic risk of fˆnFNN only
requiring the Hölder smoothness of the conditional probability function η of

6
Classification with Deep Neural Networks and Logistic Loss

the data distribution. Specifically, for Hölder-β smooth η, we show that the
 5
 β
convergence rates of the excess logistic risk of fˆnFNN can achieve O( (lognn)
β+d
),

which is optimal up to the logarithmic term (log n) β+d . From this we obtain the
 β
(log n)5 2β+2d

convergence rate O( n ) of the excess misclassification error of fˆnFNN ,
which is very close to the optimal rate, by using the calibration inequality (see
Theorem 2.2). As a by-product, we also derive a new tight error bound for the
approximation of the natural logarithm function (which is unbounded near zero)
by ReLU DNNs (see Theorem 2.4). This bound plays a key role in establishing
the aforementioned optimal rates of convergence.
◦ We consider a compositional assumption which requires the conditional probabil-
ity function η to be the composition hq ◦hq−1 ◦· · ·◦h1 ◦h0 of several vector-valued
multivariate functions hi , satisfying that each component function of hi is either
a Hölder-β smooth function only depending on (a small number) d∗ of its input
variables or the maximum value function among some of its input variables. We
show that under this compositional assumption the convergence rate of the excess
  β·(1∧β)q
ˆFNN (log n)5 d∗ +β·(1∧β)q
logistic risk of fn can achieve O( n ), which is optimal up to
5β·(1∧β)q
the logarithmic term (log n) d∗ +β·(1∧β)q
. We then use the calibration inequality to
  β·(1∧β)q
(log n)5 2d∗ +2β·(1∧β)q
obtain the convergence rate O( n ) of the excess misclassifica-
tion error of fˆnFNN (see Theorem 2.3). Note that the derived convergence rates
  β·(1∧β)q   β·(1∧β)q
(log n)5 d∗ +β·(1∧β)q (log n)5 2d∗ +2β·(1∧β)q
O( n ) and O( n ) are independent of the in-
put dimension d, thereby circumventing the well-known curse of dimensionality.
It can be shown that the above compositional assumption is likely to be satis-
fied in practice (see comments before Theorem 2.3). Thus this result helps to
explain the huge success of DNNs in practical classification problems, especially
high-dimensional ones.
◦ We derive convergence rates of the excess misclassification error of fˆnFNN under
the piecewise smooth decision boundary condition combining with the noise and
margin conditions (see Theorem 2.5). As a special case of this result, we show
that when the input data are bounded away from the decision boundary almost
surely, the derived rates can also be dimension-free.
• We demonstrate the optimality of the convergence rates stated above by presenting
corresponding minimax lower bounds (see Theorem 2.6 and Corollary 2.1).
The rest of this paper is organized as follows. In the remainder of this section, we first
introduce some conventions and notations that will be used in this paper. Then we describe
the mathematical modeling of fully connected ReLU neural networks which defines the hy-
pothesis spaces in our setting. At the end of this section, we provide a symbol glossary for
the convenience of readers. In Section 2, we present our main results in this paper, including
the oracle-type inequality, several generalization bounds for classifiers obtained from em-
pirical logistic risk minimization over fully connected ReLU DNNs, and two minimax lower

7
Zhang, Shi and Zhou

bounds. Section 3 provides discussions and comparisons with related works and Section 4
concludes the paper. In Appendix A and Appendix B, we present covering number bounds
and some approximation bounds for the space of fully connected ReLU DNNs respectively.
Finally, in Appendix C, we give detailed proofs of results in the main body of this paper.

1.1 Conventions and Notations

Throughout this paper, we follow the conventions that 00 := 1, 1∞ := 1, z0 := ∞ =: ∞c ,


log(∞) := ∞, log 0 := −∞, 0 · w := 0 =: w · 0 and ∞ a
:= 0 =: b∞ for any a ∈ R, b ∈
[0, 1), c ∈ (0, ∞), z ∈ [0, ∞], w ∈ [−∞, ∞] where we denote by log the natural logarithm
function (i.e. the base-e logarithm function). The terminology “measurable” means “Borel
measurable” unless otherwise specified. Any Borel subset of some Euclidean space Rm is
equipped with the Borel sigma algebra by default. Let G be an arbitrary measurable space
and n be a positive integer. We call any sequence of G-valued random variables {Zi }ni=1 a
sample in G of size n. Furthermore, for any measurable space F and any sample {Zi }ni=1
in G, an F-valued statistic on G n from the sample {Zi }ni=1 is a random variable θ̂ together
with a measurable map T : G n → F such that θ̂ = T (Z1 , . . . , Zn ), where T is called the
map associated with the statistic θ̂. Let θ̂ be an arbitrary F-valued statistic from some
sample {Zi }ni=1 and T is the map associated with θ̂. Then for any measurable space D and
any measurable map T0 : F → D, T0 (θ̂) = T0 (T (Z1 , . . . , Zn )) is a D-valued statistic from
the sample {Zi }ni=1 , and T0 ◦ T is the map associated with T0 (θ̂).
Next we will introduce some notations used in this paper. We denote by N the set of
all positive integers {1, 2, 3, 4, . . .}. For d ∈ N, we use Fd to denote the set of all Borel
measurable functions from [0, 1]d to (−∞, ∞), and use H0d to denote the set of all Borel
probability measures on [0, 1]d × {−1, 1}. For any set A, the indicator function of A is given
by
(
0, if x ∈
/ A,
1A (x) := (1.10)
1, if x ∈ A,

and the number of elements of A is denoted by #(A). For any finite dimensional vector v
and any positive integer l less than or equal to the dimension of v, we denote by (v)l the l-th
component of v. More generally, for any nonempty subset I = {i1 , i2 , . . . , im } of N with
 1≤
i1 < i2 < · · · < im ≤ the dimension of v, we denote (v)I := (v)i1 , (v)i2 , . . . , (v)im , which
is a #(I)-dimensional vector. For any function f , we use dom(f  ) to denote the domain of
f , and use ran(f ) to denote the range of f , that is, ran(f ) := f (x) x ∈ dom(f ) . If f is
a [−∞, ∞]m -valued function for some m ∈ N with dom(f ) containing a nonempty set Ω,
then the uniform norm of f on Ω is given by

n  o
kf kΩ := sup f (x) i x ∈ Ω, i ∈ {1, 2, . . . , m} . (1.11)

For integer m ≥ 2 and real numbers a1 , · · · , am , define a1 ∨a2 ∨· · ·∨am = max{a1 , a2 , · · · am }


and a1 ∧ a2 ∧ · · · ∧ am = min{a1 , a2 , · · · am }. Given a real matrix A = (ai,j )i=1,...,m,j=1,...,l

8
Classification with Deep Neural Networks and Logistic Loss

and t ∈ [0, ∞], the `t -norm of A is defined by



m X l
1(0,∞) (|ai,j |),
 X
if t = 0,







 i=1 j=1

1/t
kAkt := Xm X l (1.12)
t
|ai,j | , if 0 < t < ∞,






 i=1 j=1

 
sup |ai,j | i ∈ {1, · · · , m}, j ∈ {1, · · · , l} , if t = ∞.

Note that a vector is exactly a matrix with only one column or one row. Consequently,
(1.12) with l = 1 or m = 1 actually defines the `t -norm of a real vector A. Let G be a
measurable space, {Zi }ni=1 be a sample in G of size n, Pn be a probability measure on G n ,
and θ̂ be a [−∞, ∞]-valued statistic on G n from the sample {Zi }ni=1 . Then we denote
Z
EPn [θ̂] := T dPn (1.13)

R
provided that the integral T dPn exists, where T is the map associated with θ̂. Therefore,

EPn [θ̂] = E [T (Z1 , . . . , Zn )] = E[θ̂]

if the joint distribution of (Z1 , . . . , Zn ) is exactly Pn . Let P be a Borel probability mea-


sure on [0, 1]d × {−1, 1} and x ∈ [0, 1]d . We use P (·|x) to denote the regular conditional
distribution of P on {−1, 1} given x, and PX to denote the marginal distribution of P on
[0, 1]d . For short, we will call the function [0, 1]d 3 x 7→ P ({1} |x) ∈ [0, 1] the conditional
probability function (instead of the conditional class probability function) of P . For any
probability measure Q defined on some measurable space (Ω, F) and any n ∈ N, we use
Q ⊗n to denote the product measure Q | × Q{z× · · · Q} defined on the product measurable
n
| × Ω{z
space (Ω × · · · Ω}, F
| ⊗ F{z⊗ · · · F}).
n n

1.2 Spaces of Fully Connected Neural Networks


In this paper, we restrict ourselves to neural networks with the ReLU activation function.
Consequently, hereinafter, for simplicity, we sometimes omit the word “ReLU” and the
terminology “neural networks” will always refer to “ReLU neural networks”.
The ReLU function is given by σ : R → [0, ∞), t 7→ max {t, 0}. For any vector v ∈ Rm
with m being some positive integer, the v-shifted ReLU function is defined as σv : Rm →
[0, ∞)m , x 7→ σ(x − v), where the function σ is applied componentwise.
Neural networks considered in this paper can be expressed as a family of real-valued
functions which take the form

f : Rd → R, x 7→ WL σvL WL−1 σvL−1 · · · W1 σv1 W0 x, (1.14)

where the depth L denotes the number of hidden layers, mk is the width of k-th layer, Wk
is an mk+1 × mk weight matrix with m0 = d and mL+1 = 1, and the shift vector vk ∈ Rmk

9
Zhang, Shi and Zhou

is called a bias. The architecture of a neural network is parameterized by weight matrices


{Wk }L L
k=0 and biases {vk }k=1 , which will be estimated from data. Throughout the paper,
whenever we talk about a neural network, we will explicitly associate it with a function f
of the form (1.14) generated by {Wk }L L
k=0 and {vk }k=1 .
The space of fully connected neural networks is characterized by their depth and width,
as well as the number of nonzero parameters in weight matrices and bias vectors. In
addition, the complexity of this space is also determined by the k · k∞ -bounds of neural
network parameters and k · k[0,1]d -bounds of associated functions in form (1.14). Concretely,
let (G, N ) ∈ [0, ∞)2 and (S, B, F ) ∈ [0, ∞]3 , the space of fully connected neural networks
is defined as
 

 f is defined in (1.14) satisfying that  
 
L ≤ G, m ∨ m ∨ · · · ∨ m ≤ N,
 


 1 2 L 


 
L L

 ! ! 

 X X 
kW k kv k ≤
 
FdFNN d
(G, N, S, B, F ) := f : R → R k 0 + k 0 S, . (1.15)

 k=0 k=1 

 
sup kWk k∞ ∨ sup kvk k∞ ≤ B, 

 

 


 k=0,1,··· ,L k=1,··· ,L 


 

 and kf k[0,1]d ≤ F 

In this definition, the freedom in choosing the position of nonzero entries of Wk reflects the
fully connected nature between consecutive layers of the neural network f . It should be
noticed that B and F in the definition (1.15) above can be ∞, meaning that there is no re-
striction on the upper bounds of kWk k∞ and kvk k∞ , or kf k[0,1]d . The parameter S in (1.15)
can also be ∞, leading to a structure without sparsity. The space FdFNN (G, N, S, B, F ) in-
corporates all the essential features of fully connected neural network architectures and
has been adopted to study the generalization properties of fully connected neural network
models in regression and classification (Kim et al., 2021; Schmidt-Hieber, 2020).

1.3 Glossary
At the end of this section, we provide a glossary of frequently used symbols in this paper
for the convenience of readers.

Symbol Meaning Definition


Z The set of integers.
N The set of positive integers.
R The set of real numbers.
∨ Taking the maximum, e.g., a1 ∨ a2 ∨ a3 ∨ a4 is equal
to the maximum of a1 , . . . a4 .
∧ Taking the minimum, e.g., a1 ∧ a2 ∧ a3 ∧ a4 is equal
to the minimum of a1 , . . . a4 .
◦ Function composition, e.g., for f : R → R and g :
R → R, g◦f denotes the map R 3 x 7→ g(f (x)) ∈ R.
dom(f ) The domain of a function f . Below Eq. (1.10)

10
Classification with Deep Neural Networks and Logistic Loss

ran(f ) The range of a function f . Below Eq. (1.10)


A> The transpose of a matrix A.
#(A) The number of elements of a set A.
bc The floor function, which is defined as bxc :=
sup { z ∈ Z| z ≤ x}.
de The ceiling function, which is defined as dxe :=
inf { z ∈ Z| z ≥ x}.
1A The indicator function of a set A. Eq. (1.10)
(v)l The l-th component of a vector v. Below Eq. (1.10)
(v)I The #(I)-dimensional vector whose components Below Eq. (1.10)
are exactly {(v)i }i∈I .
k·kΩ The uniform norm on a set Ω. Eq. (1.11)
k·kt The `t -norm. Eq. (1.12)
k·kC k,λ (Ω) The Hölder norm. Eq. (2.12)
sgn The sign function. Below Eq. (1.2)
σ The ReLU function, that is, R 3 t 7→ max {0, t} ∈ Above Eq. (1.14)
[0, ∞).
σv The v-shifted ReLU function. Above Eq. (1.14)
Ma The probability measure on {−1, 1} with Above Eq. (1.8)
Ma ({1}) = a.
PX The marginal distribution of P on [0, 1]d . Below Eq. (1.6)
P (·|x) The regular conditional distribution of P on {−1, 1} Below Eq. (1.6)
given x ∈ [0, 1]d .
Pη,Q The probability on [0, 1]d × {−1, 1} of which the Eq. (2.57)
marginal distribution on [0, 1]d is Q and the condi-
tional probability function is η.
Pη The probability on [0, 1]d × {−1, 1} of which the Below Eq. (2.57)
marginal distribution on [0, 1]d is the Lebesgue mea-
sure and the conditional probability function is η.
EPn [θ̂] The expectation of a statistic θ̂ when the joint dis- Eq. (1.13)
tribution of the sample on which θ̂ depends is Pn .
Q ⊗n The product measure Q | × Q{z× · · · Q}. Below Eq. (1.13)
n
RP (f ) The misclassification error of f with respect to P . Eq. (1.1)
EP (f ) The excess misclassification error of f with respect Eq. (1.2)
to P .
RφP (f ) The φ-risk of f with respect to P . Eq. (1.4)
EPφ (f ) The excess φ-risk of f with respect to P . Eq. (1.5)

fφ,P The target function of the φ-risk under some distri- Eq. (1.6)
bution P .
N (F, γ) The covering number of a class of real-valued func- Eq. (2.1)
tions F with radius γ in the uniform norm.

11
Zhang, Shi and Zhou

Brβ (Ω) The closed ball of radius r centered at the origin in Eq. (2.13)
the Hölder space of order β on Ω.
GdM (d? ) The set of all functions from [0, 1]d to R which com- Eq. (2.27)
pute the maximum value of up to d? components of
their input vectors.
GdH (d∗ , β, r) The set of all functions in Brβ ([0, 1]d ) whose output Eq. (2.28)
values depend on exactly d∗ components of their
input vectors.
M (d ) M (d ) :=
S∞ M
G∞ ? G∞ ? G (d ) Above Eq. (2.30)
H (d , β, r) H
d=1S∞d ?H
G∞ ∗ G∞ (d∗ , β, r) := d=1 Gd (d∗ , β, r) Above Eq. (2.30)
GdCH (· · · ) GdCH (q, K, d∗ , β, r) consists of compositional func- Eq. (2.31)
tions hq ◦ · · · ◦ h0 satisfying that each component
function of hi belongs to G∞ H (d , β, r).

GdCHOM (· · · ) GdCHOM (q, K, d? , d∗ , β, r) consists of compositional Eq. (2.32)
functions hq ◦· · ·◦h0 satisfying that each component
function of hi belongs to G∞ H (d , β, r) ∪ G M (d ).
∗ ∞ ?
C d,β,r,I,Θ The set of binary classifiers C : [0, 1]d → {−1, +1} Eq. (2.46)
such that x ∈ [0, 1]d C(x) = +1 is the union of
some disjoint closed regions with piecewise Hölder
smooth boundary.
∆C (x) The distance from some point x ∈ [0, 1]d to the Eq. (2.48)
decision boundary of some classifier C ∈ C d,β,r,I,Θ .
Fd The set of all Borel measurable functions from Above Eq. (1.10)
[0, 1]d to (−∞, ∞).
FdFNN (· · · ) The class of ReLU neural networks defined on Rd . Eq. (1.15)
H0d The set of all Borel probability measures on [0, 1]d × Above Eq. (1.10)
{−1, 1}.
H1d,β,r The set of all probability measures P ∈ H0d Eq. (2.15)
whose conditional probability function coincides
with some function in Brβ [0, 1]d PX -a.s..

d,β,r
H2,s 1 ,c1 ,t1
The set of all probability measures P in H1d,β,r sat- Eq. (2.26)
isfying the noise condition (2.24).
d,β,r
H3,A The set of all probability measures P ∈ H0d whose Eq. (2.58)
marginal distribution on [0, 1]d is the Lebesgue mea-
sure and whose  conditional probability function is
β d 1
in Br [0, 1] and bounded away from 2 almost
surely.
d,β,r
H4,q,K,d? ,d∗
The set of all probability measures P ∈ H0d Eq. (2.34)
whose conditional probability function coincides
with some function in GdCHOM (q, K, d? , d∗ , β, r)
PX -a.s..

12
Classification with Deep Neural Networks and Logistic Loss

d,β,r
H5,A,q,K,d∗
The set of all probability measures P ∈ H0d whose Eq. (2.58)
marginal distribution on [0, 1]d is the Lebesgue mea-
sure and whose conditional probability function is
in GdCH (q, K, d∗ , β, r) and bounded away from 12 al-
most surely.
d,β,r,I,Θ,s1 ,s2
H6,t1 ,c1 ,t2 ,c2
The set of all probability measures P ∈ H0d which Eq. (2.52)
satisfy the piecewise smooth decision boundary con-
dition (2.50), the noise condition (2.24) and the
margin condition (2.51) for some C ∈ C d,β,r,I,Θ .
H7d,β The set of all probability measures P ∈ H0d such Above Eq. (3.4)
that the target function  of the logistic risk under P
β d
belongs to B1 [0, 1] .
fˆnFNN The DNN estimator obtained from empirical logis- Eq. (2.14)
tic risk minimization over the space of fully con-
nected ReLU DNNs.

Table 1: Glossary of frequently used symbols in this paper

2. Main Results
In this section, we give our main results, consisting of upper bounds presented in Subsection
2.1 and lower bounds presented in Subsection 2.2.

2.1 Main Upper Bounds


In this subsection, we state our main results about upper bounds for the (excess) logistic risk
or (excess) misclassification error of empirical logistic risk minimizers. The first result, given
in Theorem 2.1, is an oracle-type inequality which provides upper bounds for the logistic risk
of empirical logistic risk minimizers. Oracle-type inequalities have been extensively studied
in the literature of nonparametric statistics (see Johnstone (1998) and references therein).
As one of the main contributions in this paper, this inequality deserves special attention in
its own right, allowing us to establish a novel strategy for generalization analysis. Before
we state Theorem 2.1, we introduce some notations. For any pseudometric space (F, ρ) (cf.
Section 10.5 of Anthony and Bartlett (2009)) and γ ∈ (0, ∞), the covering number of (F, ρ)
with radius γ is defined as
( )
A ⊂ F, and for any f ∈ F there
N ((F, ρ) , γ) := inf # (A) ,
exists g ∈ A such that ρ(f, g) ≤ γ
where we recall that # (A) denotes the number of elements of the set A. When the pseu-
dometric ρ on F is clear and no confusion arises, we write N (F, γ) instead of N ((F, ρ), γ)
for simplicity. In particular, if F consists of real-valued functions which are bounded on
[0, 1]d , we will use N (F, γ) to denote
!
 
N F, ρ : (f, g) 7→ sup |f (x) − g(x)| , γ (2.1)
x∈[0,1]d

13
Zhang, Shi and Zhou

unless otherwise specified. Recall that the φ-risk of a measurable function f : [0, 1]d → R
with respect to a distribution P on [0, 1]d × {−1, 1} is denoted by RφP (f ) and defined in
(1.4).

Theorem 2.1 Let {(Xi , Yi )}ni=1 be an i.i.d. sample of a probability distribution P on


[0, 1]d × {−1, 1}, F be a nonempty class of uniformly bounded real-valued functions de-
fined on [0, 1]d , and fˆn be an ERM with respect to the logistic loss φ(t) = log(1 + e−t ) over
F, i.e.,
n
ˆ 1X
fn ∈ arg min φ (Yi f (Xi )) . (2.2)
f ∈F n
i=1

If there exists a measurable function ψ : [0, 1]d × {−1, 1} → R and a constant triple
(M, Γ, γ) ∈ (0, ∞)3 such that
Z Z
ψ(x, y)dP (x, y) ≤ inf φ(yf (x))dP (x, y), (2.3)
[0,1]d ×{−1,1} f ∈F [0,1]d ×{−1,1}

( )
n o
sup φ(t) |t| ≤ sup kf k[0,1]d ∨ sup |ψ(x, y)| (x, y) ∈ [0, 1]d × {−1, 1} ≤ M, (2.4)
f ∈F

Z
(φ(yf (x)) − ψ (x, y))2 dP (x, y)
[0,1]d ×{−1,1}
Z (2.5)
≤Γ· (φ(yf (x)) − ψ(x, y)) dP (x, y), ∀ f ∈ F,
[0,1]d ×{−1,1}

and
W := max {3, N (F, γ)} < ∞.
Then for any ε ∈ (0, 1), there holds
" #
  Z
E RφP fˆn − ψ(x, y)dP (x, y)
[0,1]d ×{−1,1}
r
(1 + ε)2 Γ log W M log W √ Γ log W (2.6)
≤ 80 · · + (20 + 20ε) · + (20 + 20ε) · γ ·
ε n n ! n
Z
φ
+ 4γ + (1 + ε) · inf RP (f ) − ψ(x, y)dP (x, y) .
f ∈F [0,1]d ×{−1,1}

According to its proof in Appendix C.2, Theorem 2.1 remains true when the logistic loss
is replaced by any nonnegative function φ satisfying
" #
φ(t) − φ(t0 ) ≤ t − t0 , ∀ t, t0 ∈ − sup kf k[0,1]d , sup kf k[0,1]d .
f ∈F f ∈F

Then by rescaling, Theorem 2.1 can be further generalized to the case when φ is any
nonnegative locally Lipschitz continuous loss function such as the exponential loss or the

14
Classification with Deep Neural Networks and Logistic Loss

LUM (large-margin unified machine) loss (cf. Liu et al. (2011)). Generalization analysis for
classification with these loss functions based on oracle-type inequalities similar to Theorem
2.1 has been studied in our coming work Zhang et al. (2024).
Let us give some comments on conditions (2.3) and (2.5) of Theorem 2.1. To our
knowledge, these two conditions are introduced for the first time in this paper, and will play
pivotal roles in our estimates. Let φ be the logistic loss and P be a probability measure on

[0, 1]d × {−1, 1}. Recall that fφ,P denotes the target function of the logistic risk. If
Z n o
ψ(x, y)dP (x, y) = inf RφP (f ) f : [0, 1]d → R is measurable , (2.7)
[0,1]d ×{−1,1}
h  i
then condition (2.3) is satisfied and the left hand side of (2.6) is exactly E EPφ fˆn .
Therefore, Theorem 2.1 can be used to establish excess φ-risk bounds for the φ-ERM fˆn .
In particular, one can take ψ(x, y) to be φ(yfφ,P∗ (x)) to ensure the equality (2.7) (recalling

(1.7)). It should be pointed out that if ψ(x, y) = φ(yfφ,P∗ (x)), inequality (2.5) is of the same

form as the following inequality with τ = 1, which asserts that there exist τ ∈ [0, 1] and
Γ > 0 such that
Z   2  τ
φ(yf (x)) − φ yfφ∗ (x) dP (x, y) ≤ Γ · EPφ (f ) , ∀ f ∈ F. (2.8)
[0,1]d ×{−1,1}

This inequality appears naturally when bounding the sample error by using concentration
inequalities, which is of great importance in previous generalization analysis for binary
classification (cf. condition (A4) in Kim et al. (2021) and Definition 10.15 in Cucker and
Zhou (2007)). In Farrell et al. (2021), the authors actually prove that if the target function
∗ is bounded and the functions in F are uniformly bounded by some F > 0, the inequality
fφ,P
∗ (x)) and
(2.5) holds with ψ(x, y) = φ(yfφ,P

2
Γ=   .
inf φ00 (t) t ∈ R, |t| ≤ max F, ∗
fφ,P
[0,1]d

Here φ00 (t) denotes the second order derivative of φ(t) = log(1 + e−t ) which is given by
et
φ00 (t) = (1+e ∗
t )2 . The boundedness of fφ,P is a key ingredient leading to the main results

in Farrell et al. (2021) (see Section 3 for more details). However, fφ,P ∗ is explicitly given
η
by log 1−η with η(x) = P ({1} |x), which tends to infinity when η approaches to 0 or 1. In
some cases, the uniformly boundedness assumption on fφ,P ∗ is too restrictive. When fφ,P ∗

is unbounded, i.e., kfφ,P k[0,1]d = ∞, condition (2.5) will not be satisfied by simply taking
ψ(x, y) = φ(yfφ,P ∗ (x)). Since in this case we have inf 00
t∈(−∞,+∞) φ (t) = 0, one cannot find a
finite constant Γ to guarantee the validity of (2.5), i.e., the inequality (2.8) cannot hold for
τ = 1, which means the previous strategy for generalization analysis in Farrell et al. (2021)
fails to work. In Theorem 2.1, the requirement for ψ(x, y) is much more flexible, we don’t
require ψ(x, y) to be φ(yfφ,P ∗ (x)) or even to satisfy (2.7). In this paper, by recurring to

Theorem 2.1, we carefully construct ψ to avoid using fφ,P ∗ directly in the following estimates.

Based on this strategy, under some mild regularity conditions on η, we can develop a more
general analysis to demonstrate the performance of neural network classifiers trained with

15
Zhang, Shi and Zhou

∗ . The derived generalization bounds


the logistic loss regardless of the unboundedness of fφ,P
and rates of convergence are stated in Theorem 2.2, Theorem 2.3, and Theorem 2.5, which
are new in the literature and constitute the main contributions of this paper. It is worth
noticing that in Theorem 2.2 and Theorem 2.3, we use Theorem 2.1 to obtain optimal rates
of convergence (up to some logarithmic factor), which demonstrates the tightness and power
of the inequality (2.6) in Theorem 2.1. To obtain these optimal rates from Theorem 2.1,
a delicate construction of ψ which allows small constants M and Γ in (2.4) and (2.5) is
necessary. One frequently used form of ψ in this paper is

ψ :[0, 1]d × {−1, 1} → R,


  
η(x)
φ y log , η(x) ∈ [δ1 , 1 − δ1 ],


1 − η(x)


(2.9)


(x, y) 7→ 0, η(x) ∈ {0, 1},

1 1


η(x) log + (1 − η(x)) log , η(x) ∈ (0, δ1 ) ∪ (1 − δ1 , 1),


η(x) 1 − η(x)
 
which can be regarded as a truncated version of φ(yfφ,P∗ (x)) = φ y log η(x)
1−η(x) , where δ1 is
some suitable constant in (0, 1/2]. However, in Theorem 2.5 we use a different form of ψ,
which will be specified later.
The proof of Theorem 2.1 is based on the following error decomposition
h   i  
E RφP fˆn − Ψ ≤ Tε,ψ,n + (1 + ε) · inf RφP (g) − Ψ , ∀ ε ∈ [0, 1), (2.10)
g∈F
h   Pn   ˆ  i
where Tε,ψ,n := E RφP fˆn − Ψ − (1 + ε) · 1
n i=1 φ Y f
i n (Xi ) − ψ(X ,
i iY ) and Ψ =
R
[0,1]d ×{−1,1} ψ(x, y)dP (x, y)
(see (C.13)). Although (2.10) is true for ε = 0, it’s better to
take ε > 0 in (2.10) to obtain sharp rates of convergence. This is because bounding the term
Tε,ψ,n with ε ∈ (0, 1) is easier than bounding T0,ψ,n . To see this, note that for ε ∈ (0, 1) we
have
h   i
Tε,ψ,n = (1 + ε) · T0,ψ,n − ε · E RφP fˆn − Ψ ≤ (1 + ε) · T0,ψ,n ,

meaning that we can always establish tighter upper bounds for Tε,ψ,n than for T0,ψ,n (up
to the constant factor 1 + ε < 2). Indeed, ε > 0 is necessary in establishing Theorem 2.1,
as indicated in its proof in Appendix C.2. Wen also point out that, setting ε = 0o and ψ ≡ 0
(hence Ψ = 0) in (2.10), and subtracting inf RφP (g) g : [0, 1]d → R measurable from both
sides, we will obtain a simpler error decomposition
n  
" #
h  i   1X 
φ ˆ φ
E EP fn ≤ E RP fˆn − φ Yi fˆn (Xi ) + inf EPφ (g)
n g∈F
i=1
" n
# (2.11)
φ 1X φ
≤ E sup RP (g) − (φ (Yi g(Xi ))) + inf EP (g),
g∈F n g∈F
i=1

which is frequently used in the literature (see e.g., Lemma 2 in Kohler and Langer (2020) and
the proof of Proposition 4.1 in Mohri et al. (2018)). Note that (2.11) does not require the

16
Classification with Deep Neural Networks and Logistic Loss

∗ , which means that we can also use this error decomposition to establish
explicit form of fφ,P
h i
rates of convergence for E E φ (fˆn ) regardless of the unboundedness of f ∗ . However, in
P φ,P
comparison with Theorem 2.1, using (2.11) may result in slow rates of convergence because
of the absence of the positive parameter ε and a carefully constructed function ψ.
We now state Theorem 2.2 which establishes generalization bounds for empirical logistic
risk minimizers over DNNs. In order to present this result, we need the definition of Hölder
spaces (Evans, 2010). The Hölder space C k,λ (Ω), where Ω ⊂ Rd is a closed domain, k ∈
N ∪ {0} and λ ∈ (0, 1], consists of all those functions from Ω to R which have continuous
derivatives up to order k and whose k-th partial derivatives are Hölder-λ continuous on Ω.
Here we say a function g : Ω → R is Hölder-λ continuous on Ω, if

|g(x) − g(z)|
|g|C 0,λ (Ω) := sup < ∞.
Ω3x6=z∈Ω kx − zkλ2

Then the Hölder spaces C k,λ (Ω) can be assigned the norm

kf kC k,λ (Ω) := max kDm f kΩ + max |Dm f |C 0,λ (Ω) , (2.12)


kmk1 ≤k kmk1 =k

where m = (m1 , · · · , md ) ∈ (N ∪ {0})d ranges over multi-indices (hence kmk1 = di=1 mi )


P
∂ m1 ∂ md
and Dm f (x1 , . . . , xd ) = ∂xm1 · · · m f (x1 , . . . , xd ). Given β ∈ (0, ∞), we say a function
∂x d
1 d
f : Ω → R is Hölder-β smooth if f ∈ C k,λ (Ω) with k = dβe − 1 and λ = β − dβe + 1, where
dβe denotes the smallest integer larger than or equal to β. For any β ∈ (0, ∞) and any
r ∈ (0, ∞), let
 
β f ∈ C k,λ (Ω) and kf kC k,λ (Ω) ≤ r for
Br (Ω) := f : Ω → R (2.13)
k = −1 + dβe and λ = β − dβe + 1

denote the closed ball of radius r centered at the origin in the Hölder space of order β on Ω.
Recall that the space FdFNN (G, N, S, B, F ) generated by fully connected neural networks is
given in (1.15), which is parameterized by the depth and width of neural networks (bounded
by G and N ), the number of nonzero entries in weight matrices and bias vectors (bounded
by S), and the upper bounds of neural network parameters and associated functions of form
(1.14) (denoted by B and F ). In the following theorem, we show that to ensure the rate
of convergence as the sample size n becomes large, all these parameters should be taken
within certain ranges scaling with n. For two positive sequences {λn }n≥1 and {νn }n≥1 ,
we say λn . νn holds if there exist n0 ∈ N and a positive constant c independent of n
such that λn ≤ cνn , ∀ n ≥ n0 . In addition, we write λn  νn if and only if λn . νn and
νn . λn . Recall that the excess misclassification error of f : Rd → R with respect to some
distribution P on [0, 1]d × {−1, 1} is defined as
n o
EP (f ) = RP (f ) − inf RP (g) g : [0, 1]d → R is Borel measurable ,

where RP (f ) denotes the misclassification error of f given by


n o
RP (f ) = P (x, y) ∈ [0, 1]d × {−1, 1} y 6= sgn(f (x)) .

17
Zhang, Shi and Zhou

Theorem 2.2 Let d ∈ N, (β, r) ∈ (0, ∞)2 , n ∈ N, ν ∈ [0, ∞), {(Xi , Yi )}ni=1 be an i.i.d.
sample in [0, 1]d × {−1, 1} and fˆnFNN be an ERM with respect to the logistic loss φ(t) =
log 1 + e−t over FdFNN (G, N, S, B, F ), i.e.,


n
1X
fˆnFNN ∈ arg min φ (Yi f (Xi )) . (2.14)
f ∈F FNN (G,N,S,B,F ) n
d i=1

Define
PX ( z ∈ [0, 1]d P ({1} |z) = η̂(z) ) = 1
  
H1d,β,r := P ∈ H0d . (2.15)
for some η̂ ∈ Brβ [0, 1]d


Then there exists a constant c ∈ (0, ∞) only depending on (d, β, r), such that the estimator
fˆFNN defined by (2.14) with
n
−d −d
(log n)5 (log n)5
  d+β   d+β
c log n ≤ G . log n, N  , S · log n,
n n (2.16)
β
1 ≤ B . nν , and · log n ≤ F . log n
d+β
satisfies
! β
h  i (log n)5 β+d

sup EP ⊗n EPφ fˆnFNN . (2.17)


P ∈H1d,β,r
n
and β
!
h  i (log n)5 2β+2d

sup EP ⊗n EP fˆnFNN . . (2.18)


P ∈H1d,β,r
n

Theorem 2.2 will be proved in Appendix C.4. As far as we know, for classification
with neural networks and the logistic loss φ, generalization bounds presented in (2.17) and
(2.18) establish fastest rates of convergence among the existing literature under the Hölder
smoothness condition on the conditional probability function η of the data distribution P .
Note that to obtain such generalization bounds in (2.17) and (2.18) we do not require any
assumption on the marginal distribution PX of the distribution P . For example, we dot
not require that PX is absolutely continuous with respect to the Lebesgue measure. The
 5
 β
rate O( (lognn)
β+d
) in (2.17) for the convergence of excess φ-risk is indeed optimal (up to
some logarithmic factor) in the minimax sense (see Corollary 2.1 and comments therein).
 β
(log n)5 2β+2d

However, the rate O( n ) in (2.18) for the convergence of excess misclassification
error is not optimal. According to Theorem 4.1, Theorem 4.2, Theorem 4.3 and their proofs
in Audibert and Tsybakov (2007), there holds
− β
h i
inf sup EP ⊗n EP (fˆn )  n 2β+d , (2.19)
ˆ fn P ∈H
d,β,r
1

where the infimum is taken over all Fd -valued statistics from the sample {(Xi , Yi )}ni=1 .
 β
(log n)5 2β+2d

Therefore, the rate O( n ) in (2.18) does not match the minimax optimal rate

18
Classification with Deep Neural Networks and Logistic Loss

β  β
(log n)5 2β+2d

1

O( n ). Despite suboptimality, the rate O(
2β+d
n ) in (2.18) is fairly close to
β
the optimal rate O( n1 2β+d ), especially when β >> d because the exponents satisfy


β 1 β
lim = = lim .
β→+∞ 2β + 2d 2 β→+∞ 2β + d
β
(log n)5
 
2β+2d
In our proof of Theorem 2.2, the rate O( ) in (2.18) is derived directly
n
β
5
 
from the rate (lognn)
β+d
in (2.17) via the so-called calibration inequality which takes the
form
 1
φ 2
EP (f ) ≤ c · EP (f ) for any f ∈ Fd and any P ∈ H0d (2.20)

with c being a constant independent of P and f (see (C.98)). Indeed, it follows from
Theorem 8.29 of Steinwart and Christmann (2008) that

√  φ 1
2
EP (f ) ≤ 2 2 · EP (f ) for any f ∈ Fd and any P ∈ H0d . (2.21)


In other words, (2.20) holds when c = 2 2. Interestingly, we can use Theorem 2.2 to obtain
that the inequality (2.20) is optimal in the sense that the exponent 12 cannot be replaced
by a larger one. Specifically, by using (2.17) of our Theorem 2.2 together with (2.19), we
can prove that 12 is the largest number s such that there holds
 s
EP (f ) ≤ c · EPφ (f ) for any f ∈ Fd and any P ∈ H0d (2.22)

for some constant c independent of P or f . We now demonstrate this by contradiction. Fix


d ∈ N. Suppose there exists an s ∈ (1/2, ∞) and a c ∈ (0, ∞) such that (2.22) holds. Since

( 32 ∧ s) · β 2 β
lim = ∧ s > 1/2 = lim ,
β→+∞ d+β 3 β→+∞ 2β + d

( 23 ∧s)·β β
we can choose β large enough such that d+β > 2β+d . Besides, it follows from EP (f ) ≤ 1
and (2.22) that
2 ∧s
2 ∧s  s 3  ( 2 ∧s)
EP (f ) ≤ |EP (f )|
3
s ≤ c· EPφ (f )
s
≤ (1 + c) · EPφ (f )
3 (2.23)

for any f ∈ Fd and any P ∈ H0d . Let r = 3 and fˆnFNN be the estimator in Theorem 2.2.
Then it follows from (2.17), (2.19), (2.23) and Hölder’s inequality that

− β
h i h  i
n 2β+d  inf sup EP ⊗n EP (fˆn ) ≤ sup EP ⊗n EP fˆnFNN
fˆn P ∈Hd,β,r P ∈H1d,β,r
1
  ( 2 ∧s) 
EPφ (fˆnFNN )
3
≤ sup EP ⊗n (1 + c) ·
P ∈H1d,β,r

19
Zhang, Shi and Zhou

 h i( 2 ∧s)
EP ⊗n EPφ (fˆnFNN )
3
≤ (1 + c) · sup
P ∈H1d,β,r
 ( 2 ∧s)
h i 3
≤ (1 + c) ·  sup EP ⊗n EPφ (fˆnFNN ) 
P ∈H1d,β,r
 ! β ( 23 ∧s) ! ( 23 ∧s)·β
5 β+d 5 β+d
(log n) (log n)
.  = .
n n

 ( 32 ∧s)·β
β
(log n)5 ( 2 ∧s)·β

− 2β+d β+d β
Hence n . , which contradicts the fact that 3d+β > 2β+d
n . This
proves the desired result. Due to the optimality of (2.20) and the minimax lower bound
β

O(n d+β ) for rates of convergence of the excess φ-risk stated in Corollary 2.1, we deduce
that rates of convergence of the excess misclassification error obtained directly from those
of the excess φ-risk and the calibration inequality which takes the form of (2.22) can never
 β
− β 5

be faster than O(n 2d+2β ). Therefore, the convergence rate O( (lognn)
2β+2d
) of the excess

misclassification error in (2.18) is the fastest one (up to the logarithmic term (log n) 2β+2d )
among all those that are derived directly from the convergence rates of the excess φ-risk
and the calibration inequality of the form (2.22), which justifies the tightness of (2.18).
 5
 β
It should be pointed out that the rate O( (lognn)
2β+2d
) in (2.18) can be further im-
proved if we assume the following noise condition (cf. Tsybakov (2004)) on P : there exist
c1 > 0, t1 > 0 and s1 ∈ [0, ∞] such that
n o
PX x ∈ [0, 1]d 2 · P ({1} |x) − 1 ≤ t ≤ c1 ts1 , ∀ 0 < t ≤ t1 . (2.24)

This condition measures the size of high-noisy points and reflects the noise level through the
exponent s1 ∈ [0, ∞]. Obviously, every distribution satisfies condition (2.24) with s1 = 0
and c1 = 1, whereas s1 = ∞ implies that we have a low amount of noise in labeling x, i.e.,
the conditional probability function P ({1} |x) is bounded away from 1/2 for PX -almost all
x ∈ [0, 1]d . Under the noise condition (2.24), the calibration inequality for logistic loss φ
can be refined as
  s1 +1
EP (f ) ≤ c̄ · E φ (f ) 1
s +2
for all f ∈ Fd , (2.25)
P

where c̄ ∈ (0, ∞) is a constant only depending on (s1 , c1 , t1 ), and ss11 +2


+1
:= 1 if s1 = ∞
(cf. Theorem 8.29 in Steinwart and Christmann (2008) and Theorem 1.1 in Xiang (2011)).
Combining this calibration inequality (2.25) and (2.17), we can obtain an improved gener-
alization bound given by

! (s1 +1)β
h  i (log n)5 (s1 +2)(β+d)

sup EP ⊗n EP fˆnFNN . ,
d,β,r
P ∈H2,s
n
1 ,c1 ,t1

20
Classification with Deep Neural Networks and Logistic Loss

where
n o
d,β,r
H2,s 1 ,c1 ,t1
:= P ∈ H1d,β,r P satisfies (2.24) . (2.26)

One can refer to Section 3 for more discussions about comparisons between Theorem
2.2 and other related results.
 5
 β  5
 β
In our Theorem 2.2, the rates (lognn) and (lognn)
β+d 2β+2d
become slow when the
dimension d is large. This phenomenon, known as the curse of dimensionality, arises in our
Theorem 2.2 because our assumption on the data distribution P is very mild and general.
Except for the Hölder smoothness condition on the conditional probability function η of P ,
we do not require any other assumptions in our Theorem 2.2. The curse of dimensionality
cannot be circumvented under such general assumption on P , as shown in Corollary 2.1
and (2.19). Therefore, to overcome the curse of dimensionality, we need other assumptions.
In our next theorem, we assume that η is the composition of several multivariate vector-
valued functions hq ◦ · · · ◦ h1 ◦ h0 such that each component function of hi is either a Hölder
smooth function whose output values only depend on a small number of its input variables,
or the function computing the maximum value of some of its input variables (see (2.32)
and (2.34)). Under this assumption, the curse of dimensionality is circumvented because
each component function of hi is either essentially defined on a low-dimensional space or
a very simple maximum value function. Our hierarchical composition assumption on the
conditional probability function is convincing and likely to be met in practice because
many phenomena in natural sciences can be “described well by processes that take place
at a sequence of increasing scales and are local at each scale, in the sense that they can be
described well by neighbor-to-neighbor interactions” (Appendix 2 of Poggio et al. (2016)).
Similar compositional assumptions have been adopted in many works such as Schmidt-
Hieber (2020); Kohler and Langer (2020); Kohler et al. (2022). One may refer to Poggio
et al. (2015, 2016, 2017); Kohler et al. (2022) for more discussions about the reasonableness
of such compositional assumptions.
In our compositional assumption mentioned above, we allow the component function
of hi to be the maximum value function, which is not Hölder-β smooth when β > 1. The
maximum value function is incorporated because taking the maximum value is an important
operation to pass key information from lower scale levels to higher ones, which appears
naturally in the compositional structure of the conditional probability function η in practical
classification problems. To see this, let us consider the following example. Suppose the
classification problem is to determine whether an input image contains a cat. We assume the
data is perfectly classified, in the sense that the conditional probability function η is equal
to zero or one almost surely. It should be noted that the assumption “η = 0 or 1 almost
surely” does not conflict with the continuity of η because the support of the distribution of
the input data may be unconnected. This classification task can be done by human beings
through considering each subpart of the input image and determining whether each subpart
contains a cat. Mathematically, let V be a family of subset of {1, 2, . . . , d} which consists
of all the index sets of those (considered) subparts of the input image x ∈ [0, 1]d . V should
satisfy
[
J = {1, 2, . . . , d}
J∈V

21
Zhang, Shi and Zhou

because the union of all the subparts should cover the input image itself. For each J ∈ V,
let

(
1, if the subpart (x)J of the input image x contains a cat,
ηJ ((x)J ) =
0, if the subpart (x)J of the input image x doesn’t contains a cat.

Then we will have η(x) = maxJ∈V {ηJ ((x)J )} a.s. because

a.s.
η(x) = 1 ⇐=⇒ x contains a cat ⇔ at least one of the subpart (x)J contains a cat
⇔ ηJ ((x)J ) = 1 for at least one J ∈ V ⇔ max {ηJ ((x)J )} = 1.
J∈V

Hence the maximum value function emerges naturally in the expression of η.


We now give the specific mathematical definition of our compositional model. For any
(d, d? , d∗ , β, r) ∈ N × N × N × (0, ∞) × (0, ∞), define

 
∃ I ⊂ {1, 2, . . . , d} such that 1 ≤ #(I) ≤ d?
GdM (d? ) := d
f : [0, 1] → R , (2.27)
and f (x) = max {(x)i |i ∈ I } , ∀ x ∈ [0, 1]d

and

GdH (d∗ , β, r)
β (2.28)
 
d∗ such that

d
:= f : [0, 1] → R ∃ I ⊂ {1, 2, . . . , d} and g ∈ Br [0, 1]
.
#(I) = d∗ and f (x) = g ((x)I ) for all x ∈ [0, 1]d

Thus GdM (d? ) consists of all functions from [0, 1]d to R which compute the maximum value
of at most d? components of their input vectors, and GdH (d∗ , β, r) consists of all functions
from [0, 1]d to R which only depend on d∗ components of the input vector and are Hölder-β
smooth with corresponding Hölder-β norm less than or equal to r. Obviously,

GdH (d∗ , β, r) = ∅, ∀ (d, d∗ , β, r) ∈ N × N × (0, ∞) × (0, ∞) with d < d∗ . (2.29)

H (d , β, r) :=
S∞ H
Next, for any (d?S , d∗ , β, r) ∈ N×N×(0, ∞)×(0, ∞), define G∞ ∗ d=1 Gd (d∗ , β, r)

M
and G∞ (d? ) := d=1 Gd (d? ). Finally, for any q ∈ N ∪ {0}, any (β, r) ∈ (0, ∞)2 and any
M

(d, d? , d∗ , K) ∈ N4 with

d∗ ≤ min d, K + 1{0} (q) · (d − K) ,



(2.30)

22
Classification with Deep Neural Networks and Logistic Loss

define
GdCH (q, K, d∗ , β, r)

 h0 , h1 , . . . , hq−1 , hq are functions satisfy- 

ing the following conditions:

 


 


 

K
 


 (i) dom(hi ) = [0, 1] for 0 < i ≤ q and  

d
 
dom(h0 ) = [0, 1] ;

 


 


 

K
 


 (ii) ran(hi ) ⊂ [0, 1] for 0 ≤ i < q and  
 (2.31)
:= hq ◦ · · · ◦ h1 ◦ h0 ran(hq ) ⊂ R;

 

H
 



 (iii) hq ∈ G∞ (d∗ , β, r); 




 

 



 (iv) For 0 ≤ i < q and 1 ≤ j ≤ K, the  





 j-th coordinate function of hi given  


by dom(hi ) 3 x 7→ (hi (x))j ∈ R 

 

 

H
belongs to G∞ (d∗ , β, r)
and
GdCHOM (q, K, d? , d∗ , β, r)

 h0 , h1 , . . . , hq−1 , hq are functions satisfy-  
ing the following conditions:

 


 


 

K
 


 (i) dom(h i ) = [0, 1] for 0 < i ≤ q and 


d
 
dom(h ) = [0, 1] ;
 
0

 


 

 
K for 0 ≤ i < q and 
 


 (ii) ran(h i ) ⊂ [0, 1] 
 (2.32)
:= hq ◦ · · · ◦ h1 ◦ h0 ran(hq ) ⊂ R; .

 

H (d , β, r) ∪ G M (d );
 



 (iii) hq ∈ G∞ ∗ ∞ ?





 

 



 (iv) For 0 ≤ i < q and 1 ≤ j ≤ K, the 






 j-th coordinate function of hi given 



by dom(h ) 3 x →
7 (h (x)) ∈
 


 i i j R 


H
belongs to G∞ (d∗ , β, r) ∪ G∞ (d? ) M

Obviously, we always have that GdCH (q, K, d∗ , β, r) ⊂ GdCHOM (q, K, d? , d∗ , β, r). The condi-
tion (2.30), which is equivalent to
(
d, if q = 0,
d∗ ≤
d ∧ K, if q > 0,

is required in the above definitions because it follows from (2.29) that

GdCH (q, K, d∗ , β, r) = ∅ if d∗ > min d, K + 1{0} (q) · (d − K) .




Thus we impose the condition (2.30) simply to avoid the trivial empty set. The space
GdCH (q, K, d∗ , β, r) consists of composite functions hq ◦ · · · h1 ◦ h0 satisfying that each com-
ponent function of hi only depends on d∗ components of its input vector and is Hölder-β
smooth with corresponding Hölder-β norm less than or equal to r. For example, the function

23
Zhang, Shi and Zhou

 
(x)1 ·(x)2 (x)3 ·(x)4
h2 (h1 (h0 (x))) (x)1 +(x)2 (x)3 +(x)4 P
4· 4
+ 4
+4· 2
· 2
= (x)i · (x)j
1≤i<j≤4

(x)1 ·(x)2 (x)3 ·(x)4 (x)1 +(x)2 (x)3 +(x)4


h1 (h0 (x)) 0 4
+ 4 2
· 2
0

(x)1 +(x)2 (x)3 +(x)4


h0 (x) (x)1 · (x)2 2
(x)3 · (x)4 2

x ∈ [0, 1]4 (x)1 (x)2 (x)3 (x)4

Figure 2.1: An illustration of the function [0, 1]4 3 x 7→


P
(x)i · (x)j ∈ R,
1≤i<j≤4
which belongs to G4CH (2, 4, 2, 2, 8).

[0, 1]4 3 x 7→ (x)i · (x)j ∈ R belongs to G4CH (2, 4, 2, 2, 8) (cf. Figure 2.1). The defini-
P
1≤i<j≤4
tion of GdCHOM (q, K, d? , d∗ , β, r) is similar to that of GdCH (q, K, d∗ , β, r). The only difference
is that, in comparison to GdCH (q, K, d∗ , β, r), we in the definition of GdCHOM (q, K, d? , d∗ , β, r)
additionally allow the component function of hi to be the function which computes the
maximum value of at most d? components of its input vector. For example, the function
[0, 1]4 3 x 7→ max (x)i · (x)j ∈ R belongs to G4CHOM (2, 6, 3, 2, 2, 2) (cf. Figure 2.2). From
1≤i<j≤4
the above description of the spaces GdCH (q, K, d∗ , β, r) and GdCHOM (q, K, d? , d∗ , β, r), we see
that the condition (2.30) is very natural because it merely requires the essential input di-
mension d∗ of the Hölder-β smooth component function of hi to be less than or equal to
its actual input dimension, which is d (if i = 0) or K (if i > 0). At last, we point out that
the space GdCH (q, K, d∗ , β, r) reduces to the Hölder ball Brβ ([0, 1]d ) when q = 0 and d∗ = d.
Indeed, we have that

Brβ ([0, 1]d ) = GdH (d, β, r) = GdCH (0, K, d, β, r)


(2.33)
⊂ GdCHOM (0, K, d? , d, β, r), ∀ K ∈ N, d ∈ N, d? ∈ N, β ∈ (0, ∞), r ∈ (0, ∞).

 
max max (x)1 · (x)j , max (x)i · (x)j = max (x)i · (x)j
h2 (h1 (h0 (x))) 2≤j≤4 2≤i<j≤4 1≤i<j≤4

0 max (x)1 · (x)j 0 0 max (x)i · (x)j 0


h1 (h0 (x)) 2≤j≤4 2≤i<j≤4

h0 (x) (x)1 · (x)2 (x)1 · (x)3 (x)1 · (x)4 (x)2 · (x)3 (x)2 · (x)4 (x)3 · (x)4

x ∈ [0, 1]4 (x)1 (x)2 (x)3 (x)4

Figure 2.2: An illustration of the function [0, 1]4 3 x 7→ max (x)i · (x)j ∈ R,
1≤i<j≤4
which belongs to G4CHOM (2, 6, 3, 2, 2, 2).

24
Classification with Deep Neural Networks and Logistic Loss

Now we are in a position to state our Theorem 2.3, where we establish sharp convergence
rates, which are free from the input dimension d, for fully connected DNN classifiers trained
with the logistic loss under the assumption that the conditional probability function η of
the data distribution belongs to GdCHOM (q, K, d? , d∗ , β, r). In particular, it can be shown
the convergence rate of the excess logistic risk stated in (2.36) in Theorem 2.3 is optimal
(up to some logarithmic term). Since GdCH (q, K, d∗ , β, r) ⊂ GdCHOM (q, K, d? , d∗ , β, r), the
same convergences rates as in Theorem 2.3 can also be achieved under the slightly narrower
assumption that η belongs to GdCH (q, K, d∗ , β, r). The results of Theorem 2.3 break the
curse of dimensionality and help explain why deep neural networks perform well, especially
in high-dimensional problems.

Theorem 2.3 Let q ∈ N∪{0}, (d, d? , d∗ , K) ∈ N4 with d∗ ≤ min d, K + 1{0} (q) · (d − K) ,




(β, r) ∈ (0, ∞)2 , n ∈ N, ν ∈ [0, ∞), {(Xi , Yi )}ni=1 be an i.i.d. sample in [0, 1]d × {−1, 1}
and fˆnFNN be an ERM with respect to the logistic loss φ(t) = log 1 + e−t over the space
FdFNN (G, N, S, B, F ), which is given by (2.14). Define
d
  
d,β,r d PX ( z ∈ [0, 1] P ({1} |z) = η̂(z) ) = 1
H4,q,K,d? ,d∗ := P ∈ H0 . (2.34)
for some η̂ ∈ GdCHOM (q, K, d? , d∗ , β, r)

Then there exists a constant c ∈ (0, ∞) only depending on (d? , d∗ , β, r, q), such that the
estimator fˆnFNN defined by (2.14) with
 −d∗  −d∗
(log n)5 d∗ +β·(1∧β)q (log n)5 d∗ +β·(1∧β)q
 
c log n ≤ G . log n, N  , S · log n,
n n (2.35)
ν β · (1 ∧ β)q
1 ≤ B . n , and · log n ≤ F . log n
d∗ + β · (1 ∧ β)q

satisfies
q
h  i  (log n)5  d∗β·(1∧β)
+β·(1∧β)q
φ ˆFNN
sup EP ⊗n EP fn . (2.36)
d,β,r
P ∈H4,q,K,d
n
? ,d∗

and q
 2d β·(1∧β)
(log n)5 +2β·(1∧β)q

h  i ∗
sup EP ⊗n EP fˆnFNN . . (2.37)
d,β,r
P ∈H4,q,K,d
n
? ,d∗

The proof of Theorem 2.3 is given in Appendix C.4. Note that Theorem 2.3 directly
leads to Theorem 2.2 because it follows from (2.33) that

H1d,β,r ⊂ H4,q,K,d
d,β,r
? ,d∗
if q = 0, d∗ = d and d? = K = 1.

Consequently, Theorem 2.3 can be regarded as a generalization of Theorem 2.2. Note that
 5
 β·(1∧β)q q  5
 β·(1∧β)q q
both the rates O( (lognn) ) and O( (lognn)
d∗ +β·(1∧β) 2d∗ +2β·(1∧β)
) in (2.36) and (2.37) are
independent of the input dimension d, thereby overcoming the curse of dimensionality.
  β·(1∧β)q
(log n)5 d∗ +β·(1∧β)q
Moreover, according to Theorem 2.6 and the comments therein, the rate n

25
Zhang, Shi and Zhou

in (2.36) for the convergence of the excess logistic risk is even optimal (up to some logarith-
mic factor). This justifies the sharpness of Theorem 2.3.
Next, we would like to demonstrate the main idea of the proof of Theorem 2.3. The
strategy we adopted is to apply Theorem 2.1 with a suitable ψ satisfying (2.7). Let P be
d,β,r
an arbitrary probability in H4,q,K,d ? ,d∗
and denote by η the conditional probability function
P ({1} |·) of P . According to the previous discussions, we cannot simply take ψ(x, y) =
∗ (x)) as the target function f ∗ η
φ(yfφ,P φ,P = log 1−η is unbounded. Instead, we define ψ by
(2.9) for some carefully selected δ1 ∈ (0, 1/2]. For such ψ, we prove
Z n o
ψ (x, y)dP (x, y) = inf RφP (f ) f : [0, 1]d → R is measurable (2.38)
[0,1]d ×{−1,1}

in Lemma C.3, and establish a tight inequality of form (2.5) with Γ = O((log δ11 )2 ) in Lemma
C.10. We then calculate the covering numbers of F := FdFNN (G, N, S, B, F ) by Corollary
A.1 and use Lemma C.15 to estimate the approximation error
Z !
inf RφP (f ) − ψ(x, y)dP (x, y)
f ∈F [0,1]d ×{−1,1}

which is essentially inf f ∈F EPφ (f ). Substituting the above estimations into the right hand
d,β,r
side of (2.6) and taking supremum over P ∈ H4,q,K,d ? ,d∗
, we obtain (2.36). We then derive
(2.37) from (2.36) through the calibration inequality (2.21).
We would like to point out that the above scheme for obtaining generalization bounds,
which is built on our novel oracle-type inequality in Theorem 2.1 with a carefully con-
structed ψ, is very general. This scheme can be used to establish generalization bounds for
classification in other settings, provided that the estimation for the corresponding approx-
imation error is given. For example, one can expect to establish generalization bounds for
convolutional neural network (CNN) classification with the logistic loss by using Theorem
2.1 together with recent results about CNN approximation. CNNs perform convolutions
instead of matrix multiplications in at least one of their layers (cf. Chapter 9 of Goodfellow
et al. (2016)). Approximation properties of various CNN architectures have been inten-
sively studied recently. For instance, 1D CNN approximation is studied in Zhou (2020b,a);
Mao et al. (2021); Fang et al. (2020), and 2D CNN approximation is investigated in Kohler
et al. (2022); He et al. (2022). With the help of these CNN approximation results and
classical concentration techniques, generalization bounds for CNN classification have been
established in many works such as Kohler and Langer (2020); Kohler et al. (2022); Shen
et al. (2021); Feng et al. (2023). In our coming work Zhang et al. (2024), we will derive
generalization bounds for CNN classification with logistic loss on spheres under the Sobolev
smooth conditional probability assumption through the novel framework developed in our
paper.
In our proof of Theorem 2.2 and Theorem 2.3, a tight error bound for neural network
approximation of the logarithm function log(·) arises as a by-product. Indeed, for a given
data distribution P on [0, 1]d × {−1, 1}, to estimate the approximation error of FdFNN , we
∗ η
need to construct neural networks to approximate the target function fφ,P = log 1−η , where
η denotes the conditional probability function of P . Due to the unboundedness of fφ,P ∗ , one

26
Classification with Deep Neural Networks and Logistic Loss

cannot approximate fφ,P ∗ directly. To overcome this difficulty, we consider truncating fφ,P∗

to obtain an efficient approximation. We design neural networks η̃ and ˜l to approximate η


on [0, 1]d and log(·) on [δn , 1 − δn ] respectively, where δn ∈ (0, 1/4] is a carefully selected
number which depends on the sample size n and tends to zero as n → ∞. Let Πδn denote
the clipping function given by Πδn : R → [δn , 1 − δn ], t 7→ arg mint0 ∈[δn ,1−δn ] |t0 − t|. Then
L̃ : t 7→ ˜l(Πδn (t)) − ˜l(1 − Πδn (t)) is a neural network which approximates the function

t
log 1−t ,
 if t ∈ [δn , 1 − δn ],
1−δ
Lδn : t 7→ log δn , if t > 1 − δn ,
n (2.39)
 δn
log 1−δn , if t < δn ,

meaning that the function L̃(η̃(x)) = ˜l (Πδn (η̃(x))) − ˜l (Πδn (1 − η̃(x))) is a neural network
which approximates the truncated fφ,P ∗ given by
1 − δn

∗ ∗
fφ,P (x),

 if fφ,P (x) ≤ log ,
δn
Lδn ◦ η : x 7→ Lδn (η(x)) =
 ∗ 1 − δn
sgn(fφ,P
 (x)) log , otherwise.
δn
One can build η̃ by applying some existing results on approximation theory of neural net-
works (see Appendix B). However, the construction of ˜l requires more effort. Since the
logarithm function log(·) is unbounded near 0, which leads to the blow-up of its Hölder
norm on [δn , 1 − δn ] when δn is becoming small, existing conclusions, e.g., the results in Ap-
pendix B, cannot yield satisfactory error bounds for neural network approximation of log(·)
on [δn , 1 − δn ]. To see this, let us consider using Theorem B.1 to estimate the approximation
error directly. Note that approximating log(·) on [δn , 1 − δn ] is equivalent to approximating
lδn (t) := log((1 − 2δn )t + δn ) on [0, 1]. For β1 > 0 with k = dβ1 − 1e and λ = β1 − dβ1 − 1e,
(k)
denote by lδn the k-th derivative of lδn . Then there holds
(k) (k)
lδn (t) − lδn (t0 )
klδn kC k,λ ([0,1]) ≥ sup
0≤t<t0 ≤1 |t − t0 |λ
 
(k) (k) δn (k+1) δn
lδn (0) − lδn 1−2δn lδn (t) · 0 − 1−2δn
≥ λ
≥ h inf i λ
δn δn δn
0− t∈ 0, 1−2δ 0−
1−2δn n 1−2δn
k! δn
((1−2δn )t+δn )k+1
· (1 − 2δn )k+1 · 0 − 1−2δn k! 1
= h inf λ
= · (1 − 2δn )k+λ · .
δn
t∈ 0, 1−2δ
i
δn 2k+1 δnk+λ
n 0− 1−2δn

Hence it follows from δn ∈ (0, 1/4] that


dβ1 − 1e! 1 dβ1 − 1e! 1 3 1
klδn kC k,λ ([0,1]) ≥ dβ e
· (1 − 2δn )β1 · β1 ≥ dβ e
· β1 ≥ · β1 .
2 1
δn 4 1
δn 128 δn
By Theorem B.1, for any positive integers m and M 0 with
n   o 3 1
M 0 ≥ max (β1 + 1), klδn kC k,λ ([0,1]) dβ1 e + 1 · e ≥ klδn kC k,λ ([0,1]) ≥ · β1 , (2.40)
128 δn

27
Zhang, Shi and Zhou

there exists a neural network

f˜ ∈ F1FNN 14m(2 + log2 (1 ∨ β1 )), 6 (1 + dβ1 e) M 0 , 987(2 + β1 )4 M 0 m, 1, ∞



(2.41)

such that

sup lδn (x) − f˜(x) ≤ klδn kC k,λ ([0,1]) · dβ1 e · 3β1 M 0−β1
x∈[0,1]
 
+ 1 + 2 klδn kC k,λ ([0,1]) · dβ1 e · 6 · (2 + β12 ) · M 0 · 2−m .

To make this error less than or equal to a given error threshold εn (depending on n), there
must hold
3 1
εn ≥ klδn kC k,λ ([0,1]) · dβ1 e · 3β1 M 0−β1 ≥ klδn kC k,λ ([0,1]) · M 0−β1 ≥ M 0−β1 · · .
128 δnβ1

This together with (2.40) gives


( )
1/β1
0 3 1 −1/β1 3 1
M ≥ max · ,ε · · . (2.42)
128 δnβ1 n 128 δn

Consequently, the width and the number of nonzero parameters of f˜ are greater than or
equal to the right hand side of (2.42), which may be too large when δn is small (recall that
δn → 0 as n → ∞). In this paper, we establish a new sharp error bound for approximating
the natural logarithm function log(·) on [δn , 1 − δn ], which indicates that one can achieve
the same approximation error by using a much smaller network. This refined error bound is
given in Theorem 2.4 which is critical in our proof of Theorem 2.2 and also deserves special
attention in its own right.

Theorem 2.4 Given a ∈ (0, 1/2], b ∈ (a, 1], α ∈ (0, ∞) and ε ∈ (0, 1/2], there exists
1
1 1 1 α 1
f˜ ∈ F1FNN A1 log + 139 log , A2 · log ,
ε a ε a
1 !
2
1 α 1 1 1
A3 · log · log + 65440 log , 1, ∞
ε ε a a

such that
sup log z − f˜(z) ≤ ε and log a ≤ f˜(t) ≤ log b, ∀ t ∈ R,
z∈[a,b]

where (A1 , A2 , A3 ) ∈ (0, ∞)3 are constants depending only on α.

In Theorem 2.4, we show that for each fixed α ∈ (0, ∞) one can construct a neural
network to approximate the natural logarithm function log(·) on [a, b] with error ε, where the
depth, width and number of nonzero parameters of this neural network are in the same order
1 1 2
of magnitude as log 1ε + log a1 , 1ε α log a1 and 1ε α log 1ε log a1 + log a1 respectively.
  

Recall that in our generalization analysis we need to approximate log on [δn , 1 − δn ], which

28
Classification with Deep Neural Networks and Logistic Loss

is equivalent to approximating lδn (t) = log((1−2δn )t+δn ) on [0, 1]. Let εn ∈ (0, 1/2] denote
the desired accuracy of the approximation of lδn on [0, 1], which depends on the sample size
n and converges to zero as n → ∞. Using Theorem 2.4 with α = 2β1 , we deduce that for
any β1 > 0 one can approximate lδn on [0, 1] with error εn by a network of which the width
− 2β1
and the number of nonzero parameters are less than Cβ1 εn 1 |log εn | · |log δn |2 with some
constant Cβ1 > 0 (depending only on β1 ). The complexity of this neural network is much
smaller than that of f˜ defined in (2.41) with (2.42) as n → ∞ since |log δn |2 = o (1/δn ) and
− 2β1 
−1/β

εn 1 |log εn | = o εn 1 as n → ∞. In particular, when

1 1
θ
. ε n ∧ δn ≤ ε n ∨ δ n . for some θ2 ≥ θ1 > 0 independent of n or β1 , (2.43)
n2 n θ1
which occurs in our generalization analysis (e.g., in our proof of Theorem 2.3, we essentially
  β·(1∧β)q −β·(1∧β)q −β·(1∧β)q
(log n)5 d∗ +β·(1∧β)q d∗ +β·(1∧β)q . ε = δ . n 2d∗ +β·(1∧β)q (cf.
take εn = δn  n , meaning that n n n
(C.74), (C.78), (C.87) and (C.92)), we will have that the right hand side of (2.42) grows no
slower than nθ1 +θ1 /β1 . Hence, in this case, no matter what β1 is, the width and the number
of nonzero parameters of the network f˜, which approximates lδn on [0, 1] with error εn and
is obtained by using Theorem B.1 directly (cf. (2.41)), will grow faster than nθ1 as n → ∞.
However, it follows from Theorem 2.4 that there exists a network f of which the width and
− 2β1 θ2
the number of nonzero parameters are less than Cβ1 εn 1 |log εn | · |log δn |2 . n 2β1 |log n|3
such that it achieves the same approximation error as that of f˜. By taking β1 large enough
we can make the growth (as n → ∞) of the width and the number of nonzero parameters of
f slower than nθ for arbitrary θ ∈ (0, θ1 ]. Therefore, in the usual case when the complexity
of η̃ is not too small in the sense that the width and the number of nonzero parameters
of η̃ grow faster than nθ3 as n → ∞ for some θ3 ∈ (0, ∞) independent of n or β1 , we
can use Theorem 2.4 with a large enough α = 2β1 to construct the desired network ˜l of
which the complexity is insignificant in comparison to that of L̃ ◦ η̃. In other words, the
neural network approximation of logarithmic function based on Theorem 2.4 brings little
complexity in approximating the target function fφ,P ∗ . The above discussion demonstrates

the tightness of the inequality in Theorem 2.4 and the advantage of Theorem 2.4 over those
general results on approximation theory of neural networks such as Theorem B.1.
It is worth mentioning that an alternative way to approximate the function Lδn defined
in (2.39) is by simply using its piecewise linear interpolation. For example, in Kohler and
Langer (2020), the authors express the piecewise linear interpolation of Lδn at equidistant
points by a neural network L̃, and construct a CNN η̃ to approximate η, leading to an
approximation of the truncated target function of the logistic risk L̃ ◦ η̃. It follows from
Proposition 3.2.4 of Atkinson and Han (2009) that

h2n
h2n . L̃ − Lδn . , (2.44)
[δn ,1−δn ] δn2

where hn denotes the step size of the interpolation. Therefore, to ensure the error bound

εn for the approximation of Lδn by L̃, we must have hn . εn , implying that the number
of nonzero parameters of L̃ will grow no slower than h1n & √1εn as n → ∞. Consequently,

29
Zhang, Shi and Zhou

in the case (2.43), we will have that the number of nonzero parameters of L̃ will grow no
slower than nθ1 /2 . Therefore, in contrast to using Theorem 2.4, we cannot make the number
of nonzero parameters of the network L̃ obtained from piecewise linear interpolation grow
slower than nθ for arbitrarily small θ > 0. As a result, using piecewise linear interpolation
to approximate Lδn may bring extra complexity in establishing the approximation of the
target function. However, the advantage of using piecewise linear interpolation is that one
can make the depth or width of the network L̃ which expresses the desired interpolation
bounded as n → ∞ (cf. Lemma 7 in Kohler and Langer (2020) and its proof therein).
The proof of Theorem 2.4 is in Appendix C.3. The key observation in our proof is the
fact that for all k ∈ N, the following holds true:

log x = log(2k · x) − k log 2, ∀ x ∈ (0, ∞). (2.45)

Then we can use the values of log(·) which are taken far away from zero (i.e., log(2k · x) in
the right hand side of (2.45)) to determine its values taken near zero, while approximating
the former is more efficient as the Hölder norm of the natural logarithm function on domains
far away from zero can be well controlled.
In the next theorem, we show that if the data distribution has a piecewise smooth
decision boundary, then DNN classifiers trained by empirical logistic risk minimization can
also achieve dimension-free rates of convergence under the noise condition (2.24) and a
margin condition (see (2.51) below). Before stating this result, we need to introduce this
margin condition and relevant concepts.
We first define the set of (binary) classifiers which have a piecewise Hölder smooth
decision boundary. We will adopt similar notations from Kim et al. (2021) to describe
this set. Specifically, let β, r ∈ (0, ∞) and I, Θ ∈ N. For g ∈ Brβ [0, 1]d−1 and j =


1, 2, · · · , d, we define horizon function Ψg,j : [0, 1]d → {0, 1} as Ψg,j (x) := 1{(x)j ≥g(x−j )} ,
where x−j := ((x)1 , · · · , (x)j−1 , (x)j+1 , · · · , (x)d ) ∈ [0,1]d−1 . For each horizon function, the
corresponding basis piece Λg,j is defined as Λg,j := x ∈ [0, 1]d Ψg,j (x) = 1 . Note that
Λg,j =  x ∈ [0, 1]d (x)j ≥ max {0, g(x−j )} . Thus Λg,j is enclosed by the hypersurface


Sg,j := x ∈ [0, 1]d (x)j = max {0, g(x−j )} and (part of) the boundary of [0, 1]d . We then
define the set of pieces which are the intersection of I basis pieces as
I
( )
\  
Ad,β,r,I := A A = Λgk ,jk for some jk ∈ {1, 2, · · · , d} and gk ∈ Brβ [0, 1]d−1 ,
k=1

and define C d,β,r,I,Θ to be a set of binary classifiers as

C d,β,r,I,Θ
Θ
( )
A1 , A2 , A3 , · · · , AΘ are (2.46)
1Ai (x) − 1 : [0, 1] → {−1, 1}
X
d
:= C(x) = 2 .
disjoint sets in Ad,β,r,I
i=1

Thus C d,β,r,I,Θ consists of all binary classifiers which are equal to +1 on some disjoint
sets A1 , . . . , AΘ in Ad,β,r,I and −1 otherwise. Let At = ∩Ik=1 Λgt,k ,jt,k (t = 1, 2, . . . , Θ) be
arbitrary disjoint sets in Ad,β,r,I , where jt,k ∈ {1, 2, . . . , d} and gt,k ∈ Brβ [0, 1]d−1 . Then


C : [0, 1]d → {−1, 1} , x 7→ 2 Θ i=1 1Ai (x)−1 is a classifier in C


d,β,r,I,Θ . Recall that Λ
P
gt,k ,jt,k is

30
Classification with Deep Neural Networks and Logistic Loss

enclosed by Sgt,k ,jt,k and (part of) the boundary of [0, 1]d for each t, k. Hence for each t, the
region At is enclosed by hypersurfaces Sgt,k ,jt,k (k = 1, . . . , I) and (part of) the boundary
of [0, 1]d . We say the piecewise Hölder smooth hypersurface

Θ [
[ I
DC∗ :=

Sgt,k ,jt,k ∩ At (2.47)
t=1 k=1

is the decision boundary of the classifier C because intuitively, points on different sides of
DC∗ are classified into different categories (i.e. +1 and −1) by C (cf. Figure 2.3). Denote by
∆C (x) the distance from x ∈ [0, 1]d to the decision boundary DC∗ , i.e.,

x − x0 x0 ∈ DC∗ .

∆C (x) := inf 2
(2.48)

1
g2,2
A2
0.8 g1,1
g1,3
(x)2 -axis

0.6

0.4 A1
g1,2
g2,3
0.2 g2,1

0
0 0.2 0.4 0.6 0.8 1
(x)1 -axis
Figure 2.3: Illustration of the sets A1 , . . . AΘ when d = 2, Θ =P2, I = 3,
j2,1 = j2,2 = j1,1 = j1,2 = 2 and j1,3 = j2,3 = 1. The classifier C(x) = 2 Θ t=1 1At (x) − 1 is

equal to +1 on A1 ∪ A2 and −1 otherwise. The decision boundary DC of C is marked red.

We then describe the margin condition mentioned above. Let P be a probability measure
on [0, 1]d × {−1, 1}, which we regard as the joint distribution of the input and output data,
and η(·) = P ({1} |·) is the conditional probability function of P . The corresponding Bayes
classifier is the sign of 2η − 1 which minimizes the misclassification error over all measurable
functions, i.e.,
n o
RP (sgn(2η − 1)) = RP (2η − 1) = inf RP (f ) f : [0, 1]d → R is measurable . (2.49)

We say the distribution P has a piecewise smooth decision boundary if

P -a.s.
∃ C ∈ C d,β,r,I,Θ s.t. sgn(2η − 1) ==X==== C,

31
Zhang, Shi and Zhou

that is, n o
PX x ∈ [0, 1]d sgn(2 · P ({1} |x) − 1) = C(x) =1 (2.50)

for some C ∈ C d,β,r,I,Θ . Suppose C ∈ C d,β,r,I,Θ and (2.50) holds. We call DC∗ the decision
boundary of P , and for c2 ∈ (0, ∞), t2 ∈ (0, ∞), s2 ∈ [0, ∞], we use the following condition
n o
PX x ∈ [0, 1]d ∆C (x) ≤ t ≤ c2 ts2 , ∀ 0 < t ≤ t2 , (2.51)

which we call the margin condition, to measure the concentration of the input distribution
PX near the decision boundary DC∗ of P . In particular, when the input data are bounded
away from the decision boundary DC∗ of P (PX -a.s.), (2.51) will hold for s2 = ∞.
Now we are ready to give our next main theorem.

Theorem 2.5 Let d ∈ N ∩ [2, ∞), (n, I, Θ) ∈ N3 , (β, r, t1 , t2 , c1 , c2 ) ∈ (0, ∞)6 , (s1 , s2 ) ∈
[0, ∞]2 , {(Xi , Yi )}ni=1 be a sample in [0, 1]d × {−1, 1} and fˆnFNN be an ERM with respect
to the logistic loss φ(t) = log 1 + e−t over FdFNN (G, N, S, B, F ) which is given by (2.14).
Define  
d,β,r,I,Θ,s1 ,s2 d (2.24), (2.50) and (2.51)
H6,t1 ,c1 ,t2 ,c2 := P ∈ H0 . (2.52)
hold for some C ∈ C d,β,r,I,Θ
Then the following statements hold true:

(1) For s1 ∈ [0, ∞] and s2 = ∞, the φ-ERM fˆnFNN with


! d−1 ! d−1 !
β β
1 1 1 1
G = G0 log , N = N0 , S = S0 log ,
t2 ∧ 12 t2 ∧ 21 t2 ∧ 1
2 t2 ∧ 1
2
!   1
1 log n s1 +2
B = B0 , and F 
t2 ∧ 12 n

satisfies
  s1
h  i log n s1 +2
sup EP ⊗n EP fˆnFNN . , (2.53)
d,β,r,I,Θ,s1 ,s2
P ∈H6,t
n
1 ,c1 ,t2 ,c2

where G0 , N0 , S0 , B0 are positive constants only depending on d, β, r, I, Θ;

(2) For s1 = ∞ and s2 ∈ [0, ∞), the φ-ERM fˆnFNN with


  d−1   d−1
n s2 β+d−1 n s2 β+d−1
G  log n, N  , S log n,
(log n)3 (log n)3
  1
n s2 + d−1
β
1
B , and F = t1 ∧
(log n)3 2

satisfies
! 1
h  i (log n)3 1+ d−1
βs2
sup EP ⊗n EP fˆnFNN . ; (2.54)
d,β,r,I,Θ,s1 ,s2
P ∈H6,t
n
1 ,c1 ,t2 ,c2

32
Classification with Deep Neural Networks and Logistic Loss

(3) For s1 ∈ [0, ∞) and s2 ∈ [0, ∞), the φ-ERM fˆnFNN with
  (d−1)(s1 +1)   (d−1)(s1 +1)
n n
s2 β+(s1 +1)(s2 β+d−1) s2 β+(s1 +1)(s2 β+d−1)
G  log n, N  ,S  log n,
(log n)3 (log n)3
s1 +1 s2
(log n)3 s2 +(s1 +1)(s2 + d−1
   
n s2 +(s1 +1)(s2 + d−1 )
β β )
B , and F 
(log n)3 n
satisfies
! s1
3
 
(log n) 1+(s1 +1) 1+ d−1
h  i
EP ⊗n EP fˆnFNN
βs2
sup . . (2.55)
d,β,r,I,Θ,s1 ,s2
P ∈H6,t
n
1 ,c1 ,t2 ,c2

  s1
log n s1 +2
It is worth noting that the rate O( n ) established in (2.53) does not depend on
the dimension d, and dependency of the rates in (2.54) and (2.55) on the dimension d
diminishes as s2 increases, which demonstrates that the condition (2.51) with s2 = ∞ helps
circumvent the curse of dimensionality. In particular, (2.53) will give a fast dimension-free
rate of convergence O( logn n ) if s1 = s2 = ∞. One may refer to Section 3 for more discussions
about the result of Theorem 2.5.
The proof of Theorem 2.5 is in Appendix C.5. Our proof relies on Theorem 2.1 and
the fact that the ReLU networks are good at approximating indicator functions of bounded
regions with piecewise smooth boundary (Imaizumi and Fukumizu, 2019; Petersen and
d,β,r,I,Θ,s1 ,s2
Voigtlaender, 2018). Let P be an arbitrary probability in H6,t 1 ,c1 ,t2 ,c2
and denote by η
the condition probability function P ({1} |·) of P . To apply Theorem 2.1 and make good use
of the noise condition (2.24) and the margin condition (2.51), we define another ψ (which
is different from that in (2.9)) as

φ (yF0 sgn(2η(x) − 1)) , if |2η(x) − 1| > η0 ,

d
ψ : [0, 1] × {−1, 1} → R, (x, y) 7→
 
η(x)
φ y log , if |2η(x) − 1| ≤ η0
1 − η(x)

 
1+η0
for some suitable η0 ∈ (0, 1) and F0 ∈ 0, log 1−η 0
. For such ψ, Lemma C.17 guarantees
that inequality (2.3) holds as
Z n o
ψ (x, y)dP (x, y) ≤ inf RφP (f ) f : [0, 1]d → R is measurable ,
[0,1]d ×{−1,1}

2 8
and (2.4), (2.5) of Theorem 2.1 are satisfied with M = 1−η 0
and Γ = 1−η 2 . Moreover, we
0
use the noise condition (2.24) and the margin condition (2.51) to bound the approximation
error Z !
inf RφP (f ) − ψ(x, y)dP (x, y) (2.56)
f ∈FdFNN (G,N,S,B,F ) [0,1]d ×{−1,1}

(see (C.111), (C.112), (C.113)). Then, as in the proof of Theorem 2.2, we combine Theorem
2.1 with estimates for the covering number of FdFNN (G, N, S, B, F ) and the approximation

33
Zhang, Shi and Zhou

h   R i
error (2.56) to obtain an upper bound for EP ⊗n RφP fˆnFNN − ψdP , which, together
h i
ˆFNN
with the noise condition (2.24), yields an upper bound for EP ⊗n EP (fn ) (see (C.109)).
d,β,r,I,Θ,s1 ,s2
Finally taking the supremum over all P ∈ H6,t 1 ,c1 ,t2 ,c2
gives the desired result. The proof
of Theorem 2.5 along with that of Theorem 2.2 and Theorem 2.3 indicates that Theorem
2.1 is very flexible in the sense that it can be used in various settings with different choices
of ψ.

2.2 Main Lower Bounds


In this subsection, we will give our main results on lower bounds for convergence rates of
the logistic risk, which will justify the optimality of our upper bounds established in the
last subsection. To state these results, we need some notations.
Recall that for any a ∈ [0, 1], Ma denotes the probability measure on {−1, 1} with
Ma ({1}) = a and Ma ({−1}) = 1 − a. For any measurable η : [0, 1]d → [0, 1] and any Borel
probability measure Q on [0, 1]d , we denote
n o
Pη,Q : Borel subsets of [0, 1]d × {−1, 1} → [0, 1],
Z Z (2.57)
S 7→ 1S (x, y)dMη(x) (y)dQ(x).
[0,1]d {−1,1}

Therefore, Pη,Q is the (unique) probability measure on [0, 1]d ×{−1, 1} of which the marginal
distribution on [0, 1]d is Q and the conditional probability function is η. If Q is the Lebesgue
measure on [0, 1]d , we will write Pη for Pη,Q .
3
 β ∈ (0, ∞), r ∈ (0, ∞), A ∈ [0, 1), q ∈ N ∪ {0}, and (d, d∗ , K) ∈ N with
For any
d∗ ≤ min d, K + 1{0} (q) · (d − K) , define
( )
d,β,r ηR ∈ Brβ ([0, 1]d ), ran(η) ⊂ [0, 1], and
H3,A := Pη ,
[0,1]d 1[0,A] (|2η(x) − 1|)dx = 0
(2.58)

η ∈ G CH (q, K, d , β, r), ran(η) ⊂ [0, 1], 
d,β,r Rd ∗
H5,A,q,K,d := Pη .
∗ and [0,1]d 1[0,A] (|2η(x) − 1|)dx = 0

Now we can state our Theorem 2.6. Recall that Fd is the set of all measurable real-valued
functions defined on [0, 1]d .

Theorem 2.6 Let φ be the logistic loss, n ∈ N, β ∈ (0, ∞), r ∈ (0, ∞), A ∈ [0, 1), q ∈ N ∪
{0}, and (d, d∗ , K) ∈ N3 with d∗ ≤ min d, K + 1{0} (q) · (d − K) . Suppose {(Xi , Yi )}ni=1
is a sample in [0, 1]d × {−1, 1} of size n. Then there exists a constant c0 ∈ (0, ∞) only
depending on (d∗ , β, r, q), such that
d∗ +β·(1∧β)q
h i β·(1∧β)q 7 β·(1∧β)q
−d
inf sup EP ⊗n EPφ (fˆn ) ≥ c0 n ∗ +β·(1∧β)q provided that n > ,
fˆn P ∈Hd,β,r 1−A
5,A,q,K,d∗

where the infimum is taken over all Fd -valued statistics on ([0, 1]d × {−1, 1})n from the
sample {(Xi , Yi )}ni=1 .

34
Classification with Deep Neural Networks and Logistic Loss

Taking q = 0, K = 1, and d∗ = d in Theorem 2.6, we immediately obtain the following


corollary:

Corollary 2.1 Let φ be the logistic loss, d ∈ N, β ∈ (0, ∞), r ∈ (0, ∞), A ∈ [0, 1), and
n ∈ N. Suppose {(Xi , Yi )}ni=1 is a sample in [0, 1]d × {−1, 1} of size n. Then there exists a
constant c0 ∈ (0, ∞) only depending on (d, β, r), such that
d+β
h i β
− d+β 7 β
inf sup EP ⊗n EPφ (fˆn ) ≥ c0 n provided that n > ,
fˆn P ∈Hd,β,r 1−A
3,A

where the infimum is taken over all Fd -valued statistics on ([0, 1]d × {−1, 1})n from the
sample {(Xi , Yi )}ni=1 .

Theorem 2.6, together with Corollary 2.1, is proved in Appendix C.6.


d,β,r d,β,r
Obviously, H5,A,q,K,d∗
⊂ H4,q,K,d? ,d∗
. Therefore, it follows from Theorem 2.6 that
h i h i β·(1∧β)q

inf sup EP ⊗n EPφ (fˆn ) ≥ inf sup EP ⊗n EPφ (fˆn ) & n d∗ +β·(1∧β)q .
fˆn P ∈Hd,β,r fˆn P ∈Hd,β,r
4,q,K,d? ,d∗ 5,A,q,K,d∗

  β·(1∧β)q
(log n)5 d∗ +β·(1∧β)q
This justifies that the rate O( n ) in (2.36) is optimal (up to the logarithmic
5β·(1∧β)q
d,β,r
factor (log n) d∗ +β·(1∧β)q ). Similarly, it follows from H3,A ⊂ H1d,β,r and Corollary 2.1 that

− β
h i h i
inf sup EP ⊗n EPφ (fˆn ) ≥ inf sup EP ⊗n EPφ (fˆn ) & n d+β ,
fˆn P ∈Hd,β,r fˆn P ∈Hd,β,r
1 3,A

β
(log n)5
 
β+d
which justifies that the rate O( n ) in (2.17) is optimal (up to the logarithmic

d,β,r
factor (log n) ). Moreover, note that any probability P in H3,A
β+d must satisfy the noise
condition (2.24) provided that s1 ∈ [0, ∞], t1 ∈ (0, A], and c1 ∈ (0, ∞). In other words, for
d,β,r d,β,r
any s1 ∈ [0, ∞], t1 ∈ (0, A], and c1 ∈ (0, ∞), there holds H3,A ⊂ H2,s 1 ,c1 ,t1
, meaning that
β h i h i
− d+β
n . inf sup EP ⊗n EPφ (fˆn ) ≤ inf sup φ ˆ
EP ⊗n EP (fn )
fˆn P ∈Hd,β,r fˆn P ∈Hd,β,r
3,A 2,s ,c1 1 ,t1
! β
h i h  i (log n)5 β+d

≤ inf sup EP ⊗n EPφ (fˆn ) ≤ sup EP ⊗n EPφ fˆnFNN . ,


fˆn P ∈Hd,β,r P ∈H1d,β,r
n
1

where fˆnFNN is the estimator defined in Theorem 2.2. From above inequalities we see that
the noise condition (2.24) does little to help improve the convergence rate of the excess
φ-risk in classification.
The proof of Theorem 2.6 and Corollary 2.1 is based on a general scheme for obtaining
lower bounds, which is given in Section 2 of Tsybakov (2009). However, the scheme in
Tsybakov (2009) is stated for a class of probabilities H that takes the form H = {Qθ |θ ∈ Θ}

35
Zhang, Shi and Zhou

with Θ being some pseudometric space. In our setting, we do not have such pseudometric
space. Instead, we introduce another quantity

inf EPφ (f ) + EQ
φ
(f ) (2.59)
f ∈Fd

to characterize the difference between any two probability measures P and Q (see (C.126)).
Estimating lower bounds for the quantity defined in (2.59) plays a key role in our proof of
Theorem 2.6 and Corollary 2.1.

3. Discussions on Related Work


In this section, we compare our results with some existing ones in the literature. We first
compare Theorem 2.2 and Theorem 2.5 with related results about binary classification
using fully connected DNNs and logistic loss in Kim et al. (2021) and Farrell et al. (2021)
respectively. Then we compare our work with Ji et al. (2021), in which the authors carry
out generalization analysis for estimators obtained from gradient descent algorithms.
Throughout this section, we will use φ to denote the logistic loss (i.e., φ(t) = log(1+e−t ))
and {(Xi , Yi )}ni=1 to denote an i.i.d. sample in [0, 1]d × {−1, 1}. The symbols d, β, r, I,
Θ, t1 , c1 , t2 , c2 and c will denote arbitrary numbers in N, (0, ∞), (0, ∞), N, N, (0, ∞),
(0, ∞), (0, ∞), (0, ∞) and [0, ∞), respectively. The symbol P will always denote some
probability measure on [0, 1]d ×{−1, 1}, regarded as the data distribution, and η will denote
the corresponding conditional probability function P ({1} |·) of P .
Recall that C d,β,r,I,Θ , defined in (2.46), is the space consisting of classifiers which are
equal to +1 on the union of some disjoint regions with piecewise Hölder smooth boundary
and −1 otherwise. In Theorem 4.1 of Kim et al. (2021), the authors conduct generalization
analysis when the data distribution P satisfies the piecewise smooth decision boundary
condition (2.50), the noise condition (2.24), and the margin condition (2.51) with s1 =
s2 = ∞ for some C ∈ C d,β,r,I,Θ . They show that there exist constants G0 , N0 , S0 , B0 , F0 not
depending on the sample size n such that the φ-ERM
n
1X
fˆnFNN ∈ arg min φ (Yi f (Xi ))
f ∈F FNN (G0 ,N0 ,S0 ,B0 ,F0 ) n
d i=1

satisfies h  i (log n)1+


sup EP ⊗n EP fˆnFNN . (3.1)
P ∈Hd,β,r,I,Θ,∞,∞
n
6,t1 ,c1 ,t2 ,c2

for any  > 0. Indeed, the noise conditions (2.24) and the margin condition (2.51) with
s1 = s2 = ∞ are equivalent to the following two conditions: there exist η0 ∈ (0, 1) and
∆ > 0 such that n o
PX x ∈ [0, 1]d |2η(x) − 1| ≤ η0 =0
and n o
PX x ∈ [0, 1]d | ∆C (x) ≤ ∆ =0

(cf. conditions (N0 ) and (M0 ) in Kim et al. (2021)). Under the two conditions above,
P -a.s.
combining with the assumption sgn(2η − 1) ==X==== C ∈ C d,β,r,I,Θ , Lemma A.7 of Kim et al.

36
Classification with Deep Neural Networks and Logistic Loss

(2021) asserts that there exists f0∗ ∈ FdFNN (G0 , N0 , S0 , B0 , F0 ) such that

f0∗ ∈ arg min RφP (f )


f ∈FdFNN (G0 ,N0 ,S0 ,B0 ,F0 )

and n o
RP (f0∗ ) = RP (2η − 1) = inf RP (f ) f : [0, 1]d → R is measurable .

The excess misclassification error of f : [0, 1]d → R is then given by EP (f ) = RP (f ) −


RP (f0∗ ). Since f0∗ is bounded by F0 , the authors in Kim et al. (2021) can apply classical
concentration techniques developed for bounded random variables (cf. Appendix A.2 of
Kim et al. (2021)) to deal with f0∗ (instead of the target function fφ,P∗ ), leading to the

generalization bound (3.1). In this paper, employing Theorem 2.1, we extend Theorem
4.1 of Kim et al. (2021) to much less restrictive cases in which the noise exponent s1 and
the margin exponent s2 are allowed to be taken from [0, ∞]. The derived generalization
bounds are presented in Theorem 2.5. In particular, when s1 = s2 = ∞ (i.e., let s1 = ∞
in statement (1) of Theorem 2.5), we obtain a refined generalization bound under the same
conditions as those of Theorem 4.1 in Kim et al. (2021), which asserts that the φ-ERM
fˆnFNN over FdFNN (G0 , N0 , S0 , B0 , F0 ) satisfies
h  i log n
sup EP ⊗n EP fˆnFNN . , (3.2)
P ∈Hd,β,r,I,Θ,∞,∞
n
6,t1 ,c1 ,t2 ,c2

removing the  in their bound (3.1). The above discussion indicates that Theorem 2.1
can lead to sharper estimates in comparison with classical concentration techniques, and
can be applied in very general settings. However, we would like to point out that if
s1 < ∞ and s2 < ∞, then the convergence rate obtained in Theorem 2.5 (that is, the
s1
 
d−1
(log n)3 1+(s1 +1) 1+ βs2
  
rate O n in (2.55)) is suboptimal. Indeed, Theorem 3.1 and The-
orem 3.4 of Kim et al. (2021) show that DNN classifier fˆFNN trained with empirical hinge
n
risk minimization can achieve a convergence rate
s +1
! 1
3

(log n) 1+(s1 +1) 1+ d−1
h  i
sup EP ⊗n EP fˆnFNN . β·(1∨s2 )
, (3.3)
d,β,r,I,Θ,s ,s
P ∈H6,t ,c ,t ,c1 2
n
1 1 2 2

s1
 
d−1
(log n)3
  1+(s
1 +1) 1+ βs
which is strictly faster than the rate O n in (2.55). Moreover, as
2

mentioned below Theorem 3.1 in Kim et al. (2021), even the rate in (3.3) is suboptimal
in general. In Hu et al. (2022a), the authors propose a new DNN classifier which are
constructed in a divide-and-conquer manner: DNN classifiers are trained with empirical 0-1
risk minimization on each local region and then “aggregated to a global one”. Hu et al.
(2022a) provides minimax optimal convergence rates for this new DNN classifier under the
d,β,r,1,1,0,0
assumption that the data distribution P ∈ H6,t 1 ,1,t2 ,1
(that is, the decision boundary of
P is assumed to be Hölder-β smooth (rather than just piecewise smooth), but the noise
condition (2.24) and the margin condition (2.51) are not required) along with a “localized

37
Zhang, Shi and Zhou

version” of the noise condition (2.24) (see assumptions (M1) and (M2) in Hu et al. (2022a)).
It is interesting to further study whether we can apply Theorem 2.1 to establish optimal
convergence rates for the new DNN classifiers proposed in Hu et al. (2022a) which are locally
trained with some surrogate loss (as we have already pointed out, Theorem 2.1 remains true
for any locally Lipschitz continuous loss function φ, see the discussion on page 14) such as
logistic loss instead of 0-1 loss.
The recent work Farrell et al. (2021) considers estimation and inference using fully
connected DNNs and the logistic loss in which their setting can cover both regression and
classification. For any probability measure P on [0, 1]d × {−1, 1} and any measurable
R 1
2 2
function f : [0, 1]d → [−∞, ∞], define kf kL2 := [0,1]d |f (x)| dPX (x) . Recall that
PX

Brβ (Ω) is defined in (2.13). Let H7d,β be the set of all probability measures P on [0, 1]d ×
∗ belongs to B1β [0, 1]d . In Corollary 1 of Farrell

{−1, 1} such that the target function fφ,P
et al. (2021), the authors claimed that if P ∈ H7d,β and β ∈ N, then with probability at
least 1 − e−υ there holds


2 2β
− 2β+d log log n + υ
fˆnFNN − fφ,P .n log4 n + , (3.4)
L2P
X
n

where the estimator fˆnFNN ∈ FdFNN (G, N, S, ∞, F ) is defined by (2.14) with


d d
G  log n, N  n d+2β , S  n d+2β log n, and F = 2. (3.5)

∗ ∈ B1β [0, 1]d implies kfφ,P


∗ k

Note that fφ,P ∞ ≤ 1. From Lemma 8 of Farrell et al. (2021),
2
bounding the quantity fˆnFNN − fφ,P
∗ on the left hand side of (3.4) is equivalent to
L2P
X
bounding EPφ (fˆnFNN ), since
1 2 1 ˆFNN 2

fˆFNN − fφ,P ≤ EPφ (fˆnFNN ) ≤ f ∗
− fφ,P . (3.6)
2(e + e−1 + 2) n L2P
X
4 n L2P
X

Hence (3.4) actually establishes the same upper bound (up to a constant independent of n
and P ) for the excess φ-risk of fˆnFNN , leading to upper bounds for the excess misclassification
error EP (fˆnFNN ) through the calibration inequality. The authors in Farrell et al. (2021) apply
concentration techniques based on (empirical) Rademacher complexity (cf. Section A.2 of
Farrell et al. (2021) or Bartlett et al. (2005); Koltchinskii (2006)) to derive the bound (3.4),
which allows for removing the restriction of uniformly boundedness on the weights and biases
in the neural network models, i.e., the hypothesis space generated by neural networks in their
analysis can be of the form FdFNN (G, N, S, ∞, F ). In our paper, we employ the covering
number to measure the complexity of hypothesis space. Due to the lack of compactness, the
covering numbers of FdFNN (G, N, S, ∞, F ) are in general equal to infinity. Consequently,
in our convergence analysis, we require the neural networks to possess bounded weights
and biases. The assumption of bounded parameters may lead to additional optimization
constraints in the training process. However, it has been found that the weights and biases
of a trained neural network are typically around their initial values (cf. Goodfellow et al.
(2016)). Thus the boundedness assumption matches what is observed in practice and has

38
Classification with Deep Neural Networks and Logistic Loss

been adopted by most of the literature (see, e,g., Kim et al. (2021); Schmidt-Hieber (2020)).
In particular, the work Schmidt-Hieber (2020) considers nonparametric regression using
neural networks with all parameters bounded by one (i.e., B = 1). This assumption can
be realized by projecting the parameters of the neural network onto [−1, 1] after each
updating. Though the framework developed in this paper would not deliver generalization
bounds without restriction of uniformly bounded parameters, we weaken this constraint in
Theorem 2.2 by allowing the upper bound B to grow polynomially with the sample size
n, which simply requires 1 ≤ B . nν for any ν > 0. It is worth mentioning that in our
coming work Zhang et al. (2024), we actually establish oracle-type inequalities analogous
to Theorem 2.1, with the covering number N (F, γ) replaced by the supremum of some
empirical L1 -covering numbers. These enable us to derive generalization bounds for the
empirical φ-risk minimizer fˆnFNN over FdFNN (G, N, S, ∞, F ) because empirical L1 -covering
numbers of FdFNN (G, N, S, ∞, F ) can be well-controlled, as indicated by Lemma 4 and
Lemma 6 of Farrell et al. (2021) (see also Theorem 9.4 of Györfi et al. (2002) and Theorem
7 of Bartlett et al. (2019)). In addition, note that (3.4) can lead to probability bounds
(i.e., confidence bounds) for the excess φ-risk and misclassification error of fˆnFNN , while
the generalization bounds presented in this paper are only in expectation. Nonetheless, in
Zhang et al. (2024), we obtain both probability bounds and expectation bounds for the
empirical φ-risk minimizer.

As discussed in Section 1, the boundedness assumptions on the target function fφ,P ∗

∗ ∈ B1β [0, 1]d , are too restrictive. This assumption actually



and its derivatives, i.e., fφ,P
requires that there exists some δ ∈ (0, 1/2) such that the conditional class probability
η(x) = P ({1}|x) satisfies δ < η(x) < 1 − δ for PX -almost all x ∈ [0, 1]d , which rules out the
case when η takes values in 0 or 1 with positive probabilities. However, it is believed that the
conditional class probability should be determined by the patterns that make the two classes
mutually exclusive, implying that η(x) should be closed to either 0 or 1. This is also observed
in many benchmark datasets for image recognition. For example, it is reported in Kim et al.
(2021), the conditional class probabilities of CIFAR10 data set estimated by neural networks
with the logistic loss almost solely concentrate on 0 or 1 and very few are around 0.5 (see
Fig.2 in Kim et al. (2021)). Overall, the boundedness restriction on fφ,P ∗ is not expected to
hold in binary classification as it would exclude the well classified data. We further point out
that the techniques used in Farrell et al. (2021) cannot deal with the case when fφ,P ∗ is un-
bounded, or equivalently, when η can take values close to 0 or 1. Indeed, the authors apply
approximation theory of neural networks developed in Yarotsky (2017) to construct uniform
∗ , which requires f ∗ β d with β ∈ N. However, if f ∗

approximations of fφ,P φ,P ∈ B1 [0, 1] φ,P is
∗ d
unbounded, uniformly approximating fφ,P by neural networks on [0, 1] is impossible, which
brings the essential difficulty in estimating the approximation
 error. Besides, the authors

1 Pn ∗ ∗ (X )) ap-
use Bernstein’s inequality to bound the quantity n i=1 φ(Yi f1 (Xi )) − φ(Yi fφ,P i
2
pearing in the error decomposition for fˆnFNN − fφ,P
∗ (see (A.1) in Farrell et al. (2021)),
L2P
X
where f1∗ ∈ arg minf ∈F FNN (G,N,S,∞,2) kf − fφ,P
∗ k
[0,1]d . We can see that the unboundedness of
d  

fφ,P will lead to the unboundedness of the random variable φ(Y f1∗ (X)) − φ(Y fφ,P ∗ (X)) ,

which makes Bernstein’s inequality invalid to bound its empirical mean by the expectation.

39
Zhang, Shi and Zhou


In addition, the boundedness assumption on fφ,P ensures the inequality (3.6) on which the
entire framework of convergence estimates in Farrell et al. (2021) is built (cf. Appendix
A.1 and A.2 of Farrell et al. (2021)). Without this assumption, most of the theoretical
P -a.s.
arguments in Farrell et al. (2021) are not feasible. In contrast, we require η ==X==== η̂ for
some η̂ ∈ Brβ [0, 1]d and r ∈ (0, ∞) in Theorem 2.2. This Hölder smoothness condition


on η is well adopted in the study of binary classifiers (see Audibert and Tsybakov (2007)
P -a.s.
∗ ∈ B1β [0, 1]d indeed implies η ==X==== η̂ for some

and references therein). Note that fφ,P
η̂ ∈ Brβ [0, 1]d and r ∈ (0, ∞) which only depends on (d, β). Therefore, the setting con-


sidered in Theorem 2.2 is more general than that of Farrell et al. (2021). Moreover, the
P -a.s.
condition η ==X==== η̂ ∈ Brβ [0, 1]d is more nature, allowing η to take values close to 0 and


1 with positive probabilities. We finally point out that, under the same assumption (i.e.,
P ∈ H7d,β ), one can use Theorem 2.1 to establish a convergence rate which is slightly im-
proved compared with (3.4). Actually, we can show that there exists a constant c ∈ (0, ∞)
only depending on (d, β), such that for any µ ∈ [1, ∞), and ν ∈ [0, ∞), there holds
" # ! 2β


2 (log n)3 2β+d

sup EP ⊗n fˆnFNN − fφ,P . , (3.7)


P ∈H7d,β
L2P
X
n

where the estimator fˆnFNN ∈ FdFNN (G, N, S, B, F ) is defined by (2.14) with


  d   d
n d+2β n d+2β
c log n ≤ G  log n, N  , S · log n,
log3 n log3 n (3.8)
1 ≤ B . nν , and 1 ≤ F ≤ µ.

Though we restrict the weights and biases to be bounded by B, both the convergence rate
and the network complexities in the result above refine the previous estimates established

in (3.4) and (3.5). In particular, since 2β+d < 3 < 4, the convergence rate in (3.7) is
indeed faster than that in (3.4) due to a smaller power exponent of the term log n. The
proof of this claim is in Appendix C.7. We also remark that the convergence rate in (3.7)
achieves the minimax optimal rate established in Stone (1982) up to log factors (so does the
rate in (3.4)), which confirms that generalization analysis developed in this paper is also
rate-optimal for bounded fφ,P ∗ .

In our work, we have established generalization bounds for ERMs over hypothesis spaces
consisting of neural networks. However, such ERMs cannot be obtained in practice because
the correspoding optimization problems (e.g., (2.2)) cannot be solved explicitly. Instead,
practical neural network estimators are obtained from algorithms which numerically solve
the empirical risk minimization problem. Therefore, it is better to conduct generalization
analysis for estimators obtained from such algorithms. One typical work in this direction
is Ji et al. (2021).
In Ji et al. (2021), for classification tasks, the authors establish excess φ-risk bounds
to show that classifiers obtained from solving empirical risk minimization with respect to
the logistic loss over shallow neural networks using gradient descent with (or without) early
stopping are consistent. Note that the setting of Ji et al. (2021) is quite different from ours:
We consider deep neural network models in our work, while Ji et al. (2021) considers shallow

40
Classification with Deep Neural Networks and Logistic Loss

ones. Besides, we use the smoothness of the conditional probability function η(·) = P ({1} |·)
to characterize the regularity (or complexity) of the data distribution P . Instead, in Ji et al.
(2021), for each U ∞ : Rd → Rd , the authors construct a function

kvk22
Z
1
f ( · ; U ∞ ) : Rd → R, x 7→ x> U ∞ (v) · 1[0,∞) (v > x) · · exp(− )dv
Rd (2π)n/2 2

called infinite-width random feature model. Then they use the norm of U ∞ which makes
EPφ (f ( · ; U ∞ )) small to characterize the regularity of data: the data distribution is regarded
as simple if there is a U ∞ with EPφ (f ( · ; U ∞ )) ≈ 0 and moreover has a low norm. More
rigorously, the slower the quantity
n o
inf U ∞ Rd EPφ (f ( · ; U ∞ )) ≤ ε (3.9)

grows as ε → 0, the more regular (simpler) the data distribution P is. In Ji et al. (2021),
the established excess φ-risk bounds depend on the quantity EPφ (f ( · ; U ∞ )) and the norm
U ∞ Rd . Hence by assuming certain growth rates of the quantity in (3.9) as ε → 0, we can
obtain specific rates of convergence from the excess φ-risk bounds in Ji et al. (2021). It is
natural to ask is there any relation between these two characterizations of data regularity,
that is, the smoothness of conditional probability function, and the rate of growth of the
quantity in (3.9) as ε → 0. For example, will Hölder smoothness of the conditional proba-
bility function imply certain growth rates of the quantity in (3.9) as ε → 0? This question
is worth considering because once we prove the equivalence of these two characterizations,
then the generalization analysis in Ji et al. (2021) will be able to be used in other settings
requiring smoothness of the conditional probability function and vice versa. In addition, it
is also interesting to study how can we use our new techniques developed in this paper to
establish generalization bounds for deep neural network estimators obtained from learning
algorithms (e.g., gradient descent) within the settings in this paper.

4. Conclusion
In this paper, we develop a novel generalization analysis for binary classification with
DNNs and logistic loss. The unboundedness of the target function in logistic classification
poses challenges for the estimates of sample error and approximation error when deriving
generalization bounds. To overcome these difficulties, we introduce a bivariate function
ψ : [0, 1]d × {−1, 1} → R to establish an elegant oracle-type inequality, aiming to bound
the excess risk with respect to the logistic loss. This inequality incorporates the estimation
of sample error and enables us to propose a framework for generalization analysis, which
avoids using the explicit form of the target function. By properly choosing ψ under this
framework, we can eliminate the boundedness restriction of the target function and estab-
lish sharp rates of convergence. In particular, for fully connected DNN classifiers trained by
minimizing the empirical logistic risk, we obtain an optimal (up to some logarithmic factor)
rate of convergence of the excess logistic risk (which further yields a rate of convergence
of the excess misclassification error via the calibration inequality) merely under the Hölder
smoothness assumption on the conditional probability function. If we instead assume that
the conditional probability function is the composition of several vector-valued multivariate

41
Zhang, Shi and Zhou

functions of which each component function is either a maximum value function of some of
its input variables or a Hölder smooth function only depending on a small number of its
input variables, we can even establish dimension-free optimal (up to some logarithmic fac-
tor) convergence rates for the excess logistic risk of fully connected DNN classifiers, further
leading to dimension-free rates of convergence of their excess misclassification error through
the calibration inequality. This result serves to elucidate the remarkable achievements of
DNNs in high-dimensional real-world classification tasks. In other circumstances such as
when the data distribution has a piecewise smooth decision boundary and the input data
are bounded away from it (i.e., s2 = ∞ in (2.51)), dimension-free rates of convergence can
also be derived. Besides the novel oracle-type inequality, the sharp estimates presented in
our paper also owe to a tight error bound for approximating the natural logarithm function
(which is unbounded near zero) by fully connected DNNs. All the claims for the optimal-
ity of rates in our paper are justified by corresponding minimax lower bounds. As far as
we know, all these results are new to the literature, which further enrich the theoretical
understanding of classification using deep neural networks. At last, we would like to em-
phasize that our framework of generalization analysis is very general and can be extended
to many other settings (e.g., when the loss function, the hypothesis space, or the assump-
tion on the data distribution is different from that in this current paper). In particular,
in our forthcoming research Zhang et al. (2024), we have investigated generalization anal-
ysis for CNN classifiers trained with the logistic loss, exponential loss, or LUM loss on
spheres under the Sobolev smooth conditional probability assumption. Motivated by recent
work Guo et al. (2020, 2017); Lin and Zhou (2018); Zhou (2018), we will also study more
efficient implementations of deep logistic classification for dealing with big data.

Acknowledgments

The work described in this paper is supported partially by InnoHK initiative, the Gov-
ernment of the HKSAR, Laboratory for AI-Powered Financial Technologies, the Research
Grants Council of Hong Kong (Projects No. CityU 11308121, 11306220, 11308020), the Ger-
many/Hong Kong Joint Research Scheme (Project No. G-CityU101/20), the NSFC/RGC
Joint Research Scheme (Project No. 12061160462 and N CityU102/20). Lei Shi is also
supported by Shanghai Science and Technology Program (Project No. 21JC1400600). The
first version of the paper was written when Ding-Xuan Zhou worked at City University of
Hong Kong.

Appendix A. Covering Numbers of Spaces of Fully Connected DNNs


In this appendix, we provide upper bounds for the covering numbers of spaces of fully con-
nected DNNs. Recall that if F consists of bounded real-valued functions defined on a domain
containing [0, 1]d , the covering number of F with respect to the radius γ and the metric
F × F 3 (f, g) 7→ supx∈[0,1]d |f (x) − g(x)| ∈ [0, ∞) is denoted by N (F, γ). For the space
FdFNN (G, N, S, B, F ) defined by (1.15), the covering number N FdFNN (G, N, S, B, F ) , γ


can be bounded from above in terms of G, N, S, B, and the radius of covering γ. The related
results are stated below.

42
Classification with Deep Neural Networks and Logistic Loss

Theorem A.1 For G ∈ [1, ∞), (N, S, B) ∈ [0, ∞)3 , and γ ∈ (0, 1), there holds
 
log N FdFNN (G, N, S, B, ∞) , γ
(max {N, d} + 1)(B ∨ 1)(G + 1)
≤ (S + Gd + 1)(2G + 5) · log .
γ
Theorem A.1 can be proved in the same manner as in the proof of Lemma 5 in Schmidt-
Hieber (2020). Therefore, we omit the proof here. Similar results are also presented in
Proposition A.1 of Kim et al. (2021) and Lemma 3 of Suzuki (2019). Corollary A.1 follows
immediately from Theorem A.1 and Lemma 10.6 of Anthony and Bartlett (2009).

Corollary A.1 For G ∈ [1, ∞), (N, S, B) ∈ [0, ∞)3 , F ∈ [0, ∞] and γ ∈ (0, 1), there holds
 
log N FdFNN (G, N, S, B, F ) , γ
(max {N, d} + 1)(B ∨ 1)(2G + 2)
≤ (S + Gd + 1)(2G + 5) · log .
γ

Appendix B. Approximation Theory of Fully Connected DNNs


Theorem B.1 below gives error bounds for approximating Hölder continuous functions by
fully connected DNNs. Since it can be derived straightforwardly from Theorem 5 of Schmidt-
Hieber (2020), we omit its proof.

Theorem B.1 Suppose that f ∈ Brβ [0, 1]n


d with some (β, r) ∈ (0, ∞)2 . Then for any

 √  o
0 0 d
d
positive integers m and M with M ≥ max (β + 1) , r d dβe + 1 ed , there exists
 
f˜ ∈ FdFNN 14m(2 + log2 (d ∨ β)), 6 (d + dβe) M 0 , 987(2d + β)4d M 0 m, 1, ∞

such that

sup f (x) − f˜(x)


x∈[0,1]d
√  √ 
≤ r d dβed · 3β M 0−β/d + 1 + 2r d dβed · 6d · (1 + d2 + β 2 ) · M 0 · 2−m .

Corollary B.1 follows directly from Theorem B.1.

Corollary B.1 Suppose that f ∈ Brβ [0, 1]d with some (β, r) ∈ (0, ∞)2 . Then for any


ε ∈ (0, 1/2], there exists


 
1 − d
− d 1
f˜ ∈ Fd
FNN
D1 log , D2 ε β , D3 ε β log , 1, ∞
ε ε
such that
sup f (x) − f˜(x) ≤ ε,
x∈[0,1]d

where (D1 , D2 , D3 ) ∈ (0, ∞)3 are constants depending only on d, β and r.

43
Zhang, Shi and Zhou

Proof Let
 !−d/β 
  √  1 1 
E1 = max (β + 1)d , r d dβed + 1 ed , · 3−β · √ ,
 2r d dβed 
   √ d
 
2 + β 2 ) · 6d 
 d log 4E1 · 1 + 2r d dβe (1 + d
E2 = 3 max 1 + , ,
 β log 2 

and

D1 = 14 · (2 + log2 (d ∨ β)) · (E2 + 2),


D2 = 6 · (d + dβe) · (E1 + 1),
D3 = 987 · (2d + β)4d · (E1 + 1) · (E2 + 2).

Then D1 , D2 , D3 are constants only depending on d, β,r.


β
For f ∈ Br [0, 1] and ε ∈ (0, 1/2], choose M 0 = E1 · ε−d/β and m = dE2 log(1/ε)e.
d
 

Then m and M 0 are positive integers satisfying that


n  √  o
1 ≤ max (β + 1)d , r d dβed + 1 ed ≤ E1 ≤ E1 · ε−d/β
(B.1)
≤ M 0 ≤ 1 + E1 · ε−d/β ≤ (E1 + 1) · ε−d/β ,
 −β/d 1 1
M 0−β/d ≤ E1 · ε−d/β ≤ε· · 3−β · √ , (B.2)
2r d dβed
and

m ≤ E2 log(1/ε) + 2 log 2 ≤ E2 log(1/ε) + 2 log(1/ε) = (2 + E2 ) · log(1/ε). (B.3)

Moreover, we have that


 √  1
2 · 1 + 2r d dβed · 6d · (1 + d2 + β 2 ) · M 0 ·
ε
 √ d

≤ 2 · 1 + 2r d dβe · 6 · (1 + d + β ) · (E1 + 1) · ε−1−d/β
d 2 2

 √ 
(B.4)
≤ 2 · 1 + 2r d dβed · 6d · (1 + d2 + β 2 ) · 2E1 · ε−1−d/β
1 1 1 1 1
≤ 2 3 E2 · ε−1−d/β ≤ 2 3 E2 · ε− 3 E2 ≤ ε− 3 E2 · ε− 3 E2
≤ ε−E2 ·log 2 = 2E2 log(1/ε) ≤ 2m .

Therefore, from (B.1), (B.2), (B.3), (B.4), and Theorem B.1, we conclude that there exists

f˜ ∈ FdFNN (14m(2 + log2 (d ∨ β)), 6 (d + dβe) M 0 , 987(2d + β)4d M 0 m, 1, ∞)


 
D1 D2 D3
= FdFNN · m, · M 0, · M 0 m, 1, ∞
E2 + 2 E1 + 1 (E1 + 1) · (E2 + 2)
 
FNN 1 − βd − βd 1
⊂ Fd D1 log , D2 ε , D3 ε log , 1, ∞
ε ε

44
Classification with Deep Neural Networks and Logistic Loss

such that

sup f (x) − f˜(x)


x∈[0,1]d
√  √  ε ε
≤ r d dβed · 3β M 0−β/d + 1 + 2r d dβed · 6d · (1 + d2 + β 2 ) · M 0 · 2−m ≤ + = ε.
2 2
Thus we complete the proof.

Appendix C. Proofs of Results in the Main Body


The proofs in this appendix will be organized in logical order in the sense that each result
in this appendix is proved without relying on results that are presented after it.
Throughout this appendix, we use

CParameter1 ,Parameter2 ,··· ,Parameterm

to denote a positive constant only depending on Parameter1 , Parameter2 , · · · , Parameterm .


For example, we may use Cd,β to denote a positive constant only depending on (d, β). The
values of such constants appearing in the proofs may be different from line to line or even
in the same line. Besides, we may use the same symbol with different meanings in different
proofs. For example, the symbol I may denote a number in one proof, and denote a set in
another proof. To avoid confusion, we will explicitly redefine these symbols in each proof.

C.1 Proofs of Some Properties of the Target Function


The following lemma justifies our claim in (1.7).

Lemma C.1 Let d ∈ N, P be a probability measure on [0, 1]d × {−1, 1}, and φ : R → [0, ∞)
be a measurable function. Define

 lim φ(t), if z = ∞,
t→+∞

φ : [−∞, ∞] → [0, ∞], z 7→ φ(z), if z ∈ R,

 lim φ(t), if z = −∞,

t→−∞

which is an extension of φ to [−∞, ∞]. Suppose f ∗ : [0, 1]d → [−∞, ∞] is a measurable


function satisfying that
Z

f (x) ∈ arg min φ(yz)dP (y|x) for PX -almost all x ∈ [0, 1]d . (C.1)
z∈[−∞,∞] {−1,1}

Then there holds


Z n o
φ(yf ∗ (x))dP (x, y) = inf RφP (g) g : [0, 1]d → R is measurable .
[0,1]d ×{−1,1}

45
Zhang, Shi and Zhou

Proof Let Ω0 := x ∈ [0, 1]d f ∗ (x) ∈ R × {−1, 1}. Then for any m ∈ N and any (i, j) ∈


{−1, 1}2 , define



m,
 if f ∗ (x) = ∞,
d
fm : [0, 1] → R, x 7→ f ∗ (x), if f ∗ (x) ∈ R,

−m, if f ∗ (x) = −∞,

and Ωi,j = x ∈ [0, 1]d f ∗ (x) = i · ∞ × {j}. Obviously, yf ∗ (x) = ij · ∞ and yfm (x) = ijm


for any (i, j) ∈ {−1, 1}2 , any m ∈ N, and any (x, y) ∈ Ωi,j . Therefore,
Z
lim φ(yfm (x))dP (x, y)
m→+∞ Ω
i,j
Z
= lim φ(ijm)dP (x, y) = P (Ωi,j ) · lim φ(ijm)
m→+∞ Ω m→+∞
i,j
Z (C.2)
≤ P (Ωi,j ) · lim φ(t) = P (Ωi,j ) · φ(ij · ∞) = φ(ij · ∞)dP (x, y)
t→ij·∞ Ωi,j
Z
= φ(yf ∗ (x))dP (x, y), ∀ (i, j) ∈ {−1, 1}2 .
Ωi,j

Besides, it is easy to verify that yfm (x) = yf ∗ (x) ∈ R for any (x, y) ∈ Ω0 and any m ∈ N,
which means that
Z Z
φ(yfm (x))dP (x, y) = φ(yf ∗ (x))dP (x, y), ∀ m ∈ N. (C.3)
Ω0 Ω0

Combining (C.2) and (C.3), we obtain


n o
inf RφP (g) g : [0, 1]d → R is measurable
Z
φ
≤ lim RP (fm ) = lim φ(yfm (x))dP (x, y)
m→+∞ m→+∞ [0,1]d ×{−1,1}
 
Z X X Z
= lim  φ(yfm (x))dP (x, y) + φ(yfm (x))dP (x, y)
m→+∞ Ω0 i∈{−1,1} j∈{−1,1} Ωi,j
Z X X Z (C.4)
≤ lim φ(yfm (x))dP (x, y) + lim φ(yfm (x))dP (x, y)
m→+∞ Ω m→+∞ Ω
0 i∈{−1,1} j∈{−1,1} i,j
Z X X Z

≤ φ(yf (x))dP (x, y) + φ(yf ∗ (x))dP (x, y)
Ω0 i∈{−1,1} j∈{−1,1} Ωi,j
Z
= φ(yf ∗ (x))dP (x, y).
[0,1]d ×{−1,1}

On the other hand, for any measurable g : [0, 1]d → R, it follows from (C.1) that
Z Z Z

φ(yf (x))dP (y|x) = inf φ(yz)dP (y|x) ≤ φ(yg(x))dP (y|x)
{−1,1} z∈[−∞,∞] {−1,1} {−1,1}

46
Classification with Deep Neural Networks and Logistic Loss

Z
= φ(yg(x))dP (y|x) for PX -almost all x ∈ [0, 1]d .
{−1,1}

Integrating both sides, we obtain


Z Z Z

φ(yf (x))dP (x, y) = φ(yf ∗ (x))dP (y|x)dPX (x)
[0,1]d ×{−1,1} [0,1]d {−1,1}
Z Z Z
≤ φ(yg(x))dP (y|x)PX (x) = φ(yg(x))dP (x, y) = RφP (g).
[0,1]d {−1,1} [0,1]d ×{−1,1}

Since g is arbitrary, we deduce that


Z n o
φ(yf ∗ (x))dP (x, y) ≤ inf RφP (g) g : [0, 1]d → R is measurable ,
[0,1]d ×{−1,1}

which, together with (C.4), proves the desired result.

The next lemma gives the explicit form of the target function of the logistic risk.

Lemma C.2 Let φ(t) = log(1 + e−t ) be the logistic loss, d ∈ N, P be a probability measure
on [0, 1]d × {−1, 1}, and η be the conditional probability function P ({1} |·) of P . Define

∞,
 if η(x) = 1,
∗ d η(x)
f : [0, 1] → [−∞, ∞], x 7→ log 1−η(x) , if η(x) ∈ (0, 1), (C.5)

−∞, if η(x) = 0,

which is a natural extension of the map


n o η(x)
z ∈ [0, 1]d η(z) ∈ (0, 1) 3 x 7→ log ∈R
1 − η(x)

to all of [0, 1]d . Then f ∗ is a target function of the φ-risk under P , i.e., (1.6) holds. In
addition, the target function of the φ-risk under P is unique up to a PX -null set. In other
words, for any target function f ? of the φ-risk under P , we must have
n o
PX x ∈ [0, 1]d f ∗ (x) 6= f ? (x) = 0.

Proof Define 
0,
 if z = ∞,
φ : [−∞, ∞] → [0, ∞], z 7→ φ(z), if z ∈ R, (C.6)

∞, if z = −∞,

which is a natural extension of the logistic loss φ to [−∞, ∞], and define

Va : [−∞, ∞] → [0, ∞], z 7→ aφ(z) + (1 − a)φ(−z)

47
Zhang, Shi and Zhou

for any a ∈ [0, 1]. Then we have that


Z
φ(yz)dP (y|x) = η(x)φ(z) + (1 − η(x))φ(−z)
{−1,1} (C.7)
d
= Vη(x) (z), ∀ x ∈ [0, 1] , z ∈ [−∞, ∞].

For any a ∈ [0, 1], we have that Va is smooth on R, and an elementary calculation gives
1
Va00 (t) = > 0, ∀ t ∈ R.
2 + et + e−t
Therefore, Va is strictly convex on R and
arg min Va (z) = z ∈ R Va0 (z) = 0 = z ∈ R aφ0 (z) − (1 − a)φ0 (−z) = 0
 
z∈R
 (n a
o
(C.8)
ez , if a ∈ (0, 1),

log 1−a
= z ∈ R −a + z
=0 =
1+e ∅, if a ∈ {0, 1} .

Besides, it is easy to verify that

Va (z) = ∞, ∀ a ∈ (0, 1), ∀ z ∈ {∞, −∞} ,

which, together with (C.8), yields


 
a
arg min Va (z) = arg min Va (z) = log , ∀ a ∈ (0, 1). (C.9)
z∈[−∞,∞] z∈R 1−a

In addition, it follows from

φ(z) > 0 = φ(∞), ∀ z ∈ [−∞, ∞)

that
arg min V1 (z) = arg min φ(z) = {∞} (C.10)
z∈[−∞,∞] z∈[−∞,∞]

and
arg min V0 (z) = arg min φ(−z) = {−∞} . (C.11)
z∈[−∞,∞] z∈[−∞,∞]

Combining (C.7), (C.10) and (C.11), we obtain



{+∞} , o if η(x) = 1,
Z 
n
η(x)
arg min φ(yz)dP (y|x) = arg min Vη(x) (z) = log 1−η(x) , if η(x) ∈ (0, 1),
z∈[−∞,∞] {−1,1} z∈[−∞,∞] 

{−∞} , if η(x) = 0
= {f ∗ (x)} , ∀ x ∈ [0, 1]d ,

which implies (1.6). Therefore, f ∗ is a target function of the φ-risk under the distribution P .
Moreover, the uniqueness of the target function of the φ-risk under P follows immediately
from the fact that for all x ∈ [0, 1]d the set
Z
arg min φ(yz)dP (y|x) = {f ∗ (x)}
z∈[−∞,∞] {−1,1}

48
Classification with Deep Neural Networks and Logistic Loss

contains exactly one point and the uniqueness (up to some PX -null set) of the conditional
distribution P (·|·) of P . This completes the proof.

The Lemma C.3 below provides a formula for computing the infimum of the logistic risk
over all real-valued measurable functions.

Lemma C.3 Let φ(t) = log(1 + e−t ) be the logistic loss, δ ∈ (0, 1/2], d ∈ N, P be a
probability measure on [0, 1]d × {−1, 1}, η be the conditional probability function P ({1} |·)
of P , f ∗ be defined by (C.5), φ be defined by (C.6), H be defined by
    
t log 1 + (1 − t) log
 1
, if t ∈ (0, 1),
H : [0, 1] → [0, ∞), t 7→ t 1−t

0, if t ∈ {0, 1},

and ψ be defined by

ψ : [0, 1]d × {−1, 1} → [0, ∞),


  
η(x)
φ y log , if η(x) ∈ [δ, 1 − δ],


1 − η(x)




(x, y) 7→ 0, if η(x) ∈ {0, 1},

1 1


η(x) log + (1 − η(x)) log , if η(x) ∈ (0, δ) ∪ (1 − δ, 1).


η(x) 1 − η(x)
Then there holds
n o Z
inf RφP (g) g : [0, 1]d → R is measurable = φ(yf ∗ (x))dP (x, y)
d
[0,1] ×{−1,1}
Z Z
= H(η(x))dPX (x) = ψ(x, y)dP (x, y).
[0,1]d [0,1]d ×{−1,1}

Proof According to Lemma C.2, f ∗ is a target function of the φ-risk under the distribution
P , meaning that
Z

f (x) ∈ arg min φ(yz)dP (y|x) for PX -almost all x ∈ [0, 1]d .
z∈[−∞,∞] {−1,1}

Then it follows from Lemma C.1 that


n o Z
φ d
inf RP (g) g : [0, 1] → R is measurable = φ(yf ∗ (x))dP (x, y)
d
[0,1] ×{−1,1}
Z Z
= φ(yf ∗ (x))dP (y|x)dPX (x) (C.12)
[0,1]d {−1,1}
Z  
= η(x)φ(f ∗ (x)) + (1 − η(x))φ(−f ∗ (x)) dPX (x).
[0,1]d

For any x ∈ [0, 1]d , if η(x) = 1, then we have

η(x)φ(f ∗ (x)) + (1 − η(x))φ(−f ∗ (x)) = φ(f ∗ (x)) = φ(+∞) = 0 = H(η(x)) = 0

49
Zhang, Shi and Zhou

Z
= 1 · 0 + (1 − 1) · 0 = η(x)ψ(x, 1) + (1 − η(x))ψ(x, −1) = ψ(x, y)dP (y|x);
{−1,1}

If η(x) = 0, then we have

η(x)φ(f ∗ (x)) + (1 − η(x))φ(−f ∗ (x)) = φ(−f ∗ (x)) = φ(+∞) = 0 = H(η(x)) = 0


Z
= 0 · 0 + (1 − 0) · 0 = η(x)ψ(x, 1) + (1 − η(x))ψ(x, −1) = ψ(x, y)dP (y|x);
{−1,1}

If η(x) ∈ (0, δ) ∪ (1 − δ, 1), then we have

η(x)φ (f ∗ (x)) + (1 − η(x))φ(−f ∗ (x))


   
η(x) η(x)
= η(x)φ log + (1 − η(x))φ − log
1 − η(x) 1 − η(x)
   
1 − η(x) η(x)
= η(x) log 1 + + (1 − η(x)) log 1 +
η(x) 1 − η(x)
1 1
= η(x) log + (1 − η(x)) log
η(x) 1 − η(x)
Z  
1 1
= H(η(x)) = η(x) log + (1 − η(x)) log dP (y|x)
{−1,1} η(x) 1 − η(x)
Z
= ψ(x, y)dP (y|x);
{−1,1}

If η(x) ∈ [δ, 1 − δ], then we have that

η(x)φ (f ∗ (x)) + (1 − η(x))φ(−f ∗ (x))


   
η(x) η(x)
= η(x)φ log + (1 − η(x))φ − log
1 − η(x) 1 − η(x)
   
1 − η(x) η(x)
= η(x) log 1 + + (1 − η(x)) log 1 +
η(x) 1 − η(x)
1 1
= η(x) log + (1 − η(x)) log
η(x) 1 − η(x)
   
η(x) η(x)
= H(η(x)) = η(x)φ log + (1 − η(x))φ − log
1 − η(x) 1 − η(x)
Z
= η(x)ψ(x, 1) + (1 − η(x))ψ(x, −1) = ψ(x, y)dP (y|x).
{−1,1}

In conclusion, we always have that


Z
∗ ∗
η(x)φ(f (x)) + (1 − η(x))φ(−f (x)) = H(η(x)) = ψ(x, y)dP (y|x).
{−1,1}

Since x is arbitrary, we deduce that


Z   Z
∗ ∗
η(x)φ(f (x)) + (1 − η(x))φ(−f (x)) dPX (x) = H(η(x))dPX (x)
[0,1]d [0,1]d

50
Classification with Deep Neural Networks and Logistic Loss

Z Z Z
= ψ(x, y)dP (y|x)dPX (x) = ψ(x, y)dP (x, y),
[0,1]d {−1,1} [0,1]d ×{−1,1}

which, together with (C.12), proves the desired result.

C.2 Proof of Theorem 2.1


Appendix C.2 is devoted to the proof of Theorem 2.1.
Proof [Proof of Theorem 2.1] Throughout this proof, we denote
Z
Ψ := ψ(x, y)dP (x, y).
[0,1]d ×{−1,1}
 
Then it follows from (2.3) and (2.4) that 0 ≤ RφP fˆn − Ψ ≤ 2M < ∞. Let {(Xk0 , Yk0 )}nk=1
be an i.i.d. sample from distribution P which is independent of {(Xk , Yk )}nk=1 . By inde-
pendence, we have
h   i 1X n h  
φ i
ˆ
E RP fn − Ψ = E φ Yi0 fˆn (Xi0 ) − ψ Xi0 , Yi0
n
i=1

with its empirical counterpart given by


n
1X h  ˆ  i
R̂ := E φ Yi fn (Xi ) − ψ(Xi , Yi ) .
n
i=1

Then we have
  1X n h   i
R̂ − RφP (g) −Ψ = E φ Yi fˆn (Xi ) − φ(Yi g(Xi ))
n
i=1
" n n
#
1X  ˆ  1X
=E φ Yi fn (Xi ) − φ (Yi g(Xi )) ≤ 0, ∀ g ∈ F,
n n
i=1 i=1

where the last inequality follows from the fact that fˆn is an empirical
 φ-risk minimizer
 which
1 Pn φ
minimizes n i=1 φ (Yi g(Xi )) over all g ∈ F. Hence R̂ ≤ inf g∈F RP (g) − Ψ , which means
that
h   i  h   i 
E RφP fˆn − Ψ = E RφP fˆn − Ψ − (1 + ε) · R̂ + (1 + ε) · R̂
 h   i    (C.13)
≤ E RφP fˆn − Ψ − (1 + ε) · R̂ + (1 + ε) · inf RφP (g) − Ψ , ∀ ε ∈ [0, 1).
g∈F
h   i
We then establish an upper bound for E RφP fˆn − Ψ − (1 + ε) · R̂ by using a similar
argument to that in the proof of Lemma 4 of Schmidt-Hieber (2020). The desired inequality
(2.6) will follow from this bound and (C.13). Recall that W = max {3, N (F, γ)}. From
the definition of W , there exist f1 , · · · , fW ∈ F such that for any f ∈ F, there exists some

51
Zhang, Shi and Zhou

j ∈ {1, · · · , W }, such that kf − fj k∞ ≤ γ. Therefore, there holds fˆn − fj ∗ ≤ γ where


[0,1]d
j ∗ is a {1, · · · , W }-valued statistic from the sample {(Xi , Yi )}ni=1 . Denote

r
log W (C.14)
A := M · .
Γn

And for j = 1, 2, · · · , W , let

hj,1 := RφP (fj ) − Ψ,


Z
hj,2 := (φ(yfj (x)) − ψ(x, y))2 dP (x, y),
[0,1]d ×{−1,1}
n
X (C.15)
Yi0 fj (Xi0 ) Xi0 , Yi0
 
Vj := φ (Yi fj (Xi )) − ψ (Xi , Yi ) − φ +ψ ,
i=1
p
rj := A ∨ hj,1 .

Then define
Vj
T := max .
j=1,··· ,W rj

Denote by E [ ·| (Xi , Yi )ni=1 ] the conditional expectation with respect to {(Xi , Yi )}ni=1 . Then
we have that

p
rj ∗ = A ∨ hj ∗ ,1
p
≤ A + hj ∗ ,1
q
= A + E [φ (Y 0 fj ∗ (X 0 )) − ψ(X 0 , Y 0 )| (Xi , Yi )ni=1 ]
r h   i
≤ A + γ + E φ Y 0 fˆn (X 0 ) − ψ(X 0 , Y 0 ) (Xi , Yi )ni=1
r  
= A + γ + Rφ fˆn − Ψ P
r
√  
≤A+ γ+ RφP fˆn − Ψ,

where (X 0 , Y 0 ) is an i.i.d. copy of (Xi , Yi ) (1 ≤ i ≤ n) and the second inequality follows


from

|φ(t1 ) − φ(t2 )| ≤ |t1 − t2 | , ∀ t1 , t2 ∈ R (C.16)

52
Classification with Deep Neural Networks and Logistic Loss

and fj ∗ − fˆ ≤ γ. Consequently,
[0,1]d
h   i h   i
E RφP fˆn − Ψ − R̂ ≤ R̂ − E RφP fˆn − Ψ
" n #
1 X   
0 0

0 0

= E φ Yi fˆn (Xi ) − ψ(Xi , Yi ) − φ Yi fˆn (Xi ) + ψ(Xi , Yi )
n
i=1
" n #
1 X
φ (Yi fj ∗ (Xi )) − ψ(Xi , Yi ) − φ Yi0 fj ∗ (Xi0 ) + ψ(Xi0 , Yi0 )
 
≤ E + 2γ
n
i=1
1 1
= E [Vj ∗ ] + 2γ ≤ E [T · rj ∗ ] + 2γ (C.17)
n " n
r # √
1 φ
 
ˆ A+ γ
≤ E T · RP fn − Ψ + · E [T ] + 2γ
n n
1p
r h   i A + √γ
≤ E [T 2 ] · E RφP fˆn − Ψ + · E [T ] + 2γ
n n
h   i
εE RφP fˆn − Ψ √
(1 + ε)E T 2
 
A+ γ
≤ + + E [T ] + 2γ, ∀ ε ∈ (0, 1),
2 + 2ε 2ε · n2 n
√ 
where the last inequality follows from 2 ab ≤ 1+ a + 1+ b, ∀a > 0, b > 0. We then bound
 2
E [T ] and E T by Bernstein’s inequality (see e.g., Chapter 3.1 of Cucker and Zhou (2007)
and Chapter 6.2 of Steinwart and Christmann (2008)). Indeed, it follows from (2.5) and
(C.15) that
hj,2 ≤ Γ · hj,1 ≤ Γ · (rj )2 , ∀ j ∈ {1, · · · , W }.
For any j ∈ {1, · · · , W } and t ≥ 0, we apply Bernstein’s inequality to the zero mean i.i.d.
random variables
n
φ (Yi fj (Xi )) − ψ(Xi , Yi ) − φ Yi0 fj (Xi0 ) + ψ(Xi0 , Yi0 ) i=1
 

and obtain

P(Vj ≥ t)
n
!
X
φ (Yi fj (Xi )) − ψ(Xi , Yi ) − φ Yi0 fj (Xi0 ) + ψ(Xi0 , Yi0 )
 
=P ≥t
i=1
 
−t2 /2
≤ 2 exp  Pn h i
Mt + E (φ (Y f (X )) − ψ(X , Y ) − φ (Y 0 f (X 0 )) + ψ(X 0 , Y 0 ))2
i=1 i j i i i i j i i i
 
−t2 /2
≤ 2 exp  Pnh i
2 0 f (X 0 )) − ψ(X 0 , Y 0 ))2
Mt + 2i=1 E (φ (Y f
i j (Xi )) − ψ(X ,
i iY )) + (φ (Y i j i i i
!
−t2 /2 −t2 t2
   
= 2 exp = 2 exp ≤ 2 exp − .
M t + 4 ni=1 hj,2 2M t + 8nΓ · (rj )2
P
2M t + 8nhj,2

53
Zhang, Shi and Zhou

Hence
W
X W
X
P(T ≥ t) ≤ P(Vj /rj ≥ t) = P(Vj ≥ trj )
j=1 j=1
W W
!
(trj )2 t2
X X  
≤2 exp − =2 exp −
j=1
2M trj + 8nΓ · rj2 j=1
2M t/rj + 8nΓ
W
t2 t2
X    
≤2 exp − = 2W exp − , ∀ t ∈ [0, ∞).
2M t/A + 8nΓ 2M t/A + 8nΓ
j=1

Therefore, for any θ ∈ {1, 2}, by taking


 s θ
2
M M
B :=  · log W + · log W + 8nΓ log W  = 4θ · (nΓ log W )θ/2 ,
A A

we derive
h i Z ∞   Z ∞  
E Tθ = P T ≥t 1/θ
dt ≤ B + P T ≥ t1/θ dt
0 B
!!

t2/θ
Z
≤B+ 2W exp − dt
B 2M t1/θ /A + 8nΓ
!!

B 1/θ · t1/θ
Z
≤B+ 2W exp − dt
B 2M B 1/θ /A + 8nΓ
Z ∞
−θ
= B + 2W Bθ · (log W ) e−u uθ−1 du
log W
−θ
≤ B + 2W Bθ · (log W ) ·θ·e − log W
(log W )θ−1
≤ 5θB ≤ 5θ · 4θ · (nΓ log W )θ/2 .
Plugging the inequality above and (C.14) into (C.17), we obtain
h   i h   i
E RφP fˆn − Ψ − R̂ ≤ R̂ − E RφP fˆn − Ψ
h   i
εE RφP fˆn − Ψ √
(1 + ε)E T 2
 
A+ γ
≤ + + E [T ] + 2γ
2 + 2ε 2ε · n2 n
r
ε h   i √ Γ log W
≤ E RφP fˆn − Ψ + 20 · γ ·
1+ε n
log W Γ log W 1 + ε
+ 20M · + 80 · · + 2γ, ∀ ε ∈ (0, 1).
n n ε
Multiplying the above inequality by (1 + ε) and then rearranging, we obtain that
r
h
φ
  i √ Γ log W
E RP fˆn − Ψ − (1 + ε) · R̂ ≤ 20 · (1 + ε) · γ ·
n (C.18)
log W Γ log W (1 + ε)2
+ 20 · (1 + ε) · M · + 80 · · + (2 + 2ε) · γ, ∀ ε ∈ (0, 1).
n n ε

54
Classification with Deep Neural Networks and Logistic Loss

Combining (C.18) and (C.13), we deduce that


r
h   i   √ Γ log W
E RφP
ˆ φ
fn − Ψ ≤ (1 + ε) · inf RP (g) − Ψ + 20 · (1 + ε) · γ ·
g∈F n
log W Γ log W (1 + ε) 2
+ 20 · (1 + ε) · M · + 80 · · + (2 + 2ε) · γ, ∀ ε ∈ (0, 1).
n n ε
This proves the desired inequality (2.6) and completes the proof of Theorem 2.1.

C.3 Proof of Theorem 2.4


To prove Theorem 2.4, we need the following Lemma C.4 and Lemma C.5.
Lemma C.4, which describes neural networks that approximate the multiplication op-
erator, can be derived directly from Lemma A.2 of Schmidt-Hieber (2020). Thus we omit
its proof. One can also find a similar result to Lemma C.4 in the earlier paper Yarotsky
(2017) (cf. Proposition 3 therein).

Lemma C.4 For any ε ∈ (0, 1/2], there exists a neural network
 
FNN 1 1
M ∈ F2 15 log , 6, 900 log , 1, 1
ε ε

such that for any t, t0 ∈ [0, 1], there hold M(t, t0 ) ∈ [0, 1], M(t, 0) = M(0, t0 ) = 0 and

M(t, t0 ) − t · t0 ≤ ε.

In Lemma C.5, we construct a neural network which performs the operation of multi-
plying the inputs by 2k .

Lemma C.5 Let k be a positive integer and f be a univariate function given by f (x) =
2k · max {x, 0}. Then
f ∈ F1FNN (k, 2, 4k, 1, ∞) .

Proof For any 1 ≤ i ≤ k − 1, let vi = (0, 0)> and


 
1 1
Wi = .
1 1

In addition, take
W0 = (1, 1)> , Wk = (1, 1), and vk = (0, 0)> .
Then we have

f = x 7→ Wk σvk Wk−1 σvk−1 · · · W1 σv1 W0 x ∈ F1FNN (k, 2, 4k, 1, ∞) ,

which proves this lemma.


Now we are in the position to prove Theorem 2.4.

55
Zhang, Shi and Zhou

Proof [Proof of Theorem 2.4] Given a ∈ (0, 1/2], let I := d− log2 ae and Jk := 3·21 k , 21k
 

for k = 0, 1, 2, · ·n· . Then


o 1 ≤ I ≤ 1 − log2 a ≤ 4 log a1 . The idea of proof is to construct
neural networks h̃k which satisfy that 0 ≤ h̃k (t) ≤ 1 and (8 log a) · h̃k approximates the
k
natural logarithm function on Jk . Then the function
X  
x 7→ (8 log a) · M h̃k (x), f˜k (x)
k

is the desired neural network in Theorem 2.4, where M is the neural network that ap-
proximates multiplication operators given in Lemma C.4 and {f˜k }k are neural networks
representing piecewise linear function supported on Jk which constitutes a partition of
unity.
Specifically, given α ∈ (0, ∞), there exists some rα > 0 only depending on α such that
 
2x 1
x 7→ log + ∈ Brαα ([0, 1]) .
3 3

Hence it follows from Corollary B.1 that there exists


 1/α  1/α !
2 2 2 2
g̃1 ∈ F1FNN Cα log , Cα , Cα log , 1, ∞
ε ε ε ε
 1/α  1/α !
1 1 1 1
⊂ F1FNN Cα log , Cα , Cα log , 1, ∞
ε ε ε ε

such that  
2x 1
sup g̃1 (x) − log + ≤ ε/2.
x∈[0,1] 3 3

Recall that the ReLU function is given by σ(t) = max {t, 0}. Let

g̃2 : R → R, x 7→ −σ (−σ (g̃1 (x) + log 3) + log 3) .

Then  1/α  1/α !


1 1 1 1
g̃2 ∈ F1FNN Cα log , Cα , Cα log , 1, ∞ , (C.19)
ε ε ε ε

and for x ∈ R, there holds



 − log 3,
 if g̃1 (x) < − log 3,
− log 3 ≤ g̃2 (x) = g̃1 (x), if − log 3 ≤ g̃1 (x) ≤ 0,

0, if g̃1 (x) > 0.

2x 1

Moreover, since − log 3 ≤ log 3 + 3 ≤ 0 whenever x ∈ [0, 1], we have
   
2x 1 2x 1
sup g̃2 (x) − log + ≤ sup g̃1 (x) − log + ≤ ε/2.
x∈[0,1] 3 3 x∈[0,1] 3 3

56
Classification with Deep Neural Networks and Logistic Loss

3·2k ·t−1
Let x = 2 in the above inequality, we obtain
3 · 2k · t − 1
 
sup g̃2 − k log 2 − log t ≤ ε/2, ∀ k = 0, 1, 2, · · · . (C.20)
t∈Jk 2
For any 0 ≤ k ≤ I, define
!
3 1
· 2I+1 · σ(t) −

σ −g̃2 σ 4·2I−k 2 k log 2
h̃k : R → R, t 7→ σ + .
8 log a1 8 log a1
Then we have
3 1
· 2I+1 · σ(t) −

σ −g̃2 σ 4·2I−k 2 k log 2
0 ≤ h̃k (t) ≤ +
8 log a1 8 log a1
3
· 2I+1 · σ(t) − 12

−g̃2 σ 4·2I−k k log 2
≤ + (C.21)
1
8 log a 8 log a1
supx∈R |g̃2 (x)| I log 3 + 4 log a1
≤ + ≤ ≤ 1, ∀ t ∈ R.
8 log a1 8 log a1 8 log a1

Therefore, it follows from (C.19), the definition of h̃k , and Lemma C.5 that (cf. Figure C.1)

 1  1 !
1 1 α 1 α 1
h̃k ∈ F1FNN Cα log + I, Cα , Cα log + 4I, 1, 1
ε ε ε ε
 1  1 ! (C.22)
1 1 1 α 1 α 1 1
⊂ F1FNN Cα log + 4 log , Cα , Cα log + 16 log , 1, 1
ε a ε ε ε a

for all 0 ≤ k ≤ I. Besides, according to (C.20), it is easy to verify that for 0 ≤ k ≤ I, there
holds
 
3 k
(8 log a) · h̃k (t) − log t = g̃2 · 2 · t − 1/2 − k log 2 − log t ≤ ε/2, ∀ t ∈ Jk .
2
Define 

0, if x ∈ (−∞, 1/3),

  
1
f˜0 : R → [0, 1], x 7→ 6 · x − , if x ∈ [1/3, 1/2],

 3

1, if x ∈ (1/2, ∞),

and for k ∈ N,


0, if x ∈ R \ Jk ,
    
1 1 1


k
6·2 · x− if x ∈


 , , ,


 3 · 2k 3 · 2k 2k+1
f˜k : R → [0, 1], x 7→
 
1 1
 1, if x ∈ k+1 , ,
2 3 · 2k−1




    

 k 1 1 1
−3·2 · x− k , if x ∈ , .


2 3 · 2k−1 2k

57
Zhang, Shi and Zhou

Output
h̃k (t)
    !
3
σ −g̃2 σ ·2I+1 σ(t)− 12 k log 2
4·2I−k
σ 8 log 1 + 8 log a1
a

3 1
· 2I+1 σ(t) −

σ −g̃2 σ 4·2I−k 2

−g̃2

3 1
· 2I+1 σ(t) −

σ 4·2I−k 2

2I+1 σ(t)

I + 1 layers sub-network equipped with the architecture described


in Lemma C.5 and representing the function t 7→ 2I+1 σ(t)

t∈R
Input

Figure C.1: Networks representing functions h̃k .

1 f˜0
f˜1
1 f˜2
2
f˜3
f˜4
0
1 1 1 1 1 1 1 1 2
24 16 12 8 6 4 3 2 3

Figure C.2: Graphs of functions f˜k .

Then it is easy to show that for any x ∈ R and k ∈ N, there hold

   
6 1 6 1
f˜k (x) = · 2I+3
· σ x − − · 2 I+3
· σ x −
2I−k+3 3 · 2k 2I−k+3 2k+1
   
6 I+3 1 6 I+3 1
+ I−k+4 · 2 · σ x − k − I−k+3 · 2 ·σ x− ,
2 2 2 3 · 2k−1

58
Classification with Deep Neural Networks and Logistic Loss

and
6 6
f˜0 (x) = · 2I+3 · σ(x − 1/3) − · 2I+3 · σ(x − 1/2).
2I+3 2I+3
Hence it follows from Lemma C.5 that (cf. Figure C.3)

f˜k ∈ F1FNN (I + 5, 8, 16I + 60, 1, ∞)


(C.23)
 
FNN 1 1
⊂ F1 12 log , 8, 152 log , 1, ∞ , ∀ 0 ≤ k ≤ I.
a a

Output
f˜k (x)

2I+3 σ x − 1
2I+3 σ x − 1
2I+3 σ x − 1
2I+3 σ x − 2
   
3·2k 2k+1 2k 3·2k

I + 3 layers I + 3 layers I + 3 layers I + 3 layers


sub-network sub-network sub-network sub-network
equipped equipped equipped equipped
with the with the with the with the
architecture architecture architecture architecture
described in described in described in described in
Lemma C.5 Lemma C.5 Lemma C.5 Lemma C.5
and and and and
representing representing representing representing
the function the function the function the function
t 7→ 2I+3 σ(t) t 7→ 2I+3 σ(t) t 7→ 2I+3 σ(t) t 7→ 2I+3 σ(t)

1 1 1 2
   
σ x− 3·2k
σ x− 2k+1
σ x− 2k
σ x− 3·2k

x∈R
Input

Figure C.3: Networks representing functions f˜k .

Next, we show that

I
 X
1
sup log(t) + 8 log h̃k (t)f˜k (t) ≤ ε/2. (C.24)
t∈[a,1] a
k=0

Indeed, we have the following inequalities:


 XI  
1 1
log(t) + 8 log h̃k (t)f˜k (t) = log t + 8 log h̃0 (t)f˜0 (t)
a a
k=0
  (C.25)
1
= log t + 8 log h̃0 (t) ≤ ε/2, ∀ t ∈ [1/2, 1];
a

59
Zhang, Shi and Zhou

 XI  
1 ˜ 1
log(t) + 8 log h̃k (t)fk (t) = log(t) + 8 log h̃m−1 (t) ≤ ε/2,
a a
k=0
  (C.26)
1 1
∀t ∈ m , ∩ [a, 1] with 2 ≤ m ≤ I;
2 3 · 2m−2

and

I
 X
1
log(t) + 8 log h̃k (t)f˜k (t)
a
k=0
 
= log (t) (f˜m (t) + f˜m−1 (t)) − 8 log (a) h̃m (t)f˜m (t) + h̃m−1 (t)f˜m−1 (t)
(C.27)
≤ f˜m (t) log(t) − 8 log (a) h̃m (t) + f˜m−1 (t) log(t) − 8 log (a) h̃m−1 (t)
 
˜ ε ˜ ε ε 1 1
≤ fm (t) · + fm−1 (t) · = , ∀ t ∈ , ∩ [a, 1] with 1 ≤ m ≤ I.
2 2 2 3 · 2m−1 2m

Note that

I  ! I  !
[ 1 1 [ 1 1
[a, 1] ⊂ [1/2, 1] ∪ , ∪ , .
3 · 2m−1 2m 2m 3 · 2m−2
m=1 m=2

Consequently, (C.24) follows immediately from (C.25), (C.26) and (C.27).


From Lemma C.4 we know that there exists
!
FNN 96 (log a)2 96 (log a)2
M ∈ F2 15 log , 6, 900 log , 1, 1 (C.28)
ε ε

such that for any t, t0 ∈ [0, 1], there hold M(t, t0 ) ∈ [0, 1], M(t, 0) = M(0, t0 ) = 0 and

ε
M(t, t0 ) − t · t0 ≤ . (C.29)
96 (log a)2

Define
I
X  
g̃3 : R → R, x 7→ M h̃k (x), f˜k (x) ,
k=0

and

f˜ : R → R,
8I      
X log(a) log b log b 1
x 7→ ·σ + σ σ (g̃3 (x)) − − σ σ (g̃3 (x)) − .
I 8 log a 8 log a 8
k=1

60
Classification with Deep Neural Networks and Logistic Loss

Output
g̃3 (x)

  ··· ···  
M h̃0 (x), f˜0 (x)   M h̃I (x), f˜I (x)
M h̃1 (x), f˜1 (x) ···
M M ··· ··· ··· M

˜ ˜ ··· ··· ˜
..h̃0 (x) f0 (x) ..h̃1 (x) f1 (x) ..h̃I (x) fI (x)
. .
. . .
. . .
.
h̃0 (x) . h̃1 (x) . ··· ··· h̃I (x) .
. . .
h̃0 (x) .. h̃1 (x) .. ··· ··· h̃I (x) ..
f˜0 (x) f˜1 (x) f˜I (x)
f˜0 (x) f˜1 (x) ··· ··· f˜I (x)

h̃0 h̃1 ··· ··· h̃I


f˜0 f˜1 ··· ··· f˜I
··· ···
··· ···
x∈R
Input

Figure C.4: The network representing the function g̃3 .

Then it follows from (C.21),(C.29), (C.24), the definitions of f˜k and g̃3 that

|log t − 8 log(a) · g̃3 (t)|


  I  XI
1 X
˜ 1
≤ 8 log · g̃3 (t) − h̃k (t)fk (t) + log t + 8 log h̃k (t)f˜k (t)
a a
k=0 k=0
  I
1 X
≤ 8 log · g̃3 (t) − h̃k (t)f˜k (t) + ε/2 (C.30)
a
k=0
I
X  
≤ ε/2 + |8 log a| · M h̃k (t), f˜k (t) − h̃k (t)f˜k (t)
k=0
ε
≤ ε/2 + |8 log a| · (I + 1) · ≤ ε, ∀ t ∈ [a, 1].
96 (log a)2

However, for any t ∈ R, by the definition of f˜, we have



8 log(a) · g̃3 (t), if 8 log(a) · g̃3 (t) ∈ [log a, log b],

˜
f (t) = log a, if 8 log(a) · g̃3 (t) < log a,
 (C.31)
log b, if 8 log(a) · g̃3 (t) > log b,

satisfying log a ≤ f˜(t) ≤ log b ≤ 0.

61
Zhang, Shi and Zhou

Then by (C.30), (C.31) and the fact that log t ∈ [log a, log b], ∀ t ∈ [a, b], we obtain

log t − f˜(t) ≤ |log t − 8 log(a) · g̃3 (t)| ≤ ε, ∀ t ∈ [a, b].

That is,
sup log t − f˜(t) ≤ ε. (C.32)
t∈[a,b]

Output
f˜(x)
f˜(x) = 8I log a
P
k=1 I · 8 log a

··· ···

f˜(x) f˜(x) f˜(x) f˜(x)


8 log a 8 log a ··· ··· 8 log a 8 log a (8I neurons)
   
log b log b f˜(x)
σ σ σ(g̃3 (x)) − 8 log a
− σ(σ(g̃3 (x)) − 18 ) + 8 log a
= 8 log a

log b
σ(σ(g̃3 (x)) − 8 log a
) σ(σ(g̃3 (x)) − 18 )
log b log b
σ(0 · σ(g̃3 (x)) + 8 log a
) = 8 log a

σ(g̃3 (x))

g̃3

x∈R
Input

Figure C.5: The network representing the function f˜.

On the other hand, it follows from (C.22), (C.23), (C.28), the definition of g̃3 , and
1 ≤ I ≤ 4 log a1 that
 1
1 
2
 1 α
g̃3 ∈F1FNN Cα log + I + 15 log 96 (log a) , Cα I,
ε ε
 1 ! !
1 α 1 
2

(I + 1) · 20I + Cα · log + 900 log 96 (log a) , 1, ∞
ε ε
 1
FNN 1 1 1 α 1
⊂ F1 Cα log + 139 log , Cα log ,
ε a ε a
 1     !
1 α 1 1 2
Cα · log · log + 65440 (log a) , 1, ∞ .
ε ε a

62
Classification with Deep Neural Networks and Logistic Loss

Then by the definition of f˜ we obtain (cf. Figure C.5)

 1
1 1 1 α 1
f˜ ∈ F1FNN Cα log + 139 log , Cα log ,
ε a ε a
 1     !
1 α 1 1 2
Cα · log · log + 65440 (log a) , 1, ∞ .
ε ε a

This, together with (C.31) and (C.32), completes the proof of Theorem 2.4.

C.4 Proof of Theorem 2.2 and Theorem 2.3

Appendix C.4 is devoted to the proof of Theorem 2.2 and Theorem 2.3. We will first
establish several lemmas. We then use these lemmas to prove Theorem 2.3. Finally, we
derive Theorem 2.2 by applying Theorem 2.3 with q = 0, d∗ = d and d? = K = 1.

Lemma C.6 Let φ(t) = log(1 + e−t n ) be the logistic


o loss.
n Supposeo real numbers a, f, A, B
a a
satisfy that 0 < a < 1 and A ≤ min f, log 1−a ≤ max f, log 1−a ≤ B. Then there holds

a 2
 
1 1
min , · f − log
4 + 2eA + 2e−A 4 + 2eB + 2e−B 1−a
1 1
≤ aφ(f ) + (1 − a)φ(−f ) − a log − (1 − a) log
a 1−a
a 2 1 2
 
1 a
≤ sup z ∈ [A, B] · f − log ≤ · f − log .
4 + 2ez + 2e−z 1−a 8 1−a

Proof Consider the map G : R → [0, ∞), z 7→ aφ(z)


 + (1 − a)φ(−z). Obviously G is twice
continuously differentiable on R with G log 1−a = 0 and G00 (z) = 2+ez1+e−z for any real
0 a

number z. Then it follows from Taylor’s theorem that there exists a real number ξ between
a
log 1−a and f , such that

 
1 1 a
aφ(f ) + (1 − a)φ(−f ) − a log − (1 − a) log = G(f ) − G log
a 1−a 1−a
    00 2
a a G (ξ) a
= f − log · G0 log + · f − log (C.33)
1−a 1−a 2 1−a
2
a
G00 (ξ) a 2 f − log 1−a
= · f − log = .
2 1−a 4 + 2eξ + 2e−ξ

63
Zhang, Shi and Zhou

n o n o
a a
Since A ≤ min f, log 1−a ≤ max f, log 1−a ≤ B, we must have ξ ∈ [A, B], which,
together with (C.33), yields
a 2
 
1 1
min , · f − log
4 + 2eA + 2e−A 4 + 2eB + 2e−B 1−a
2
a

1

a 2 f − log 1−a
= inf · f − log ≤
t∈[A,B] 4 + 2et + e−t 1−a 4 + 2eξ + 2e−ξ
(C.34)
2
a
1 1 f − log 1−a
= aφ(f ) + (1 − a)φ(−f ) − a log − (1 − a) log =
a 1−a 4 + 2eξ + 2e−ξ
a 2 1 2
 
1 a
≤ sup z −z
z ∈ [A, B] · f − log ≤ · f − log .
4 + 2e + 2e 1−a 8 1−a
This completes the proof.

Lemma C.7 Let φ(t) = log 1 + e−t be the logistic loss, f be a real number, d ∈ N, and


P be a Borel probability measure on [0, 1]d × {−1, 1} of which the conditional probability
function [0, 1]d 3 z 7→ P ({1} |z) ∈ [0, 1] is denoted by η. Then for x ∈ [0, 1]d such that
η(x) ∈
/ {0, 1}, there holds
2
1 η(x)
inf t −t
· f − log
t∈ f ∧log 1−η(x) ,f ∨log 1−η(x) 2(2 + e + e )
h
η(x) η(x)
i
1 − η(x)
Z   
η(x)
≤ φ (yf ) − φ y log dP (y|x)
{−1,1} 1 − η(x)
2 2
1 η(x) 1 η(x)
≤ sup i 2(2 + et + e−t )
· f − log ≤ f − log .
h
η(x) η(x) 1 − η(x) 4 1 − η(x)
t∈ f ∧log 1−η(x)
,f ∨log 1−η(x)

Proof Given x ∈ [0, 1]d such that η(x) ∈


/ {0, 1}, define
Vx : R → (0, ∞), t 7→ η(x)φ(t) + (1 − η(x))φ(−t).
Then it is easy to verify that
Z
φ (yt) dP (y|x) = φ(t)P (Y = 1|X = x) + φ(−t)P (Y = −1|X = x) = Vx (t)
{−1,1}

for all t ∈ R. Consequently,


Z     
η(x) η(x)
φ (yf ) − φ y log dP (y|x) = Vx (f ) − Vx log
{−1,1} 1 − η(x) 1 − η(x)
1 1
= η(x)φ(f ) + (1 − η(x))φ(−f ) − η(x) log − (1 − η(x)) log .
η(x) 1 − η(x)
The desired inequalities then follow immediately by applying Lemma C.6.

64
Classification with Deep Neural Networks and Logistic Loss

Lemma C.8 Let φ(t) = log 1 + e−t be the logistic loss, d ∈ N, f : [0, 1]d → R be a


measurable function, and P be a Borel probability measure on [0, 1]d × {−1, 1} of which the
conditional probability function [0, 1]d 3 z 7→ P ({1} |z) ∈ [0, 1] is denoted by η. Assume that
there exist constants (a, b) ∈ R2 , δ ∈ (0, 1/2), and a measurable function η̂ : [0, 1]d → R,
such that η̂ = η, PX -a.s.,
δ
log ≤ f (x) ≤ −a, ∀ x ∈ [0, 1]d satisfying 0 ≤ η̂(x) = η(x) < δ,
1−δ
and
1−δ
b ≤ f (x) ≤ log , ∀ x ∈ [0, 1]d satisfying 1 − δ < η̂(x) = η(x) ≤ 1.
δ
Then

EPφ (f ) − φ(a)PX (Ω2 ) − φ(b)PX (Ω3 )


2
 
η(x)
Z  f (x) − log 1−η(x)
 
η(x) η(x) 
 
≤ sup t −t
t ∈ f (x) ∧ log , f (x) ∨ log dPX (x)
Ω1  2(2 + e + e )
 1 − η(x) 1 − η(x) 
Z 2
η(x)
≤ f (x) − log dPX (x),
Ω1 1 − η(x)

where n o
Ω1 := x ∈ [0, 1]d δ ≤ η̂(x) = η(x) ≤ 1 − δ ,
n o
Ω2 := x ∈ [0, 1]d 0 ≤ η̂(x) = η(x) < δ , (C.35)
n o
Ω3 := x ∈ [0, 1]d 1 − δ < η̂(x) = η(x) ≤ 1 .

Proof Define
ψ : [0, 1]d × {−1, 1} → [0, ∞),
  
η(x)
φ y log , if η(x) ∈ [δ, 1 − δ],


1 − η(x)




(x, y) 7→ 0, if η(x) ∈ {0, 1},

1 1


η(x) log + (1 − η(x)) log , if η(x) ∈ (0, δ) ∪ (1 − δ, 1).


η(x) 1 − η(x)

Since η̂ = η ∈ [0, 1], PX -a.s., we have that PX ([0, 1]d \ (Ω1 ∪ Ω2 ∪ Ω3 )) = 0. Then it follows
from lemma C.3 that
n o
φ φ φ d
EP (f ) = RP (f ) − inf RP (g) g : [0, 1] → R is measurable
Z Z (C.36)
= φ(yf (x))dP (x, y) − ψ(x, y)dP (x, y) = I1 + I2 + I3 ,
[0,1]d ×{−1,1} [0,1]d ×{−1,1}

where Z
Ii := (φ (yf (x)) − ψ(x, y)) dP (x, y), i = 1, 2, 3.
Ωi ×{−1,1}

65
Zhang, Shi and Zhou

According to Lemma C.7, we have


Z Z   
η(x)
I1 = φ (yf (x)) − φ y log dP (y|x)dPX (x)
Ω1 {−1,1} 1 − η(x)
 
η(x)
 
2
η(x)
 f (x) − log 1−η(x)

 t ∈ f (x) ∧ log , ∞ and  (C.37)
1 − η(x)
Z 
≤ sup dPX (x).
2(2 + et + e−t )
 
Ω1  η(x) 

 t ∈ −∞, f (x) ∨ log 

1 − η(x)
Then it remains to bound I2 and I3 .
Indeed, for any x ∈ Ω2 , if η(x) = 0, then
Z
(φ(yf (x)) − ψ(x, y)) dP (y|x) = φ(−f (x)) ≤ φ(a).
{−1,1}

Otherwise, we have
Z
(φ(yf (x)) − ψ(x, y)) dP (y|x)
{−1,1}
   
1 1
= φ(f (x)) − log η(x) + φ(−f (x)) − log (1 − η(x))
η(x) 1 − η(x)
     
η(x) η(x)
= φ (f (x)) − φ log η(x) + φ (−f (x)) − φ − log (1 − η(x))
1 − η(x) 1 − η(x)
    
δ η(x)
≤ φ log − φ log η(x) + φ(−f (x))(1 − η(x))
1−δ 1 − η(x)
≤ φ(−f (x))(1 − η(x)) ≤ φ(−f (x)) ≤ φ(a).

Therefore, no matter whether η(x) = 0 or η(x) 6= 0, there always holds


Z
(φ(yf (x)) − ψ(x, y)) dP (y|x) ≤ φ(a),
{−1,1}

which means that


Z Z
I2 = (φ(yf (x)) − ψ(x, y)) dP (y|x)dPX (x)
Ω2 {−1,1}
Z (C.38)
≤ φ(a)dPX (x) = φ(a)PX (Ω2 ).
Ω2

Similarly, for any x ∈ Ω3 , if η(x) = 1, then


Z
(φ(yf (x)) − ψ(x, y)) dP (y|x) = φ(f (x)) ≤ φ(b).
{−1,1}

Otherwise, we have
Z
(φ(yf (x)) − ψ(x, y)) dP (y|x)
{−1,1}

66
Classification with Deep Neural Networks and Logistic Loss

   
1 1
= φ(f (x)) − log η(x) + φ(−f (x)) − log (1 − η(x))
η(x) 1 − η(x)
     
η(x) η(x)
= φ (f (x)) − φ log η(x) + φ (−f (x)) − φ − log (1 − η(x))
1 − η(x) 1 − η(x)
    
δ 1 − η(x)
≤ φ(f (x))η(x) + φ log − φ log (1 − η(x))
1−δ η(x)
≤ φ(f (x))η(x) ≤ φ(f (x)) ≤ φ(b).
Therefore, no matter whether η(x) = 1 or η(x) 6= 1, we have
Z
(φ(yf (x)) − ψ(x, y)) dP (y|x) ≤ φ(b),
{−1,1}

which means that


Z Z
I3 = (φ(yf (x)) − ψ(x, y)) dP (y|x)dPX (x)
Ω3 {−1,1}
Z (C.39)
≤ φ(b)dPX (x) = φ(b)PX (Ω3 ).
Ω3

The desired inequality then follows immediately from (C.37), (C.38), (C.39) and (C.36).
Thus we complete the proof.

Lemma C.9 Let δ ∈ (0, 1/2), a ∈ [δ, 1−δ], f ∈ − log 1−δ 1−δ −t
 
δ , log δ , and φ(t) = log(1+e )
be the logistic loss. Then there hold
H(a, f ) ≤ Γ · G(a, f )

with Γ = 5000 |log δ|2 ,


  2   2
a a
H(a, f ) := a · φ(f ) − φ log + (1 − a) · φ(−f ) − φ − log ,
1−a 1−a
and
   
a a
G(a, f ) := aφ(f ) + (1 − a)φ(−f ) − aφ log − (1 − a)φ − log
1−a 1−a
1 1
= aφ(f ) + (1 − a)φ(−f ) − a log − (1 − a) log .
a 1−a
Proof In this proof, we will frequently use elementary inequalities
 
1 1
x log ≤ min 1 − x, (1 − x) · log , ∀ x ∈ [1/2, 1), (C.40)
x 1−x
and    
1 3 − 3x 1
− log − 2 < − log 7 ≤ − log exp log −1
1−x x 1−x
(C.41)
x 1
< log < 2 + log , ∀ x ∈ [1/2, 1).
1−x 1−x

67
Zhang, Shi and Zhou

We first show that


aφ(f )
G(a, f ) ≥
3     (C.42)
1 3 − 3a 1
provided ≤ a ≤ 1 − δ and f ≤ − log exp log −1 .
2 a 1−a
   
Indeed, if 1/2 ≤ a ≤ 1 − δ and f ≤ − log exp 3−3a
a log 1
1−a − 1 , then
    
2 2 3 − 3a 1 1
· aφ(f ) ≥ · aφ − log exp log −1 = (2 − 2a) · log
3 3 a 1−a 1−a
1 1
≥ a log + (1 − a) log ,
a 1−a
which means that
1 1 aφ(f )
G(a, f ) ≥ aφ(f ) − a log − (1 − a) log ≥ .
a 1−a 3
This proves (C.42).
We next show that
1−a a 2
G(a, f ) ≥ f − log
18 1−a (C.43)
1 1 1
provided ≤ a ≤ 1 − δ and −2 − log ≤ f ≤ 2 + log .
2 1−a 1−a
1 1
Indeed, if 1/2 ≤ a ≤ 1 − δ and −2 − log 1−a ≤ f ≤ 2 + log 1−a , then it follows from Lemma
C.6 that
2
a
f − log 1−a
G(a, f ) ≥    
1 1
4 + 2 exp 2 + log 1−a + 2 exp −2 − log 1−a
2 2 2
a a a
f − log 1−a (1 − a) · f − log 1−a (1 − a) · f − log 1−a
≥ 1 ≥ ≥ ,
5 + 15 · 1−a
5 − 5a + 15 18

which proves (C.43).


We then show
1−δ 1−δ
H(a, f ) ≤ Γ · G(a, f ) provided 1/2 ≤ a ≤ 1 − δ and − log ≤ f ≤ log (C.44)
δ δ
by considering the following four cases.
1
Case I. 1/2 ≤ a ≤ 1 − δ and 2 + log 1−a ≤ f ≤ log 1−δ
δ . In this case we have
 
1 δ 1
log = φ log ≥ φ(−f ) = log(1 + ef ) ≥ f ≥ 2 + log
δ 1−δ 1−a
  (C.45)
a 1 1
> φ − log = log ≥ log > 0,
1−a 1−a a

68
Classification with Deep Neural Networks and Logistic Loss

which, together with (C.40), yields


1
1 + log 1−a
 
1 1 1
a log + (1 − a) log ≤ (1 − a) · 1 + log ≤ (1 − a) · 1 · φ(−f ).
a 1−a 1−a 2 + log 1−a

Consequently,
1 1
G(a, f ) ≥ (1 − a) · φ(−f ) − a log − (1 − a) log
a 1−a
1
1 + log 1−a
≥ (1 − a) · φ(−f ) − (1 − a) · 1 · φ(−f ) (C.46)
2 + log 1−a
(1 − a) · φ(−f ) (1 − a) · φ(−f )
= 1 ≥ .
2 + log 1−a 4 log 1δ
1 a
On the other hand, it follows from f ≥ 2 + log 1−a > log 1−a that
   
a a
0 ≤ φ log − φ(f ) < φ log ,
1−a 1−a
which, together with (C.40) and (C.45), yields
 2   2
a a
a · φ(f ) − φ log ≤ a · φ log
1−a 1−a
2
(C.47)
1 1
= a · log ≤ (1 − a) · log ≤ (1 − a) · φ(−f ).
a a
 
a
Besides, it follows from (C.46) that 0 ≤ φ(−f ) − φ − log 1−a ≤ φ(−f ). Consequently,
  2
a 1
(1 − a) · φ(−f ) − φ − log ≤ (1 − a) · φ(−f )2 ≤ (1 − a) · φ(−f ) · log . (C.48)
1−a δ
Combining (C.46), (C.47) and (C.48), we deduce that
1 Γ
H(a, f ) ≤ (1 − a) · φ(−f ) · 1 + log ≤ (1 − a) · φ(−f ) · ≤ Γ · G(a, f ),
δ 4 log 1δ

which proves the desired inequality.    


Case II. 1/2 ≤ a ≤ 1 − δ and − log exp 3−3a
a log 1
1−a − 1 1
≤ f < 2 + log 1−a . In
1 1
this case, we have −2 − log 1−a ≤ f ≤ 2 + log 1−a , where we have used (C.41). Therefore,
2
it follows from (C.43) that G(a, f ) ≥ 1−a a
18 f − log 1−a . On the other hand, it follow from
(C.41) and Taylor’s Theorem that there exists
   
3 − 3a 1
− log 7 ≤ − log exp log −1
a 1−a
a a 1
≤ f ∧ log ≤ ξ ≤ f ∨ log ≤ 2 + log ,
1−a 1−a 1−a

69
Zhang, Shi and Zhou

such that
  2
a
a · φ(f ) − φ log
1−a
2 a 2 a 2
= a · φ0 (ξ) · f − log ≤ a · e−2ξ · f − log
1−a 1−a
     2
3 − 3a 1 a
≤ a · exp(log 7) · exp log exp log −1 · f − log
a 1−a 1−a
Z 3−3a log 1 2
a 1−a a
= 7a · et dt · f − log
0 1 − a
(C.49)
a 2
 
3 − 3a 1 3 − 3a 1
≤ 7a · log · exp log · f − log
a 1−a a 1−a 1−a
2
3 − 3a 1 a
≤ 7a · log · (1 + exp (log 7)) · f − log
a 1−a 1−a
2
1 a
≤ 168 · (1 − a) · log · f − log
1−a 1−a
2
1 a
≤ 168 · (1 − a) · log · f − log .
δ 1−a

Besides, we have

  2
a
(1 − a) · φ(−f ) − φ − log
1−a
2 2
(C.50)
a a
≤ |1 − a| · φ0 R · f − log ≤ |1 − a| · f − log .
1−a 1−a

2
1−a a
Combining (C.49), (C.50) and the fact that G(a, f ) ≥ 18 f − log 1−a , we deduce that

2 2
1 a a
H(a, f ) ≤ 168 · (1 − a) · log · f − log + |1 − a| · f − log
δ 1−a 1−a
2 2
1 a 1−a a
≤ 170 · (1 − a) · log · f − log ≤Γ· · f − log ≤ Γ · G(a, f ),
δ 1−a 18 1−a

which proves the desired inequality.


   
a
Case III. 1/2 ≤ a ≤ 1 − δ and − log 1−a ≤ f < − log exp 3−3a 1
a log 1−a − 1 . In this
case, we still have (C.50). Besides, it follows from (C.42) that G(a, f ) ≥ aφ(f )
3 . Moreover,
1 1
by (C.41) we obtain −2 − log 1−a < f < 2 + log 1−a , which, together with (C.43), yields
2    
G(a, f ) ≥ 1−a
18 f − log a
1−a . In addition, since f < − log exp 3−3a
a log 1
1−a − 1 ≤

70
Classification with Deep Neural Networks and Logistic Loss

 
a a
log 1−a , we have that 0 < φ(f ) − φ log 1−a < φ(f ), which means that
 2
a
a · φ(f ) − φ log ≤ a · |φ(f )|2
1−a
  (C.51)
a 1 1
≤ aφ(f )φ − log = aφ(f ) log ≤ aφ(f ) log .
1−a 1−a δ
Combining all these inequalities, we obtain
2
1 a
H(a, f ) ≤ aφ(f ) · log + |1 − a| · f − log
δ 1−a
Γaφ(f ) 1−a a 2
≤ +Γ· · f − log
6 36 1−a
Γ · G(a, f ) Γ · G(a, f )
≤ + = Γ · G(a, f ),
2 2
which proves the desired inequality.n    o
Case IV. − log 1−δδ ≤ f < min − log a
1−a , − log exp 3−3a
a log 1
1−a − 1 and 1/2 ≤
a ≤ 1 − δ. In this case, we still have G(a, f ) ≥ aφ(f
3
)
according to (C.42). Besides, it follows
from
    
a 3 − 3a 1 a a
f < min − log , − log exp log −1 ≤ − log ≤ log
1−a a 1−a 1−a 1−a
that     
a a
0 ≤ min φ − log − φ(−f ), φ(f ) − φ log
1−a 1−a
    
a a
≤ max φ − log − φ(−f ), φ(f ) − φ log (C.52)
1−a 1−a
   
a
≤ max φ − log , φ(f ) = φ(f ).
1−a
aφ(f )
Combining (C.52) and the fact that G(a, f ) ≥ 3 , we deduce that
 
2 2 1−δ
H(a, f ) ≤ a · |φ(f )| + (1 − a) · |φ(f )| ≤ φ(f )φ − log
δ
1 Γaφ(f )
= φ(f ) log ≤ ≤ Γ · G(a, f ),
δ 3
which proves the desired inequality.
Combining all these four cases, we conclude that (C.44) has been proved. Furthermore,
(C.44) yields that
H(a, f ) = H(1 − a, −f ) ≤ Γ · G(1 − a, −f ) = Γ · G(a, f )
provided δ ≤ a ≤ 1/2 and − log 1−δ 1−δ
δ ≤ f ≤ log δ , which, together with (C.44), proves this
lemma.

71
Zhang, Shi and Zhou

Lemma C.10 Let φ(t) = log 1 + e−t be the logistic loss, δ0 ∈ (0, 1/3), d ∈ N and P be a


Borel probability measure on [0, 1]d × {−1, 1} of which the conditional probability function
[0, 1]d 3 z 7→ P ({1} |z) ∈ [0, 1] is denoted by η. Then there exists a measurable function
 
d 10 log(1/δ0 )
ψ : [0, 1] × {−1, 1} → 0, log
δ0
such that
Z n o
ψ (x, y)dP (x, y) = inf RφP (g) g : [0, 1]d → R is measurable (C.53)
[0,1]d ×{−1,1}

and Z
(φ (yf (x)) − ψ(x, y))2 dP (x, y)
[0,1]d ×{−1,1}
Z (C.54)
2
≤ 125000 |log δ0 | · (φ (yf (x)) − ψ(x, y)) dP (x, y)
[0,1]d ×{−1,1}
h i
δ0 1−δ0
for any measurable f : [0, 1]d → log 1−δ 0
, log δ0 .

Proof Let
    
t log 1 + (1 − t) log
 1
, if ∈ (0, 1),
H : [0, 1] → [0, ∞), t 7→ t 1−t

0, if t ∈ {0, 1}.
     
δ0 4 1 δ0
Then it is easy to show that H 10 log(1/δ 0)
≤ 5 log 1−δ0 ≤ H log(1/δ0 ) . Thus there
1

exists δ1 ∈ 0, 3 such that  
4 1
H(δ1 ) ≤ log
5 1 − δ0
and
δ0 δ0
0< ≤ δ1 ≤ ≤ δ0 < 1/3.
10 log (1/δ0 ) log(1/δ0 )
Take
  
η(x)
φ y log , if η(x) ∈ [δ1 , 1 − δ1 ],

ψ : [0, 1]d × {−1, 1} → R, (x, y) 7→ 1 − η(x)

H(η(x)), if η(x) ∈
/ [δ1 , 1 − δ1 ],

which can be further expressed as

ψ : [0, 1]d × {−1, 1} → R,


  
η(x)
φ y log , if η(x) ∈ [δ1 , 1 − δ1 ],


1 − η(x)




(x, y) 7→ 0, if η(x) ∈ {0, 1},

1 1


η(x) log + (1 − η(x)) log , if η(x) ∈ (0, δ1 ) ∪ (1 − δ1 , 1).


η(x) 1 − η(x)

72
Classification with Deep Neural Networks and Logistic Loss

Obviously, ψ is a measurable function such that


1 10 log(1/δ0 )
0 ≤ ψ(x, y) ≤ log ≤ log , ∀ (x, y) ∈ [0, 1]d × {−1, 1},
δ1 δ0
and it follows immediately from Lemma C.3 that
h (C.53) holds. Wei next show (C.54).
d δ0 1−δ0
For any measurable function f : [0, 1] → log 1−δ0 , log δ0 and any x ∈ [0, 1]d , if
η(x) ∈
/ [δ1 , 1 − δ1 ], then we have
4 1
0 ≤ ψ(x, y) = H(η(x)) ≤ H(δ1 ) ≤ log
5 1 − δ0
 
4 1 − δ0 4
= φ log ≤ φ(yf (x)) ≤ φ(yf (x)), ∀ y ∈ {−1, 1}.
5 δ0 5

Hence 0 ≤ 15 φ(yf (x)) ≤ φ(yf (x)) − ψ(x, y) ≤ φ(yf (x)), ∀ y ∈ {−1, 1}, which means that
 
2 2 1 − δ0
(φ(yf (x)) − ψ(x, y)) ≤ φ(yf (x)) ≤ φ(yf (x))φ − log
δ0
1 1
= φ(yf (x)) · 5 log ≤ (φ(yf (x)) − ψ(x, y)) · 5000 |log δ1 |2 , ∀ y ∈ {−1, 1}.
5 δ0
Integrating both sides with respect to y, we obtain
Z
(φ(yf (x)) − ψ(x, y))2 dP (y|x)
{−1,1}
Z (C.55)
2
≤ 5000 |log δ1 | · (φ(yf (x)) − ψ(x, y)) dP (y|x).
{−1,1}

If η(x) ∈ [δ1 , 1 − δ1 ], then it follows from Lemma C.9 that


Z
(φ(yf (x)) − ψ(x, y))2 dP (y|x)
{−1,1}
  2   2
η(x) η(x)
= η(x) φ(f (x)) − φ log + (1 − η(x)) φ(−f (x)) − φ − log
1 − η(x) 1 − η(x)

≤ 5000 |log δ1 |2 · η(x)φ(f (x)) + (1 − η(x))φ(−f (x))


!
 η(x)   η(x) 
− η(x)φ log − (1 − η(x))φ − log
1 − η(x) 1 − η(x)
Z
2
= 5000 |log δ1 | (φ(yf (x)) − ψ(x, y)) dP (y|x),
{−1,1}

which means that (C.55) still holds. Therefore, (C.55) holds for all x ∈ [0, 1]d . We then
integrate both sides of (C.55) with respect to x and obtain
Z
(φ(yf (x)) − ψ(x, y))2 dP (x, y)
[0,1]d ×{−1,1}

73
Zhang, Shi and Zhou

Z
2
≤ 5000 |log δ1 | (φ(yf (x)) − ψ(x, y)) dP (x, y)
[0,1]d ×{−1,1}
Z
2
≤ 125000 |log δ0 | (φ(yf (x)) − ψ(x, y)) dP (x, y),
[0,1]d ×{−1,1}

which yields (C.54). In conclusion, the function ψ defined above has all the desired prop-
erties. Thus we complete the proof.

The following Lemma C.11 is similar to Lemma 3 of Schmidt-Hieber (2020).

Lemma C.11 Let (d, d? , d∗ , K) ∈ N4 , β ∈ (0, ∞), r ∈ [1, ∞), and q ∈ N ∪ {0}. Suppose
h0 , h1 , . . . , hq , h̃0 , h̃1 , . . . , h̃q are functions satisfying that

(i) dom(hi ) = dom(h̃i ) = [0, 1]K for 0 < i ≤ q and


dom(h0 ) = dom(h̃0 ) = [0, 1]d ;

(ii) ran(hi ) ∪ ran(h̃i ) ⊂ [0, 1]K for 0 ≤ i < q and


ran(hq ) ∪ ran(h̃q ) ⊂ R;
H (d , β, r) ∪ G M (d );
(iii) hq ∈ G∞ ∗ ∞ ?

(iv) For 0 ≤ i < q and 1 ≤ j ≤ K, the j-th coordinate


function of hi given by dom(hi ) 3 x 7→ (hi (x))j ∈ R
H (d , β, r) ∪ G M (d ).
belongs to G∞ ∗ ∞ ?

Then there holds

hq ◦ hq−1 ◦ · · · ◦ h1 ◦ h0 − h̃q ◦ h̃q−1 ◦ · · · ◦ h̃1 ◦ h̃0


[0,1]d
Pq−1 k q
(1∧β)q−k (C.56)
k=0 (1∧β)
X
≤ r · d1∧β
∗ · h̃k − hk .
dom(hk )
k=0

Proof We will prove this lemma by induction on q. The case q = 0 is trivial. Now assume
that q > 0 and that the desired result holds for q − 1. Consider the case q. For each
0 ≤ i < q and 1 ≤ j ≤ K, denote

h̃i,j : dom(h̃i ) → R, x 7→ h̃i (x) j ,

and 
hi,j : dom(hi ) → R, x 7→ hi (x) j .

Obviously, ran(h̃i,j ) ∪ ran(hi,j ) ⊂ [0, 1]. By induction hypothesis (that is, the case q − 1 of
this lemma), we have that

hq−1,j ◦ hq−2 ◦ hq−3 ◦ · · · ◦ h0 − h̃q−1,j ◦ h̃q−2 ◦ h̃q−3 ◦ · · · ◦ h̃0


[0,1]d
q−2
Pq−2 !
k (1∧β)q−1−k
k=0 (1∧β)
X
≤ r · d1∧β
∗ · h̃q−1,j − hq−1,j + h̃k − hk
dom(hq−1,j ) dom(hk )
k=0

74
Classification with Deep Neural Networks and Logistic Loss

Pq−2 k q−1
k=0 (1∧β)
X (1∧β)q−1−k
≤ r· d1∧β
∗ · h̃k − hk , ∀ j ∈ Z ∩ (0, K].
dom(hk )
k=0

Therefore,

hq−1 ◦ hq−2 ◦ hq−3 ◦ · · · ◦ h0 − h̃q−1 ◦ h̃q−2 ◦ h̃q−3 ◦ · · · ◦ h̃0


[0,1]d

= sup hq−1,j ◦ hq−2 ◦ hq−3 ◦ · · · ◦ h0 − h̃q−1,j ◦ h̃q−2 ◦ h̃q−3 ◦ · · · ◦ h̃0


j∈Z∩(0,K] [0,1]d (C.57)
Pq−2 k q−1
k=0 (1∧β)
X (1∧β)q−1−k
≤ r · d1∧β
∗ · h̃k − hk .
dom(hk )
k=0

We next show that


1∧β
hq (x) − hq (x0 ) ≤ r · d1∧β
∗ · x − x0 ∞
, ∀ x, x0 ∈ [0, 1]K (C.58)

by considering three cases.


Case I: hq ∈ G∞H (d , β, r) and β > 1. In this case, we must have that h ∈ G H (d , β, r)
∗ q K ∗
since dom(hq ) = [0, 1]K . Therefore, there exist I ⊂ {1, 2, . . . , K} and g ∈ Brβ [0, 1]d∗ such
that #(I) = d∗ and hq (x) = g((x)I ) for all x ∈ [0, 1]K . Denote λ := β + 1 − dβe. We then
use Taylor’s formula to deduce that
∃ ξ∈[0,1]d∗
hq (x) − hq (x0 ) = g((x)I ) − g((x0 )I ) ======== ∇g(ξ) · (x)I − (x0 )I


≤ k∇g(ξ)k∞ · (x)I − (x0 )I 1


≤ k∇gk[0,1]d · d∗ · (x)I − (x0 )I ∞
0 0
≤ kgkC β−λ,λ ([0,1]d ) · d∗ · (x)I − (x )I ∞
≤ r · d∗ · (x)I − (x )I ∞
1∧β
≤r· d1∧β
∗ · x− x0 ∞ 0
, ∀ x, x ∈ [0, 1] , K

which yields (C.58).


Case II: hq ∈ G∞ H (d , β, r) and β ≤ 1. In this case, we still have that h ∈ G H (d , β, r).
∗ q K ∗
Therefore, there exist I ⊂ {1, 2, . . . , K} and g ∈ Brβ [0, 1]d∗ such that #(I) = d∗ and


hq (x) = g((x)I ) for all x ∈ [0, 1]K . Consequently,

β |g(z) − g(z 0 )|
hq (x) − hq (x0 ) = g((x)I ) − g((x0 )I ) ≤ (x)I − (x0 )I 2
· sup
[0,1]d∗ 3z6=z 0 ∈[0,1]d∗ kz − z 0 kβ2
β β p β
≤ (x)I − (x0 )I 2
· kgkC 0,β ([0,1]d ) ≤ (x)I − (x0 )I 2
·r ≤r· d∗ · x − x0 ∞
1∧β
≤ r · d1∧β
∗ · x − x0 ∞
, ∀ x, x0 ∈ [0, 1]K ,

which yields (C.58).


Case III: hq ∈ G∞ M (d ). In this case, we have that there exists I ⊂ {1, 2, . . . , K} such
?
that 1 ≤ #(I) ≤ d? and hq (x) = max (x)i i ∈ I for all x ∈ [0, 1]K . Consequently,


hq (x) − hq (x0 ) = max (x)i i ∈ I − max (x0 )i i ∈ I ≤ (x)I − (x0 )I ∞


 

1∧β
≤ r · d1∧β
∗ · x − x0 ∞
≤ r · d1∧β
∗ · x − x0 ∞
, ∀ x, x0 ∈ [0, 1]K ,

75
Zhang, Shi and Zhou

which yields (C.58).


Combining the above three cases, we deduce that (C.58) always holds true. From (C.58)
and (C.57) we obtain that

hq ◦ hq−1 ◦ · · · ◦ h0 (x) − h̃q ◦ h̃q−1 ◦ · · · ◦ h̃0 (x)

≤ hq ◦ hq−1 ◦ · · · ◦ h0 (x) − hq ◦ h̃q−1 ◦ · · · ◦ h̃0 (x)

+ hq ◦ h̃q−1 ◦ · · · ◦ h̃0 (x) − h̃q ◦ h̃q−1 ◦ · · · ◦ h̃0 (x)


1∧β
≤ r · d1∧β
∗ · hq−1 ◦ · · · ◦ h0 (x) − h̃q−1 ◦ · · · ◦ h̃0 (x) + hq − h̃q
∞ dom(hq )
1∧β
≤ r · d1∧β
∗ · hq−1 ◦ · · · ◦ h0 − h̃q−1 ◦ · · · ◦ h̃0 + hq − h̃q
[0,1]d dom(hq )
Pq−2 q−1 1∧β
k (1∧β)q−1−k
k=0 (1∧β)
X
≤ r · d1∧β
∗ · r · d1∧β
∗ · h̃k − hk + hq − h̃q
dom(hk ) dom(hq )
k=0
Pq−2 q−1 1∧β
k+1 (1∧β)q−1−k
k=0 (1∧β)
X
= r · d1∧β
∗ · r · d1∧β
∗ · h̃k − hk + hq − h̃q
dom(hk ) dom(hq )
k=0
Pq−2 q−1 1∧β
k+1 (1∧β)q−1−k
k=0 (1∧β)
X
≤r· d1∧β
∗ · r· d1∧β
∗ · h̃k − hk + hq − h̃q
dom(hk ) dom(hq )
k=0
Pq−2 q−1 1∧β
k+1 (1∧β)q−1−k
k=0 (1∧β)
X
≤r· d1∧β
∗ · r· d1∧β
∗ · h̃k − hk + hq − h̃q
dom(hk ) dom(hq )
k=0
Pq−1 k q
k=0 (1∧β)
X (1∧β)q−k
= r· d1∧β
∗ · h̃k − hk , ∀ x ∈ [0, 1]d .
dom(hk )
k=0

Therefore,

hq ◦ hq−1 ◦ · · · ◦ h1 ◦ h0 − h̃q ◦ h̃q−1 ◦ · · · ◦ h̃1 ◦ h̃0


[0,1]d

= sup hq ◦ hq−1 ◦ · · · ◦ h1 ◦ h0 (x) − h̃q ◦ h̃q−1 ◦ · · · ◦ h̃1 ◦ h̃0 (x)


x∈[0,1]d
Pq−1 k q
k=0 (1∧β)
X (1∧β)q−k
≤ r· d1∧β
∗ · h̃k − hk ,
dom(hk )
k=0

meaning that the desired result holds for q.


In conclusion, according to mathematical induction, we have that the desired result
holds for all q ∈ N ∪ {0}. This completes the proof.

Lemma C.12 Let k be an positive integer. Then there exists a neural network
d log k
log 2 e
     
˜ FNN log k log k
f ∈ Fk 1+2· , 2k, 26 · 2 − 20 − 2 · , 1, 1
log 2 log 2

76
Classification with Deep Neural Networks and Logistic Loss

such that
f˜(x) = kxk∞ , ∀ x ∈ Rk .

Proof We argue by induction.


Firstly, consider the case k = 1. Define

f˜1 : R → R, x 7→ σ(x) + σ(−x).

Obviously,

f˜1 ∈ F1FNN (1, 2, 6, 1, 1)


log 1
d log 2e
     
FNN log 1 log 1
⊂ F1 1+2· , 2 · 1, 26 · 2 − 20 − 2 · , 1, 1
log 2 log 2

and f˜(x) = σ(x) + σ(−x) = |x| = kxk∞ for all x ∈ R = R1 . This proves the k = 1 case.
Now assume that the desired result holds for k = 1, 2, 3, . . . , m−1 (m ≥ 2), and consider
the case k = m. Define
bmc
g̃1 : Rm → R 2 ,
 
x 7→ (x)1 , (x)2 , · · · , (x)b m c−1 , (x)b m c ,
2 2

dme
g̃2 : Rm → R 2 ,
 
x 7→ (x)b m c+1 , (x)b m c+2 , · · · , (x)m−1 , (x)m ,
2 2

and
f˜m : Rm → R,
 
1 ˜  1 
˜
x 7→ σ · σ fb m c (g̃1 (x)) − · σ fd m e (g̃2 (x))
2 2 2 2

1   1   (C.59)
+σ · σ f˜d m e (g̃2 (x)) − · σ f˜b m c (g̃1 (x))
2 2 2 2

1   1  
+ · σ f˜b m c (g̃1 (x)) + · σ f˜d m e (g̃2 (x)) .
2 2 2 2

It follows from the induction hypothesis that


logd m
& '
2 e
 & m' l m & m' 
log 2 m log 2 log 2
f˜d m e ◦ g̃2 ∈ FmFNN 
1+2 ,2 , 26 · 2 − 20 − 2 , 1, 1
2 log 2 2 log 2

d log m
log 2 e
   l m   
FNN log m m log m
= Fm −1 + 2 ,2 , 13 · 2 − 18 − 2 , 1, 1
log 2 2 log 2
and
logb m
& '
2 c
 &
 ' j k &  ' 
log m m log 2 log m
f˜b m c FNN 
◦ g̃1 ∈ Fm 1+2 2
,2 , 26 · 2 − 20 − 2 2
, 1, 1
2 log 2 2 log 2

77
Zhang, Shi and Zhou

d log m
log 2 e
   j k   
FNN log m m log m
⊂ Fm −1 + 2 ,2 , 13 · 2 − 18 − 2 , 1, 1 ,
log 2 2 log 2

which, together with (C.59), yield

   l m
˜ FNN log m m jmk
fm ∈ Fm 2−1+2 ,2 +2 ,
log 2 2 2
d log m
log 2 e
    
log m log m
2 · 13 · 2 − 18 − 2 +2 + 16, 1, ∞ (C.60)
log 2 log 2
d log m
log 2 e
     
FNN log m log m
= Fm 1+ , 2m, 26 · 2 − 20 − 2 , 1, ∞
log 2 log 2

(cf. Figure C.6). Besides, it is easy to verify that

Output
f˜m (x)

      
σ f˜ m (x00 ) − 21 σ f˜ m (x0 )
1
σ σ f˜ m (x0 )
2 d2e b2c b2c
 
σ f˜ m (x00 )
    
d2e σ 21 σ f˜ m (x0 ) − 21 σ f˜ m (x00 )
b2c d2e

   
σ f˜ m (x00 ) σ f˜ m (x0 )
d2e b2c
..
. ..
.
..
.
 
σ f˜ m (x0 )
f˜d m e b2c
2

f˜b m c
2

··· ···
x00 x0
bm
2 c dm
2 e
Input: x = (x0 , x00 ) with x0 ∈ R and x00 ∈ R

Figure C.6: The network f˜m .

78
Classification with Deep Neural Networks and Logistic Loss

    
˜ ˜ ˜
fm (x) = max σ f m (g̃1 (x)) , σ f m (g̃2 (x))
b2c d2e
( ! !)
   
= max σ (x)1 , . . . , (x) m ,σ (x) m , . . . , (x)m
b2c ∞ b 2 c+1 ∞
      (C.61)
= max (x)1 , . . . , (x) m , (x) m , . . . , (x)m
b2c ∞ b 2 c+1 ∞
( )
= max max |(x)i | , max |(x)i | = max |(x)i | = kxk∞ , ∀ x ∈ Rm .
1≤i≤b m 1≤i≤m
2 c b m2 c+1≤i≤m
Combining (C.60) and (C.61), we deduce that the desired result holds for k = m. Therefore,
according to mathematical induction, we have that the desired result hold for all positive
integer k . This completes the proof.

Lemma C.13 Let (ε, d, d? , d∗ , β, r) ∈ (0, 1/2] × N × N × N × (0, ∞) × (0, ∞) and f be a


function from [0, 1]d to R. Suppose f ∈ G∞H M
(d∗ , β, r ∨1)∪G∞ (d? ). Then there exist constants
E1 , E2 , E3 ∈ (0, ∞) only depending on (d∗ , β, r) and a neural network
 
1 d∗ d∗ 1
f˜ ∈ FdFNN 3 log d? + E1 log , 2d? + E2 ε− β , 52d? + E3 ε− β log , 1, ∞
ε ε
such that
sup f˜(x) − f (x) < 2ε.
x∈[0,1]d

Proof According to Corollary B.1, there exist constants E1 , E2 , E3 ∈ (6, ∞) only depending
on (d∗ , β, r), such that
(  )
FNN 1 − dβ∗ − dβ∗ 1
inf sup |g(x) − g̃(x)| g̃ ∈ Fd∗ E1 log , E2 t , E3 t log , 1, ∞
x∈[0,1]d∗ t t (C.62)
β d∗

≤ t, ∀ g ∈ Br∨1 [0, 1] , ∀ t ∈ (0, 1/2].
We next consider two cases.
M
Case I: f ∈ G∞ (d? ). In this case, we must have f ∈ GdM (d? ), since dom(f ) = [0, 1]d .
Therefore, there exists I ⊂ {1, 2, . . . , d}, such that 1 ≤ #(I) ≤ d? and
f (x) = max (x)i i ∈ I , ∀ x ∈ [0, 1]d .


According to Lemma C.12, there exists


  !
  log #(I)  
FNN log #(I) log 2 log #(I)
g̃ ∈ F#(I) 1+2· , 2 · #(I), 26 · 2 − 20 − 2 · , 1, 1
log 2 log 2
d log d?
log 2 e
   
FNN log d?
⊂ F#(I) 1 + 2 · , 2d? , 26 · 2 , 1, 1
log 2
FNN
⊂ F#(I) (3 + 3 log d? , 2d? , 52d? , 1, 1)
(C.63)

79
Zhang, Shi and Zhou

such that
g̃(x) = kxk∞ , ∀ x ∈ R#(I) .
Define f˜ : Rd → R, x 7→ g̃((x)I ). Then it follows from (C.63) that

f˜ ∈ FdFNN (3 + 3 log d? , 2d? , 52d? , 1, 1)


 
FNN 1 − dβ∗ − dβ∗ 1
⊂ Fd 3 log d? + E1 log , 2d? + E2 ε , 52d? + E3 ε log , 1, ∞
ε ε
and

f (x) − f˜(x) = sup



sup max (x)i i ∈ I − g̃((x)I )
x∈[0,1]d x∈[0,1]d

= sup max |(x)i | i ∈ I − k(x)I k∞ = 0 < 2ε,
x∈[0,1]d

which yield the desired result.


H
Case II: f ∈ G∞ (d∗ , β, r ∨ 1). In this case, we must have f ∈ GdH (d∗ , β, r ∨ 1), since
d β
[0, 1]d∗ such

dom(f ) = [0, 1] . By definition, there exist I ⊂ {1, 2, . . . , d} and g ∈ Br∨1
d
that #(I) = d∗ and f (x) = g ((x)I ) for all x ∈ [0, 1] . Then
d∗
 it follows from (C.62) that
d∗
there exists g̃ ∈ FdFNN

E1 log 1ε , E2 ε− β , E3 ε− β log 1ε , 1, ∞ such that

sup |g(x) − g̃(x)| < 2ε.


x∈[0,1]d∗

Define f˜ : Rd → R, x 7→ g̃((x)I ). Then we have that


 
˜ FNN 1 − dβ∗ − dβ∗ 1
f ∈ Fd E1 log , E2 ε , E3 ε log , 1, ∞
ε ε
 
FNN 1 − dβ∗ − dβ∗ 1
⊂ Fd 3 log d? + E1 log , 2d? + E2 ε , 52d? + E3 ε log , 1, ∞
ε ε
and

sup f (x) − f˜(x) = sup |g((x)I ) − g̃((x)I )| = sup |g(x) − g̃(x)| < 2ε.
x∈[0,1]d x∈[0,1]d x∈[0,1]d∗

These yield the desired result again.


In conclusion, the desired result always holds. Thus we completes the proof of this
lemma.

Lemma C.14 Let β ∈ (0, ∞), r ∈ (0, ∞), q ∈ N ∪ {0}, and (d, d? , d∗ , K) ∈ N4 with d∗ ≤
min d, K + 1{0} (q) · (d − K) . Suppose f ∈ GdCHOM (q, K, d? , d∗ , β, r) and ε ∈ (0, 1/2].


Then there exist E7 ∈ (0, ∞) only depending on (d∗ , β, r, q) and



1 − d∗
f˜ ∈ FdFNN (q + 1) · 3 log d? + E7 log , 2Kd? + KE7 ε β·(1∧β)q ,
ε
 (C.64)
d∗
− β·(1∧β)q 1
(Kq + 1) · 63d? + E7 ε log , 1, ∞
ε

80
Classification with Deep Neural Networks and Logistic Loss

such that
ε
sup f (x) − f˜(x) ≤ . (C.65)
x∈[0,1]d 8

Proof By the definition of GdCHOM (q, K, d? , d∗ , β, r), there exist functions h0 , h1 , . . . , hq


such that

(i) dom(hi ) = [0, 1]K for 0 < i ≤ q and dom(h0 ) = [0, 1]d ;

(ii) ran(hi ) ⊂ [0, 1]K for 0 ≤ i < q and ran(hq ) ⊂ R;


H M
(iii) hq ∈ G∞ (d∗ , β, r ∨ 1) ∪ G∞ (d? );

(iv) For 0 ≤ i < q and 1 ≤ j ≤ K , the j -th coordinate function of hi given


H M
by dom(hi ) 3 x 7→ (hi (x))j ∈ R belongs to G∞ (d∗ , β, r ∨1)∪G∞ (d? );

(v) f = hq ◦ hq−1 ◦ · · · ◦ h2 ◦ h1 ◦ h0 .

Define Ω := (i, j) ∈ Z2 0 ≤ i ≤ q, 1 ≤ j ≤ K, 1{q} (i) ≤ 1{1} (j) . For each (i, j) ∈ Ω, de-


note di,j := K + 1{0} (i) · (d − K) and



hi,j : dom(hi ) → R, x 7→ hi (x) j .

Then it is easy to verify that,

dom(hi,j ) = [0, 1]di,j and hi,j ∈ G∞


H M
(d∗ , β, r ∨ 1) ∪ G∞ (d? ), ∀ (i, j) ∈ Ω, (C.66)

and
ran (hi,j ) ⊂ [0, 1], ∀ (i, j) ∈ Ω \ {(q, 1)} . (C.67)
Fix ε ∈ (0, 1/2]. Take
1
1 ε (1∧β)q ε/2 ε 1
δ := · ≤ ≤ ≤ .
2 8 · |(1 ∨ r) · d∗ |q · (q + 1) 8 · |(1 ∨ r) · d∗ |q · (q + 1) 8 16

According to (C.66) and Lemma C.13,  there exists a constant E1 ∈ (6, ∞) only depending
di,j
on (d∗ , β, r) and a set of functions g̃i,j : R → R (i,j)∈Ω , such that

1 d∗
g̃i,j ∈ FdFNN
i,j
3 log d? + E1 log , 2d? + E1 δ − β ,
δ
 (C.68)
− dβ∗ 1
52d? + E1 δ log , 1, ∞ , ∀(i, j) ∈ Ω
δ

and
sup |g̃i,j (x) − hi,j (x)| x ∈ [0, 1]di,j

≤ 2δ, ∀ (i, j) ∈ Ω. (C.69)
Define

E4 := 8 · |(1 ∨ r) · d∗ |q · (q + 1),

81
Zhang, Shi and Zhou

d∗
d∗ q
E5 := 2 β · E4β·(1∧β) ,
1 2 log E4
E6 := q
+ + 2 log 2,
(1 ∧ β) (1 ∧ β)q
E7 := E1 E6 + E1 E5 + 2E1 E5 E6 + 6,

Obviously, E4 , E5 , E6 , E7 are constants only depending on (d∗ , β, r, q). Next, define

h̃i,j : Rdi,j → R, x 7→ σ σ (g̃i,j (x)) − σ σ (g̃i,j (x)) − 1


 

for each (i, j) ∈ Ω \ {(q, 1)}, and define h̃q,1 := g̃q,1 . It follows from the fact
 
σ σ (z) − σ σ (z) − 1 ∈ [0, 1], ∀ z ∈ R

and (C.68) that


ran(h̃i,j ) ⊂ [0, 1], ∀ (i, j) ∈ Ω \ (q, 1) (C.70)
and

1 d∗
h̃i,j ∈ FdFNN
i,j
2 + 3 log d? + E1 log , 2d? + E1 δ − β ,
δ
 (C.71)
− dβ∗ 1
58d? + E1 δ log , 1, ∞ , ∀(i, j) ∈ Ω.
δ

Besides, it follows from the fact that


 
σ σ (z) − σ σ (z) − 1 − w ≤ |w − z| , ∀ z ∈ R, ∀ w ∈ [0, 1]

and (C.69) that n o


sup h̃i,j (x) − hi,j (x) x ∈ [0, 1]di,j
(C.72)
≤ sup |g̃i,j (x) − hi,j (x)| x ∈ [0, 1]di,j

≤ 2δ.
We then define
 >
h̃i : Rdi,1 → RK , x 7→ h̃i,1 (x), h̃i,2 (x), . . . , h̃i,K (x)

for each i ∈ {0, 1, . . . , q − 1}, and h̃q := h̃q,1 . From (C.70) we obtain

ran(h̃i ) ⊂ [0, 1]K ⊂ dom(h̃i+1 ), ∀ i ∈ {0, 1, . . . , q − 1} . (C.73)

Thus we can well define the function f˜ := h̃q ◦ h̃q−1 ◦ · · · ◦ h̃1 ◦ h̃0 , which is from Rd to R.
Since all the functions h̃i,j ((i, j) ∈ Ω) are neural networks satisfying (C.71), we deduce that
f˜ is also a neural network, which is comprised of all those networks h̃i,j through series and
parallel connection. Obviously, the depth of f˜ is less than or equal to
X  
1 + max the depth of h̃i,j ,
j
i

82
Classification with Deep Neural Networks and Logistic Loss

the width of f˜ is less than or equal to


X 
max the width of h̃i,j ,
i
j

the number of nonzero parameters of f˜ is less than or equal to


X    
the number of nonzero parameters h̃i,j + max the depth of h̃i,k ,
k
i,j

and the parameters of f˜ is bounded by 1 in absolute value. Thus we have that



1 d∗
˜
f ∈ FdFNN
(q + 1) · 3 + 3 log d? + E1 log , 2Kd? + KE1 δ − β ,
δ

− dβ∗ 1
(Kq + 1) · 63d? + 2E1 δ log , 1, ∞
δ
 log Eε4  − d∗
= FdFNN (q + 1) · 3 + 3 log d? + E1 · log 2 + q
, 2Kd? + KE1 E5 ε β·(1∧β)q ,
(1 ∧ β)
!
E4 
− d∗  log ε
(Kq + 1) · 63d? + 2E1 E5 ε β·(1∧β)q · log 2 + , 1, ∞
(1 ∧ β)q

1 − d∗
⊂ FdFNN (q + 1) · 3 + 3 log d? + E1 E6 log , 2Kd? + KE1 E5 ε β·(1∧β)q ,
ε

d∗
− β·(1∧β) q 1
(Kq + 1) · 63d? + 2E1 E5 ε E6 log , 1, ∞
ε

1 − d∗
⊂ FdFNN (q + 1) · 3 log d? + E7 log , 2Kd? + KE7 ε β·(1∧β)q ,
ε

d∗
− β·(1∧β)q 1
(Kq + 1) · 63d? + E7 ε log , 1, ∞ ,
ε
leading to (C.64). Moreover, it follows from (C.72) and Lemma C.11 that

sup f˜(x) − f (x) = sup h̃q ◦ · · · ◦ h̃0 (x) − hq ◦ · · · ◦ h0 (x)


x∈[0,1]d x∈[0,1]d

Pq−1 q (1∧β)q−i
i
i=0 (1∧β)
X
≤ (1 ∨ r) · d1∧β
∗ · sup h̃i (x) − hi (x)
di,1 ∞
i=0 x∈[0,1]
q q
X q−i X q ε
≤ |(1 ∨ r) · d∗ |q · |2δ|(1∧β) ≤ |(1 ∨ r) · d∗ |q · |2δ|(1∧β) = ,
i=0 i=0
8

which yields (C.65).


In conclusion, the constant E7 and the neural network f˜ have all the desired properties.
The proof of this lemma is then completed.

The next lemma aims to estimate the approximation error.

83
Zhang, Shi and Zhou

Lemma C.15 Let φ(t) = log (1 + e−t ) be the logistic loss, q ∈ N ∪ {0}, (β, r) ∈ (0, ∞)2 ,
(d, d? , d∗ , K) ∈ N4 with d∗ ≤ min d, K + 1{0} (q) · (d − K) , and P be a Borel probability
d
measure on  [0, 1] ×{−1, 1}. Suppose that there exists an η̂ ∈ GdCHOM (q, K, d? , d∗ , β, r) such
d
that PX ( x ∈ [0, 1] η̂(x) = P ({1} |x)) = 1. Then there exist constants D1 , D2 , D3 only
depending on (d? , d∗ , β, r, q) such that for any δ ∈ (0, 1/3),
  
φ FNN 1 −d∗ /β
q
−d∗ /β
q 1 1−δ
inf EP (f ) f ∈ Fd D1 log , KD2 δ (1∧β) , KD3 δ (1∧β) · log , 1, log
δ δ δ (C.74)
≤ 8δ.

Proof Denote by η the conditional probability function [0, 1]d 3 x 7→ P ({1} |x) ∈ [0, 1].
Fix δ ∈ (0, 1/3). Then it follows from Lemma C.14 that there exists

1 − d∗
η̃ ∈ FdFNN Cd? ,d∗ ,β,r,q log , KCd? ,d∗ ,β,r,q δ β·(1∧β)q ,
δ
 (C.75)
d∗
− β·(1∧β)q 1
KCd? ,d∗ ,β,r,q δ log , 1, ∞
δ

such that
sup |η̃(x) − η̂(x)| ≤ δ/8. (C.76)
x∈[0,1]d


Also, by Theorem 2.4 with a = ε = δ , b = 1 − δ and α = d∗ , there exists

  1
˜l ∈ F FNN 1 1 1 2β/d∗ 1
1 Cd∗ ,β log + 139 log , Cd∗ ,β · log ,
δ δ δ δ
  1     !
1 2β/d∗ 1 1 2
Cd∗ ,β · · log · log + 65440 (log δ) , 1, ∞
δ δ δ (C.77)
 
FNN 1 − dβ∗ − dβ∗ 1
⊂ F1 Cd∗ ,β log , Cd∗ ,β δ , Cd∗ ,β δ log , 1, ∞
δ δ
 
1 − d ∗ − d∗ 1
⊂ F1FNN Cd∗ ,β log , Cd∗ ,β δ β·(1∧β)q , Cd∗ ,β δ β·(1∧β)q log , 1, ∞
δ δ

such that
sup ˜l(t) − log t ≤ δ
(C.78)
t∈[δ,1−δ]

and
log δ ≤ ˜l(t) ≤ log (1 − δ) < 0, ∀ t ∈ R. (C.79)

Recall that the clipping function Πδ is given by



1 − δ,
 if t > 1 − δ,
Πδ : R → [δ, 1 − δ], t 7→ δ, if t < δ,

t, otherwise.

84
Classification with Deep Neural Networks and Logistic Loss

Define f˜ : R → R, x 7→ ˜l (Πδ (η̃(x)))− ˜l (1 − Πδ (η̃(x))). Consequently, we know from (C.75),


(C.77) and (C.79) that (cf. Figure C.7)

˜ FNN 1 − d∗
f ∈ Fd Cd? ,d∗ ,β,r,q log , KCd? ,d∗ ,β,r,q δ β·(1∧β)q ,
δ

d∗
− β·(1∧β)q 1 1−δ
KCd? ,d∗ ,β,r,q δ log , 1, log .
δ δ

Let Ω1 , Ω2 , Ω3 be defined in (C.35). Then it follows from (C.76) that

|Πδ (η̃(x)) − η(x)| = |Πδ (η̃(x)) − Πδ (η̂(x))| ≤ |η̃(x) − η̂(x)|


δ min {η(x), 1 − η(x)}
≤ ≤ , ∀ x ∈ Ω1 ,
8 8

which means that


 
Πδ (η̃(x)) 1 − Πδ (η̃(x))
min , ≥ 7/8, ∀ x ∈ Ω1 . (C.80)
η(x) 1 − η(x)

Combining (C.78) and (C.80), we obtain that

η(x)
f˜(x) − log
1 − η(x)
≤ ˜l (Πδ (η̃(x))) − log (η(x)) + ˜l (1 − Πδ (η̃(x))) − log (1 − η(x))

≤ ˜l (Πδ (η̃(x))) − log (Πδ (η̃(x))) + |log (Πδ (η̃(x))) − log (η(x))|

+ ˜l (1 − Πδ (η̃(x))) − log (1 − Πδ (η̃(x))) + |log (1 − Πδ (η̃(x))) − log (1 − η(x))|


≤δ+ sup log0 (t) · |Πδ (η̃(x)) − η(x)|
t∈[Πδ (η̃(x))∧η(x),∞)

+δ+ sup log0 (t) · |Πδ (η̃(x)) − η(x)|


t∈[min{1−Πδ (η̃(x)),1−η(x)},∞)

≤δ+ sup log0 (t) · |Πδ (η̃(x)) − η(x)|


t∈[7η(x)/8,∞)

+δ+ h
sup 
log0 (t) · |Πδ (η̃(x)) − η(x)|
7−7η(x)
t∈ 8
,∞

8 δ 8 δ
≤ 2δ + · + · , ∀ x ∈ Ω1 ,
7η(x) 8 7 − 7η(x) 8

meaning that

η(x) 8 δ 8 δ
f˜(x) − log ≤ 2δ + · + ·
1 − η(x) 7η(x) 8 7 − 7η(x) 8
(C.81)
δ 2 2
= 2δ + ≤ + < 1, ∀ x ∈ Ω1 .
7η(x)(1 − η(x)) 3 7

85
Zhang, Shi and Zhou

Output
˜l (Πδ (η̃(x))) − ˜l (1 − Πδ (η̃(x))) = f˜(x)

˜l ˜l

σ (δ + σ (η̃(x) − δ) − σ (η̃(x) − 1 + δ)) = Πδ (η̃(x))

σ (1 − δ − σ (η̃(x) − δ) + σ (η̃(x) − 1 + δ)) = 1 − Πδ (η̃(x))

σ (0 · η̃(x) + δ) = δ σ (η̃(x) − δ)
σ (η̃(x) − 1 + δ) σ (0 · η̃(x) + 1 − δ) = 1 − δ

η̃
··· ··· ···
······ x ∈ Rd
Input

Figure C.7: The network representing the function f˜.

Besides, note that

x ∈ Ω2 ⇒ η̃(x) ∈ [−ξ1 , δ + ξ1 ] ⇒ Πδ (η̃(x)) ∈ [δ, δ + ξ1 ]


⇒ ˜l (Πδ (η̃(x))) ∈ [log δ, δ + log (δ + ξ1 )]
as well as ˜l (1 − Πδ (η̃(x))) ∈ [−δ + log(1 − δ − ξ1 ), log(1 − δ)]
ξ1 + δ 2δ 4δ
⇒ f˜(x) ≤ 2δ + log ≤ log 2 + log = log .
1 − ξ1 − δ 1 − 2δ 1 − 2δ

Therefore, by (C.79) and the definition of f˜, we have


δ 4δ 1 − 2δ
log ≤ f˜(x) ≤ log = − log , ∀ x ∈ Ω2 . (C.82)
1−δ 1 − 2δ 4δ
Similarly, we can show that
1 − 2δ 1−δ
log ≤ f˜(x) ≤ log , ∀ x ∈ Ω3 . (C.83)
4δ δ
Then it follows from (C.81), (C.82), (C.83) and Lemma C.8 that

1
 d∗



 f ∈ FdFNN Cd? ,d∗ ,β,r,q log , KCd? ,d∗ ,β,r,q δ β·(1∧β)q , 

 δ 
inf EPφ (f ) 
− d∗ 1 1−δ
KCd? ,d∗ ,β,r,q δ β·(1∧β)q log , 1, log

 

 
δ δ

86
Classification with Deep Neural Networks and Logistic Loss

   
  1 − 2δ 1 − 2δ
≤ EPφ˜
f ≤ φ log PX (Ω2 ) + φ log PX (Ω3 )
4δ 4δ
2
 
η(x)
Z  f˜(x) − log 1−η(x)
  
˜(x) ∧ log η(x) , f˜(x) ∨ log η(x)

+ sup t −t
t ∈ f dPX (x)
Ω1  2(2 + e + e )
 1 − η(x) 1 − η(x) 
2
 
η(x)
Z  f˜(x) − log 1−η(x)
 
η(x) η(x)


≤ sup t −t
t ∈ −1 + log , 1 + log dPX (x)
Ω1  2(2 + e + e )
 1 − η(x) 1 − η(x) 

1 + 2δ
+ PX (Ω2 ∪ Ω3 ) · log
1 − 2δ
Z 2
η(x)
≤ f˜ (x) − log · 2 · (1 − η(x))η(x)dPX (x) + 6δ
Ω1 1 − η(x)
Z 2
δ
≤ 2δ + · 2 · (1 − η(x))η(x)dPX (x) + 6δ
Ω1 7η(x)(1 − η(x))
δ2 δ2
Z
≤ dPX (x) + 6δ ≤ + 6δ < 8δ,
Ω1 (1 − η(x))η(x) δ(1 − δ)

which proves this lemma.


Now we are in the position to prove Theorem 2.2 and Theorem 2.3.
Proof [Proof of Theorem 2.2 and Theorem 2.3] We first prove Theorem 2.3. According to
Lemma C.15, there exist (D1 , D2 , D3 ) ∈ (0, ∞)3 only depending on (d? , d∗ , β, r, q) such that
d,β,r
(C.74) holds for any δ ∈ (0, 1/3) and any P ∈ H4,q,K,d ? ,d∗
. Take E1 = 1 + D1 , then E1 > 0
only depends on (d? , d∗ , β, r, q). We next show that for any constants a := (a2 , a3 ) ∈ (0, ∞)2
and b := (b1 , b2 , b3 , b4 , b5 ) ∈ (0, ∞)5 , there exist constants E2 ∈ (3, ∞) only depends on
(a, d? , d∗ , β, r, q, K) and E3 ∈ (0, ∞) only depending on (a, b, ν, d, d? , d∗ , β, r, q, K) such
that when n ≥ E2 , the φ-ERM fˆnFNN defined by (2.14) with
E1 · log n ≤ G ≤ b1 · log n,
 −d∗  −d∗
(log n)5 d∗ +β·(1∧β)q (log n)5 d∗ +β·(1∧β)q
 
a2 · ≤ N ≤ b2 · ,
n n
 −d∗  −d∗ (C.84)
(log n)5 d∗ +β·(1∧β)q (log n)5 d∗ +β·(1∧β)q
 
a3 · · log n ≤ S ≤ b3 · · log n,
n n
β · (1 ∧ β)q
· log n ≤ F ≤ b4 log n, and 1 ≤ B ≤ b5 · nν
d∗ + β · (1 ∧ β)q
must satisfy
q
 d β·(1∧β)
(log n)5 +β·(1∧β)q

h  i ∗
sup EP ⊗n EPφ fˆnFNN ≤ E3 ·
d,β,r
P ∈H4,q,K,d
n
? ,d∗
q (C.85)
 2d β·(1∧β)
(log n)5 +2β·(1∧β)q

h  i ∗
and sup EP ⊗n EP fˆnFNN ≤ E3 · ,
d,β,r
P ∈H4,q,K,d
n
? ,d∗

87
Zhang, Shi and Zhou

which will lead to the results of Theorem 2.3.


Let a := (a2 , a3 ) ∈ (0, ∞)2 and b := (b1 , b2 , b3 , b4 , b5 ) ∈ (0, ∞)5 be arbitrary and fixed.
Take
  β·(1∧β)q   β·(1∧β)q
D2 K d∗ D 3 E1 K d∗
D4 = 1 ∨ ∨ ,
a2 D1 a3

then D4 > 0 only depends on (a, d? , d∗ , β, r, q, K). Hence there exists E2 ∈ (3, ∞) only
depending on (a, d? , d∗ , β, r, q, K) such that

 β·(1∧β)q
(log t)5 (log t)5 d∗ +β·(1∧β)q

0< < D4 · < 1/4
t t (C.86)
< 1 < log t, ∀ t ∈ [E2 , ∞).

From now on we assume that n ≥ E2 , and (C.84) holds. We have to show that there exists
E3 ∈ (0, ∞) only depending on (a, b, ν, d, d? , d∗ , β, r, q, K) such that (C.85) holds.
d,β,r
Let P be an arbitrary probability in H4,q,K,d ? ,d∗
. Denote by η the conditional probability
function x 7→ P ({1} |x) of P . Then there exists an η̂ ∈ GdCHOM (q, K, d? , d∗ , β, r) such that
η̂ = η , PX -a.s.. Define
q
 d β·(1∧β)
(log n)5 +β·(1∧β)q


ζ := D4 · . (C.87)
n

−β·(1∧β)q
1
By (C.86), 0 < n d∗ +β·(1∧β)q ≤ ζ < 4 and there hold inequalities

β·(1∧β)q
 
1−ζ 1 q
log 2 < log ≤ log ≤ log n ∗
d +β·(1∧β) ≤ F, (C.88)
ζ ζ

β·(1∧β)q
 
1 q
D1 log ≤ D1 log n ∗d +β·(1∧β)
ζ (C.89)
≤ D1 log n ≤ max {1, D1 log n} ≤ E1 log n ≤ G,

and
−d∗
(log n)5 ∗ +β·(1∧β)q
 d
−d∗ /β −d∗ /β
KD2 ζ (1∧β)q = KD2 · D4 (1∧β)q ·
n
−d∗ /β
 β·(1∧β)q (1∧β)q −d∗
(log n)5 ∗ +β·(1∧β)q
  d
D2 K d∗
(C.90)
≤ KD2 · ·
a2 n
−d∗
(log n)5 ∗ +β·(1∧β)q
 d
= a2 · ≤ N.
n

88
Classification with Deep Neural Networks and Logistic Loss

Consequently,
−d∗
(log n)5 ∗ +β·(1∧β)q
 d
−d∗ /β
(1∧β)q
1 −d∗ /β 1
KD3 ζ · log = KD3 · D4 (1∧β)q · · log
ζ n ζ
−d∗ /β
 β·(1∧β)q (1∧β)q −d∗
(log n)5 ∗ +β·(1∧β)q
  d
D3 E1 K d∗ 1 (C.91)
≤ KD3 · · · log
D1 a3 n ζ
−d∗ −d∗
(log n)5 ∗ +β·(1∧β)q D1 · log ζ1 (log n)5 ∗ +β·(1∧β)q
 d  d
= a3 · · ≤ a3 · · log n ≤ S.
n E1 n
Then it follows from (C.74), (C.87), (C.89), (C.88), (C.90), and (C.91) that
n o
inf EPφ (f ) f ∈ FdFNN (G, N, S, B, F )
( !)
φ FNN 1 KD2 KD3 1 1−ζ
≤ inf EP (f ) f ∈ Fd D1 log , d∗ /β , d∗ /β · log , 1, log
ζ (1∧β)q ζ ζ (C.92)
ζ ζ (1∧β)q
 β·(1∧β)q
(log n)5 d∗ +β·(1∧β)q

≤ 8ζ = 8D4 · .
n

Besides, from (C.88) we know eF > 2. Hence by taking δ0 = eF1+1 in Lemma C.10, we
obtain immediately that there exists

ψ : [0, 1]d × {−1, 1} → 0, log (10eF + 10) · log eF + 1 ,


 
(C.93)

such that
Z n o
ψ (x, y)dP (x, y) = inf RφP (f ) | f : [0, 1]d → R is measurable , (C.94)
[0,1]d ×{−1,1}

and for any measurable f : [0, 1]d → [−F, F ],


Z
(φ (yf (x)) − ψ(x, y))2 dP (x, y)
[0,1]d ×{−1,1}
Z
F 2

≤ 125000 log 1 + e · (φ (yf (x)) − ψ(x, y)) dP (x, y) (C.95)
[0,1]d ×{−1,1}
Z
≤ 500000F 2 · (φ (yf (x)) − ψ(x, y)) dP (x, y).
[0,1]d ×{−1,1}

1
Moreover, it follows from Corollary A.1 with γ = n that

log W ≤ (S + Gd + 1)(2G + 5) log ((max {N, d} + 1)(2nG + 2n)B)


 −d∗
(log n)5 d∗ +β·(1∧β)q

2
≤ Cb,d · (log n) · · log ((max {N, d} + 1)(2nG + 2n)b5 nν )
n (C.96)
 −d∗  −d∗
(log n)5 d∗ +β·(1∧β)q (log n)5 d∗ +β·(1∧β)q
 
3 3
≤ Cb,d,ν · (log n) · = E4 · (log n) ·
n n

89
Zhang, Shi and Zhou

for some constant E4 ∈ (0, ∞) only depending on(b, d, ν), where


 
 FNN 1
W =3∨N f |[0,1]d f ∈ Fd (G, N, S, B, F ) , .
n
Also, note that

sup φ(t) = log 1 + eF ≤ log (10eF + 10) · log eF + 1 ≤ 7F.


 
(C.97)
t∈[−F,F ]

Therefore, by taking  = 21 , γ = n1 , Γ = 500000F 2 , M = 7F , and

F = f |[0,1]d f ∈ FdFNN (G, N, S, B, F )




in Theorem 2.1 and combining (C.93), (C.94), (C.95), (C.96), (C.92), we obtain
h  i    Z 
φ ˆFNN φ ˆFNN
EP ⊗n EP fn = EP ⊗n RP fn − ψ(x, y)dP (x, y)
[0,1]d ×{−1,1}
r  Z 
Γ log W 4 30M log W Γ log W φ
≤ 360 · + + + 30 · + 2 inf RP (f ) − ψdP
n n n n2 f ∈F
360Γ log W Γ log W Γ log W Γ log W
≤ + + + + 2 inf EPφ (f )
n n n n f ∈F
β·(1∧β)q

2 · 108 · F 2 · log W (log n)5 d∗ +β·(1∧β)q


 
≤ + 16D4 ·
n n
  −d∗ q
109 · |b4 log n|2 · E4 · (log n)3 · (lognn)
5 d∗ +β·(1∧β)  β·(1∧β)q
(log n)5 d∗ +β·(1∧β)q

≤ + 16D4 ·
n n
q
β·(1∧β)
  (log n)5  d∗ +β·(1∧β)q   β·(1∧β)q
5 d∗ +β·(1∧β)q
 (log n)
= 16D4 + 109 · |b4 |2 · E4 · ≤ E3 ·
n n
with  
E3 := 4 · 16D4 + 109 · |b4 |2 · E4 + 4

only depending on (a, b, ν, d, d? , d∗ , β, r, q, K). We then apply the calibration inequality


(2.21) and conclude that
"r #

h  i   r h  i
φ
EP ⊗n EP f ˆFNN
≤ 2 2 · EP ⊗n
n E f ˆFNN
P ≤ 4 · EP ⊗n E φ fˆFNN
n P n

v
q
  (log n)5  d∗β·(1∧β)
u
+β·(1∧β)q
u
≤4·
t 2
16D4 + 109 · |b4 | · E4 · (C.98)
n
q
 2d β·(1∧β)
(log n)5 +2β·(1∧β)q


≤ E3 · .
n
Since P is arbitrary, the desired bound (C.85) follows. Setting c = E1 completes the proof
of Theorem 2.3.

90
Classification with Deep Neural Networks and Logistic Loss

Now it remains to show Theorem 2.2. Indeed, it follows from (2.33) that

H1d,β,r ⊂ H4,0,1,1,d
d,β,r
.

Then by taking q = 0, d∗ = d and d? = K = 1 in Theorem 2.3, we obtain that there exists


a constant c ∈ (0, ∞) only depending on (d, β, r) such that the estimator fˆnFNN defined by
(2.14) with

 −d  −d
(log n)5 d+β·(1∧β)0 (log n)5 d+β
 
c log n ≤ G . log n, N  = ,
n n
 −d  −d
(log n)5 d+β·(1∧β)0 (log n)5 d+β
 
S · log n = · log n,
n n
β β · (1 ∧ β)0
1 ≤ B . nν , and · log n = · log n ≤ F . log n
d+β d + β · (1 ∧ β)0

must satisfy
h  i h  i
sup EP ⊗n EPφ fˆnFNN ≤ sup EP ⊗n EPφ fˆnFNN
P ∈H1d,β,r d,β,r
P ∈H4,0,1,1,d

β·(1∧β)0 β
(log n)5 (log n)5
    d+β
d+β·(1∧β)0
. =
n n

and
h  i h  i
sup EP ⊗n EP fˆnFNN ≤ sup EP ⊗n EP fˆnFNN
P ∈H1d,β,r d,β,r
P ∈H4,0,1,1,d

β·(1∧β)0 β
(log n)5 (log n)5
    2d+2β
2d+2β·(1∧β)0
. = .
n n

This completes the proof of Theorem 2.2.

C.5 Proof of Theorem 2.5


Appendix C.5 is devoted to the proof of Theorem 2.5. To this end, we need the following
lemmas. Note that the logistic loss is given by φ(t) = log(1 + e−t ) with φ0 (t) = − 1+e
1
t ∈
t
00 e 1 1
(−1, 0) and φ (t) = (1+et )2 = et +e−t +2 ∈ (0, 4 ] for all t ∈ R.

 
1+η0
Lemma C.16 Let η0 ∈ (0, 1), F0 ∈ 0, log 1−η0
, a ∈ [−F0 , F0 ], φ(t) = log(1 + e−t ) be the
logistic loss, d ∈ N, and P be a Borel probability measure on [0, 1]d × {−1, 1} of which the
conditional probability function [0, 1]d 3 z 7→ P ({1} |z) ∈ [0, 1] is denoted by η . Then for

91
Zhang, Shi and Zhou

any x ∈ [0, 1]d such that |2η(x) − 1| > η0 , there holds


 
1 − η0 0 η0 + 1 0
0 ≤ |a − F0 sgn(2η(x) − 1)| · φ (−F0 ) − φ (F0 )
2 2
 
1 − η0 0 η0 + 1 0
≤ |a − F0 sgn(2η(x) − 1)| · φ (−F0 ) − φ (F0 )
2 2
1 (C.99)
+ −F F
|a − F0 sgn(2η(x) − 1)|2
2 (e 0 + e + 2)
0
Z
≤ (φ(ya) − φ(yF0 sgn(2η(x) − 1))) dP (y|x)
{−1,1}

≤ |a − F0 sgn(2η(x) − 1)| + F02 .

Proof Given x ∈ [0, 1]d , recall the function Vx defined in the proof of Lemma C.7. By
Taylor expansion, there exists ξ between a and F0 sgn(2η(x) − 1) such that
Z
(φ(ya) − φ(yF0 sgn(2η(x) − 1))) dP (y|x)
{−1,1}

= Vx (a) − Vx (F0 sgn(2η(x) − 1))


1
= (a − F0 sgn(2η(x) − 1)) · Vx0 (F0 sgn(2η(x) − 1)) + |a − F0 sgn(2η(x) − 1)|2 · Vx00 (ξ).
2
(C.100)
Since ξ ∈ [−F0 , F0 ], we have

1
0≤ = inf {φ00 (t) | t ∈ [−F0 , F0 ]}
e−F0
+ eF0 + 2
1
≤ Vx00 (ξ) = η(x)φ00 (ξ) + (1 − η(x))φ00 (−ξ) ≤
4
and then
1 1
0≤ |a − F0 sgn(2η(x) − 1)|2 · −F
2 e 0 + eF0 + 2
1
≤ |a − F0 sgn(2η(x) − 1)|2 · Vx00 (ξ) (C.101)
2
1 1 1
≤ (|a| + F0 )2 · ≤ F02 .
2 4 2
On the other hand, if 2η(x) − 1 > η0 , then

(a − F0 sgn(2η(x) − 1)) · Vx0 (F0 sgn(2η(x) − 1))


= (a − F0 ) (η(x)φ0 (F0 ) − (1 − η(x))φ0 (−F0 ))
= |a − F0 | ((1 − η(x))φ0 (−F0 ) − η(x)φ0 (F0 )))
  
1 + η0 0 1 + η0 0
≥ |a − F0 | 1− φ (−F0 ) − φ (F0 )
2 2
 
1 − η0 0 1 + η0 0
= |a − F0 sgn(2η(x) − 1)| · φ (−F0 ) − φ (F0 ) .
2 2

92
Classification with Deep Neural Networks and Logistic Loss

Similarly, if 2η(x) − 1 < −η0 , then

(a − F0 sgn(2η(x) − 1)) · Vx0 (F0 sgn(2η(x) − 1))


= (a + F0 ) (η(x)φ0 (−F0 ) − (1 − η(x))φ0 (F0 ))
= |a + F0 | (η(x)φ0 (−F0 ) − (1 − η(x))φ0 (F0 )))
   
1 − η0 0 1 − η0 0
≥ |a + F0 | φ (−F0 ) − 1 − φ (F0 )
2 2
 
1 − η0 0 1 + η0 0
= |a − F0 sgn(2η(x) − 1)| · φ (−F0 ) − φ (F0 ) .
2 2

Therefore, for given x ∈ [0, 1]d satisfying |2η(x) − 1| > η0 , there always holds

(a − F0 sgn(2η(x) − 1)) · Vx0 (F0 sgn(2η(x) − 1))


(C.102)
 
1 − η0 0 1 + η0 0
≥ |a − F0 sgn(2η(x) − 1)| · φ (−F0 ) − φ (F0 ) .
2 2

We next show that 1−η0 0


2 φ (−F0 ) −
1+η0 0
2 φ (F0 ) > 0. Indeed, let g(t) =
1−η0 0
2 φ (−t) −
1+η0 0 0 1−η0 00 1+η0 00
Then g (t) = − 2 φ (−t) − 2 φ (t) < 0, i.e., g is strictly decreasing, and thus
2 φ (t).
 
1 − η0 0 1 + η0 0 1 + η0
φ (−F0 ) − φ (F0 ) = g(F0 ) > g log = 0. (C.103)
2 2 1 − η0

Moreover, we also have

(a − F0 sgn(2η(x) − 1)) · Vx0 (F0 sgn(2η(x) − 1))


≤ |a − F0 sgn(2η(x) − 1)| · |Vx0 (F0 sgn(2η(x) − 1))|
= |a − F0 sgn(2η(x) − 1)| · |η(x)φ0 (F0 sgn(2η(x) − 1)) − (1 − η(x))φ0 (−F0 sgn(2η(x) − 1))|
≤ |a − F0 sgn(2η(x) − 1)| |η(x) + (1 − η(x))| = |a − F0 sgn(2η(x) − 1)| .
(C.104)
Then the first inequality of (C.99) is from (C.103), the third inequality of (C.99) is due
to (C.100), (C.101) and (C.102), and the last inequality of (C.99) is from (C.100), (C.101)
and (C.104). Thus we complete the proof.

 
Lemma C.17 Let η0 ∈ (0, 1), F0 ∈ 0, log 1+η
1−η0 , d ∈ N, and P be a Borel probability
0

measure on [0, 1]d × {−1, 1} of which the conditional probability function [0, 1]d 3 z 7→
P ({1} |z) ∈ [0, 1] is denoted by η . Define

ψ : [0, 1]d × {−1, 1} → R,



φ (yF0 sgn(2η(x) − 1)) ,
 if |2η(x) − 1| > η0 ,
(C.105)
(x, y) 7→
 
η(x)
φ y log
 , if |2η(x) − 1| ≤ η0 .
1 − η(x)

93
Zhang, Shi and Zhou

Then there hold


Z
(φ (yf (x)) − ψ(x, y))2 dP (x, y)
[0,1]d ×{−1,1}
Z (C.106)
8
≤ · (φ (yf (x)) − ψ(x, y)) dP (x, y)
1 − η02 [0,1]d ×{−1,1}

for any measurable f : [0, 1]d → [−F0 , F0 ] , and

2
0 ≤ ψ(x, y) ≤ log , ∀(x, y) ∈ [0, 1]d × {−1, 1} . (C.107)
1 − η0

Proof Recall that given x ∈ [0, 1]d , Vx (t) = η(x)φ(t) + (1 − η(x))φ(−t), ∀t ∈ R. Due to
inequality (C.99) and Lemma C.7, for any measurable f : [0, 1]d → [−F0 , F0 ], we have
Z
(φ (yf (x)) − ψ(x, y)) dP (x, y)
[0,1]d ×{−1,1}
Z Z
= φ (yf (x)) − φ (yF0 sgn(2η(x) − 1)) dP (y|x)dPX (x)
|2η(x)−1|>η0 {−1,1}
Z Z  
η(x)
+ φ (yf (x)) − φ y log dP (y|x)dPX (x)
|2η(x)−1|≤η0 {−1,1} 1 − η(x)
Z
1
≥ |f (x) − F0 sgn(2η(x) − 1)|2 dPX (x)
2 (e F0 + e−F0 + 2)
|2η(x)−1|>η0
 
Z 2
1  f (x) − log η(x)
+  h inf t −t + 2)
dPX (x)
|2η(x)−1|≤η0 t∈ log 1−η0 ,log 1+η0 2(e + e
i
1 − η(x)
1+η0 1−η0
Z
1 1
≥ 1+η0 1−η0 |φ (yf (x)) − φ (yF0 sgn(2η(x) − 1))|2 dP (x, y)
2 1−η + 1+η + 2 {|2η(x)−1|>η0 }×{−1,1}
0 0
Z  2
1 1 η(x)
+ 1+η0 1−η0 φ (yf (x)) − φ y log dP (x, y)
2 1−η + 1+η + 2 {|2η(x)−1|≤η0 }×{−1,1} 1 − η(x)
0 0

1 − η02
Z
= · (φ (yf (x)) − ψ(x, y))2 dP (x, y),
8 d
[0,1] ×{−1,1}
 
where the second inequality is from (C.16) and the fact that F0 ∈ 0, log 1+η 0
1−η0 . Thus we
have proved the inequality (C.106).  
1+η0
On the other hand, from the definition of ψ as well as F0 ∈ 0, log 1−η0 , we also have
    
1 + η0 1 + η0 2
0 ≤ ψ(x, y) ≤ max φ(−F0 ), φ − log ≤ φ − log = log ,
1 − η0 1 − η0 1 − η0

which gives the inequality (C.107). The proof is completed.

Now we are in the position to prove Theorem 2.5.

94
Classification with Deep Neural Networks and Logistic Loss

Proof [Proof of Theorem 2.5] Let η0 ∈ (0, 1)∩[0, t1 ], F0 ∈ (0, log 1+η 1
1−η0 )∩[0, 1] , ξ ∈ (0, 2 ∧t2 ]
0

d,β,r,I,Θ,s1 ,s2
and P ∈ H6,t 1 ,c1 ,t2 ,c2
be arbitrary. Denote by η the conditional probability function
P ({1} |·) of P . By definition, there exists a classifier C ∈ C d,β,r,I,Θ such that (2.24), (2.50)
and (2.51) hold. According to Proposition A.4 and the proof of Theorem 3.4 in Kim
et al. (2021), there exist positive constants G0 , N0 , S0 , B0 only depending on d, β, r, I, Θ
and f˜0 ∈ FdFNN (Gξ , Nξ , Sξ , Bξ , 1) such that f˜0 (x) = C(x) for x ∈ [0, 1]d with ∆C (x) > ξ ,
where

  d−1   d−1    
1 1 β 1 β 1 B0 (C.108)
Gξ = G0 log , Nξ = N0 , S ξ = S0 log , Bξ = .
ξ ξ ξ ξ ξ

Define ψ : [0, 1]d × {−1, 1} → R by (C.105). Then for any measurable function f :
[0, 1]d → [−F0 , F0 ], there holds

  Z
f f (x)
EP (f ) = EP ≤ − sgn(2η(x) − 1) |2η(x) − 1| dPX (x)
F0 [0,1]d F0
Z
f (x)
≤ 2P (|2η(x) − 1| ≤ η0 ) + − sgn(2η(x) − 1) dPX (x)
|2η(x)−1|>η0 F0
Z
1
≤ 2c1 η0s1 + |f (x) − F0 sgn(2η(x) − 1)| dPX (x) (C.109)
F0 |2η(x)−1|>η0
R
(φ(yf (x)) − φ(yF0 sgn(2η(x) − 1))) dP (y|x)
Z
s1
≤ 2c1 η0 + dPX (x)
F0 · 1−η η0 +1 0
0 0

|2η(x)−1|>η0 2 φ (−F 0 ) − 2 φ (F 0 )
R
d
[0,1] ×{−1,1} (φ(yf (x)) − ψ(x, y)) dP (x, y)
≤ 2c1 η0s1 + ,
F0 · 1−η η0 +1 0
0 0

2 φ (−F0 ) − 2 φ (F0 )

where the first inequality is from Theorem 2.31 of Steinwart and Christmann (2008), the
third inequality is due to the noise condition (2.24), and the fourth inequality is from (C.99)
in Lemma C.16.
8
Let F = FdFNN (Gξ , Nξ , Sξ , Bξ , F0 ) with (Gξ , Nξ , Sξ , Bξ ) given by (C.108), Γ = 1−η02
and
2
M= 1−η0 in Theorem 2.1. Then we will use this theorem to derive the desired generalization
bounds for the φ-ERM fˆn := fˆnFNN over FdFNN (Gξ , Nξ , Sξ , Bξ , F0 ). Indeed, Lemma C.17
guarantees that the conditions (2.3), (2.4) and (2.5) of Theorem 2.1 are satisfied. Moreover,
take γ = n1 . Then W = max {3, N (F, γ)} satisfies

1 2
   
− d−1 1
log W ≤ Cd,β,r,I,Θ ξ β log log + log n .
ξ ξ

95
Zhang, Shi and Zhou

 
Thus the expectation of [0,1]d ×{−1,1} φ(y fˆn (x)) − ψ(x, y) dP (x, y) can be bounded by
R

inequality (2.6) in Theorem 2.1 as


Z   
EP ⊗n ˆ
φ(y fn (x)) − ψ(x, y) dP (x, y)
[0,1]d ×{−1,1}
d−1
 2  
4000Cd,β,r,I,Θ ξ − β log 1ξ log 1ξ + log n
≤ (C.110)
n(1 − η02 )
 Z 
φ
+ 2 inf RP (f ) − ψ(x, y)dP (x, y) .
f ∈F [0,1]d ×{−1,1}

We next estimate the approximation error, i.e., the second term on the right hand side
of (C.110). Take f0 = F0 f˜0 ∈ F where f˜0 ∈ FdFNN (Gξ , Nξ , Sξ , Bξ , 1) satisfying f˜0 (x) = C(x)
for x ∈ [0, 1]d with ∆C (x) > ξ . Then one can bound the approximation error as
 Z 
inf RφP (f ) − ψ(x, y)dP (x, y)
f ∈F [0,1]d ×{−1,1}
Z (C.111)
≤ RφP (f0 ) − ψ(x, y)dP (x, y) = I1 + I2 + I3 ,
[0,1]d ×{−1,1}

where
Z
I1 := φ(yf0 (x)) − φ(yF0 sgn(2η(x) − 1))dP (x, y),
{|2η(x)−1|>η0 ,∆C (x)>ξ}×{−1,1}
Z  
η(x)
I2 := φ(yf0 (x)) − φ y log dP (x, y),
{|2η(x)−1|≤η0 }×{−1,1} 1 − η(x)
Z
I3 := φ(yf0 (x)) − φ(yF0 sgn(2η(x) − 1))dP (x, y).
{|2η(x)−1|>η0 ,∆C (x)≤ξ}×{−1,1}

Note that f0 (x) = F0 f˜0 (x) = F0 C(x) = F0 sgn(2η(x) − 1) for PX -almost all x ∈ [0, 1]d with
∆C (x) > ξ . Thus it follows that I1 = 0. On the other hand, from Lemma C.7 and the noise
condition (2.24), we see that

η(x) 2
Z
I2 ≤ f0 (x) − log dP (x, y)
{|2η(x)−1|≤η0 }×{−1,1} 1 − η(x)
(C.112)
1 + η0 2 1 + η0 2
Z    
s1
≤ F0 + log dP (x, y) ≤ 4 log c1 · η0 .
{|2η( x)−1|≤η0 }×{−1,1} 1 − η0 1 − η0

Moreover, due to Lemma C.16 and the margin condition (2.51), we have
Z
2F0 + F02 dPX (x)

I3 ≤
{|2η(x)−1|>η0 ,∆C (x)≤ξ} (C.113)
3F0 · PX ( x ∈ [0, 1]d s2

≤ ∆C (x) ≤ ξ ) ≤ 3F0 · c2 · ξ .

96
Classification with Deep Neural Networks and Logistic Loss

The estimates above together with (C.109) and (C.110) give


h i
EP ⊗n EP (fˆn )
hR   i
E ⊗n φ(y ˆ
f (x)) − ψ(x, y) dP (x, y)
1 P d
[0,1] ×{−1,1} n
≤ 2c1 η0s1 + · 1−η0 0 η0 +1 0
F0 2 φ (−F0 ) − 2 φ (F0 ) (C.114)
− d−1
 2  
2 4000Cd,β,r,I,Θ ξ β log 1ξ log 1ξ +log n
1+η0
8 log 1−η0
c1 η0s1 + 6F0 c2 ξ s2 + n(1−η02 )
≤ 2c1 η0s1 + 1−η0 0 η0 +1 0
 .
F0 · 2 φ (−F0 ) − 2 φ (F0 )
d,β,r,I,Θ,s1 ,s2
Since P is arbitrary, we can take the supremum over all P ∈ H6,t 1 ,c1 ,t2 ,c2
to obtain from
(C.114) that
h i
sup EP ⊗n EP (fˆnFNN )
d,β,r,I,Θ,s1 ,s2
P ∈H6,t
1 ,c1 ,t2 ,c2

− d−1 (C.115)
 2  
1
2 4000Cd,β,r,I,Θ ξ β log log 1ξ +log n
1+η0
c1 η0s1 s2 ξ
8 log 1−η0 + 6F0 c2 ξ + n(1−η02 )
≤ 2c1 η0s1 + 1−η0 0 η0 +1 0
 .
F0 · 2 φ (−F0 ) − 2 φ (F0 )

(C.115) holds for all η0 ∈ (0, 1) ∩ [0, t1 ], F0 ∈ (0, log 1+η 1


1−η0 ) ∩ [0, 1] , ξ ∈ (0, 2 ∧ t2 ]. We then
0

take suitable η0 , F0 , and ξ in (C.115) to derive the convergence rates stated in Theorem
2.5.

h i
sup EP ⊗n EP (fˆnFNN )
d,β,r,I,Θ,s1 ,s2
P ∈H6,t
1 ,c1 ,t2 ,c2
d−1 
2 4000Cd,β,r,I,Θ ξ β log 1ξ
2 
log 1ξ +log n

(C.116)
8 log 1+η0
1−η0 c1 · η0s1 + 6F0 c2 ξ s2
+ n(1−η02 )
≤ 2c1 η0s1 + 1−η0 0 η0 +1 0
 .
F0 · 2 φ (−F0 ) − 2 φ (F0 )

Case I. When s1 = s2 = ∞, taking η0 = F0 = t1 ∧ 12 and ξ = t2 ∧ 1


2 in (C.115) yields
h  i log n
sup EP ⊗n EP fˆnFNN . .
P ∈H
d,β,r,I,Θ,s1 ,s2 n
6,t1 ,c1 ,t2 ,c2

  1
1 (log n)3 s2 + d−1
Case II. When s1 = ∞ and s2 < ∞, taking η0 = F0 = t1 ∧ 2 and ξ  n
β

in (C.115) yields
! 1
h  i (log n)3 1+ d−1
βs2
sup EP ⊗n EP fˆnFNN . .
d,β,r,I,Θ,s1 ,s2
P ∈H6,t
n
1 ,c1 ,t2 ,c2

  1
log n s1 +2 1
Case III. When s1 < ∞ and s2 = ∞, take η0 = F0  n and ξ = t2 ∧ 2 in
η0 1−η0 0 η0 +1 0
(C.115). From the fact that 4 ≤ 2 φ (−η0 ) − 2 φ (η0 ) ≤ η0 , ∀0 ≤ η0 ≤ 1, the item in

97
Zhang, Shi and Zhou

the denominator of the second term on the right hand side of (C.115) is larger than 41 η02 .
Then we have
h  i  log n  s1s+2
1

sup EP ⊗n EP fˆn
FNN
. .
P ∈H
d,β,r,I,Θ,s1 ,s2 n
6,t1 ,c1 ,t2 ,c2

Case IV. When s1 < ∞ and s2 < ∞, taking


s2 s1 +1
(log n)3 s2 +(s1 +1)(s2 + d−1 3 s +(s +1) s + d−1
   
β ) (log n) 2 1 (2 β )
η0 = F0  and ξ 
n n
in (C.115) yields
! s1
3
 
(log n) 1+(s1 +1) 1+ d−1
h  i
EP ⊗n EP fˆnFNN .
βs2
sup .
P ∈H6,t
d,β,r,I,Θ,s1 ,s2 n
1 ,c1 ,t2 ,c2

Combining above cases, we obtain the desired results. The proof of Theorem 2.5 is com-
pleted.

C.6 Proof of Theorem 2.6 and Corollary 2.1


In Appendix C.6, we provide the proof ofTheorem 2.6 and Corollary 2.1. Hereinafter, for
a ∈ Rd and R ∈ R, we define B(a, R) := x ∈ Rd kx − ak2 ≤ R .
Lemma C.18 Let d ∈ N, β ∈ (0, ∞), r ∈ (0, ∞), Q ∈ N ∩ (1, ∞),
 
k1 kd >
GQ,d := ( ,..., ) k1 , . . . , kd are odd integers ∩ [0, 1]d ,
2Q 2Q
1
and T : GQ,d → {−1, 1} be a map. Then there exist a constant c1 ∈ (0, 9999 ) only depending
on (d, β, r), and an f ∈ Br [0, 1] depending on (d, β, r, Q, T ), such that kf k[0,1]d = Qc1β ,
β d


and
c1 1
f (x) = kf k[0,1]d · T (a) = β · T (a), ∀ a ∈ GQ,d , x ∈ B(a, ) ∩ [0, 1]d .
Q 5Q
Proof Let
exp (−1/(x − 1/9)) · exp (−1/(1/8 − x)) · 1(1/9,1/8) (x)dx
R∞
κ : R → [0, 1], t 7→ t R 1/8
1/9 exp (−1/(x − 1/9)) · exp (−1/(1/8 − x)) dx

be a well defined infinitely differentiable decreasing function on R with κ(t) = 1 for t ≤ 1/9
and κ(t) = 0 for t ≥ 1/8. Then define b := dβe − 1, λ := β − b,
u : Rd → [0, 1], x 7→ κ(kxk22 ),
and c2 := u|[−2,2]d . Obviously, u only depends on d, and c2 only depends on
C b,λ ([−2,2]d ) q
(d, β). Since u is infinitely differentiable and supported in B(0, 18 ), we have 0 < c2 < ∞.
r 1 1
Take c1 := 4c2 ∧ 10000 . Then c1 only depends on (d, β, r), and 0 < c1 < 9999 . Define
X c1
f : [0, 1]d → R, x 7→ T (a) · β · u(Q · (x − a)).
Q
a∈GQ,d

98
Classification with Deep Neural Networks and Logistic Loss

We then show that these c1 and f defined above have the desired properties.
For any m ∈ (N ∪ {0})d , we write um for Dm u, i.e., the partial derivative of u with
respect to the multi-index m. An elementary calculation yields
X c1
Dm f (x) = T (a) · β−kmk · um (Q · (x − a)), ∀ m ∈ (N ∪ {0})d , x ∈ [0, 1]d .
a∈G
Q 1
Q,d
(C.117)
c1
Note that the supports of the functions T (a) · Qβ−kmk1
· um (Q · (x − a)) ( a ∈ GQ,d ) in (C.117)
are disjoint. Indeed, we have
 
d c1
x ∈ R T (a) · β−kmk · um (Q · (x − a)) 6= 0
Q 1
p  
1/8 −1 1 d
⊂ B(a, )⊂ a+v v ∈( , )
Q 2Q 2Q
 
−1 1 d
⊂ [0, 1]d \ z + v v ∈ [ , ] , ∀ m ∈ (N ∪ {0})d , a ∈ GQ,d , z ∈ GQ,d \ {a} ,
2Q 2Q
√ (C.118)
1/8
and sets B(a, Q ) (a ∈ GQ,d ) are disjoint. Therefore,

c1
kDm f k[0,1]d = sup sup T (a) · β−kmk
· um (Q · (x − a))
a∈GQ,d x∈[0,1]d Q 1

c1
= sup sup√ T (a) · β−kmk
· um (Q · (x − a))
a∈GQ,d 1/8 Q 1
x∈B(a, )
Q (C.119)
c1 c1
= sup sup
√ · um (x) ≤ sup · um (x)
a∈GQ,d x∈B(0, 1/8)
Qβ−kmk1 x∈[−2,2]d Qβ−kmk1

≤ sup |c1 · um (x)| ≤ c1 c2 , ∀ m ∈ (N ∪ {0})d with kmk1 ≤ b.


x∈[−2,2]d

In particular, we have that


c1 c1
kf k[0,1]d = sup sup
√ · u(x) = β . (C.120)
a∈GQ,d x∈B(0, 1/8)
Qβ Q

Besides, for any a ∈ GQ,d , any x ∈ B(a, 5Q


1
) ∩ [0, 1]d , and any z ∈ GQ,d \ {a}, we have
1 p p
kQ · (x − z)k2 ≥ Q ka − zk2 − Q kx − ak2 ≥ 1 − > 1/8 > 1/9 > kQ · (x − a)k2 ,
5
which means that u(Q · (x − z)) = 0 and u(Q · (x − a)) = 1 . Thus
c1 X c1
f (x) = T (a) · β · u(Q · (x − a)) + T (z) · β · u(Q · (x − z))
Q Q
z∈GQ,d \{a}
c1 X c1
= T (a) · β · 1 + T (z) · β · 0 (C.121)
Q Q
z∈GQ,d \{a}
c1 1
= T (a) · β
, ∀ a ∈ GQ,d , x ∈ B(a, ) ∩ [0, 1]d .
Q 5Q

99
Zhang, Shi and Zhou

Now it remains to show that f ∈ Brβ [0, 1]d . Let m ∈ (N ∪ {0}) d



n be an arbitrary multi-
o
1 1 d
S
index with kmk1 = b, and x, y be arbitrary points in a∈GQ,d a + v v ∈ (− 2Q , 2Q ) .
1 1 d 1 1 d
Then there exist ax , ay ∈ GQ,d , such that x − ax ∈ (− 2Q , 2Q ) and y − ay ∈ (− 2Q , 2Q ) . If
ax = ay , then it follows from (C.118) that

um (Q · (x − z)) = um (Q · (y − z)) = 0, ∀ z ∈ GQ,d \ {ax } ,

which, together with the fact that {Q · (x − ax ), Q · (y − ay )} ⊂ (− 21 , 12 )d , yields

|Dm f (x) − Dm f (y)|


c1 c1
= T (ax ) · β−kmk
· um (Q · (x − ax )) − T (ay ) · β−kmk
· um (Q · (y − ay ))
Q 1 Q 1

um (Q · (x − ax )) − um (Q · (y − ay ))
= c1 ·

c1 λ um (z) − um (z 0 )
≤ · kQ · (x − ax ) − Q · (y − a y )k2 · sup
Qλ z,z 0 ∈(− 21 , 12 )d ,z6=z 0 , kz − z 0 kλ2
c1
≤ λ · kQ · (x − ax ) − Q · (y − ay )kλ2 · c2 = c1 c2 · kx − ykλ2 .
Q

If, otherwise, ax 6= ay , then it is easy to show that

 
1 1 d 1 1 d
{t · x + (1 − t) · y|t ∈ [0, 1]} ∩ ax + v v ∈ [− , ] \ (− , ) 6= ∅,
2Q 2Q 2Q 2Q
 
1 1 d 1 1 d
{t · x + (1 − t) · y|t ∈ [0, 1]} ∩ ay + v v ∈ [− , ] \ (− , ) 6= ∅.
2Q 2Q 2Q 2Q

In
n other words, the line o n joining points x and yointersects boundaries of rectangles
segment
1 1 d 1 1 d
ax + v v ∈ (− 2Q , 2Q ) and ay + v v ∈ (− 2Q , 2Q ) . Take

 
0 1 1 d 1 1 d
x ∈ {t · x + (1 − t) · y|t ∈ [0, 1]} ∩ ax + v v ∈ [− , ] \ (− , )
2Q 2Q 2Q 2Q

and

 
0 1 1 d 1 1 d
y ∈ {t · x + (1 − t) · y|t ∈ [0, 1]} ∩ ay + v v ∈ [− , ] \ (− , )
2Q 2Q 2Q 2Q

(cf. Figure C.8).

100
Classification with Deep Neural Networks and Logistic Loss

(0, 1) (1, 1)

ay
y
y0

x0
x
ax

(0, 0) (1, 0)

Figure C.8: Illustration of the points x, y, ax , ay , x0 , y 0 when Q = 3 and d = 2.

Obviously, we have that

1 1
{Q · (x − ax ), Q · (x0 − ax ), Q · (y − ay ), Q · (y 0 − ay )} ⊂ [− , ]d .
2 2
By (C.118), we have that

um (Q · (x − z)) · (1 − 1{ax } (z)) = um (Q · (x0 − z))


= um (Q · (y 0 − z)) = um (Q · (y − z)) · (1 − 1{ay } (z)) = 0, ∀ z ∈ GQ,d .

Consequently,

|Dm f (x) − Dm f (y)| ≤ |Dm f (x)| + |Dm f (y)|


c1 c1
= T (ax ) · β−kmk
· um (Q · (x − ax )) + T (ay ) · β−kmk
· um (Q · (y − ay ))
Q 1 Q 1

c1 c1
= · |um (Q · (x − ax ))| + λ · |um (Q · (y − ay ))|
Qλ Q
c1
= λ · |um (Q · (x − ax )) − um (Q · (x0 − ax ))|
Q
c1
+ λ · |um (Q · (y − ay )) − um (Q · (y 0 − ay ))|
Q
c1 0 λ um (z) − um (z 0 )
≤ · kQ · (x − a x ) − Q · (x − ax )k 2 · sup
Qλ z,z 0 ∈[− 21 , 12 ]d ,z6=z 0 , kz − z 0 kλ2
c1 0 λ um (z) − um (z 0 )
+ · kQ · (y − a y ) − Q · (y − a y )k2 · sup
Qλ z,z 0 ∈[− 21 , 12 ]d ,z6=z 0 , kz − z 0 kλ2
c1 λ λ
≤ λ
· kQ · (x − ax ) − Q · (x0 − ax )k2 + kQ · (y − ay ) − Q · (y 0 − ay )k2 · c2
Q
λ λ
= c1 c2 · kx − x0 k2 + ky − y 0 k2 ≤ 2c1 c2 · kx − ykλ2 .

101
Zhang, Shi and Zhou

Therefore, no matter whether ax = ay or not, we always have that

|Dm f (x) − Dm f (y)| ≤ 2c1 c2 · kx − ykλ2 .

Since m, x, y are arbitrary, we deduce that

|Dm f (x) − Dm f (y)| ≤ 2c1 c2 · kx − ykλ2


n o
1 1 d
for any m ∈ (N ∪ {0})d with kmk1 = b and any x, y ∈ a∈GQ,d a + v v ∈ (− 2Q
S
, 2Q ) .
n o
1 1 d
is dense in [0, 1]d . Hence, by taking limit, we
S
Note that a∈GQ,d a + v v ∈ (− 2Q , 2Q )
obtain
|Dm f (x) − Dm f (y)|
(C.122)
≤ 2c1 c2 · kx − ykλ2 , ∀ m ∈ (N ∪ {0})d with kmk1 = b, ∀ x, y ∈ [0, 1]d .

Combining (C.119) and (C.122), we conclude that kf kC b,λ ([0,1]d ) ≤ c1 c2 + 2c1 c2 < r. Thus
f ∈ Brβ [0, 1]d . Then the proof of this lemma is completed.


Let P and Q be two arbitrary probability measures which have the same domain. We
write P << Q if P is absolutely continuous with respect to Q . The Kullback-Leibler
divergence (KL divergence) from Q to P is given by
(R
dQ dP, if P << Q,
log dP

KL(P||Q) :=
+∞, otherwise,

dQ is the Radon-Nikodym derivative of P with respect to Q (cf. Definition 2.5 of


where dP
Tsybakov (2009)).

Lemma C.19 Suppose η1 : [0, 1]d → [0, 1] and η2 : [0, 1]d → (0, 1) are two Borel measurable
functions, and Q is a Borel probability measure on [0, 1]d . Then Pη1 ,Q << Pη2 ,Q , and
( η (x)
1
dPη1 ,Q , if y = 1,
(x, y) = η1−η
2 (x)
(x)
dPη2 ,Q 1
1−η2 (x) , if y = −1.

1 (x)
η2 (x) , if y = 1,
Proof Let f : [0, 1]d × {−1, 1} → [0, ∞), (x, y) 7→ 1−η1 (x) Then we have
1−η2 (x) ,if y = −1.
that f is well defined and measurable. For any Borel subset S of [0, 1]d × {−1, 1}, let
S1 := x ∈ [0, 1]d (x, 1) ∈ S , and S2 := x ∈ [0, 1]d (x, −1) ∈ S . Obvioulsy, S1 × {1}


and S2 × {−1} are measurable and disjoint. Besides, it is easy to verify that S = (S1 ×
{1}) ∪ (S2 × {−1}). Therefore,
Z
f (x, y)dPη2 ,Q (x, y)
S
Z Z Z Z
= f (x, y)dMη2 (x) (y)dQ(x) + f (x, y)dMη2 (x) (y)dQ(x)
S1 {1} S2 {−1}

102
Classification with Deep Neural Networks and Logistic Loss

Z Z
= η2 (x)f (x, 1)dQ(x) + (1 − η2 (x))f (x, −1)dQ(x)
S1 S2
Z Z
= η1 (x)dQ(x) + (1 − η1 (x))dQ(x)
S1 S2
Z Z Z Z
= dMη1 (x) (y)dQ(x) + dMη1 (x) (y)dQ(x)
S1 {1} S2 {−1}

= Pη1 ,Q (S1 × {1}) + Pη1 ,Q (S2 × {−1}) = Pη1 ,Q (S).


dPη1 ,Q
Since S is arbitrary, we deduce that Pη1 ,Q << Pη2 ,Q , and dPη2 ,Q = f . This completes the
proof.

Lemma C.20 Let ε ∈ (0, 51 ], Q be a Borel probability on [0, 1]d , and η1 : [0, 1]d → [ε, 3ε],
η2 : [0, 1]d → [ε, 3ε] be two measurable functions. Then

KL(Pη1 ,Q ||Pη2 ,Q ) ≤ 9ε.

Proof By Lemma C.19,

KL(Pη1 ,Q ||Pη2 ,Q )
 
1 − η1 (x)
Z
η1 (x)
= log · 1{1} (y) + · 1{−1} (y) dPη1 ,Q (x, y)
[0,1]d ×{−1,1} η2 (x) 1 − η2 (x)
    
1 − η1 (x)
Z
η1 (x)
= η1 (x) log + (1 − η1 (x)) log dQ(x)
[0,1]d η2 (x) 1 − η2 (x)
    
1 − η1 (x)
Z
η1 (x)
≤ 3ε · log + log dQ(x)
[0,1]d η2 (x) 1 − η2 (x)
    
1−ε
Z

≤ 3ε · log + log dQ(x)
[0,1]d ε 1 − 3ε
 
2ε 2ε
= log 1 + + 3ε · log 3 ≤ + 4ε ≤ 9ε.
1 − 3ε 1 − 3ε

Lemma C.21 Let m ∈ N ∩ (1, ∞), Ω be a set with #(Ω) = m, and {0, 1}Ω be the set of
all functions mapping from Ω to {0, 1}. Then there exists a subset E of {0, 1}Ω , such that
#(E) ≥ 1 + 2m/8 , and
m
# ({ x ∈ Ω| f (x) 6= g(x)}) ≥ , ∀ f ∈ E, ∀ g ∈ E \ {f } .
8

Proof If m ≤ 8, then E = {0, 1}Ω have the desired properties. The proof for the case
m > 8 can be found in Lemma 2.9 of Tsybakov (2009).

103
Zhang, Shi and Zhou

Lemma C.22 Let φ be the logistic loss,

J : (0, 1)2 → R
2 2
(x, y) 7→ (x + y) log + (2 − x − y) log
x+y 2−x−y (C.123)
 
1 1 1 1
− x log + (1 − x) log + y log + (1 − y) log ,
x 1−x y 1−y

Q be a Borel probability measure on [0, 1]d , and η1 : [0, 1]d → (0, 1), η2 : [0, 1]d → (0, 1) be
two measurable functions. Then there hold

J (x, y) = J (y, x) ≥ 0, ∀ x ∈ (0, 1), y ∈ (0, 1), (C.124)

ε 1
< J (ε, 3ε) = J (3ε, ε) < ε, ∀ ε ∈ (0, ], (C.125)
4 6
and Z
J (η1 (x), η2 (x))dQ(x) ≤ inf EPφη (f ) + EPφη (f ) . (C.126)
f ∈Fd 1 ,Q 2 ,Q
[0,1]d

Proof Let g : (0, 1) → (0, ∞), x 7→ x log x1 + (1 − x) log 1−x


1
. Then it is easy to verify that
g is concave (i.e., −g is convex), and
x+y
J (x, y) = 2g( ) − g(x) − g(y), ∀ x ∈ (0, 1), y ∈ (0, 1).
2
This yields (C.124).
An elementary calculation gives

J (ε, 3ε) = J (3ε, ε)


(1 − 2ε)2
 
27
= ε log − log + 4ε log(1 − 2ε) − ε log(1 − ε) − 3ε log(1 − 3ε)
16 (1 − ε)(1 − 3ε)

Taylor expansion 27 X 3k + 1 − 2 · 2k k
=========== ε log + · ε , ∀ ε ∈ (0, 1/3).
16 k · (k − 1)
k=2

Therefore,
 
3 k
− 2 · 2k

∞ 1+ ∞
ε 27 27 X 2 27 X 3k + 1 − 2 · 2k k
< ε log ≤ ε log + · εk = ε log + ·ε
4 16 16 k · (k − 1) 16 k · (k − 1)
k=2 k=2
∞ ∞
27 X 3k + 1 − 2 · 2k 27 X 3k − 7
= J (ε, 3ε) = J (3ε, ε) = ε log + · εk ≤ ε log
+ · εk
16 k · (k − 1) 16 k · (k − 1)
k=2 k=2
∞ k ∞  k−1
27 2
X 3 −7
k−1 27 X 3k 1
= ε log +ε +ε· ·ε ≤ ε log + ε/6 + ε · ·
16 k · (k − 1) 16 3 · (3 − 1) 6
k=3 k=3
 
1 1 27
= + + log · ε < ε, ∀ ε ∈ (0, 1/6],
6 4 16

104
Classification with Deep Neural Networks and Logistic Loss

which proves (C.125).


η1 (x) η2 (x)
Define f1 : [0, 1]d → R, x 7→ log 1−η 1 (x)
and f2 : [0, 1]d → R, x 7→ log 1−η 2 (x)
. Then it is
easy to verify that
Z
φ
RPη ,Q (fi ) = g(ηi (x))dQ(x) ∈ (0, ∞), ∀ i ∈ {1, 2} ,
i
[0,1]d

and 
inf aφ(t) + (1 − a)φ(−t) t ∈ R = g(a), ∀ a ∈ (0, 1).
Consequently, for any measurable function f : [0, 1]d → R, there holds
EPφη ,Q (f ) + EPφη ,Q (f ) ≥ RφPη ,Q (f ) − RφPη ,Q (f1 ) + RφPη ,Q (f ) − RφPη ,Q (f2 )
1 2 1 1 2 2
Z
= ((η1 (x) + η2 (x))φ(f (x)) + (2 − η1 (x) − η2 (x))φ(−f (x)) dQ(x)
[0,1]d

− RφPη ,Q (f1 ) − RφPη ,Q (f2 )


Z  1 2

η1 (x) + η2 (x) η1 (x) + η2 (x)
≥ 2 · inf φ(t) + (1 − )φ(−t) t ∈ R dQ(x)
[0,1]d 2 2
− RφPη (f1 ) − RφPη (f2 )
1 ,Q 2 ,Q
Z
η1 (x) + η2 (x)
= 2g( )dQ(x) − RφPη ,Q (f1 ) − RφPη ,Q (f2 )
[0,1]d 2 1 2
Z  
η1 (x) + η2 (x)
= 2g( ) − g(η1 (x)) − g(η2 (x)) dQ(x)
[0,1]d 2
Z
= J (η1 (x), η2 (x))dQ(x).
[0,1]d

This proves (C.126).

Proof [Proof of Theorem 2.6 and Corollary 2.1] We first prove Theorem 2.6. Let n be
d∗ +β·(1∧β)q
7 β·(1∧β)q
an arbitrary integer greater than 1−A . Take b := dβe − 1, λ := β + 1 − dβe,
1
j k l d m

Q := n d∗ +β·(1∧β)q + 1, M := 2Q /8 ,
 
k1 kd∗ >
GQ,d∗ := ( ,..., ) k1 , . . . , kd∗ are odd integers ∩ [0, 1]d∗ ,
2Q 2Q
and J to be the function defined in (C.123). Note that # (GQ,d∗ ) = Qd∗ . Thus it follows
from Lemma C.21 that there exist functions Tj : GQ,d∗ → {−1, 1}, j = 0, 1, 2, . . . , M, such
that
Qd∗
# ({ a ∈ GQ,d∗ | Ti (a) 6= Tj (a)}) ≥ , ∀ 0 ≤ i < j ≤ M. (C.127)
8
According to Lemma C.18, for each j ∈ {0, 1, . . . , M}, there exists an fj ∈ B βr∧1 [0, 1]d∗ ,

777
such that
c1 1∧r
β
= kfj k[0,1]d∗ ≤ kfj kC b,λ ([0,1]d∗ ) ≤ , (C.128)
Q 777

105
Zhang, Shi and Zhou

and
c1 1
fj (x) = β
· Tj (a), ∀ a ∈ GQ,d∗ , x ∈ B(a, ) ∩ [0, 1]d∗ , (C.129)
Q 5Q
1
where c1 ∈ (0, 9999 ) only depends on (d∗ , β, r). Define
c1
gj : [0, 1]d∗ → R, x 7→ + fj (x).

It follows from (C.128) that
   
2c1 1∧r
ran(gj ) ⊂ 0, β ⊂ 0, 2 · ⊂ [0, 1] (C.130)
Q 777
and
c1 2c1 1∧r 1∧r
β
+ kgj kC b,λ ([0,1]d∗ ) ≤ β + kfj kC b,λ ([0,1]d∗ ) ≤ 2 · + < r, (C.131)
Q Q 777 777
meaning that
c1
gj ∈ Brβ [0, 1]d∗ ∈ Brβ [0, 1]d∗ .
 
and gj + β (C.132)
Q
We then define

h0,j : [0, 1]d → [0, 1], (x1 , . . . , xd )> 7→ gj (x1 , . . . , xd∗ )

if q = 0, and define

h0,j : [0, 1]d → [0, 1]K , (x1 , . . . , xd )> 7→ (gj (x1 , . . . , xd∗ ), 0, 0, . . . , 0)>

if q > 0. Note that h0,j is well defined because d∗ ≤ d and ran(gj ) ⊂ [0, 1]. Take
Pq−1 k
k=0 (1∧β) (1∧β)q
1 1∧r 2c1
ε= · · β .
2 777 Q

From (C.128) we see that


1∧r
0<ε≤ . (C.133)
777
For all real number t, define the function
1∧r
ut : [0, 1]d∗ → R, (x1 , . . . , xd∗ )> 7→ t + · |x1 |(1∧β) .
777
Then it follows from (C.133) and the elementary inequality

||z1 |w − |z2 |w | ≤ |z1 − z2 |w , ∀ z1 ∈ R, z2 ∈ R, w ∈ (0, 1]

that
n o n o
max kuε k[0,1]d∗ , ku0 k[0,1]d∗ ≤ max kuε kC b,λ ([0,1]d∗ ) , ku0 kC b,λ ([0,1]d∗ )
1∧r 1∧r 1∧r (C.134)
≤ ku0 kC b,λ ([0,1]d∗ ) + ε ≤ ·2+ε≤ ·2+ < r ∧ 1,
777 777 777

106
Classification with Deep Neural Networks and Logistic Loss

which means that


ran(u0 ) ∪ ran(uε ) ⊂ [0, 1], (C.135)
and
{u0 , uε } ⊂ Brβ [0, 1]d∗ .

(C.136)
Next, for each i ∈ N, define

hi : [0, 1]K → R,
(x1 , . . . , xK )> 7→ u0 (x1 , . . . , xd∗ )

if i = q > 0, and define

hi : [0, 1]K → RK , (x1 , . . . , xK )> 7→ (u0 (x1 , . . . , xd∗ ), 0, 0, . . . , 0)>

otherwise. It follows from (C.135) that ran(hi ) ⊂ [0, 1] if i = q > 0, and ran(hi ) ⊂ [0, 1]K
otherwise. Thus, for each j ∈ {0, 1, . . . , M}, we can well define

ηj : [0, 1]d → R, x 7→ ε + hq ◦ hq−1 ◦ · · · ◦ h3 ◦ h2 ◦ h1 ◦ h0,j (x).

We then deduce from (C.132) and (C.136) that

ηj ∈ GdCH (q, K, d∗ , β, r), ∀ j ∈ {0, 1, . . . , M} . (C.137)

Moreover, an elementary calculation gives


Pq−1 k
1∧r k=0 (1∧β) q
· |gj (x1 , . . . , xd∗ )|(1∧β) + ε
777 (C.138)
d
= ηj (x1 , . . . , xd ), ∀ (x1 , . . . , xd ) ∈ [0, 1] , ∀ j ∈ {0, 1, . . . , M} ,

which, together with (C.130), yields


Pq−1 k
k=0 (1∧β) (1∧β)q
1∧r 2c1
0 < ε ≤ ηj (x1 , . . . , xd ) ≤ · β + ε = 2ε + ε
777 Q
(1∧β)q
3c1 1 1 1−A 1−A
= 3ε ≤ < ≤ β·(1∧β)q
≤ <
Qβ Qβ·(1∧β)q n d∗ +β·(1∧β)q
7 2
< 1, ∀ (x1 , . . . , xd ) ∈ [0, 1]d , ∀ j ∈ {0, 1, . . . , M} .

Consequently,
ran(ηj ) ⊂ [ε, 3ε] ⊂ (0, 1), ∀ j ∈ {0, 1, . . . , M} , (C.139)
and
x ∈ [0, 1]d |2ηj (x) − 1| ≤ A = ∅, ∀ j ∈ {0, 1, . . . , M} .

(C.140)
Combining (C.137), (C.139), and (C.140), we obtain
d,β,r
Pj := Pηj ∈ H5,A,q,K,d∗
, ∀ j ∈ {0, 1, 2, . . . , M} . (C.141)

107
Zhang, Shi and Zhou

By (C.129) and (C.138), for any 0 ≤ i < j ≤ M, any a ∈ GQ,d∗ with Ti (a) 6= Tj (a), and any
x ∈ [0, 1]d with (x){1,2,...,d∗ } ∈ B(a, 5Q
1
), there holds

J (ηi (x), ηj (x))


Pq−1 k
k=0 (1∧β) (1∧β)q
1∧r c1 c1
=J · β + Ti (a) · β + ε,
777 Q Q
Pq−1 k (1∧β)q
!
1∧r k=0 (1∧β) c1 c1
· β + Tj (a) · β +ε
777 Q Q
Pq−1 k
Pq−1
(1∧β)q k
!
1∧r k=0 (1∧β) 2c1 1∧r k=0 (1∧β)
(1∧β)q
=J · β + ε, · |0| +ε
777 Q 777
= J (2ε + ε, ε) = J (ε, 3ε).

Thus it follows from Lemma C.22 and (C.127) that


  Z
φ φ
inf EPj (f ) + EPi (f ) ≥ J (ηi (x), ηj (x))dx
f ∈Fd [0,1]d
Z
J (ηi (x), ηj (x)) · 1B(a, 1
X 
≥ ) (x){1,...,d∗ } dx
5Q
a∈GQ,d∗ : Tj (a)6=Ti (a) [0,1]d
Z
J (ε, 3ε) · 1B(a,
X 
= 1
) (x){1,...,d∗ } dx
5Q
a∈GQ,d∗ : Tj (a)6=Ti (a) [0,1]d
Z
ε
· 1B(a, 1 ) (x){1,...,d∗ } dx
X 
≥ (C.142)
d 4 5Q
a∈GQ,d∗ : Tj (a)6=Ti (a) [0,1]
  Z
# a ∈ GQ,d∗ Tj (a) 6= Ti (a) ε
= · dx1 dx2 · · · dxd∗
Qd∗ 1 4
B(0, 5 )
Z Z
1 ε 1 ε
≥ · dx1 dx2 · · · dxd∗ ≥ · dx1 dx2 · · · dxd∗
8 B(0, 15 ) 4 8 [− √ 1 , √ 1 ]d∗ 4
25d∗ 25d∗
d∗
2 ε
≥ √ · =: s, ∀ 0 ≤ i < j ≤ M.
25d∗ 32

Let fˆn be an arbitrary Fd -valued statistic on ([0, 1]d × {−1, 1})n from the sample
{(Xi , Yi )}ni=1 , and let T : ([0, 1]d × {−1, 1})n → Fd be the map associated with fˆn , i.e.,
fˆn = T (X1 , Y1 , . . . , Xn , Yn ). Take

T0 : Fd → {0, 1, . . . , M} , f 7→ inf arg min EPφj (f ),


j∈{0,1,...,M}

i.e., T0 (f ) is the smallest integer j ∈ {0, . . . , M} such that EPφj (f ) ≤ EPφi (f ) for any i ∈
{0, . . . , M}. Define g∗ = T0 ◦ T . Note that, for any j ∈ {0, 1, . . . , M} and any f ∈ Fd there
holds
(C.142) s
T0 (f ) 6= j ⇒ EPφT (f ) + EPφj (f ) ≥ s ⇒ EPφj (f ) + EPφj (f ) ≥ s ⇒ EPφj (f ) ≥ ,
0 (f ) 2

108
Classification with Deep Neural Networks and Logistic Loss

which, together with the fact that the range of T is contained in Fd , yields

1R\{j} (g∗ (z)) = 1R\{j} (T0 (T ((z))))


(C.143)
≤ 1[ 2s ,∞] (EPφj (T (z))), ∀ z ∈ ([0, 1]d × {−1, 1})n , ∀ j ∈ {0, 1, . . . , M} .

Consequently,
h i h i
sup EP ⊗n EPφ (fˆn ) ≥ sup EP ⊗n EPφj (fˆn )
d,β,r j
P ∈H5,A,q,K,d j∈{0,1,...,M}

Z Z 1[ 2s ,∞] (EPφj (T (z)))
= sup EPφj (T (z))dPj⊗n (z) ≥ sup dPj⊗n (z)
j∈{0,1,...,M} j∈{0,1,...,M} 2/s
(C.144)
1R\{j} (g∗ (z)) Pj⊗n (g∗ 6= j)
Z
≥ sup dPj⊗n (z) = sup
j∈{0,1,...,M} 2/s j∈{0,1,...,M} 2/s
( )
s g is a measurable function from
≥ · inf sup Pj⊗n 6 j)
(g = ,
2 j∈{0,1,...,M} ([0, 1]d × {−1, 1})n to {0, 1, . . . , M}

where the first inequality follows from (C.141) and the third inequality follows from (C.143).
We then use Proposition 2.3 of Tsybakov (2009) to bound the right hand side of (C.144).
By Lemma C.20, we have that
M M M
1 X n X n X
· KL(Pj⊗n ||P0⊗n ) = · KL(Pj ||P0 ) ≤ · 9ε = 9nε,
M j=1 M j=1 M j=1

which, together with Proposition 2.3 of Tsybakov (2009), yields


( )
g is a measurable function from
inf sup Pj⊗n (g 6= j)
j∈{0,1,...,M} ([0, 1]d × {−1, 1})n to {0, 1, . . . , M}

  q   q 
9nε 9nε
τ M 9nε + 2 M 9nε + 2
≥ sup  · 1 +  ≥ √ · 1 + 
τ ∈(0,1) 1 + τM log τ 1+ M log √1M
√ √
 q 
9nε !
M 9nε + 2 M 9nε + 10 1
+ 12nε
≥ √ · 1 − ≥ √ · 1− √
1+ M log √1M 1+ M log M
√ !
M 21nε 1/10
≥ √ · 1− 1 Qd∗ /8
 − √
1+ M 2 log 2 log 2
√ Pq−1 k q
!
M 336n 1 1 ∧ r k=0 (1∧β) 2c1 (1∧β) 1/10
= √ · 1− d · · · β − √
1+ M Q ∗ log 2 2 777 Q log 2
√ q
!
M 336n 1 1 1 (1∧β) 1/10
≥ √ · 1− d · · · − √
1+ M Q ∗ log 2 2 777 Qβ log 2
√   √
M 336 1 1 1/10 M 1 1
≥ √ · 1− · · − √ ≥ √ · ≥ .
1+ M log 2 2 777 log 2 1+ M 3 6

109
Zhang, Shi and Zhou

Combining this with (C.144), we obtain that


h i s 1
sup EP ⊗n EPφ (fˆn ) ≥ ·
P ∈Hd,β,r
2 6
5,A,q,K,d∗
Pq−1 k
d∗ q
k=0 (1∧β) (1∧β)q
2 |2c1 |(1∧β) 1∧r 1
= √ · · · β
25d∗ 768 777 Q
q
Pq−1 k
d∗ k=0 (1∧β)
2 |2c1 |(1∧β) 1∧r 1 1
≥ √ · · · · β·(1∧β)q
.
25d∗ 768 777 2β·(1∧β)q n d∗ +β·(1∧β)q

Since fˆn is arbitrary, we deduce that


h i β·(1∧β)q

inf sup EP ⊗n EPφ (fˆn ) ≥ c0 n d∗ +β·(1∧β)q
fˆn P ∈Hd,β,r
5,A,q,K,d ∗

d∗ (1∧β)q
Pq−1
(1∧β)k
2
with c0 := √25d · |2c1768
|
· 1∧r
777
k=0 1
· 2β·(1∧β)q only depending on (d∗ , β, r, q). Thus

we complete the proof of Theorem 2.6.
Now it remains to prove Corollary 2.1. Indeed, it follows from (2.33) that
d,β,r d,β,r
H3,A = H5,A,0,1,d .

Taking q = 0, K = 1 and d∗ = d in Theorem 2.6, we obtain that there exists an constant


c0 ∈ (0, ∞) only depending on (d, β, r), such that
β·(1∧β)0

h i h i
inf sup EP ⊗n EPφ (fˆn ) = inf sup EP ⊗n EPφ (fˆn ) ≥ c0 n d+β·(1∧β)0
fˆn P ∈Hd,β,r fˆn P ∈Hd,β,r
3,A 5,A,0,1,d

d+β·(1∧β)0 d+β
β
− d+β 7 β·(1∧β)0 7 β
= c0 n provided that n > = .
1−A 1−A
This proves Corollary 2.1.

C.7 Proof of (3.7)


Appendix C.7 is devoted to the proof of the bound (3.7).
Proof [Proof of (3.7)]Fix ν ∈ [0, ∞) and µ ∈ [1, ∞). Let P be an arbitrary probability in
H7d,β . Denote by η the conditional probability function P ({1} |·) of P . According to Lemma
C.2 and the definition of H7d,β , there exists a function f ∗ ∈ B1β [0, 1]d such that

∗ P -a.s. η P -a.s.
fφ,P ==X==== log ==X==== f ∗ . (C.145)
1−η

Thus there exists a measurable set Ω contained in [0, 1]d such that PX (Ω) = 1 and
η(x)
log = f ∗ (x), ∀ x ∈ Ω. (C.146)
1 − η(x)

110
Classification with Deep Neural Networks and Logistic Loss

Let δ be an arbitrary number in (0, 1/3). Then it follows from Corollary B.1 that there
exists  
FNN 1 −d/β −d/β 1
g̃ ∈ Fd Cd,β log , Cd,β δ , Cd,β δ log , 1, ∞ (C.147)
δ δ
such that supx∈[0,1]d |f ∗ (x) − g̃(x)| ≤ δ . Let T : R → [−1, 1], z 7→ min {max {z, −1} , 1} and

 − 1, if g̃(x) < −1,

˜
f : R → [−1, 1], x 7→ T (g̃(x)) = g̃(x), if − 1 ≤ g̃(x) ≤ 1,

1, if g̃(x) > 1.

Obviously, |T (z) − T (w)| ≤ |z − w| for any real numbers z and w, and


∵kf ∗ k
[0,1] d ≤1
sup f ∗ (x) − f˜(x) ========== sup |T (f ∗ (x)) − T (g̃(x))|
x∈[0,1]d x∈[0,1]d
(C.148)

≤ sup |f (x) − g̃(x)| ≤ δ.
x∈[0,1]d

Besides, it is easy to verify that


f˜(x) = σ(g̃(x) + 1) − σ(g̃(x) − 1) − 1, ∀ x ∈ Rd ,
which, together with (C.147), yields
 
˜ FNN 1 −d/β −d/β 1
f ∈ Fd 1 + Cd,β log , 1 + Cd,β δ , 4 + Cd,β δ log , 1, 1
δ δ
 
1 1
⊂ FdFNN Cd,β log , Cd,β δ −d/β , Cd,β δ −d/β log , 1, 1 .
δ δ
In addition, it follows from Lemma C.7 that
Z
1 ∗ 2
|f (x) − f (x)| ≤ (φ(yf (x)) − φ(yf ∗ (x))) dP (y|x)
2(eµ + e−µ + 2) {−1,1}
(C.149)
1 ∗ 2 d
≤ |f (x) − f (x)| , ∀ measurable f : [0, 1] → [−µ, µ], ∀ x ∈ Ω.
4
Take Ce := 2(eµ + e−µ + 2). Integrating both side with respect to x in (C.149) and using
(C.148), we obtain
Z
(φ(yf (x)) − φ(yf ∗ (x)))2 dP (x, y)
d
[0,1] ×{−1,1}
Z Z
2
≤ ∗
(f (x) − f (x)) dP (x, y) = |f (x) − f ∗ (x)|2 dPX (x)
[0,1]d ×{−1,1} [0,1]d
Z
∵PX (Ω)=1 C
e
======== |f (x) − f ∗ (x)|2 dPX (x)
Ω 2(eµ + e−µ + 2) (C.150)
Z Z
≤C e (φ(yf (x)) − φ(yf ∗ (x))) dP (y|x)dPX (x)
Ω {−1,1}
Z
∵PX (Ω)=1
======== C
e (φ(yf (x)) − φ(yf ∗ (x))) dP (x, y)
[0,1]d ×{−1,1}
by Lemma C.3
========== CE e φ (f ), ∀ measurale f : [0, 1]d → [−µ, µ] ,
P

111
Zhang, Shi and Zhou

and
  
1 −βd
−β d 1
inf EPφ (f )f∈ FdFNN
Cd,β log , Cd,β δ , Cd,β δ log , 1, 1
δ δ
Z Z  
by Lemma C.3
≤ EPφ (f˜) ========== φ(y f˜(x)) − φ(yf ∗ (x)) dP (y|x)dPX (x)
[0,1]d {−1,1}
Z Z (C.151)
∵PX (Ω)=1
 
======== φ(y f˜(x)) − φ(yf ∗ (x)) dP (y|x)dPX (x)
Ω {−1,1}
Z Z
1 ˜ 2 2
≤ f (x) − f ∗ (x) dPX (x) ≤ f˜(x) − f ∗ (x) dPX (x) ≤ δ 2 .
Ω 4 [0,1]d

Take c to be the maximum of the three constants Cd,β in (C.151). Hence c ∈ (0, ∞) only
depends on (d, β). Now suppose (3.8) holds. Then it follows that there exists l ∈ (0, ∞)
 3  d  3  d
d+2β d+2β
not depending on n and P such that N · logn n > l and logS n · logn n > l for
1 1
β  3
  3

any n > 1/l. We then take δ = δn := cl d · (lognn)  (lognn)
2+d/β 2+d/β
. Thus
limn→∞ n·δ1 n = 0 = limn→∞ δn , which means that n1 ≤ δn < 1/3 for n > Cl,c,d,β . We then
deduce from (C.151) that
n o
inf EPφ (f ) f ∈ FdFNN (G, N, S, B, F )
( −d −d !)
φ FNN (log n)3 2β+d (log n)3 2β+d
≤ inf EP (f ) f ∈ Fd c log n, l ,l log n, B, F
n n
  d d

φ FNN −β −β
= inf EP (f ) f ∈ Fd c log n, cδn , cδn log n, B, F
  
1 −d −d 1
≤ inf EPφ (f ) f ∈ FdFNN c log , cδn β , cδn β log , B, F
δn δn
  d d

φ FNN 1 −β −β 1
≤ inf EP (f ) f ∈ Fd Cd,β log , Cd,β δn , Cd,β δn log , 1, 1
δn δn
2
≤ δn , ∀ n > Cl,c,d,β ,
(C.152)
where we use the fact the infimum taken over a larger  set is smaller. Define W = 3 ∨
N FdFNN (G, N, S, B, F ) , n1 . Then by taking F = f |[0,1]d f ∈ FdFNN (G, N, S, B, F ) ,


ψ(x, y) = φ(yf ∗ (x)), Γ = C e , M = 2, γ = 1 in Theorem 2.1, and using (C.145), (C.150),


n
(C.152), we deduce that
" # " #
2 2 h i
EP ⊗n fˆFNN − f ∗ = EP ⊗n fˆFNN − f ∗ e P ⊗n E φ (fˆFNN )
≤ CE
n φ,P n P n
L2P L2P
X X
   Z 
by Lemma C.3
========== CE e P ⊗n Rφ fˆFNN − ψ(x, y)dP (x, y)
P n
[0,1]d ×{−1,1}
2
500 · C
e · log W  Z 
≤ + 2C
e inf RφP (f ) − ψ(x, y)dP (x, y)
n f ∈F [0,1]d ×{−1,1}

112
Classification with Deep Neural Networks and Logistic Loss

2 2
by Lemma C.3
500 · C
e · log W 500 · C
e · log W
========== e inf E φ (f ) ≤
+ 2C e 2
+ 2Cδ
P n
n f ∈F n

for n > Cl,c,d,β . Taking the supremum, we obtain,


2
500 · C · log W
" #
2
e
sup EP ⊗n fˆnFNN − fφ,P

≤ e 2 , ∀ n > Cl,c,d,β .
+ 2Cδn
(C.153)
P ∈H7d,β
L2P
X
n

Moreover, it follows from (3.8) and Corollary A.1 that

log W ≤ (S + Gd + 1)(2G + 5) log ((max {N, d} + 1) · B · (2nG + 2n)) . (G + S)G log n


 d !  2β
(log n)3 d+2β
 
n d+2β
. log n + log n · (log n) · (log n) . n · .
log3 n n

Plugging this into (C.153), we obtain


" #
2 log W
sup EP ⊗n fˆFNN − f ∗ n φ,P . + δn2
P ∈H7d,β
L2P
X
n
2β 1 2 2β
(log n)3 (log n)3 (log n)3
d+2β
  2+d/β d+2β
. + . ,
n n n

which proves the desired result.

References
Martin Anthony and Peter L Bartlett. Neural Network Learning: Theoretical Foundations.
Cambridge University Press, New York, NY, 2009.

Kendall Atkinson and Weimin Han. Theoretical Numerical Analysis: A Functional Analysis
Framework. Springer, New York, NY, third edition, 2009.

Jean-Yves Audibert and Alexandre B Tsybakov. Fast learning rates for plug-in classifiers.
The Annals of Statistics, 35(2):608–633, 2007.

Peter L. Bartlett, Olivier Bousquet, and Shahar Mendelson. Local Rademacher complexi-
ties. The Annals of Statistics, 33(4):1497–1537, 2005.

Peter L Bartlett, Nick Harvey, Christopher Liaw, and Abbas Mehrabian. Nearly-tight VC-
dimension and pseudodimension bounds for piecewise linear neural networks. The Journal
of Machine Learning Research, 20(1):2285–2301, 2019.

Felipe Cucker and Ding Xuan Zhou. Learning Theory: An Approximation Theory Viewpoint.
Cambridge University Press, New York, NY, 2007.

113
Zhang, Shi and Zhou

Lawrence C Evans. Partial Differential Equations. American Mathematical Society, Provi-


dence, RI, second edition, 2010.

Zhiying Fang, Han Feng, Shuo Huang, and Ding-Xuan Zhou. Theory of deep convolutional
neural networks II: Spherical analysis. Neural Networks, 131:154–162, 2020.

Max H Farrell, Tengyuan Liang, and Sanjog Misra. Deep neural networks for estimation
and inference. Econometrica, 89(1):181–213, 2021.

Han Feng, Shuo Huang, and Ding-Xuan Zhou. Generalization analysis of CNNs for classifi-
cation on spheres. IEEE Transactions on Neural Networks and Learning Systems, 34(9):
6200–6213, 2023.

Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT press, Cam-
bridge, MA, 2016.

Xin Guo, Lexin Li, and Qiang Wu. Modeling interactive components by coordinate kernel
polynomial models. Mathematical Foundations of Computing, 3(4):263–277, 2020.

Zheng-Chu Guo, Dao-Hong Xiang, Xin Guo, and Ding-Xuan Zhou. Thresholded spectral
algorithms for sparse approximations. Analysis and Applications, 15(3):433–455, 2017.

László Györfi, Michael Kohler, Adam Krzyżak, and Harro Walk. A Distribution-free Theory
of Nonparametric Regression. Springer, New York, NY, 2002.

Song Han, Jeff Pool, John Tran, and William J Dally. Learning both weights and connec-
tions for efficient neural networks. In Proceedings of the Advances in Neural Information
Processing Systems 28, pages 1135–1143, Montreal, Canada, 2015.

Juncai He, Lin Li, and Jinchao Xu. Approximation properties of deep ReLU CNNs. Research
in the Mathematical Sciences, 9(3):1–24, 2022.

Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep
Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. Deep
neural networks for acoustic modeling in speech recognition: The shared views of four
research groups. IEEE Signal Processing Magazine, 29(6):82–97, 2012.

Tianyang Hu, Ruiqi Liu, Zuofeng Shang, and Guang Cheng. Minimax optimal deep neural
network classifiers under smooth decision boundary. arXiv preprint arXiv:2207.01602,
2022a.

Tianyang Hu, Jun Wang, Wenjia Wang, and Zhenguo Li. Understanding square loss in
training overparametrized neural network classifiers. In Proceedings of the Advances in
Neural Information Processing Systems 35, pages 16495–16508, New Orleans, LA, United
States, 2022b.

Like Hui and Mikhail Belkin. Evaluation of neural architectures trained with square loss vs
cross-entropy in classification tasks. In Proceedings of the Ninth International Conference
on Learning Representations, pages 1–17, Virtual Event, 2021.

114
Classification with Deep Neural Networks and Logistic Loss

Masaaki Imaizumi and Kenji Fukumizu. Deep neural networks learn non-smooth functions
effectively. In Proceedings of the Twenty-second International Conference on Artificial
Intelligence and Statistics, pages 869–878, Naha, Okinawa, Japan, 2019.

Mohit Iyyer, Varun Manjunatha, Jordan Boyd-Graber, and Hal Daumé III. Deep un-
ordered composition rivals syntactic methods for text classification. In Proceedings of
the Fifty-third Annual Meeting of the Association for Computational Linguistics and the
Seventh International Joint Conference on Natural Language Processing (Volume 1: Long
Papers), pages 1681–1691, Beijing, China, 2015.

Katarzyna Janocha and Wojciech Czarnecki. On loss functions for deep neural networks in
classification. Schedae Informaticae, 25:49–59, 2016.

Ziwei Ji, Justin Li, and Matus Telgarsky. Early-stopped neural networks are consistent. In
Proceedings of the Advances in Neural Information Processing Systems 34, pages 1805–
1817, Virtual Event, 2021.

Iain M. Johnstone. Oracle inequalities and nonparametric function estimation. In Proceed-


ings of the Twenty-third International Congress of Mathematicians (Volume III), pages
267–278, Berlin, Germany, 1998.

Yongdai Kim, Ilsang Ohn, and Dongha Kim. Fast convergence rates of deep neural networks
for classification. Neural Networks, 138:179–197, 2021.

Diederik P Kingma and Jimmy Ba. ADAM: A method for stochastic optimization. In
Proceedings of the Third International Conference on Learning Representations, pages
1–15, San Diego, CA, United States, 2015.

Michael Kohler and Sophie Langer. Statistical theory for image classification using deep
convolutional neural networks with cross-entropy loss. arXiv preprint arXiv:2011.13602,
2020.

Michael Kohler, Adam Krzyżak, and Benjamin Walter. On the rate of convergence of image
classifiers based on convolutional neural networks. Annals of the Institute of Statistical
Mathematics, 74:1085–1108, 2022.

Vladimir Koltchinskii. Local Rademacher complexities and oracle inequalities in risk mini-
mization. The Annals of Statistics, 34(6):2593–2656, 2006.

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep
convolutional neural networks. In Proceedings of the Advances in Neural Information
Processing Systems 25, pages 1097–1105, Lake Tahoe, NV, United States, 2012.

Shao-Bo Lin and Ding-Xuan Zhou. Distributed kernel-based gradient descent algorithms.
Constructive Approximation, 47(2):249–276, 2018.

Yufeng Liu, Hao Helen Zhang, and Yichao Wu. Hard or soft classification? Large-margin
unified machines. Journal of the American Statistical Association, 106(493):166–177,
2011.

115
Zhang, Shi and Zhou

Tong Mao, Zhongjie Shi, and Ding-Xuan Zhou. Theory of deep convolutional neural net-
works III: Approximating radial functions. Neural Networks, 144:778–790, 2021.
Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine
Learning. MIT press, Cambridge, MA, second edition, 2018.
Kevin P Murphy. Machine Learning: A Probabilistic Perspective. MIT press, Cambridge,
MA, 2012.
Philipp Petersen and Felix Voigtlaender. Optimal approximation of piecewise smooth func-
tions using deep ReLU neural networks. Neural Networks, 108:296–330, 2018.
Tomaso Poggio, Fabio Anselmi, and Lorenzo Rosasco. I-theory on depth vs width: Hierar-
chical function composition. Technical report, Center for Brains, Minds and Machines,
MIT, Cambridge, MA, 2015.
Tomaso Poggio, Hrushikesh Mhaskar, Lorenzo Rosasco, Brando Miranda, and Qianli Liao.
Why and when can deep–but not shallow–networks avoid the curse of dimensionality: A
review. arXiv preprint arXiv:1611.00740, 2016.
Tomaso Poggio, Hrushikesh Mhaskar, Lorenzo Rosasco, Brando Miranda, and Qianli Liao.
Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A
review. International Journal of Automation and Computing, 14(5):503–519, 2017.
Johannes Schmidt-Hieber. Nonparametric regression using deep neural networks with ReLU
activation function. The Annals of Statistics, 48(4):1875–1897, 2020.
Guohao Shen, Yuling Jiao, Yuanyuan Lin, and Jian Huang. Non-asymptotic excess
risk bounds for classification with deep convolutional neural networks. arXiv preprint
arXiv:2105.00292, 2021.
Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale
image recognition. In Proceedings of the Third International Conference on Learning
Representations, pages 1–14, San Diego, CA, United States, 2015.
Ingo Steinwart and Andreas Christmann. Support Vector Machines. Springer, New York,
NY, 2008.
Charles J. Stone. Optimal global rates of convergence for nonparametric regression. The
Annals of Statistics, 10(4):1040–1053, 1982.
Taiji Suzuki. Adaptivity of deep ReLU network for learning in Besov and mixed smooth
Besov spaces: Optimal rate and curse of dimensionality. In Proceedings of the Seventh
International Conference on Learning Representations, pages 1–25, New Orleans, LA,
United States, 2019.
Alexander B Tsybakov. Optimal aggregation of classifiers in statistical learning. The Annals
of Statistics, 32(1):135–166, 2004.
Alexandre B Tsybakov. Introduction to Nonparametric Estimation. Springer, New York,
NY, 2009.

116
Classification with Deep Neural Networks and Logistic Loss

Martin J Wainwright. High-dimensional Statistics: A Non-asymptotic Viewpoint. Cam-


bridge University Press, Cambridge, 2019.

Dao-Hong Xiang. Classification with Gaussians and convex loss II: Improving error bounds
by noise conditions. Science China Mathematics, 54(1):165–171, 2011.

Dmitry Yarotsky. Error bounds for approximations with deep ReLU networks. Neural
Networks, 94:103–114, 2017.

Zihan Zhang, Lei Shi, and Ding-Xuan Zhou. Convolutional neural networks for spherical
data classification. preprint, 2024.

Ding-Xuan Zhou. Deep distributed convolutional neural networks: Universality. Analysis


and Applications, 16(6):895–919, 2018.

Ding-Xuan Zhou. Theory of deep convolutional neural networks: Downsampling. Neural


Networks, 124:319–327, 2020a.

Ding-Xuan Zhou. Universality of deep convolutional neural networks. Applied and Compu-
tational Harmonic Analysis, 48(2):787–794, 2020b.

117

You might also like