Yan Sun Dissertation

Download as pdf or txt
Download as pdf or txt
You are on page 1of 126

SPARSE DEEP LEARNING AND STOCHASTIC NEURAL

NETWORK
by
Yan Sun

A Dissertation
Submitted to the Faculty of Purdue University
In Partial Fulfillment of the Requirements for the degree of

Doctor of Philosophy

Department of Statistics
West Lafayette, Indiana
August 2022
THE PURDUE UNIVERSITY GRADUATE SCHOOL
STATEMENT OF COMMITTEE APPROVAL

Dr. Faming Liang, Chair


Distinguished Professor of Statistics

Dr. Xiao Wang


Professor of Statistics

Dr. Chuanhai Liu


Professor of Statistics

Dr. Vinayak Rao


Associate Professor of Statistics

Approved by:
Dr. Jun Xie

2
To my beloved family.

3
ACKNOWLEDGMENTS

First and foremost, I want to express my sincere gratitude to my advisor, Dr. Faming
Liang, for his invaluable guidance and support over the years. Dr. Liang’s guidance not only
contribute to our research project, but also help me grow more professionally. His passion
on our research has always inspired me to work hard and achieve more. This dissertation
would not be possible without his enormous time and energy spent on me. I would also like
to thank Dr. Qifan Song for his support and contribution on the theoretical development of
our research. And I would like to thank Dr. Chuanhai Liu, Dr. Xiao Wang, Dr. Vinayak
Rao who generously served as my advisory committee members. Their insightful suggestions
and comments provided different perspectives and inspired me to think deeper.
I want to express sincere appreciation to my friends at Purdue, Mao Ye, Xinlin Tao,
Chuan Zuo, Wei Deng, Peiyi Zhang, Siqi Liang, Sehwan Kim, Xinyi Pei, Zhanyu Wang,
and many others. Thank you for the fun time together and the discussion, collaboration,
exchanging of thoughts in both study and life.
To my parents and beloved family, thank you for your unconditional love, I would not
have been achieve anything without your support.

4
TABLE OF CONTENTS

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.1 Sparse Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2 Kernel-Expanded Stochastic Neural Network . . . . . . . . . . . . . . . . . . 14

2 SPARSE DEEP LEARNING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15


2.1 Bayesian Sparse DNNs with mixture Gaussian Prior . . . . . . . . . . . . . . 15
2.2 Posterior Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Consistency of DNN Structure Selection . . . . . . . . . . . . . . . . . . . . 21
2.3.1 Marginal Posterior Inclusion Probability Approach . . . . . . . . . . 22
2.3.2 Laplace Approximation of Marginal Posterior Inclusion Probabilities . 23
2.4 Asymptotic Normality of Connection Weights . . . . . . . . . . . . . . . . . 26
2.4.1 Asymptotic Normality of Prediction . . . . . . . . . . . . . . . . . . . 27
2.5 Asymptotically Optimal Generalization Bound . . . . . . . . . . . . . . . . . 28
2.6 Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.6.1 Bayesian Evidence Approach . . . . . . . . . . . . . . . . . . . . . . . 31
2.6.2 Prior Annealing: Frequentist Computation . . . . . . . . . . . . . . . 34
2.6.3 Prior Annealing: Bayesian Computation . . . . . . . . . . . . . . . . 36
2.6.4 Construct Confidence Interval . . . . . . . . . . . . . . . . . . . . . . 39
2.7 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.7.1 Synthetic Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.7.2 Real Data Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.7.3 Experimental Setups . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Synthetic Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Real Data Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5
2.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.9 Technical Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.9.1 Proofs on Posterior Consistency . . . . . . . . . . . . . . . . . . . . . 50
Basic Formulas of Bayesian Neural Networks . . . . . . . . . . . . . . 50
Normal Regression. . . . . . . . . . . . . . . . . . . . . . . . . 50
Logistic Regression. . . . . . . . . . . . . . . . . . . . . . . . . 50
Posterior Consistency of General Statistical Models . . . . . . . . . . 51
General Shrinkage Prior Settings for Deep Neural Networks . . . . . 52
Proof of Theorem 2.2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.9.2 Proofs on Structure Selection Consistency . . . . . . . . . . . . . . . 56
Proof of Theorem 2.3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Proof of Theorem 2.3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Proof of Lemma 2.3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Verification of the Bounded Gradient Condition in Theorem 2.3.2 . . 61
Approximation of Bayesian Evidence . . . . . . . . . . . . . . . . . . 63
2.9.3 Proofs of Asymptotic Normality . . . . . . . . . . . . . . . . . . . . . 64
Proof of Theorem 2.4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Proof of Theorem 2.4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . 70
2.9.4 Proofs on Generalization Bounds . . . . . . . . . . . . . . . . . . . . 71
Proof of Theorem 2.5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Proof of Theorem 2.5.2 . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Proof of Theorems 2.5.3 and 2.5.4 . . . . . . . . . . . . . . . . . . . . 73
2.9.5 Mathematical facts of sparse DNN . . . . . . . . . . . . . . . . . . . 73
2.9.6 Proof of Theorem 2.6.1 . . . . . . . . . . . . . . . . . . . . . . . . . . 76

3 A KERNEL-EXPANDED STOCHASTIC NEURAL NETWORK . . . . . . . . . 78


3.1 A Kernel-Expanded Stochastic Neural Network . . . . . . . . . . . . . . . . 78
3.1.1 A Kernel-Expanded Neural Network . . . . . . . . . . . . . . . . . . 78
3.1.2 A Kernel-Expanded StoNet as an Approximator to KNN . . . . . . . 81
3.1.3 The Imputation-Regularized Optimization Algorithm . . . . . . . . . 85

6
3.1.4 Hyperparameter Setting . . . . . . . . . . . . . . . . . . . . . . . . . 91
3.2 Illustrative Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
3.2.1 A full row rank example . . . . . . . . . . . . . . . . . . . . . . . . . 92
3.2.2 A measurement error example . . . . . . . . . . . . . . . . . . . . . . 95
3.3 Real Data Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
3.3.1 QSAR Androgen Receptor . . . . . . . . . . . . . . . . . . . . . . . . 99
3.3.2 MNIST Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
3.3.3 CoverType Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
3.3.4 More UCI Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
3.4 Prediction Uncertainty Quantification with K-StoNet . . . . . . . . . . . . . 106
3.4.1 A Recursive Formula for Uncertainty Quantification . . . . . . . . . . 106
3.4.2 A Numerical Example . . . . . . . . . . . . . . . . . . . . . . . . . . 108
3.5 Parameter Settings for K-StoNet . . . . . . . . . . . . . . . . . . . . . . . . 108
3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
3.7 Technical Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
3.7.1 Proof of Theorem 3.1.2 . . . . . . . . . . . . . . . . . . . . . . . . . . 114
3.7.2 Proof of Lemma 3.1.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

7
LIST OF TABLES

2.1 Simulation Result: MSFE and MSPE were calculated by averaging over 10
datasets, and their standard deviations were given in the parentheses. . . . . . 42
2.2 ResNet network pruning results for CIFAR-10 data, which were calculated by
averaging over 3 independent runs with the standard deviation reported in the
parentheses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.3 ResNet network pruning results for CIFAR-10 data, which were calculated by
averaging over 3 independent runs with the standard deviation reported in the
parentheses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.1 Performance of the K-StoNet model with different values of ε, where the model
was evaluated at the last iteration, #SV represents the average number of support
vectors selected by the SVRs at the first hidden layer, and the number in the
parentheses represents the standard deviation of the average. . . . . . . . . . . 99
3.2 Training and prediction accuracy(%) for QSAR androgen receptor data, where
“T” and “P” denote the training and prediction accuracy, respectively. . . . . . 102
3.3 Training and prediction accuracy(%) for CoverType, where “T” and “P” denote
the training and prediction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
3.4 Average test RMSE (and its standard error) by variational inference (VI, [108]),
probabilistic back-propagation (PBP, [107]), dropout (Dropout, [106]), SGD via
back-propagation (BP), and KNN, where N denotes the dataset size and p denotes
the input dimension. For each dataset, the boldfaced values are the best result or
the second best result if it is insignificantly different from the best one according
to a t-test with a significance level of 0.05. . . . . . . . . . . . . . . . . . . . . 105

8
LIST OF FIGURES

2.1 Negative logarithm of the mixture Gaussian prior. . . . . . . . . . . . . . . . . . 35


2.2 Negative log-prior and π(γ = 1|β) for different choices of σ0,n
2
and λn . . . . . . . 37
2.3 Prediction intervals of 20 testing points, where the y-axis is the response value,
the x-axis is the index, and the blue point represents the true observation. . . . 43
3.1 An illustrative plot of K-StoNet . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.2 Upper Panel: paths of the mean squared error (MSE) produced by K-StoNet and
an unregularized DNN for one simulated dataset; and Lower Panel: best MSE
(by the current epoch) produced by SGD for a regularized DNN and K-StoNet
over 10 runs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
3.3 MSE paths produced by two K-StoNets, one KNN, and two DNNs for the data
generated from a KNN model: the left plot is for training and the right plot is
for testing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
3.4 MSE paths produced by K-StoNets and DNNs: (upper) one-hidden-layer net-
works; (lower) three-hidden-layer networks. . . . . . . . . . . . . . . . . . . . . . 98
3.5 Training and prediction accuracy paths (along with epochs) produced by K-
StoNet, KNN and DNN in one fold of the cross-validation experiment for the
QSAR androgen receptor data. . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
3.6 Training and prediction accuracy paths (along with computational time) pro-
duced by K-StoNets, KNN and DNNs in one fold of the cross-validation experi-
ment for the QSAR androgen receptor data. . . . . . . . . . . . . . . . . . . . . 101
3.7 Training and test accuracy versus epochs produced by K-StoNet and DNN (LeNet-
300-100) for the MNIST data, where K-StoNet achieved a prediction accuracy of
98.87%, and LeNet-300-100 achieved a prediction accuracy of 98.38%. . . . . . . 103
3.8 95% prediction intervals produced by K-StoNet for 20 test points, where the x-
axis indexes the test points, the y-axis represents the response value, and the blue
star represents the true observation. . . . . . . . . . . . . . . . . . . . . . . . . . 109

9
ABSTRACT

Deep learning has achieved state-of-the-art performance on many machine learning tasks.
But the deep neural network(DNN) model still suffers a few issues. Over-parametrized neural
network generally has better optimization landscape, but it is computationally expensive,
hard to interpret and the model usually can not correctly quantify the prediction uncertainty.
On the other hand, small DNN model could suffer from local trap and will be hard to
optimize. In this dissertation, we tackle these issues from two directions, sparse deep learning
and stochastic neural network.
For sparse deep learning, we proposed Bayesian neural network(BNN) model with mix-
ture of normal prior. Theoretically, We established the posterior consistency and structure
selection consistency, which ensures the sparse DNN model can be consistently identified. We
also demonstrate the asymptotic normality of the prediction, which ensures the prediction
uncertainty to be correctly quantified. Computationally, we proposed a prior annealing ap-
proach to optimize the posterior of BNN. The proposed methods share similar computation
complexity to the standard stochastic gradient descent method for training DNN. Experi-
ment results show that our model performs well on high dimensional variable selection as
well as neural network pruning.
For stochastic neural network, we proposed a Kernel-Expanded Stochastic Neural Net-
work model or K-StoNet model in short. We reformulate the DNN as a latent variable model
and incorporate support vector regression (SVR) as the first hidden layer. The latent vari-
able formulation breaks the training into a series of convex optimization problems and the
model can be easily trained using the imputation-regularized optimization (IRO) algorithm.
We provide theoretical guarantee for convergence of the algorithm and the prediction uncer-
tainty quantification. Experiment results show that the proposed model can achieve good
prediction performance and provide correct confidence region for prediction.

10
1. INTRODUCTION

During the past decade, the deep neural network (DNN) has achieved great successes in
solving many complex machine learning tasks such as pattern recognition and natural lan-
guage processing. However, the DNN model still suffers a few issues. The DNNs used in
practice may consist of hundreds of layers and millions of parameters, see e.g. [1] on image
classification. Most of those DNNs are severely over-parametrized. For example, [2] showed
that in some networks, only 5% of the parameters are enough to achieve acceptable models.
Training and operation of DNNs of this scale entail formidable computational challenges.
Over-parameterization also makes the DNN model less interpretable and miscalibrated [3],
which can cause serious issues in human-machine trust and thus hinder applications of arti-
ficial intelligence (AI) in human life.
On the other hand, training a small DNN model from scratch often performs worse
than the over-parametrized DNN model [4]. A line of researches have been done towards
understanding the optimization landscape and training process of DNN model. For example,
[5] and [6] studied the training loss surface of over-parameterized DNNs. They showed that
for a fully connected DNN, almost all local minima are globally optimal, if the width of
one layer of the DNN is no smaller than the training sample size and the network structure
from this layer on is pyramidal. Recently, [7]–[9] and [10] explored the convergence theory of
the gradient-based algorithms in training over-parameterized DNNs. They showed that the
gradient-based algorithms with random initialization can converge to global minima provided
that the width of the DNN is polynomial in training sample size. A small DNN model does
not enjoy those good property of optimization landscape or training process. It can suffer
from local trap and be hard to optimize.
In this dissertation, we tackle these issues from two directions. First, we consider sparse
deep learning. Sparse deep learning start with over-parametrized DNN model, then identify
a sparse model with most parameters being zero and can perform as good as a dense one.
Starting with over-parametrization allows the model to enjoy good optimization property,
while the sparse model can be easier to interpret and well calibrated. Using the sparse model
for future prediction can also save computation cost. From another direction, we propose a

11
so-called kernel-expanded stochastic neural network (K-StoNet) model, which incorporates
support vector regression (SVR) as the first hidden layer and reformulates the neural network
as a latent variable model. The former maps the input vector into an infinite dimensional
feature space via a radial basis function (RBF) kernel, ensuring absence of local minima
on its training loss surface. The latter breaks the high-dimensional nonconvex neural net-
work training problem into a series of low-dimensional convex optimization problems, and
enables its prediction uncertainty easily assessed. For both directions, we provide theoret-
ical guarantee for the proposed model and demonstrate the performance on synthetic and
real data sets. The remaining part of this dissertation is organized as follows. In Section
1.1, we introduce the background of sparse deep learning and our proposed approach. In
Section 1.2, we review stochastic neural network model and introduce our K-StoNet model.
The subsequent two chapters, Chapter 2 and 3 contains detailed formulation of the model,
theoretical properties and experiment results. Discussion and technical proofs are given at
the end of each chapter.

1.1 Sparse Deep Learning

The desire to identify sparse model naturally lead to two questions: (i) Is a sparsely
connected DNN able to approximate the target mapping with a desired accuracy? and (ii)
how to train and determine the structure of a sparse DNN? There have been some work in
the literature trying to address these questions.
The approximation power of sparse DNNs has been studied in the literature from both
frequentist and Bayesian perspectives. From the frequentist perspective, [11] quantifies the
minimum network connectivity that guarantees uniform approximation rates for a class of
affine functions; and [12] and [13] characterize the approximation error of a sparsely con-
nected neural network for Hölder smooth functions. From the Bayesian perspective, [14] es-
tablished posterior consistency for Bayesian shallow neural networks under mild conditions;
and [15] established posterior consistency for Bayesian DNNs but under some restrictive con-
ditions such as a spike-and-slab prior is used for connection weights, the activation function

12
is ReLU, and the number of input variables keeps at an order of O(1) while the sample size
grows to infinity.
The existing methods for learning sparse DNNs are usually developed separately from
the approximation theory. For example, [16], [17] and [18] developed some regularization
methods for learning sparse DNNs; [19] showed that dropout training is approximately equiv-
alent to an L2 -regularization; [20] introduced a deep compression pipeline, where pruning,
trained quantization and Huffman coding work together to reduce the storage requirement
of DNNs; [21] proposed a sparse decomposition method to sparsify convolutional neural net-
works (CNNs); [22] considered a lottery ticket hypothesis for selecting a sparse subnetwork;
and [23] proposed to learn Bayesian sparse neural networks via node selection with a horse-
shoe prior under the framework of variational inference. For these methods, it is generally
unclear if the resulting sparse DNN is able to provide a desired approximation accuracy to
the true mapping and how close in structure the sparse DNN is to the underlying true DNN.
In this dissertation, we proposed a Bayesian Neural Network(BNN) with mixture of nor-
mal prior. Theoretically, we first establish posterior consistency for the BNN and consistency
of structure selection based on the marginal posterior inclusion probabilities, which ensures
the posterior will concentrate around the true model. Then we establish consistency of the
sparsified DNN via Laplace approximation to the marginal posterior inclusion probabilities,
which ensures the sparse structure can be consistently identified by finding maximum of
the posterior distribution. To quantify the prediction uncertainty of the model, we estab-
lished the Bernstein-von Mises (BvM) theorem for network prediction. Computationally,
we provide prior annealing algorithm to learn the sparse neural network model. Our pro-
posed learning algorithm shares similar computation cost as standard stochastic gradient
descent(SGD) method. Our numerical results indicate that the proposed models can work
very well for large-scale DNN compression and high-dimensional nonlinear variable selection.
In addition to the mixture of normal prior, we will also discuss other possible choice of priors
and their properties from both theoretical and computational perspective.

13
1.2 Kernel-Expanded Stochastic Neural Network

Introducing noise in neural network training or stochastic neural network model has also
been an promising approach to improve performance of neural network. Famous examples
include deep belief networks [24] and deep Boltzmann machines [25], which have ever ad-
vanced the development of machine learning. Recently, some researchers have proposed to
add noise to the DNN to improve its performance. For example, [26] proposed the dropout
method to prevent the DNN from over-fitting by randomly dropping some hidden and visible
units during training; [27] proposed to add gradient noise to improve training; and [28]–[30]
proposed to use stochastic activations through adding noise to improve generalization and
adversarial robustness. However, these methods are usually not systematic and theoretical
guarantees are hard to be provided.
In this dissertation, we propose a new neural network model, the so-called kernel-expanded
stochastic neural network (K-StoNet). The new model incorporates support vector regres-
sion (SVR) [31], [32] as the first hidden layer and reformulates the neural network as a
latent variable model. The former maps the input vector from its original space into an
infinite dimensional feature space, ensuring all local minima on the loss surface are globally
optimal. The latter resolves the parameter optimization and statistical inference issues as-
sociated with the neural network: it breaks the high-dimensional nonconvex neural network
training problem into a series of low-dimensional convex optimization problems, and enables
the prediction uncertainty easily assessed. The new model can be easily trained using the
imputation-regularized optimization (IRO) algorithm [33], which converges very fast, usu-
ally within a small number of epochs. Moreover, the introduction of the SVR layer with a
universal kernel [34], [35] enables K-StoNet to work with a smaller network, while ensuring
the universal approximation capability.
Compared to existing stochastic neural network model, K-StoNet is developed under a
rigorous statistical framework, whose convergence to the global optimum is asymptotically
guaranteed and whose prediction uncertainty can be easily assessed.

14
2. SPARSE DEEP LEARNING
2.1 Bayesian Sparse DNNs with mixture Gaussian Prior

Let Dn = (x(i) , y (i) )i=1,...,n denote a training dataset of n i.i.d observations, where x(i) ∈
Rpn , y (i) ∈ R, and pn denotes the dimension of input variables and is assumed to grow with
the training sample size n. We first study the posterior approximation theory of Bayesian
sparse DNNs under the framework of generalized linear models, for which the distribution
of y given x is given by

f (y|µ∗ (x)) = exp{A(µ∗ (x))y + B(µ∗ (x)) + C(y)},

where µ∗ (x) denotes a nonlinear function of x, and A(·), B(·) and C(·) are appropriately
defined functions. The theoretical results presented in this work mainly focus on logistic
regression models and normal linear regression models. For logistic regression, we have

A(µ∗ ) = µ∗ , B(µ∗ ) = − log(1 + eµ ), and C(y) = 1. For normal regression, by introducing
an extra dispersion parameter σ 2 , we have A(µ∗ ) = µ∗ /σ 2 , B(µ∗ ) = −µ∗ 2 /2σ 2 and C(y) =
−y 2 /2σ 2 − log(2πσ 2 )/2. For simplicity, σ 2 = 1 is assumed to be known. How to extend our
results to the case that σ 2 is unknown will be discussed in Remark 2.2.3.
We approximate µ∗ (x) using a DNN. Consider a DNN with Hn − 1 hidden layers and
Lh hidden units at layer h, where LHn = 1 for the output layer and L0 = pn for the input
layer. Let wh ∈ RLh ×Lh−1 and bh ∈ RLh ×1 , h ∈ {1, 2, ..., Hn } denote the weights and bias of
layer h, and let ψ h : RLh ×1 → RLh ×1 denote a coordinate-wise and piecewise differentiable
activation function of layer h. The DNN forms a nonlinear mapping

h h i i
µ(β, x) = wHn ψ Hn −1 · · · ψ 1 w1 x + b1 · · · + bHn , (2.1)

n o
where β = (w, b) = wijh , bhk : h ∈ {1, 2, ..., Hn }, i, k ∈ {1, ..., Lh }, j ∈ {1, ..., Lh−1 } denotes
the collection of all weights and biases, consisting of Kn = (Lh−1 × Lh + Lh ) ele-
PHn
h=1

ments in total. To facilitate representation of the sparse DNN, we introduce an indicator


variable for each weight and bias of the DNN, which indicates the existence of the con-

15
h
nection in the network. Let γ w and γ b denote the matrix and vector of the indicator
h

h
ij , γ k : h ∈
variables associated with wh and bh , respectively. Further, we let γ = {γ w b h

h
{1, 2, ..., Hn }, i, k ∈ {1, ..., Lh } , j ∈ {1, ..., Lh−1 }} and β γ = {wijh , bhk : γ w = 1, γ bk = 1
h
ij

,h ∈ {1, 2, ..., Hn }, i, k ∈ {1, ..., Lh }, j ∈ {1, ..., Lh−1 }}, which specify, respectively, the struc-
ture and associated parameters for a sparse DNN.
To conduct Bayesian analysis for the sparse DNN, we consider a mixture Gaussian prior
specified as follows:

h h
ij ∼ Bernoulli(λn ),
γw γ bk ∼ Bernoulli(λn ),
(2.2)
h h h
ij ∼ γ ij N (0, σ1,n ) + (1 − γ ij )N (0, σ0,n ),
whij |γ w w 2 w 2

h h h
bhk |γ bk ∼ γ kb N (0, σ1,n
2
) + (1 − γ bk )N (0, σ0,n
2
),

where h ∈ {1, 2, ..., HN }, i ∈ {1, ..., Lh−1 } , j, k ∈ {1, ..., Lh }, and σ0,n
2 2
< σ1,n are prespecified
constants. Marginally, we have

whij ∼ λn N (0, σ1,n


2
) + (1 − λn )N (0, σ0,n
2
), bhk ∼ λn N (0, σ1,n
2
) + (1 − λn )N (0, σ0,n
2
). (2.3)

Typically, we set σ0,n


2
to be a very small value while σ1,n
2
to be relatively large. When
2
σ0,n → 0, the prior is reduced to the spike-and-slab prior [36]. Therefore, this prior can be
viewed as a continuous relaxation of the spike-and-slab prior. Such a prior has been used by
many authors in Bayesian variable selection, see e.g., [37] and [38].

2.2 Posterior Consistency

Posterior consistency plays a major role in validating Bayesian methods especially for
high-dimensional models, see e.g. [39] and [40]. For DNNs, since the total number of
parameters Kn is often much larger than the sample size n, posterior consistency provides
a general guideline in prior setting or choosing prior hyperparameters for a class of prior
distributions. Otherwise, the prior information may dominate data information, rendering
a biased inference for the underlying true model. In what follows, we prove the posterior
consistency of the DNN model with the mixture Gaussian prior (2.3).

16
With slight abuse of notation, we rewrite µ(β, x) in (2.1) as µ(β, γ, x) for a sparse
network by including its network structure information. We assume µ∗ (x) can be well
approximated by a sparse DNN with relevant variables, and call this sparse DNN as the true
DNN. More precisely, we define the true DNN as

(β ∗ , γ ∗ ) = arg min |γ|, (2.4)


(β,γ)∈Gn , ∥µ(β,γ,x)−µ∗ (x)∥L2 (Ω) ≤ϖn

where Gn := G(C0 , C1 , ε, pn , Hn , L1 , L2 , . . . , LHn ) denotes the space of valid sparse networks


satisfying condition A.2 (given below) for the given values of Hn , pn , and Lh ’s, and ϖn is some
sequence converging to 0 as n → ∞. For any given DNN (β, γ), the error µ(β, γ, x) − µ∗ (x)
can be generally decomposed as the network approximation error µ(β ∗ , γ ∗ , x) − µ∗ (x) and
the network estimation error µ(β, γ, x) − µ(β ∗ , γ ∗ , x). The L2 norm of the former one is
bounded by ϖn , and the order of the latter will be given in Theorem 2.2.1. In what follows,
we will treat ϖn as the network approximation error. In addition, we make the following
assumptions:

A.1 The input x is bounded by 1 entry-wisely, i.e. x ∈ Ω = [ − 1, 1]pn , and the density of
x is bounded in its support Ω uniformly with respect to n.

A.2 The true sparse DNN model satisfies the following conditions:

A.2.1 The network structure satisfies: rn Hn log n +rn log L + sn log pn ≤ C0 n1−ε , where
0 < ε < 1 is a small constant, rn = |γ ∗ | denotes the connectivity of γ ∗ , L =
max1≤j≤Hn −1 Lj denotes the maximum hidden layer width, sn denotes the input
dimension of γ ∗ .

A.2.2 The network weights are polynomially bounded: ∥β ∗ ∥∞ ≤ En , where En = nC1


for some constant C1 > 0.

A.3 The activation function ψ is Lipschitz continuous with a Lipschitz constant of 1.

Assumption A.1 is a typical assumption for posterior consistency, see e.g., [15] and [39].
In practice, all bounded data can be normalized to satisfy this assumption, e.g. image data

17
are bounded and usually normalized before training. Assumption A.3 is satisfied by many
conventional activation functions such as sigmoid, tanh and ReLU.
Assumption A.2 specifies the class of DNN models that we are considering. They are
sparse, while still being able to approximate many types of functions arbitrarily well as the
training sample size becomes large, i.e., limn→∞ ϖn = 0. The approximation power of sparse
DNNs has been studied in several existing work. For example, for the functions that can
be represented by an affine system, [11] proved that if the network parameters are bounded
in absolute value by some polynomial g(rn ), i.e. ||β ∗ ||∞ ≤ g(rn ), then the approximation

error ϖn = O(rn−α ) for some constant α∗ . To fit this this result into our framework, we
can let rn ≍ n(1−ϵ)/2 for some 0 < ϵ < 1, pn = d for some constant d, Hn < rn + d
and L̄ < rn (i.e. the setting given in Proposition 3.6 of [11]). Suppose that the degree
of g(·) is c2 , i.e. g(rn ) ≺ rnc2 , then ∥β ∗ ∥∞ ≺ nc2 (1−ϵ)/2 ≺ nC1 = En for some constant
C1 > c2 (1 − ϵ)/2. Therefore, Assumption A.2 is satisfied with the approximation error ϖn =
∗ ∗ (1−ϵ)/2 ∆
O(rn−α ) = O(n−α ) = O(n−ς ) (by defining ς = α∗ (1−ϵ)/2), which goes to 0 as n → ∞.
In summary, the minimax rate in supµ∗ (x)∈C inf (β,γ)∈G ∥µ(β, γ, x)−µ∗ (x)∥L2 (Ω) ∈ O(n−ς ) can
be achieved by sparse DNNs under our assumptions, where C denotes the class of functions
represented by an affine system.
Other than affine functions, our setup for the sparse DNN also matches the approximation
theory for many other types of functions. For example, Corollary 3.7 of [41] showed that
for a wide class of piecewise smooth functions with a fixed input dimension, a fixed depth
ReLU network can achieve an ϖn -approximation with log(rn ) = O(− log ϖn ) and log En =
O(− log ϖn ). This result satisfies condition A.2 by setting ϖn = O(n−ς ) for some constant
ς > 0. As another example, Theorem 3 of [12] (see also lemma 5.1 of [15]) proved that any
bounded α-Hölder smooth function µ∗ (x) can be approximated by a sparse ReLU DNN with
the network approximation error ϖn = O(log(n)α/pn n−α/(2α+pn ) ) for some Hn ≍ log n log pn ,
Lj ≍ pn npn /(2α+pn ) / log n, rn = O(p2n α2pn npn /(2α+pn ) log pn ), and En = C for some fixed
constant C > 0. This result also satisfies condition A.2.2 as long as p2n ≪ log n.
It is important to note that there is a fundamental difference between the existing neural
network approximation theory and ours. In the existing neural network approximation
theory, no data is involved and a small network can potentially achieve an arbitrarily small

18
approximation error by allowing the connection weights to take values in an unbounded
space. In contrast, in our theory, the network approximation error, the network size, and
the bound of connection weights are all linked to the training sample size. A small network
approximation error is required only when the training sample size is large; otherwise, over-
fitting might be a concern from the point of view of statistical modeling. In the practice of
modern neural networks, the depth and width have been increased without much scruple.
These increases reduce the training error, improve the generalization performance under
certain regimes [42], but negatively affect model calibration [3]. We expect that our theory
can tame the powerful neural networks into the framework of statistical modeling; that is,
by selecting an appropriate network size according to the training sample size, the proposed
method can generally improve the generalization and calibration of the DNN model while
controlling the training error to a reasonable level.
Let P ∗ and E ∗ denote the respective probability measure and expectation for data Dn .
2 !1
 1 1 2
Let d(p1 , p2 ) = p1 (x, y) − p2 (x, y) denote the Hellinger distance between two
R 2 2
dydx
density functions p1 (x, y) and p2 (x, y). Let π(A | Dn ) be the posterior probability of an
event A. The following theorem establishes posterior consistency for sparse DNNs under the
mixture Gaussian prior (2.3).

Theorem 2.2.1. Suppose Assumptions A.1-A.3 hold. If the mixture Gaussian prior (2.3)
satisfies the conditions: λn = O(1/{Kn [nHn (Lpn )]τ }) for some constant τ > 0, En /{Hn log n+
n √
log L}1/2 ≲ σ1,n ≲ nα for some constant α > 0, and σ0,n ≲ min 1/{ nKn (n3/2 σ1,0 /Hn )Hn },
√ o
1/{ nKn (nEn /Hn )Hn } , then there exists an error sequence ϵ2n = O(ϖn2 ) + O(ζn2 ) such that
limn→∞ ϵn = 0 and limn→∞ nϵ2n = ∞, and the posterior distribution satisfies

n 2
o 2
P ∗ π[d(pβ , pµ∗ ) > 4ϵn |Dn ] ≥ 2e−cnϵn ≤ 2e−cnϵn ,
(2.5)
−2cnϵ2n

ED n
π[d(pβ , pµ∗ ) > 4ϵn |Dn ] ≤ 4e ,

for sufficiently large n, where c denotes a constant, ζn2 = [rn Hn log n + rn log L + sn log pn ]/n,
pµ∗ denotes the underlying true data distribution, and pβ denotes the data distribution re-
constructed by the Bayesian DNN based on its posterior samples.

19
The proof of Theorem 2.2.1 can be found in Section 2.9. Regarding this theorem, we
have a few remarks:

Remark 2.2.1. Theorem 2.2.1 provides a posterior contraction rate ϵn for the sparse BNN.
The contraction rate contains two components, ϖn and ζn , where ϖn , as defined previously,
represents the network approximation error, and ζn represents the network estimation error
measured in Hellinger distance. Since the estimation error ζn grows with the network con-
nectivity rn , there is a trade-off between the network approximation error and the network
estimation error. A larger network has a lower approximation error and a higher estimation
error, and vice versa.

Remark 2.2.2. Theorem 2.2.1 implies that given a training sample size n, the proposed
method can learn a sparse neural network with at most O(n/ log(n)) connections. Com-
pared to the fully connected DNN, the sparsity of the proposed BNN enables some theoretical
guarantees for its performance. The sparse BNN has nice theoretical properties, such as pos-
terior consistency, variable selection consistency, and asymptotically optimal generalization
bounds, which are beyond the ability of general neural networks. The latter two properties
will be established in Section 2.3 and Section 2.5, respectively.

Remark 2.2.3. Although Theorem 2.2.1 is proved by assuming σ 2 is known, it can be easily
extended to the case that σ 2 is unknown by assuming an inverse gamma prior σ 2 ∼ IG(a0 , b0 )
for some constants a0 , b0 > 0. If a relatively uninformative prior is desired, one can choose
a0 ∈ (0, 1) such that the inverse gamma prior is very diffuse with a non-existing mean value.
However, if a0 = b0 = 0, i.e., the Jeffreys prior π(σ 2 ) ∝ 1/σ 2 , the posterior consistency
theory established Theorem 2.2.1 might not hold any more. In general, to achieve posterior
consistency, the prior is required, at least in our framework, to satisfy two conditions [39],
[43]: (i) a not too little prior probability is placed over the neighborhood of the true density,
and (ii) a very little prior probability is placed outside of a region that is not too complex.
Obviously, the Jeffreys prior and thus the joint prior of σ 2 and the regression coefficients
do not satisfy neither of the two conditions. We note that the inverse gamma prior σ 2 ∼
IG(a0 , b0 ) has long been used in Bayesian inference for many different statistical models,

20
such as linear regression [44], nonparametric regression [45], and Gaussian graphical models
[46].

2.3 Consistency of DNN Structure Selection

This section establishes consistency of DNN structure selection under posterior consis-
tency. It is known that the DNN model is generally nonidentifiable due to the symmetry of
the network structure. For example, the approximation µ(β, γ, x) can be invariant if one
permutes the orders of certain hidden nodes, simultaneously changes the signs of certain
weights and biases if tanh is used as the activation function, or re-scales certain weights and
bias if ReLU is used as the activation function. However, by introducing appropriate con-
straints, see e.g., [47] and [14], we can define a set of neural networks such that any possible
neural networks can be represented by one and only one neural network in the set via nodes
permutation, sign changes, weight rescaling, etc. Let Θ denote such set of DNNs, where each
element in Θ can be viewed as an equivalent class of DNN models. Let ν(γ, β) ∈ Θ be an
operator that maps any neural network to Θ via appropriate transformations such as nodes
permutation, sign changes, weight rescaling, etc. To serve the purpose of structure selection
in the space Θ, we consider the marginal posterior inclusion probability approach proposed
in [40] for high-dimensional variable selection.
For a better description of this approach, we reparameterize β and γ as

β = (β 1 , β 2 , . . . , β Kn ), γ = (γ 1 , γ 2 , . . . , γ Kn ),

respectively, according to their elements. Without possible confusions, we will often use the
indicator vector γ and the active set {i : γ i = 1, i = 1, 2, . . . , Kn } exchangeably; that is,
i ∈ γ and γ i = 1 are equivalent. In addition, we will treat the connection weights w and
the hidden unit biases b equally; that is, they will not be distinguished in β and γ. For
convenience, we will call each element of β and γ a ‘connection’ in what follows.

21
2.3.1 Marginal Posterior Inclusion Probability Approach

For each connection ci , we define its marginal posterior inclusion probability by

Z X
qi = ei|ν(γ,β) π(γ|β, Dn )π(β|Dn )dβ, i = 1, 2, . . . , Kn , (2.6)
γ

where ei|ν(γ,β) is the indicator for the existence of connection ci in the network ν(γ, β). Sim-
ilarly, we define ei|ν(γ ∗ ,β∗ ) as the indicator for the existence of connection ci in the true model
ν(γ ∗ , β ∗ ). The proposed approach is to choose the connections whose marginal posterior
inclusion probabilities are greater than a threshold value q̂; that is, setting γ̂ q̂ = {i : qi >
q̂, i = 1, 2, . . . , Kn } as an estimator of γ ∗ = {i : ei|ν(γ ∗ ,β∗ ) = 1, i = 1, . . . , Kn }, where γ ∗ can be
viewed as the uniquenized true model. To establish the consistency of γ̂ q̂ , an identifiability
condition for the true model is needed. Let A(ϵn ) = {β : d(pβ , pµ∗ ) ≥ ϵn }. Define

Z
ρ(ϵn ) = max |ei|ν(γ,β) − ei|ν(γ ∗ ,β∗ ) |π(γ|β, Dn )π(β|Dn )dβ,
X
1≤i≤Kn A(ϵn )c γ

which measures the structure difference between the true model and the sampled models on
the set A(ϵn )c . Then the identifiability condition can be stated as follows:

B.1 ρ(ϵn ) → 0, as n → ∞ and ϵn → 0.

That is, when n is sufficiently large, if a DNN has approximately the same probability distri-
bution as the true DNN, then the structure of the DNN, after mapping into the parameter
space Θ, must coincide with that of the true DNN. Note that this identifiability is different
from the one mentioned at the beginning of the section. The earlier one is only with respect
to structure and parameter rearrangement of the DNN. Theorem 2.3.1 concerns consistency
of γ̂ q̂ and its sure screening property, whose proof is given in Section 2.9.

Theorem 2.3.1. Assume that the conditions of Theorem 2.2.1 and the identifiability condi-
tion B.1 hold. Then
p p
(i) max1≤i≤Kn {|qi − ei|ν(γ ∗ ,β∗ ) |} → 0, where → denotes convergence in probability;

p
(ii) (sure screening) P (γ ∗ ⊂ γ̂ q̂ ) → 1 for any pre-specified q̂ ∈ (0, 1).

22
p
(iii) (Consistency) P (γ ∗ = γ̂ 0.5 ) → 1.

For a network γ, it is easy to identify the relevant variables. Recall that γ w ∈ RLh ×Lh−1
h

denotes the connection indicator matrix of layer h. Let

Hn Hn −1 1
γx = γw γw · · · γ w ∈ R1×pn , (2.7)

and let γ xi denote the i-th element of γ x . It is easy to see that if γ xi > 0 then the variable
xi is effective in the network γ, and γ xi = 0 otherwise. Let exi |ν(γ ∗ ,β∗ ) be the indicator for
the effectiveness of variable xi in the network ν(γ ∗ , β ∗ ), and let γ x∗ = {i : exi |ν(γ ∗ ,β∗ ) =
1, i = 1, . . . , pn } denote the set of true variables. Similar to (2.6), we can define the marginal
inclusion probability for each variable:

Z X
qix = exi |ν(γ,β) π(γ|β, Dn )π(β|Dn )dβ, i = 1, 2, . . . , pn , (2.8)
γ

Then we can select the variables whose marginal posterior inclusion probabilities greater
than a threshold q̂ x , e.g., setting q̂ x = 0.5. As implied by (2.7), the consistency of structure
selection implies consistency of variable selection.
It is worth noting that the above variable selection consistency result is with respect
to the relevant variables defined by the true network γ ∗ . To achieve the variable selection
consistency with respect to the relevant variables of µ∗ (x), some extra assumptions are
needed in defining (β ∗ , γ ∗ ). How to specify these assumptions is an open problem and we
would leave it to readers. However, as shown by our simulation example, the sparse model
(β ∗ , γ ∗ ) defined in (2.4) works well, which correctly identifies all the relevant variables of the
underlying nonlinear system.

2.3.2 Laplace Approximation of Marginal Posterior Inclusion Probabilities

Theorem 2.3.1 establishes the consistency of DNN structure selection based on the
marginal posterior inclusion probabilities. To obtain Bayesian estimates of the marginal
posterior inclusion probabilities, intensive Markov Chain Monte Carlo (MCMC) simulations
are usually required. Instead of performing MCMC simulations, we propose to approximate

23
the marginal posterior inclusion probabilities using the Laplace method based on the DNN
model trained by an optimization method such as SGD. Traditionally, such approximation is
required to be performed at the maximum a posteriori (MAP) estimate of the DNN. However,
finding the MAP for a large DNN is not computationally guaranteed, as there can be many
local minima on its energy landscape. To tackle this issue, we proposed a Bayesian evidence
method for eliciting sparse DNN models learned by an optimization method in multiple runs
with different initializations. Alternatively, we provide an prior annealing approach, which
incorporate the property of the energy landscape of over-parameterized DNN into model
training, to find the optimal of posterior. See Section 2.6 for the detail. Since conventional
optimization methods such as SGD can be used to train the DNN here, the proposed method
is generally more computationally efficient than the standard Bayesian method. More impor-
tantly, as explained in Section 2.6, consistent estimates of the marginal posterior inclusion
probabilities might be obtained at a local maximizer of the log-posterior instead of the MAP
estimate. In what follows, we justify the validity of Laplace approximation for marginal
posterior inclusion probabilities.
Based on the marginal posterior distribution π(β|Dn ), the marginal posterior inclusion
probability qi of connection ci can be re-expressed as

Z
qi = π(γ i = 1|β)π(β|Dn )dβ, i = 1, 2, . . . , Kn .

Under the mixture Gaussian prior, it is easy to derive that

π(γ i = 1|β) = b̃i /(ãi + b̃i ), (2.9)

where
1 − λn β 2i λn β 2i
ãi = exp{− 2 }, b̃i = exp{− 2 }.
σ0,n 2σ0,n σ1,n 2σ1,n
Let’s define
1X n
1
hn (β) = log(p(yi , xi |β)) + log(π(β)), (2.10)
n i=1 n

24
where p(yi , xi |β) denotes the likelihood function of the observation (yi , xi ) and π(β) denotes
nhn (β)
the prior as specified in (2.3). Then π(β|Dn ) = R eenhn (β) dβ and, for a function b(β), the
R nhn (β) dβ
posterior expectation is given by Rb(β)e . Let β̂ denote a strict local maximum of
enhn (β) dβ

π(β|Dn ). Then β̂ is also a local maximum of hn (β). Let Bδ (β) denote an Euclidean ball of
∂ d h(β)
radius δ centered at β. Let hi1 ,i2 ,...,id (β) denote the d-th order partial derivative ∂β i1 ∂β i2 ···∂β i
,
d

let Hn (β) denote the Hessian matrix of hn (β), let hij denote the (i, j)-th component of the
Hessian matrix, and let hij denote the (i, j)-component of the inverse of the Hessian matrix.
Recall that γ ∗ denotes the set of indicators for the connections of the true sparse DNN, rn
denotes the size of the true sparse DNN, and Kn denotes the size of the fully connected
DNN. The following theorem justifies the Laplace approximation of the posterior mean for
a bounded function b(β).

Theorem 2.3.2. Assume that there exist positive numbers ϵ, M , η, and n0 such that for
any n > n0 , the function hn (β) in (2.10) satisfies the following conditions:

C.1 |hi1 ,...,id (β̂)| < M hold for any β ∈ Bϵ (β̂) and any 1 ≤ i1 , . . . , id ≤ Kn , where 3 ≤ d ≤ 4.

C.2 |hij (β̂)| < M if γ ∗i = γ ∗j = 1 and |hij (β̂)| = O( K12 ) otherwise.


n

1 4
C.3 det(− 2π
n
Hn (β̂)) 2 en(hn (β)−hn (β̂)) dβ = O( rnn ) = o(1) for any 0 < δ < ϵ.
R
RKn \Bδ (β̂)

∂ d b(β)
For any bounded function b(β), if |bi1 ,...,id (β)| = | ∂β | < M holds for any 1 ≤ d ≤ 2
i1 ∂β i2 ···∂β i d

and any 1 ≤ i1 , . . . , id ≤ Kn , then for the posterior mean of b(β), we have

b(β)enhn (β) dβ
!
rn4
R
= b(β̂) + O .
enhn (β) dβ
R
n

Conditions C.1 and C.3 are typical conditions for Laplace approximation, see e.g., [48].
Condition C.2 requires the inverse Hessian to have very small values for the elements corre-
sponding to the false connections. To justify condition C.2, we note that for a multivariate
normal distribution, the inverse Hessian is its covariance matrix. Thus, we expect that for
the weights with small variance, their corresponding elements in the inverse Hessian matrix
would be small as well. The following lemma quantifies the variance of the weights for the
false connections.

25
Lemma 2.3.1. Assume that supn |β i |2+δ π(βi |Dn )dβ i ≤ C < ∞ a.s. for some constants
R

δ > 0 and C > 0 and ρ(ϵn ) ≍ π(d(pβ , pµ∗ ) ≥ ϵn |Dn ), where ρ(ϵn ) is defined in Condition
B.1. Then with an appropriate choice of prior hyperparameters and ϵn , P ∗ {E(β 2i |Dn ) ≺
} ≥ 1 − 2e−nϵn /4 holds for any false connection ci in γ ∗ (i.e., γ ∗i = 0).
1 2
2Hn −1
Kn

In addition, with an appropriate choice of prior hyperparameters, we can also show that
π(γ i = 1|β) satisfies all the requirements of b(β) in Theorem 2.3.2 with a probability tending
to 1 as n → ∞. Then, by Theorem 2.3.2, qk and π(γ i = 1|β̂) are approximately the same as
n → ∞, where π(γ i = 1|β̂) is as defined in (2.9) but with β replaced by β̂. Combining with
Theorem 2.3.1, we have that π(γ i = 1|β̂) is a consistent estimator of ei|ν(γ ∗ ,β∗ ) .

2.4 Asymptotic Normality of Connection Weights

In this section, we establish the asymptotic normality of the network parameters and
predictions. Let nln (β) = log(pβ (xi , yi )) denote the log-likelihood function, and let
Pn
i=1

π(β) denote the density of the mixture Gaussian prior (2.3). Let hi1 ,i2 ,...,id (β) denote the
∂ d ln (β)
d-th order partial derivatives ∂β i1 ∂β i2 ···∂β i
. Let Hn (β) denote the Hessian matrix of ln (β).
d

Let hij (β) and hij (β) denote the (i, j)-th component of Hn (β) and Hn−1 (β), respectively.
Let λ̄n (β) and λn (β) denotes the maximum and minimum eigenvalue of the Hessian matrix
q
n (β )/λn (β ) and bλ,n =
Hn (β), respectively. Let Bλ,n = λ̄1/2 ∗ ∗
rn /nBλ,n , where rn is the
connectivity of γ ∗ . For a DNN parameterized by β, we define the weight truncation at the
true model γ ∗ : (β γ ∗ )i = β i for i ∈ γ ∗ and (β γ ∗ )i = 0 otherwise. For the mixture Gaussian
prior (2.3), let Bδn (β ∗ ) = {β : |β i − β ∗i | < δn , ∀i ∈ γ ∗ , |β i − β ∗i | < 2σ0,n log( λnσ1,n
σ0,n
), ∀i ∈
/ γ ∗ }.
We follow the definition of asymptotic normality in [49] and [50]:

Definition 2.4.1. Denote by dβ the bounded Lipschitz metric for weak convergence and

by ϕn the mapping ϕn : β → n(g(β) − g∗ ). We say that the posterior distribution of
the functional g(β) is asymptotically normal with the center g∗ and variance G if dβ (π[· |
Dn ] ◦ ϕ−1
n , N (0, G)) → 0 in P -probability as n → ∞. We will write this more compactly as

π[· | Dn ] ◦ ϕ−1
n ⇝ N (0, G).

Theorem 2.4.1 establishes the asymptotic normality of ν̃(β), where ν̃(β) denotes a trans-
formation of β which is invariant with respect to µ(β, γ, x) while minimizing ∥ν̃(β) − β ∗ ∥∞ .

26
Theorem 2.4.1. Assume the conditions of Lemma 2.3.1 hold with ρ(ϵn ) = o( K1n ) and C1 > 2
3

in Condition A.2.2. For some δn s.t. rn



n
≲ δn ≲ √ 1
3 nr ,
n
let A(ϵn , δn ) = {β : maxi∈γ ∗ |β i −
β ∗i | > δn , d(pβ , pµ∗ ) ≤ ϵn }, where ϵn is the posterior contraction rate as defined in Lemma
2.2.1. Assume there exists some constants C > 2 and M > 0 such that

D.1 β ∗ = (β ∗1 , β ∗2 , . . . , β ∗Kn ) is generic [51], [52], mini∈γ ∗ |β ∗i | > Cδn and π(A(ϵn , δn ) |
Dn ) → 0 as n → ∞.

D.2 |hi (β ∗ )| < M , |hj,k (β ∗ )| < M , |hj,k (β ∗ )| < M , |hi,j,k (β)| < M , |hl (β)| < M hold for
any i, j, k ∈ γ ∗ , l ∈
/ γ ∗ and β ∈ B2δn (β ∗ ).
n o q
D.3 sup |Eβ (aT U )3 | : ∥β γ ∗ − β ∗ ∥ ≤ 1.2bλ,n , ∥a∥ = 1 ≤ 0.1 n/rn λ2n (β ∗ )/λ̄1/2
n (β ) and

Bλ,n = O(1), where U = Z − Eβγ ∗ (Z), Z denotes a random variable drawn from a
neural network model parameterized by β γ ∗ , and Eβγ ∗ (Z) denotes the mean of Z.

Then π[ n(ν̃(β) − β ∗ ) | Dn ] ⇝ N (0, V ) in P ∗ -probability as n → ∞, where V = (vij ), and
vi,j = E(hi,j (β ∗ )) if i, j ∈ γ ∗ and 0 otherwise.

Condition D.1 is essentially an identifiability condition, i.e., when n is sufficiently large,


the DNN weights cannot be too far away from the true weights if the DNN produces approx-
imately the same distribution as the true data. Condition D.2 gives typical conditions on
derivatives of the DNN. Condition D.3 ensures consistency of the MLE of β ∗ for the given
structure γ ∗ [53].

2.4.1 Asymptotic Normality of Prediction

Theorem 2.4.2 establishes asymptotic normality of the prediction µ(β, x0 ) for a test
data point x0 , which implies that a faithful prediction interval can be constructed for the
learnt sparse neural network. Refer to Section 2.6.4 for how to construct the prediction
interval based on the theorem. Let µi1 ,i2 ,...,id (β, x0 ) denote the d-th order partial derivative
∂ d µ(β,x0 )
∂β i1 ∂β i2 ···∂β i
.
d

Theorem 2.4.2. Assume the conditions of Theorem 2.4.1 and the following condition hold:
|µi (β ∗ , x0 )| < M , |µi,j (β, x0 )| < M , |µk (β, x0 )| < M hold for any i, j ∈ γ ∗ , k ∈
/ γ ∗ and

27

β ∈ B2δn (β ∗ ), where M is as defined in Theorem 2.4.1. Then π[ n(µ(β, x0 ) − µ(β ∗ , x0 )) |
Dn ] ⇝ N (0, Σ), where Σ = ∇γ ∗ µ(β ∗ , x0 )T H −1 ∇γ ∗ µ(β ∗ , x0 ) and H = E(−∇2γ ∗ ln (β ∗ )) is the
Fisher information matrix.

The asymptotic normality for general smooth functional has been established in [49].
For linear and quadratic functional of deep ReLU network with a spike-and-slab prior, the
asymptotic normality has been established in [50]. The DNN prediction µ(β, x0 ) can be
viewed as a point evaluation functional over the neural network function space. However, in
general, this functional is not smooth with respect to the locally asymptotic normal (LAN)
norm. The results of [49] and [50] are not directly applicable for the asymptotic normality
of µ(β, x0 ).

2.5 Asymptotically Optimal Generalization Bound

This section shows the sparse BNN has asymptotically an optimal generalization bound.
First, we introduce a PAC Bayesian bound due to [54], [55], where the acronym PAC stands
for Probably Approximately Correct. It states that with an arbitrarily high probability, the
performance (as provided by a loss function) of a learning algorithm is upper-bounded by a
term decaying to an optimal value as more data is collected (hence “approximately correct”).
PAC-Bayes has proven over the past two decades to be a powerful tool to derive theoretical
guarantees for many machine learning algorithms.

Lemma 2.5.1 (PAC Bayesian bound). Let P be any data independent distribution on the
machine parameters β, and Q be any distribution that is potentially data-dependent and
absolutely continuous with respective to P . If the loss function l(β, x, y) ∈ [0, 1], then the
following inequality holds with probability 1 − δ,


v
u d (Q, P ) + log 2 n
u
Z Z
1 n
t 0
l(β, x(i) , y (i) )dQ +
X
δ
Ex,y l(β, x, y)dQ ≤ ,
n i=1 2n

where d0 (Q, P ) denotes the Kullback-Leibler divergence between Q and P , and (x(i) , y (i) )
denotes the i-th observation of the dataset.

28
For the binary classification problem, the DNN model fits a predictive distribution
as p̂1 (x; β) := Pb r(y = 1|x) = logit−1 (µ(β, x)) and p̂0 (x; β) := Pb r(y = 0|x) = 1 −
logit−1 (µ(β, x)). Given an observation (x, y), we define the loss with margin ν > 0 as

lν (β, x, y) = 1(p̂y (x; β) − p̂1−y (x; β) < ν).

Therefore, the empirical loss for the whole data set {x(i) , y (i) }ni=1 is defined as Lemp,ν (β) =
lν (β, x(i) , y (i) )/n, and the population loss is defined as Lν (β) = Ex,y lν (β, x, y).
P

Theorem 2.5.1 (Bayesian Generalization error for classification). Suppose the conditions
of Theorem 2.2.1 hold. For any ν > 0, when n is sufficiently large, the following inequality
holds with probability greater than 1 − exp{c0 nϵ2n },
Z
L0 (β)dπ(β|Dn )
1 Z q
≤ L emp,ν (β)dπ(β|Dn ) + O(ϵn + log n/n + exp{−c1 nϵ2n }),
1 − 2 exp{−c1 nϵ2n }

for some c0 , c1 > 0, where ϵn is as defined in Theorem 2.2.1.

Theorem 2.5.1 characterizes the relationship between Bayesian population risk


L0 (β)dπ(β|Dn ) and Bayesian empirical risk Lemp,ν (β)dπ(β|Dn ), and implies that the
R R

difference between them is O(ϵn ). Furthermore, this generalization performance extends to


any point estimator β̂, as long as β̂ belongs to the dominating posterior mode.

Theorem 2.5.2. Suppose that the conditions of Theorem 2.2.1 hold and estimation β̂ belongs
to the dominating posterior mode under Theorem 2.2.1, then for any ν > 0, the following
inequality holds with probability greater than 1 − exp{c0 nϵ2n },

L0 (β̂) ≤ Lemp,ν (β̂) + O(ϵn ),

for some c0 > 0.

It is worth to clarify that the statement “β̂ belongs to the dominating posterior mode”
means β̂ ∈ Bn where Bn is defined in the proof of Theorem 2.2.1 and its posterior is greater

29
than 1 − exp{−cnϵ2n } for some c > 0. Therefore, if β̂ ∼ π(β|Dn ), i.e., β̂ is one valid posterior
sample, then with high probability, it belongs to the dominating posterior mode. The proof
of the above two theorems can be found in Section 2.9
Now we consider the generalization error for regression models. Assume the following
additional assumptions:

E.1 The activation function ψ ∈ [ − 1, 1].

E.2 The last layer weights and bias in β ∗ are restricted to the interval [ − Fn , Fn ] for some
Fn ≤ En , while Fn → ∞ is still allowed as n → ∞.

E.3 maxx∈Ω |µ∗ (x)| ≤ F for some constant F .

Correspondingly, the priors of the last layer weights and bias are truncated on [ − Fn , Fn ],
i.e., the two normal mixture prior (2.3) truncated on [ − Fn , Fn ]. By the same argument of
Theorem 2.9.1 (in Section 2.9), Theorem 2.2.1 still holds.
Note that the Hellinger distance for regression problem is defined as

[µ(β, x) − µ∗ (x)]2
( )!
d2 (pβ , pµ∗ ) = Ex 1 − exp − .
8σ 2

2
By our assumption, for any β on the prior support, |µ(β, x) − µ∗ (x)|2 ≤ (F + LFn )2 := F ,
thus,
d2 (pβ , pµ∗ ) ≥ CF Ex |µ(β, x) − µ∗ (x)|2 , (2.11)

2 2
where CF = [1 − exp(−4F /8σ 2 )]/4F . Furthermore, (2.5) implies that with probability at
least 1 − 2 exp{−cnϵ2n },

Z
2
d2 (pβ , pµ∗ )dπ(β|Dn ) ≤ 16ϵ2n + 2e−cnϵn . (2.12)

By Combining (2.11) and (2.12), we obtain the following Bayesian generalization error
result:

30
Theorem 2.5.3. (Bayesian generalization error for regression) Suppose the conditions of
Theorem 2.2.1 hold. When n is sufficiently large, the following inequality holds with proba-
bility at least 1 − 2 exp{−cnϵ2n },

Z
2 2 2
Ex |µ(β, x) − µ∗ (x)|2 dπ(β|Dn ) ≤ [16ϵ2n + 2e−cnϵn ]/CF ≍ [ϵ2n + e−cnϵn ]L Fn2 . (2.13)

Similarly, if an estimator β̂ belongs to the dominating posterior mode (refer to the dis-
cussion of Theorem 2.5.2 for more details), then β̂ ∈ {β : d(pβ , pµ∗ ) ≤ 4ϵn } and the following
result hold:

Theorem 2.5.4. Suppose the conditions of Theorem 2.2.1 hold, then

2
Ex |µ(β̂, x) − µ∗ (x)|2 ≤ [16ϵ2n ]/CF ≍ ϵ2n L Fn2 . (2.14)

2.6 Computation

2.6.1 Bayesian Evidence Approach

The theoretical results established in previous sections show that the Bayesian sparse
DNN can be learned with a mixture Gaussian prior and, more importantly, the posterior
inference is not necessarily directly drawn based on posterior samples, which avoids the
convergence issue of the MCMC implementation for large complex models. As shown in
Theorems 2.3.2, 2.5.2 and 2.5.4, for the sparse BNN, a good local maximizer of the log-
posterior distribution also guarantees consistency of the network structure selection and
asymptotic optimality of the network generalization performance. This local maximizer, in
the spirit of condition C.3 and the conditions of Theorems 2.5.2 and 2.5.4, is not necessarily
1
a MAP estimate, as the factor det(− 2π
n
Hn (β̂)) 2 can play an important role. In other words,
an estimate of β lies in a wide valley of the energy landscape is generally preferred. This
is consistent with the view of many other authors, see e.g., [56] and [57], where different
techniques have been developed to enhance convergence of SGD to a wide valley of the
energy landscape.

31
Algorithm 1 Sparse DNN Elicitation with Bayesian Evidence
Input: T —the number of independent tries in training the DNN, and the prior hyperpa-
rameters σ0,n , σ1,n , and λn .
for t = 1, 2, ..., T do
(i) Initialization: Randomly initialize the weights and biases, set γ i =1 for i =
1, 2, . . . , Kn .
(ii) Optimization: Run SGD to maximize hn (β) as defined in (2.10). Denote the estimate
of β by β̂.
(iii) Connection sparsification: For each i ∈ {1, 2, . . . , Kn }, set γ i = 1 if |β̂ i | >
√ r  
√2σ20,n σ1,n
2
log 1−λn σ1,n
λn σ0,n
and 0 otherwise. Denote the yielded sparse DNN structure
σ1,n −σ0,n

by γ t , and set β̂ γ t = β̂ ◦ γ t , where ◦ denotes element-wise production.


(iv) Nonzero-weights refining: Refine the nonzero weights of the sparsified DNN by
maximizing
1X n
1
hn (β γ t ) = log(p(yi , xi |β γ t )) + log(π(β γ t )), (2.15)
n i=1 n
which can be accomplished by running SGD for a few epochs with the initial value β̂ γ t .
Denote the resulting DNN model by β̃ γ t .
(v) Model evaluation: Calculate the Bayesian Evidence: Evidencet =
2
∂ hn (β γ )
− 21 nhn (β̃ γ t )
det(− 2π Hn (β̃ γ t )) e
n
, where Hn (β γ ) = ∂β ∂ T β is the Hessian matrix.
γ γ
end for
Output β̃ γ t with the largest Bayesian evidence.

1
Condition C.3 can be re-expressed as enhn (β)) dβ = o(det(− 2π
n
Hn (β̂))− 2 enhn (β̂) ),
R
RKn \Bδ (β̂)

which requires that β̂ is a dominating mode of the posterior. Based on this observation,
we suggest to use the Bayesian evidence [58], [59] as the criterion for eliciting estimates of
β produced by an optimization method in multiple runs with different initializations. The
1
Bayesian evidence is calculated as det(− 2π
n
Hn (β̂))− 2 enhn (β̂) . Since Theorem 2.3.1 ensures
only consistency of structure selection but not consistency of parameter estimation, we sug-
gest to refine its nonzero weights by a short optimization process after structure selection.
The complete algorithm is summarized in Algorithm 1.
For a large-scale neural network, even if it is sparse, the number of nonzero elements
can easily exceed a few thousands or millions, see e.g. the networks considered in Sec-
tion 2.7.2. In this case, evaluation of the determinant of the Hessian matrix can be very
time consuming. For this reason, we suggest to approximate the log(Bayesian evidence) by

32
nhn (βˆγ ) − 12 |γ| log(n) with the detailed arguments given in Section 2.9 As explained there, if
the prior information imposed on the sparse DNNs is further ignored, then the sparse DNNs
can be elicited by BIC.
The main parameters for Algorithm 1 are the prior hyperparameters σ0,n , σ1,n , and λn .
Theorem 2.2.1 provides theoretical suggestions for the choice of the prior-hyperparameters,
see also the proof of Lemma 2.3.1 for a specific setting for them. Our theory allows σ1,n to
grow with n from the perspective of data fitting, but in our experience, the magnitude of
weights tend to adversely affect the generalization ability of the network. For this reason, we
usually set σ1,n to a relatively small number such as 0.01 or 0.02, and then tune the values
of σ0,n and λn for the network sparsity as well as the network approximation error. As a
trade-off, the resulting network might be a little denser than the ideal one. If it is too dense
to satisfy the sparse constraint given in Assumption A.2.2, one might increase the value of
σ0,n and/or decrease the value of λn , and rerun the algorithm to get a sparser structure.
This process can be repeated until the constraint is satisfied.
Algorithm 1 employs SGD to optimize the log-posterior of the BNN. Since SGD generally
converges to a local optimal solution, the multiple initialization method is used in order to
find a local optimum close to the global one. It is interesting to note that SGD has some nice
properties in non-convex optimization: It works on the convolved (thus smoothed) version
of the loss function [60] and tends to converge to flat local minimizers which are with very
high probability also global minimizers [61]. In all of our experiments, we set the number
of initializations to T = 10 as default unless otherwise stated. We note that Algorithm 1 is
not very sensitive to the value of T , although a large value of T can generally improve its
performance.
For network weight initialization, we adopted the standard method, see [62] for tanh
activation and [63] for ReLU activation, which ensures that the variance of the gradient of
each layer is of the same order at the beginning of the training process.

33
2.6.2 Prior Annealing: Frequentist Computation

In order to avoid the multiple run for the algorithm, we suggest to use a prior annealing
approach which incorporate the study of optimization landscape of the over-parametrized
DNN. It has been shown in [5], [6] that the loss of an over-parameterized DNN exhibits good
properties:

(S ∗ ) For a fully connected DNN with an analytic activation function and a convex loss
function at the output layer, if the number of hidden units of one layer is larger than
the number of training points and the network structure from this layer on is pyramidal,
then almost all local minima are globally optimal.

Motivated by this result, we propose the following approach

Algorithm 2 Prior annealing: Frequentist

(i) (Initial training) Train a DNN satisfying condition (S*) such that a global optimal
solution β 0 = arg maxβ ln (β) is reached, which can be accomplished using SGD or
Adam [64].

(ii) (Prior annealing) Initialize β at β 0 and simulate from a sequence of distributions


(k) η (k) /τ
π(β|Dn , τ, η (k) , σ0,n ) ∝ enln (β)/τ πk (β) for k = 1, 2, . . . , m, where 0 < η (1) ≤ η (2) ≤
(k) (1) (2)
· · · ≤ η (m) = 1, πk = λn N (0, σ1,n 2
)+(1−λn )N (0, (σ0,n )2 ), and σ0,n init
= σ0,n ≥ σ0,n ≥ · · · ≥
(m)
σ0,n = σ0,n end
. The simulation can be done in an annealing manner using a stochastic
gradient MCMC algorithm [65]–[68]. After the stage m has been reached, continue to
run the simulated annealing algorithm by gradually decreasing the temperature τ to a
very small value. Denote the resulting DNN by β̂ = (β̂ 1 , β̂ 2 , . . . , β̂ Kn ).

(iii) (Structure r sparsification) For each connection i ∈ {1, 2, . . . , Kn }, set γ̃ i = 1 if |β̂ i | >
√  
2σ0,n σ1,n 1−λn σ1,n
√ 2 2
σ1,n −σ0,n
log λ n σ 0,n
and 0 otherwise, where the threshold value of |β̂ i | is obtained
by solving π(γ i = 1|β i ) > 0.5 based on the mixture Gaussian prior as in [69]. Denote
the yielded sparse DNN structure by γ̃.

(iv) (Nonzero-weights refining) Refine the nonzero weights of the sparsified DNN by max-
imizing ln (β). Denote the resulting estimate by β̃ γ̃ , which represents the MLE of
β∗.

34
30

25

20
Negtive Log Prior

15

10

−5

−1.00 −0.75 −0.50 −0.25 0.00 0.25 0.50 0.75 1.00


β

Figure 2.1. Negative logarithm of the mixture Gaussian prior.

For Algorithm 2, the consistency of (γ̃, β̃ γ̃ ) as an estimator of (γ ∗ , β ∗ ) can be proved


based on Theorem 3.4 of [6] for global convergence of β 0 , the property of simulated annealing
(by choosing an appropriate sequence of ηk and a cooling schedule of τ ), Theorem 2.3.2 for
consistency of structure selection, and Theorem 2.1 of [53] for consistency of MLE under the
scenario of dimension diverging.
Intuitively, the initial training phase can reach the global optimum of the likelihood
function. In the prior annealing phase, as we slowly add the effect of the prior, the landscape
of the target distribution is gradually changed and the MCMC algorithm is likely to hit the
region around the optimum of the target distribution. In practice, let t denote the step index,
a simple implementation of the initial training and prior annealing phases of Algorithm 2 can
(t)
be given as follows: (i) for 0 < t < T1 , run initial training; (ii) for T1 ≤ t ≤ T2 , fix σ0,n = σ0,n
init

and linearly increase ηt by setting η (t) = t−T1


T2 −T1
; (iii) for T2 ≤ t ≤ T3 , fix η (t) = 1 and linearly
 2  2  2  2
(t) (t)
decrease σ0,n by setting σ0,n = T3 −t
T3 −T2
init
σ0,n + Tt−T 2
3 −T2
end
σ0,n ; (iv) for t > T3 , fix η (t) = 1
(t)
and σ0,n = σ0,n
end
and gradually decrease the temperature τ , e.g., setting τt = c
t−T3
for some
constant c.
To better understand the prior annealing procedure, we provide some graphical illustra-
tions. In practice, the negative log-prior puts penalty on parameter weights. The mixture
Gaussian prior behaves like a piecewise L2 penalty with different weights on different re-

35
gions. Figure 2.1 shows the shape of a negative log-mixture Gaussian prior. In step (iii)
of Algorithm 2, the condition π(γ i = 1|β i ) > 0.5 splits the parameters into two parts. For
the β i ’s with large magnitudes, the slab component N (0, σ1,n
2
) plays the major role in the
prior, imposing a small penalty on the parameter. For the β i ’s with smaller magnitudes, the
spike component N (0, σ0,n
2
) plays the major role in the prior, imposing a large penalty on
the parameters to push them toward zero in training.
Figure 2.2 shows the shape of negative log-prior and π(γ i = 1|β i ) for different choices of
2
σ0,n and λn . As we can see from the plot, σ0,n
2
plays the major role in determining the effect
of the prior. Let α be the threshold in step (iii) of Algorithm 2, i.e. the positive solution
to π(γ i = 1|β i ) = 0.5. In general, a smaller σ0,n
2
will result in a smaller α. If a very small
2
σ0,n is used in the prior from the beginning, then most of β i ’s at initialization will have a
magnitude larger than α and the slab component N (0, σ1,n
2
) of the prior will dominate most
parameters. As a result, it will be difficult to find the desired sparse structure. Following the
proposed prior annealing procedure, we can start with a larger σ0,n
2
, i.e. a larger threshold
α and a relatively smaller penalty for those |β i | < α. As we gradually decrease the value of
2
σ0,n , α decreases, and the penalty imposed on the small weights increases and drives them
toward zero. The prior annealing allows us to gradually sparsify the DNN and impose more
and more penalties on the parameters close to 0.

2.6.3 Prior Annealing: Bayesian Computation

For certain problems the size (or #nonzero elements) of γ ∗ is large, calculation of the
Fisher information matrix is difficult. In this case, the prediction uncertainty can be quanti-
fied via posterior simulations. The simulation can be started with a DNN satisfying condition
(S*) and performed using a SGMCMC algorithm [67], [68] with an annealed prior as defined
in step (ii) of Algorithm 2 (For Bayesian approach, we may fix the temperature τ = 1).
The over-parameterized structure and annealed prior make the simulations immune to local
traps.
To justify the Bayesian estimator for the prediction mean and variance, for some test
function ϕ(β), we study the deviation of the path averaging estimator ϕ(β (t) ) and
1 PT
T t=1

36
15

10
Negtive Log Prior

σ12 = 0.04, σ02 = 1.5e − 5, λn = 1e − 8


σ12 = 0.04, σ02 = 1.5e − 4, λn = 1e − 8
−5 σ12 = 0.04, σ02 = 1.5e − 5, λn = 1e − 7

−0.100 −0.075 −0.050 −0.025 0.000 0.025 0.050 0.075 0.100


β

1.0

0.8

0.6
P(γ = 1|β)

0.4

0.2

0.0

σ12 = 0.04, σ02 = 1.5e − 5, λn = 1e − 8


σ12 = 0.04, σ02 = 1.5e − 4, λn = 1e − 8
σ12 = 0.04, σ02 = 1.5e − 5, λn = 1e − 6

−0.100 −0.075 −0.050 −0.025 0.000 0.025 0.050 0.075 0.100


β

Figure 2.2. Negative log-prior and π(γ = 1|β) for different choices of σ0,n
2
and λn .

37
the posterior mean ϕ(β)π(β|Dn , η ∗ , σ0,n

)dβ. For simplicity, we will focus on SGLD with
R

prior annealing. Our analysis can be easily generalized to other SGMCMC algorithms [70].
For a test function ϕ(·), the difference between ϕ(β) and ϕ(β)π(β|Dn , η ∗ , σ0,n

)dβ can
R

be characterized by the Poisson equation:

Z
Lψ(β) = ϕ(β) − ϕ(β)π(β|Dn , η ∗ , σ0,n

)dβ,

where ψ(·) is the solution of the Poisson equation and L is the infinitesimal generator of the
Langevin diffusion. i.e. for the following Langevin diffusion


dβ (t) = ∇ log(π(β|Dn , η ∗ , σ0,n

))dt + 2IdWt ,

where I is identity matrix and Wt is Brownian motion, we have

Lψ(β) := ⟨∇ψ(β), ∇ log(π(β|Dn , η ∗ , σ0,n



)) + tr(∇2 ψ(β)).

Let Dk ψ denote the kth-order derivatives of ψ. To control the perturbation of ϕ(β), we need
the following assumption about the function ψ(β):

Assumption 2.6.1. For k ∈ {0, 1, 2, 3}, Dk ψ exists and there exists a function V, s.t.
||Dk ψ|| ≲ V pk for some constant pk > 0. In addition, V is smooth and the expectation of V p
on β (t) is bounded for some p ≤ 2 maxk {pk }, i.e. supt E(V p (β (t) )) < ∞, V p (sβ 1 +
P
s∈(0,1)

(1 − s)β 2 ) ≲ V p (β 1 ) + V p (β 2 ).
(t)
In step t of the SGLD algorithm, the drift term is replaced by ∇β log π(β (t) |Dm,n
(t)
, η (t) , σ0,n ),
where Dm,n
(t)
is used to represent the mini-batch data used in step t. Let Lt be the corre-
sponding infinitesimal generator. Let δt = Lt − L. To quantify the effect of δt , we introduce
the following assumption:

Assumption 2.6.2. β (t) has bounded expectation and the expectation of log-prior is Lipschitz
continuous with respect to σ0,n , i.e. there exists some constant M s.t. supt E(|β (t) |) ≤ M <
(t ) (t ) (t ) (t )
∞. For all t, |E log(π(β (t) |λn , σ0,n1 , σ1,n )) − E log(π(β (t) |λn , σ0,n2 , σ1,n ))| ≤ M |σ0,n1 − σ0,n2 |.

Then we have the following theorem:

38
Theorem 2.6.1. Suppose the model satisfy assumption 2.6.2, and a constant learning rate
of ϵ is used. For a test function ϕ(·), if the solution of the Poisson equation ψ(·) satisfy
assumption 2.6.1, then
−1
1 TX
Z !
E ϕ(β (t) ) − ϕ(β)π(β|Dn , η ∗ , σ0,n

)dβ
T t=1
 PT −1 (t)
 (2.16)
1 t=0 (|η (t) − η ∗ | + |σ0,n − σ0,n
∗ |)
=O  + + ϵ ,
Tϵ T


where σ0,n is treated as a fixed constant.

Theorem 2.6.1 shows that with prior annealing, the path averaging estimator can still be
used for estimating the mean and variance of the prediction and constructing the confidence
interval. The detailed procedure is given in next section. For the case that a decaying
learning rate is used, a similar theorem can be developed as in [70].

2.6.4 Construct Confidence Interval

Theorem 2.4.2 implies that a faithful prediction interval can be constructed for the sparse
neural network learned by the proposed algorithms. In practice, for a normal regression
problem with noise N (0, σ 2 ), to construct the prediction interval for a test point x0 , the
terms σ 2 and Σ = ∇γ ∗ µ(β ∗ , x0 )T H −1 ∇γ ∗ µ(β ∗ , x0 ) in Theorem 2.4.2 need to be estimated
from data. Let Dn = (x(i) , y (i) )i=1,...,n be the training set and µ(β, ·) be the predictor of the
network model with parameter β. We can follow the following procedure to construct the
prediction interval for the test point x0 :

• Run algorithm 1 or 2, let β̂ be an estimation of the network parameter at the end of


the algorithm and γ̂ be the correspoding network structure.

• Estimate σ 2 by
1X n
σ̂ 2 = (µ(β̂, x(i) ) − y (i) )2 .
n i=1

• Estimate Σ by
Σ̂ = ∇γ̂ µ(β̂, x0 )T (−∇2γ̂ ln (β̂))−1 ∇γ̂ µ(β̂, x0 ).

39
• Construct the prediction interval as
 s s 
µ(β̂, x0 ) − 1.96
1 1
Σ̂ + σ̂ 2 , µ(β̂, x0 ) + 1.96 Σ̂ + σ̂ 2  .
n n

Here, by the structure selection consistency 2.3.2 and consistency of the MLE for the learnt
structure [53], we replace β ∗ and γ ∗ in Theorem 2.4.2 by β̂ and γ̂.
If the dimension of the sparse network is still too high and the computation of Σ̂ becomes
prohibitive, the following Bayesian approach can be used to construct confidence intervals.

• Running SGMCMC algorithm to get a sequence of posterior samples: β (1) , . . . , β (m) .

• Estimating σ 2 by σ̂ 2 = i=1 (y − µ(i) )2 , where


1 Pn (i)
n

1 Xm
µ(i) = µ(β (j) , x(i) ), i = 1, . . . , n,
m j=1

• Estimate the prediction mean by

1 Xm
µ̂ = µ(β (i) , x0 ).
m i=1

• Estimate the prediction variance by

1 Xm
V̂ = (µ(β (i) , x0 ) − µ̂)2 + σ̂ 2 .
m i=1

• Construct the prediction interval as

√ √
(µ − 1.96 V , µ + 1.96 V ).

40
2.7 Numerical Experiments

This section illustrates the performance of the proposed method on synthetic and real
data examples.1 For the synthetic example, the frequentist algorithm is employed to con-
struct prediction intervals. The real data example involves a large network, so both the
frequentist and Bayesian algorithms are employed along with comparisons with some exist-
ing network pruning methods.

2.7.1 Synthetic Example

We consider a high-dimensional nonlinear regression problem, which shows that our


method can identify the sparse network structure and relevant features as well as produce
prediction intervals with correct coverage rates. The explanatory variables x1 , . . . , xpn were
simulated by independently generating e, z1 , . . . , zpn from N (0, 1) and setting xi = √ i.
e+z
2
The
response variable was generated from a nonlinear regression model:

5x2
y= + 5 sin(x3 x4 ) + 2x5 + 0x6 + · · · + 0x2000 + ϵ,
1 + x21

where ϵ ∼ N (0, 1). Ten datasets were generated, each consisting of 10000 samples for training
and 1000 samples for testing.
We apply both Bayesian Evidence approach 1 and prior annealing approach 2 on this data
set. To demonstrate the difference of the algorithm, for Bayesian evidence approach, we use a
small DNN of structure 2000-6-4-3-1. For prior annealing approach, to satisfy condition (S*),
we used a DNN of structure 2000-10000-100-10-1. We use tanh as activation function. The
P10
|Ŝi \S|
variable selection performance were measured using the false selection rate F SR = Pi=1
10
|Ŝi |
P10 i=1
|S\Ŝi |
and negative selection rate N SR = i=1
P 10
|S|
, where S is the set of true variables, Ŝi
i=1

is the set of selected variables from dataset i and |Ŝi | is the size of Ŝi . The predictive
performance is measured by mean square prediction error (MSPE) and mean square fitting
error (MSFE). We compare our method with the other existing variable selection methods
1
↑The code for running these experiments can be found in https://github.com/sylydya/
Sparse-Deep-Learning-A-New-Framework-Immuneto-Local-Traps-and-Miscalibration and https:
//github.com/sylydya/Consistent-Sparse-Deep-Learning-Theory-and-Computation

41
Table 2.1. Simulation Result: MSFE and MSPE were calculated by averaging
over 10 datasets, and their standard deviations were given in the parentheses.
Method |Ŝ| FSR NSR MSFE MSPE
BNN anneal 5(0) 0 0 2.353(0.296) 2.428(0.297)
BNN Evidence 5(0) 0 0 2.372(0.093) 2.439(0.132)
Spinn 10.7(3.874) 0.462 0 4.157(0.219) 4.488(0.350)
DNN - - - 1.1701e-5(1.1542e-6) 16.9226(0.3230)
Dropout - - - 1.104(0.068) 13.183(0.716)
BART50 16.5(1.222) 0.727 0.1 11.182(0.334) 12.097(0.366)
LASSO 566.8(4.844) 0.993 0.26 8.542(0.022) 9.496(0.148)
SIS 467.2(11.776) 0.991 0.2 7.083(0.023) 10.114(0.161)

including Sparse input neural network(Spinn) [51], Bayesian adaptive regression tree (BART)
[71], linear model with lasso penalty (LASSO) [72], and sure independence screening with
SCAD penalty (SIS)[73]. To demonstrate the importance of selecting correct variables, we
also compare our method with two dense model with the same network structure: DNN
trained with dropout(Dropout) and DNN trained with no regularization(DNN). Detailed
experiment setups are given in the Section 2.7.3. The results were summarized in Table 2.1.
With a single run, prior annealing approach achieves similar result with the multiple-run
method. The latter trained the model for 10 times and selected the best one using Bayesian
evidence. While for Spinn (with LASSO penalty), even with over-parametrized structure, it
performs worse than the sparse BNN model.
To quantify the uncertainty of the prediction, we conducted 100 experiments over dif-
ferent training sets as generated previously. We constructed 95% prediction intervals over
1000 test points. Over the 1000 test points, the average coverage rate of the prediction
intervals is 94.72%(0.61%), where (0.61%) denote the standard deviation. Figure 2.3 shows
the prediction intervals constructed for 20 of the testing points.

2.7.2 Real Data Example

As a different type of applications of the proposed method, we conducted unstructured


network pruning experiments on CIFAR10 dataset[74]. Following the setup in [75], we train

42
Confidence Interval
True Value
Predicted Value
10

5
Y

−5

−10
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
index

Figure 2.3. Prediction intervals of 20 testing points, where the y-axis is the
response value, the x-axis is the index, and the blue point represents the true
observation.

43
the residual network[76] with different networks size and pruned the network to different
sparsity levels. The detailed experimental setup can be found in Section 2.7.3
We compared the proposed methods, BNN anneal (Algorithm 2), BNN average (averaged
over last 75 networks simulated by the Bayesian version of the prior annealing algorithm)
and BNN BIC (multiple-run and select best model by BIC) with several state-of-the-art
unstructured pruning methods, including Dynamic pruning with feedback (DPF) [75], Dy-
namic Sparse Reparameterization (DSR) [77] and Sparse Momentum (SM) [78]. The results
of the baseline methods were taken from [75] The results of prediction accuracy for different
models and target sparsity levels were summarized in Table 2.2. Due to the threshold used
in step (iii) of Algorithm 2, it is hard for our method to make the pruning ratio exactly the
same as the targeted one. We intentionally make the pruning ratio smaller than the tar-
get ratio, while our method still achieve better test accuracy. To further demonstrate that
the proposed method result in better model calibration, we followed the setup of [79] and
compared the proposed method with DPF on several metrics designed for model calibration,
including negtive log likelihood (NLL), symmetrized, discretized KL distance between in and
out of sample entropy distributions (JS-Distance), and expected calibration error (ECE). For
JS-Distance, we used the test data of SVHN data set2 as out-of-distribution samples. The
results were summarized in Table 2.3. As discussed in [3], [79], a well calibrated model
tends to have smaller NLL, larger JS-Distance and smaller ECE. The comparison shows
that the proposed method outperforms DPF in most cases. In addition to the network prun-
ing method, we also train a dense model with the standard training set up. Compared to the
dense model, the sparse network has worse accuracy, but it tends to outperform the dense
network in terms of ECE and JS-Distance, which indicates that sparsification is also a useful
way for improving calibration of the DNN.
2
↑The Street View House Numbers (SVHN) Dataset: http://ufldl.stanford.edu/housenumbers/

44
Table 2.2. ResNet network pruning results for CIFAR-10 data, which were
calculated by averaging over 3 independent runs with the standard deviation
reported in the parentheses.
ResNet-20 ResNet-32
Method Pruning Ratio Test Accuracy Pruning Ratio Test Accuracy
DNN dense 100% 92.93(0.04) 100% 93.76(0.02)
BNN average 19.85%(0.18%) 92.53(0.08) 9.99%(0.08%) 93.12(0.09)
BNN anneal 19.80%(0.01%) 92.30(0.16) 9.97%(0.03%) 92.63(0.09)
BNN BIC 19.67%(0.05%) 92.27(0.03) 9.53%(0.04%) 92.74(0.07)
SM 20% 91.54(0.16) 10% 91.54(0.18)
DSR 20% 91.78(0.28) 10% 91.41(0.23)
DPF 20% 92.17(0.21) 10% 92.42(0.18)
BNN average 9.88%(0.02%) 91.65(0.08) 4.77%(0.08%) 91.30(0.16)
BNN anneal 9.95%(0.03%) 91.28(0.11) 4.88%(0.02%) 91.17(0.08)
BNN BIC 9.55%(0.03%) 91.27(0.05) 4.78%(0.01%) 91.21(0.01)
SM 10% 89.76(0.40) 5% 88.68(0.22)
DSR 10% 87.88(0.04) 5% 84.12(0.32)
DPF 10% 90.88(0.07) 5% 90.94(0.35)

45
Table 2.3. ResNet network pruning results for CIFAR-10 data, which were
calculated by averaging over 3 independent runs with the standard deviation
reported in the parentheses.
Method Model Pruning Ratio NLL JS-Distance ECE
DNN dense ResNet20 100% 0.2276(0.0021) 7.9118(0.9316) 0.02627(0.0005)
BNN average ResNet20 9.88%(0.02%) 0.2528(0.0029) 9.9641(0.3069) 0.0113(0.0010)
BNN anneal ResNet20 9.95%(0.03%) 0.2618(0.0037) 10.1251(0.1797) 0.0175(0.0011)
DPF ResNet20 10% 0.2833(0.0004) 7.5712(0.4466) 0.0294(0.0009)
BNN average ResNet20 19.85%(0.18%) 0.2323(0.0033) 7.7007(0.5374) 0.0173(0.0014)
BNN anneal ResNet20 19.80%(0.01%) 0.2441(0.0042) 6.4435(0.2029) 0.0233(0.0020)
DPF ResNet20 20% 0.2874(0.0029) 7.7329(0.1400) 0.0391(0.0001)
DNN dense ResNet32 100% 0.2042(0.0017) 6.7699(0.5253) 0.02613(0.00029)
BNN average ResNet32 9.99%(0.08%) 0.2116(0.0012) 9.4549(0.5456) 0.0132(0.0001)
BNN anneal ResNet32 9.97%(0.03%) 0.2218(0.0013) 8.5447(0.1393) 0.0192(0.0009)
DPF ResNet32 10% 0.2677(0.0041) 7.8693(0.1840) 0.0364(0.0015)
BNN average ResNet32 4.77%(0.08%) 0.2587(0.0022) 7.0117(0.2222) 0.0100(0.0002)
BNN anneal ResNet32 4.88%(0.02%) 0.2676(0.0014) 6.8440(0.4850) 0.0149(0.0006)
DPF ResNet32 5% 0.2921(0.0067) 6.3990(0.8384) 0.0276(0.0019)

2.7.3 Experimental Setups

Synthetic Example

For prior annealing, we follow simple implementation of Algorithm given in Section 2.6.2.
We run SGHMC for T = 80000 iterations with constant learning rate ϵt = 0.001, momentum
1 − α = 0.9 and subsample size m = 500. We set λn = 1e − 7, σ1,n
2
= 1e − 2, (σ0,n ) = 5e − 5,
init 2

(σ0,n ) = 1e − 6 and T1 = 5000, T2 = 20000, T3 = 60000. We set temperature τ = 0.1 for


end 2

t < T3 and for t > T3 , we gradually decrease temperature τ by τ = 0.1


t−T3
. After structure
selection, the model is fine tuned for 40000 iterations.
For Bayesian Evidence approach 1. we ran SGD for 80,000 iterations to train the neural
network with a learning rate of ϵt = 0.005. The subsample size was set to 500. For the
mixture Gaussian prior, we set σ1,n = 0.01, σ0,n = 0.0001, and λn = 0.00001. The number
of independent tries was set to T = 10. After structure selection, the DNN was retrained
using SGD for 40,000 iterations.

46
Spinn, Dropout and DNN are trained with the same network structure as the prior
annealing method using SGD with momentum. Same as our method, we use constant
learning rate 0.001, momentum 0.9, subsample size 500 and traing the model for 80000
iterations. For Spinn, we use LASSO penalty and the regularization parameter is selected
from {0.05, 0.06, . . . , 0.15} according to the performance on validation data set. For Dropout,
the dropout rate is set to be 0.2 for the first layer and 0.5 for the other layers. Other baseline
methods BART50, LASSO, SIS are implemented using R-package randomF orest, glmnet,
BART and SIS respectively with default parameters.

Real Data Examples

We follow the standard training procedure as in [75], i.e. we train the model with SGHMC
for T = 300 epochs, with initial learning rate ϵ0 = 0.1, momentum 1 − α = 0.9, temperature
τ = 0.001, mini-batch size m = 128. The learning rate is divided by 10 at 150th and 225th
epoch.
For prior annealing, we follow the implementation given in section 2.6.2 and use T1 =
150, T2 = 200, T3 = 225, where Ti s are number of epochs. We set temperature τ = 0.01
for t < T3 and gradually decrease τ by τ = 0.01
t−T3
for t > T3 . We set σ1,n
2
= 0.04 and
(σ0,n ) = 10 × (σ0,n
init 2
) and use different σ0,n
end 2 end
, λn for different network size and target sparsity
level. The detailed settings are given below:

• ResNet20 with target sparsity level 20%: (σ0,n ) = 1.5e − 5, λn = 1e − 8


end 2

• ResNet20 with target sparsity level 10%: (σ0,n ) = 6e − 5, λn = 1e − 9


end 2

• ResNet32 with target sparsity level 10%: (σ0,n ) = 3e − 5, λn = 2e − 9


end 2

• ResNet32 with target sparsity level 5%: (σ0,n ) = 1e − 4, λn = 2e − 8


end 2

For BIC approach, the only different setting with the prior annealing approach is the prior.
We set σ1,n
2
= 0.02 and tried different values for σ0,n and λn to achieve different sparsity
levels. For ResNet-20, to achieve 10% target sparsity, we set σ0,n
2
= 4e − 5 and λn = 1e − 6;
to achieve 20% target sparsity, we set σ0,n
2
= 6e − 6 and λn = 1e − 7. For ResNet-32, to

47
achieve 5% target sparsity, we set σ0,n
2
= 6e − 5 and λn = 1e − 7; to achieve 10% target
sparsity, we set σ0,n
2
= 2e − 5 and λn = 1e − 5.

2.8 Discussion

In this dissertation, we provide a complete treatment for sparse DNNs in both theory
and computation. The sparse DNN can be simply viewed as a nonlinear statistical model
which, like a traditional statistical model, possesses many nice properties such as posterior
consistency, variable selection consistency, asymptotic normality, and asymptotically optimal
generalization bound.
In computation, we proposed to use Bayesian evidence or BIC for eliciting sparse DNN
models learned by an optimization method in multiple runs with different initializations and
an prior annealing approach for over-parametrized DNN model. The computation complexity
of the proposed method is of same order as standard SGD method for DNN training. Our
numerical results show that the proposed method can perform very well in large-scale network
compression and high-dimensional nonlinear variable selection. The networks learned by
the proposed method tend to predict better than the existing methods and have better
calibration.
In this chapter, we choose the two-mixture Gaussian prior for the weights and biases of
the DNN, mainly for the sake of computational convenience. Other choices, such as two-
mixture Laplace prior [80], which will lead to the same posterior contraction with an appro-
priate choice for the prior hyperparameters. To be more specific, Theorem 2.9.1 establishes
sufficient conditions that guarantee the posterior consistency, and any prior distribution
satisfying the sufficient conditions can yield consistent posterior inferences for the DNN.
Beyond the absolutely continuous prior, the hierarchical prior used in [14] and [15] can
be adopted for DNNs. To be more precise, one can assume that

β γ | γ ∼ N (0, σ1,n
2
I ∥γ∥×∥γ∥ ), β γ c = 0; (2.17)

Kn −∥γ∥
n (1 − λn )
π(γ) ∝ λ∥γ∥ 1 {1 ≤ ∥γ∥ ≤ r̄n , γ ∈ G} , (2.18)

48
where β γ c is the complement of β γ , ∥γ∥ is the number of nonzero elements of γ, I ∥γ∥×∥γ∥ is
a ∥γ∥ × ∥γ∥ identity matrix, r̄n is the maximally allowed size of candidate networks, G is the
set of valid DNNs, and the hyperparameter λn , as in (2.3), can be read as an approximate
prior probability for each connection or bias to be included in the DNN. Under this prior,
the product of the weight or bias and its indicator follows a discrete spike-and-slab prior
distribution, i.e.

h h h h h h h h
ij |γ ij ∼ γ ij N (0, σ1,n ) + (1 − γ ij )δ0 , bk γ k |γ k ∼ γ k N (0, σ1,n ) + (1 − γ k )δ0 ,
h b
whij γ w w w 2 w b b 2 b

where δ0 denotes the Dirac delta function. Under this hierarchical prior, it is not difficult
to show that the posterior consistency and structure selection consistency theory developed
in this chapter still hold. However, from the computational perspective, the hierarchical
prior might be inferior to the mixture Gaussian prior adopted in the chapter, as the pos-
terior π(β γ , γ|Dn ) is hard to be optimized or simulated from. It is known that directly
simulating from π(β γ , γ|Dn ) using an acceptance-rejection based MCMC algorithm can be
time consuming. A feasible way is to formulate the prior of β γ as β γ = θ ⊗ γ, where
θ ∼ N (0, σ1,n
2
IHn ×Hn ) can be viewed as a latent variable and ⊗ denotes entry-wise product.
Then one can first simulate from the marginal posterior π(θ|Dn ) using a stochastic gradient
MCMC algorithm and then make inference of the network structure based on the conditional
posterior π(γ|θ, Dn ). We note that the gradient ∇θ log π(θ|Dn ) can be approximated based
on the following identity developed in [81],

∇θ log π(θ|Dn ) = π(γ|θ, Dn )∇θ log π(θ|γ, Dn ),


X

where Dn can be replaced by a dataset duplicated with mini-batch samples if the subsampling
strategy is used to accelerate the simulation. This identity greatly facilitates the simulations
for the dimension jumping problems, which requires only some samples to be drawn from
the conditional posterior π(γ|θ, Dn ) for approximating the gradient ∇θ log π(θ|Dn ) at each
iteration. A further exploration of this discrete prior for its use in deep learning is of great
interest, although there are some difficulties needing to be addressed in computation.

49
2.9 Technical Proofs

This section is organized as follows. Section 2.9.1 gives the proofs on posterior consistency,
Section 2.9.2 gives the proofs on structure selection consistency, Section 2.9.3 gives the proofs
on BvM theorem for weights and predictions. Section 2.9.4 gives the proofs on generalization
bounds, and Section 2.9.5 gives some mathematical facts of the sparse DNN.

2.9.1 Proofs on Posterior Consistency

Basic Formulas of Bayesian Neural Networks

Normal Regression.

Let pµ denote the density of N (µ, σ 2 ) where σ 2 is a known constant, and let pβ denote
the density of N (µ(β, x), σ 2 ). Extension to the case σ 2 is unknown is simple by following
the arguments given in [39]. In this case, an inverse gamma prior can be assumed for σ 2 as
suggested by [39]. Define the Kullback-Leibler divergence as d0 (p, p∗ ) = p log(p∗ /p) for
R ∗

two densities p and p∗ . Define a distance dt (p, p∗ ) = t−1 ( p∗ (p∗ /p)t − 1) for any t > 0, which
R

decreases to d0 as t decreases toward 0. A straightforward calculation shows

1 1
Z  
d1 (pµ1 , pµ2 ) = pµ1 (pµ1 /pµ1 ) − 1 = exp 2 (µ2 − µ1 )2 − 1 = 2 (µ2 − µ1 )2 + o((µ2 − µ1 )3 ),
σ σ
(2.19)

1
d0 (p1 , p2 ) = (µ1 − µ2 )2 . (2.20)
2σ 2

Logistic Regression.

Let pµ denote the probability mass function with the success probability given by 1/(1 +
e−µ ). Similarly, we define pβ as the logistic regression density for a binary classification DNN
with parameter β. For logistic regression, we have

Z
e2µ2 −µ1 + eµ1 − 2eµ2
d1 (pµ1 , pµ2 ) = pµ2 (pµ2 /pµ1 ) − 1 = ,
(1 + eµ2 )2

50
which, by the mean value theorem, can be written as

′ ′ ′ ′
eµ − e2µ2 −µ eµ (1 − e2µ2 −2µ )
d1 (pµ1 , pµ2 ) = (µ µ − µ 2 ) = (µ1 − µ2 ),
(1 + eµ2 )2 1
(1 + eµ2 )2

where µ′ denotes an intermediate point between µ1 and µ2 , and thus |µ′ − µ2 | ≤ |µ1 − µ2 |.
Further, by Taylor expansion, we have

′ ′
eµ = eµ2 [1 + (µ′ − µ2 ) + O((µ′ − µ2 )2 )], e2µ2 −2µ = 1 + 2(µ2 − µ′ ) + O((µ2 − µ′ )2 ).

Therefore,

eµ2 h i 1
d1 (pµ1 , pµ2 ) ≤ 2|µ 2 − µ ′
| + O((µ 2 − µ )
′ 2
) |µ1 − µ2 | ≤ (µ1 − µ2 )2 + O((µ1 − µ2 )3 ),
(1 + e )
µ2 2 2
(2.21)
and

Z
eµv
d0 (pu , pv ) = pv (log pv − log pu )vy (dy) = log(1 + e ) − log(1 + e ) +
µu µv
(µv − µu ).
1 + e µv

By the mean value theorem, we have

′ ′
eµ eµv eµv eµ
d0 (pu , pv ) = (µ u − µ v ) + (µ v − µ u ) = [ − ](µv − µu ), (2.22)
1 + eµ′ 1 + eµv 1 + eµv 1 + e µ′

where µ′ denotes an intermediate point between µu and µv .

Posterior Consistency of General Statistical Models

We first introduce a lemma concerning posterior consistency of general statistical models.


This lemma has been proved in [39]. Let Pn denote a sequence of sets of probability densities,
let Pcn denote the complement of Pn , and let ϵn denote a sequence of positive numbers. Let
N (ϵn , Pn ) be the minimum number of Hellinger balls of radius ϵn that are needed to cover
Pn , i.e., N (ϵn , Pn ) is the minimum of all k’s such that there exist sets Sj = {p : d(p, pj ) ≤ ϵn },
qR √ √
j = 1, . . . , k, with Pn ⊂ ∪kj=1 Sj holding, where d(p, q) = ( p − q)2 denotes the Hellinger
distance between the two densities p and q.

51
Let Dn = (z (1) , . . . , z (n) ) denote the dataset, where the observations z (1) , . . . , z (n) are iid
with the true density p∗ . The dimension of z (1) and p∗ can depend on n. Define π(·) as the
prior density, and π(·|Dn ) as the posterior. Define π̂(ϵ) = π[d(p, p∗ ) > ϵ|Dn ] for each ϵ > 0.
Define the KL divergence as d0 (p, p∗ ) = p∗ log(p∗ /p). Define dt (p, p∗ ) = t−1 ( p∗ (p∗ /p)t − 1)
R R

for any t > 0, which decreases to d0 as t decreases toward 0. Let P ∗ and E ∗ denote the
probability measure and expectation for the data Dn , respectively. Define the conditions:

(a) log N (ϵn , Pn ) ≤ nϵ2n for all sufficiently large n;

(b) π(Pcn ) ≤ e−bnϵn for all sufficiently large n;


2


(c) π[p : dt (p, p∗ ) ≤ b′ ϵ2n ] ≥ e−b nϵn for all sufficiently large n and some t > 0,
2

where 2 > b > 2b′ > 0 are positive constants. The following lemma is due to the same
argument of [39, Proposition 1].

Lemma 2.9.1. Under the conditions (a), (b) and (c) (for some t > 0), given sufficiently
large n, we have
h ′
i ′
(i) P ∗ π̂(4ϵn ) ≥ 2e−0.5nϵn min{1,2−x,b−x,t(x−2b )} ≤ 2e−0.5nϵn min{1,2−x,b−x,t(x−2b )} ,
2 2


(ii) E ∗ π̂(4ϵn ) ≤ 4e−nϵn min{1,2−x,b−x,t(x−2b )} .
2

for any 2b′ < x < b.

General Shrinkage Prior Settings for Deep Neural Networks

Let β denote the vector of parameters, including the weights of connections and the biases
of the hidden and output units, of a deep neural network. Consider a general prior setting
that all entries of β are subject to independent continuous prior πb , i.e., π(β) = πb (βj ).
QKn
j=1

Theorem 2.9.1 provides a sufficient condition for posterior consistency.

52
Theorem 2.9.1 (Posterior consistency). Assume the conditions A.1, A.2 and A.3 hold, if
the prior π(β) satisfies that

log(1/πb ) = O(Hn log n + log L), (2.23)


1 1
πb {[ − ηn , ηn ]} ≥ 1 − exp{−τ [Hn log n + log L + log pn ]} and πb {[ − ηn′ , ηn′ ]} ≥ 1 − ,
Kn Kn
(2.24)

− log [Kn πb (|βj | > Mn )] ≻ nϵ2n , (2.25)

√ √
for some τ > 0, where ηn < 1/{ nKn (n/Hn )Hn (c0 Mn )Hn }, ηn′ < 1/{ nKn (rn /Hn )Hn (c0 En )Hn }
with some c0 > 1, πb is the minimal density value of πb within interval [ − En − 1, En + 1],
and Mn is some sequence satisfying log(Mn ) = O(log(n)). Then, there exists a sequence ϵn ,
satisfying nϵ2n ≍ rn Hn log n + rn log L + sn log pn + nϖn2 and ϵn ≺ 1, such that

n 2
o 2
P ∗ π[d(pβ , pµ∗ ) > 4ϵn |Dn ] ≥ 2e−ncϵn ≤ 2e−cnϵn ,
2
(2.26)

ED n
π[d(pβ , pµ∗ ) > 4ϵn |Dn ] ≤ 4e−2cnϵn .

for some c > 0.

To prove Theorem 2.9.1, we first introduce a useful Lemma:

Lemma 2.9.2 (Theorem 1 of [82]). Let X ∼ B(n, v) be a Binomial random variable. For
any 1 < k < n − 1,

P r(X ≥ k + 1) ≤ 1 − Φ(sign(k − nv){2nH(v, k/n)}1/2 ),

where Φ is the cumulative distribution function (CDF) of the standard Gaussian distribution
and H(v, k/n) = (k/n) log(k/nv) + (1 − k/n) log [(1 − k/n)/(1 − v)].

Proof of Theorem 2.9.1


Theorem 2.9.1 can be proved using Lemma 2.9.1, so it suffices to verify conditions (a)-(c)
given in Section 2.9.1.
Checking condition (c) for t = 1:

53
/ ∗ ∥βj − βj ∥∞ ≤ ωn }, where
Consider the set A = {β : maxj∈γ ∗ ∥βj − βj∗ ∥∞ ≤ ωn , maxj∈γ ∗ ′

ωn = c1 ϵn /[Hn (rn /Hn )Hn (c0 En )Hn ] and ωn′ = c1 ϵn /[Kn (rn /Hn )Hn (c0 En )Hn ] for some constant
c1 > 0 and c0 > 1. If β ∈ A, then by Lemma 2.9.5, we have |µ(β, x) − µ(β ∗ , x)| ≤ 3c1 ϵn . By
condition A.2.1, |µ(β, x) − µ∗ (x)| ≤ 3c1 ϵn + ϖn . Combining it with (2.19)–(2.22), for both
normal and logistic models, we have

d1 (pβ , pµ∗ ) ≤ C(1 + o(1))Ex (µ(β, x) − µ∗ (x))2 ≤ C(1 + o(1))(3c1 ϵn + ϖn )2 , if β ∈ A,

for some constant C. Thus for any small b′ > 0, condition (c) holds as long as that c1 is
sufficiently small, nϵ2n ≥ M0 nϖn2 for large M0 , and the prior satisfies − log π(A) ≤ b′ nϵ2n .
/ ∗ ∥βj ∥ ≤ ωn }), πb ([ − ωn , ωn ]) ≥ 1 − 1/Kn (due to
Since π(A) ≥ (2πb ωn )rn × π({maxj∈γ ′ ′ ′

the fact ωn′ ≫ ηn′ ), and log(1/ωn ) ≍ log(1/ϵn ) + Hn log En + Hn log(rn /Hn ) + constant =
O(Hn log n) (note that log(1/ϵn ) = O(log n)), the above requirement holds when nϵ2n ≥
M0 rn Hn log n for some sufficiently large constant M0 .
Checking condition (a):
Let Pn denote the set of all DNN models whose weight parameter β satisfies that

β ∈ Bn = {|βj | ≤ Mn , γ β = {i : |βi | ≥ δn′ } satisfies |γ β | ≤ kn rn and |γ β |in ≤ kn′ sn }, (2.27)

where |γ|in denotes the input dimension of sparse network γ, kn (≤ n/rn ) and kn′ (≤ n/sn )
will be specified later, and δn = c1 ϵn /[Hn (kn rn /Hn )Hn (c0 Mn )Hn ] and
δn′ = c1 ϵn /[Kn (kn rn /Hn )Hn (c0 Mn )Hn ] for some constant c1 > 0 and c0 > 1. Consider two
parameter vectors β u and β v in set Bn , such that there exists a model γ with |γ| ≤ kn rn
and |γ|in ≤ kn′ sn , and |βju − βjv | ≤ δn for all j ∈ γ, max(|βju |, |βjv |) ≤ δn′ for all j ∈
/ γ.
Hence, by Lemma 2.9.5, we have that |µ(β u , x) − µ(β v , x)|2 ≤ 9c21 ϵ2n , and furthermore, due
to (2.19)-(2.22), we can easily derive that

q q
d(pβ , pβ ) ≤
u v d0 (pβ , pβ ) ≤
u v (9 + o(1))c21 Cϵ2n ≤ ϵn ,

for some C, given a sufficiently small c1 . On the other hand, if some β u ∈ Bn and its
connections whose magnitudes are larger than δn′ don’t form a valid network, then by Lemma

54
2.9.5 and (2.19)-(2.22), we also have that d(pβu , pβo ) ≤ ϵn , where β o = 0 denotes a empty
output network.
 j
Given the above results, one can bound the packing number N (Pn , ϵn ) by XHj n ,
Pkn rn 2Mn
j=1 δn

where XHj n denotes the number of all valid networks who has exact j connection and has no
2
more than kn′ sn inputs. Since log XHj n ≤ kn′ sn log pn + j log(kn′ sn L1 + Hn L ),

log N (Pn , ϵn ) ≤ log kn rn + kn rn log Hn + 2kn rn log(L + kn′ sn ) + kn′ sn log pn


2Mn Hn (kn rn /Hn )Hn MnHn
+ kn rn log
c1 ϵn
= kn rn ∗ O{Hn log n + log L + constant} + kn′ sn log pn

where the second inequality is due to log Mn = O(log n), kn rn ≤ n and kn′ sn ≤ n. We can
choose kn and kn′ such that kn rn {Hn log n + log L} ≍ kn′ sn ≍ nϵ2n and log N (Pn , ϵn ) ≤ nϵ2n .
Checking condition (b):
π(Pcn ) ≤ P r(Binomial(Kn , vn ) > kn rn )+Kn πb (|βj | > Mn )+P r(|γ β |in ≥ kn′ sn ), where vn =
1 − πb ([ − δn′ , δn′ ]). By the condition of πb and the fact that δn′ ≫ ηn , vn ≤ exp{−τ [Hn log n +
log L + log pn )] − log Kn } for some positive constant τ .
Hence, by Lemma 2.9.2, − log P r(Binomial(Kn , vn ) > kn rn ) ≈ τ kn rn [Hn log n + log L +
log pn ] ≳ nϵ2n due to the choice of kn , and − log P r(|γ β |in ≥ kn′ sn ) ≈ kn′ sn [τ (Hn log n +
log L + log pn ) + log(Kn /L1 pn )] ≳ nϵ2n due to the choice of kn′ . Thus, condition (b) holds as
well.

Proof of Theorem 2.2.1



Proof. It suffices to verify the conditions listed in Theorem 2.9.1. Let Mn = max( 2nσ1,n , En ).
Condition (2.23) is due to En2 /2σ1,n
2
+ log σ1,n
2
= O[Hn log n + log L]; Condition (2.24) can be
′ √
verified by λn = 1/{Kn [nHn (Lpn )]τ } and σ0,n ≺ 1/{ nKn (n/Hn )Hn (c0 Mn )Hn }; Condition
(2.25) can be verified by Mn ≥ 2nσ0,n
2
and τ [Hn log n + log L + log pn ] + Mn2 /2σ1,n
2
≥ n.

55
2.9.2 Proofs on Structure Selection Consistency

Proof of Theorem 2.3.1

Proof.
Z X
max |qi − ei|ν(γ ∗ ,β∗ ) | ≤ max |ei|ν(γ,β) − ei|ν(γ ∗ ,β∗ ) |π(γ|β, Dn )π(β|Dn )dβ
γ
Z
= max |ei|ν(γ,β) − ei|ν(γ ∗ ,β∗ ) |π(γ|β, Dn )π(β|Dn )dβ + ρ(4ϵn ) (2.28)
X
A(4ϵn ) γ

p
≤π̂(4ϵn ) + ρ(4ϵn ) → 0,

p
where → denotes convergence in probability, and π̂(c) denotes the posterior probability of the
set A(c) = {β : d(pβ , pµ∗ ) ≥ c}. The last convergence is due to the identifiability condition
B.1 and the posterior consistency result. This completes the proof of part (i).
Part (ii) & (iii): They are directly implied by part (i).

Proof of Theorem 2.3.2


Proof. Let u = n(β− β̂) = (u1 , . . . , uKn )T , and let g(β) = nh(β)−nh(β̂)− n2 hi,j (β̂)(βi −
P

β̂i )(βj − β̂j ). It is easy to see that for all 1 ≤ i1 , . . . , id ≤ Kn , gi1 ,...,id (β̂) = 0 if 1 ≤ d ≤ 2, and
gi1 ,...,id (β) = nhi1 ,...,id (β) if d ≥ 3.
Consider Taylor’s expansions of b(β) and exp(g(β)) at β̂, we have

1X
b(β) = b(β̂) + bi (β̂)(βi − β̂i ) + bi,j (β̃)(βi − β̂i )(βj − β̂j )
X
2
1 X 1 X
=b(β̂) + √ bi (β̂)ui + bi,j (β̃)ui uj ,
n 2n

nX
eg(β) = 1 + hi,j,k (β̂)(βi − β̂i )(βj − β̂j )(βk − β̂k )
3! i,j,k
n g(β̌) X
+ e hi,j,k,l (β̌)(βi − β̂i )(βj − β̂j )(βk − β̂k )(βl − β̂l )
4! i,j,k,l
1 X 1 g(β̌) X
=1+ √ hi,j,k (β̂)ui uj uk + e hi,j,k,l (β̌)ui uj uk ul ,
6 n 24n

56
where β̃ and β̌ are two points between β and β̂. In what follows, we also treat β̃ and β̌ as
1
P
functions of u, while treating β̂ as a constant. Let ϕ(u) = det(− 2π
1
Hn (β̂))e 2 hi,j (β̂)ui uj
be
the centered normal density with covariance matrix −Hn−1 . Then
Z Z
n
P
b(β)enh(β) dβ = enh(β̂) e2 hi,j (β̂)(βi −β̂i )(βj −β̂j )
b(β)eg(β) dβ
Bδ(β) Bδ(β)
X 1 11X
!
n 1
Z
=e det(− Hn (β̂))− 2
nh(β̂)
ϕ(u) b(β̂) + √ bi (β̂)ui + bi,j (β̃(u))ui uj
2π B√nδ (0) n 2n
1 1 X 1 1 g(β̃(u)) X
!
× 1+ √ hi,j,k (β̂)ui uj uk + e hi,j,k,l (β̌(u))ui uj uk ul du
6 n 24 n
n − 21
Z
=enh(β̂)
det(− Hn (β̂)) ϕ(u)(I1 + I2 )du,
2π B√nδ (0)

where
 
1 b(β̂)
I1 = b(β̂) + bi (β̂)ui + hi,j,k (β̂)ui uj uk  ,
X X
6
1

n 2

1 1X
 
I2 = bi (β̂)ui hi,j,k (β̂)ui uj uk
n 6
11X 1 1 X 1 1 g(β̃(u)) X
!
+ bi,j (β̃(u))ui uj 1 + √ hi,j,k (β̂)ui uj uk + e hi,j,k,l (β̌(u))ui uj uk ul
2n 6 n 24 n
1 1 g(β̃(u)) X X 1
!
+ e hi,j,k,l (β̌(u))ui uj uk ul b(β̂) + √ bi (β̂)ui ,
24 n n

and we will study the two terms ϕ(u)I1 du and ϕ(u)I2 du separately.
R R
B√nδ (0) B√nδ (0)

To quantify the term ϕ(u)I1 du, we first bound ϕ(u)I1 du. By as-
R R
B√nδ (0) RKn \B√nδ (0)

sumption C.2 and the Markov inequality,

Z Kn PKn
E(Ui2 ) rn M + C(KKn 2−rn ) rn
ϕ(u)du = P ( Ui2 > nδ ) ≤ 2 i=1
= O( ), (2.29)
X
2
≤ 2
n

RKn \B√nδ (0) i=1 nδ nδ n

57
where (U1 , . . . , UKn )T denotes a multivariate normal random vector following density ϕ(u).
Now we consider the term √1 b(β̂) hi,j,k (β̂)ui uj uk ϕ(u)du, by Cauchy-Schwarz in-
R
RKn \B√nδ (0) n 6

equality and assumption C.1, we have

Z
1 b(β̂)
| √ hi,j,k (β̂)ui uj uk ϕ(u)du|
RKn \B√nδ (0) n 6
 !
1 b(β̂) Kn
=|E  √ hi,j,k (β̂)Ui Uj Uk 1 Ut2 > nδ 2  |
X
n 6 t=1
v
u
u M1 Kn
E(U U U )P ( U 2 > nδ 2 )
X
2 2 2
≤t
i j k t
n t=1

rn q
=O( ) E(Ui2 Uj2 Uk2 )
n

where M1 is some constant. To bound E(Ui2 Uj2 Uk2 ),we refer to Theorem 1 of [83], which
proved that for 1 ≤ i1 , . . . , i6 ≤ Kn ,
v
u 6
uX Y
E(|Ui1 . . . Ui6 |) ≤ t hij ,iπ(j) ,
π j=1

where the sum is taken over all permutations π = (π(1), . . . , π(6)) of set {1, . . . , 6} and hi,j is
the (i, j)-th element of the covariance matrix H −1 . Let m := m(i1 , . . . , id ) = |{j : ij ∈ γ ∗ , j ∈
{1, 2, . . . , d}}| count the number of indexes belonging to the true connection set. Then, by
condition C.2, we have

1 6−m 1
s
E(|Ui1 . . . Ui6 |) ≤ C0 M m ( 2
) = O( 6−m ).
Kn Kn

The above inequality implies that E(Ui2 Uj2 Uk2 ) = O( 1


6−2m0 ), where m0 = |{i, j, k} ∩ γ ∗ |.
Kn
Thus, we have

Z
1 X b(β̂)
| hi,j,k (β̂)ui uj uk ϕ(u)du|
1
RKn \B√nδ (0) n 2 6
 
√ (2.30)
 3  m0 1
3
rn r3.5
 rn (Kn − rn ) ) = O( n ).
3−m0
X
≤ O( 3−m0
Kn n n

m0 =0 m0

58
By similar arguments, we can get the upper bound of the term

Z √
(bi (β̂)ui )/ nϕ(u)du|.
X
|
RKn \B√nδ (0) i
3.5
Thus, we obtain that | ϕ(u)I1 du| ≤ O( rnn ). Due to the fact that ϕ(u)I1 du =
R R
RKn \B√nδ (0) RKn
3.5
b(β̂), we have ϕ(u)I1 du = b(β̂) + O( rn
).
R
B√nδ (0) n

Due to assumption C.1 and the fact that bi,j ≤ M , within B√nδ (0), each term in I2 is
trivially bounded by a polynomial of |u|, such as,

11X 1 1 g(β̌(u)) X
| bi,j (β̃(u))ui uj e hi,j,k,l (β̌(u))ui uj uk ul |
2n 24 n
1 1 X
≤ M 2 eM 2
X
|ui uj | |ui uj uk ul |.
48 n

Therefore, there exists a constant M0 such that within B√nδ (0),

1X 1X 1 X 1 X
 
|I2 | ≤ M0 |ui uj | + |ui uj uk ul | + 3 |ui uj uk ul us | + 2 |ui uj uk ul us ut | := I3 ,
n n n2 n

Then we have

Z Z Z
| ϕ(u)I2 du| ≤ ϕ(u)I3 du ≤ ϕ(u)I3 du,
B√nδ (0) B√nδ (0) RKn

By the same arguments as used to bound E(Ui2 Uj2 Uk2 ), we can show that

Z
1 X rn6
ϕ(u) |u u u u u u
i j k l s t |du = O( )
RKn n2 n2

holds. The rest terms in ϕ(u)I3 du can be bounded by the same manner, and in the
R
RKn
4
end we have | ϕ(u)I2 du| ≤ ϕ(u)I3 du = O( rnn ).
R R
B√nδ (0) RKn
1 4
Then b(β)enh(β) dβ = enh(β̂) det(− 2π
n
Hn (β̂))− 2 (b(β̂) + O( rnn )) holds. Combining it
R
Bδ(β)

with condition C.3 and the boundedness of b, we get

Z
n 1 r4
b(β)enhn (β) dβ = enh(β̂) det(− Hn (β̂))− 2 (b(β̂) + O( n )).
2π n

59
1 4
With similar calculations, we can get e n dβ = enh(β̂) det(− 2π
n
Hn (β̂))− 2 (1 + O( rnn )).
R nh (β)

Therefore,

1 4
b(β)enhn (β) dβ enh(β̂) det(− 2π
n
Hn (β̂))− 2 (b(β̂) + O( rnn )) rn4
R
= = b(β̂) + O( ).
enhn (β) dβ enh(β̂) det(− 2π
1 4
Hn (β̂))− 2 (1 + O( rnn ))
R
n n

Proof of Lemma 2.3.1

′ )H
Proof. Consider the following prior setting: (i) λn = Kn−(1+τ n
, (ii) log σ0,n
1
= Hn log(Kn ),
r
( 16 +16)rn Hn log Kn
(iii) σ1,n = 1, and (iv) ϵn ≥ δ
n
. Note that this setting satisfies all conditions
of previous theorems. Recall that the marginal posterior inclusion probability is given by

Z X Z
qi = ei|ν(γ,β) π(γ|β, Dn )π(β|Dn )dβ := π(ν(γi ) = 1|βi )π(β|Dn )dβ.
γ

For any false connection ci ∈


/ γ ∗ , we have ei|ν(γ ∗ ,β∗ ) = 0 and

Z X
|qi | = |ei|ν(γ,β) − ei|ν(γ ∗ ,β∗ ) |π(γ|β, Dn )π(β|Dn )dβ ≤ π̂(4ϵn ) + ρ(4ϵn ).
γ

A straightforward calculation shows

1
π(ν(γi ) = 1|βi ) =  
1−λn σ1,n
1+ λn σ0,n
exp − 12 ( σ21 − 2
σ1,n
1
)βi2
0,n

1
= n
(4+2γ ′ )Hn log(Kn )+2 log(1−λn )
o.
1 + exp − 21 (Kn2Hn − 1)(βi2 − Kn2Hn
−1
)

(4+2τ ′ )Hn log(Kn )+2 log(1−λn )


Let Mn = Kn2Hn
−1
. Then, by Markov inequality,

P (βi2 > Mn |Dn ) = P (π(ν(γi ) = 1|βi ) > 1/2|Dn ) ≤ 2|qi | ≤ 2(π̂(4ϵn ) + ρ(4ϵn )).

60
Therefore,

Z
E(βi2 |Dn ) ≤ Mn + βi2 π(β|Dn )dβ
βi2 >Mn
Z Z
≤ Mn + −2 βi2 π(β|Dn )dβ + −2 Mn |βi |2+δ π(β|Dn )dβ
Mn <βi2 <Mn δ βi2 >Mn δ

−2
≤ Mn + Mn δ P (βi2 > Mn ) + CMn .
r
( 16 +16)rn Hn log Kn −2
Since , ϵn ≥ , we have Mn δ e−nϵn /4 ≺ 2Hn −1 . Thus
1 1 2 1
2Hn
Kn
≺ Mn ≺ 2Hn −1
Kn
δ
n Kn

1
( )
 2
 2
P ∗
E(βi2 |Dn ) ≺ ≥ P ∗ π̂(4ϵn ) < 2e−nϵn /4 ≥ 1 − 2e−nϵn /4 .
Kn2Hn −1

Verification of the Bounded Gradient Condition in Theorem 2.3.2

This section shows that with an appropriate choice of prior hyperparameters, the first
and second order derivatives of π(ν(γi ) = 1|βi ) (i.e. the function b(β) in Theorem 2.3.2) and
the third and fourth order derivatives of log π(β) are all bounded with a high probability.
Therefore, the assumption C.1 in Theorem 2.3.2 is reasonable.
Under the same setting of the prior as that used in the proof of Lemma 2.3.1, we can
show that the derivative of π(ν(γi ) = 1|βi ) is bounded with a high probability. For notational
simplicity, we suppress the subscript i in what follows and let

1 1
f (β) = π(ν(γ) = 1|β) = := ,
1 + C2 exp{−C1 β 2 }
 
1−λn σ1,n
1+ λn σ0,n
exp − 12 ( σ21 − 2 )β
1
σ1,n
2
0,n

1−λn σ1,n ′ )H
where C1 = 21 ( σ21 − 2 )
1
σ1,n
= 12 (Kn2Hn − 1) and C2 = λn σ0,n
= (1 − λn )Kn(2+τ n
. Then we
0,n

have C1 β 2 = log(C2 ) + log(f (β)) − log(1 − f (β)). With some algebra, we can show that

df (β) q q
| | = 2 C1 f (β)(1 − f (β)) log(C2 ) + log(f (β)) − log(1 − f (β)),

d2 f (β)
= f (β)(1 − f (β))(2C1 + 4C1 (log(C2 ) + log(f (β)) − log(1 − f (β)))(1 − 2f (β))).
dβ 2

61
By Markov inequality and Theorem 2.3.1, for the false connections,

1 1
( )
P f (β)(1 − f (β)) > 2 |Dn ≤ P (f (β) > 2 |Dn ) ≤ C12 E(f (β)|Dn ) ≤ C12 (π̂(4ϵn ) + ρ(4ϵn ))
C1 C1

holds, and for the true connections,

1 1
( )
P f (β)(1 − f (β)) > 2 |Dn ≤ P (1 − f (β) > 2 |Dn )
C1 C1
≤ C12 E(1 − f (β)|Dn ) ≤ C12 (π̂(4ϵn ) + ρ(4ϵn ))

holds. Under the setting of Lemma 2.3.1, by Theorem 2.2.1, it is easy to see that
C12 (π̂(4ϵn ) + ρ(4ϵn )) → 0 as n → ∞, and thus (f (β)(1 − f (β)) < 1
C12
with high probability.
Note that log(C2 )
C1
→ 0 and |f (β) log(f (β))| < 1e . Thus, when (f (β)(1 − f (β)) < 1
C12
holds,

s
df (β) log(C2 )
| |≤ + (f (β)(1 − f (β)) log(f (β)) − (f (β)(1 − f (β)) log(1 − f (β)),
dβ C1

d2 f (β) d2 f (β)
is bounded. Similarly we can show that dβ 2
is also bounded. In conclusion, df (β)

and dβ 2
n o
is bounded with probability P f (β)(1 − f (β)) ≤ 1
C12
|Dn which tends to 1 as n → ∞.
β2 2
Recall that π(β) = √1−λn
2πσ0,n
exp{− 2σ2 } + √2πσ
λn
1,n
exp{− 2σβ2 }. With some algebra, we can
0,n 1,n

show

d3 log(π(β)) 1 1 df (β) β β d2 f (β)


= 2( 2
− 2
) + ( 2
− 2
)
dβ 3 σ0,n σ1,n dβ σ0,n σ1,n dβ 2
df (β) q d2 f (β) q
= 4C1 + 2 C1 log(C2 ) + log(f (β)) − log(1 − f (β)).
dβ dβ 2

d2 f (β) d2 f (β)
With similar arguments to that used for dβ 2
and dβ 2
, we can make the term f (β)(1 −
d3 log(π(β))
f (β)) very small with a probability tending to 1. Therefore, dβ 3
is bounded with a
d4
probability tending to 1. Similarly we can bound log(π(β))
dβ 4
with a high probability.

62
Approximation of Bayesian Evidence

In Algorithm 1, each sparse model is evaluated by its Bayesian evidence:

n 1
Evidence = det(− Hn (β γ ))− 2 enhn (βγ ) ,

∂ 2 hn (β γ )
where Hn (β γ ) = ∂β γ ∂ T β γ
is the Hessian matrix, β γ denotes the vector of connection weights
selected by the model γ, i.e. Hn (β γ ) is a |γ| × |γ| matrix, and the prior π(β γ ) in hn (β) is
only for the connection weights selected by the model γ. Therefore,

1 1 1
log(Evidence) = nhn (β γ ) − |γ| log(n) + |γ| log(2π) − log(det(−Hn (β γ ))), (2.31)
2 2 2

∂ 2 log(p(yi ,xi |β γ )) ∂ 2 log(π(β γ ))


and −Hn (β γ ) = − n1 . For the selected connection
Pn 1 Pn
i=1 ∂β γ ∂ T β γ
− n i=1 ∂β γ ∂ T β γ
∂ 2 log(π(β ))
weights, the prior π(β γ ) behaves like N (0, σ1,n
2
), and then − ∂β ∂ T βγ is a diagonal matrix
γ γ

with the diagonal elements approximately equal to 1


2
σ1,n
and 1
2
nσ1,n
→ 0.
∂ 2 log(p(yi ,xi |β γ ))
If (yi , xi )’s are viewed as i.i.d samples drawn from p(y, x|β γ )), then − n1
Pn
i=1 ∂β γ ∂ T β γ
∂ 2 log(p(y,x|β γ ))
will converge to the Fisher information matrix I(β γ ) = E(− ∂β ∂ T β ). If we further
γ γ

assume that I(β γ ) has bounded eigenvalues, i.e. Cmin ≤ λmin (I(β γ )) ≤ λmax (I(β γ )) ≤ Cmax
for some constants Cmin and Cmax , then log(det(−Hn (β γ ))) ≍ |γ|. This further implies

1 1
|γ| log(2π) − log(det(−Hn (β γ ))) ≺ |γ| log(n).
2 2

By keeping only the dominating terms in (2.31), we have

1 1
log(Evidence) ≈ nhn (β γ ) − |γ| log(n) = − BIC + log(π(β γ )).
2 2

Since we are comparing a few low-dimensional models (with the model size |γ| ≺ n), it is
intuitive to further ignore the prior term log(π(β γ )). As a result, we can elicit the low-
dimensional sparse neural networks by BIC.

63
2.9.3 Proofs of Asymptotic Normality

Proof of Theorem 2.4.1

Proof. We first define the equivalent class of neural network parameters. Given a parameter
vector β and the corresponding structure parameter vector γ, its equivalent class is given
by
QE (β, γ) = {(β̃, γ̃) : νg (β̃, γ̃) = (β, γ), µ(β̃, γ̃, x) = µ(β, γ, x), ∀x},

where νg (·) denotes a generic mapping that contains only the transformations of node per-
mutation and weight sign flipping. Specifically, we define

Q∗E = QE (β ∗ , γ ∗ ),

which represents the equivalent class of true DNN model.


Let Bδn (β ∗ ) = {β : |β i − β ∗i | < δn , ∀i ∈ γ ∗ , |β i − β ∗i | < 2σ0,n log( λnσ1,n
σ0,n
), ∀i ∈
/ γ ∗ }. By
assumption D.1, β ∗ is generic (i.e. QE (β ∗ ) contains only reparameterizations of weight sign-
flipping or node permutations as defined in [51] and [52]) and mini∈γ ∗ |β ∗i | − δn > (C − 1)δn >
δn , then for any β ∗(1) , β ∗(2) ∈ Q∗E , Bδn (β ∗(1) ) ∩ Bδn (β ∗(2) ) = ∅, and thus {β : ν̃(β) ∈
Bδn (β ∗ )} = ∪β∈Q∗E Bδn (β). In what follows, we will first show π(∪β∈Q∗E Bδn (β) | Dn ) → 1 as
n → ∞, which means the most posterior mass falls in the neighbourhood of true parameter.
Remark on the notation: ν̃(·) is similar to ν(·) defined in Section 2.3 They both map the
set QE (β, γ) to a unique network. The difference between them is that ∥ν(β) − β ∗ ∥∞ may
be arbitrary, but ∥ν̃(β) − β ∗ ∥∞ is minimized. In other words, ν(β, γ) is to find an arbitrary
network in QE (β, γ) as the representative of the equivalent class, while ν̃(β, γ) is to find
a representative in QE (β, γ) such that the distance between β ∗ and the representative is
minimized. In what follows, we will use ν̃(β) and ν̃(γ) to denote the connection weight and
network structure of ν̃(β, γ), respectively. With a slight abuse of notation, we will use ν̃(β)i
to denote the ith component of ν̃(β), and use ν̃(γ)i to denote the ith component of ν̃(γ).

64
Recall that the marginal posterior inclusion probability is given by

Z X Z
qi = ei|ν̃(β,γ) π(γ|β, Dn )π(β|Dn )dβ = π(ν̃(γ)i = 1|β)π(β|Dn )dβ.
γ

For the mixture Gaussian prior,

1
π(γ i = 1|β) = −( 1
− 12 )β 2i
,
σ1,n (1−λn ) 2σ 2
1+ σ0,n λn
e 0,n

1,n

which increases with respect to |β i |. In particular, if |β i | > 2σ0,n log( λnσ1,n


σ0,n
), then
π(γ i = 1|β) > 21 .
For the mixture Gaussian prior,

/ ∪β∈Q∗E Bδn (β) | Dn )


π(β ∈
σ1,n
/ γ ∗ , |ν̃(β)i | > 2σ0,n log(
≤π(∃i ∈ ) | Dn ) + π(∃i ∈ γ ∗ , |ν̃(β)i − β ∗i | > δn | Dn ).
λn σ0,n

For the first term, note that for a given i ∈


/ γ ∗,

σ1,n 1
π(|ν̃(β)i | > 2σ0,n log( ) | Dn ) ≤π(π(ν̃(γ)i = 1|β) > | Dn )
λn σ0,n 2
Z
≤2 π(ν̃(γ)i = 1|β)π(β|Dn )dβ

≤2ρ(ϵn ) + 2π(d(pβ , pµ∗ ) ≥ ϵn | Dn ) → 0.

Then we have

σ1,n
/ γ ∗ , |ν̃(β)i | > 2σ0,n log(
π(∃i ∈ ) | Dn )
λn σ0,n
σ1,n
=π(max |ν̃(β)i | > 2σ0,n log( ) | Dn )

i∈γ
/ λn σ0,n
1
≤π(max π(ν̃(γ)i = 1|β) > | Dn )
i∈γ
/ ∗ 2
1
π(π(ν̃(γ)i = 1|β) > | Dn )
X

/ ∗
i∈γ
2

≤2Kn ρ(ϵn ) + 2Kn π(d(pβ , pµ∗ ) ≥ ϵn | Dn ) → 0.

65
For the second term, by condition D.1 and Theorem 2.3.1,

π(∃i ∈ γ ∗ , |ν̃(β)i − β ∗i | > δn | Dn ) = π(max



|ν̃(β)i − β ∗i | > δn | Dn )
i∈γ

=π(max

|ν̃(β)i − β ∗i | > δn , d(pβ , pµ∗ ) ≤ ϵn | Dn )
i∈γ

+ π(max

|ν̃(β)i − β ∗i | > δn , d(pβ , pµ∗ ) ≥ ϵn | Dn )
i∈γ

≤π(A(ϵn , δn ) | Dn ) + π(d(pβ , pµ∗ ) ≥ ϵn | Dn ) → 0.

Summarizing the above two terms, we have π(∪β∈Q∗E Bδn (β) | Dn ) → 1.


Let Qn = |Q∗E | be the number of equivalent true DNN model. By the generic assumption
of β ∗ , for any β ∗(1) , β ∗(2) ∈ Q∗E , Bδn (β ∗(1) ) ∩ Bδn (β ∗(2) ) = ∅. Then in Bδn (β ∗ ), the posterior
density of ν̃(β) is Qn times the posterior density of β. Then for a function f (·) of ν̃(β), by
changing variable,

Z Z
f (ν̃(β))π(ν̃(β)|Dn )dν̃(β) = Qn f (β)π(β|Dn )dβ.
ν̃(β)∈Bδn (β ∗ ) Bδn (β ∗ )

Thus, we only need to consider the integration on Bδn (β ∗ ). Define


β ∗ − hi,j (β ∗ )hj (β ∗ ), ∀i ∈ γ ∗ ,

 P
i j∈γ ∗
β̂ i = 
0,

∀i ̸∈ γ ∗ .

We will first prove that for any real vector t,



e ntT (ν̃(β)−β̂)
π(ν̃(β)|Dn )dν̃(β)
R
√ T Bδn (β ∗ )
E(e nt (ν̃(β)−β̂)
| Dn , Bδn (β )) :=

Bδn (β ∗ ) π(ν̃(β)|Dn )dν̃(β)


R
√ T
∗ e e
nt (β−β̂) nln (β)
(2.32)
R
Bδn (β ) π(β)dβ
=
Bδn (β ∗ ) e
nln (β) π(β)dβ
R

1 T
=e 2 t V t+oP ∗ (1)
.

For any β ∈ Bδn (β ∗ ), we have

√ √ σ1,n
| n(tT (β − β γ ∗ ))| ≤ nKn ||t||∞ 2σ0,n log( ) = o(1),
λn σ0,n

66
σ1,n
|n(ln (β) − ln (β γ ∗ ))| = |n β i (hi (β̃))| ≤ nKn M 2σ0,n log( ) = o(1).
X

/ ∗
i∈γ
λn σ0,n

Then, we have

√ T √ √ X i,j ∗
nt (β − β̂) = ntT (β − β γ ∗ + β γ ∗ − β ∗ ) + n h (β )tj hi (β ∗ ))
i,j∈γ ∗
√ √ (2.33)
=o(1) + (β i − β ∗i )ti + h (β )tj hi (β ),
i,j ∗ ∗
X X
n n
i∈γ ∗ i,j∈γ ∗

nln (β) − nln (β ∗ ) =n(ln (β) − ln (β γ ∗ ) + ln (β γ ∗ ) − nln (β ∗ ))


n X
=o(1) + n (β i − β ∗i )hi (β ∗ ) + hi,j (β ∗ )(β i − β ∗i )(β j − β ∗j )
X

i∈γ ∗ 2 i,j∈γ ∗ (2.34)


n X
+ hi,j,k (β̌)(β i − β ∗i )(β j − β ∗j )(β k − β ∗k ),
6 i,j,k∈γ ∗

where β̌ is a point between β γ ∗ and β ∗ . Note that for β ∈ Bδn (β ∗ ), |β i − β ∗i | ≤ δn ≲ 3 nr ,


√ 1
n

we have n
hi,j,k (β̌)(β i − β ∗i )(β j − β ∗j )(β k − β ∗k ) = o(1).
P
6 i,j,k∈γ ∗
(t)
Let β (t) be network parameters satisfying β i = βi + √1 hi,j (β ∗ )tj , ∀i ∈ γ ∗ and
P
n j∈γ ∗
(t) rn ||t||
= β i , ∀i ∈
/ γ ∗ . Note that √1 hi,j (β ∗ )tj ≤ √ ∞M ≲ δn , for large enough n,
P
βi n j∈γ ∗ n
(t)
|β i | < 2δn ∀i ∈ γ ∗ . Thus, we have

(t) (t)
nln (β (t) ) − nln (β ∗ ) =n(ln (β (t) ) − ln (β γ ∗ ) + ln (β γ ∗ ) − nln (β ∗ ))
X (t) n X (t) (t)
=o(1) + n (β i − β ∗i )hi (β ∗ ) + hi,j (β ∗ )(β i − β ∗i )(β j − β ∗j )
i∈γ ∗ 2 i,j∈γ ∗
n X
=o(1) + n (β i − β ∗i )hi (β ∗ ) + hi,j (β ∗ )(β i − β ∗i )(β j − β ∗j )
X

i∈γ ∗ 2 i,j∈γ ∗

√ X i,j ∗ √ 1 X i,j ∗
+ n h (β )tj hi (β ∗ ) + n (β i − β ∗i )ti + h (β )ti tj
X

i,j∈γ ∗ i∈γ ∗ 2 i,j∈γ ∗


√ 1 X i,j ∗
=o(1) + ntT (β − β̂) + nln (β) − nln (β ∗ ) + h (β )ti tj ,
2 i,j∈γ ∗
(2.35)

67
√ T
where the last equality is derived by replacing appropriate terms by nt (β − β̂) and
nln (β) − nln (β ∗ ) based on (2.33) and (2.34), respectively; and the third equality is derived
based on the following calculation:

n X (t) (t)
hi,j (β ∗ )(β i − β ∗i )(β j − β ∗j )
2 i,j∈γ ∗
n X 1 X i,k ∗ 1 X j,k ∗
= hi,j (β ∗ )(β i − β ∗i + √ h (β )tk )(β j − β ∗j + √ h (β )tk )
2 i,j∈γ ∗ n k∈γ ∗ n k∈γ ∗
n X n X 1 X i,k ∗
= hi,j (β ∗ )(β i − β ∗i )(β j − β ∗j ) + 2 × hi,j (β ∗ ) √ h (β )tk (β j − β ∗j ) (2.36)
2 i,j∈γ ∗ 2 i,j∈γ ∗ n k∈γ ∗
n X 1 X i,k ∗ 1 X j,k ∗
+ hi,j (β ∗ )( √ h (β )tk )( √ h (β )tk )
2 i,j∈γ ∗ n k∈γ ∗ n k∈γ ∗
n X √ X 1 X i,j ∗
= hi,j (β ∗ )(β i − β ∗i )(β j − β ∗j ) + n (β i − β ∗i )ti + h (β )ti tj ,
2 i,j∈γ ∗ i∈γ ∗ 2 i,j∈γ ∗

where the second and third terms in the last equality are derived based on the relation
hi,j (β ∗ )hi,k (β ∗ ) = δj,k , where δj,k = 1 if j = k, δj,k = 0 if j ̸= k.
P
i∈γ ∗

By rearranging the terms in (2.35), we have


Z √
exp{ ntT (β − β̂) + nln (β)}π(β)dβ
Bδn (β ∗ )
 
 1 Z (t)
= exp − hi,j (β ∗ )ti tj + o(1) enln (β )
X
π(β)dβ.
 2 i,j∈γ ∗  Bδn (β ) ∗

For β ∈ Bδn (β ∗ ), i ∈ γ ∗ , by Assumption D.1, there exists a constant C > 2 such that

(t) rn ||t||∞ M rn
|β i | ≥ |β i | − √ ≥ |β ∗i | − 2δn ≥ (C − 2)δn ≳ √
n n
v !−1
1 1 rn (1 − λn )σ1,n
u !
log
u
≳ t − 2 .
2σ0,n 2σ1,n
2
σ0,n λn

Then we have
σ1,n (1 − λn ) −( 2σ10,n 1
1 (t) 2
2 − 2σ 2 )(β i )
e 1,n ≲ .
σ0,n λn rn

68
(t)
It is easy to see that the above formula also holds if we replace β i by β i . Note that the
mixture Gaussian prior of β i can be written as

β2
σ1,n (1 − λn ) −( 2σ10,n 1 2
!
λn − 2i 2 − 2σ 2 )β i
π(β i ) = √ e 2σ1,n 1+ e 1,n .
2πσ1,n σ0,n λn

(t) (t)
Since |β i − β i | ≲ δn ≲ 3 nr ,
√ 1
n
|β i + β i | < 2En + 3δn ≲ En , and 1
2
σ1,n
≲ Hn log(n)+log(L̄)
2
En
, we
have
rn (t) (t) Hn log(n) + log(L̄)
2
(β i − β i )(β i + β i ) = = o(1),
σ1,n nC1 +1/3

by the condition C1 > 2/3 and Hn log(n) + log(L̄) ≺ n1−ϵ . Thus, π(β)
= π(β i )
=
Q
π(β (t) ) i∈γ ∗ π(β (t) )
i

1 + o(1), and
Z Z
nln (β (t) ) (t)


e π(β)dβ =(1 + o(1)) (t) ∗
enln (β )
π(β (t) )dβ (t)
Bδn (β ) β ∈Bδn (β )
(2.37)
=(1 + o(1))CN π(β (t)
∈ Bδn (β ) | Dn ),

where CN is the normalizing constant of the posterior. Note that ||β (t) − β||∞ ≲ δn , we have
π(β (t) ∈ Bδn (β ∗ ) | Dn ) → π(β ∈ Bδn (β ∗ ) | Dn ). Moreover, since − 12 hi,j (β ∗ )ti tj →
P
i,j∈γ ∗
1 T
2
t V t, we have


)e e
ntT (β−β̂) nhn (β)
R
√ Bδn (β ∗ π(β)dβ tT V t
ntT (ν̃(β)−β̂)
E(e | Dn , Bδn (β )) = ∗
=e 2
+oP ∗ (1)
.
enhn (β) π(β)dβ
R
Bδn (β ∗ )

Combining the above result with the fact that π(ν̃(β) ∈ Bδn (β ∗ ) | Dn ) → 1, by section 1 of
[84], we have

π[ n(ν̃(β) − β̂) | Dn ] ⇝ N (0, V ).

We will then show that β̂ will converge to β ∗ , then essentially we can replace β̂ by β ∗ in the
above result. Let Θγ ∗ = {β : β i = 0, ∀i ∈
/ γ ∗ } be the parameter space given the model γ ∗ ,
and let β̂ γ ∗ be the maximum likelihood estimator given the model γ ∗ , i.e.

β̂ γ ∗ = arg max ln (β).


β∈Θγ ∗

69
q
Given condition D.3 and by Theorem 2.1 of [53], we have ||β̂ γ ∗ − β ∗ || = O( rn
n
) = o(1).
Note that hi (β̂ γ ∗ ) = 0 as β̂ γ ∗ is maximum likelihood estimator. Then for any i ∈ γ ∗ ,
q
|hi (β ∗ )| = |hi (β̂ γ ∗ ) − hi (β ∗ )| = | hij (β̃)((β̂ γ ∗ )j − β ∗j )| ≤ M ||β̂ γ ∗ − β ∗ ||1 = O( rn
).
P
j∈γ ∗ n
q 3
Then for any i, j ∈ γ ∗ , we have hi,j (β ∗ )hj (β ∗ ) = O( rn
) = o(1). By the definition of
P
j∈γ ∗ n

β̂, we have β̂ − β ∗ = o(1). Therefore, we have


π[ n(ν̃(β) − β ∗ ) | Dn ] ⇝ N (0, V ).

Proof of Theorem 2.4.2

Proof. The proof of Theorem 2.4.2 can be done using the same strategy as that used in
proving Theorem 2.4.1. Here we provide a simpler proof using the result of Theorem 2.4.1.
The notations we used in this proof are the same as in the proof of Theorem 2.4.1. In
the proof of Theorem 2.4.1, we have shown that π(ν̃(β) ∈ Bδn (β ∗ ) | Dn ) → 1. Note that
µ(β, x0 ) = µ(ν̃(β), x0 ). We only need to consider β ∈ Bδn (β ∗ ). For β ∈ Bδn (β ∗ ), we have


n(µ(β, x0 ) − µ(β ∗ , x0 ))

= n(µ(β, x0 ) − µ(β γ ∗ , x0 ) + µ(ν̃(β γ ∗ ), x0 ) − µ(β ∗ , x0 )).

/ γ ∗ , |β i | < 2σ0,n log( λnσ1,n


Since β ∈ Bδn (β ∗ ), for i ∈ σ0,n
); and for i ∈ γ ∗ , |ν̃(β)i −β ∗i | < δ ≲ 3 nr .
√ 1
n

Therefore,

√ √ X √ σ1,n
| nµ(β, x0 ) − µ(β γ ∗ , x0 ))| = | n β i (µi (β̃, x0 ))| ≤ nKn M 2σ0,n log( ) = o(1),
i∈γ
/ ∗ λn σ 0,n

where µi (β, x0 ) denotes the first derivative of µ(β, x0 ) with respect to the ith component of
β, and β̃ denotes a point between β and β γ ∗ . Further,

µ(ν̃(β γ ∗ ), x0 ) − µ(β ∗ , x0 )
√ X √ X X
= n (ν̃(β)i − β ∗i )µi (β ∗ , x0 ) + n (ν̃(β)i − β ∗i )µi,j (β̌, x0 )(ν̃(β)j − β ∗j )
i∈γ ∗ i∈γ ∗ j∈γ ∗

= n ((ν̃(β)i − β ∗i )µi (β ∗ , x0 ) + o(1),
X

i∈γ ∗

70
where µi,j (β, x0 ) denotes the second derivative of µ(β, x0 ) with respect to the ith and jth
components of β and β̌ is a point between ν̃(β) and β ∗ . Summarizing the above two
equations, we have

√ √ X
nµ(β, x0 ) − µ(β ∗ , x0 )) = n ((ν̃(β i ) − β ∗i )µi (β ∗ , x0 ) + o(1).
i∈γ ∗


By Theorem 2.4.1, π[ n(ν̃(β)−β ∗ ) | Dn ] ⇝ N (0, V ), where V = (vij ), and vi,j = E(hi,j (β ∗ ))

if i, j ∈ γ ∗ and 0 otherwise. Then we have π[ n(µ(β, x0 ) − µ(β ∗ , x0 )) | Dn ] ⇝ N (0, Σ),
where Σ = ∇γ ∗ µ(β ∗ , x0 )T H −1 ∇γ ∗ µ(β ∗ , x0 ) and H = E(−∇2γ ∗ ln (β ∗ )).

2.9.4 Proofs on Generalization Bounds

Proof of Theorem 2.5.1

Proof. Consider the set Bn defined in (2.27). By the argument used in the proof of Theorem
2.9.1, there exists a class of B̃n = {β (l) : 1 ≤ l < L} for some L < exp{cnϵ2n } with a constant
c such that for any β ∈ Bn , there exists some β (l) satisfying |µ(β, x) − µ(β (l) , x)| ≤ c′ ϵn .
Let π̃ be the truncated distribution of π(β|Dn ) on Bn , and let π̌ be a discrete distribu-
tion on B̃n defined as π̌(β (l) ) = π̃(Bl ), where Bl = {β ∈ Bn : ∥µ(β, x) − µ(β (l) , x)∥∞ <
minj̸=l ∥µ(β, x) − µ(β (j) , x)∥∞ } by defining the norm ∥f ∥∞ = max{x∈Ω} f (x). Note that for
any β ∈ Bl , ∥µ(β, x) − µ(β (l) , x)∥ ≤ c′ ϵn , thus,

l0 (β, x, y) ≤ lc′ ϵn /2 (β (l) , x, y) ≤ lc′ ϵn (β, x, y). (2.38)

The above inequality implies that


Z Z
Ex,y l0 (β, x, y)dπ̃ ≤ Ex,y lc′ ϵn /2 (β (l) , x, y)dπ̌,
Z
1X n Z
1X n (2.39)
lc ϵn /2 (β , x , y )dπ̌ ≤

(l) (i) (i)
lc′ ϵn (β, x(i) , y (i) )dπ̃.
n i=1 n i=1

71
Let P be a uniform prior on B̃n , by Theorem 2.3.1, with probability 1 − δ,


v
u d (π̌, P ) + log 2 n
u
Z Z
1 n
t 0
Ex,y lν (β (l) , x, y)dπ̌ ≤ lν (β (l) , x(i) , y (i) )dπ̌ +
X
δ
n i=1 2n
v

(2.40)
+ log
u
2 n
Z
1 n u cnϵ2
lν (β (l) , x(i) , y (i) )dπ̌ + n
X
δ
≤ ,
t
n i=1 2n

for any ν ≥ 0 and δ > 0, where the second inequality is due to the fact that d0 (L, P ) ≤ log L
for any discrete distribution L over {β (l) }Ll=1 .
Combining inequalities (2.39) and (2.40), we have that, with probability 1 − δ,


v
+ log 2 δ n Z 1 X
u
Z u cnϵ2 n
Ex,y l0 (β, x, y)dπ̃ ≤ n
+ lc′ ϵ (β, x(i) , y (i) )dπ̃. (2.41)
t
2n n i=1 n

Due to the boundedness of lν , we have


Z Z
Ex,y l0 (β, x, y)dπ(β|Dn ) ≤ Ex,y l0 (β, x, y)dπ̃ + π(Bnc |Dn ),
Z
1X n
1 Z
1Xn (2.42)
lc′ ϵn (β, x(i) , y (i) )dπ̃ ≤ lc′ ϵ (β, x(i) , y (i) )dπ(β|Dn ).
n i=1 1 − π(Bnc |Dn ) n i=1 n

Note that the result of Theorem A.1 implies that, with probability at least 1 − exp{−c′′ nϵ2n },
π(Bnc |Dn ) ≤ 2 exp{−c′′ nϵ2n }. Therefore, with probability greater than 1 − δ − exp{−c′′ nϵ2n },

Z
1 Z
1Xn
Ex,y l0 (β, x, y)dπ(β|Dn ) ≤ lc′ ϵ (β, x(i) , y (i) )dπ(β|Dn )
1 − 2 exp{−c′′ nϵ2n } n i=1 n

v
+ log 2 δ n
u
u cnϵ2
+ n
+ 2 exp{−c′′ nϵ2n }.
t
2n

Thus, the result holds if we choose δ = exp{−c′′′ nϵ2n } for some c′′′ .

Proof of Theorem 2.5.2

Proof. To prove the theorem, we first introduce a lemma on generalization error of finite
classifiers, which can be easily derived based on Hoeffding’s inequality:

72
Lemma 2.9.3 (Generalization error for finite classifier). Given a set B which contains H
elements, if the estimator β̂ belongs to B and the loss function l ∈ [0, 1], then with probability
1 − δ, s
1X n
log H + log(1/δ)
Ex,y l(β̂, x, y) ≤ l(β̂, x(i) , y (i) ) + .
n i=1 2n

Next, let’s consider the same sets Bn and B̃n as defined in the proof of Theorem 2.5.1. Due
to the posterior contraction result, with probability at least 1 − exp{−c′′ nϵ2n }, the estimator
β̂ ∈ Bn . Therefore, there must exist some β (l) ∈ B̃n such that (2.38) holds, which implies
that with probability at least 1 − exp{−c′′ nϵ2n } − δ,
s
log H + log(1/δ)
L0 (β̂) ≤ Lc′ ϵn /2 (β (l) ) ≤ Lemp,c′ ϵn /2 (β (l) ) +
2n
s
log H + log(1/δ)
≤ Lemp,c′ ϵn (β̂) + ,
2n

where the second inequality is due to Lemma 2.9.3, and H ≤ exp{cnϵ2n }. The result then
holds if we set δ = exp{−cnϵ2n }.

Proof of Theorems 2.5.3 and 2.5.4

The proofs are straightforward and thus omitted.

2.9.5 Mathematical facts of sparse DNN

Consider a sparse DNN model with Hn − 1 hidden layer. Let L1 , . . . , LHn −1 denote the
number of node in each hidden layer and ri be the number of active connections that connect
to the ith hidden layer (including the bias for the ith hidden layer and weight connections
between i − 1th and ith layer). Besides, we let Oi,j (β, x) denote the output value of the jth
node in the ith hidden layer

73
Lemma 2.9.4. Under assumption A.1, if a sparse DNN has at most rn connectivity (i.e.,
ri = rn ), and all the weight and bias parameters are bounded by En (i.e., ∥β∥∞ ≤ En ),
P

then the summation of the outputs of the ith hidden layer for 1 ≤ i ≤ Hn is bounded by

Li i
Oi,j (β, x) ≤ Eni
X Y
rk ,
j=1 k=1

where the Hn -th hidden layer means the output layer.

Proof. For the simplicity of representation, we rewrite Oi,j (β, x) as Oi,j when causing no
confusion. The lemma is the result from the facts that

L1 Li Li−1
|Oi,j | ≤ rn En , and
X X X
|Oi,j | ≤ |Oi−1,j |En ri .
i=1 j=1 j=1

Consider two neural networks, µ(β, x) and µ(β,


e x), where the formal one is a sparse

network satisfying ∥β∥0 = rn and ∥β∥∞ = En , and its model vector is γ. If |β i − β̃ i | < δ1
for all i ∈ γ and |β i − β̃ i | < δ2 for all i ∈
/ γ, then

Lemma 2.9.5.

Hn Hn Hn
max |µ(β, x)−µ(β,
e x)| ≤ δ H (E +δ )Hn −1 ri +δ2 (pn L1 + Li ) [(En +δ1 )ri +δ2 Li ].
Y X Y
1 n n 1
∥x∥∞ ≤1
i=1 i=1 i=1

Proof. Define β̌ such that β̌ i = β̃ i for all i ∈ γ and β̌ i = 0 for all i ∈


/ γ. Let Ǒi,j denote
Oi,j (β̌, x). Then,

Li−1 Li−1 Li−1


|Oi−1,j | + En |Ǒi−1,j − Oi−1,j | + δ1
X X X
|Ǒi,j − Oi,j | ≤ δ1 |Ǒi−1,j − Oi−1,j |
j=1 j=1 j=1
Li−1 Li−1
|Oi−1,j | + (En + δ1 )
X X
≤ δ1 |Ǒi−1,j − Oi−1,j |.
j=1 j=1

74
This implies a recursive result

Li Li−1 Li−1
|Ǒi,j − Oi,j | ≤ ri (En + δ1 ) |Ǒi−1,j − Oi−1,j | + ri δ1
X X X
|Oi−1,j |.
j=1 j=1 j=1

PLi−1
Due to Lemma 2.9.4, |Oi−1,j | ≤ Eni−1 r1 · · · ri−1 . Combined with the fact that
PL1
j=1 j=1 |Ǒ1,j −
O1,j | ≤ δ1 r1 , one have that

Hn
|µ(β, x) − µ(β̌, x)| = |ǑHn ,j − OHn ,j | ≤ δ1 Hn (En + δ1 )Hn −1
X Y
ri .
j i=1

e := µ(β,
Now we compare Oi,j
e x) and Ǒ . We have that
i,j

Li Li−1 Li−1 Li−1


i−1,j − Ǒi−1,j | + δ2 Li |Ǒi−1,j | + ri (En + δ1 )
X X X X
|O
e − Ǒ | ≤ δ L
i,j i,j 2 i |O
e |O
e
i−1,j − Ǒi−1,j |,
j=1 j=1 j=1 j=1

and
L1
X
|O
e − Ǒ | ≤ δ p L .
1,j 1,j 2 n 1
j=1

PLi−1
Due to Lemma 2.9.4, we also have that j=1 |Ǒi−1,j | ≤ (En + δ1 )i−1 r1 · · · ri−1 . Together, we
have that

e x) − µ(β̌, x)| =
X
|µ(β, |O
e
Hn ,j − ǑHn ,j |
j
Hn Hn
≤δ2 (pn L1 + Li ) [(En + δ1 )ri + δ2 Li ].
X Y

i=1 i=1

The proof is concluded by summation of the bound for |µ(β, x) − µ(β̌, x)| and |µ(β,
e x) −

µ(β̌, x)|.

75
2.9.6 Proof of Theorem 2.6.1

Our proof follows the proof of Theorem 2 in [70]. SGLD use the first order integrator
(see Lemma 12 of [70] for the detail). Then we have

E(ψ(β (t+1) )) =ψ(β (t) ) + ϵt Lt ψ(β (t) ) + O(ϵ2t )

=ψ(β (t) ) + ϵt (Lt − L)ψ(β (t) ) + ϵt Lψ(β (t) ) + O(ϵ2t ).

Note that by Poisson equation, Lψ(β) = ϕ(β) − ϕ(β)π(β|Dn , η ∗ , σ0,n



)dβ. Taking expec-
R

tation on both sides of the equation, summing over t = 0, 1, . . . , T − 1, and dividing ϵT on


both sides of the equation, we have

1 TX
−1 Z !
E ϕ(β ) − ϕ(β)π(β|Dn , η ∗ , σ0,n
(t) ∗
)
T t=1
1 1 TX
−1
= (E(ψ(β (T ) )) − ψ(β (0) )) − E(δt ψ(β (t) )) + O(ϵ).
Tϵ T t=0

To characterize the order of δt = Lt − L, we first study the difference of the drift term

(t)
∇ log(π(β (t) |Dm,n
(t)
, η (t) , σ0,n )) − ∇ log(π(β (t) |Dn , η ∗ , σ0,n

))
n m
n X
= ∇ log(pβ(t) (xi , yi )) − ∇ log(pβ(t) (xij , yij ))
X

i=1 m j=1
(t)
+ η (t) ∇ log(π(β (t) |λn , σ0,n , σ1,n )) − η ∗ ∇ log(π(β (t) |λn , σ0,n

, σ1,n )).

Use of the mini-batch data gives an unbiased estimator of the full gradient, i.e.

n m
n X
E( ∇ log(pβ(t) (xi , yi )) − ∇ log(pβ(t) (xij , yij ))) = 0.
X

i=1 m j=1

For the prior part, let p(σ) denote the density function of N (0, σ). Then we have

(t)
∇ log(π(β (t) |λn , σ0,n , σ1,n ))
(t)
(1 − λn )p(σ0,n ) β (t) λn p(σ1,n ) β (t)
=− (t) 2 − (t) 2
,
(1 − λn )p(σ0,n ) + λn p(σ1,n ) σ0,n
(t)
(1 − λn )p(σ0,n ) + λn p(σ1,n ) σ1,n

76
(t) 2E|β (t) |
and thus E|∇ log(π(β (t) |λn , σ0,n , σ1,n ))| ≤ ∗ 2 .
σ0,n
By Assumption 5.2, we have

(t)
E(|η (t) ∇ log(π(β (t) |λn , σ0,n , σ1,n )) − η ∗ ∇ log(π(β (t) |λn , σ0,n

, σ1,n ))|)
(t) (t)
=E(|η (t) ∇ log(π(β (t) |λn , σ0,n , σ1,n )) − η ∗ ∇ log(π(β (t) |λn , σ0,n , σ1,n ))|)
(t)
+ E(|η ∗ ∇ log(π(β (t) |λn , σ0,n , σ1,n )) − η ∗ ∇ log(π(β (t) |λn , σ0,n

, σ1,n ))|)
2M (t) (t)
≤ ∗ 2
|η − η ∗ | + η ∗ M |σ0,n − σ0,n

|.
σ0,n

By Assumption 5.1, E(ψ(β (t) )) ≤ ∞. Then

1 TX
−1
1 TX
−1
!
(t)
E(δt ψ(β (t) )) = O (|η (t) − η ∗ | + |σ0,n − σ0,n

|) .
T t=0 T t=0

Note that by assumption 5.1, |(ψ(β (T ) )) − ψ(β (0) )| is bounded. Then

1 TX
−1 Z !
E ϕ(Xt ) − ϕ(β)π(β|Dn , η ∗ , σ0,n

)
T t=1
 
(t)
t=0 (|η − η ∗ | + |σ0,n − σ0,n
PT −1
1 (t) ∗
|)
=O  + + ϵ .
Tϵ T

77
3. A KERNEL-EXPANDED STOCHASTIC NEURAL
NETWORK
3.1 A Kernel-Expanded Stochastic Neural Network

3.1.1 A Kernel-Expanded Neural Network

Let’s start with a brief review for the theory developed in [5] and [6]. Consider a neural
network model with h hidden layers. Let Z 0 = X ∈ Rm0 denote an input vector, let
Z i ∈ Rmi denote the output vector at layer i for i = 1, 2, . . . , h, h + 1, and let Y ∈ Rmh+1
denote the target output. At each layer i, the neural network calculates its output:

Z i = Ψ(wi Z i−1 + bi ), i = 1, 2, . . . , h, h + 1, (3.1)

where wi ∈ Rmi × Rmi−1 and bi ∈ Rmi denote the weights and bias of the layer i respectively,
Ψ(s) = (ψ(s1 ), . . . , ψ(smi ))T , and ψ(·) is the activation function used in the network. For
convenience, let p = m0 denote the dimension of the input vector, let w̃i = [wi , bi ] denote the
matrix of all parameters of layer i for i = 1, 2, . . . , h + 1, and let θ = (w̃1 , w̃2 , . . . , w̃h+1 ) ∈ Θ.
Further, we assume that the network structure is pyramidal with m0 ≥ m1 ≥ · · · ≥ mh ≥
mh+1 and, for simplicity, the same activation function ψ(·) is used for all hidden units. Let
U : Θ → R be the loss function of the neural network, which is given by

1X n
∆ 1 X
n
(i)
U (θ) = − log π(Y (i) |θ, X (i) ) = l(Z h+1 ), (3.2)
n i=1 n i=1

where π(·) denotes the density/mass function of each observation under the neural net-
(i)
work model, n denotes the training sample size, i indexes the training sample, and Z h+1
is the output vector of layer h + 1, and l : Rmh+1 → R is assumed to be a continuously
differentiable loss function, i.e., l ∈ C 2 (Rmh+1 ). In order to study the property of the loss
function, [6] made the following assumption:

Assumption 3.1.1. (i) All training samples are distinct, i.e., X (i) ̸= X (j) for all i ̸= j;

78
(ii) ψ(·) is real analytic, strictly monotonically increasing and (a) ψ(·) is bounded or (b)
there are positive ρ1 , ρ2 , ρ3 and ρ4 such that |ψ(t)| ≤ ρ1 eρ2 t for t < 0 and |ψ(t)| ≤
ρ3 t + ρ4 for t ≥ 0;

(iii) l ∈ C 2 (Rmh+1 ) and if l′ (a) = 0 then a is a global optimum.

Here a function ψ : R → R is called real analytic if the corresponding Taylor series


converges to ψ(s) on an open subset of R. It is easy to see that many of the activation
functions, such as tanh, sigmoid and softplus, satisfy 3.1.1-(ii). It is known that the softplus
function can be viewed as a differentiable approximation to ReLU. 3.1.1-(iii) can be satisfied
by any twice continuously differentiable convex loss function, e.g., negative log-Gaussian and
log-binomial density/mass functions. The following lemma is a restatement of Theorem 3.4
of [6]. A similar result has also been established in [5].

Lemma 3.1.1. (Theorem 3.4 of [6]) Suppose Assumption 3.1.1 holds. If (i) the training
samples are linearly independent, i.e., rank([X, 1n ]) = n; and (ii) the weight matrices (w̃l )h+1
l=2

have full row rank, i.e., rank(w̃l ) = ml for l = 2, 3, . . . , h + 1, then every critical point of the
loss function U (θ) is a global minimum.

Among the conditions of Lemma 3.1.1, Assumption 3.1.1 is regular as discussed above,
and condition (ii) can be almost surely satisfied by restricting the network structure to be
pyramidal. However, condition (i) is not satisfied by many machine learning problems for
which the training sample size is much larger than the dimension of the input. To have this
condition satisfied, we propose a kernel-expanded neural network (or KNN in short), where
each input vector x is mapped into an infinite dimensional feature space by a radial basis
function (RBF) kernel ϕ(x). More precisely, the KNN can be expressed as

Ỹ 1 = b1 + βϕ(X),

Ỹ i = bi + wi Ψ(Ỹ i−1 ), i = 2, 3, . . . , h, (3.3)

Y = bh+1 + wh+1 Ψ(Ỹ h ) + eh+1 ,

where eh+1 ∼ N (0, σh+1


2
Imh+1 ) is Gaussian random error; Ỹ i , bi ∈ Rmi for i = 1, 2, . . . , h;
Y h+1 , bh+1 ∈ Rmh+1 ; Ψ(Ỹ i−1 ) = (ψ(Ỹi−1,1 ), ψ(Ỹi−1,2 ), . . . , ψ(Ỹi−1,mi−1 ))T for i = 2, 3, . . . , h+1,

79
and Ỹi−1,j is the jth element of Ỹ i−1 ; wi ∈ Rmi ×mi−1 for i = 2, 3, . . . , h+1; β ∈ Rm1 ×dϕ and dϕ
denotes the dimension of the feature space of the kernel ϕ(·). For the RBF kernel, dϕ = ∞.
Note that different kernels can be used for different hidden units of the first hidden layer.
For notational simplicity, we consider only the case that the same kernel is used for all
the hidden units and Y follows a normal regression model. Replacing the third equation
of (3.3) by a logit model will lead to the classification case. In general, we consider only
the distribution π(Y |θ, X) such that Assumption 3.1.1-(iii) is satisfied, where, with a slight
abuse of notation, we use θ = (b1 , β; b2 , w2 ; . . . , bh+1 , wh+1 ) to denote the collection of all
weights of the KNN.
Compared to formula (3.1), formula (3.3) gives a new presentation form for neural net-
works, where the feeding operator (used for calculating wi Z i−1 + bi ) and the activation
operator ψ(·) are separated into two equations. As shown later, such a representation facil-
itates parameter estimation for the neural network when auxiliary noise are introduced into
the model.
For KNN, since the input vector has been mapped into an infinite dimensional feature
space, the Gram matrix K = (kij ), where kij = ϕT (xi )ϕ(xj ), can be of full rank, i.e.,
rank(K) = n. This means the transformed samples ϕ(X (1) ), ϕ(X (2) ), . . . , ϕ(X (n) ) are lin-
early independent. In addition, we can restrict the structure of the KNN to be pyramidal,
and choose the activation and loss function such that Assumption 3.1.1 is satisfied. There-
fore, by Lemma 3.1.1, every critical point of the KNN model is a global minimum. In
summary, we have the following theorem with the proof as argued above.

Theorem 3.1.1. For a KNN model given in (3.3), if Assumption 3.1.1 holds, an RBF kernel
is used in the input layer, and the weight matrices (w̃l )h+1
l=2 are of full row rank, then every

critical point of its loss function is a global minimum.

Other than the RBF kernel, the polynomial kernel might also satisfy Theorem 3.1.1 for
 
certain problems. For an input vector x ∈ Rp , the dimension of its feature space is p+q
q
,
where q denotes the degree freedom of the polynomial kernel. Therefore, if the resulting
Gram matrix is of full rank, then the transformed samples ϕ(X (1) ), . . . , ϕ(X (n) ) are also

80
linearly independent. However, as stated in Assumption 3.1.4, the K-StoNet requires the
kernel to be universal, so the polynomial kernel is not used.

3.1.2 A Kernel-Expanded StoNet as an Approximator to KNN

As shown in Theorem 3.1.1, the KNN has a nice loss surface, where every critical point is a
global minimum. However, training the KNN using a gradient-based algorithm is infeasible,
as the transformed features are not explicitly available. Based on the kernel representer
theorem [85], [86], one might consider to replace the first equation of (3.3) by

n
(i)
Ỹ 1 = b1 + w1 K(X (i) , X), (3.4)
X

i=1

(i)
where w1 ∈ Rm1 and K(X (i) , X) = ϕT (X (i) )ϕ(X) is explicitly available, and then train
such an over-parameterized neural network model using a regularization method. However,
the global optimality property established in Theorem 3.1.1 might not hold for the regularized
KNN any more, because the proof of Theorem 3.1.1 relies on the back propagation formula
of the neural network (see the proof of Theorem 3.4 in [6] for the detail), while that formula
cannot be easily generalized to regularized loss functions. Moreover, for a nonlinear kernel
regression Y = g(Ỹ 1 ) + e = g(b1 + βϕ(X)) + e, where g(·) represents a nonlinear mapping
from Ỹ 1 to the output layer, the kernel representer theorem does not hold for g(·) in general
and, therefore, (3.4) and the first equation of (3.3) might not be equivalent for the KNN.
Recall that SVR is a special case of the kernel regression with the identity mapping g(Ỹ 1 ) =
Ỹ 1 .
To tackle this issue, we introduce a K-StoNet model (depicted by Figure 3.1) by adding
auxiliary noise to Ỹ i ’s, i = 1, 2, . . . , h, in (3.3). The resulting model is given by

Y 1 = b1 + βϕ(X) + e1 ,

Y i = bi + wi Ψ(Y i−1 ) + ei , i = 2, 3, . . . , h, (3.5)

Y = bh+1 + wh+1 Ψ(Y h ) + eh+1 ,

81
𝑌

Linear / Logistic Regression

𝜓(𝑌" ) ⋯⋯
Activation: 𝜓
𝑌" ⋯⋯

Linear Regression

𝜓(𝑌! ) ⋯⋯
Activation: 𝜓
𝑌! ⋯⋯

Support Vector Regression

𝑋 ⋯⋯

Figure 3.1. An illustrative plot of K-StoNet

82
where Y1 , Y2 , . . . , Yh are latent variables. To complete the model specification, we assume
that ei ∼ N (0, σi2 Imi ) for i = 2, 3, . . . , h, h + 1, and each component of e1 is independent and
identically distributed with the density function given by

C
f (x) = e−C|x|ε , (3.6)
2(1 + Cϵ)

where |x|ϵ = max(0, |x| − ε) is an ε-intensive loss function, and C is a scale parameter. It
ε2 (εC+3)
is known that this distribution has mean 0 and variance 2
C2
+ 3(εC+1)
. For classification
networks, the last equation of (3.5) is replaced by a generalized linear model (GLM), for
which the parameter σh+1
2
plays the role of temperature for the binomial or multinomial
distribution formed at the output layer. In summary, {C, ε, σ22 , . . . , σh2 , σh+1
2
} work together
to control the variation of the latent variables {Y 1 , . . . , Y h } as discussed in Section 3.1.4.
As shown later, such specifications for the auxiliary noise enable the K-StoNet parameters
to be estimated by solving a series of convex optimization problems and the prediction
uncertainty to be easily assessed via a recursive formula.
To establish that K-StoNet is a valid approximator to KNN, i.e., asymptotically they have
the same loss function, some assumptions need to be imposed on the model. To indicate
their dependence on the training sample size n, we redenote C by Cn , ε by εn , and σi by σn,i
for i = 2, 3, . . . , h + 1. For (3.6), we assume εn ≤ 1/Cn holds as n → ∞. As in KNN, we
let θ denote the parameter vector of K-StoNet, and let dθ denote the dimension of θ. Since,
for the KNN, any local minimum is also a global minimum, we can restrict Θ to a compact
set which is large enough such that one local minimum is contained. This is essentially a
technical condition. In practice, if a local convergence algorithm is used for training the
KNN, it is then equivalent to set Θ = Rdθ , as the regions beyond a neighborhood of the
starting point will never be visited by the algorithm.

Assumption 3.1.2. (i) Θ is compact, which can be contained in a dθ -ball centered at


the origin and of radius r; (ii) E(log π(Y |X, θ))2 < ∞ for any θ ∈ Θ; (iii) the ac-
tivation function ψ(·) is c′ -Lipschitz continuous for some constant c′ ; (iv) the network’s
depth h and widths mi ’s are all allowed to increase with n; and (v) σn,h+1 = O(1), and
mh+1 ( m2i )mk σn,k for k ∈ {1, 2, . . . , h}, where σn,1 = 1/Cn .
Qh 2 1
i=k+1 ≺ h

83
Assumption 3.1.2-(ii) is the regularity condition for the distribution of Y . Assumption
3.1.2-(iii) can be satisfied by many activation functions such as tanh, sigmoid and softplus.
Assumption 3.1.2-(v) constrains the size of the noise added to each hidden layer such that the
K-StoNet has asymptotically the same loss function as the KNN when the training sample
size becomes large, where the factor mh+1 ( m2i )mk is derived in the proof of Theorem
Qh
i=k+1

3.1.2 and its square root can be interpreted as the amplification factor of the noise ek at the
output layer.
As stated in Assumption 3.1.4, the SVR in K-StoNet is required to work with a universal
kernel such as RBF. By [34], [35], [87], such a SVR possesses the universal approximation
capability, so does K-StoNet. Therefore, K-StoNet is not necessarily very deep or wide,
while having any continuous function approximated arbitrarily well as the training sample

size n → ∞. For this reason, we may restrict the depth h = O(1), and restrict m1 = o( n)

and thus mi = o( n) for all i = 2, 3, . . . , h due to the pyramidal structure of K-StoNet. The
universal approximation property of SVR is quite different from that of the neural networks.
The former depends on the training sample size, while the latter depends on the network
size. The K-StoNet lies in the between of them.
Theorem 3.1.2 shows that the K-StoNet and KNN have asymptotically the same training
loss function, whose proof is given in the Section 3.7.1.

Theorem 3.1.2. Suppose Assumption 3.1.2 holds. Then the K-StoNet (3.5) and the KNN
(3.3) have asymptotically the same loss function, i.e., as n → ∞,
n n
1X (i) 1X p
sup log π(Y (i) , Y mis |X (i) , θ) − log π(Y (i) |X (i) , θ) → 0, (3.7)
θ∈Θ n i=1 n i=1
(i) p
where Y mis = (Y 1 , Y 2 , . . . , Y h ) denotes the collection of latent variables in (3.5), and →
denotes convergence in probability.

Let Q∗ (θ) = E(log π(Y |X, θ)), where the expectation is taken with respect to the joint
distribution π(X, Y ). By Assumption 3.1.2-(i) & (ii), and the law of large numbers,

1X n
p
log π(Y (i) |X (i) , θ) − Q∗ (θ) → 0, (3.8)
n i=1

holds uniformly over Θ. Further, we make the following assumptions for Q∗ (θ):

84
Assumption 3.1.3. (i) Q∗ (θ) is continuous in θ and uniquely maximized at θ ∗ ; (ii) for
any ϵ > 0, supθ∈Θ\B(ϵ) Q∗ (θ) exists, where B(ϵ) = {θ : ∥θ − θ ∗ ∥ < ϵ}, and δ = Q∗ (θ ∗ ) −
supθ∈Θ\B(ϵ) Q∗ (θ) > 0.

Assumption 3.1.3 restricts the shape of Q∗ (θ) around the global maximizer, which cannot
be discontinuous or too flat. Given nonidentifiability of the neural network model (see e.g.
[69]), we here have implicitly assumed that each θ in the KNN and K-StoNet is unique up
to loss-invariant transformations, such as reordering some hidden units and simultaneously
changing the signs of some weights and biases.

Lemma 3.1.2. Suppose Assumptions 3.1.2-3.1.3 hold, and π(Y , Y mis |X, θ) is continuous in
p
n P o
(i)
θ. Let θ̂ n = arg maxθ∈Θ n
1 n
i=1 log π(Y (i) , Y mis X (i) , θ) . Then ∥θ̂ n − θ ∗ ∥ → 0 as n → ∞.

The proof of Lemma 3.1.2 is given in the Section 3.7.1. It implies that the KNN can be
trained by training K-StoNet as the sample size n becomes large.

3.1.3 The Imputation-Regularized Optimization Algorithm

To train the K-StoNet, we propose to use the imputation-regularized optimization (IRO)


algorithm [33]. Consider a missing data problem, where Z obs denotes observed data, Z mis
denotes missing data, and ϑ denotes the parameter. The IRO algorithm aims to find a con-
sistent estimate of ϑ by maximizing E log π(Z obs , Z mis |ϑ), where the expectation is taken
with respect to the joint distribution of (Z obs , Z mis ). Conceptually, this is a little differ-
ent from the expectation-maximization (EM) algorithm [88] and stochastic EM algorithm
[89], which aim to estimate ϑ by maximizing the marginal likelihood function π(Z obs |ϑ).
Practically, the IRO algorithm works in similar way to stochastic EM by iterating between
an imputation step and an optimization step, but for which a regularization term can be
included in the loss function at each optimization step for ensuring the convergence of the
estimate under the high-dimensional scenario.
For K-StoNet, the IRO algorithm is to estimate θ by maximizing E log π(Y , Y mis |X, θ),
which is equivalent to maximizing Q∗ (θ) = E log π(Y |X, θ) as implied by (3.7) and (3.8).
This coincides with the goal of KNN training if the stochastic gradient descent (SGD) algo-

85
(t)
rithm is used. Let θ̂ n denote the estimate of θ obtained by the IRO algorithm at iteration t.
(0)
The IRO algorithm starts with an initial guess θ̂ n and then iterates between the following
two steps:

(i)
• I-step: For each sample (X (i) , Y (i) ), draw Y mis from the predictive distribution

(t)
g(Y mis |Y (i) , X (i) , θ̂ ).

(t+1)
• RO-step: Based on the pseudo-complete data, find an updated estimate θ̂ n by min-
imizing the penalized loss function, i.e.,

1X n
( )
(t+1) (i)
θ̂ n = arg min − log π(Y (i) , Y mis X (i) , θ) + Pλn (θ) , (3.9)
n i=1

(t+1)
where the penalty function Pλn (θ) is chosen such that θ̂ n forms a consistent estimate
of

(t)
θ ∗ = arg max Eθ(t) log π(Y , Y mis |X, θ)
θ n
Z

= log π(Y mis , Y |X, θ))g(Y mis |Y , X, θ (t)
n )π(Y |θ , X)π(X)dY mis dY dX

θ ∗ denotes the true parameter of the model, and π(X) denotes the density function of
X.

For the K-StoNet, the joint distribution π(Y , Y mis |X, θ) can be factored as
h
Y
π(Y , Y mis |X, θ) = π(Y 1 |X, w̃1 )[ π(Y i |Y i−1 , w̃i )]π(Y |Y h , w̃h+1 ). (3.10)
i=2

Therefore, the optimization in (3.9) can be executed separately for each of the hidden and
output layers with an appropriately specified penalty function. That is, K-StoNet can be
trained by solving a series of lower dimensional optimization problems.

86
For the first hidden layer, the RO-step is reduced to solving a SVR for each hidden unit.
As described in [90], the parameter β in (3.5) can be estimated by solving a regularized
optimization problem:

1 n
C̃n X (i)
arg min ||β||22 + |Y − βϕ(X (i) ) − b1 |ε , (3.11)
β,b1 2 n i=1 1

where the first term represents a penalty function, and C̃n represents the regularization
parameter. We can set C̃n = Cn given in (3.6), but not necessarily. In general, their values
should make Assumptions 3.1.2-(v) and 3.1.4-(i) hold. The consistency of the SVR estimator
T
β̂ ϕ(x) + b̂1 , which is the basic requirement by the IRO algorithm, has been established in
[91] by assuming that a universal kernel [34], [92] such as RBF is used in (3.11). Equivalently,
this is to reparameterize the SVR layer by kernel-based regression. By the kernel representer
theorem [85], [86], the solution to the regularized optimization problem (3.11) leads to the
representer of the first equation of (3.5) as

n
(i)
Y 1 = b̂1 + ŵ1 K(X (i) , X) + e1 . (3.12)
X

i=1

(1) (n)
In what follows, we will use w̌1 = (ŵ1 , . . . , ŵ1 , b̂1 ) to denote the estimator for the
parameters of the SVR layer.
For other hidden layers, the RO-step is reduced to solving a linear regression for each
hidden unit using a regularization method. To ensure convexity of the resulting objective
function, a Lasso penalty [72] can be used. Alternatively, some nonconvex amenable penalties
with vanishing derivatives away from the origin, such as the SCAD [93] and MCP [94], can
also be used. As shown in [95], for such nonconvex amenable penalties, any stationary
point in a compact region around the true regression coefficients can be used to consistently
estimate the parameters and recover the support of the underlying true regression.
For the output layer, the RO-step is reduced to solving a multinomial logistic or multi-
variate linear regression, depending on the problem under consideration. The Lasso, SCAD
and MCP penalties can again be used for them by the theory of [95]. In practice, this step

87
can also be simplified to solving a linear or logistic regression for each output unit by ignoring
the correlation between different components of Y .
In summary, we have the pseudo-code given in Algorithm 3 for training K-StoNet, where
(t) (s) (s)
w̌i denotes the estimate of the parameters for the layer i at iteration t, (Y 0 , Y h+1 ) =
(s,t) (s,t)
(X (s) , Y (s) ) denotes a training sample, (Y 1 , . . . , Y h ) denotes the latent variables im-
puted for training sample s at iteration t. For convenience, we occasionally use the notation
(s,t) (s) (s,t) (s)
Y0 = Y 0 and Y h+1 = Y h+1 .
For Algorithm 3, we have a few remarks:

• The Hamiltonian Monte Carlo (HMC) algorithm [96]–[98] is employed in the backward
imputation step. Other MCMC algorithms such as Langevin Monte Carlo [99] and the
Gibbs sampler [100] can also be employed there.

• In the parameter update step, a Lasso penalty [72] is used in (3.15) to induce the
sparsity of StoNet, while ensuring convexity of the minimization problems. Therefore,
K-StoNet is trained by solving a series of convex optimization problems. Note that the
minimization in (3.14) is known as a convex quadratic programming problem [101],
[102]. Although solving the convex optimization problems is more expensive than a
single gradient update, the IRO algorithm converges very fast, usually within tens of
iterations.

• The major computational cost of K-StoNet comes from the SVR step when the sample
size is large. The computational complexity for solving an SVR is O(n2 p+n3 ), and that
for solving a linear/logistic regression is bounded by O(nm21 + m31 ), while m1 ≺ n1/2
is usually recommended. A scalable SVR solver will accelerate the computation of
K-StoNet substantially. This issue will be further discussed at the end of the chapter.

• If m1 ≺ n holds, then the penalty in (3.15) can be simply removed for computational
simplicity, while ensuring asymptotic normality of the resulting regression coefficient
estimates by [53].

88
Algorithm 3 The IRO Algorithm for K-StoNet Training
Input: the total iteration number T , the Monte Carlo step number tHM C , and the learning
rate sequences {ϵt,i : t = 1, 2, . . . , T ; i = 1, 2, . . . , h + 1}.
(0) (0) (0)
Initialization: Randomly initialize the network parameters θ̂ n = (w̌1 , . . . , w̌h+1 ).
for t=1,2,. . . ,T do
STEP 1. Backward Imputation: For each observation s, impute the latent variables
(s,t)
in the order from layer h to layer 1. More explicitly, impute Y i from the distribu-
(s,t) (s,t) (s,t) (t−1) (t−1) (s,t) (s,t) (t−1) (s,t) (s,t) (t−1)
tion π(Y i |Y i+1 , Y i−1 , w̌i , w̌i+1 ) ∝ π(Y i |Y i−1 , w̌i )π(Y i+1 |Y i , w̌i+1 )
(s,t) (t−1)
by running HMC in tHM C steps, where π(Y 1 |X (s) , w̌1 ) can be expressed based on
(3.12).
(s,0) (s,t,0)
(1.1) Initialization: Initialize v i = 0, and initialize Y i by KNN, i.e., calculating
(s,t,0)
Yi for i = 1, 2, . . . , h in (3.5) by setting the random errors to zero.
(1.2) Imputation:
for k = 1, 2, . . . , tHM C do
for i = h,h-1,. . . , 1 do

(s,k) (s,k−1) (s,t,k−1) (s,t,k−1) (t−1)


vi = (1 − α)v i + ϵt,i ∇Y (s,t,k−1) log π(Y i |Y i−1 , w̌i )
i
q
(s,t,k) (s,t,k−1) (t−1)
+ ϵt,i ∇Y (s,t,k−1) log π(Y i+1 |Y i , w̌i+1 ) + 2αϵt,i z (s,t,k) , (3.13)
i
(s,t,k) (s,t,k−1) (s,k)
Yi =Yi + vi ,

where z (s,t,k) ∼ N (0, I mi ), ϵt,i is the learning rate, and 1 − α is the momentum
decay factor (α = 1 corresponds to Langevin Monte Carlo).
end for
end for
(s,t) (s,t,t )
(1.3) Output: Set Y i = Y i HM C for i = 1, 2, . . . , h.
(t−1) (t−1) (t−1)
STEP 2. Parameter Updating: Update the estimates (w̌1 , w̌2 , . . . , w̌h+1 )
by solving h + 1 penalized multivariate regressions separately.
(2.1) SVR layer:
 
 C̃ (t) X
n
1 
(t) n,1 (s,t) (s,t)
w̌1 = arg min ∥|Y 1 − β T ϕ(Y0 ) − b1 |ε ∥1 + ∥β∥22  , (3.14)
β,b1  n s=1 2

(t)
where C̃n,1 is the regularization parameter used at iteration t.
(2.2) Regression layers:
for i=2,3,. . . ,h+1 do

1X n
( )
(t) (s,t) (s,t)
w̌i = arg min ∥Y i − wi ψi (Y i−1 ) − bi ∥22 + Pλ(t) (w̃i ) , (3.15)
wi ,bi n s=1 n,i

(t)
where λn,i is the regularization parameter used for layer i at iteration t.
end for
(t) (t) (t)
(2.3) Output: Denote the updated estimate by θ̂ = (w̌1 , . . . , w̌h+1 ).
end for
89
Like the stochastic EM algorithm, the IRO algorithm generates two interleaved Markov
chains:
(0) (1) (1) (1) (2) (2) (2)
θ̂ n → (Y 1 , . . . , Y h ) → θ̂ n → (Y 1 , . . . , Y h ) → θ̂ n → · · · ,

whose convergence theory has been studied in [33]. To ensure the convergence of the Markov
chains in K-StoNet training, we make following assumptions for the regularization parameters
used in (3.14) and (3.15):

Assumption 3.1.4. (i) A universal kernel such as RBF is used in the SVR layer, and for
(t) √
each t ∈ {1, 2, . . . , T }, 1 ≺ C̃n,1 ≺ n holds; and (ii) for each t ∈ {1, 2, . . . , T } and each
(t)
i ∈ {2, 3, . . . , h + 1}, supw̃i ∈Θi Pλn,i (w̃i ) → 0 holds as n → ∞, where Θi denotes the sample
space of w̃i .

Assumption 3.1.4-(i) ensures consistency of the regression function estimator in the SVR
step by Theorem 12 of [91]. Assumptions 3.1.4-(ii) ensures consistency of the weight esti-
mators in the output and other hidden layers. For the Lasso penalty, we can set Pλ(t) (w̃i ) =
q n,i
(t) (t)
λn,i ∥w̃i ∥1 and λn,i = O( log(mi−1 )/n) for any t ∈ {1, 2, . . . , T } and i ∈ {2, 3, . . . , h + 1}.
Since Θ is bounded as assumed in 3.1.2-(i), 3.1.4-(ii) is satisfied. In summary, we have the
following theorem which is essentially a restatement of Theorem 4 and Corollary 3 of [33]
and therefore whose proof is omitted.

Theorem 3.1.3. (Consistency) Suppose that Assumptions 3.1.1-3.1.4 hold and, further, the
general regularity conditions on missing data (given in [33]) hold. Then for sufficiently large
(T ) p (t)
n, sufficiently large T , and almost every (X, Y )-sequence, ∥θ̂ n −θ ∗ ∥ → 0 and ∥ T1
PT
t=1 θ̂ n −
p (t)
θ ∗ ∥ → 0. In addition, for any Lipschitz continuous function ζ(·) on Θ, ∥ T1 ζ(θ̂ n ) −
PT
t=1
p
ζ(θ ∗ )∥ → 0.
(t)
As implied by Theorems 3.1.1-3.1.3 and Lemma 3.1.2, θ̂ n asymptotically converges to
(t)
a global optimum of the KNN. For each θ̂ n , when making predictions, one can simply
(t)
calculate the output in (3.5) by ignoring the auxiliary noise, i.e., treating θ̂ n as the weights
of a KNN. In this way, K-StoNet can be viewed as a tool for training the KNN, although it
means more than that.

90
3.1.4 Hyperparameter Setting

As mentioned previously, DNN is often over parameterized to avoid getting trapped into
a poor local minimum. In contrast, as implied by Theorems 3.1.1-3.1.3, the local minimum
trap is not an issue to K-StoNet any more. This, together with the universal approximation
property of K-StoNet and the parsimony principle of statistical modeling, suggests that a
small K-StoNet might work well for complex problems. As shown in Section 3.3, the K-
StoNet with a single hidden layer and a small number of hidden units works well for many
complex datasets.
Other than the network structure, the performance of K-StoNet also depends on the
network hyperparameters as well as the hyperparameters introduced by the IRO algorithm.
The former include Cn , εn and σn,k ’s for k = 2, . . . , h+1. The latter include the learning rates
and iteration number used in HMC backward imputation and the regularization parameters
used in solving the optimizations (3.14) and (3.15). The hyperparameters Cn , εn and σn,k ’s
control the variation of the latent variables and thus the variation of θ (T
n
)
by the theory
developed in [33] and [103]. In general, setting the latent variables to have slightly large
variations can facilitate the convergence of the training process. On the other hand, as
required by Assumption 3.1.2-(v), we need to control the variations of the latent variables
sufficiently small for ensuring the convergence of K-StoNet to a global minimum of the
corresponding KNN by noting the stochastic optimization nature of the IRO algorithm.
Assumption 3.1.2-(v) provides a clue for setting the network hyperparameters. Here we
would like to note that when 1/Cn and σn,i
2
’s are set to be very small, to ensure the stability
of the algorithm, we typically need to adjust the learning rate ϵt,i ’s to be very small as
well such that their effects on the drift term of (3.13) can be canceled or partially canceled.
Meanwhile, to compensate the negative effect of the reduced learning rate on the mobility
of the Markov chain, we need to lengthen the MCMC iterations, i.e., increasing the value of
tHM C , appropriately. Finally, we note that setting σn,i ’s in the monotonic pattern σn,h+1 ≥
σn,h ≥ · · · ≥ σn,2 ≥ 1/Cn is generally unnecessary, as long as their values have been in a
reasonable range.

91
In our experience, the performance of K-StoNet is not very sensitive to these hyperpa-
rameters as long as they are set in an appropriate range. As shown in Section 3.5, which
collects all parameter settings of K-StoNet used in this chapter, many examples share the
same parameter setting.

3.2 Illustrative Examples

This section contains two examples. The first example demonstrates that K-StoNet in-
deed avoids local traps in training, and the second example demonstrates the performance
of K-StoNet in the large-n-small-p scenario that DNN typically works in. K-StoNet is com-
pared with DNN and KNN. For the KNN, the kernel representer given by equation (3.4) is
used as the first hidden layer, and the kernel is set to be the same as that used by K-StoNet.

3.2.1 A full row rank example

The dataset was generated from a two-hidden layer neural network with structure 1000-
5-5-1. The input variables x1 , . . . , x1000 were generated by independently simulating the
variables e, z1 , . . . , z1000 from the standard Gaussian distribution and then setting xi = √ i.
e+z
2

In this way, all the input variables are mutually correlated with a correlation coefficient of
0.5. The response variable was generated by setting

y = w3 tanh(w2 tanh(w1 x)) + ϵ, (3.16)

where w1 ∈ R5×1000 , w2 ∈ R5×5 and w3 ∈ R1×5 represent the weights at different layers
of the neural network, tanh(·) is the hyperbolic tangent function, and the random error
ϵ ∼ N (0, 1). Each elements of wi ’s was randomly sampled from the set {−2, −1, 1, 2}. The
full dataset consisted of 1000 training samples and 1000 test samples.
We first refit the model (3.16) using SGD. Since the training samples form a full row rank
matrix of size n = 1000 by p = 1001 (including the bias term), SGD will not get trapped
into a local minimum by Lemma 3.1.1. SGD was run for 2000 epochs with a mini-batch size

92
of 100 and a constant learning rate of 0.005. Figure 3.2 (upper panel) indicates that SGD
indeed converges to a global optimum.
For K-StoNet, we tried a model with one hidden layer and 5 hidden units. The model
was trained by IRO for 40 epochs. Since all training samples were used at each iteration,
an iteration is equivalent to an epoch for K-StoNet. We also tried a KNN model for this
example, which has the same structure as K-StoNet. The KNN was trained using SGD with
a constant learning rate of 0.005 for 2000 epochs. Figure 3.2 (upper panel) compares the
training and testing MSE paths of the three models. It shows that K-StoNet converges to
the global optimum in a few epochs; while DNN needs over 100 epochs, and KNN needs
even more. More importantly, K-StoNet is less bothered by over-fitting, whose prediction
performance is stable after convergence has been reached. However, the DNN tends to be
over fitted, whose prediction becomes worse and worse as training goes on. The KNN is more
stable than DNN in prediction, but worse than K-StoNet. As discussed in Section 3.1.2, the
KNN with the kernel representer for the first hidden layer is not equivalent to K-StoNet in
general. This experiment further demonstrates the importance of the stochastic structure
introduced in K-StoNet.
To explore the loss surface of the regularized DNN, we re-trained the true DNN model
(3.16) using SGD with a Lasso penalty (λ = 0.1) imposed on all the weights. SGD was
run for 500 epochs with a mini-batch size of 100 and a constant learning rate of 0.005. The
run was repeated for 10 times. For comparison, K-StoNet was also retained for 10 times.
Their convergence paths were shown in the lower panel of Figure 3.2. The comparison shows
that the regularized DNN might suffer from local traps (different runs converged to different
MSE values), while K-StoNet does not although its RO step also involves penalty terms.
According to the theory developed in [33], the convergence of the IRO algorithm requires a
consistent estimate of θ (t)
∗ to be obtained at each RO step and an appropriate penalty term is

allowed for obtaining the consistent estimate. For K-StoNet, to ensure a consistent estimate
to be obtained at each parameter updating step, we impose a L2 -penalty on the SVR layer
and a Lasso penalty on the regression layers. For both of them, the resulting loss functions
are convex, and the corresponding consistent (optimal) estimates are uniquely determined.

93
16
MSE Path
K-StoNet Training
14 K-StoNet Testing
DNN Training
DNN Testing
12 KNN Training
KNN Testing
10
MSE

0
0 25 50 75 100 125 150 175 200
Number of epochs

5
Lowest Ever Training MSE Path Over 10 Runs

3
MSE

1
K-StoNet
DNN
0
0 100 200 300 400 500
Epoch

Figure 3.2. Upper Panel: paths of the mean squared error (MSE) produced
by K-StoNet and an unregularized DNN for one simulated dataset; and Lower
Panel: best MSE (by the current epoch) produced by SGD for a regularized
DNN and K-StoNet over 10 runs.

94
Then, by Theorems 3.1.1–3.1.3 and Lemma 3.1.2, the convergence of K-StoNet to the global
optimum is asymptotically guaranteed.
In the previous example, the data was generated from a DNN model. Even working
with the true structure, DNN is still inferior to K-StoNet in training and prediction. To
further demonstrate the advantage of K-StoNet, we generated a dataset from a KNN model.
The dataset consisted of 5000 training samples, where the input variables x ∈ R5 with each
component being a standard Gaussian random variable and a mutual correlation coefficient
of 0.5. Let k = (K(x(1) , x), . . . , K(x(5000) , x))T ∈ R5000 , where K(·, ·) is the RBF kernel and
x(i) denotes the ith training sample. The response variable was generated by

y = w2 tanh(w1 k + b1 ) + b2 + ϵ,

where w2 ∈ R1×5 , w1 ∈ R5×5000 , b1 ∈ R5 , b2 ∈ R, and ϵ ∼ N (0, 1). The components of


w2 and b2 were randomly generated from N (0, 1), and w1 and b1 were the dual parameters
of five SVR models with the above training samples as input and some vectors randomly
generated from N (0, I5 ) as response. We set C = 5 and ϵ = 0.01 for the SVR model. We
also generated another 5000 samples from the same model as test data. Then we modeled
the data by K-StoNet with one hidden layer, for which we tried the cases with 5 hidden
units and 10 hidden units and set C = 5 and ϵ = 0.01 for each SVR. For comparison, we
tried a KNN with 5 hidden units, a DNN with one hidden layer and 50 hidden units, and a
DNN with 3 hidden layers and 50 hidden units on each layer. Figure 3.3 shows the training
and testing paths of the five models. For this example, K-StoNet achieved a training MSE
about 1.0 and significantly outperformed the DNN and KNN models in prediction.

3.2.2 A measurement error example

This example mimics the typical scenario under which DNN works. We generated 500
training samples and 500 test samples from a nonlinear regression: for each sample (Y, X),
where Y ∈ R and X = (X1 , . . . , X5 ) ∈ R5 . The explanatory variables X1 , . . . , X5 were
generated such that each follows the standard Gaussian distribution, while they are mutually

95
2.00
Training MSE Path 2.00
Testing MSE Path
K-StoNet 5 unit Training
K-StoNet 10 unit Training
1.75 KNN 5 unit Training 1.75
DNN 1 hidden layer Training
1.50 DNN 3 hidden layer Training 1.50

1.25 1.25
MSE

MSE

1.00 1.00

0.75 0.75

0.50 0.50 K-StoNet 5 unit Testing


K-StoNet 10 unit Testing
0.25 0.25
KNN 5 unit Testing
DNN 1 hidden layer Testing
DNN 3 hidden layer Testing
0.00 0.00
0 200 400 600 800 1000 0 200 400 600 800 1000
Number of epochs Number of epochs

Figure 3.3. MSE paths produced by two K-StoNets, one KNN, and two
DNNs for the data generated from a KNN model: the left plot is for training
and the right plot is for testing.

96
correlated with a correlation coefficient of 0.5. The response variable was generated from
the nonlinear regression

5X2
Y = + 5 sin(X3 X4 ) + 2X5 + ϵ,
1 + X12

where ϵ ∼ N (0, 1). Then each explanatory variable was perturbed by adding a random
measurement error independently drawn from N (0, 0.5).
We modeled the data using two different K-StoNets, one with 1-hidden layer and 5 hidden
units, and the other with 3-hidden layers and 20 hidden units on each hidden layer. Both
models were trained by IRO for 1000 epochs. For comparison, we also modeled the data by
KNNs and DNNs with the same structures as the K-StoNets. The KNNs and DNNs were
trained by SGD with momentum for 1000 epochs with a minibatch size of 100, a constant
learning rate of 0.005, and a momentum decay factor of 0.9. As shown in Figure 3.4, the 1-
hidden layer DNN and KNN perform stably in both training and testing, while the 3-hidden
layer DNN and KNN are obviously over-fitted. Compared to the DNN and KNN, K-StoNet
is resistant to over-fitting, even when an overly large model is employed.
Finally, we explored the sparsity of the SVR layer by varying the value of ε defined in
(3.6). Table 3.1 shows the number of support vectors selected by the two K-StoNets, together
with their training and test errors, at different values of ε. It implies that the sparsity of
K-StoNet can be controlled by ε, a larger ε leading to less support vectors. However, the two
K-StoNet models show different sensitivities to ε. The 3-hidden layer K-StoNet has a higher
representation power and is more flexible; it can achieve relatively low training error with a
large number of connections, and is more sensitive to ϵ. When ϵ increases from 0.01 to 0.09,
it changes from an overfitted model to an underfitted model. Correspondingly, the training
MSE increases, while the test MSE decreases in the beginning and then starts to increase. In
contrast, for the 1-hidden layer K-StoNet, its representation power is limited, and it is less
flexible and thus less sensitive to ϵ. It led to about the same models with different choices
of ϵ (with similar training and test errors). The training and test errors varied slightly with
ϵ, as different sets of support vectors were used for different choices of ϵ. In general, the set

97
MSE Path
K-StoNet Training
22.5 K-StoNet Testing
DNN Training
20.0 DNN Testing
KNN Training
17.5 KNN Testing

15.0
MSE

12.5

10.0

7.5

5.0
0 200 400 600 800 1000
Number of epochs

30
MSE Path
K-StoNet Training
K-StoNet Testing
25 DNN Training
DNN Testing
KNN Training
20 KNN Testing
MSE

15

10

0
0 200 400 600 800 1000
Number of epochs

Figure 3.4. MSE paths produced by K-StoNets and DNNs: (upper) one-
hidden-layer networks; (lower) three-hidden-layer networks.

98
of support vectors used for a large ϵ is not nested to that for a small ϵ. Therefore, a smaller
ϵ does not necessarily lead to a smaller training error.

Table 3.1. Performance of the K-StoNet model with different values of ε,


where the model was evaluated at the last iteration, #SV represents the aver-
age number of support vectors selected by the SVRs at the first hidden layer,
and the number in the parentheses represents the standard deviation of the
average.
1-Hidden Layer 3-Hidden Layer
ε #SV Train MSE Test MSE #SV Train MSE Test MSE
0.01 473.6(9.22) 6.6786 9.0852 409.0(6.419) 4.6552 10.1387
0.02 428.6(18.73) 6.7382 9.0387 320.2(13.85) 4.7257 9.9422
0.03 416.2(16.59) 6.6917 9.0531 235.8(7.94) 4.9850 9.8192
0.04 381.0(29.75) 6.7301 9.0742 165.4(8.96) 5.2578 9.3949
0.05 361.4(45.85) 6.5158 9.3358 113.2(7.83) 5.6594 9.0195
0.06 342.6(28.35) 6.5041 9.1498 69.6(4.50) 6.1869 8.7551
0.07 347.2(13.39) 6.4878 9.1893 37.8(3.60) 6.8884 9.0565
0.08 313.6(38.52) 6.5083 9.0639 17.6(3.61) 7.8616 9.5882
0.09 317.0(14.59) 6.4130 9.0330 16.2(16.45) 9.1655 9.9940
0.1 290.2(40.45) 6.4676 9.0308 8.2(13.47) 10.5946 10.7611

3.3 Real Data Examples

This section shows that a small K-StoNet can work well for a variety of problems. The
example in Section 3.3.1 has a high dimension, which represents typical problems that sup-
port vector machine/regression works on. The examples in Sections 3.3.2 and 3.3.3 have
large training sample sizes, which represent typical problems that the DNN works on. The
examples in Section 3.3.4 represent more real world problems, with which we explore the
prediction performance of K-StoNet.

3.3.1 QSAR Androgen Receptor

The QSAR androgen receptor dataset is available at the UCI machine learning repository,
which consists of 1024 binary attributes (molecular fingerprints) used to classify 1687 chem-
icals into 2 classes (binder to androgen receptor/positive, non-binder to androgen receptor
/negative), i.e. n = 1607, p = 1024. The experiment was done in a 5-fold cross-validation.

99
In each fold of the experiment, we modeled the data by a K-StoNet with one hidden
layer and 5 hidden units, and trained the model by IRO for 40 epochs. The prediction was
computed by averaging over the models generated in the last 10 epochs. For comparison,
support vector machine (SVM), KNN and DNN were applied to this example. For SVM,
we employed the RBF kernel with C = 1. For KNN, we used the same structure as K-
StoNet. For DNN, we tried two network structures, 1024-5-1 and 1024-10-5-1, which are
called DNN one layer and DNN two layer, respectively. Each of the KNN and DNN models
were trained by SGD for 1000 epochs with a mini-batch size of 32 and a constant learning
rate of 0.001. The weights of the DNNs were subject to the LASSO penalty with the
regularization parameter λ = 1e − 4.

Training Accuracy Path Test Accuracy Path


1.0 1.0

0.9 0.9
Accuracy

Accuracy

0.8 0.8

0.7 0.7
K-StoNet K-StoNet
KNN KNN
0.6 DNN one layer 0.6 DNN one layer
DNN two layer DNN two layer
0.5 0.5
0 200 400 600 800 1000 0 200 400 600 800 1000
Number of epochs Number of epochs

Figure 3.5. Training and prediction accuracy paths (along with epochs) pro-
duced by K-StoNet, KNN and DNN in one fold of the cross-validation experi-
ment for the QSAR androgen receptor data.

Figure 3.5 compares the training and prediction accuracy paths produced by K-StoNet,
KNN and DNN in one fold of the experiment. Table 3.2 summarizes the training and
prediction accuracy produced by K-StoNet, SVM, KNN and DNN over the five folds. In
summary, K-StoNet converges very fast to the global optimum with the training accuracy
close to 1, and is less bothered by the over-fitting issue. In contrast, the KNN and DNN

100
Training Accuracy Path Test Accuracy Path
1.0 1.0

0.9 0.9
Accuracy

Accuracy
0.8 0.8

0.7 0.7
K-StoNet K-StoNet
KNN KNN
0.6 DNN one layer 0.6 DNN one layer
DNN two layer DNN two layer
0.5 0.5
0 5 10 15 20 25 0 5 10 15 20 25
Training time/s Training time/s

Figure 3.6. Training and prediction accuracy paths (along with computa-
tional time) produced by K-StoNets, KNN and DNNs in one fold of the cross-
validation experiment for the QSAR androgen receptor data.

models took more epochs to converge and predicted less accurately than K-StoNet. SVM is
inferior to K-StoNet in both training and prediction.
Since each iteration of the IRO algorithm involves imputing latent variables and solving
a series of SVR/linear regressions, it is more expensive than a single gradient update step
used in DNN training. To compare their computational efficiency, we include an accuracy
versus time plot in Figure 3.6, which indicates that K-StoNet took less computational time
than KNN and DNN to achieve the same training/prediction accuracy.

3.3.2 MNIST Data

The MNIST [104] is a benchmark dataset in machine learning. It consists of 60,000


images for training and 10,000 images for testing. We modeled the data by a K-StoNet with
one hidden layer, 20 hidden units, and the softplus activation function, We trained the model
by IRO for 6 epochs. For comparison, we trained a standard LeNet-300-100 model [104] by
Adam [105] with default parameters for 300 epochs, a constant learning rate of 0.001, and
a mini-batch size of 128. Figure 3.7 shows the training paths of two models. Both models

101
Table 3.2. Training and prediction accuracy(%) for QSAR androgen receptor
data, where “T” and “P” denote the training and prediction accuracy, respec-
tively.
Method Split 1 Split 2 Split 3 Split 4 Split 5 Average
T 99.93 99.85 99.85 100 100 99.926
K-StoNet
P 88.43 93.47 90.50 90.50 91.99 90.978
T 97.11 96.89 97.11 97.11 97.19 97.082
SVM
P 89.32 90.80 87.83 89.02 92.28 89.850
T 99.93 100 98.44 99.93 99.93 99.646
KNN
P 88.72 93.17 89.32 90.50 91.10 90.562
T 99.70 99.63 99.85 99.78 99.85 99.762
DNN one layer
P 85.16 88.72 89.32 86.35 88.13 87.536
T 99.93 99.93 99.93 100 100 99.958
DNN two layer
P 86.94 88.13 88.13 86.05 88.72 87.596

102
can achieve 100% training accuracy. LeNet-300-100 achieved 98.38% test accuracy, while
K-StoNet achieved 98.87% test accuracy (at the 3rd iteration) without data augmentation
being used in training!

Accuracy Path
1.00

0.95

0.90
Accuracy

0.85

0.80
K-StoNet Training
0.75
K-StoNet Testing
0.70 DNN Training
DNN Testing
0.65
0 50 100 150 200 250 300
Number of epochs

Figure 3.7. Training and test accuracy versus epochs produced by K-StoNet
and DNN (LeNet-300-100) for the MNIST data, where K-StoNet achieved a
prediction accuracy of 98.87%, and LeNet-300-100 achieved a prediction accu-
racy of 98.38%.

3.3.3 CoverType Data

The CoverType data is available at the UCI machine learning repository. It consisted of
n = 581, 012 samples with p = 54 attributes, which were collected for classification of forest
cover types from cartographic variables. This dataset has an extremely large sample size,
which represents a typical problem that the DNN works on. We used half of the samples for
training and the other half for testing. The experiments were repeated for thee times.
We modeled the data by a K-StoNet with one hidden layer and 50 hidden units, and
trained the model by the IRO algorithm for 2 epochs. For comparison, we also modeled
the data by a 2-hidden-layer DNN with 1000 nodes on the first hidden layer and 50 nodes
on the second hidden layer. We trained the DNN model by SGD with momentum, where
a mini-batch size of 500, a constant learning rate of 0.01 and a momentum decay factor of

103
0.9 were used. The numerical results were summarized in Table 3.3, which indicates that
K-StoNet outperforms DNN in both training and prediction for this example.

Table 3.3. Training and prediction accuracy(%) for CoverType, where “T”
and “P” denote the training and prediction.
Method Run 1 Run 2 Run 3 Average
T 99.22 99.32 99.33 99.29
K-StoNet
P 94.21 94.23 94.26 94.23
T 98.20 98.11 98.07 98.13
DNN
P 94.11 94.06 94.04 94.07

3.3.4 More UCI Datasets

As shown by the above examples, K-StoNet can converge in only a few epochs and is
less bothered by overfitting, and its prediction accuracy is typically similar or better than
the best one that DNN achieved. In order to achieve the best prediction accuracy, DNN
often needs to be trained with tricks such as early stopping or Dropout [26], which lack
the theoretical guarantee for the down-stream statistical inference. In contrast, K-StoNet
possesses the theoretical guarantee to asymptotically converge to the global optimum and
enables the prediction uncertainty easily assessed (see Section 3.4). This subsection compares
K-StoNet with Dropout in prediction on more real world examples, 10 datasets taken at the
UCI machine learning repository.
Following the setting of [106], we randomly split each dataset into training and test sets
with 90% and 10% of its samples, respectively. The random split was repeated for 20 times.
The average prediction accuracy and its standard deviation were reported. As in [106], for
the largest two datasets Protein Structure and Year Prediction MSD, the random splitting
was done five times and one time, respectively. The baseline results were taken from [106]
and [107]. The neural network model used there had one hidden layer and 50 hidden units
for all datasets except for the largest two. For the largest two datasets, the neural network
model had one hidden layer and 100 hidden units. The K-StoNet model we used had one
hidden layer with 5 hidden units and the softplus activation function for all datasets except

104
for the largest one. For the largest dataset, K-StoNet had one hidden layer with 50 hidden
units. Other parameter settings were given in Section 3.5. The results were summarized in
Table 3.4, which indicates that K-StoNet generally outperforms Dropout in prediction.
For a thorough comparison, the KNN has also been implemented for the UCI datasets
except for two large ones, “Protein Structure” and “Year Prediction MSD”. For these two
datasets, the training sample size n is too large, making the gram matrix hard to handle.
For the same reason, it is not included in the comparisons for the MNIST and CoverType
data examples, either. For the KNN, the use of minibatch data is not very helpful when
n is large, as there are still n kernels (K(x(1) , x∗ ), K(x(2) , x∗ ), . . . , K(x(n) , x∗ )) we need to
evaluate for each sample x∗ in the minibatch. For the other datasets, the KNN was run with
the same network structure as for the K-StoNet. The detailed parameter settings were given
in Section 3.5. The results are summarized in Table 3.4, which indicates that the KNN is
generally inferior to K-StoNet in prediction.

Table 3.4. Average test RMSE (and its standard error) by variational
inference (VI, [108]), probabilistic back-propagation (PBP, [107]), dropout
(Dropout, [106]), SGD via back-propagation (BP), and KNN, where N de-
notes the dataset size and p denotes the input dimension. For each dataset,
the boldfaced values are the best result or the second best result if it is insignif-
icantly different from the best one according to a t-test with a significance level
of 0.05.
Dataset N p VI PBP Dropout BP KNN K-StoNet
Boston Housing 506 13 4.32 ±0.29 3.014 ±0.1800 2.97 ±0.19 3.228 ±0.1951 4.196 ±0.069 2.987 ±0.0227
Concrete Strength 1,030 8 7.19 ±0.12 5.667 ±0.0933 5.23 ±0.12 5.977 ±0.2207 6.962 ±0.062 5.261 ±0.0265
Energy Efficiency 768 8 2.65 ±0.08 1.804 ±0.0481 1.66 ±0.04 1.098 ±0.0738 1.942 ±0.030 1.301 ±0.015
Kin8nm 8,192 8 0.10 ±0.00 0.098 ±0.0007 0.10 ±0.00 0.091 ±0.0015 0.0917 ±0.0002 0.0747 ±0.0003
Naval Propulsion 11,934 16 0.01 ±0.00 0.006 ±0.0000 0.01 ±0.00 0.001 ±0.0001 0.0151 ±0.0001 0.00098 ±0.0001
Power Plant 9,568 4 4.33 ±0.04 4.124 ±0.0345 4.02 ±0.04 4.182 ±0.0402 4.033 ±0.010 3.952 ±0.003
Protein Structure 45,730 9 4.84 ±0.03 4.732 ±0.0130 4.36 ±0.01 4.539 ±0.0288 na 3.856 ±0.005
Wine Quality Red 1,599 11 0.65 ±0.01 0.635 ±0.0079 0.62 ±0.01 0.645 ±0.0098 0.675 ±0.004 0.6214 ±0.0008
Yacht Hydrodynamics 308 6 6.89 ±0.67 1.015 ±0.0542 1.11 ±0.09 1.182 ±0.1645 7.5334 ±0.0893 0.8560±0.0795
Year Prediction MSD 515,345 90 9.034 ±na 8.879 ±na 8.849 ±na 8.932 ±na na 8.881 ±na

105
3.4 Prediction Uncertainty Quantification with K-StoNet

3.4.1 A Recursive Formula for Uncertainty Quantification

The prediction uncertainty of the K-StoNet can be easily assessed with the variance
decomposition formula (as known as Eve’s law) based on the asymptotic normality theory.
More precisely, we can first calculate the variance for the output of the first hidden layer
based on the existing theory of SVR [109], then calculate the variance for the output of the
second hidden layer based on Eve’s law and the theory of linear models, and continue this
process till the output layer is reached. For the case that normal regression was done at
layer h + 1 and no penalties was used in solving the optimization (3.15), the calculation is
detailed as follows.
(t)
Let Yi ∈ Rn×mi , i = 1, 2, . . . , h, denote the matrices of latent variables imputed at
iteration t, which leads to the updated parameter estimate θ (t)
n . Let z denote a test sample.
(t)
For each layer i ∈ {1, 2, . . . , h + 1}, let Z i ∈ Rmi denote the output of the KNN (with the
(t) (t)
parameter θ (t)
n ) at layer i; and let µi and Σi denote the mean and covariance matrix of
(t) (t)
Zi , respectively. Assume that Z i ’s are all multivariate Gaussian, which will be justified
below. Then, for any layer i ∈ {2, . . . , h + 1}, by Eve’s law, we have
(t) (t) (t) (t) (t)
Σi = E(Var(Z i |Z i−1 )) + Var(E(Z i |Z i−1 ))
(t) (t) (t) (t) 2(t) 2(t) (t)
= E (ψ(Z i−1 ))T [(ψ(Yi−1 ))T ψ(Yi−1 )]−1 ψ(Z i−1 ) diag{σi,1 , . . . , σi,mi } + Var(w̃∗i−1 ψ(Z i−1 ))


(t)) (t) (t) (t) (t) (t) (t)


= tr([(ψ(Yi−1 )T ψ(Yi−1 )]−1 Var(ψ(Z i−1 ))) + (E(ψ(Z i−1 )))T [(ψ(Yi−1 ))T ψ(Yi−1 )]−1 (E(ψ(Z i−1 )))


2(t) 2(t) (t)


× diag{σi,1 , . . . , σi,mi } + w̃∗i−1 Var(ψ(Z i−1 ))(w̃∗i−1 )T ,

(t) 2(t) (t) (t) (t)


where w̃∗i−1 = E(w̃i−1 ) and σi,j ’s are unknown. Let µi−1 = (µi−1,1 , . . . , µi−1,mi−1 )T and
(t) (t) (t)
Dψ′ (µi−1 ) = diag{ψ ′ (µi−1,1 ), . . . , ψ ′ (µi−1,mi−1 )}. By the first order Taylor expansion, it is
easy to derive that
(t) (t)
E(ψ(Z i−1 )) ≈ ψ(µi−1 ),
(t) (t) (t) (t)
Var(ψ(Z i−1 )) ≈ Dψ′ (µi−1 )Σi−1 Dψ′ (µi−1 ).

106
(t) (t) (t) 2(t)
We suggest to estimate w̃∗i−1 by w̃i−1 , estimate µi−1 by Z i−1 , and estimate σi,j by its OLS
estimator from the corresponding multiple regression. This leads to the following recursive
formula for covariance estimation:
(t) (t)
b ≈ tr [(ψ(Y ))T ψ(Y )]−1 D ′ (Z )Σ

b D ′ (Z ) (t) (t) (t) (t) 
Σi i−1 i−1 ψ i−1 i−1 ψ i−1
(t) (t) (t) (t) 2(t) 2(t)
+ (ψ(Z i−1 ))T [(ψ(Yi−1 ))T ψ(Yi−1 )]−1 ψ(Z i−1 ) diag{σ̂i,1 , . . . , σ̂i,mi } (3.17)
(t) (t) (t)
b D ′ (Z )(w̃ )T , (t) (t)
+ w̃i−1 Dψ′ (Z i−1 )Σi−1 ψ i−1 i−1

(t)
for i = 2, 3, . . . , h + 1. For the SVR layer, the asymptotic normality of Z 1 can be jus-
tified based on [109], which gives a Bayesian interpretation to the classical SVR with
an ε-intensive loss function. Let f (x) = β T ϕ(x) + b denote a SVR function, and let
{(x(1) , y (1) ), . . . , (x(n) , y (n) )} denote training samples.
In [109], the authors treated (f (x(1) ), f (x(2) ), . . . , f (x(n) )) as a random vector subject to a
functional Gaussian process prior, and showed that the posterior of f (z) can be approximated
by a Gaussian distribution with mean f ∗ (z) and variance

σf2 (z) = Kz,z − KX


T
M ,z
−1
KXM ,XM
KXM ,z , (3.18)

where f ∗ (·) denotes the optimal regression function fitted by SVR, XM = {x(s) : |y (s) −
f (t) (x(s) )| = ε} denotes the set of marginal vectors, and KA,B denotes a kernel matrix with
elements formed by variables in A versus variables in B. By this result, conditioned on
(t)
the training samples, Z 1 is approximately Gaussian with the covariance matrix given by
2(t) 2(t)
diag{σf1 (z), . . . , σfm1 (z)}.

For the output and other hidden layers, we can restrict mi ≺ n for each layer i ∈
{1, 2, . . . , h}. Then, by the theory of [53] which allows the dimension of the parameters
(t)
diverging with the training sample size, Z i+1 is asymptotically Gaussian.
Let Yz denote the unknown true observation at the test point z, and let ξˆ(t) (z) = Zh+1
(t)

denote its K-StoNet prediction with the parameters θ (t) ˆ


n . Then the variance of Yz − ξ(z) can

be approximated by
n
ˆ 1X b (t) ,
d z − ξ(z))
Var(Y = (y (i) − ξˆ(t) (x(i) ))(y (i) − ξˆ(t) (x(i) ))T + Σ h+1 (3.19)
n i=1

107
based on which the prediction interval for Yz can be constructed at a desired confidence
level. Further, by Theorem 3.1.3, a more accurate confidence interval can be obtained by
averaging over those obtained at different iterations.
The above procedure can be easily extended to the probit/logistic regression (via Wald/end-
point transformation) and the case that an amenable penalty is used in solving (3.15). Refer
to [110] for asymptotic normality of the regularized estimators.

3.4.2 A Numerical Example

To illustrate the above procedure, we generated 100 training datasets and one test dataset
as in Section 3.2.2. Each dataset consisted of 500 samples. Again, we modeled the data using
a K-StoNet with one hidden layer and 5 hidden units. For each training dataset, we trained
the model by IRO for 50 epochs. For each test point z, a 95% prediction interval was
constructed based on each training dataset according to the prediction variance calculated
in (3.19) at the last epoch of the run, and the coverage rate was calculated by averaging
over the coverage status (0 or 1) of the 100 prediction intervals. Further, we averaged the
coverage rates over 500 test points, which produced a mean coverage rate of 93.812% with
standard deviation 0.781%. It is very close to the nominal level 95%! Figure 3.8 shows the
prediction intervals at 20 test points, which were obtained at the last epoch of an IRO run
for a training dataset.
For each test point, we have also constructed a prediction interval based on each training
dataset by averaging those obtained at the last 25 epochs. As a result, the mean coverage
rate over the 500 test points was improved to 94.026% with standard deviation 0.771%.

3.5 Parameter Settings for K-StoNet

In all computations of this chapter except for the CoverType experiments, the RBF
kernel k(x, x′ ) = exp(−γ∥x − x′ ∥22 ) is used, where γ is set to the default value 1
pVar(x)
, p is
the dimension of x, and Var(x) is the variance of x. We have the following default values
(t)
for the parameters: one hidden layer, 5 hidden units, Cn = C̃n,1 = 10 for all t, ε = 0.01,
tHM C = 25, α = 0.1, σn,2
2
= 0.01, and the learning rate ϵt,i = 5e − 4 for all t and i. The

108
Confidence Interval
15 True Value
Predicted Value
10

−5
Y

−10

−15

−20

−25

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
index

Figure 3.8. 95% prediction intervals produced by K-StoNet for 20 test points,
where the x-axis indexes the test points, the y-axis represents the response
value, and the blue star represents the true observation.

109
parameter settings may vary around the default values to achieve better performance for the
K-StoNet model.
Section 3.2.1: Simulated DNN Data. Network: Cn = 1 for the SVR layer, σn,2
2
= 0.001
for the output layer; HMC imputation: tHM C = 25, α = 0.1, and ϵt,i = 5e − 7 for all t and i;
(t)
parameter updating: for all t and i, (i) SVR with C̃n,1 = 1 and ε = 0.1, (ii) linear regression
(t)
with a Lasso penalty and the regularization parameter λn,i = 1e − 4.
Section 3.2.1: Simulated KNN Data. Network: Cn = 5 for the SVR layer, σn,2
2
= 0.001
for the output layer; HMC imputation: tHM C = 25, α = 0.1, and ϵt,i = 5e − 4 for all t and i;
(t)
parameter updating: for all t and i, (i) SVR with C̃n,1 = 5 and ε = 0.01, (ii) linear regression
(t)
with a Lasso penalty and the regularization parameter λn,i = 1e − 4.
Section 3.2.2: Measurement error data. For both the one-hidden layer and three-hidden
layer K-StoNets, the parameters were set as follows: Network: Cn = 1 for the SVR layer,
2
σn,i = 0.001 for layers i = 2, . . . , h, and σn,h+1
2
= 0.01; HMC imputation: tHM C = 25, α = 1,
(t)
and ϵt,i = 5e−5 for all t and i; parameter updating: for all t and i, (i) SVR with C̃n,1 = 1 and
ε ∈ {0.01, 0.02, . . . , 0.1}, (ii) linear regression with a Lasso penalty and the regularization
(t)
parameter λn,i = 1e − 4.
Section 3.3.1: QSAR Androgen Receptor. Network: Cn = 1 for the SVR layer, σn,2
2
=
0.001 for the output layer; HMC imputation: tHM C = 25, α = 0.1, and ϵt,i = 5e − 5 for all
(t)
t and i; parameter updating: for all t and i, (i) SVR with C̃n,1 = 1 and ε = 0.1, (ii) logistic
(t)
regression with a Lasso penalty and the regularization parameter λn,i = 1e − 4.
Section 3.3.2: MNIST Data. Network: Cn = 10 for the SVR layer, σn,2
2
= 1e − 9 for
the output layer; HMC imputation: tHM C = 25, α = 0.1, and ϵt,i = 5e − 13 for all t and i;
(t)
parameter updating: for all t and i, (i) SVR with C̃n,1 = 10 and ε = 0.0001, (ii) multinomial
(t)
logistic regression with a Lasso penalty and the regularization parameter λn,i = 1e − 4.
Section 3.3.3: CoverType Data. Network: Cn = 10 for the SVR layer, σn,2
2
= 0.005 for
the output layer; HMC imputation: tHM C = 25, α = 0.1, and ϵt,i = 5e − 5 for all t and i;
(t)
parameter updating: for all t and i, (i) SVR with C̃n,1 = 10 and ε = 0.01. (ii) multinomial
(t)
logistic regression with a Lasso penalty and the regularization parameter λn,i = 1e − 4.
This dataset consists of 44 binary features. When applying the RBF kernel k(x, x′ ) =
exp(−γ∥x − x′ ∥22 ), the default choice γ = 1
pVar(x)
does not work well. Different values of

110
γ were used for different SVRs in the K-StoNet model. Let γi denote the γ-value used for
the SVR corresponding to the i-th hidden unit. We set γi = 0.5 for 1 ≤ i < 30, γi = 1 for
30 ≤ i < 40, γi = 2 for 40 ≤ i < 45, and γi = 5 for 45 ≤ i ≤ 50.
Section 3.3.4: For all 10 datasets except for Yacht Hydrodynamic and Year Prediction
MSD, we set σn,2
2
= 0.01, tHM C = 25, α = 0.1, and ϵt,i = 5e − 4 for all t and i. For the
SVRs in the first layer, we set ϵ = 0.01. We used 1
9
of the training data as the validation set
(t)
and chose C̃n,1 ∈ 1, 2, 5, 10, 20 with the smallest MSE on the validation set. For the dataset
(t)
Yacht Hydrodynamic, we set σn,2
2
= 0.0001, α = 0.1, ϵt,i = 5e − 6 and C̃n,1 = 200. For the
(t)
dataset Year Prediction MSD, we set σn,2
2
= 0.02, α = 0.1, ϵt,i = 1e − 3 and C̃n,1 = 1. Similar
to the CoverType dataset, when some categorical features exist in the dataset, the default
choice γ = 1
pVar(x)
in the RBF kernel does not work very well. Among the 10 datasets, we
set γ = 3 for Yacht Hydrodynamic, γ = 1 for Protein Structure, and employed the default
setting for the others.
The KNN model was trained in a similar setting as used for the probabilistic back-
propagation method in [107]: we used a one-hidden layer model with 50 hidden units, and
trained the model using SGD with a constant learning rate of 0.0001 and a momentum decay
factor of 0.9. As in [107], we ran SGD for 40 epochs with a mini-batch size of 1.
Section 3.4.2: Prediction Interval. Network: Cn = 10 for the SVR layer, σn,2
2
= 0.001
for the output layer; HMC imputation: tHM C = 25, α = 0.1, and ϵt,i = 5e − 6 for all t
(t)
and i; parameter updating: for all t and i, (i) SVR with C̃n,1 = 10 and ε = 0.05, (ii) linear
regression, OLS estimation.

3.6 Discussion

We have proposed K-StoNet as a new neural network model for machine learning. The
K-StoNet incorporates SVR as the first hidden layer and reformulates the neural network
as a latent variable model. The former maps the input variable into an infinite dimensional
feature space via the RBF kernel, ensuring absence of local minima on the loss surface
of the resulting neural network. The latter breaks the high-dimensional nonconvex neural
network training problem into a series of lower-dimensional convex optimization problems. In

111
addition, the use of kernel partially addresses the over-parameterization issue suffered by the
DNN; it enables a smaller network to be used, while ensuring the universal approximation
capability. The K-StoNet can be easily trained using the IRO algorithm. Compared to
DNN, K-StoNet avoids local traps, and enables the prediction uncertainty easily assessed.
Compared to SVR, K-StoNet has better approximation capability due to the added hidden
units. Our numerical results indicate its superiority over SVR and DNN in both training
and prediction.
As an important ingredient of K-StoNet, StoNet is itself of interest. Under the framework
of StoNet, the existing statistical theory for SVR and high-dimensional generalized linear
models can be easily incorporated into the development of deep learning.
As another important ingredient of K-StoNet, kernel has long been studied in machine
learning as a function approximation tool. It is known that, with “kernel trick”, kernel
methods enable a classifier/regression to learn a complex decision boundary with only a
small number of parameters. To enhance their flexibility, some researchers proposed the so-
called deep kernel learning methods, where one kernel function is repeatedly concatenated
with another kernel or nonlinear function, see e.g. [111]–[117]. Under some conditions,
[113] showed that the upper bound of generalization error for deep multiple kernels can be
significantly lower than that for the DNNs. However, unlike shallow kernel methods such
as SVM and SVR [31], [32], [118], coefficient estimation for deep kernels is not convex any
more. Estimating coefficients of the inner layer kernel can be highly nonlinear and becomes
more complicated for a larger number of layers [117]. By introducing latent variables, this
work provides an effective way to resolve the computational challenge suffered by deep kernel
learning. K-StoNet is essentially a deep kernel learning method.
The K-StoNet can be further extended in many different ways. Instead of relying on a
SVR solver, K-StoNet can be implemented as a StoNet with the gram matrix being treated
as input data. In this case, although a large-scale gram matrix needs to be handled when
the training sample size is large, different kernels can be adopted for different tasks. For
example, one might employ the convolutional kernel developed in [114] for computer vision
problems. As discussed in Section 3.1.3, the regression in the output and other hidden

112
layers can also be regularized by different amenable penalties [95], some of which might lead
to better selection properties than Lasso.
The K-StoNet has an embarrassingly parallel structure; that is, solving K-StoNet can be
broken into to many parallel tasks that can be solved with little or no need for communi-
cations. More precisely, the imputation step can be done in parallel for each observation;
and the parameter updating step can be done in parallel for each of the regression tasks,
including both SVR and Lasso regression. If both steps are implemented in parallel, the
computation can be greatly accelerated. Currently, parallelization has not yet been com-
pleted: the imputation step was implemented in PyTorch, where the imputation for each
observation was done in a serial manner; the parameter updating step was implemented
using the package sklearn[119] for the normal or logistic regression and ThunderSVM [120]
for SVR, where the regression was solved one by one. On a machine with Intel(R) Xeon(R)
Silver 4110 CPU @ 2.10GHz and Nvidia Tesla P100 GPU, for a dataset with n = 10000
and p = 54 and a 1-hidden layer K-StoNet with 5 hidden units, one iteration/epoch cost
less than half a minute in the current serial implementation. Therefore, all the examples
presented in this chapter can be done reasonably fast on the machine. We expect that the
computation can be much accelerated with a parallel implementation of K-StoNet.
As mentioned in Section 3.1.3, a scalable SVR solver will accelerate the computation of
K-StoNet substantially. The scalable SVR can be developed in different ways. For example,
[121] developed ParitoSVR — a parallel iterated optimizer for SVR, where each machine
iteratively solves a small (sub-)problem based only on a subset data and these solutions are
then combined to form the solution to the global problem. ParitoSVR is provably convergent
to the results obtained from the centralized algorithm, where the optimization has access to
the entire data set. Alternatively, one can implement SVR in an incremental manner [122]–
[125], where a SVR is first learned with a subset data and then sequentially updated based
on the remaining set of samples. By the property that the decision function of SVR depends
on support vectors only, [124] proposed to use only the boundary vectors in the remaining
set of samples. By the same property, [125] proposed a sample selection method for SVR
to maximize its validation set accuracy at the minimum number of training examples. As
shown in [124], [125], both methods can accelerate the computation of SVR substantially.

113
3.7 Technical Proofs

3.7.1 Proof of Theorem 3.1.2

Proof. Since Θ is compact, it suffices to prove that the consistency holds for any θ ∈ Θ. For
notational simplicity, we rewrite σn,i by σi in the remaining part of the proof.
Let Y mis = (Y 1 , Y 2 , . . . , Y h ), where Y i ’s are latent variables as defined in (3.5). Let
Ỹ = (Ỹ 1 , . . . , Ỹ h ), where Ỹ i ’s are calculated by KNN in (3.3). By Taylor expansion, we
have

logπ(Y , Y mis |X, θ) = log π(Y , Ỹ |X, θ) + ϵT ∇Y mis log π(Y , Ỹ |X, θ) + O(∥ϵ∥2 ), (3.20)

where ϵ = Y mis − Ỹ = (ϵ1 , ϵ2 , . . . , ϵh ), ∇Y mis log π(Y , Ỹ |X, θ) is evaluated according to


the joint distribution (3.10), and log π(Y , Ỹ |X, θ) = log π(Y |X, θ) is the log-likelihood
function of the KNN.
Consider the partial derivative ∇Y i log π(Y , Y mis |X, θ), for whose single component, say
(k)
Yi , the output of neuron k at hidden layer i ∈ {2, 3, . . . , h}, we have
mi+1
1 (j) (j) (j)  (j,k) (k)
Yi+1 − bi+1 − wi+1 ψ(Y i ) wi+1 ψ ′ (Yi
X
∇Y (k) log π(Y , Y mis |X, θ) = 2 )
i σi+1 j=1 (3.21)
1 (k) (k) (k)
− 2 (Yi − bi − wi ψ(Y i−1 )),
σi
(j)
where wi+1 denotes the vector of the weights from neuron j at layer i + 1 to the neurons at
(j,k)
layer i, and wi+1 denotes the weight from neuron j at layer i + 1 to neuron k at layer i. For
layer i = 1, the second term of (3.21) will disappear, since the ϵ-intensive loss is a constant
around zero.
(j) (j) (j) (j) (k) (k) (k)
Since Yi+1 = bi+1 + wi+1 ψ(Y i ) + ei+1 and Ỹi = bi + wi ψ(Ỹ i−1 ), we have, for any k ∈
Pmi+1  (j) (j)

(j,k)
{1, 2, . . . , mi }, ∇Y (k) log π(Y , Ỹ |X, θ) = 2
σi+1
1
j=1 ei+1 +wi+1 (ψ(Y i )−ψ(Ỹ i )) wi+1 ψ ′ (Ỹi,k )
i

if i = h, and 0 otherwise. Then, by Assumption 3.1.2-(i)&(iii), for any k ∈ {1, 2, . . . , mi },

 nP o
mi+1 (j) (j,k) ′ (k)
 1 e i+1 wi+1 ψ (Ỹi ) + (c′ r)2 mi+1 ∥ϵi ∥ , i = h


2
σi+1 j=1
|∇Y (k) log π(Y , Ỹ |X, θ)| ≤ 
i
0

i < h,

(3.22)

114
(j)
where ϵi = Y i − Ỹ i , ei+1 is the jth component of ei+1 , r is the upper bound of the weights,
and c′ is the Lipschitz constant of ψ(·) as well as the upper bound of ψ ′ (·).
Next, let’s figure out the order of ∥ϵi ∥. The kth component of ϵi is given by

e(k) + w (k) (ψ(Y − ψ(Ỹ i−1 )), i > 1,

i−1 )

(k) (k) i i
Yi − Ỹi =
e(k) ,


i i = 1.

Therefore, ∥ϵ1 ∥ = ∥e1 ∥; and for i = 2, 3, . . . , h, the following inequalities hold:

∥ϵi ∥ ≤ ∥ei ∥ + c′ rmi ∥ϵi−1 ∥, ∥ϵi ∥2 ≤ 2∥ei ∥2 + 2(c′ r)2 m2i ∥ϵi−1 ∥2 . (3.23)

Since ei and ei−1 are independent, by summarizing (3.22) and (3.23), we have
h+1
2 h
σk−1
Z X Y
T
mh+1 ( m2i )mk−1 = o(1),

ϵ ∇Y mis log π(Y , Ỹ |X, θ)π(Y mis |X, θ, Y )dY mis ≤O 2
σ
k=2 h+1 i=k

(3.24)

where the last equality follows from 3.1.2-(v). Then, by (3.20), we have the mean value

E [log π(Y , Y mis |X, θ) − log π(Y |X, θ)] → 0, ∀θ ∈ Θ.

Further, it is easy to verify

Z
|ϵT ∇Y mis log π(Y , Ỹ |X, θ)|2 π(Y mis |X, θ, Y )dY mis < ∞,

which, together with (3.20) and (3.23), implies

E| log π(Y , Y mis |X, θ) − log π(Y |X, θ)|2 < ∞. (3.25)

Therefore, the weak law of large numbers (WLLN) applies, and the proof can be concluded.

3.7.2 Proof of Lemma 3.1.2

Lemma 3.1.2 is a direct application of Lemma 3.7.1 given below.

Lemma 3.7.1. Consider a function Q(θ, X n ). Suppose that the following conditions are
satisfied: (B1) Q(θ, X n ) is continuous in θ and there exists a function Q∗ (θ), which is

115
continuous in θ and uniquely maximized at θ ∗ . (B2) For any ϵ > 0, supθ∈Θ\B(ϵ) Q∗ (θ)
exists, where B(ϵ) = {θ : ∥θ − θ ∗ ∥ < ϵ}; Let δ = Q∗ (θ ∗ ) − supθ∈Θ\B(ϵ) Q∗ (θ), then δ > 0.
p
(B3) supθ∈Θ |Q(θ, X n ) − Q∗ (θ)| → 0 as n → ∞. Let θ̂ n = arg maxθ∈Θ Q(θ, X n ). Then
p
∥θ̂ n − θ ∗ ∥ → 0.

Proof. Consider two events (i) supθ∈Θ\B(ϵ) |Q(θ, X n ) − Q∗ (θ)| < δ/2, and
(ii) supθ∈B(ϵ) |Q(θ, X n ) − Q∗ (θ)| < δ/2. From event (i), we can deduce that for any θ ∈
Θ \ B(ϵ), Q(θ, X n ) < Q∗ (θ) + δ/2 ≤ Q∗ (θ ∗ ) − δ + δ/2 ≤ Q∗ (θ ∗ ) − δ/2.
From event (ii), we can deduce that for any θ ∈ B(ϵ), Q(θ, X n ) > Q∗ (θ) − δ/2 and thus
Q(θ ∗ , X n ) > Q∗ (θ ∗ ) − δ/2.
If both events hold simultaneously, then we must have θ̂ n ∈ B(ϵ) as n → ∞. By condition
(B3), the probability that both events hold tends to 1. Therefore, P (θ̂ n ∈ B(ϵ)) → 1, which
concludes the lemma.

116
REFERENCES

[1] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,”
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778,
2016.

[2] M. Denil, B. Shakibi, L. Dinh, M. Ranzato, and N. de Freitas, “Predicting parameters


in deep learning,” in NIPS, 2013.

[3] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger, “On calibration of modern neural net-
works,” in Proceedings of the 34th International Conference on Machine Learning - Volume
70, ser. ICML’17, Sydney, NSW, Australia: JMLR.org, 2017, pp. 1321–1330.

[4] J. Frankle and M. Carbin, “The lottery ticket hypothesis: Finding sparse, trainable
neural networks,” in International Conference on Learning Representations, 2019. [Online].
Available: https://openreview.net/forum?id=rJl-b3RcF7.

[5] M. Gori and A. Tesi, “On the problem of local minima in backpropagation,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 14, no. 1, pp. 76–86, 1992.

[6] Q. Nguyen and M. Hein, “The loss surface of deep and wide neural networks,” in ICML,
2017.

[7] Z. Allen-Zhu, Y. Li, and Z. Song, “A convergence theory for deep learning via over-
parameterization,” in ICML, 2019.

[8] S. S. Du, J. D. Lee, H. Li, L. Wang, and X. Zhai, “Gradient descent finds global minima
of deep neural networks,” in ICML, 2019.

[9] D. Zou, Y. Cao, D. Zhou, and Q. Gu, “Gradient descent optimizes over-parameterized
deep relu networks,” Machine Learning, vol. 109, pp. 467–492, 2020.

[10] D. Zou and Q. Gu, “An improved analysis of training over-parameterized deep neural
networks,” in NuerIPS, 2019.

[11] H. Bölcskei, P. Grohs, G. Kutyniok, and P. Petersen, “Optimal approximation with


sparsely connected deep neural networks,” CoRR, vol. abs/1705.01714, 2019.

[12] J. Schmidt-Hieber, “Nonparametric regression using deep neural networks with relu
activation function,” arXiv:1708.06633, 2017.

[13] B. Bauler and M. Kohler, “On deep learning as a remedy for the curse of dimensionality
in nonparametric regression,” The Annals of Statistics, vol. 47, no. 4, pp. 2261–2285, 2019.

117
[14] F. Liang, Q. Li, and L. Zhou, “Bayesian neural networks for selection of drug sensitive
genes,” Journal of the American Statistical Association, vol. 113, no. 523, pp. 955–972, 2018.

[15] N. G. Polson and V. Ročková, “Posterior concentration for sparse deep learning,” in
Proceedings of the 32nd International Conference on Neural Information Processing Systems,
ser. NIPS’18, Montréal, Canada: Curran Associates Inc., 2018, pp. 938–949.

[16] J. M. Alvarez and M. Salzmann, “Learning the number of neurons in deep networks,”
in Advances in Neural Information Processing Systems, 2016, pp. 2270–2278.

[17] S. Scardapane, D. Comminiello, A. Hussain, and A. Uncini, “Group sparse regularization


for deep neural networks,” Neurocomputing, vol. 241, pp. 81–89, 2017.

[18] R. Ma, J. Miao, L. Niu, and P. Zhang, “Transformed l1 regularization for learning sparse
deep neural networks,” ArXiv:1901.01021v1, 2019.

[19] S. Wager, S. Wang, and P. S. Liang, “Dropout training as adaptive regularization,”


Advances in neural information processing systems, vol. 26, 2013.

[20] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks
with pruning, trained quantization and huffman coding,” arXiv preprint arXiv:1510.00149,
2015.

[21] B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Pensky, “Sparse convolutional neural
networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition,
2015, pp. 806–814.

[22] J. Frankle and M. Carbin, “The lottery ticket hypothesis: Finding sparse, trainable
neural networks,” arXiv preprint arXiv:1803.03635, 2018.

[23] S. Ghosh and F. Doshi-Velez, “Model selection in bayesian neural networks via horseshoe
priors,” arXiv preprint arXiv:1705.10388, 2017.

[24] G. E. Hinton, “Learning multiple layers of representation,” Trends in cognitive sciences,


vol. 11, no. 10, pp. 428–434, 2007.

[25] R. Salakhutdinov and G. Hinton, “Deep boltzmann machines,” in Proceedings of the


International Conference on Artificial Intelligence and Statistics, 2009, pp. 448–455.

[26] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout:


A simple way to prevent neural networks from overfitting,” The journal of machine learning
research, vol. 15, no. 1, pp. 1929–1958, 2014.

118
[27] A. Neelakantan, L. Vilnis, Q. V. Le, et al., “Adding gradient noise improves learning
for very deep networks,” arXiv preprint arXiv:1511.06807, 2015.

[28] Z. You, J. Ye, K. Li, Z. Xu, and P. Wang, “Adversarial noise layer: Regularize neural
network by adding noise,” in 2019 IEEE International Conference on Image Processing
(ICIP), IEEE, 2019, pp. 909–913.

[29] H. Noh, T. You, J. Mun, and B. Han, “Regularizing deep neural networks by noise:
Its interpretation and optimization,” Advances in Neural Information Processing Systems,
vol. 30, 2017.

[30] C. Gulcehre, M. Moczulski, M. Denil, and Y. Bengio, “Noisy activation functions,” in


International conference on machine learning, PMLR, 2016, pp. 3059–3068.

[31] V. Vapnik, “Pattern recognition using generalized portrait method,” Automation and
remote control, vol. 24, pp. 774–780, 1963.

[32] V. Vapnik, “A note one class of perceptrons,” Automation and remote control, 1964.

[33] F. Liang, B. Jia, J. Xue, Q. Li, and Y. Luo, “An imputation–regularized optimization
algorithm for high dimensional missing data problems and beyond,” Journal of the Royal
Statistical Society: Series B (Statistical Methodology), vol. 80, no. 5, pp. 899–926, 2018.

[34] C. A. Micchelli, Y. Xu, and H. Zhang, “Universal kernels.,” Journal of Machine Learning
Research, vol. 7, no. 12, 2006.

[35] B. Hammer and K. Gersmann, “A note on the universal approximation capability of


support vector machines,” neural processing letters, vol. 17, no. 1, pp. 43–53, 2003.

[36] H. Ishwaran, J. S. Rao, et al., “Spike and slab variable selection: Frequentist and bayesian
strategies,” The Annals of Statistics, vol. 33, no. 2, pp. 730–773, 2005.

[37] E. I. George and R. E. McCulloch, “Variable selection via gibbs sampling,” Journal of
the American Statistical Association, vol. 88, no. 423, pp. 881–889, 1993.

[38] Q. Song and F. Liang, “Nearly optimal bayesian shrinkage for high dimensional regres-
sion,” arXiv:1712.08964, 2017.

[39] W. Jiang, “Bayesian variable selection for high dimensional generalized linear models:
Convergence rate of the fitted densities,” The Annals of Statistics, vol. 35, pp. 1487–1511,
2007.

119
[40] F. Liang, Q. Song, and K. Yu, “Bayesian subset modeling for high dimensional general-
ized linear models,” Journal of the American Statistical Association, vol. 108, pp. 589–606,
2013.

[41] P. Petersen and F. Voigtlaender, “Optimal approximation of piecewise smooth functions


using deep relu neural networks,” Neural Networks, vol. 108, pp. 296–330, 2018.

[42] P. Nakkiran, G. Kaplun, Y. Bansal, T. Yang, B. Barak, and I. Sutskever, “Deep double
descent: Where bigger models and more data hurt,” in International Conference on Learning
Representations, 2020. [Online]. Available: https://openreview.net/forum?id=B1g5sA4twr.

[43] S. Ghosal, J. K. Ghosh, and A. W. Van Der Vaart, “Convergence rates of posterior
distributions,” Annals of Statistics, vol. 28, no. 2, pp. 500–531, 2000.

[44] E. I. George and R. E. McCulloch, “Approaches for bayesian variable selection,” Statis-
tica sinica, vol. 7, pp. 339–373, 1997.

[45] R. Kohn, M. Smith, and D. Chan, “Nonparametric regression using linear combinations
of basis functions,” Statistics and Computing, vol. 11, no. 4, pp. 313–322, 2001.

[46] A. Dobra, C. Hans, B. Jones, J. R. Nevins, G. Yao, and M. West, “Sparse graphical
models for exploring gene expression data,” Journal of Multivariate Analysis, vol. 90, no. 1,
pp. 196–212, 2004.

[47] A. A. Pourzanjani, R. M. Jiang, and L. R. Petzold, “Improving the identifiability of


neural networks for bayesian inference,” in NIPS Workshop on Bayesian Deep Learning,
2017.

[48] S. Geisser, J. Hodges, S. Press, and A. ZeUner, “The validity of posterior expansions
based on laplace’s method,” Bayesian and likelihood methods in statistics and econometrics,
vol. 7, p. 473, 1990.

[49] I. Castillo, J. Rousseau, et al., “A bernstein–von mises theorem for smooth functionals
in semiparametric models,” The Annals of Statistics, vol. 43, no. 6, pp. 2353–2383, 2015.

[50] Y. Wang and V. Rocková, “Uncertainty quantification for sparse deep learning,” in
AISTATS, 2020.

[51] J. Feng and N. Simon, “Sparse-input neural networks for high-dimensional nonparamet-
ric regression and classification,” arXiv preprint arXiv:1711.07592, 2017.

[52] C. Fefferman, “Reconstructing a neural net from its output,” Revista Matemática
Iberoamericana, vol. 10, no. 3, pp. 507–555, 1994.

120
[53] S. Portnoy, “Asymptotic behavior of likelihood methods for exponential families when
the number of parameters tend to infinity,” The Annals of Statistics, vol. 16, no. 1, pp. 356–
366, 1988.

[54] D. A. McAllester, “Pac-bayesian model averaging,” in Proceedings of the twelfth annual


conference on Computational learning theory, 1999, pp. 164–170.

[55] D. A. McAllester, “Some pac-bayesian theorems,” Machine Learning, vol. 37, no. 3,
pp. 355–363, 1999.

[56] P. Chaudhari, A. Choromanska, S. Soatto, et al., “Entropy-sgd: Biasing gradient descent


into wide valleys,” arXivi:1611.01838, 2016. eprint: 1611.01838 (cs.LG).

[57] P. Izmailov, D. Podoprikhin, T. Garipov, D. Vetrov, and A. G. Wilson, “Averaging


weights leads to wider optima and better generalization,” arXiv:1803.05407, 2018. eprint:
1803.05407 (cs.LG).

[58] F. Liang, “Evidence evaluation for bayesian neural networks using contour monte carlo,”
Neural Computation, vol. 17, no. 6, pp. 1385–1410, 2005.

[59] D. J. MacKay, “The evidence framework applied to classification networks,” Neural


computation, vol. 4, no. 5, pp. 720–736, 1992.

[60] B. Kleinberg, Y. Li, and Y. Yuan, “An alternative view: When does sgd escape local
minima?” In International Conference on Machine Learning, PMLR, 2018, pp. 2698–2707.

[61] C. Zhang, Q. Liao, A. Rakhlin, B. Miranda, N. Golowich, and T. Poggio, “Theory of


deep learning iib: Optimization properties of sgd,” arXiv preprint, arXiv:1801.02254, 2018.

[62] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward
neural networks,” in Proceedings of the thirteenth international conference on artificial in-
telligence and statistics, 2010, pp. 249–256.

[63] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-
level performance on imagenet classification,” in Proceedings of the IEEE international con-
ference on computer vision, 2015, pp. 1026–1034.

[64] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in International
Conference on Learning Representations, 2015.

[65] M. Welling and Y. W. Teh, “Bayesian learning via stochastic gradient langevin dynam-
ics,” in Proceedings of the 28th international conference on machine learning (ICML-11),
2011, pp. 681–688.

121
[66] T. Chen, E. Fox, and C. Guestrin, “Stochastic gradient hamiltonian monte carlo,” in
International conference on machine learning, 2014, pp. 1683–1691.

[67] Y.-A. Ma, T. Chen, and E. Fox, “A complete recipe for stochastic gradient mcmc,” in
Advances in Neural Information Processing Systems, 2015, pp. 2917–2925.

[68] C. Nemeth and P. Fearnhead, “Stochastic gradient markov chain monte carlo,” arXiv
preprint arXiv:1907.06986, 2019.

[69] Y. Sun, Q. Song, and F. Liang, “Consistent sparse deep learning: Theory and compu-
tation,” Journal of the American Statistical Association, in press, 2021.

[70] C. Chen, N. Ding, and L. Carin, “On the convergence of stochastic gradient mcmc
algorithms with high-order integrators,” in Proceedings of the 28th International Conference
on Neural Information Processing Systems-Volume 2, 2015, pp. 2278–2286.

[71] J. Bleich, A. Kapelner, E. I. George, and S. T. Jensen, “Variable selection for bart: An
application to gene regulation,” The Annals of Applied Statistics, pp. 1750–1781, 2014.

[72] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal
Statistical Society. Series B (Methodological), pp. 267–288, 1996.

[73] J. Fan and J. Lv, “Sure independence screening for ultrahigh dimensional feature space,”
Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 70, no. 5,
pp. 849–911, 2008.

[74] A. Krizhevsky, G. Hinton, et al., “Learning multiple layers of features from tiny images,”
Citeseer, Tech. Rep., 2009.

[75] T. Lin, S. U. Stich, L. Barba, D. Dmitriev, and M. Jaggi, “Dynamic model pruning
with feedback,” in International Conference on Learning Representations, 2020. [Online].
Available: https://openreview.net/forum?id=SJem8lSFwB.

[76] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,”
in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016,
pp. 770–778.

[77] H. Mostafa and X. Wang, “Parameter efficient training of deep convolutional neural
networks by dynamic sparse reparameterization,” in International Conference on Machine
Learning, 2019, pp. 4646–4655.

[78] T. Dettmers and L. Zettlemoyer, “Sparse networks from scratch: Faster training without
losing performance,” arXiv preprint arXiv:1907.04840, 2019.

122
[79] W. J. Maddox, P. Izmailov, T. Garipov, D. P. Vetrov, and A. G. Wilson, “A simple base-
line for bayesian uncertainty in deep learning,” in Advances in Neural Information Processing
Systems, 2019, pp. 13 153–13 164.

[80] V. Ročková, “Bayesian estimation of sparse signals with a continuous spike-and-slab


prior,” The Annals of Statistics, vol. 46, no. 1, pp. 401–437, 2018.

[81] Authors, “Extended stochastic gradient mcmc algorithms for large-scale bayesian com-
puting,” Submitted Manuscript, 2019.

[82] A. Zubkov and A. Serov, “A complete proof of universal inequalities for the distribution
function of the binomial law,” Theory Probab. Appl., vol. 57, no. 3, pp. 539–544, 2013.

[83] W. V. Li and A. Wei, “A gaussian inequality for expected absolute products,” Journal
of Theoretical Probability, vol. 25, no. 1, pp. 92–99, 2012.

[84] I. Castillo and J. Rousseau, “Supplement to “a bernstein–von mises theorem for smooth
functionals in semiparametric models”,” Annals of Statistics, vol. 43, no. 6, pp. 2353–2383,
2015.

[85] G. Wahba, Spline models for observational data. SIAM, 1990.

[86] B. Schölkopf, R. Herbrich, and A. J. Smola, “A generalized representer theorem,” in


International conference on computational learning theory, Springer, 2001, pp. 416–426.

[87] J. Park and I. W. Sandberg, “Universal approximation using radial-basis-function net-


works,” Neural computation, vol. 3, no. 2, pp. 246–257, 1991.

[88] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete


data via the em algorithm,” Journal of the Royal Statistical Society: Series B (Methodologi-
cal), vol. 39, no. 1, pp. 1–22, 1977.

[89] G. Celeux, “The sem algorithm: A probabilistic teacher algorithm derived from the em
algorithm for the mixture problem,” Computational statistics quarterly, vol. 2, pp. 73–82,
1985.

[90] A. J. Smola and B. Schölkopf, “A tutorial on support vector regression,” Statistics and
computing, vol. 14, no. 3, pp. 199–222, 2004.

[91] A. Christmann and I. Steinwart, “Consistency and robustness of kernel-based regression


in convex risk minimization,” Bernoulli, vol. 13, no. 3, pp. 799–819, 2007.

[92] I. Steinwart, “On the influence of the kernel on the consistency of support vector ma-
chines,” Journal of machine learning research, vol. 2, no. Nov, pp. 67–93, 2001.

123
[93] J. Fan and R. Li, “Variable selection via nonconcave penalized likelihood and its oracle
properties,” Journal of the American statistical Association, vol. 96, no. 456, pp. 1348–1360,
2001.

[94] C.-H. Zhang, “Nearly unbiased variable selection under minimax concave penalty,” The
Annals of statistics, vol. 38, no. 2, pp. 894–942, 2010.

[95] P.-L. Loh, M. J. Wainwright, et al., “Support recovery without incoherence: A case for
nonconvex regularization,” The Annals of Statistics, vol. 45, no. 6, pp. 2455–2482, 2017.

[96] S. Duane, A. D. Kennedy, B. J. Pendleton, and D. Roweth, “Hybrid monte carlo,”


Physics letters B, vol. 195, no. 2, pp. 216–222, 1987.

[97] R. M. Neal et al., “Mcmc using hamiltonian dynamics,” Handbook of markov chain
monte carlo, vol. 2, no. 11, p. 2, 2011.

[98] X. Cheng, N. S. Chatterji, P. L. Bartlett, and M. I. Jordan, “Underdamped langevin


mcmc: A non-asymptotic analysis,” in Conference on learning theory, PMLR, 2018, pp. 300–
323.

[99] P. J. Rossky, J. D. Doll, and H. L. Friedman, “Brownian dynamics as smart monte carlo
simulation,” The Journal of Chemical Physics, vol. 69, no. 10, pp. 4628–4633, 1978.

[100] S. Geman and D. Geman, “Stochastic relaxation, gibbs distributions, and the bayesian
restoration of images,” IEEE Transactions on pattern analysis and machine intelligence,
no. 6, pp. 721–741, 1984.

[101] V. Vapnik, The Nature of Statistical Learning Theory (2nd ed.) New York: Springer,
2000.

[102] S. Balasundaram, D. D. Gupta, and K. Gupta, “Lagrangian support vector regression


via unconstrained convex minimization,” Neural networks, vol. 51C, pp. 67–79, Dec. 2013.

[103] S. F. Nielsen, “The stochastic em algorithm: Estimation and asymptotic results,”


Bernoulli, pp. 457–489, 2000.

[104] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al., “Gradient-based learning applied to


document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.

[105] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint
arXiv:1412.6980, 2014.

124
[106] Y. Gal and Z. Ghahramani, “Dropout as a bayesian approximation: Representing model
uncertainty in deep learning,” in Proceedings of the 33rd International Conference on Inter-
national Conference on Machine Learning - Volume 48, ser. ICML’16, New York, NY, USA:
JMLR.org, 2016, pp. 1050–1059.

[107] J. M. Hernández-Lobato and R. Adams, “Probabilistic backpropagation for scalable


learning of Bayesian neural networks,” in International Conference on Machine Learning,
PMLR, 2015, pp. 1861–1869.

[108] A. Graves, “Practical variational inference for neural networks,” in Advances in neural
information processing systems, Citeseer, 2011, pp. 2348–2356.

[109] J. B. Gao, S. R. Gunn, C. J. Harris, and M. Brown, “A probabilistic framework for svm
regression and error bar estimation,” Machine Learning, vol. 46, no. 1, pp. 71–89, 2002.

[110] P.-L. Loh, “Statistical consistency and asymptotic normality for high-dimensional robust
M -estimators,” The Annals of Statistics, vol. 45, no. 2, pp. 866–896, 2017.

[111] Y. Cho and L. Saul, “Kernel methods for deep learning,” Advances in neural information
processing systems, vol. 22, 2009.

[112] J. Zhuang, I. W. Tsang, and S. C. Hoi, “Two-layer multiple kernel learning,” in Proceed-
ings of the fourteenth international conference on artificial intelligence and statistics, JMLR
Workshop and Conference Proceedings, 2011, pp. 909–917.

[113] E. V. Strobl and S. Visweswaran, “Deep multiple kernel learning,” in 2013 12th Inter-
national Conference on Machine Learning and Applications, IEEE, vol. 1, 2013, pp. 414–
417.

[114] J. Mairal, P. Koniusz, Z. Harchaoui, and C. Schmid, “Convolutional kernel networks,”


arXiv preprint arXiv:1406.3332, 2014.

[115] I. Rebai, Y. BenAyed, and W. Mahdi, “Deep multilayer multiple kernel learning,” Neural
Computing and Applications, vol. 27, no. 8, pp. 2305–2314, 2016.

[116] A. G. Wilson, Z. Hu, R. Salakhutdinov, and E. P. Xing, “Deep kernel learning,” in


Artificial intelligence and statistics, PMLR, 2016, pp. 370–378.

[117] B. Bohn, C. Rieger, and M. Griebel, “A representer theorem for deep kernel learning,”
The Journal of Machine Learning Research, vol. 20, no. 1, pp. 2302–2333, 2019.

[118] V. Vapnik, The nature of statistical learning theory. Springer science & business media,
1999.

125
[119] F. Pedregosa, G. Varoquaux, A. Gramfort, et al., “Scikit-learn: Machine learning in
python,” Journal of Machine Learning Research, vol. 12, no. 85, pp. 2825–2830, 2011. [On-
line]. Available: http://jmlr.org/papers/v12/pedregosa11a.html.

[120] Z. Wen, J. Shi, Q. Li, B. He, and J. Chen, “ThunderSVM: A fast SVM library on GPUs
and CPUs,” Journal of Machine Learning Research, vol. 19, pp. 797–801, 2018.

[121] K. Das, K. Bhaduri, B. L. Matthews, and N. C. Oza, “Large scale support vector
regression for aviation safety,” in 2015 IEEE International Conference on Big Data (Big
Data), IEEE, 2015, pp. 999–1006.

[122] J. Ma, J. Theiler, and S. Perkins, “Accurate on-line support vector regression,” Neural
computation, vol. 15, no. 11, pp. 2683–2703, 2003.

[123] P. Laskov, C. Gehl, S. Krüger, K.-R. Müller, K. P. Bennett, and E. Parrado-Hernández,


“Incremental support vector learning: Analysis, implementation and applications.,” Journal
of machine learning research, vol. 7, no. 9, 2006.

[124] H. Xu, R. Wang, and K. Wang, “A new svr incremental algorithm based on bound-
ary vector,” in 2010 International Conference on Computational Intelligence and Software
Engineering, IEEE, 2010, pp. 1–4.

[125] D. Ruta, L. Cen, and Q. H. Vu, “Greedy incremental support vector regression,” in
2019 Federated Conference on Computer Science and Information Systems (FedCSIS), IEEE,
2019, pp. 7–9.

126

You might also like