0% found this document useful (0 votes)
1 views99 pages

Stats 205 Notes

The document contains lecture notes for an Introduction to Nonparametric Statistics course, covering various topics such as nonparametric regression, bias-variance trade-off, local linear and polynomial regression, splines, and high-dimensional nonparametric methods. It includes detailed sections on methods, motivations, and case studies relevant to nonparametric statistics. The notes are structured into multiple chapters, each focusing on specific methodologies and theoretical concepts in the field.

Uploaded by

yutian0717
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views99 pages

Stats 205 Notes

The document contains lecture notes for an Introduction to Nonparametric Statistics course, covering various topics such as nonparametric regression, bias-variance trade-off, local linear and polynomial regression, splines, and high-dimensional nonparametric methods. It includes detailed sections on methods, motivations, and case studies relevant to nonparametric statistics. The notes are structured into multiple chapters, each focusing on specific methodologies and theoretical concepts in the field.

Uploaded by

yutian0717
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 99

Lecture Notes for Introduction to Nonparametric Statistics (STATS

205)

Instructor: Tengyu Ma

August 4, 2023
Contents

1 Overview of nonparametric statistics 6


1.1 What is Nonparametric Statistics? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 The Nonparametric Regression Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.2 Deterministic design and mean squared error . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.3 Random Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.4 Motivation for nonparametric regression . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Regressogram and kernel Methods 10


2.1 Nonparametric regression methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.1 Regressogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.2 Local averaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.3 Nadaraya-Watson kernel estimator (soft-weight averaging) . . . . . . . . . . . . . . . . 11
2.1.4 Choosing the bandwith . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.5 Another example: the optimal bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . 14

3 Bias-variance trade-off 17
3.1 Motivation for using MSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Bias-variance decomposition for MSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3 Bias-variance trade-off in the nonparametric setting . . . . . . . . . . . . . . . . . . . . . . . 18
3.4 Case studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.5 Effect of dataset size on bias-variance trade-off . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.6 Formal theorem (bias-variance characterization) . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4 Local linear and polynomial regression 23


4.1 Linear smoothers: a unified view . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2 Local linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.3 Local polynomial regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.4 Cross validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.4.1 Holdout dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.4.2 Leave-one-out estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5 Splines 30
5.1 Penalized Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.2 Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.3 Background: subspaces and bases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.4 Cubic spline and penalized regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.5 Interlude: A brief review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.6 Matrix notation for splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.7 Minimizing the regularized objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.8 Choosing the basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

i
5.9 Interpretation of splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.9.1 Splines as linear smoothers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.9.2 Splines approximated by kernel estimation . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.9.3 Advanced spline methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.9.4 Confidence bands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

6 High-dimensional nonparametric methods 39


6.1 Nonparametrics in high dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6.1.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6.1.2 k-Nearest neighbors algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
6.2 Kernel method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.2.1 Motivating examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.2.2 Kernel regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.2.3 Kernel efficiencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.2.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6.2.5 Existence of φ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.3 More about kernel methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.3.1 Recap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.3.2 Another approach to kernel methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.3.3 Connection to splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.3.4 Connection to nearest neighbor methods . . . . . . . . . . . . . . . . . . . . . . . . . . 47

7 Fully-connected two layer neural networks 48


7.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
7.2 Neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
7.2.1 A glimpse into deep learning theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
7.2.2 Fully-connected two layer neural networks . . . . . . . . . . . . . . . . . . . . . . . . . 48
7.2.3 Deep neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
7.2.4 Equivalence to linear splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
7.2.5 Simplification: m goes to infinity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
7.2.6 Outline of the proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
7.3 Showing that R̃(f ) = R̄(f ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
7.3.1 Recap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
7.3.2 Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
7.3.3 Discretization to rewrite hθe(x) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
7.3.4 Reformulation of the objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
7.3.5 Simplification of the constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

8 Optimization, feature methods and transfer 57


8.1 Review and overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
8.2 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
8.2.1 Basic Premise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
8.2.2 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
8.2.3 Stochastic gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
8.2.4 Computing the gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
8.3 Learning Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
8.4 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
8.5 Few-shot learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
8.5.1 Nearest neighbor algorithms using features . . . . . . . . . . . . . . . . . . . . . . . . 60

ii
9 Density estimation 62
9.1 Unsupervised learning: estimating the CDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
9.1.1 Setup of CDF estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
9.1.2 Empirical estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
9.1.3 Estimating functionals of the CDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
9.2 Unsupervised learning: density estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
9.2.1 Measuring performance of density estimators . . . . . . . . . . . . . . . . . . . . . . . 65
9.2.2 Mean squared error in high-dimensional spaces . . . . . . . . . . . . . . . . . . . . . . 66
9.2.3 Mean squared error and other errors in low-dimensional spaces . . . . . . . . . . . . . 67
9.2.4 Bias-variance tradeoff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
9.2.5 Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
9.2.6 Bias-variance of histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
9.2.7 Finding the optimal h∗ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
9.2.8 Proof sketch of Theorem 9.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

10 Kernel density estimation and Bayesian linear regression 72


10.1 Review and overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
10.2 Kernel density estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
10.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
10.2.2 Integrated risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
10.2.3 Choosing h empirically . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
10.3 Mixture models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
10.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
10.3.2 Gaussian mixtures and model fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
10.4 Bayesian nonparametric statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
10.4.1 Review of the Bayesian approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
10.4.2 Bayesian linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

11 Parametric/nonparametric Bayesian methods and Gaussian process 82


11.1 Review and overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
11.2 Proof of the Eq. (11.2) in Theorem 11.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
11.2.1 Proof using Lemma 11.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
11.2.2 Proof of Eq. (11.5) using singular value decomposition . . . . . . . . . . . . . . . . . . 84
11.3 Nonparametric Bayesian regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
11.3.1 Approach 1: frequentist approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
11.3.2 Approach 2: Gaussian process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
11.3.3 Bayesian prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

12 Dirichlet process 93
12.1 Review and overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
12.2 Parametric mixture models & extension to Bayesian setting . . . . . . . . . . . . . . . . . . . 93
12.2.1 Review: Gaussian mixture model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
12.2.2 Dirichlet distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
12.2.3 Bayesian Gaussian mixture model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
12.2.4 Dirichlet topic model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
12.3 Dirichlet process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
12.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
12.3.2 Exchangeability & de Finett’s theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
12.3.3 The Chinese restaurant process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
12.3.4 Explicitly constructing G (informal) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
12.3.5 Explicitly constructing G (formal) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
12.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

iii
Acknowledgments

5
Chapter 1

Overview of nonparametric statistics

In this chapter, we begin our exploration of nonparametric statistics. We first describe the underlying
motivation for the field of nonparametric statistics and its general principles.

1.1 What is Nonparametric Statistics?


The overarching principle of nonparametric statistics is that it does not leverage standard parameterization.
Oftentimes, the assumption that our underlying follows some known family of functions or distributions is
constraining; it might be too complicated or may violate assumptions of parametric modeling. While there
are not any precise definitions for the field of nonparametric statistics, there are a few core tenets to know.
• Make as few assumptions as possible. For example, do not assume that data extends from a linear or
quadratic model. In general, we do not assume the data belongs to any particular parametric family
of probability distributions

• No fixed set of parameters exists. For example, in nonparametric statistics we will often see across
infinite dimensional models, infinite parameters, or circumstances where the dimension → ∞ as the
number of data point n → ∞. In other words, the model grows in size to accommodate the complexity
of the data

Such principles are widely applicable to many areas of statistics and machine learning, such as in nonpara-
metric testing, density estimation, supervised learning, and unsupervised learning.
Particularly in this course, we consider low-dimensional data1 , with exceptions like neural networks
and some kernel methods. This is important since high dimensional data without many strong parametric
assumptions will fundamentally and statistically need many samples (i.e. the data will need exponential
dimensions 2 ) to estimate anything (density, CDF, etc.), suffering from the “curse of dimensionality”. The
lack of high dimensional data without many strong parametric assumptions results in estimate errors at the
zero-th or first order.
1 In this course, low dimensions generally refers to the case when data dimension d = 1, 2, 3
2 The number of samples needed is at least an exponential function of the dimension of the data, since the rate or estimation
error is approximately n−1/d where d is the dimension of the data and n is the number of samples needed. For more information
about this concept see All of Nonparametric Statistics chapter 4.5.

6
1.2 The Nonparametric Regression Problem
1.2.1 Setup
We start with one of the simplest estimation problems: 1 dimentional regression. In such a problem, we
have n pairs of observations
(x1 , Y1 ), . . . , (xn , Yn ), xi , Yi ∈ R. (1.1)
In these notes we will often refer to each xi as an input and Yi as an output; however, other names that are
often used include
xi Yi
covariate label
features target
explanatory variable response variable
independent variable dependent variable
regressors regressand
design outcome
exogenous endogenous
Furthermore, we generally assume that Yi can be written as
Yi = r(xi ) + ξi , (1.2)
where r(·) is an unknown function we would like to approximate and ξ is noise (independent of xi ) on the
observed output. We shall assume that E[ξi ] = 0 and thus since E[f (X) | X] = f (X)
E[Yi | xi ] = E[r(xi ) + ξi | xi ] (1.3)
= r(xi ). (1.4)
Equivalently, we can define r(xi ) as
r(xi ) := E[Yi |xi ] (1.5)
and then ξi = Yi − r(xi ), which implies E[ξi ] = 0 by the law of total expectation3 .
With these definitions, we can view the regression problem under two frameworks: deterministic inputs
or random inputs, and we will explore both possibilities in the subsequent sections.

1.2.2 Deterministic design and mean squared error


In this view, we treat x1 , . . . , xn as fixed, deterministic inputs with Y1 , . . . , Yn being random variables. Our
goal then is to estimate or recover r(x1 ), . . . , r(xn ) as accurately as possible. Pictorially, we can see this
in Figure 5.2, where the red circular open dots are the “noisy” set of observations and the blue r(x) is the
function where each labeled r(xi ) (in black circular open dots) is what we aim to recover.
We denote our estimator r̂ as r̂ : r̂(x1 ), ..., r̂(xn ). One common and simple way to evaluate the strength
of r̂ is by utilizing its mean squared error (MSE)
n
1X
MSE(r̂) = (r̂(xi ) − r(xi ))2 . (1.6)
n i=1

However, because our estimate r̂ is a function of Y1 , . . . , Yn which are random variables, r̂ is also a random
variable. We then might instead want to consider the expectation over Yi
" n #
1X
MSE = EYi [MSE(r̂)] = EYi (r̂(xi ) − r(xi ))2 . (1.7)
n i=1

3 E[X] = E[E[X | Y ]]

7
y
r(x)
(xi , Yi )’s

Figure 1.1: Graphical representation of regression problem

1.2.3 Random Design


Alternatively, we could have viewed our input as a series of independent and identically distributed random
iid
variables X1 , ..., Xn ∼ P (note the upper-case Xi now) where our Yi = r(Xi ) + ξi . Our interpretation of
such an perspective while modeling the problem remains very similar, though, and our estimator is still r̂;
however, in this case we can define MSE as

MSE := EX∼P [(r̂(X) − r(X))2 ]. (1.8)

For the rest of this note, though, we will maintain within the deterministic design paradigm described above.

1.2.4 Motivation for nonparametric regression


With our current understanding of the regression problem at hand, one may claim we can solve such examples
parametrically by assuming that each Yi is a linear combination of the input (as in linear regression) or
some polynomial combination of the input (as in polynomial regression). However, often the assumptions
underlying these models are violated. For example, consider the regression where r(x) is neither linear nor
polynomial as in Figure 1.2 where r(x) cannot be fitted with any polynomial fit very well.
To see why polynomial regression would fail, suppose we fit r(x) with f (x) a k-degree polynomial.
Suppose x0 , . . . , xk+1 and f (x1 ) = · · · = f (xk+1 ) = c for distinct values of x1 , . . . , xk+1 (they lie on right
side of figure 1.2). Since a k-degree polynomial is uniquely determined to its values at k + 1 different points,
the only possible solution would be f (x) = c, the constant function. Therefore it is not possible to fit all the
points on both the right and left hand side of this curve above. We now see that we cannot simply make an
assumption of the behavior of our r(x) in this case, and must turn to new ways of achieving our goal that
can be generalized to functions with any behavior. We turn now to some methodologies of nonparametric
regression.

8
y
r(x)

Figure 1.2: Nonparametric regression problem on which polynomial regression fail.

9
Chapter 2

Regressogram and kernel Methods

2.1 Nonparametric regression methods


2.1.1 Regressogram
Our first methodology to approach our modeling problem is known as the regressogram approach. Our
algorithm is quite rudimentary: divide our domain of x into some number of bins (assume that our bins are
of equal size) as shown in Figure 2.1. For each (xi , Yi ) that falls within a given bin Bj , our estimator is
defined as
1 X
r̂ = Yi , (2.1)
|Bj |
i∈Bj

which is, in other words, the average of all Yi where xi ∈ Bj . Because each point will fall into some Bj
(and for all bins where there is no observation, we won’t have a defined prediction), we recover a piece-wise
constant function for that Bj . We then see that each point within a particular Bj will recover the same
r̂. While this approach is simple, it is not often used in practice as choosing the binning method and size
is quite tricky. Additionally, some bins may not capture many observations while others may not capture
enough of a set of observations if there are too many variable regions within a bin.

2.1.2 Local averaging


As the original regressogram can easily fail to capture or over-capture a set of observations, we move to a
modified version of binning known as local averaging. Here, we instead define bins dynamically i.e. for each
observation (xi , Yi ), we define a bin of some size h in each direction around that value. Each bin is therefore
defined in terms of a particular observation such that Bxi = {j : |xj − xi | ≤ h} where our estimator is now

1 X
r̂(xi ) = Yi , (2.2)
|Bxi |
i∈Bxi

which is, similar to before, an average, but now with the nuance that each bin is defined with respect to each
observation. One such bin with locally averaged r̂(xi ) can be seen in Figure 2.2. Note that we are making
the assumption that within each Bxi , r̂(xi ) is a constant denoted by a. Following this assumption, we can
then derive r̂(xi ) as minimizing the MSE inside each local bin, namely
1 X
r̂(xi ) = argmin (Yi − a)2 . (2.3)
a |Bxi |
i∈Bxi

10
y
r(x)

Figure 2.1: Regressogram binning with piecewise constant functions

Note that as a convex function, we can solve for this minimum explicitly. We have that
 
d  1 X 2 X
(Yi − a)2  = (Yi − a) (2.4)
da |Bxi | |Bxi |
i∈Bxi i∈Bxi

Thus, setting the derivative equal to 0 and rearranging yields


2 X
0= (Yi − a) (2.5)
|Bxi |
i∈Bxi

1 X
r̂(xi ) = a = Yi (2.6)
|Bxi |
i∈Bxi

As with the regressogram, we make some assumptions in local averaging that may not be sufficient for
our regression. Namely, we are assuming that we can safely ignore all points outside the boundary of each
Bxi , even if such points are borderline to the bin determined by some h. Such a problem motivates our next
methodology: soft-weight averaging.

2.1.3 Nadaraya-Watson kernel estimator (soft-weight averaging)


As we saw in the previous section on local averaging, we make a potentially detrimental assumption that
all points outside the boundary of a given bin should not be considered for our estimator r̂(x). To alleviate
this problem, we introduce the concept of soft-weight averaging, where we introduce a weighting for each
observation that can be distance dependent. More specifically, we define our new constant estimate a for
each observation (xi , Yi ) over all n observations as follows
n
X
argmin wi (Yi − a)2 , (2.7)
a∈R i=1

We can solve this is the same way as we did (2.3) where the minimizer can be found by taking derivative
w.r.t. a. !
n n
d X 2
X
wi (Yi − a) =2 wi (Yi − a) (2.8)
da i=1 i=1

11
y
r(xi )’s

Figure 2.2: Comuting local averaging around a single xi . Our estimate r̂(xi ) is equal to the average of the
r(xi )’s (red line) within the local bin Bxi (black dotted lines).

Solving for when the derivative is zero yields


n
X
0= wi (Yi − a) (2.9)
i=1
Pn
wi Yi
r̂(xi ) = a = Pi=1
n (2.10)
i=1 wi

Notice that if wi = 1 {|xi − x| ≤ h}, we recover the same local averaging we have previously described. In
general, we desire wi to be smaller as |xi − x| increases and for any |xi − x| > |xj − x| it follows that wi < wj .

Kernel functions
We now need to define a weighting scheme that satisfies our weighting desires, and we will do so using a
kernel estimator. Namely, because we desire to have a weighting dependent on the distance a particular
observation xi is from x, we will define
 
xi − x
wi = f (xi − x) = K , (2.11)
h

where K(·) is a kernel function and h is the bandwidth.


Definition 2.1 (Kernel function). K : R → R is called a kernel function if it is non-negative and satisfies
the following properties.
R
1. R K(t)dt = 1. K is normalized and scales to 1.
R
2. R tK(t)dt = 0. There must be some kind of symmetry within K.
3. σt2 := R t2 K(t)dt > 0. We want it to be non-degenerate.
R

Now, we will discuss four variants of kernels. We start by the boxcar kernel
1
K(t) = 1 {|t| ≤ 1} . (2.12)
2

12
y
Boxcar
Gaussian
Epanechnikov
Tricube

x
−1 1

Figure 2.3: Several common kernels with bandwidth 1.

The boxcar is named for its box-like shape and can be seen in red in Figure 2.3 overlayed with the Gaussian
kernel as well. One then sees that this corresponds exactly to local averaging for some h. Next we have the
Gaussian kernel  2
1 −t
K(t) = √ exp , (2.13)
2π 2
Unlike the boxcar, here we have some non-zero weight for some xi when |xi − x| > h, which tapers off as
that difference increases. Other choices of kernels include the Epanechnikov kernel
3
K(t) = (1 − t2 )1 {|t| ≤ 1} (2.14)
4
and the tricube kernel
70
K(t) = (1 − |t|3 )3 1 {|t| ≤ 1} . (2.15)
81
In practice, however, the explicit choice of kernel is not that important empirically between the these (ex-
cluding boxcar).

2.1.4 Choosing the bandwith


Finally, we approach the discussion of choosing our bin width h. Larger or smaller values of h can dramatically
change our estimates r̂(xi ), so understanding how to do so is critical for our regression. To see why this is
true, consider any choice of kernel. Here we see that if we increase h, we are simply increasing the bin widths
(imagining a larger width in, say, Figure 2.2) and thereby averaging more observations for any particular xi .
In other words, a large h allows for greater weightings wi for farther observations.
In practice, we see that this choice of h is highly dependent on the data. For example, if observations
are truly close to one another and have a similar true value of r(x), we will achieve better results using a
large h. To see this, consider the example set of observations in Figure 2.4, where we see small fluctuations
across observations but on average not very many differences and assume that the true value is somewhere
along the average of these observations. Here, an h = 0 would yield separate constants for each observation
and no “denoising” for each observation. On the other hand, choosing h = ∞ here would yield the same
constant prediction r̂(xi ) for each xi , which we can see as follows:
n n n n
1X 1X 1X 1X
Yi = r(xi ) + ξi = (c + ξi ) = c + ξi (2.16)
n i=1 n i=1 n i=1 n i=1

13
y

Figure 2.4: Example where choosing a large h would yield a better estimate

Figure 2.5: Example where choosing a small h would yield a better estimate

Pn
Using the central limit theorem, we have ξ1 , ..., ξn ∼ N (0, 1). Thus, we have n1 i=1 ξi ∼ N (0, n1 ). Note that
the standard deviation of this expression is √1n . Therefore, we can simplify the last term as follows:
n
1X 1
r̂(xi ) = c + ξi ≈ c ± √ (2.17)
n i=1 n

If instead we had chosen a moderate h, our estimate for each xi may have not included all observations
and then would be represented using a similar derivation as follows:
1 X 1 X 1 X 1
r̂(xi ) = Yj = c + ξi = c + ξi ≈ c ± p (2.18)
|Bxi | |Bxi | |Bxi | |Bxi |
j∈Bx i
j∈Bx i
j∈Bxi

where we can see here that we could be obtaining a noisier value for our estimate compared to a larger h.
Finally, considering another extreme case, where the true r(x) fluctuates a lot but the noise ξi is very
small such as in Figure 2.5, utilizing a large h would yield a poorer result than choosing h = 0. In such a
case, although we may not be correct in our assumption, we may still obtain a reasonable estimate r̂(xi ) = Yi
(when h = 0).
In practice, choosing the optimal bandwidth is hard. Often times we will try many different values of h
and see what works best by using methods such as cross-validation which we will explore in section 4.4.

2.1.5 Another example: the optimal bandwidth


Consider the function r(x), along with the given data points, shown in the plot below:

14
r(x)
Observed data

We would like to use local averaging to estimate r(xi ) at each data point xi . We consider three choices for
the bandwidth h.
1. h = ∞: when h is infinite, we simply average over all data points in the dataset when making a
prediction for a given point x. Thus, the resulting estimator r̂ is simply a constant function, as shown
in blue below:

r(x)
r̂(x)

h = ∞ is clearly undesirable in this example, because r̂(x) carries no information about the underlying
function r(x).
2. h = 0: the only data point xi that satisfies |x − xi | ≤ h is x itself. Thus, for a given xi in the training
set, the predicted value of r(xi ) will be precisely equal to the observed outcome Yi (i.e. r̂(xi ) = Yi ), so
that we fail to denoise the dataset at all.

r(x)
r̂(x)

15
3. h = 1: This is the best choice for this example. h = 1 has the effect of averaging only the nearby
points when making a prediction on an example xi . This is desirable because points substantially far
from xi exhibit quite different behavior and thus should not influence the estimate of r(xi ). h = 1 has
the effect of adequately denoising the data without smoothing the estimator r̂ too severely, as the plot
below illustrates.

r(x)
observed
r̂(x)

16
Chapter 3

Bias-variance trade-off

In this chapter, we discuss the bias-variance trade-off and its implications in the nonparametric setting. We
start with some case studies in the setting of kernel estimators, and finish with the general Theorem 3.1.

3.1 Motivation for using MSE


We begin with a motivation for why we use MSE to measure performance of an estimator. We claim that
MSE is very closely related to predictive risk, which measures how well r̂(x) performs on a new observation
(with fresh randomness). In particular, suppose we have a new observation Z = r(x) + ξ, where ξ is a
random variable with mean 0 and variance σ 2 (i.e. ξ is the noise). Then we define the predictive risk of the
estimator r̂ at a point x to be

predictive risk(r̂, x) = EZ (r̂(x) − Z)2 .


 
(3.1)

In words, the predictive risk is our expectation of the squared error of our estimator, with the expectation
being taken over the noise in Z.
Then we may consider the average predictive risk of r̂ on our inputs xi :
 n 
∆ 1X
predictive risk(r̂) = EZi (r̂(xi ) − Zi )2
n i=1
n
1X
EZi (r̂(xi ) − Zi )2
 
=
n i=1
n
1X 
E (r̂(xi ) − r(xi ) − ξi )2

=
n i=1
n
1X 
E (r̂(xi ) − r(xi ))2 − 2(r̂(xi ) − r(xi ))ξi + ξi2

=
n i=1
n
1X
(r̂(xi ) − r(xi ))2 − 2(r̂(xi ) − r(xi ))E[ξi ] + E[ξi2 ]

=
n i=1
n
1X
(r̂(xi ) − r(xi ))2 + E[ξi2 ]

=
n i=1
n
1X
= (r̂(xi ) − r(xi ))2 + σ 2
n i=1
= MSE(r̂) + constant. (3.2)

17
Here, the MSE (mean-squared error) of r̂ is just the average of the squared errors between our predicted
values r̂(xi ) and the true r(xi ):
n
1X
MSE(r̂) = (r̂(xi ) − r(xi ))2 . (3.3)
n i=1
We thus have that MSE and predictive risk are identical up to a constant. Note that in this derivation,
we have used the fact that the expectation is taken with respect to Zi to pull the constants with respect to
Zi (in particular, r̂(xi ) − r(xi )) out of the expectation.

3.2 Bias-variance decomposition for MSE


Recall that our estimator r̂ is random because it is a function of the observed outcomes Yi , which are random
due to the random noise in the observations. In contrast, the true underlying function r is fixed. Thus,

MSE := EYi [MSE(r̂)]


 X n 
1
=E (r̂(xi ) − r(xi ))2
n i=1
n
1X
= E[(r̂(xi ) − r(xi ))2 ]. (3.4)
n i=1
E[(r̂(xi ) − r(xi ))2 ] = (E[r̂(xi ) − r(xi )])2 + Var(r̂(xi ) − r(xi ))
= (E[r̂(xi ) − r(xi )])2 + Var(r̂(xi )), (3.5)

where we used the fact that for any random variable X, E[X 2 ] = E[X]2 + Var(X) and the fact that shifting
a random variable by constant does not change the variance. Combining (3.4) and (3.5) and using the fact
that r(x) is fixed, we see that
n n
1X 2 1X
MSE = (E[r̂(xi )] − r(xi )) + Var(r̂(xi )) , (3.6)
n i=1 n i=1

where the expression in the blue box is called the bias and the expression in the red box is the variance.
Note that all results we have derived thus far hold for all estimators, parametric and nonparametric. We
will next discuss results related to the bias-variance trade-off in the nonparametric setting.

3.3 Bias-variance trade-off in the nonparametric setting


Let us look at the bias and variance of kernel estimators. Note that the bias is closely related to E[r̂(x)].
Thus, we compute E[r̂(x)] to investigate the bias.
Pn n
j=1 wj Yj
X 
1
r̂(x) = P n = n
P wj (r(xj ) + ξj ) .
i=1 wj j=1 wj j=1

By linearity of expectation,
n n Pn
j=1 wj r(xj )
X  X 
1 1
E[r̂(x)] = Pn wj (r(xj ) + E[ξj ]) = Pn wj (r(xj )) = Pn ,
j=1 wj j=1 j=1 wj j=1 j=1 wj

which gives us that


n Pn !2
1X j=1 wj r(xj )
Bias = Pn − r(xi ) . (3.7)
n i=1 j=1 wj

18
 
|xj −x|
where wj = K h .

Thus, we see that E[r̂(x)] is equivalent to applying the estimator to the “clean data” (i.e. data with
no noise). E[r̂(x)] is a “smoother” version of r(x), as demonstrated in the plot below. The bias provides
a measure of how much information is lost in the process of smoothing the initial function. Note that as
bandwidth h increases, r̂(x) becomes smoother because the higher the value of h, the more weight we give
to points further from x. Thus, as h increases, bias increases.
r(x)
E[r̂(x)]

To compute the variance of r̂(x), assume the ξi are independent with mean 0 and variance σ 2 . Then
 
 Pn n Pn 2
j=1 wj

i=1 wj Yj 1 X
wj2 Var(Yj ) =  P
 2
Var(r̂(x)) = Var P = P 2  σ . (3.8)

n 2
i=1 w j n n
j=1 jw j=1 w
j=1 j

Let us compute the value of the variance in the case of local averaging. Recall that the kernel used for local
averaging is the Boxcar kernel: wj = 1{|xj − x| ≤ h}, so that (w1 , . . . , wn ) = (0, . . . , 0, 1, . . . , 1, 0, . . . , 0),
where the number of 1’s is equal to the number of elements that are at most a distance of h from x.
Suppose we take x = xi . Let Bxi = {xj : |xj − xi | ≤ h} and let nxi = |Bxi |. Then
   
Pn 2
P 2
j=1 wj  2  xj ∈Bxi 1  2

nxi

σ2
2
Var(r̂(xi )) =  P 2  σ =  P 2  σ = σ = . (3.9)

n n2xi nxi
j=1 w j xj ∈Bx 1
i

We thus see that, in the case of local averaging, the variance of r̂(xi ) decreases as a function of the number
of points in the neighborhood of xi . As h → ∞, the number of points being averaged over increases. So
2
h = ∞ gives the smallest possible variance of σn and h = 0 gives the largest variance of σ 2 .

3.4 Case studies


Consider the constant function r(x) = c shown below.

19
Pn
w r(xi )
Because r(x) is a constant function, bias will be equal to 0 for any choice of h (E[r̂(x)] = Pn i
i=1
=
Pn i=1 wi
wi c
Pi=1
n = c, so that E[r̂(x)] = c = r(x) for any choice of kernel and h). However, the variance is sensitive
i=1 wi
to the bandwidth. In this case, we should pick h = ∞ to minimize variance and thus minimize MSE.

Next, let r(x), and the observed data points, be as shown below. Note that r(x) fluctuates quite sub-
stantially but the observed points have no noise (σ = 0).

Pn
j=1 wj22
In this case, variance = Pn 2σ = 0 for all h. However, because the function fluctuates, the choice of
( j=1 wj )
bandwidth has a large effect on the bias, as it effects how “smooth” our estimates will be. Thus, in this case,
the bandwidth should be chosen to minimize bias, i.e. h = 0.

Finally, consider the function from Section 2.1.5, shown again below:

As discussed in section 2, it is desirable to pick a value of h between 0 and ∞, because we would like
to average over the noise in the dataset (which decreases variance) without smoothing our estimate too
substantially (which would increase bias).

3.5 Effect of dataset size on bias-variance trade-off


In general, when we have more data, we can assume that there will be more observations within a given
neighborhood of x. Because variance depends on the number of data points being averaged over, this
means
Pm that increasing the number of observations will decrease variance. However, note that Bias(r̂(x)) =
1 2
n i=1 (E[r̂(x i )] − r(x)) depends only on the estimator’s performance on the clean data, not on the noise
in the dataset. Thus, increasing the number of observations does not change the bias.
Suppose you were at a ”sweet spot” with the bias-variance trade-off. Then, you observe more data points.
This has the effect of decreasing variance while keeping bias constant. Thus, to ”re-balance” the bias and
variance, you should decrease h slightly as to decrease the bias and increase the variance.

20
3.6 Formal theorem (bias-variance characterization)
We now make the previous discussions of the bias-variance trade-off in the context of kernel estimators more
precise with a formal theorem. While the MSE of an estimator does not depend on the noise in the dataset, it
iid
does depend on the (arbitrary) positions of x1 , . . . , xn . For theoretical feasibility, we assume x1 , . . . , xn ∼ P ,
where P has density f (x). Additionally, the theorem has the following setup:
• Assume we have an estimator r̂n with n samples and that n tends to ∞.

• For a fixed n, the bandwidth is hn .


∆ R
• We use integrated risk to measure the quality of an estimator: R(r̂n , r) = (r̂n (x) − r(x))2 dx
Theorem 3.1. The risk of a Nadaraya-Watson kernel estimator is
2 Z  2
h4n f 0 (x)
Z
2 00 0
R(r̂n , r) = x K(x)dx r (x) + 2r (x) dx (3.10)
4 f (x)
2
R Z 
2 K(x) dx 1
+ σ dx (3.11)
nhn f (x)

+ o (nhn )−1 + o h4n .


 
(3.12)

The term in blue is the bias of the estimator and the term in red is the variance. The terms in green are
higher order terms that go to 0 as nhn → ∞ and hn → 0.
We will next try to decompose this theorem to understand each of the individual parts.

Bias: The theorem tells us that the bias depends on the following quantities:
• bandwidth hn : The smaller the bandwidth, the lower the bias.
• x2 K(x)dx:R This quantity is a measure of how flat the kernel is. The flatter the kernel, the larger
R

the value of x2 K(x)dx and thus the higher the bias. This aligns with our intuition, as the flatter the
kernel, the more we are weighing points further away and thus, the smoother our estimator.
• r00 (x): This is a measure of the curvature of r(x). The smoother the function, the lower the value of
r00 (x) and thus, the lower the bias.
0
• r0 (x) ff (x)
(x)
: Note that this quantity is equal to r0 (x)(log f (x))0 . This term is called the design bias
because it depends on the “design” (i.e., the distribution of the xi ’s). The design bias is small when
(log f (x))0 is small (i.e., when the density of X doesn’t change too quickly, or X is“close to uniform”).

Variance: According to the theorem, the variance depends on the following quantities:

• σ 2 : The larger the value of σ 2 (i.e. the variance of the random noise), the higher the variance.
• hn : The higher the bandwidth, the lower the variance.
• n: The larger the value of n, the smaller the variance.

21
Implications on bandwidth hn : Let us treat K, f, and r as constants. We are interested in seeing how
the optimal bandwidth h∗n changes as a function of σ 2 and n. Holding K, f, and r constant, we can express
the risk as
σ2
R(r̂(x), r(x)) = h4n c1 + c2 + higher order terms, (3.13)
nhn
where c1 and c2 are constants. We are thus interested in the choice of hn that minimizes this risk. By taking
the derivative of the risk and setting it equal to zero, we see that
1/5
σ2

h∗n = c3 . (3.14)
n

This tells us that the optimal bandwidth decreases at a rate proportional to n−1/5 .
Plugging in h∗n to the risk equation, we see that
 2 4/5
σ 2 c2
 
σ
min h4n c1 + = c4 , (3.15)
hn nhn n

so that the lowest risk is on the order of n−4/5 . Note that the risk for most parametric models is on the
order of n−1 , a slight improvement over the risk for the nonparametric models we have discussed.

22
Chapter 4

Local linear and polynomial regression

https://www.overleaf.com/project/641d0430d367c0e91730e057
We introduce the concept of linear smoothers as a way to unify many common nonparametric regression
methods and conclude with an introduction to another approach to nonparametric regression: local linear
regression. We discuss how local linear regression overcomes some of the challenges faced by kernel estimators.
We proceed further to explore classical non-parametric methods. Firstly, by exploring local polynomial
regression as an extension of local linear regression. Secondly, different methods of optimizing regression
are evaluated, including cross validation (where one selects various hyperparameters) and dropout methods
(where the data set is split into validation and training sets.)

4.1 Linear smoothers: a unified view


We now take a brief detour to introduce the concept of linear smoothers, which serve as a way to unify many
of the aforementioned nonparametric regression methods.

Definition 4.1. r̂ is a linear smoother if there exists a vector valued function x −→ (l1 (x), . . . , ln (x))
such that
m
X
r̂(x) = li (x)Yi . (4.1)
i=1

PnNote that the li ’s can depend on x1 , . . . , xn but not on Y1 , . . . , Yn . Additionally, we must have that
i=1 li (x) = 1.

Theorem 4.2. The regressogram and kernel estimator are both instances of linear smoothers.

• Regressogram:
n 
1{xi ∈ Bx } m

1 X X X
r̂(x) = Yi = Yi = li (x)Yi , (4.2)
|Bx | i=1
|Bx | i=1
i∈Bx

where Bx is the bin containing x and li (x) = 1{x|B


i ∈Bx }
x|
.

• Kernel estimator:
Pn n
! m
wi Yi X wi X
r̂(x) = Pi=1
n = Pn Yi = li (x)Yi , (4.3)
i=1 wi i=1 j=1 wj i=1

where li (x) = Pnwi


j=1 wj

23
Thus, linear smoothers provide a “unified view” in that they provide a category into which many different
types of estimators fall. In the next section, we will introduce the method of local linear regression, which
we show is yet another instance of a linear smoother. We now conclude this section with a few facts about
linear smoothers.
 
r̂1 (x)
• We can write any linear smoother in the matrix multiplication form r̂ = LY , where r̂ =  ... ,
 

r̂n (x)
   
l1 (x1 ) ··· ln (x1 ) Y1
 .. .. .. , and Y =  .. .
L= . . .   . 
l1 (xn ) · · · ln (xn ), Yn
• For all linear smoothers,
n n n
X  X X
E[r̂(x)] = E li (x)Yi = li (x)E[Yi ] = li (x)r(x), (4.4)
i=1 i=1 i=1

so E[r̂(x)] is equal to the estimate when the estimator is run on clean data. Like in the particular case
of kernel estimators, the bias is an indicator of how much we damage the clean data by smoothing.

4.2 Local linear regression


Local linear regression provides an alternative to kernel estimators. To motivate why we might want such
an alternative, let us revisit the idea of design bias. Consider the following function r(x), along with the
observed data points, shown below.

a b c

If we were to use a kernel estimator to attempt to predict r(a), we would overestimate the true value.
This is because all observed points that are close to a are to the right of a, so we would average primarily
over points whose y-values are larger than a’s. The predicted value for r(a) using local averaging is marked
in green in the plot below.

24
r(x)
observed
r̂(a)

h h
a b c

Similarly, using a kernel estimator to predict r(c) would result in an underestimate. In contrast, be-
cause b has nearby points on either side, the prediction for r(b) would be reasonable. This problem of
over/underestimating r(x) for points that have no observations on one side (i.e. points on the boundary of
a given bin) is closely related to the design bias. It occurs because, when using kernel estimators, we make
the assumption that r(x) is locally constant.
Local linear regression provides a solution to this problem. Rather than assuming r(x) is locally constant,
we assume that r(x) is locally linear. For a given data point x, we would like to approximate r(x) by locally
fitting a line based at x to our data. Let r̃x (u) = a1 (u−x)+a0 . Then the algorithm for local linear regression
is as follows.
For a given data point x, let
n
X n
X
â0 , â1 = argmin wi (Yi − r̃x (xi ))2 = argmin wi (Yi − (a1 (xi − x) + a0 ))2 , (4.5)
a0 ,a1 a0 ,a1
i=1 i=1
 
(xi −x)
where wi = K h for some kernel function K. Then, we let our estimate r̂(x) equal the intercept term:


r̂(x) = â0 . (4.6)

Theorem 4.3. The integrated risk of local linear regression consists of a bias term and variance term. The
variance is the same as that from Theorem 3.1. The bias is
2 Z
h4n
Z
2
x k(x)dx r00 (x)2 dx. (4.7)
4
h4n R 2 2
R 00 0 f 0 (x) 2
If we compare this bias to the bias from the kernel estimators ( 4 ( x k(x)dx) · (r (x)+2r (x) f (x) ) dx),
0
we see that the bias for local linear regression does not have the 2r0 (x) ff (x)
(x)
term (i.e. no design bias!). Local
linear regression thus mitigates the problem of design bias that we encounter when using kernel estimators.
The diagram below provides some intuition as to why this is the case.

25
observed
r(x)
r̃a (x)
r̂(a)

We see that the local linear assumption allows us to better approximate r(x) for values of x that lie on
the boundary. We conclude by proving that local linear regression is another instance of a linear smoother.
n
X
â0 , â1 = argmin wi (Yi − (a1 (xj − x) + a0 ))2 = argmin g(a0 , a1 ).
a0 ,a1 a0 ,a1
i=1
n
∂g X
=2 wi (a1 (xi − x) + a0 − Yi ) = 0
∂a0 i=1
n
∂g X
=2 wi (a1 (xi − x) + a0 − Yi )(xi − x) = 0.
∂a1 i=1

By solving this system of equations for a0 and a1 , we get that


Pn n  
i=1 (wi CYi− wi (xi − x)BYi ) X wi C − wi (xi − x)B
r̂(x) = â0 = = Yi , (4.8)
AC − B 2 i=1
AC − B 2
Pn Pn Pn
where A = i=1 wi , B = i=1 wi (xi − x) and C = i=1 wi (xi − x)2 . We thus see that r̂(x) is a linear
combination of the Yi ’s, so local linear averaging is indeed a linear smoother.

4.3 Local polynomial regression


As seen in the last section, local linear regression as in Fig. 4.1 extends local averaging, where local averaging
fits a locally constant function while local linear regression fits a linear one, fixing the design bias locally. To
fix the tension within local linear regression, we now apply local polynomial regression, which just like local
linear regression extends local averaging, local polynomial extends local linear regression.
Essentially, in local polynomial regression, we are fitting polynomial functions locally. We first start by
fixing x.
Once we have a fixed x, we now want to approximate the function r(·) in the neighborhood of x by
defining the function Px (u; a) with a = (a0 , · · · , ap ) where1
a2 ap
Px (u; a) = a0 + a1 (u − x) + (u − x)2 + ... + (u − x)p (4.9)
2 p!
With this new approximate Px , we now want to fit this degree-p polynomial to the data around point x.
Much like the logic with local linear regression, we are no longer assuming r(x) is neither constant or locally
1 p! dk
is a convenient choice if we want to take k-th order derivative of Px (u, a) at u = x, i.e. P (u, a)
duk x
= ak for all
u=x
k = 0, · · · , p.

26
y
r(x)
(xi , Yi )’s

Figure 4.1: Graphical representation of local linear regression (LLR).

linear, but rather we assume r(x) is a polynomial function locally. Therefore, we minimize over a, giving us
the minimized equation:

n
X
â = (â0 , ..., âp ) = argmin wj (Yj − Px (uj ; a))2
(a0 ,...,ap )∈Rp+1 j=1
n  2
X ap p
= argmin wj Yj − (a0 + a1 (u − x) + ... + (u − x) ) , (4.10)
(a0 ,...,ap )∈Rp+1 j=1 p!
 
x−xj
where wj = K h for some kernel function K. Notice that we can rewrite this equation into
n
X
argmin wj (Yj − a> zj )2 ,
a∈Rp+1 j=1

where a and zj are


 
  1
a0
 .. 
 xj − x 
a =  . , zj =  .. .
 
 . 
ap 1
p! (xj − x)p
As we can see, this function closely resembles local linear regression with the exception of Px representing
degree-p polynomials. Therefore, we can classify it as weighted linear regression with vector a as parameter
and zj ’s as the input, giving us a very similar final estimator function to local linear regression
r̂(x) = Px (x; â) = â0 .
It is noted that when using local polynomial regression, the convention is to use up to degree 3 polynomials
as higher-degree polynomials are not much help in complex data sets. Note that local polynomial regression
is also a linear smoother.

4.4 Cross validation


There are different forms of cross validation that can be conducted on an algorithm. The main reason to
use cross validation is to reducing the chance of over-fitting through altering various hyperparameters. This

27
cross-validation technique is widely used in machine learning (specifically neural nets) when trying to create
models that best fit specific datasets. In our case, we will be looking at cross validation for local polynomial
regression which concerns selecting the optimal bandwidth h, degree polynomial p, and which method (like
splines, regressogram, etc.) is used. We care about cross validation to optimize our model, and therefore our
results. Recall we want to evaluate and minimize

MSE = EYi [MSE(r̂)]


" n #
1X
=E (r̂(xi ) − r(xi ))2 ,
n i=1

where
n
1X
MSE(r̂) = (r̂(xi ) − r(xi ))2 .
n i=1
As we have seen before, this issue cannot simply be solved with minimizing the training error
n
1X
(Yi − r̂(xi ))2 ,
n i=1

as setting h = 0 gives zero training error, since r̂(xi ) = Yi .

4.4.1 Holdout dataset


One method of cross validation that is widely used in machine learning is via a holdout set. However,
this is only useful when you have a large dataset (e.g. n  10000), something rarely available in non-
parametric statistics. To construct the holdout dataset on an original dataset of n data points, pick a
random permutation {i1 , . . . , in } of {1, . . . , n}. Then we train on the dataset {(Xi1 , Yi1 ), ..., (Xim , Yim )},
typically with m = 0.9n, and test on the validation set {(Xim+1 , Yim+1 ), ..., (Xin , Yin )}.

4.4.2 Leave-one-out estimator


Another technique for cross validation is leave-one-out estimate where you have an estimator for the risk
n
1X
R̂(h) = (Yi − r̂−i (xi ))2 ,
n i=1

where r̂−i (·) is the estimator applied to the dataset excluding xi . You remove xi from the dataset, apply
the estimator, and finally use the estimator on xi to see if it can produce Yi with relative small error. To
implement this, recall a general linear smoother can be written as
n
X n
X
r̂(x) = lj (x)Yj , lj (x) = 1.
j=1 j=1

Therefore, for the leave-one-out estimator, we can obtain that


!
l (x)
P j
X
r̂−1 (x) = · Yj .
j6=i j6=i lj (x)

For the kernel estimator, r̂−1 (x) defined in the equation above is indeed the estimator applied on (X1 , Y1 ), ..., (Xn , Yn )
excluding (Xi , Yi ). Here R̂ is almost an unbiased estimator for the predictive risk. Now follows the question
on how to compute R̂ efficiently. We will do this in the form of a theorem.

28
Theorem 4.4. If r̂ is a linear smoother
n
X
r̂(x) = lj (x)Yj ,
j=1

then
n  2
1X Yi − r̂(xi )
R̂(h) = , (4.11)
n i=1 1 − Lii

where Lii = li (xi ).


Proof. Consider the estimator P
j6=i lj (xi )Yj
r̂−1 (xi ) = P ,
j6=i lj (xi )

Recall the sum of the weights for data point x is always 1, therefore we can rewrite the estimator as
Pn
j=1 lj (xi )Yj − li (xi )
r̂−1 (xi ) = P
j6=i lj
r̂(xi ) − li (xi )Yi
= .
1 − Lii

Therefore,
n
1X
R̂(h) = (Yi − r̂−i (xi ))2
n i=1
n  2
1X r̂(xi ) − li (xi )Yi
= Yi −
n i=1 1 − Lii
n  2
1 X Yi − r̂(xi )
= ,
n i=1 1 − Lii

as desired.

29
Chapter 5

Splines

This chapter covers splines as a new framework for an non-parametric algorithm.

5.1 Penalized Regression


To
R 00motivate the use of splines, first let us recall the MSE for local linear regression where we seek to minimize
r (x)2 dx, where S is some interval that contains x1 , ..., xn . In the case where we explicitly leverage the
smoothest function that fits the data is known as penalized regression. We can rewrite the function as
n
X
argmin (Yi − r̂(xi ))2 + λJ(r̂) , Lλ (r̂).
r̂ i=1

r00 (x)2 dx. Let us consider one of the extreme cases when λ = ∞. In this case, J(r̂) can
R
where J(r̂) = S

r(x)
(xi , Yi )’s
λ=∞

Figure 5.1: Example of extreme cases when λ = ∞.

only be zero since r00 (x) = 0 and r̂ can only be linear. Here, what spline is doing is that you do not have
to be so strict with being linear, in addition to controlling second order derivatives. We can now move onto
defining splines.

30
5.2 Splines
Splines themselves are family of functions f , where we have the set points ξ1 < ξ2 < ... < ξk (also known
as knots) contained in some interval [a, b]. Generally, M -th order splines are piecewise (M − 1)-degree
polynomial with continuous (M − 2)-th order derivatives at the knots. More specifically, a cubic spline (4-th
order spline) q is a continuous function such that
• q is a cubic polynomial on (a, ξ1 ],[ξ1 , ξ2 ], ..., [ξk − 1, ξk ], [ξk , b), where we have a fixed cubic polynomial
qi between ξi and ξi+1 .
• q has continuous first and second derivatives at the knots.
There is another type of spline, known as a natural spline. This spline is one that extrapolates linearly
beyond the boundary knots. Or mathematiclaly:

a0,1 x + a0,0
 x ≤ ξ1
f (x) = a3i,3 x3 + a2i,2 x2 + ai,1 x + ai,0 ξi ≤ x ≤ ξi+1

ak,1 x + ak,0 x ≥ ξk

After defining a few of these notions, we can intuitively see piecewise polynomials has relative smoothness
properties and will allow us to arrive to a solution for the penalized objective. We will now prove a theorem
that demonstrates this.

5.3 Background: subspaces and bases


A set of functions F form a subspace if ∀f1 , f2 ∈ F, λ1 f1 + λ2 f2 ∈ F for all λ1 , λ2 ∈ R.
A set of functions F = {f1 , f2 , . . . , fk } is said to be a basis of G if:
i. The functions f1 , f2 , . . . , fk span G. That is for any g ∈ G, g = λ1 f1 + λ2 f2 + · · · + λk fk for some
λ1 , · · · , λk ∈ R
ii. λ1 f1 + λ2 f2 + · · · + λk fk = 0 only if λ1 = λ2 = · · · = λk = 0 (fi ’s are linearly independent).
Furthermore, a subspace of functions has dimension of at most k if ∃f1 , ..., fk ∈ F such that ∀f ∈ F, f can
be represented as
Xk
f= λi fi
i=1

for some λ1 , · · · , λk ∈ R.

5.4 Cubic spline and penalized regression


In this section, we explore the connection between cubic spline and penalized regression with the following
theorem.
Theorem 5.1. The function that minimizes Lλ (r̂) is a natural cubic spline with knots at the data points.
In particular, the minimizer has to be a cubic spline.
Therefore, the minimizer of the penalized objective has an estimator that is called a smoothing spline.
This identifiction introduces a much smaller search space for f for the loss function Lλ (r̂). There are 2
reasons

1. Because we know that the best estimator is a spline class, it limits candidate functions to only functions
of the spline classes, which can be parametrized as addition of finite number cublic polynomials. This
transformed the original uncountable and infinite dimensional function space to only finite dimensional
parameter spaces.

31
2. Because the theorem also showed that the order of spline is cubic spline, it limits the highest order to
be 3rd order. This also places an upper-bound on the number of parameters to be searched.

To prove this let us consider the following lemma:


Lemma 5.2. All cubic splines with knots ξ1 , ..., ξk form a (k + 4)-dimensional subspace of functions. Specif-
ically, there exists some h1 ...., hk+4 such that every cubic spline f can be represented as
k+4
X
f= λi hi ,
i=1

where the λi ’s serve as parameters.


Proof. Let us consider the case where we have f, g being cubic splices with fixed knots ξ1 , ..., ξk . Therefore,
f + g is also a cubic splice. Now we will show why the space of cubic splines with fixed knots also form a
(k + 4)-dimensional subspace. Notice that a cubic spline can be represented as

q(x) = a3,i x3 + a2,i x2 + a1,i x + a0,i

for all x ∈ [ξi , ξi+1 ] where ∀i = 0, ..., k. The convention for knot assignment is: ξ0 = a, ξk+1 = b. However,
note that at,i ’s have to satisfy constraints pertaining to the (M −2)-th order derivation requirements explored
earlier. To prove the lemma, consider the following functions

h1 (x) = 1,
h2 (x) = x,
h3 (x) = x2 ,
h4 (x) = x3 ,
hi+4 (x) = (x − ξi )3+ , i = 1, · · · , k.

where t+ is the ReLU function utilized throughout machine learning max{t, 0}, which returns the input t
directly if it is positive, and return 0 otherwise. We prove that they are the desired basis by induction.
When there is no knots, a degree 3 polynomial can be represented by combinations of h0 , h1 , h2 , h3 .
Suppose the inductive hypothesis holds true for k − 1 knots. We can take the spline qe(x) which spans
over ξ1 , ..., ξk−1 where we define qe(x) = q(x) for all x ∈ [ξk−1 , ξk ]. Suppose

q(x) = a3,k−1 x3 + a2,k−1 x2 + a1,k−1 x + a0,k−1

on [ξk−1 , ξk ]. For qe(x) on [ξk , b):

qe(x) = a3,k−1 x3 + a2,k−1 x2 + a1,k−1 x + a0,k−1

Since we also know that qe is a cubic spline with k − 1 knots, by the inductive hypothesis
k+3
X
qe(x) = λi hi (x).
i=1

Therefore, we can deduce that for all x ≤ ξk q(x) − qe(x) is zero, while for [ξk , b) is a degree 3 polynomial.
Furthermore, recall that we know that q(x) and qe(x) have continuous derivatives. Notice that

q(ξk ) − qe(ξk ) = 0 = b0 ,

q 0 (ξk ) − qe0 (ξk ) = 0 = b1 ,


q 00 (ξk ) − qe00 (ξk ) = 0 = 2b2 .

32
And given we can rewrite q(x) − qe(x) = b3 (x − ξk )3 + b2 (x − ξk )2 + b1 (x − ξk ) + b0 . Therefore, q(x) − qe(x) =
b3 (x − ξk )3 for all x ≥ ξk . Rewriting it shows q(x) − qe(x) = b3 (x − ξk )3 and hence
k+4
X
q(x) = qe(x) + b3 (x − ξk )3+ = λi hi (x).
i=1

where λk+4 = b3
Given this lemma, we will now show that the minimizer of Lλ (r̂) is a degree 3 spline ge. To do this, let
us start with a twice differentiable function g over [a, b]. We will construct a natural spline ge that matches
g on x1 , ..., xn , namely ge(x) = g(x) for all x ∈ {x1 , · · · , xn }. This implies
k
X k
X
(Yi − g(xi ))2 = (Yi − ge(xi ))2
i=1 i=1

Next we will show that ge00 (x)2 dx ≤ g 00 (x)2 dx which will then indicate that Lλ (e
R R
g ) ≤ Lλ (g). Notice that
in these two inequalities, equality is attained at g = ge. Let us consider h where h = g − ge. Given this, we
only need to deduce h(x) = 0. Note that
Z Z
ge00 (x)2 dx = (eg 00 (x) + h00 (x))2 dx
S S
Z Z Z
= (eg (x)) dx + 2 ge (x)h (x)dx + (h00 (x))2 dx
00 2 00 00
S
Z ZS S
00 2 00 00
≥ (eg (x)) dx + 2 ge (x)h (x)dx
S S

Here we want to show that 2 S ge00 (x)h00 (x)dx = 0. Recall that in natural spline, ge is linear outside the
R

boundary knots. By continuity of the second derivative, ge00 (ξ1 ) = 0 and ge00 (ξn ) = 0, where ξ1 is the left
boundary node and ξn is the right boundary node. By integration by parts, we can show:
Z ξn Z ξn Z ξn
00 00 00 0 ξ 0 000
ge (x)h (x)dx = ge(x) h (x)|ξn1 − g (x)dx = −
h (x)e h0 (x)e
g 000 (x)dx.
ξ1 ξ1 ξ1

y
qe(·)
q(·)
q(·) − qe(·)

ξ
12 16

Figure 5.2: q(·) − qe(·), 12 = ξk−1 , 16 = ξk

33
Here, we can identify ge000 (x) as constant ci on [ξi , ξi+1 ] given that ge is a degree-3 polynomial on the
interval. This allows us to further expand
Z ξn Z ξn
ge00 (x)h00 (x)dx = − h0 (x)e
g 000 (x)dx
ξ1 ξ1
n−1
X Z ξi+1
=− h0 (x)e
g 000 (x)dx
i=1 ξi
n−1
X Z ξi+1
=− ci h0 (x)dx
i=1 ξi
n−1
X
=− ci (h(xi+1 ) − h(xi ))dx
i=1
= 0,
where the last line follows since g = ge on knots, h(xi+1 ) and h(xi ) are zero.

5.5 Interlude: A brief review


Let’s quickly review the key definition and results for splines we have established. The basic principle of a
spline is to minimize a regularized objective function over function r̂ to some data {(xi , Yi ) : 1 ≤ i ≤ n}
n
X Z
2
Lλ (r̂) , argmin (Yi − r̂(xi )) + λ r̂00 (x)2 dx, (5.1)
r̂ i=1 S

where the regularization term λ S r̂00 (x)2 dx encourages r̂ to be as smooth as possible. Here, “smooth” means
R

the second order derivative is as small as possible (i.e. we want a very small regularization penalty).
Let’s also recall a few important theorems and lemmas. From the last section, we know that the minimizer
r̂ of this objective function is a natural cubic spline.
Theorem 5.3. The minimizer Lλ (r̂) is a natural cubic spline with n knots at data points {xi : 1 ≤ i ≤ n}.
That is, r̂ must be a cubic spline.
A natural cubic spline is a piecewise polynomial function that linearly extrapolates near ±∞. Let’s also
recall an important lemma from last section.
Lemma 5.4. A cubic spline with knots ξ1 , ..., ξn forms a (n + 4)-dimensional subspace of functions. That
is, there exist some {hj : 1 ≤ j ≤ n + 4} such that the cubic spline r̂ can be written as
n+4
X
r̂ = βj hj (x),
j=1

where βj ∈ R for j = 1, · · · , n + 4.
Now, we have a strong structural form in the cubic spline and only have to search over a finite dimensional
subspace in the functional form specified above. We can use this functional form of r̂ in our penalized
regression from above.
 2  2
n
X n+4
X Z n+4
X
Lλ (r̂) = Lλ (β) = Yi − βj hj (xi ) + λ  βj h00j (xi ) dx.
i=1 j=1 S j=1

Although this looks like a complex objective function, we have a finite number of parameters {βj : 1 ≤ j ≤
n + 4} to optimize over. Further, Lλ (β) is a convex quadratic function in β. We can see this by expanding
out the squared terms. As β is not a function of x, the βj ’s in the regularization term will be unaffected by
the derivatives. This makes it a much more feasible problem and allows us to write it in matrix form.

34
5.6 Matrix notation for splines
Although we have a convex quadratic functional form, the problem remains notationally burdensome. Let’s
continue by translating our optimization problem into matrix notation. First, let’s define a few matricies
that will be useful later.

    
h1 (x1 )
... hn+4 (x1 ) β1 Y1
F = ...  ∈ Rn×(n+4) , β =  ...  ∈ Rn+4 , Y =  ...  ∈ Rn . (5.2)
h1 (xn ) ... hn+4 (xn ) βn+4 Yn
Applying matrix multiplication in lieu of our summations before, we find
   
Y1 β1 h1 (x1 ) + ... + βn+4 hn+4 (x1 )
Y − F β =  ...  −  ... ,
Yn β1 h1 (xn ) + ... + βn+4 hn+4 (xn )
and therefore  2
n
X n+4
X
Yi − βj hj (Xi ) = ||Y − F β||22 .
i=1 j=1

Then we look at the regularization term, since


 2  
Z n+4
X Z n+4
X n+4
X
 βj h00j (x) dx =  βj βk h00j (x)h00k (x) dx
S j=1 S j=1 k=1
n+4
X n+4
X Z 
= βj βk h00j (x)h00k (x)dx ,
j=1 k=1 S

by defining the term Ωjk as Z


Ωjk , h00j (x)h00k (x)dx,
S

we have Ω = [Ωjk ] ∈ R(n+4)×(n+4) and


n+4
X n+4
X Z  n+4
X n+4
X
βj βk h00j (x)h00k (x)dx = βj βk Ωjk = β > Ωβ.
j=1 k=1 S j=1 k=1

Now that we have translated each part of the objective function, we can finally write it as
Lλ (β) , ||Y − F β||22 + λβ > Ωβ.
This is a remarkably familiar functional form which reminds us of a simple linear regression and ridge
regression. The regularization is weighted by matrix Ω. In the case that we have Ω = I, then we have

β > Ωβ = β > β = ||β||22 ,


which is exactly a ridge penalty.

5.7 Minimizing the regularized objective


Given its similarities to linear regression, we can solve this objective function analytically in a similar fashion
to that. That is, we will compute the gradient
∇Lλ (β) = −2F > (Y − F β) + 2λΩβ,

35
and set it equal to zero. Solving the above linear equation, we find
β̂ = (F > F + λΩ)−1 F > Y.
And therefore the minimizing natural cubic spline is then
n+4
X
r̂(x) = β̂j hj (x).
j=1

5.8 Choosing the basis


Now that we have analytically solved for r̂, we might wonder how to choose h1 , ..., hn+4 . In the last lecture,
we saw an example of a basis given by hi+4 = (x − ξi )3+ where ξi is a knot in our spline. When choosing a
basis, we should always remember that we must derive Ω which requires integration. Therefore, our choice of
basis should be relatively easy to integrate. Of course, this can be done numerically but it’s a much simpler
problem when we are able to integrate and know properties of the basis (i.e. to speed up computation of
F > F in our estimate of β).
The textbook refers to B-splines as a good basis for computational reasons and because they have nice
properties. We will not delve into this here.

5.9 Interpretation of splines


5.9.1 Splines as linear smoothers
First, let’s recall that a linear smoother is a family of functions that takes an average of the response variable
(i.e. our Y variable). Splines fall within this family of linear smoothers. We will show why this is true. Let’s
begin by recalling our definition of β̂ and r̂.
n+4
X
β̂ = (F > F + λΩ)−1 F > Y, r̂(x) = β̂j hj (x).
j=1

 >
By taking h(x) = h1 (x) · · · hn+1 (x) ∈ Rn+4 , we can write

r̂(x) = h(x)> (F > F + λΩ)−1 F > Y := LY,

where L = h(x)> (F > F + λΩ)−1 F > , meaning our spline is indeed a linear smoother, as required. It is
important to show that this falls into the family of linear smoothers because it means that we can apply our
methods of cross validation (which were defined for linear smoothers) to splines as well.

5.9.2 Splines approximated by kernel estimation


In general, splines and local linear regression will perform approximately equivalently. It will be hard to find
consistent cases where one outperforms the other; it simply depends on the noise in the data. Let’s quickly
recall the kernel estimator: Pn
j=1 wj Yj
 
xj − x
r̂ = Pn , wj = K .
j=1 wj h
Note that K is our kernel function (e.g. Gaussian, boxcar, etc.). Splines approximately correspond to a
similar functional form.
 
Pn 1
w Y
j=1 j j n 4  xj − x 
r̂ ≈ Pn , wj = 1 3 K   14  .
j=1 wj λ 4 f (x) 4
λ
nf (xj )

36
This is clearly much more complicated that our standard kernel estimator for a number of reasons. First,
we can notice that we have a dependency on the density f (x) of x. Next, our kernel K is typically not the
Gaussian kernel (although it is usually something relatively reasonable). Finally, we have a bandwidth that
depends on f (x).
Therefore, splines can be considered a special form of local averaging. They have the advantage in that
they tend to better fit the global structure of the data due to the “global” penalization term that encourages
smoothness.

5.9.3 Advanced spline methods


There exist more advanced spline methods, although we won’t dive too deeply into them here. For example,
there is a method that has knots that aren’t necessarily on data points. In our penalized regression above,
we mandated that the knots fell onto a data point in order to properly penalize the objective function. Other
methods do not have this requirement. Other methods also allow us to fix knots q1 , ..., qm in advance while
still others put knots in the optimization problem (i.e. the function also optimizes over the placement of
knots and chooses their locations).

5.9.4 Confidence bands


Confidence intervals (or bands) are always important in statistics. We want to know how close we are to the
ground truth functional form of the data and we want to know where our estimates might be more different
from the ground truth (i.e. we want a more interpretable estimate). Our confidence band will be of the

Figure 5.3: A example of spline along with its confidence band.

typical form
[r̂ − ŝ(x), r̂ + ŝ].
In the case of splines, we have an estimate of r̂(x) and wish to find the standard deviation of r̂(x) defined as
ŝ(x). That is, we want to find a confidence band around our estimates that contains the true function r(x).
In practice, this is difficult since we do not know the difference between E[r̂(x)] and r(x). Statisticians thus
find the confidence band for E[r̂(x)]. For this purpose, we find ŝ(x) ≈ SD(r̂(x)).

37
We can do this for the general family of linear smoothers with the general form
n
X
r̂(x) = li (x)Yi .
i=1

From here, we can find the variance of r̂(x).


n
! n n
X X X
Var(r̂(x)) = Var li (x)Yi li (x)2 Var(Yi ) = li (x)2 σ 2 = σ 2 ||l(x)||22 .
i=1 i=1 i=1
 
l1 (x)
Notice that we assume Var(Yi ) = σ 2 for each {(Yi ) : 1 ≤ i ≤ n} and define l(x) =  .... . Thus, our
ln (x)
confidence band becomes
[r̂ − c · σ · ||l(x)||2 , r̂ + c · σ · ||l(x)||2 ].
Notice the addition of the variable c which is chosen based on the level of confidence desired. A large value
of c will have higher confidence whereas a small value of c will have lower confidence.

38
Chapter 6

High-dimensional nonparametric
methods

We introduce nearest-neighbor methods, challenges in high-dimensional nonparametrics, and the kernel


method.

6.1 Nonparametrics in high dimension


In order to motivate this discussion of high dimensional statistics, we will begin by examining our toolbox of
nonparametric techniques in the context of higher dimensional data. Let’s consider the case when we have
X ∈ Rd where d > 1.
There are some immediate challenges to local averaging, including the construction of neighborhoods. A
natural first thought would be to construct a “sphere” Bx = {x0 : ||x0 − x||2 ≤ h} around some fixed point x.
This is a problem because Euclidean distance or other distance is meaningless on finite-size (polynomial-size)
data sets in high dimensions. The distance between points often is not informative.

6.1.1 Examples
iid
Example 6.1. Suppose we generate x(1) , ..., x(n) ∼ Unif(sd−1 ) where the superscripts denote the index of
the data points and sd−1 is a d-dimensional unit sphere defined as {x : ||x||2 = 1, x ∈ Rd }.
In this case, with p ≥ 1 − n2 exp(−Ω(d)) 1 , we have

hx(i) , x(j) i ≈ 0 ⇒ ∀i, j ∈ [n], ||x(i) − x(j) ||2 ≈ 2.
iid
The reasoning is as follows. Let u = (u1 , ..., ud ) ∈ Rd , v = (v1 , ..., vd ) ∈ Rd and u, v ∼ Unif(sd−1 ). Due to
symmetry,
Ehu, vi = E[u1 v1 + ... + ud vd ] = 0
d
X 1
u2i = 1 ⇒ u2i ≈ , i = 1, ..., d
i=1
d
iid
Let us make a convenient approximation that ui , vi ∼ N (0, d1 ) for all i. Then
d d d d
X X X X 1 1
Var( ui v i ) = Var(ui vi ) = E(ui vi )2 = E(u2i )E(vi2 ) = d × 2 =
i=1 i=1 i=1 i=1
d d
1 T (d) = Ω(g(d)) ⇔ ∃c, d0 > 0 s.t. ∀d ≥ d0 , c · g(d) ≤ T (d)

39
Figure 6.1: Example of image comparison.

Figure 6.2: Examples of k-NN with k = 10 (Left), k = 50 (Middle) and k = 400 (Right).

When dimension d is high, the variance is small and we have hu, vi ≈ 0 with high probability. Thus u and v
are nearly orthogonal. q √
||u − v||2 = ||u||22 + ||v||22 − 2hu, vi ≈ 2
This is problematic because the distance between points isn’t that big since each x is √
orthogonal to all other
values of x. The distance
q isn’t informative in this case. That is, if we take h = 2 +  for some small
positive number  = O( log(n) 2
d ) , then we have all data points in our neighborhood. However, if we take

h = 2 − , then we have no points in our neighborhood.

Example 6.2. Let’s consider the example of comparing images as in Fig. 6.1.
The distances between these two images will appear as very large just comparing the pixels, even if they
are closely related.

6.1.2 k-Nearest neighbors algorithm


One of the most prominent algorithms for high dimensional data is the k-nearest neighbors method. Before
diving into this algorithm, we must first note that it doesn’t always solve a problem, but will potentially
n
make it easier to handle. Let {(xi , Yi )}i=1 be our training dataset and x a test point. The algorithm is
defined as follows.

1. ∀x, let Bx = {i : xi is among the k closest neighbors of x};


2. r̂(x) = k1 i∈Bx Yi .
P

A benefit of this algorithm is that each bin Bx will always contain k points, so we don’t need to worry about
bandwidth. Of course, this doesn’t always alleviate the problem that the neighbors might not be meaningful
(i.e. they could be neighbors just by chance). The plots in Fig. 6.2 explore a few different values for k on a
toy dataset (k ∈ {10, 50, 400}).
2 T (n) = O(g(n)) ⇔ ∃c, n0 > 0 s.t. ∀n ≥ n0 , T (n) ≤ c · g(n)

40
However, there is the fundamental limitation in the curse of dimensionality. That is, nonparametric
methods require a sample size at least exponential in dimension d. More formally, if we only assume
Lipschitzness or smoothness conditions (e.g. ||r(x) − r(x0 )|| ≤ ||x − x0 || or bounded derivation of r(x)), then
1
any estimator of r̂ will have errors on the order of n− Ω(d) . That is, we have
 Ω(d)
1
− Ω(d) 1
=n ⇒n≥ .


6.2 Kernel method


We will now explore the kernel method (not to be confused with kernel estimators mentioned in Section
2.1.3). In order to begin, let’s introduce some notation. Our training dataset is defined as
   
x(1) , y (1) , ..., x(n) , y (n) , x(i) ∈ Rd , y (i) ∈ R.

The general idea of the kernel method is to generate a feature map defined as

φ : x ∈ Rd → φ(x) ∈ Rm ,

where x ∈ Rd are our input and φ(x) ∈ Rm are our features. Note that m can be very big or even infinite.
More specifically, we want to transform our standard database into a different feature-space.
n on n   on
x(i) , y (i) → φ x(i) , y (i) .
i=1 i=1

After converting this to a higher dimension, we can run a standard parameterized method on the transformed
dataset (e.g. a linear regression).

6.2.1 Motivating examples


Example 6.3. Let’s consider nearest neighbors with `2 distance in the feature space. That is, we have
d(x, z) = ||φ(x) − φ(z)||22 . If we design φ(·) in the right way, our distance metric d(·, ·) will be more
informative than the typical ||x − z||22 . For example, applying φ could make transformed values φ(x) and
φ(z) closer than their untransformed counterparts.
Example 6.4. We can apply the same logic to traditional linear models. That is, we can apply φ(x) and fit
a linear model to the transformed data to extract more signal. Suppose we have x, y ∈ R with linear model
y = θ0 + θ1 x. Now, let’s say φ(x) = (1, x, x2 , x3 ) ∈ R4 . We can then fit a linear regression on top of this
transformed data  
θ0
θ1 
y = θ> φ(x) =  3 3

θ2  φ(x) = θ0 + θ1 x + θ2 x + θ3 x .

θ3
This is a more flexible polynomial model that can be expanded even beyond the third degree to represent
any polynomial. This allows us to rely less on the assumptions of linear regression by transforming our input
into a higher dimension.
 >
Example 6.5. We can apply the same logic to splines. By taking h(x) = h1 (x) · · · hn+4 (x) and
 > n+4
β̂ = β̂1 · · · β̂n+4 ∈ R , recall lemma 5.4, minimizing natural cubic spline leads to
n+4
X
r̂(x) = β̂j hj (x) = hβ̂, h(x)i
j=1

It is a linear function of feature φ(x) = h(x), where φ(x) is a transformation of data x.

41
6.2.2 Kernel regression
In a linear regression on transformed feature space, we find the estimator as follows

ŷ = φ(x)> θ, φ(x) ∈ Rm .

Let’s focus on the case when m > n. Our least squares objective remains the same.
n  >  2
1 X (i) 
(i)
L̂(θ) , y −φ x θ .
2n i=1

Converting to matrix notation, let’s define a few variables.

φ(x(1) )>
   (1) 
y
Φ =  ...  ∈ Rn×m , y =  ...  ∈ Rn .
φ(x(n) )> y (n)

Then, our objective function becomes the following


1
L̂(θ) = ||y − Φθ||22 .
2n
Noticing that this is convex in θ, we can compute the gradient and set it equal to zero just as in a typical
optimization problem
n
1 X  (i) 
∇L̂(θ) = − y − φ(x(i) )> θ φ(x(i) ) = 0,
n i=1
n
X n
X
⇔ φ(x(i) )φ(x(i) )> θ = y (i) φ(x(i) ),
i=1 i=1
⇔ Φ> Φθ = Φ> y.

It is important to notice that when m > n, Φ> Φ is not invertible as is required to solve the minimization
problem. This is because Φ> Φ ∈ Rm×m with rank n (i.e. the rank is smaller than the dimension). This
means we will have a family of solutions rather than a unique one. We claim that the family of solutions is
given by
θ = Φ> (ΦΦ> )−1 y + β,
where β ⊥ φ(x(1) ), ..., φ(x(n) ) (i.e. β is in the null space or βΦ = 0). Note that this is only feasible if ΦΦ>
is invertible. We can verify that this indeed is a family of solutions.

Φ> Φθ = (Φ> Φ)(Φ> (ΦΦ> )−1 y + β)


= Φ> ΦΦ> (ΦΦ> )−1 y
= Φ> y.

Note that the β term disappears because it is orthogonal to Φ. We can also conduct a sanity check by
confirming that there is zero training error.

y − Φθ = y − Φ(Φ> (ΦΦ> )−1 y + β)


= y − ΦΦ> (ΦΦ> )−1 y
=y−y
= 0.

42
Again, note that the β term disappears because it is orthogonal to Φ. Typically, we will take the minimum
of the family of solutions as “the” solution to the problem. This is because it is seen as the simplest model
and can likely generalize the best. The minimum in this case is given by:

θ̂ = Φ> (ΦΦ> )−1 y. (6.1)

The issue remains that, when m is large, it is computationally inefficient. Further, when m is infinite, this
is an impossible problem. In fact, computation is on the order of O(m · n2 ).

6.2.3 Kernel efficiencies


The trick of the kernel is to remove the explicit dependency on m. We know that θ̂ is difficult to compute;
instead, we can compute θ̂> φ(x).
θ̂> φ(x) = y > (ΦΦ> )−1 Φφ(x).
We can re-write this in a more typical kernel fashion by defining K.
h i
K , ΦΦ> = φ(x(i) )> φ(x(j) ) .
∀i,j∈[n]

Using this definition, we can plug into the expression for θ̂φ(x).

φ(x(1) )> φ(x)


 

θ̂> φ(x) = y > K −1  ... .


φ(x(n) )> φ(x)

Thus, we see that this kernel trick only requires (i) φ(x(i) )> φ(x(j) ) and (ii) φ(x(i) )> φ(x) for all i, j ∈ {1, ..., n}.
In the case that we can easily compute φ(x)> φ(z) for some x, z, then we don’t have any dependency on m.
That is, we can construct feature maps such that φ(x)> φ(z) can be computed more quickly than O(m) time.
Now we will briefly discuss the time it takes to compute an estimate in this fashion. Let’s say it takes us
T units of time to compute φ(x)> φ(z). Then, it takes us roughly
1. n2 T units of time to compute the matrix K;
2. n3 units of time to compute the inverse K −1 ;
φ(x(1) )> φ(x)
 

3. nT units of time to compute the vector  ... ;


(n) >
φ(x ) φ(x)

4. n2 units of time to compute the product of vector y > and matrix K −1 .


φ(x(1) )> φ(x)
 

5. n units of time to compute the product of vector y > K −1 and vector  ... .
φ(x(n) )> φ(x)

All in all, it takes n2 T + n3 + nT + n2 + n units of time to compute θ̂> φ(x). Overloading our notation for
K, we can define the kernel as the inner product

K(x, z) , φ(x)> φ(z) = hφ(x), φ(z)i.

We wish to construct a φ such that K(·, ·) is easy to compute. There are lots of ways to do this. In fact,
we can even ignore φ and directly work with our kernel K instead (as long as we know there exists some φ
where K(x, z) = φ(x)> φ(z)).

43
6.2.4 Examples
Example 6.6. Consider the case when we have x = (x1 , ..., xd ) ∈ Rd . Let’s construct a feature map as
follows.    >  
1 1 1
 x1   x1   z1 
     
 ...   ...   ... 
     
 xd  1+d+d2 >
 xd   zd 
φ(x) =  ∈R , φ(x) φ(z) = 
     .
 x1 x1  x1 x1  z1 z1 
  
 x1 x2  x1 x2  z1 z2 
     
 ...   ...   ... 
xd xd xd xd zd zd
Re-writing the inner product, we find the following.
d
X d
X
T
φ(x) φ(z) = 1 + xi zi + xi xj zi zj
i=1 i,j=1
d
X d
X
= 1 + x> z + xi zi xj zj
i=1 j=1

= 1 + x> z + (x> z)2 .

Since it takes O(d) time to compute x> z, it takes O(d) time to compute φ(x)T φ(z). There is no reliance on
m here, so our kernel trick worked.

Example 6.7. Again, let’s consider x = (x1 , ..., xd ) ∈ Rd . Let’s consider a degree-3 construction of φ this
time.    >  
1 1 1
 x1   x1   z1 
     
 ...   ...   ... 
     
 xd   xd   zd 
     
 x1 x1   x1 x1   z1 z1 
     
 x1 x2  1+d+d2 +d3 >
 x1 x2   z1 z2 
φ(x) =  ∈R , φ(x) φ(z) = 
     .
 ...   ...   ... 
  
 xd xd   xd xd   zd zd 
     
x1 x1 x1  x1 x1 x1  z1 z1 z1 
     
x1 x1 x2  x1 x1 x2  z1 z1 z2 
     
 ...   ...   ... 
xd xd xd xd xd xd zd zd zd
Similar to the argument above, we find φ(x)> φ(z) = 1 + x> z + (x> z)2 + (x> z)3 , meaning we have O(d) time
again.
Example 6.8. A Gaussian kernel (also called RBF kernel) also works here. That is, we have
 
||x − z||2
K(x, z) = exp − .
2σ 2

It turns out there exists two infinite dimensional features φ(x) and φ(z) such that K(x, z) = φ(x)> φ(z).
Here, parameter σ controls how strong the locality is. When σ is extremely large, then we are not sensitive
to the choice of x and z since the quantity K(x, z) is always very small. When σ is very close to 0, you care
about points in local neighborhood much more than points faraway.

44
Example 6.9. Let’s consider applying kernel functions to k-Nearest neighbors algorithm instead of linear
 n
regression in the feature space φ x(i) , y (i) i=1 . We can re-write the distance between x and z as

d(x, z) = ||φ(x) − φ(z)||22 = (φ(x) − φ(z))>(φ(x) − φ(z))


= φ(x)>φ(x) − 2φ(x)>φ(z) + φ(z)>φ(z)
= K(x, x) − 2K(x, z) + K(z, z)
We can see that kernel method allows non-linear data representations.
There are many, many more examples of valid kernels. Here are just a few listed without many details.
1. K(x, z) = (x> z)2 ;
2. K(x, z) = (x> z)k ;
3. K(x, z) = (c + x> z)k where c is a constant;
 
4. K(x, z) = exp − ||x−z||
2σ 2
2
;

5. K(x, z) = random features kernel, where we use a randomized feature map to approximate the kernel function;
6. K(x, z) = function of infinite dimension features, e.g. RBF kernel mentioned in example 6.8..

6.2.5 Existence of φ
For a kernel function to be valid there must exist some φ such that K(x, z) = φ(x)> φ(z). Let’s show how
we know that φ exists.
Theorem 6.10. If K(x, z) = φ(x)> φ(z) then for any x(1) , ..., x(n) we have [K(x(i) , x(j) )]i,j∈[n]  0. That
is, the matrix K must be semi-positive definite.
Proof. We know that K  0 if and only if v > Kv  0 for all v. Let’s show that this holds true for some
arbitrary v.
n
X
v | Kv = vi Kij vj
i,j=1
Xn
= vi hφ(x(i) ), φ(x(j) )ivj
i,j=1
n n
!
(i)
X X
= vi φ(xk φ(x(j) )k vj
i,j=1 k=1
 
m
X n
X
=  vi φ(x(i) )k vj φ(x(j) )k 
k=1 i,j=1

m n
! n 
X X X
= vi φ(x(i) )k  vj φ(x(j) )k 
k=1 i=1 j=1

m n
!2
X X
= vi φ(x(i) )k ≥0
k=1 i=1

Therefore, K  0 for any x(1) , · · · , x(n) is a necessary condition for φ to exist. This is in fact also
sufficient. If you’re interested, you can find more about it here [Wikipedia contributors, 2023] known as
Mercer’s theorem.

45
6.3 More about kernel methods
6.3.1 Recap
Let us quickly review the kernel method from section 6.2. The basic principle of the kernel method is that
given a set of data points
n o
(x(1) , y (1) ), · · · , (x(n) , y (n) ) , x(i) ∈ Rd , y (i) ∈ R,

we look for a suitable feature map φ such that

φ : x 7→ φ(x) ∈ Rm .

Our interpretation of this feature map is that it is transforming the feature pairs in our dataset. If we
run a linear regression or a logistic regression on the transformed dataset (φ(x(i) ), y (i) ), then the algorithm
only depends on the inner product (i.e. we don’t need to know φ(x) or φ(z) explicitly). We only need to
compute hφ(x), φ(z)i. This is called the kernel function

K(x, z) := hφ(x), φ(z)i. (6.2)

If we can compute the kernel function directly, then we don’t need to pay the computational overhead of
computing the φ function/map explicitly. When the number of features is large, computing the feature map
explicitly can be quite costly.

6.3.2 Another approach to kernel methods


An alternate way of understanding the kernel method is to view each feature as a function of x, that is

φ(x)k : x → R,

and  
φ(x)1
φ(x) =  ...  .
 

φ(x)m
An example could be the second degree polynomial kernel such that φ(x)(ij) = xi xj . We can also view the
linear prediction function of our features as a linear combination of these functions
m
X
θT φ(x) = θi φ(x)i ∈ span{φ(x)1 , ..., φ(x)m }.
i=1

The kernel method can be thought of as looking for a function in a linear span of functions.

6.3.3 Connection to splines


A cubic spline is a function in the span of a family of basis of cubic splines, that is our model r(x) satisfies

r(x) ∈ span{h1 (x), ..., hn+4 (x)}.

Equivalently, we can write


n+4
X
r(x) = βi hi (x) = β T φ(x),
i=1

46
where φ(x)i = hi (x), i = 1, · · · , n + 4, and thus
 
h1 (x)
φ : x 7→ φ(x) = 
 .. 
. 
hn+4 (x)

is a feature map. Consequently, in our connection between kernels and splines, we can write out the kernel
function for cubic splines as
n+4
X
K(x, z) = hφ(x), φ(z)i = hi (x)hi (z).
i=1

Empirically, our main design choice centers around our choosing a basis hi (x) such that K(x, z) is efficiently
computable. Previous bases we have used for splines have been good mathematically but are not necessarily
the best choice when thinking about computability.
The kernel method with feature map φ is equivalent to a cubic spline r̂ with no regularization, or a
ridgeless kernel regression since
n
X n
X
argmin (y (i) − β T φ(x(i) ))2 ⇔ argmin (y (i) − r̂(x(i) ))2 .
β i=1 r̂ i=1

However, this is underspecified because the number of parameters (n + 4) > number of data points n.
Therefore, we look for the minimum norm solution or a regularization. An example of a regularized solution
is the kernel ridge regression
n
1 X (i) λ
min (y − β T φ(x(i) ))2 + kβk22 ,
β 2 2
i=1

whose minimizer takes a simple form (cf. Homework 3),

β̂ = Φ> (ΦΦ> + λI)−1 y,

where
φ(x(1) )>
 
.. n×m
Φ= ∈R .
 
.
φ(x(n) )>
To connect with splines, we see in in our previous lecture that in natural cubic splines we’re using a similar
but different regularizer β > Ωβ.

6.3.4 Connection to nearest neighbor methods


Another application of the kernel method is to perform a nearest neighbor strategy in feature space. We run
nearest neighbor on {(φ(x(1) ), y (1) ), ..., (φ(x(n) ), y (n) )}, and the `2 squared distance metric is then

d(x, z) = kφ(x) − φ(z)k22


= hφ(x)T − φ(z)T , φ(x) − φ(z)i
= φ(x)T φ(x) − 2φ(x)T φ(z) + φ(z)T φ(z)
= K(x, x) − 2K(x, z) + K(z, z)

We see again that we don’t need to compute the features φ explicitly.

47
Chapter 7

Fully-connected two layer neural


networks

7.1 Overview
In this chapter we will talk about neural networks. We will explore their connection to the kernel method
and their practical implementation.

7.2 Neural networks


7.2.1 A glimpse into deep learning theory
While Neural Networks are not typically studied in most classic statistics classes, in recent years they have
revolutionized the field of machine learning and thus have become an increasingly interesting topic in the
field of statistics. The result we will show in this chapter is that a one-dimensional neural network is a fully
non-parametric method which is somewhat similar to what we have already discussed in cubic splines. We
will primarily discuss the following two things.

1. We will use a neural net to represent features and then find those features, this allows for more dynamic
and better features than those computed in the kernel method.
2. We will also show that for one dimensional two layer wide neural network, it is equivalent to a linear
spline.

7.2.2 Fully-connected two layer neural networks


A neural network can be thought of as a method to learn the features φ in a non-parametric model. If we
think about neural networks in the specific case where the input is only 1-dimensional, and we have two
layers, then we can see that it is really a type of linear spline. To begin, we introduce some basic notations.

Definition 7.1 (Transformation of fully-connected neural network in each layer). We denote by the input
of i-th layer of a neural network by hi−1 ∈ Rd and its output by hi ∈ Rm . The weighted matrix parameters
are denoted by W ∈ Rm×d . Let σ be the non-linear activation function R → R. Examples of activation
functions include

ReLU(x) := max{x, 0},


1
Sigmoid(x) := ,
1 + e−x

48
 
1
Softplus(x) := log .
1 + ex

Then the output vector can be written as hi = σ(W hi−1 ), where σ here is understood to be applied
elementwise. Empirical wisdom is that activation functions should not be flat on both sides, so Sigmoid
tends not to be used in modern networks. Otherwise when the input to the activation function gets large,
output will be highly insensitive to parameter changes due to zero gradients.
For a fully-connected two layer neural networks, we thus can write

ŷ = a> σ(W x),

where x := h0 is the input of the network and a ∈ Rm . If we view σ(W x) as a feature map φ(x) which
depends on W (hence we might more accurately write this feature map as φW (x)), then this is similar to a
kernel method. If we fix W , then this is exactly a the kernel method with K(x, z) = hφW (x), φW (z)i. In
neural networks, the difference is that we train both a and W . If we don’t train W , then we essentially have
a kernel method.

7.2.3 Deep neural networks


With the above notations, we can formally define a fully-connected deep neural network with r layers,
parameters W1 , · · · , Wr , a and input vector x := h0 as

First layer: h1 = σ(W1 , x)


Second layer: h2 = σ(W2 , h1 )
..
.
Output: ŷ = aT hr

Often times, hr is called “the features”. hr = σ(Wr σ(Wr−1 . . . )) → φW1 ...Wr (x) where φW1 ...Wr (x) is referred
to as the feature extractor or the feature map. The key difference is that φW1 ...Wr (x) is learned. More broadly,
any sequence of parameterized computations is called a neural network. For example, the residual neural
network is

First layer: h1 = σ(W1 , x)


Second layer: h2 = h1 + σ(W2 , h1 )
..
.
r-th layer: hr = hr−1 + σ(Wr , hr−1 )
Output: ŷ = aT hr

7.2.4 Equivalence to linear splines


We will work with infinitely wide neural networks to make the connection to linear splines. We don’t really
need an infinite width, but we do need a really large width. First, we introduce some
Pm notations. We call
an input x ∈ R and its associated output y ∈ R. We will call our model hθ (x) = i=1 ai [wi x + bi ]+ + c
where ai ∈ R, wi ∈ R, x ∈ R, bi ∈ R, c ∈ R. Our activation function in this neural network is simply the
ReLU(x) = max{t, 0} function which we denoted by [. . . ]+ .
And as an aside
m
X m
X
aT σ(W x) = ai (σ(W x))i = ai σ((W x)i ),
i=1 i=1

49
w1T
   T 
w1 x
W =  ...  → W x =  ...  ,
   
T T
wm wm x
(W x)i = wiT x,
Xm
aT σ(W x) = ai σ(wiT x).
i=1

We will denote our parameters by θ = (m, a, w, b, c) where m ∈ N, a ∈ Rm , w ∈ Rm , b ∈ Rm and c ∈ R.


Our regularizer will be the l2 norm of the weights parameter:
m
1  1X 2
C(θ) = kak22 + kwk22 = (a + wi2 ) (7.1)
2 2 i=1 i

This regularizer prevents overfitting by essentially controlling the slope of the model. Note that the bi and c
terms in our model function hθ (x) are not included in our regularizer since these terms refer to the vertical
and horizontal translation of our model, and thus should not be penalized for being too large. And we now
define our regularized training objective:

inf [L(hθ ) + λC(θ)] (7.2)


θ

Where L(hθ ) can be any loss function that is continuous in θ, and C(θ) is our regularizer. An example of a
loss function is
n
1 X (i)
L(θ) = (y − hθ (x(i) ))2 . (7.3)
n i=1

7.2.5 Simplification: m goes to infinity


With some abuse of notation, we will work with the following neural network


X
hθ (x) = ai [wi x + bi ]+ ,
i=1
a = (a1 , · · · ) ∈ R∞ , w = (w1 , · · · ) ∈ R∞ ,
b = (b1 , · · · ) ∈ R∞ ,
 ∞ ∞ 
1 X 2 X 2
C(θ) = ai + wi .
2 i=1 i=1

Theorem 7.2. Define the nonparametric complexity measure


Z +∞ 
00 0 0
R̄(f ) = max |f (x)|dx, |f (−∞) + f (+∞)| . (7.4)
−∞

The first term is related to the continuity


R of the function, and the second term pertains to the slope. Recall
00
that for cubic splines, the penalty was S r (x)2 dx. For a nonparametric penalized regression, our goal is to
find the minimizer to

min L(f ) + λR̄(f ). (7.5)


f

50
For a parameterized neural network, we are trying to find the minimizer to

min L(hθ ) + λC(θ). (7.6)


θ

We claim that these methods are doing the same thing. Specifically we claim that

min L(f ) + λR̄(f ) = min L(hθ ) + λC(θ), (7.7)


f θ

and
f ∗ = h∗θ ,
where f ∗ and θ∗ are the minimizers of equations 7.6 and 7.7, respectively.
In other words, on the one hand we have a non-parametric approach, i.e. a penalized regression with a
complexity measure R̄(f ). On the other hand we have a parameterized regression which comes from neural
networks. We claim that these are doing the exact same thing.
How do we interpret this? What does minimizing L(f ) + λR̄(f ) really do? First, let’s consider the
following:
n
X
minimize R̄(f ) s.t L(f ) = (y (i) − f (x(i) ))2 = 0.
i=1

As this corresponds to the case where λ → 0. The above is minimized when f is a linear spline that fits the
data exactly and L(f ) = 0.
00
From equation (7.4), we see that R̄(f ) consists of two terms. We know that f (x) = ∞ at the data
00
points (since this is where the slope instantaneously changes from one linear line to another), and f (x) = 0
00
otherwise. Hence, we can model f (x) as the sum of the dirac delta functions {δ(t)|t = datapoint}. The dirac
delta function is defined such that it takes the value zero everywhere except the a single point where
R +∞ R 00it take
the value infinity. By definition for dirac delta functions, we know that −∞ δ(t)dt = 1, hence |f (x)|dx
in equation (7.4) is actually quite small. We won’t go into the formality of proving this, but take it as true
that minimizing R̄ gives us a linear spline since the penalization from the second order derivatives is actually
quite small.
To represent a linear spline with n knots, we only need n + 1 pieces. We can therefore represent a linear
spline with a neural net of at most n + 1 terms. Analogously, in penalized regression we started with all
possible solutions r(x), but after we realized that the solution has a structure like a cubic spline, we then
reduced our infinitely large solution space to an n + 4 dimensional space (n + 4 neurons/width of the neural
net); this simplification makes the optimization of the problem a lot easier.

7.2.6 Outline of the proof


The proof follows mainly from the following two steps.
• Step 1: Show ∃R̃(f ) such that min L(f ) + λR̃(f ) = min L(hθ ) + λC(hθ ) (We use min in place of inf
here so that we can discuss the proof using minimizers. Ultimately, this doesn’t change the conclusions
of the theorem, but just makes the proof cleaner.);
• Step 2: Show that the formula for R̃ coincides with that of R̄(f ).
We begin by looking for a representation of f (x) by a neural network with minimum complexity. Consider
the following optimization:

min C(θ) s.t. f (x) = hθ (x) (7.8)

Why do we know there exists θ such that f (x) = hθ (x)? For any piecewise linear function with a finite
number of pieces, we know that there exists a hθ (x) that represents f (x), since neural networks are piecewise
linear for a finite number of neurons. A uniformly continuous function f (x) can be approximated by a

51
2-layer neural network with finite width, and it can be exactly represented by a two-layer neural network
with infinite width. This is done by taking finer and finer approximations of our function, and then taking
the limit as the number of approximations (width of our neural network) →.
We wish to prove that
min L(f ) + λR̃(f ) = min L(hθ ) + λC(θ),

where R̃(f ) = min C(θ) s.t. f = hθ . Let θ∗ be the minimizer of the min L(hθ ) + λC(θ). We have

L(hθ∗ ) + λC(θ∗ ) = min L(hθ ) + λC(θ).

Let f = hθ∗ . Then we have L(f ) = L(hθ∗ ) and

R̃(f ) = min C(θ) ≤ C(θ∗ ).


θ:hθ =hθ∗

Combining these statements implies that L(f ) + λR̃(f ) ≤ L(hθ∗ ) + λC(θ∗ ) = min L(hθ ) + λC(θ), which
suggests

min L(f ) + λR̃(f ) ≤ min L(hθ ) + λC(θ). (7.9)

In the other direction, let f ∗ be the minimizer of min L(f )+λR̄(f ). By the argument above, we can construct
θ such that hθ = f ∗ . Take θ to be the minimizer of min C(θ) s.t. f ∗ = hθ . This is the minimum complexity
network that can represent f ∗ . This means that C(θ) = R̃(f ∗ ) and

min L(f ) + λR̃(f ) = L(f ∗ ) + λR̃(f ∗ ) (7.10)


= L(hθ ) + λC(θ)
≥ min L(hθ ) + λC(θ).

Taken collectively, we conclude that min L(f ) + λR̃(f ) = min L(hθ ) + λC(θ).

7.3 Showing that R̃(f ) = R̄(f )


7.3.1 Recap
Recall that we introduced the following quantities for an infinitely-wide two-layer neural net,

a = (a1 , a2 , a3 , · · · ) ∈ R∞ ,
b = (b1 , b2 , b3 , · · · ) ∈ R∞ ,
w = (w1 , w2 , w3 , · · · ) ∈ R∞ ,

X
hθ (x) = ai [wi x + bi ]+ ,
i=1

X ∞ 
1 X
C(θ) = a2i + wi2 .
2 i=1 i=1

Denote θ as the set of vectors {(ai , bi , wi )}∞


i=1 . We previously showed that there exists


R̃(f ) = min C(θ) s.t. f (x) = hθ (x), (7.11)

and Eq. (7.7) holds for this definition of R̄(f ). It remains to show the definition of complexity measure in
Eq. (7.4) holds for (7.11). Let us take a detour and prove a related lemma.

52
7.3.2 Preparation
Lemma 7.3. The minimizer θ∗ of Eq. (7.11) satisfies |ai | = |wi |, for all i = 1, 2, · · · .
Recall that the wi ’s are the weights of the first layer and the ai ’s are the weights of the second layer.
Therefore, this lemma implies that the weights are balanced between the two levels in order to minimize the
complexity.
Proof. We can write ai [wi x + bi ]+ = aγi [γwi x + γbi ]+ if γ > 0. This is allowed because [γt]+ = γ[t]+ . Now
suppose that (ai , wi , bi ) is optimal for each i. Then the complexity should not decrease after scaling by γ as
we have already found the minimum. Hence we have

1 2 1 a2
(ai + wi2 ) ≤ ( i2 + γ 2 wi2 ). (7.12)
2 2 γ
Now we minimize with respect to γ, and

1 a2 1
min ( i2 + γ 2 wi2 ) = (a2i + wi2 ). (7.13)
γ 2 γ 2
a2
Let g(γ) = 21 ( γi2 + γ 2 wi2 ). Therefore min g(γ) = g(1).

−a2i
g 0 (1) = 0 = + γwi2 = −a2i + wi2
γ3
⇒ a2i = wi2
⇒ |ai | = |wi |.

We now proceed to finish our proof of Theorem 7.2. We can redefine our neural net hθ (x):
∞ ∞  ∞
X X wi bi X
hθ (x) = ai [wi x + bi ]+ = ai |ai | x+ = αi [w
ei x + βi ]+ ,
i=1 i=1
|ai | |ai | + i=1

wi
where αi = ai |ai |, w
ei = |ai|
, and βi = |abii | . The absolute value of ai is taken because ReLU is positive
homogeneous.Since θ satisfies Eq. (7.11), by using the results of Lemma 7.3, we know that αi ∈ {−a2i , a2i }
ei ∈ {−1, 1}. Furthermore, we can redefine C(θ):
and w

X ∞  ∞
X ∞  ∞ ∞
1 X 1 X X X
C(θ) = a2i + wi2 = a2i + a2i = a2i = |αi | = kαk1 .
2 i=1 i=1
2 i=1 i=1 i=1 i=1

ei )}∞
Define a new neural net by the set of parameters θe = {(αi , βi , w i=1 . Then the objective function in
Eq. (7.11) becomes

R̃(f ) = min kαk1 s.t. f (x) = hθe(x). (7.14)

7.3.3 Discretization to rewrite hθe(x)


The neural net hθe(x) can be grouped by whether w ei = −1:
ei = 1 or w
X X
hθe(x) = αi [x + βi ]+ + αi [−x + βi ]+
i:w
ei =1 i:w
ei =−1

53
Figure 7.1: Example of discretization across R

Let Z = {z1 , · · · , zN } be a discretization of R. Then [x + βi ]+ can be approximated by [x + zj ]+ , where the


bin around zj is the “bucket” that βi falls into. Since [x + zj ]+ is constant over all i where βi falls into zj ,
we can isolate α such that X X
αi [x + βi ]+ ≈ [x + zj ]+ αi .
i:βi ∈zj i:βi ∈zj

Similar logic holds for [−x + zj ]+ . See Fig. 7.1 for an example of discretization.
Define u+ (z) and u− (z) as
   
 X  −  X
u+ (z) = 

  , u (z) = 
αi  
.
αi 
i:w
ei =1, i:w
ei =−1,
βi =z βi =z

Then  
X X  X  X
αi [x + βi ]+ = [x + z]+ 
 αi
=
 [x + z]+ u+ (z),
i:w=1
e z∈Z i:w
ei =1, z∈Z
βi =z

and thus we have X X


hθe(x) = [x + z]+ u+ (z) + [−x + z]+ u− (z).
z∈Z z∈Z

Taking the limit N → ∞ makes the discretization fine-grained and results in the integral,
Z ∞ Z ∞
hθe(x) = [x + z]+ u+ (z)dz + [−x + z]+ u− (z)dz.
−∞ −∞

We can view hθe(x) as a linear combination of features [x + z]+ and [−x + z]+ for z ∈ R.

54
7.3.4 Reformulation of the objective
We know that

X Z ∞ Z ∞
|αi | ≥ |u+ (z)|dz + |u− (z)|dz.
i=1 −∞ −∞

by the triangle equality; for every bucket z, i:βi ∈z |αi | ≥ |u+ (z)| = | i:βi ∈z αi | and a similar expression
P P

holds for u− (z). Equality occurs at the minimum, as the optimal θ minimizes complexity regardless of how
we rewrite the expression. Thus, we can update our objective in Eq. (7.14):
Z ∞ Z ∞
min |u+ (z)|dz + |u− (z)|dz s.t. f (x) = hθe(x). (7.15)
−∞ −∞

We can write the first derivative of [x + z]+ to be


d
[x + z]+ = I(x + z ≥ 0),
dx
and the second derivative of [x + z]+ to be
(
d ∞ if z = −x
I(x + z ≥ 0) = δ−x (z) = .
dx 0 otherwise

Likewise, the first derivative of [−x + z]+ is


d
[−x + z]+ = −I(−x + z ≥ 0),
dx
and the second derivative of [−x + z]+ is
(
d ∞ if x = z
[−I(−x + z ≥ 0)] = δx (z) = .
dx 0 otherwise

The derivative of f (x) is


Z ∞ Z ∞
f 0 (x) = h0θe(x) = I(x + z ≥ 0)u+ (z)dz + −I(−x + z ≥ 0)u− (z)dz,
−∞ −∞
0
and the derivative of f (x) is
Z ∞ Z ∞
f 00 (x) = δ−x (z)u+ (z)dz + δx (z)u− (z)dz = u+ (−x) + u− (x).
−∞ −∞

Note that the last equality holds because δ−x (z) and δx (z) can be treated as “degenerate” probability
distributions (with total probability 1 occurring only at −x and x, respectively). Our choice of x was
arbitrary, so this holds for all x. Thus, the objective from Eq. (7.15) becomesz
Z ∞ Z ∞
min |u+ (z)|dz + |u− (z)|dz s.t. ∀x, f 00 (x) = u+ (−x) + u− (x). (7.16)
−∞ −∞

7.3.5 Simplification of the constraints


We can further remove redundancies by parameterizing u+ (−x) and u− (x) in terms of a function q and using
the constraint f 00 (x) = u+ (−x) + u− (x):
1 00
u+ (−x) = (f (x) − q(x)),
2
1
u− (x) = (f 00 (x) + q(x)).
2

55
Then the objective function in Eq. (7.16) can be written as
Z ∞ Z ∞ Z ∞ Z ∞

+
|u (z)|dz + |u (z)|dz = +
|u (−z)|dz + |u− (z)|dz
−∞ −∞ −∞ −∞
Z ∞ Z ∞
1 00 1 00
= (f (z) − q(z)) dz + (f (z) + q(z)) dz
−∞ 2 −∞ 2
Z ∞
1 
= |f 00 (z) − q(z)| + |f 00 (z) + q(z)| dz.
2 −∞

Since |f 00 (z) − q(z)| + |f 00 (z) + q(z)| is 2|f 00 (z)| if |f 00 (z)| ≥ |q(z)| and 2|q(z)| if |f 00 (z)| < |q(z)|, we have a
simple expression for the objective function:
Z ∞ Z ∞ Z ∞

+
|u (z)|dz + |u (z)|dz = max{|f 00 (z)|, |q(z)|}dz.
−∞ −∞ −∞

We can find a constraint on q using


Z ∞ Z ∞
f 0 (x) = I(x + z ≥ 0)u+ (z)dz + −I(−x + z ≥ 0)u− (z)dz,
−∞ −∞

specifically the values of f 0 (−∞) and f 0 (∞):


Z ∞ Z ∞
1
f 0 (−∞) = −u− (z)dz = − (f 00 (z) + q(z))dz,
−∞ −∞ 2
Z ∞ Z ∞ Z ∞
1 00
f 0 (∞) = u+ (z)dz = u+ (−z)dz = (f (z) − q(z))dz.
−∞ −∞ −∞ 2

Thus, the sum


Z ∞
f 0 (−∞) + f 0 (∞) = − q(z)dz (7.17)
−∞

gives a constraint for q, and we update the objective in Equation 7.16 in terms of q:
Z ∞ Z ∞
00 0 0
min max{|f (z)|, |q(z)|} dz s.t. f (−∞) + f (∞) = − q(z) dz. (7.18)
−∞ −∞

Previous formulations of the objective were taken with respect to many (infinite) variables, but we have
found an equivalent objective with respect to q only. Consider the following discrete objective
k
X k
X
min max{ai , |xi |} s.t. xi = B.
i=1 i=1

Pk
The minimum value of the objective function above is max{ i=1 ai , |B|}. Connecting this idea to our
objective in Equation 7.18, the minimum value of the objective function is
Z ∞ 
00 0 0
max |f (x)|, |f (−∞) + f (∞)| ,
−∞

which is R̄(f ) for the initial objective in Eq. (7.11) so we are done.

56
Chapter 8

Optimization, feature methods and


transfer

8.1 Review and overview


In the previous chapter, we began our discussion on neural nets and claimed that neural nets are equivalent
to nonparametric penalized regression for one dimensional inputs; specifically, we showed that there exists a
complexity measure R̄(f ) for nonparametric penalized regression such that the nonparametrized penalized
regression minimization problem is the same as the parametrized neural network one and we derived the
formula for R̄(f ).
In this chapter, we will take a look at algorithms that will solve this minimization problem and discuss
the intuition behind them. In particular, we will discuss gradient descent methods and feature training.

8.2 Optimization
8.2.1 Basic Premise
In a general sense, our neural network function is of the form

hθ (x) = aT φw (x).

where φw (x) has a lot of layers in it. We established the conventional viewpoint that the earlier layers of
φw (x) are producing features and the last layer is producing some linear classification of all of them. The
typical objective function for regression is then denoted by
n
1 X (i) λ 2
L(θ) = min (y − hθ (x(i) ))2 + kθk2 . (8.1)
2n i=1 2

We want to find some algorithms that will help us solve this optimization problem.

8.2.2 Gradient Descent


Let us start from some initialization θ0 . This initialization is often random but the exact initialization, or
rather the scale, matters. After the initialization is created with an initial set of parameters we take the
gradient and do a recursion.
θt+1 = θt − η∇L(θt ).

57
Why does this work? In essence, we are finding the steepest descent at a point θt and moving in that
direction. If we look at a Taylor expansion of L(θ) at the point θt , we get

L(θ) = L(θt ) + h∇L(θt ), θ − θt i + higher order terms.

Notice that the second term is linear in θ. If we ignore the higher order terms and minimize over a Euclidean
ball around θt (the ball is required to maintain the accuracy of the Taylor expansion), we get

argmin L(θt ) + h∇L(θt ), θ − θt i, s.t. kθ − θt k2 ≤ ε

L(θt ) is a constant in this case, so this simplifies to

minh∇L(θt ), θ − θt i, s.t.kθ − θt k2 ≤ ε

This is equivalent to finding two vectors v and x with minimum correlation such that x has norm less than
ε. As a result, the optimal solution is x = −cv, where c is a scalar constant greater than 0. This assures
that our two vectors have a minimum correlation as they are in opposite directions, and c allows the vector
to be within our previously established ball. Therefore the optimal solution for θ − θt is

θ − θt = −c · ∇L(θt )

Therefore, we can see that the steepest direction is optimal, and thus the gradient descent method will reach
a minimum.

8.2.3 Stochastic gradient descent


For many machine learning problems, computing the gradient is a computationally expensive task. Consider
the gradient for the loss function in Eq. (8.1):
n
1 X
∇L(θt ) = ∇θ (y (i) − hθt (x(i) ))2 + λθt .
2n i=1

Calculating the summation gradient of ∇θ (y (i) − hθt (x(i) ))2 over the entire dataset is expensive for complex
neural nets (with many parameters) and/or large sample sizes. Stochastic gradient descent relies on using a
small subset of the samples to estimate the gradient, which is effective, especially during the initial stages
of training, because gradients of the individual data points will often point in somewhat similar directions.
We can derive the individual loss function:
n
1X 1 1
L(θ) = `i (θ), `i (θ) = (y (i) − hθ (x(i) ))2 + kθk22 .
2 i=1 2 2

The SGD algorithm can be described as follows:


1. Sample a subset S = {i1 , · · · , iB } ⊆ {1, · · · , n}.
2. Find the gradient estimate
B
1 X
gS (θ) = ∇`ik (θ).
B
k=1

Note that gS (θ) is unbiased because


B B
1 X 1 X
ES [gS (θ)] = E[∇`ik (θ)] == ∇`(θ) = ∇`(θ).
B B
k=1 k=1

3. Sample S and find θt+1 = θt − ηgS (θt ) for t = 0 to t = T , where T is the number of iterations.

58
8.2.4 Computing the gradient
The gradient of a single data point is
∇θ (y (i) − hθ (x(i) ))2 = −2(y (i) − hθ (x(i) ))∇θ hθ (x(i) ).
Hence, it suffices to find an evaluable expression for ∇θ hθ (x(i) ). Recall that hθ (x) = a> σ(W x). Then the
partial derivatives are shown below:

hθ (x) = σ(wjT x),
∂aj

hθ (x) = (aj σ 0 (wjT x))x>
i .
∂wij
Note that is the element-wise product.
We can also present an informal statement about the time for computation. Suppose `(θ1 , · · · , θp ) : Rp →
R can be evaluated by a differentiable circuit (or sequence of operations) of size N . Then the gradient ∇`(θ)
can be computed in time O(N + p) using a circuit of size O(N + d). This means that the time to compute the
gradient is similar to the time to compute the function value. The only requirement is that the operations
of the circuit are differentiable.

8.3 Learning Features


Neural nets learn better features than those designed in the kernel method. Suppose we have a simple
two-layer neural net hθ (x) = a> σ(W x) with objective
min kak2 + kwk22 s.t. y (i) = a> σ(W x(i) ). (8.2)

For a d-dimensional network, W ∈ Rm×d . We assume m is sufficiently large. For the kernel method with
feature σ(W x) for random W , the objective is
min kak22 s.t. y (i) = a> σ(W x(i) ).
The objective for the neural net is equivalent to
min kak1 s.t. y (i) = a> σ(W x),
where W is random. With the L1 norm, the neural net prefers sparse solutions, similar to the lasso regression.
Thus, unlike the kernel method, the neural net actively selects features [Wei et al., 2020].

8.4 Transfer Learning


Transfer learning aims to “transfer” features trained on a large dataset to a small, yet different, dataset. Con-
iid iid
sider the big dataset (x(1) , y (1) ), · · · , (x(n) , y (n) ) ∼ Ptransfer and the small target dataset (x̃(1) , ỹ (1) ), · · · , (x̃(m) , ỹ (m) ) ∼
Ptarget where n  m. Our objective is to model Ptarget . A simple approach to transfer learning can be out-
lined as follows:
1. Train a (deep) neural net hθ (x) = a> φw (x) on (x(1) , y (1) ), · · · , (x(n) , y (n) ). Often we can find and down-
load a model previously trained on our big dataset (especially for famous datasets such as ImageNet).
This neural net gives us values for â and ŵ.
2. Train a linear model gb (x) = b> φwe (x) on (x̃(1) , ỹ (1) ), · · · , (x̃(m) , ỹ (m) ), discarding â and fixing w̃ = ŵ
from hθ . Thus, our objective function is
m
1 X 2 λ
min gb (x̃(i) ) − ỹ (i) + kbk21 .
b 2m i=1 2

59
We also present an improved method that fine-tunes W :
1. Train a (deep) neural net hθ (x) = a> φw (x) on (x(1) , y (1) ), · · · , (x(n) , y (n) ).
2. Train a linear model gb, w (x) = b> φwe (x) on (x̃(1) , ỹ (1) ), · · · , (x̃(m) , ỹ (m) ), still discarding â but not fixing
ŵ from hθ . Thus, our objective function is
m
1 X 2
min gb,w (x̃(i) ) − ỹ (i) .
b,w 2m i=1

The improved method can be implemented using SGD with initialization w = ŵ. We desire to keep w close
to its initialization (tactics like early stop can be used). This is useful for tasks where both datasets share
similar goals but have slightly different contexts.

8.5 Few-shot learning


Few-shot learning is utilized in an even more extreme case than transfer learning, when the dataset is even
smaller. The learning setting involves training data of, say, N examples and l classes, where both N and l
are very big (ImageNet [Deng et al., 2009], a standard example, has N = 1.2M and l = 103 ).
At test time, however, we are only given a small number of examples (x̃(1) , ỹ (1) ), ..., (x̃(nk) , ỹ (nk) ) drawn
independently and identically distributed from a distribution Ptest . There are k new labels or classes and n
images per each new class. In this scenario, n is very small (such as n = 5), and k could be bigger. Such a
setting is called a “k-way n-shot setting.” With this limited data for each new class, the goal is to classify
the examples from Ptest with one of the k labels. In few-shot learning settings, the feature dimension is
typically large (on the order of 103 ), which makes it difficult to fine-tune a model to the small test dataset,
as was done in transfer learning, because overfitting becomes likely.

8.5.1 Nearest neighbor algorithms using features


A simple but competitive algorithm in few-shot learning settings is nearest neighbor methods using features,
with steps as follows:
1. Pretrain neural networks on the large pretraining dataset, which results in an output of a> φW (x).
(a) In this case, we enforce φW (x) to have a norm of 1 during training. (This is helpful in order to not
have dramatically different norms for different examples, and is likely applied by research teams
such as Google who release pretrained neural networks.)

This can be done in one of two ways by changing the parameterization

N NW (x)
φW (x) = normalize(N NW (x)) = .
kN NW (x)k2

for N NW (x) as the standard feed-forward NN. This normalization is a sequence of elementary
operations, which can be done efficiently (such as computing kN NW (x)k2 ) and allows for efficient
gradient calculations with auto-differentiation in backpropagation. Performing this operation then
implies that kφW (x)k2 = 1.
2. At test time, we utilize an one-nearest neighbor algorithm. (Here, we predict based on the single nearest
neighbor rather than a combination of the k nearest as used in k-nearest neighbors.) Generally, given
an example x, we wish to predict the output label y. The steps are as follows:

(a) Compute φw (x).

60
(b) Find nearest neighbor in {φw (x̃(1) ), ..., φw (x̃(nk) )}.

The “nearness” is quantified according to `2 distance or cosine distance. `2 distance calculates


d(a, b) = ka − bk2 . Squaring this calculation results in

ka − bk22 = kak22 + kbk22 − 2a, b. (8.3)

which, given the unit norms enforced on outputs φw (x̃(i) ), can be simplified to

ka − bk22 = 2 − 2a, b. (8.4)

which is a constant shift from the cosine distance 2a, b, the cosine angle between two vectors a, b.

Let’s suppose that the nearest neighbor is φw (x̃(j) ).

(c) Assign the output label of ỹ (j) , or the label of the “nearest neighbor,” to the example x.

61
Chapter 9

Density estimation

9.1 Unsupervised learning: estimating the CDF


Now we return to classical methods, with one-dimensional problems that can be described by Cumulative
Density Functions (CDFs) and Probability Density Functions (PDFs) (rather than the high-dimensional
feature sets of machine learning).

9.1.1 Setup of CDF estimation


Let F be a distribution over R. Additionally, let us observe n examples from this distribution
iid
X1 , ..., Xn ∼ F.

Our goal is to estimate the underlying CDF function, F (x).


From earlier chapters, we can recall that a property of the CDF is that if X ∼ F , then F (x) = Pr [X ≤ x] ∈
[0, 1] (the probability that we observe an output less than or equal to x). F (x) is monotonically increasing,
and an example of CDF is shown in Fig. 9.1.

9.1.2 Empirical estimators


iid
Given the empirical examples X1 , ..., Xn ∼ F , we can estimate the underlying CDF, F (x), by evaluating
how often Xi ≤ x for a given x and 1 ≤ i ≤ n (that is, how often our examples are less than or equal to the
input value). Thus, we can consider the empirical estimator F̂n (x) defined as
n
1X
F̂n (x) = 1(Xi ≤ x). (9.1)
n i=1

We can make the following observations on the function F̂n (·):


• F̂n (·) as a function is called an empirical distribution function.

• F̂n (·) is a step function which only takes values in 0, n1 , ..., 1 (but, for a given distribution, not all
 

numbers in this range may be used). This is because F̂n (·) multiplies n1 by a sum of n integers with
values ∈ {0, 1}, which then implies that 0 ≤ F̂n (x) ≤ 1.

• F̂n (·) is a CDF itself, and is in fact the CDF of the uniform distribution over {X1 , ..., Xn } (given that
the datapoints were each drawn an equal number of times).

62
Figure 9.1: CDF of the standard normal distribution.

Figure 9.2: Empirical CDF for 100 randomly generated N (0, 1) points against standard normal CDF.

63
Using the previous example of F (x) (the standard normal distribution), we can illustrate what form the
estimator will take for an example with n = 100 data points in Fig. 9.2.
Although this class will not overview the in-depth theory, it is possible to show for a given x that as the
number of examples n → ∞, this estimator F̂n (x) converges to the underlying distribution function F (x).
Indeed, we see in Fig. 9.2 that at n = 100 we achieve a reasonably good estimate.
In the extreme lower and/or upper range of inputs x, the density of data points is close to 0 (since F (x)
is flat), and the estimator does not change to transition to the next“increased step” often given the relative
lack of examples in these regions. In the opposite scenario in a region where F (x) increases sharply, there
are more examples in this region, and the CDF will increase (transition to the next step) more quickly.
The following section involves analysis of simple theorems related to the estimator F̂n (x).
h i
Theorem 9.1. For any fixed value of x, the expectation of the empirical estimator, E F̂n (x) , satisfies
h i
E F̂n (x) = F (x), (9.2)

with randomness over the choice of X1 , ..., Xn . This means that F̂n (x) is an unbiased estimator of F (x).
h i
Proof This result can be seen by evaluating the E F̂n (x) , since
" n
#
h i 1X
E F̂n (x) = E 1(Xi ≤ x)
n i=1
n
1X
= E [1(Xi ≤ x)]
n i=1
n
1X
= Pr(Xi ≤ x)
n i=1
= F (x). (9.3)

We can also evaluate the variance of F̂n (x) as


h i F (x)(1 − F (x))
Var F̂n (x) = . (9.4)
n
and observe that the numerator is bounded in the range [0, 1] while the denominator becomes larger (ap-
proaching infinity) with more and more examples. This means that, with more examples, the variance
becomes smaller, and the (unbiased) estimate becomes more accurate. Therefore, we see that
p
F̂n (x) → F (x). (9.5)
or that our estimator converges in probability to the true underlying function (with more and more examples).
Theorem 9.2 (Glivenko-Cantelli).
a.s.
sup |F̂n (x) − F (x)| −−→ 0. (9.6)
x

This means that the supremum of |F̂n (x) − F (x)| almost surely converges to 0.
Expressed in words, the Glivenko-Cantelli theorem ensures that the estimator converges to the true
underlying distribution over the entire function.
Remark 9.3. While smoothing may produce an estimator that looks more similar to a true CDF, the step-
wise estimator described is optimal to ensure the convergence of the estimator to the underlying CDF, and
smoothing is therefore not necessary. However, this estimator has a zero derivative at most places and no
derivative in others, which makes it inapplicable when trying to estimate the density of the data (which is
the derivative of the CDF).

64
9.1.3 Estimating functionals of the CDF
Consider T (F ), which is a function of the CDF F (x). For example, T (F ) could be any of the following
functions of F that represent a property of the CDF:
• Mean of the distribution F .
• Variance of F .
• Skewness of F (measuring the asymmetry of the CDF about the mean).
• Quantile of F .
A plug-in estimator uses T (F̂n ) as the estimator (directly plugging in the estimator for F into the func-
tional). Under certain conditions (satisfied with the functionals listed above), T (F̂n ) → T (F ). Note that
more “abnormal” functions of F (such as the derivative) do not satisfy these conditions.

9.2 Unsupervised learning: density estimation


As before, we assume that we have data points
iid
X1 , ..., Xn ∼ F.

with F (x) as the CDF of the underlying data distribution, F .


The PDF, or density, is f (x) = F 0 (x). In other words, it is the derivative of the CDF. In density
estimation, the goal is to estimate the underlying density function f (x) from the data X1 , ..., Xn .
In order to do so, we utilize technical ideas similar to regression problems. In this case, instead of
predicting over an output y, we aim to predict f (x). However, we do not directly observe f (x) in any of the
data points. This problem is separate from the problem of empirical CDF estimates, since we cannot simply
take fˆ = F̂n (x)0 for an empirical CDF estimator F̂n (x) because this CDF estimator is a step function which
has a derivative of 0 at most inputs and infinity at the location of the data points.

9.2.1 Measuring performance of density estimators


There are several different ways in which to measure and evaluate the performance of density estimators.
The most common way in which we can measure the performance of a density estimator fˆ in accurately
estimating the true density f is by calculating the integrated mean square error:
Z  2
ˆ
R(f , f ) = fˆ(x) − f (x) dx.

And often the integration is in a restricted range (e.g., from 0 to 1 interval). For a one-dimensional problem,
this can be seen as a natural extension the mean squared error.
Another way to calculate the risk in order to measure performance is to calculate the `1 integrated
risk, also known as the total variation (TV) distance between the two distributions f and fˆ:
Z
R`1 (fˆ, f ) = fˆ(x) − f (x) dx. (9.7)

Throughout the rest of the chapter, we will utilize the mean squared error as a metric to evaluate density
estimator performance. Part of the reason is that it is much easier to do math with it and understand for
example, bias and variance tradeoff more clearly.
Remark 9.4. The mean squared error is not very useful in high dimensions. (If f = fˆ, then the mean squared
error will evaluate to 0, but this error generally does not scale well in higher dimensions.) This problem is
elaborated on in the section below.

65
9.2.2 Mean squared error in high-dimensional spaces
Consider the d-dimension problem as follows. We assume that f is a spherical Gaussian ∼ N (0, I). It follows
that
 
1 1 2
f (x) = √ d · exp − kxk2 . (9.8)
2π 2

Some key observations we can make about this density function are:
• f is a density and therefore

f (x) ≥ 0. (9.9)

• We can evaluate the point with the largest density as


1
sup f (x) = f (0) ≤ √ d . (an inverse exponential) (9.10)
x 2π

This means that, in high dimensional-spaces, we are aiming to predict very small values, which becomes
an issue that is exacerbated in the integrated mean squared error calculation.
Now, consider some fˆ that approximates f reasonably well. Because we have shown the output of f (x) to
be less than the inverse exponential √ 1 d , we can reasonably expect that fˆ ≤ √ 1 d for most x.
( 2π) ( 2π)
We can evaluate the integrated mean squared error between the described f and fˆ as follows:
Z  2
R(fˆ, f ) = fˆ(x) − f (x) dx
Z  
≤ fˆ(x) − f (x) · |fˆ(x)| + |f (x)| dx
Z
2
. √ d fˆ(x) − f (x) dx

Z 
2 
. √ d fˆ(x) + f (x) dx (f and fˆ(x) are positive)

4
≤ √ d .

In conclusion, if we have an estimator fˆ such that fˆ(x) ≤ √


1
d ∀x, then
( 2π )

4
R(fˆ, f ) ≤ √ d . (9.11)

and thus f and fˆ need not be close by any means for the error to be proportionally very small as an inverse
exponential. Note that the TV distance can also not be very meaningful in high dimensions. Generally, the
distance between two distributions in high-dimensional space is non-trivial. There are, however, alternatives
used that offer slightly better method of measuring the performance of density estimators in high dimensions.
One such alternative is the KL divergence metric. However, the KL divergence can still result in a large
error for very similar distributions. Take, for example, two distributions P1 = N (0, I) and P1 = N (µ, I),
where µ is a small vector. Then, the KL divergence can become very large. Wassertein distance is another
alternative method that incorporates the geometry into the calculation, and performs better for examples
such as distributions which are two point masses that are very close to one another.

66
9.2.3 Mean squared error and other errors in low-dimensional spaces
Suppose that d = 1 and the situation is thus low-dimensional In this case, use of the mean-squared error
is acceptable (as well as other distance metrics discussed). Going forward in the chapter, we will primarily
focus on discussing one-dimensional scenarios.

9.2.4 Bias-variance tradeoff


Just as we evaluated the bias-variance tradeoff in regression problems, we can calculate the bias-variance
tradeoff with expectation of the integrated mean square error risk over the randomness of X1 , ..., Xn as
Z  2  Z  h i
E ˆ
f (x) − f (x) dx = E (f (x) − fˆ(x))2 dx.

  2
Because, for a random variable Z, E Z 2 = (E [Z]) + Var(Z), then we can decompose the bias-variance
tradeoff as:
 2   h i2 h i
ˆ
E f (x) − f (x) = E f (x) − fˆ(x) + Var f (x) − fˆ(x) (9.12)
 h i2 h i
= f (x) − E fˆ(x) + Var −fˆ(x) (f (x) is a constant)
 h i 2 h i
= E fˆ(x) − f (x) + Var fˆ(x) .
  2 
f (x) − fˆ(x)
R
Thus, we can continue to evaluate E dx as

Z  2  Z  h i 2 Z h i
E f (x) − fˆ(x) dx = ˆ
E f (x) − f (x) dx + Var fˆ(x) dx. (9.13)

where the term in the blue box is the bias term and the term in the red box is the variance (although
sometimes each term including the integral is regarded as the bias and variance respectively). This clear
distinction between bias and variance is a property of the integrated mean squared error loss.

9.2.5 Histograms
The first algorithm that we will discuss is the histogram algorithm, which is an analog of regressograms.
Recalling from previous chapters, we remember that the process of solving a regressogram problem involves:
1. binning the input domain.
2. fitting constant density functions across each bin.

These two steps are also used in the histogram algorithm.


If we assume X ∈ [0, 1], then we can create bins B1 , ..., Bm within the input range, and we generate a
constant function across each bin, z1 , ..., zm . To set up the notation we will utilize, we first define
1
length of each bin , h = m.

Furthermore, let Yi equal the number of observations (data points) in each bin Bi . Then, we define p̂ as
Yi
p̂i = n = the fraction of data points in bin Bi .

67
The value of zi ∝ p̂i , but we need to normalize our p̂i in order to achieve a proper density function.
In order to form a proper density from z1 , ..., zn , we require that
Z m Z
X m
X
fˆ(x)dx = fˆ(x)dx = h · zi = 1. (9.14)
i=1 Bi i=1

m
P
Suppose that zi = c · p̂i . Then, using the property that p̂i = 1, we see that
i=1

Z m
X 1 1 p̂i
fˆ(x)dx = h · c p̂i = 1 =⇒ c= m = =⇒ zi = , (9.15)
P h h
i=1 h· p̂i
i=1

which tells us that each zi is computed as the fraction of the points in the bin Bi normalized by the size of
the bin. More succinctly, we can write
n n
p̂j
zj 1(x ∈ Bj ) = 1(x ∈ Bj ).
X X
fˆ(x)dx = (9.16)
j=1 j=1
h

9.2.6 Bias-variance of histogram


In the case of the histogram algorithm, we can explicitly compute the bias and variance. First, for use in
our calculation of the bias and variance for x ∈ Bj , we can evaluate the expectation of the estimator on x
as
 
h i p̂j
E fˆ(x) = E
h
1
= · E [pj ]
h
1
= Pr [X ∈ Bj ]
h Z
1
= · f (u)du
h Bj
pj
= .
h

with pj , Pr [X ∈ Bj ] defined as the probability of a random sample being in bin Bj .

Bias
Thus, we can evaluate the bias as
 h i2  pj 2
Bias = f (x) − E fˆ(x) = f (x) − . (9.17)
h
R
When h is infinitesimally small, each bin becomes a very very small window. Knowing that Bj
f (u)du can
thus be approximated as h · f (x) for any x ∈ Bj allows us to evaluate
Z
pj 1 1
= f (u)du ≈ · h · f (x) = f (x), (9.18)
h h Bj h

and thus the bias goes to 0 as h → 0.

68
Variance
When evaluating the variance of the estimator for x ∈ Bj ,
  1
Var fˆ(x) = 2 Var (p̂j ) (9.19)
h

We can note that, for the number of points that fall into a given bin Bj as np̂j ,

np̂j = Yj ∼ Binomial(n, pj ) (9.20)


n
1(Xi ∈ Bj ),
X
=
j=1

where 1(Xi ∈ Bj ) follows the Bernoulli distribution B(pj ).

Thus, we can calculate the variance of np̂j as


n
Var(1(Xi ∈ Bj )) = n · pj (1 − pj ),
X
Var(np̂j ) = (9.21)
i

and we can therefore evaluate Var(p̂j ) as

1 pj (1 − pj )
Var(p̂j ) = Var(np̂j ) = , (9.22)
n2 n

allowing us to evaluate Var(fˆ(x)) as


1 1
Var(fˆ(x)) = 2
Var(p̂j ) = 2 pj (1 − pj ). (9.23)
h h ·n

By analyzing the above result, we see that when h → 0, then Var(fˆ(x)) → ∞, and when n → ∞, then
Var(fˆ(x)) → 0. (This is consistent with the results we saw for regression problems). As a note, we can
clearly see the obvious tradeoff between Bias and Variance under this scenario. The bias goes to 0 as h → 0,
while the variance goes to ∞ as h → 0/. This, once again, is the central tradeoff.

Theorem 9.5. Suppose f 0 is absolutely continuous and that f 0 (u)2 du < ∞. Then, for the histogram
R

estimator fˆ,

h2
Z
1 1
R(fˆ, f ) = (f 0 (u))2 du + + O(h2 ) + O( ), (9.24)
12 nh n

where the term in the blue box is the bias term and the term in the red box is the variance. (The bias depends
upon the Lipschizes of f , meaning that f must be smooth.)

9.2.7 Finding the optimal h∗


The best value for h is the minimizer of R(fˆ, f ) over h, ignoring higher order terms. Using the results from
Theorem 9.5, we see that
 2Z 
h 1
h∗ = argmin (f 0 (u))2 du + (9.25)
h 12 nh

69
  13
1 6
= · R
0
,
n1/3 (f (u))2 du

which, most importantly, informs us that the rate that h ∝ n−1/3 . Plugging in this choice of h∗ , we get

R(fˆ, f ) ∼ cn−2/3 , (9.26)

which shows the convergence rate of the error as n → ∞.

9.2.8 Proof sketch of Theorem 9.5


When working through linear regression problems in previous chapters, we never derived these theorems.

However, we can prove Theorem 9.5 as shown in this proof sketch. We have shown that fˆ(x) = hj if x ∈ Bj .
We also saw that we can evaluate the expectation of the estimator fˆ(x) at a point x ∈ Bj as
i p Z
h
j 1
E fˆ(x) = = · f (u)du. (9.27)
h h Bj

Before, we had roughly approximated f (u) ≈ f (x). However, we can more explicitly obtain an expression
for f (u) using a first order Taylor expansion:

f (u) = f (x) + (u − x)f 0 (x) + O(h2 ), (9.28)


2
since |u − x| ≤ h we have that the higher-order terms scaleh as O(h
i ). R
Therefore, we can further simplify our calculation of E fˆ(x) = h1 Bj f (u)du by evaluating
Z Z
f (x) + (u − x)f 0 (x) + O(h2 ) dn

f (u)du =
Bj Bj
Z
0
= h · f (x) + f (x) (u − x)du + O(h2 ) · h

= h · f (x) + f 0 (x) · O(h2 ) + O(h3 ),

given that |u − x| ≤ h and the size of Bj is similarly bounded as ≤ h. Thus, we can evaluate the expectation
of the estimator as
h i 1 Z
E fˆ(x) = · f (u)du = f (x) + f 0 (x) · O(h) + O(h2 ). (9.29)
h Bj

Bias
h i
Given our previous calculation of E fˆ(x) , we can evaluate the bias as
 h i2 2
f (x) − E fˆ(x) = f 0 (x) · O(h) + O(h2 ) (9.30)
2
= h2 (f 0 (x) + O(h))
= O(h2 )f 0 (x)2 + O(h3 ).

And thus, the integrated bias can be calculated as


Z  h i2 Z
f (x) − E fˆ(x) dx = O(h2 ) · f 0 (x)2 dx + O(h3 ). (9.31)

70
Variance
h i
Given our previous calculation of E fˆ(x) , we can evaluate the integrated variance as

Z m Z
X
Var(fˆ(x))dx = Var(fˆ(x))dx (9.32)
j=1 Bj
m
X pj (1 − pj )
= ·h
j=1
h2 · n
m
X pj
≤ 2·n
·h
j=1
h
m
1 X
= pj
nh j=1
1
= .
nh
Notice that the variance does not depend on f .

71
Chapter 10

Kernel density estimation and


Bayesian linear regression

10.1 Review and overview


In the last lecture, we began our treatment of nonparametric density estimation with a discussion of the
histogram algorithm. This intuitive approach involves constructing a histogram from our observations and
normalizing it so that it constitutes a valid density function. In particular, we discuss about bias-variance
trade-off:: as the bandwidth increases, the variance decreases while variance increases, and as we consider
more examples, variance decreases but the bias, given its dependence on the expectation of the algorithm’s
predictions, does not change with the number of examples.
In this lecture, we move to kernel density estimation, a more sophisticated technique for this problem.
Then, we briefly touch parametric and nonparametric mixture models and begin our discussion of Bayesian
nonparametric methods.

10.2 Kernel density estimation


10.2.1 Introduction
We define the kernel density estimator
n  
1 X x − xi
fˆ(x) = K , (10.1)
nh i=1 h

where h is the bandwidth and K is the kernel function. Recall from Lecture 1 that we have defined the
kernel function to be any smooth, non-negative function K such that
Z Z Z
K(x)dx = 1, xK(x)dx = 0, and x2 K(x)dx > 0.
R R R

Two kernel functions we have seen are the boxcar and Gaussian kernels. For the former, we now show
that kernel density estimation is very similar to the histogram approach. Recall that the boxcar kernel
K(x) = 12 1{|x| ≤ 1}. Thus, using the boxcar, our kernel density estimator is
n  
1 X1 x − xi
fˆ(x) = 1 ≤1 . (10.2)
nh i=1 2 h

72
Define Bx = {i : |xi − x| ≤ h} and let |Bx | be the cardinality of this set, i.e., the number of points in Bx .
Then we can write,
 
ˆ 1 X 1 x − xi
f (x) = 1 ≤1
nh 2 h
i∈Bx
1 X 1
=
nh 2
i∈Bx
|Bx |
= .
2nh
To see the similarity with the histogram algorithm, recall that for the histogram,
p̂j |Bj |
fˆ(x) = = , (10.3)
h nh
for x ∈ Bj . Moreover, note that for the histogram, h corresponds to the bin width, whereas for our boxcar
density estimator, h is half of the bin width. The characteristic difference between the approaches is that
for kernel density estimators, our bins are not fixed but moving with and centered at x.
Of course, we require that our kernel density estimator constitutes a valid density. There are two ap-
proaches for verifying that (10.1) coheres with the definition of the probability density function. The first is
to check directly that (10.1) integrates to 1:
n  
x − xi
Z Z
ˆ 1 X
f (x)dx = K dx
R R nh i=1 h
n Z  
1 X x − xi
= K dx
nh i=1 R h
n Z
1 X x
= K dx (10.4)
nh i=1 R h
n Z
1 X
= h K (z) dz (10.5)
nh i=1 R
n
1 X
= h
nh i=1
= 1.

To reach (10.4), we used that shifting a function being integrated over the continuum by a constant has no
effect on the value of the integral. In (10.5) we made a change of variables.1
The second approach for sanity checking our definition in (10.1) is to view K(x) as a probability density
function. Examining (10.2) confirms that this is a valid move. Then, we find that fˆ is identical to a density
function as desired. The following theorem formalizes this.
Theorem 10.1. Let ξ ∼ K(x), Z ∼ Unif{x1 , . . . , xn }. Further, define W = Z + ξh. Then, the density
function of W is fˆ.
Proof. Let Wi = xi + ξh for each i ∈ {1, . . . , n}. Finding the density of Wi is straightforward. To do this,
we first find the distribution function of Wi :

FWi (x) = P {Wi ≤ x}


1 Let R[a, b] denote the set of functions that are Riemann integrable on [a, b]. Then, let f ∈ R[a, b] and let g be a strictly

increasing function from [c, d] onto [a, b] such that g is differentiable on [c, d] and g 0 ∈ R[c, d]. Then (f ◦ g) · g 0 ∈ R[c, d] and
Rb Rd 0
a f (x)dx = c f (g(t))g (t)dt [Johnsonbaugh and Pfaffenberger, 2010].

73
W₁ density

W₂ density

W₃ density

W₄ density

W₅ density

W density

x₁ x₂ x₃ x₄ x₅

P5
Figure 10.1: Example of Gaussian mixture model. In the figure above, W = 15 i=1 Wi , where Wi ∼
N (xi , h2 ). Per Theorem 8.1, the Gaussian kernel density estimator assumes that the density fˆ is an equally
weighted mixture of Gaussians centered on the observations {xi }ni=1 with variance h2 .

= P {xi + ξh ≤ x}
 
x − xi
=P ξ≤
h
 
x − xi
= Fξ .
h
Then, we differentiate to find the density:
d
fWi (x) = FWi (x)
dx  
d x − xi
= Fξ
dx h
 
x − xi 1
= fξ
h h
 
1 x − xi
= K .
h h
1
Now, since W = Wi with probability n, we can easily find the density of W . We again appeal to distribution
functions to show this.

FW (x) = P {W ≤ x}
1 1
= n P {W1 ≤ x} + · · · + n P {Wn ≤ x}
1 1
= n FW1 (x) + · · · + n FWn (x). (10.6)

Differentiating (10.6) gives that


n
1X
fW (x) = fW (x)
n i=1 i

74
n  
1 X x − xi
= K
nh i=1 h
= fˆ(x).

(If W were not a mixture of random variables but a sum of them, computing its density would be far more
complicated.)

10.2.2 Integrated risk


Now, we turn to the risk associated with the Gaussian kernel density estimator.
Theorem 10.2. For the Gaussian kernel density estimator the risk is
β2
  1 Z
R f, fˆ = σk2 h4 f 00 (x)2 dx + k +O h6 + O n−1 ,
 
(10.7)
4 nh
| {z } |{z}
bias σ2

where σk2 = x2 K(x)dx and βk2 = K(x)2 dx. The first term in (10.7) is the bias and the second is the
R R

variance.
Recall that for the histogram density estimator,
  h2 Z 1
ˆ
R f, f = f 0 (x)2 dx + +O(h2 ) + O(n−1 ). (10.8)
12 nh
| {z } |{z}
bias σ2

Comparing (10.7) and (10.8), we see that for the Gaussian kernel density estimator, we want f 00 (x) to be
small rather than f 0 (x). More importantly, for values of h < 1, the bias of the Gaussian kernel density
estimator will be lower than for the histogram estimator. We encounter the usual bias-variance tradeoff
here: increasing h results in more smoothing which boosts bias and depresses variance whereas decreasing h
results in less smoothing which depresses bias and boosts variance.
By minimizing the risk with respect to h, we find the optimal bandwidth
1/5
βk2
 Z

h = , where A(f ) = f 00 (x)2 dx. (10.9)
σk2 A(f )n

As usual, the optimal bandwidth is inversely dependent on n. Plugging h∗ into (10.7), we find that as a
function of n, R(f, fˆ) ∝ O(n−4/5 ). We observe that this is an improvement over the histogram estimator
where as a function of n, R(f, fˆ) ∝ O(n−2/3 ).
Now, we show that for the boxcar kernel density estimator, the bias is as claimed in (10.7). From (10.3),
fˆ(x) = |Bx|
2nh , so

1
E[fˆ(x)] = E[|Bx |]
2nh
Z x+h
n
= f (u)du
2nh x−h
Z x+h
1
= f (u)du
2h x−h
Z x+h
1
f (x) + (u − x)f 0 (x) + 12 (u − x)2 f 00 (x) du + higher order terms

= (10.10)
2h x−h
Z x+h Z x+h !
1
= 2hf (x) + f 0 (x) (u − x)du + f 00 (x) 1 2
2 (u − x) du + higher order terms
2h x−h x−h

75
= f (x) + O(h2 )f 00 (x).

In (10.10), we have carried out a degree 2 Taylor expansion for f at x.2 Thus, we find that the bias at x is
 h i2
f (x) − E fˆ(x) = O(h2 )2 f 00 (x)2 = O(h4 )f 00 (x)2 .

So, the total bias is O(h4 ) f 00 (x)2 , which agrees with (10.7) as desired.
R

10.2.3 Choosing h empirically


There are two approaches for selecting the optimal bandwidth parameter h∗ for kernel density estimators:
normal references and cross-validation.
Unfortunately, we cannot use (10.9) to directly pick h∗ . Though we will know βk2 , σk2 , and n, we will
not know A(f ), which depends on the object of our estimation. Normal references assumes for the purpose
of finding h∗ that f ∼ N (µ, τ 2 ). Then, when K is a Gaussian kernel, h∗ = 1.06τ n−1/5 . In practice, τ is
Q
commonly chosen to be min{τ̂ , 1.34 }, where τ̂ is the sample standard deviation and Q is the interquartile
3
range. Note that we only make this normality assumption to choose h. If we believed that f were actually
normally distributed, we would be better suited with a parametric density estimation approach.
Next, we examine how to use cross-validation to choose h. Cross-validation is somewhat trickier in the
unsupervised setting, since we do not have labels to evaluate our performance against on held-out data. To
address this challenge, we rewrite our integrated risk
Z  2
R(f, fˆ) = f (x) − fˆ(x)
Z Z Z
= f (x)2 dx − 2 f (x)fˆ(x)dx + fˆ(x)dx. (10.11)

In (10.11), the first term is constant with respect to fˆ, so we are not concerned about it when choosing fˆ.
The second term can be rewritten −2EX∼f [fˆ(x)], and with held-out data x01 , . . . , x0m ∼ f , we can compute
a Monte Carlo estimator of the expectation:
m
h i 1 Xˆ 0
EX∼f fˆ(x) ≈ f (xi ).
m i=1

If we have insufficient data for a hold-out set, we can use cross-validation. Under leave-one-out and Monte
Carlo we have
n
h i 1Xˆ
EX∼f fˆ(x) ≈ f−i (xi ),
n i=1

where fˆ−i denotes the estimator obtained using {x1 , . . . , xi−1 , xi+1 , . . . , xn }. Finally, the third term can be
directly computed. Thus, the leave-one-out cross validation score is defined
  Z n
2Xˆ
J fˆ = fˆ(x)2 dx −
ˆ f−i (xi ).
n i=1

We would like an efficient way to find the leave-one-out loss. A naive approach to computing Jˆ could be
quite expensive since it would require that we fit fˆ n times. Fortunately, we can do better.
2A real-valued function f is said to be of class C n on (a, b) if f (n) (x) exists and is continuous for all x ∈ (a, b). Define
f (n) (c)
Pn (x) = f (c) + f (1) (c)(x − c) + · · · + n!
(x − c)n . Let f ∈ C n+1 on (a, b), and let c and d be any points in (a, b). Then
f n+1 (t)
Taylor’s Theorem says that there exists a point t between c and d such that f (d) = Pn (d) + (n+1)! (d − c)n+1 [Johnsonbaugh
and Pfaffenberger, 2010].
3 For some data, the interquartile range is the data’s 75th percentile minus its 25th percentile.

76
Theorem 10.3. We can compute the leave-one-out cross validation score for a kernel density estimator fˆ
as
n n  
ˆ
 
ˆ 1 X X ∗ xi − xj 2
K(0) + O n−2 ,

J f = 2
K + (10.12)
hn i=1 j=1 h nh

where K ∗ (x) =
R
K(x − y)K(y)dy − 2K(x).

10.3 Mixture models


10.3.1 Introduction
From Theorem 8.1, we see that kernel density estimators recover the density of a random variable that is a
mixture of n distributions of the same form as the kernel centered on the data points. In this section, we
present a more parametric approach by examining mixture models where the number of distributions in the
mixture is k < n. We assume that our density
k
1X
f (x) = fi (x) (10.13)
k i=1

for some distributions {fi | i ∈ {1, . . . , k}}. There is an equivalent generative specification of this model:
1. Draw i from some distribution over {1, . . . , k}. A simple choice that we have used in (10.13) and that
we will use going forward is i ∼ unif{1, . . . , k}.
2. Draw x ∼ fi .

10.3.2 Gaussian mixtures and model fitting


A popular mixture model is the Gaussian mixture. Under this model, fi (x) = N (µi , Σi ) for µi ∈ Rd ,
Σi ∈ Rd×d .
k
1X 1 1 > −1
f (x; µ1 . . . , µk , Σ1 , . . . , Σk ) = e− 2 (x−µi ) Σi (x−µi ) . (10.14)
k i=1 (2π)d/2 |Σi |1/2

This Gaussian mixture model is a fully parametric approach since k is fixed. We can make it less parametric
by letting k grow with n in some way. There are three algorithms commonly used for fitting mixtures:
maximum likelihood estimation (MLE), the Expectation Maximization algorithm (EM), and the method of
moments.
Let θ = (µ1 . . . , µk , Σ1 , . . . , Σk , z1 , . . . , zn ), where the zi ’s are in {1, . . . , k} and denote the Gaussians to
which the observations are assigned. MLE amounts to solving the optimization problem
n
1X
max log f (xj ; µzj , Σzj ). (10.15)
θ n
j=1

This is often impossible to do analytically, so numerical methods are frequently required. EM is beyond the
scope of the class, but it can be applied to fit mixtures under the MLE approach or even the more general
Bayesian framework. The method of moments involves relating model parameters to the moments of random
variables. Recall that for random variables xi , i ∈ {1, . . . , d}, the first moments are

E[xi ] for each i,

77
the second moments are

E[xi xj ] for each i, j,

and the third moments are

E[xi xj xk ] for each i, j, k.

We can estimate these using empirical moments. For example, for observations x(1) , . . . , x(n) in Rd , the
empirical first moment for the i’th dimension of x is
n
1 X (j)
x ≈ E[xi ].
n j=1 i

If the moments are functions of the model


Pk parameters, we can exploit this to fit our model. For example, in
the Gaussian mixture case, E[xi ] = k1 j=1 (µj )i . In general, suppose

E[xi ] = qi (µ, Σ)
E[xi xj ] = qij (µ, Σ)
..
.

Then we can construct loss functions to minimize with respect to θ:


 2
n
1 (j)
X
x − qi (µ, Σ) (10.16)
n j=1 i
n
!2
1 X (k) (k)
xi xj − qij (µ, Σ) (10.17)
n
k=1
..
.

10.4 Bayesian nonparametric statistics


10.4.1 Review of the Bayesian approach
Under the Bayesian take on statistics, we treat our model parameter θ as a random variable, and we express
our beliefs regarding θ prior to our statistical analysis through a distribution over θ called a prior. Recall that
there are two types of Bayesian statistics: supervised and unsupervised. In supervised statistics, data points
are accompanied by explicit annotations or labels that indicate the correct or desired output associated with
each input. The goal of supervised learning is to generate a model to capture the relationship between input
and output variables. In unsupervised statistics, data points are not accompanied by labels, so the goal is
to discover hidden relationships within the data itself.
In the unsupervised setting, the Bayesian approach assumes the following hierarchical model:
1. Draw θ ∼ p(θ).
iid
2. Draw data x(1) , . . . , x(n) ∼ p(x | θ).4
4 Note that in this section, we drop density function subscripts for notational elegance. For example, to be as clear as possible,

we would write fθ (θ) rather than p(θ) and fx|θ (x | θ) not p(x | θ). p(θ) and p(x | θ) are not the same function p. “Think of
them as living things that look inside their own parentheses before deciding what function to be” [Owen, 2018].

78

Our goal is to infer the posterior distribution p θ | x(1) , . . . , x(n) . To accomplish this, we use Bayes’ rule
and expand using the chain rule of probability:
  p θ, x(1) , . . . , x(n) 
(1) (n)
p θ | x ,...,x = 
p x(1) , . . . , x(n)

p x(1) , . . . , x(n) | θ p (θ)
= 
p x(1) , . . . , x(n)
Qn (i)

i=1 p x | θ p (θ)
= Qn
R 
(i) | θ p (θ) dθ
.
i=1 p x
  
Now, in the supervised setting, suppose we have a dataset S = x(1) , y (1) , . . . , x(n) , y (n) , where the
x(i) ’s are fixed. Then, our generative story takes the following form:
1. Draw θ ∼ p(θ).

2. Draw a label y (i) ∼ p y (i) | x(i) , θ for each i ∈ {1, . . . , n}.
Given a test example x∗ , we want to find p(y ∗ | x∗ , S), where y ∗ denotes the (unknown) label associated
with x∗ . We can do this if we can first infer the posterior p(θ | S). Why? Observe that
Z
p(y ∗ | x∗ , S) = p(y ∗ | x∗ , θ, S)p(θ | x∗ , S)dθ
Z
= p(y ∗ | θ, x∗ )p(θ | S)dθ,

where we’ve used that y ∗ is independent of y (1) , . . . , y (n) conditional on θ. As in the supervised setting, we
can use Bayes’ rule to find an expression for the posterior that we can work with:
p(θ, S)
p(θ | S) =
p(S)
p(S | θ)p(θ)
=R
p(S | θ)p(θ)dθ

p y (1) , . . . , y (n) | θ p(θ)
=R 
p y (1) , . . . , y (n) | θ p(θ)dθ
Qn (i)

i=1 p y | θ, x(i) p(θ)
= Qn
R 
(i) | θ, x(i) p(θ)dθ
.
i=1 p y

In summary, the process of supervised learning involves drawing samples from the prior distribution for
the parameter θ, then drawing observed datapoints based upon a distribution dependent on θ. Then, to
infer the posterior distribution, we apply a formula derived from Bayes’s Rule. For unsupervised learning,
we also draw samples from the prior distribution of parameter θ, but for the second step, we  instead we
generate labels for each datapoint based upon the condition distribution y (i) ∼ p y (i) | x(i) , θ . Then, for a
given test example, we find the label using a formula with the posterior p(θ | S).

10.4.2 Bayesian linear regression


Let x(i) ∈ Rd , y (i) ∈ R, θ ∈ Rd with θ ∼ N (0, τ 2 Id ). We can simplify the density of θ
1 2 2
p(θ) = e−||θ||2 /(2τ ) . (10.18)
(2πτ 2 )d/2
>
We assume that y (i) = x(i) θ + (i) where (i) ∼ N (0, σ 2 ). Our generative model then is

79
1. Draw θ ∼ N (0, τ 2 Id ).
>
2. Draw y (i) ∼ N (x(i) θ, σ 2 ) for each i ∈ {1, . . . , n}.
Theorem 10.4. Define the design matrix
>
 
x(1)
 (1) 
y
 .  1 > 1
X=  ..  ∈ R
n×d
, ~y =  ...  ∈ Rn , and A= X X + 2 Id .
  
>
σ2 τ
x(n) y (n)

Then θ | S ∼ N ( σ12 A−1 X > ~y , A−1 ), and y ∗ | x∗ , S ∼ N ( σ12 x∗ > A−1 X > ~y , x∗ > A−1 x∗ + σ 2 ).
Note that an interesting connection can be made between Bayesian and frequentist approaches here.
Recall that in frequentist ridge regression, the goal is to estimate the regression coefficients by minimizing
the sum of squared errors through a regularization term that penalizes large coefficients. Bayesian approaches
instead involve estimating the posterior distribution of the parameters given the observed data and a prior
distribution. We can show that for a particular regularization parameter λ in ridge regression, the mean
of the posterior distribution in Bayesian linear regression corresponds to the frequentist ridge regression
estimate, as follows:
Consider the expression for the mean of the posterior distribution, as given by the Gaussian formulation:
1 −1 T
A X y
θ2
By simplifying terms,
= (θ2 A)−1 X T y
Substituting matrix A into the previous equation,

θ2 −1 T
= (X T X + I) X y
τ2
θ2
This is precisely the estimate for ridge regression when the regularization parameter λ is equal to τ2 .
Now, let’s perform a sanity check Theorem 10.4. We can rewrite A as
n
1 X (i) (i) > 1
x x + Id . (10.19)
σ 2 i=1 τ2
| {z } | {z }
influence of data influence of prior

First, as n → ∞, the first term in (10.19) dominates the second term. As we would hope, as the size of our
dataset grows the influence of the prior on the posterior of θ diminishes and, at the limit, vanishes. For this
reason, Bayesian methods are less useful under a large data regime. Second, as τ → ∞, our Gaussian prior
becomes increasingly flat and uninformative; see Figure 8.2. Accordingly, in (10.19), τ is inversely related
to the influence of the prior on the posterior. Third, the variance of our posterior predictive distribution
y ∗ | x∗ , S is at least σ 2 .
Proof. First, we show that A is positive definite. For non-zero z ∈ Rd ,
1 > 1
z> 2
X Xz = 2 (Xz)> Xz
σ σ
1
= 2 hXz, Xz, i.
σ
Since X is full-rank (our predictors cannot be a linear combination of each other), X’s null space is trivial
and Xz 6= 0. Because σ 2 is positive and the norm is positive for all non-zero vectors, (11.1) is positive

80
Figure 10.2: Densities for N (0, τ 2 )

!=1

!=2

!=3

!=5

and A is positive definite. Then, A−1 is also positive definite since its eigenvalues are the reciprocals of A’s
eigenvalues.5 Thus,

x∗ > A−1 x∗ + σ 2 ≥ σ 2 .

As expected, the lower bound of uncertainty of our predictions is σ 2 , which is the uncertainty intrinsic
to the problem. As our dataset grows, the number of observations (n) tends toward infinity, causing A also
to grow toward infinity. Consequentially, when n → ∞, x∗ > A−1 x∗ → 0, and we converge upon the lower
bound of uncertainity, σ 2 .

5 For any invertible matrix M , M −1 ’s eigenvalues are the eigenvalues of M inverted. A matrix is positive definite if and only

if all of its eigenvalues are positive [Axler, 2014].

81
Chapter 11

Parametric/nonparametric Bayesian
methods and Gaussian process

11.1 Review and overview


In the last lecture, we introduced the parametric Bayesian linear regression model with a theorem on distri-
butions of θ|S and y ∗ |x∗ , S. The model assumes that θ follows the normal distribution N (0, τ 2 I) and the rela-
>
tionship between the observed datapoints and the labels is given by y (i) = x(i) θ + (i) where (i) ∼ N (0, σ 2 ).
∗ ∗
Our goal is to predict y |x , S where S represents the set of all data points {(x(1) , y (1) ), ..., (x(n) , y (n) )}. In
this set, the x(i) ’s are deterministic variables and the y (i) s are random. It’s important to note that we always
conditioning on x∗ , even if y ∗ |x∗ , S is sometimes written as y ∗ |S.Thus, we make the following claim:
Theorem 11.1. Based on our assumptions, it follows that
1 −1 > →
θ|S ∼ N ( A x − y , A−1 ), (11.1)
σ2
1
y ∗ |x∗ , S ∼ N ( 2 x∗> A−1 x> →
−y , x∗> A−1 x∗ + σ 2 ), (11.2)
σ
 (1)>   (1) 
x y
where the collection of all data points X =  ...  ∈ Rn×d , the collection of all labels →

y =  ...  ∈ Rn ,
>
x(n) y (n)
 (1) 

the covariance matrix A = σ12 x> x + τ12 I, and the noise → − =  ...  .
(n)
In this class, we will prove the second claim and discuss nonparametric Bayesian methods, such as the
Gaussian process.

11.2 Proof of the Eq. (11.2) in Theorem 11.1


Extending from our previous discussion about interpretations of the theorem, we will prove the theorem
using an approach that’s also useful in the nonparametric scenario. Although our main focus is on proving
the parametric case, it will provide valuable insights for the nonparametric case as well We note that we can
directly prove the second claim without having to prove the first claim.This is advantageous since proving
claim one becomes more challenging in the nonparametric context
In our approach, we make the assumption that the posterior, conditional, and prior distributions are all
Gaussian. By considering a set of jointly Gaussian-distributed random variables, we can leverage Lemma

82
11.2 to determine the posterior distribution of one variable conditioned on the other. This lemma provides a
valuable analytical formula for computing the desired posterior distribution when we have a joint Gaussian
distribution. We will present a proof of this lemma at a later stage to establish its validity.
Lemma 11.2. Suppose      
xA µ Σ ΣAB
∼ N ( A , AA )
xB µB ΣBA ΣBB
where ΣAA is the covariance matrix of xA and ΣAB is the correlation between xA and xB . Then,

xB |xA ∼ N (µB + ΣBA Σ−1 −1


AA (xA − µA ), ΣBB − ΣBA ΣAA ΣAB ) (11.3)

We see that xB |xA is a function of xA and thus the mean and variance depend on xA . By symmetry, we
know that
xA |xB ∼ N (µA + ΣAB Σ−1 −1
BB (xB − µB ), ΣAA − ΣAB ΣBB ΣBA ) (11.4)

11.2.1 Proof using Lemma 11.2


First, we will prove the second claim (11.2) assuming the Lemma holds. We can fix X1 , ..., Xn and drop
them to focus on the outcome y ∗ |x∗ , S. Since x∗ is fixed globally, S contains x and y, and all the x(i)’s are
deterministic, y|x, S is equivalent to y ∗ |y (1) , ..., y (n) (y (i) ∈ R). Each of these events is a Gaussian random
variable, so their joint distribution is also a Gaussian random variable, satisfying the conditions required to
apply the lemma.  (1) 
y


The first Gaussian random variable is a vector xA = y =  ...  ∈ Rn , the second one is a scalar
y (n)

xB = y ∈ R. It’s easier to compute the joint distribution compared to conditional distribution. We first
compute the joint distribution
→
−       
y xA µA ΣAA ΣAB
= ∼ N ( , ).
y∗ xB µB ΣBA ΣBB

Using the definition of mean and covariance and the fact that θ is the random and θ ∼ N (0, τ 2 I), we
have
µA = E[→
−y ] ∈ Rn = E[Xθ + →
− ] = XE[θ] + E[→
− ] = 0 + 0 = 0.

Similarly, we can obtain µB = E[y ∗ ] ∈ R = E[x∗T θ + ∗ ] = 0 (∗ is a scalar). Since θ ∼ N (0, τ 2 I), the
covariance of →

y is

ΣAA = E[(→
−y − µA )(→

y − µA )> ]
= E[→
−y→−
y >]
= E[(Xθ + →
− )(Xθ + →
− )> ]

= E[Xθθ> X > + Xθ→− > + → − θ> X > + →


− →
− > ]

= XE[θθ> ]X > + XE[θ→ − > ] + E[→


− θ> ]X > + E[→
− →
− > ]

= Xτ 2 IX > + XE[θ]E[→
− > ] + 0 + σ 2 I

= τ 2 XX > + σ 2 I.

83
Note that x∗> θ is a scalar and thus it equals to its transpose and ∗ is a scalar, we have

ΣAB = E[(→
−y − µA )(y ∗ − µB )]
= E[→
−y y∗ ]
= E[(Xθ + →
− )(x∗> θ + )]

= E[Xθ(x∗> θ) + Xθ∗ + → − x∗> θ + →


− ∗ ]

= E[Xθ(x∗> θ)] + XE[θ∗ ] + E[→


− θ> ]x∗ + E[→
− ∗ ]

= E[Xθ(x∗> θ)] + 0 + 0 + 0
= E[Xθθ> x∗ ]
= XE[θθ> ]x∗
= Xτ 2 Ix∗
= τ 2 Xx∗ ,

and thus
ΣBA = Σ> 2 ∗> >
AB = τ x X .
Finally we can get

ΣBB = E[(y ∗ − µB )2 ]
= E[(y ∗ )2 ]
= E[(x∗> θ + Σ∗ )(x∗> θ + Σ∗ )]
= E[x∗> θ(x∗> θ) + (Σ∗ )2 ]
= E[x∗> θ(x∗> θ) + (Σ∗ )2 ]
= E[x∗> θθ> x∗ + (Σ∗ )2 ]
= τ 2 x∗> Ix∗ + σ 2
= τ 2 kx∗ k22 + σ 2 .

Putting together and invoking Lemma 11.2,

xB |xA ∼ N (µB + ΣBA Σ−1 −1


AA (xA − µA ), ΣBB − ΣBA ΣAA ΣAB ),

it yields that

µB + ΣBA Σ−1 2 ∗> > 2


AA (xB − µA ) = 0 + τ x X (τ XX > + σ 2 I)−1 (xB − 0) = τ 2 x∗> X > (τ 2 XX > + σ 2 I)−1 xB .

and our goal is then to show


1 ∗ −1 > →
τ 2 x∗> X > (τ 2 XX > + σ 2 I)−1 →

y = x A X −
y. (11.5)
σ2

11.2.2 Proof of Eq. (11.5) using singular value decomposition


1 > I
With A = σ2 X X + τ2 , we will show that
1 ∗ −1 >
x∗> X > (τ 2 XX > + σ 2 )−1 = x A X ,
τ2
using singular value decomposition (SVD). First, we will have a brief summary of why SVD is useful in our
case.
Consider A ∈ Rn×m where n ≥ m. It can be decomposed as A = U ΣV > where V ∈ Rm×m , U ∈
n×m
R , Σ ∈ Rm×m . Note that:

84
• Σ is diagonal.
• U is column-wise orthogonal, meaning that every column of U has norm 1 and orthogonal of each
other. This means that U > U = [U > ][U ] = Im×m where every entry ij is the inner product of i-th and
j-th column of U .

• V has orthogonal columns. Similar to U , V > V = I.


• The columns of U are basis of the column span of A; similarly, the rows of V > are basis of the row
span of A.
What we want to show is that
τ 2 x∗ X > (τ 2 XX > + σ 2 I)−1 →

y
σ 2 −1 →
=x> X > (XX > + I) −y
τ2
1 XX > 1
= 2 x∗ X > ( 2 + 2 I)−1 → −
y.
σ σ τ
Thus, we’d like to prove
ZZ > I −1 X >X I
X >( + ) = ( + 2 )−1 X > .
σ2 τ2 σ2 τ
which is not obviously true with X being a matrix. Since SVD allows us to extend U and V to a bigger
orthonormal matrices, the claim can be proved as the following. Suppose
 
r1
 .. 

 . 

rd
X = U ΣV > = U 
  >
V ,

 0 

 .. 
 . 
0

where X ∈ Rn×d , n ≥ d. Given that U is orthogonal, we have

XX > I U ΣV > V ΣU > I U Σ2 U > I


2
+ 2 = 2
+ 2 = + 2
σ τ σ τ σ2 τ
 2 
r1
 .. 

 . 

1  2
rd U + I
 >
= 2U 
σ 
 0 
 τ2
 .. 
 . 
0
 2 
r1
2
σ .. 

 . 

rd2  > UU>
 
=U U +

σ2
 0  τ2
 
 .. 
 . 
0

85
r12
 
1
σ2 + τ2
 .. 

 . 

 rd2 1

=U

σ2 + τ2
 >
U ,
 1 
 τ2 
 .. 
 . 
1
τ2

and therefore
 2 
r 1 −1
( 12 + τ2 )
 σ .. 

 . 

XX > I  r2
( σd2 + 1 −1

+ 2 )−1 τ2 )
 >
( =U U .

σ2 τ  τ2 
 
 .. 
 . 
τ2

Since we have
(U ΣU > )−1 = U Σ−1 U > ,
and
U ΣU > U Σ−1 U > = U ΣΣ−1 U > = U U > = I, (11.6)
we can further show that
XX > I
X >( + )−1
σ2  τ 2
r2

1 −1
( 12 + τ2 )
 σ
...

 
 rd2 1 −1

=V ΣU > U 
 ( σ2 + τ2 )
 >
U

 τ2 

 ... 
τ2
 r1 
2
r1
σ2
+ τ12
 
 ... 
rd
 
  >
r2
=V Σ  d + τ12
U
 σ2 

 τ2 

 ... 
τ2
 r1 
2
r1 1
 σ2 + τ 2 
 ... 
rd
 
  >
r2
=V  d + τ12
U .
 σ2 

 τ2 

 ... 
τ2
>
We expand the RHS ( Xσ2X + τI2 )−1 X > in the same way and thus arrive at the same quantity. As we’ve
shown, SVD reduces everything to diagonal matrices. Thus, it’s a very useful trick in these problem settings.

86
Back to our original proof, we can first show that
1 ∗ > XX > I
µB + ΣBA Σ−1
AA (xA − µA ) = z Z ( 2 + 2 )−1 → −
y
σ2 σ τ
1 XX > I
= 2 x∗ ( 2 + 2 )−1 X > →
−y (11.7)
σ σ τ
1
= 2 x∗ A−1 X > →

y.
σ
Then, for the covariance:
ΣBB − ΣBA Σ−1
AA ΣAB
= τ 2 kx∗ k22 + σ 2 − X 4> X > τ 2 (τ 2 XX > + σ 2 I)−1 τ 2 Xx∗
τ 2 ∗> > XX > I
= τ 2 kx∗ k22 + σ 2 − x X ( 2 + 2 )−1 Xx∗
σ2 σ τ
τ2 X >X I
= τ 2 kx∗ k22 + σ 2 − 2 x∗> ( 2 + 2 )−1 X > Xx∗ (11.8)
σ σ τ
X >y I X >X I X >X I I
= τ 2 kx∗ k22 + σ 2 − τ 2 x∗> ( 2 + 2 )−1 ( 2 + 2 )x∗ + τ 2 x∗> ( 2 + 2 )−1 2 x∗
σ τ σ τ σ τ τ
I
= τ 2 kx∗ k22 + σ 2 − τ 2 x∗> x∗ + τ 2 x∗> A−1 2 x∗
τ
= σ 2 + x∗> A−1 x∗ .

11.3 Nonparametric Bayesian regression


The setup of nonparametric Bayesian regression is the following. Suppose that we have S = {(x(i) , y (i) )}ni=1 ,
where high-dimensional data is allowed. We assume that y (i) = f (x(i) ) + (i) where f can be non-linear and
the noise (i) ∼ N (0, σ 2 ), ∀i.
First, we revisit the frequentist approach, which is the kernel method. Assumed that f (x(i) ) = θ> φ(x(i) )
where φ : Rd → Rm is a fixed feature map (m could be possibly infinite). As we’ve discussed in the previous
lectures, the computational efficiency here depends on the inner product of the features.
Secondly, if we extend the above to the Bayesian approach, we have a prior on f . RAssume that f ∼ P (f ), a
distribution of non-linear functions. Given test example x∗ , compute P (y ∗ |x∗ , S) = P (y ∗ |f, x∗ , S)P (f |S)df
where P (f |S) is the posterior of f given S.
Here, the challenge is to define the prior. There are two approaches. While we’ll emphasize the second
one, which is more convenient, the two methods are actually equivalent.

11.3.1 Approach 1: frequentist approach


The first approach defines the prior as the following:
n
X
f ∼ P (f ) ⇐⇒ f (x) = θ> φ(x) = θi φi (x),
i=1
iid
θi ∼ N (0, τ 2 I), ∀i = 1, ..., n.
This reduces to a Bayesian linear regression in the feature space. Our input is now φ(x) rather than x.
Moreover, we need to know that φ(x) needs to be powerful enough such that θ> φ(x) can cover a large family
of functions f (or even all functions).
Then, with θ(.) ∈ R∞ , we need the kernel trick for computational efficiency. Choose φ such that
< φ(x), φ(z) >= K(x, z) is computationally efficient.
The Bayesian linear regression algorithm should be rewritten as only using calls of the kernel function
K(x, z). Thus, this method is somewhat complicated.

87
11.3.2 Approach 2: Gaussian process
The second approach, the Gaussian process, takes a cleaner and more fundamental viewpoint, though con-
ceptually, it requires more work.
As a warm-up, assume that the input space is finite and our goal is to design a prior over functions with
finite input space. After this, it takes a slight leap of faith to extend it to the infinite case.
Consider F = {all functions that maps X → R} where X = {t1 , ..., tm }. We want to design a prior over
F.
To describe the function f ∈ Rm , we only need to specify the values of the function on a finite number
of inputs:  
f (t1 )
 ...  .
f (tm )
In other word, we can represent f by a vector
 
f (t1 )


f =  ...  ∈ Rm .
f (tm )

In this case, designing a prior over the function space is the same as designing a prior over the m-
dimensional vector; the latter is much easier.


Consider a Gaussian prior on f (or f ).


f ∼ N (µ, Σ), µ ∈ Rm , Σ ∈ Rm×m .

The prior, or the density of the function, is



− 1 1 →
− →

P (f ) = P ( f ) = √ y2 exp(− ( f − µ)> Σ−1 ( f − µ)).
d
( 2π) det(Σ) 2


The key takeaway is that the distribution of f is equivalent to the distribution of vector f . Then, what if
X is infinite (m = ∞)? A straightforward extension is the following:

µ ∈ Rm → µ(·)
Σ ∈ Rm×m → k(·, ·),

where µ(·) is a function over X and k(·, ·) is a function over X × X .


Now, we’re tempted to say that f ∼ N (µ, k). To formalize the method, we define a stochastic process
as a collection of random variables {f (x) : x ∈ X } indexed by elements in X . These random variable have
correlation with each other, just like different entries of a vector.
A Gaussian process is a stochastic process such that for any finite number of variables t1 , ..., tm ∈ X ,
(f (t1 ), ..., f (tm )) has Gaussian distribution. We will design a prior that has this property. Returning to
definition, suppose {f (x) : x ∈ X } is a Gaussian process, then let

µ(x) = E[f (x)]

be the mean function and


k(x, z) = E[(f (x) − µ(x))(f (z) − µ(z))]
be the covariance function. Formally,
f ∼ GP (µ(·), k(·, ·)). (11.9)
Note that the Gaussian process has the following interesting properties.

88
1. µ, k uniquely describe a Gaussian process f ∼ GP (µ(·), k(·, ·)).
 
f (x1 )
For a Gaussian random vector W =  ...  , there is
f (xn )
 
µ(x1 )  
W ∼ N ( ...  , k(xi , xj ) i,j=1,...,n )
µ(xn )

2. If a Gaussian process has mean µ and covariance function k(·, ·), then k(·, ·) is a valid kernel function.
In other words, there exists φ such that k(x, z) = φ(x)> φ(z) = < φ(x), φ(z) >.
 
Proof. Suppose that in a Gaussian process, ∀x1 , ..., xn , K = k(xi , xj ) i,j=1,...,n is the covariance of
 
f (x1 )
 ...  , ∀x1 , ..., xn , K ≥ 0. By Mercer’s Theorem, k(·, ·) is a valid kernel function. (Please refer to
f (xn )
Definition 2.1, P12 for properties of a valid kernel function)

3. Vice versa, if k(·, ·) is a valid kernel function, then there exists φ such that k(x, z) = φ(x)> φ(z).
For simplicity, assume φ(x) ∈ Rm . Let f (x) = θ> φ(x) where θi ∼ N (0, 1), then

f ∼ GP (0, K),
 
f (x1 )
cov( ... )ij = E[f (xi )f (xj )]
f (xn )
= E[θ> φ(xi )θ> φ(xj )]
= E[φ(xi )> θθ> φ(xj )]
= φ(xi )> E[θθ> ]φ(xj )
= φ(xi )> Iφ(xj )
= φ(xi )> φ(xj )
= k(xi , xj ).

Thus, GP is a properly-defined Gaussian process if and only if K(·, ·) is a valid kernel function.

What we’ve been doing is to go from any choice of kernel function K(·, ·) to defining GP (µ, k) to getting
the prior f ∼ GP (µ(·), k(·, ·)). Typically, µ(·) is chosen to be zero function. k(·, ·) can be any common kernel
function, where the most popular one is the squared exponential kernel, also known as the Gaussian kernel
or KBF kernel.
1
kSE (x, z) = exp(− 2 kx − zk22 ).

89
Figure 11.1: kSE (x, z) vs kx − zk

Qualitatively, what does f ∼ GP (µ, k) look like?

• f (x), f (z) have high correlation is x is close to z because exp(− 2τ12 kx − zk22 ) ≈ exp(0) ≈ 1.
kx−zk22
• f (x), f (z) have low correlation if they are far away because with big kx − zk, exp(− 2τ 2 ) ≈ 0.
• The parameter τ controls smoothing. Thus, if τ is very big, then even faraway points have strong
correlations, meaning that there is strong smoothing and a flatter curve. If τ is very small, there is
weak smoothing, leading to a higher tendency for fluctuations.
To summarize, GP (µ(·), k(·, ·)) is the distribution of functions that satisfies the following properties:
1. f (x) is Gaussian, ∀x;

2. (f (x1 ), ..., f (xn )) is Gaussian;


3. The correlation between f (x), f (z) is k(x, z).

11.3.3 Bayesian prediction


ROur next challenge is to compute y ∗ |x∗ , S. Following the parametric case, P (y ∗ |x∗ , S) should be defined as
∗ ∗
P (y |x , f )P (f |S)df , which is not really defined when f is a function instead of vectors.
Our plan is to directly compute P (y ∗ |x∗ , S). We will practically test on multiple test points x∗(1) , ..., x∗(m) .
We will compute posteriors y ∗(1) , ..., y ∗(m) |x∗(1) , ..., x∗(m) , S; more symmetry makes math cleaner. As S
represents the set of all data points {(x(i) , y (i) )}i=1,...,n , x(i) ’s and x∗(i) ’s are deterministic, the posteriors we
need to compute become y ∗(1) , ..., y ∗(m) |y (1) , ..., y (n) .
We will reuse the lemma about the conditional distribution of Gaussian. Recall the lemma: Suppose
     
xA µA ΣAA ΣAB
∼ N( , ),
xB µB ΣBA ΣBB

then
xB |xA ∼ N (µB + ΣBA Σ−1 −1
AA (xA − µA ), ΣBB − ΣBA ΣAA ΣAB ).

Consider xA → y (1) , ...y (n) to be training observation and xB → y ∗(1) , ...y ∗(m) to be labels to predict.
Then our target y ∗(1) , ...y ∗(m) |y (1) , ...y (n) mentioned above is equivalent to xB |xA . To apply the lemma, we

90
need the joint distribution of
f (x(1) )
 


f =  ...  ∈ Rn ,
f (x(n) )
f (x∗(1) )
 

−∗ 
f = ...  ∈ Rm ,
f (x∗(m) )
f (y (1) )
   (1) 
f ( )

− →
− →
− −
y =  ...  = f +  ...  = f + →  ∈ Rn ,
(n) (n)
f (y ) f ( )
∗(1)
f (∗(1) )
   
f (y )

− →
− →

y ∗ =  ...  = f ∗ +  ...  = f ∗ + → − ∗ ∈ Rm ;
∗(m) ∗(m)
f (y ) f ( )


f ∼ N (0, K(X, X)),

where K(X, X) = [K(x(i) , x(j) )]i,j=1,...,n . By defining


K(X, X ∗ ) := [K(X (i) , X ∗(j) )]i=1,...,n;j=1,...,m ,
K(X ∗ , X) := [K(X ∗(i) , X (j) )]i=1,...,m;j=1,...,n ,
K(X ∗ , X ∗ ) := [K(X ∗(i) , X ∗(j) )]i,j=1,...,m ;


f ∼ N (0, K(X, X)),

−∗
f ∼ N (0, K(X ∗ , X ∗ )),
"→ − #
K(X, X ∗ )
 
f K(X, X)
− ∗ ∼ N (0, K(X ∗ , X) K(X ∗ , X ∗ ) );

f

− →
− → −
y = f + Σ ∼ N (0, K(X, X) + σ 2 I),

− →
− →

y ∗ = f ∗ + Σ ∗ ∼ N (0, K(X ∗ , X ∗ ) + σ 2 I),
we can have

− → − → − →

E[→

y→−
y ∗> ] = E[( f + Σ )( f ∗ + Σ ∗ )> ]

−→ − −→
→ − →
−→ − →
−→ −
= E[ f f ∗> ] + E[ Σ f ∗> ] + E[ f Σ ∗> ] + E[ Σ Σ ∗> ]
= K(X, X ∗ ).
In summary, we have
→
− K(X, X) + σ 2 I K(X, X ∗ )
      
y µ Σ ΣAB

− ∗ ∼ N (0, ) = N ( A , AA ).
y K(X ∗ , X) K(X ∗ , X ∗ ) + σ 2 I µB ΣBA ΣBB
Applying the conditional Gaussian distribution lemma by checking the corresponding relationships (for
example, K(X, X ∗ ) = ΣAB ) in the equation above,
µ + Σ Σ−1 (x − µ ) = K(X ∗ , X)(K(X, X) + σ 2 I)−1 →
B BA AA A A

y,
ΣBB − ΣBA Σ−1
AA ΣAB = K(X , X ) + σ I − K(X ∗ , X)(K(X, X) + σ 2 I)−1 K(X, X ∗ ),
∗ ∗ 2

and thus


y ∗ |→

y ∼ N (µ∗ , Σ∗ ),
µ∗ = K(X ∗ , X)(K(X, X) + σ 2 I)−1 →

y, (11.10)
∗ ∗ ∗ 2 ∗ 2 −1 ∗
Σ = K(X , X ) + σ I − K(X , X)(K(X, X) + σ I) K(X, X ).

91
Interestingly, not only do we have the prediction µ∗ , but we also have Σ∗ , which gives uncertainty
quantification. When we have few data points, the confidence interval is large, whereas the confidence
interval is smaller when there are multiple data points. One simple example of using the Gaussian Process
to do uncertainty quantification based on only five points is shown below.

Figure 11.2: One simple example of uncertainty qualification with Gaussian Process

92
Chapter 12

Dirichlet process

12.1 Review and overview


Last lecture, we discussed the Gaussian process, which is a Bayesian approach to supervised, nonparametric
problems. It can be thought of as a generalization of the mixture of Gaussians model, with an infinite number
of Gaussian distributions. Today, we will discuss the Dirichlet process. As with the Gaussian process, our
discussion of the Dirichlet process will begin with simpler, parametric mixture models, which we will then
build on to understand the more complex Dirichlet process.
Unlike the Gaussian process, the Dirichlet process is used in an unsupervised setting, to model a distri-
bution over some variable X, rather than modeling a conditional distribution of Y | X. We’ll first review
parametric mixture models, which are one way to model a probability distribution. Then, we’ll discuss
how to extend these models to a Bayesian setting, by establishing a prior over the parameters (this will
require a tangent to define the Dirichlet distribution). Then, we’ll discuss topic modeling, a popular type of
unsupervised machine learning model, as an entry point into the Dirichlet process. Finally, we will define
the Dirichlet process itself, which can be thought of as a topic model with infinite topics.

12.2 Parametric mixture models & extension to Bayesian setting


12.2.1 Review: Gaussian mixture model
The “mixture of k Gaussians” is a probability distribution with the following generative “story” for how a
sample Xi is generated:
1. First, one of k “sources”, zi , is chosen from a discrete (or categorical) distribution π (which can be
thought of as simply a non-negative, k-dimensional vector whose components sum to 1).
2. Then, Xi is sampled from a Gaussian distribution whose mean and covariance are conditional on the
choice of zi , i.e. Xi | zi ∼ N (µzi , Σzi )
The result is called a mixture of Gaussians because it is as if you combined k Gaussian distributions
(each with some weight πk ) into one distribution. You can read more about this model in the previous set
of lecture notes.

12.2.2 Dirichlet distribution


In order to extend the mixture of Gaussians to a Bayesian setting, we have to establish priors over the
parameters. There are many ways to get priors over µzi and Σzi , and we won’t go into detail on that (a
unit normal distribution and chi-squared distribution respectively is one example). However, the choice of
Pk
parameters for π is unique, as it is constrained: we must have that i=1 = 1, so not just any choice of

93
π1 , . . . , πk is a valid probability distribution. Our distribution over π should only have probability mass on
valid choices of π. The Dirichlet distribution (denoted Dir for short in mathematical equations), is a natural
choice.
In the Bayesian setting, we are interested in studying distributions over the parameters of another dis-
tribution (which are themselves random variables). A simpler example of this idea is the Beta distribution,
which is a probability distribution over the parameter p for a binomial random variable (i.e. a coin flip). The
Beta distribution represents the probability distribution over the “true” probability of the coin coming up
heads. A Dirichlet distribution is simply a generalization of the Beta distribution to an experiment with more
than two outcomes, e.g. a dice roll. So, the Dirichlet distribution could be used to model the distribution
over the parameters π1 , π2 , π3 , π4 , π5 , π6 representing the probability each side of the die has of being rolled.
The Dirichlet distribution is parametrized by α1 , . . . , αk , which, as with the Beta distribution, can be
interpreted as “pseudocounts”, i.e. a larger αi will result in a distribution where larger values of πi have
more density; and moreover, if their relative magnitudes are all held fixed, larger parameters denote more
“confidence”, resulting in a less uniform distribution. (As a simple example, if you’ve rolled each number
on a die one time, you’d guess that all sides equally likely, but with low confidence. If you’ve rolled each
number on a die 1000 times, you’d guess they’re all equally likely, with high confidence.)
The Dirichlet distribution has a few important properties and related intuitions, some of which will be
important for our later discussion of the Dirichlet process:

1. The PDF of the Dirichlet distribution over K-dimensional ~π is:

K
Q
Γ(αi ) Y
K
i=1
p(~π ) =  K  πiαi −1
P
Γ αi i=1
i=1

...where Γ is the Gamma function (too complicated to explain here).


αi
2. E[πi ] = K
P
, which means that the relative magnitudes of the α’s determine the expected relative
αj
j=1

magnitudes of the components of π. (The pseudocounts interpretation helps here: a larger αi is a


larger pseudocount, as if you’ve already observed that event more, so that should make the parameter
for the probability of that event larger.)
PK
3. i=1 αi controls how “sharp” the distribution is. Again, by the pseudocounts logic, having “seen”
(or hallucinated) more data will make us more confident in what we think the distribution is, so this
makes sense intuitively.
iid
4. Relationship to Gamma distribution: If ηk ∼ Gamma(αk , 1), for i ∈ {1, . . . , K}, and πi = Pηi ,
j πj
then (π1 , . . . , πK ) ∼ Dir(α1 , . . . , αK ).

94
5. Merging rule: If (π1 , . . . , πK ) ∼ Dir(α1 , . . . , αK ), then we can “merge” π’s by summing them. Doing
so will create a new Dirichlet distribution with fewer components, parametrized by new α’s obtained
by summing the αj ’s corresponding to the πj ’s that were combined. For example:

(π1 + π2 , π3 + π4 , . . .) ∼ Dir(α1 + α2 , α3 + α4 , . . .)

6. Expanding rule: Reverse merging rule; you can also obtain a new Dirichlet distribution from an
existing one by “splitting” components; for example:

(π1 θ, π1 (1 − θ), π2 , . . . , πK ) ∼ Dir(α1 b, α1 (1 − b), α2 , . . . , αK )

...where θ ∼ Beta(α1 b, α1 (1 − b)) for 0 < b < 1.


7. Renormalizing property: Given (π1 , . . . , πK ) ∼ Dir(α1 , . . . , αK ), then if we discard one πi and its
associated αi , and renormalize, we get another Dirichlet distribution with parameters α1 , . . . , αi−1 , αi+1 , . . . , αK .

12.2.3 Bayesian Gaussian mixture model


Now, if we wanted to extend the mixture of Gaussians to the Bayesian setting, we have the tools to do
so. The only change from the frequentist version of the generative story is that the parameters themselves
(µ’s, Σ’s, and π) are drawn from a prior distribution; then the latent variables zi are drawn from z | π ∼
Categorical(π1 , . . . , πk ); and finally Xi | µ, Σ, zi ∼ N (µzi , Σzi ).

12.2.4 Dirichlet topic model


Topic modeling is a common unsupervised technique in natural language processing, which models a dis-
tribution over documents (collections of words) by grouping them into clusters (topics). This is a mixture
model just like the mixture of Gaussians: each topic is a “source”, and then the conditional on the document
being associated with that source (topic), there’s a set of probabilities associated with each word appearing
in the document.
More formally, we set up with a vocabulary V with W words. Each document is represented as a vector
in RW , with the i-th component equal to the number of times word i appears in it. (This is called a “bag
of words” representation.) For simplicity, we assume each document is length n. Then, we aim to model
the distribution over these document vectors with a mixture model. Unlike the Gaussian mixture, these
document vectors can only take on non-negative integer values in each entry, so rather than a Gaussian
conditioned on the source, we model a document as a multinomial distribution conditioned on its topic. (A
multinomial distribution is just n trials of a categorical distribution.) So, we have parameters π for the
choice of topic, and then θzi for the multinomial distribution over words for each topic.
Then, the generative story is:

1. First, select a topic, zi ∼ Categorical(π), where as before, the parameter π is a vector whose compo-
nents sum to 1.
2. Generate n words with M ultinomial(n, θzi ), this produces the document X.
In order to make this Bayesian, which gets us to the Dirichlet Topic Model, we only need a prior over the
parameters, π and θzi for i ∈ {1, . . . , W }, since the number of words n is fixed. Both can be Dirichlet priors:
one for π with the number of parameters equal to the number of topics; and the other for each θzi with the
number of parameters equal to the number of words in the vocabulary. Then, to generate a document, we
first sample π and all θ’s, then follow the generative process above.

95
12.3 Dirichlet process
12.3.1 Overview
One way to think about the Dirichlet process is as a topic model whose number of topics is not fixed, but
rather can grow with the number of data points grows. Rather than a fixed number of topics in advance, we
allow choosing the number of topics to be “part of the model”, in a sense. To do this, we need a prior that
can
S∞ generate probability vectors of any dimension, not just a fixed K, in other words, a distribution over
K=1 ∆K . We can think of the Dirichlet process as doing exactly this. Let’s take some abstractions from
the parametric model to generalize to the Dirichlet process setting.
In the parametric mixture models we’ve been discussing, you sample some parameters θk∗ for each source
(or topic) from some distribution H; sample π from a Dirichlet distribution; sample the latent z from
Categorical(π), and finally sample X from some distribution parametrized by θz . The important thing to
take away here is that, for a given sample Xi , once you’ve fixed π and all the θk∗ ’s, then your choice of zi
completely determines θzi , i.e. the set of parameters you’ll use to select Xi .
Let’s say we are modeling n examples, X1 , . . . , Xn . For each, we can think about its corresponding θzi
itself as a random variable, drawn from a distribution G. G is fixed given a choice of π and all θk∗ ’s; which
means that the prior for G is determined by the choice of α and H. A realization of G is a choice of θi
(i.e. the θ used to sample Xi ), which is one of the k possible choices θ1∗ , . . . , θK ∗
. G is basically a discrete
∗ ∗
distribution with point masses on all the locations defined by θ1 , . . . , θK , with the caveat that the magnitude
of K is not fixed. The goal is to construct a prior over G, which in turn gives a distribution over θi , which
in turn parametrizes a distribution over Xi .
There are two approaches for designing a prior for G. One is to directly construct it. (We’ll do that
later.) The other is to model the joint distribution over θ1 , . . . , θn (i.e. the choices of parameters for each of
the n examples), which then implicitly defines G. We will start with this approach. This will require a few
theoretical building blocks, which will occupy the next few sections.

12.3.2 Exchangeability & de Finett’s theorem


Exchangeability is a fundamental concept in Bayesian statistics. Given a sequence of random variables
X1 , . . . , Xn , we say they are exchangeable if their joint distribution p is permutation-invariant. That is,
if p(X1 = k1 , . . . , Xn = kn ) = c, then if we scramble up all the k’s, the joint probability would still be
c, no matter what order the k’s are in. Furthermore, we say a sequence of random variables is infinitely
exchangeable if any length-n prefix of the sequence is exchangeable for all n ≥ 1.
Theorem 12.1. De Finett’s Theorem: If θ1 , . . . , θn are infinitely exchangeable, then there exists a random
R n
Q
variable G such that p(θ1 , . . . , θn ) = p(G) p(θi | G)dG.
i=1

In other words, there exists some G such that joint distribution over all n θ’s “factors” and is equivalent
to the distribution obtained by first sampling G from p(G), then sampling θi from the distribution defined
by G. The implication is that we don’t have to define G directly; we can instead describe θ1 , . . . , θn (the
“effect” of G) and this is sufficient (since by this theorem, G is guaranteed to exist, and we can do inference
tasks using just the θi ’s).

12.3.3 The Chinese restaurant process


To define the joint distribution over θ1 , . . . , θn , we first have to explain something called the Chinese restau-
rant process, which provides intuition for this distribution. Imagine a restaurant with infinitely many tables,
and n customers. Customers enter one at a time, and sit at a table according to the following rules:
1. Customer 1 sits at table 1 with probability 1.
nk
2. For i > 1, customer i sits at (occupied) table k with probability α+i−1 (where nk is the number of
α
previous customers at that table), or else sits down and starts a new table with probability α+i−1 .

96
Because all the nk ’s add up to i − 1 (the number of previous customers) we can quite easily confirm
this setup makes sense (i.e. the probabilities of the customer’s choices sum to 1). What does this thought
process have to do with Dirichlet processes though? Well, let the latent variable zi be the table number of
the i-th customer. Then, if each “table” is assigned some θk∗ ∼ H, then this gives us a way of picking θi ’s,
by simply letting θi be the value assigned to the table where the i-th customer sits. This is also known as
the Blackwell-MacQueen urn scheme.
This provides a joint distribution over θ1 , . . . , θn . Moreover, it is exchangeable (possible to verify, but we
won’t do it here). Intuitively, it will result in some outcomes that “could be” IID draws from some discrete
distribution G (informally speaking). Formally, applying de Finetti’s theorem, because exchangeability holds,
we know that there exists a G such that θ1 , . . . , θn chosen according to this scheme are equivalent to first
iid
sampling G ∼ DP (α, H), then sampling each θi | G ∼ G. We don’t know what G is, just that it exists; and
we can do all the interesting probabilistic inference without it (using just the θi ’s).

12.3.4 Explicitly constructing G (informal)


We don’t have to specify G indirectly in this way—it can also be directly defined and constructed. First,
a slightly-incorrect, informal treatment. We basically want an infinite-dimensional Dirichlet distribution,
limk→∞ Dir(α/k, . . . , α/k). Then, we would just select some θk∗ ∼ H for each of these components, and have
an infinite mixture. G could then be defined as a set of “point masses” with some density on each of the
θk∗ ’s, i.e.:

X
G= πk δθk∗
k=1

...where δ denotes the direct measure.


This is all slightly imprecise and incorrect, but it gets at the basic idea. To formalize it, we use a variate
of the merging rule (10.2.2). Rather than summing pairs of π’s, as discussed there, imagine partitioning
π’s into groups. By the same rule, we get a new Dirichlet distribution with a component for each partition,
whose α parameters for each partition are the sum of the αk ’s in that partition. The Dirichlet process is a
bit different, because we will partition an infinite space, not a finite list of π’s, but this is exactly the idea.
Recall that G is a distribution over the space of Θ, the set of all possible θk∗ . (A discrete distribution,
with point masses on certain possible values θk∗ .) Consider the partition of Θ into A1 , . . . , Am . Then, G(Ai )
is basically the total mass of G that’s contained in the Ai segment of the partition; i.e. G(Ai ) = Pr[θ ∈ Ai ].
This is deterministic for fixed G (but of course random otherwise, since G is a random variable itself). The
claim is that G(A1 ), . . . , G(Am ) ∼ Dir(αH(A1 ), . . . , αH(Am )), i.e. that a partition of Θ into some finite
number of segments P∞ results in a Dirichlet distribution.
P∞ We have G = k=1 πk δθk , but we can write G(Ai ) as the sum over only the mass in partition

P Ai , i.e
∗ ∗
k=1 π k 1{θ k ∈ A i }. Or, letting Ij be the set of j such that θ k ∈ A j , we can also write G(A j ) = k∈Ij πk .
We can then write:
!
X X
(G(A1 ), . . . , G(Am )) = πk , . . . , πk ∼
k∈I1 k∈Im
X X
Dir( αk , . . . , αk )
k∈I1 k∈Im

This is close to showing the claim, but it’s still not quite right because I is not fixed, it is a random
variable. But, we can intuitively explain why the I’s aren’t that important. Remember, our goal is to get
something like an infinite-dimensional
PDirichlet distribution,
P with parameters α/K. Because of the idea of
this uniform prior, we can say that αk ∈Ij αk = k∈Ij α/K = |Ij |α/K, i.e. α multiplied by the fraction
the probability mass in partition defined by Ij , which is just αH(Aj ), which is what we wanted to show.

97
12.3.5 Explicitly constructing G (formal)
The formal definition of a Dirichlet process: A unique distribution over distributions on Θ such that for any
partition A1 , . . . , Am of Θ, we have that when G ∼ DP (α, H), then G(A1 ), . . . , G(Am ) ∼ Dir(αH(A1 ), . . . , αH(Am )).
We can explicitly construct such a distribution with the “stick-breaking construction.”
iid
1. Sample θk∗ ∼ H for k = 1, 2, . . . , ∞.
iid
2. Choose βk ∼ Beta(1, α) for k = 1, 2, . . . , ∞.
k−1
Q
3. Set πk = βk (1 − βi ).
i=1
P∞
4. Then, G = k=1 πk δθk∗ .
It’s called the ”stick-breaking construction” because the intuition is that you begin with a stick of length
1, and then at the k-th step, break off the fraction βk of what’s left, and choose that value for πk . So first,
the stick is length 1, you break off β1 , and so π1 = β1 . Then, there’s (1 − β1 ) left; you break off β2 of that,
so then π2 = (1 − β1 )β2 . So on and so forth. This gives a formal construction of G for the Dirichlet process.
Then, all that remains is to do inference, which is typically done with Markov-Chain Monte Carlo (e.g.
Gibbs Sampling). This is tractable, since the conditional distributions of the θi ’s have nice properties.

12.4 Summary
Since this is the last class, let’s look back at what we’ve learned.
1. Non-parametric regression, including the kernel estimator, local polynomial/linear regression, splines,
and using cross-validation to select a model and tune hyperparameters.

2. The kernel method, and its connection to splines and wide two-layer neural networks.
3. Neural networks, transfer learning, and few-shot learning.
4. Density estimation, for CDF and PDF.

5. Bayesian nonparametric models (Gaussian and Dirichlet processes).

98
Bibliography

Sheldon Axler. Linear algebra done right. Springer, New York, 2014. ISBN 9783319110790.
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical
image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255.
Ieee, 2009.
Richard Johnsonbaugh and W. E. Pfaffenberger. Foundations of mathematical analysis. Dover books on
mathematics. Dover Publications, Mineola, N.Y, dover ed edition, 2010. ISBN 9780486477664. OCLC:
ocn463454165.
Art Owen. Lecture 6: Bayesian estimation, October 2018. Unpublished lecture notes from STATS 200.
Colin Wei, Jason D. Lee, Qiang Liu, and Tengyu Ma. Regularization matters: Generalization and optimiza-
tion of neural nets v.s. their induced kernel, 2020.
Wikipedia contributors. Mercer’s theorem — Wikipedia, the free encyclopedia, 2023. URL https:
//en.wikipedia.org/w/index.php?title=Mercer%27s_theorem&oldid=1143423242. [Online; accessed
9-May-2023].

99

You might also like