Stats 205 Notes
Stats 205 Notes
205)
Instructor: Tengyu Ma
August 4, 2023
Contents
3 Bias-variance trade-off 17
3.1 Motivation for using MSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Bias-variance decomposition for MSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3 Bias-variance trade-off in the nonparametric setting . . . . . . . . . . . . . . . . . . . . . . . 18
3.4 Case studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.5 Effect of dataset size on bias-variance trade-off . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.6 Formal theorem (bias-variance characterization) . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5 Splines 30
5.1 Penalized Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.2 Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.3 Background: subspaces and bases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.4 Cubic spline and penalized regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.5 Interlude: A brief review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.6 Matrix notation for splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.7 Minimizing the regularized objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.8 Choosing the basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
i
5.9 Interpretation of splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.9.1 Splines as linear smoothers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.9.2 Splines approximated by kernel estimation . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.9.3 Advanced spline methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.9.4 Confidence bands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
ii
9 Density estimation 62
9.1 Unsupervised learning: estimating the CDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
9.1.1 Setup of CDF estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
9.1.2 Empirical estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
9.1.3 Estimating functionals of the CDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
9.2 Unsupervised learning: density estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
9.2.1 Measuring performance of density estimators . . . . . . . . . . . . . . . . . . . . . . . 65
9.2.2 Mean squared error in high-dimensional spaces . . . . . . . . . . . . . . . . . . . . . . 66
9.2.3 Mean squared error and other errors in low-dimensional spaces . . . . . . . . . . . . . 67
9.2.4 Bias-variance tradeoff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
9.2.5 Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
9.2.6 Bias-variance of histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
9.2.7 Finding the optimal h∗ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
9.2.8 Proof sketch of Theorem 9.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
12 Dirichlet process 93
12.1 Review and overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
12.2 Parametric mixture models & extension to Bayesian setting . . . . . . . . . . . . . . . . . . . 93
12.2.1 Review: Gaussian mixture model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
12.2.2 Dirichlet distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
12.2.3 Bayesian Gaussian mixture model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
12.2.4 Dirichlet topic model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
12.3 Dirichlet process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
12.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
12.3.2 Exchangeability & de Finett’s theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
12.3.3 The Chinese restaurant process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
12.3.4 Explicitly constructing G (informal) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
12.3.5 Explicitly constructing G (formal) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
12.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
iii
Acknowledgments
5
Chapter 1
In this chapter, we begin our exploration of nonparametric statistics. We first describe the underlying
motivation for the field of nonparametric statistics and its general principles.
• No fixed set of parameters exists. For example, in nonparametric statistics we will often see across
infinite dimensional models, infinite parameters, or circumstances where the dimension → ∞ as the
number of data point n → ∞. In other words, the model grows in size to accommodate the complexity
of the data
Such principles are widely applicable to many areas of statistics and machine learning, such as in nonpara-
metric testing, density estimation, supervised learning, and unsupervised learning.
Particularly in this course, we consider low-dimensional data1 , with exceptions like neural networks
and some kernel methods. This is important since high dimensional data without many strong parametric
assumptions will fundamentally and statistically need many samples (i.e. the data will need exponential
dimensions 2 ) to estimate anything (density, CDF, etc.), suffering from the “curse of dimensionality”. The
lack of high dimensional data without many strong parametric assumptions results in estimate errors at the
zero-th or first order.
1 In this course, low dimensions generally refers to the case when data dimension d = 1, 2, 3
2 The number of samples needed is at least an exponential function of the dimension of the data, since the rate or estimation
error is approximately n−1/d where d is the dimension of the data and n is the number of samples needed. For more information
about this concept see All of Nonparametric Statistics chapter 4.5.
6
1.2 The Nonparametric Regression Problem
1.2.1 Setup
We start with one of the simplest estimation problems: 1 dimentional regression. In such a problem, we
have n pairs of observations
(x1 , Y1 ), . . . , (xn , Yn ), xi , Yi ∈ R. (1.1)
In these notes we will often refer to each xi as an input and Yi as an output; however, other names that are
often used include
xi Yi
covariate label
features target
explanatory variable response variable
independent variable dependent variable
regressors regressand
design outcome
exogenous endogenous
Furthermore, we generally assume that Yi can be written as
Yi = r(xi ) + ξi , (1.2)
where r(·) is an unknown function we would like to approximate and ξ is noise (independent of xi ) on the
observed output. We shall assume that E[ξi ] = 0 and thus since E[f (X) | X] = f (X)
E[Yi | xi ] = E[r(xi ) + ξi | xi ] (1.3)
= r(xi ). (1.4)
Equivalently, we can define r(xi ) as
r(xi ) := E[Yi |xi ] (1.5)
and then ξi = Yi − r(xi ), which implies E[ξi ] = 0 by the law of total expectation3 .
With these definitions, we can view the regression problem under two frameworks: deterministic inputs
or random inputs, and we will explore both possibilities in the subsequent sections.
However, because our estimate r̂ is a function of Y1 , . . . , Yn which are random variables, r̂ is also a random
variable. We then might instead want to consider the expectation over Yi
" n #
1X
MSE = EYi [MSE(r̂)] = EYi (r̂(xi ) − r(xi ))2 . (1.7)
n i=1
3 E[X] = E[E[X | Y ]]
7
y
r(x)
(xi , Yi )’s
For the rest of this note, though, we will maintain within the deterministic design paradigm described above.
8
y
r(x)
9
Chapter 2
which is, in other words, the average of all Yi where xi ∈ Bj . Because each point will fall into some Bj
(and for all bins where there is no observation, we won’t have a defined prediction), we recover a piece-wise
constant function for that Bj . We then see that each point within a particular Bj will recover the same
r̂. While this approach is simple, it is not often used in practice as choosing the binning method and size
is quite tricky. Additionally, some bins may not capture many observations while others may not capture
enough of a set of observations if there are too many variable regions within a bin.
1 X
r̂(xi ) = Yi , (2.2)
|Bxi |
i∈Bxi
which is, similar to before, an average, but now with the nuance that each bin is defined with respect to each
observation. One such bin with locally averaged r̂(xi ) can be seen in Figure 2.2. Note that we are making
the assumption that within each Bxi , r̂(xi ) is a constant denoted by a. Following this assumption, we can
then derive r̂(xi ) as minimizing the MSE inside each local bin, namely
1 X
r̂(xi ) = argmin (Yi − a)2 . (2.3)
a |Bxi |
i∈Bxi
10
y
r(x)
Note that as a convex function, we can solve for this minimum explicitly. We have that
d 1 X 2 X
(Yi − a)2 = (Yi − a) (2.4)
da |Bxi | |Bxi |
i∈Bxi i∈Bxi
1 X
r̂(xi ) = a = Yi (2.6)
|Bxi |
i∈Bxi
As with the regressogram, we make some assumptions in local averaging that may not be sufficient for
our regression. Namely, we are assuming that we can safely ignore all points outside the boundary of each
Bxi , even if such points are borderline to the bin determined by some h. Such a problem motivates our next
methodology: soft-weight averaging.
We can solve this is the same way as we did (2.3) where the minimizer can be found by taking derivative
w.r.t. a. !
n n
d X 2
X
wi (Yi − a) =2 wi (Yi − a) (2.8)
da i=1 i=1
11
y
r(xi )’s
Figure 2.2: Comuting local averaging around a single xi . Our estimate r̂(xi ) is equal to the average of the
r(xi )’s (red line) within the local bin Bxi (black dotted lines).
Notice that if wi = 1 {|xi − x| ≤ h}, we recover the same local averaging we have previously described. In
general, we desire wi to be smaller as |xi − x| increases and for any |xi − x| > |xj − x| it follows that wi < wj .
Kernel functions
We now need to define a weighting scheme that satisfies our weighting desires, and we will do so using a
kernel estimator. Namely, because we desire to have a weighting dependent on the distance a particular
observation xi is from x, we will define
xi − x
wi = f (xi − x) = K , (2.11)
h
Now, we will discuss four variants of kernels. We start by the boxcar kernel
1
K(t) = 1 {|t| ≤ 1} . (2.12)
2
12
y
Boxcar
Gaussian
Epanechnikov
Tricube
x
−1 1
The boxcar is named for its box-like shape and can be seen in red in Figure 2.3 overlayed with the Gaussian
kernel as well. One then sees that this corresponds exactly to local averaging for some h. Next we have the
Gaussian kernel 2
1 −t
K(t) = √ exp , (2.13)
2π 2
Unlike the boxcar, here we have some non-zero weight for some xi when |xi − x| > h, which tapers off as
that difference increases. Other choices of kernels include the Epanechnikov kernel
3
K(t) = (1 − t2 )1 {|t| ≤ 1} (2.14)
4
and the tricube kernel
70
K(t) = (1 − |t|3 )3 1 {|t| ≤ 1} . (2.15)
81
In practice, however, the explicit choice of kernel is not that important empirically between the these (ex-
cluding boxcar).
13
y
Figure 2.4: Example where choosing a large h would yield a better estimate
Figure 2.5: Example where choosing a small h would yield a better estimate
Pn
Using the central limit theorem, we have ξ1 , ..., ξn ∼ N (0, 1). Thus, we have n1 i=1 ξi ∼ N (0, n1 ). Note that
the standard deviation of this expression is √1n . Therefore, we can simplify the last term as follows:
n
1X 1
r̂(xi ) = c + ξi ≈ c ± √ (2.17)
n i=1 n
If instead we had chosen a moderate h, our estimate for each xi may have not included all observations
and then would be represented using a similar derivation as follows:
1 X 1 X 1 X 1
r̂(xi ) = Yj = c + ξi = c + ξi ≈ c ± p (2.18)
|Bxi | |Bxi | |Bxi | |Bxi |
j∈Bx i
j∈Bx i
j∈Bxi
where we can see here that we could be obtaining a noisier value for our estimate compared to a larger h.
Finally, considering another extreme case, where the true r(x) fluctuates a lot but the noise ξi is very
small such as in Figure 2.5, utilizing a large h would yield a poorer result than choosing h = 0. In such a
case, although we may not be correct in our assumption, we may still obtain a reasonable estimate r̂(xi ) = Yi
(when h = 0).
In practice, choosing the optimal bandwidth is hard. Often times we will try many different values of h
and see what works best by using methods such as cross-validation which we will explore in section 4.4.
14
r(x)
Observed data
We would like to use local averaging to estimate r(xi ) at each data point xi . We consider three choices for
the bandwidth h.
1. h = ∞: when h is infinite, we simply average over all data points in the dataset when making a
prediction for a given point x. Thus, the resulting estimator r̂ is simply a constant function, as shown
in blue below:
r(x)
r̂(x)
h = ∞ is clearly undesirable in this example, because r̂(x) carries no information about the underlying
function r(x).
2. h = 0: the only data point xi that satisfies |x − xi | ≤ h is x itself. Thus, for a given xi in the training
set, the predicted value of r(xi ) will be precisely equal to the observed outcome Yi (i.e. r̂(xi ) = Yi ), so
that we fail to denoise the dataset at all.
r(x)
r̂(x)
15
3. h = 1: This is the best choice for this example. h = 1 has the effect of averaging only the nearby
points when making a prediction on an example xi . This is desirable because points substantially far
from xi exhibit quite different behavior and thus should not influence the estimate of r(xi ). h = 1 has
the effect of adequately denoising the data without smoothing the estimator r̂ too severely, as the plot
below illustrates.
r(x)
observed
r̂(x)
16
Chapter 3
Bias-variance trade-off
In this chapter, we discuss the bias-variance trade-off and its implications in the nonparametric setting. We
start with some case studies in the setting of kernel estimators, and finish with the general Theorem 3.1.
In words, the predictive risk is our expectation of the squared error of our estimator, with the expectation
being taken over the noise in Z.
Then we may consider the average predictive risk of r̂ on our inputs xi :
n
∆ 1X
predictive risk(r̂) = EZi (r̂(xi ) − Zi )2
n i=1
n
1X
EZi (r̂(xi ) − Zi )2
=
n i=1
n
1X
E (r̂(xi ) − r(xi ) − ξi )2
=
n i=1
n
1X
E (r̂(xi ) − r(xi ))2 − 2(r̂(xi ) − r(xi ))ξi + ξi2
=
n i=1
n
1X
(r̂(xi ) − r(xi ))2 − 2(r̂(xi ) − r(xi ))E[ξi ] + E[ξi2 ]
=
n i=1
n
1X
(r̂(xi ) − r(xi ))2 + E[ξi2 ]
=
n i=1
n
1X
= (r̂(xi ) − r(xi ))2 + σ 2
n i=1
= MSE(r̂) + constant. (3.2)
17
Here, the MSE (mean-squared error) of r̂ is just the average of the squared errors between our predicted
values r̂(xi ) and the true r(xi ):
n
1X
MSE(r̂) = (r̂(xi ) − r(xi ))2 . (3.3)
n i=1
We thus have that MSE and predictive risk are identical up to a constant. Note that in this derivation,
we have used the fact that the expectation is taken with respect to Zi to pull the constants with respect to
Zi (in particular, r̂(xi ) − r(xi )) out of the expectation.
where we used the fact that for any random variable X, E[X 2 ] = E[X]2 + Var(X) and the fact that shifting
a random variable by constant does not change the variance. Combining (3.4) and (3.5) and using the fact
that r(x) is fixed, we see that
n n
1X 2 1X
MSE = (E[r̂(xi )] − r(xi )) + Var(r̂(xi )) , (3.6)
n i=1 n i=1
where the expression in the blue box is called the bias and the expression in the red box is the variance.
Note that all results we have derived thus far hold for all estimators, parametric and nonparametric. We
will next discuss results related to the bias-variance trade-off in the nonparametric setting.
By linearity of expectation,
n n Pn
j=1 wj r(xj )
X X
1 1
E[r̂(x)] = Pn wj (r(xj ) + E[ξj ]) = Pn wj (r(xj )) = Pn ,
j=1 wj j=1 j=1 wj j=1 j=1 wj
18
|xj −x|
where wj = K h .
Thus, we see that E[r̂(x)] is equivalent to applying the estimator to the “clean data” (i.e. data with
no noise). E[r̂(x)] is a “smoother” version of r(x), as demonstrated in the plot below. The bias provides
a measure of how much information is lost in the process of smoothing the initial function. Note that as
bandwidth h increases, r̂(x) becomes smoother because the higher the value of h, the more weight we give
to points further from x. Thus, as h increases, bias increases.
r(x)
E[r̂(x)]
To compute the variance of r̂(x), assume the ξi are independent with mean 0 and variance σ 2 . Then
Pn n Pn 2
j=1 wj
i=1 wj Yj 1 X
wj2 Var(Yj ) = P
2
Var(r̂(x)) = Var P = P 2 σ . (3.8)
n 2
i=1 w j n n
j=1 jw j=1 w
j=1 j
Let us compute the value of the variance in the case of local averaging. Recall that the kernel used for local
averaging is the Boxcar kernel: wj = 1{|xj − x| ≤ h}, so that (w1 , . . . , wn ) = (0, . . . , 0, 1, . . . , 1, 0, . . . , 0),
where the number of 1’s is equal to the number of elements that are at most a distance of h from x.
Suppose we take x = xi . Let Bxi = {xj : |xj − xi | ≤ h} and let nxi = |Bxi |. Then
Pn 2
P 2
j=1 wj 2 xj ∈Bxi 1 2
nxi
σ2
2
Var(r̂(xi )) = P 2 σ = P 2 σ = σ = . (3.9)
n n2xi nxi
j=1 w j xj ∈Bx 1
i
We thus see that, in the case of local averaging, the variance of r̂(xi ) decreases as a function of the number
of points in the neighborhood of xi . As h → ∞, the number of points being averaged over increases. So
2
h = ∞ gives the smallest possible variance of σn and h = 0 gives the largest variance of σ 2 .
19
Pn
w r(xi )
Because r(x) is a constant function, bias will be equal to 0 for any choice of h (E[r̂(x)] = Pn i
i=1
=
Pn i=1 wi
wi c
Pi=1
n = c, so that E[r̂(x)] = c = r(x) for any choice of kernel and h). However, the variance is sensitive
i=1 wi
to the bandwidth. In this case, we should pick h = ∞ to minimize variance and thus minimize MSE.
Next, let r(x), and the observed data points, be as shown below. Note that r(x) fluctuates quite sub-
stantially but the observed points have no noise (σ = 0).
Pn
j=1 wj22
In this case, variance = Pn 2σ = 0 for all h. However, because the function fluctuates, the choice of
( j=1 wj )
bandwidth has a large effect on the bias, as it effects how “smooth” our estimates will be. Thus, in this case,
the bandwidth should be chosen to minimize bias, i.e. h = 0.
Finally, consider the function from Section 2.1.5, shown again below:
As discussed in section 2, it is desirable to pick a value of h between 0 and ∞, because we would like
to average over the noise in the dataset (which decreases variance) without smoothing our estimate too
substantially (which would increase bias).
20
3.6 Formal theorem (bias-variance characterization)
We now make the previous discussions of the bias-variance trade-off in the context of kernel estimators more
precise with a formal theorem. While the MSE of an estimator does not depend on the noise in the dataset, it
iid
does depend on the (arbitrary) positions of x1 , . . . , xn . For theoretical feasibility, we assume x1 , . . . , xn ∼ P ,
where P has density f (x). Additionally, the theorem has the following setup:
• Assume we have an estimator r̂n with n samples and that n tends to ∞.
The term in blue is the bias of the estimator and the term in red is the variance. The terms in green are
higher order terms that go to 0 as nhn → ∞ and hn → 0.
We will next try to decompose this theorem to understand each of the individual parts.
Bias: The theorem tells us that the bias depends on the following quantities:
• bandwidth hn : The smaller the bandwidth, the lower the bias.
• x2 K(x)dx:R This quantity is a measure of how flat the kernel is. The flatter the kernel, the larger
R
the value of x2 K(x)dx and thus the higher the bias. This aligns with our intuition, as the flatter the
kernel, the more we are weighing points further away and thus, the smoother our estimator.
• r00 (x): This is a measure of the curvature of r(x). The smoother the function, the lower the value of
r00 (x) and thus, the lower the bias.
0
• r0 (x) ff (x)
(x)
: Note that this quantity is equal to r0 (x)(log f (x))0 . This term is called the design bias
because it depends on the “design” (i.e., the distribution of the xi ’s). The design bias is small when
(log f (x))0 is small (i.e., when the density of X doesn’t change too quickly, or X is“close to uniform”).
Variance: According to the theorem, the variance depends on the following quantities:
• σ 2 : The larger the value of σ 2 (i.e. the variance of the random noise), the higher the variance.
• hn : The higher the bandwidth, the lower the variance.
• n: The larger the value of n, the smaller the variance.
21
Implications on bandwidth hn : Let us treat K, f, and r as constants. We are interested in seeing how
the optimal bandwidth h∗n changes as a function of σ 2 and n. Holding K, f, and r constant, we can express
the risk as
σ2
R(r̂(x), r(x)) = h4n c1 + c2 + higher order terms, (3.13)
nhn
where c1 and c2 are constants. We are thus interested in the choice of hn that minimizes this risk. By taking
the derivative of the risk and setting it equal to zero, we see that
1/5
σ2
h∗n = c3 . (3.14)
n
This tells us that the optimal bandwidth decreases at a rate proportional to n−1/5 .
Plugging in h∗n to the risk equation, we see that
2 4/5
σ 2 c2
σ
min h4n c1 + = c4 , (3.15)
hn nhn n
so that the lowest risk is on the order of n−4/5 . Note that the risk for most parametric models is on the
order of n−1 , a slight improvement over the risk for the nonparametric models we have discussed.
22
Chapter 4
https://www.overleaf.com/project/641d0430d367c0e91730e057
We introduce the concept of linear smoothers as a way to unify many common nonparametric regression
methods and conclude with an introduction to another approach to nonparametric regression: local linear
regression. We discuss how local linear regression overcomes some of the challenges faced by kernel estimators.
We proceed further to explore classical non-parametric methods. Firstly, by exploring local polynomial
regression as an extension of local linear regression. Secondly, different methods of optimizing regression
are evaluated, including cross validation (where one selects various hyperparameters) and dropout methods
(where the data set is split into validation and training sets.)
Definition 4.1. r̂ is a linear smoother if there exists a vector valued function x −→ (l1 (x), . . . , ln (x))
such that
m
X
r̂(x) = li (x)Yi . (4.1)
i=1
PnNote that the li ’s can depend on x1 , . . . , xn but not on Y1 , . . . , Yn . Additionally, we must have that
i=1 li (x) = 1.
Theorem 4.2. The regressogram and kernel estimator are both instances of linear smoothers.
• Regressogram:
n
1{xi ∈ Bx } m
1 X X X
r̂(x) = Yi = Yi = li (x)Yi , (4.2)
|Bx | i=1
|Bx | i=1
i∈Bx
• Kernel estimator:
Pn n
! m
wi Yi X wi X
r̂(x) = Pi=1
n = Pn Yi = li (x)Yi , (4.3)
i=1 wi i=1 j=1 wj i=1
23
Thus, linear smoothers provide a “unified view” in that they provide a category into which many different
types of estimators fall. In the next section, we will introduce the method of local linear regression, which
we show is yet another instance of a linear smoother. We now conclude this section with a few facts about
linear smoothers.
r̂1 (x)
• We can write any linear smoother in the matrix multiplication form r̂ = LY , where r̂ = ... ,
r̂n (x)
l1 (x1 ) ··· ln (x1 ) Y1
.. .. .. , and Y = .. .
L= . . . .
l1 (xn ) · · · ln (xn ), Yn
• For all linear smoothers,
n n n
X X X
E[r̂(x)] = E li (x)Yi = li (x)E[Yi ] = li (x)r(x), (4.4)
i=1 i=1 i=1
so E[r̂(x)] is equal to the estimate when the estimator is run on clean data. Like in the particular case
of kernel estimators, the bias is an indicator of how much we damage the clean data by smoothing.
a b c
If we were to use a kernel estimator to attempt to predict r(a), we would overestimate the true value.
This is because all observed points that are close to a are to the right of a, so we would average primarily
over points whose y-values are larger than a’s. The predicted value for r(a) using local averaging is marked
in green in the plot below.
24
r(x)
observed
r̂(a)
h h
a b c
Similarly, using a kernel estimator to predict r(c) would result in an underestimate. In contrast, be-
cause b has nearby points on either side, the prediction for r(b) would be reasonable. This problem of
over/underestimating r(x) for points that have no observations on one side (i.e. points on the boundary of
a given bin) is closely related to the design bias. It occurs because, when using kernel estimators, we make
the assumption that r(x) is locally constant.
Local linear regression provides a solution to this problem. Rather than assuming r(x) is locally constant,
we assume that r(x) is locally linear. For a given data point x, we would like to approximate r(x) by locally
fitting a line based at x to our data. Let r̃x (u) = a1 (u−x)+a0 . Then the algorithm for local linear regression
is as follows.
For a given data point x, let
n
X n
X
â0 , â1 = argmin wi (Yi − r̃x (xi ))2 = argmin wi (Yi − (a1 (xi − x) + a0 ))2 , (4.5)
a0 ,a1 a0 ,a1
i=1 i=1
(xi −x)
where wi = K h for some kernel function K. Then, we let our estimate r̂(x) equal the intercept term:
∆
r̂(x) = â0 . (4.6)
Theorem 4.3. The integrated risk of local linear regression consists of a bias term and variance term. The
variance is the same as that from Theorem 3.1. The bias is
2 Z
h4n
Z
2
x k(x)dx r00 (x)2 dx. (4.7)
4
h4n R 2 2
R 00 0 f 0 (x) 2
If we compare this bias to the bias from the kernel estimators ( 4 ( x k(x)dx) · (r (x)+2r (x) f (x) ) dx),
0
we see that the bias for local linear regression does not have the 2r0 (x) ff (x)
(x)
term (i.e. no design bias!). Local
linear regression thus mitigates the problem of design bias that we encounter when using kernel estimators.
The diagram below provides some intuition as to why this is the case.
25
observed
r(x)
r̃a (x)
r̂(a)
We see that the local linear assumption allows us to better approximate r(x) for values of x that lie on
the boundary. We conclude by proving that local linear regression is another instance of a linear smoother.
n
X
â0 , â1 = argmin wi (Yi − (a1 (xj − x) + a0 ))2 = argmin g(a0 , a1 ).
a0 ,a1 a0 ,a1
i=1
n
∂g X
=2 wi (a1 (xi − x) + a0 − Yi ) = 0
∂a0 i=1
n
∂g X
=2 wi (a1 (xi − x) + a0 − Yi )(xi − x) = 0.
∂a1 i=1
26
y
r(x)
(xi , Yi )’s
linear, but rather we assume r(x) is a polynomial function locally. Therefore, we minimize over a, giving us
the minimized equation:
n
X
â = (â0 , ..., âp ) = argmin wj (Yj − Px (uj ; a))2
(a0 ,...,ap )∈Rp+1 j=1
n 2
X ap p
= argmin wj Yj − (a0 + a1 (u − x) + ... + (u − x) ) , (4.10)
(a0 ,...,ap )∈Rp+1 j=1 p!
x−xj
where wj = K h for some kernel function K. Notice that we can rewrite this equation into
n
X
argmin wj (Yj − a> zj )2 ,
a∈Rp+1 j=1
27
cross-validation technique is widely used in machine learning (specifically neural nets) when trying to create
models that best fit specific datasets. In our case, we will be looking at cross validation for local polynomial
regression which concerns selecting the optimal bandwidth h, degree polynomial p, and which method (like
splines, regressogram, etc.) is used. We care about cross validation to optimize our model, and therefore our
results. Recall we want to evaluate and minimize
where
n
1X
MSE(r̂) = (r̂(xi ) − r(xi ))2 .
n i=1
As we have seen before, this issue cannot simply be solved with minimizing the training error
n
1X
(Yi − r̂(xi ))2 ,
n i=1
where r̂−i (·) is the estimator applied to the dataset excluding xi . You remove xi from the dataset, apply
the estimator, and finally use the estimator on xi to see if it can produce Yi with relative small error. To
implement this, recall a general linear smoother can be written as
n
X n
X
r̂(x) = lj (x)Yj , lj (x) = 1.
j=1 j=1
For the kernel estimator, r̂−1 (x) defined in the equation above is indeed the estimator applied on (X1 , Y1 ), ..., (Xn , Yn )
excluding (Xi , Yi ). Here R̂ is almost an unbiased estimator for the predictive risk. Now follows the question
on how to compute R̂ efficiently. We will do this in the form of a theorem.
28
Theorem 4.4. If r̂ is a linear smoother
n
X
r̂(x) = lj (x)Yj ,
j=1
then
n 2
1X Yi − r̂(xi )
R̂(h) = , (4.11)
n i=1 1 − Lii
Recall the sum of the weights for data point x is always 1, therefore we can rewrite the estimator as
Pn
j=1 lj (xi )Yj − li (xi )
r̂−1 (xi ) = P
j6=i lj
r̂(xi ) − li (xi )Yi
= .
1 − Lii
Therefore,
n
1X
R̂(h) = (Yi − r̂−i (xi ))2
n i=1
n 2
1X r̂(xi ) − li (xi )Yi
= Yi −
n i=1 1 − Lii
n 2
1 X Yi − r̂(xi )
= ,
n i=1 1 − Lii
as desired.
29
Chapter 5
Splines
r00 (x)2 dx. Let us consider one of the extreme cases when λ = ∞. In this case, J(r̂) can
R
where J(r̂) = S
r(x)
(xi , Yi )’s
λ=∞
only be zero since r00 (x) = 0 and r̂ can only be linear. Here, what spline is doing is that you do not have
to be so strict with being linear, in addition to controlling second order derivatives. We can now move onto
defining splines.
30
5.2 Splines
Splines themselves are family of functions f , where we have the set points ξ1 < ξ2 < ... < ξk (also known
as knots) contained in some interval [a, b]. Generally, M -th order splines are piecewise (M − 1)-degree
polynomial with continuous (M − 2)-th order derivatives at the knots. More specifically, a cubic spline (4-th
order spline) q is a continuous function such that
• q is a cubic polynomial on (a, ξ1 ],[ξ1 , ξ2 ], ..., [ξk − 1, ξk ], [ξk , b), where we have a fixed cubic polynomial
qi between ξi and ξi+1 .
• q has continuous first and second derivatives at the knots.
There is another type of spline, known as a natural spline. This spline is one that extrapolates linearly
beyond the boundary knots. Or mathematiclaly:
a0,1 x + a0,0
x ≤ ξ1
f (x) = a3i,3 x3 + a2i,2 x2 + ai,1 x + ai,0 ξi ≤ x ≤ ξi+1
ak,1 x + ak,0 x ≥ ξk
After defining a few of these notions, we can intuitively see piecewise polynomials has relative smoothness
properties and will allow us to arrive to a solution for the penalized objective. We will now prove a theorem
that demonstrates this.
for some λ1 , · · · , λk ∈ R.
1. Because we know that the best estimator is a spline class, it limits candidate functions to only functions
of the spline classes, which can be parametrized as addition of finite number cublic polynomials. This
transformed the original uncountable and infinite dimensional function space to only finite dimensional
parameter spaces.
31
2. Because the theorem also showed that the order of spline is cubic spline, it limits the highest order to
be 3rd order. This also places an upper-bound on the number of parameters to be searched.
for all x ∈ [ξi , ξi+1 ] where ∀i = 0, ..., k. The convention for knot assignment is: ξ0 = a, ξk+1 = b. However,
note that at,i ’s have to satisfy constraints pertaining to the (M −2)-th order derivation requirements explored
earlier. To prove the lemma, consider the following functions
h1 (x) = 1,
h2 (x) = x,
h3 (x) = x2 ,
h4 (x) = x3 ,
hi+4 (x) = (x − ξi )3+ , i = 1, · · · , k.
where t+ is the ReLU function utilized throughout machine learning max{t, 0}, which returns the input t
directly if it is positive, and return 0 otherwise. We prove that they are the desired basis by induction.
When there is no knots, a degree 3 polynomial can be represented by combinations of h0 , h1 , h2 , h3 .
Suppose the inductive hypothesis holds true for k − 1 knots. We can take the spline qe(x) which spans
over ξ1 , ..., ξk−1 where we define qe(x) = q(x) for all x ∈ [ξk−1 , ξk ]. Suppose
Since we also know that qe is a cubic spline with k − 1 knots, by the inductive hypothesis
k+3
X
qe(x) = λi hi (x).
i=1
Therefore, we can deduce that for all x ≤ ξk q(x) − qe(x) is zero, while for [ξk , b) is a degree 3 polynomial.
Furthermore, recall that we know that q(x) and qe(x) have continuous derivatives. Notice that
q(ξk ) − qe(ξk ) = 0 = b0 ,
32
And given we can rewrite q(x) − qe(x) = b3 (x − ξk )3 + b2 (x − ξk )2 + b1 (x − ξk ) + b0 . Therefore, q(x) − qe(x) =
b3 (x − ξk )3 for all x ≥ ξk . Rewriting it shows q(x) − qe(x) = b3 (x − ξk )3 and hence
k+4
X
q(x) = qe(x) + b3 (x − ξk )3+ = λi hi (x).
i=1
where λk+4 = b3
Given this lemma, we will now show that the minimizer of Lλ (r̂) is a degree 3 spline ge. To do this, let
us start with a twice differentiable function g over [a, b]. We will construct a natural spline ge that matches
g on x1 , ..., xn , namely ge(x) = g(x) for all x ∈ {x1 , · · · , xn }. This implies
k
X k
X
(Yi − g(xi ))2 = (Yi − ge(xi ))2
i=1 i=1
Next we will show that ge00 (x)2 dx ≤ g 00 (x)2 dx which will then indicate that Lλ (e
R R
g ) ≤ Lλ (g). Notice that
in these two inequalities, equality is attained at g = ge. Let us consider h where h = g − ge. Given this, we
only need to deduce h(x) = 0. Note that
Z Z
ge00 (x)2 dx = (eg 00 (x) + h00 (x))2 dx
S S
Z Z Z
= (eg (x)) dx + 2 ge (x)h (x)dx + (h00 (x))2 dx
00 2 00 00
S
Z ZS S
00 2 00 00
≥ (eg (x)) dx + 2 ge (x)h (x)dx
S S
Here we want to show that 2 S ge00 (x)h00 (x)dx = 0. Recall that in natural spline, ge is linear outside the
R
boundary knots. By continuity of the second derivative, ge00 (ξ1 ) = 0 and ge00 (ξn ) = 0, where ξ1 is the left
boundary node and ξn is the right boundary node. By integration by parts, we can show:
Z ξn Z ξn Z ξn
00 00 00 0 ξ 0 000
ge (x)h (x)dx = ge(x) h (x)|ξn1 − g (x)dx = −
h (x)e h0 (x)e
g 000 (x)dx.
ξ1 ξ1 ξ1
y
qe(·)
q(·)
q(·) − qe(·)
ξ
12 16
33
Here, we can identify ge000 (x) as constant ci on [ξi , ξi+1 ] given that ge is a degree-3 polynomial on the
interval. This allows us to further expand
Z ξn Z ξn
ge00 (x)h00 (x)dx = − h0 (x)e
g 000 (x)dx
ξ1 ξ1
n−1
X Z ξi+1
=− h0 (x)e
g 000 (x)dx
i=1 ξi
n−1
X Z ξi+1
=− ci h0 (x)dx
i=1 ξi
n−1
X
=− ci (h(xi+1 ) − h(xi ))dx
i=1
= 0,
where the last line follows since g = ge on knots, h(xi+1 ) and h(xi ) are zero.
where the regularization term λ S r̂00 (x)2 dx encourages r̂ to be as smooth as possible. Here, “smooth” means
R
the second order derivative is as small as possible (i.e. we want a very small regularization penalty).
Let’s also recall a few important theorems and lemmas. From the last section, we know that the minimizer
r̂ of this objective function is a natural cubic spline.
Theorem 5.3. The minimizer Lλ (r̂) is a natural cubic spline with n knots at data points {xi : 1 ≤ i ≤ n}.
That is, r̂ must be a cubic spline.
A natural cubic spline is a piecewise polynomial function that linearly extrapolates near ±∞. Let’s also
recall an important lemma from last section.
Lemma 5.4. A cubic spline with knots ξ1 , ..., ξn forms a (n + 4)-dimensional subspace of functions. That
is, there exist some {hj : 1 ≤ j ≤ n + 4} such that the cubic spline r̂ can be written as
n+4
X
r̂ = βj hj (x),
j=1
where βj ∈ R for j = 1, · · · , n + 4.
Now, we have a strong structural form in the cubic spline and only have to search over a finite dimensional
subspace in the functional form specified above. We can use this functional form of r̂ in our penalized
regression from above.
2 2
n
X n+4
X Z n+4
X
Lλ (r̂) = Lλ (β) = Yi − βj hj (xi ) + λ βj h00j (xi ) dx.
i=1 j=1 S j=1
Although this looks like a complex objective function, we have a finite number of parameters {βj : 1 ≤ j ≤
n + 4} to optimize over. Further, Lλ (β) is a convex quadratic function in β. We can see this by expanding
out the squared terms. As β is not a function of x, the βj ’s in the regularization term will be unaffected by
the derivatives. This makes it a much more feasible problem and allows us to write it in matrix form.
34
5.6 Matrix notation for splines
Although we have a convex quadratic functional form, the problem remains notationally burdensome. Let’s
continue by translating our optimization problem into matrix notation. First, let’s define a few matricies
that will be useful later.
h1 (x1 )
... hn+4 (x1 ) β1 Y1
F = ... ∈ Rn×(n+4) , β = ... ∈ Rn+4 , Y = ... ∈ Rn . (5.2)
h1 (xn ) ... hn+4 (xn ) βn+4 Yn
Applying matrix multiplication in lieu of our summations before, we find
Y1 β1 h1 (x1 ) + ... + βn+4 hn+4 (x1 )
Y − F β = ... − ... ,
Yn β1 h1 (xn ) + ... + βn+4 hn+4 (xn )
and therefore 2
n
X n+4
X
Yi − βj hj (Xi ) = ||Y − F β||22 .
i=1 j=1
Now that we have translated each part of the objective function, we can finally write it as
Lλ (β) , ||Y − F β||22 + λβ > Ωβ.
This is a remarkably familiar functional form which reminds us of a simple linear regression and ridge
regression. The regularization is weighted by matrix Ω. In the case that we have Ω = I, then we have
35
and set it equal to zero. Solving the above linear equation, we find
β̂ = (F > F + λΩ)−1 F > Y.
And therefore the minimizing natural cubic spline is then
n+4
X
r̂(x) = β̂j hj (x).
j=1
>
By taking h(x) = h1 (x) · · · hn+1 (x) ∈ Rn+4 , we can write
where L = h(x)> (F > F + λΩ)−1 F > , meaning our spline is indeed a linear smoother, as required. It is
important to show that this falls into the family of linear smoothers because it means that we can apply our
methods of cross validation (which were defined for linear smoothers) to splines as well.
36
This is clearly much more complicated that our standard kernel estimator for a number of reasons. First,
we can notice that we have a dependency on the density f (x) of x. Next, our kernel K is typically not the
Gaussian kernel (although it is usually something relatively reasonable). Finally, we have a bandwidth that
depends on f (x).
Therefore, splines can be considered a special form of local averaging. They have the advantage in that
they tend to better fit the global structure of the data due to the “global” penalization term that encourages
smoothness.
typical form
[r̂ − ŝ(x), r̂ + ŝ].
In the case of splines, we have an estimate of r̂(x) and wish to find the standard deviation of r̂(x) defined as
ŝ(x). That is, we want to find a confidence band around our estimates that contains the true function r(x).
In practice, this is difficult since we do not know the difference between E[r̂(x)] and r(x). Statisticians thus
find the confidence band for E[r̂(x)]. For this purpose, we find ŝ(x) ≈ SD(r̂(x)).
37
We can do this for the general family of linear smoothers with the general form
n
X
r̂(x) = li (x)Yi .
i=1
38
Chapter 6
High-dimensional nonparametric
methods
6.1.1 Examples
iid
Example 6.1. Suppose we generate x(1) , ..., x(n) ∼ Unif(sd−1 ) where the superscripts denote the index of
the data points and sd−1 is a d-dimensional unit sphere defined as {x : ||x||2 = 1, x ∈ Rd }.
In this case, with p ≥ 1 − n2 exp(−Ω(d)) 1 , we have
√
hx(i) , x(j) i ≈ 0 ⇒ ∀i, j ∈ [n], ||x(i) − x(j) ||2 ≈ 2.
iid
The reasoning is as follows. Let u = (u1 , ..., ud ) ∈ Rd , v = (v1 , ..., vd ) ∈ Rd and u, v ∼ Unif(sd−1 ). Due to
symmetry,
Ehu, vi = E[u1 v1 + ... + ud vd ] = 0
d
X 1
u2i = 1 ⇒ u2i ≈ , i = 1, ..., d
i=1
d
iid
Let us make a convenient approximation that ui , vi ∼ N (0, d1 ) for all i. Then
d d d d
X X X X 1 1
Var( ui v i ) = Var(ui vi ) = E(ui vi )2 = E(u2i )E(vi2 ) = d × 2 =
i=1 i=1 i=1 i=1
d d
1 T (d) = Ω(g(d)) ⇔ ∃c, d0 > 0 s.t. ∀d ≥ d0 , c · g(d) ≤ T (d)
39
Figure 6.1: Example of image comparison.
Figure 6.2: Examples of k-NN with k = 10 (Left), k = 50 (Middle) and k = 400 (Right).
When dimension d is high, the variance is small and we have hu, vi ≈ 0 with high probability. Thus u and v
are nearly orthogonal. q √
||u − v||2 = ||u||22 + ||v||22 − 2hu, vi ≈ 2
This is problematic because the distance between points isn’t that big since each x is √
orthogonal to all other
values of x. The distance
q isn’t informative in this case. That is, if we take h = 2 + for some small
positive number = O( log(n) 2
d ) , then we have all data points in our neighborhood. However, if we take
√
h = 2 − , then we have no points in our neighborhood.
Example 6.2. Let’s consider the example of comparing images as in Fig. 6.1.
The distances between these two images will appear as very large just comparing the pixels, even if they
are closely related.
A benefit of this algorithm is that each bin Bx will always contain k points, so we don’t need to worry about
bandwidth. Of course, this doesn’t always alleviate the problem that the neighbors might not be meaningful
(i.e. they could be neighbors just by chance). The plots in Fig. 6.2 explore a few different values for k on a
toy dataset (k ∈ {10, 50, 400}).
2 T (n) = O(g(n)) ⇔ ∃c, n0 > 0 s.t. ∀n ≥ n0 , T (n) ≤ c · g(n)
40
However, there is the fundamental limitation in the curse of dimensionality. That is, nonparametric
methods require a sample size at least exponential in dimension d. More formally, if we only assume
Lipschitzness or smoothness conditions (e.g. ||r(x) − r(x0 )|| ≤ ||x − x0 || or bounded derivation of r(x)), then
1
any estimator of r̂ will have errors on the order of n− Ω(d) . That is, we have
Ω(d)
1
− Ω(d) 1
=n ⇒n≥ .
The general idea of the kernel method is to generate a feature map defined as
φ : x ∈ Rd → φ(x) ∈ Rm ,
where x ∈ Rd are our input and φ(x) ∈ Rm are our features. Note that m can be very big or even infinite.
More specifically, we want to transform our standard database into a different feature-space.
n on n on
x(i) , y (i) → φ x(i) , y (i) .
i=1 i=1
After converting this to a higher dimension, we can run a standard parameterized method on the transformed
dataset (e.g. a linear regression).
θ3
This is a more flexible polynomial model that can be expanded even beyond the third degree to represent
any polynomial. This allows us to rely less on the assumptions of linear regression by transforming our input
into a higher dimension.
>
Example 6.5. We can apply the same logic to splines. By taking h(x) = h1 (x) · · · hn+4 (x) and
> n+4
β̂ = β̂1 · · · β̂n+4 ∈ R , recall lemma 5.4, minimizing natural cubic spline leads to
n+4
X
r̂(x) = β̂j hj (x) = hβ̂, h(x)i
j=1
41
6.2.2 Kernel regression
In a linear regression on transformed feature space, we find the estimator as follows
ŷ = φ(x)> θ, φ(x) ∈ Rm .
Let’s focus on the case when m > n. Our least squares objective remains the same.
n > 2
1 X (i)
(i)
L̂(θ) , y −φ x θ .
2n i=1
φ(x(1) )>
(1)
y
Φ = ... ∈ Rn×m , y = ... ∈ Rn .
φ(x(n) )> y (n)
It is important to notice that when m > n, Φ> Φ is not invertible as is required to solve the minimization
problem. This is because Φ> Φ ∈ Rm×m with rank n (i.e. the rank is smaller than the dimension). This
means we will have a family of solutions rather than a unique one. We claim that the family of solutions is
given by
θ = Φ> (ΦΦ> )−1 y + β,
where β ⊥ φ(x(1) ), ..., φ(x(n) ) (i.e. β is in the null space or βΦ = 0). Note that this is only feasible if ΦΦ>
is invertible. We can verify that this indeed is a family of solutions.
Note that the β term disappears because it is orthogonal to Φ. We can also conduct a sanity check by
confirming that there is zero training error.
42
Again, note that the β term disappears because it is orthogonal to Φ. Typically, we will take the minimum
of the family of solutions as “the” solution to the problem. This is because it is seen as the simplest model
and can likely generalize the best. The minimum in this case is given by:
The issue remains that, when m is large, it is computationally inefficient. Further, when m is infinite, this
is an impossible problem. In fact, computation is on the order of O(m · n2 ).
Using this definition, we can plug into the expression for θ̂φ(x).
Thus, we see that this kernel trick only requires (i) φ(x(i) )> φ(x(j) ) and (ii) φ(x(i) )> φ(x) for all i, j ∈ {1, ..., n}.
In the case that we can easily compute φ(x)> φ(z) for some x, z, then we don’t have any dependency on m.
That is, we can construct feature maps such that φ(x)> φ(z) can be computed more quickly than O(m) time.
Now we will briefly discuss the time it takes to compute an estimate in this fashion. Let’s say it takes us
T units of time to compute φ(x)> φ(z). Then, it takes us roughly
1. n2 T units of time to compute the matrix K;
2. n3 units of time to compute the inverse K −1 ;
φ(x(1) )> φ(x)
5. n units of time to compute the product of vector y > K −1 and vector ... .
φ(x(n) )> φ(x)
All in all, it takes n2 T + n3 + nT + n2 + n units of time to compute θ̂> φ(x). Overloading our notation for
K, we can define the kernel as the inner product
We wish to construct a φ such that K(·, ·) is easy to compute. There are lots of ways to do this. In fact,
we can even ignore φ and directly work with our kernel K instead (as long as we know there exists some φ
where K(x, z) = φ(x)> φ(z)).
43
6.2.4 Examples
Example 6.6. Consider the case when we have x = (x1 , ..., xd ) ∈ Rd . Let’s construct a feature map as
follows. >
1 1 1
x1 x1 z1
... ... ...
xd 1+d+d2 >
xd zd
φ(x) = ∈R , φ(x) φ(z) =
.
x1 x1 x1 x1 z1 z1
x1 x2 x1 x2 z1 z2
... ... ...
xd xd xd xd zd zd
Re-writing the inner product, we find the following.
d
X d
X
T
φ(x) φ(z) = 1 + xi zi + xi xj zi zj
i=1 i,j=1
d
X d
X
= 1 + x> z + xi zi xj zj
i=1 j=1
Since it takes O(d) time to compute x> z, it takes O(d) time to compute φ(x)T φ(z). There is no reliance on
m here, so our kernel trick worked.
Example 6.7. Again, let’s consider x = (x1 , ..., xd ) ∈ Rd . Let’s consider a degree-3 construction of φ this
time. >
1 1 1
x1 x1 z1
... ... ...
xd xd zd
x1 x1 x1 x1 z1 z1
x1 x2 1+d+d2 +d3 >
x1 x2 z1 z2
φ(x) = ∈R , φ(x) φ(z) =
.
... ... ...
xd xd xd xd zd zd
x1 x1 x1 x1 x1 x1 z1 z1 z1
x1 x1 x2 x1 x1 x2 z1 z1 z2
... ... ...
xd xd xd xd xd xd zd zd zd
Similar to the argument above, we find φ(x)> φ(z) = 1 + x> z + (x> z)2 + (x> z)3 , meaning we have O(d) time
again.
Example 6.8. A Gaussian kernel (also called RBF kernel) also works here. That is, we have
||x − z||2
K(x, z) = exp − .
2σ 2
It turns out there exists two infinite dimensional features φ(x) and φ(z) such that K(x, z) = φ(x)> φ(z).
Here, parameter σ controls how strong the locality is. When σ is extremely large, then we are not sensitive
to the choice of x and z since the quantity K(x, z) is always very small. When σ is very close to 0, you care
about points in local neighborhood much more than points faraway.
44
Example 6.9. Let’s consider applying kernel functions to k-Nearest neighbors algorithm instead of linear
n
regression in the feature space φ x(i) , y (i) i=1 . We can re-write the distance between x and z as
5. K(x, z) = random features kernel, where we use a randomized feature map to approximate the kernel function;
6. K(x, z) = function of infinite dimension features, e.g. RBF kernel mentioned in example 6.8..
6.2.5 Existence of φ
For a kernel function to be valid there must exist some φ such that K(x, z) = φ(x)> φ(z). Let’s show how
we know that φ exists.
Theorem 6.10. If K(x, z) = φ(x)> φ(z) then for any x(1) , ..., x(n) we have [K(x(i) , x(j) )]i,j∈[n] 0. That
is, the matrix K must be semi-positive definite.
Proof. We know that K 0 if and only if v > Kv 0 for all v. Let’s show that this holds true for some
arbitrary v.
n
X
v | Kv = vi Kij vj
i,j=1
Xn
= vi hφ(x(i) ), φ(x(j) )ivj
i,j=1
n n
!
(i)
X X
= vi φ(xk φ(x(j) )k vj
i,j=1 k=1
m
X n
X
= vi φ(x(i) )k vj φ(x(j) )k
k=1 i,j=1
m n
! n
X X X
= vi φ(x(i) )k vj φ(x(j) )k
k=1 i=1 j=1
m n
!2
X X
= vi φ(x(i) )k ≥0
k=1 i=1
Therefore, K 0 for any x(1) , · · · , x(n) is a necessary condition for φ to exist. This is in fact also
sufficient. If you’re interested, you can find more about it here [Wikipedia contributors, 2023] known as
Mercer’s theorem.
45
6.3 More about kernel methods
6.3.1 Recap
Let us quickly review the kernel method from section 6.2. The basic principle of the kernel method is that
given a set of data points
n o
(x(1) , y (1) ), · · · , (x(n) , y (n) ) , x(i) ∈ Rd , y (i) ∈ R,
φ : x 7→ φ(x) ∈ Rm .
Our interpretation of this feature map is that it is transforming the feature pairs in our dataset. If we
run a linear regression or a logistic regression on the transformed dataset (φ(x(i) ), y (i) ), then the algorithm
only depends on the inner product (i.e. we don’t need to know φ(x) or φ(z) explicitly). We only need to
compute hφ(x), φ(z)i. This is called the kernel function
If we can compute the kernel function directly, then we don’t need to pay the computational overhead of
computing the φ function/map explicitly. When the number of features is large, computing the feature map
explicitly can be quite costly.
φ(x)k : x → R,
and
φ(x)1
φ(x) = ... .
φ(x)m
An example could be the second degree polynomial kernel such that φ(x)(ij) = xi xj . We can also view the
linear prediction function of our features as a linear combination of these functions
m
X
θT φ(x) = θi φ(x)i ∈ span{φ(x)1 , ..., φ(x)m }.
i=1
The kernel method can be thought of as looking for a function in a linear span of functions.
46
where φ(x)i = hi (x), i = 1, · · · , n + 4, and thus
h1 (x)
φ : x 7→ φ(x) =
..
.
hn+4 (x)
is a feature map. Consequently, in our connection between kernels and splines, we can write out the kernel
function for cubic splines as
n+4
X
K(x, z) = hφ(x), φ(z)i = hi (x)hi (z).
i=1
Empirically, our main design choice centers around our choosing a basis hi (x) such that K(x, z) is efficiently
computable. Previous bases we have used for splines have been good mathematically but are not necessarily
the best choice when thinking about computability.
The kernel method with feature map φ is equivalent to a cubic spline r̂ with no regularization, or a
ridgeless kernel regression since
n
X n
X
argmin (y (i) − β T φ(x(i) ))2 ⇔ argmin (y (i) − r̂(x(i) ))2 .
β i=1 r̂ i=1
However, this is underspecified because the number of parameters (n + 4) > number of data points n.
Therefore, we look for the minimum norm solution or a regularization. An example of a regularized solution
is the kernel ridge regression
n
1 X (i) λ
min (y − β T φ(x(i) ))2 + kβk22 ,
β 2 2
i=1
where
φ(x(1) )>
.. n×m
Φ= ∈R .
.
φ(x(n) )>
To connect with splines, we see in in our previous lecture that in natural cubic splines we’re using a similar
but different regularizer β > Ωβ.
47
Chapter 7
7.1 Overview
In this chapter we will talk about neural networks. We will explore their connection to the kernel method
and their practical implementation.
1. We will use a neural net to represent features and then find those features, this allows for more dynamic
and better features than those computed in the kernel method.
2. We will also show that for one dimensional two layer wide neural network, it is equivalent to a linear
spline.
Definition 7.1 (Transformation of fully-connected neural network in each layer). We denote by the input
of i-th layer of a neural network by hi−1 ∈ Rd and its output by hi ∈ Rm . The weighted matrix parameters
are denoted by W ∈ Rm×d . Let σ be the non-linear activation function R → R. Examples of activation
functions include
48
1
Softplus(x) := log .
1 + ex
Then the output vector can be written as hi = σ(W hi−1 ), where σ here is understood to be applied
elementwise. Empirical wisdom is that activation functions should not be flat on both sides, so Sigmoid
tends not to be used in modern networks. Otherwise when the input to the activation function gets large,
output will be highly insensitive to parameter changes due to zero gradients.
For a fully-connected two layer neural networks, we thus can write
where x := h0 is the input of the network and a ∈ Rm . If we view σ(W x) as a feature map φ(x) which
depends on W (hence we might more accurately write this feature map as φW (x)), then this is similar to a
kernel method. If we fix W , then this is exactly a the kernel method with K(x, z) = hφW (x), φW (z)i. In
neural networks, the difference is that we train both a and W . If we don’t train W , then we essentially have
a kernel method.
Often times, hr is called “the features”. hr = σ(Wr σ(Wr−1 . . . )) → φW1 ...Wr (x) where φW1 ...Wr (x) is referred
to as the feature extractor or the feature map. The key difference is that φW1 ...Wr (x) is learned. More broadly,
any sequence of parameterized computations is called a neural network. For example, the residual neural
network is
49
w1T
T
w1 x
W = ... → W x = ... ,
T T
wm wm x
(W x)i = wiT x,
Xm
aT σ(W x) = ai σ(wiT x).
i=1
This regularizer prevents overfitting by essentially controlling the slope of the model. Note that the bi and c
terms in our model function hθ (x) are not included in our regularizer since these terms refer to the vertical
and horizontal translation of our model, and thus should not be penalized for being too large. And we now
define our regularized training objective:
Where L(hθ ) can be any loss function that is continuous in θ, and C(θ) is our regularizer. An example of a
loss function is
n
1 X (i)
L(θ) = (y − hθ (x(i) ))2 . (7.3)
n i=1
∞
X
hθ (x) = ai [wi x + bi ]+ ,
i=1
a = (a1 , · · · ) ∈ R∞ , w = (w1 , · · · ) ∈ R∞ ,
b = (b1 , · · · ) ∈ R∞ ,
∞ ∞
1 X 2 X 2
C(θ) = ai + wi .
2 i=1 i=1
50
For a parameterized neural network, we are trying to find the minimizer to
We claim that these methods are doing the same thing. Specifically we claim that
and
f ∗ = h∗θ ,
where f ∗ and θ∗ are the minimizers of equations 7.6 and 7.7, respectively.
In other words, on the one hand we have a non-parametric approach, i.e. a penalized regression with a
complexity measure R̄(f ). On the other hand we have a parameterized regression which comes from neural
networks. We claim that these are doing the exact same thing.
How do we interpret this? What does minimizing L(f ) + λR̄(f ) really do? First, let’s consider the
following:
n
X
minimize R̄(f ) s.t L(f ) = (y (i) − f (x(i) ))2 = 0.
i=1
As this corresponds to the case where λ → 0. The above is minimized when f is a linear spline that fits the
data exactly and L(f ) = 0.
00
From equation (7.4), we see that R̄(f ) consists of two terms. We know that f (x) = ∞ at the data
00
points (since this is where the slope instantaneously changes from one linear line to another), and f (x) = 0
00
otherwise. Hence, we can model f (x) as the sum of the dirac delta functions {δ(t)|t = datapoint}. The dirac
delta function is defined such that it takes the value zero everywhere except the a single point where
R +∞ R 00it take
the value infinity. By definition for dirac delta functions, we know that −∞ δ(t)dt = 1, hence |f (x)|dx
in equation (7.4) is actually quite small. We won’t go into the formality of proving this, but take it as true
that minimizing R̄ gives us a linear spline since the penalization from the second order derivatives is actually
quite small.
To represent a linear spline with n knots, we only need n + 1 pieces. We can therefore represent a linear
spline with a neural net of at most n + 1 terms. Analogously, in penalized regression we started with all
possible solutions r(x), but after we realized that the solution has a structure like a cubic spline, we then
reduced our infinitely large solution space to an n + 4 dimensional space (n + 4 neurons/width of the neural
net); this simplification makes the optimization of the problem a lot easier.
Why do we know there exists θ such that f (x) = hθ (x)? For any piecewise linear function with a finite
number of pieces, we know that there exists a hθ (x) that represents f (x), since neural networks are piecewise
linear for a finite number of neurons. A uniformly continuous function f (x) can be approximated by a
51
2-layer neural network with finite width, and it can be exactly represented by a two-layer neural network
with infinite width. This is done by taking finer and finer approximations of our function, and then taking
the limit as the number of approximations (width of our neural network) →.
We wish to prove that
min L(f ) + λR̃(f ) = min L(hθ ) + λC(θ),
∆
where R̃(f ) = min C(θ) s.t. f = hθ . Let θ∗ be the minimizer of the min L(hθ ) + λC(θ). We have
Combining these statements implies that L(f ) + λR̃(f ) ≤ L(hθ∗ ) + λC(θ∗ ) = min L(hθ ) + λC(θ), which
suggests
In the other direction, let f ∗ be the minimizer of min L(f )+λR̄(f ). By the argument above, we can construct
θ such that hθ = f ∗ . Take θ to be the minimizer of min C(θ) s.t. f ∗ = hθ . This is the minimum complexity
network that can represent f ∗ . This means that C(θ) = R̃(f ∗ ) and
Taken collectively, we conclude that min L(f ) + λR̃(f ) = min L(hθ ) + λC(θ).
a = (a1 , a2 , a3 , · · · ) ∈ R∞ ,
b = (b1 , b2 , b3 , · · · ) ∈ R∞ ,
w = (w1 , w2 , w3 , · · · ) ∈ R∞ ,
∞
X
hθ (x) = ai [wi x + bi ]+ ,
i=1
∞
X ∞
1 X
C(θ) = a2i + wi2 .
2 i=1 i=1
∆
R̃(f ) = min C(θ) s.t. f (x) = hθ (x), (7.11)
and Eq. (7.7) holds for this definition of R̄(f ). It remains to show the definition of complexity measure in
Eq. (7.4) holds for (7.11). Let us take a detour and prove a related lemma.
52
7.3.2 Preparation
Lemma 7.3. The minimizer θ∗ of Eq. (7.11) satisfies |ai | = |wi |, for all i = 1, 2, · · · .
Recall that the wi ’s are the weights of the first layer and the ai ’s are the weights of the second layer.
Therefore, this lemma implies that the weights are balanced between the two levels in order to minimize the
complexity.
Proof. We can write ai [wi x + bi ]+ = aγi [γwi x + γbi ]+ if γ > 0. This is allowed because [γt]+ = γ[t]+ . Now
suppose that (ai , wi , bi ) is optimal for each i. Then the complexity should not decrease after scaling by γ as
we have already found the minimum. Hence we have
1 2 1 a2
(ai + wi2 ) ≤ ( i2 + γ 2 wi2 ). (7.12)
2 2 γ
Now we minimize with respect to γ, and
1 a2 1
min ( i2 + γ 2 wi2 ) = (a2i + wi2 ). (7.13)
γ 2 γ 2
a2
Let g(γ) = 21 ( γi2 + γ 2 wi2 ). Therefore min g(γ) = g(1).
−a2i
g 0 (1) = 0 = + γwi2 = −a2i + wi2
γ3
⇒ a2i = wi2
⇒ |ai | = |wi |.
We now proceed to finish our proof of Theorem 7.2. We can redefine our neural net hθ (x):
∞ ∞ ∞
X X wi bi X
hθ (x) = ai [wi x + bi ]+ = ai |ai | x+ = αi [w
ei x + βi ]+ ,
i=1 i=1
|ai | |ai | + i=1
wi
where αi = ai |ai |, w
ei = |ai|
, and βi = |abii | . The absolute value of ai is taken because ReLU is positive
homogeneous.Since θ satisfies Eq. (7.11), by using the results of Lemma 7.3, we know that αi ∈ {−a2i , a2i }
ei ∈ {−1, 1}. Furthermore, we can redefine C(θ):
and w
∞
X ∞ ∞
X ∞ ∞ ∞
1 X 1 X X X
C(θ) = a2i + wi2 = a2i + a2i = a2i = |αi | = kαk1 .
2 i=1 i=1
2 i=1 i=1 i=1 i=1
ei )}∞
Define a new neural net by the set of parameters θe = {(αi , βi , w i=1 . Then the objective function in
Eq. (7.11) becomes
∆
R̃(f ) = min kαk1 s.t. f (x) = hθe(x). (7.14)
53
Figure 7.1: Example of discretization across R
Similar logic holds for [−x + zj ]+ . See Fig. 7.1 for an example of discretization.
Define u+ (z) and u− (z) as
X − X
u+ (z) =
, u (z) =
αi
.
αi
i:w
ei =1, i:w
ei =−1,
βi =z βi =z
Then
X X X X
αi [x + βi ]+ = [x + z]+
αi
=
[x + z]+ u+ (z),
i:w=1
e z∈Z i:w
ei =1, z∈Z
βi =z
Taking the limit N → ∞ makes the discretization fine-grained and results in the integral,
Z ∞ Z ∞
hθe(x) = [x + z]+ u+ (z)dz + [−x + z]+ u− (z)dz.
−∞ −∞
We can view hθe(x) as a linear combination of features [x + z]+ and [−x + z]+ for z ∈ R.
54
7.3.4 Reformulation of the objective
We know that
∞
X Z ∞ Z ∞
|αi | ≥ |u+ (z)|dz + |u− (z)|dz.
i=1 −∞ −∞
by the triangle equality; for every bucket z, i:βi ∈z |αi | ≥ |u+ (z)| = | i:βi ∈z αi | and a similar expression
P P
holds for u− (z). Equality occurs at the minimum, as the optimal θ minimizes complexity regardless of how
we rewrite the expression. Thus, we can update our objective in Eq. (7.14):
Z ∞ Z ∞
min |u+ (z)|dz + |u− (z)|dz s.t. f (x) = hθe(x). (7.15)
−∞ −∞
Note that the last equality holds because δ−x (z) and δx (z) can be treated as “degenerate” probability
distributions (with total probability 1 occurring only at −x and x, respectively). Our choice of x was
arbitrary, so this holds for all x. Thus, the objective from Eq. (7.15) becomesz
Z ∞ Z ∞
min |u+ (z)|dz + |u− (z)|dz s.t. ∀x, f 00 (x) = u+ (−x) + u− (x). (7.16)
−∞ −∞
55
Then the objective function in Eq. (7.16) can be written as
Z ∞ Z ∞ Z ∞ Z ∞
−
+
|u (z)|dz + |u (z)|dz = +
|u (−z)|dz + |u− (z)|dz
−∞ −∞ −∞ −∞
Z ∞ Z ∞
1 00 1 00
= (f (z) − q(z)) dz + (f (z) + q(z)) dz
−∞ 2 −∞ 2
Z ∞
1
= |f 00 (z) − q(z)| + |f 00 (z) + q(z)| dz.
2 −∞
Since |f 00 (z) − q(z)| + |f 00 (z) + q(z)| is 2|f 00 (z)| if |f 00 (z)| ≥ |q(z)| and 2|q(z)| if |f 00 (z)| < |q(z)|, we have a
simple expression for the objective function:
Z ∞ Z ∞ Z ∞
−
+
|u (z)|dz + |u (z)|dz = max{|f 00 (z)|, |q(z)|}dz.
−∞ −∞ −∞
gives a constraint for q, and we update the objective in Equation 7.16 in terms of q:
Z ∞ Z ∞
00 0 0
min max{|f (z)|, |q(z)|} dz s.t. f (−∞) + f (∞) = − q(z) dz. (7.18)
−∞ −∞
Previous formulations of the objective were taken with respect to many (infinite) variables, but we have
found an equivalent objective with respect to q only. Consider the following discrete objective
k
X k
X
min max{ai , |xi |} s.t. xi = B.
i=1 i=1
Pk
The minimum value of the objective function above is max{ i=1 ai , |B|}. Connecting this idea to our
objective in Equation 7.18, the minimum value of the objective function is
Z ∞
00 0 0
max |f (x)|, |f (−∞) + f (∞)| ,
−∞
which is R̄(f ) for the initial objective in Eq. (7.11) so we are done.
56
Chapter 8
8.2 Optimization
8.2.1 Basic Premise
In a general sense, our neural network function is of the form
hθ (x) = aT φw (x).
where φw (x) has a lot of layers in it. We established the conventional viewpoint that the earlier layers of
φw (x) are producing features and the last layer is producing some linear classification of all of them. The
typical objective function for regression is then denoted by
n
1 X (i) λ 2
L(θ) = min (y − hθ (x(i) ))2 + kθk2 . (8.1)
2n i=1 2
We want to find some algorithms that will help us solve this optimization problem.
57
Why does this work? In essence, we are finding the steepest descent at a point θt and moving in that
direction. If we look at a Taylor expansion of L(θ) at the point θt , we get
Notice that the second term is linear in θ. If we ignore the higher order terms and minimize over a Euclidean
ball around θt (the ball is required to maintain the accuracy of the Taylor expansion), we get
minh∇L(θt ), θ − θt i, s.t.kθ − θt k2 ≤ ε
This is equivalent to finding two vectors v and x with minimum correlation such that x has norm less than
ε. As a result, the optimal solution is x = −cv, where c is a scalar constant greater than 0. This assures
that our two vectors have a minimum correlation as they are in opposite directions, and c allows the vector
to be within our previously established ball. Therefore the optimal solution for θ − θt is
θ − θt = −c · ∇L(θt )
Therefore, we can see that the steepest direction is optimal, and thus the gradient descent method will reach
a minimum.
Calculating the summation gradient of ∇θ (y (i) − hθt (x(i) ))2 over the entire dataset is expensive for complex
neural nets (with many parameters) and/or large sample sizes. Stochastic gradient descent relies on using a
small subset of the samples to estimate the gradient, which is effective, especially during the initial stages
of training, because gradients of the individual data points will often point in somewhat similar directions.
We can derive the individual loss function:
n
1X 1 1
L(θ) = `i (θ), `i (θ) = (y (i) − hθ (x(i) ))2 + kθk22 .
2 i=1 2 2
3. Sample S and find θt+1 = θt − ηgS (θt ) for t = 0 to t = T , where T is the number of iterations.
58
8.2.4 Computing the gradient
The gradient of a single data point is
∇θ (y (i) − hθ (x(i) ))2 = −2(y (i) − hθ (x(i) ))∇θ hθ (x(i) ).
Hence, it suffices to find an evaluable expression for ∇θ hθ (x(i) ). Recall that hθ (x) = a> σ(W x). Then the
partial derivatives are shown below:
∂
hθ (x) = σ(wjT x),
∂aj
∂
hθ (x) = (aj σ 0 (wjT x))x>
i .
∂wij
Note that is the element-wise product.
We can also present an informal statement about the time for computation. Suppose `(θ1 , · · · , θp ) : Rp →
R can be evaluated by a differentiable circuit (or sequence of operations) of size N . Then the gradient ∇`(θ)
can be computed in time O(N + p) using a circuit of size O(N + d). This means that the time to compute the
gradient is similar to the time to compute the function value. The only requirement is that the operations
of the circuit are differentiable.
For a d-dimensional network, W ∈ Rm×d . We assume m is sufficiently large. For the kernel method with
feature σ(W x) for random W , the objective is
min kak22 s.t. y (i) = a> σ(W x(i) ).
The objective for the neural net is equivalent to
min kak1 s.t. y (i) = a> σ(W x),
where W is random. With the L1 norm, the neural net prefers sparse solutions, similar to the lasso regression.
Thus, unlike the kernel method, the neural net actively selects features [Wei et al., 2020].
59
We also present an improved method that fine-tunes W :
1. Train a (deep) neural net hθ (x) = a> φw (x) on (x(1) , y (1) ), · · · , (x(n) , y (n) ).
2. Train a linear model gb, w (x) = b> φwe (x) on (x̃(1) , ỹ (1) ), · · · , (x̃(m) , ỹ (m) ), still discarding â but not fixing
ŵ from hθ . Thus, our objective function is
m
1 X 2
min gb,w (x̃(i) ) − ỹ (i) .
b,w 2m i=1
The improved method can be implemented using SGD with initialization w = ŵ. We desire to keep w close
to its initialization (tactics like early stop can be used). This is useful for tasks where both datasets share
similar goals but have slightly different contexts.
N NW (x)
φW (x) = normalize(N NW (x)) = .
kN NW (x)k2
for N NW (x) as the standard feed-forward NN. This normalization is a sequence of elementary
operations, which can be done efficiently (such as computing kN NW (x)k2 ) and allows for efficient
gradient calculations with auto-differentiation in backpropagation. Performing this operation then
implies that kφW (x)k2 = 1.
2. At test time, we utilize an one-nearest neighbor algorithm. (Here, we predict based on the single nearest
neighbor rather than a combination of the k nearest as used in k-nearest neighbors.) Generally, given
an example x, we wish to predict the output label y. The steps are as follows:
60
(b) Find nearest neighbor in {φw (x̃(1) ), ..., φw (x̃(nk) )}.
which, given the unit norms enforced on outputs φw (x̃(i) ), can be simplified to
which is a constant shift from the cosine distance 2a, b, the cosine angle between two vectors a, b.
(c) Assign the output label of ỹ (j) , or the label of the “nearest neighbor,” to the example x.
61
Chapter 9
Density estimation
• F̂n (·) is a step function which only takes values in 0, n1 , ..., 1 (but, for a given distribution, not all
numbers in this range may be used). This is because F̂n (·) multiplies n1 by a sum of n integers with
values ∈ {0, 1}, which then implies that 0 ≤ F̂n (x) ≤ 1.
• F̂n (·) is a CDF itself, and is in fact the CDF of the uniform distribution over {X1 , ..., Xn } (given that
the datapoints were each drawn an equal number of times).
62
Figure 9.1: CDF of the standard normal distribution.
Figure 9.2: Empirical CDF for 100 randomly generated N (0, 1) points against standard normal CDF.
63
Using the previous example of F (x) (the standard normal distribution), we can illustrate what form the
estimator will take for an example with n = 100 data points in Fig. 9.2.
Although this class will not overview the in-depth theory, it is possible to show for a given x that as the
number of examples n → ∞, this estimator F̂n (x) converges to the underlying distribution function F (x).
Indeed, we see in Fig. 9.2 that at n = 100 we achieve a reasonably good estimate.
In the extreme lower and/or upper range of inputs x, the density of data points is close to 0 (since F (x)
is flat), and the estimator does not change to transition to the next“increased step” often given the relative
lack of examples in these regions. In the opposite scenario in a region where F (x) increases sharply, there
are more examples in this region, and the CDF will increase (transition to the next step) more quickly.
The following section involves analysis of simple theorems related to the estimator F̂n (x).
h i
Theorem 9.1. For any fixed value of x, the expectation of the empirical estimator, E F̂n (x) , satisfies
h i
E F̂n (x) = F (x), (9.2)
with randomness over the choice of X1 , ..., Xn . This means that F̂n (x) is an unbiased estimator of F (x).
h i
Proof This result can be seen by evaluating the E F̂n (x) , since
" n
#
h i 1X
E F̂n (x) = E 1(Xi ≤ x)
n i=1
n
1X
= E [1(Xi ≤ x)]
n i=1
n
1X
= Pr(Xi ≤ x)
n i=1
= F (x). (9.3)
This means that the supremum of |F̂n (x) − F (x)| almost surely converges to 0.
Expressed in words, the Glivenko-Cantelli theorem ensures that the estimator converges to the true
underlying distribution over the entire function.
Remark 9.3. While smoothing may produce an estimator that looks more similar to a true CDF, the step-
wise estimator described is optimal to ensure the convergence of the estimator to the underlying CDF, and
smoothing is therefore not necessary. However, this estimator has a zero derivative at most places and no
derivative in others, which makes it inapplicable when trying to estimate the density of the data (which is
the derivative of the CDF).
64
9.1.3 Estimating functionals of the CDF
Consider T (F ), which is a function of the CDF F (x). For example, T (F ) could be any of the following
functions of F that represent a property of the CDF:
• Mean of the distribution F .
• Variance of F .
• Skewness of F (measuring the asymmetry of the CDF about the mean).
• Quantile of F .
A plug-in estimator uses T (F̂n ) as the estimator (directly plugging in the estimator for F into the func-
tional). Under certain conditions (satisfied with the functionals listed above), T (F̂n ) → T (F ). Note that
more “abnormal” functions of F (such as the derivative) do not satisfy these conditions.
And often the integration is in a restricted range (e.g., from 0 to 1 interval). For a one-dimensional problem,
this can be seen as a natural extension the mean squared error.
Another way to calculate the risk in order to measure performance is to calculate the `1 integrated
risk, also known as the total variation (TV) distance between the two distributions f and fˆ:
Z
R`1 (fˆ, f ) = fˆ(x) − f (x) dx. (9.7)
Throughout the rest of the chapter, we will utilize the mean squared error as a metric to evaluate density
estimator performance. Part of the reason is that it is much easier to do math with it and understand for
example, bias and variance tradeoff more clearly.
Remark 9.4. The mean squared error is not very useful in high dimensions. (If f = fˆ, then the mean squared
error will evaluate to 0, but this error generally does not scale well in higher dimensions.) This problem is
elaborated on in the section below.
65
9.2.2 Mean squared error in high-dimensional spaces
Consider the d-dimension problem as follows. We assume that f is a spherical Gaussian ∼ N (0, I). It follows
that
1 1 2
f (x) = √ d · exp − kxk2 . (9.8)
2π 2
Some key observations we can make about this density function are:
• f is a density and therefore
f (x) ≥ 0. (9.9)
This means that, in high dimensional-spaces, we are aiming to predict very small values, which becomes
an issue that is exacerbated in the integrated mean squared error calculation.
Now, consider some fˆ that approximates f reasonably well. Because we have shown the output of f (x) to
be less than the inverse exponential √ 1 d , we can reasonably expect that fˆ ≤ √ 1 d for most x.
( 2π) ( 2π)
We can evaluate the integrated mean squared error between the described f and fˆ as follows:
Z 2
R(fˆ, f ) = fˆ(x) − f (x) dx
Z
≤ fˆ(x) − f (x) · |fˆ(x)| + |f (x)| dx
Z
2
. √ d fˆ(x) − f (x) dx
2π
Z
2
. √ d fˆ(x) + f (x) dx (f and fˆ(x) are positive)
2π
4
≤ √ d .
2π
4
R(fˆ, f ) ≤ √ d . (9.11)
2π
and thus f and fˆ need not be close by any means for the error to be proportionally very small as an inverse
exponential. Note that the TV distance can also not be very meaningful in high dimensions. Generally, the
distance between two distributions in high-dimensional space is non-trivial. There are, however, alternatives
used that offer slightly better method of measuring the performance of density estimators in high dimensions.
One such alternative is the KL divergence metric. However, the KL divergence can still result in a large
error for very similar distributions. Take, for example, two distributions P1 = N (0, I) and P1 = N (µ, I),
where µ is a small vector. Then, the KL divergence can become very large. Wassertein distance is another
alternative method that incorporates the geometry into the calculation, and performs better for examples
such as distributions which are two point masses that are very close to one another.
66
9.2.3 Mean squared error and other errors in low-dimensional spaces
Suppose that d = 1 and the situation is thus low-dimensional In this case, use of the mean-squared error
is acceptable (as well as other distance metrics discussed). Going forward in the chapter, we will primarily
focus on discussing one-dimensional scenarios.
2
Because, for a random variable Z, E Z 2 = (E [Z]) + Var(Z), then we can decompose the bias-variance
tradeoff as:
2 h i2 h i
ˆ
E f (x) − f (x) = E f (x) − fˆ(x) + Var f (x) − fˆ(x) (9.12)
h i2 h i
= f (x) − E fˆ(x) + Var −fˆ(x) (f (x) is a constant)
h i 2 h i
= E fˆ(x) − f (x) + Var fˆ(x) .
2
f (x) − fˆ(x)
R
Thus, we can continue to evaluate E dx as
Z 2 Z h i 2 Z h i
E f (x) − fˆ(x) dx = ˆ
E f (x) − f (x) dx + Var fˆ(x) dx. (9.13)
where the term in the blue box is the bias term and the term in the red box is the variance (although
sometimes each term including the integral is regarded as the bias and variance respectively). This clear
distinction between bias and variance is a property of the integrated mean squared error loss.
9.2.5 Histograms
The first algorithm that we will discuss is the histogram algorithm, which is an analog of regressograms.
Recalling from previous chapters, we remember that the process of solving a regressogram problem involves:
1. binning the input domain.
2. fitting constant density functions across each bin.
Furthermore, let Yi equal the number of observations (data points) in each bin Bi . Then, we define p̂ as
Yi
p̂i = n = the fraction of data points in bin Bi .
67
The value of zi ∝ p̂i , but we need to normalize our p̂i in order to achieve a proper density function.
In order to form a proper density from z1 , ..., zn , we require that
Z m Z
X m
X
fˆ(x)dx = fˆ(x)dx = h · zi = 1. (9.14)
i=1 Bi i=1
m
P
Suppose that zi = c · p̂i . Then, using the property that p̂i = 1, we see that
i=1
Z m
X 1 1 p̂i
fˆ(x)dx = h · c p̂i = 1 =⇒ c= m = =⇒ zi = , (9.15)
P h h
i=1 h· p̂i
i=1
which tells us that each zi is computed as the fraction of the points in the bin Bi normalized by the size of
the bin. More succinctly, we can write
n n
p̂j
zj 1(x ∈ Bj ) = 1(x ∈ Bj ).
X X
fˆ(x)dx = (9.16)
j=1 j=1
h
Bias
Thus, we can evaluate the bias as
h i2 pj 2
Bias = f (x) − E fˆ(x) = f (x) − . (9.17)
h
R
When h is infinitesimally small, each bin becomes a very very small window. Knowing that Bj
f (u)du can
thus be approximated as h · f (x) for any x ∈ Bj allows us to evaluate
Z
pj 1 1
= f (u)du ≈ · h · f (x) = f (x), (9.18)
h h Bj h
68
Variance
When evaluating the variance of the estimator for x ∈ Bj ,
1
Var fˆ(x) = 2 Var (p̂j ) (9.19)
h
We can note that, for the number of points that fall into a given bin Bj as np̂j ,
1 pj (1 − pj )
Var(p̂j ) = Var(np̂j ) = , (9.22)
n2 n
By analyzing the above result, we see that when h → 0, then Var(fˆ(x)) → ∞, and when n → ∞, then
Var(fˆ(x)) → 0. (This is consistent with the results we saw for regression problems). As a note, we can
clearly see the obvious tradeoff between Bias and Variance under this scenario. The bias goes to 0 as h → 0,
while the variance goes to ∞ as h → 0/. This, once again, is the central tradeoff.
Theorem 9.5. Suppose f 0 is absolutely continuous and that f 0 (u)2 du < ∞. Then, for the histogram
R
estimator fˆ,
h2
Z
1 1
R(fˆ, f ) = (f 0 (u))2 du + + O(h2 ) + O( ), (9.24)
12 nh n
where the term in the blue box is the bias term and the term in the red box is the variance. (The bias depends
upon the Lipschizes of f , meaning that f must be smooth.)
69
13
1 6
= · R
0
,
n1/3 (f (u))2 du
which, most importantly, informs us that the rate that h ∝ n−1/3 . Plugging in this choice of h∗ , we get
Before, we had roughly approximated f (u) ≈ f (x). However, we can more explicitly obtain an expression
for f (u) using a first order Taylor expansion:
given that |u − x| ≤ h and the size of Bj is similarly bounded as ≤ h. Thus, we can evaluate the expectation
of the estimator as
h i 1 Z
E fˆ(x) = · f (u)du = f (x) + f 0 (x) · O(h) + O(h2 ). (9.29)
h Bj
Bias
h i
Given our previous calculation of E fˆ(x) , we can evaluate the bias as
h i2 2
f (x) − E fˆ(x) = f 0 (x) · O(h) + O(h2 ) (9.30)
2
= h2 (f 0 (x) + O(h))
= O(h2 )f 0 (x)2 + O(h3 ).
70
Variance
h i
Given our previous calculation of E fˆ(x) , we can evaluate the integrated variance as
Z m Z
X
Var(fˆ(x))dx = Var(fˆ(x))dx (9.32)
j=1 Bj
m
X pj (1 − pj )
= ·h
j=1
h2 · n
m
X pj
≤ 2·n
·h
j=1
h
m
1 X
= pj
nh j=1
1
= .
nh
Notice that the variance does not depend on f .
71
Chapter 10
where h is the bandwidth and K is the kernel function. Recall from Lecture 1 that we have defined the
kernel function to be any smooth, non-negative function K such that
Z Z Z
K(x)dx = 1, xK(x)dx = 0, and x2 K(x)dx > 0.
R R R
Two kernel functions we have seen are the boxcar and Gaussian kernels. For the former, we now show
that kernel density estimation is very similar to the histogram approach. Recall that the boxcar kernel
K(x) = 12 1{|x| ≤ 1}. Thus, using the boxcar, our kernel density estimator is
n
1 X1 x − xi
fˆ(x) = 1 ≤1 . (10.2)
nh i=1 2 h
72
Define Bx = {i : |xi − x| ≤ h} and let |Bx | be the cardinality of this set, i.e., the number of points in Bx .
Then we can write,
ˆ 1 X 1 x − xi
f (x) = 1 ≤1
nh 2 h
i∈Bx
1 X 1
=
nh 2
i∈Bx
|Bx |
= .
2nh
To see the similarity with the histogram algorithm, recall that for the histogram,
p̂j |Bj |
fˆ(x) = = , (10.3)
h nh
for x ∈ Bj . Moreover, note that for the histogram, h corresponds to the bin width, whereas for our boxcar
density estimator, h is half of the bin width. The characteristic difference between the approaches is that
for kernel density estimators, our bins are not fixed but moving with and centered at x.
Of course, we require that our kernel density estimator constitutes a valid density. There are two ap-
proaches for verifying that (10.1) coheres with the definition of the probability density function. The first is
to check directly that (10.1) integrates to 1:
n
x − xi
Z Z
ˆ 1 X
f (x)dx = K dx
R R nh i=1 h
n Z
1 X x − xi
= K dx
nh i=1 R h
n Z
1 X x
= K dx (10.4)
nh i=1 R h
n Z
1 X
= h K (z) dz (10.5)
nh i=1 R
n
1 X
= h
nh i=1
= 1.
To reach (10.4), we used that shifting a function being integrated over the continuum by a constant has no
effect on the value of the integral. In (10.5) we made a change of variables.1
The second approach for sanity checking our definition in (10.1) is to view K(x) as a probability density
function. Examining (10.2) confirms that this is a valid move. Then, we find that fˆ is identical to a density
function as desired. The following theorem formalizes this.
Theorem 10.1. Let ξ ∼ K(x), Z ∼ Unif{x1 , . . . , xn }. Further, define W = Z + ξh. Then, the density
function of W is fˆ.
Proof. Let Wi = xi + ξh for each i ∈ {1, . . . , n}. Finding the density of Wi is straightforward. To do this,
we first find the distribution function of Wi :
increasing function from [c, d] onto [a, b] such that g is differentiable on [c, d] and g 0 ∈ R[c, d]. Then (f ◦ g) · g 0 ∈ R[c, d] and
Rb Rd 0
a f (x)dx = c f (g(t))g (t)dt [Johnsonbaugh and Pfaffenberger, 2010].
73
W₁ density
W₂ density
W₃ density
W₄ density
W₅ density
W density
x₁ x₂ x₃ x₄ x₅
P5
Figure 10.1: Example of Gaussian mixture model. In the figure above, W = 15 i=1 Wi , where Wi ∼
N (xi , h2 ). Per Theorem 8.1, the Gaussian kernel density estimator assumes that the density fˆ is an equally
weighted mixture of Gaussians centered on the observations {xi }ni=1 with variance h2 .
= P {xi + ξh ≤ x}
x − xi
=P ξ≤
h
x − xi
= Fξ .
h
Then, we differentiate to find the density:
d
fWi (x) = FWi (x)
dx
d x − xi
= Fξ
dx h
x − xi 1
= fξ
h h
1 x − xi
= K .
h h
1
Now, since W = Wi with probability n, we can easily find the density of W . We again appeal to distribution
functions to show this.
FW (x) = P {W ≤ x}
1 1
= n P {W1 ≤ x} + · · · + n P {Wn ≤ x}
1 1
= n FW1 (x) + · · · + n FWn (x). (10.6)
74
n
1 X x − xi
= K
nh i=1 h
= fˆ(x).
(If W were not a mixture of random variables but a sum of them, computing its density would be far more
complicated.)
where σk2 = x2 K(x)dx and βk2 = K(x)2 dx. The first term in (10.7) is the bias and the second is the
R R
variance.
Recall that for the histogram density estimator,
h2 Z 1
ˆ
R f, f = f 0 (x)2 dx + +O(h2 ) + O(n−1 ). (10.8)
12 nh
| {z } |{z}
bias σ2
Comparing (10.7) and (10.8), we see that for the Gaussian kernel density estimator, we want f 00 (x) to be
small rather than f 0 (x). More importantly, for values of h < 1, the bias of the Gaussian kernel density
estimator will be lower than for the histogram estimator. We encounter the usual bias-variance tradeoff
here: increasing h results in more smoothing which boosts bias and depresses variance whereas decreasing h
results in less smoothing which depresses bias and boosts variance.
By minimizing the risk with respect to h, we find the optimal bandwidth
1/5
βk2
Z
∗
h = , where A(f ) = f 00 (x)2 dx. (10.9)
σk2 A(f )n
As usual, the optimal bandwidth is inversely dependent on n. Plugging h∗ into (10.7), we find that as a
function of n, R(f, fˆ) ∝ O(n−4/5 ). We observe that this is an improvement over the histogram estimator
where as a function of n, R(f, fˆ) ∝ O(n−2/3 ).
Now, we show that for the boxcar kernel density estimator, the bias is as claimed in (10.7). From (10.3),
fˆ(x) = |Bx|
2nh , so
1
E[fˆ(x)] = E[|Bx |]
2nh
Z x+h
n
= f (u)du
2nh x−h
Z x+h
1
= f (u)du
2h x−h
Z x+h
1
f (x) + (u − x)f 0 (x) + 12 (u − x)2 f 00 (x) du + higher order terms
= (10.10)
2h x−h
Z x+h Z x+h !
1
= 2hf (x) + f 0 (x) (u − x)du + f 00 (x) 1 2
2 (u − x) du + higher order terms
2h x−h x−h
75
= f (x) + O(h2 )f 00 (x).
In (10.10), we have carried out a degree 2 Taylor expansion for f at x.2 Thus, we find that the bias at x is
h i2
f (x) − E fˆ(x) = O(h2 )2 f 00 (x)2 = O(h4 )f 00 (x)2 .
So, the total bias is O(h4 ) f 00 (x)2 , which agrees with (10.7) as desired.
R
In (10.11), the first term is constant with respect to fˆ, so we are not concerned about it when choosing fˆ.
The second term can be rewritten −2EX∼f [fˆ(x)], and with held-out data x01 , . . . , x0m ∼ f , we can compute
a Monte Carlo estimator of the expectation:
m
h i 1 Xˆ 0
EX∼f fˆ(x) ≈ f (xi ).
m i=1
If we have insufficient data for a hold-out set, we can use cross-validation. Under leave-one-out and Monte
Carlo we have
n
h i 1Xˆ
EX∼f fˆ(x) ≈ f−i (xi ),
n i=1
where fˆ−i denotes the estimator obtained using {x1 , . . . , xi−1 , xi+1 , . . . , xn }. Finally, the third term can be
directly computed. Thus, the leave-one-out cross validation score is defined
Z n
2Xˆ
J fˆ = fˆ(x)2 dx −
ˆ f−i (xi ).
n i=1
We would like an efficient way to find the leave-one-out loss. A naive approach to computing Jˆ could be
quite expensive since it would require that we fit fˆ n times. Fortunately, we can do better.
2A real-valued function f is said to be of class C n on (a, b) if f (n) (x) exists and is continuous for all x ∈ (a, b). Define
f (n) (c)
Pn (x) = f (c) + f (1) (c)(x − c) + · · · + n!
(x − c)n . Let f ∈ C n+1 on (a, b), and let c and d be any points in (a, b). Then
f n+1 (t)
Taylor’s Theorem says that there exists a point t between c and d such that f (d) = Pn (d) + (n+1)! (d − c)n+1 [Johnsonbaugh
and Pfaffenberger, 2010].
3 For some data, the interquartile range is the data’s 75th percentile minus its 25th percentile.
76
Theorem 10.3. We can compute the leave-one-out cross validation score for a kernel density estimator fˆ
as
n n
ˆ
ˆ 1 X X ∗ xi − xj 2
K(0) + O n−2 ,
J f = 2
K + (10.12)
hn i=1 j=1 h nh
where K ∗ (x) =
R
K(x − y)K(y)dy − 2K(x).
for some distributions {fi | i ∈ {1, . . . , k}}. There is an equivalent generative specification of this model:
1. Draw i from some distribution over {1, . . . , k}. A simple choice that we have used in (10.13) and that
we will use going forward is i ∼ unif{1, . . . , k}.
2. Draw x ∼ fi .
This Gaussian mixture model is a fully parametric approach since k is fixed. We can make it less parametric
by letting k grow with n in some way. There are three algorithms commonly used for fitting mixtures:
maximum likelihood estimation (MLE), the Expectation Maximization algorithm (EM), and the method of
moments.
Let θ = (µ1 . . . , µk , Σ1 , . . . , Σk , z1 , . . . , zn ), where the zi ’s are in {1, . . . , k} and denote the Gaussians to
which the observations are assigned. MLE amounts to solving the optimization problem
n
1X
max log f (xj ; µzj , Σzj ). (10.15)
θ n
j=1
This is often impossible to do analytically, so numerical methods are frequently required. EM is beyond the
scope of the class, but it can be applied to fit mixtures under the MLE approach or even the more general
Bayesian framework. The method of moments involves relating model parameters to the moments of random
variables. Recall that for random variables xi , i ∈ {1, . . . , d}, the first moments are
77
the second moments are
We can estimate these using empirical moments. For example, for observations x(1) , . . . , x(n) in Rd , the
empirical first moment for the i’th dimension of x is
n
1 X (j)
x ≈ E[xi ].
n j=1 i
E[xi ] = qi (µ, Σ)
E[xi xj ] = qij (µ, Σ)
..
.
we would write fθ (θ) rather than p(θ) and fx|θ (x | θ) not p(x | θ). p(θ) and p(x | θ) are not the same function p. “Think of
them as living things that look inside their own parentheses before deciding what function to be” [Owen, 2018].
78
Our goal is to infer the posterior distribution p θ | x(1) , . . . , x(n) . To accomplish this, we use Bayes’ rule
and expand using the chain rule of probability:
p θ, x(1) , . . . , x(n)
(1) (n)
p θ | x ,...,x =
p x(1) , . . . , x(n)
p x(1) , . . . , x(n) | θ p (θ)
=
p x(1) , . . . , x(n)
Qn (i)
i=1 p x | θ p (θ)
= Qn
R
(i) | θ p (θ) dθ
.
i=1 p x
Now, in the supervised setting, suppose we have a dataset S = x(1) , y (1) , . . . , x(n) , y (n) , where the
x(i) ’s are fixed. Then, our generative story takes the following form:
1. Draw θ ∼ p(θ).
2. Draw a label y (i) ∼ p y (i) | x(i) , θ for each i ∈ {1, . . . , n}.
Given a test example x∗ , we want to find p(y ∗ | x∗ , S), where y ∗ denotes the (unknown) label associated
with x∗ . We can do this if we can first infer the posterior p(θ | S). Why? Observe that
Z
p(y ∗ | x∗ , S) = p(y ∗ | x∗ , θ, S)p(θ | x∗ , S)dθ
Z
= p(y ∗ | θ, x∗ )p(θ | S)dθ,
where we’ve used that y ∗ is independent of y (1) , . . . , y (n) conditional on θ. As in the supervised setting, we
can use Bayes’ rule to find an expression for the posterior that we can work with:
p(θ, S)
p(θ | S) =
p(S)
p(S | θ)p(θ)
=R
p(S | θ)p(θ)dθ
p y (1) , . . . , y (n) | θ p(θ)
=R
p y (1) , . . . , y (n) | θ p(θ)dθ
Qn (i)
i=1 p y | θ, x(i) p(θ)
= Qn
R
(i) | θ, x(i) p(θ)dθ
.
i=1 p y
In summary, the process of supervised learning involves drawing samples from the prior distribution for
the parameter θ, then drawing observed datapoints based upon a distribution dependent on θ. Then, to
infer the posterior distribution, we apply a formula derived from Bayes’s Rule. For unsupervised learning,
we also draw samples from the prior distribution of parameter θ, but for the second step, we instead we
generate labels for each datapoint based upon the condition distribution y (i) ∼ p y (i) | x(i) , θ . Then, for a
given test example, we find the label using a formula with the posterior p(θ | S).
79
1. Draw θ ∼ N (0, τ 2 Id ).
>
2. Draw y (i) ∼ N (x(i) θ, σ 2 ) for each i ∈ {1, . . . , n}.
Theorem 10.4. Define the design matrix
>
x(1)
(1)
y
. 1 > 1
X= .. ∈ R
n×d
, ~y = ... ∈ Rn , and A= X X + 2 Id .
>
σ2 τ
x(n) y (n)
Then θ | S ∼ N ( σ12 A−1 X > ~y , A−1 ), and y ∗ | x∗ , S ∼ N ( σ12 x∗ > A−1 X > ~y , x∗ > A−1 x∗ + σ 2 ).
Note that an interesting connection can be made between Bayesian and frequentist approaches here.
Recall that in frequentist ridge regression, the goal is to estimate the regression coefficients by minimizing
the sum of squared errors through a regularization term that penalizes large coefficients. Bayesian approaches
instead involve estimating the posterior distribution of the parameters given the observed data and a prior
distribution. We can show that for a particular regularization parameter λ in ridge regression, the mean
of the posterior distribution in Bayesian linear regression corresponds to the frequentist ridge regression
estimate, as follows:
Consider the expression for the mean of the posterior distribution, as given by the Gaussian formulation:
1 −1 T
A X y
θ2
By simplifying terms,
= (θ2 A)−1 X T y
Substituting matrix A into the previous equation,
θ2 −1 T
= (X T X + I) X y
τ2
θ2
This is precisely the estimate for ridge regression when the regularization parameter λ is equal to τ2 .
Now, let’s perform a sanity check Theorem 10.4. We can rewrite A as
n
1 X (i) (i) > 1
x x + Id . (10.19)
σ 2 i=1 τ2
| {z } | {z }
influence of data influence of prior
First, as n → ∞, the first term in (10.19) dominates the second term. As we would hope, as the size of our
dataset grows the influence of the prior on the posterior of θ diminishes and, at the limit, vanishes. For this
reason, Bayesian methods are less useful under a large data regime. Second, as τ → ∞, our Gaussian prior
becomes increasingly flat and uninformative; see Figure 8.2. Accordingly, in (10.19), τ is inversely related
to the influence of the prior on the posterior. Third, the variance of our posterior predictive distribution
y ∗ | x∗ , S is at least σ 2 .
Proof. First, we show that A is positive definite. For non-zero z ∈ Rd ,
1 > 1
z> 2
X Xz = 2 (Xz)> Xz
σ σ
1
= 2 hXz, Xz, i.
σ
Since X is full-rank (our predictors cannot be a linear combination of each other), X’s null space is trivial
and Xz 6= 0. Because σ 2 is positive and the norm is positive for all non-zero vectors, (11.1) is positive
80
Figure 10.2: Densities for N (0, τ 2 )
!=1
!=2
!=3
!=5
and A is positive definite. Then, A−1 is also positive definite since its eigenvalues are the reciprocals of A’s
eigenvalues.5 Thus,
x∗ > A−1 x∗ + σ 2 ≥ σ 2 .
As expected, the lower bound of uncertainty of our predictions is σ 2 , which is the uncertainty intrinsic
to the problem. As our dataset grows, the number of observations (n) tends toward infinity, causing A also
to grow toward infinity. Consequentially, when n → ∞, x∗ > A−1 x∗ → 0, and we converge upon the lower
bound of uncertainity, σ 2 .
5 For any invertible matrix M , M −1 ’s eigenvalues are the eigenvalues of M inverted. A matrix is positive definite if and only
81
Chapter 11
Parametric/nonparametric Bayesian
methods and Gaussian process
82
11.2 to determine the posterior distribution of one variable conditioned on the other. This lemma provides a
valuable analytical formula for computing the desired posterior distribution when we have a joint Gaussian
distribution. We will present a proof of this lemma at a later stage to establish its validity.
Lemma 11.2. Suppose
xA µ Σ ΣAB
∼ N ( A , AA )
xB µB ΣBA ΣBB
where ΣAA is the covariance matrix of xA and ΣAB is the correlation between xA and xB . Then,
We see that xB |xA is a function of xA and thus the mean and variance depend on xA . By symmetry, we
know that
xA |xB ∼ N (µA + ΣAB Σ−1 −1
BB (xB − µB ), ΣAA − ΣAB ΣBB ΣBA ) (11.4)
Using the definition of mean and covariance and the fact that θ is the random and θ ∼ N (0, τ 2 I), we
have
µA = E[→
−y ] ∈ Rn = E[Xθ + →
− ] = XE[θ] + E[→
− ] = 0 + 0 = 0.
Similarly, we can obtain µB = E[y ∗ ] ∈ R = E[x∗T θ + ∗ ] = 0 (∗ is a scalar). Since θ ∼ N (0, τ 2 I), the
covariance of →
−
y is
ΣAA = E[(→
−y − µA )(→
−
y − µA )> ]
= E[→
−y→−
y >]
= E[(Xθ + →
− )(Xθ + →
− )> ]
= Xτ 2 IX > + XE[θ]E[→
− > ] + 0 + σ 2 I
= τ 2 XX > + σ 2 I.
83
Note that x∗> θ is a scalar and thus it equals to its transpose and ∗ is a scalar, we have
ΣAB = E[(→
−y − µA )(y ∗ − µB )]
= E[→
−y y∗ ]
= E[(Xθ + →
− )(x∗> θ + )]
= E[Xθ(x∗> θ)] + 0 + 0 + 0
= E[Xθθ> x∗ ]
= XE[θθ> ]x∗
= Xτ 2 Ix∗
= τ 2 Xx∗ ,
and thus
ΣBA = Σ> 2 ∗> >
AB = τ x X .
Finally we can get
ΣBB = E[(y ∗ − µB )2 ]
= E[(y ∗ )2 ]
= E[(x∗> θ + Σ∗ )(x∗> θ + Σ∗ )]
= E[x∗> θ(x∗> θ) + (Σ∗ )2 ]
= E[x∗> θ(x∗> θ) + (Σ∗ )2 ]
= E[x∗> θθ> x∗ + (Σ∗ )2 ]
= τ 2 x∗> Ix∗ + σ 2
= τ 2 kx∗ k22 + σ 2 .
it yields that
84
• Σ is diagonal.
• U is column-wise orthogonal, meaning that every column of U has norm 1 and orthogonal of each
other. This means that U > U = [U > ][U ] = Im×m where every entry ij is the inner product of i-th and
j-th column of U .
85
r12
1
σ2 + τ2
..
.
rd2 1
=U
σ2 + τ2
>
U ,
1
τ2
..
.
1
τ2
and therefore
2
r 1 −1
( 12 + τ2 )
σ ..
.
XX > I r2
( σd2 + 1 −1
+ 2 )−1 τ2 )
>
( =U U .
σ2 τ τ2
..
.
τ2
Since we have
(U ΣU > )−1 = U Σ−1 U > ,
and
U ΣU > U Σ−1 U > = U ΣΣ−1 U > = U U > = I, (11.6)
we can further show that
XX > I
X >( + )−1
σ2 τ 2
r2
1 −1
( 12 + τ2 )
σ
...
rd2 1 −1
=V ΣU > U
( σ2 + τ2 )
>
U
τ2
...
τ2
r1
2
r1
σ2
+ τ12
...
rd
>
r2
=V Σ d + τ12
U
σ2
τ2
...
τ2
r1
2
r1 1
σ2 + τ 2
...
rd
>
r2
=V d + τ12
U .
σ2
τ2
...
τ2
>
We expand the RHS ( Xσ2X + τI2 )−1 X > in the same way and thus arrive at the same quantity. As we’ve
shown, SVD reduces everything to diagonal matrices. Thus, it’s a very useful trick in these problem settings.
86
Back to our original proof, we can first show that
1 ∗ > XX > I
µB + ΣBA Σ−1
AA (xA − µA ) = z Z ( 2 + 2 )−1 → −
y
σ2 σ τ
1 XX > I
= 2 x∗ ( 2 + 2 )−1 X > →
−y (11.7)
σ σ τ
1
= 2 x∗ A−1 X > →
−
y.
σ
Then, for the covariance:
ΣBB − ΣBA Σ−1
AA ΣAB
= τ 2 kx∗ k22 + σ 2 − X 4> X > τ 2 (τ 2 XX > + σ 2 I)−1 τ 2 Xx∗
τ 2 ∗> > XX > I
= τ 2 kx∗ k22 + σ 2 − x X ( 2 + 2 )−1 Xx∗
σ2 σ τ
τ2 X >X I
= τ 2 kx∗ k22 + σ 2 − 2 x∗> ( 2 + 2 )−1 X > Xx∗ (11.8)
σ σ τ
X >y I X >X I X >X I I
= τ 2 kx∗ k22 + σ 2 − τ 2 x∗> ( 2 + 2 )−1 ( 2 + 2 )x∗ + τ 2 x∗> ( 2 + 2 )−1 2 x∗
σ τ σ τ σ τ τ
I
= τ 2 kx∗ k22 + σ 2 − τ 2 x∗> x∗ + τ 2 x∗> A−1 2 x∗
τ
= σ 2 + x∗> A−1 x∗ .
87
11.3.2 Approach 2: Gaussian process
The second approach, the Gaussian process, takes a cleaner and more fundamental viewpoint, though con-
ceptually, it requires more work.
As a warm-up, assume that the input space is finite and our goal is to design a prior over functions with
finite input space. After this, it takes a slight leap of faith to extend it to the infinite case.
Consider F = {all functions that maps X → R} where X = {t1 , ..., tm }. We want to design a prior over
F.
To describe the function f ∈ Rm , we only need to specify the values of the function on a finite number
of inputs:
f (t1 )
... .
f (tm )
In other word, we can represent f by a vector
f (t1 )
→
−
f = ... ∈ Rm .
f (tm )
In this case, designing a prior over the function space is the same as designing a prior over the m-
dimensional vector; the latter is much easier.
→
−
Consider a Gaussian prior on f (or f ).
→
−
f ∼ N (µ, Σ), µ ∈ Rm , Σ ∈ Rm×m .
µ ∈ Rm → µ(·)
Σ ∈ Rm×m → k(·, ·),
88
1. µ, k uniquely describe a Gaussian process f ∼ GP (µ(·), k(·, ·)).
f (x1 )
For a Gaussian random vector W = ... , there is
f (xn )
µ(x1 )
W ∼ N ( ... , k(xi , xj ) i,j=1,...,n )
µ(xn )
2. If a Gaussian process has mean µ and covariance function k(·, ·), then k(·, ·) is a valid kernel function.
In other words, there exists φ such that k(x, z) = φ(x)> φ(z) = < φ(x), φ(z) >.
Proof. Suppose that in a Gaussian process, ∀x1 , ..., xn , K = k(xi , xj ) i,j=1,...,n is the covariance of
f (x1 )
... , ∀x1 , ..., xn , K ≥ 0. By Mercer’s Theorem, k(·, ·) is a valid kernel function. (Please refer to
f (xn )
Definition 2.1, P12 for properties of a valid kernel function)
3. Vice versa, if k(·, ·) is a valid kernel function, then there exists φ such that k(x, z) = φ(x)> φ(z).
For simplicity, assume φ(x) ∈ Rm . Let f (x) = θ> φ(x) where θi ∼ N (0, 1), then
f ∼ GP (0, K),
f (x1 )
cov( ... )ij = E[f (xi )f (xj )]
f (xn )
= E[θ> φ(xi )θ> φ(xj )]
= E[φ(xi )> θθ> φ(xj )]
= φ(xi )> E[θθ> ]φ(xj )
= φ(xi )> Iφ(xj )
= φ(xi )> φ(xj )
= k(xi , xj ).
Thus, GP is a properly-defined Gaussian process if and only if K(·, ·) is a valid kernel function.
What we’ve been doing is to go from any choice of kernel function K(·, ·) to defining GP (µ, k) to getting
the prior f ∼ GP (µ(·), k(·, ·)). Typically, µ(·) is chosen to be zero function. k(·, ·) can be any common kernel
function, where the most popular one is the squared exponential kernel, also known as the Gaussian kernel
or KBF kernel.
1
kSE (x, z) = exp(− 2 kx − zk22 ).
2τ
89
Figure 11.1: kSE (x, z) vs kx − zk
• f (x), f (z) have high correlation is x is close to z because exp(− 2τ12 kx − zk22 ) ≈ exp(0) ≈ 1.
kx−zk22
• f (x), f (z) have low correlation if they are far away because with big kx − zk, exp(− 2τ 2 ) ≈ 0.
• The parameter τ controls smoothing. Thus, if τ is very big, then even faraway points have strong
correlations, meaning that there is strong smoothing and a flatter curve. If τ is very small, there is
weak smoothing, leading to a higher tendency for fluctuations.
To summarize, GP (µ(·), k(·, ·)) is the distribution of functions that satisfies the following properties:
1. f (x) is Gaussian, ∀x;
then
xB |xA ∼ N (µB + ΣBA Σ−1 −1
AA (xA − µA ), ΣBB − ΣBA ΣAA ΣAB ).
Consider xA → y (1) , ...y (n) to be training observation and xB → y ∗(1) , ...y ∗(m) to be labels to predict.
Then our target y ∗(1) , ...y ∗(m) |y (1) , ...y (n) mentioned above is equivalent to xB |xA . To apply the lemma, we
90
need the joint distribution of
f (x(1) )
→
−
f = ... ∈ Rn ,
f (x(n) )
f (x∗(1) )
→
−∗
f = ... ∈ Rm ,
f (x∗(m) )
f (y (1) )
(1)
f ( )
→
− →
− →
− −
y = ... = f + ... = f + → ∈ Rn ,
(n) (n)
f (y ) f ( )
∗(1)
f (∗(1) )
f (y )
→
− →
− →
−
y ∗ = ... = f ∗ + ... = f ∗ + → − ∗ ∈ Rm ;
∗(m) ∗(m)
f (y ) f ( )
→
−
f ∼ N (0, K(X, X)),
and thus
→
−
y ∗ |→
−
y ∼ N (µ∗ , Σ∗ ),
µ∗ = K(X ∗ , X)(K(X, X) + σ 2 I)−1 →
−
y, (11.10)
∗ ∗ ∗ 2 ∗ 2 −1 ∗
Σ = K(X , X ) + σ I − K(X , X)(K(X, X) + σ I) K(X, X ).
91
Interestingly, not only do we have the prediction µ∗ , but we also have Σ∗ , which gives uncertainty
quantification. When we have few data points, the confidence interval is large, whereas the confidence
interval is smaller when there are multiple data points. One simple example of using the Gaussian Process
to do uncertainty quantification based on only five points is shown below.
Figure 11.2: One simple example of uncertainty qualification with Gaussian Process
92
Chapter 12
Dirichlet process
93
π1 , . . . , πk is a valid probability distribution. Our distribution over π should only have probability mass on
valid choices of π. The Dirichlet distribution (denoted Dir for short in mathematical equations), is a natural
choice.
In the Bayesian setting, we are interested in studying distributions over the parameters of another dis-
tribution (which are themselves random variables). A simpler example of this idea is the Beta distribution,
which is a probability distribution over the parameter p for a binomial random variable (i.e. a coin flip). The
Beta distribution represents the probability distribution over the “true” probability of the coin coming up
heads. A Dirichlet distribution is simply a generalization of the Beta distribution to an experiment with more
than two outcomes, e.g. a dice roll. So, the Dirichlet distribution could be used to model the distribution
over the parameters π1 , π2 , π3 , π4 , π5 , π6 representing the probability each side of the die has of being rolled.
The Dirichlet distribution is parametrized by α1 , . . . , αk , which, as with the Beta distribution, can be
interpreted as “pseudocounts”, i.e. a larger αi will result in a distribution where larger values of πi have
more density; and moreover, if their relative magnitudes are all held fixed, larger parameters denote more
“confidence”, resulting in a less uniform distribution. (As a simple example, if you’ve rolled each number
on a die one time, you’d guess that all sides equally likely, but with low confidence. If you’ve rolled each
number on a die 1000 times, you’d guess they’re all equally likely, with high confidence.)
The Dirichlet distribution has a few important properties and related intuitions, some of which will be
important for our later discussion of the Dirichlet process:
K
Q
Γ(αi ) Y
K
i=1
p(~π ) = K πiαi −1
P
Γ αi i=1
i=1
94
5. Merging rule: If (π1 , . . . , πK ) ∼ Dir(α1 , . . . , αK ), then we can “merge” π’s by summing them. Doing
so will create a new Dirichlet distribution with fewer components, parametrized by new α’s obtained
by summing the αj ’s corresponding to the πj ’s that were combined. For example:
(π1 + π2 , π3 + π4 , . . .) ∼ Dir(α1 + α2 , α3 + α4 , . . .)
6. Expanding rule: Reverse merging rule; you can also obtain a new Dirichlet distribution from an
existing one by “splitting” components; for example:
1. First, select a topic, zi ∼ Categorical(π), where as before, the parameter π is a vector whose compo-
nents sum to 1.
2. Generate n words with M ultinomial(n, θzi ), this produces the document X.
In order to make this Bayesian, which gets us to the Dirichlet Topic Model, we only need a prior over the
parameters, π and θzi for i ∈ {1, . . . , W }, since the number of words n is fixed. Both can be Dirichlet priors:
one for π with the number of parameters equal to the number of topics; and the other for each θzi with the
number of parameters equal to the number of words in the vocabulary. Then, to generate a document, we
first sample π and all θ’s, then follow the generative process above.
95
12.3 Dirichlet process
12.3.1 Overview
One way to think about the Dirichlet process is as a topic model whose number of topics is not fixed, but
rather can grow with the number of data points grows. Rather than a fixed number of topics in advance, we
allow choosing the number of topics to be “part of the model”, in a sense. To do this, we need a prior that
can
S∞ generate probability vectors of any dimension, not just a fixed K, in other words, a distribution over
K=1 ∆K . We can think of the Dirichlet process as doing exactly this. Let’s take some abstractions from
the parametric model to generalize to the Dirichlet process setting.
In the parametric mixture models we’ve been discussing, you sample some parameters θk∗ for each source
(or topic) from some distribution H; sample π from a Dirichlet distribution; sample the latent z from
Categorical(π), and finally sample X from some distribution parametrized by θz . The important thing to
take away here is that, for a given sample Xi , once you’ve fixed π and all the θk∗ ’s, then your choice of zi
completely determines θzi , i.e. the set of parameters you’ll use to select Xi .
Let’s say we are modeling n examples, X1 , . . . , Xn . For each, we can think about its corresponding θzi
itself as a random variable, drawn from a distribution G. G is fixed given a choice of π and all θk∗ ’s; which
means that the prior for G is determined by the choice of α and H. A realization of G is a choice of θi
(i.e. the θ used to sample Xi ), which is one of the k possible choices θ1∗ , . . . , θK ∗
. G is basically a discrete
∗ ∗
distribution with point masses on all the locations defined by θ1 , . . . , θK , with the caveat that the magnitude
of K is not fixed. The goal is to construct a prior over G, which in turn gives a distribution over θi , which
in turn parametrizes a distribution over Xi .
There are two approaches for designing a prior for G. One is to directly construct it. (We’ll do that
later.) The other is to model the joint distribution over θ1 , . . . , θn (i.e. the choices of parameters for each of
the n examples), which then implicitly defines G. We will start with this approach. This will require a few
theoretical building blocks, which will occupy the next few sections.
In other words, there exists some G such that joint distribution over all n θ’s “factors” and is equivalent
to the distribution obtained by first sampling G from p(G), then sampling θi from the distribution defined
by G. The implication is that we don’t have to define G directly; we can instead describe θ1 , . . . , θn (the
“effect” of G) and this is sufficient (since by this theorem, G is guaranteed to exist, and we can do inference
tasks using just the θi ’s).
96
Because all the nk ’s add up to i − 1 (the number of previous customers) we can quite easily confirm
this setup makes sense (i.e. the probabilities of the customer’s choices sum to 1). What does this thought
process have to do with Dirichlet processes though? Well, let the latent variable zi be the table number of
the i-th customer. Then, if each “table” is assigned some θk∗ ∼ H, then this gives us a way of picking θi ’s,
by simply letting θi be the value assigned to the table where the i-th customer sits. This is also known as
the Blackwell-MacQueen urn scheme.
This provides a joint distribution over θ1 , . . . , θn . Moreover, it is exchangeable (possible to verify, but we
won’t do it here). Intuitively, it will result in some outcomes that “could be” IID draws from some discrete
distribution G (informally speaking). Formally, applying de Finetti’s theorem, because exchangeability holds,
we know that there exists a G such that θ1 , . . . , θn chosen according to this scheme are equivalent to first
iid
sampling G ∼ DP (α, H), then sampling each θi | G ∼ G. We don’t know what G is, just that it exists; and
we can do all the interesting probabilistic inference without it (using just the θi ’s).
This is close to showing the claim, but it’s still not quite right because I is not fixed, it is a random
variable. But, we can intuitively explain why the I’s aren’t that important. Remember, our goal is to get
something like an infinite-dimensional
PDirichlet distribution,
P with parameters α/K. Because of the idea of
this uniform prior, we can say that αk ∈Ij αk = k∈Ij α/K = |Ij |α/K, i.e. α multiplied by the fraction
the probability mass in partition defined by Ij , which is just αH(Aj ), which is what we wanted to show.
97
12.3.5 Explicitly constructing G (formal)
The formal definition of a Dirichlet process: A unique distribution over distributions on Θ such that for any
partition A1 , . . . , Am of Θ, we have that when G ∼ DP (α, H), then G(A1 ), . . . , G(Am ) ∼ Dir(αH(A1 ), . . . , αH(Am )).
We can explicitly construct such a distribution with the “stick-breaking construction.”
iid
1. Sample θk∗ ∼ H for k = 1, 2, . . . , ∞.
iid
2. Choose βk ∼ Beta(1, α) for k = 1, 2, . . . , ∞.
k−1
Q
3. Set πk = βk (1 − βi ).
i=1
P∞
4. Then, G = k=1 πk δθk∗ .
It’s called the ”stick-breaking construction” because the intuition is that you begin with a stick of length
1, and then at the k-th step, break off the fraction βk of what’s left, and choose that value for πk . So first,
the stick is length 1, you break off β1 , and so π1 = β1 . Then, there’s (1 − β1 ) left; you break off β2 of that,
so then π2 = (1 − β1 )β2 . So on and so forth. This gives a formal construction of G for the Dirichlet process.
Then, all that remains is to do inference, which is typically done with Markov-Chain Monte Carlo (e.g.
Gibbs Sampling). This is tractable, since the conditional distributions of the θi ’s have nice properties.
12.4 Summary
Since this is the last class, let’s look back at what we’ve learned.
1. Non-parametric regression, including the kernel estimator, local polynomial/linear regression, splines,
and using cross-validation to select a model and tune hyperparameters.
2. The kernel method, and its connection to splines and wide two-layer neural networks.
3. Neural networks, transfer learning, and few-shot learning.
4. Density estimation, for CDF and PDF.
98
Bibliography
Sheldon Axler. Linear algebra done right. Springer, New York, 2014. ISBN 9783319110790.
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical
image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255.
Ieee, 2009.
Richard Johnsonbaugh and W. E. Pfaffenberger. Foundations of mathematical analysis. Dover books on
mathematics. Dover Publications, Mineola, N.Y, dover ed edition, 2010. ISBN 9780486477664. OCLC:
ocn463454165.
Art Owen. Lecture 6: Bayesian estimation, October 2018. Unpublished lecture notes from STATS 200.
Colin Wei, Jason D. Lee, Qiang Liu, and Tengyu Ma. Regularization matters: Generalization and optimiza-
tion of neural nets v.s. their induced kernel, 2020.
Wikipedia contributors. Mercer’s theorem — Wikipedia, the free encyclopedia, 2023. URL https:
//en.wikipedia.org/w/index.php?title=Mercer%27s_theorem&oldid=1143423242. [Online; accessed
9-May-2023].
99