0% found this document useful (0 votes)
2 views

Statistical Methods for ML

Uploaded by

mbilimbimbovu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Statistical Methods for ML

Uploaded by

mbilimbimbovu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Statistical Methods for ML

Exam questions
1. Write the formulas for the square loss, the zero-one loss, and the
logarithmic loss

• ^)
square loss l(y, y ​ = (y − y^)2 ​ ​

={
• ^)
1 if y = y^
zero-one loss l(y, y ​

0 if y 
= y^
​ ​

• log y1^
^) = {
if y = 1
logarithmic loss l(y, y ​

1
log 1−y if y = 0
​ ​ ​

2. What does a learning algorithms receive in input? What does it


produce in output?
A learning algorithm receives in input a training set S =
{(x1 , y1 ), … , (xm , ym )}of size m, and produces a predictor p
​ ​ ​ ​ :
X → Y where X is the data domain, and Y is the label space.
3. Write the mathematical formula defining the training error of a
predictor h​
Given a training set S = {(x1 , y1 ), … , (xm , ym )}, the training
​ ​ ​ ​

error of h : X → Y on S is defined as:

m
1
lS (h) =
​ ∑ l(yt , h(xt ))
​ ​ ​ ​

m t=1
4. Write the mathematical formula defining the ERM algorithm over a
class Hof predictors. Define the main quantities occurring in the
formula.
^
The ERM learning algorithm outputs the predictor h ∈
arg minh∈H lS (h)​
​ ​

• S = {(x1 , y1 ), … , (xm , ym )}is the training set given in input


​ ​ ​ ​

to the learning algorithm

• lS (h)is the training error of predictor hin the class of predictors


Hon training set S ​


• seen that there could be multiple predictors in Hthat minimize
^ is included in a set of possibile optimal
the training error for S , h
predictors
5. Explain in words how overfitting and underfitting are defined in
terms of the behaviour of an algorithm on training and test set.

• underfitting occurs when the training error of the predictor


produced by the algorithm is high

• overfitting occurs when the training error of the predictor


produced by the algorithm is low, but the test error obtained on
the same predictor is high
6. Name and describe three reasons why labels may be noisy.

• “human-in-the-loop” → seen that humans are tasked with


assigning the labels to the datapoints, there could be errors in the
dataset labels and there could be room for interpretation.

• epistemic uncertainty → the set of features that determine the


datapoints could not be sufficient to uniquely determine a label. In
other words, the same values for a collection of attributes could
correspond to different (but legitimate) labels.

• aleatoric uncertainty → the feature vector is obtained with some


error or imprecision in the measurement. This means that two
different elements could collapse onto the same datapoint, so the
choice of the label to assign to the original element becomes
aleatoric.

7. Is k-NN more likely to overfit when k is large or small?


It is more likely to overfit with small k , because the prediction is
strongly dependent on the points in the training set, seen that the
predictor assigns to a new point the same label of the few datapoints
nearest to it in the training set.
8. Write a short pseudo-code for building a tree classifier based on a
training set S .

(assume that the built tree is binary)

• Input: T formed by root l​

• Initialization:

◦ Sl = S ​


={
+1 if Nl+ ≥ Nl−
set yl ​
​ ​

−1
​ ​ ​

else
• while !(stopping criterion):

◦ pick root land split, obtaining a new internal node v and


leaves l′ and l′′ ​

◦ pick attribute i​

◦ pick test f : Xi → {1, 2}​ ​

◦ associate f with node v and partition Sl in : ​

▪ Sl′ = {(xt , yt ) ∈ Sl ∣ f (xt,i ) = 1}and set yl′ ​


​ ​ ​ ​ ​ ​

▪ Sl′′ = {(xt , yt ) ∈ Sl ∣ f (xt,i ) = 2}and set yl′′ ​


​ ​ ​ ​ ​ ​

9. What is the property of a splitting criterion ψ ensuring that the


training error of a tree classifier does not increase after a split?

Jensen’s inequality must hold. Formally speaking ψ must be such that


ψ(αa + (1 − α)b) ≥ αψ(a) + (1 − α)ψ(b)for a, b ∈ R, α ∈
[0, 1]. This is the case when ψ is a concave function.
10. Write the formulas for at least two splitting criteria ψ used in
practice to build tree classifiers.

• scaled entropy ψ(a) = − a2 log2 a − 1−a


2
​log2 (1 − a)​
​ ​

• Gini function ψ(a) = 2a(1 − a)​


• ψ(a) = p(1 − p)​ ​

11. Write the formula for the statistical risk of a predictor hwith
respect to a generic loss function and data distribution.

lD (h) = E[l(h(X), Y )]​


where lis a loss function and (X, Y ) ∼ Dwith Dthe data


distribution

12. Write the formula for the Bayes optimal predictor for a generic
loss function and data distribution.
f ∗ (x) = arg miny^∈Y E[l(Y , y^)∣X = x]​

​ ​

where lis a loss function and (X, Y ) ∼ Dwith Dthe data


distribution
13. Write the formula for Bayes optimal predictor and Bayes risk for
the zero-one loss.

η(x) := P(Y = +1∣X = x)​


• Bayes optimal predictor

f (x) = {
1
∗ −1 η(x) < 2 ​

1
+1 η(x) ≥
​ ​

2

short story long

• Bayes risk

lD (f ∗ ) = E[min{η(X), 1 − η(X)}]

short story long

14. Can the Bayes risk for the zero-one loss be zero? If yes, then
explain how.

Yes, it is possible when the data distribution D is known, and it is


degenerate, in other words the support can only assume one of the
two values. In this case η(X)becomes a constant.
15. Write the formula for Bayes optimal predictor and Bayes risk for
the square loss.

• Bayes optimal predictor

f ∗ (x) = E[Y ∣X = x]

short story long

• Bayes risk

E[Var[Y ∣X]]

short story long

16. Explain in mathematical terms the relationship between test error


and statistical risk.
With probability at least 1 − δ with respect to a random draw of S ′ 
1
∣lD (h) − lS ′ (h)∣ ≤
​ ​

2n
​ ln 2δ ​
​ ​

2
P(∣lD (h) − lS ′ (h)∣ > ϵ) ≤ 2e−2ϵ n where nis the size of the test
​ ​

set S ′ obtained through independent random draws from the


distribution D .
Logically the discrepancy between the (ideal) statistical risk and the
test error seen in practice diminishes when the number of
independent draws from D grows. In fact, for n → ∞the two
converge. In other words the test error is a good proxy for the
statistical risk.

short story long


17. State the Chernoff-Hoeffding bounds.
Let Z1 , Z2 , … , Zn be i.i.d random variables such that Zi
​ ​ ​ ​ ∈
[0, 1] ∀i ∈ {1 … n}E[Z
 i ] = μ ∀i ∈ {1 … n}​

Then ∀ϵ > 0​

n
1
P( ∑ Zt > μ + ϵ) ≤ e−2ϵ n
2
​ ​ ​

n
t=1

n
1
∑ Zt < μ − ϵ) ≤ e−2ϵ n
2
P( ​ ​ ​

n t=1
18. Write the bias-variance decomposition for a generic learning
algorithm Aand associate the resulting components to overfitting
and underfitting.
Coso :) 08/26/2023
Let lD (hS )be the statistical risk on the predictor produced by
I’ll be a cagacazzo ‘cause the cam corp
​ ​

algorithm Aon training set S . Let h∗ be the best predictor that Acan taught me how to be one.
output for the distribution (D, l). This question is pointless, there are no
Lipschitz conditions for a general binary
classification algorithm.
D (hS ) = lD (hs ) − lD (h∗ )+ variance / estimation error
It is necessary for a binary classificatio…
​ ​ ​ ​ ​

lD (h∗ ) − lD (f ∗ )+ bias / approximation error


​ ​
​ ​
… more
l D (f ∗ ) ​ Bayes error

Underfitting occurs when the approximation error is large. This is due


to the fact that the algorithm Acannot find a suitable predictor hS  ​

because the class of predictors HA is too small. ​

Overfitting occurs when the estimation error is large, because hS is ​

not a good describer for the distribution at hand. This is in indicator


that the class HA may be too big, and thus adapts more to the data

points because it is not possibile to measure the quality of the


prediction for every possible dataset.

19. Write the upper bound on the estimation error of ERM run on a
finite class Hof predictors.

The estimation error lD (hs ) − lD (h∗ ) ​ ​ ​ ≤ 2


m

ln 2∣H∣
δ with
​ ​

probability at least 1 − δ with respect to the independent random


draws to obtain S of size m.
short story long
20. Write the upper bound on the estimation error of ERM run on the
complete binary tree predictors with at most N nodes and dbinary
features.

l D (h S ) − l D (h ∗ ) ≤
​ ​ ​

2 1 − (2ed)N +1 2
(ln ( ) + ln ( ))
1 − 2ed
​ ​ ​ ​

m δ

with probability at least 1 − δ with respect to the random


independent draws to obtain S of size m.

short story long


21. Write the bound on the difference between risk and training error
for an arbitrary complete binary tree classifier hon dbinary features
in terms of its number Nh of nodes. Bonus points if you provide a

short explanation on how this bound is obtained.


With w : Y X → [0, 1]a function that assigns a weight to each
predictor such that ∑h∈H w(h) ≤ 1(a predictor is weighed less if it

is more complex), where w(h) = 2−∣σ(h)∣ and σ(h)is an


instantaneous code such that ∣σ(h)∣ = O(Nh log d)​ ​

2 2
lD (h) ≤ lS (h) +
​ ​ ​(O(Nh log d) + ln ) ​ ​

m δ

with at least 1 − δ probability with respect to random draw of training


set S of size m.

P(∃h ∈ H : ∣lD (h) − lS (h)∣ > ϵh ) ≤ (Union bound)


​ ​ ​

≤ ∑ P(∣lD (h) − lS (h)∣ > ϵh ) ≤ (Chernoff-Heoffding)


​ ​ ​ ​

h∈H

≤ ∑ 2e−2ϵh m = (ϵh = (ln + ln )


2


2 1 2
​ ​ ​ ​ ​

m w(h) δ
h∈H

= ∑ w(h)δ ≤ δ (by definition of w)


h∈H

It follows that lD (h) ​ ≤ lS (h) + ​


2
m
​ (ln w(h)
1
+ ln 2δ )​
​ ​ ​
It is possible to encode the predictor hwith an instantaneous code
σ : H → {0, 1}∗ such that ∣σ(h)∣ = O(Nh log(d)). For Kraft ​

inequality it holds that ∑h∈H 2−∣σ(h)∣ ≤ 1. So, by setting w(h) = ​

−∣σ(h)∣
2 the bound becomes

22. Write the formula for the K


2 -fold cross validation estimate.
2 Explain
l (h) ≤ l (h) + (O(N log d) + ln )
the mainDquantitiesS occurring m
in the formula.
h
​ ​ ​ ​ ​

K
1
lSCV (A)
​ = ∑ lSi (hi ) ​ ​ ​ ​

K

i=1

K
with lSi (hi ) ​
∑(x,y)∈Si l(y, hi (x))
​ ​ = m ​


​ ​

Si is the test part of the i-th fold.


hi = A(S−i )is the predictor outputted by the algorithm Aon input


​ ​

S−i ≡ S ∖ Si , the training part of the i-th fold.


​ ​

This quantity estimates E[lD (A(S))], in other words the quality of


the predictor given by Aon a generic training set S .
23. Write the pseudo-code for computing the nested cross validation
estimate.

• input : dataset S ​

• split S into K folds S1 … SK ​ ​ ​

• for i = 1 … K do
◦ S−i ≡ S ∖ Si ​ ​ ​

◦ for each θ ∈ Θ0 ​ ​

▪ run CV on S−i ​ ​

◦ θi = arg minθ∈Θ0 lSCV


−i
(Aθ )​ ​


​ ​

◦ hi = Aθi (S−i )​



​ ​

1
• output: K ​
K
∑i=1 lSi (hi )​ ​


​ ​

24. Write the mathematical definition of consistency for an algorithm


A.

Ais consistent with respect to lfor each Dif


limm→+∞ E[lD (A(Sm ))] = lD (f ∗ )​
​ ​ ​ ​

25. Write the statement of the no-free-lunch theorem

For each sequence a1 , a2 , … ∀i ai ∈ R+ limi→+∞ ai = 0such ​ ​ ​ ​ ​

1
that 16 ≥ a1 ≥ a2 ≥ …and for all binary classification algorithms
​ ​ ​

Awith zero-one loss there exists a distribution Dsuch that


lD (f ∗ ) = 0and ∀m ≥ 1 E[lD (A(Sm ))] ≥ am ​
​ ​ ​ ​
26. Write the mathematical definition of nonparametric learning
algorithm. Define the main quantities occurring in the formula.
An algorithm Ais nonparametric if limm→∞ minh∈Hm lD (h)​


​ ​ =
lD (f ∗ )​

where

• Hm = {h∣∃Sm : h = A(Sm )}​


​ ​ ​

• lD (h) = E[l(Y , h(X))]the statistical risk


• f ∗ the Bayes optimal predictor defined as f ∗ (x) =


arg miny^∈Y E[l(Y , y^)∣X = x]​

​ ​

27. Name one nonparametric learning algorithm and one parametric


learning algorithm.
nonparametric K-nn
parametric linear classification or regression
28. Write the mathematical conditions on k ensuring consistency for
the k -NN algorithm.
Considering km the hyperparameter for the Nearest Neighbour

algorithm that depends on the size of the training set m, to ensure
consistency km must be such that:

• limm→+∞ km = +∞(no overfitting)


​ ​

• km = o(m)(no underfitting)

29. Write the formula for the Lipschitz condition in a binary
classification problem. Define the main quantities occurring in the
formula.
The Lipschitz condition holds for a binary classification problem with a
data distribution D such that η(x) = P(Y = +1∣X = x)if

1
∀x, x′ ∈ X
∃ 0 < c < ∞ : ∣η(x) − η(x′ )∣ ≤ c∣∣x − x′ ∣∣

in other words, η is c-Lipschitz.


30. Write the rate at which the risk of a consistent learning algorithm
for binary classification vanishes as a function of the training set size
m and the dimension dunder Lipschitz assumptions.
1
The typical convergence rate is of m− d+1 ​ ​

31. Explain the curse of dimensionality


Seen that the convergence rate of a consistent algorithm for binary
1
classification under Lipschitz assumptions is of m− d+1 , bounding such

1
rate m− d+1 ≤ ϵit holds that m ≥ ϵ−(d+1) , which means that m

must grow exponentially in the number of dimensions of the data


domain. This is called the curse of dimensionality, because it is difficult
to learn in a nonparametric setting in a high dimensional space. In
fact, to be consistent, it is necessary to map the Bayes optimal
predictor in all possibile dimensions/directions d.
32. Write the bound on the risk of the 1-NN binary classifier under
Lipschitz assumptions.

1
E[lD (A(S))] ≤ 2lD (f ∗ ) + 4c dm− d+1
​ ​ ​

short story long


33. Can the ERM over linear classifiers be computed efficiently? Can it
be approximated efficiently? Motivate your answers.

Given a training set S = {(x1 , y1 ), … , (xm , ym )} ⊆ ​ ​ ​ ​ Rd ×


{−1, 1}, the ERM algorithm for zero-one loss outputs

m
1
hS = arg min ∑ I{h(xt ) 
​ = yt } =
​ ​ ​ ​

h∈Hd m
t=1
m
1
arg min ​∑ I{yt wT xt ≤ 0}
​ ​ ​ ​

w∈Rd :∣∣w∣∣=1 m t=1

with Hd = {h(x) = sgn(w T x)∣w ∈ Rd : ∣∣w∣∣ = 1}.


This algorithm is not efficient because it is provable that


MinDisagreement, a much simpler decision problem, is NP-complete.
The MinDisagreement decision problem is defined as follows
instance (x1 , y1 ), … , (xm , ym ) ∈ {0, 1}d × {−1, 1}, k
​ ​ ​ ​ ∈ N​
question ∃w ∈ Rd such that at yt wT xt ≤ 0for at most k indices ​ ​

t ∈ {1, … , m}​
In the case of the ERM algorithm this problem is equivalent to asking
k
if it is possible to find a predictor such that lS (h) ≤ m ​ ​ ​

It is provable that MinDisagreement is NP-complete in the length of


the instance description O(md)​

The ERM algorithm cannot be efficiently approximated because it is


equivalent to the optimization problem MinDisOpt that is defined as

instance (x1 , y1 ), … , (xm , ym )


​ ​ ​ ​ ∈ {0, 1}d × {−1, 1}​
∈ Rd that minimizes the number of indices t ∈
solution w
{1, … , m}such that yt wT xt ≤ 0​
​ ​
1
In the case of the ERM algorithm minh∈Hd lD (h) = m Opt(S) ​
​ ​ ​

where Opt(S)indicates the number of examples misclassified by


ERM.
It is provable that if P≠NP, ∀S ∀c > 0∄polynomial time (in the
lenght of the instance description) algorithms that solve MinDisOpt
with lS (h) ≤ cOpt(S)​

34. Write the system of linear inequalities stating the condition of


linear separability for a training set in binary classification.
Given a training set S = {(x1 , y1 ), … , (xm , ym )} ⊆ Rd ×
​ ​ ​ ​

{−1, 1}, w ∈ Rd ​

∀t ∈ {1, … , m} yt wt xt > 0 : ∣∣w∣∣ = 1


​ ​

35. Write the pseudo-code for the Perceptron algorithm.

• input: S = {(x1 , y1 ), … , (xm , ym )}​


​ ​ ​ ​

• initialization: w = (0, … , 0)​


• while true do

◦ for t = 1, … , mdo
▪ if yt w T xt
​ ​
≤ 0then
• w ← w + yt xt ​ ​ ​

◦ if no mistakes encountered in last epoch

▪ break
36. Write the statement of the Perceptron convergence theorem.
Given a linearly separable training set S = {(x1 , y1 ), … , (xm , ym )}
​ ​ ​ ​

, the Perceptron algorithm terminates after at most

( min ∣∣u∣∣2 ) ( min


​ ​ ∣∣xt ∣∣2 )

u:γ(u)≥1 t∈{1,…,m}

number of updates. (The quantities are merely dependant on the


training set S )
37. Write the closed-form formula (i.e., not the argmin definition) for
the Ridge Regression predictor. Define the main quantities occurring
in the formula.

wS,α = (S T S + αI)−1 S T y

where:

• xT1 ​

xT2
S is the desgin matrix, defined as follows: S ∈ [R]m×d S = ,


​ ​

xTm ​

so it contains all the datapoints of the training set.

• αis the regularization parameter, that determines how stable the


predictor is in relation to perturbations to the dataset. In other
words, with small αthe estimation error is big, and there could be
overfitting. On the other hand, growing αcauses the
approximation error to grow, but the estimation error to shrink.

• yT = (y1 , y2 … ym ), so it contains all the corresponding labels


​ ​ ​

of the datapoints contained in the training set.


38. Write the pseudo-code for the projected online gradient descent
algorithm.

• parameters: η > 0, U > 0​


• w1 = 0​
​ ​

• for t = 1, 2, …do

◦ wt+1 = wt − ηt ∇lt (wt )​
​ ​ ​


​ ​


◦ wt+1 = arg minw:∣∣w∣∣≤U ∣∣w − wt+1
​ ∣∣​ ​ ​

39. Write the upper bound on the regret of projected online gradient
descent on convex functions. Define the main quantities occurring in
the bound.

T T
1 1 8
​∑ lt (wt ) − min
​ ​ ∑ lt (u) ≤ U G
​ ​ ​ ​ ​ ​ ​

T u:∣∣u∣∣≤U T T
t=1 t=1

Where:

• T is the time horizon in which the regret is measured


• lt is a convex loss function that is once differentiable(e.g. square

loss)

• U is the radius if the sphere that contains all the vectors that are
included in the class of predictor that is considered

• G := maxt=1,2,… ∣∣∇lt (wt )∣∣is the maximum possible gradient


​ ​ ​

obtainable during all moments in time


40. Write the upper bound on the regret of online gradient descent
on σ-strongly convex functions. Define the main quantities occurring
in the bound.

T T
1 1 G2 ln (T + 1)
∑ lt (wt ) − min ∑ lt (u) ≤
2σT
​ ​ ​ ​ ​ ​ ​ ​ ​

T t=1 u:∣∣u∣∣≤U T
t=1

Where:

• T is the time horizon in which the regret is measured


• lt is a σ-strongly convex loss function that is once

differentiable

• U is the radius of the sphere that contains all the vectors that
are included in the class of predictor that is considered

• G := maxt=1,2,… ∣∣∇lt (wt )∣∣is the maximum possible


​ ​ ​

gradient obtainable
41. Write the formula for the hinge loss.

ht (w) = [1 − yt wT xt ]+
​ ​ ​ ​
42. Write the mistake bound for the Perceptron run on an arbitrary
data stream for binary classification. Define the main quantities
occurring in the bound.

∀u ∈ Rd
T T
MT ≤ ∑ ht (u) + (∣∣u∣∣X) + ∣∣u∣∣X
​ ​ ​
2
∑ ht (u)​ ​ ​

t=1 t=1

• T is the time horizon considered


• ht is the hinge loss function defined as [1 − yt uT xt ]+ where
​ ​ ​ ​

(xt , yt )is the t-th element of the stream


​ ​

• X = maxt=1,2,… ∣∣xt ∣∣the maximum possible norm of a


​ ​

datapoint
43. Write the formula for the polynomial kernel of degree n.

For x, x′ ∈ Rd , K : Rd × Rd → R​

K(x, x′ ) = (1 + xT x′ )n

44. Write the formula for the Gaussian kernel with parameter γ .

For γ > 0, x, x′ ∈ Rd , K : Rd × Rd → R​

1 ′ 2
Kγ (x, x′ ) = e− 2γ ∣∣x−x ∣∣


45. Write the pseudo-code for the kernel Perceptron algorithm.

• initialization: S = ∅​
• for t = 1, 2, …do
◦ y^t = sgn(∑s∈S ys K (xs , xt ))​
​ ​ ​ ​ ​ ​

◦ if y^t 
= yt then ​ ​ ​

▪ S ← S ∪ {t}​
46. Write the mathematical definition of the linear space HK of ​

functions induced by a kernel K .

HK =

{∑ αi K (xi , ⋅) ∣ α1 , … , αN ∈ R, x1 , … , xN ∈ X, N ∈ N}
N
​ ​ ​ ​ ​ ​ ​

i=1

47. Let f be an element of the linear space HK induced by a kernel ​

K . Write f (x)in terms of K .

You might also like