Statistical Methods for ML
Statistical Methods for ML
Exam questions
1. Write the formulas for the square loss, the zero-one loss, and the
logarithmic loss
• ^)
square loss l(y, y = (y − y^)2
={
• ^)
1 if y = y^
zero-one loss l(y, y
0 if y
= y^
• log y1^
^) = {
if y = 1
logarithmic loss l(y, y
1
log 1−y if y = 0
m
1
lS (h) =
∑ l(yt , h(xt ))
m t=1
4. Write the mathematical formula defining the ERM algorithm over a
class Hof predictors. Define the main quantities occurring in the
formula.
^
The ERM learning algorithm outputs the predictor h ∈
arg minh∈H lS (h)
• Initialization:
◦ Sl = S
◦
={
+1 if Nl+ ≥ Nl−
set yl
−1
else
• while !(stopping criterion):
2
log2 (1 − a)
11. Write the formula for the statistical risk of a predictor hwith
respect to a generic loss function and data distribution.
12. Write the formula for the Bayes optimal predictor for a generic
loss function and data distribution.
f ∗ (x) = arg miny^∈Y E[l(Y , y^)∣X = x]
f (x) = {
1
∗ −1 η(x) < 2
1
+1 η(x) ≥
2
• Bayes risk
lD (f ∗ ) = E[min{η(X), 1 − η(X)}]
14. Can the Bayes risk for the zero-one loss be zero? If yes, then
explain how.
f ∗ (x) = E[Y ∣X = x]
• Bayes risk
E[Var[Y ∣X]]
2n
ln 2δ
2
P(∣lD (h) − lS ′ (h)∣ > ϵ) ≤ 2e−2ϵ n where nis the size of the test
n
1
P( ∑ Zt > μ + ϵ) ≤ e−2ϵ n
2
n
t=1
∧
n
1
∑ Zt < μ − ϵ) ≤ e−2ϵ n
2
P(
n t=1
18. Write the bias-variance decomposition for a generic learning
algorithm Aand associate the resulting components to overfitting
and underfitting.
Coso :) 08/26/2023
Let lD (hS )be the statistical risk on the predictor produced by
I’ll be a cagacazzo ‘cause the cam corp
algorithm Aon training set S . Let h∗ be the best predictor that Acan taught me how to be one.
output for the distribution (D, l). This question is pointless, there are no
Lipschitz conditions for a general binary
classification algorithm.
D (hS ) = lD (hs ) − lD (h∗ )+ variance / estimation error
It is necessary for a binary classificatio…
19. Write the upper bound on the estimation error of ERM run on a
finite class Hof predictors.
l D (h S ) − l D (h ∗ ) ≤
2 1 − (2ed)N +1 2
(ln ( ) + ln ( ))
1 − 2ed
m δ
2 2
lD (h) ≤ lS (h) +
(O(Nh log d) + ln )
m δ
h∈H
m w(h) δ
h∈H
h∈H
−∣σ(h)∣
2 the bound becomes
K
1
lSCV (A)
= ∑ lSi (hi )
K
i=1
K
with lSi (hi )
∑(x,y)∈Si l(y, hi (x))
= m
• input : dataset S
• for i = 1 … K do
◦ S−i ≡ S ∖ Si
◦ for each θ ∈ Θ0
▪ run CV on S−i
−i
(Aθ )
1
• output: K
K
∑i=1 lSi (hi )
1
that 16 ≥ a1 ≥ a2 ≥ …and for all binary classification algorithms
=
lD (f ∗ )
where
algorithm that depends on the size of the training set m, to ensure
consistency km must be such that:
• km = o(m)(no underfitting)
29. Write the formula for the Lipschitz condition in a binary
classification problem. Define the main quantities occurring in the
formula.
The Lipschitz condition holds for a binary classification problem with a
data distribution D such that η(x) = P(Y = +1∣X = x)if
1
∀x, x′ ∈ X
∃ 0 < c < ∞ : ∣η(x) − η(x′ )∣ ≤ c∣∣x − x′ ∣∣
1
rate m− d+1 ≤ ϵit holds that m ≥ ϵ−(d+1) , which means that m
1
E[lD (A(S))] ≤ 2lD (f ∗ ) + 4c dm− d+1
m
1
hS = arg min ∑ I{h(xt )
= yt } =
h∈Hd m
t=1
m
1
arg min ∑ I{yt wT xt ≤ 0}
t ∈ {1, … , m}
In the case of the ERM algorithm this problem is equivalent to asking
k
if it is possible to find a predictor such that lS (h) ≤ m
{−1, 1}, w ∈ Rd
◦ for t = 1, … , mdo
▪ if yt w T xt
≤ 0then
• w ← w + yt xt
▪ break
36. Write the statement of the Perceptron convergence theorem.
Given a linearly separable training set S = {(x1 , y1 ), … , (xm , ym )}
u:γ(u)≥1 t∈{1,…,m}
wS,α = (S T S + αI)−1 S T y
where:
• xT1
xT2
S is the desgin matrix, defined as follows: S ∈ [R]m×d S = ,
…
xTm
• for t = 1, 2, …do
′
◦ wt+1 = wt − ηt ∇lt (wt )
′
◦ wt+1 = arg minw:∣∣w∣∣≤U ∣∣w − wt+1
∣∣
39. Write the upper bound on the regret of projected online gradient
descent on convex functions. Define the main quantities occurring in
the bound.
T T
1 1 8
∑ lt (wt ) − min
∑ lt (u) ≤ U G
T u:∣∣u∣∣≤U T T
t=1 t=1
Where:
loss)
• U is the radius if the sphere that contains all the vectors that are
included in the class of predictor that is considered
T T
1 1 G2 ln (T + 1)
∑ lt (wt ) − min ∑ lt (u) ≤
2σT
T t=1 u:∣∣u∣∣≤U T
t=1
Where:
differentiable
• U is the radius of the sphere that contains all the vectors that
are included in the class of predictor that is considered
gradient obtainable
41. Write the formula for the hinge loss.
ht (w) = [1 − yt wT xt ]+
42. Write the mistake bound for the Perceptron run on an arbitrary
data stream for binary classification. Define the main quantities
occurring in the bound.
∀u ∈ Rd
T T
MT ≤ ∑ ht (u) + (∣∣u∣∣X) + ∣∣u∣∣X
2
∑ ht (u)
t=1 t=1
datapoint
43. Write the formula for the polynomial kernel of degree n.
For x, x′ ∈ Rd , K : Rd × Rd → R
K(x, x′ ) = (1 + xT x′ )n
44. Write the formula for the Gaussian kernel with parameter γ .
1 ′ 2
Kγ (x, x′ ) = e− 2γ ∣∣x−x ∣∣
45. Write the pseudo-code for the kernel Perceptron algorithm.
• initialization: S = ∅
• for t = 1, 2, …do
◦ y^t = sgn(∑s∈S ys K (xs , xt ))
◦ if y^t
= yt then
▪ S ← S ∪ {t}
46. Write the mathematical definition of the linear space HK of
HK =
{∑ αi K (xi , ⋅) ∣ α1 , … , αN ∈ R, x1 , … , xN ∈ X, N ∈ N}
N
i=1