0% found this document useful (0 votes)
16 views21 pages

Cours2 ML

Uploaded by

laribiamal24
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views21 pages

Cours2 ML

Uploaded by

laribiamal24
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Plug-in Methods & under/over-fitting

Vianney Perchet
February 5th 2024

Lecture 2/12
Last Lecture Take Home Message

• “Attributes/Features” space X ⊂ Rd & ”label” space Y ⊂ R


n o
• Training data-set: Dn = (X1 , Y1 ), . . . , (Xn , Yn )
• Risk w.r.t. loss ℓ : Y × Y → R+
h i
Risk: R(f) = E(X,Y)∼P ℓ f(X), Y

• Optimal risk and Bayes predictor


f∗ = arg minf R(f) and R∗ = R(f∗ )
• Binary Classification
• 0/1-loss: ℓ(y, y′ ) = 1{y ̸= y′ }
• Bayes classifier. f∗ (x) = 1{η(x) ≥ 1
2
}
• Linear Regression
• quad-loss: ℓ(y, y′ ) = ∥y − y′ ∥2
• Bayes regressor. f∗ (x) = η(x)

2
Focus on Binary Classification

f∗ (x) = 1{η(x) ≥ 1
2}

• Simple strategy
• Estimate η(x) by η̂(x) (using Dn )
• Plug-it in the formula, i.e., f̂(x) = 1{η̂(x) ≥ 1
2
}
• Pray for the best

3
Focus on Binary Classification

f∗ (x) = 1{η(x) ≥ 1
2}

• Simple strategy
• Estimate η(x) by η̂(x) (using Dn )
• Plug-it in the formula, i.e., f̂(x) = 1{η̂(x) ≥ 1
2
}
• Pray for the best
• ✓ It works !
• Simple methods, already implemented, intuitive and interpretable
• Many variants (K-NN, regressograms, Kernels)
7 if choose the correct parameters !

3
Focus on Binary Classification

f∗ (x) = 1{η(x) ≥ 1
2}

• Simple strategy
• Estimate η(x) by η̂(x) (using Dn )
• Plug-it in the formula, i.e., f̂(x) = 1{η̂(x) ≥ 1
2
}
• Pray for the best
• ✓ It works !
• Simple methods, already implemented, intuitive and interpretable
• Many variants (K-NN, regressograms, Kernels)
7 if choose the correct parameters !
• 7 Convergence can be slow
• Even arbitrarily slow “No Free-Lunch Theorem”
• E R(f̂) − R∗ ≥ log log1log(n) which is “constant”

3
Focus on Binary Classification

f∗ (x) = 1{η(x) ≥ 1
2}

• Simple strategy
• Estimate η(x) by η̂(x) (using Dn )
• Plug-it in the formula, i.e., f̂(x) = 1{η̂(x) ≥ 1
2
}
• Pray for the best
• ✓ It works !
• Simple methods, already implemented, intuitive and interpretable
• Many variants (K-NN, regressograms, Kernels)
7 if choose the correct parameters !
• 7 Convergence can be slow
• Even arbitrarily slow “No Free-Lunch Theorem”
• E R(f̂) − R∗ ≥ log log1log(n) which is “constant”

• ✓ But these are pathological counter-examples !


• In practice, data are “regular” (Lipschitz, Holder, ....)
• Can compute explicit rates
3
Regressograms. The model

• Partition of X = Rd into “bins” (hypercubes) of size hn


• Volume of one bin: hdn
• Independently on each bin B
♯{i:Xi ∈B and Yi =1}
• η̂(x) = ♯{i:Xi ∈B}
[piece-wise constant]

Th. If hn → 0 and nhdn → ∞, “consistency” i.e., R(fn ) → R∗ in proba

• Proof ideas: h i
1. Lemma: R(f) − R∗ ≤ 2E |η̂(X) − η(X)| Dn
2. Approximation error: hn → 0
3. Estimation error: nhdn → ∞

4
Regressograms. Pros/cons

✓ Pros
• Simple, intuitive & interpretable
• Computational complexity
7 Cons
• Find the correct value of h
• Partition not data-dependent (why bins ?)
• Lots of empty bins
• Space complexity: ♯ bins huge

5
Regressograms. Pros/cons

5
K-Nearest Neighbors. The model

• Adaptive partition of X = Rd
• 1 parameter kn
• Neighborhood Nkn (x) = {kn closest Xj to x}
♯{i:Xi ∈Nkn x) and Yi =1}
• η̂(x) = ♯{i:Xi ∈Nkn (x)}

• Piece-wise (polytopial) constant

kn
Th. If n → 0 and kn → ∞, “consistency”

• Proof ideas:
1. Approximation error: knn → 0
2. Estimation error: kn → ∞

6
K-Nearest Neighbors. Pros/cons

✓ Pros
• Intuitive & (somehow) interpretable
• Data Dependent partition
• No empty bins (& no arbitrary choice)
• Space complexity
7 Cons
• Find the correct value of k
• Weirdly shaped partition
• Computational complexity: finding the partition

7
K-Nearest Neighbors. Pros/cons

7
K-Nearest Neighbors. Pros/cons

7
Kernel-Methods (Nadaraya-Watson)

• Adaptive partition of X = Rd
• 2 parameters Kernel Kn (·) : X → R+ and window h ∈ R+
P
Kn ( x−X i
)Yi
• η̂(x) = Pi h
x−Xj
j K n ( h
)

Th. If hn → 0 and nhn → ∞, “consistency”

• Proof ideas:
1. Approximation error: hn → 0
2. Estimation error: nhn → ∞

8
Typical Kernel

• Usual properties
R
• Normalized X K(u)du = 1
• Symmetry K(−u) = K(u)
R R
• Bounded variance X ∥u∥2 K(u)du < ∞& X K2 (u)du < ∞
• Typical Kernels
• uniform K(x) = 12 1{x ∈ [−1; 1]}

• triangular K(x) = (1 − |x|)1{x ∈ [−1; 1]}

• Gaussian K(x) = √1

exp(− 12 x2 )
2 1
• Sigmoid K(x) = x ex +e−x

9
Kernels. Pros/cons

✓ Pros
• Intuitive & (somehow) interpretable
• Use all/many points to estimate
• Data Dependent
• No empty bins (& no arbitrary choice)
• Smooth/regular approximation
7 Cons
• Find the correct kernels K(·) and window h

10
Kernels. Pros/cons

10
Over/Under-fitting

All learning algorithms have data-fitting parameter(s)

• Choose it too small = under-fitting


Pn
7 Big empirical error on training set 1
ℓ f(Xi ), Yi )
n h
i=1
i
7 Medium (generalization) error E ℓ f(X), Y
• Choose it too big = over-fitting
Pn
✓ Small empirical error (even 0) 1
f(Xi ), Yi )

n h
i=1
i
7 Huge (generalization) error E ℓ f(X), Y
• How to choose it ??

• Do not focus too much on empirical error (around 1/ n ?)
• Find several candidates & pick the smallest one (Occam’s razor)
• Cross validate (following lecture !)

11
Over/Under-fitting

11
Over/Under-fitting

11
Take home message - Local/Plug-in Methods

h i
Lemma: R(f) − R∗ ≤ 2E |η̂(X) − η(X)| Dn

• Estimate η(·) by η̂(·).


• Plug-it in the formula f∗ (x) = 1{η(x) ≥ 1
2
}
• Local Methods
P 
• General form η̂(x) = ni=1 ω x, Xi ; (X1 , X2 , . . . , Xn ) Yi
convex weights for all x (in [0, 1] and sum to 1)
• Typical examples
• Regressogram
• k-Nearest neighbors
• Kernel methods
• Avoid Under/Over fitting
• Many points around x with positive weight over-fitting
• Points far from x with small/zero weight under-fitting

12

You might also like