Cours2 ML
Cours2 ML
Vianney Perchet
February 5th 2024
Lecture 2/12
Last Lecture Take Home Message
2
Focus on Binary Classification
f∗ (x) = 1{η(x) ≥ 1
2}
• Simple strategy
• Estimate η(x) by η̂(x) (using Dn )
• Plug-it in the formula, i.e., f̂(x) = 1{η̂(x) ≥ 1
2
}
• Pray for the best
3
Focus on Binary Classification
f∗ (x) = 1{η(x) ≥ 1
2}
• Simple strategy
• Estimate η(x) by η̂(x) (using Dn )
• Plug-it in the formula, i.e., f̂(x) = 1{η̂(x) ≥ 1
2
}
• Pray for the best
• ✓ It works !
• Simple methods, already implemented, intuitive and interpretable
• Many variants (K-NN, regressograms, Kernels)
7 if choose the correct parameters !
3
Focus on Binary Classification
f∗ (x) = 1{η(x) ≥ 1
2}
• Simple strategy
• Estimate η(x) by η̂(x) (using Dn )
• Plug-it in the formula, i.e., f̂(x) = 1{η̂(x) ≥ 1
2
}
• Pray for the best
• ✓ It works !
• Simple methods, already implemented, intuitive and interpretable
• Many variants (K-NN, regressograms, Kernels)
7 if choose the correct parameters !
• 7 Convergence can be slow
• Even arbitrarily slow “No Free-Lunch Theorem”
• E R(f̂) − R∗ ≥ log log1log(n) which is “constant”
3
Focus on Binary Classification
f∗ (x) = 1{η(x) ≥ 1
2}
• Simple strategy
• Estimate η(x) by η̂(x) (using Dn )
• Plug-it in the formula, i.e., f̂(x) = 1{η̂(x) ≥ 1
2
}
• Pray for the best
• ✓ It works !
• Simple methods, already implemented, intuitive and interpretable
• Many variants (K-NN, regressograms, Kernels)
7 if choose the correct parameters !
• 7 Convergence can be slow
• Even arbitrarily slow “No Free-Lunch Theorem”
• E R(f̂) − R∗ ≥ log log1log(n) which is “constant”
• Proof ideas: h i
1. Lemma: R(f) − R∗ ≤ 2E |η̂(X) − η(X)| Dn
2. Approximation error: hn → 0
3. Estimation error: nhdn → ∞
4
Regressograms. Pros/cons
✓ Pros
• Simple, intuitive & interpretable
• Computational complexity
7 Cons
• Find the correct value of h
• Partition not data-dependent (why bins ?)
• Lots of empty bins
• Space complexity: ♯ bins huge
5
Regressograms. Pros/cons
5
K-Nearest Neighbors. The model
• Adaptive partition of X = Rd
• 1 parameter kn
• Neighborhood Nkn (x) = {kn closest Xj to x}
♯{i:Xi ∈Nkn x) and Yi =1}
• η̂(x) = ♯{i:Xi ∈Nkn (x)}
kn
Th. If n → 0 and kn → ∞, “consistency”
• Proof ideas:
1. Approximation error: knn → 0
2. Estimation error: kn → ∞
6
K-Nearest Neighbors. Pros/cons
✓ Pros
• Intuitive & (somehow) interpretable
• Data Dependent partition
• No empty bins (& no arbitrary choice)
• Space complexity
7 Cons
• Find the correct value of k
• Weirdly shaped partition
• Computational complexity: finding the partition
7
K-Nearest Neighbors. Pros/cons
7
K-Nearest Neighbors. Pros/cons
7
Kernel-Methods (Nadaraya-Watson)
• Adaptive partition of X = Rd
• 2 parameters Kernel Kn (·) : X → R+ and window h ∈ R+
P
Kn ( x−X i
)Yi
• η̂(x) = Pi h
x−Xj
j K n ( h
)
• Proof ideas:
1. Approximation error: hn → 0
2. Estimation error: nhn → ∞
8
Typical Kernel
• Usual properties
R
• Normalized X K(u)du = 1
• Symmetry K(−u) = K(u)
R R
• Bounded variance X ∥u∥2 K(u)du < ∞& X K2 (u)du < ∞
• Typical Kernels
• uniform K(x) = 12 1{x ∈ [−1; 1]}
• Gaussian K(x) = √1
2π
exp(− 12 x2 )
2 1
• Sigmoid K(x) = x ex +e−x
9
Kernels. Pros/cons
✓ Pros
• Intuitive & (somehow) interpretable
• Use all/many points to estimate
• Data Dependent
• No empty bins (& no arbitrary choice)
• Smooth/regular approximation
7 Cons
• Find the correct kernels K(·) and window h
10
Kernels. Pros/cons
10
Over/Under-fitting
11
Over/Under-fitting
11
Over/Under-fitting
11
Take home message - Local/Plug-in Methods
h i
Lemma: R(f) − R∗ ≤ 2E |η̂(X) − η(X)| Dn
12