0% found this document useful (0 votes)
4 views16 pages

ML-W2L02 Supervised Learning Setup

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views16 pages

ML-W2L02 Supervised Learning Setup

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 16

Machine Leaning

References
Machine Learning for Intelligent Systems,
CS4780/CS5780, Kilian Weinberger, https:/
/www.cs.cornell.edu/courses/cs4780/2018fa/lectur
es/lecturenote01_MLsetup.html
Supervised learning
𝑫 = { 𝒙𝟏, 𝒚𝟏 , 𝒙𝟐, 𝒚𝟐 , … 𝒙𝒏, 𝒚𝒏 } ⊆ 𝑿 × 𝒀

Where,
𝑥𝑖 , 𝑦𝑖 ~𝑃 𝑥, 𝑦

Learn a function ℎ ∈ 𝐻, such that for a new instance 𝑥Ԧ, 𝑦


~𝑃,
ℎ 𝑥 ≈𝑦
Hypothesis Space
𝒉∈𝑯
• 𝑯 can be thought of to contain classes of hypotheses which
share sets of assumptions like
• Decisions tree
• Perceptron
• Neural networks
• Support Vector Machines
How to choose 𝒉?
• Randomly
• May not work well
• Like using a random program to solve you sorting problem
• May work if 𝐻 is constrained enough

• Exhaustively
• Would be very slow
• The space 𝐻 is usually very large (if not infinite)

• 𝐻 is usually chosen by data scientists (you!) based on their


experience!
• ℎ ∈ 𝐻 is estimated efficiently using various optimization techniques
How to evaluate 𝒉?
Loss functions
• Calculate the average error of ℎ in predicting 𝑦.
• Smaller is better
• 0 loss: No error
• 100% loss: Could not even get one instance right
• 50% loss: Your ℎ is as informative as a coin toss
Loss functions
0/1 Loss
𝒏
𝟏
= ෍ , 1, 𝑖𝑓 ℎ(𝑥𝑖) ≠ 𝑦𝑖
𝑳𝟎/𝟏 𝒉 𝒏 𝒉 𝒙𝐢 ≠𝒚𝒊 ℎ 𝒙 𝐢 ≠𝑦 𝑖 = ቊ0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
𝒊=𝟏 𝑤ℎ𝑒𝑟𝑒 𝛿
• Counts the average number of mistakes in predicting 𝑦
• Returns the training error rate
• Non-continuous and non-differentiable
• Difficult to utilize in optimization
• Used to evaluate classifiers in binary/multiclass settings
Squared loss
𝒏
𝟏
𝑳𝒔𝒒 𝒉 = ෍ 𝒉 𝒙𝐢 𝟐
𝒊
𝒏− 𝒚
𝒊=𝟏
• Typically used in regression settings
• The loss is always non-negative
• The loss grows quadratically with the absolute magnitude of
mis-prediction
• Encourages no predictions to be really far off
• If a prediction is very close to be correct, the square will be tiny
and little attention will be given to that example to obtain zero
error
Absolute loss
𝒏
𝟏
𝑳𝒂𝒃𝒔 𝒉 = ෍ 𝒉 𝒙𝐢 − 𝒚𝒊
𝒏 𝒊=𝟏
• Typically used in regression settings
• The loss is always non-negative
• The loss grows linearly with the absolute magnitude of mis-
prediction
• Better suited for noisy data
Comparison
y h(x) Square loss Abs loss
100.00 101.00 1.00 1.00
90.00 90.01 0.0001 0.01
100.00 200.00 10,000.00 100.00
100.00 1,000.00 810,000.00 900.00
Overall 205,000.25 250.25

y h(x) Square loss Abs loss y h(x) Square loss Abs loss
100.00 101.00 1.00 1.00 100.00 0.00 10,000.00 100.00
90.00 91.00 1.00 1.00 90.00 0.00 8,100.00 90.00
100.00 101.00 1.00 1.00 100.00 0.00 10,000.00 100.00
20.00 21.00 1.00 1.00 20.00 0.00 400.00 20.00
30.00 29.00 1.00 1.00 30.00 0.00 900.00 30.00
40.00 41.00 1.00 1.00 40.00 0.00 1,600.00 40.00
30.00 31.00 1.00 1.00 30.00 0.00 900.00 30.00
10.00 11.00 1.00 1.00 10.00 0.00 100.00 10.00
12.00 13.00 1.00 1.00 12.00 0.00 144.00 12.00
16.00 17.00 1.00 1.00 16.00 0.00 256.00 16.00
100.00 1,000.00 810,000.00 900.00 1,000.00 1,000.00 0.00 0.00
Overall 73,637.27 82.73 Overall 2,945.45 40.73
The elusive 𝒉
𝒉 = 𝒂𝒓𝒈𝒎𝒊𝒏𝒉∈𝑯 𝑳(𝒉)

So we need an ℎ with a low loss on 𝐷?


How not to reduce the loss?
ℎ 𝑥 = ቊ𝑦𝑖, 𝑖𝑓 ∃ 𝑥𝑖 , 𝑦𝑖 ∈ 𝐷, 𝑠. 𝑡. 𝑥 =
0, 𝑥𝑖
• What would be the loss of this ℎ on the training set?
𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
• What would be the loss of this ℎ on an unseen test set?

The memorizer!

• Why is it bad?
• How to prevent this from happening?
Generalization
𝝐 = 𝜠 𝒙,𝒚 ∼𝑷[𝒍(𝒙, 𝒚)|𝒉]
• That the expected loss should be calculated on any data point
sampled from the distribution 𝑃, not necessarily those present
in 𝐷
• How to get a new datapoint 𝑥, 𝑦 ∼ 𝑃?
• All we have are the 𝑛 data points!
• We estimate 𝜖 by splitting the 𝐷:
Training set, 𝑫𝑻𝑹 Test set, 𝑫𝑻𝑬

• We train on 𝐷𝑇𝑅 and test on 𝐷𝑇𝐸 only once!


• Don’t train on test inadvertently! (e.g. repeated testing)
Training and Test Data
• Usually 80:20 or 70:30 splits
• Making sure the splits make sense
• Time series: Split by time
• i.i.d: Uniformly at random
o Make sure you don’t split the same datapoint between 𝐷𝑇𝑅 and 𝐷𝑇𝐸
o Make sure the same data does not get repeated on both sides e.g. spam
• We never look at the test data
• We train only using the 𝐷𝑇𝑅 and only use the 𝐷𝑇𝐸 once
• Then the test error approximates the generalization loss

• How to we evaluate the model, if we do not have access to


the test data while training?
Validation sets
Training set, 𝑫𝑻𝑹 Validation set, 𝑫𝑽𝑨 Test set, 𝑫𝑻𝑬

• E.g. Split 80:10:10


• Train on 𝐷𝑇𝑅, tune parameters or calculate error on 𝐷𝑉𝐴, and finally test
once on 𝐷𝑇𝐸
• Cross-validation:

• Finally, train on the whole data once, before shipping out


Summary
Learning:

𝟏
𝒉 ෍ 𝒍(𝒙, 𝒚|𝒉)
𝑫𝑻𝑹
= 𝒂𝒓𝒈𝒎𝒊𝒏𝒉∗∈𝑯 𝒙,𝒚 ∈𝑫𝑻𝑹
Evaluation:
𝟏
𝝐𝑻𝑬 = ෍ 𝒍(𝒙, 𝒚|
𝑫𝑻𝑬 𝒉∗)
• If the samples are drawn i.i.d. from the same distribution 𝑃, then the testing
𝒙,𝒚 ∈𝑫𝑻𝑬
loss is an unbiased estimator of the true generalization loss:
Generalization:
𝝐=𝜠 𝒙,𝒚 ∼𝑷[𝒍(𝒙, 𝒚)|𝒉∗]
• 𝝐𝑻𝑬 → 𝝐 as
𝐷𝑇𝐸 → +∞
• This is due to the weak law of large numbers, which says that the
empirical average of data drawn from a distribution converges to its mean.

You might also like