Lecture 15 - Recap and Midterm Review
Lecture 15 - Recap and Midterm Review
Review
CVPR 2014
• Performs similarly to humans in the LFW dataset (labeled faces in the wild)
• Can be used to organize photo albums, identifying celebrities, or alert user when someone posts an image of them
• If this is used in a commercial deployment, what might be some unintended consequences?
• This algorithm is used by Facebook (though with expanded training data)
Example application of SVM: Dalal-Triggs 2005
https://lear.inrialpes.fr/people/triggs/pubs/Dalal-cvpr05.pdf
Example application of SVM: Dalal-Triggs 2005
• Best performing
face/car detector in
2000-2005
• Model probabilities of
small groups of features
(wavelet coefficients)
• Search for groupings,
discretize features,
estimate parameters
https://www.cs.cmu.edu/afs/cs.cmu.edu/user/hws/www/CVPR00.pdf
Human pose estimation with random forest
• Lots of data
• Random Forest
Training (Parameter Learning)
Target
Labels
Model
Raw
Encoder Decoder Prediction
Features
Discrete/continuous values Trees Linear regressor Category
Feature selection Logistic regressor Continuous value
Text Clustering Nearest Neighbor Clusters
Images Kernels Probabilistic model Low dimensional embedding
Audio Density estimation SVM
Structured/unstructured Pixel labels
Few/many features Manual feature design Generated text, image, audio
Clean/noisy labels Deep networks Positions
Learning a model
𝑦𝑦𝑡𝑡 = 𝑓𝑓 𝒙𝒙𝒕𝒕 ; 𝜃𝜃
• Given some new set of input features 𝒙𝒙𝑡𝑡 , model predicts 𝑦𝑦𝑡𝑡
– Regression: output 𝑦𝑦𝑡𝑡 directly, possibly with some variance estimate
– Classification
• Output most likely 𝑦𝑦𝑡𝑡 directly, as in nearest neighbor
• Output 𝑃𝑃(𝑦𝑦𝑡𝑡 |𝒙𝒙𝑡𝑡 ), as in logistic regression
Model evaluation process
1. Collect/define training, validation, and test sets
2. Decide on some candidate models and hyperparameters
3. For each candidate:
a. Learn parameters with training set
b. Evaluate trained model on the validation set
4. Select best model
5. Evaluate best model’s performance on the test set
– Cross-validation can be used as an alternative
– Common measures include error or accuracy, root mean squared
error, precision-recall
How to think about ML algorithms
• What is the model?
– What kinds of functions can it represent?
– What functions does it prefer? (regularization/prior)
• What is the objective function?
– What “values” are implied?
– The objective function does not always match the final evaluation metric
– Objectives are designed to be optimizable and improve generalization
• How do I optimize the model?
– How long does it take to train, and how does it depend on the amount of training
data or number of features?
– Can I reach a global optimum?
• How does the prediction work?
– How accurate is the prediction?
– How fast can I make a prediction for a new sample?
– Does my algorithm provide a confidence on its prediction?
Bias-Variance Trade-off
Fig Sources
See this for derivation
Underfitting Overfitting
Strengths * Low bias * Estimate from limited data * Powerful in high * Explainable decision
* No training time * Simple dimensions function
* Widely applicable * Fast training/prediction * Widely applicable * Widely applicable
* Simple * Good confidence * Does not require
estimates feature scaling
* Fast prediction
Limitations * Relies on good input features * Limited modeling power * Relies on good * One tree tends to
* Slow prediction (in basic input features either generalize poorly
implementation) or underfit the data
Classification methods (extended) assuming x in {0 1}
Naïve
∑ log P (xij | yi ;θ j ) ∑δ (x = 1 ∧ y = k ) + r
ij i
where θ1 j = log
P (x j = 1 | y = 1)
,
maximize ∑ j θ kj = i P (x j = 1 | y = 0 )
Bayes i
+ log P ( yi ;θ 0 )
∑δ ( y = k ) + Kr
i
θ 0 j = log
P (x j = 0 | y = 1)
i P (x j = 0 | y = 0 )
1
minimize λ ∑ ξ i + θ Quadratic programming
Linear 2
θT x > t
i
or subgradient opt.
SVM such that yi θ x ≥ 1 − ξ i ∀i, ξ i ≥ 0
T
Kernelized
SVM
complicated to write Quadratic
programming ∑ y α K (xˆ , x ) > 0
i
i i i
Nearest yi
where i = argmin K (xˆ i , x )
Neighbor most similar features same label Record data
i
Strengths * Low bias * Estimate from limited data * Powerful in high * Explainable decision
* No training time * Simple dimensions function
* Widely applicable * Fast training/prediction * Widely applicable * Widely applicable
* Simple * Fast prediction * Does not require
* Coefficients may be feature scaling
interpretable
Limitations * Relies on good input features * Limited modeling power * Relies on good * One tree tends to
* Slow prediction (in basic input features either generalize poorly
implementation) or underfit the data
Ensembles
Model is able to
Low dimensional or
Good when approximately fit 1-D data
smooth distribution
the distribution
• Entropy is a measure of
uncertainty