0% found this document useful (0 votes)
37 views37 pages

Lecture 15 - Recap and Midterm Review

The document discusses machine learning topics including midterm exam logistics, exam topics, example applications of KNN, SVM, and random forests, the bias-variance tradeoff, and classification methods.

Uploaded by

deponly
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views37 pages

Lecture 15 - Recap and Midterm Review

The document discusses machine learning topics including midterm exam logistics, exam topics, example applications of KNN, SVM, and random forests, the bias-variance tradeoff, and classification methods.

Uploaded by

deponly
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Recap and

Review

Applied Machine Learning


Derek Hoiem
Midterm Exam Logistics
• Thurs, Mar 7 (start exam between 9:30 AM and 10:30 PM)
• Exam will be 75 minutes long (or longer for those with DRES
accommodations)
• Mainly multiple choice / multiple select
– No coding or complex calculations; mainly tests conceptual understanding
• You take it at home (open book) on PrairieLearn
• Not cheating
– Consult notes, practice questions/answers, slides, internet, etc.
• Cheating
– Talking to a classmate about the exam after one (but not both) of you has taken it
– Getting help from another person during the exam
– Obtaining past exam questions/answers
• You will not have time to look up all the answers, so do prepare by reviewing
slides, lectures, AML book, and practice questions
Midterm Exam Central Topics
• How does train/test error depend on
– Number of training samples
– Complexity of model
• Bias-variance trade-off, including meaning of “bias” and “variance”
for ML models and “overfitting”
• Basic function/form/assumptions of classification/regression
models (KNN, NB, linear/logistic regression, trees, SVMs, boosted
trees, random forests, ensembles)
• Entropy/Information gain
• Data organization and transformation: clustering, PCA
• Latent variables and robustness: EM, density estimation, robust
estimation and fittting
• Gradient descent, SGD
KNN Usage Example: Deep Face

CVPR 2014

1. Detect facial features


2. Align faces to be frontal
3. Extract features using deep network while training classifier to label image into person (dataset based on employee faces)
4. In testing, extract features from deep network and use nearest neighbor classifier to assign identity

• Performs similarly to humans in the LFW dataset (labeled faces in the wild)
• Can be used to organize photo albums, identifying celebrities, or alert user when someone posts an image of them
• If this is used in a commercial deployment, what might be some unintended consequences?
• This algorithm is used by Facebook (though with expanded training data)
Example application of SVM: Dalal-Triggs 2005

• Detection by scanning window


• Resize image to multiple scales and extract overlapping windows
• Classify each window as positive or negative
• Very highly cited (40,000+) paper, mainly for HOG
• One of the best pedestrian detectors for several years

https://lear.inrialpes.fr/people/triggs/pubs/Dalal-cvpr05.pdf
Example application of SVM: Dalal-Triggs 2005

• Very highly cited (40,000+) paper, mainly for HOG


• One of the best pedestrian detectors for several years
Example application of SVM: Dalal-Triggs 2005
“Semi-naïve Bayes” object detection

• Best performing
face/car detector in
2000-2005
• Model probabilities of
small groups of features
(wavelet coefficients)
• Search for groupings,
discretize features,
estimate parameters

https://www.cs.cmu.edu/afs/cs.cmu.edu/user/hws/www/CVPR00.pdf
Human pose estimation with random forest

• Very simple features

• Lots of data

• Random Forest
Training (Parameter Learning)
Target
Labels
Model
Raw
Encoder Decoder Prediction
Features
Discrete/continuous values Trees Linear regressor Category
Feature selection Logistic regressor Continuous value
Text Clustering Nearest Neighbor Clusters
Images Kernels Probabilistic model Low dimensional embedding
Audio Density estimation SVM
Structured/unstructured Pixel labels
Few/many features Manual feature design Generated text, image, audio
Clean/noisy labels Deep networks Positions
Learning a model

𝜃𝜃 ∗ = argmin 𝐿𝐿𝑜𝑜𝑜𝑜𝑜𝑜(𝑓𝑓 𝑿𝑿; 𝜃𝜃 , 𝒚𝒚)


𝜃𝜃
• 𝑓𝑓 𝑿𝑿; 𝜃𝜃 : the model, e.g. 𝑦𝑦 = 𝒘𝒘𝑇𝑇 𝒙𝒙
• 𝜃𝜃: parameters of the model (e.g. 𝒘𝒘)
• (𝑿𝑿, 𝒚𝒚): pairs of training samples
• 𝐿𝐿𝑜𝑜𝑜𝑜𝑜𝑜(): defines what makes a good model
– Good predictions, e.g. minimize − ∑𝑛𝑛 log 𝑃𝑃(𝑦𝑦𝑛𝑛 |𝒙𝒙𝑛𝑛 )
– Likely parameters, e.g. minimize 𝒘𝒘𝑇𝑇 𝒘𝒘
• Regularization and priors indicate preference for particular solutions, which tends to
improve generalization (for well chosen parameters) and can be necessary to obtain
a unique solution
Prediction using a model

𝑦𝑦𝑡𝑡 = 𝑓𝑓 𝒙𝒙𝒕𝒕 ; 𝜃𝜃
• Given some new set of input features 𝒙𝒙𝑡𝑡 , model predicts 𝑦𝑦𝑡𝑡
– Regression: output 𝑦𝑦𝑡𝑡 directly, possibly with some variance estimate
– Classification
• Output most likely 𝑦𝑦𝑡𝑡 directly, as in nearest neighbor
• Output 𝑃𝑃(𝑦𝑦𝑡𝑡 |𝒙𝒙𝑡𝑡 ), as in logistic regression
Model evaluation process
1. Collect/define training, validation, and test sets
2. Decide on some candidate models and hyperparameters
3. For each candidate:
a. Learn parameters with training set
b. Evaluate trained model on the validation set
4. Select best model
5. Evaluate best model’s performance on the test set
– Cross-validation can be used as an alternative
– Common measures include error or accuracy, root mean squared
error, precision-recall
How to think about ML algorithms
• What is the model?
– What kinds of functions can it represent?
– What functions does it prefer? (regularization/prior)
• What is the objective function?
– What “values” are implied?
– The objective function does not always match the final evaluation metric
– Objectives are designed to be optimizable and improve generalization
• How do I optimize the model?
– How long does it take to train, and how does it depend on the amount of training
data or number of features?
– Can I reach a global optimum?
• How does the prediction work?
– How accurate is the prediction?
– How fast can I make a prediction for a new sample?
– Does my algorithm provide a confidence on its prediction?
Bias-Variance Trade-off

Variance: due to limited data


Different training samples will give different models that vary in predictions for the same test sample

“Noise”: irreducible error due to data/problem

Bias: error when optimal model is learned from infinite data

Above is for regression.


But same error = variance + noise + bias2 holds for classification error and logistic regression.

Fig Sources
See this for derivation
Underfitting Overfitting

How to detect high variance:


• Test error is much higher than
training error

How to detect high bias or noise:


• The training error is high

As you increase model complexity:


• Training error will decrease
• Test error may decrease (if you are
currently “underfitting”) or increase (if
you are “overfitting”)
What does “model complexity” mean?
• More parameters in the same structure, e.g. a deeper tree is
more complex than a shallow tree
• Less regularization, e.g. smaller regularization penalty
Performance vs training size
As we get more training data:
1. The same model has more difficulty
Fixed model fitting the training data
2. But the test error becomes closer to
training error (reduced generalization
error)
Due to limited training data
Testing (model variance) and
3. Overall test performance improves
distribution shift
Error

Test error with infinite training examples


Due to difference in P(y|x) in
Train error with infinite training examples training and test (function shift)
Training
Number of Training Examples
Due to limited power of model
(model bias) and unavoidable
intrinsic error (Bayes optimal
error)
Classification methods
Nearest Neighbor Naïve Bayes Logistic Regression Decision Tree
Type Instance-Based Probabilistic Probabilistic Probabilistic
Decision Partition by example distance Usually linear Usually linear Partition by selected
Boundary boundaries

Model / 𝑖𝑖 ∗ = argmin 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 𝑋𝑋𝑡𝑡𝑡𝑡𝑡𝑡 [𝑖𝑖], 𝑥𝑥 Conjunctive rules


𝑖𝑖 𝑦𝑦 ∗ = argmax � 𝑃𝑃 𝑥𝑥𝑖𝑖 𝑦𝑦 𝑃𝑃(𝑦𝑦)
Prediction 𝑦𝑦 ∗ = 𝑦𝑦𝑡𝑡𝑡𝑡𝑡𝑡 𝑖𝑖 ∗ 𝑦𝑦 𝑦𝑦 ∗ = 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙(𝑥𝑥)
𝑖𝑖
𝑦𝑦 ∗ = argmax 𝑃𝑃 𝑦𝑦 𝑥𝑥
𝑦𝑦

Strengths * Low bias * Estimate from limited data * Powerful in high * Explainable decision
* No training time * Simple dimensions function
* Widely applicable * Fast training/prediction * Widely applicable * Widely applicable
* Simple * Good confidence * Does not require
estimates feature scaling
* Fast prediction
Limitations * Relies on good input features * Limited modeling power * Relies on good * One tree tends to
* Slow prediction (in basic input features either generalize poorly
implementation) or underfit the data
Classification methods (extended) assuming x in {0 1}

Learning Objective Training Inference


θ1 x + θ 0 (1 − x ) > 0
T T

Naïve
∑ log P (xij | yi ;θ j ) ∑δ (x = 1 ∧ y = k ) + r
ij i
where θ1 j = log
P (x j = 1 | y = 1)
,
maximize ∑  j  θ kj = i P (x j = 1 | y = 0 )
Bayes i 
+ log P ( yi ;θ 0 )

 ∑δ ( y = k ) + Kr
i
θ 0 j = log
P (x j = 0 | y = 1)
i P (x j = 0 | y = 0 )

Logistic m𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 � −log 𝑃𝑃 𝑦𝑦𝑖𝑖 |𝐱𝐱, 𝛉𝛉 + 𝜆𝜆 𝛉𝛉


Gradient descent θT x > t
𝑖𝑖
Regression where 𝑃𝑃 𝑦𝑦𝑖𝑖 |𝐱𝐱, 𝛉𝛉 = 1/ 1 + exp −𝑦𝑦𝑖𝑖 𝛉𝛉𝑇𝑇 𝐱𝐱

1
minimize λ ∑ ξ i + θ Quadratic programming
Linear 2
θT x > t
i
or subgradient opt.
SVM such that yi θ x ≥ 1 − ξ i ∀i, ξ i ≥ 0
T

Kernelized
SVM
complicated to write Quadratic
programming ∑ y α K (xˆ , x ) > 0
i
i i i

Nearest yi
where i = argmin K (xˆ i , x )
Neighbor most similar features  same label Record data
i

* Notation may differ from previous slide


Regression methods
Nearest Neighbor Naïve Bayes Linear Regression Decision Tree
Type Instance-Based Probabilistic Data fit Probabilistic
Decision Partition by example distance Usually linear Linear Partition by selected
Boundary boundaries

Model / 𝑖𝑖 ∗ = argmin 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 𝑋𝑋𝑡𝑡𝑡𝑡𝑡𝑡 [𝑖𝑖], 𝑥𝑥 ∗ 𝑦𝑦 ∗ = 𝑤𝑤 𝑇𝑇 𝑥𝑥 Conjunctive rules


𝑖𝑖 𝑦𝑦 = argmax � 𝑃𝑃 𝑥𝑥𝑖𝑖 𝑦𝑦 𝑃𝑃(𝑦𝑦)
Prediction 𝑦𝑦 ∗ = 𝑦𝑦𝑡𝑡𝑡𝑡𝑡𝑡 𝑖𝑖 ∗ 𝑦𝑦 𝑦𝑦 ∗ = 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙(𝑥𝑥)
𝑖𝑖

Strengths * Low bias * Estimate from limited data * Powerful in high * Explainable decision
* No training time * Simple dimensions function
* Widely applicable * Fast training/prediction * Widely applicable * Widely applicable
* Simple * Fast prediction * Does not require
* Coefficients may be feature scaling
interpretable
Limitations * Relies on good input features * Limited modeling power * Relies on good * One tree tends to
* Slow prediction (in basic input features either generalize poorly
implementation) or underfit the data
Ensembles

• Ensembles improve accuracy by


reducing bias and/or variance

• Boosting minimizes bias by fixing


previous mistakes, e.g. Boosted
Decision Tree classifier

• Averaging over predictions from


multiple models minimizes
variance, e.g. Random Forests

• Random forests and boosted trees


are powerful classifiers and useful
for a wide variety of problems
Questions
https://tinyurl.com/cs441midtermreview
Summaries
Working with Data (L2)
Machine learning is fitting
parameters of a model so that you
can accurately predict one set of 𝑓𝑓 𝒙𝒙; 𝜃𝜃 → 𝑦𝑦
numbers from another set of
numbers

Something can take a lot of data


storage but provide little
information, or vice versa

The predictiveness or information


gain of the features depends on
how they are modeled
Clustering (L3)
• Similarity is foundational to machine
learning

• Use highly optimized libraries like FAISS for


search/retrieval

• Approximate search methods like LSH can


be used to find similar points quickly

• TF-IDF is used for similarity of tokenized


documents and used with index for fast
search

• Clustering groups similar data points

• K-means is the must-know method, but


there are many others
KNN (L4)

• KNN is a simple but effective


classifier/regressor that
predicts the label of the most
similar training example(s)

• Larger K gives a smoother


prediction function

• Test error is composed of bias


(model too simple/smooth to
fit data) and variance (model
too complex to learn from
training data)
PCA/Embedding (L5)
• PCA reduces dimensions by linear projection
– Preserves variance to reproduce data as well as
possible, according to mean squared error
– May not preserve local connectivity structure or
discriminative information

• Other methods try to preserve relationships


between points
– MDS: preserve pairwise distances
– IsoMap: MDS but using a graph-based distance
– t-SNE: preserve a probabilistic distribution of
neighbors for each point (also focusing on closest
points)
– UMAP: incorporates k-nn structure, spectral
embedding, and more to achieve good embeddings
relatively quickly
Linear Regression (L6)

• Linear regression fits a linear model to a


set of feature points to predict a
continuous value
– Explain relationships
– Predict values
– Extrapolate observations

• Regularization prevents overfitting by


restricting the magnitude of feature
weights
– L1: prefers to assign a lot of weight to the most
useful features
– L2: prefers to assign smaller weight to
everything
Linear Classifiers (L7)
• Linear logistic regression and linear SVM are
classification techniques that aims to split features
between two classes with a linear model
– Predict categorical values with confidence

• Logistic regression maximizes confidence in the


correct label, while SVM just tries to be confident
enough

• Non-linear versions of SVMs can also work well and


were once popular (but almost entirely replaced by
deep networks)

• Nearest neighbor and linear models are the final


predictors of most ML algorithms – the complexity
lies in finding features that work well with NN or
linear models
Probability / Naïve Bayes (L8)
• Probabilistic models are a large class of
machine learning methods

• Naïve Bayes assumes that features are


independent given the label 𝑃𝑃 𝒙𝒙, 𝑦𝑦 = � 𝑃𝑃 𝑥𝑥𝑖𝑖 𝑦𝑦 𝑃𝑃(𝑦𝑦)
– Easy/fast to estimate parameters
– Less risk of overfitting when data is limited 𝑖𝑖

• You can look up how to estimate parameters


for most common probability models
– Or take partial derivative of total data/label
likelihood given parameter

• Prediction involves finding y that maximizes


𝑃𝑃(𝑥𝑥, 𝑦𝑦), either by trying all 𝑦𝑦 or solving
partial derivative

• Maximizing log 𝑃𝑃(𝑥𝑥, 𝑦𝑦) is equivalent to


maximizing 𝑃𝑃(𝑥𝑥, 𝑦𝑦) and often much easier
EM (L9)

• EM is a widely applicable algorithm to


solve for latent variables and parameters Estimated scores
that make the observed data likely
– E-step: compute the likelihoods of the values
of the latent variables
– M-step: solve for most likely model
parameters, using the likelihoods from the E-
step as weights

• While derivation is long and somewhat


complicated, the application is simple (Green = true; red = prediction)

• EM is used, for example, in mixture of Good annotators: 0, 1, 3


Gaussian and topic models
PDF Estimation (L10)

Parametric Models Semi-Parametric Non-Parametric

Can fit a broad


Assumes a fixed range of functions Can fit any
Description
form for density with limited distribution
parameters
Discretization,
Gaussian, Mixture of
Examples kernel density
exponential Gaussians
estimation

Model is able to
Low dimensional or
Good when approximately fit 1-D data
smooth distribution
the distribution

Model cannot Distribution is not


Data is high
Not good when approximate the smooth, challenging
dimensional
distribution in high dimensions
Robust Estimation (L11)

Median and quantiles are


robust to outliers, while
mean/min/max aren’t

Outliers can be detected as


low probability points, low
density points, poorly
compressible points, or
through 2D visualizations

Least squares is not robust to


outliers. Use RANSAC or IRLS
or robust loss function
instead.
Trees (L12)
• Decision/regression trees
learn to split up the feature
space into partitions with
similar values

• Entropy is a measure of
uncertainty

• Information gain measures


how much particular
knowledge reduces prediction
uncertainty
Ensembles (L13)

• Ensembles improve accuracy and


confidence estimates by reducing
bias and/or variance

• Boosted trees minimize bias by


fixing previous mistakes

• Random forests minimize variance


by averaging over multiple
different trees

• Random forests and boosted trees


are powerful classifiers and useful
for a wide variety of problems
SGD (L14)
• Gradient descent iteratively takes a step in the
negative gradient direction of the full objective
function, to minimize a loss function
• Stochastic gradient descent (SGD) estimates the
gradient using a subset of examples
– Smaller batches require much less compute to
evaluate but give a noisier estimate of the
gradient
– Faster than GD
– Can escape local minima
• Learning rate (step size) and schedule are
important factors in the speed and stability of
the optimization
• Optimization problems for linear models are
convex, and have a single local optimum
• MLPs and deep networks have many local
optima, so are harder to optimize well

You might also like