0% found this document useful (0 votes)

61 views54 pages

cs229 MT Review

The document provides an overview of machine learning topics that will be covered on a midterm exam, including support vector machines, kernels, tree ensembles, and mixture models. It focuses on building intuition around these algorithms rather than solving specific problems, and encourages students to ask questions. Details are provided on SVM optimization and kernels, decision trees and random forests, and the EM algorithm for fitting mixture models such as Gaussian mixture models.

Uploaded by

dasabhisek46

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

61 views54 pages

cs229 MT Review

Uploaded by

dasabhisek46

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 54

CS 229 Midterm Review

Course Staff Fall 2018

11/2/2018
Outline
Today:

SVMs

Kernels

Tree Ensembles

EM Algorithm / Mixture Models

[ Focus on building intuition, less so on solving specific problems. Ask questions! ]

SVMs
Optimal margin classifier
Two classes separable by linear decision boundary.

But first … what is a hyperplane?

○ In d-dimensional space, a (d−1)-dimensional affine subspace

○ Examples: line in 2D, plane in 3D

Hyperplane in d-dimensional space:

[Separates space into two half-spaces.]

Hyperplanes
Optimal margin classifier
Idea:

Use a separating hyperplane for

binary classification.

Key assumption:

Classes can be separated by a linear

decision boundary.
Optimal margin classifier
To classify new data points:

Assign class by location of new data

point with respect to hyperplane:
Optimal margin classifier
Problem

Many possible separating

hyperplanes!
Optimal margin classifier
Which linear decision boundary?

Separating hyperplane “farthest”

from the training data.

➔ “Optimal margin”
Optimal margin classifier
Which linear decision boundary?

Separating hyperplane “farthest”

from the training data

Margin: smallest distance between any

training observation and the hyperplane

Support vectors: the training

observations equidistant from the
hyperplane
Regularization and the non-separable case
Disadvantage

Can be sensitive to individual

observations.

May overfit training data.

Regularization and the non-separable case
So far we’ve assumed that classes
can be separated by a linear
decision boundary.

What if there’s no separating

hyperplane?
Regularization and the non-separable case
What if there’s no separating
hyperplane?

Support Vector Classifier:

Allows training samples on the

“wrong side” of the margin or
hyperplane.
Regularization and the non-separable case
Penalty parameter C

“Budget” for violations, allows

at most C misclassifications
on training set.

Support vectors

Observations on margin or
violating margin.
Quizz
Non-linear decision boundary
Disadvantage

What if we have a non-linear

decision boundary?
Expanding feature space
Some data sets are not linearly
separable...

But they become linearly separable

when transformed into a higher
dimensional space
Expanding feature space

Variables: X1, X2 Variables: X1, X2, X1X2

Non-linear decision boundary
Suppose our original data has d features:

Expand feature space to include quadratic terms:

Decision boundary will be non-linear in original feature space (ellipse), but linear
in the expanded feature space.
Non-linear decision boundary
Non-linear decision boundary
Non-linear decision boundary
Large number of features becomes computationally challenging.

We need an efficient way to work with large number of features.

Kernels
Kernels
Kernel: Generalization of inner product.

Kernels (implicitly) map data into higher-dimensional space

Why use kernels instead of explicitly constructing larger feature space?

● Computational advantage when n << d [see the following slide]

Kernels
Consider two equivalent ways to represent a linear classifier.

Left:

Right:

If ,, much more space efficient to use kernelized representation.

Tree Ensembles
Tree Ensembles
Decision Tree: recursively partition space to make predictions.

Prediction: simply predict the average of labels in leaf.

Tree Ensembles
Recursive partitioning: split on thresholds of features.

How to choose? Take average loss in children produced by split.

Classification: cross-entropy loss

Regression: mean squared-error loss

Tree Ensembles
Decision tree tuning:

1. Minimum leaf size

2. Maximum depth
3. Maximum number of nodes
4. Minimum decrease in loss
5. Pruning with validation set

Advantage: easy to interpret!

Disadvantage: easy to overfit.

Tree Ensembles
Random forests: Take the average prediction of many decision trees,

1. Each constructed on a bagged dataset of the original.

2. Only use a subset (typically √p) features per split.

Each decision tree has high bias,

but averaging them together yields
low variance.

Bagging: resample of the same size as the original dataset, with replacement.
EM Algorithm / Mixtures
Mixture Models
Gaussian Mixture Model

We have data points, which we suppose come from Gaussians.

Here,

We hypothesize
Mixture Models
GMMs can be extremely effective at modeling distributions!

[Richardson and Weiss 2018: On GANs and GMMs]

EM Algorithm
Problem: how do we estimate parameters when there are latent variables?

Idea: maximize marginal likelihood.

But this is difficult to compute!

Example: Mixture of Gaussians.

EM Algorithm
Algorithm

1. Begin with an initial guess for

2. Alternate between:
a. [E-step] Hallucinate missing values by computing, for all possible values ,

b. [M-step] Use hallucinated dataset to maximize lower bound on log-likelihood

EM Algorithm
[E-step] Hallucinate missing values by computing, for all possible values ,

In the GMM example: for this data point, compute:

Now repeat for all data points.

Creates “augmented” dataset, weighted by probabilities.

EM Algorithm
[M-step] Use hallucinated dataset to maximize lower bound on log-likelihood

We simply now fit using the augmented dataset with weights.

EM Algorithm
Theory

It turns out that we’re maximizing a lower bound on the true log-likelihood.

Here we use Jensen’s inequality.

EM Algorithm
Intuition
EM Algorithm
Sanity Check

EM is guaranteed to converge to a local optimum.

Why?

Runtime per iteration?

EM Algorithm
Sanity Check

EM is guaranteed to converge to a local optimum.

Why?

Our estimated only ever increases.

Runtime per iteration?

In general, need to hallucinate one new data point per possible value of .

For GMMS, need to hallucinate data points thanks to independence.

K Means and More GMMs
K Means - Motivation
K Means - Algorithm(*)
K Means
K Means
K Means
Real-World Example: Mixture of Gaussians

Note: data is unlabeled

How do we fit a GMM - the EM
Observations
Only the distribution of X matters. You can change things like
ordering of the components without affecting the distribution
and hence not affecting the algorithm.

Mixing two distributions from a parametric family might give us

a third distribution from the same family. A mixture of 2
Bernoullis is another Bernoulli .

Probabilistic clustering - Putting similar data points together into

“clusters”, where clusters are represented by the component
distributions.
Sample Question

How do constraints on the covariance matrix change the gaussian

that is being fit?
Tied
Diag
Spherical (K-Means!)

Unit 2 PPT - Part 2
100% (1)
Unit 2 PPT - Part 2
81 pages
Image Enhancement Image Filtering
No ratings yet
Image Enhancement Image Filtering
167 pages
Abnormal Psychology 9th Edition Thomas F Oltmanns Robert E Emery Digital Access
100% (1)
Abnormal Psychology 9th Edition Thomas F Oltmanns Robert E Emery Digital Access
405 pages
Unit - 3
No ratings yet
Unit - 3
73 pages
ML 7th Sem AIML ITE Notes Complete LONG
No ratings yet
ML 7th Sem AIML ITE Notes Complete LONG
202 pages
2019BurkovTheHundred pageMachineLearnin2
No ratings yet
2019BurkovTheHundred pageMachineLearnin2
33 pages
2B Naive Bayes
No ratings yet
2B Naive Bayes
90 pages
Module 6-Svm
No ratings yet
Module 6-Svm
47 pages
401 Week7 Part 2 EM Algorithm
No ratings yet
401 Week7 Part 2 EM Algorithm
58 pages
ML RUSA Module 6 Probablistic EM KNN SVM
No ratings yet
ML RUSA Module 6 Probablistic EM KNN SVM
51 pages
Support Vector Machines: Theory, Implementation, and Applications
No ratings yet
Support Vector Machines: Theory, Implementation, and Applications
40 pages
ML.5-Clustering Techniques (Week 9)
No ratings yet
ML.5-Clustering Techniques (Week 9)
71 pages
SVM PPT
No ratings yet
SVM PPT
32 pages
Lecture Expectation Maximization
No ratings yet
Lecture Expectation Maximization
58 pages
INT354 - Unit 3
No ratings yet
INT354 - Unit 3
60 pages
AI Chapter 3 Part 3
No ratings yet
AI Chapter 3 Part 3
49 pages
27 Support - Vector - Machine
No ratings yet
27 Support - Vector - Machine
17 pages
Understanding Machine Learning Algorithms - in Depth
No ratings yet
Understanding Machine Learning Algorithms - in Depth
167 pages
Lec 24
No ratings yet
Lec 24
39 pages
ML U-3
No ratings yet
ML U-3
16 pages
Clustering, K-Means,. Expectation Maximization, Mean Shift, Classifier Ensembles, Bagging, Boosting
No ratings yet
Clustering, K-Means,. Expectation Maximization, Mean Shift, Classifier Ensembles, Bagging, Boosting
21 pages
AP For NLP-LO2
No ratings yet
AP For NLP-LO2
38 pages
Clustering, K-Means,. Expectation Maximization, Mean Shift, Classifier Ensembles, Bagging, Boosting
No ratings yet
Clustering, K-Means,. Expectation Maximization, Mean Shift, Classifier Ensembles, Bagging, Boosting
21 pages
Pattern Revision
No ratings yet
Pattern Revision
63 pages
Unit 2 - SVM
No ratings yet
Unit 2 - SVM
137 pages
Module 3
No ratings yet
Module 3
79 pages
Chapter 8
No ratings yet
Chapter 8
103 pages
Support Vector Machine
No ratings yet
Support Vector Machine
34 pages
SVM
No ratings yet
SVM
43 pages
Data Science Unit 3
No ratings yet
Data Science Unit 3
33 pages
Lab 5
No ratings yet
Lab 5
9 pages
Session 5
No ratings yet
Session 5
36 pages
ds11 2
No ratings yet
ds11 2
19 pages
SVM and Kernel
No ratings yet
SVM and Kernel
57 pages
PROBABILISTIC Learning Jb-New
No ratings yet
PROBABILISTIC Learning Jb-New
13 pages
Fintech ML Using Azure
No ratings yet
Fintech ML Using Azure
51 pages
Michael Morpurgo
100% (1)
Michael Morpurgo
6 pages
SML Unit 4
No ratings yet
SML Unit 4
61 pages
SVMs
No ratings yet
SVMs
30 pages
SVM
No ratings yet
SVM
11 pages
Unit 2
No ratings yet
Unit 2
7 pages
Basic of SVM Algorithm
No ratings yet
Basic of SVM Algorithm
10 pages
SVM Class
No ratings yet
SVM Class
33 pages
Philippine Normal University-Mindanao: The National Center For Teacher Education
100% (1)
Philippine Normal University-Mindanao: The National Center For Teacher Education
24 pages
Support Vector Machine
No ratings yet
Support Vector Machine
8 pages
Pattern Recognition & Learning II: © UW CSE Vision Faculty
No ratings yet
Pattern Recognition & Learning II: © UW CSE Vision Faculty
47 pages
Lecture 8
No ratings yet
Lecture 8
19 pages
Machine Learning SVM - Supervised
No ratings yet
Machine Learning SVM - Supervised
32 pages
Mixture Models and Clustering
No ratings yet
Mixture Models and Clustering
8 pages
Abirami R - Internship - Report
No ratings yet
Abirami R - Internship - Report
26 pages
Machine Learning and Data Mining: Prof. Alexander Ihler Fall 2012
No ratings yet
Machine Learning and Data Mining: Prof. Alexander Ihler Fall 2012
36 pages
Training Plan
100% (1)
Training Plan
14 pages
SVM Scribe Notes
No ratings yet
SVM Scribe Notes
16 pages
C Bollas
No ratings yet
C Bollas
8 pages
SVM Using Python
No ratings yet
SVM Using Python
24 pages
Bucket Drumming Lesson
No ratings yet
Bucket Drumming Lesson
3 pages
Atc Lecture Tyliu
No ratings yet
Atc Lecture Tyliu
48 pages
Adaptation For Struggling Writers
No ratings yet
Adaptation For Struggling Writers
4 pages
Unit - 3 ML
No ratings yet
Unit - 3 ML
17 pages
Fundamental Math
From Everand
Fundamental Math
Russell Pead
No ratings yet
Islamabad College For Boys, G-6/3, Islamabad: Summer Vacations Home Work - 2020
No ratings yet
Islamabad College For Boys, G-6/3, Islamabad: Summer Vacations Home Work - 2020
2 pages
SZRZ6014 SilibusApprovedSenate2010
No ratings yet
SZRZ6014 SilibusApprovedSenate2010
8 pages
Placement PPTHHHHHHHHJJJJ
No ratings yet
Placement PPTHHHHHHHHJJJJ
12 pages
Toc
No ratings yet
Toc
14 pages
Support Vector Machines: Dominik Wisniewski Wojciech Wawrzyniak
No ratings yet
Support Vector Machines: Dominik Wisniewski Wojciech Wawrzyniak
16 pages
Introduction To Support Vector Machines: 1 Description
No ratings yet
Introduction To Support Vector Machines: 1 Description
15 pages
Supervised Learning - Support Vector Machines and Feature Reduction
No ratings yet
Supervised Learning - Support Vector Machines and Feature Reduction
11 pages
ML
No ratings yet
ML
1 page
Understanding The Enneagram Stacey Keogh George
No ratings yet
Understanding The Enneagram Stacey Keogh George
16 pages
Interview Questions
No ratings yet
Interview Questions
5 pages
12 Earlylate
No ratings yet
12 Earlylate
30 pages
Exploring Anatomy Physiology in The Laboratory 3rd Edition Edition Erin C. Amerman - The Full Ebook With All Chapters Is Available For Download Now
No ratings yet
Exploring Anatomy Physiology in The Laboratory 3rd Edition Edition Erin C. Amerman - The Full Ebook With All Chapters Is Available For Download Now
44 pages
D.sharmila: Career Objective
No ratings yet
D.sharmila: Career Objective
2 pages
Past Simple of Verb Be, Present Simple Vs Past Simple
No ratings yet
Past Simple of Verb Be, Present Simple Vs Past Simple
3 pages
LEARNING PLAN - Health
No ratings yet
LEARNING PLAN - Health
7 pages
Machine Learning Techniques For Civil Engineering Problems
No ratings yet
Machine Learning Techniques For Civil Engineering Problems
28 pages
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet
Estimation of Remaining Useful Life Based On Switching Kalman Filter Neural Network Ensemble
No ratings yet
Estimation of Remaining Useful Life Based On Switching Kalman Filter Neural Network Ensemble
8 pages
Booklistouf
No ratings yet
Booklistouf
1 page
Facilities Data and State Rated Capadty School Year 201:4-2015
No ratings yet
Facilities Data and State Rated Capadty School Year 201:4-2015
3 pages
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Bundle Adjustment: Optimizing Visual Data for Precise Reconstruction
From Everand
Bundle Adjustment: Optimizing Visual Data for Precise Reconstruction
Fouad Sabry
No ratings yet
An Overview of Needs Assessment in ESP by Kay Westerfield
No ratings yet
An Overview of Needs Assessment in ESP by Kay Westerfield
5 pages
Computer Vision Graph Cuts: Exploring Graph Cuts in Computer Vision
From Everand
Computer Vision Graph Cuts: Exploring Graph Cuts in Computer Vision
Fouad Sabry
No ratings yet
Worksheet Practicing Either-Neither - So - and Nor
No ratings yet
Worksheet Practicing Either-Neither - So - and Nor
2 pages
Course 6-14 : Roadmap &
No ratings yet
Course 6-14 : Roadmap &
1 page
IXL Diagnostic Report - 2024 11 06 - Barr Mirah.
No ratings yet
IXL Diagnostic Report - 2024 11 06 - Barr Mirah.
4 pages
Lesson 6
No ratings yet
Lesson 6
2 pages
44.ICTNWK557 Activity 1 Template.v1.0
No ratings yet
44.ICTNWK557 Activity 1 Template.v1.0
3 pages
Barangaycommunity Training Needs Assesssment
No ratings yet
Barangaycommunity Training Needs Assesssment
2 pages

cs229 MT Review

Uploaded by

cs229 MT Review

Uploaded by

CS 229 Midterm Review

Course Staff Fall 2018

EM Algorithm / Mixture Models

[ Focus on building intuition, less so on solving specific problems. Ask questions! ]

But first … what is a hyperplane?

○ In d-dimensional space, a (d−1)-dimensional affine subspace

Hyperplane in d-dimensional space:

[Separates space into two half-spaces.]

Use a separating hyperplane for

Classes can be separated by a linear

Assign class by location of new data

Many possible separating

Separating hyperplane “farthest”

Separating hyperplane “farthest”

Margin: smallest distance between any

Support vectors: the training

Can be sensitive to individual

May overfit training data.

What if there’s no separating

Support Vector Classifier:

Allows training samples on the

“Budget” for violations, allows

What if we have a non-linear

But they become linearly separable

Variables: X1, X2 Variables: X1, X2, X1X2

Expand feature space to include quadratic terms:

We need an efficient way to work with large number of features.

Kernels (implicitly) map data into higher-dimensional space

Why use kernels instead of explicitly constructing larger feature space?

● Computational advantage when n << d [see the following slide]

If ,, much more space efficient to use kernelized representation.

Prediction: simply predict the average of labels in leaf.

How to choose? Take average loss in children produced by split.

Classification: cross-entropy loss

Regression: mean squared-error loss

1. Minimum leaf size

Advantage: easy to interpret!

Disadvantage: easy to overfit.

1. Each constructed on a bagged dataset of the original.

Each decision tree has high bias,

We have data points, which we suppose come from Gaussians.

[Richardson and Weiss 2018: On GANs and GMMs]

Idea: maximize marginal likelihood.

But this is difficult to compute!

Example: Mixture of Gaussians.

1. Begin with an initial guess for

b. [M-step] Use hallucinated dataset to maximize lower bound on log-likelihood

In the GMM example: for this data point, compute:

Now repeat for all data points.

Creates “augmented” dataset, weighted by probabilities.

We simply now fit using the augmented dataset with weights.

Here we use Jensen’s inequality.

EM is guaranteed to converge to a local optimum.

Runtime per iteration?

EM is guaranteed to converge to a local optimum.

Our estimated only ever increases.

Runtime per iteration?

For GMMS, need to hallucinate data points thanks to independence.

Note: data is unlabeled

Mixing two distributions from a parametric family might give us

Probabilistic clustering - Putting similar data points together into

How do constraints on the covariance matrix change the gaussian

You might also like