0% found this document useful (0 votes)
61 views54 pages

cs229 MT Review

The document provides an overview of machine learning topics that will be covered on a midterm exam, including support vector machines, kernels, tree ensembles, and mixture models. It focuses on building intuition around these algorithms rather than solving specific problems, and encourages students to ask questions. Details are provided on SVM optimization and kernels, decision trees and random forests, and the EM algorithm for fitting mixture models such as Gaussian mixture models.

Uploaded by

dasabhisek46
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views54 pages

cs229 MT Review

The document provides an overview of machine learning topics that will be covered on a midterm exam, including support vector machines, kernels, tree ensembles, and mixture models. It focuses on building intuition around these algorithms rather than solving specific problems, and encourages students to ask questions. Details are provided on SVM optimization and kernels, decision trees and random forests, and the EM algorithm for fitting mixture models such as Gaussian mixture models.

Uploaded by

dasabhisek46
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

CS 229 Midterm Review

Course Staff Fall 2018

11/2/2018
Outline
Today:

SVMs

Kernels

Tree Ensembles

EM Algorithm / Mixture Models

[ Focus on building intuition, less so on solving specific problems. Ask questions! ]


SVMs
Optimal margin classifier
Two classes separable by linear decision boundary.

But first … what is a hyperplane?

○ In d-dimensional space, a (d−1)-dimensional affine subspace


○ Examples: line in 2D, plane in 3D

Hyperplane in d-dimensional space:

[Separates space into two half-spaces.]


Hyperplanes
Optimal margin classifier
Idea:

Use a separating hyperplane for


binary classification.

Key assumption:

Classes can be separated by a linear


decision boundary.
Optimal margin classifier
To classify new data points:

Assign class by location of new data


point with respect to hyperplane:
Optimal margin classifier
Problem

Many possible separating


hyperplanes!
Optimal margin classifier
Which linear decision boundary?

Separating hyperplane “farthest”


from the training data.

➔ “Optimal margin”
Optimal margin classifier
Which linear decision boundary?

Separating hyperplane “farthest”


from the training data

Margin: smallest distance between any


training observation and the hyperplane

Support vectors: the training


observations equidistant from the
hyperplane
Regularization and the non-separable case
Disadvantage

Can be sensitive to individual


observations.

May overfit training data.


Regularization and the non-separable case
So far we’ve assumed that classes
can be separated by a linear
decision boundary.

What if there’s no separating


hyperplane?
Regularization and the non-separable case
What if there’s no separating
hyperplane?

Support Vector Classifier:

Allows training samples on the


“wrong side” of the margin or
hyperplane.
Regularization and the non-separable case
Penalty parameter C

“Budget” for violations, allows


at most C misclassifications
on training set.

Support vectors

Observations on margin or
violating margin.
Quizz
Non-linear decision boundary
Disadvantage

What if we have a non-linear


decision boundary?
Expanding feature space
Some data sets are not linearly
separable...

But they become linearly separable


when transformed into a higher
dimensional space
Expanding feature space

Variables: X1, X2 Variables: X1, X2, X1X2


Non-linear decision boundary
Suppose our original data has d features:

Expand feature space to include quadratic terms:

Decision boundary will be non-linear in original feature space (ellipse), but linear
in the expanded feature space.
Non-linear decision boundary
Non-linear decision boundary
Non-linear decision boundary
Large number of features becomes computationally challenging.

We need an efficient way to work with large number of features.


Kernels
Kernels
Kernel: Generalization of inner product.

Kernels (implicitly) map data into higher-dimensional space

Why use kernels instead of explicitly constructing larger feature space?

● Computational advantage when n << d [see the following slide]


Kernels
Consider two equivalent ways to represent a linear classifier.

Left:

Right:

If ,, much more space efficient to use kernelized representation.


Tree Ensembles
Tree Ensembles
Decision Tree: recursively partition space to make predictions.

Prediction: simply predict the average of labels in leaf.


Tree Ensembles
Recursive partitioning: split on thresholds of features.

How to choose? Take average loss in children produced by split.

Classification: cross-entropy loss

Regression: mean squared-error loss


Tree Ensembles
Decision tree tuning:

1. Minimum leaf size


2. Maximum depth
3. Maximum number of nodes
4. Minimum decrease in loss
5. Pruning with validation set

Advantage: easy to interpret!

Disadvantage: easy to overfit.


Tree Ensembles
Random forests: Take the average prediction of many decision trees,

1. Each constructed on a bagged dataset of the original.


2. Only use a subset (typically √p) features per split.

Each decision tree has high bias,


but averaging them together yields
low variance.

Bagging: resample of the same size as the original dataset, with replacement.
EM Algorithm / Mixtures
Mixture Models
Gaussian Mixture Model

We have data points, which we suppose come from Gaussians.

Here,

We hypothesize
Mixture Models
GMMs can be extremely effective at modeling distributions!

[Richardson and Weiss 2018: On GANs and GMMs]


EM Algorithm
Problem: how do we estimate parameters when there are latent variables?

Idea: maximize marginal likelihood.

But this is difficult to compute!

Example: Mixture of Gaussians.


EM Algorithm
Algorithm

1. Begin with an initial guess for


2. Alternate between:
a. [E-step] Hallucinate missing values by computing, for all possible values ,

b. [M-step] Use hallucinated dataset to maximize lower bound on log-likelihood


EM Algorithm
[E-step] Hallucinate missing values by computing, for all possible values ,

In the GMM example: for this data point, compute:

Now repeat for all data points.

Creates “augmented” dataset, weighted by probabilities.


EM Algorithm
[M-step] Use hallucinated dataset to maximize lower bound on log-likelihood

We simply now fit using the augmented dataset with weights.


EM Algorithm
Theory

It turns out that we’re maximizing a lower bound on the true log-likelihood.

Here we use Jensen’s inequality.


EM Algorithm
Intuition
EM Algorithm
Sanity Check

EM is guaranteed to converge to a local optimum.

Why?

Runtime per iteration?


EM Algorithm
Sanity Check

EM is guaranteed to converge to a local optimum.

Why?

Our estimated only ever increases.

Runtime per iteration?

In general, need to hallucinate one new data point per possible value of .

For GMMS, need to hallucinate data points thanks to independence.


K Means and More GMMs
K Means - Motivation
K Means - Algorithm(*)
K Means
K Means
K Means
Real-World Example: Mixture of Gaussians

Note: data is unlabeled


How do we fit a GMM - the EM
Observations
Only the distribution of X matters. You can change things like
ordering of the components without affecting the distribution
and hence not affecting the algorithm.

Mixing two distributions from a parametric family might give us


a third distribution from the same family. A mixture of 2
Bernoullis is another Bernoulli .

Probabilistic clustering - Putting similar data points together into


“clusters”, where clusters are represented by the component
distributions.
Sample Question

How do constraints on the covariance matrix change the gaussian


that is being fit?
Tied
Diag
Spherical (K-Means!)

You might also like