0% found this document useful (0 votes)
68 views

CS195-5: Introduction To Machine Learning: Greg Shakhnarovich

This document provides an overview of Lecture 5 of the course CS195-5 Introduction to Machine Learning. It reviews concepts related to Gaussians such as covariance, correlation, and covariance matrices. It introduces the topic of classification and discusses representing classification as a regression problem. It also covers the geometry of linear classifiers and projections onto lines. The lecture discusses these concepts and their applications to machine learning.

Uploaded by

satyabasha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views

CS195-5: Introduction To Machine Learning: Greg Shakhnarovich

This document provides an overview of Lecture 5 of the course CS195-5 Introduction to Machine Learning. It reviews concepts related to Gaussians such as covariance, correlation, and covariance matrices. It introduces the topic of classification and discusses representing classification as a regression problem. It also covers the geometry of linear classifiers and projections onto lines. The lecture discusses these concepts and their applications to machine learning.

Uploaded by

satyabasha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

CS195-5 : Introduction to Machine Learning

Lecture 5
Greg Shakhnarovich
September 15 2006
Revised October 24th, 2006
Announcements
Collaboration policy on Psets
Projects
Clarications for Problem Set 1
CS195-5 2006 Lecture 5 1
The correlation question
N values in each of two samples:
e
i
= y
i
w
T
x the prediction error
z
i
= a
T
x
i
a linear function evaluated on the training examples.
Show that ({e
i
}, {z
i
}) = 0.
Develop an intuition, before you attack the derivation: Play with these in
Matlab!
Generate a random w

, random X
Compute Xw

, generate and add Gaussian noise


Fit w, calculate {e
i
}
Generate a random a, calculate {z
i
}. plot them!
Calculate correlation.
CS195-5 2006 Lecture 5 2
More notation
A B means A is dened by B (rst time A is introduced)
A B for varying A and/or B means they are always equal.
E.g., f(x) 1 means f returns 1 regardless of the input x.
a p(a) random variable a is drawn from density p(a)
CS195-5 2006 Lecture 5 3
Review
Uncertainty in w as an estimate of w

:
w N
_
w; w

,
2
(X
T
X)
1
_
Generalized linear regression
f(x; w) = w
0
+ w
1

1
(x) + w
2

2
(x) + . . . + w
m

m
(x)
Multivariate Gaussians
CS195-5 2006 Lecture 5 4
Today
More on Gaussians
Introduction to classication
Projections
Linear discriminant analysis
CS195-5 2006 Lecture 5 5
Refresher on probability
Variance of a r.v. a:
2
a
= E
_
(a
a
)
2

, where
a
= E [a].
Standard deviation:
_

2
a
. Measures the spread around the mean.
Generalization to two variables: covariance
Cov
a,b
E
p(a,b)
[(a
a
)(b
b
)]
Measures how the two variable deviate together from their means (co-vary).
CS195-5 2006 Lecture 5 6
Correlation and covariance
Correlation:
cor(a, b)
Cov
a,b

b
.
1.5 1 0.5 0 0.5 1 1.5
1.5
1
0.5
0
0.5
1
1.5
a
b
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.2
0.4
0.6
0.8
1
1.2
1.4
a
b
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
a
b
cor(a, b) measures the linear relationship between a and b.
1 cor(a, b) +1 ; +1 or 1 means a is a linear function of b.
CS195-5 2006 Lecture 5 7
Covariance matrix
For a random vector x = [x
1
, . . . , x
d
]
T
,
Cov
x

_

2
x
1
Cov
x
1
,x
2
. . . . . . . Cov
x
1
,x
d
Cov
x
2
,x
1

2
x
2
. . . . . . . Cov
x
2
,x
d
.
.
.
.
.
.
.
.
.
Cov
x
d
,x
1
Cov
x
d
,x
2
. . . . .
2
x
d
_

_
.
Square, symmetric, non-negative main diagonal (variances 0)
Under that denition, one can show:
Cov
x
= E
_
(x
x
)(x
x
)
T

i.e. expectation of the outer product of x


x
with itself.
Note: so far nothing Gaussian-specic!
CS195-5 2006 Lecture 5 8
Covariance matrix decomposition
Any covariance matrix can be decomposed:
= R
_
_

1
.
.
.

d
_
_
R
T
where R is a rotation matrix, and
j
0 for all j = 1, . . . , d.
Rotation in 2D:
R =
_
cos() sin()
sin() cos()
_
CS195-5 2006 Lecture 5 9
Rotation matrices
= R
_
_

1
.
.
.

d
_
_
R
T
Rotation matrix R:
orthonormal: if columns are r
1
, . . . , r
d
, then r
T
i
r
i
= 1, r
T
i
r
j
= 0 for i = j.
From here follows R
T
= R
1
(R
T
reverses the rotation produced by R).
Columns r
i
specify the basis for the new (rotated) coordinate system.
R determines the orientation of the ellipse (so called principal directions)
The inner diag(
1
, . . . ,
d
) species the scaling along each of the principal
directions.
Interpretation of the whole product: rotate, scale, and rotate back.
CS195-5 2006 Lecture 5 10
Covariance and correlation for Gaussians
Suppose (for simplicity) = 0. What happens if we rotate the data by R
T
?
The new covariance matrix is just
_

2
x
1
Cov
x
1
,x
2
. . . . . . . Cov
x
1
,x
d
Cov
x
2
,x
1

2
x
2
. . . . . . . Cov
x
2
,x
d
.
.
.
.
.
.
.
.
.
Cov
x
d
,x
1
Cov
x
d
,x
2
. . . . .
2
x
d
_

_
=
_
_

1
.
.
.

d
_
_
The components of x are now uncorrelated (covariances are zero). This is
known as whitening transformation.
For Gaussians, this also means they are independent.
Not true for all distributions!
CS195-5 2006 Lecture 5 11
Classication versus regression
Formally: just like in regression, we want to learn a mapping from X to Y, but
Y is discrete and nite.
One approach is to (navely) ignore that Y is such.
Regression on the indicator matrix:
Code the possible values of the label as 1, . . . , C.
Dene matrix Y:
Y
ij
=
_
1 if y
i
= c,
0 otherwise
This denes C independent regression problems; solving them with least
squares yields

Y
0
= X
0
(X
T
X)
1
XY.
CS195-5 2006 Lecture 5 12
Classication as regression
Suppose we have a binary problem, y {1, 1}.
Assuming the standard model y = f(x; w) + , and solving with least squares,
we get w.
This corresponds to squared loss as a measure of classication performance!
Does this make sense?
CS195-5 2006 Lecture 5 13
Classication as regression
Suppose we have a binary problem, y {1, 1}.
Assuming the standard model y = f(x; w) + , and solving with least squares,
we get w.
This corresponds to squared loss as a measure of classication performance!
Does this make sense?
How do we decide on the label based on f(x; w)?
CS195-5 2006 Lecture 5 13
Classication as regression: example
A 1D example:
x
CS195-5 2006 Lecture 5 14
Classication as regression: example
A 1D example:
x
y
+1
-1
CS195-5 2006 Lecture 5 14
Classication as regression: example
A 1D example:
x
y
+1
-1
w
0
+ w
T
x
CS195-5 2006 Lecture 5 14
Classication as regression: example
A 1D example:
x
y
+1
-1
w
0
+ w
T
x
y = 1
y = +1
CS195-5 2006 Lecture 5 14
Classication as regression
f(x; w) = w
0
+ w
T
x
Cant just take y = f(x; w) since it wont be a valid label.
A reasonable decision rule:
decide on y = 1 if f(x; w) 0, otherwise y = 1.
y = sign
_
w
0
+ w
T
x
_
This species a linear classier :
The linear decision boundary (hyperplane) given by the equation w
0
+ w
T
x = 0
separates the space into two half-spaces.
CS195-5 2006 Lecture 5 15
Classication as regression
Seems to work well here but not so well here?
CS195-5 2006 Lecture 5 16
Geometry of projections
x
2
w
0
+ w
T
x = 0
x
1
w
T
x = 0: a line passing through
the origin and orthogonal to w
w
T
x+w
0
= 0 shifts the line along
w.
CS195-5 2006 Lecture 5 17
Geometry of projections
x
2
w
0
+ w
T
x = 0
x
1
w
w
0
w
w
T
x = 0: a line passing through
the origin and orthogonal to w
w
T
x+w
0
= 0 shifts the line along
w.
CS195-5 2006 Lecture 5 17
Geometry of projections
x
2
w
0
+ w
T
x = 0
x
1
w
w
0
w
x
0
w
T
x = 0: a line passing through
the origin and orthogonal to w
w
T
x+w
0
= 0 shifts the line along
w.
CS195-5 2006 Lecture 5 17
Geometry of projections
x
2
w
0
+ w
T
x = 0
x
1
w
w
0
w
x
0
w
0
+w
T
x
0
w
x
0
w
T
x = 0: a line passing through
the origin and orthogonal to w
w
T
x+w
0
= 0 shifts the line along
w.
CS195-5 2006 Lecture 5 17
Geometry of projections
x
2
w
0
+ w
T
x = 0
x
1
w
w
0
w
x
0
w
0
+w
T
x
0
w
x
0
x

0
w
T
x = 0: a line passing through
the origin and orthogonal to w
w
T
x+w
0
= 0 shifts the line along
w.
x

is the projection of x on w.
CS195-5 2006 Lecture 5 17
Geometry of projections
x
2
w
0
+ w
T
x = 0
x
1
w
w
0
w
x
0
w
0
+w
T
x
0
w
x
0
x

0
w
T
x = 0: a line passing through
the origin and orthogonal to w
w
T
x+w
0
= 0 shifts the line along
w.
x

is the projection of x on w.
Set up a new 1D coordinate system: x (w
0
+x
T
x)/w.
CS195-5 2006 Lecture 5 17
Distribution in 1D projection
Consider a projection given by w
T
x = 0 (i.e., w is the normal)
Each training point x
i
is projected to a scalar z
i
= w
T
x.
We can study how well the projected values corresponding to dierent classes
are separated
This is a function of w; some projections may be better than others.
CS195-5 2006 Lecture 5 18
Linear discriminant and dimensionality reduction
The discriminant function f(x; w) = w
0
+w
T
x reduces the dimension of
examples from d to 1:
f(x, w) = 1
f(x, w) = 0
f(x, w) = +1
w
CS195-5 2006 Lecture 5 19
Projections and classication
What objecive are we optimizing the 1D projection for?
CS195-5 2006 Lecture 5 20
1D projections of a Gaussian
Let p(x) = N (x; , ).
For any A, p(Ax) = N
_
Ax; A, AA
T
_
.
To get a marginal of 1D projection on the direction dened by a unit vector v:
Make R a rotation such that R[1, 0, . . . , 0]
T
= v
Compute
v
= v
T
v; thats the variance of the marginal.
Lets assume for now = 0 (but think what happens if its not!)
Matlab demo: margGausDemo.m
CS195-5 2006 Lecture 5 21
Objective: class separation
We want to minimize overlap between projections of the two classes.
One way to approach that: make the class projections a) compact, b) far apart.
CS195-5 2006 Lecture 5 22
Next time
Continue with linear discriminant analysis, and talk about optimal way to place
the decision boundary.
CS195-5 2006 Lecture 5 23

You might also like