Linear Regression With Multiple Variables
Linear Regression With Multiple Variables
Objectif :
I hope everyone has been enjoying the course and learning a lot! This week we’re
covering linear regression with multiple variables. we’ll show how linear regression
can be extended to accommodate multiple input features. We also discuss best
practices for implementing linear regression.
We’re also going to go over how to use Octave. You’ll work on programming
assignments designed to help you understand how to implement the learning
algorithms in practice. To complete the programming assignments, you will need to
use Octave or MATLAB.
Multiple Features
In this video, we will start to talk about a new version of linear regression that is more powerful.
One that works with multiple variables or with multiple features.
In the original version of linear regression that we developed, we have a single feature x, the size of the house,
and we wanted to use that to predict why the price of the house and this was our form of our hypothesis.
But now imagine, what if we had not only the size of the house as a feature or as a variable of which to try to
predict the price, but that we also knew the number of bedrooms, the number of house and the age of the home
and years.
It seems like this would give us a lot more information with which to predict the price.
Now that we have multiple features, let's talk about what the form of our hypothesis should be.
Previously this was the form of our hypothesis, where x was our single feature, but now that we have multiple
features, we aren't going to use the simple representation any more.
And if we have N features then rather than summing up over our four features, we would have a sum over our N
features.
This would be one example of a hypothesis and you remember a hypothesis is trying to predict the price of the
house in thousands of dollars, just saying that, you know, the base price of a house is maybe 80,000.
”
Gradient Descent for Multiple Variables
In the previous video, we talked about the form of the hypothesis for linear regression with multiple features or
with multiple variables. In this video, let's talk about how to fit the parameters of that hypothesis.
In particular let's talk about how to use gradient descent for linear regression with multiple features.
Gradient Descent in Practice I - Feature Scaling
In this video and in the video after this one, I wanna tell you about some of the practical tricks for making gradient
descent work well.
In this video, I want to tell you about an idea called feature skill.
If you have a problem where you have multiple features, if you make sure that the features are on a similar scale,
by which I mean make sure that the different features take on similar ranges of values, then gradient descents
can converge more quickly.
Concretely let's say you have a problem with two features where X1 is the size of house and takes on values
between zero to two thousand and X2 is the number of bedrooms, and maybe that takes on values between one
and five.
If you plot the contours of the cost function J of theta, then the contours may look like this, where, let's see, J of
theta is a function of parameters theta zero, theta one and theta two.
I'm going to ignore theta zero, but if x1 presents much larger range of values than x2. It turns out that the contours
of the cost function J of theta can take on this very skewed elliptical shape
So, this is very, very tall and skinny ellipses, or these very tall skinny ovals, can form the contours of the cost
function J of theta.
And if you run gradient descents on this cost function, your gradients may end up taking a long time and can
oscillate back and forth and take a long time before it can finally find its way
In these settings, a useful thing to do is to scale the features.
Concretely if you instead define the feature X 1 to be the size of the house divided by two thousand, and define
X2 to be maybe the number of bedrooms divided by five, then the count well as of the cost function J can become
much more, much less skewed so the contours may look more like circles...
And if you run gradient descent on a cost function like this, then you can find a much more direct path to the
global minimum rather than taking a much more convoluted path where you're sort of trying to follow a much more
complicated trajectory to get to the global minimum.
In this example, we end up with both features, X one and X two, between zero and one.
You can wind up with an implementation of gradient descent that can converge much fuster.
More generally, when we're performing feature scaling, what we often want to do is get every feature into
approximately a -1 to +1 range and concretely, your feature x0 is always equal to 1.
So, that's already in that range, but you may end up dividing other features by different numbers to get them to
this range.
So, if you have a feature, x1 that winds up being between zero and three, that's not a problem.
If you end up having a different feature that winds being between -2 and + 0.5, again, this is close enough to
minus one and plus one that, you know, that's fine, and that's fine.
It's only if you have a different feature, say X 3 that is between, that ranges from -100 tp +100 , then, this is a very
different values than minus 1 and plus 1.
So, this might be a less well-skilled feature and similarly, if your features take on a very, very small range of
values so if X 4 takes on values between minus 0.0001 and positive 0.0001, then again this takes on a much
smaller range of values than the minus one to plus one range.
don't worry if your features are not exactly on the same scale or exactly in the same range of values.
But so long as they're all close enough to this gradient descent it should work okay.
So, now you know about feature scaling and if you apply this simple trick, it and make gradient descent run much
faster and converge in a lot fewer other iterations.
The way to prevent this is to modify the ranges of our input variables so that
they are all roughly the same. Ideally:
−1 ≤ x_{(i)}x(i) ≤ 1
or
These aren't exact requirements; we are only trying to speed things up. The
goal is to get all input variables into roughly one of these ranges, give or take a
few.
Where μi is the average of all the values for feature (i) and s_isi is the range
of values (max - min), or s_isi is the standard deviation.
Note that dividing by the range, or dividing by the standard deviation, give
different results. The quizzes in this course use range - the programming
exercises use standard deviation.
In this video, I want to give you more practical tips for getting gradient descent to work. The ideas in
this video will center on the learning rate alpha.
Concretely, here's the gradient descent update rule. And what I want to do in this video is tell
you about what I think of as debugging, and some tips for making sure that gradient descent is
working correctly. And second, I wanna tell you how to choose the learning rate alpha or at least how I
go about choosing it. Here's something that I often do to make sure that gradient descent is working
correctly. The job of gradient descent is to find the value of theta for you that hopefully minimizes the
cost function J(theta).
What I often do is therefore plot the cost function J(theta) as gradient descent runs.
So the x axis here is a number of iterations of gradient descent and as gradient descent runs you
hopefully get a plot that maybe looks like this.
So what this plot is showing is, is it's showing the value of your cost function after each iteration of
gradient decent. And if gradient is working properly then J(theta) should decrease after every iteration.
And one useful thing that this sort of plot can tell you also is that if you look at the specific figure that
I've drawn, it looks like by the time you've gotten out to maybe 300 iterations, between 300 and 400
iterations, in this segment it looks like J(theta) hasn't gone down much more. So by the time you get to
400 iterations, it looks like this curve has flattened out here. And so way out here 400 iterations, it
looks like gradient descent has more or less converged because your cost function isn't going down
much more. So looking at this figure can also help you judge whether or
not gradient descent has converged.
By the way, the number of iterations the gradient descent takes to converge for a physical application
can vary a lot, so maybe for
one application, gradient descent may converge after just thirty iterations.
For a different application, gradient descent may take 3,000 iterations, for another learning algorithm, it
may take 3 million iterations.
It turns out to be very difficult to tell in advance how many iterations gradient descent needs to
converge.
And is usually by plotting this sort of plot, plotting the cost function as we increase in number in
iterations, is usually by looking at these plots.
But I try to tell if gradient descent has converged.
It's also possible to come up with automatic convergence test, namely to have a algorithm try to tell
you if gradient descent has converged.
And here's maybe a pretty typical example of an automatic convergence test. And such a test may
declare convergence if your cost function J(theta)
decreases by less than some small value epsilon, some small value 10 to the minus 3 in one iteration.
But I find that usually choosing what this threshold is is pretty difficult. And so in order to check your
gradient descent's converge
I actually tend to look at plots like these, like this figure on the left, rather than rely on an automatic
convergence test.
Looking at this sort of figure can also tell you, or give you an advance warning, if maybe gradient
descent is not working correctly.
Concretely, if you plot J(theta) as a function of the number of iterations.
Then if you see a figure like this where J(theta) is actually increasing, then that gives you a clear sign
that gradient descent is not working.
And a theta like this usually means that you should be using learning rate alpha.
You now know about linear regression with multiple variables.
In this video, I wanna tell
you a bit about the choice
of features that you have and
how you can get different learning
algorithm, sometimes very powerful
ones by choosing appropriate features.
And in particular I also want
to tell you about polynomial regression allows
you to use the machinery of
linear regression to fit very
complicated, even very non-linear functions.
Let's take the example of predicting the price of the house.
Suppose you have two features,
the frontage of house and the depth of the house.
So, here's the picture of the house we're trying to sell.
So, the frontage is
defined as this distance
is basically the width
or the length of
how wide your lot
is if this that you
own, and the depth
of the house is how
deep your property is, so
there's a frontage, there's a depth.
called frontage and depth.
You might build a linear regression
model like this where frontage
is your first feature x1 and
and depth is your second
feature x2, but when you're
applying linear regression, you don't
necessarily have to use
just the features x1 and x2 that you're given.
What you can do is actually create new features by yourself.
So, if I want to predict
the price of a house, what I
might do instead is decide
that what really determines
the size of the house is
the area or the land area that I own.
So, I might create a new feature.
I'm just gonna call this feature
x which is frontage, times depth.
This is a multiplication symbol.
It's a frontage x depth because
this is the land area
that I own and I might
then select my hypothesis
as that using just
one feature which is my
land area, right?
Because the area of a
rectangle is you know,
the product of the length
of the size So, depending
on what insight you might have
into a particular problem, rather than
just taking the features [xx]
that we happen to have started
off with, sometimes by defining
new features you might actually get a better model.
Closely related to the
idea of choosing your features
is this idea called polynomial regression.
Let's say you have a housing price data set that looks like this.
Then there are a few different models you might fit to this.
One thing you could do is fit a quadratic model like this.
It doesn't look like a straight line fits this data very well.
So maybe you want to fit
a quadratic model like this
where you think the size, where
you think the price is a quadratic
function and maybe that'll
give you, you know, a fit
to the data that looks like that.
But then you may decide that your
quadratic model doesn't make sense
because of a quadratic function, eventually
this function comes back down
and well, we don't think housing
prices should go down when the size goes up too high.
So then maybe we might
choose a different polynomial model
and choose to use instead a
cubic function, and where
we have now a third-order term
and we fit that, maybe
we get this sort of
model, and maybe the
green line is a somewhat better fit
to the data cause it doesn't eventually come back down.
So how do we actually fit a model like this to our data?
Using the machinery of multivariant
linear regression, we can
do this with a pretty simple modification to our algorithm.
The form of the hypothesis we,
we know how the fit
looks like this, where we say
H of x is theta zero
plus theta one x one plus x two theta X3.
And if we want to
fit this cubic model that
I have boxed in green,
what we're saying is that
to predict the price of a
house, it's theta 0 plus theta
1 times the size of the house
plus theta 2 times the square size of the house.
So this term is equal to that term.
And then plus theta 3
times the cube of the
size of the house raises that third term.
In order to map these
two definitions to each other,
well, the natural way
to do that is to set
the first feature x one to
be the size of the house, and
set the second feature x two
to be the square of the size
of the house, and set the third feature x three to
be the cube of the size of the house.
And, just by choosing my
three features this way and
applying the machinery of linear
regression, I can fit this
model and end up with
a cubic fit to my data.
I just want to point out one
more thing, which is that
if you choose your features
like this, then feature scaling
becomes increasingly important.
So if the size of the
house ranges from one to
a thousand, so, you know,
from one to a thousand square
feet, say, then the size
squared of the house will
range from one to one
million, the square of
a thousand, and your third
feature x cubed, excuse me
you, your third feature x
three, which is the size
cubed of the house, will range
from one two ten to
the nine, and so these
three features take on very
different ranges of values, and
it's important to apply feature
scaling if you're using gradient
descent to get them into
comparable ranges of values.
Finally, here's one last example
of how you really have
broad choices in the features you use.
Earlier we talked about how a
quadratic model like this might
not be ideal because, you know,
maybe a quadratic model fits the
data okay, but the quadratic
function goes back down
and we really don't want, right,
housing prices that go down,
to predict that, as the size of housing freezes.
But rather than going to
a cubic model there, you
have, maybe, other choices of
features and there are many possible choices.
But just to give you another
example of a reasonable
choice, another reasonable choice
might be to say that the
price of a house is theta
zero plus theta one times
the size, and then plus theta
two times the square root of the size, right?
So the square root function is
this sort of function, and maybe
there will be some value of theta
one, theta two, theta three, that
will let you take this model
and, for the curve that looks
like that, and, you know,
goes up, but sort of flattens
out a bit and doesn't ever
come back down.
And, so, by having insight into, in
this case, the shape of a
square root function, and, into
the shape of the data, by choosing
different features, you can sometimes get better models.
In this video, we talked about polynomial regression.
That is, how to fit a
polynomial, like a quadratic function,
or a cubic function, to your data.
Was also throw out this idea,
that you have a choice in what
features to use, such as
that instead of using
the frontish and the depth
of the house, maybe, you can
multiply them together to get
a feature that captures the land area of a house.
In case this seems a little
bit bewildering, that with all
these different feature choices, so how do I decide what features to use.
Later in this class, we'll talk
about some algorithms were automatically
choosing what features are used,
so you can have an
algorithm look at the data
and automatically choose for you
whether you want to fit a
quadratic function, or a cubic function, or something else.
But, until we get to
those algorithms now I just
want you to be aware that
you have a choice in
what features to use, and
by designing different features
you can fit more complex functions
your data then just fitting a
straight line to the data and
in particular you can put polynomial
functions as well and sometimes
by appropriate insight into the
feature simply get a much
better model for your data.
Features and Polynomial Regression
We can improve our features and the form of our hypothesis function in a
couple different ways.
Polynomial Regression
Our hypothesis function need not be linear (a straight line) if that does not fit
the data well.
One important thing to keep in mind is, if you choose your features this way
then feature scaling becomes very important.
Normal Equation
EXERCICE:
In this video, we'll talk about the normal equation, which for some linear regression problems, will give
us a much better way to solve for the optimal value of the parameters theta.
Concretely, so far the algorithm that we've been using for linear regression is gradient descent where
in order to minimize the cost function J of Theta, we would take this iterative algorithm that takes many
steps, multiple iterations of gradient descent to converge to the global minimum.
In contrast, the normal equation would give us a method to solve for theta analytically, so that rather
than needing to run this iterative algorithm, we can
instead just solve for the optimal value for theta all at one go, so that in
basically one step you get to the optimal value right there.
It turns out the normal equation that has some advantages and some disadvantages, but before we
get to that and talk about when you should use it, let's get some intuition about what this method
does. For this week's planetary example, let's imagine, let's take a very simplified cost function
J of Theta, that's just the function of a real number Theta.
So, for now, imagine that Theta is just a scalar value or that Theta is just a row value.
It's just a number, rather than a vector. Imagine that we have a cost function J that's a quadratic
function of this real value parameter Theta, so J of Theta looks like that.
Well, how do you minimize a quadratic function?
For those of you that know a little bit of calculus, you may know that the way to minimize a function is
to take derivatives and to set derivatives equal to zero.
So, you take the derivative of J with respect to the parameter of Theta.
You get some formula which I am not going to derive,
you set that derivative equal to zero, and this allows you to solve for
the value of Theda that minimizes J of Theta. That was a simpler case
of when data was just real number.
In the problem that we are interested in, Theta is
no longer just a real number,
but, instead, is this
n+1-dimensional parameter vector, and,
a cost function J is
a function of this vector
value or Theta 0 through
Theta m. And, a cost
function looks like this, some square cost function on the right.
How do we minimize this cost function J?
Calculus actually tells us
that, if you, that
one way to do so, is
to take the partial derivative of J, with respect to every parameter of Theta J in turn, and then, to set
all of these to 0.
If you do that, and you
solve for the values of
Theta 0, Theta 1,
up to Theta N, then,
this would give you that values
of Theta to minimize the cost
function J. Where, if
you actually work through the
calculus and work through
the solution to the parameters
Theta 0 through Theta N, the
derivation ends up being somewhat involved.
And, what I am going
to do in this video,
is actually to not go
through the derivation, which is kind
of long and kind of involved, but
what I want to do is just
tell you what you need to know
in order to implement this process
so you can solve for the
values of the thetas that
corresponds to where the
partial derivatives is equal to zero.
Or alternatively, or equivalently,
the values of Theta is that
minimize the cost function J of Theta.
I realize that some of
the comments I made that made
more sense only to those
of you that are normally familiar with calculus.
So, but if you don't
know, if you're less familiar
with calculus, don't worry about it.
I'm just going to tell you what
you need to know in order to
implement this algorithm and get it to work.
For the example that I
want to use as a running
example let's say that
I have m = 4 training examples.
Start transcript at 3 minutes 50 seconds3:50
In order to implement this normal
equation at big, what I'm going to do is the following.
I'm going to take my
data set, so here are my four training examples.
In this case let's assume that,
you know, these four examples is all the data I have.
What I am going to do is take
my data set and add
an extra column that corresponds
to my extra feature, x0,
that is always takes
on this value of 1.
What I'm going to do is
I'm then going to construct
a matrix called X that's
a matrix are basically contains all
of the features from my
training data, so completely
here is my here are
all my features and we're
going to take all those numbers and
put them into this matrix "X", okay?
So just, you know, copy
the data over one column
at a time and then I am going to do something similar for y's.
I am going to take the
values that I'm trying to
predict and construct now
a vector, like so
and call that a vector y.
So X is going to be a
Start transcript at 4 minutes 59 seconds4:59
m by (n+1) - dimensional matrix, and
Y is going to be
a m-dimensional vector
where m is the number of training examples
and n is, n is
a number of features, n+1, because of
this extra feature X0 that I had.
Finally if you take
your matrix X and you take
your vector Y, and if you
just compute this, and set
theta to be equal to
X transpose X inverse times
X transpose Y, this would
give you the value of theta
that minimizes your cost function.
There was a lot
that happened on the slides and
I work through it using one specific example of one dataset.
Let me just write this
out in a slightly more general form
and then let me just, and later on in
this video let me explain this equation a little bit more.
Start transcript at 5 minutes 57 seconds5:57
It is not yet entirely clear how to do this.
In a general case, let us
say we have M training examples
so X1, Y1 up to
Xn, Yn and n features.
So, each of the training example
x(i) may looks like a vector
like this, that is a n+1 dimensional feature vector.
The way I'm going to construct the
matrix "X", this is
also called the design matrix
is as follows.
Each training example gives
me a feature vector like this.
say, sort of n+1 dimensional vector.
The way I am going to construct my
design matrix x is only construct the matrix like this.
and what I'm going to
do is take the first
training example, so that's
a vector, take its transpose
so it ends up being this,
you know, long flat thing and
make x1 transpose the first row of my design matrix.
Then I am going to take my
second training example, x2, take
the transpose of that and
put that as the second row
of x and so on,
down until my last training example.
Take the transpose of that,
and that's my last row of
my matrix X. And, so,
that makes my matrix X, an
M by N +1
dimensional matrix.
As a concrete example, let's
say I have only one
feature, really, only one
feature other than X zero,
which is always equal to 1.
So if my feature vectors
X-i are equal to this
1, which is X-0, then
some real feature, like maybe the
size of the house, then my
design matrix, X, would be equal to this.
For the first row, I'm going
to basically take this and take its transpose.
So, I'm going to end up with 1, and then X-1-1.
For the second row, we're going to end
up with 1 and then
X-1-2 and so
on down to 1, and
then X-1-M.
And thus, this will be
a m by 2-dimensional matrix.
So, that's how to construct
the matrix X. And, the
vector Y--sometimes I might
write an arrow on top to
denote that it is a vector,
but very often I'll just write this as Y, either way.
The vector Y is obtained by
taking all all the labels,
all the correct prices of
houses in my training set, and
just stacking them up into
an M-dimensional vector, and
that's Y. Finally, having
constructed the matrix X
and the vector Y, we then
just compute theta as X'(1/X)
x X'Y. I just
want to make
I just want to make sure that this equation makes sense to you
and that you know how to implement it.
So, you know, concretely, what is this X'(1/X)?
Well, X'(1/X) is the
inverse of the matrix X'X.
Concretely, if you were
to say set A to
be equal to X' x
X, so X' is a
matrix, X' x X
gives you another matrix, and we
call that matrix A. Then, you
know, X'(1/X) is just
you take this matrix A and you invert it, right!
This gives, let's say 1/A.
Start transcript at 9 minutes 26 seconds9:26
And so that's how you compute this thing.
You compute X'X and then you compute its inverse.
We haven't yet talked about Octave.
We'll do so in the later
set of videos, but in the
Octave programming language or a
similar view, and also the
matlab programming language is very similar.
The command to compute this quantity,
X transpose X inverse times
X transpose Y, is as follows.
In Octave X prime is
the notation that you use to denote X transpose.
And so, this expression that's
boxed in red, that's computing
X transpose times X.
pinv is a function for
computing the inverse of
a matrix, so this computes
X transpose X inverse,
and then you multiply that by
X transpose, and you multiply
that by Y. So you
end computing that formula
which I didn't prove,
but it is possible to
show mathematically even though I'm
not going to do so
here, that this formula gives you
the optimal value of theta
in the sense that if you set theta equal
to this, that's the value
of theta that minimizes the
cost function J of theta
for the new regression.
One last detail in the earlier video.
I talked about the feature
skill and the idea of
getting features to be
on similar ranges of
Scales of similar ranges of values of each other.
If you are using this normal
equation method then feature
scaling isn't actually necessary
and is actually okay if,
say, some feature X one
is between zero and one,
and some feature X two is
between ranges from zero to
one thousand and some feature
x three ranges from zero
to ten to the
minus five and if
you are using the normal equation method
this is okay and there is
no need to do features
scaling, although of course
if you are using gradient descent,
then, features scaling is still important.
Finally, where should you use the gradient descent
and when should you use the normal equation method.
Here are some of the their advantages and disadvantages.
Let's say you have m training
examples and n features.
One disadvantage of gradient descent
is that, you need to choose the learning rate Alpha.
And, often, this means running
it few times with different learning
rate alphas and then seeing what works best.
And so that is sort of extra work and extra hassle.
Another disadvantage with gradient descent
is it needs many more iterations.
So, depending on the details,
that could make it slower, although
there's more to the story as we'll see in a second.
As for the normal equation, you don't need to choose any learning rate alpha.
So that, you know, makes it really convenient, makes it simple to implement.
You just run it and it usually just works.
And you don't need to
iterate, so, you don't need
to plot J of Theta or
check the convergence or take all those extra steps.
So far, the balance seems to
favor normal the normal equation.
Here are some disadvantages of
the normal equation, and some advantages of gradient descent.
Gradient descent works pretty well,
even when you have a very large number of features.
So, even if you
have millions of features you
can run gradient descent and it will be reasonably efficient.
It will do something reasonable.
In contrast to normal equation, In, in
order to solve for the parameters
data, we need to solve for this term.
We need to compute this term, X transpose, X inverse.
This matrix X transpose X.
That's an n by n matrix, if you have n features.
Because, if you look
at the dimensions of
X transpose the dimension of
X, you multiply, figure out what
the dimension of the product
is, the matrix X transpose
X is an n by n matrix where
n is the number of features, and
for almost computed implementations
the cost of inverting
the matrix, rose roughly as
the cube of the dimension of the matrix.
So, computing this inverse costs,
roughly order, and cube time.
Sometimes, it's slightly faster than
N cube but, it's, you know, close enough for our purposes.
So if n the number of features is very large,
Start transcript at 13 minutes 37 seconds13:37
then computing this
quantity can be slow and
the normal equation method can actually be much slower.
So if n is
large then I might
usually use gradient descent because
we don't want to pay this all in q time.
But, if n is relatively small,
then the normal equation might give you a better way to solve the parameters.
What does small and large mean?
Well, if n is on
the order of a hundred, then
inverting a hundred-by-hundred matrix is
no problem by modern computing standards.
If n is a thousand, I would still use the normal equation method.
Inverting a thousand-by-thousand matrix is
actually really fast on a modern computer.
If n is ten thousand, then I might start to wonder.
Inverting a ten-thousand- by-ten-thousand matrix
starts to get kind of slow,
and I might then start to
maybe lean in the
direction of gradient descent, but maybe not quite.
n equals ten thousand, you can
sort of convert a ten-thousand-by-ten-thousand matrix.
But if it gets much bigger than that, then, I would probably use gradient descent.
So, if n equals ten
to the sixth with a million
features, then inverting a
million-by-million matrix is going
to be very expensive, and
I would definitely favor gradient descent if you have that many features.
So exactly how large
set of features has to be
before you convert a gradient descent, it's hard to give a strict number.
But, for me, it is usually
around ten thousand that I might
start to consider switching over
to gradient descents or maybe,
some other algorithms that we'll talk about later in this class.
To summarize, so long
as the number of features is
not too large, the normal equation
gives us a great alternative method to solve for the parameter theta.
Concretely, so long as
the number of features is less
than 1000, you know, I would
use, I would usually is used
in normal equation method rather than, gradient descent.
To preview some ideas that
we'll talk about later in this
course, as we get
to the more complex learning algorithm, for
example, when we talk about
classification algorithm, like a logistic regression algorithm,
We'll see that those algorithm
actually...
The normal equation method actually do not work
for those more sophisticated
learning algorithms, and, we
will have to resort to gradient descent for those algorithms.
So, gradient descent is a very useful algorithm to know.
The linear regression will have
a large number of features and
for some of the other algorithms
that we'll see in
this course, because, for them, the normal
equation method just doesn't apply and doesn't work.
But for this specific model of
linear regression, the normal equation
can give you a alternative
Start transcript at 16 minutes 7 seconds16:07
that can be much faster, than gradient descent.
So, depending on the detail of your algortithm,
depending of the detail of the problems and
how many features that you have,
both of these algorithms are well worth knowing about.
Normal Equation Noninvertibility
In this video I want to talk about the Normal equation and non-invertibility.
This is a somewhat more advanced concept, but
it's something that I've often been asked about.
And so I want to talk it here and address it here.
But this is a somewhat more advanced concept,
so feel free to consider this optional material.
Start transcript at 18 seconds0:18
And there's a phenomenon that you may run into that may be somewhat useful
to
understand, but even if you don't understand the normal equation and
linear progression, you should really get that to work okay.
Start transcript at 31 seconds0:31
Here's the issue.
Start transcript at 33 seconds0:33
For those of you there are, maybe some are more familiar with linear algebra,
what some students have asked me is,
when computing this Theta equals X transpose X inverse X transpose Y.
What if the matrix X transpose X is non-invertible?
So for those of you that know a bit more linear algebra
you may know that only some matrices are invertible and
some matrices do not have an inverse we call those non-invertible matrices.
Singular or degenerate matrices.
The issue or
the problem of x transpose x being non invertible should happen pretty rarely.
And in Octave if you implement this to compute theta,
it turns out that this will actually do the right thing.
I'm getting a little technical now, and I don't want to go into the details,
but Octave hast two functions for inverting matrices.
One is called pinv, and the other is called inv.
And the differences between these two are somewhat technical.
One's called the pseudo-inverse, one's called the inverse.
But you can show mathematically that so
long as you use the pinv function then this will actually compute
the value of data that you want even if X transpose X is non-invertible.
The specific details between inv.
What is the difference between pinv?
What is inv?
That's somewhat advanced numerical computing concepts,
I don't really want to get into.
But I thought in this optional video, I'll try to give you little bit of intuition
about what it means for X transpose X to be non-invertible.
For those of you that know a bit more linear Algebra might be interested.
I'm not gonna prove this mathematically but if X transpose X is non-invertible,
there usually two most common causes for this.
The first cause is if somehow in your learning problem you have redundant
features.
Concretely, if you're trying to predict housing prices and if x1 is the size of
the house in feet, in square feet and x2 is the size of the house in square
meters,
then you know 1 meter is equal to 3.28 feet Rounded to two decimals.
And so your two features will always satisfy the constraint x1
equals 3.28 squared times x2.
And you can show for those of you that are somewhat advanced in linear
Algebra, but
if you're explaining the algebra you can actually show that if your two features
are related, are a linear equation like this.
Then matrix X transpose X would be non-invertable.
The second thing that can cause X transpose X to be non-invertable is if you
are training, if you are trying to run the learning algorithm with a lot of
features.
Concretely, if m is less than or equal to n.
For example, if you imagine that you have m = 10 training examples
that you have n equals 100 features then you're trying to fit
a parameter back to theta which is, you know, n plus one dimensional.
So this is 101 dimensional,
you're trying to fit 101 parameters from just 10 training examples.
Start transcript at 3 minutes 44 seconds3:44
This turns out to sometimes work but not always be a good idea.
Because as we'll see later, you might not have enough data if you only
have 10 examples to fit you know, 100 or 101 parameters.
We'll see later in this course why this might be too little data to fit
this many parameters.
But commonly what we do then if m is less than n,
is to see if we can either delete some features or
to use a technique called regularization which is something that we'll talk
about
later in this class as well, that will kind of let you fit a lot of parameters,
use a lot features, even if you have a relatively small training set.
But this regularization will be a later topic in this course.
But to summarize if ever you find that x transpose x is singular or
alternatively you find it non-invertable, what I would recommend you do is
first look at your features and see if you have redundant features like this x1,
x2.
You're being linearly dependent or being a linear function of each other like
so.
And if you do have redundant features and if you just delete one of these
features,
you really don't need both of these features.
If you just delete one of these features,
that would solve your non-invertibility problem.
And so I would first think through my features and check if any are redundant.
And if so then keep deleting redundant features until they're no
longer redundant.
And if your features are not redundant,
I would check if I may have too many features.
And if that's the case, I would either delete some features if I can bear to
use fewer features or else I would consider using regularization.
Which is this topic that we'll talk about later.
Start transcript at 5 minutes 24 seconds5:24
So that's it for the normal equation and
what it means for if the matrix X transpose X is non-invertable but
this is a problem that you should run that hopefully you run into pretty rarely
and
if you just implement it in octave using P and
using the P n function which is called a pseudo inverse function so
you could use a different linear out your alive in Is called a pseudo-inverse but
that implementation should just do the right thing,
even if X transpose X is non-invertable, which should happen pretty rarely
anyways,
so this should not be a problem for most implementations of linear regression.
Redundant features, where two features are very closely related (i.e.
they are linearly dependent)
Too many features (e.g. m ≤ n). In this case, delete some features or
use "regularization" (to be explained in a later lesson).
Solutions to the above problems include deleting a feature that is linearly
dependent with another or deleting one or more features when there are too
many features.