Yousef ML Washin Classification
Yousef ML Washin Classification
1
1 hour to complete
Welcome!
In Module 5, we're going to see that over 50 is not just a bad, bad problem with Logistic
Regression, but it's also a bad, bad problem with decision trees. Here, as you make those trees
deeper and deeper and deeper, those decision boundaries can get very, very complicated, and
really overfit. So, we're going to have to do something about it. What we're going to do is use a
fundamental concept called Occam's Razor, where you try to find the simplest explanation for
your data. And this concept goes back way before Occam was around the 13th Century. It goes
back to Pythagoras and Aristotle, and those folks said the simplest explanation is often the best
one. So we're going to take this really complex deep tree's and find simple ones that give you
better performance and are less prone to overfitting. [MUSIC]
2 hours to complete
Linear Classifiers & Logistic Regression
Linear classifiers are amongst the most practical classification methods. For example,
in our sentiment analysis case-study, a linear classifier associates a coefficient with
the counts of each word in the sentence. In this module, you will become proficient in
this type of representation. You will focus on a particularly useful type of linear
classifier called logistic regression, which, in addition to allowing you to predict a
class, provides a probability associated with the prediction. These probabilities are
extremely useful, since they provide a degree of confidence in the predictions. In this
module, you will also be able to construct features from categorical inputs, and to
tackle classification problems with more than two class (multiclass problems). You will
examine the results of these techniques on a real-world product sentiment analysis
task.
One is the sum of the squares, also called the L2 norm, the square of the L2 norm. And because
it's noted by W2 squared and it's just very simple. It's the square of the first coefficient plus the
square of the second coefficient plus the square of the third coefficient and so on plus the square
of the last coefficient, w d squared. That's if you used the L2 norm. We can also use the sum of
the absolute values, also called the L1 norm, and it's denoted by this here. And instead of being
the squares, it's w0, absolute value, plus W one absolute value plus w two absolute value all the
way to the absolute value of the last coefficient. Now in the regression course we explored these
notions quite a bit but the main reason we take the square of the absolute value is that we want to
make sure to penalize highly positive and highly negative numbers in the same way, so by doing
the search squaring of some value, i'll make the output here positive. When I make this norms as
low as possible. So both of these approaches are penalize larger weight. Actually, i should say
penalize large coefficients. However, as we saw in the regression class by using the L one norm,
I'm also going to get what's called a sparse solution. So the sparse doesn't point play in
regression but it also plays a role in classification for example. And in this module we're going to
explore a little bit of both of these concepts. And we're going to start with the L2 norm, or the sum
of the squares. So now that we've reviewed these concepts, we can now formalize the problem,
the quality that we're trying to maximize. And so I want to maximize over my choice parameter W's
of the trade off between two things. The likelihood of my data and actually the log of it. So, log of
the data likelihood. And some notion of penalty for the magnitude of the coefficients, which it will
start with this L2 penalty notion. [MUSIC]
we would call Lambda or the tuning parameter, or the magic parameter, or the magic constant.
And so, if you think about it, there's three regimes here for us to explore. Where Lambda is equal
to zero, let's see what happens. So when Lambda is equal to zero, this problem reduces to just
optimizing. So maximizing over W of the likelihood only, so only the likelihood term. Which means
that we get to the standard maximum likelihood solution, an unpenalized MLE solution. So, that's
probably not a good idea to set it to zero, because I don't, I have this really bad over fitting
problems, and not preventing the over fitting. Now, if I set Lambda to be too large, for example, if I
set it to be infinity, what happens? Well, the optimization becomes the maximum over W. Or if L of
W minus infinity times the norm of the parameters, which means the LW gets drowned out. All I
care about is that infinity term and so, that pushes me to only care about penalizing the
parameters. About penalizing the coefficient, say, another parameter, so penalizing W, or
penalizing that large coefficient. Which will lead to just setting all of the Ws equal to zero.
Everything be zero. Also now, I've got a good idea because I'm not fitting the data at all, I set all
the parameters to zero, it's not doing anything good, ignoring the data. So the area that we care
about is somewhere in between. So a Lambda between zero and infinity, which balances the data
fit against magnitude of the coefficients. Very good. So we're going to try to find the Lambda. If it's
between zero and infinity, it fits our data well. And this process, where we're trying to find a
Lambda and we're trying to fit the data with this L2 penalty, it's called L2 regularized logistic
regression. In the regression case, we called this ridge regression, here it doesn't have a fancy
name, it's just L2 regularized logistic regression. Now, you might ask this point, how do I pick
Lambda? Well, if you took the regression course, you should know the answer already. Now, use
your training data, because as Lambda goes to zero, you going to fit the training data better.
You're not going to be able to pick Lambda that way. Never ever use your test data, ever. So, you
either use a validation set, if you have lots of data or use cross validation for smaller data sets. So
in the regression course, we cover this picking the parameter Lambda for the regression study,
and this is the same kind of idea here. Use a validation set or use cross-validation always.
Lambda can be viewed as a parameter that helps us go between the high variance model and the
high bias model. And try to find a way to balance the bias and variance in terms of the bias
variance tradeoff. So when Lambda is very large, we have W is going to zero, and so we have
large bias and we know, they are not fitting the data very well. We have low variance, no matter
where your data set is, you get the same kind of parameters. In extreme, when Lambda is
extremely large, you get zero no matter what data set you have. If Lambda is very small, you get a
very good fit to the training data, so you have low bias but you can have a very high variance. If
the data changes a little bit, you get a completely different decision boundary. And so in that
sense, Lambda controls the bias of variance trade off for this regularization setting in logistic
regression or in classification. Just like you did in regular regression. [MUSIC]
as we increase the penalty lambda, so in the beginning, when we have a unregularized problem
this coefficient tends to be large. But as we increase lambda, they tend to become smaller and
smaller and smaller and go to zero. I've used the review data set, property review data set here
and I picked a few words and fit a legislation model just using those words with different levels of
regularization. So for example, the words that have positive coefficients tend to be associated with
positive aspects of reviews, while the ones with negative coefficients tend to be associated with
negative aspects of reviews. What is the word in quotes that has the most positive weight? Well, if
you look at the key here, you'll see the word that has the most positive weight is actually the
emoticon smiley face, well, the word that has most negative weight is another another emoticon
the sad face. And the beginning all these words have pretty large coefficients except the words
near zero, which are words like this and review, which are not associated with either positive
things or negative things, although if the word review shows up it's slightly correlated with a
negative review but in general, this coefficients much more than the others. And as I increase the
regularization lambda, you see the coefficients can become smaller and smaller, and if I were to
keep drawing this, they will eventually go to zero. And now, if I were to use cross validation to pick
the best lambda, I'll get a result kind of around here, and I'm going to call that lambda star. And so
that's what you do is cross validation, to find that point where it's fitting data pretty well, but it's not
over-fitting it too much. And as a last point, I'm going to show you something that is pretty exciting.
It's really beautiful about regularization with regression. Regularization doesn't only address the
crazy wiggly decision boundaries but addressing with those over-confidence problems that we
saw with over-fitting regularization. So I'm taking the same coefficient, the same thing that I've
learned. The lambda is increasing, the range of coefficients is decreasing, they're getting smaller
but I'm talking the bottom here. The actual decision boundaries that we learned and the notion of
uncertainty on their data. So if lambda is equal to zero we have this highly over confident
predictions. If lambda is a one, not only do I get a more natural kind of parabola like decision
boundary even though I'm using Degree 20 features, polynomial degree 20 is features. I get a
very natural certainty region. So the region why I don't know if it's positive or negative is really
those points in the boundary which kind of between those clusters of positive points and the
clusters of negative points. And you get this kind of beautiful smooth transition. So by introducing
regularization, we've now addressed those two fundamental problems where over-fitting comes in
in logistic aggression. [MUSIC]
This is the thing we need to be able to walk into that hill-climbing direction. So the derivative of
the sum is the sum of the derivative. So the total derivative is the derivative of the first term, the
derivative of the log-likelihood, which, thankfully, we've seen in the previous module, minus
lambda times the derivative of the quadratic term here. And it is how the derivative of the
quadratic term we already covered in the regression course. But we're going to do a quick review
here. But as you can see, just a small change to your code before, we just have to add this
lambda times the derivative of the quadratic term. So the review, the derivative of the log-
likelihood is going to be the sum of my data points of the difference between the syndicate of
whether it's a positive example and the probability of it being positive weighed by the value of the
feature. And we talked about last module and interpreted this piece in quite a bit of detail. So what
I'm going to go over again, we're going to focus on the second part, which is the derivative of the
L2 penalty. So in other words, what's the partial derivative with respect to some parameter wj of
w0 squared plus w1 squared plus w2 squared, Plus dot, dot, dot plus wj squared plus dot, dot, dot
plus wd squared. Now if you look at all of this terms w zero squared wr squared all of those don't
play any role in the derivative. The only thing that plays a role is wj squared Now what's the
derivative of 30g squared? It's just 2wj. So that's always going to change in our code, it's actually
2wj. So in fact, our total derivative is going to be the same derivative that we've implemented in
the past, mins 2 lambda, Times wj. So 2 times the regularization coefficient, the regularization
penalty, the parameter lambda times the magnitude, so times the value of that coefficient. So let's
interpret what this extra term does for us. So what does the minus 2 lambda wj do to the
derivative? So wj is positive, this minus lambda wj is a negative term. Negative contribution to a
derivative which means that it decreases wj because you're going to add some negative term to it.
It was positive we're going to decrease it. So since it was positive and you're decreasing it what
happens is wj becomes closer to 0. So if the rig is positive you have the negative number and
becomes less positive closer to 0. And in fact if lambda's bigger then that thing becomes more
negative and going to 0 faster. That's what happens. And if wj is very positive that the decrement
is also larger so it becomes again goes to towards 0 even faster. Now if wj is negative then -2
lambda wj is going to be greater than 0. Because lambda is also greater than 0. And what impact
does that have? So you're adding something positive so you're increasing wj which implies that wj
becomes, again, closer to 0. It was negative, and I posited it numbers with, it goes a little closer to
0. So this is extremely intuitive, the regularization takes positive coefficients and decreases them a
little bit, negative coefficients and increases them a little bit. So it tries to push coefficients to 0,
that was the effect has on the gradient, exactly what you expect. Finally, this is exactly the code
that we described in the last module, so learn the coefficients of a logistic regression model. You
start with some, that is equal to 0, or some other randomly initiated or some kind of smartly
initiated parameters. And you go, for each iteration you go coefficient by coefficient, you compute
a partial derivative, which is this really long term here, sum over data points. The feature value
times the difference between where there's a positive data point and the predicted value positive,
so called a partial j. And you have the same update, wj(t+1) is wj(t) plus the step size. It multiplies
the partial derivative just as before, which is the derivative of the likelihood function With respect
to wj. And all you need to change in your code, there's only one little thing to change in the code.
You have this little thing here which is our only change. In other words, take all the code you had
before, put- 2 lambda wj in the computation of the derivative, and now you have a solver for L2
regularized logistic regression. And this is going to help you a tremendous amount in practice.
[MUSIC]
and that can help us with both efficiency and interpretability of the models as we saw in
regression. So for example, let's say that we have a lot of data and a lot of features so the number
of w's that you have can be a 100 billion, 100 billion possible values. Things can in practice in all
sorts of settings. For example, many of the spam filters out there have hundreds of billions of
parameters in them, or coefficients they learn from data. So this has a couple problems. It can be
expensive to make a prediction. You have to go through 100 billion values. However if I have a
sparse solution where many of these w's are actually equal to zero then when I'm trying to make a
prediction. So I'm judging will be at the sine of wj times the feature hj of xi. I only have to look at
no zero quotations wj. Everything else can be ignored. So if I have a 100 billion coefficients but
only say 100,000 of those are non zero then it's going to be much faster to make a prediction. This
makes a huge difference in practice. The other impact that sparsity has having many zero
coefficients being zero is that it can help you interpret the non zero coefficients. So you can look at
the small number of non zero coefficients and try to make an interpretation, this is why a
prediction gets made. Such interpretations can be used for practice in many ways, so how you
learn logistic regression classifier with sparsity in that, sparsity inducing penalty. So what you do is
take the same log-likelihood function Lw but we add extra L1 penalty. Which is the sum of the
absolute value of w0, the absolute value of w1 all the way to the absolute value wd. So by just
changing the squares, sum of squares to be sum of absolute values we go into what's called L1
regularized logistic regression which gives you sparse solutions. So that small change leads to
sparse solutions. So just like we did with L2 regularization. Here, we're also going to have a
parameter lander which controls how much regularization we introduce. So how much penalty
we're going to introduce. And objective becomes the log-likelihood of the data, minus lambda
times the sum of these absolute values, the L1 penalty. When lambda equals to 0, we have no
regularization, which leads us to the standard MLE solution. Just like we had in the case of L2
regularization. Now when lambda is equal to infinity, we have only penalty so all weight is on
regularization. And that's going to lead to w hat being everything 0, all 0 coefficients. Now the
case that we really care about was on lambda similar between 0 and infinity which leads to what
are called Sparse Solutions. Where some wj's are now going to be equal to 0 but hopefully, many
other wj's and this is the maximum wj hats are going to be exactly 0. So that's what we're going to
try to aim for. So let's revisit those coefficient paths, and here I'm showing you coefficient paths of
L2 penalty. You see that when the lambda parameter's low, you have large coefficients learned,
and when the lambda parameters gets larger, you got smaller coefficients. So, they go from large
to small, but they're never exactly 0. So, the coefficients never become exactly 0. If you look
however at the coefficient paths when the regularization is L1, well guess if would be much more
interesting. So, for example, in the beginning, the coefficient of the smiley face oops that should
be frowny. That should be smiley face has a large positive value. But eventually becomes exactly
zero from here on. And similarly, the coefficient for the frowney face is a large negative value, but
eventually over here the frowney face has a coefficient that becomes 0. And so it goes from large
all the way to exactly zero. And we see that for many of the other words. For example in the
beginning the coefficient of the word hate is pretty high and that's a pretty important word but
around here hate becomes irrelevant. And so as just a quick reminder, these are product reviews
and trying to figure out whether it's a positive or negative review for the product. And work with,
we can look at what coefficient stays non zero for the longest time. And this is exactly this line
over here, where it never hits 0, never stays exactly 0. And this is a co-efficient of the word
disappointed. So, you might be disappointed to learn that frowny face is not the one that becomes
0. But in the beginning, disappoint is not as, the coefficient is not as large, not as significant as a
frowny face but it's the one that stays negative for the longest. And so frowny face is not, you
might be disappointed to know that friendly face is not as important as disappointed. [LAUGH] And
disappointed probably because it's prevalent in more reviews and when you say disappointed
you're really like in a negative review. That coefficient goes on for a long time. So you see these
transitions. So the coefficients of those small numbers like reviews goes to zero. Earlier on, the
smiley face will last for a while then it becomes zero. Frowny face lasts for longer and then it
becomes exactly zero. But propositionally large lambdas, all those are zero except for the
coefficient at this point. [MUSIC]
That should be smiley face has a large positive value. But eventually becomes exactly zero from
here on. And similarly, the coefficient for the frowney face is a large negative value, but eventually
over here the frowney face has a coefficient that becomes 0. And so it goes from large all the way
to exactly zero. And we see that for many of the other words. For example in the beginning the
coefficient of the word hate is pretty high and that's a pretty important word but around here hate
becomes irrelevant. And so as just a quick reminder, these are product reviews and trying to
figure out whether it's a positive or negative review for the product. And work with, we can look at
what coefficient stays non zero for the longest time. And this is exactly this line over here, where it
never hits 0, never stays exactly 0. And this is a co-efficient of the word disappointed. So, you
might be disappointed to learn that frowny face is not the one that becomes 0. But in the
beginning, disappoint is not as, the coefficient is not as large, not as significant as a frowny face
but it's the one that stays negative for the longest. And so frowny face is not, you might be
disappointed to know that friendly face is not as important as disappointed. [LAUGH] And
disappointed probably because it's prevalent in more reviews and when you say disappointed
you're really like in a negative review. That coefficient goes on for a long time. So you see these
transitions. So the coefficients of those small numbers like reviews goes to zero. Earlier on, the
smiley face will last for a while then it becomes zero. Frowny face lasts for longer and then it
becomes exactly zero. But propositionally large lambdas, all those are zero except for the
coefficient at this point. [MUSIC]
WEEK
3
2 hours to complete
Decision Trees
Along with linear classifiers, decision trees are amongst the most widely used
classification techniques in the real world. This method is extremely intuitive, simple to
implement and provides interpretable predictions. In this module, you will become
familiar with the core decision trees representation. You will then design a simple,
recursive greedy algorithm to learn decision trees from data. Finally, you will extend this
approach to deal with continuous inputs, a fundamental requirement for practical
problems. In this module, you will investigate a brand new case-study in the financial
sector: predicting the risk associated with a bank loan. You will implement your own
decision tree learning algorithm on real loan data.
Now lets revisit the AdaBoost algorithm that we've been talking about and in this part of the
module, we're going to be exploring how do we compute the coefficient w hat t and we saw that
can be computed by this really simple formula. We compute the weighted error of f of t and we just
say, w hat t is one-half of the log of 1 minus weighted error divided by the weighted error. And with
that, we have a w hat t and we can focus on figuring out how to come up with alpha Is. And we
want alpha i is to be high where ft makes mistakes or does [INAUDIBLE]. [MUSIC]
[SOUND] We started with alpha i's being uniform, the same for all data points, one over n, and
now we want to change them to focus more on those difficult data points where we're making
mistakes. So the question is where did f t make mistakes or what data points f t got right. If f t got
a particular data point, xi right, we want to decrease alpha i because we got it right. But if we got xi
wrong, then want to increase our phi so the next decision style we classify our homes in and does
better in this particular input.
Again, the AdaBoost theorem provides us with a slightly intimidating formula for how to update the
weights out for i. But if you take a moment to interpret it, we'll see this one is extremely intuitive,
and there's something quite nice. So let's take a quick look at it. So it says that alpha i gets an
update depending on whether on ft gets the data point right because this is correct or whether ft
makes a mistake. In this we'll see we're going to increase the weight of data points where we
made mistakes and we're going to decrease the weight of data points we got right. So let's take a
look at this. So let's take one xi and lets suppose that we got it correct. So, we're the top line here
and notice that this equation depends on whatever the coefficients that was assigned to this
classifier. So, the classifier was good. We only changed a way to more but if the classifiers was
terrible we're going to change the weights less. So, let's say the classifier was good and we gave
it weight 2.3. So what we're doing here, we're looking at the formula, we're multiplying alpha i by e
to the -w hat t, which is 2.3. And if you take your calculator out, you see that this is 0.1. So, we're
taking the data points to our right, and we multiply the weight of those data points, by 0.1, so
dividing by 10. So what effect does that do? We're going to decrease the importance of this data
point ff xi, yi, so this particular data point. Now let's look at a case where we got the data point
correct but the cost that we learn is random. So, it had to wait zero just like we discussed a few
slides ago. So it's overall weight 0.5 is weight 0. In this case we're multiplying the coefficient L5,
but e to the minus 0. Which is = 1, What does that mean? That means that when I keep the
importance of this data point the same, that also makes a ton of sense. So this was a classified
that was terrible and we gave it a weight of 0, we are going to ignore it, and so since we are
ignoring it we are not changing anything about how we rate all data points we are just going to
keep going as if nothing's happened because nothing's changed on their overall assemble. Now
let's look at the opposite case when we actually made mistakes so let's say that we got xi incorrect
so we made a mistake. In this case, we're in the second line here. So if it was a good classifier it
had the w hat t of 2.3, then we're going to multiply the weight by each of the power 2.3, so this is e
to the 2.3, which if you do the math is 9.98, so approximately 10. So it's 10 times bigger. And so,
what we doing is increasing the importance of this mistake significantly. So the next classify is
going to pay much more attention to this particular data point because it was a mistaken one. Now
finally, just very quickly, what happens if we make a mistake, but we have that random classifier
that had weight 0, we didn't care about it. So the multiplication here is e0, which is again = 1 which
means we keep the importance of this data point the same. So very good we now seen this cool
update from AdaBoost, which makes a ton of sense increase the weights of data points where we
made mistakes and decrease the ones we didn't make mistakes in simulator and we're going to
use it in our AdaBoost algorithm. So if we update our algorithm, or we stacked it with uniform
weights, we learned classifier f of t. We updated its, or computed its coefficient, w hat t. Now we
can update the weights of the data points, alpha i. Using the simple formula from the previous
slide which increases the weight of mistakes, decreases the weights of the correct classifications.
[MUSIC]
[MUSIC] >> Finally, we need to address a technical issue that we hinted at when we normalized
the weights of data points start to 1 over n, when we had uniform weights. Which is they should be
normalizing weights of the data points throughout the iterations ,so for example if you take data
point xi, suppose that it is often a mistake where multiplying its weight by a positive number again,
and again and again. Let's say 2 times 2 times 2 times 2, and that weight of a can get extremely
large. On the other hand, if you take the data point xi that's often correct, you multiply by some
number less than one, so say a half. So you keep going by a half, a half, a half, and that weight
can get really, really small. And so this problem can lead to numerical instabilities in the
approach.
So let's summarize the AdaBoost algorithm. We start with, even weights, uniform weights, equal
weights for all data points. We learn a classify F of T. We find its coefficient depending on how
good it is in terms of weighted error. And then we update the weights to weigh mistakes more than
things we got correct. And finally, we normalize the weights by dividing each value by this total
sum of the weights. And this normalization is of practical importance. So this is the whole
AdaBoost algorithm, it's beautiful, works extremely well, really easy to use. >> [MUSIC]
[MUSIC] Now let's take that third example we've used to illustrate different machine learning
algorithms in this module and explore it in the context of AdaBoost. And it's going to give us a lot
of insight as to how boosting works in practice. So for the first classifier f1, we work directly off the
original data, all points have the same weight. That's right. And so the learning process we have is
going to be standard learning. So nothing changes in your learning algorithm since every data
point has the same weight. And in this case we're learning a decision stump and so here is the
decision boundary that does it's best to try to separate positive examples from negative examples.
It splits right around 0. It's actually minus 0.07, if you remember from the decision tree classifier.
So this is the first decision stump, f1. Now to learn the second decision stump, f2, we have to
reweigh our data based on how much f1 did, so how well f1 did. So we're going to look at our
decision boundary and we're going to weigh data points that were mistakes higher. And here in
the picture I'm denoting them by bigger minus signs and plus signs. So if you look at the data
points here on the left, they were mistakes or the minus on this side, and this plus is over here.
They were also made mistakes, so we increased our weight and we decreased the weight of
everybody else and we see that the pluses here became here bigger and the minuses in this
region became larger.
Learning the classifier f2 in the second Iteration based on this weighted data. Using the weighted
data, we'll learn the following decision stump. And you see that now we're still having a vertical
split, we have a horizontal split, and it's a better split for weighted data. Split for these weights on
the left and it's kind of cool. So in the first iteration we decided to split on x 1. In the second one
we split on x2, and this is x2 greater than or less than 1.3 or so. And you'll see that it gets all the
minuses correct on top but it makes some mistakes on the minuses in the bottom, but it gets the
pluses correct in the bottom. So as opposed to the vertical split here, we now have a horizontal
split
So now we've learned there are decision stems f1 and f2, and the question here is how do we
combine them? So if you go through with the AdaBoost formula you'll see that the w hat 1 the
weight of the first decision stop is going to be 0.61, then w hat 2 is going to 0.53. So we trust the
first decision stamp a little bit more than we trust the second one which makes sense. The second
one doesn't seem as good, but when you add them together, you start getting a very interesting
decision boundary. So you get the points in the top left here are ones where we definitely think
that y hear is minus 1, so definitely negatives. On the bottom right here, it's some definite positives
y hat equals plus 1. And then for the other two regions, we can think about these as being regions
of higher uncertainty. So these are uncertain which right now makes sense, but as you add more
decision stumps we're going to be more sure that some of the points on the left tier bottom are
negative and right top are negative. Now, if you keep our numbers going for 30 iterations the first
thing
that we notice is that we get all the data points right, so our training error is 0. The second thing
you'll notice, and here I'm going to use a technical term for this, is that the decision boundary is
crazy. This is our technical term, and then if you combine these two insides We figure out, okay
we don't really trust this classifier, we're probably over-fitting the data. So it fits perfectly on the
train later, it maybe doesn't do as well with a little error. So overfitting something that will happen
in boasting, we'll talk about a little bit next. So let's take a deep breath and summarize what we've
done so far. We described simple classifiers, and we said that we're going to learn the simple
classifiers and take the volt between them to make predictions. And then we described this
AdaBoost algorithm, which is a pretty simple approach to learning a non-simple classifier using
this technique of boosting where you're boosting up the weight of data points when we're making
mistakes. And simple to implement from practice. [MUSIC]
[MUSIC] We'll now take a few minutes to instantiate this abstract algorithm we described and see
what it looks like in the context of learning decision stumps. There is other boosted decision
stumps, it's a really nice simple default way of training on your data. And so, it's one that we'll just
go through a little further and it'll help us ground some of the concepts that we looked at so far. So
here I've outlined the other boost algorithm that we discussed before in the previous slides, but
just to be clear we're going to talk about linear decision stump for f of t, and figuring out how to
update it, figure out its coefficient, w hat t. Our first step is figuring out how to learn the next
decision stump, that's going to be f of t. And this is just going to be like standard decision stump
learning. We're going to try splitting it on each feature, income, credit history, savings, market
conditions. And figure out how well each of the resulting decisions stumps are a way to data. And
notice that in our process we might split on income multiple times. So in multiple considerations
we might revisit the same feature. So we're going to try each of those features, and for each one
of them, measure the weighted error on string data. So for say splitting on income, the weight of
the error might be 0.2. For splitting on credit, it might be 0.35. For splitting on savings, it might be
0.3. And finally, if you split the market conditions, it might be the worst of these four decision
stumps. On this weighted data, it might have a weighted error of 0.4. So, we're picking the best
feature, the one that has lowest weighted error, and so we're going to pick that first one split on.
We're going to split on income, we'll get to the 100,000. And so f of t's, going to be that decision
stamp, that says income's good to 100,000, if yes, safe, if not, is a risky loan. Now, the final
question is, what coefficient we give to this particular classifier? So all we have to is plug 0.2 into
the formula, and if we plug it in and do the math, 0.69 is the result. So the coefficient of this first
decision stem is just going to be 0.69
Going back to the algorithm, we discussed how we are going to learn this new stuff from data and
how we figure out its coefficient. Let's next talk about how to update the weight of i of each data
point. So here's the intuitive process what happens. We have our data points and I'm highlighting
them here depending on their income just like we did before. But I'm going to make a prediction
using this decision stamp. The question is how good is this decision stamp income greater than
100,000, and if you look at it, it makes mistakes in some of the data points and get others right. So
I marked the correct ones in bright green and mistakes in bright red. And if we take the previous
weights, alpha for each one of these data points, I'm going to highlight where those weights were
right there. We need to compute the new weight based on the formula above, which is standard
formula. So, we're going to plug in the w hat that we computed, 0.69, into the formula to figure out
what to multiply each one of those weights by. So, plug it in. You'll see that each of the -0.69 is a
half, so for every correct data point one is half its weight, and each of the 0.69 is two. So for every
incorrect data point, we're going to double its weight. So I'm going to go row by row, for the ones
in green I got correct, I'm going to half the weights. So for the first row there the weight before was
0.5 and now becomes 0.25. The next one was 1.5 becomes 0.75 because of their correct. For
third row I made a mistake, its weight before was 1.5. Now I'm going to double it, I'm going to
make it 3. So it can go datapoint by datapoint and then multiplied by two, or divided by two, the
weight, depending on whether we got that data point right or not. Extremely simple to be able to
boost the decision stump classifier, and these tend do extremely well on a wide range of data
sets. [MUSIC]
[MUSIC] Next, we're going to take a few minutes to talk about the famous AdaBoost theorem and
what it's practical implications are, which are extremely interesting. So, if you remember, the way
that boosting came about is that Kearns and Valiant had this famous question, can you combine
weak classifiers to make a stronger one? Actually, Valiant is a touring award winner, so really
famous professor from Harvard. And Schapire, a couple of years later, came up with the idea of
boosting, which really changed the machine learning field. And so if you look at the iterations of
boosting in the x axis here, in the training error for our dataset, we observe a very practical effect
that we see in a lot of boosting data. We're going to start with a really bad training error. So first
decision stump has training error of 22.5%. So, not good at all. After thirty iterations, as we saw a
little earlier, we're going to get all the data points right in our training data and I showed you kind
of that crazy decision boundary for this. So we see a smooth transition where the classification
error tends to go down. Tends to go down and the value goes to zero and actually stays at zero.
And that's a key insight of the boosting theorem.
So here is the famous AdaBoost Theorem which underlines all the choices we made in the algorithm
and really has had a lot of impact on machine learning. Basically the theorem says under some
technical conditions and all theorems are like this. Now if you were to be able to say some restrictions
apply, see store for details. And so you say under some technical conditions, the training error of the
classifier goes to zero, ss the number of iterations of the number and some models considered capital
T goes to infinity. So in other words pictorially we'll see that we expect the training error, which is this
thing here on the y axis, to go eventually to zero. Now that's eventually. It might oscillate a little bit in
the middle, however, what the theorem says is that it tends to generally decrease, eventually become
zero, usually, and then stick at zero. So we're going to see this high in the beginning, wiggle, wiggle,
wiggle, but tends to go down, and then hit a certain value, usually zero, and sticks at zero. Now, let
me just take a minute to talk about those technical distances starts with details.
It turns out the technical condition is something quite important. It says that at every iteration T, we
can find a weak learner. So a decision stump that has weighted error at least a little bit lower than
0.5. So it is at least better than random. That's what the theorem says. It seems like intuitively, it
would show us by the classifier that it's better than random, even with the data. But, it turns out
that this is not always possible. So, here is a counter example, which is a really extreme counter
example, but there's other examples where it's not possible. So for example, if you have a
negative point here, and you have a positive point on top of it, there's never going to be a decision
stump that can separate the positive point on top of the negative point. So the conditions of
boosting algorithm might not be satisfied, of the boosting AdaBoost theorem. However,
nevertheless, boosting usually takes your training error to zero or somewhere quite low if the
number of iterations go to infinity. So, we do observe that decreasing practice although they're are
some technical conditions in theorems, so it might not go exactly to zero, but it can get really,
really low. [MUSIC]
[MUSIC] So, let's compare a boosted serial stump to a decision tree. So, this is the decision tree
blog that we saw in the decision tree model. So this is on real a real dataset is based on the loan
applications and we see the training error as we make the decision tree deeper tends to go down,
down and down and we saw that. But the test error, which is kind of related to true error goes
down for a while and you have here the best depth, which is maybe seven. But eventually, it goes
backup. So let's say after that 18 over here, the training error's down to 8%. So it's a really low
training error, but the test error has gone up to 39% and we observe over 15. And in fact, we
observe a huge gap here. There are things now, by the way is the best decision tree has
classification error in the test set of about 34% or so. Now, let's see what happens with the senior
stamps that boosted on the same data set. You get a plot that looks kind of like this, you see that
the training arrow keeps decreasing per iteration. So, this is the train error. So as the theorem
predicts, it decreases as and the test error in this case is actually going down with iterations. So,
we're not observing over 15 at least yet. And after 18 rounds of boosting, so 18 decision stumps,
we have 32% test error, which is better than a decision tree. Yet, in fact, that's the best decision
tree and the over fitting is much letter. So, this gap here is much smaller. So, the gap between
your training error and the test error is kind of related to that over fitting quantity that we have.
Now we said, we're not observing over 15 yet. So, let's run booster stamps for more iterations and
see what happens.
So let's see what happens when we keep on boosting and adding more decision stems on the x-
axis, this is adding more and more decision stems and more tree instead of only one tree and see
what happens to the classification error. We see here that the training error keeps decreasing.
Just like we expected by the theorem, but the test error has stabilized. So, the test performance
tends to stay constant. And in fact, it stays constant for many iterations. So if I pick T anywhere in
this range, we will do about the same. So, any of these values for T would be fine. So now we
seen how boosting the sound is seems to be stabilizing, but the question is do we observe
overfitting and boosting?
But as we know, we need to be very careful of how we pick the parameters of our algorithm. So
how do we pick capital T, which is when we stop boosting? We have five decision stamps, 5,000
decision stamps. And this is just like selecting the magic parameters or all sorts of algorithms and
almost every algorithm out there model has a parameter trades off complexity, that for decision
tree, number of features, or magnitude of weights in logistic regression and here is the number of
rounds of boosting with the quality of the fit. So, just like lambda in regularization. We can't use
the training data, because the training tends to go down with iterations of boosting, so you say that
T should be infinite. Shouldn't be here really big and should never, never, never, ever, ever, ever,
ever use the test data, so that's bad. I was just showing you an illustrative examples, but you
should never do that. So, what should we do? Well, you should either use a validation set if you
have a lot of data. If you have a big dataset, you select a subpart of that to just pick the magic
parameters. And if your dataset is not that large you should use cross-validation. And in the
regression course, we talked about how to use validation sets and how to use cross-validations to
pick magic parameters like lambda, the def decision tree or the number rounds of boosting Capital
T. [MUSIC]
[MUSIC] I'd like to take a moment now to summarize we've seen boosting and what impacts it's
had in the real world. Let's talk about AdaBoost. So AdaBoost is one of the early types of boosting
algorithms. Extremely useful, but there other algorithms out there. In particular, there's one called
gradient boosting, which is slightly more complicated but extremely similar. And it's like AdaBoost,
but it can be useful not not just for basic classification, but for other of types of loss functions for
the types of problems and it's what most people use, so gradient boosting. It's kind of
generalization kind of boost, you can think about it that way. Then there's other related ways to
learn ensembles, the most popular one is called random forests. So random forest is a lot like
boost in the sense that it tries to learn an example of classifiers in this case, decision trees, but it
could be other types of classifiers. And instead of using boosting, it uses an approach called
bagging. And just very briefly, what you do with bagging is, you take your data set, and you
sample different subset of the data, which is kind of like learning on different sub datasets, and we
learn the decision tree on each one of them. You just average the outputs. So you're not
optimizing the coefficients that we had, and we're learning from different subset of data. It's easier
to parallelize, but it tends not perform as well as boosting for a fixed number of trees. So for 100
trees, or 100 serial stumps,
boosting tends to perform better than random forest. Now let's take a moment to discuss impact
boosting has had in the machine learning world. And hint, hint, it's been huge. It's amongst the
most useful machine learning approaches out there. It's useful in a wide range of fields so, for
example, in computer vision a lot of the default algorithm in computer vision is boosting, like face
detection algorithms where you take your camera, you point it at something and tries to detect
your face. Of which their that uses boosting very useful. If you look at machinery competition
they've become very popular the last two or three years from places like Kaggle or KDD Cup.
Most winners, so this is more than half, I think it's about 70% of winners actually use boosting to
win their competition. If fact, they use Boosta Freeze, and this looks at wide range of tasks like
malware detection or fraud detection or ranking web searches, and even interesting physics tasks
like detecting the Higgs Boson. All those problems and all those challenges have been won by
boosting decision trees. And this is perhaps one of the most deployed advanced machinery
methods out there. Particularly the notion of ensembles. So for example, if you know about Netflix
which is an online place we can watch movies online. This kind of company, they recommend
what movie you might want to watch next. That system uses boosting. Actually uses an example
of crossfires. But more interestingly, they had a competition a few years ago, where people tried to
provide better recommendations and the winner was one that treated assemble of many, many,
many classifiers in order to create better recommendations. So assembles, you'll see them
everywhere. Sometimes they optimize boosting, sometimes they optimize with different
techniques like bagging. And sometimes people just by hand tuned away to say okay, I'm going to
give you one to this, half to that. I don't recommend the last approach, I recommend boosting as a
one to use. Great, so in this module we've explored the notion of an ensemble classifier and we
formalized ensembles as the way to combination of the loads of different classifiers and we
discussed generally the boosting algorithm where the next classifier focuses on the mistakes that
we made so far. As well as Adaboost, which is a special case for classification where we show
you how to come up with the coefficients of each classifier and the weights on the data. We've
discussed how to implement decision stumps, with the decision stumps, extremely easy to do.
And then talked a little bit about the conversions property of how the AdaBoosting tends to go to 0,
but you have to be concerned a little bit about the over 50, although it tends to be a robust over 50
in practice. [MUSIC]
WEEK
6
2 hours to complete
Precision-Recall
In many real-world settings, accuracy or error are not the best quality metrics for
classification. You will explore a case-study that significantly highlights this issue:
using sentiment analysis to display positive reviews on a restaurant website. Instead
of accuracy, you will define two metrics: precision and recall, which are widely used in
real-world applications to measure the quality of classifiers. You will explore how the
probabilities output by your classifier can be used to trade-off precision with recall,
and dive into this spectrum, using precision-recall curves. In your hands-on
implementation, you will compute these metrics with your learned classifier on real-
world sentiment analysis data.
so it's false positive. I find it very helpful to ground these ideas of false positive and false
negative in the context of an example, to really feel it and really understand what the impact of
those mistakes can be. So let's look at this matrix here again. If you look at the top left, we have a
truly positive sentence, so it was a plus 1 sentence. And we got it right, we had a + 1 prediction.
So that's no mistake, that's great. Similarly for the bottom right, we didn't make a mistake. We had
a- 1 sentence, so a negative sentence, and we made a negative prediction. Now the problematic
was, I did only two. So let's look first at the top one. So what happened here, was that I had a
positive sentence, but a- 1 prediction. So, what does this actually means? Those are positive
sentence in the word. Did they fall with negative? So, I missed the sentences to show my website.
Maybe this is not too bad, you know, there might be some positive things I've said. So, maybe
missing one is not that bad but it's still a problem. But let's look at the other quadrant here. The
other quadrant is when we have- 1 sentence but I made a + 1 prediction. In other words, it was a
negative sentence in the world in a review and I thought it was positive. So that means I showed a
bad thing. I showed a bad review on my website. So this is quite problematic. I showed a bad
review on the website, maybe said the sushi sucked, everybody read it, nobody comes to my
restaurant anymore. Big, big, big, big trouble.
So where y hat is also +1 but there's a subset which we don't capture where we think the y hat is
-1, so y hat does not agree with y. And so, that's the part that we missed. So the recall is the
fraction of the once that we actually get. We want everything to be in the blue box here. More
formally, we can define recall as the fraction of true positives. For these are the data points that
we were positive and we got them right divided by the true positives. And the false negative. So
the data points that were true, but we labeled as negative. So falsely labeled as negative. And so,
this is going to have value one if the false negatives are zero. Which means we captured
everything, we captured all the true positives. And zero if we did not capture any true positives so
all the positive data went to the false negative bin. So if we go back to the example that we have
been looking at, I want to show positive sentences on my website. I've got four of them, in y hat i
equals +1 but I missed out on two sentences. So for example, I missed out on the sentences that
said my wife tried the ramen and it was delicious, and so maybe somebody's interested in ramen
they don't see that sentence, they don't go to my website, and so I missed out on something really
good. So high recall means that you discover everything positive that's being said about the
restaurant or all of the positive data points. [MUSIC]
[MUSIC] When there's a trade off between precision and recall, it's important for us to look at the
two extremes. What does it mean to have a classifier that's extremely precise? And what does it
mean to have a classifier that's extremely high recall? And how the two can go against each other
sometimes. First, let's think about what I call an optimistic classifier. You might know some of
these optimists in your life. They think everything is good. How's it going? Good. Even if bad stuff
is happening, they say good. Those folks say that all possible experiences are good, so they're
optimists. That means that pretty much every input, every sentence, is labeled as positive, very
few get labeled as negative. It's extremely likely that all the truly positive data points get labeled as
good. What does that mean? That means that I have perfect recall, because I recall all those
positive data points. Good. But I might not get perfect precision because I put in a bunch of
negatives into that bit. How can we address that? We can have that pessimistic classifier, you
might have some friends like that where you try really hard, you do everything for them, you go out
of your way, and everything sucks. Every single experience that you have is really bad. There's
very, very, very, very, few things that they say are good. And when they are there very likely to be
good but everything else they say is bad, so everything else in the world is very hard, equals -1.
Pessimist means that you're going to miss out on many good things in life. The pessimists have
high precision because the few things that was good tends to be good, but very, very, very low
recall, they don't inspire great things in life. It turns out there is a spectrum between a high
precision low recall model and low precision high recall model, the pessimist and the optimist.
What we'd like to do is somehow balance between the two perspectives in the world to find
something that's just right for us. So, balance between a pessimistic model and the optimistic
model. In particular, we want to find as many positive reviews or sentences as possible, as many
of those as possible, with as few incorrect predictions as we can. So, that's the balance we're
trying to strike in the case of our restaurant. [MUSIC]
[MUSIC] Thus far we've talked about precision, recall, optimism, pessimism. All sorts of different
aspects. But one of the most surprising things about this whole story is that it's quite easy to
navigate from a low precision model to a high precision model from a high recall model to a low
recall model, so kind of investigate that spectrum. So we can have a low precision, high recall
model that's very optimistic, you can have a high precision, low recall model that's very
pessimistic, and then it turns out that it's easy to find a path in between. And the question is, how
do we do that?
If you recall from earlier in this course, we assign not just a label, +1 or -1, for every data point,
but a probability number, let's say, 0.99 or being positive for the sushi and everything else were
awesome. To say 0.55 of being positive for the sushi was good, the service was okay. This
probability is I, as I mentioned earlier in the course, that they are going to be fundamentally useful.
Now you're going to see a place where they are amazingly useful. So the probabilities can be
used to tradeoff precision with recall. And so let's figure that out. So earlier in the course, we just
had a fixed threshold to decide if an input sentence, x-i, was going to be positive or negative. We
said, it's going to be positive if the probability is greater than 0.5, and is going to be negative if the
probability is less than 0.5, or less than or equal to it. Now, how can we create an optimistic and
pessimistic model by just changing the 0.5 threshold? Let's explore that idea. Think about what
would happen if we set the threshold, instead of being 0.5 to being 0.999. So a data point is only
+1 if its probability is greater than 0.999. Well, here's what happen. Very few data points would
satisfy this. So if very few data points satisfy this, then very few data points will be +1. And the
vast majority will be labeled as -1. And we call this classifier the pessimistic classifier.
Now alternatively, if we change the threshold to be 0.001, then that means that any experience is
going to be labeled as positive. So, almost all of the data points are going to satisfy this condition. So
we're going to say that. Everything is +1, and so this is going to be the optimistic classifier. It's going to
be say yeah, everything is +1, everything's good. So by varying that threshold from 0.5 to something
close to 0 to something close to 1, we're going to change between optimism and pessimism. So if you
go back to this picture of logistic regression, for example, as a complete case. We have this input, the
score of x. And the output here was the probability that y is equal to +1 given x and w. This should
bring some memories maybe some sad, sad memories. The threshold here is going to be a cut where
we say, set y hat to be equal to +1 if it's greater than or equal to this threshold t. So everything above
the line will be +1 and everything below the line will be labeled -1.
Or concretely, let's see what happens if we set the threshold to be some very, very high number.
So, t here is close to one. So if t is some number close to one, then everything below that line will
be labeled as -1. And very, very few things there above the line here can be labeled as +1. So,
that's why we end up with a pessimistic classifier, on the flip side if we set the t threshold to be
something very, very small, so this is small t then everything's going to be above the line. So
everything's going to be labeled as +1, and very few data points are going to be labeled as -1. So
we'll end up with the optimistic classifier. So range in t from 0 to 1, takes us from optimism to
pessimism. In other words that spectrum that we said weren't navigate on can now be navigated
for a single parameter, t, that goes between 0 and 1. [MUSIC]
[MUSIC] We saw how we could change the threshold from zero to one for deciding what's a
positive value to navigate between the optimistic classifier and the pessimistic classifier. There's
an actually really intuitive visualization that does this. It's called a precision-recall curve. Precision-
recall curves are extremely useful to understanding how a classifier is performing. So in this case,
you can imagine setting two points in that curve. What happens to the precision once you have,
when the threshold is very close to one? Well the precision is going to be one because we're
going to get everything right, there's very, very few things and very sure they're going to correct.
But the recall's going to be zero because you're going to say everything's bad, everything else is
bad, so that's pessimistic. On the other extreme, our precision recall curve, the point on the
bottom there, is a point where the optimistic point where you have very high recall because you're
going to find all the positive data points, but very low precision, because you're going to find all
sorts of other stuff and say that's still good. And so that happens when t is very small, close to
zero. Now if you keep varying t, you have a spectrum of tradeoffs between precision and recall.
So if you want a model that has a little bit more recall but still highly precise, maybe you set t =
0.8, but if you really want really, really high recall, but trying to improve precision a little bit, maybe
set t to 0.2. And you can navigate that spectrum to explore the tradeoff between precision and
recall. Now there doesn't always have to be a tradeoff, if you have a really perfect classifier, you
might have a curve that looks like this. This is kind of the world's ideal where you have perfect
precision no matter what your recall level. This line basically never happens. But that's kind of the
ideal. That's where you're trying to get to, is that kind of flat line at the top. So the more your
algorithm is closer to the flat line at the top, the better it is. And so precision-recall curves can be
used to compare algorithms in addition to understanding one.
So for example, let's say you have two classifiers, classifier A and classifier B. And you see that
for every single point, classifier B is higher than classifier A. In that case we always prefer
classifier B. No matter what the threshold is, classifier B always gives you a better precision for
the same recall, better precision for same recall. So B is always better. However, life is not always
this simple.
If there's one thing you should learn about thus far, is that life and practice tends to be a bit
messy. And so, often what you observe is not classifier A and B like we saw, but it's classifier A
and C like we're seeing over here. Where there might be one or more cutoff points, where
classifier A does better in some regions of the precision recal curve where classifier B does better
in others. So, for example, or C in this case. So for example, if you're interested in very high
precision but okay with lower recall, then you should pick classifier C, because it does better in
that region. It's higher up, closer to that flat line. But, if you cared about getting high recall, then
you should choose classifier A. Because in the high recall regime, when you pick tease, they're
smaller, then classifier A tends to do better. So you see, it's curve is higher over here. So that's
kind of complexity of dealing with machine learning in the real world. Now if you just had to pick
one classifier, the question is how do you decide? How do you choose between A and C in this
case?
And we often the single number to decide, as I was hinting at, depends on where you want to be
in the precision trade off curve. And there's many metrics out there to try to do single numbers,
some are called F1 measures, some called area-under-the-curve. I'm less fond of those
measures, myself, for a lot of applications than I am of one that's much simpler. Which, it's called
precision at k. And let me talk about that because it's a really simple measure, really useful. So,
let's say that there's five slots on my website to show sentences. That's all I care about. I want to
show five great sentences on my website. I don't have room for ten million, for five million, just for
five. And I show five sentences there. Four were great and one sucked. I want all five to be great.
So I want my precision for the five top sentences, for the top five, to be as good as possible. In
this case, our precision was four out of five, 0.8. So I ended up putting a sentence in there that
said, my wife tried the ramen and it was pretty forgettable. That's kind of a disappointing thing to
put in. So for many applications, like recommender systems for example, where you go to a web
page and somebody showing you some products you might want to buy, precision at k is a really
good metric to be thinking about. [MUSIC]
[MUSIC] In this module, we will discuss some very important fundamental concept, which is
evaluating classifiers. And in particular, we talked about precision-recall, which is a concept that's
widely used, way beyond some of the classifiers we talked about here. So basically any
classification problem you're going to see in industry. We saw that just straight accuracy or error
metrics may not be the right things for your application. And you need to look at something else
and precision recall is one of the first things you might want to look at. Precision captures the
fraction of your top of your positive predictions that are actually positive and recall talks about of
all positive predictions positive sentences out there, which ones did you find, which ones did you
label as positive. So we talked about the trade-off in between precision and recall, and how you
can navigate that trade-off with that trade-off parameter t, in terms of probability, and really get this
beautiful precision trade-off curves. And finally, we talked about comparing models with this
precision at k metric, which is one that I particularly like for a lot of applications.
WEEK
7
2 hours to complete
Scaling to Huge Datasets & Online Learning
With the advent of the internet, the growth of social media, and the embedding of
sensors in the world, the magnitudes of data that our machine learning algorithms
must handle have grown tremendously over the last decade. This effect is sometimes
called "Big Data". Thus, our learning algorithms must scale to bigger and bigger
datasets. In this module, you will develop a small modification of gradient ascent
called stochastic gradient, which provides significant speedups in the running time of
our algorithms. This simple change can drastically improve scaling, but makes the
algorithm less stable and harder to use in practice. In this module, you will investigate
the practical techniques needed to make stochastic gradient viable, and to thus to
obtain learning algorithms that scale to huge datasets. You will also address a new
kind of machine learning problem, online learning, where the data streams in over
time, and we must learn the coefficients as the data arrives. This task can also be
solved with stochastic gradient. You will implement your very own stochastic gradient
ascent algorithm for logistic regression from scratch, and evaluate it on sentiment
analysis data.
Let's now dig in and understand that change we hinted at where you use a little bit of data to
compute the gradient instead of using an entire data set. And in fact we're just going to use one
data point to compute the gradient instead of everything. So this is the gradient ascent algorithm
for logistic regression. The one that we've seen earlier in the course. And I'm sure in the gradient
explicitly over here. Now, it requires a sum over data points which is the thing that we're trying to
avoid. We're not going to do a sum over data points at every update, at every duration. So let's
throw out that sum. But each time we're going to pick a different data point. So, we're going to
introduce an outer loop here where we loop over the data points, 1 through N and then we
compute a gradient with respect to that data point and then we update the parameters. And we do
that one at a time.
[MUSIC] Let's take a moment to compare gradient to stochastic gradient to really understand how
the two approaches relate to each other. And we see that this very, very small change in your
algorithm and your implementation is going to make a big difference. So we're going to build out
the table here, comparing the two approaches, gradient in blue, stochastic gradient in green. And
we saw already that gradient is slow at computing an update for large data, while stochastic
gradient is fast. It doesn't depend on the dataset size, it's always fast. But the question is, what's
the total time? Each iteration is cheaper for stochastic gradient, but is it cheaper overall? And
there's two answers to this question. In theory, stochastic gradient for large datasets is always
faster, always. In practice, it's a little bit more nuanced, but it's often faster, so it's often a good
thing to do. However, stochastic gradient has a significant problem, it's much more sensitive to the
choice of parameters like the choice of step size, and it has lots of practical problems. And so a lot
of the focus of today's module is to talk about those practical challenges to stochastic gradient and
how to overcome them. And how to get the most benefit of the small change to your algorithm.
We'll see a lot of pictures like this, so I'm going to take a few minutes to explain it. This picture
compares gradient to stochastic gradient. So the red line here is the behavior of gradient as you
iterate through data, so as you make passes through datum. And on the y axis we see the data
likelihood, so higher is better, we fit in the data better. The blue line here is stochastic gradient.
And to be able to compare the two approaches, on the x axis I'm not showing exactly running
time, but I'm showing you how many data points you need to touch. So stochastic gradient is
going to make an update every time he sees a data point. Gradient is going to make update every
time this makes full pass of the data. And so, we're showing how many passes we're making over
the data. So the full x axis here is ten passes over the data, and you see that after ten passes
over the data, gradient is going to a likelihood that's much lower than that of stochastic gradient.
Stochastic gradient goes to a point that's much higher. And even if you look at it in the longer
scales you'll see, stochastic gradient's going to converge faster. However, it doesn't convert
smoothly to the optimal solution. It oscillates around the optimal solution, and we will understand
today why that happens. And oscillation, some of the challenges that are introduced by stochastic
gradient. So now I've extended it from 10 passes over the data to 100 passes over the data, and
now we see the gradient is getting to solutions very close to that stochastic gradient. But again,
you see a lot of noise and a lot of oscillation from stochastic gradient. Sometimes it's good,
sometimes it's bad, sometimes it's good, sometimes it's bad. So that's the challenge there. So
here's a summary of what we've learned. Make a tiny change to our algorithm. Instead of using
the whole dataset to compute the gradient, use a single data point, call that stochastic gradient.
going to get better quality faster. However, it's going to be tricky, there's going to be some
oscillations, and you have to learn some of the practical issues that you need to address in order
to make this really effective. But this change is going to allow you to scale to billions of data
points. Even on your desktop you'll be able to deal with a ton of data, which is really super
exciting. [MUSIC]
[MUSIC] So we've made a small change to the gradient algorithm but it is still big. We start looking
at all the data points one at a time. Why should this even work? Why is this a good idea at all?
Let's spend a little bit of time getting intuition of why it works. And this intuition is going to help us
understand the behavior of the algorithm in practice. In this picture, I'm showing you the
landscape, the counter plot for our sentiment analysis problem, which is the data that we'll be
using throughout our module today. But just for a subset of the data where we're only looking at
two possible features, coefficient for awful and the coefficient for awesome. And if we were to start
here at this point, this would be the exact gradient that would compute. And that exact gradient
gives you the best possible improvement of going from wt to wt plus one. That's the best thing you
could possibly do. Now, there are many other directions that also improve the value, improve the
likelihood, the quality. So for example, if we were to take this other direction over here and we
reached some parameter w prime is still okay. Because it is still going uphill we're increasing the
likelihood. In other words, the likelihood of w prime is going to be greater than the likelihood of wt.
So in fact, any direction that take uphill will be good the gradient is just the best direction.
The gradient direction is the sum of the contributions for each one of the data points. So, if I look
at that gradient direction. It is the sum over my data points of the contributions for each one of
these xis. So this is the, each one of these red lines here are the contributions from a data point,
from each xiyi, which is this part of the equation, here. So interestingly, most contributions point
upwards. So, all of these over here are pointing in an upward direction. So if I were to pick any of
those, I would make some progress. If I picked any of the other ones, like the ones back here, this
would not make progress, this would be bad directions. But on average, most of them are good
directions. And this is why stochastic gradient works. In stochastic gradient, we're just going to
pick one of these directions. And most of them are okay. So most of the time, we're going to make
progress. Sometimes when we take a bad direction, we won't make progress. We'll make negative
progress. But on average, we're going to be making positive progress. And so, that's exactly why
stochastic gradient works.
So if you think about the stochastic gradient algorithm, we're going one data point at a time. And
picking the direction associated with that data point and maybe taking a little step. So, at first
iteration we might pick this data point and make some progress and the second one make this
number too here and make negative progress. But in the third one, I pick another one that makes
positive progress and I pick a fourth one that makes positive progress. And I pick a fifth one that
doesn't and I pick a sixth one that does and so on. And so, most of the time we're making positive
progress and overall the likelihood is improving. [MUSIC]
[MUSIC] Next, let's visualize the path the gradient takes, as opposed to stochastic gradient, what I
call the convergence paths. And as you will see, the stochastic gradient oscillates a bit more, but
gets you close to that optimal solution. So just before in the black line, I'm showing the path of
gradient, and you see that that path is very smooth and it does very nicely. In the red line, I show
you the path of stochastic gradient. You see that this is a noisier path. It does get us to the right
solution but one thing to note though is that It doesn't convergent stop like gradient does, it
oscillates around the optimal. And this is going to be one of the practical issues that we're going to
address when we talk about how to get stochastic gradient to work in practice but it's a significant
issue.
Another view of stochastic gradient oscillating around the optimum, can be viewed in this plot. The
one we've been using for quite awhile and you see that gradient is smoothly making progress. But
stochastic gradient is this noisy curve as it makes progress, and as it converges. It's oscillating
around the optimal.
Let's summarize. Gradient ascent looks for the direction of greatest improvement and steepest
ascent direction and does that by summing over all possible data points. Stochastic gradient on
the other hand tries to find some direction which usually makes progress. So, for example by
picking one data point to estimate the gradient and on average it makes a ton of progress and
because of that it tends to converge much faster but it's noisier than optimum. So, even in that
simple example we've been using today It's over a hundred times faster than gradient conversion
much much faster to converge. But, it gets noisy in the end.
[MUSIC] You've heard me hint about this a lot today, and the practical issues of stochastic
gradient are pretty significant. So for most of the remaining of the module, we'll talk about how to
address some of those practical issues. Let's take a moment to review the stochastic gradient
algorithm. We initialize our parameters, our coefficients to some value, let's say all 0. And then
until convergence, we go 1 data point at a time, and 1 feature at a time. And we update the
coefficient of that feature by just computing the gradient at that single data point. So I need a
contribution of each data point 1 at a time. So we're scaling down the data, update the parameters
1 data point at a time. For example, I see my first data point here 0 awesome, 2 awful, sentiment
-1. I make an update which pushes me towards predicting -1 to this data point. Now if the next
data point is also negative, I'm going to make another kind of negative push. If the third one is
negative, make another negative push, and so on. And so, one worry, one bad thing that can
happen with stochastic gradient is if your data is implicitly sorted in a certain way. So, for example,
all the negative data points are coming before all the positives. They can introduce some bad
behaviors to the algorithm, bad practical performance. And so we worry about this a lot and it's
really significant.
So, because of that, before you start running stochastic gradient you should always shuffle the
rows of the data, mix them up. So that you don't have this long regions of, say, negatives before
positives, or young people before older people. Or people who live in one country versus people
who live in another country. You want to mix it all up. So what that means, from the context of the
stochastic gradient algorithm that we just saw, is just adding a line at the beginning where you
shuffle the data. So before doing anything, you should start by shuffling the data.
[MUSIC] The second question on stochastic gradient is how do you pick the step size eta. And
this is a significant issue just like it is with gradient. Both of them, it's kind of annoying, it's a pain to
figure out to pick that coefficient eta. But it turns out that because of the oscillations to the
stochastic gradient, picking eta can be even more annoying, much more annoying. So, if we go
back to our data set and when we've been using it, I've shown you this blue curve many times.
This was the best eta that I could find, the best step size. Now, if I were to pick smaller step sizes.
So, smaller ETAs, it will behave kind of like stocha elect regular gradient. It will be slower to
converge and you see less oscillations. But, it will eventually get there, but I mean, much slower.
So, we worry about that a bit. On the other hand, instead of using the best step size, we try to use
a larger step size because we thought, what could make more progress more quickly? You'll see
this crazy oscillations and the oscillations are much worse than you observe with gradient, and I
showed you gradient. So, you have to be really careful to pick a too large things can behave
extremely erratically.
And in fact, if you picked step size very, very large, you can end up with behaviors like this. So,
this black line here was an eta that was way too large, that's a technical term here that we'd like to
use. And in this case, the solution is not even close to anything we got even with etas of
oscillations that we have showed you in the previous slide. It's a huge gap and so, etas too large
leads to really bad behavior in stochastic gradient. The rule of thumb that we described for
gradient, for picking eta is basically the same as the one for picking step size for stochastic
gradient. The same as for gradient, but unfortunately it requires much more trial and error. So, it's
even more annoying, so you might spent a lot of time in the trial and error even though it's a
hundred times faster than converge, it's possible to spend a hundred times more effort trying to
find the right step size, just be prepared. But, we just try several values exponentially spaced from
each other. And try to find somewhere between an eta that's too small and an eta that is too big,
and then find one that's just right. And, I mentioned this in the gradient section, but for stochastic
gradients, even more important, for those who end up exploring this further, there's an advanced
step where you would make the step size decrease over iterations. And so, for example, you
might have an eta that depends on what iteration you're in and often you set it to something like
some constant here eta zero divided by the iteration number t. Iteration number, and this
approach tends to reduce noise and make things behave quite a bit better. [MUSIC]
[MUSIC] As we saw in our plot, stochastic gradient tends to oscillate around the optimum. And so
you should never trust the last parameter it finds, unfortunately. Gradient will eventually stabilize
on the optimal solution.
So even though it takes a hundred times longer or more, like was shown in this example, so if you
look at the x-axis a hundred times more time to converge, you get there. And you feel really good
when you get there. Stochastic gradient, when you think it converged, is really that it's oscillating
around the optimum, and that can lead to bad practical behavior. So for example here, I'm just
giving you some numbers, say w at iteration 1000 might look really, really bad. But maybe W at
iteration 1005 looks really, really good and needs some kind of approach to minimize the risk of
picking a really bad one or a really good one. And there is a very simple technique which works
really well in practice, and theoretically is what you should do. So all the theorems require
something like this. And what it says is. When you are outputting w hat, your final self-coefficient,
you don't use the last value, w(t), capital T, you use the average of all the values that you've
computed. All the coefficients you computed along the way. So, what I'm showing here is what
your algorithm should output as it's fitting the solution to make some predictions in the real world.
>> [MUSIC]
[MUSIC] The next few sections of this module are going to talk about more practical issues with a
stochastic gradient, which you need to address for implementing those algorithms. I made the
next few sections optional because it can get really tedious to go through all these practical
details. You now already got a sense of how finicky it can be. There's many other practical details,
I'm going to make those sections optional. So the first one is that learning from one data point is
just way too noisy, usually you use a few data points. And this is called mini-batches.
. So we really illustrated two extremes so far. We illustrated gradient where you make a full pass
through the data, and you use N data points for every update of your coefficient. And then we
talked about stochastic gradient, where you look at just one data point when you're making up
data and the coefficients. And the question is, is there something in between, where you looked at
B data points, say 100? And that's called mini-batches, it reduces noise, increases stability, and
it's the right thing to do. And 100 is a really good number to use, by the way. Here I'm showing the
convergence paths of the same problem we've been looking at, but comparing batch size of 1 with
batch size of 25. And here you observe two things. First, the batch size of 25 makes a
convergence path smoother, which is a good thing. But the second thing to observe which is even
more interesting is, when you get around optimum batch size of one, really oscillates around the
optimum. Well, batch size of 25, it oscillates better, so it has better behavior around the optimal.
And by better behavior, it's going to make it much easier to use this approach in practice. So mini-
batches are a great thing to do.
So now we've introduced one more parameter to be tuned in this stochastic algorithm, this is the
batch size B. If it's too large then it behaves just like gradient, for example, if you use batch size of
N, it's exactly the gradient ascent algorithm, so in this case, the red line here is batch size too
large. If the batch size is too small, you have bad oscillation, or bad behavior of. So B, too small in
this case, it isn't converged very well. But if you pick the best batch size, B, you have very nice
behavior. You quickly get to great solution, and you stay around that. So picking the right batch
size makes a big difference.
So let's go back to a simple stochastic gradient algorithm, and modify it, and reduce the notion of
batch sizes. So instead of looking one data point at a time, we're going to look at one batch at a
time. And if the batch size is size of B, we have N over B batch sizes for data set of size N. So if
we have 1 billion data points and batch size 100, it's 1 billion over 100 of those. And now we go
batch by batch, but instead of considering one data point at a time in the competition of the
gradient, we now consider the B data points in that mini-batch. And the equation here just shows
you the math behind basically the obvious thing of just taking 100 data points and use that just to
estimate the gradient instead of 1 or instead of 1 billion. [MUSIC]
[MUSIC] The second practical issue we're going to talk about is how do you measure
convergence for stochastic gradient. If you have to look at all the data points to figure out they
have converged, then the whole process is going to be pointless and meaningless. So we need to
think about new techniques of measuring convergence. This, again, is going to be an optional
section. It's very practical for those those who actually want to use and implement a really
practical stochastic gradient algorithm. One way to think about it is, how do they actually make this
plot? Here's a plot where a stochastic gradient gets the optimum before a single pass over the
data, while gradient is taking 100 or more passes over the data. If to get one point in this plot I had
to compute the whole likelihood over the entire data set, that would require me, for every little blue
dot, to make a pass over the entire data set. Which make the whole thing really slow. If I had to
make a pass over a data set to compute the likelihood, might as well just use full gradient and not
have this noisy problems. And so, we need to rethink how we're going to compute conversions,
how we're going to plot that we're making progress.
And here there'll be a really, really simple trick, really easy, really beautiful, that we can do. So I'm
showing here the stochastic gradient ascent algorithm for logistic regression, the one that we've
been using so far. So I go data point by data point, and I compute the contribution to the gradient,
which is this part, this thing I'm calling partial j. Now if I wanted to compute the log likelihood of the
data to see what my quality was, how well I'm doing, I'll have to compute the following equation for
data point i. If the data point were a positive data point, I take the log of y = +1 given xi and the
current parameters, current coefficient. And, if the data point were a negative yi = -1, then I need
to take the log of the probability yi = -1, which turns out to be log of 1- P(y = +1). And so here's the
beautiful thing, the thing that I need to compute the likelihood for a data point is exactly the same
as the thing that I needed to compute the gradient. And so, I've already computed. I can compute
the contribution likelihood of this one data point. Which is great. I can't do it for everybody, but I
can do it for one data point.
So every iteration t, I can compute the likelihood of a particular data point. I can't use that
measure conversions because I could do well one data point classified perfectly but not for others,
so it would be a very noisy thing. But if I want to compute how well I'm doing, let's say, after 75
iterations, what I can do is look at the last few data points, how well I did. And the likelihood for
those data points, average it out, and then create the smoother curve. And so, for every time
stamp I want to keep an average, it's called a moving average, of the last few likelihoods in order
to measure convergence.
And so in the plot here, the plot that I be showing to you, now I can tell you truth in advertising
what it actually was. The blue line here was not straight up gradient, it was many batches of size
100, which is a great number to use. It still converts much faster than gradient. And to draw that
blue line, I average the likelihood over the last 30 data points. So that's how I build a plot, and this
is how you would have to build a plot if you're going to go through this whole process of stochastic
gradient. [MUSIC]
[MUSIC] We're now down to our final practical issue stochastic gradient, which again is going to
be an optional section. And the question is, how do we introduce regularization and what impact it
has on the updates. And this is going to be pretty significant if you want to implement it yourself in
a general way. Again, optional section because it's pretty detailed. So let's remind ourselves of the
regularized likelihood. So, we have some total party metric defined in the pit of our data, which is
the log likelihood. And some measures of the complex of the parameters which in our case, here
we're using w squared, the actual normal for the parameters, and we wanted to compute the
gradient of this regularized likelihood to make some progress, and avoid over fitting. So, the total
derivative is the derivative of the first term, the quality, which is the sum over the data points of the
contribution of each data point, the thing that got really expensive. And the contribution of the
second one, once we introduced that parameter, lambda, to trade off the two things, talked about
this a lot, contribution is -2 lambda wj. And we derive this update during an earlier module on
regularized logistic regression.
Now this is how we do straight up gradient updates with regularization. The question is what do we
do? We only do stochastic gradient with regularization. So, if you remember stochastic gradient, we
just said we take the contribution for single data point, and if we added up those contributions we get
exactly the gradient. This is why it worked, because the sum of these stochastic gradients equal the
gradient. And so, to my theme that, we need to think about how to set up the regularization term
such that the sum of the stochastic gradients also is equal to the gradient. And one way to think about
this is that we take the regularization and divide it by N. So in practice you want to do this. You want
to say that the total derivative for stochastic gradient is contribution data point minus two over n
lambda wj. And so if you were to add this up you get back to the full gradient.
So with regularization now, the algorithm stays the same, but the contribution data point is its
gradient minus two over n lambda wj. That's the contribution from the regularization point. If you
are using mini-batches, you'd adapt that times. So you had the sum of the contributions from each
data point And then it will be 2 b over n lambda w j. So this is how we take care of regularization.
Again very small change, it's going to behave way better.
[MUSIC] Now it seems to cast a gradient which is really exciting. Simple algorithm, simple
modification to gradient, which really speeds up in practice. Has many practical challenges, and
we talked about several of those, and how to address them. But now, I would like to step back,
and think about a broader question, what's called online learning, of how do we learn from
streaming data. And we see that is one way to learn from data that arrives over time or streaming
data. Let's define the idea of online learning. But first, let's look at what we've been doing so far.
What we've been doing so far in this course, and in the regression course, is what's called batch
learning. I'm given the full data set. And I'm going to run some machine algorithm over this data
set, maybe gradient, and do many pass over the data. And finally output my best guess, my best
estimate, for the coefficients, and we're going to call that W hot and we're done. That's batch
learning. Online learning is something different. Actually, what you are doing here is online
learning. But that's a different kind of online learning. What we're talking about here is online
machine learning. And in online machine learning, data raise over time, one data point at a time.
So, for example, as we'll see next, ad serving ads on web pages, is an example, where your
things are arriving one data point at a time. And so, that's where data is coming in. And your
machine learning algorithm, sees a little trench of that data, one little bit. Let's say, a timesstamp
one, takes it in, and makes an estimate of the coefficient, say w hat 1. And the timestamp two, this
is another little bit of the data, and makes another estimate of the coefficient w hat 2. And the
timestamp three, it makes another estimate w hat 3. Timestamp four, a little more data and makes
an estimate w hat4. So every timestamp is making a new estimate, so it can make new
predictions.
To better the ideas, let's look at really practical real world example of where online learning makes
a huge difference, and it's on ad targeting. So let's see on navigating the web and you hit the
particular website, what's happening behind the scenes when you're shown ads? Well some
information about you, like your age, or the websites you've visited, and some of the information
about the website, like the text of the website, are fed into a machine learning algorithm, that's
going to use some set of quotations, w hat t, to figure out what's the best ads to show you. And
we're going to call that y hat suggested ads. It might show you ad 1, ad 2, ad 3, and so on. And
then, look at the website. You're like, cool, that's a really interesting ad. And you go and you click
on ad two. Well, when you click on ad two, the machine learning algorithm figures out that you
clicked on ad two, and assigns true label, for website ad two. That's where you clicked on. And
then the machine learning algorithm takes the and updates its coefficient from w high t to w high t
plus one. And what we describe so far, is really how ad systems work, a lot of them work in
practice. So this is a little bit of an is really something that makes a big difference in the real world.
[MUSIC]
[MUSIC] And this is an example of an online learning problem. Data is arriving over time. You see
an import xi and you need to make a prediction, y hat i. So, the input may be texting the web page
information about you and y hat i might be a prediction about what ads you might click on. And
then, given what happens in the real world, whether you're clicking an ad, in which case yt might
be ad two, or you don't click on anything, in which case yt would be none of the above, no ad was
good for you. Whatever that is gets fed into a machine learning algorithm that improves its
coefficients, so it can improve its performance over time. The question is how do we design a
machine learning algorithm that behaves like this? What's a good example of a machine learning
algorithm? It can improve its performance over time in an online fashion like this.
And it turns out that we've seen one. Stochastic gradient. Stochastic gradient is a learning algorithm
that can be used for online learning. So let's review it. You give me some initial set of coefficients, say
everything is equal to zero. Every time we stop, you get some input xi. You can make a prediction y
hat t based on your current estimate of the coefficients. And then, you're given that true label, yt, and
you want to feed those into an algorithm. Well, stochastic gradients will take those inputs and use it
to compute the gradient, and then just update the coefficients, so w j t + 1 is going to be w j t, + eta
times the gradient, which is computed from these observed quantities in the real world.
whether you're clicking an ad, in which case yt might be ad two, or you don't click on anything, in
which case yt would be none of the above, no ad was good for you. Whatever that is gets fed into
a machine learning algorithm that improves its coefficients, so it can improve its performance over
time. The question is how do we design a machine learning algorithm that behaves like this?
What's a good example of a machine learning algorithm? It can improve its performance over time
in an online fashion like this. And it turns out that we've seen one. Stochastic gradient. Stochastic
gradient is a learning algorithm that can be used for online learning. So let's review it. You give me
some initial set of coefficients, say everything is equal to zero. Every time we stop, you get some
input xi. You can make a prediction y hat t based on your current estimate of the coefficients. And
then, you're given that true label, yt, and you want to feed those into an algorithm. Well, stochastic
gradients will take those inputs and use it to compute the gradient, and then just update the
coefficients, so w j t + 1 is going to be w j t, + eta times the gradient, which is computed from these
observed quantities in the real world. So, online learning is a different kind of learning that we
haven't talked at all about in the specialization but it's really important to practice. So when data
arrives over time and you need to make a decision right away of what to do with it. But based on
that decision, you're going to get some feedback and you're going to update the parameters
immediately and keep going. This online learning approach, where you update the parameters
immediately as you see some information in the real world, can be extremely useful. So, for
example, your model is always up to date. It's always based on the latest data, latest information
in the world. It can have lower computational cost because you can use techniques like stochastic
gradient that don't have to look at all the data. And in fact, you don't even have to store all the data
if it's too massive. However, most people do store the data because they might want to use it
later. So that's a side note. But you don't have to. However it has some really difficult practical
properties. So this system that you have to build, the actual design of how data interacts with the
world and where problems get stored, the coefficients get stored and all that is really complex and
complicated. It's hard to maintain. If you have oscillations in your machine learning algorithm, it
can do really stupid things, and nobody want their website to do stupid things. And you don't really
trust those machinery stochastic gradient updates, necessarily. Sometimes it can give you bad
predictions. And so, in practice, most companies don't do something like this. They, what they do
is they save their data for a little while and update their models with the data from last hour or from
the last day or from the last week. It's very common. So it's very common, for example, for a large
like retailer, to every night change its recommender system and run a big service every night to do
that. And you can think about that as an extreme version of mini-batches that we talked about
earlier in this module. But now the batch is the whole data from the whole day. For you, it will be
those 5 billion page views. [MUSIC]