100% found this document useful (1 vote)
245 views333 pages

Yousef ML Washin Classification

This document provides an overview of the content covered in Week 1 of a course on machine learning. It includes: 1) An introduction explaining that decision trees can overfit data similarly to logistic regression, and the concept of Occam's Razor will be used to find simpler trees. 2) An overview of Module 2 which covers linear classifiers and logistic regression, focusing on using logistic regression for classification and generating probabilities. 3) Details of the 18 videos, 2 readings and 2 quizzes included in Module 2.

Uploaded by

yousef shaban
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
245 views333 pages

Yousef ML Washin Classification

This document provides an overview of the content covered in Week 1 of a course on machine learning. It includes: 1) An introduction explaining that decision trees can overfit data similarly to logistic regression, and the concept of Occam's Razor will be used to find simpler trees. 2) An overview of Module 2 which covers linear classifiers and logistic regression, focusing on using logistic regression for classification and generating probabilities. 3) Details of the 18 videos, 2 readings and 2 quizzes included in Module 2.

Uploaded by

yousef shaban
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 333

WEEK

1
1 hour to complete
Welcome!
 In Module 5, we're going to see that over 50 is not just a bad, bad problem with Logistic
Regression, but it's also a bad, bad problem with decision trees. Here, as you make those trees
deeper and deeper and deeper, those decision boundaries can get very, very complicated, and
really overfit. So, we're going to have to do something about it. What we're going to do is use a
fundamental concept called Occam's Razor, where you try to find the simplest explanation for
your data. And this concept goes back way before Occam was around the 13th Century. It goes
back to Pythagoras and Aristotle, and those folks said the simplest explanation is often the best
one. So we're going to take this really complex deep tree's and find simple ones that give you
better performance and are less prone to overfitting. [MUSIC]
2 hours to complete
Linear Classifiers & Logistic Regression
Linear classifiers are amongst the most practical classification methods. For example,
in our sentiment analysis case-study, a linear classifier associates a coefficient with
the counts of each word in the sentence. In this module, you will become proficient in
this type of representation. You will focus on a particularly useful type of linear
classifier called logistic regression, which, in addition to allowing you to predict a
class, provides a probability associated with the prediction. These probabilities are
extremely useful, since they provide a degree of confidence in the predictions. In this
module, you will also be able to construct features from categorical inputs, and to
tackle classification problems with more than two class (multiclass problems). You will
examine the results of these techniques on a real-world product sentiment analysis
task.

18 videos (Total 78 min), 2 readings, 2 quizzes


SEE LESS
. And this is called a linear classifier because the output is the weighted sum of the inputs. So
that's kind of what a linear classifier is. We'll see in a little bit more details what does that really
means. So more generally a simple linear classifier which we're going to take as input coefficient
associated with each word. And it's going to compute a score for that input. If the score is greater
than zero, we say that the output, the prediction y hat, is +1. And if the score is less than zero, we
say that the prediction is -1. Now, what we need to do is train the weights of these linear
classifiers from data. So given some input training data that includes sentences of reviews labeled
with either plus one or minus one, positive or negative. We're going to split those into some
training set and some validation set. Then we're going to feed that training set to some learning
algorithm which is going to learn the weights associated with each word, so 1.0 for good, 1.7 for
awesome and so on. And then after we learn this classifier, we're going to go back and evaluate
its accuracy on that validation set. So our goal for today is to explore that learning box. How do we
learn this classifier from data and understand a little bit more deeply of what a linear classifier is
really about? In particular, in the context of logistic regression. [MUSIC]
. And the hyperplane is associated with the score function. The score function is going to be a
weighted combination of the coefficients w0 multiplied by the features that we have. So w0 + w1
times number of awesomes which in our case was 5 + w2 times the number of awfuls, which in
our case is 3 and finally, w3 times number of greats, which in our case was 2. So for this data
point over here, the score of xi is going to be defined by w0 + 5w1 plus 3w2 plus 2w3 and if this,
depending on the coefficients, that score may be positive or negative.
, instead of 0. What does that mean? That means that our Score function now has this extra term,
1.0 times + 1.0 times awesome- 1.5 times awful. So what happens to the line to that decision
boundary? Well, that line gets shifted up. And so if you look at that point on the lower left, which is
close to 0.0, which before we predicted as being a negative review. After we make that change, 
so we can predict probabilities for every class. For logistic regression, the link function we'll use,
it's called the logistic function. Sometimes called sigmoid, sometimes called logit.
. It predicts the output, what's the probability of a positive sentiment given the input x and the
parameter is w. 
So if we take this point over here, it's close to the boundary but it's on the positive side so it should
be outputting probabilities that are greater than 0.5 but not that much greater than 0.5. 
So now what happens is I need the number #awesomes to have two more, the number of #awfuls
for that prediction to be 0.5. So for the probability that y=+1 given xi and w to be 50 50. So in this
case, #awfuls count a lot more negatively than the #awesomes, or at least that has that extra
constant. Now, if I keep w0 to 0 but I increase the magnitude of the parameters, I get a curve on
the right which similar to the curve in the middle so if the difference between the two is 0, I still
predict 0.5. However, it grows much, much more steeply. So, in other words, if you just have one
more #awesome than you have #awfuls, you're going to say that the probability of y=+1 given xi
and ws is almost 1. So the bigger you make the parameters in magnitude, the more sure you get
more quickly. And the amount to change the constant, you kind of shifting that line to the left and
to the right. Now we can see our logistic regression during the problem. So I have some training
data, which had some features. We have this ML model that says probability that a review is
positive is given by the sigmoid of the score w transpose H. Which is 1 / 1 + e to the -w transpose
H. We're going to learn a w hat that really fits the data well. So next, we'll discuss the algorithmic
foundations of how do we fit the w hats from data. [MUSIC]
As we said we're going to split that into a training set and a validation set. And from the training
set we're going to run a learning algorithm that will output the parameter estimates w hat. And
those w hats are going to be plugged into the model. To estimate the probability that an input
sentence is either positive or negative. And, of course, we can use the learn model to take the
validation set and estimate how good it is, what the quality metrics are, what the error is. Now to
find the best classifier we are going to define the quality metric. In this case the quality metric is
going to be called the likelihood function. So very possible parameters serve coefficient w for
example w0, w1, w2, we will be able to score it according to, l of w to figure out how good it is. So
for example if I take this data set of plusses and minuses and learn the line shown in green, we
might get a particular likelihood. So, for example, if the parameter w0 is 0, w1 is 1 but w2 is -1.5,
the likelihood might be 10 to the -6, pretty small. These numbers actually tend to be pretty small.
For this alternative line, where w0 is now 1, w1 is still 1, w2 is -1.5, the likelihood function is a little
better 10 to the -5 instead of 10 to the -6. But perhaps for this best line over here, where w0 is 1,
w1 is 0.5 and w2 is -1.5. You get the best likelihood, 10 to the -4. So we'd like an approach that
would actually search for the possible values of w to find the best line. And, as we will see in the
next module, we'll use a gradient ascent algorithm to find the set of parameters w that has the
highest likelihood best quality. [MUSIC]
 An encoding takes an input which is categorical, for example country of birth and tries to encode
it using some kind of numerical values that are naturally multiplied by some coefficients. So for
example, country of birth, there might be 196 possible countries or categories that that value
comes from. And so one way to encode this is using what's called 1-hot encoding, where you
create one feature for every possible country. So for example there might be a feature for
Argentina, a feature for Brazil, and so on, all the way to a feature for Zimbabwe. And so if
somebody's born in Brazil then the feature for Argentina has value 0, the feature for Brazil has
value 1, and all the other features have value 0. So only one of these features has value 1 at the
time, everything else is 0, that's why it's called 1-hot. It's from electrical engineering, that means
one on or one active encoding. Similarly if somebody's born in Zimbabwe, we're going to get 0, 0,
0, 0, and just 1 in the feature h196 which corresponds to Zimbabwe birth. So that's one kind of
encoding. And implicitly in this module, we've actually explored a different kind of encoding for text
data. And we discussed that in the first course, what's called the Bag of Words encoding. So a
review is defined by text, and text can have say 10,000 different words that come from it, or more,
many more, millions. And so what Bag of Words does is take that text, and then codes its as
counts. So, for example, I might associate h1 with the number of awesomes, h2 with the number
of awful. And so on all the way to say h10,000 which might be the number of sushis. So the
number of times the word sushi appears. And a particular data point might have 2 awesomes, 0
awfuls, 0 bunch of different things, and maybe 3 sushis. And so it becomes a really, really sparse
10,000 additional vectors. In both of these cases, we've taken a categorical input, and defined a
set of features, one for each possible category, to contain either a single value on or account. And
we can feed this directly into the logistic regression model that we discussed so far. This type of
encoding is really fundamental in practice, and you should really familiarize yourself with them.
[MUSIC]
Recap
WEEK
2
2 hours to complete
Learning Linear Classifiers
Once familiar with linear classifiers and logistic regression, you can now dive in and
write your first learning algorithm for classification. In particular, you will use gradient
ascent to learn the coefficients of your classifier from data. You first will need to define
the quality metric for these tasks using an approach called maximum likelihood
estimation (MLE). You will also become familiar with a simple technique for selecting
the step size for gradient ascent. An optional, advanced part of this module will cover
the derivation of the gradient for logistic regression. You will implement your own
learning algorithm for logistic regression from scratch, and use it to learn a sentiment
analysis classifier.

18 videos (Total 83 min), 2 readings, 2 quizzes


SEE LESS
So, in other word, if we take the negative examples and the positive examples, there might not be
a w hat that achieves exactly zero for the negative, one for the positives for all of them. So the
quality metric of the likelihood function tries to figure out kind of on average it measures the quality
throughout all the data points. With respect to coefficients W, how well we're making extremes
happen. Now, if I have the likelihood function, I can evaluate multiple lines or multiple classifiers.
So for example, for the green line here, the likelihood function may have a certain value, let's say
10 to the minus 6, well for this other line where instead of having w0 be 0, now w0 is 1, but the w1
and the w2 coefficients are the same then the likelihood is slightly higher, 10 to the minus 6. But
for the best line which maybe sets w0 to be 1, w1 to be 0.5, and w2 to be -1.5 then the likelihood
is biggest which in this case is 10 to minus 4. Now, you see these numbers. They're kind of weird,
10 to the minus something. But this is what likelihoods will come out to. They're going to be very,
very, small numbers. They're going to be less than one. But the higher you get, the closer you get
to one, the better. And so the question is, how do we find the best w's, the best classifiers, we'll
find the ones that make this likelihood function that we're going to talk about, as big as possible.
So we're going to define this function l(w) and then we're going to use gradient ascent to find w
hat. And you should have some fond memories, maybe some sad sad memories, from the
regression course where we talked about gradient descent and we explored the idea of using
gradient to find the best possible parameters to optimize the quality metric. And we're going to go
through that in this case again. [MUSIC]
And so the question is, how do we find the best w's, the best classifiers, we'll find the ones that
make this likelihood function that we're going to talk about, as big as possible. So we're going to
define this function l(w) and then we're going to use gradient ascent to find w hat. And you should
have some fond memories, maybe some sad sad memories, from the regression course where
we talked about gradient descent and we explored the idea of using gradient to find the best
possible parameters to optimize the quality metric. And we're going to go through that in this case
again. [MUSIC]
And finally so that that line doesn't get really long, because if we have a million or two examples,
we would have a million of these entries. We use the product notation. So this is the little notation
over here. And it just says, I'm just going to write same function here. l(w) is going to be equal to
the product that ranges from the first data point to n, which is number of data points of the
probability of whatever label yi has, +1, -1. Given the input xi, which is the sentence of that review,
and the parameter w. This is the likelihood function that we're trying to optimize. So, our goal here
is to pick w to make this crazy thing, I mean, this function as large as possible. That's our goal.
[MUSIC]
As a quick, quick, quick review, we have our likelihood function and when I find the parameter
values that take this likelihood function, so that this is the W0, W1, W2, which is a function of three
parameters in this little example over here. We're trying to find the maximum over all possible
parameters W0, W1, and W2 and there's infinitely many of those, so if you try to enumerate it will
be impossible to try them all. But gradient ascent is this magically simple but wonderful algorithm
where you start from some point over here in the parameter space, which might be the weight for
awful is 0, the weight for awesome is -6. And you slowly climb up the hill in order to find the
optimum, the top of the hill here, which going to be our famous W hat. They might say that the
weight for awesome is, it's probably going to be a positive number, so maybe somewhere like this,
say 0.5 and the weight for awful is maybe -1. Now in this plot, I've only shown two of the
coordinates W1 and W2. I didn't show W0 because it's really hard to plot in four-dimensional
space. Four dimensions, so I'm just showing you three out of those four dimensions. Now, let's
discuss the gradient ascent algorithm to go ahead and do that. [MUSIC]
 so there's a difference between that and whatever my model predicts, that how much my model
thus far predict, that xi is positive. In other words, if this is the difference between the truth, is this
a positive example and the likelihood that my model science to being a positive example. And
we're going to weigh it by the feature value xj. So this, for example, is the number of awesomes.
So if later point has many awesomes, then it has more contributions of the the derivative because
that coefficient's going to be multiplied by say 20 awesomes. If you have zero awesomes, then
there's no contribution here because weather the coefficient is high or low doesn't make any
difference is the data point. So that's why we weigh that way. So this is exactly the derivative
that's very simple that we're going to implement. [MUSIC]
We are getting this training example totally wrong. In this case delta i is the difference between the
indicator, which has a value 1 and the probability here has a value 0 and so it's approximately 1.
So delta i is really big. And so what is this imply? This implies that this data point wants us to
increase the coefficient, w j, the one that we just tried to increase. It says push the derivative up,
add this little delta that makes it more positive. And remember that if xhj is a positive number and
is getting multiplied by a bigger coefficient because we just made it a little bigger, that implies that
the score of hi becomes larger and that implies in a sense the probability that y is equal to +1
given xi and w increases. And that is extremely intuitive. So, if we're getting the data point really
wrong, we get this positive delta i, which is going to increase my parameter, which is is going to
increase my score, which makes in the next step. In the next iteration, the probability this data
point's positive higher. So we're going the right direction. Now, let's go through this a little quicker,
we can look at the case where yi is negative and you just get the same thing with flipped sign. So,
for example, If yi is negative but I'm getting the prediction right, so this is a negative data point.
We'll see that delta i is approximately equal to 0 which implies don't change anything. It makes
sense, I've got everything right, why would I change anything? However, in the case where the
data points were negative, but the prediction I've made was that there were positive data point,
then delta i would be the difference between the indicator which in this case is 0, and the
probability, which is approximately 1. So this delta i would be negative, would be approximately -1.
Which would lead wj to decrease which would imply that the score of xi decreases and it leads to
the probability of y=+1 given x i and w, to decrease. Yay. It fits the negative datapoint and we get
it wrong, we decrease the coefficient to make that probability of it being positive smaller and
increase the probability it will be negative. So cool. We've gone through a little bit of interpretation
of how this gradient helps us push the coefficients bigger for positive training example. And more
negative, smaller for negative training examples, which is exactly what we wanted for that score
function.
 So cool. We've gone through a little bit of interpretation of how this gradient helps us push the
coefficients bigger for positive training example. And more negative, smaller for negative training
examples, which is exactly what we wanted for that score function
[MUSIC] And now, we have the complete gradient ascent algorithm for logistic regression.
Extremely simple, extremely useful and very cool. So we're going to start from some point say, w0
and we're just going to follow the gradient here until we get to the optimal here, which we're
gonmna and we're going to stop when the norm of the gradient is sufficiently small with respect to
tolerance parameter. Epsilon, like we discussed in the regression case. And every time we take
an iteration, we go feature by feature or coefficient by coefficient compute that partial derivative,
which is back to coefficient j. So what is the derivative with respect coefficient and then we just
update the coefficient by wj(t+1) by wjt plus the stub size times this derivative that we just
computed. So this is notation for the derivative l at t, which respects the parameter wj. And at the
core of the derivative, the only little computation that we have to do Is this one over here, which is
computing the probability that the data point has value plus one with the parameters wt. So,
what's that equal to? Well, this is just exactly the logistic regression model. So, I'm going to do a
change of colors transformation here and just do this in blue so it will stand out. This last little bit is
just one over 1+e to the -w, which is whatever it is at iteration t. Transpose a h(of xi). So I just
compute that prediction, whatever my model thinks. I subtract to the indicators this a positive
example, multiplied by the feature value and there you go. Some more data points. Simple
algorithm, really useful, really, really useful. [MUSIC]
? So now let's observe two curves. The top one, is the one we just observed, where the eta has
value 10 to the -5 and the green curve is, eta's ten times smaller. Eta is 10 to the -6. And what we
observe there's two smooth curves, but the 10 to the -5 goes up much quicker and gets to a
higher likelihood value. So here again we're plotting l of wt. Getting to a higher likelihood value
then the one to ten minus six, but that line six is still going. It's just a really small step size so it's
climbing up the hill. That we have in really, really, really, really small steps, small steps, small
steps, so it's going to take a long time to get to the top of the hill. Eventually will, but it's going to
take a long time. So you see, if you were to try this plot. You see the tension minus five is
probably a better parameter than tension minus six. So that's a good thing so you got time two
minus five its looking pretty good maybe you want to try something a little bit larger so if you try
something larger you're taking bigger steps towards the top of the hill. So here I'm comparing what
happens if I try eta step size 10 to the minus five, as opposed to this other sine curve which is eta
being 1.5 times 10 to the minus 5. So here we're picking a parameter which is 50% larger. And
this is really interesting, really interesting behavior. We've observed this behavior a lot in practice.
So this early phase, here so you get some early oscillation,but once you start getting to the
beginning it doesn't do something so good. But once you start getting towards the top of the hill
you see that towards the end the larger parameter values tend to make more progress. So
smooth, so this has, Zion, has smooth, faster progress at the end. And you see that it's starting to
get even higher values of likelihood, so higher quality. Remember that higher here is better. As we
get more iterations. So it's kind of interesting, if a parameter is bigger, not too big or bigger, you
might see some oscillations in the beginning, but it tends to smooth out and do well. So then you
might say okay 1.5 times 10 to the minus five seems pretty good let's try something even a bit
larger. [MUSIC]
First thing they observe is why all the oscillations? This parameter is too big, things are going
everywhere, doesn't seem to be converging and the likelihood value is much, much lower than of
the other parameters. The other way to look at it is, look, we thought that the difference between
the red parameter, which was, we thought was too big and the blue parameter, which was better
with the cyan one was a big gap, but look at the gap between, this all curves look really small
because we had to make the scale much bigger to show this wildly oscillating one. So ten to the
minus four was too big. So here's what I've learned, we've learned that ten to minus six, too small,
ten to the minus four is too big. Now, we have a range between the too big and too small, and we
can search in that range. [MUSIC]
maximize the likelihood function and that is the product of our data points are the probability of yi
given xi and w.
(VERY OPTIONAL) Expressing the log-likelihood3m
(VERY OPTIONAL) Deriving probability y=-1 given x2m
(VERY OPTIONAL) Rewriting the log likelihood into a simpler form8m
(VERY OPTIONAL) Deriving gradient of log likelihood8m
Recap of learning logistic regression classifiers1m
2 hours to complete
Overfitting & Regularization in Logistic Regression
As we saw in the regression course, overfitting is perhaps the most significant
challenge you will face as you apply machine learning approaches in practice. This
challenge can be particularly significant for logistic regression, as you will discover in
this module, since we not only risk getting an overly complex decision boundary, but
your classifier can also become overly confident about the probabilities it predicts. In
this module, you will investigate overfitting in classification in significant detail, and
obtain broad practical insights from some interesting visualizations of the classifiers'
outputs. You will then add a regularization term to your optimization to mitigate
overfitting. You will investigate both L2 regularization to penalize large coefficient
values, and L1 regularization to obtain additional sparsity in the coefficients. Finally,
you will modify your gradient ascent algorithm to learn regularized logistic regression
classifiers. You will implement your own regularized logistic regression classifier from
scratch, and investigate the impact of the L2 penalty on real-world sentiment analysis
data.
Now the curve becomes steeper. And so if I look at the same point where the difference between
awesome's and awful's are at one, and I look at my predicted probability. Now I've increased it
tremendously. Now I see the probability of positive review, is about 0.88. I'm even more confident
that the same exact review is positive. So that doesn't seem as good, 88% chance that it's
positive. But let's push the coefficiencts up more, let's say that the coefficient of awesome is plus
six and the coefficient for awful is minus six. Now if I look at the same point. The same import. The
same difference between awesome's and awful's and I push that up. I get this pre scare result it
says that the probability of being a positive root 0.997. I can't trust that. Is it really a case of 0.997
probability? The review of two awesome's and one awful is positive? That doesn't make sense. So
as you can see, we have the same decision boundary still crossing at 0. The coefficients are just
getting a bit bigger every time. But my estimated probability of the review becomes steeper and
steeper more and more likely. Which is another type of over 50 that we observe in logistic
regression. So not only the curves with them weird and wiggly but the estimated probabilities
become close to zero and close to one. So let's look at our data set and see how we observe the
same, the same effect right there. [MUSIC]
 And so uncertainty is something that's very important in classifiers and by looking downstairs we
have another interpretation of overfitting, another way that overfitting gets expressed in
classification by creating this really narrow uncertainty bands, and so we want to avoid that. We'll
do everything we can to avoid it. [MUSIC]
and so it is going to push the coefficients to be infinite for linearly-separable data, because it just
can. So it's going to be pushing larger and larger and larger and larger until, basically, they go to
infinity. So that's a really bad over-fitting problem that happens in logistic regression. So just as a
summary of this optional section, we'll see that logistic regression over 50 here could be where I
call it twice as bad. We have the same kind of bad situation that we had looking at decision
boundaries and typically in regression, where we had this really complicated function that you
learned. And you become really complex decision boundaries that over-fit the data and don't
generalize well. But you also have the second effect, where if the data is linearly separable, if you
have lots of features, the data becomes linearly separable or you are close to it, then the
coefficients can get really big and eventually they can go to infinity. And so you get these massive
coefficients and massive confidence about your answers. And so you will see these two kinds of
effects of over-fitting with logistic regression. [MUSIC]
[MUSIC] Now we've seen multiple ways that overfitting can be bad for classification especially
from logistic regression and how very massive parameters can be a really bad thing. So, what
we're going to do next is introduce a notion of regularization just like we did regression to penalize
this really large parameters in order to get a more reasonable outcome. So we're still talking about
the same logistical regression model where we take data we do some feature extraction to it we fit
this model one over one plus e to the minus w transpose x. But the quality metric for this machine
in the algorithm is going to change to push us away from really large coefficients. So in particular
we're going to balance how well we fit the data with the magnitude of coefficients as to avoid this
massive coefficients. In the context of logistic regression, we're balancing two things to measure
total quality. The measure of fit. Which was the data likelihood, the thing that's bigger is better,
how well I fit the data, and then the magnitude of the coefficients, where the coefficients are too
big are problematic. So we have one thing that we want to be big, the likelihood, and the other
thing we want to be small, which is the minds of coefficients, we're going to optimize the quality
minus this complex metric here. And so we want to balance between the two. So what do those
mean? Let's substantiate that more clearly In the context of logistic regression.

 One is the sum of the squares, also called the L2 norm, the square of the L2 norm. And because
it's noted by W2 squared and it's just very simple. It's the square of the first coefficient plus the
square of the second coefficient plus the square of the third coefficient and so on plus the square
of the last coefficient, w d squared. That's if you used the L2 norm. We can also use the sum of
the absolute values, also called the L1 norm, and it's denoted by this here. And instead of being
the squares, it's w0, absolute value, plus W one absolute value plus w two absolute value all the
way to the absolute value of the last coefficient. Now in the regression course we explored these
notions quite a bit but the main reason we take the square of the absolute value is that we want to
make sure to penalize highly positive and highly negative numbers in the same way, so by doing
the search squaring of some value, i'll make the output here positive. When I make this norms as
low as possible. So both of these approaches are penalize larger weight. Actually, i should say
penalize large coefficients. However, as we saw in the regression class by using the L one norm,
I'm also going to get what's called a sparse solution. So the sparse doesn't point play in
regression but it also plays a role in classification for example. And in this module we're going to
explore a little bit of both of these concepts. And we're going to start with the L2 norm, or the sum
of the squares. So now that we've reviewed these concepts, we can now formalize the problem,
the quality that we're trying to maximize. And so I want to maximize over my choice parameter W's
of the trade off between two things. The likelihood of my data and actually the log of it. So, log of
the data likelihood. And some notion of penalty for the magnitude of the coefficients, which it will
start with this L2 penalty notion. [MUSIC]
we would call Lambda or the tuning parameter, or the magic parameter, or the magic constant.
And so, if you think about it, there's three regimes here for us to explore. Where Lambda is equal
to zero, let's see what happens. So when Lambda is equal to zero, this problem reduces to just
optimizing. So maximizing over W of the likelihood only, so only the likelihood term. Which means
that we get to the standard maximum likelihood solution, an unpenalized MLE solution. So, that's
probably not a good idea to set it to zero, because I don't, I have this really bad over fitting
problems, and not preventing the over fitting. Now, if I set Lambda to be too large, for example, if I
set it to be infinity, what happens? Well, the optimization becomes the maximum over W. Or if L of
W minus infinity times the norm of the parameters, which means the LW gets drowned out. All I
care about is that infinity term and so, that pushes me to only care about penalizing the
parameters. About penalizing the coefficient, say, another parameter, so penalizing W, or
penalizing that large coefficient. Which will lead to just setting all of the Ws equal to zero.
Everything be zero. Also now, I've got a good idea because I'm not fitting the data at all, I set all
the parameters to zero, it's not doing anything good, ignoring the data. So the area that we care
about is somewhere in between. So a Lambda between zero and infinity, which balances the data
fit against magnitude of the coefficients. Very good. So we're going to try to find the Lambda. If it's
between zero and infinity, it fits our data well. And this process, where we're trying to find a
Lambda and we're trying to fit the data with this L2 penalty, it's called L2 regularized logistic
regression. In the regression case, we called this ridge regression, here it doesn't have a fancy
name, it's just L2 regularized logistic regression. Now, you might ask this point, how do I pick
Lambda? Well, if you took the regression course, you should know the answer already. Now, use
your training data, because as Lambda goes to zero, you going to fit the training data better.
You're not going to be able to pick Lambda that way. Never ever use your test data, ever. So, you
either use a validation set, if you have lots of data or use cross validation for smaller data sets. So
in the regression course, we cover this picking the parameter Lambda for the regression study,
and this is the same kind of idea here. Use a validation set or use cross-validation always.
Lambda can be viewed as a parameter that helps us go between the high variance model and the
high bias model. And try to find a way to balance the bias and variance in terms of the bias
variance tradeoff. So when Lambda is very large, we have W is going to zero, and so we have
large bias and we know, they are not fitting the data very well. We have low variance, no matter
where your data set is, you get the same kind of parameters. In extreme, when Lambda is
extremely large, you get zero no matter what data set you have. If Lambda is very small, you get a
very good fit to the training data, so you have low bias but you can have a very high variance. If
the data changes a little bit, you get a completely different decision boundary. And so in that
sense, Lambda controls the bias of variance trade off for this regularization setting in logistic
regression or in classification. Just like you did in regular regression. [MUSIC]
 as we increase the penalty lambda, so in the beginning, when we have a unregularized problem
this coefficient tends to be large. But as we increase lambda, they tend to become smaller and
smaller and smaller and go to zero. I've used the review data set, property review data set here
and I picked a few words and fit a legislation model just using those words with different levels of
regularization. So for example, the words that have positive coefficients tend to be associated with
positive aspects of reviews, while the ones with negative coefficients tend to be associated with
negative aspects of reviews. What is the word in quotes that has the most positive weight? Well, if
you look at the key here, you'll see the word that has the most positive weight is actually the
emoticon smiley face, well, the word that has most negative weight is another another emoticon
the sad face. And the beginning all these words have pretty large coefficients except the words
near zero, which are words like this and review, which are not associated with either positive
things or negative things, although if the word review shows up it's slightly correlated with a
negative review but in general, this coefficients much more than the others. And as I increase the
regularization lambda, you see the coefficients can become smaller and smaller, and if I were to
keep drawing this, they will eventually go to zero. And now, if I were to use cross validation to pick
the best lambda, I'll get a result kind of around here, and I'm going to call that lambda star. And so
that's what you do is cross validation, to find that point where it's fitting data pretty well, but it's not
over-fitting it too much. And as a last point, I'm going to show you something that is pretty exciting.
It's really beautiful about regularization with regression. Regularization doesn't only address the
crazy wiggly decision boundaries but addressing with those over-confidence problems that we
saw with over-fitting regularization. So I'm taking the same coefficient, the same thing that I've
learned. The lambda is increasing, the range of coefficients is decreasing, they're getting smaller
but I'm talking the bottom here. The actual decision boundaries that we learned and the notion of
uncertainty on their data. So if lambda is equal to zero we have this highly over confident
predictions. If lambda is a one, not only do I get a more natural kind of parabola like decision
boundary even though I'm using Degree 20 features, polynomial degree 20 is features. I get a
very natural certainty region. So the region why I don't know if it's positive or negative is really
those points in the boundary which kind of between those clusters of positive points and the
clusters of negative points. And you get this kind of beautiful smooth transition. So by introducing
regularization, we've now addressed those two fundamental problems where over-fitting comes in
in logistic aggression. [MUSIC]
 This is the thing we need to be able to walk into that hill-climbing direction. So the derivative of
the sum is the sum of the derivative. So the total derivative is the derivative of the first term, the
derivative of the log-likelihood, which, thankfully, we've seen in the previous module, minus
lambda times the derivative of the quadratic term here. And it is how the derivative of the
quadratic term we already covered in the regression course. But we're going to do a quick review
here. But as you can see, just a small change to your code before, we just have to add this
lambda times the derivative of the quadratic term. So the review, the derivative of the log-
likelihood is going to be the sum of my data points of the difference between the syndicate of
whether it's a positive example and the probability of it being positive weighed by the value of the
feature. And we talked about last module and interpreted this piece in quite a bit of detail. So what
I'm going to go over again, we're going to focus on the second part, which is the derivative of the
L2 penalty. So in other words, what's the partial derivative with respect to some parameter wj of
w0 squared plus w1 squared plus w2 squared, Plus dot, dot, dot plus wj squared plus dot, dot, dot
plus wd squared. Now if you look at all of this terms w zero squared wr squared all of those don't
play any role in the derivative. The only thing that plays a role is wj squared Now what's the
derivative of 30g squared? It's just 2wj. So that's always going to change in our code, it's actually
2wj. So in fact, our total derivative is going to be the same derivative that we've implemented in
the past, mins 2 lambda, Times wj. So 2 times the regularization coefficient, the regularization
penalty, the parameter lambda times the magnitude, so times the value of that coefficient. So let's
interpret what this extra term does for us. So what does the minus 2 lambda wj do to the
derivative? So wj is positive, this minus lambda wj is a negative term. Negative contribution to a
derivative which means that it decreases wj because you're going to add some negative term to it.
It was positive we're going to decrease it. So since it was positive and you're decreasing it what
happens is wj becomes closer to 0. So if the rig is positive you have the negative number and
becomes less positive closer to 0. And in fact if lambda's bigger then that thing becomes more
negative and going to 0 faster. That's what happens. And if wj is very positive that the decrement
is also larger so it becomes again goes to towards 0 even faster. Now if wj is negative then -2
lambda wj is going to be greater than 0. Because lambda is also greater than 0. And what impact
does that have? So you're adding something positive so you're increasing wj which implies that wj
becomes, again, closer to 0. It was negative, and I posited it numbers with, it goes a little closer to
0. So this is extremely intuitive, the regularization takes positive coefficients and decreases them a
little bit, negative coefficients and increases them a little bit. So it tries to push coefficients to 0,
that was the effect has on the gradient, exactly what you expect. Finally, this is exactly the code
that we described in the last module, so learn the coefficients of a logistic regression model. You
start with some, that is equal to 0, or some other randomly initiated or some kind of smartly
initiated parameters. And you go, for each iteration you go coefficient by coefficient, you compute
a partial derivative, which is this really long term here, sum over data points. The feature value
times the difference between where there's a positive data point and the predicted value positive,
so called a partial j. And you have the same update, wj(t+1) is wj(t) plus the step size. It multiplies
the partial derivative just as before, which is the derivative of the likelihood function With respect
to wj. And all you need to change in your code, there's only one little thing to change in the code.
You have this little thing here which is our only change. In other words, take all the code you had
before, put- 2 lambda wj in the computation of the derivative, and now you have a solver for L2
regularized logistic regression. And this is going to help you a tremendous amount in practice.
[MUSIC]
and that can help us with both efficiency and interpretability of the models as we saw in
regression. So for example, let's say that we have a lot of data and a lot of features so the number
of w's that you have can be a 100 billion, 100 billion possible values. Things can in practice in all
sorts of settings. For example, many of the spam filters out there have hundreds of billions of
parameters in them, or coefficients they learn from data. So this has a couple problems. It can be
expensive to make a prediction. You have to go through 100 billion values. However if I have a
sparse solution where many of these w's are actually equal to zero then when I'm trying to make a
prediction. So I'm judging will be at the sine of wj times the feature hj of xi. I only have to look at
no zero quotations wj. Everything else can be ignored. So if I have a 100 billion coefficients but
only say 100,000 of those are non zero then it's going to be much faster to make a prediction. This
makes a huge difference in practice. The other impact that sparsity has having many zero
coefficients being zero is that it can help you interpret the non zero coefficients. So you can look at
the small number of non zero coefficients and try to make an interpretation, this is why a
prediction gets made. Such interpretations can be used for practice in many ways, so how you
learn logistic regression classifier with sparsity in that, sparsity inducing penalty. So what you do is
take the same log-likelihood function Lw but we add extra L1 penalty. Which is the sum of the
absolute value of w0, the absolute value of w1 all the way to the absolute value wd. So by just
changing the squares, sum of squares to be sum of absolute values we go into what's called L1
regularized logistic regression which gives you sparse solutions. So that small change leads to
sparse solutions. So just like we did with L2 regularization. Here, we're also going to have a
parameter lander which controls how much regularization we introduce. So how much penalty
we're going to introduce. And objective becomes the log-likelihood of the data, minus lambda
times the sum of these absolute values, the L1 penalty. When lambda equals to 0, we have no
regularization, which leads us to the standard MLE solution. Just like we had in the case of L2
regularization. Now when lambda is equal to infinity, we have only penalty so all weight is on
regularization. And that's going to lead to w hat being everything 0, all 0 coefficients. Now the
case that we really care about was on lambda similar between 0 and infinity which leads to what
are called Sparse Solutions. Where some wj's are now going to be equal to 0 but hopefully, many
other wj's and this is the maximum wj hats are going to be exactly 0. So that's what we're going to
try to aim for. So let's revisit those coefficient paths, and here I'm showing you coefficient paths of
L2 penalty. You see that when the lambda parameter's low, you have large coefficients learned,
and when the lambda parameters gets larger, you got smaller coefficients. So, they go from large
to small, but they're never exactly 0. So, the coefficients never become exactly 0. If you look
however at the coefficient paths when the regularization is L1, well guess if would be much more
interesting. So, for example, in the beginning, the coefficient of the smiley face oops that should
be frowny. That should be smiley face has a large positive value. But eventually becomes exactly
zero from here on. And similarly, the coefficient for the frowney face is a large negative value, but
eventually over here the frowney face has a coefficient that becomes 0. And so it goes from large
all the way to exactly zero. And we see that for many of the other words. For example in the
beginning the coefficient of the word hate is pretty high and that's a pretty important word but
around here hate becomes irrelevant. And so as just a quick reminder, these are product reviews
and trying to figure out whether it's a positive or negative review for the product. And work with,
we can look at what coefficient stays non zero for the longest time. And this is exactly this line
over here, where it never hits 0, never stays exactly 0. And this is a co-efficient of the word
disappointed. So, you might be disappointed to learn that frowny face is not the one that becomes
0. But in the beginning, disappoint is not as, the coefficient is not as large, not as significant as a
frowny face but it's the one that stays negative for the longest. And so frowny face is not, you
might be disappointed to know that friendly face is not as important as disappointed. [LAUGH] And
disappointed probably because it's prevalent in more reviews and when you say disappointed
you're really like in a negative review. That coefficient goes on for a long time. So you see these
transitions. So the coefficients of those small numbers like reviews goes to zero. Earlier on, the
smiley face will last for a while then it becomes zero. Frowny face lasts for longer and then it
becomes exactly zero. But propositionally large lambdas, all those are zero except for the
coefficient at this point. [MUSIC]
That should be smiley face has a large positive value. But eventually becomes exactly zero from
here on. And similarly, the coefficient for the frowney face is a large negative value, but eventually
over here the frowney face has a coefficient that becomes 0. And so it goes from large all the way
to exactly zero. And we see that for many of the other words. For example in the beginning the
coefficient of the word hate is pretty high and that's a pretty important word but around here hate
becomes irrelevant. And so as just a quick reminder, these are product reviews and trying to
figure out whether it's a positive or negative review for the product. And work with, we can look at
what coefficient stays non zero for the longest time. And this is exactly this line over here, where it
never hits 0, never stays exactly 0. And this is a co-efficient of the word disappointed. So, you
might be disappointed to learn that frowny face is not the one that becomes 0. But in the
beginning, disappoint is not as, the coefficient is not as large, not as significant as a frowny face
but it's the one that stays negative for the longest. And so frowny face is not, you might be
disappointed to know that friendly face is not as important as disappointed. [LAUGH] And
disappointed probably because it's prevalent in more reviews and when you say disappointed
you're really like in a negative review. That coefficient goes on for a long time. So you see these
transitions. So the coefficients of those small numbers like reviews goes to zero. Earlier on, the
smiley face will last for a while then it becomes zero. Frowny face lasts for longer and then it
becomes exactly zero. But propositionally large lambdas, all those are zero except for the
coefficient at this point. [MUSIC]
WEEK
3
2 hours to complete
Decision Trees
Along with linear classifiers, decision trees are amongst the most widely used
classification techniques in the real world. This method is extremely intuitive, simple to
implement and provides interpretable predictions. In this module, you will become
familiar with the core decision trees representation. You will then design a simple,
recursive greedy algorithm to learn decision trees from data. Finally, you will extend this
approach to deal with continuous inputs, a fundamental requirement for practical
problems. In this module, you will investigate a brand new case-study in the financial
sector: predicting the risk associated with a bank loan. You will implement your own
decision tree learning algorithm on real loan data.

13 videos (Total 47 min), 3 readings, 3 quizzes


SEE LESS
 I go back to step two, but only look at the subset of the data that has credit fair and then only look
at subset of data that has credit poor. Now if you look at this algorithm so far, it sounds a little
abstract, but there is a few points that we need to make more concrete. We have to decide how to
pick the feature to split on. We split on credit in our example, but we could have split on something
else like the term of the loan or my income. And then since we having recursion here at the end,
we have to figure out when to stop recursion, when to not go and expand another node in the tree.
So, we'll discuss these two important tasks in the rest of this module. [MUSIC]
 But before we split further, we're going to discuss why we picked credit to do the first split as
opposed to say, for example, the term of the loan or income. [MUSIC]
we've gone down now from 0.45 to 0.2. So splitting on credit seems like a pretty good idea. Now
let's see what happens when we split on term of the loan. We still have term of the loan. If the
term is three years, maybe there is 16 safe loans and 4 risky ones. So, in this case we're making
four mistakes. And for five years we predicted risky where there were six safe loans, so now we're
making six mistakes. And so if you look at the overall error here it's 4 + 6 / 40. So that's 10 divided
by 40, which is 0.25. So overall, if we look at our data, not splitting on anything, the root node, has
0.45 error, splitting on credit has 0.2 error, and splitting on term has 0.25 error. So can go back
and ask, what is the best choice? Should we split on credit, or should split on term? The answer
now becomes obvious. Splitting on credit gives you lower classification error, so this is what our
greedy algorithm will do first. This is the first feature to split on. And that would be the winner of
our selection process. So in general, the decision tree splitting process will say given the subset of
data at node M, which is what we're looking at, the root node so far, we try out every feature xi,
which in our case here was credit, term, and income. And I could see splitting the data according
to possible values of each one of these features, and I compute a classification error of the
resulting split, just like we did manually over here. And then I pick the feature, which in our case
was credit which had the lowest classification error. So, if we go back to our decision tree learning
algorithm that first challenge that we had, figuring out what feature to split on, we can now do that
using this feature split selection algorithm that minimizes the classification error. So next, we'll
explore the other parts of this decision tree learning algorithm. [MUSIC]
 For the nodes I've selected here including that first nodes where credit was excellent every single
nodes associated with data points of just one category or one class same output. So, for accident
there everything will save, and for the case where the credit was fair, but the term was 3 years
everything was risky. So, as we can see for those there's no point in keep on splitting, so the first
stop in condition is stop splitting when all the data agrees on the value of y. There's nothing to do
there. And there's a second criteria which is only happened over here where we stopped splitting,
and we still had some data points with safe and risky loans inside that node. However, we had
used up all of the features in our dataset. We only had three features here, credit, income, and
term, and on that branch of decision tree we used all of them up. There's nothing left to split on.
We get the same things if you keep splitting them forever. And so the two stopping criteria actually
very simple stop if every data point agrees or stop if you run out of features. So if we go back now
to our greedy algorithm for learning decision trees we see that step two you just pick the feature
that minimizes the classification errors we discussed. And then we have these two stopping
conditions here that we just described, two extremely simple ones, and then we just recurse and
keep going until those stopping conditions are reached. Either we use all the features, use them
all up or all the data points agree on the value of y. [MUSIC]
[MUSIC] We've now learned decision trees from data. Let's look at how we can make predictions
from it. So given the model that we've learned from the data, we're going to predict y hat. As we
discussed in the beginning of the module, it's a pretty simple process of traversing down the tree
in order to make that prediction. Given that particular input xi, so going back to that first example
where Credit was poor, Income was high, and the Term was 5 years, we go down the branch of
the decision tree that is associated with those particular inputs, and we make a prediction that y
hat i is Safe. This would be explicit, even that prediction algorithm is what's called a recursive
algorithm. I'm trying to predict, for a particular decision tree, I start with the top node and my input
set. And then I say, if the current node is a leaf, return, predict whatever is on that leaf. If it's not a
leaf, then pick the child that agrees with the input and then recurse on that child. So return the
prediction the child makes. And then that process is going to unroll that path down the tree in
order to make the prediction. [MUSIC]
And this can be really, really bad. When you have very few data points in the intermediate node in
the decision tree you're very prone to overfitting. Very prone to make predictions you cannot trust.
So, for example, if you look here you'd predict in this case that if you're income is $30,000, this is
definitely a risky loan, but if you're income is $31,400 is definitely a safe loan, however if your
income is now 39,500, you're back to risky. So [LAUGH] it's risky, safe, risky, which doesn't make
any sense. Do you trust it? I wouldn't. And so the question is, how do we deal with this real valued
features. As a very natural alternative, we can work on threshold splits. And these are simply
picking a threshold on the value of that continuous valued feature, so let's say 60,000. And for the
left side of that split will put all the data points have income lower than $60,000, and on the right,
will put all the data points have incomes higher than or equal to $60,000. And as we can see, we
have a subset of the data here, income higher than $60,000. And for those we have many data
points there. So, it's a lot less risk of over fitting and we see that 14 of them have our safe laws so
probably predict a safe there. Well, 13 will risk you on the $60,000 so maybe you'll predict those
as risks. So this is a very natural kind of split that we might want to do with continuous value data.
Let's now take a moment to visualize what happens when we do this kind of threshold split. So for
example, I've laid out my data income into this line here that ranges from 10,000 to 120, and if we
pick a threshold split of 60,000 and we say everything on the left of the split has income less that
$60,000 we're going to predict to be risky loans. Everything to the right has income higher than
$60,000 we're going to predict those as being safe loans. Now let's supposed that we have two
continuous value to features. We have income in the y axis and we have age in the x axis, and
let's see what happens here. And you'll see there are some positive and negative examples laid
out in 2D. Another thing that's interesting is that you see that older people with higher incomes
tend to be safe loans, but also younger people that may have lower incomes, those might also be
safe loans because those people may make money over time, let's say. So we might look at this
state and decide to split on age first. And if we split on age, let's say age equals 38, we'll see that
for the folks that are younger than 38, on average, more of them have risky long, so you might
predict risky. But for the folks that have age greater than 38, we have more safe loans than risky.
So we might predict safe. Now to the next split in our decision tree. We might choose to split for
the folks that have age greater than 38 we might split on the income and ask whether this income
greater than $60,000 or not. And if it is, we put a split there. And we'll see that the point below
Income below $60,000 even the higher age might be negative, so might be predicted negative. So
let's take a moment to visualize the decision tree we've learned so far. So we start from the root
node over here and we made our first split. And for our first split, we decide to split on age. And
the two possibilities we looked at were, is the age smaller than 38 or is the age greater than or
equal to 38. So that was our first threshold split. And for those with age smaller than 38, let's say
that we stopped right here, we'd see that there's five risky and three safe. So we'd predict risky.
So that might be our leaf here. And for age greater than 38 we took another split, which was on
income. And we just ask ourselves is the income Is it less than 60,000 or is it greater than or equal
to 60,000? Now for the ones that have income greater than or equal to 60,000 in age greater than
38 we predicted those were safe loans. While the ones that had age greater than 38 and income
less than $60,000, we predicted those to be risky loans. And this is an example for the tree where
we're making these binary splits on the data for the continuous variables. [MUSIC]
So let's say that I'm considering splitting, on the feature here, hj, which might be in our case,
income, and, what I can do is go through all my data, so the column of values of the income might
take, and sort them. Such that V1 is the lowest income, V2 is the next lowest and VN is the
highest income. And all I need to consider is the splitting points right in between V1 and V2, V2
and V3 and so on. So I walk from i = 1 though N-1 and then consider splitting point ti, which is the
midpoint between Vi and Vi+1, and I ask what is the classification error if I were to build a decision
tree, a decision stump in this case, that splits xj on ti, on greater than ti and less than ti? So
greater than 60,000 and lower than 60,000. And then we'll pick. t star, to be the split that leads to
decision stump, with the lowest classification error. And that's it. Pretty simple algorithm, pretty
easy to take from here. [MUSIC]
And we can imagine continuing this process, splitting again and again and again as we'll see next,
but one important thing to know is that when you have continuous variables and And this is a
really important point. Unlike for discrete ones or for categorical ones, we can split on the same
variable multiple times. So we can split the next one and split the next one again, or we can split
the next one and the next two, then the next one and again, and we'll see that. So in this example
that we talked about, we can keep the decision tree learning process growing. So in Depth 1 we
just get a decision stump which corresponds to the vertical line that we drew in the beginning-
0.07. If we go to Depth 2, we get this little box that contains most of the positive examples but it is
still some misclassifications over here, and then if you kept going, splitting, splitting, splitting,
splitting, splitting, splitting all the way to Depth 10, we get this really crazy decision boundary. And
if you look at it carefully, what you'll see is that, and I'm going to draw over the decision boundary
here, but if you look at it a little bit carefully, you'll see that it basically makes, no mistakes, so it
has 0 training error. And we can compare what we saw with Logistic Regression with what we're
seeing with Decision Trees, and understand again, in preview for what we will see next module,
kind of the notion of overfitting. So, in Logistic Regression we started with Degree 1 parameter
features, and we saw in Degree 2 had a really nice fit of the data, is a nice parable. You didn't get
everything right, but you did pretty well. And the degree six polynomial had a really crazy decision
boundary. It got zero training error, but I didn't really trust those predictions, we didn't really trust
those prediction. With the decision tree, what you control is the depth of the decision tree and so
Depth 1 was just a decision stamp. It didn't do so well. If you go to Depth 3, it looks like a little bit
of a jagged line, but it looks like a pretty nice decision boundary. It makes a few mistakes, but it
looks pretty good. If you look at Depth 10, you get this crazy decision boundary, has zero training
error, but is likely to be over fitting. [MUSIC]
[MUSIC] And we have now seen how to build a decision tree from data. It's pretty cool, simple,
really recursive algorithm, main choice to be made is what feature to split on and when to stop
splitting. In the next module we're going to talk about ways to address over fitting in decision trees.
But you should be ready now to understand how to build a decision tree from data, really
implement that algorithm, and be able to make predictions from the decision trees that you learn.
As well as to explore decision boundaries of decision trees and how they relate to decision
boundaries of say logistic regression. And let me close by thanking my colleague Krishna who
was instrumental in making the slides that you saw in this module. I really appreciate all the effort
that Krishna put into this. [MUSIC]
WEEK
4
2 hours to complete
Preventing Overfitting in Decision Trees
Out of all machine learning techniques, decision trees are amongst the most prone to
overfitting. No practical implementation is possible without including approaches that
mitigate this challenge. In this module, through various visualizations and
investigations, you will investigate why decision trees suffer from significant overfitting
problems. Using the principle of Occam's razor, you will mitigate overfitting by
learning simpler trees. At first, you will design algorithms that stop the learning
process before the decision trees become overly complex. In an optional segment,
you will design a very practical approach that learns an overly-complex tree, and then
simplifies it with pruning. Your implementation will investigate the effect of these
techniques on mitigating overfitting on our real-world loan data set.
 Why was that the first feature we chose? And the reason we choose to split on credit first is
because it improved the training error the most. Improved the training error from 0.45 to 0.20. So
that was a good first split to make. Now, if we go back and review the algorithm for choosing what
the best decision split is, the best feature to split on is, we'd try every single possible feature and
pick the one that decreases the training error the most. And so at every step of the way, what are
the features that decrease the training error? And we're adding features that decrease the training
error. And we're adding features and decreasing the error. And eventually we'll drive the training
error to zero. Unless of course we get to some points where we can't drive the training error down
because we've run out of features to split on and we have positive points on top of negative
points, but that's a side note. Most importantly is to remember the training error test go down,
down down, down. And so that going down as we increase the depth is what leads to this low
training error, which often leads to these very complex trees, which are very prone to overfitting.
And here's a real world dataset from loan data. Where we actually observed that big bad
overfitting problem. So if we take the depth of the tree, and we push it all the way to depth 18,
we'll see that the training error has gone down a lot. And the blue line, which gets you down to
about eight percent training error, which is extremely low. However, if you look at the validation set
error, man that was not so good. Maybe that's around 39%, and so there's a big gap between the
two. Which we'll characterize as a form of over fitting. If somehow you were able to pick the best
depth for a decision tree, which in this case was depth seven. Then you notice that in this case,
The validation error is just under 35%. In other words, if i could pick the right depth I would get
39% validation error. But, if I let it go until very low training error, I get a validation error of 39%. So
going all the way, bad idea. Gotta stop a little earlier. [MUSIC]
The error will be 5 out of 21 which is 0.24. Now if I were to try to split on the other features, term
and income, suppose that neither of them helped. Suppose that neither of them decreased the
training error. So training error we're still 0.24 for both of them, what should I do? Now I could try
to split on one then keep going, but that will make the tree more and more complex. So a second
early stopping condition we might use is to say stop there. If the tree error is not decreasing, don't
make the tree more complex. Use Occam's Razor and stop the process right there and just say
everything after credit goes poor is just a risky loan. And then just focus on what to do with the
note where credit is fair. So the second type of early stopping criteria we're exploring is keep
recursing, but stop if the tree error does not decrease, don't make the tree more complex. Now
there's a few practical caveats about this. First, typically you don't just stop, the error doesn't
decrease at all. You put a magic parameter, you say stop if the error doesn't increase by more
than absulum. Let's say If it doesn't increase by more than 1% we just stop. Then there's a few
pitfalls in practice for this rule and we talk about those a little bit in the context of pruning. So this
is a rule that you should be super careful about, but it's very useful in practice. So something to
definitely keep in mind. The third type of early stopping rule is extremely important, and you
should always include it. And it just says, if there is only a few data points left in a node, don't split
any further, because then you're really risking overfitting. So an example would be looking at, we
had that case or credit that was equal to fair. Well there is only one safe loan there and two risky
loans. So total of only three data points fall inside of that node. In that case, if we are trying to spilt
further we are going to find lots of patterns in the data that don't make much sense because there
is very little data to support those patterns. With three data points, I don't know if I really trust a
decision tree algorithm to make tons of decision. So I'll just stop recursing right there. In general,
we should always set kind of parameters, say Nmin, which says if a node has few data points then
Nmin. Or Nmin or less, then just stop recursing right there and call that early stopping condition
three. This one you should always implement no matter what. For example, here I'm suggesting
setting Nmin to 10. 10 is quite a nice number for Nmin if you have a relatively small dataset size. If
you have a huge dataset size you might set it to 100. But somewhere between 10 and 100 is a
good value for Nmin. So let's summarize the early stopping conditions that we described in this
approach to try to find decision trees that are not too complex. There are three. The first one is
limiting the depth of the tree. That's a relatively good condition to set. You should set that one to
be something pretty carefully using cross validation. The second one is stop recursing if the
classification error does not decrease by enough, increase by zero or doesn't increase by more
than absolute. It is a little risky but can be very useful. The third one, extremely useful, we should
always do it, stop recursing if there is not that many data points left inside of a node. So these are
some of the key early stopping conditions a decision tree algorithm should be thinking about. If we
go back to the decision tree video learning for the decision tree algorithm that we discussed in the
previous module, we start with the empty tree. We selected a feature to split on, then we said stop
if there was nothing left to do and just make the prediction. This is the majority class. And
otherwise we want to recurse and go back to step two. Now, we considered two stopping
conditions in the previous module. And to those we are adding the three early stopping conditions
that we just described. And so small modifications to the code we've already built gives you this
extra factor to help prevent some overfitting. [MUSIC]
 The second one is stop recursing if the classification error does not decrease by enough,
increase by zero or doesn't increase by more than absolute. It is a little risky but can be very
useful. The third one, extremely useful, we should always do it, stop recursing if there is not that
many data points left inside of a node. So these are some of the key early stopping conditions a
decision tree algorithm should be thinking about. If we go back to the decision tree video learning
for the decision tree algorithm that we discussed in the previous module, we start with the empty
tree. We selected a feature to split on, then we said stop if there was nothing left to do and just
make the prediction. This is the majority class. And otherwise we want to recurse and go back to
step two. Now, we considered two stopping conditions in the previous module. And to those we
are adding the three early stopping conditions that we just described. And so small modifications
to the code we've already built gives you this extra factor to help prevent some overfitting.
[MUSIC]
And so, if I look, all in all, neither are splitting on x1, nor x2. That helps me reduce the error. So
both splits give you the same error. So the question is, should I start now? If neither feature
improves and your using early stopping condition two then you just say no don't do it take it in the
splits. You just stop splitting the decision tree. At the very very root output to be true, make two
mistakes and say that the error that you get in the end is 0.5. So, [LAUGH] this is why xor is a
counter example to everything. What would happen if you didn't take early stopping condition two
and you just kept going? If you just kept going, you get this tree that looks here. And if you see, for
each one of these leaves, there is zero mistakes. So the leaf from the left outputs false, or minus
one. And, makes zero mistakes, next one makes zero mistakes. The third one makes zero
mistakes, and the last one makes zero mistakes. So, if you kept going. If you didn't use early
stopping condition two, you would get three errors of zero. And so, in this case, there is a huge
gap, huge gap between training error was 0.5, which is as good as random, and training error of
zero, which is perfect training error. So in this case, early stopping condition two created this
arbitrary large gap between the best he could have done. And what you end up outputting. So we
saw a list of in condition two. Don't stop recursing if the training error does not decrease enough. It
seems like a reasonable heuristic, but it has some dangerous pitfalls. Because two short, short-
sighted, this doesn't explore branches far enough. So it's kind of generally a good principle. But if
you apply top down you might get yourself into trouble. So what can you do about this? This is
where pruning comes in, uses something like this criterian but uses it bottom up. [MUSIC]
The two concepts we want to trade off to the two quantities are how well my tree fits the data and
what the complexity of the tree is. So the total cost is going to be a measure of fit and a measure
of complexity. We saw this in regression. We saw this in logistic regression. We're seeing this
again in decision tree. Here, the measure of fit is just classification error in the training data. So
training error and the measure of complexity is going to be number of leaves, and we're going to
balance the two. So more specifically, we're going to use Error(T) to node the error, classification
error training data and L(T) to note the number of leaves. And as we will see, we'll have a magic
parameter to trade off between those two quantities. In particular, we're going to use lambda to
denote that magic parameter that we can fit then for validation set where we have cross validation.
When lambda equals 0, we're back to the standard decision tree learning process. When lambda
is infinite, we have infinite penalty. So we're going to have infinite penalty, which means we're
going to just learn a tree that has no decisions in it, so just the root. So, we're going to learn
something like this. Root and then a class. And in this case, we're just going to say that y hat for
all data points is the majority class. In other words, lambda is equal to infinity. We're just going to
output the majority class, when lambda is equal to 0, it's crazy deep decision tree. When lambda
is in between, that's where we came here. So, this is going to balance the fit and the complexity of
the tree. More specifically, we're going to go from that complex tree to a simple tree by using a
pruning approach where the pruning is based on this function, which balances the Error(T) plus
lambda times the number of leaves in T. [MUSIC]
. Now let's see what happens after we do the pruning, if we were to do the pruning. So if we were
to do the pruning, let's say that the smaller tree is being pruned where this last year has been
pruned has slightly higher training error. So, training error min up to 0.26 and you're like, should I
really do it? Well let's look at the number of leaves. Its one, two, three, four, five leaves, and so
we're here five leaves and say, okay, we have five leaves instead of six, so one less leaf. The
training error went up a little bit, but if you do five times 0.3, which is rounded, you get 0.15. You
add that to 0.26, you get a grand total of 0.41 and magically say this new tree Tsmaller is looking
promising. It has worse training error by a little bit but lower overall cost. And so since it has lower
overall cost, we're going to end up pruning, good idea. So that's the kind of simplification that
looks okay in a simple example, when you have a massive tree of tons of data, it's hard to imagine
how this things can prune. But it must happen, otherwise you're going to have tremendous
amounts of over fitting in decision trees. We're not just going to do the pruning procedure for the
very bottom of the tree, we're going to keep going up the tree and revisit every decision that we've
made in the decision tree and say is it worth pruning, is this worth pruning it? Is it worth pruning
income after credit? Is it worth pruning term after credit? Is it worth pruning credit itself and just
have the root node? So we're going to check all those out and then find the best tree after this
pruning procedure and output that as our solution. For completeness, I've included here the full
algorithm for building up a decision tree using pruning. It's a little bit more contrived, I'll leave you
this reference for those who want to implement it themselves, but it's relatively simple to
implement and not that different than what you might have done in the regular case. This kind of
idea is fundamental though and it's used in every decision tree learning approach out there with
small caveats and changes in the parameters, but overall, same idea. [MUSIC]
1 hour to complete
Handling Missing Data
Real-world machine learning problems are fraught with missing data. That is, very
often, some of the inputs are not observed for all data points. This challenge is very
significant, happens in most cases, and needs to be addressed carefully to obtain
great performance. And, this issue is rarely discussed in machine learning courses. In
this module, you will tackle the missing data challenge head on. You will start with the
two most basic techniques to convert a dataset with missing data into a clean dataset,
namely skipping missing values and inputing missing values. In an advanced section,
you will also design a modification of the decision tree learning algorithm that builds
decisions about missing data right into the model. You will also explore these
techniques in your real-data implementation.

6 videos (Total 25 min), 1 reading, 1 quiz


SEE LESS
 So there are two basic kinds of skipping that you might want to do when you have missing data.
You can either skip data points that have missing data or skip features that have missing data.
And somehow you have to make a decision of whether to skip a data point, skip features, or skip
some data points and some features and that's a kind of complicated decision to make. In general
this idea of skipping is good because it's easy, it kind of just takes your data set and simplifies it a
bunch. It can be applied to any algorithm because you just simplify the data and you just feed it to
any algorithm, but it has some challenges. Now removing data, removing features is always a kind
of painful thing, data is important. You don't want to do that and its often unclear if you should
remove features, to remove data points, what impact it will have on your answer if you do. Most
fundamentally even if you really skip too much at training time at prediction time if you had a
question mark what do you do? This approach does not address missing data at prediction time.
And so people do this approach all the time. And I'm okay with it if you just have kind of one case
here or case there. But it's a pretty dangerous approach to take. I don't fully recommend skipping
as an approach to dealing missing data. [MUSIC]
[MUSIC] At the core of the decision tree algorithm is choosing the feature to split on and that's the
place we're going to modify it to also choose what to do with that missing data. Regarding the
most important question in decision tree learning is figuring out what feature to split up. And so
now when we look at what features split on, we're not just going to decide what feature to split on,
we're also going to decide, which branch to take when we see a missing value? And so
conceptually, it's just a very very simplification algorithm we described in the previous module. But
now, we're just going to add this extra search to figure out where to put that missing data. Just like
with grading learning decision trees, we can just focus on learning decision stump to understand
what's going on here. So, we chose to split on credit. Credit can be excellent, fair or poor and the
branch here on the right is the one where all data points have credit equals poor. And now, we
might decide that the missing data is going to go into that branch. So now we're going to take all
the data that was missing and the data where credit equals poor, and feed it into that node. That's
what's going to happen with categorical data. Now when we have numeric data, when they use
those threshold splits that we discussed in the previous module. And in this case, when my
threshold income is less than $50,000 or greater than $50,000. And if we decided that missing
data or data where the income is unknown, when the feed it to the same node of those the income
was less than 50,000. That's the decisions that we are going to try to make inside of our learning
algorithm. Now, we have to decide where this data's going to go. So let's say, we choose to split a
credit or considering splitting a credit. If we consider spilling of credit, what should we do with the
missing data? Should we put it with people of excellent credit, for people with fair credit or with
some people with poor credit? I don't know what's best or we have a principle for doing this. As
the principle says, choose the branch that leads to the lowest classification error, just like we did
before. Let say, they were considering splitting on credit and we'll consider assigning the missing
data to the branch where the credit was excellent. Let see what happens. So in this example, we
have three data points with missing data, these three here. And we're considering assigning those
data points to the branch where the credit was excellent. So, we had two kinds of data. We had
the observe data and then there, we had in the exon branch, we had nine safe loans, zero risk. In
the fair branch, we had eight safe loans and four risky. And in the poor branch, we have 4 safe
loans and 12 risky. Now, we have the three extra data points. And in this three extra data points
as you can see, there's two risky and one safe. So, we're thinking about adding these two risky
and one safe into the excellent branch. So, what are the grand totals here? The grand totals are
ten safe and two risky inside the excellent branch and then the same for the other two branches,
because we're not in excellent data. So what's the overall total error of the resulting decision tree?
Well, for the excellent branch, we're going to predict safe. So, we're going to make two mistakes.
For the fair branch, we're going to predict safe. So we're going to make four mistakes and for the
poor branch, we're going to predict risky and we're going to make four mistakes. So overall, we're
going to make 10 mistakes and our total was 40 data points, so the total error here is 0.25 one-
quarter. Now we do the same for every single option, every single branch. So, we tried putting the
unknown data with the excellent and got error 0.5. We tried to put it with the fair branch, we get A
0.25 error as well. But if we try to put it with the last branch here, with the poor branch, we get an
error of 0.225. And so that's a lower error, so this is the winner. So the best thing to do would be
to assign unknowns to the poor batch to credit equals poor, which is exactly what we did in the
earlier tree. This is the modification of the tree splitting algorithm to take into account missing data,
gets a little complicated. But it's generally, kind of simple idea we just described. You're leaving
here for a reference, if you choose to implement it. But overall, the fundamental concept is you
can use this idea of embedding, handling of missing data inside of the learning algorithm to get
better overall performance. [MUSIC]
[MUSIC] In this module we address a very fundamental concept, a concept of having missing
data. And missing data can impact us both in the training time And a prediction time. For both
cases, we explored fundamental ideas that are useful for a wide range of algorithms, not just
decision trees. We explore the idea of just skipping data points, which has its benefits and pitfalls,
the idea of trying to impute or guess what those missing values are. And the idea of modifying the
actual learning algorithm, in particular with decision trees, in order to better deal with missing data.
Now, in practice, you will often see missing data and you should be always on the lookout for
missing data. And sometimes our data comes in, in a way, value is not just explicitly missing. So
for example, sometimes people put in zero, when it's unknown. And you might think it's zero. But
it's really unknown. So you should always be on the lookout for missing data. And you should
always take it very carefully, because it really impact the answers of your algorithm. Today we've
seen some basic approach dealing with that. Of course, there are more advanced ones that you
can get into. But this is a fundamental area we should always be on the lookout for. And let me
close, again, by thanking my colleague, Krishna Sridhar, who's really been instrumental in the
creation of the slides and helping with the overall vision of this module.
WEEK
5
2 hours to complete
Boosting
One of the most exciting theoretical questions that have been asked about machine
learning is whether simple classifiers can be combined into a highly accurate
ensemble. This question lead to the developing of boosting, one of the most important
and practical techniques in machine learning today. This simple approach can boost
the accuracy of any classifier, and is widely used in practice, e.g., it's used by more
than half of the teams who win the Kaggle machine learning competitions. In this
module, you will first define the ensemble classifier, where multiple models vote on
the best prediction. You will then explore a boosting algorithm called AdaBoost, which
provides a great approach for boosting classifiers. Through visualizations, you will
become familiar with many of the practical aspects of this techniques. You will create
your very own implementation of AdaBoost, from scratch, and use it to boost the
performance of your loan risk predictor on real data.

13 videos (Total 58 min), 3 readings, 3 quizzes


 So even though there was two positive votes and two negative votes, and the most important
classifier, the first one, was a positive vote, when you add it up and you average the results, you
get a risky loan as an output. Now this is a simple example of what's called an ensemble classifier
or the combination multiple classifiers. As we'll see, this kind of ensemble model, this kind of
combination is what everybody uses in industry to be able to solve complex decision problems,
complex classification problems. Just to make sure we're all on the same page, I'm going to just
formally define the ensemble learning problem. So we're given some data, where y is either +1 or
-1, there's also multiclass version of this, we're just going to talk about +1 or -1 in today's module,
and we have some input, x. We have some data that allows us to learn f1, f2, all the way to fT,
which are the T weight classifiers, or just classifiers, that we learn from data. And then some
coefficients that we learn from data, w-hat1, w-hat2, all the way to w-hatT. And once we learn
them, making a prediction is very similar to what you do for logistic regression or a linear classifier.
It's just the sign of the weighted sum of the votes from each classifier. And so if you look at this, it
looks a lot like a linear classification model, logistic regression and all of those, exactly the same.
However, not only are we learning the ws from data, we're actually learning the features. In those
models, we had hs to represent our features. Here, the feature are these fs, which are the weight
classifiers that we're going to learn from data. So we can think about boosting as an approach to
learn features from data. And it's really super exciting. [MUSIC]
[MUSIC] So now that we've talked about several models, let's dive in into boosting more generally,
and then a specific example of a boosting algorithm. So think about a learning problem where we
take some data, we learn a classifier which gives us some output, f(x), and we use it to predict on
some data so we say that y hat is a sign of f(x). So if we take some data and we try to make a
prediction, we might learn see a decision stop. So let's say you try to split on income, we look at
folks with income greater than 100,000, folks with income less than 100,000, and how do we learn
the actual predictions that we make? Well, we look at the rows of our data, the rows where the
income was greater than 100,000, and we see that for those, 3 were safe loans and 1 was a risky
loan, so we're going to predict y hat = safe. Now, if we look at incomes of less than 100,000, we'll
see that we have 4 safe and 3 risky, so again we predict safe, so both sides we're going to predict
safe. As it turns out that we do this first decision stump, seems okay, but it doesn't seem great on
the data. And so, decision stump wasn't enough to capture the information with the limited data.
So a boosting will do is take that decision stump, evaluate it, look at how well it's doing in our data,
and learn a next decision stump, a next weak classifier and then next the classifier is going to
focus on the data points where the first one was bad. So in other words, this is a very important
point. In other words, we're going to look at where we're making mistakes so far and want to
increase, see the proportion or the impact or how much we care about the points where we've
made mistakes. And we load in another classifier that takes care of the points that made mistakes
and then we load another one and another one. And eventually, as we'll see, will actually
converge to a great classifier.
What does it mean to learn the data points were made mistakes? What it means is that we're
going to assign a weight, alpha i to the positive number to every data point in our dataset. And that
weight, when it's higher, it means that data point is more important. So you can imagine a learning
problem where we have data, not just like we've done so far in this course, but we have data with
weights associated with it. And now we're going to learn from weighted data. What there is to
learn from weighted data, the way to think about it is that, those alpha is correspond to kind of
data points counting more than one time if its greater than one. So for example, if alpha i is 2, you
can think about that data point is counting twice. If the data point was half you could think about
data point counting as half of a data point. But everything in the learning algorithm stays exactly
the same. It just that instead of counting data points you count kind of weights of data points.
What happens in our decision stump approach? So we had that first decision stump, it was not a
great, especially for folks with lower income and so we learned it's weight, which are higher for the
places where we've made mistakes and now we learn the new decision stump. Let's say again we
split the columns, again, written 100,000 less 100,000. When we look at the classifications
decisions that we make, for greater than 100,000 what we do is we sum the weight of the data
points that were risky, and incomes greater than 100,000. So in this case we're summing 0.5, 0.8,
0.7 which adds up to 2 and then for the risky ones it's 1.2 and so we're going to make a prediction
of y hat is safe. So it's the weighted sum of the data points. For income less than 100,000, same
kind of idea. We'll go through the data points, look at the ones that were risk and ones that were
safe and sum up the weight of those. And we see the total weight of risky that 65, the total weight
of safe loans is 3, so we're going to predict what is risky. So this decision stump is now going to
be a bit better. So we're going to combine this one with the previous one and others and others
until 
Now, this idea of learning from weighted data is not just about decision stumps. It's the result that
most machine learning algorithms accept weighted data. So I'm going to show you very briefly
what happens if you are doing logistical regression and you have weighted data. So if you look at
the equation on the bottom here, that's exactly the equation of logistic regressions derivative, or
the update function that we do. This is the thing that you would implement if you run a logistic
regression model. Now, you say, I have this weighted data, I have to reimplement everything from
scratch. My god, my god, my god. And it turns out that it's very simple. You should look at the
middle of the equation, we have the sum over data points over here. And now, so we just sum our
data points. We're just going to weigh the contribution of each data point. So we just add that
weight alpha i to each term in the sum and we're done. We now have logistic regression for
weighted data. So we showed you two examples, decision stumps and logistics regression but in
general it's quite easy to learn with the data. So boosting can be viewed as a greedy algorithm for
learning an ensemble from data. We train the first classssifier, let's say f1(x), if you just have f1(x)
just predict the sign of f1 to be your output of y hat. Now, then you re-weight your data by
weighting more data points where we made mistakes, where f1 makes mistakes. And now we run
another classifier, f2 and we learn the coefficients which for these classifiers, and now our
prediction if we just do 2 steps of this is w hat 1, f1 + w hat 2, f2. And the sign of that is y hat. So
that is kind of like the keep adding new classifiers, optimizing the weights to focus on more difficult
data points and then learning the coefficients between different classifiers. [MUSIC]
[MUSIC] In this module we're going to talk about the specific boosting algorithm called AdaBoost.
AdaBoost is one of the early machine learning algorithms for boosting. It's extremely useful. It's
very simple to implement. And it's quite effective. There are others. I'll mention another interesting
one towards the end of the module but let's start with AdaBoost. This is the famous AdaBoost
algorithm, which was created by Freund and Schapire in 1999, amazing useful algorithm. So you
start by seeing every data point the same way. You don't know which ones are hard, which ones
are easy. They all have the same weight. And so you can start with them all weight, we'll start with
them with weight 1 over N, because it makes everything kind of work out a little better. And we'll
talk about why in a few slides, but we start to all data points having the same what's called uniform
weight. So in this case alpha i is one over n. And then for each iteration of AdaBoost, as you go to
learn the first decision stamp or the first simple classifier or the first weight classifier, the second
one, or the third one all the way to capital T. What we do is we learn ft on weighted data is alpha i.
So that's the data, is the weights start with 1 over N but they get different over time. Then we
compute the coefficient w hat t for this new classify ft that we learned. And then we recompute the
weights alpha i. And we keep going, and finally once we're done we say that the prediction y hat is
the sign of the weighted combination of f1, f2, f3, f4 weighted by these coefficients that we learn
from later. So there are two fundamental problems we need to address when we're thinking about
AdaBoost. One is, how do you compute the coefficient w hat t, let's call that problem 1 in our
module for today. So problem 1 is how much do I trust ftF in this case? So if I trust ft a lot, I should
give it a very high weight. If I trust ft very little, I should give it a very low weight, or a very low
coefficient, I should say. And then problem 2 is how do you recompute this weight on the data
points? Let's call that problem 2. And so, problem 2 here is how do we weigh mistakes more? So
we want to increase the weights of mistakes. So in the main part of this module, we're going to
talk about how do you compute w hat t and how we can update alpha i's, and it's going to be
pretty simple, relatively intuitive, extremely useful. [MUSIC]
[MUSIC] >> Let's start talking about how do you compute w hat t. And this quantity's intuitive and
has to know how good, or how much we trust ft. The classify we learn that's [INAUDIBLE]
iteration. So, specifically, if ft is good, we like it, it's doing well in our data, we want w hat t to be
large. In fact, if ft has really, really great accuracy, very low error, we want wt to be really big.
However if ft is really bad, if it really is terrible at making predictions, we should down weight it. We
should not trust that particular vote. So in other words, how do we measure whether a classifier's
good or not? As we said, we said ft is good if he has low training error. However you have to
remember that we have weighted data. So what we really care about is how well it's doing
weighted data. For example, if we're weighing more certain datapoints because they're really high,
they're making lots of mistakes on those, we want to make sure that the classifier has low error on
those really hard mistakes. And so let's look at measuring error on weighted data.
Measuring error and weighted data is very similar to measuring error in regular data. You have a
data point. For example, the sushi was great and is labeled as positive, but now we have a weight.
In this case, alpha, which might be 1.2. So this is a data point which is say above average
importance. So we want to measure the weighted total of the correct examples and the weighted
total of the mistakes. So we take our learned classifier, f of t, and we feed that review, in this case,
the sushi was great, but we hide the label, which in this case was positive. And now we compare
the prediction. For example, let's say that y hat was plus one for this input. It's the same as the
two labels correct, so we add the weight 1.2 to the weight of other correct examples we've seen.
So that's awesome. But let's say we have another data point. The food was OK, which is labeled
truly as negative, and we talked about this example before. We feed food was OK to a classifier.
We hide the label. Minus 1. But our classifier gets confused. It doesn't know the cultural reference,
the food was OK, and thinks it is a positive example, y hat is plus 1, and it's a mistake. So we take
the weight of this data point 0.5 and add it to the total weight of the mistakes. So keep adding the
weight of the mistakes versus the weight of the correct classifications. And use that to measure
the error. Now that we have seen and intuitive notion of what a weighted error is, let's write the
down the equations for the weighted error, so we can be sure if we need to implement it. So, the
first thing we need to measure is the total weight of all the mistakes, so the sum of our mistakes of
the weight of those data points. So this is the sum over the datapoint, so i equals 1 through N, of
an indicator that says, was this a mistake? So is y hat i different than yi? So this just measure
whether it was a mistake, and if it was a mistake we don't just count it as a mistake, we count it
whatever weight that datapoint has. So we're going to weigh that contribution by alpha i. And now,
to compute the error, we're going to normalize it so it's a number between zero and one, so we
have to divide it by the total weight of all the data points. So it's the sum over i equals 1 through N
of the weight of all the data points of i. And these are the two quantities that we care about, and
the weighted error can be denoted by the total weight of the mistakes divided by the total weight of
all data points. Extremely simple, the best possible error you could hope for is 0.0. Now, the worst
error is 1.0, which means that we're making mistakes everywhere. But notice that if we're making
mistakes everywhere, if we emerge a class fire we're going to get everything right. So the way to
think about it is in the worst possible case in some ways is how random does. So a random
classifier will get error of 0.5, and we discussed this in the first course of how a random classifier
gets error 0.5 on a binary classification problem like this. So now that we've seen the weighted
error, let's look at how we can update the coefficient w hat t of the function that we learn. >>
[MUSIC]
[MUSIC] AdaBoost uses this slightly intimidating formula to figure out what w hat t should be, but it
has to be pretty intuitive if you look at it in a bit more detail. This formula is derived from a famous
theorem, AdaBoost theorem, which I want to mention very briefly towards the end of the module.
But it's the formula that lets you find classifiers that keep getting better and better, and help
boosting to get to the optimal solution. So, lets look at this one in a little bit more detail by
exploring a few possible cases. So, the question is is ft good? If ft is really good, it has really low
error with the training data that say, weighted error. So for example, if that weighted error is 0.01,
then it's a really good classifier. The question is first, lets see what happens to this famous middle
term here when the weighted error is 0.01. So the middle term is 1-0.01/0.01, which is equal to 99.
And next, to complete w hat t, we're going to take a half times the log of this number 99. And so if
you do one-half times the log of 99, you're going to get 2.3. So this was an excellent classifier and
we gave it a weight of 2.3, which is high. Now, let's see what happens if we output a random
classifier. So as we said, a random classifier has weighted error of 0.5 is not something to be
trusted. So if you plug this in, 1-0.5/0.5, yields the magic number 1. And if you look at a half of log
of one, what's log of one? It's 0, so w hat t is 0. So what we learn, if a classifier is just random, it's
not doing anything meaningful. We count it by zero. We say, you're terrible, we're going to ignore
you and you might have friends who are kind of like this. They say random stuff, you never trust
what they say, you put zero weight on their opinions. So, that's what AdaBoost is too. Now we're
going to get to a really, really, really interesting case. Let's suppose that your classify is terrible, it
gets 0.99 weighted error. So its getting almost everything wrong, it's worse than random. Let's see
what happens to the term in the middle here of our equation. You get 1-0.99/0.99, which is equal
to 0.01 and guess what happens when you take a half log of 0.01? You get -2.3. And when I first
saw this, I thought, wow, this AdaBoost theorem is beautiful, but take a moment to kind of
internalize what just happened. We had this terrible classifier. But yet, we gave it pretty high
weight, 2.3, but with a negative sign and why is that? Because a terrible, terrible classifier might
be terrible but if we take 1-f of t. So if we do exactly the opposite of what it says, it's an awesome
classifier. In other words, if we invert a classifier, we're going to do awesomely. And AdaBoost
automatically does that for you. And so this is again, kind of the using the friend analogy. You
might have a friend who always has really good opinions, but they're all always like wrong. And
so, we do exactly the opposite of what that person says. Maybe this is how you hear your parents
or something, or some friends. You say, okay, you say, I should do A, I'm going to do the opposite
of that. And by doing that, I might do great things in the world. And so AdaBoost automatically
figures that out for you, which is awesome

Now lets revisit the AdaBoost algorithm that we've been talking about and in this part of the
module, we're going to be exploring how do we compute the coefficient w hat t and we saw that
can be computed by this really simple formula. We compute the weighted error of f of t and we just
say, w hat t is one-half of the log of 1 minus weighted error divided by the weighted error. And with
that, we have a w hat t and we can focus on figuring out how to come up with alpha Is. And we
want alpha i is to be high where ft makes mistakes or does [INAUDIBLE]. [MUSIC]
[SOUND] We started with alpha i's being uniform, the same for all data points, one over n, and
now we want to change them to focus more on those difficult data points where we're making
mistakes. So the question is where did f t make mistakes or what data points f t got right. If f t got
a particular data point, xi right, we want to decrease alpha i because we got it right. But if we got xi
wrong, then want to increase our phi so the next decision style we classify our homes in and does
better in this particular input.
Again, the AdaBoost theorem provides us with a slightly intimidating formula for how to update the
weights out for i. But if you take a moment to interpret it, we'll see this one is extremely intuitive,
and there's something quite nice. So let's take a quick look at it. So it says that alpha i gets an
update depending on whether on ft gets the data point right because this is correct or whether ft
makes a mistake. In this we'll see we're going to increase the weight of data points where we
made mistakes and we're going to decrease the weight of data points we got right. So let's take a
look at this. So let's take one xi and lets suppose that we got it correct. So, we're the top line here
and notice that this equation depends on whatever the coefficients that was assigned to this
classifier. So, the classifier was good. We only changed a way to more but if the classifiers was
terrible we're going to change the weights less. So, let's say the classifier was good and we gave
it weight 2.3. So what we're doing here, we're looking at the formula, we're multiplying alpha i by e
to the -w hat t, which is 2.3. And if you take your calculator out, you see that this is 0.1. So, we're
taking the data points to our right, and we multiply the weight of those data points, by 0.1, so
dividing by 10. So what effect does that do? We're going to decrease the importance of this data
point ff xi, yi, so this particular data point. Now let's look at a case where we got the data point
correct but the cost that we learn is random. So, it had to wait zero just like we discussed a few
slides ago. So it's overall weight 0.5 is weight 0. In this case we're multiplying the coefficient L5,
but e to the minus 0. Which is = 1, What does that mean? That means that when I keep the
importance of this data point the same, that also makes a ton of sense. So this was a classified
that was terrible and we gave it a weight of 0, we are going to ignore it, and so since we are
ignoring it we are not changing anything about how we rate all data points we are just going to
keep going as if nothing's happened because nothing's changed on their overall assemble. Now
let's look at the opposite case when we actually made mistakes so let's say that we got xi incorrect
so we made a mistake. In this case, we're in the second line here. So if it was a good classifier it
had the w hat t of 2.3, then we're going to multiply the weight by each of the power 2.3, so this is e
to the 2.3, which if you do the math is 9.98, so approximately 10. So it's 10 times bigger. And so,
what we doing is increasing the importance of this mistake significantly. So the next classify is
going to pay much more attention to this particular data point because it was a mistaken one. Now
finally, just very quickly, what happens if we make a mistake, but we have that random classifier
that had weight 0, we didn't care about it. So the multiplication here is e0, which is again = 1 which
means we keep the importance of this data point the same. So very good we now seen this cool
update from AdaBoost, which makes a ton of sense increase the weights of data points where we
made mistakes and decrease the ones we didn't make mistakes in simulator and we're going to
use it in our AdaBoost algorithm. So if we update our algorithm, or we stacked it with uniform
weights, we learned classifier f of t. We updated its, or computed its coefficient, w hat t. Now we
can update the weights of the data points, alpha i. Using the simple formula from the previous
slide which increases the weight of mistakes, decreases the weights of the correct classifications.
[MUSIC]

[MUSIC] >> Finally, we need to address a technical issue that we hinted at when we normalized
the weights of data points start to 1 over n, when we had uniform weights. Which is they should be
normalizing weights of the data points throughout the iterations ,so for example if you take data
point xi, suppose that it is often a mistake where multiplying its weight by a positive number again,
and again and again. Let's say 2 times 2 times 2 times 2, and that weight of a can get extremely
large. On the other hand, if you take the data point xi that's often correct, you multiply by some
number less than one, so say a half. So you keep going by a half, a half, a half, and that weight
can get really, really small. And so this problem can lead to numerical instabilities in the
approach. 
 So let's summarize the AdaBoost algorithm. We start with, even weights, uniform weights, equal
weights for all data points. We learn a classify F of T. We find its coefficient depending on how
good it is in terms of weighted error. And then we update the weights to weigh mistakes more than
things we got correct. And finally, we normalize the weights by dividing each value by this total
sum of the weights. And this normalization is of practical importance. So this is the whole
AdaBoost algorithm, it's beautiful, works extremely well, really easy to use. >> [MUSIC]
[MUSIC] Now let's take that third example we've used to illustrate different machine learning
algorithms in this module and explore it in the context of AdaBoost. And it's going to give us a lot
of insight as to how boosting works in practice. So for the first classifier f1, we work directly off the
original data, all points have the same weight. That's right. And so the learning process we have is
going to be standard learning. So nothing changes in your learning algorithm since every data
point has the same weight. And in this case we're learning a decision stump and so here is the
decision boundary that does it's best to try to separate positive examples from negative examples.
It splits right around 0. It's actually minus 0.07, if you remember from the decision tree classifier.
So this is the first decision stump, f1. Now to learn the second decision stump, f2, we have to
reweigh our data based on how much f1 did, so how well f1 did. So we're going to look at our
decision boundary and we're going to weigh data points that were mistakes higher. And here in
the picture I'm denoting them by bigger minus signs and plus signs. So if you look at the data
points here on the left, they were mistakes or the minus on this side, and this plus is over here.
They were also made mistakes, so we increased our weight and we decreased the weight of
everybody else and we see that the pluses here became here bigger and the minuses in this
region became larger.
 Learning the classifier f2 in the second Iteration based on this weighted data. Using the weighted
data, we'll learn the following decision stump. And you see that now we're still having a vertical
split, we have a horizontal split, and it's a better split for weighted data. Split for these weights on
the left and it's kind of cool. So in the first iteration we decided to split on x 1. In the second one
we split on x2, and this is x2 greater than or less than 1.3 or so. And you'll see that it gets all the
minuses correct on top but it makes some mistakes on the minuses in the bottom, but it gets the
pluses correct in the bottom. So as opposed to the vertical split here, we now have a horizontal
split
So now we've learned there are decision stems f1 and f2, and the question here is how do we
combine them? So if you go through with the AdaBoost formula you'll see that the w hat 1 the
weight of the first decision stop is going to be 0.61, then w hat 2 is going to 0.53. So we trust the
first decision stamp a little bit more than we trust the second one which makes sense. The second
one doesn't seem as good, but when you add them together, you start getting a very interesting
decision boundary. So you get the points in the top left here are ones where we definitely think
that y hear is minus 1, so definitely negatives. On the bottom right here, it's some definite positives
y hat equals plus 1. And then for the other two regions, we can think about these as being regions
of higher uncertainty. So these are uncertain which right now makes sense, but as you add more
decision stumps we're going to be more sure that some of the points on the left tier bottom are
negative and right top are negative. Now, if you keep our numbers going for 30 iterations the first
thing
that we notice is that we get all the data points right, so our training error is 0. The second thing
you'll notice, and here I'm going to use a technical term for this, is that the decision boundary is
crazy. This is our technical term, and then if you combine these two insides We figure out, okay
we don't really trust this classifier, we're probably over-fitting the data. So it fits perfectly on the
train later, it maybe doesn't do as well with a little error. So overfitting something that will happen
in boasting, we'll talk about a little bit next. So let's take a deep breath and summarize what we've
done so far. We described simple classifiers, and we said that we're going to learn the simple
classifiers and take the volt between them to make predictions. And then we described this
AdaBoost algorithm, which is a pretty simple approach to learning a non-simple classifier using
this technique of boosting where you're boosting up the weight of data points when we're making
mistakes. And simple to implement from practice. [MUSIC]
[MUSIC] We'll now take a few minutes to instantiate this abstract algorithm we described and see
what it looks like in the context of learning decision stumps. There is other boosted decision
stumps, it's a really nice simple default way of training on your data. And so, it's one that we'll just
go through a little further and it'll help us ground some of the concepts that we looked at so far. So
here I've outlined the other boost algorithm that we discussed before in the previous slides, but
just to be clear we're going to talk about linear decision stump for f of t, and figuring out how to
update it, figure out its coefficient, w hat t. Our first step is figuring out how to learn the next
decision stump, that's going to be f of t. And this is just going to be like standard decision stump
learning. We're going to try splitting it on each feature, income, credit history, savings, market
conditions. And figure out how well each of the resulting decisions stumps are a way to data. And
notice that in our process we might split on income multiple times. So in multiple considerations
we might revisit the same feature. So we're going to try each of those features, and for each one
of them, measure the weighted error on string data. So for say splitting on income, the weight of
the error might be 0.2. For splitting on credit, it might be 0.35. For splitting on savings, it might be
0.3. And finally, if you split the market conditions, it might be the worst of these four decision
stumps. On this weighted data, it might have a weighted error of 0.4. So, we're picking the best
feature, the one that has lowest weighted error, and so we're going to pick that first one split on.
We're going to split on income, we'll get to the 100,000. And so f of t's, going to be that decision
stamp, that says income's good to 100,000, if yes, safe, if not, is a risky loan. Now, the final
question is, what coefficient we give to this particular classifier? So all we have to is plug 0.2 into
the formula, and if we plug it in and do the math, 0.69 is the result. So the coefficient of this first
decision stem is just going to be 0.69
Going back to the algorithm, we discussed how we are going to learn this new stuff from data and
how we figure out its coefficient. Let's next talk about how to update the weight of i of each data
point. So here's the intuitive process what happens. We have our data points and I'm highlighting
them here depending on their income just like we did before. But I'm going to make a prediction
using this decision stamp. The question is how good is this decision stamp income greater than
100,000, and if you look at it, it makes mistakes in some of the data points and get others right. So
I marked the correct ones in bright green and mistakes in bright red. And if we take the previous
weights, alpha for each one of these data points, I'm going to highlight where those weights were
right there. We need to compute the new weight based on the formula above, which is standard
formula. So, we're going to plug in the w hat that we computed, 0.69, into the formula to figure out
what to multiply each one of those weights by. So, plug it in. You'll see that each of the -0.69 is a
half, so for every correct data point one is half its weight, and each of the 0.69 is two. So for every
incorrect data point, we're going to double its weight. So I'm going to go row by row, for the ones
in green I got correct, I'm going to half the weights. So for the first row there the weight before was
0.5 and now becomes 0.25. The next one was 1.5 becomes 0.75 because of their correct. For
third row I made a mistake, its weight before was 1.5. Now I'm going to double it, I'm going to
make it 3. So it can go datapoint by datapoint and then multiplied by two, or divided by two, the
weight, depending on whether we got that data point right or not. Extremely simple to be able to
boost the decision stump classifier, and these tend do extremely well on a wide range of data
sets. [MUSIC]

[MUSIC] Next, we're going to take a few minutes to talk about the famous AdaBoost theorem and
what it's practical implications are, which are extremely interesting. So, if you remember, the way
that boosting came about is that Kearns and Valiant had this famous question, can you combine
weak classifiers to make a stronger one? Actually, Valiant is a touring award winner, so really
famous professor from Harvard. And Schapire, a couple of years later, came up with the idea of
boosting, which really changed the machine learning field. And so if you look at the iterations of
boosting in the x axis here, in the training error for our dataset, we observe a very practical effect
that we see in a lot of boosting data. We're going to start with a really bad training error. So first
decision stump has training error of 22.5%. So, not good at all. After thirty iterations, as we saw a
little earlier, we're going to get all the data points right in our training data and I showed you kind
of that crazy decision boundary for this. So we see a smooth transition where the classification
error tends to go down. Tends to go down and the value goes to zero and actually stays at zero.
And that's a key insight of the boosting theorem.
So here is the famous AdaBoost Theorem which underlines all the choices we made in the algorithm
and really has had a lot of impact on machine learning. Basically the theorem says under some
technical conditions and all theorems are like this. Now if you were to be able to say some restrictions
apply, see store for details. And so you say under some technical conditions, the training error of the
classifier goes to zero, ss the number of iterations of the number and some models considered capital
T goes to infinity. So in other words pictorially we'll see that we expect the training error, which is this
thing here on the y axis, to go eventually to zero. Now that's eventually. It might oscillate a little bit in
the middle, however, what the theorem says is that it tends to generally decrease, eventually become
zero, usually, and then stick at zero. So we're going to see this high in the beginning, wiggle, wiggle,
wiggle, but tends to go down, and then hit a certain value, usually zero, and sticks at zero. Now, let
me just take a minute to talk about those technical distances starts with details.
It turns out the technical condition is something quite important. It says that at every iteration T, we
can find a weak learner. So a decision stump that has weighted error at least a little bit lower than
0.5. So it is at least better than random. That's what the theorem says. It seems like intuitively, it
would show us by the classifier that it's better than random, even with the data. But, it turns out
that this is not always possible. So, here is a counter example, which is a really extreme counter
example, but there's other examples where it's not possible. So for example, if you have a
negative point here, and you have a positive point on top of it, there's never going to be a decision
stump that can separate the positive point on top of the negative point. So the conditions of
boosting algorithm might not be satisfied, of the boosting AdaBoost theorem. However,
nevertheless, boosting usually takes your training error to zero or somewhere quite low if the
number of iterations go to infinity. So, we do observe that decreasing practice although they're are
some technical conditions in theorems, so it might not go exactly to zero, but it can get really,
really low. [MUSIC]
[MUSIC] So, let's compare a boosted serial stump to a decision tree. So, this is the decision tree
blog that we saw in the decision tree model. So this is on real a real dataset is based on the loan
applications and we see the training error as we make the decision tree deeper tends to go down,
down and down and we saw that. But the test error, which is kind of related to true error goes
down for a while and you have here the best depth, which is maybe seven. But eventually, it goes
backup. So let's say after that 18 over here, the training error's down to 8%. So it's a really low
training error, but the test error has gone up to 39% and we observe over 15. And in fact, we
observe a huge gap here. There are things now, by the way is the best decision tree has
classification error in the test set of about 34% or so. Now, let's see what happens with the senior
stamps that boosted on the same data set. You get a plot that looks kind of like this, you see that
the training arrow keeps decreasing per iteration. So, this is the train error. So as the theorem
predicts, it decreases as and the test error in this case is actually going down with iterations. So,
we're not observing over 15 at least yet. And after 18 rounds of boosting, so 18 decision stumps,
we have 32% test error, which is better than a decision tree. Yet, in fact, that's the best decision
tree and the over fitting is much letter. So, this gap here is much smaller. So, the gap between
your training error and the test error is kind of related to that over fitting quantity that we have.
Now we said, we're not observing over 15 yet. So, let's run booster stamps for more iterations and
see what happens.
 So let's see what happens when we keep on boosting and adding more decision stems on the x-
axis, this is adding more and more decision stems and more tree instead of only one tree and see
what happens to the classification error. We see here that the training error keeps decreasing.
Just like we expected by the theorem, but the test error has stabilized. So, the test performance
tends to stay constant. And in fact, it stays constant for many iterations. So if I pick T anywhere in
this range, we will do about the same. So, any of these values for T would be fine. So now we
seen how boosting the sound is seems to be stabilizing, but the question is do we observe
overfitting and boosting?
 But as we know, we need to be very careful of how we pick the parameters of our algorithm. So
how do we pick capital T, which is when we stop boosting? We have five decision stamps, 5,000
decision stamps. And this is just like selecting the magic parameters or all sorts of algorithms and
almost every algorithm out there model has a parameter trades off complexity, that for decision
tree, number of features, or magnitude of weights in logistic regression and here is the number of
rounds of boosting with the quality of the fit. So, just like lambda in regularization. We can't use
the training data, because the training tends to go down with iterations of boosting, so you say that
T should be infinite. Shouldn't be here really big and should never, never, never, ever, ever, ever,
ever use the test data, so that's bad. I was just showing you an illustrative examples, but you
should never do that. So, what should we do? Well, you should either use a validation set if you
have a lot of data. If you have a big dataset, you select a subpart of that to just pick the magic
parameters. And if your dataset is not that large you should use cross-validation. And in the
regression course, we talked about how to use validation sets and how to use cross-validations to
pick magic parameters like lambda, the def decision tree or the number rounds of boosting Capital
T. [MUSIC]

[MUSIC] I'd like to take a moment now to summarize we've seen boosting and what impacts it's
had in the real world. Let's talk about AdaBoost. So AdaBoost is one of the early types of boosting
algorithms. Extremely useful, but there other algorithms out there. In particular, there's one called
gradient boosting, which is slightly more complicated but extremely similar. And it's like AdaBoost,
but it can be useful not not just for basic classification, but for other of types of loss functions for
the types of problems and it's what most people use, so gradient boosting. It's kind of
generalization kind of boost, you can think about it that way. Then there's other related ways to
learn ensembles, the most popular one is called random forests. So random forest is a lot like
boost in the sense that it tries to learn an example of classifiers in this case, decision trees, but it
could be other types of classifiers. And instead of using boosting, it uses an approach called
bagging. And just very briefly, what you do with bagging is, you take your data set, and you
sample different subset of the data, which is kind of like learning on different sub datasets, and we
learn the decision tree on each one of them. You just average the outputs. So you're not
optimizing the coefficients that we had, and we're learning from different subset of data. It's easier
to parallelize, but it tends not perform as well as boosting for a fixed number of trees. So for 100
trees, or 100 serial stumps, 
boosting tends to perform better than random forest. Now let's take a moment to discuss impact
boosting has had in the machine learning world. And hint, hint, it's been huge. It's amongst the
most useful machine learning approaches out there. It's useful in a wide range of fields so, for
example, in computer vision a lot of the default algorithm in computer vision is boosting, like face
detection algorithms where you take your camera, you point it at something and tries to detect
your face. Of which their that uses boosting very useful. If you look at machinery competition
they've become very popular the last two or three years from places like Kaggle or KDD Cup.
Most winners, so this is more than half, I think it's about 70% of winners actually use boosting to
win their competition. If fact, they use Boosta Freeze, and this looks at wide range of tasks like
malware detection or fraud detection or ranking web searches, and even interesting physics tasks
like detecting the Higgs Boson. All those problems and all those challenges have been won by
boosting decision trees. And this is perhaps one of the most deployed advanced machinery
methods out there. Particularly the notion of ensembles. So for example, if you know about Netflix
which is an online place we can watch movies online. This kind of company, they recommend
what movie you might want to watch next. That system uses boosting. Actually uses an example
of crossfires. But more interestingly, they had a competition a few years ago, where people tried to
provide better recommendations and the winner was one that treated assemble of many, many,
many classifiers in order to create better recommendations. So assembles, you'll see them
everywhere. Sometimes they optimize boosting, sometimes they optimize with different
techniques like bagging. And sometimes people just by hand tuned away to say okay, I'm going to
give you one to this, half to that. I don't recommend the last approach, I recommend boosting as a
one to use. Great, so in this module we've explored the notion of an ensemble classifier and we
formalized ensembles as the way to combination of the loads of different classifiers and we
discussed generally the boosting algorithm where the next classifier focuses on the mistakes that
we made so far. As well as Adaboost, which is a special case for classification where we show
you how to come up with the coefficients of each classifier and the weights on the data. We've
discussed how to implement decision stumps, with the decision stumps, extremely easy to do.
And then talked a little bit about the conversions property of how the AdaBoosting tends to go to 0,
but you have to be concerned a little bit about the over 50, although it tends to be a robust over 50
in practice. [MUSIC]
WEEK
6
2 hours to complete
Precision-Recall
In many real-world settings, accuracy or error are not the best quality metrics for
classification. You will explore a case-study that significantly highlights this issue:
using sentiment analysis to display positive reviews on a restaurant website. Instead
of accuracy, you will define two metrics: precision and recall, which are widely used in
real-world applications to measure the quality of classifiers. You will explore how the
probabilities output by your classifier can be used to trade-off precision with recall,
and dive into this spectrum, using precision-recall curves. In your hands-on
implementation, you will compute these metrics with your learned classifier on real-
world sentiment analysis data.

8 videos (Total 31 min), 2 readings, 2 quizzes


SEE LESS
[MUSIC] Throughout this course we have evaluated classifiers in one key way. We measured
error or the accuracy of that classifier. But it turns out that for many real world applications, error
or accuracy is not great measure to try to understand whether classifier is doing the right thing for
you. And in this module, we're going to talk about precision recall, which is a really cool, very
simple way to evaluate classifiers that captures something that's needed for a wide range of
applications. And we'll use a cool, fun application as a kind of running example throughout the
module. So here's the idea. Let's say I have a restaurant and I have a goal. I want to increase the
number of guests, the number of people, coming to my restaurant by 30%. And I say, I'm going to
do a cool advertising campaign to do that. But nobody wants to just get those ads in the mail or
spam email as their advertising campaign. So I want to be innovative, I want to be authentic about
my advertising campaign. And the way that I want to be authentic is that when I use the voice of
my customers to talk about how great the restaurant is. So when I'm looking at customer reviews,
and then I find great things in there, great nuggets, to be able to tell everyone about how great my
restaurant is. So I want to find great quotes, key positive sentences that describe amazing things
about my restaurant. And may even find some spokespeople that are really eloquent, they explain
really well what they love about my restaurant. So that's my goal.
. So I only care about the positive ones, and I'm going to do my best to take those positive
sentences, show them in a way that people really feel, man, my restaurant is awesome. I'm going
to go there for sushi. So how do I find those positive sentences? I am going to do a sentiment
classifier. How do I know the sentiment classifier is really good, I can trust it, I can put those
sentences on my website without having to check every time a sentence goes up? This is the key
point. We are talking about automating machine learning. You have to really trust the machine
learning model. So if I give you a particular, say, accuracy, is that enough trust for me to just
automatically feed reviews into something that shows up in my website? [MUSIC]
 I'm desperately trying to put into my website and so this is also bad. And so now let's step back
and think about the task that we have, this automated marketing campaign task. And ask what is
good performance for me? Is it accuracy, is it 90%? It's not, it's about two things. First, if I show
something on my website, it better be good, it better be positive. Man, if I show a negative review
on my website, it's a double whammy. So, first people are going to my website, they're reading
about me and they're reading bad things there that people are saying. Nobody's going to want to
come to my website. So what I want to make sure is that I'm very precise, whenever I show
something, it's good, so that means high precision. The other thing that I have to worry about is
finding all those positive reviews. Maybe my restaurant is not that good, and those positive
reviews are not that common. So I want to make sure I find all of the positive reviews, so I can
have a chance of showing all the positive reviews on my website. And that's called recall. So
precision is how precise I am at showing good stuff on my website, recall is how precise I am, how
good I am, at finding all the positive reviews amongst all the reviews, all the sentences out there.
So I want to be good into those two metrics, not the single accuracy number that we talked about
before. Precision-recall is generally an extremely important type of metric for evaluating
classifiers. People use it in practice all the time. We're going to discuss them in quite a bit of detail
today. And it's something that you should be extremely familiar with if you're starting to use
machine learning practice. [MUSIC]
[MUSIC] We're going to start by defining this idea of precision a little bit more formally. So
precision is, after showed my website, which fraction were actually positive? In general, it's the
fraction of positive predictions that are actually positive. Precision then is the fraction of positive
predictions that were actually positive. So let's say that my algorithm predicted that the six
sentences were positive. So it predicted y hat = + 1 for the six sentences. But in reality, only four
out of those six were truly positive. So we got four truly positive ones, and two false positives in
the mix. So it's precision was four, six. So in general, we have a set of data points we're calling
positive, that we're predicting to be positive y hat. In this case y = + 1, some of them are truly
positive. The yi is + 1. But some of them were actually not positive so that yi was actually- 1. And
the question is, how big a fraction of those are the ones that actually truly positive? So here is
where we can review that notion of true positives and true negatives. And so we can look at this
table where, when in the rows we have the true label, yi. While in the columns we have the
predicted label, y hat. And so if the truth is positive and the predictor is positive, we call that a true
positive. It was positive and it was true. If the true label is- 1 and the prediction is- 1, we call that
true negative. It was negative, and we predicted negative,- 1. So both of those are correct. But
there are two types of mistakes you can make. The first type, those are called false negatives. It
was truly a positive review, but we predicted it to be negative. So yi was + 1, y hat was- 1. And
finally, a false positive is one where the true label was- 1, but the prediction was + 1. So the truth
was negative would predict that it's going to be positive, so it's false positive

so it's false positive. I find it very helpful to ground these ideas of false positive and false
negative in the context of an example, to really feel it and really understand what the impact of
those mistakes can be. So let's look at this matrix here again. If you look at the top left, we have a
truly positive sentence, so it was a plus 1 sentence. And we got it right, we had a + 1 prediction.
So that's no mistake, that's great. Similarly for the bottom right, we didn't make a mistake. We had
a- 1 sentence, so a negative sentence, and we made a negative prediction. Now the problematic
was, I did only two. So let's look first at the top one. So what happened here, was that I had a
positive sentence, but a- 1 prediction. So, what does this actually means? Those are positive
sentence in the word. Did they fall with negative? So, I missed the sentences to show my website.
Maybe this is not too bad, you know, there might be some positive things I've said. So, maybe
missing one is not that bad but it's still a problem. But let's look at the other quadrant here. The
other quadrant is when we have- 1 sentence but I made a + 1 prediction. In other words, it was a
negative sentence in the world in a review and I thought it was positive. So that means I showed a
bad thing. I showed a bad review on my website. So this is quite problematic. I showed a bad
review on the website, maybe said the sushi sucked, everybody read it, nobody comes to my
restaurant anymore. Big, big, big, big trouble.
 So where y hat is also +1 but there's a subset which we don't capture where we think the y hat is
-1, so y hat does not agree with y. And so, that's the part that we missed. So the recall is the
fraction of the once that we actually get. We want everything to be in the blue box here. More
formally, we can define recall as the fraction of true positives. For these are the data points that
we were positive and we got them right divided by the true positives. And the false negative. So
the data points that were true, but we labeled as negative. So falsely labeled as negative. And so,
this is going to have value one if the false negatives are zero. Which means we captured
everything, we captured all the true positives. And zero if we did not capture any true positives so
all the positive data went to the false negative bin. So if we go back to the example that we have
been looking at, I want to show positive sentences on my website. I've got four of them, in y hat i
equals +1 but I missed out on two sentences. So for example, I missed out on the sentences that
said my wife tried the ramen and it was delicious, and so maybe somebody's interested in ramen
they don't see that sentence, they don't go to my website, and so I missed out on something really
good. So high recall means that you discover everything positive that's being said about the
restaurant or all of the positive data points. [MUSIC]
[MUSIC] When there's a trade off between precision and recall, it's important for us to look at the
two extremes. What does it mean to have a classifier that's extremely precise? And what does it
mean to have a classifier that's extremely high recall? And how the two can go against each other
sometimes. First, let's think about what I call an optimistic classifier. You might know some of
these optimists in your life. They think everything is good. How's it going? Good. Even if bad stuff
is happening, they say good. Those folks say that all possible experiences are good, so they're
optimists. That means that pretty much every input, every sentence, is labeled as positive, very
few get labeled as negative. It's extremely likely that all the truly positive data points get labeled as
good. What does that mean? That means that I have perfect recall, because I recall all those
positive data points. Good. But I might not get perfect precision because I put in a bunch of
negatives into that bit. How can we address that? We can have that pessimistic classifier, you
might have some friends like that where you try really hard, you do everything for them, you go out
of your way, and everything sucks. Every single experience that you have is really bad. There's
very, very, very, very, few things that they say are good. And when they are there very likely to be
good but everything else they say is bad, so everything else in the world is very hard, equals -1.
Pessimist means that you're going to miss out on many good things in life. The pessimists have
high precision because the few things that was good tends to be good, but very, very, very low
recall, they don't inspire great things in life. It turns out there is a spectrum between a high
precision low recall model and low precision high recall model, the pessimist and the optimist.
What we'd like to do is somehow balance between the two perspectives in the world to find
something that's just right for us. So, balance between a pessimistic model and the optimistic
model. In particular, we want to find as many positive reviews or sentences as possible, as many
of those as possible, with as few incorrect predictions as we can. So, that's the balance we're
trying to strike in the case of our restaurant. [MUSIC]
[MUSIC] Thus far we've talked about precision, recall, optimism, pessimism. All sorts of different
aspects. But one of the most surprising things about this whole story is that it's quite easy to
navigate from a low precision model to a high precision model from a high recall model to a low
recall model, so kind of investigate that spectrum. So we can have a low precision, high recall
model that's very optimistic, you can have a high precision, low recall model that's very
pessimistic, and then it turns out that it's easy to find a path in between. And the question is, how
do we do that? 
 If you recall from earlier in this course, we assign not just a label, +1 or -1, for every data point,
but a probability number, let's say, 0.99 or being positive for the sushi and everything else were
awesome. To say 0.55 of being positive for the sushi was good, the service was okay. This
probability is I, as I mentioned earlier in the course, that they are going to be fundamentally useful.
Now you're going to see a place where they are amazingly useful. So the probabilities can be
used to tradeoff precision with recall. And so let's figure that out. So earlier in the course, we just
had a fixed threshold to decide if an input sentence, x-i, was going to be positive or negative. We
said, it's going to be positive if the probability is greater than 0.5, and is going to be negative if the
probability is less than 0.5, or less than or equal to it. Now, how can we create an optimistic and
pessimistic model by just changing the 0.5 threshold? Let's explore that idea. Think about what
would happen if we set the threshold, instead of being 0.5 to being 0.999. So a data point is only
+1 if its probability is greater than 0.999. Well, here's what happen. Very few data points would
satisfy this. So if very few data points satisfy this, then very few data points will be +1. And the
vast majority will be labeled as -1. And we call this classifier the pessimistic classifier. 
Now alternatively, if we change the threshold to be 0.001, then that means that any experience is
going to be labeled as positive. So, almost all of the data points are going to satisfy this condition. So
we're going to say that. Everything is +1, and so this is going to be the optimistic classifier. It's going to
be say yeah, everything is +1, everything's good. So by varying that threshold from 0.5 to something
close to 0 to something close to 1, we're going to change between optimism and pessimism. So if you
go back to this picture of logistic regression, for example, as a complete case. We have this input, the
score of x. And the output here was the probability that y is equal to +1 given x and w. This should
bring some memories maybe some sad, sad memories. The threshold here is going to be a cut where
we say, set y hat to be equal to +1 if it's greater than or equal to this threshold t. So everything above
the line will be +1 and everything below the line will be labeled -1.
Or concretely, let's see what happens if we set the threshold to be some very, very high number.
So, t here is close to one. So if t is some number close to one, then everything below that line will
be labeled as -1. And very, very few things there above the line here can be labeled as +1. So,
that's why we end up with a pessimistic classifier, on the flip side if we set the t threshold to be
something very, very small, so this is small t then everything's going to be above the line. So
everything's going to be labeled as +1, and very few data points are going to be labeled as -1. So
we'll end up with the optimistic classifier. So range in t from 0 to 1, takes us from optimism to
pessimism. In other words that spectrum that we said weren't navigate on can now be navigated
for a single parameter, t, that goes between 0 and 1. [MUSIC]

[MUSIC] We saw how we could change the threshold from zero to one for deciding what's a
positive value to navigate between the optimistic classifier and the pessimistic classifier. There's
an actually really intuitive visualization that does this. It's called a precision-recall curve. Precision-
recall curves are extremely useful to understanding how a classifier is performing. So in this case,
you can imagine setting two points in that curve. What happens to the precision once you have,
when the threshold is very close to one? Well the precision is going to be one because we're
going to get everything right, there's very, very few things and very sure they're going to correct.
But the recall's going to be zero because you're going to say everything's bad, everything else is
bad, so that's pessimistic. On the other extreme, our precision recall curve, the point on the
bottom there, is a point where the optimistic point where you have very high recall because you're
going to find all the positive data points, but very low precision, because you're going to find all
sorts of other stuff and say that's still good. And so that happens when t is very small, close to
zero. Now if you keep varying t, you have a spectrum of tradeoffs between precision and recall.
So if you want a model that has a little bit more recall but still highly precise, maybe you set t =
0.8, but if you really want really, really high recall, but trying to improve precision a little bit, maybe
set t to 0.2. And you can navigate that spectrum to explore the tradeoff between precision and
recall. Now there doesn't always have to be a tradeoff, if you have a really perfect classifier, you
might have a curve that looks like this. This is kind of the world's ideal where you have perfect
precision no matter what your recall level. This line basically never happens. But that's kind of the
ideal. That's where you're trying to get to, is that kind of flat line at the top. So the more your
algorithm is closer to the flat line at the top, the better it is. And so precision-recall curves can be
used to compare algorithms in addition to understanding one.
 So for example, let's say you have two classifiers, classifier A and classifier B. And you see that
for every single point, classifier B is higher than classifier A. In that case we always prefer
classifier B. No matter what the threshold is, classifier B always gives you a better precision for
the same recall, better precision for same recall. So B is always better. However, life is not always
this simple. 
 If there's one thing you should learn about thus far, is that life and practice tends to be a bit
messy. And so, often what you observe is not classifier A and B like we saw, but it's classifier A
and C like we're seeing over here. Where there might be one or more cutoff points, where
classifier A does better in some regions of the precision recal curve where classifier B does better
in others. So, for example, or C in this case. So for example, if you're interested in very high
precision but okay with lower recall, then you should pick classifier C, because it does better in
that region. It's higher up, closer to that flat line. But, if you cared about getting high recall, then
you should choose classifier A. Because in the high recall regime, when you pick tease, they're
smaller, then classifier A tends to do better. So you see, it's curve is higher over here. So that's
kind of complexity of dealing with machine learning in the real world. Now if you just had to pick
one classifier, the question is how do you decide? How do you choose between A and C in this
case? 

 And we often the single number to decide, as I was hinting at, depends on where you want to be
in the precision trade off curve. And there's many metrics out there to try to do single numbers,
some are called F1 measures, some called area-under-the-curve. I'm less fond of those
measures, myself, for a lot of applications than I am of one that's much simpler. Which, it's called
precision at k. And let me talk about that because it's a really simple measure, really useful. So,
let's say that there's five slots on my website to show sentences. That's all I care about. I want to
show five great sentences on my website. I don't have room for ten million, for five million, just for
five. And I show five sentences there. Four were great and one sucked. I want all five to be great.
So I want my precision for the five top sentences, for the top five, to be as good as possible. In
this case, our precision was four out of five, 0.8. So I ended up putting a sentence in there that
said, my wife tried the ramen and it was pretty forgettable. That's kind of a disappointing thing to
put in. So for many applications, like recommender systems for example, where you go to a web
page and somebody showing you some products you might want to buy, precision at k is a really
good metric to be thinking about. [MUSIC]
[MUSIC] In this module, we will discuss some very important fundamental concept, which is
evaluating classifiers. And in particular, we talked about precision-recall, which is a concept that's
widely used, way beyond some of the classifiers we talked about here. So basically any
classification problem you're going to see in industry. We saw that just straight accuracy or error
metrics may not be the right things for your application. And you need to look at something else
and precision recall is one of the first things you might want to look at. Precision captures the
fraction of your top of your positive predictions that are actually positive and recall talks about of
all positive predictions positive sentences out there, which ones did you find, which ones did you
label as positive. So we talked about the trade-off in between precision and recall, and how you
can navigate that trade-off with that trade-off parameter t, in terms of probability, and really get this
beautiful precision trade-off curves. And finally, we talked about comparing models with this
precision at k metric, which is one that I particularly like for a lot of applications.
WEEK
7
2 hours to complete
Scaling to Huge Datasets & Online Learning
With the advent of the internet, the growth of social media, and the embedding of
sensors in the world, the magnitudes of data that our machine learning algorithms
must handle have grown tremendously over the last decade. This effect is sometimes
called "Big Data". Thus, our learning algorithms must scale to bigger and bigger
datasets. In this module, you will develop a small modification of gradient ascent
called stochastic gradient, which provides significant speedups in the running time of
our algorithms. This simple change can drastically improve scaling, but makes the
algorithm less stable and harder to use in practice. In this module, you will investigate
the practical techniques needed to make stochastic gradient viable, and to thus to
obtain learning algorithms that scale to huge datasets. You will also address a new
kind of machine learning problem, online learning, where the data streams in over
time, and we must learn the coefficients as the data arrives. This task can also be
solved with stochastic gradient. You will implement your very own stochastic gradient
ascent algorithm for logistic regression from scratch, and evaluate it on sentiment
analysis data.

16 videos (Total 52 min), 2 readings, 2 quizzes


SEE LESS
[MUSIC] In this module we're going to address a really important problem with machine learning
today. How to scale the algorithms we discussed to really large data sets. The ideas we discuss
today are going to be broadly applicable. They're going to be applicable for all of classification.
They're also going to be applicable even for the regression course, or the second course of the
specialization. We're going to talk about the technique called stochastic gradient, and when you
relate it to something called online learning, 
This modification is called stochastic gradient, and does something extremely simple. What it does
is take your massive dataset, and your current parameters, w(t), and when you're computing the
gradient, instead of looking at all the data, because you don't need to look at everything. It just
looks at small subset of the data. So it just does a bit of data and then updates the parameter to
the coefficients of w(t+1). And then it looks at little bit more data and then updates the coefficients
to w(t+2). Then it looks at a little bit more data and updates to w(t+3). And then looks at little bit
more data and updates a coefficients to w(t+4). And so instead of making massive pass over the
dataset, before making a coefficient update, here we're just looking at little bits of data and
updating coefficients in a kind of interlinked fashion. This small change is going to really change
everything for us, and allow us to scale to much bigger datasets, as we'll see today. [MUSIC]
[MUSIC] What we're going to do about it is introduce an algorithm called stochastic gradient
instead of the standard gradient. So the standard gradient says, update the parameter by
computing the contribution of every data point and sum it up. Stochastic gradient in its simple
possible way says, you don't have to use every single data point just use one data point. That's
not exact gradient. That's kind of an approximation to the gradient as we will see. And then
instead of using just a single data point, every time we're going to do an update will use a different
data point. So we're just going to use one gradient but we use a different one every time. And this
simple change, as we will see, is extremely powerful.

Let's now dig in and understand that change we hinted at where you use a little bit of data to
compute the gradient instead of using an entire data set. And in fact we're just going to use one
data point to compute the gradient instead of everything. So this is the gradient ascent algorithm
for logistic regression. The one that we've seen earlier in the course. And I'm sure in the gradient
explicitly over here. Now, it requires a sum over data points which is the thing that we're trying to
avoid. We're not going to do a sum over data points at every update, at every duration. So let's
throw out that sum. But each time we're going to pick a different data point. So, we're going to
introduce an outer loop here where we loop over the data points, 1 through N and then we
compute a gradient with respect to that data point and then we update the parameters. And we do
that one at a time.
[MUSIC] Let's take a moment to compare gradient to stochastic gradient to really understand how
the two approaches relate to each other. And we see that this very, very small change in your
algorithm and your implementation is going to make a big difference. So we're going to build out
the table here, comparing the two approaches, gradient in blue, stochastic gradient in green. And
we saw already that gradient is slow at computing an update for large data, while stochastic
gradient is fast. It doesn't depend on the dataset size, it's always fast. But the question is, what's
the total time? Each iteration is cheaper for stochastic gradient, but is it cheaper overall? And
there's two answers to this question. In theory, stochastic gradient for large datasets is always
faster, always. In practice, it's a little bit more nuanced, but it's often faster, so it's often a good
thing to do. However, stochastic gradient has a significant problem, it's much more sensitive to the
choice of parameters like the choice of step size, and it has lots of practical problems. And so a lot
of the focus of today's module is to talk about those practical challenges to stochastic gradient and
how to overcome them. And how to get the most benefit of the small change to your algorithm. 
 We'll see a lot of pictures like this, so I'm going to take a few minutes to explain it. This picture
compares gradient to stochastic gradient. So the red line here is the behavior of gradient as you
iterate through data, so as you make passes through datum. And on the y axis we see the data
likelihood, so higher is better, we fit in the data better. The blue line here is stochastic gradient.
And to be able to compare the two approaches, on the x axis I'm not showing exactly running
time, but I'm showing you how many data points you need to touch. So stochastic gradient is
going to make an update every time he sees a data point. Gradient is going to make update every
time this makes full pass of the data. And so, we're showing how many passes we're making over
the data. So the full x axis here is ten passes over the data, and you see that after ten passes
over the data, gradient is going to a likelihood that's much lower than that of stochastic gradient.
Stochastic gradient goes to a point that's much higher. And even if you look at it in the longer
scales you'll see, stochastic gradient's going to converge faster. However, it doesn't convert
smoothly to the optimal solution. It oscillates around the optimal solution, and we will understand
today why that happens. And oscillation, some of the challenges that are introduced by stochastic
gradient. So now I've extended it from 10 passes over the data to 100 passes over the data, and
now we see the gradient is getting to solutions very close to that stochastic gradient. But again,
you see a lot of noise and a lot of oscillation from stochastic gradient. Sometimes it's good,
sometimes it's bad, sometimes it's good, sometimes it's bad. So that's the challenge there. So
here's a summary of what we've learned. Make a tiny change to our algorithm. Instead of using
the whole dataset to compute the gradient, use a single data point, call that stochastic gradient.
going to get better quality faster. However, it's going to be tricky, there's going to be some
oscillations, and you have to learn some of the practical issues that you need to address in order
to make this really effective. But this change is going to allow you to scale to billions of data
points. Even on your desktop you'll be able to deal with a ton of data, which is really super
exciting. [MUSIC]
[MUSIC] So we've made a small change to the gradient algorithm but it is still big. We start looking
at all the data points one at a time. Why should this even work? Why is this a good idea at all?
Let's spend a little bit of time getting intuition of why it works. And this intuition is going to help us
understand the behavior of the algorithm in practice. In this picture, I'm showing you the
landscape, the counter plot for our sentiment analysis problem, which is the data that we'll be
using throughout our module today. But just for a subset of the data where we're only looking at
two possible features, coefficient for awful and the coefficient for awesome. And if we were to start
here at this point, this would be the exact gradient that would compute. And that exact gradient
gives you the best possible improvement of going from wt to wt plus one. That's the best thing you
could possibly do. Now, there are many other directions that also improve the value, improve the
likelihood, the quality. So for example, if we were to take this other direction over here and we
reached some parameter w prime is still okay. Because it is still going uphill we're increasing the
likelihood. In other words, the likelihood of w prime is going to be greater than the likelihood of wt.
So in fact, any direction that take uphill will be good the gradient is just the best direction.
The gradient direction is the sum of the contributions for each one of the data points. So, if I look
at that gradient direction. It is the sum over my data points of the contributions for each one of
these xis. So this is the, each one of these red lines here are the contributions from a data point,
from each xiyi, which is this part of the equation, here. So interestingly, most contributions point
upwards. So, all of these over here are pointing in an upward direction. So if I were to pick any of
those, I would make some progress. If I picked any of the other ones, like the ones back here, this
would not make progress, this would be bad directions. But on average, most of them are good
directions. And this is why stochastic gradient works. In stochastic gradient, we're just going to
pick one of these directions. And most of them are okay. So most of the time, we're going to make
progress. Sometimes when we take a bad direction, we won't make progress. We'll make negative
progress. But on average, we're going to be making positive progress. And so, that's exactly why
stochastic gradient works. 
So if you think about the stochastic gradient algorithm, we're going one data point at a time. And
picking the direction associated with that data point and maybe taking a little step. So, at first
iteration we might pick this data point and make some progress and the second one make this
number too here and make negative progress. But in the third one, I pick another one that makes
positive progress and I pick a fourth one that makes positive progress. And I pick a fifth one that
doesn't and I pick a sixth one that does and so on. And so, most of the time we're making positive
progress and overall the likelihood is improving. [MUSIC]
[MUSIC] Next, let's visualize the path the gradient takes, as opposed to stochastic gradient, what I
call the convergence paths. And as you will see, the stochastic gradient oscillates a bit more, but
gets you close to that optimal solution. So just before in the black line, I'm showing the path of
gradient, and you see that that path is very smooth and it does very nicely. In the red line, I show
you the path of stochastic gradient. You see that this is a noisier path. It does get us to the right
solution but one thing to note though is that It doesn't convergent stop like gradient does, it
oscillates around the optimal. And this is going to be one of the practical issues that we're going to
address when we talk about how to get stochastic gradient to work in practice but it's a significant
issue.
Another view of stochastic gradient oscillating around the optimum, can be viewed in this plot. The
one we've been using for quite awhile and you see that gradient is smoothly making progress. But
stochastic gradient is this noisy curve as it makes progress, and as it converges. It's oscillating
around the optimal.

Let's summarize. Gradient ascent looks for the direction of greatest improvement and steepest
ascent direction and does that by summing over all possible data points. Stochastic gradient on
the other hand tries to find some direction which usually makes progress. So, for example by
picking one data point to estimate the gradient and on average it makes a ton of progress and
because of that it tends to converge much faster but it's noisier than optimum. So, even in that
simple example we've been using today It's over a hundred times faster than gradient conversion
much much faster to converge. But, it gets noisy in the end.
[MUSIC] You've heard me hint about this a lot today, and the practical issues of stochastic
gradient are pretty significant. So for most of the remaining of the module, we'll talk about how to
address some of those practical issues. Let's take a moment to review the stochastic gradient
algorithm. We initialize our parameters, our coefficients to some value, let's say all 0. And then
until convergence, we go 1 data point at a time, and 1 feature at a time. And we update the
coefficient of that feature by just computing the gradient at that single data point. So I need a
contribution of each data point 1 at a time. So we're scaling down the data, update the parameters
1 data point at a time. For example, I see my first data point here 0 awesome, 2 awful, sentiment
-1. I make an update which pushes me towards predicting -1 to this data point. Now if the next
data point is also negative, I'm going to make another kind of negative push. If the third one is
negative, make another negative push, and so on. And so, one worry, one bad thing that can
happen with stochastic gradient is if your data is implicitly sorted in a certain way. So, for example,
all the negative data points are coming before all the positives. They can introduce some bad
behaviors to the algorithm, bad practical performance. And so we worry about this a lot and it's
really significant.
So, because of that, before you start running stochastic gradient you should always shuffle the
rows of the data, mix them up. So that you don't have this long regions of, say, negatives before
positives, or young people before older people. Or people who live in one country versus people
who live in another country. You want to mix it all up. So what that means, from the context of the
stochastic gradient algorithm that we just saw, is just adding a line at the beginning where you
shuffle the data. So before doing anything, you should start by shuffling the data.
[MUSIC] The second question on stochastic gradient is how do you pick the step size eta. And
this is a significant issue just like it is with gradient. Both of them, it's kind of annoying, it's a pain to
figure out to pick that coefficient eta. But it turns out that because of the oscillations to the
stochastic gradient, picking eta can be even more annoying, much more annoying. So, if we go
back to our data set and when we've been using it, I've shown you this blue curve many times.
This was the best eta that I could find, the best step size. Now, if I were to pick smaller step sizes.
So, smaller ETAs, it will behave kind of like stocha elect regular gradient. It will be slower to
converge and you see less oscillations. But, it will eventually get there, but I mean, much slower.
So, we worry about that a bit. On the other hand, instead of using the best step size, we try to use
a larger step size because we thought, what could make more progress more quickly? You'll see
this crazy oscillations and the oscillations are much worse than you observe with gradient, and I
showed you gradient. So, you have to be really careful to pick a too large things can behave
extremely erratically.
 And in fact, if you picked step size very, very large, you can end up with behaviors like this. So,
this black line here was an eta that was way too large, that's a technical term here that we'd like to
use. And in this case, the solution is not even close to anything we got even with etas of
oscillations that we have showed you in the previous slide. It's a huge gap and so, etas too large
leads to really bad behavior in stochastic gradient. The rule of thumb that we described for
gradient, for picking eta is basically the same as the one for picking step size for stochastic
gradient. The same as for gradient, but unfortunately it requires much more trial and error. So, it's
even more annoying, so you might spent a lot of time in the trial and error even though it's a
hundred times faster than converge, it's possible to spend a hundred times more effort trying to
find the right step size, just be prepared. But, we just try several values exponentially spaced from
each other. And try to find somewhere between an eta that's too small and an eta that is too big,
and then find one that's just right. And, I mentioned this in the gradient section, but for stochastic
gradients, even more important, for those who end up exploring this further, there's an advanced
step where you would make the step size decrease over iterations. And so, for example, you
might have an eta that depends on what iteration you're in and often you set it to something like
some constant here eta zero divided by the iteration number t. Iteration number, and this
approach tends to reduce noise and make things behave quite a bit better. [MUSIC]
[MUSIC] As we saw in our plot, stochastic gradient tends to oscillate around the optimum. And so
you should never trust the last parameter it finds, unfortunately. Gradient will eventually stabilize
on the optimal solution.
 So even though it takes a hundred times longer or more, like was shown in this example, so if you
look at the x-axis a hundred times more time to converge, you get there. And you feel really good
when you get there. Stochastic gradient, when you think it converged, is really that it's oscillating
around the optimum, and that can lead to bad practical behavior. So for example here, I'm just
giving you some numbers, say w at iteration 1000 might look really, really bad. But maybe W at
iteration 1005 looks really, really good and needs some kind of approach to minimize the risk of
picking a really bad one or a really good one. And there is a very simple technique which works
really well in practice, and theoretically is what you should do. So all the theorems require
something like this. And what it says is. When you are outputting w hat, your final self-coefficient,
you don't use the last value, w(t), capital T, you use the average of all the values that you've
computed. All the coefficients you computed along the way. So, what I'm showing here is what
your algorithm should output as it's fitting the solution to make some predictions in the real world.
>> [MUSIC]

[MUSIC] The next few sections of this module are going to talk about more practical issues with a
stochastic gradient, which you need to address for implementing those algorithms. I made the
next few sections optional because it can get really tedious to go through all these practical
details. You now already got a sense of how finicky it can be. There's many other practical details,
I'm going to make those sections optional. So the first one is that learning from one data point is
just way too noisy, usually you use a few data points. And this is called mini-batches. 
. So we really illustrated two extremes so far. We illustrated gradient where you make a full pass
through the data, and you use N data points for every update of your coefficient. And then we
talked about stochastic gradient, where you look at just one data point when you're making up
data and the coefficients. And the question is, is there something in between, where you looked at
B data points, say 100? And that's called mini-batches, it reduces noise, increases stability, and
it's the right thing to do. And 100 is a really good number to use, by the way. Here I'm showing the
convergence paths of the same problem we've been looking at, but comparing batch size of 1 with
batch size of 25. And here you observe two things. First, the batch size of 25 makes a
convergence path smoother, which is a good thing. But the second thing to observe which is even
more interesting is, when you get around optimum batch size of one, really oscillates around the
optimum. Well, batch size of 25, it oscillates better, so it has better behavior around the optimal.
And by better behavior, it's going to make it much easier to use this approach in practice. So mini-
batches are a great thing to do.

 So now we've introduced one more parameter to be tuned in this stochastic algorithm, this is the
batch size B. If it's too large then it behaves just like gradient, for example, if you use batch size of
N, it's exactly the gradient ascent algorithm, so in this case, the red line here is batch size too
large. If the batch size is too small, you have bad oscillation, or bad behavior of. So B, too small in
this case, it isn't converged very well. But if you pick the best batch size, B, you have very nice
behavior. You quickly get to great solution, and you stay around that. So picking the right batch
size makes a big difference.
 So let's go back to a simple stochastic gradient algorithm, and modify it, and reduce the notion of
batch sizes. So instead of looking one data point at a time, we're going to look at one batch at a
time. And if the batch size is size of B, we have N over B batch sizes for data set of size N. So if
we have 1 billion data points and batch size 100, it's 1 billion over 100 of those. And now we go
batch by batch, but instead of considering one data point at a time in the competition of the
gradient, we now consider the B data points in that mini-batch. And the equation here just shows
you the math behind basically the obvious thing of just taking 100 data points and use that just to
estimate the gradient instead of 1 or instead of 1 billion. [MUSIC]
[MUSIC] The second practical issue we're going to talk about is how do you measure
convergence for stochastic gradient. If you have to look at all the data points to figure out they
have converged, then the whole process is going to be pointless and meaningless. So we need to
think about new techniques of measuring convergence. This, again, is going to be an optional
section. It's very practical for those those who actually want to use and implement a really
practical stochastic gradient algorithm. One way to think about it is, how do they actually make this
plot? Here's a plot where a stochastic gradient gets the optimum before a single pass over the
data, while gradient is taking 100 or more passes over the data. If to get one point in this plot I had
to compute the whole likelihood over the entire data set, that would require me, for every little blue
dot, to make a pass over the entire data set. Which make the whole thing really slow. If I had to
make a pass over a data set to compute the likelihood, might as well just use full gradient and not
have this noisy problems. And so, we need to rethink how we're going to compute conversions,
how we're going to plot that we're making progress.
And here there'll be a really, really simple trick, really easy, really beautiful, that we can do. So I'm
showing here the stochastic gradient ascent algorithm for logistic regression, the one that we've
been using so far. So I go data point by data point, and I compute the contribution to the gradient,
which is this part, this thing I'm calling partial j. Now if I wanted to compute the log likelihood of the
data to see what my quality was, how well I'm doing, I'll have to compute the following equation for
data point i. If the data point were a positive data point, I take the log of y = +1 given xi and the
current parameters, current coefficient. And, if the data point were a negative yi = -1, then I need
to take the log of the probability yi = -1, which turns out to be log of 1- P(y = +1). And so here's the
beautiful thing, the thing that I need to compute the likelihood for a data point is exactly the same
as the thing that I needed to compute the gradient. And so, I've already computed. I can compute
the contribution likelihood of this one data point. Which is great. I can't do it for everybody, but I
can do it for one data point.
 So every iteration t, I can compute the likelihood of a particular data point. I can't use that
measure conversions because I could do well one data point classified perfectly but not for others,
so it would be a very noisy thing. But if I want to compute how well I'm doing, let's say, after 75
iterations, what I can do is look at the last few data points, how well I did. And the likelihood for
those data points, average it out, and then create the smoother curve. And so, for every time
stamp I want to keep an average, it's called a moving average, of the last few likelihoods in order
to measure convergence.
And so in the plot here, the plot that I be showing to you, now I can tell you truth in advertising
what it actually was. The blue line here was not straight up gradient, it was many batches of size
100, which is a great number to use. It still converts much faster than gradient. And to draw that
blue line, I average the likelihood over the last 30 data points. So that's how I build a plot, and this
is how you would have to build a plot if you're going to go through this whole process of stochastic
gradient. [MUSIC]
[MUSIC] We're now down to our final practical issue stochastic gradient, which again is going to
be an optional section. And the question is, how do we introduce regularization and what impact it
has on the updates. And this is going to be pretty significant if you want to implement it yourself in
a general way. Again, optional section because it's pretty detailed. So let's remind ourselves of the
regularized likelihood. So, we have some total party metric defined in the pit of our data, which is
the log likelihood. And some measures of the complex of the parameters which in our case, here
we're using w squared, the actual normal for the parameters, and we wanted to compute the
gradient of this regularized likelihood to make some progress, and avoid over fitting. So, the total
derivative is the derivative of the first term, the quality, which is the sum over the data points of the
contribution of each data point, the thing that got really expensive. And the contribution of the
second one, once we introduced that parameter, lambda, to trade off the two things, talked about
this a lot, contribution is -2 lambda wj. And we derive this update during an earlier module on
regularized logistic regression.

Now this is how we do straight up gradient updates with regularization. The question is what do we
do? We only do stochastic gradient with regularization. So, if you remember stochastic gradient, we
just said we take the contribution for single data point, and if we added up those contributions we get
exactly the gradient. This is why it worked, because the sum of these stochastic gradients equal the
gradient. And so, to my theme that, we need to think about how to set up the regularization term
such that the sum of the stochastic gradients also is equal to the gradient. And one way to think about
this is that we take the regularization and divide it by N. So in practice you want to do this. You want
to say that the total derivative for stochastic gradient is contribution data point minus two over n
lambda wj. And so if you were to add this up you get back to the full gradient.
So with regularization now, the algorithm stays the same, but the contribution data point is its
gradient minus two over n lambda wj. That's the contribution from the regularization point. If you
are using mini-batches, you'd adapt that times. So you had the sum of the contributions from each
data point And then it will be 2 b over n lambda w j. So this is how we take care of regularization.
Again very small change, it's going to behave way better.
[MUSIC] Now it seems to cast a gradient which is really exciting. Simple algorithm, simple
modification to gradient, which really speeds up in practice. Has many practical challenges, and
we talked about several of those, and how to address them. But now, I would like to step back,
and think about a broader question, what's called online learning, of how do we learn from
streaming data. And we see that is one way to learn from data that arrives over time or streaming
data. Let's define the idea of online learning. But first, let's look at what we've been doing so far.
What we've been doing so far in this course, and in the regression course, is what's called batch
learning. I'm given the full data set. And I'm going to run some machine algorithm over this data
set, maybe gradient, and do many pass over the data. And finally output my best guess, my best
estimate, for the coefficients, and we're going to call that W hot and we're done. That's batch
learning. Online learning is something different. Actually, what you are doing here is online
learning. But that's a different kind of online learning. What we're talking about here is online
machine learning. And in online machine learning, data raise over time, one data point at a time.
So, for example, as we'll see next, ad serving ads on web pages, is an example, where your
things are arriving one data point at a time. And so, that's where data is coming in. And your
machine learning algorithm, sees a little trench of that data, one little bit. Let's say, a timesstamp
one, takes it in, and makes an estimate of the coefficient, say w hat 1. And the timestamp two, this
is another little bit of the data, and makes another estimate of the coefficient w hat 2. And the
timestamp three, it makes another estimate w hat 3. Timestamp four, a little more data and makes
an estimate w hat4. So every timestamp is making a new estimate, so it can make new
predictions.

To better the ideas, let's look at really practical real world example of where online learning makes
a huge difference, and it's on ad targeting. So let's see on navigating the web and you hit the
particular website, what's happening behind the scenes when you're shown ads? Well some
information about you, like your age, or the websites you've visited, and some of the information
about the website, like the text of the website, are fed into a machine learning algorithm, that's
going to use some set of quotations, w hat t, to figure out what's the best ads to show you. And
we're going to call that y hat suggested ads. It might show you ad 1, ad 2, ad 3, and so on. And
then, look at the website. You're like, cool, that's a really interesting ad. And you go and you click
on ad two. Well, when you click on ad two, the machine learning algorithm figures out that you
clicked on ad two, and assigns true label, for website ad two. That's where you clicked on. And
then the machine learning algorithm takes the and updates its coefficient from w high t to w high t
plus one. And what we describe so far, is really how ad systems work, a lot of them work in
practice. So this is a little bit of an is really something that makes a big difference in the real world.
[MUSIC]
[MUSIC] And this is an example of an online learning problem. Data is arriving over time. You see
an import xi and you need to make a prediction, y hat i. So, the input may be texting the web page
information about you and y hat i might be a prediction about what ads you might click on. And
then, given what happens in the real world, whether you're clicking an ad, in which case yt might
be ad two, or you don't click on anything, in which case yt would be none of the above, no ad was
good for you. Whatever that is gets fed into a machine learning algorithm that improves its
coefficients, so it can improve its performance over time. The question is how do we design a
machine learning algorithm that behaves like this? What's a good example of a machine learning
algorithm? It can improve its performance over time in an online fashion like this. 
And it turns out that we've seen one. Stochastic gradient. Stochastic gradient is a learning algorithm
that can be used for online learning. So let's review it. You give me some initial set of coefficients, say
everything is equal to zero. Every time we stop, you get some input xi. You can make a prediction y
hat t based on your current estimate of the coefficients. And then, you're given that true label, yt, and
you want to feed those into an algorithm. Well, stochastic gradients will take those inputs and use it
to compute the gradient, and then just update the coefficients, so w j t + 1 is going to be w j t, + eta
times the gradient, which is computed from these observed quantities in the real world.
 whether you're clicking an ad, in which case yt might be ad two, or you don't click on anything, in
which case yt would be none of the above, no ad was good for you. Whatever that is gets fed into
a machine learning algorithm that improves its coefficients, so it can improve its performance over
time. The question is how do we design a machine learning algorithm that behaves like this?
What's a good example of a machine learning algorithm? It can improve its performance over time
in an online fashion like this. And it turns out that we've seen one. Stochastic gradient. Stochastic
gradient is a learning algorithm that can be used for online learning. So let's review it. You give me
some initial set of coefficients, say everything is equal to zero. Every time we stop, you get some
input xi. You can make a prediction y hat t based on your current estimate of the coefficients. And
then, you're given that true label, yt, and you want to feed those into an algorithm. Well, stochastic
gradients will take those inputs and use it to compute the gradient, and then just update the
coefficients, so w j t + 1 is going to be w j t, + eta times the gradient, which is computed from these
observed quantities in the real world. So, online learning is a different kind of learning that we
haven't talked at all about in the specialization but it's really important to practice. So when data
arrives over time and you need to make a decision right away of what to do with it. But based on
that decision, you're going to get some feedback and you're going to update the parameters
immediately and keep going. This online learning approach, where you update the parameters
immediately as you see some information in the real world, can be extremely useful. So, for
example, your model is always up to date. It's always based on the latest data, latest information
in the world. It can have lower computational cost because you can use techniques like stochastic
gradient that don't have to look at all the data. And in fact, you don't even have to store all the data
if it's too massive. However, most people do store the data because they might want to use it
later. So that's a side note. But you don't have to. However it has some really difficult practical
properties. So this system that you have to build, the actual design of how data interacts with the
world and where problems get stored, the coefficients get stored and all that is really complex and
complicated. It's hard to maintain. If you have oscillations in your machine learning algorithm, it
can do really stupid things, and nobody want their website to do stupid things. And you don't really
trust those machinery stochastic gradient updates, necessarily. Sometimes it can give you bad
predictions. And so, in practice, most companies don't do something like this. They, what they do
is they save their data for a little while and update their models with the data from last hour or from
the last day or from the last week. It's very common. So it's very common, for example, for a large
like retailer, to every night change its recommender system and run a big service every night to do
that. And you can think about that as an extreme version of mini-batches that we talked about
earlier in this module. But now the batch is the whole data from the whole day. For you, it will be
those 5 billion page views. [MUSIC]

You might also like