Lesson 1 Notes
Lesson 1 Notes
Lesson 01 Notes
C: Today we are going to talk about supervised learning. But, in particular what
we're going to talk about are two kinds of supervised learning, and one particular way to
do supervised learning. Okay, so the two types of supervised learning that we typically think
about are classification and regression. And we're going to spend most of the time today talking
about classification and more time next time talking about regression. So the difference between
classification and regression is fairly simple for the purposes of this discussion. Classification is
simply the process of taking
some kind of input, let's call it
x. And I'm going to define
these terms in a couple of
minutes. And mapping it to
some discrete label. Usually,
for what we're talking about,
something like, true or false.
So, what's a good example of
that? Imagine that I have a nice
little picture of Michael.So what
do you think, Michael? Do you
think this is a male or a female?
M: So you're, you're classifying me as male or female based on the picture of me and I would
think you know, based on how I look I'm clearly male.
C: Yes. In fact, manly male. So, this would be a classification from pictures to male. And this
is where we're going to spend most of our time talking about it first as a classification task. So
taking some kind of input, in this case pictures, and mapping it to some discrete number of
labels, true or false, male or female, car versus cougar, anything that, that you might imagine
thinking of.
M: Car versus cougar?
C: Yes. Okay, so that's classification. We'll return to regression in a little bit later during this
conversation. But, just as a preview, regression is more about continuous value function. So,
something like giving a bunch of points. I want to give in a new point. I want to map it to some
real value. So we may pretend that these are examples of a line and so given a point here,
I might say the output is right there. Okay, so that's regression but we'll talk about that in a
moment. Right now, what I want to talk about is classification.
M: Would an example of regression also be, for example, mapping the pictures
of me to the length of my hair? Like a number that represents the length of my hair?
You have three questions here and we divided the world up into some input to some learning
algorithm, whatever that learning algorithm is. The output that we're expecting and then a box
for you to tell us whether its classification or regression. So, the first question, the input is
credit history, whatever that means, the number of loans that you have, how much money you
make, how many times you've defaulted, how many times you've been late, the sort of things
that make up credit history, and the output of the learning algorithm is rather you should lend
money or not. So you're a bank, and you're trying to determine whether given a credit history, I
should lend this person money, that's question one. Is that classification, or is that regression?
Question two you take as input a picture like the examples that we've given before. And the
output of your learning system is going to be whether this person is of high school age, college
age, or grad student age. The third question is very similar. The input is again a picture. And the
output is, I guess, of the actual age of the person, 17, 24, 23 and a half, whatever. So take a
moment and try to decide whether these are classification tasks or regression tasks.
M: Alright, so, let's see what happened here. So, what you're saying is in some cases the
inputs are discrete or continuous or complicated. In other cases the outputs could be discrete
or continuous or complicated. But I think what you were saying is what matters to determine
if something is classification or regression is whether the output is from a discrete small set or
whether it's some continuous quantity. Is that right?
C: Right, that's exactly right. The difference between a classification task or a regression task is
not about the input, it's about the output. If the output is continuous then it's regression and if the
output is small discrete set or discrete set, then it is a classification task. So, with that in mind,
what do you think is the answer to the first one?
M: So, lend money. If it was something like predicting a credit score, that seems like a more
continuous quantity. But this is just a yes no, so that's a discrete class, so I'm going to go with
classification.
C: That is, correct. It is classification and the short answer is, because it's a binary task. True,
false. Yes, no. Lend money or don't lend money. So it's a simple classification test. Okay,
with that in mind, what about number two?
M: Alright, so number two. It's trying to judge something about where they fall on a scale, high
school, college, or grad student. But all of, the system is being asked to do is put
them into one of those three categories, and these categories are like classes, so it's
classification.
C: That is also exactly right. Classification. We moved from binary to trinary in this case, but
the important thing is that it's discrete. So it doesn't matter if it's high school, college grad,
professor, elementary school, any number of other ways we might decide where your status of
matriculation is is a small discrete set. So, with that in mind, what about number three?
C: Before we get into the details of that, I want to define a couple of terms. Because we're going
to be throwing these terms out all throughout the lessons. And I want to make certain that we're
all on the same page and we mean the same thing. But we're returning to this again and again
and again. So, the first term I want to introduce is the notion of instances. So, instances, or
another way of thinking about them is input. Those are vectors of values, of attributes. That
define whatever your input space is. So they can be pictures and all the pixels that make up
pictures like we've been doing so far before. They can be credit score examples like how much
money I make, or how much money Michael makes. So whatever your input value is, whatever
it is you're using to describe the input, whether it's pixels or its discrete values, those are your
instances, the set of things that you're looking at. So you have instances, that's the set of inputs
that you have. And then what we're trying to find is some kind of function and that is the concept
that we care about. And this function actually maps inputs to outputs, so it takes the instances,
it maps them in this case to some kind of output such as true or false. This is the, the categories
of things that we're worried about. And for most of the conversation that we're going to be
C: So with that, with that notion of a concept as functions or as mappings from objects to
membership in a set we have a particular target concept. And the only difference between a
target concept and the general notion of concept is that the target concept is the thing we're
trying to find. It's the actual answer. So, a function that determines whether something is a car
or not, or male or not, is the target concept. Now this is actually important, right, because we
could say that. We have this notion in our head of things that are cars or things that are males,
or thing, or people who are credit worthy but unless we actually have it written down somewhere
we don't know whether it's right or wrong. So there is some target concept we're trying to get of
all the concepts that might map pictures or people to true and false.
M: Okay, so if you trying to teach me what tallness is so you have this kind of concept in mind
of these, these things are tall and these things are not tall. So you're going to have to somehow
convey that to me. So how are you going to teach me?
C: So what's a training set? Well a training set is a set of all of our inputs, like pictures of
people, paired with a label, which is the correct output. So in this case, yes, this person is credit
worthy.
C: All right, so we've defined our terms, we know what it takes to do at least supervised
learning. So now I want to do a specific algorithm and a specific representation, that allows us to
solve
the problem of going from instances to, actual concepts. So what we're going to talk about next
are decision trees. And I think the best way to introduce decision trees is through an example.
So, here's the very simple example I want you to think about for a while. You're on a date with
someone. And you come to a restaurant. And you're going to try to decide whether to enter the
restaurant or not. So your input, your instances are going to be features about the restaurant.
And we'll talk a little bit about what those features might be. And the output is whether you
should enter or not. Okay, so it's a very simple, straightforward problem but there are a lot of
details here that we have to figure out.
M: It's a classification problem.
C: It's clearly a classification problem because the output is yes, we should enter or no, we
should move on to the next restaurant. So in fact, it's not just a classification problem, it's those
binary classification problems that I said that we'd almost exclusively be thinking about for the
next few minutes. Okay. So, you understand the problem set up?
M: Yes, though I'm not sure exactly what the pieces of the input are.
C: Right, so thats actually the right next question to ask. We have to actually be specific now
about a representation. Before I was drawing squiggly little lines and you could imagine what
they were, but now since we're really going to go through an example, we need to be clear
about
what is it mean to be standing in front of the restaurant. So, let's try to figure out how we would
represent that, how we would define that. We're talking about, you're standing in front of a
restaurant or eatery because I can't see the reliably small restaurant. And we're going to
try to figure out whether we're going to go in or not. But, what do we have to describe our
eatery? What do we have? What are our attributes? What are the instances actually made of?
So what are the features that we need to pay attention to that are going to help us to determine
whether we should yes, enter into the restaurant. Or no, move on to the next restaurant. So, any
ideas Michael?
M: Sure. I guess there's like the type of restaurant.
C: Okay, let's call that the type. So it could be Italian, it could be French, it could be Thai, it
could be American, there are American restaurants, right?
M: Sure.
C: Greek, it can be, Armenian. It can any kind of restaurant you want to. Okay, good. So that's
something that probably matters because maybe you don't like Italian food or maybe you're
really in the mood for Thai. Sounds perfect. Okay anything else?
M: Sure. How about whether it smells good?
C: You know, I like cleanliness. Let's be nice to our eateries and let's say atmosphere. So is it
fancy? Is it a hole-in-the-wall? Those sorts of things. You could imagine lots of things like
that, but these things might matter to you and your date. Okay, so, we know the type of the
Representation
Okay, so we've now seen an abstract example of decision trees. So let's make certain that we
understand it with a concrete example. So, to your right is a specific decision tree that looks
at some of the features that we talked about. Whether you are occupied or not, whether the
restaurant is occupied or not what type of restaurant it is. Your choices are pizza, Thai, or
other. And whether the people inside look happy or not. The possible outputs are again binary
either you don't go or you do go, into the restaurant. On your left is a table which has six of
the features that we've talked about. Whether the restaurant is occupied or not, the type of
restaurant, whether it's raining outside or not, whether you're hungry or not. Whether you're on
a hot date and whether the people inside look happy, and some values for each of those. And
what we would like for you to do is to tell us what the output of this decision tree would be in
each case. Each row of this is a different time that we're stopping at a restaurant, and the, the
little values there summarize what is true about this particular situation. And, and you're saying
we need to then trace through this decision tree and figure out what class is.
Answer
C: Alright, so let's take a moment to have a quiz where we really delve deeply into what it
means to have a best attribute. So something Michael and I have been throwing around that
term, let's
see if we can define it a little bit more crisply. So, what you have in front of you are three
different attributes that we might apply in our decision tree for sorting out our instances. So, at
the top of the screen what you have is you have a cloud of instances. They are labelled either
with red x or a green o, and that represents the label so that means that they are part of our
training set, so this is what we're using to build and to learn our Decision Tree. So, in the first
case you have the set of instances being sorted into two piles. There are some xs and some os
on the left and some xs and some os on the right. And the second case you have that same set
of data being
sorted so that all of it goes to the left and none of it goes to the right. And in the third case you
have that same set of data that's sorted so that a bunch of the xs end up on the left and a
Answer
C: So, we saw before when we looked at AND and OR versus XOR that in the case of AND
and OR we only needed two nodes but in the case of XOR we needed three. The difference
between two and three is not that big, but it actually does turn out to be big if you start thinking
about having more than simply two attributes. So, let's look at generalized versions of OR and
generalized versions of XOR and see if we can see how the size of the decision tree grows
differently. So in the case of an n version of OR. That is we have n attributes as opposed to just
two. We might call that the any function. That is a function where any of the variables are true
then the output is true. We can see that the decision tree for that has a very particular and kind
of interesting form. Any ideas Michael about what that decision tree looks like?
M: So, well. So going off of the way you described OR in the two case. We can start with that.
And you. You pick one of the variables. And if its true then yeah. Any of them is true since the
leaf is true.
C: What happens if its false?
M: Well, then we have to
check what everything that's
left. So then we move on to
one of the other attributes
like A2 and again, if it's true,
it's true and if it's false then
we don't know. This could
take some time.
C: Oh that was actually an
interesting point. Let's say if
there were only three, we
would be done. But wait,
what if there were five?
M: Then we need one more node.
C: What if there were n?
M: Then we need n minus 4 more nodes.
C: Right, so what you end up with in this case is a nice little structure around the
decision tree. And how many nodes do we need?
M: Looks like one for each attribute, so that would be n.
C: n nodes, exactly right. So we have a term for this sort of thing, the size of the decision tree is,
in fact, linear. In n. And that's for any. Now what about an n version of XOR?
M: So XOR is, if one is true but the other one's not true then it's true. And if they're both true.
Yeah I don't. It's not clear how to generalize that.
C: All right, so what that last little exercise showed is that XOR, in XOR parody, is hard. It's
exponential. But that kind of provides us a little bit of a hint, right? We know that XOR is hard
and we know that OR is easy. At least in terms of the number of nodes you need, right? But,
we don't know, going in, what a particular function is. So we never know whether the decision
tree that we're going to have to build is going to be an easy one. That is something linear, say.
Or a hard one, something that's exponential. So this brings me to a key question that I want to
ask, which is, exactly how expressive is a decision tree. And this is what I really mean by this.
Not just what kind of functions it kind of represent. But, if we're going to be searching over all
possible decision trees in order to find the right one, how many decision trees do we have to
worry about to look at? So, let's go back and look at, take the XOR case again and just speak
more generally. Let's imagine that we once again, we have n attributes. Here's my question to
you, Michael. How many decision trees are there? And look, I'm going to make it easy for you,
Michael. They're not just attributes, they're Boolean attributes. And they're not just Boolean
attributes, but the output is also Boolean. Got it?
Answer
Answer
M: Alright. So again, a lot feels like a good answer, it's already written down on the left. But it's
also wait, wait, may be we can quantify this. So if it were. Maybe one way to think about this
is if each of the, each of those empty boxes there, is either true or false. It's kind of like a bit.
And we're asking how many different bit patterns can we make? And in general, it's two to the
number of positions, but here the number of positions is 2 to the n. So it ought to be 2, to the 2
to the n. Which is that the same as 4 to the n?
C: No.
M: Okay.
C: But you're right. It's 2 to the 2 to the n. So it's a double exponential and it's not the same thing
as 4 to the nth. It's actually 2 to the 2 to the nth. Now how big of a number do you think that is
Michael?
M: I'm going to just say a lot.
C: It is, in fact, a lot and I'm going, I actually, I'm going to look over here, and I'm going to tell
you. That for even a small value of n, this gets to be a really big number.
M: So for, for one, it's 2 to the 2 to the 1, which is 4. That's not a big number. For two, it's 2 to
the 2 to the 2. So 2 to the 2 is 4, so it's 2 to the 4, which is 16.
C: What about three?
M: Alright, so that's two to the 8th, which is 256?
C: So that's growing pretty fast, don't you think? What if I told you that for n equals
18466744073709551616.
M: Holy monkeys.
ID3
So now, we have an intuition of best, and how we want to split. We've, we've looked over,
Michael's proposed, the high-level algorithm for how we would build a decision tree. And I think
we have enough information now that we can actually do, a real specific algorithm. So, let's
write that down. And the particular algorithm that Michael proposed is a kind of generic version
of something that's called ID3. So let me write down what that algorithm is, and we can talk
about it. Okay, so here's the ID3 algorithm. You're simply going to keep looping forever until
you've solved the problem. At each step, you're going to pick the best attribute, and we're going
to define what we mean by best. There are a couple of different ways we might, we might define
best in a moment. And then, given the best attribute that splits the data the way that we want, it
does all those things that we talked about, assign that as a decision attribute for node. And then
for each value that the attribute A can take on, create a descendent of node. Sort the training
examples to those leaves based upon exactly what values they take on, and if you've perfectly
classified your training set, then you stop. Otherwise, you iterate over each of those leaves,
picking the best attribute in turn for the training examples that were sorted into that leaf, and you
keep doing that. Building up the tree until you're done. So that's the ID3 algorithm. And the key
bit that we have to expand upon in this case, is exactly what it means to have a best attribute.
All right, what exactly is it that we mean by best attribute? So, there are lots of possibilities, that
you can come up with. The one that is most common, and the one I want you to think about
the most, is what's called information gain. So information gain is simply a mathematical way to
capture the amount of information that i want to gain by picking particular attribute. But what it
really talks about is the reduction in the randomness, over the labels that you have with set of
data, based upon the knowing the value of particular attribute. So the formula's simply this. The
information gain over S and A where S is the collection of training examples that you're looking
at. And A, as a particular attribute, is simply defined as the entropy, with respect to the labels, of
the set of training examples, you have S, minus, sort of, the expected or average entropy that
you would have over each set of examples that you have with a particular value.
M: So what we're doing, we're picking an attribute and that attribute could have a bunch of
different values, like true or false, or short, medium, tall?
ID3 Bias
C: So, we've got a whole bunch of trees we have to look at, Michael. And were going to have
to come up with some clever way to look through them. And this get's us back, something that
we've talked about before, which is the notion of bias. And in particular, the notion of inductive
bias. Now, just as a quick refresher, I'm want to remind you that there is two kind of biases we
worrying about when we think about algorithms that are searching through space. One is what's
called a restriction bias. The other is called preference bias. So a restriction bias is nothing more
than the hypothesis set that you actually care about. So in this case, with the decision trees, the
hypothesis set is all possible decision trees. Okay? That means we're not considering, y equals
2x plus non-boolean functions of a certain type. We're only considering decision trees, and all
that they can represent. And nothing else. Okay? So that's already a restriction bias and it's
important. Because, instead of looking at the infinite number uncountably infinite number of
functions that are out there, that we might consider. We're only going to consider those that can
be represented by a decision tree over in, you know, all the cases we've given so far discrete
variable. But a preference bias is something that's just as important. And it tells us what source
of hypotheses from this hypothesis set we prefer, and that is really at the heart of inductive bias.
So Michael, given that, what would you say is the inductive bias of the ID3 algorithm? That is,
given a whole bunch of decision trees, which decision trees would ID3 prefer, over others?
M: So, it definitely tries, since it's, since it's making it's decisions top down. It's going to be more
likely to produce a tree that has basically good splits near the top than a tree that has bad splits
at the top. Even if the two trees can represent the same function.
C: Good point. So good splits near the top. Alright. And you said something very important there
Michael. Given two decision trees that are both correct. They both represent the function that
we might care about. It would prefer the one that had the better split near the top. Okay, so any
other preferences? Any other inductive bias on the ID3 algorithm.
M: It prefers ones that model the data better to ones that model the data worse.
C: Right. So this is one that people often forget: it prefers correct ones to incorrect ones. So,
given a tree that has very good splits at the top but produces the wrong answer. It will not take
C: Alright. So, we've actually done pretty well. So through all of this, we finally figured out what
decision trees actually are. We know what they represent. We know how expressive they are.
We have an algorithm that lets us build the decision trees in an effective way. We've done just
about everything there is to do with decision trees, but there is still a couple of open questions
that I want to think about. So, here's a couple of them and I want you to, to think about and then
we'll discuss them. So, so far all of our examples that we've used. All the the things we've been
thinking about for good pedagogical reasons. We had not only discreet outputs but we also had
discrete inputs. So one question we might ask ourselves, is what happens if we have,
continuous attributes? So Michael, let me ask you this. Let's say we had some continuous
attributes. We weren't just asking whether someone's an animal or whether they're human or
whether it's raining outside or we really cared about age or weight or distance or anything else
that might have a continuous attribute. How are we going to make that work in a decision tree?
M: Well, I guess the literal way to do it would be for something like age to have a branching
factor that's equal to the number of possible ages.
C: Okay, so that's one, one possibility. So we stick in age and then we have one. 1.0, we have
one for 1.1, we have one for 1.11, we have one for 1.111
M: Ahh, I see. Alright. Well, at the very least, okay. What if, what if we only included ages that
were in the training set? Presumably there's at least a finite number of those. Oh, we could
do that. We could just do that, except what are we going to do then when we come up with
something in the future that wasn't in the training session.
M: Oh, right. Can we look at the testing set?
C: No were not allowed to look at the testing set. That is cheating, and not the kind of good
cheating that we do when we pick a good representation.
M: Okay, fair enough. Well we could, we could do ranges. What about ranges? Isn't that the
way we cover more than just individual values?
C: Give me an example. Say ages you know, in the 20s.
M: Okay, so, huh. How would we represent that
with a decision tree? You could do like age, element sign, bracket. 20, 21, or 29 or 30 right
per end.
So, here's the next question I want to ask you, simple true or false question. Does it make
sense
to repeat an attribute along any given path in the tree? So, if you we pick some attribute like
A, should we ever ask a question about A again? Now, I mean something very specific about,
by that. I mean, down a particular path of the tree, not just anywhere else in the tree. So, in
particular, I mean this. So, I ask a question about A, then I ask a question about B, and then I
ask a question about A again. That's the question I'm asking. Not whether A might appear more
than once in the tree. So, for example, you might have been the case where A shows up more
than once in the tree, but not along the same path. So, in the second case over here, A shows
up more than once, but they really don't really have anything to do with one another because
once you've answered B, you will only ever ask the question about A once. So, my question to
you is, does it make sense to repeat A more than once along a particular path in the tree? Yes
Answer
C: So, we've answered the thing about continuous attributes. Now, here's another thing. When
do we really stop?
M: When we get all the answers right. When all the training examples are in the right category
class.
C: Right, so the the answer in the algorithm is when everything is classified correctly.
That's a pretty good answer, Michael. But what if we have noise in our data? What if it's the
case that we have two examples of the same object, the same instance, but they have two
different labels? Then this will never be the case.
M: Oh. So, then our algorithm goes into an infinite loop.
C: Which seems like a bad idea.
M: So we could just say, or we've run out of attributes.
C: Or we've run out of attributes. That's one way of doing it. In fact that what's going to
have to happen at some point, right? That's probably a slightly better answer. Although that
doesn't help us in the case where we have continuous attributes and we might ask an infinite
number of questions. So we probably need a slightly better criteria. Don't you think?
C: So another consideration we might want to think about with decision trees but you're not
going to go into a lot of detail but I think might be worth at least mentioning is the problem of
regression. So, so far we've only been doing classification where the outputs are discrete, but
what if we were trying to solve something that looked more like x squared or two x plus 17 or
some other continuous function. In other words, a regression problem. How would we have to
adapt decision trees to do that? Any ideas Michael?
M: So these are now continuous outputs, not just continuous inputs.
C: Right, maybe the outputs are all continuous, maybe the outputs are discrete, maybe they're a
mix of both.
M: Well it certainly seems like out rule of using, information gain is going to run into trouble
because it's not really clear how you measure information on these continuous values. So, I
guess you could measure error some other way. Well it's
not error, it's trying to measure how mixed up things are? Oh so ,maybe
something like variance? Cause in a continuous space you could talk about if
there's a big spread in the values that would be measured by the variance.
C: Oh good. So what you really have now is a question about splitting. What's the splitting
criteria?
M: I guess there's also an issue of what you do in the leaves.
C: Right. So, what might you do in the leaves?
M: I guess you could do some sort of more standard kind of fitting algorithm. So, like, report the
average or, or do some kind of a linear fit.
C: Is any number of things you can do. By the way ,that's worth pointing out on the, on the
output that if we do pruning like we did before, we have errors, we did actually say when we
talked about that how you would report an output. Right? If you don't have a clear answer where
everything is labeled true or everything is labeled false, how do you pick? So something like an
average would work there.
M: I don't know, I mean, it seems like it depends on what we're trying to measure with the tree.
If the tree is, we're trying to get as many right answers as we can, then you probably want to do
like a vote in the leaves.
C: Right, which ,at least, if the only answer is true or false, that would look more like an average
I guess. Right, so you pick, you do a vote. So we do a vote, so we do pruning. We do have to
deal with this issue of the output. Somehow ,and something like a vote mixing. And here, when
you have a regression, then I guess average is a lot like voting.
M: Yeah, in a continuous phase.
C: Yeah. So either way we're doing a kind of voting. I like that.