6.867 Lecture Notes: Section 1: Introduction: 1 Intro 2 2 Problem Class 3
6.867 Lecture Notes: Section 1: Introduction: 1 Intro 2 2 Problem Class 3
Reading
Lecture 1: Murphy: Chapters 1 (Introduction), 2 (Good probability review; pay attention to multinomial Gaussian, Beta, and Dirichlet distributions)
Contents
1
Intro
Problem class
2.1 Supervised learning . . . . . . . .
2.1.1 Classification . . . . . . .
2.1.2 Regression . . . . . . . . .
2.2 Unsupervised learning . . . . . .
2.2.1 Density estimation . . . .
2.2.2 Clustering . . . . . . . . .
2.2.3 Dimensionality reduction
2.3 Reinforcement learning . . . . . .
2.4 Other settings . . . . . . . . . . .
2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3
3
3
3
3
3
3
4
4
4
Assumptions
Evaluation criteria
Model type
5.1 No model . . . . . . . . . . . . . . . . . . . . .
5.2 Prediction rule . . . . . . . . . . . . . . . . . .
5.3 Probabilistic model . . . . . . . . . . . . . . .
5.3.1 Fitting a probabilistic model . . . . . .
5.3.2 Decision theoretic prediction . . . . .
5.3.3 Benefits of using a probabilistic model
5.4 Distribution over models . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6
6
6
7
7
8
8
9
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
10
10
10
11
11
11
12
12
13
13
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
MIT 6.867
6.3
6.4
Fall 2014
2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
15
18
18
20
21
22
23
Algorithm
23
Recap
23
MIT 6.867
Fall 2014
Intro
The main focus of machine learning is making decisions or predictions based on data. There
are a number of other fields with significant overlap in technique, but difference in focus:
. in economics and psychology, the goal is to discover underlying causal processes and in
statistics it is to find a model that fits a data set well. In those fields, the end product is a
model. In machine learning, we often fit models, but as a means to the end of making good
predictions.
Generally, this is done in two stages:
1. Learn or estimate a model from the data
2. Apply the model to make predictions or answer queries
Problem of induction: Why do we think that previously seen data will help us predict
the future? This is a serious philosophical problem of long standing. We will operationalize
it by making assumptions, such as that all training data are IID (independent and identically distributed) and that queries will be drawn from the same distribution as the training
data, or that the answer comes from a set of possible answers known in advance.
Two important issues that come up are:
statistical inference: How do we deal with the fact that, for example, the same treatment may end up with different results on different trials? How can we predict how
well an estimate may compare to future results?
generalization: How can we predict results of a situation or experiment that we have
never encountered before in our data set?
We can characterize problems and their solutions using six characteristics, three of
which characterize the problem and three of which characterize the solution:
1. Problem class: What is the nature of the training data and what kinds of queries will
be made at testing time?
2. Assumptions: What do we know about the source of the data or the form of the
solution?
3. Evaluation criteria: What is the goal of the system? How will the answers to individual queries be evaluated? How will the overall performance of the system be
measured?
4. Model type: Will an intermediate model be made? What aspects of the data will be
modeled? How will the model be used to make predictions?
5. Model class: What particular parametric class of models will be used? What criterion
will we use to pick a particular model from the model class?
6. Algorithm: What computational process will be used to fit the model to the data
and/or to make predictions?
Without making some assumptions about the nature of the process generating the data, we
cannot perform generalization.
The introductory section of the course will lay out the spaces of these characteristics
and illustrate them with simple (but not always easy!) examples. Then, in later sections,
we will refer back to these spaces to characterize new problems and approaches that we
introduce.
MIT 6.867
Fall 2014
Problem class
There are many different problem classes in machine learning. They vary according to what
kind of data is provided and what kind of conclusions are to be drawn from it. We will
go through several of the standard problem classes here, and establish some notation and
terminology.
2.1
Supervised learning
2.1.1
Classification
Training data D is in the form of a set of pairs {(x(1) , y(1) ), . . . , (x(n) , y(n) )} where x(i) represents an object to be classified, most typically a D-dimensional vector of real and/or discrete values, and y(i) is an element of a discrete set of values. The y values are sometimes
called target values.
A classification problem is binary or two-class if y(i) is drawn from a set of two possible
values; otherwise, it is called multi-class.
The goal in a classification problem is ultimately, given a new input value x(n+1) , to
predict the value of y(n+1) .
Classification problems are a kind of supervised learning, because the desired output (or
class) y(i) is specified for each of the training examples x(i) .
We typically assume that the elements of D are independent and identically distributed
according to some distribution Pr(X, Y).
2.1.2
Regression
2.2
Unsupervised learning
2.2.1
Density estimation
Given samples y(1) , . . . , y(n) 2 RD drawn IID from some distribution Pr(Y), the goal is to
predict the probability Pr(y(n+1) ) of an element drawn from the same distribution. Density
estimation often plays a role as a subroutine in the overall learning method for supervised learning, as well.
This is a type of unsupervised learning, because it doesnt involve learning a function
from inputs to outputs based on a set of input-output pairs.
2.2.2
Clustering
Given samples x(1) , . . . , x(n) 2 RD , the goal is to find a partition of the samples that groups
together samples that are similar. There are many different objectives, depending on the
definition of the similarity between samples and exactly what criterion is to be used (e.g.,
minimize the average distance between elements inside a cluster and maximize the average distance between elements across clusters). Other methods perform a soft clustering,
in which samples may be assigned 0.9 membership in one cluster and 0.1 in another. Clustering is sometimes used as a step in density estimation, and sometimes to find useful
structure in data. This is also unsupervised learning.
MIT 6.867
2.2.3
Fall 2014
Dimensionality reduction
Given samples x(1) , . . . , x(n) 2 RD , the problem is to re-represent them as points in a ddimensional space, where d < D. The goal is typically to retain information in the data set
that will, e.g., allow elements of one class to be discriminated from another.
Standard dimensionality reduction is particularly useful for visualizing or understanding high-dimensional data. If the goal is ultimately to perform regression or classification
on the data after the dimensionality is reduced, it is usually best to articulate an objective
for the overall prediction problem rather than to first do dimensionality reduction without
knowing which dimensions will be important for the prediction task.
2.3
Reinforcement learning
In reinforcement learning, the goal is to learn a mapping from input values x to output
values y, but without a direct supervision signal to specify which output values y are
best for a particular input. There is no training set specified a priori. Instead, the learning
problem is framed as an agent interacting with an environment, in the following setting:
The agent observes the current state, x(0) .
It selects an action, y(0) .
It receives a reward, r(0) , which depends on x(0) and possibly y(0) .
The environment transitions probabilistically to a new state, x(1) , with a distribution
that depends only on x(0) and y(0) .
The agent observes the current state, x(1) .
...
The goal is to find a policy , mapping x to y, (that is, states to actions) such that some
long-term sum or average of rewards r is maximized.
This setting is very different from either supervised learning or unsupervised learning,
because the agents action choices affect both its reward and its ability to observe the environment. It requires careful consideration of the long-term effects of actions, as well as all
of the other issues that pertain to supervised learning.
2.4
Other settings
MIT 6.867
Fall 2014
Assumptions
The kinds of assumptions that we can make about the data source or the solution include:
The data are independent and identically distributed.
The data are generated by a Markov chain.
The process generating the data might be adversarial.
The true model that is generating the data can be perfectly described by one of
some particular set of hypotheses.
Evaluation criteria
Once we have specified a problem class, we need to say what makes an output or the
answer to a query good, given the training data. We specify evaluation criteria at two
levels: how an individual prediction is scored, and how the overall behavior of the system
is scored.
The quality of predictions from a learned prediction rule is often expressed in terms of
a loss function. A loss function L(g, a) tells you how much you will be penalized for making
a guess g when the answer is actually a. There are many possible loss functions. Here are
some:
O-1 Loss applies to predictions drawn from finite domains.
L(g, a) =
Squared loss
Linear loss
if g = a
otherwise
L(g, a) = (g - a)2
L(g, a) = |g - a|
Asymmetric loss Consider a situation in which you are trying to predict whether
someone is having a heart attack. It might be much worse to predict no when the
answer is really yes, than the other way around.
8
>
<1 if g = 1 and a = 0
L(g, a) = 10 if g = 0 and a = 1
>
:
0 otherwise
Any given prediction rule will usually be evaluated based on multiple predictions and
the loss of each one. At this level, we might be interested in:
Minimizing expected loss over all the predictions(also known as risk)
Minimizing maximum loss: the loss of the worst prediction
Minimizing or bounding regret: how much worse this predictor performs than the
best one drawn from some class
Characterizing asymptotic behavior: how well the predictor will perform in the limit
of infinite training data
MIT 6.867
Fall 2014
Finding algorithms that are probably approximately correct: they probably generate
a hypothesis that is right most of the time.
There is a theory of rational agency that argues that you should always select the action
that minimizes the expected loss. This strategy will, for example, make you the most money
in the long run, in a gambling setting. Expected loss is also sometimes called risk in the
machine-learning literature, but that term means other things in economics or other parts
of decision theory, so be careful...its risky to use it. We will, most of the time, concentrate
on this criterion.
Model type
5.1
No model
In some simple cases, we can generate answers to queries directly from the training data,
without the construction of any intermediate model.
In our simple prediction problem, we might just decide to predict the mean or median
of the y(1) , . . . , y(n) because it seems like it might be a good idea. In regression or classification, we might generate an answer to a new query by averaging answers to recent queries,
as in the nearest neighbor method.
5.2
Prediction rule
score() = -
1X
L(h(x(i) ; ), y(i) ) .
n
i=1
In our simple estimation problem, the hypothesis would be a single real value, so the
distinction between this case and the no model case is not as clear as it will be in the
MIT 6.867
Fall 2014
case of real regression or classification problems. However, we might still formulate the
problem as one of fitting a model by seeking the value, yerm that minimizes empirical risk.
In the case of squared loss, we would select erm to maximize
n
score() = -
1 X (i)
(y - )2 .
n
i=1
erm =
1 X (i)
y
.
n
i=1
Having selected hypothesis h( ; erm ) to minimize the empirical risk on our data, erm
will be our predicted value. That is h( ; erm ) = erm .
We will find that minimizing empirical risk is often not a good choice: it is possible to
emphasize fitting the current data too strongly and end up with a hypothesis that does not
generalize well when presented with new x values.
The prediction-rule approach is also sometimes called a discriminative approach, because, when applied to a classification problem, it attempts to directly learn a function that
discriminates between the classes.
5.3
Probabilistic model
In the prediction rule approach, we learn a model that is used directly to select a predicted
value. In the probabilistic model approach, we:
1. Fit a probabilistic model to the training data; then
2. Use decision-theoretic reasoning to combine the probabilistic model with a loss function to select the best prediction.
5.3.1
For a regression or classification problem we might choose to fit a joint distribution Pr(X, Y; )
or a conditional distribution, Pr(Y | X; ). The problem of fitting probabilistic models to data
is treated at length in the statistics literature. There are many different criteria that one can
use to fit a probabilistic model to data.
Probably the common objective is to find the model that maximizes the likelihood of the
data from among some parametric class of models; this is known as the maximum likelihood
model. It is found by maximizing the data likelihood,
score() =
n
Y
Pr(x(i) , y(i) ; ) .
i=1
This is called a generative model, because the model describes how the entire data set is
generated. For classification problems, when y is an element of a discrete set of possible
values, it is often easiest to learn a factored model, of the form
Pr(X, Y; ) = Pr(X | Y; 1 ) Pr(Y; 2 ) ,
where = (1 , 2 ).
Another approach is to fit a discriminative model by maximizing the conditional likelihood,
score() =
n
Y
i=1
Pr(y(i) | x(i) ; ) .
MIT 6.867
Fall 2014
We will explore the trade-offs between using generative and discriminative models in more
detail in later sections. In general, we find that generative models can be learned somewhat
more efficiently and are highly effective if their modeling assumptions match the process
that generated the data. Discriminative models generally give the best performance on
the prediction task in the limit of a large amount of training data, because they focus their
modeling effort based on the problem of making the prediction and not on modeling the
distribution over possible query points x(n+1) .
We cant take this approach any further until we consider the particular class of models
we are going to fit. Well do some examples in section 6.
5.3.2
Now, imagine that you have found some parameters , so that you have a probability
distribution Pr(Y | X; ) or Pr(X, Y; ). How can you combine this with a loss function to
make predictions?
The optimal prediction, in terms of minimizing expected loss, g , for our simple prediction problem, is
Z
g = arg min
g
Pr(y; )L(g, y) dy .
In general, we have to know both the form of the probabilistic model Pr(y | x; ) and
the loss function L(g, a). But we can get some insight into the optimal decision problem in
the simple but very prevalent special case of squared error loss. In this case, for the simple
prediction problem, we have:
Z
g = arg min Pr(y; )(g - y)2 dy .
p
We can optimize this by taking the derivative of the risk with respect to g, setting it to zero,
and solving for g. In this case, we have
Z
d
Pr(y; )(g - y)2 dy = 0
dg y
Z
Pr(y; )2(g - y) dy = 0
y
Z
Z
Pr(y; )g dy =
Pr(y; )y dy
y
g = E[y; ]
Of course, for different choices of loss function, the risk-minimizing prediction will have a
different form.
5.3.3
This approach allows us to separate the probability model from the loss function; so we
could learn one model and use it in different decision-making contexts. Models that make
probabilistic predictions can be easily combined using probabilistic inference: it is much
harder to know how to combine multiple direct prediction rules.
MIT 6.867
5.4
Fall 2014
10
In the distribution over models approach, which well often call the Bayesian approach, we treat
the model parameters as random variables, and use probabilistic inference to update a
distribution over based on the observed data D.We
1. Begin with a prior distribution over possible probabilistic models of the data, Pr();
2. Perform Bayesian inference to combine the data with the prior distribution to find a
posterior distribution over models of the data, Pr( | D);
3. (Typically), integrate out the posterior distribution over models in the context of
decision-theoretic reasoning, using a loss function to select the best prediction
In the Bayesian approach, we do not have a single model that we will use to make predictions: instead, we will integrate over possible models to select the minimum expected
loss prediction. For the simple prediction problem this is:
Z Z
g = arg min
Pr(y | ) Pr( | D)L(g, y) dyd .
g