0% found this document useful (0 votes)
46 views10 pages

6.867 Lecture Notes: Section 1: Introduction: 1 Intro 2 2 Problem Class 3

This document provides an overview and introduction to machine learning concepts. It discusses different problem classes in machine learning, including supervised learning problems like classification and regression, unsupervised learning problems like density estimation and clustering, and reinforcement learning. For each problem class, it defines the basic setup and goal. It also outlines key characteristics used to describe machine learning problems and their solutions, such as the problem class, assumptions made, evaluation criteria, model type, model class, and algorithm. The introduction aims to lay out the conceptual foundations for the course material.

Uploaded by

alurana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views10 pages

6.867 Lecture Notes: Section 1: Introduction: 1 Intro 2 2 Problem Class 3

This document provides an overview and introduction to machine learning concepts. It discusses different problem classes in machine learning, including supervised learning problems like classification and regression, unsupervised learning problems like density estimation and clustering, and reinforcement learning. For each problem class, it defines the basic setup and goal. It also outlines key characteristics used to describe machine learning problems and their solutions, such as the problem class, assumptions made, evaluation criteria, model type, model class, and algorithm. The introduction aims to lay out the conceptual foundations for the course material.

Uploaded by

alurana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

6.

867 Lecture Notes: Section 1: Introduction

Reading
Lecture 1: Murphy: Chapters 1 (Introduction), 2 (Good probability review; pay attention to multinomial Gaussian, Beta, and Dirichlet distributions)

Contents
1

Intro

Problem class
2.1 Supervised learning . . . . . . . .
2.1.1 Classification . . . . . . .
2.1.2 Regression . . . . . . . . .
2.2 Unsupervised learning . . . . . .
2.2.1 Density estimation . . . .
2.2.2 Clustering . . . . . . . . .
2.2.3 Dimensionality reduction
2.3 Reinforcement learning . . . . . .
2.4 Other settings . . . . . . . . . . .

2
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

3
3
3
3
3
3
3
4
4
4

Assumptions

Evaluation criteria

Model type
5.1 No model . . . . . . . . . . . . . . . . . . . . .
5.2 Prediction rule . . . . . . . . . . . . . . . . . .
5.3 Probabilistic model . . . . . . . . . . . . . . .
5.3.1 Fitting a probabilistic model . . . . . .
5.3.2 Decision theoretic prediction . . . . .
5.3.3 Benefits of using a probabilistic model
5.4 Distribution over models . . . . . . . . . . . .

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

6
6
6
7
7
8
8
9

Model class and parameter fitting


6.1 Fitting maximum-likelihood probabilistic models .
6.1.1 Gaussian with fixed variance . . . . . . . . .
6.1.2 Gaussian . . . . . . . . . . . . . . . . . . . . .
6.1.3 Uniform over a continuous interval . . . . .
6.1.4 Uniform over a finite set of points . . . . . .
6.1.5 Mixture of Gaussians . . . . . . . . . . . . . .
6.1.6 Maximum likelihood and binary data . . . .
6.2 Model selection over classes of probabilistic models
6.2.1 Bias-Variance decomposition . . . . . . . . .

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

10
10
10
11
11
11
12
12
13
13

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

MIT 6.867

6.3

6.4

Fall 2014

6.2.2 Regularization and model selection . . . . . . .


Bayesian inference from prior to posterior over models
6.3.1 Bernoulli/Bernoulli . . . . . . . . . . . . . . . .
6.3.2 Beta/Binomial . . . . . . . . . . . . . . . . . . .
6.3.3 Gaussian with fixed variance . . . . . . . . . . .
6.3.4 Finding conjugate families . . . . . . . . . . . . .
Bayesian model selection . . . . . . . . . . . . . . . . . .

2
.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

15
18
18
20
21
22
23

Algorithm

23

Recap

23

MIT 6.867

Fall 2014

Intro

The main focus of machine learning is making decisions or predictions based on data. There
are a number of other fields with significant overlap in technique, but difference in focus:
. in economics and psychology, the goal is to discover underlying causal processes and in
statistics it is to find a model that fits a data set well. In those fields, the end product is a
model. In machine learning, we often fit models, but as a means to the end of making good
predictions.
Generally, this is done in two stages:
1. Learn or estimate a model from the data
2. Apply the model to make predictions or answer queries
Problem of induction: Why do we think that previously seen data will help us predict
the future? This is a serious philosophical problem of long standing. We will operationalize
it by making assumptions, such as that all training data are IID (independent and identically distributed) and that queries will be drawn from the same distribution as the training
data, or that the answer comes from a set of possible answers known in advance.
Two important issues that come up are:
statistical inference: How do we deal with the fact that, for example, the same treatment may end up with different results on different trials? How can we predict how
well an estimate may compare to future results?
generalization: How can we predict results of a situation or experiment that we have
never encountered before in our data set?
We can characterize problems and their solutions using six characteristics, three of
which characterize the problem and three of which characterize the solution:
1. Problem class: What is the nature of the training data and what kinds of queries will
be made at testing time?
2. Assumptions: What do we know about the source of the data or the form of the
solution?
3. Evaluation criteria: What is the goal of the system? How will the answers to individual queries be evaluated? How will the overall performance of the system be
measured?
4. Model type: Will an intermediate model be made? What aspects of the data will be
modeled? How will the model be used to make predictions?
5. Model class: What particular parametric class of models will be used? What criterion
will we use to pick a particular model from the model class?
6. Algorithm: What computational process will be used to fit the model to the data
and/or to make predictions?
Without making some assumptions about the nature of the process generating the data, we
cannot perform generalization.
The introductory section of the course will lay out the spaces of these characteristics
and illustrate them with simple (but not always easy!) examples. Then, in later sections,
we will refer back to these spaces to characterize new problems and approaches that we
introduce.

This story paraphrased


from a post on 9/4/12
at andrewgelman.com

MIT 6.867

Fall 2014

Problem class

There are many different problem classes in machine learning. They vary according to what
kind of data is provided and what kind of conclusions are to be drawn from it. We will
go through several of the standard problem classes here, and establish some notation and
terminology.

2.1

Supervised learning

2.1.1

Classification

Training data D is in the form of a set of pairs {(x(1) , y(1) ), . . . , (x(n) , y(n) )} where x(i) represents an object to be classified, most typically a D-dimensional vector of real and/or discrete values, and y(i) is an element of a discrete set of values. The y values are sometimes
called target values.
A classification problem is binary or two-class if y(i) is drawn from a set of two possible
values; otherwise, it is called multi-class.
The goal in a classification problem is ultimately, given a new input value x(n+1) , to
predict the value of y(n+1) .
Classification problems are a kind of supervised learning, because the desired output (or
class) y(i) is specified for each of the training examples x(i) .
We typically assume that the elements of D are independent and identically distributed
according to some distribution Pr(X, Y).
2.1.2

Regression

Regression is like classification, except that y(i) 2 Rk .

2.2

Unsupervised learning

2.2.1

Density estimation

Given samples y(1) , . . . , y(n) 2 RD drawn IID from some distribution Pr(Y), the goal is to
predict the probability Pr(y(n+1) ) of an element drawn from the same distribution. Density
estimation often plays a role as a subroutine in the overall learning method for supervised learning, as well.
This is a type of unsupervised learning, because it doesnt involve learning a function
from inputs to outputs based on a set of input-output pairs.
2.2.2

Clustering

Given samples x(1) , . . . , x(n) 2 RD , the goal is to find a partition of the samples that groups
together samples that are similar. There are many different objectives, depending on the
definition of the similarity between samples and exactly what criterion is to be used (e.g.,
minimize the average distance between elements inside a cluster and maximize the average distance between elements across clusters). Other methods perform a soft clustering,
in which samples may be assigned 0.9 membership in one cluster and 0.1 in another. Clustering is sometimes used as a step in density estimation, and sometimes to find useful
structure in data. This is also unsupervised learning.

Our textbook uses xi


and yi instead of x(i)
and y(i) . We find that
notation somewhat difficult to manage when
x(i) is itself a vector and
we need to talk about
its elements. The notation we are using is
standard in some other
parts of the machinelearning literature.

MIT 6.867
2.2.3

Fall 2014

Dimensionality reduction

Given samples x(1) , . . . , x(n) 2 RD , the problem is to re-represent them as points in a ddimensional space, where d < D. The goal is typically to retain information in the data set
that will, e.g., allow elements of one class to be discriminated from another.
Standard dimensionality reduction is particularly useful for visualizing or understanding high-dimensional data. If the goal is ultimately to perform regression or classification
on the data after the dimensionality is reduced, it is usually best to articulate an objective
for the overall prediction problem rather than to first do dimensionality reduction without
knowing which dimensions will be important for the prediction task.

2.3

Reinforcement learning

In reinforcement learning, the goal is to learn a mapping from input values x to output
values y, but without a direct supervision signal to specify which output values y are
best for a particular input. There is no training set specified a priori. Instead, the learning
problem is framed as an agent interacting with an environment, in the following setting:
The agent observes the current state, x(0) .
It selects an action, y(0) .
It receives a reward, r(0) , which depends on x(0) and possibly y(0) .
The environment transitions probabilistically to a new state, x(1) , with a distribution
that depends only on x(0) and y(0) .
The agent observes the current state, x(1) .
...
The goal is to find a policy , mapping x to y, (that is, states to actions) such that some
long-term sum or average of rewards r is maximized.
This setting is very different from either supervised learning or unsupervised learning,
because the agents action choices affect both its reward and its ability to observe the environment. It requires careful consideration of the long-term effects of actions, as well as all
of the other issues that pertain to supervised learning.

2.4

Other settings

There are many other problem settings. Here are a few.


In semi-supervised learning, we have a supervised-learning training set, but there may
be an additional set of x(i) values with no known y(i) . These values can still be used to
improve learning performance if they are drawn from Pr(X) that is the marginal of Pr(X, Y)
that governs the rest of the data set.
In active learning, it is assumed to be expensive to acquire a label y(i) (imagine asking a
human to read an x-ray image), so the learning algorithm can sequentially ask for particular
inputs x(i) to be labeled, and must carefully select queries in order to learn as effectively as
possible while minimizing the cost of labeling.
In transfer learning, there are multiple tasks, with data drawn from different, but related,
distributions. The goal is for experience with previous tasks to apply to learning a current
task in a way that requires decreased experience with the new task.

MIT 6.867

Fall 2014

Assumptions

The kinds of assumptions that we can make about the data source or the solution include:
The data are independent and identically distributed.
The data are generated by a Markov chain.
The process generating the data might be adversarial.
The true model that is generating the data can be perfectly described by one of
some particular set of hypotheses.

Evaluation criteria

Once we have specified a problem class, we need to say what makes an output or the
answer to a query good, given the training data. We specify evaluation criteria at two
levels: how an individual prediction is scored, and how the overall behavior of the system
is scored.
The quality of predictions from a learned prediction rule is often expressed in terms of
a loss function. A loss function L(g, a) tells you how much you will be penalized for making
a guess g when the answer is actually a. There are many possible loss functions. Here are
some:
O-1 Loss applies to predictions drawn from finite domains.
L(g, a) =
Squared loss
Linear loss

if g = a

otherwise

L(g, a) = (g - a)2
L(g, a) = |g - a|

Asymmetric loss Consider a situation in which you are trying to predict whether
someone is having a heart attack. It might be much worse to predict no when the
answer is really yes, than the other way around.
8
>
<1 if g = 1 and a = 0
L(g, a) = 10 if g = 0 and a = 1
>
:
0 otherwise

Any given prediction rule will usually be evaluated based on multiple predictions and
the loss of each one. At this level, we might be interested in:
Minimizing expected loss over all the predictions(also known as risk)
Minimizing maximum loss: the loss of the worst prediction

Minimizing or bounding regret: how much worse this predictor performs than the
best one drawn from some class
Characterizing asymptotic behavior: how well the predictor will perform in the limit
of infinite training data

If the actual values are


drawn from a continuous distribution, the
probability they would
ever be equal to some
predicted g is 0 (except
for some weird cases).

MIT 6.867

Fall 2014

Finding algorithms that are probably approximately correct: they probably generate
a hypothesis that is right most of the time.
There is a theory of rational agency that argues that you should always select the action
that minimizes the expected loss. This strategy will, for example, make you the most money
in the long run, in a gambling setting. Expected loss is also sometimes called risk in the
machine-learning literature, but that term means other things in economics or other parts
of decision theory, so be careful...its risky to use it. We will, most of the time, concentrate
on this criterion.

Model type

Of course, there are


other models for action
selection and its clear
that people do not always (or maybe even
often) select actions that
follow this rule.

In this section, well examine the role of model-making in machine learning.


Well frequently a simple estimation problem, which we can think of as regression, but
without any input value. Assume that there is a single numeric value y to be predicted.
We are given training examples D = {y(1) , . . . , y(n) } and need to predict y(n+1) .

5.1

No model

In some simple cases, we can generate answers to queries directly from the training data,
without the construction of any intermediate model.
In our simple prediction problem, we might just decide to predict the mean or median
of the y(1) , . . . , y(n) because it seems like it might be a good idea. In regression or classification, we might generate an answer to a new query by averaging answers to recent queries,
as in the nearest neighbor method.

5.2

Prediction rule

It is more typical to use a two-step process:


1. Fit a model to the training data
2. Use the model directly to make predictions
In the prediction rule setting of regression or classification, the model will be some hypothesis or prediction rule y = h(x; ) for some functional form h. The idea is that is
a vector of one or more parameter values that will be determined by fitting the model to
the training data and then be held fixed. Given a new x(n+1) , we would then make the
prediction h(x(n+1) ; ).
The fitting process is usually articulated as an optimization problem: Find a value of
that maximizes score(; D). As we will explore further in the next section, one optimal
strategy, if we knew the actual underlying distribution Pr(X, Y) would be to predict the
value of y that minimizes the expected loss, which is also known as the risk. If we dont have
that actual underlying distribution, or even an estimate of it, we can take the approach
of minimizing the empirical risk: that is, finding the prediction rule h that minimizes the
average loss on our training data set. So, we would seek that maximizes
n

score() = -

1X
L(h(x(i) ; ), y(i) ) .
n
i=1

In our simple estimation problem, the hypothesis would be a single real value, so the
distinction between this case and the no model case is not as clear as it will be in the

We write f(a; b) to describe a function that is


usually applied to a single argument a, but is a
member of a parametric
family of functions, with
the particular function
determined by parameter value b. So, for example, we might write
h(x; p) = xp to describe
a function of a single
argument that is parameterized by p.

MIT 6.867

Fall 2014

case of real regression or classification problems. However, we might still formulate the
problem as one of fitting a model by seeking the value, yerm that minimizes empirical risk.
In the case of squared loss, we would select erm to maximize
n

score() = -

1 X (i)
(y - )2 .
n
i=1

The value of that maximizes this criterion is


n

erm =

1 X (i)
y
.
n
i=1

Having selected hypothesis h( ; erm ) to minimize the empirical risk on our data, erm
will be our predicted value. That is h( ; erm ) = erm .
We will find that minimizing empirical risk is often not a good choice: it is possible to
emphasize fitting the current data too strongly and end up with a hypothesis that does not
generalize well when presented with new x values.
The prediction-rule approach is also sometimes called a discriminative approach, because, when applied to a classification problem, it attempts to directly learn a function that
discriminates between the classes.

5.3

Probabilistic model

In the prediction rule approach, we learn a model that is used directly to select a predicted
value. In the probabilistic model approach, we:
1. Fit a probabilistic model to the training data; then
2. Use decision-theoretic reasoning to combine the probabilistic model with a loss function to select the best prediction.
5.3.1

Fitting a probabilistic model

For a regression or classification problem we might choose to fit a joint distribution Pr(X, Y; )
or a conditional distribution, Pr(Y | X; ). The problem of fitting probabilistic models to data
is treated at length in the statistics literature. There are many different criteria that one can
use to fit a probabilistic model to data.
Probably the common objective is to find the model that maximizes the likelihood of the
data from among some parametric class of models; this is known as the maximum likelihood
model. It is found by maximizing the data likelihood,
score() =

n
Y

Pr(x(i) , y(i) ; ) .

i=1

This is called a generative model, because the model describes how the entire data set is
generated. For classification problems, when y is an element of a discrete set of possible
values, it is often easiest to learn a factored model, of the form
Pr(X, Y; ) = Pr(X | Y; 1 ) Pr(Y; 2 ) ,
where = (1 , 2 ).
Another approach is to fit a discriminative model by maximizing the conditional likelihood,
score() =

n
Y
i=1

Pr(y(i) | x(i) ; ) .

MIT 6.867

Fall 2014

We will explore the trade-offs between using generative and discriminative models in more
detail in later sections. In general, we find that generative models can be learned somewhat
more efficiently and are highly effective if their modeling assumptions match the process
that generated the data. Discriminative models generally give the best performance on
the prediction task in the limit of a large amount of training data, because they focus their
modeling effort based on the problem of making the prediction and not on modeling the
distribution over possible query points x(n+1) .
We cant take this approach any further until we consider the particular class of models
we are going to fit. Well do some examples in section 6.
5.3.2

Decision theoretic prediction

Now, imagine that you have found some parameters , so that you have a probability
distribution Pr(Y | X; ) or Pr(X, Y; ). How can you combine this with a loss function to
make predictions?
The optimal prediction, in terms of minimizing expected loss, g , for our simple prediction problem, is
Z
g = arg min
g

Pr(y; )L(g, y) dy .

For a regression or classification problem with a conditional probabilistic model, it would


be a function of a new input x:
Z

g (x) = arg min Pr(y | x; )L(g, y) dy .


g

In general, we have to know both the form of the probabilistic model Pr(y | x; ) and
the loss function L(g, a). But we can get some insight into the optimal decision problem in
the simple but very prevalent special case of squared error loss. In this case, for the simple
prediction problem, we have:
Z
g = arg min Pr(y; )(g - y)2 dy .
p

We can optimize this by taking the derivative of the risk with respect to g, setting it to zero,
and solving for g. In this case, we have
Z
d
Pr(y; )(g - y)2 dy = 0
dg y
Z
Pr(y; )2(g - y) dy = 0
y
Z
Z
Pr(y; )g dy =
Pr(y; )y dy
y

g = E[y; ]

Of course, for different choices of loss function, the risk-minimizing prediction will have a
different form.
5.3.3

Benefits of using a probabilistic model

This approach allows us to separate the probability model from the loss function; so we
could learn one model and use it in different decision-making contexts. Models that make
probabilistic predictions can be easily combined using probabilistic inference: it is much
harder to know how to combine multiple direct prediction rules.

MIT 6.867

5.4

Fall 2014

10

Distribution over models

In the distribution over models approach, which well often call the Bayesian approach, we treat
the model parameters as random variables, and use probabilistic inference to update a
distribution over based on the observed data D.We
1. Begin with a prior distribution over possible probabilistic models of the data, Pr();
2. Perform Bayesian inference to combine the data with the prior distribution to find a
posterior distribution over models of the data, Pr( | D);
3. (Typically), integrate out the posterior distribution over models in the context of
decision-theoretic reasoning, using a loss function to select the best prediction
In the Bayesian approach, we do not have a single model that we will use to make predictions: instead, we will integrate over possible models to select the minimum expected
loss prediction. For the simple prediction problem this is:
Z Z
g = arg min
Pr(y | ) Pr( | D)L(g, y) dyd .
g

This integral can sometimes be difficult to evaluate. We can evaluate it approximately,


using sampling techniques.
The Bayesian approach lets us maintain an explicit representation of our uncertainty
about the model and take it into account when making predictions. Imagine a situation
in which the most likely model would make a prediction g1 , but all of the other models,
which are not that much less likely would predict g2 : in such situations g2 is the choice that
is more likely to be correct. Or, it might be that there is a prediction that is not the best,
but has relatively small loss in all the models: that might be the choice that minimizes risk,
even though it is not the best choice in any of the models.
The Bayesian approach can also provide elegant solutions to the problem of model
selection.

There are a lot of different methods and quantities within machine


learning that are called
Bayesian foo. Sometimes people just use it
to mean that there is an
application of Bayes
rule going on somewhere. We will try to
use it to refer to keeping distributions over
models or hypotheses.

You might also like