0% found this document useful (0 votes)

46 views10 pages

6.867 Lecture Notes: Section 1: Introduction: 1 Intro 2 2 Problem Class 3

This document provides an overview and introduction to machine learning concepts. It discusses different problem classes in machine learning, including supervised learning problems like classification and regression, unsupervised learning problems like density estimation and clustering, and reinforcement learning. For each problem class, it defines the basic setup and goal. It also outlines key characteristics used to describe machine learning problems and their solutions, such as the problem class, assumptions made, evaluation criteria, model type, model class, and algorithm. The introduction aims to lay out the conceptual foundations for the course material.

Uploaded by

alurana

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views10 pages

6.867 Lecture Notes: Section 1: Introduction: 1 Intro 2 2 Problem Class 3

Uploaded by

alurana

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

6.

867 Lecture Notes: Section 1: Introduction

Reading
Lecture 1: Murphy: Chapters 1 (Introduction), 2 (Good probability review; pay attention to multinomial Gaussian, Beta, and Dirichlet distributions)

Contents
1

Intro

Problem class
2.1 Supervised learning . . . . . . . .
2.1.1 Classification . . . . . . .
2.1.2 Regression . . . . . . . . .
2.2 Unsupervised learning . . . . . .
2.2.1 Density estimation . . . .
2.2.2 Clustering . . . . . . . . .
2.2.3 Dimensionality reduction
2.3 Reinforcement learning . . . . . .
2.4 Other settings . . . . . . . . . . .

2
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

3
3
3
3
3
3
3
4
4
4

Assumptions

Evaluation criteria

Model type
5.1 No model . . . . . . . . . . . . . . . . . . . . .
5.2 Prediction rule . . . . . . . . . . . . . . . . . .
5.3 Probabilistic model . . . . . . . . . . . . . . .
5.3.1 Fitting a probabilistic model . . . . . .
5.3.2 Decision theoretic prediction . . . . .
5.3.3 Benefits of using a probabilistic model
5.4 Distribution over models . . . . . . . . . . . .

.
.
.
.
.
.
.

6
6
6
7
7
8
8
9

Model class and parameter fitting

6.1 Fitting maximum-likelihood probabilistic models .
6.1.1 Gaussian with fixed variance . . . . . . . . .
6.1.2 Gaussian . . . . . . . . . . . . . . . . . . . . .
6.1.3 Uniform over a continuous interval . . . . .
6.1.4 Uniform over a finite set of points . . . . . .
6.1.5 Mixture of Gaussians . . . . . . . . . . . . . .
6.1.6 Maximum likelihood and binary data . . . .
6.2 Model selection over classes of probabilistic models
6.2.1 Bias-Variance decomposition . . . . . . . . .

.
.
.
.
.
.
.
.
.

10
10
10
11
11
11
12
12
13
13

.
.
.
.
.
.
.

MIT 6.867

6.3

6.4

Fall 2014

6.2.2 Regularization and model selection . . . . . . .

Bayesian inference from prior to posterior over models
6.3.1 Bernoulli/Bernoulli . . . . . . . . . . . . . . . .
6.3.2 Beta/Binomial . . . . . . . . . . . . . . . . . . .
6.3.3 Gaussian with fixed variance . . . . . . . . . . .
6.3.4 Finding conjugate families . . . . . . . . . . . . .
Bayesian model selection . . . . . . . . . . . . . . . . . .

2
.
.
.
.
.
.
.

.
.
.
.
.
.
.

15
18
18
20
21
22
23

Algorithm

Recap

MIT 6.867

Fall 2014

Intro

The main focus of machine learning is making decisions or predictions based on data. There
are a number of other fields with significant overlap in technique, but difference in focus:
. in economics and psychology, the goal is to discover underlying causal processes and in
statistics it is to find a model that fits a data set well. In those fields, the end product is a
model. In machine learning, we often fit models, but as a means to the end of making good
predictions.
Generally, this is done in two stages:
1. Learn or estimate a model from the data
2. Apply the model to make predictions or answer queries
Problem of induction: Why do we think that previously seen data will help us predict
the future? This is a serious philosophical problem of long standing. We will operationalize
it by making assumptions, such as that all training data are IID (independent and identically distributed) and that queries will be drawn from the same distribution as the training
data, or that the answer comes from a set of possible answers known in advance.
Two important issues that come up are:
statistical inference: How do we deal with the fact that, for example, the same treatment may end up with different results on different trials? How can we predict how
well an estimate may compare to future results?
generalization: How can we predict results of a situation or experiment that we have
never encountered before in our data set?
We can characterize problems and their solutions using six characteristics, three of
which characterize the problem and three of which characterize the solution:
1. Problem class: What is the nature of the training data and what kinds of queries will
be made at testing time?
2. Assumptions: What do we know about the source of the data or the form of the
solution?
3. Evaluation criteria: What is the goal of the system? How will the answers to individual queries be evaluated? How will the overall performance of the system be
measured?
4. Model type: Will an intermediate model be made? What aspects of the data will be
modeled? How will the model be used to make predictions?
5. Model class: What particular parametric class of models will be used? What criterion
will we use to pick a particular model from the model class?
6. Algorithm: What computational process will be used to fit the model to the data
and/or to make predictions?
Without making some assumptions about the nature of the process generating the data, we
cannot perform generalization.
The introductory section of the course will lay out the spaces of these characteristics
and illustrate them with simple (but not always easy!) examples. Then, in later sections,
we will refer back to these spaces to characterize new problems and approaches that we
introduce.

This story paraphrased

from a post on 9/4/12
at andrewgelman.com

MIT 6.867

Fall 2014

Problem class

There are many different problem classes in machine learning. They vary according to what
kind of data is provided and what kind of conclusions are to be drawn from it. We will
go through several of the standard problem classes here, and establish some notation and
terminology.

2.1

Supervised learning

2.1.1

Classification

Training data D is in the form of a set of pairs {(x(1) , y(1) ), . . . , (x(n) , y(n) )} where x(i) represents an object to be classified, most typically a D-dimensional vector of real and/or discrete values, and y(i) is an element of a discrete set of values. The y values are sometimes
called target values.
A classification problem is binary or two-class if y(i) is drawn from a set of two possible
values; otherwise, it is called multi-class.
The goal in a classification problem is ultimately, given a new input value x(n+1) , to
predict the value of y(n+1) .
Classification problems are a kind of supervised learning, because the desired output (or
class) y(i) is specified for each of the training examples x(i) .
We typically assume that the elements of D are independent and identically distributed
according to some distribution Pr(X, Y).
2.1.2

Regression

Regression is like classification, except that y(i) 2 Rk .

2.2

Unsupervised learning

2.2.1

Density estimation

Given samples y(1) , . . . , y(n) 2 RD drawn IID from some distribution Pr(Y), the goal is to
predict the probability Pr(y(n+1) ) of an element drawn from the same distribution. Density
estimation often plays a role as a subroutine in the overall learning method for supervised learning, as well.
This is a type of unsupervised learning, because it doesnt involve learning a function
from inputs to outputs based on a set of input-output pairs.
2.2.2

Clustering

Given samples x(1) , . . . , x(n) 2 RD , the goal is to find a partition of the samples that groups
together samples that are similar. There are many different objectives, depending on the
definition of the similarity between samples and exactly what criterion is to be used (e.g.,
minimize the average distance between elements inside a cluster and maximize the average distance between elements across clusters). Other methods perform a soft clustering,
in which samples may be assigned 0.9 membership in one cluster and 0.1 in another. Clustering is sometimes used as a step in density estimation, and sometimes to find useful
structure in data. This is also unsupervised learning.

Our textbook uses xi

and yi instead of x(i)
and y(i) . We find that
notation somewhat difficult to manage when
x(i) is itself a vector and
we need to talk about
its elements. The notation we are using is
standard in some other
parts of the machinelearning literature.

MIT 6.867
2.2.3

Fall 2014

Dimensionality reduction

Given samples x(1) , . . . , x(n) 2 RD , the problem is to re-represent them as points in a ddimensional space, where d < D. The goal is typically to retain information in the data set
that will, e.g., allow elements of one class to be discriminated from another.
Standard dimensionality reduction is particularly useful for visualizing or understanding high-dimensional data. If the goal is ultimately to perform regression or classification
on the data after the dimensionality is reduced, it is usually best to articulate an objective
for the overall prediction problem rather than to first do dimensionality reduction without
knowing which dimensions will be important for the prediction task.

2.3

Reinforcement learning

In reinforcement learning, the goal is to learn a mapping from input values x to output
values y, but without a direct supervision signal to specify which output values y are
best for a particular input. There is no training set specified a priori. Instead, the learning
problem is framed as an agent interacting with an environment, in the following setting:
The agent observes the current state, x(0) .
It selects an action, y(0) .
It receives a reward, r(0) , which depends on x(0) and possibly y(0) .
The environment transitions probabilistically to a new state, x(1) , with a distribution
that depends only on x(0) and y(0) .
The agent observes the current state, x(1) .
...
The goal is to find a policy , mapping x to y, (that is, states to actions) such that some
long-term sum or average of rewards r is maximized.
This setting is very different from either supervised learning or unsupervised learning,
because the agents action choices affect both its reward and its ability to observe the environment. It requires careful consideration of the long-term effects of actions, as well as all
of the other issues that pertain to supervised learning.

2.4

Other settings

There are many other problem settings. Here are a few.

In semi-supervised learning, we have a supervised-learning training set, but there may
be an additional set of x(i) values with no known y(i) . These values can still be used to
improve learning performance if they are drawn from Pr(X) that is the marginal of Pr(X, Y)
that governs the rest of the data set.
In active learning, it is assumed to be expensive to acquire a label y(i) (imagine asking a
human to read an x-ray image), so the learning algorithm can sequentially ask for particular
inputs x(i) to be labeled, and must carefully select queries in order to learn as effectively as
possible while minimizing the cost of labeling.
In transfer learning, there are multiple tasks, with data drawn from different, but related,
distributions. The goal is for experience with previous tasks to apply to learning a current
task in a way that requires decreased experience with the new task.

MIT 6.867

Fall 2014

Assumptions

The kinds of assumptions that we can make about the data source or the solution include:
The data are independent and identically distributed.
The data are generated by a Markov chain.
The process generating the data might be adversarial.
The true model that is generating the data can be perfectly described by one of
some particular set of hypotheses.

Evaluation criteria

Once we have specified a problem class, we need to say what makes an output or the
answer to a query good, given the training data. We specify evaluation criteria at two
levels: how an individual prediction is scored, and how the overall behavior of the system
is scored.
The quality of predictions from a learned prediction rule is often expressed in terms of
a loss function. A loss function L(g, a) tells you how much you will be penalized for making
a guess g when the answer is actually a. There are many possible loss functions. Here are
some:
O-1 Loss applies to predictions drawn from finite domains.
L(g, a) =
Squared loss
Linear loss

if g = a

otherwise

L(g, a) = (g - a)2
L(g, a) = |g - a|

Asymmetric loss Consider a situation in which you are trying to predict whether
someone is having a heart attack. It might be much worse to predict no when the
answer is really yes, than the other way around.
8
>
<1 if g = 1 and a = 0
L(g, a) = 10 if g = 0 and a = 1
>
:
0 otherwise

Any given prediction rule will usually be evaluated based on multiple predictions and
the loss of each one. At this level, we might be interested in:
Minimizing expected loss over all the predictions(also known as risk)
Minimizing maximum loss: the loss of the worst prediction

Minimizing or bounding regret: how much worse this predictor performs than the
best one drawn from some class
Characterizing asymptotic behavior: how well the predictor will perform in the limit
of infinite training data

If the actual values are

drawn from a continuous distribution, the
probability they would
ever be equal to some
predicted g is 0 (except
for some weird cases).

MIT 6.867

Fall 2014

Finding algorithms that are probably approximately correct: they probably generate
a hypothesis that is right most of the time.
There is a theory of rational agency that argues that you should always select the action
that minimizes the expected loss. This strategy will, for example, make you the most money
in the long run, in a gambling setting. Expected loss is also sometimes called risk in the
machine-learning literature, but that term means other things in economics or other parts
of decision theory, so be careful...its risky to use it. We will, most of the time, concentrate
on this criterion.

Model type

Of course, there are

other models for action
selection and its clear
that people do not always (or maybe even
often) select actions that
follow this rule.

In this section, well examine the role of model-making in machine learning.

Well frequently a simple estimation problem, which we can think of as regression, but
without any input value. Assume that there is a single numeric value y to be predicted.
We are given training examples D = {y(1) , . . . , y(n) } and need to predict y(n+1) .

5.1

No model

In some simple cases, we can generate answers to queries directly from the training data,
without the construction of any intermediate model.
In our simple prediction problem, we might just decide to predict the mean or median
of the y(1) , . . . , y(n) because it seems like it might be a good idea. In regression or classification, we might generate an answer to a new query by averaging answers to recent queries,
as in the nearest neighbor method.

5.2

Prediction rule

It is more typical to use a two-step process:

1. Fit a model to the training data
2. Use the model directly to make predictions
In the prediction rule setting of regression or classification, the model will be some hypothesis or prediction rule y = h(x; ) for some functional form h. The idea is that is
a vector of one or more parameter values that will be determined by fitting the model to
the training data and then be held fixed. Given a new x(n+1) , we would then make the
prediction h(x(n+1) ; ).
The fitting process is usually articulated as an optimization problem: Find a value of
that maximizes score(; D). As we will explore further in the next section, one optimal
strategy, if we knew the actual underlying distribution Pr(X, Y) would be to predict the
value of y that minimizes the expected loss, which is also known as the risk. If we dont have
that actual underlying distribution, or even an estimate of it, we can take the approach
of minimizing the empirical risk: that is, finding the prediction rule h that minimizes the
average loss on our training data set. So, we would seek that maximizes
n

score() = -

1X
L(h(x(i) ; ), y(i) ) .
n
i=1

In our simple estimation problem, the hypothesis would be a single real value, so the
distinction between this case and the no model case is not as clear as it will be in the

We write f(a; b) to describe a function that is

usually applied to a single argument a, but is a
member of a parametric
family of functions, with
the particular function
determined by parameter value b. So, for example, we might write
h(x; p) = xp to describe
a function of a single
argument that is parameterized by p.

MIT 6.867

Fall 2014

case of real regression or classification problems. However, we might still formulate the
problem as one of fitting a model by seeking the value, yerm that minimizes empirical risk.
In the case of squared loss, we would select erm to maximize
n

score() = -

1 X (i)
(y - )2 .
n
i=1

The value of that maximizes this criterion is

erm =

1 X (i)
y
.
n
i=1

Having selected hypothesis h( ; erm ) to minimize the empirical risk on our data, erm
will be our predicted value. That is h( ; erm ) = erm .
We will find that minimizing empirical risk is often not a good choice: it is possible to
emphasize fitting the current data too strongly and end up with a hypothesis that does not
generalize well when presented with new x values.
The prediction-rule approach is also sometimes called a discriminative approach, because, when applied to a classification problem, it attempts to directly learn a function that
discriminates between the classes.

5.3

Probabilistic model

In the prediction rule approach, we learn a model that is used directly to select a predicted
value. In the probabilistic model approach, we:
1. Fit a probabilistic model to the training data; then
2. Use decision-theoretic reasoning to combine the probabilistic model with a loss function to select the best prediction.
5.3.1

Fitting a probabilistic model

For a regression or classification problem we might choose to fit a joint distribution Pr(X, Y; )
or a conditional distribution, Pr(Y | X; ). The problem of fitting probabilistic models to data
is treated at length in the statistics literature. There are many different criteria that one can
use to fit a probabilistic model to data.
Probably the common objective is to find the model that maximizes the likelihood of the
data from among some parametric class of models; this is known as the maximum likelihood
model. It is found by maximizing the data likelihood,
score() =

n
Y

Pr(x(i) , y(i) ; ) .

i=1

This is called a generative model, because the model describes how the entire data set is
generated. For classification problems, when y is an element of a discrete set of possible
values, it is often easiest to learn a factored model, of the form
Pr(X, Y; ) = Pr(X | Y; 1 ) Pr(Y; 2 ) ,
where = (1 , 2 ).
Another approach is to fit a discriminative model by maximizing the conditional likelihood,
score() =

n
Y
i=1

Pr(y(i) | x(i) ; ) .

MIT 6.867

Fall 2014

We will explore the trade-offs between using generative and discriminative models in more
detail in later sections. In general, we find that generative models can be learned somewhat
more efficiently and are highly effective if their modeling assumptions match the process
that generated the data. Discriminative models generally give the best performance on
the prediction task in the limit of a large amount of training data, because they focus their
modeling effort based on the problem of making the prediction and not on modeling the
distribution over possible query points x(n+1) .
We cant take this approach any further until we consider the particular class of models
we are going to fit. Well do some examples in section 6.
5.3.2

Decision theoretic prediction

Now, imagine that you have found some parameters , so that you have a probability
distribution Pr(Y | X; ) or Pr(X, Y; ). How can you combine this with a loss function to
make predictions?
The optimal prediction, in terms of minimizing expected loss, g , for our simple prediction problem, is
Z
g = arg min
g

Pr(y; )L(g, y) dy .

For a regression or classification problem with a conditional probabilistic model, it would

be a function of a new input x:
Z

g (x) = arg min Pr(y | x; )L(g, y) dy .

In general, we have to know both the form of the probabilistic model Pr(y | x; ) and
the loss function L(g, a). But we can get some insight into the optimal decision problem in
the simple but very prevalent special case of squared error loss. In this case, for the simple
prediction problem, we have:
Z
g = arg min Pr(y; )(g - y)2 dy .
p

We can optimize this by taking the derivative of the risk with respect to g, setting it to zero,
and solving for g. In this case, we have
Z
d
Pr(y; )(g - y)2 dy = 0
dg y
Z
Pr(y; )2(g - y) dy = 0
y
Z
Z
Pr(y; )g dy =
Pr(y; )y dy
y

g = E[y; ]

Of course, for different choices of loss function, the risk-minimizing prediction will have a
different form.
5.3.3

Benefits of using a probabilistic model

This approach allows us to separate the probability model from the loss function; so we
could learn one model and use it in different decision-making contexts. Models that make
probabilistic predictions can be easily combined using probabilistic inference: it is much
harder to know how to combine multiple direct prediction rules.

MIT 6.867

5.4

Fall 2014

Distribution over models

In the distribution over models approach, which well often call the Bayesian approach, we treat
the model parameters as random variables, and use probabilistic inference to update a
distribution over based on the observed data D.We
1. Begin with a prior distribution over possible probabilistic models of the data, Pr();
2. Perform Bayesian inference to combine the data with the prior distribution to find a
posterior distribution over models of the data, Pr( | D);
3. (Typically), integrate out the posterior distribution over models in the context of
decision-theoretic reasoning, using a loss function to select the best prediction
In the Bayesian approach, we do not have a single model that we will use to make predictions: instead, we will integrate over possible models to select the minimum expected
loss prediction. For the simple prediction problem this is:
Z Z
g = arg min
Pr(y | ) Pr( | D)L(g, y) dyd .
g

This integral can sometimes be difficult to evaluate. We can evaluate it approximately,

using sampling techniques.
The Bayesian approach lets us maintain an explicit representation of our uncertainty
about the model and take it into account when making predictions. Imagine a situation
in which the most likely model would make a prediction g1 , but all of the other models,
which are not that much less likely would predict g2 : in such situations g2 is the choice that
is more likely to be correct. Or, it might be that there is a prediction that is not the best,
but has relatively small loss in all the models: that might be the choice that minimizes risk,
even though it is not the best choice in any of the models.
The Bayesian approach can also provide elegant solutions to the problem of model
selection.

There are a lot of different methods and quantities within machine

learning that are called
Bayesian foo. Sometimes people just use it
to mean that there is an
application of Bayes
rule going on somewhere. We will try to
use it to refer to keeping distributions over
models or hypotheses.

Ed Marlo - Revolutionary Card Technique
100% (2)
Ed Marlo - Revolutionary Card Technique
512 pages
Design of Reinforced Concrete Structures M.L Gambhir 2008
76% (33)
Design of Reinforced Concrete Structures M.L Gambhir 2008
193 pages
Foundations of Space and Time Reflections On Quantum Gravity - Jeff Murugan PDF
100% (1)
Foundations of Space and Time Reflections On Quantum Gravity - Jeff Murugan PDF
452 pages
The Art of Ancient-Egypt
100% (2)
The Art of Ancient-Egypt
184 pages
6 390 Lecture Notes Spring24
No ratings yet
6 390 Lecture Notes Spring24
144 pages
Traditional and Neo-Institutionalism
No ratings yet
Traditional and Neo-Institutionalism
10 pages
Machine Learning Notes
100% (3)
Machine Learning Notes
134 pages
Stat Lesson 1
100% (6)
Stat Lesson 1
24 pages
CS229
No ratings yet
CS229
216 pages
Machine Learning
No ratings yet
Machine Learning
216 pages
Machine Learning Notes 1
No ratings yet
Machine Learning Notes 1
120 pages
Andrew NG Main - Notes PDF
No ratings yet
Andrew NG Main - Notes PDF
226 pages
Forecasting: Dr. J. R. Sharma @imt, Nagpur
No ratings yet
Forecasting: Dr. J. R. Sharma @imt, Nagpur
54 pages
Logistics and Facility Location: Dr. Jiten Sharma @imt, Nagpur
No ratings yet
Logistics and Facility Location: Dr. Jiten Sharma @imt, Nagpur
26 pages
1 All Notes G
No ratings yet
1 All Notes G
217 pages
Bishop Solutions PDF
No ratings yet
Bishop Solutions PDF
87 pages
Unit-1 - Machine Learning
No ratings yet
Unit-1 - Machine Learning
85 pages
Hatshepsut Four Investigations
No ratings yet
Hatshepsut Four Investigations
377 pages
Supervised Machine Learning
No ratings yet
Supervised Machine Learning
112 pages
Machine Learning Lecture Notes
No ratings yet
Machine Learning Lecture Notes
119 pages
LingAeg - 01 p227 240
No ratings yet
LingAeg - 01 p227 240
34 pages
ML Main Printing Material
No ratings yet
ML Main Printing Material
241 pages
Specializations Serge Lazzarini TheoreticalPhysics and Relativity
No ratings yet
Specializations Serge Lazzarini TheoreticalPhysics and Relativity
23 pages
Constructivism
No ratings yet
Constructivism
8 pages
Cs229-Main Notes Andrew NG and Tengyu Ma
No ratings yet
Cs229-Main Notes Andrew NG and Tengyu Ma
227 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
135 pages
Main Notes
No ratings yet
Main Notes
227 pages
Rocess Management AND Improvement Through Operations: Dr. Jitendra Sharma Professor - Operations IMT - Nagpur
No ratings yet
Rocess Management AND Improvement Through Operations: Dr. Jitendra Sharma Professor - Operations IMT - Nagpur
28 pages
6 390 Lecture Notes Fall24
No ratings yet
6 390 Lecture Notes Fall24
146 pages
Machine Learning
No ratings yet
Machine Learning
95 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
Poly ML SIR
No ratings yet
Poly ML SIR
378 pages
Facilities: Dr. J. R. Sharma @imt, Nagpur
No ratings yet
Facilities: Dr. J. R. Sharma @imt, Nagpur
30 pages
Smoke Test: Air Flow Visualization Studies in Clean Rooms
No ratings yet
Smoke Test: Air Flow Visualization Studies in Clean Rooms
4 pages
Stanford ML
No ratings yet
Stanford ML
168 pages
Optimization Problems For Machine Learning: A Survey
No ratings yet
Optimization Problems For Machine Learning: A Survey
41 pages
CS229 Lecture Notes: Andrew NG and Tengyu Ma April 25, 2023
No ratings yet
CS229 Lecture Notes: Andrew NG and Tengyu Ma April 25, 2023
223 pages
(Extraverted, Sensing, Thinking, Judging) : Taking Care of Business
100% (1)
(Extraverted, Sensing, Thinking, Judging) : Taking Care of Business
3 pages
Main Notes
No ratings yet
Main Notes
227 pages
Machine Learning Handbook - Radivojac and White
No ratings yet
Machine Learning Handbook - Radivojac and White
108 pages
Supp 2
No ratings yet
Supp 2
214 pages
Ml2 Script v2
No ratings yet
Ml2 Script v2
123 pages
LingAeg - 01 p389 426
No ratings yet
LingAeg - 01 p389 426
58 pages
Mann-Whitney U-Distribution PDF
No ratings yet
Mann-Whitney U-Distribution PDF
1 page
SML Book Draft Latest
No ratings yet
SML Book Draft Latest
275 pages
CS229 Andrew NG Lecture Notes
No ratings yet
CS229 Andrew NG Lecture Notes
216 pages
PCML Notes
No ratings yet
PCML Notes
249 pages
Cheet Sheet
No ratings yet
Cheet Sheet
47 pages
Poly Aml
No ratings yet
Poly Aml
76 pages
Entanglement and Teleportation
No ratings yet
Entanglement and Teleportation
27 pages
Super Cheatsheet Machine Learning
100% (1)
Super Cheatsheet Machine Learning
15 pages
Categorical Dependent Variable Regression Models Using STATA, SAS, and SPSS
No ratings yet
Categorical Dependent Variable Regression Models Using STATA, SAS, and SPSS
32 pages
Extra Lecturenotes Cs725
No ratings yet
Extra Lecturenotes Cs725
119 pages
LN ML Rug
No ratings yet
LN ML Rug
283 pages
Lecture Notes 2016
No ratings yet
Lecture Notes 2016
132 pages
Brief Intro To ML PDF
No ratings yet
Brief Intro To ML PDF
236 pages
Chapter 2 Machine Learning Draft-85-172
No ratings yet
Chapter 2 Machine Learning Draft-85-172
88 pages
Summary FS24
No ratings yet
Summary FS24
63 pages
Exercises
No ratings yet
Exercises
69 pages
LN ML Rug
No ratings yet
LN ML Rug
267 pages
Quadrangle No25
No ratings yet
Quadrangle No25
20 pages
10 1 1 672 7118 PDF
No ratings yet
10 1 1 672 7118 PDF
35 pages
SGN-2506 Introduction To Pattern Recognition Handout
No ratings yet
SGN-2506 Introduction To Pattern Recognition Handout
82 pages
6.036 Notes
No ratings yet
6.036 Notes
99 pages
Marxist Theory of Social Change:A Scientific Interpretation: November 2022
No ratings yet
Marxist Theory of Social Change:A Scientific Interpretation: November 2022
10 pages
Machine Learning Complete-Course-Notes Polimi
No ratings yet
Machine Learning Complete-Course-Notes Polimi
107 pages
Founding Fathers of Sociology
No ratings yet
Founding Fathers of Sociology
11 pages
SML Book Draft Latest
No ratings yet
SML Book Draft Latest
194 pages
Paired T-Test XMP Sol
100% (1)
Paired T-Test XMP Sol
2 pages
A Comprehensive Guide To Machine Learning
No ratings yet
A Comprehensive Guide To Machine Learning
152 pages
CPM Brochu3year - Syllabusre 2011
No ratings yet
CPM Brochu3year - Syllabusre 2011
8 pages
An Adventure of Epic Porpoises
No ratings yet
An Adventure of Epic Porpoises
174 pages
DC Caves 20210
No ratings yet
DC Caves 20210
7 pages
Cheat Sheet
No ratings yet
Cheat Sheet
163 pages
Chapter Introduction
No ratings yet
Chapter Introduction
7 pages
MIT - Machine Learning Notes From Chapter 1 - 14 PDF
No ratings yet
MIT - Machine Learning Notes From Chapter 1 - 14 PDF
101 pages
Unveiling The Secrets of String Theory
No ratings yet
Unveiling The Secrets of String Theory
2 pages
Teza Licenta Fizica Sevestrean Vasile
No ratings yet
Teza Licenta Fizica Sevestrean Vasile
36 pages
6036 Lecture Notes
No ratings yet
6036 Lecture Notes
56 pages
Content-CS229 MachineLearning Notes
No ratings yet
Content-CS229 MachineLearning Notes
4 pages
T-duality and α -corrections
No ratings yet
T-duality and α -corrections
36 pages
BBA N 205 Organisation Behaviour
No ratings yet
BBA N 205 Organisation Behaviour
1 page
Ftre : - SGRJF#
No ratings yet
Ftre : - SGRJF#
3 pages
Lab Shizz
No ratings yet
Lab Shizz
3 pages
Machine Learning Contents 2
No ratings yet
Machine Learning Contents 2
7 pages
Introduction To Machine Learning: ETH Zurich Janik Schuettler Marcel Graetz FS18
No ratings yet
Introduction To Machine Learning: ETH Zurich Janik Schuettler Marcel Graetz FS18
18 pages
Hydra Quick Start PDF
No ratings yet
Hydra Quick Start PDF
3 pages
Ai Unit4 QB
No ratings yet
Ai Unit4 QB
6 pages
Group 1 ECON6006 Financial Econometrics Assignment 2 Submission
No ratings yet
Group 1 ECON6006 Financial Econometrics Assignment 2 Submission
20 pages
Introduction To Land Plants & The Alternation of Generations
No ratings yet
Introduction To Land Plants & The Alternation of Generations
2 pages
Gujarat Technological University Ahmedabad: Name Enrollment No. Exam Seat No. Declared On Exam Branch
No ratings yet
Gujarat Technological University Ahmedabad: Name Enrollment No. Exam Seat No. Declared On Exam Branch
1 page
SM Feynrules Arx
No ratings yet
SM Feynrules Arx
29 pages
OVS Development PDF
No ratings yet
OVS Development PDF
10 pages
15.965 Power and Negotiation
No ratings yet
15.965 Power and Negotiation
21 pages
Isovector and Isoscalar Pairing Correlations in A Solvable Model
No ratings yet
Isovector and Isoscalar Pairing Correlations in A Solvable Model
18 pages
Analisis Bivariat: Crosstabs Dukungan Dengan Kepatuhan Dukungan Kpkat Crosstabulation
No ratings yet
Analisis Bivariat: Crosstabs Dukungan Dengan Kepatuhan Dukungan Kpkat Crosstabulation
7 pages
This Story Paraphrased From A Post On 9/4/12
No ratings yet
This Story Paraphrased From A Post On 9/4/12
7 pages
Adrenaline
No ratings yet
Adrenaline
1 page
Time Development of A Gaussian Wave Packet
No ratings yet
Time Development of A Gaussian Wave Packet
3 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
32 pages
Stat 234 Chang. Section 02, 391255: Ben Jacobson March 6, 2012
No ratings yet
Stat 234 Chang. Section 02, 391255: Ben Jacobson March 6, 2012
4 pages
Customer Id Arrival Time Interarrival Time Begin Service Time Departure Time Service Time Waiting Time Time in System
No ratings yet
Customer Id Arrival Time Interarrival Time Begin Service Time Departure Time Service Time Waiting Time Time in System
9 pages
Bus Four Wheel Drive Van: X X X X X X
No ratings yet
Bus Four Wheel Drive Van: X X X X X X
3 pages
Theoretical Perspectives On Government
No ratings yet
Theoretical Perspectives On Government
2 pages
SPSS Guide: 1-Way ANOVA: ATA Ntry
No ratings yet
SPSS Guide: 1-Way ANOVA: ATA Ntry
2 pages
CS-E4820 Machine Learning: Advanced Probabilistic Methods
No ratings yet
CS-E4820 Machine Learning: Advanced Probabilistic Methods
2 pages
Nambu
No ratings yet
Nambu
4 pages

6.867 Lecture Notes: Section 1: Introduction: 1 Intro 2 2 Problem Class 3

Uploaded by

6.867 Lecture Notes: Section 1: Introduction: 1 Intro 2 2 Problem Class 3

Uploaded by

6.

867 Lecture Notes: Section 1: Introduction

Model class and parameter fitting

6.2.2 Regularization and model selection . . . . . . .

This story paraphrased

Regression is like classification, except that y(i) 2 Rk .

Our textbook uses xi

There are many other problem settings. Here are a few.

If the actual values are

Of course, there are

In this section, well examine the role of model-making in machine learning.

It is more typical to use a two-step process:

We write f(a; b) to describe a function that is

The value of that maximizes this criterion is

Fitting a probabilistic model

Decision theoretic prediction

For a regression or classification problem with a conditional probabilistic model, it would

g (x) = arg min Pr(y | x; )L(g, y) dy .

Benefits of using a probabilistic model

Distribution over models

This integral can sometimes be difficult to evaluate. We can evaluate it approximately,

There are a lot of different methods and quantities within machine

You might also like