0% found this document useful (0 votes)

19 views52 pages

Lecture1 2015

This document provides information about the STA 414/2104 Machine Learning course taught by Russ Salakhutdinov at the University of Toronto. It outlines the evaluation criteria including assignments, midterm, and final worth a total of 100%. It lists recommended textbooks and additional books. The document describes statistical machine learning as developing algorithms to learn from data by constructing stochastic models for prediction and decision making. It provides examples of machine learning successes and discusses finding structure in large datasets through methods like matrix factorization, topic modeling, and collaborative filtering. Finally, it outlines tentative course topics and different types of machine learning.

Uploaded by

hu jack

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views52 pages

Lecture1 2015

Uploaded by

hu jack

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 52

STA

414/2104:
Machine Learning

Russ Salakhutdinov
Department of Computer Science!
Department of Statistics!
[email protected]!
h0p://www.cs.toronto.edu/~rsalakhu/

Lecture 1
Evalua;on
• 3 Assignments worth 40%.
• Midterm worth 20%.
• Undergrads: Final worth 40%
• Graduate: 10% oral presenta;on, 30% ﬁnal
Text Books
• Christopher M. Bishop (2006)
Pa@ern RecogniBon and Machine Learning, Springer.

Addi;onal Books
• Kevin Murphy, Machine Learning: A Probabilis;c Perspec;ve.

• Trevor Has;e, Robert Tibshirani, Jerome Friedman (2009)

The Elements of Sta;s;cal Learning

• David MacKay (2003)

Informa;on Theory, Inference, and Learning Algorithms

• Most of the ﬁgures and material will come from these books.
Sta;s;cal Machine Learning
Sta;s;cal machine learning is a very dynamic ﬁeld that lies at
the intersec;on of Sta;s;cs and computa;onal sciences.

The goal of sta;s;cal machine learning is to develop

algorithms that can learn from data by construc;ng stochas;c
models that can be used for making predic;ons and decisions.
Machine Learning’s Successes
• Biosta;s;cs / Computa;onal Biology.
• Neuroscience.
• Medical Imaging:
- computer-‐aided diagnosis, image-‐guided therapy.
- image registra;on, image fusion.

• Informa;on Retrieval / Natural Language Processing:

- Text, audio, and image retrieval.
- Parsing, machine transla;on, text analysis.
• Speech processing:
- Speech recogni;on, voice iden;ﬁca;on.
• Robo;cs:
- Autonomous car driving, planning, control.
Mining for Structure
Massive increase in both computa;onal power and the amount of
data available from web, video cameras, laboratory measurements.

Images & Video Text & Language Speech & Audio

Gene Expression
Develop sta;s;cal models that can
discover Rela;onal
underlying
Data/
structure, cause,
Product Climate Change
or sta;s;cal
Recommenda;on correla;ons from data.
Social Network Geological Data
Example: Boltzmann Machine
Latent (hidden)
Model parameters variables

Input data (e.g. pixel Target variables

intensi;es of an image, (response) (e.g. class
words from webpages, labels, categories,
speech signal). phonemes).

Markov Random Fields, Undirected Graphical Models.

Finding Structure in Data

Vector of word counts Latent variables:

on a webpage hidden topics

European Community
Interbank Markets Monetary/Economic

Energy Markets
Disasters and
Accidents

Leading Legal/Judicial
Economic
Indicators

804,414 newswire stories Accounts/

Government
Borrowings
Earnings
Matrix Factoriza;on
Collabora;ve Filtering/
Matrix Factoriza;on/

Hierarchical Bayesian Model

Ra;ng value of Latent user feature Latent item
user i for item j (preference) vector feature vector

Latent variables that

we infer from
observed ra;ngs.
PredicBon: predict a ra;ng r*ij for user i and query movie j.

Posterior over Latent Variables

Infer latent variables and make predic;ons using Markov chain Monte Carlo.
Finding Structure in Data
Collabora;ve Filtering/
Matrix Factoriza;on/
Product Recommenda;on

Learned ``genre’’
Neflix dataset: Fahrenheit 9/11 Independence Day
Bowling for Columbine The Day Aher Tomorrow
480,189 users The People vs. Larry Flynt Con Air
17,770 movies Canadian Bacon Men in Black II
La Dolce Vita Men in Black
Over 100 million ra;ngs.
Friday the 13th
The Texas Chainsaw Massacre
Children of the Corn
Child's Play
The Return of Michael Myers

• Part of the wining solu;on in the Neflix contest (1 million dollar prize).
Tenta;ve List of Topics
• Linear methods for regression, Bayesian linear regression
• Linear models for classiﬁca;on
• Probabilis;c Genera;ve and Discrimina;ve models
• Regulariza;on methods
• Model Comparison and BIC
• Neural Networks
• Radial basis func;on networks
• Kernel Methods, Gaussian processes, Support Vector Machines
• Mixture models and EM algorithm
• Graphical Models and Bayesian Networks
Types of Learning
Consider observing a series of input vectors:

• Supervised Learning: We are also given target outputs (labels,

responses): y1,y2,…, and the goal is to predict correct output given a
new input.

• Unsupervised Learning: The goal is to build a sta;s;cal model of x,
which can be used for making predic;ons, decisions.

• Reinforcement Learning: the model (agent) produces a set of ac;ons:

a1, a2,… that aﬀect the state of the world, and received rewards r1,
r2… The goal is to learn ac;ons that maximize the reward (we will not
cover this topic in this course).

• Semi-‐supervised Learning: We are given only a limited amount of

labels, but lots of unlabeled data.
Supervised Learning
ClassiﬁcaBon: target outputs yi
are discrete class labels. The goal
is to correctly classify new inputs.

Regression: target outputs yi are

con;nuous. The goal is to predict
the output given new inputs.
Handwri0en Digit Classiﬁca;on
Unsupervised Learning
The goal is to construct sta;s;cal
model that ﬁnds useful representa;on
of data:
• Clustering
• Dimensionality reduc;on
• Modeling the data density
• Finding hidden causes (useful
explana;on) of the data

Unsupervised Learning can be used for:

• Structure discovery
• Anomaly detec;on / Outlier detec;on
• Data compression, Data visualiza;on
• Used to aid classiﬁca;on/regression
tasks
DNA Microarray Data
Expression matrix of 6830 genes (rows)
and 64 samples (columns) for the human
tumor data.

The display is a heat map ranging from

bright green (under expressed) to bright
red (over expressed).

Ques;ons we may ask:

• Which samples are similar to other
samples in terms of their expression levels
across genes.
• Which genes are similar to each other in
terms of their expression levels across
samples.
Linear Least Squares
• Given a vector of d-‐dimensional inputs we want
to predict the target (response) using the linear model:

• The term w0 is the intercept, or ohen called bias term. It will be
convenient to include the constant variable 1 in x and write:

• Observe a training set consis;ng of N observa;ons

together with corresponding target values

• Note that X is an matrix.

Linear Least Squares
One op;on is to minimize the sum of the squares of the errors between
the predic;ons for each data point xn and the corresponding
real-‐valued targets tn.

Loss func;on: sum-‐of-‐squared error

func;on:

Source: Wikipedia
Linear Least Squares
If is nonsingular, then the unique solu;on is given by:

op;mal vector of

weights target values

the design matrix has one

input vector per row

Source: Wikipedia

• At an arbitrary input , the predic;on is

• The en;re model is characterized by d+1 parameters w*.
Example: Polynomial Curve Firng
Consider observing a training set consis;ng of N 1-‐dimensional observa;ons:
together with corresponding real-‐valued targets:

• The green plot is the true func;on

• The training data was generated by taking
xn spaced uniformly between [0 1].
• The target set (blue circles) was obtained
by ﬁrst compu;ng the corresponding values
of the sin func;on, and then adding a small
Gaussian noise.

Goal: Fit the data using a polynomial func;on of the form:

Note: the polynomial func;on is a nonlinear func;on of x, but it is a linear
func;on of the coeﬃcients w ! Linear Models.
Example: Polynomial Curve Firng
• As for the least squares example: we can minimize the sum of the
squares of the errors between the predic;ons for each data
point xn and the corresponding target values tn.

Loss func;on: sum-‐of-‐squared

error func;on:

• Similar to the linear least squares: Minimizing sum-‐of-‐squared error

func;on has a unique solu;on w*.
• The model is characterized by M+1 parameters w*.
• How do we choose M? ! Model SelecBon.
Some Fits to the Data

For M=9, we have ﬁ0ed the training data perfectly.

Overﬁrng
• Consider a separate test set containing 100 new data points generated
using the same procedure that was used to generate the training data.

• For M=9, the training error is zero ! The polynomial contains 10
degrees of freedom corresponding to 10 parameters w, and so can be
ﬁ0ed exactly to the 10 data points.
• However, the test error has become very large. Why?
Overﬁrng

• As M increases, the magnitude of coeﬃcients gets larger.

• For M=9, the coefficients have become finely tuned to the data.
• Between data points, the func;on exhibits large oscilla;ons.
More flexible polynomials with larger M tune to the random noise
on the target values.
Varying the Size of the Data
9th order polynomial

• For a given model complexity, the overﬁrng problem becomes less

severe as the size of the dataset increases.

• However, the number of parameters is not necessarily the most
appropriate measure of the model complexity.
Generaliza;on
• The goal is achieve good generalizaBon by making accurate predic;ons
for new test data that is not known during learning.
• Choosing the values of parameters that minimize the loss func;on on
the training data may not be the best op;on.
• We would like to model the true regulari;es in the data and ignore the
noise in the data:
- It is hard to know which regulari;es are real and which are accidental
due to the par;cular training examples we happen to pick.
• Intui;on: We expect the model to generalize
if it explains the data well given the complexity
of the model.
• If the model has as many degrees of freedom
as the data, it can ﬁt the data perfectly. But this
is not very informa;ve.
• Some theory on how to control model
complexity to op;mize generaliza;on.
A Simple Way to Penalize Complexity
One technique for controlling over-‐ﬁrng phenomenon is regularizaBon,
which amounts to adding a penalty term to the error func;on.
penalized error target value regulariza;on
func;on parameter

where and ¸ is called the regulariza;on term.

Note that we do not penalize the bias term w0.
• The idea is to “shrink” es;mated parameters
towards zero (or towards the mean of some other
weights).
• Shrinking to zero: penalize coeﬃcients based on
their size.
• For a penalty func;on which is the sum of the
squares of the parameters, this is known as a
“weight decay”, or “ridge regression”.
Regulariza;on

Graph of the root-‐mean-‐squared training and test errors vs. ln¸
for the M=9 polynomial.
How to choose ¸?
Cross Valida;on
If the data is plen;ful, we can divide the dataset into three subsets:
• Training Data: used to ﬁrng/learning the parameters of the model.
• Valida;on Data: not used for learning but for selec;ng the model, or
choosing the amount of regulariza;on that works best.
• Test Data: used to get performance of the ﬁnal model.

For many applica;ons, the supply of data for training and tes;ng is limited.
To build good models, we may want to use as much training data as possible.
If the valida;on set is small, we get noisy es;mate of the predic;ve performance.

S fold cross-‐valida;on • The data is par;;oned into S groups.

• Then S-‐1 of the groups are used for training
the model, which is evaluated on the
remaining group.
• Repeat procedure for all S possible choices
of the held-‐out group.
• Performance from the S runs are averaged.
Basics of Probability Theory
• Consider two random variables X and Y:
- X takes any values xi, where i=1,..,M.
- Y takes any values yj, j=1,…,L.
• Consider a total of N trials and let the number of trials in which X = xi
and Y = yj is nij.
• Joint Probability:

• Marginal Probability:

where
Basics of Probability Theory
• Consider two random variables X and Y:
- X takes any values xi, where i=1,..,M.
- Y takes any values yj, j=1,…,L.
• Consider a total of N trials and let the number of trials in which X = xi
and Y = yj is nij.
• Marginal probability can be
wri0en as:

• Called marginal probability because

it is obtained by marginalizing, or
summing out, the other variables.
Basics of Probability Theory

• Condi;onal Probability:

• We can derive the following rela;onship:

which is the product rule of probability.

The Rules of Probability

Sum Rule

Product Rule
Bayes’ Rule
• From the product rule, together with symmetric property:

• Remember the sum rule:

• We will revisit Bayes’ Rule later in class.

Illustra;ve Example
• Distribu;on over two variables: X takes on 9 possible values, and
Y takes on 2 possible values.
Probability Density
• If the probability of a real-‐valued variable x falling in the interval
is given by then p(x) is called the
probability density over x.

• The probability density must

sa;sfy the following two
condi;ons
Probability Density
• Cumula;ve distribu;on func;on is deﬁned as:

which also sa;sﬁes:

• The sum and product rules

take similar forms:
Expecta;ons
• The average value of some func;on f(x) under a probability distribu;on
(density) p(x) is called the expecta;on of f(x):

• If we are given a ﬁnite number N of points drawn from the probability
distribu;on (density), then the expecta;on can be approximated as:

• Condi;onal Expecta;on with respect to the condi;onal distribu;on:

Variances and Covariances
• The variance of f(x) is deﬁned as:

which measures how much variability there is in f(x) around its mean
value E[f(x)].

• Note that if f(x) = x, then

Variances and Covariances
• For two random variables x and y, the covariance is deﬁned as:

which measures the extent to which x and y vary together. If x and y are
independent, then their covariance vanishes.

• For two vectors of random variables x and y, the covariance is a matrix:
The Gaussian Distribu;on
• For the case of single real-‐valued variable x, the Gaussian distribu;on is
deﬁned as:

which is governed by two parameters:

- µ (mean)
- ¾2 (variance)

is called the precision.

• Next class, we will look at various distribu;ons as well as at
mul;variate extension of the Gaussian distribu;on.
The Gaussian Distribu;on
• For the case of single real-‐valued variable x, the Gaussian distribu;on is
deﬁned as:

• The Gaussian distribu;on sa;sﬁes:

which sa;sﬁes the two requirements

for a valid probability density
Mean and Variance
• Expected value of x takes the following form:

Because the parameter µ represents the average value of x under the
distribu;on, it is referred to as the mean.

• Similarly, the second order moment takes form:

• It then follows that the variance of x is given by:
Sampling Assump;ons
• Suppose we have a dataset of observa;ons x = (x1,…,xN)T, represen;ng
N 1-‐dimensional observa;ons.
• Assume that the training examples are drawn independently
from the set of all possible examples, or from the same underlying
distribu;on

• We also assume that the training examples are iden;cally

distributed ! i.i.d assump;on.

• Assume that the test samples are drawn in exactly the same way
-‐-‐ i.i.d from the same distribu;on as the training data.

• These assump;ons make it unlikely that some strong regularity

in the training data will be absent in the test data.
Gaussian Parameter Es;ma;on
• Suppose we have a dataset of i.i.d. observa;ons x = (x1,…,xN)T,
represen;ng N 1-‐dimensional observa;ons.
• Because out dataset x is i.i.d., we can write down the joint
probability of all the data points as given µ and ¾2:

Likelihood func;on • When viewed as a func;on of µ

and ¾2, this is called the likelihood
func;on for the Gaussian.
Maximum (log) likelihood
• The log-‐likelihood can be wri0en as:

Sample mean
• Maximizing w.r.t µ gives:

Likelihood func;on • Maximizing w.r.t ¾2 gives:

Sample variance
Proper;es of the ML es;ma;on
• ML approach systema;cally underes;mates the variance of the
distribu;on.

• This is an example of a phenomenon called bias.

• It is straighforward to show that:

• It follows that the following es;mate is unbiased:

Proper;es of the ML es;ma;on
• Example of how bias arises in using ML to determine the variance of a
Gaussian distribu;on:

• The green curve shows the

true Gaussian distribu;on.

• Fit three datasets, each

consis;ng of two blue points.

• Averaged across 3 datasets,

the mean is correct.

• But the variance is under-‐

es;mated because it is measured
rela;ve to the sample (and not
the true) mean.
Probabilis;c Perspec;ve
• So far we saw that polynomial curve ﬁrng can be expressed in terms
of error minimiza;on. We now view it from probabilis;c perspec;ve.
• Suppose that our model arose from a sta;s;cal model:

where ² is a random error having Gaussian distribu;on with zero

mean, and is independent of x.
Thus we have:

where ¯ is a precision parameter,

corresponding to the inverse variance.

I will use probability distribution and

probability density interchangeably. It
should be obvious from the context.!
Maximum Likelihood
If the data are assumed to be independently and iden;cally
distributed (i.i.d assump*on), the likelihood func;on takes form:

It is ohen convenient to maximize the log of the likelihood func;on:

• Maximizing log-‐likelihood with respect to w (under the assump;on of a

Gaussian noise) is equivalent to minimizing the sum-‐of-‐squared error
func;on.
• Determine by maximizing log-‐likelihood. Then maximizing
w.r.t. ¯:
Predic;ve Distribu;on
Once we determined the parameters w and ¯, we can make predic;on
for new values of x:

Later we will consider Bayesian linear regression.

Sta;s;cal Decision Theory
• We now develop a small amount of theory that provides a
framework for developing many of the models we consider.
• Suppose we have a real-‐valued input vector x and a corresponding
target (output) value t with joint probability distribu;on:

• Our goal is predict target t given a new value for x:
- for regression: t is a real-‐valued con;nuous target.
- for classiﬁca;on: t a categorical variable represen;ng class labels.

The joint probability distribu;on provides a complete

summary of uncertain;es associated with these random variables.
Determining from training data is known as the inference
problem.

K. v. Narayanan, B. Lakshmikutty - Stoichiometry and Process Calculations-PHI Learning (2017)
No ratings yet
K. v. Narayanan, B. Lakshmikutty - Stoichiometry and Process Calculations-PHI Learning (2017)
613 pages
ML - 1 - Sovan - Introduction To ML
No ratings yet
ML - 1 - Sovan - Introduction To ML
83 pages
Lecture 8.2 - Variational Quantum Eigensolver
No ratings yet
Lecture 8.2 - Variational Quantum Eigensolver
27 pages
Linear Algebra For Machine Learning
No ratings yet
Linear Algebra For Machine Learning
65 pages
ML Unit-1
No ratings yet
ML Unit-1
64 pages
WEEK 01 Merged
No ratings yet
WEEK 01 Merged
606 pages
Lecture 1.1 - Single States
No ratings yet
Lecture 1.1 - Single States
49 pages
Maths For ML
No ratings yet
Maths For ML
156 pages
Presentation On ML
No ratings yet
Presentation On ML
469 pages
Gul Nawaz CV
No ratings yet
Gul Nawaz CV
2 pages
Unit-1 - Machine Learning
No ratings yet
Unit-1 - Machine Learning
85 pages
Machine Learning - Unit - 1
100% (1)
Machine Learning - Unit - 1
58 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
45 pages
Circlet For Edinburgh
100% (1)
Circlet For Edinburgh
2 pages
GML Slides 2024 04 29
No ratings yet
GML Slides 2024 04 29
206 pages
Sec 1630
No ratings yet
Sec 1630
145 pages
Ch3-Machine Learning
No ratings yet
Ch3-Machine Learning
124 pages
Exercises On Connectors
No ratings yet
Exercises On Connectors
4 pages
Ek 2020
No ratings yet
Ek 2020
203 pages
LN ML Rug
No ratings yet
LN ML Rug
283 pages
Notes
No ratings yet
Notes
125 pages
English Proficiency Test For Aviation: Set 33-Pilot
No ratings yet
English Proficiency Test For Aviation: Set 33-Pilot
13 pages
Machine Learning
No ratings yet
Machine Learning
87 pages
Matematics and Machine Learning
No ratings yet
Matematics and Machine Learning
156 pages
AML - Mid Term - Merged
No ratings yet
AML - Mid Term - Merged
192 pages
Lec31 32 CaterpillarRegressionExample
No ratings yet
Lec31 32 CaterpillarRegressionExample
108 pages
PR & ML: CS5691: Machine Learning
No ratings yet
PR & ML: CS5691: Machine Learning
42 pages
LN ML Rug
No ratings yet
LN ML Rug
267 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
48 pages
Lec 2 Basics of Machine Learning
No ratings yet
Lec 2 Basics of Machine Learning
35 pages
Lec29 ImportanceSampling
No ratings yet
Lec29 ImportanceSampling
84 pages
Gonzalez 2020
No ratings yet
Gonzalez 2020
79 pages
Durrande 2020
No ratings yet
Durrande 2020
90 pages
Lec14 15 GenerativeModelsForDiscreteData
No ratings yet
Lec14 15 GenerativeModelsForDiscreteData
74 pages
First Cours 2
No ratings yet
First Cours 2
42 pages
ChatGPT - Machine Learning Overview
No ratings yet
ChatGPT - Machine Learning Overview
34 pages
1 - Introduction
No ratings yet
1 - Introduction
82 pages
Lec33 MetropolisHastings
No ratings yet
Lec33 MetropolisHastings
66 pages
MLSM Lecture1 050923
No ratings yet
MLSM Lecture1 050923
37 pages
Lec12 13 BayesianInferenceForTheGaussian
No ratings yet
Lec12 13 BayesianInferenceForTheGaussian
57 pages
Machine Learning
No ratings yet
Machine Learning
55 pages
Lec9 MultivariateGaussian
No ratings yet
Lec9 MultivariateGaussian
60 pages
Lec16 SummarizingPosteriors BayesianModelSelection
No ratings yet
Lec16 SummarizingPosteriors BayesianModelSelection
59 pages
Introduction To State Space Models and Sequential Bayesian Inference
No ratings yet
Introduction To State Space Models and Sequential Bayesian Inference
58 pages
Dai 2020
No ratings yet
Dai 2020
62 pages
Machine Learning Basics
No ratings yet
Machine Learning Basics
39 pages
Brand Perception of Honda Products
No ratings yet
Brand Perception of Honda Products
64 pages
86 37 196 Mod 5
No ratings yet
86 37 196 Mod 5
52 pages
Lec25 MonteCarloMethods
No ratings yet
Lec25 MonteCarloMethods
57 pages
Machine Learning
No ratings yet
Machine Learning
21 pages
Seminar em
No ratings yet
Seminar em
51 pages
Mlfa Autumn 22 Lec 01
No ratings yet
Mlfa Autumn 22 Lec 01
43 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
33 pages
Lec35 SequentialImportanceSampling
No ratings yet
Lec35 SequentialImportanceSampling
46 pages
Do417 2.8 Student Guide
No ratings yet
Do417 2.8 Student Guide
600 pages
QSRI Lecture1
No ratings yet
QSRI Lecture1
45 pages
IT Tools and Business System - Module 1
No ratings yet
IT Tools and Business System - Module 1
36 pages
Lec22 Introduction2BayesianRegression
No ratings yet
Lec22 Introduction2BayesianRegression
42 pages
Module 1
No ratings yet
Module 1
22 pages
School Memorandum With Number
No ratings yet
School Memorandum With Number
29 pages
Lecture 4.1 - Quantum Query Algorithms
No ratings yet
Lecture 4.1 - Quantum Query Algorithms
38 pages
Lec23 Evidence4Regression
No ratings yet
Lec23 Evidence4Regression
38 pages
ML 1 2 3
No ratings yet
ML 1 2 3
54 pages
Lec7 InformationTheory
No ratings yet
Lec7 InformationTheory
41 pages
Lec17 PriorModeling
No ratings yet
Lec17 PriorModeling
37 pages
ML Imp Ques 1
No ratings yet
ML Imp Ques 1
22 pages
CS464 Ch1 Intro Fall2020
No ratings yet
CS464 Ch1 Intro Fall2020
83 pages
Lecture 8.1 - Iterative Quantum Phase Estimation - Moving Beyond Traditional QPE
No ratings yet
Lecture 8.1 - Iterative Quantum Phase Estimation - Moving Beyond Traditional QPE
31 pages
Lecture 3 - Entanglement in Action
No ratings yet
Lecture 3 - Entanglement in Action
36 pages
Lec27 AcceptReject
No ratings yet
Lec27 AcceptReject
30 pages
MIT - Machine Learning Notes From Chapter 1 - 14 PDF
No ratings yet
MIT - Machine Learning Notes From Chapter 1 - 14 PDF
101 pages
Lec20 RidgeRegression
No ratings yet
Lec20 RidgeRegression
21 pages
Machine Learning
No ratings yet
Machine Learning
33 pages
Lec24 BayesianLinearRegression
No ratings yet
Lec24 BayesianLinearRegression
29 pages
Machine Learning Introduction
No ratings yet
Machine Learning Introduction
56 pages
CS3491-AI ML-Chapter 1
No ratings yet
CS3491-AI ML-Chapter 1
19 pages
Optimization Problems For Machine Learning: A Survey
No ratings yet
Optimization Problems For Machine Learning: A Survey
41 pages
Chapter Introduction
No ratings yet
Chapter Introduction
7 pages
Unit-1 ML
No ratings yet
Unit-1 ML
19 pages
CRI Test Method 114
No ratings yet
CRI Test Method 114
11 pages
Lec18 HierarchicalBayesianModels
No ratings yet
Lec18 HierarchicalBayesianModels
20 pages
Introduction To Machine Learning: ETH Zurich Janik Schuettler Marcel Graetz FS18
No ratings yet
Introduction To Machine Learning: ETH Zurich Janik Schuettler Marcel Graetz FS18
18 pages
Chapter 2 Different Types of Fixtures
No ratings yet
Chapter 2 Different Types of Fixtures
20 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
15 pages
Holacracy - The New Management System
No ratings yet
Holacracy - The New Management System
11 pages
Tirth PDF
No ratings yet
Tirth PDF
19 pages
DC1500 - Installation Manual: WWW - HHO-Plus - LV T: +371 27124103
No ratings yet
DC1500 - Installation Manual: WWW - HHO-Plus - LV T: +371 27124103
39 pages
Lec21 BiasVarianceDecomposition
No ratings yet
Lec21 BiasVarianceDecomposition
15 pages
COMP9417 Review Notes
No ratings yet
COMP9417 Review Notes
10 pages
Homework Hacks Tumblr
100% (2)
Homework Hacks Tumblr
8 pages
AMV in Pharma
No ratings yet
AMV in Pharma
13 pages
Lecture 7 - Introduction To Quantum Noise Bonus
No ratings yet
Lecture 7 - Introduction To Quantum Noise Bonus
13 pages
5.1 Large Scale ML
No ratings yet
5.1 Large Scale ML
10 pages
Machine Learning Shortnote
No ratings yet
Machine Learning Shortnote
14 pages
This Story Paraphrased From A Post On 9/4/12
No ratings yet
This Story Paraphrased From A Post On 9/4/12
7 pages
Title Page Thesis SHSHSH
No ratings yet
Title Page Thesis SHSHSH
6 pages
Waiver
No ratings yet
Waiver
6 pages
PSAT Bahasa Inggris Kelas 10
No ratings yet
PSAT Bahasa Inggris Kelas 10
5 pages
HDCS DS
No ratings yet
HDCS DS
4 pages
Petrifilm Salmonella Express SALX Interpretation Guide - en US - FS00587
No ratings yet
Petrifilm Salmonella Express SALX Interpretation Guide - en US - FS00587
6 pages
Prepaid Instruments in India Feb 27
No ratings yet
Prepaid Instruments in India Feb 27
11 pages
Resume BE Nodejs Dovanthanh English
No ratings yet
Resume BE Nodejs Dovanthanh English
2 pages
MR-Pdt-SE New Adhesion Communication
No ratings yet
MR-Pdt-SE New Adhesion Communication
2 pages
Machine Learning Notes
100% (3)
Machine Learning Notes
134 pages
6.867 Lecture Notes: Section 1: Introduction: 1 Intro 2 2 Problem Class 3
No ratings yet
6.867 Lecture Notes: Section 1: Introduction: 1 Intro 2 2 Problem Class 3
10 pages
MLL
No ratings yet
MLL
2 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
32 pages
Millennium Village 2
No ratings yet
Millennium Village 2
15 pages
Dharma Fiber Reactive Procion Dyes
No ratings yet
Dharma Fiber Reactive Procion Dyes
1 page
Cet455-Qp May 24
No ratings yet
Cet455-Qp May 24
2 pages
Vamshi Krishna Resume
No ratings yet
Vamshi Krishna Resume
4 pages
f11 Examtopics
No ratings yet
f11 Examtopics
2 pages
Saffola
No ratings yet
Saffola
2 pages
Spark Fun
No ratings yet
Spark Fun
1 page
Heart of The Sun Warrior 1st Edition Sue Lynn Tan 2024 Scribd Download
100% (3)
Heart of The Sun Warrior 1st Edition Sue Lynn Tan 2024 Scribd Download
37 pages

Lecture1 2015

Uploaded by

Lecture1 2015

Uploaded by

STA

• Trevor Has;e, Robert Tibshirani, Jerome Friedman (2009)

• David MacKay (2003)

The goal of sta;s;cal machine learning is to develop

• Informa;on Retrieval / Natural Language Processing:

Images & Video Text & Language Speech & Audio

Input data (e.g. pixel Target variables

Markov Random Fields, Undirected Graphical Models.

Vector of word counts Latent variables:

804,414 newswire stories Accounts/

Hierarchical Bayesian Model

Latent variables that

Posterior over Latent Variables

• Supervised Learning: We are also given target outputs (labels,

• Reinforcement Learning: the model (agent) produces a set of ac;ons:

• Semi-­‐supervised Learning: We are given only a limited amount of

Regression: target outputs yi are

Unsupervised Learning can be used for:

The display is a heat map ranging from

Ques;ons we may ask:

• Observe a training set consis;ng of N observa;ons

• Note that X is an matrix.

Loss func;on: sum-­‐of-­‐squared error

op;mal vector of

the design matrix has one

• At an arbitrary input , the predic;on is

• The green plot is the true func;on

Loss func;on: sum-­‐of-­‐squared

• Similar to the linear least squares: Minimizing sum-­‐of-­‐squared error

For M=9, we have ﬁ0ed the training data perfectly.

• As M increases, the magnitude of coeﬃcients gets larger.

• For a given model complexity, the overﬁrng problem becomes less

where and ¸ is called the regulariza;on term.

S fold cross-­‐valida;on • The data is par;;oned into S groups.

• Called marginal probability because

• We can derive the following rela;onship:

which is the product rule of probability.

• Remember the sum rule:

• We will revisit Bayes’ Rule later in class.

• The probability density must

which also sa;sﬁes:

• The sum and product rules

• Condi;onal Expecta;on with respect to the condi;onal distribu;on:

• Note that if f(x) = x, then

which is governed by two parameters:

is called the precision.

• The Gaussian distribu;on sa;sﬁes:

which sa;sﬁes the two requirements

• Similarly, the second order moment takes form:

• We also assume that the training examples are iden;cally

• These assump;ons make it unlikely that some strong regularity

Likelihood func;on • When viewed as a func;on of µ

Likelihood func;on • Maximizing w.r.t ¾2 gives:

• This is an example of a phenomenon called bias.

• It is straighforward to show that:

• It follows that the following es;mate is unbiased:

• The green curve shows the

• Fit three datasets, each

• Averaged across 3 datasets,

• But the variance is under-­‐

where ² is a random error having Gaussian distribu;on with zero

where ¯ is a precision parameter,

I will use probability distribution and

• Maximizing log-­‐likelihood with respect to w (under the assump;on of a

Later we will consider Bayesian linear regression.

The joint probability distribu;on provides a complete

You might also like

• Semi-‐supervised Learning: We are given only a limited amount of

Loss func;on: sum-‐of-‐squared error

Loss func;on: sum-‐of-‐squared

• Similar to the linear least squares: Minimizing sum-‐of-‐squared error

S fold cross-‐valida;on • The data is par;;oned into S groups.

• But the variance is under-‐

• Maximizing log-‐likelihood with respect to w (under the assump;on of a