0% found this document useful (0 votes)
23 views

Chapter 3 - Introduction Via Linear Regression

The document introduces linear regression as a method for supervised machine learning by formulating the problem, discussing statistical inference and learning approaches, and focusing on maximum likelihood estimation to learn the parameters of a discriminative probabilistic model from training data.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

Chapter 3 - Introduction Via Linear Regression

The document introduces linear regression as a method for supervised machine learning by formulating the problem, discussing statistical inference and learning approaches, and focusing on maximum likelihood estimation to learn the parameters of a discriminative probabilistic model from training data.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

INTRODUCTION TO MACHINE LEARNING VIA

LINEAR REGRESSION
Study key concepts band on tu
peu nd learning

3 I Supervised learning

Problem formulation Gwu training ht D containing


N training punts tn tn nal N
X n are independent variables i w variates domainpoint

explanatory variables
1n are dependent variables dependent variables labels
responses
E
ntn
15
N IO
X
I

OJ X
X
L I X l I 2 En
U L O Y OG O f
UT
X
l
X

1.5

bout predict t 1W an unobserved domain point


need to make assuphins about the mechanism
generating data inductive bias
difference between me minting and learning
3 2 Statistical inference
T
predicting r r
given observation X and known
joint distribution pix t

mhm a non
negative loss function llt.tn
Cust Russ risk if correct value is tandeshmate ist

la loss lg Ct It It I 19

Ex quadraticloss lzlt.FI ft Il
I t I
U I loss to It It It I lo O t I

Generalization risk loss average Ivr pm prediction fly

Lp Ltte Emina htt I HID

Optimalprediction tix obtained by mhinking

I E It
aryfin Etna felt
txt

Onlyneed to know the pushionir distribution pm


joint distribution pt not required
o Fir le E HI Emp Ttt
as
Et S TEl p the DT ft 2 IT EY
Ait DT
Elt Ix 2 2 I ELT ly

O
duEI 2E 2E IT 1 1 0 I E Thr

Et PHI x OS fI t t t 0.2 f tt x
n pI t 1 1

Li t xI t O S t l il O 2 OG x

Performance ofpredictor t r is measured


by the
Clitheroe between Lpl f and the minimum
generals tution loss Lp Et of the optimal
predictor
m the following three learning approaches for t HI
are discussed frequentist approach Bayesian
approach ND L approach

3 3 Frequentist approach
Assumption Turning data punts xn t n I e D are
drawn i i d from a true but unknown distribution

p It t 1 tn Tn T.ee pl t t I e l N

Distribution unknown ay run strategy from us we


cannot be applied
Two ways to approach problem of unknown dishib

separate learning and inference


lean
approx of distribution Bott Ix 1 bated on
data and u te
I lxl ay renin Eta pm I l IT E Ix ell

direct interference via empirical risk minimitation


E Rn learn approximationTp Lt 1 of optimal
decision as
Eb CH agmein Lb Ctl
with the empirical risk loss

LA fr e I t n f Hn

Remaki In contrast to the generahtation loss where


expectation over true distribution is calculated

here we take the average over the available data

we hist look at vice alinear refresh in example


Assure i pl t t I p It I pl t Ix xn Unit 10,1 and

PTI t TN Lh ZI x O l
n th
15 N lo
X

OJ X
X
L I X l I 2 En
U L O Y OG O f
UT
X
l
X

1.5

If distribution is known the optimal predictor


ee act

uncle ez loss is th x I an L2 Tx
Minimum generalisation loss
Lp I I l J tf Ptv dt
hh ziti
E Thx 2 tin 12TH ELT x 4h42 It
E T2 H E 4T x Var NH o I

3 3 1 Discriminative vs generative models

Formulate a hypothesisclass familyofparametric


probabilistic models learn parans of mold which best
ht the data
Muhl la bdt as a
polynomial of domain puntt
Gaussian noir
c Curtin
µ w Wj WTH HI
je O
with weights he two wrist and featureveeter
T
of It I El x x2 XM J M model order

Now define the parametric prob model


pl t l x El N µ tie B Y modulepunnets
E he M
B precinin inverse variance

Discriminative model
learning the anchtimal distribution

plt l t E by learning theparametervector


directlyfromthe data estimator in 14 can be directly
calculated
mode discriminates t band on their Apps
Main focus in this section

Generative probabilistic mode

learning the joint distribution p l t x1 parameterled

by E i e plx.tl e
Remark This mutils also the distribution pit

ofthe cover ates model privates a realitation

of x via p Ct l E bycomputing the marginaldistribution


Un bag es theorem to obtain p I t l t E and
then the estimate fol x l i n l
3 3.2 ML learning
Assume model crow is hired we want to learn
model parameters I
in Mm mu f dahe points
Discriminative more
p Itb Ltd he D
II p I th l th e P

F Nlt
n I
n 1µL xn et D Y
Take log on both sides
N
lug likelihood LL function

Lm p Lt Lt i K B I
n I
hn p l th Hn E Al

E In µ Hn wt tn

E en

ML learning problem chhned as minimitation of


the negative L L I N LL
which is only knit in if muchl pawns

Yip tr Fahr pl t n l xn K Al

cross entropy w tog lo is criterion


why Strong law of late numbers
ftp.uhnpltnlxn er B E exit tenPITHEAD
p
i

Ex up THIplTHI Ilpl t Ike PDDeawpetaiahk.gg


expected cross entropy ML problem attempts
to make model band pl t l x y pl closeto actual pit 1 1
de problem requires only learning the a posteriori mean

IA tem can be ignored

tn
Mff Lpl Inline 4
w
meh

Lp e training loss
u 21 can be solved in cloud 1am

hes N HED ell 131


T
Ep II Lt I
Il th of Itn 1 Nx Ime 1 matrix

M T
H El x1

Mini mitation of 131 is a least squares ILS


problem with the solution

Ini l EptEb T EptEso

overdetermined case Na M t I

them
got
Es 5 is also known under the

name pseudo inverse j


o
Dittantiating the NLL withrespectto p yields
plum Lpl Emr
0Wh Hung and undetithing
Assumption lz loss

Going back to learple with p It 1 1 NI ah 12TH U D


and the optimal but unknown predictor E Lx
h n 12TH
Hw does this impure with the ML predictor
In H µ x Wme
n t Eru lN
M l
L M 3
X Ma g

O L Oy U6 O f
n
f

o
M t predictor underhits the data
large training loss

o M 9 we hits the data i


small training loss
but late generuhtation loss

Lp Incl Et He pull IT Mlt End


model memorites the training tehonly
o M 3 good choice
Is r
undehthny
omitting
i i
v

o
s
f
u u

t l t t t t e I
n

what happens Aw a
large training set
reiss n
1

o.s
tent genuhration
o.o

oy

ut
Fp training
N
to 2 Jo lo fo to fo
Renate
If N is lap enough compared tothe of
parameters in E Lp Lyle let the
weight vector yn that minimum also

approx mr minutes Lp
Etf It
Im Lp he
o
yn af
LD Lynut Lp Inc a LpLett
l w Lage N L we make this precise late

Lp lend IN as it becomes more chillicult to


hit data

Error analysis
Two hourtypes bias and estimation error

Write generulitation loss as follows


bestgeneralizationloss forgiven
lv
motel
Lpl him Lp E t
Lphe l LplFt givenhyp
class
Mh generalisation bias

Lp hem Lp Htt
estimation error
generalisation 9
best ML estimate
Neafl
ML estimate fu hind N
squarerootloss

is
Lpl him
OG iph.it
o y

v l Lpltt
stem
l l l l l I I J N
lo w 30 40 50 Go 70

Validation and testing


Problem i hw to what the model crown
The distribution p Lt t is unknown and
Lp El cannot be evaluated
Shihan divide available data in twoparts

training ht hold out or validation set


used to obtain empirical average
Vr
Lp ht e
th I NI 1
Ll th µ tn ul
our validation ht of hse Nv
he has been obtained via training set
Lp Ld Unit 1W model ordertelectris

Test ht is additionally needed to compute estimate


of l p l w determined choice of M E
3 3 3 MAP learning
Use a
prini information on vectorof weights
can be used to reduce the effects of our hitting
11Wh explodes for increaring M butut Sechin2.3.41

Remedy priori distribution which puts


apply an a

less weight on large values i e


Ku NII L 1 iid variance d

ML matimiting LL ie
pl to 1 E BI
MAP matinite
w
pLtd El XD B p let IT
n I
pl t n l tn ie B

MAP learning criterion

en pie
Iif
En lnpltnltn.ie pl

with training loss Int N II to EDEN


and t.tn p we obtain

meh htt Hell

ML criterion Imlaitation term RIE


flute on Lbystandad LS analytis

Knap It t
Is I II to why

As N fits
lay contribution of term 21 becomes

Mpi gbh ML estimate


ate 1h51
1T

I Lp

oT
LD
l l l l l s
ko Lo o hn Ld

In creating 7 has the same effect as reducing mold orders

0 the example for a


priori dis hi b Lagrange pdf
M
s R y II hell I o Iwl
j
Remthly MAPproblem
LASSO
Mj
n Lp El t
In 11well
leastabsolute shrinkage
and selection operator
3 U Bayesian approach
putunknown
Frequentist approach existence of true distribution
plt ltl assumed ML IMAP problem tries to hnd
parameter vector E huh that a model dis ti b
p I t l x we B is close to the true distribution

Bayesian approach
Iii Data points are jointly distributed via a known
distribution
Ii Model parameters e are jointly distributed with the data

Ignoring P in the IMaung assuming only E to beoplimited


Character led by joint distribution
p I t b he t I xD x 1 1 for a new domain point x
and new label t 1 to be predicted
o
Bayesian mold evaluates the a posteriori posterior
distribution p l t l xp tp.tl p It I D t jiver
XD t to predict new label
a to
posteriori distribution obtained by manipulating
dis hi b da

By why the chain rule of cond probability


a p l bl c
p l ab l c p al b c we obtain

p I t bet I xp x I I tou l t s x pl t I ta tis uh


p
plusI xD y p I t b l exist pI t l D I
a p ri distribution
PI tis he t I xD x

ply pltblxb.ie pl t Ix y H

a penni Uist likelihood hist if new label

likelihood term p I ta l xD E

II Mtn Hulin uh B I
o
potty new label PLH x El NI t I felt It AY

u Factoritation can be praphically represented by a


Bayesian network
We hrs t chop the dependency on the domain

punts in L1
pltp.w.tl plus pl t.rs w1pltlw
Pluta vertex tar each involved r V and

acliched eye for all conditioned r v s he

the main r r S in each dishib

h l N

Ot
Bayetianguaffproach is inference based
approach the learningstandpoint is hidden as
v

all quarh his in the model are r r S

hav bturn a posteriori distribution


o
bymanipulating the

Plt D plwt p It D Hel dy


p t1D H
pl D x
t
f pl DX

Jp let pl tlDx E de LY
a posteriori distrib predictive dis hits
y u
tor nu del
UhrigBages
Phd Pl tool XD k
ply 1 D pl TDI xp
a posteriori posterior
belief

Computation in funeral di th cult but for the example


that p l to I xp uh pl t Ix I are Gaussian
distributions L in wens the convolution of these

distributions and I lie loss assumed we obtain

Pitt xD I N Ml x Wmap s l x with

s ith B l it E HIT 17 It Eb I IH

tu Bishop lyns 13.58 13.59 12 115

optimal predictor is MAP uncle ez loss


not true in general
3 4.1 Comparison with ML and MA P
Dagen us a posteriori distribution p.lt It D
allows for refined prediction of labels t
a more

compared to ML a posteriori distrib pl t l t En I


Nl µ It um B 12milady far MAP
Tame variance for all the data points

Bayesian i accuracy of prediction depends on the


value
of set by the uneven prediction of data
t n

y Mltikmn.pl IS Lt

n't
I X X X
y
x
l l l l l S t
OL 0.4 U G US 1.0

adaptive uncertainty for Bayetian


For N s is i
Wmap Wn L
t
S2 Lx p Bayetian approaches ML
3 9.2 Marginal likelihood and mold will
Advantage
Bayesian model selection without validation possible

marginal likelihood
pl t I xD f pure IIF p I tn l tn ie de

la je M car mult in smaller likelihood


entrust to ML
in
approach where pl te I xp End
can only increase with in creating M allows to telect
model order
plt.rs1 too
selected model order

t
Z
r l
4
l
b
l
f
M

3 5 Mr mmmm descriptionlength In DL

Cm push'm Y r V XE X with distribution pit


tussle Is compressionscheme can be dehorned which
needs F log pc.tl 7 hits to represent x

o MD L selects a model which lostlessly Compresses


D to theshortest possible description
u th u lil Wile M is selected which minimises
description length
N
T lo
L y p I t n I t n Wen l B nu t C th I
nIl
of bits toquantize
smallest of bits to describe parametersWmc Bnc
model
acts as regularize

not discussed in class

u su Chapter 2 G i n te t

read about pitfalls associated with learned distributions


Chapter 2.71

You might also like