0% found this document useful (0 votes)
28 views43 pages

Chapter 01

Uploaded by

Prerna Bhandari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views43 pages

Chapter 01

Uploaded by

Prerna Bhandari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

INF 8245 E -

Fall 2021

Machine
Learning
-

Sarath Chandar

t.IN/-roduction-#
( -
Machine Learn
:\
Introduction to Machine
1.
- Learning
that
Study of algorithms
-

improve their performance P

at some task t
with
experience E
-

Learning task : LP,T


,
E >

Example : -

1) T :
spam classification

p :
Accuracy
E : set of example mails labelled as
spam or not
spam .

2) T : credit card fraudulent transaction detection

p :
Accuracy
E : set of historical transactions marked as
legit
or
fraud .

3) :
Playing the
of chess
T
game

P : number of games won

of game trajectories from


E set
expert
:

players .
Prediction :

The most common ML


application .

input output
'
Agent
✗ → →
y

observation Prediction

E✗amp

① Given house
selling
the the
predict
size of ,
the

price .

the house
✗ : size of .

Y: price
selling
.

② Given the current stock price of a


company , predict

the same after 10 minutes .

✗ : stock
price at time t

y : stock price at time ttio

Given
③ an
image , predict the
object in the
image .

✗ !
Image as pixels

Y :
object name from a
predefined set

e.
g. { dog catbirds flight ball
, , , }
④ Given an
image of a handwritten
digit predict ,
the
digit .

✗ : ☒
Y : 7
[ one
of { 0,1 ,
2 .
.
9 }]

⑤ temperature
Given the ,
humidity ,
wind predict whether

it will rain or not .

✗:
temp humidity , , wind

y :
yes / No

Note : ① In examples ① and ②


,
Y is a real number .

y E- R

② In examples ③ ④ ⑤
,
, , y is
categorical .

y E- { dog ,
cat ,
birds ,
flight ,
ball } in ③

YE { 0,1 ,
. . .9 ] in ⑥

ye { yes ,
no } in ⑤

{ YER-tregressi
c-
o npoobl
{ c.
e m.l
,
y?
. .
.cn } classification probl
9
Categorical
In ⑤ { yes
, y c-
,
no
} →

binary classification problem .

Prediction task Given


-
: -
some
input ✗
,
make a
good
prediction of output y ,
denoted
by y^ .

✗ can be a vector ✗ = Lxi ,


Nz . . -

xp >
-

↳ P''

features
y can be a scalar or a vector .

Note : y →
target
5 → Model 's prediction
-

Supervised
-
Learning : -

N
"
{ n'
'"

training
set
Given a
, y }i= ,

data
of N points learn a
prediction function f
:n→y
,

such that given a new n


f can
accurately
,

predict the
correspondingly .

-
Note function
'

: The prediction
'

f- is useful

it can make accurate predictions


if
only
'

n' We will call this


for unseen .

capability as generalizationtounseen
instances .
Generalization is a
key requirement

for any
ML
algorithm .

Simpke✗amp#
Hoasi-griepredictioninportland.org#
- -

Jhon lives Portland wants to


,
who in , Oregon

his house and wants to know what a


sell

would be One to do
market price way
.

good
this is to first collect information on recent

model
sold and make of
housing
a
houses

prices .
This is an example for regression .
following dataset
Let us consider the :

+
✗ Y

the house
size of Price the house
of .

( in square feet )

2104
399,900
1600 329,900

2400 369,000
"

:
.
:

Let have 47 data points the


say
us we .

first would be to split


step the data

data and test data


into
training
.

ttet
to
① we will use the
training data

train the model .

will the test data



use
we
to test the generalization of the
only
model .

Note : Test set is a


proxy for the

true performance model when


of the

'

it after
'

training
sees a new n
.

training
Let us divide the dataset into 30

and 17 test instances


instances .

training
data :
Visualization of -

÷
N
Now consider this simple model :

|y^=wo+w,n
kind this fit ?
What of curves can model

fit lines
This model
only
can .

y n yn yn

¥→ .

n
:I -# II.
Wo = 1.5 Wo =
O Wo = I

W W 0.5 W =
0.5
,
=
O ,
=
,

→ Note the significance of


adding Wo .

→ If you dont add Wo , you can


only cover

lines
passing through the origin !

Wo = bias in ML literature .
§ =
Wo t W
,
R

µ
Parameters of the model .

W = ( Wo ,w , )

y^ ( ;w )

wotw.net#Gnweantmparameersw.w?-
n =

First step is to define an


objective function

that the model should achieve .

ob-j:/ vewant.jcniwltobeas~closeaspossible.to#,
This can be done
by minimizing
the

following emorfench
instances .

Number of training

'
N

{ goin yin }
'

E.
.
)
Ecw )
I
= ;w -

i
→ T
target
.

least squares model 's


function prediction
'

error

yln
'

"

minimise this
99"

"

÷ { n' l
}
" "
Ecw) ( w.tw ,
1-2 y
-

|obi:m:%
How to find the minimum ?

→ This error
fn .
is a
quadratic fn .
of the

parameters .

one

there
is only
minimum .

At the derivative the


the minimum
,
of

fn will be
.
w.int .
the
parameters w zero .

i. e. dE_ = 0 .

dw

Ecw ) will be linear


'
→ derivatives of w.r.t.io

the elements
in of w .

1-7 min Ecw ) has a


unique solution w*

which can be found in closed form .


thefiltedmodeli.gl#w7- wtwn
Note : This is an example of a
linear model .

r-mo-delithemodelislinearintermsofthe-yfparameters.gl
; near models with quadratic error
functions
have an unique closed
form solution !

→ More on this later .

-
Switching tot×E :

① = Wo t win

① = Wo . I + W
,
N

ret w -

l ;) .
✗ ⇐

I :)
Then 5 = wtn .

with 1 to
Note : We will
often pretend n

avoid
treating the bias
separately .

Now consider the entire


training data :

f-i-i .fi:0#:: : : : .i
"

(2)
y
"

:
"
N'
nzlm .
.
.
.
.

Np

one feature
✗ = data matrix Y =
target vector .

rows in ✗ = examples .

Columns in ✗ =
features .

✗ = NXP matrix .

N ✗ 1 Vector
Y =

w = PXI vector .

matrix vector
I = ✗ W Single
to
9 9 multiplication
NXP PXI
- predict y for all

Nxt
N
examples .

|±±¥÷]
"" " " " " "

¥
IF
Solution to least squares : -

Ecw ) =t( Y -
✗ W ) -

✗ w )

set it to zero to find


diff .
w.it . w
,

the minimum .

(yaw )
¥
-
-
-
o

XT ( y -
✗ W ) = 0

✗TY -
✗ TXW =
0

XTXW = ✗ Ty

|w*=(✗T×5f
c-
Moore -

Penrose
Pseudo - inverse of matrix

Now
given a new n
,
we can find the

target y
as
follows :

y^= w*T ✗ ,
Baeito zpmpLe
: -

the learnt linear


How good
is regression

model ?

Leastsquaresemor(LsE7_
Not insane

2
scale as

¥?(y
"
! win
ECW) = 1- prediction
2
¢
N
depends on .
Better metric for performance :

Root Mean Square Error 42ms E)

Ekras

,|2
scale
Node y
as
: RMSE is on
the same .

Train RMSE = 68727.04

Test RMSE =
57976.80 .

The model is off by approx


. 50k$ .
still

a decent performance for a


simple model .

-
Summary :

Linear
-
regression
Model : I = win ( linear model )
-

Parameters : w

objective fn Ecw ) 4- WHY )


f-
: ✗ xw
-

- -

son:w*=×i×
Me : RAISE =

,|2Elw
Synthetic example :
- -

Consider the
following synthetic function

Y= sin @ Tn ) + c-

where we have added a small Gaussian

noise C- to the output of the sin


function .

C- ~ Nco ,
0.3 )
9 9
mean std deviation
.

training following
The data consists the
of
to data
points sampled from this
fn .

&

Note:I : In real
life problems .
we will not

have access to true function y .

Adding to the
Note2 Gaussian noise

Because ,
sin fn is reasonable .
even in real life observed
'
'
will have
y
.

some noise due to observation errors .


or the

inherent stochas #
city of
the
process that
generates
function !
'

the
y

Linear solution :
regression
-

- -

Looks like the

line
learnt fn
does not fit the

data well
very .

we can consider higher order


polynomials !

"

y^( n
;w ) =
Wo + win
+ wait . . .tw
,yn

a.
rf raised to the
:
power of
2
Note
: n .

al
"
: n of
the second example

Nz : 2nd feature of a .
the
fact that n°-1 can write
Using
we
,

y^ ( n
;w) = wont -1W , n 't . . . .
+ WMNM
M

= I
wj
j=o

I :* )
Wo Ñ
Let W =
n =

'Wµ

y^ In ;w) = win .

Mode : 5 ( n
;w ) is a non-linear fn of.
n
.
But

5 ( n ;w) is still a linear fn . of w ! So

this is still a linear model .

;-§? { scrim ;w7 )


'm
Elul = -

y
,

is still a
quadratic fn .
of W .
Hence
,

unique solution exists .

n is a vector instead of being a

the Same
scalar .
However ,
we can still use

least squares solution .


14=2 is still not
good .

looks
Solutions for 14=3,4 good .

for slightly deviating


M= 5 is
Solution

from the sin curve .

keep
increasing ?
M
What happens as we
Solution for M=9 achieves zero
training
error !

Is this solution ?
a
good
NO . The fitted curve oscillates
wildly
of the
representation
-

and
gives
a
poor

for sin # ) n .

↳ This is known as
¥ng .
In this example ,
we know the true function .

So we can
tell that 1×1--9 is not a
good
it
approximation of
.

How to tell
if a model is
overfitting
when we
dont know the true function ?

its

If a model is
overfitting ,

.i
should be
generalization performance
bad

high

oO÷÷÷
error .

→ zero
training
error

good region
Kait ! 9 degree polynomial contains 3
degree
then
polynomial as
special case .

why M=q

performs bad ?

Let us look at the learnt w ?

As +9 increases the
magnitude of the coeff gets
.

larger .

For M=9 ,
10
weights are
heavily tuned for
the
given
to data
points !

9-degree polynomial contains the 3-


degree
model should
polynomial . So a 9-
degree
be able to recover the 3-
degree
too
the weights
solution
by setting remaining
.
In other words ,
9-
degree polynomial
to model
model is
expressive enough
the
given on .

But we are
notabktoarn
that solution .

Howte°fg ?

solution Add more


data
points / examples so

that the model cannot overfit .

15 examples .

too examples .

data points
With 100 ,
M=9 model
approximates
the true function very
well !
What if you cannot obtain more data ?

Solution
-
2 :
-

Add a
penalty term to the error

order to
function in
discourage the coefficients

reaching large
from values .

Ecw ) =
f- ¥ ,
{ 5 Cn
"
;w ) -

yen
.

]
11W 112
Iz
+

where llwlid __
WTW = wit wit . -

twµ)
This is known as
regularization .

regularization term .

relative the
✗ = Controls the
importance of
term
regula ligation .

Note:_ Ecw) is still a


quadratic fn .

of w .

fn be
exactly
This error .
Can
minimized
in closed form .

We will derive this solution later .


How to choose ✗ ?

It to work with An ✗ and


easy
is

compute ✗ as elm ✗
.

T T regularization
.

No much
too
regularization
too
when ✗ is small
,
then no
regularization .

When ✗ is too
high ,
too much
regularization .
It crucial to choose the
is
very right ✗ .

|7=hyper-parameterofthem
-
fix the
hyper parameter
normally
we will -

and learn the


parameters from the data .

to choose ✗ ? Can we use the


But how

test set to choose ✗ ?

III
proxy for
to be
Test supposed
a
set is

completely
new n .

choose 7 based the test set , then


If you
on

has the test set So the


model
.

the seen

test set will not be the


proxy for the

of the model
real performance .

what else can we


do ?

hold out set


Create a
separate .
awoa.g.a#eTh#isperouceaduirueieskonofwxnaiasm.io-denfsa6scbtioen.-sgt)
values 7
① For different of ,

✓ train the model .

set
✓ Compute the performance in valid .

validation performance
.

test performance for the


③ Compute the

(f)
good
region
of ✗ .
Given
training split
training
a set we the data into
,

and use the validation set


and validation part
known as
for model selection .
This technique is

Chida .

É☒
-

the chosen train / valid split not


what if is

What
if it bad validation
representative ? was a

split ?

We can do k-

fold cross-validation .

fold I
fold 2 folds fold 4 folds

valid train

train valid train

train Valid train

train valid train

Valid
train
-
Divide the data into K -
folds (disjoint )
folds for fold for
training
-
Use K-1 and last

testing .

-
Repeat previous step to test with all folds .

' '

Average
all k
performance .

very costly !
-

Mot feave-Cidn ! -

Inner the

dataset is too small


,
with n
points ,
we will

do validation
n -

fold cross -
.

-
this kenew the true function
In example ,
we

polynomial and hence


is a 3 -

degree we can

obtain it from the 9 -

degree polynomial

by regularization .

What if the true function was a 15 -

degree
polynomial and we dont know that ?

other words How can we choose the


In ,

value of M
?

M -
is also a
hyper parameter !

Solution3 different values


of M
Try
and select the value of M based

the Val set


performance in .
.

on

This is also model selection !

-
Machine
- Learning Pipeline :

① Define the
input and
output hey
:

② Collect examples for the task .

examples brain / valid / test set


③ Divide the into
-
.

data
④ Do
preprocessing
.

model
④ Define your

Paramekrsandhyperparamelers.JO
-
Model will consist
of

to
Define the
eÉn you want

minimize .

⑤ For different values of hyper parameters


-

by mnimijngn
learn model
-

params .

Compute validation
performance
-
.

⑦ Pick the best model based on the Val .

perf .

= ⑧ Test the model with test set .


Classic : -

Consider the
following binary classification
problem We are
interested
classifying the
.
in

"

either blue class


"

points
"

data as
Orange
"
or
class .

There are 2
features :

N, & N2 .

we have 100 examples


per class .

class 1 : blue

class 2 :
Orange .

to
We will
always convert the
targets numbers .

Blue :O Blue = -1
(
or )

orange
__ 1
Orange =
+1

so1ution Linear
-
model

we can use the same idea


from
regression .
For { 0,1 } classification ,

5>0.5

{
1 if
I =

0 if 5<-0.5

5= Decision boundary .

For § -1,13 classification ,

{
-1 if < o

y^ =

1 if 570

y Decision
boundary .

Orange
region
f)
Linear model without

bias term .

(
→ blue
region

decision
boundary wTn=
?⃝
?⃝
?⃝
linear model with

bias term .

classified
NIE : some blues
wrongly as
orange
& Vice - Versa .

boundary
line
clearly
not decision
good
is a

for this classification problem .

Solutions :

Nearest-neighbormethods.ve
those
training
observations in the set

+ closest in to
input space n to
form y^ .

If more
neighbors are class 0
,
then predict

class 0 and vice -


Vasa .
" '

541 1-
"gµ,,zgy
-

.
-2

k od

N ,< In ) =

Neighborhood of n .

↳ k closest n' " in T


=
points .

±
what metric ?
Euclidean distance .

if y^fn7 > 05 , class 1

else class 0 .

This like
is
majority voting from the

neighbors .

This model assumes that the class distribution

is
localtysnnooth .
12=10 is

decision
boundary
and
more
irregular
far local
to
responds one
where
clusters
dominates
.

class

There are still some mis classifications .

=
1<=1 .

The decision
boundary
more
is even
irregular
than before .

There are no mis classification .

k= hyper parameter
-

of the nearest
neighbor
algorithm .

Let us do model selection


using a
separate
validation set .
to
1<=1 over
fits the
training data !

better than linear model


K NN performs
.

9 model .

linear
Non
-

Are there
any
other hyper param .

for

K -
NN
algorithm ?
melrictocomputew.nl#
Least
-
squares
Vs .
Nearest neighbors : -

É"
Least Nearest
squares
Neighbors .

boundary boundary
-
Decision is Decision is
-

smooth
very .

on a handful of
input points & their

positions .

-
More stable - Less stable .

-
Assumes that the -
Assumes that the

boundary
decision class distribution is

linear
locally
is .

smooth .


strong assumption
-

high bias
,
low -

low bias ,
high
variance

y
Variance .

-
You should know !
-

① Prediction problem .

② Regression / classification
③ Supervised learning / Generalization
Linear
④ regression

⑤ bias in linear model .

⑥ Parameter /hyper parameter of a model .


overfitting Add more data
⑧ to ✓
overpaying
solutions
\
Regularization .

⑨ Model selection

④ cross-validation, k -
fold cross - validation .

ML
pipeline
④ Nearest
neighbor classifier .

-
?⃝

You might also like