Chapter 01
Chapter 01
Fall 2021
Machine
Learning
-
Sarath Chandar
t.IN/-roduction-#
( -
Machine Learn
:\
Introduction to Machine
1.
- Learning
that
Study of algorithms
-
at some task t
with
experience E
-
Example : -
1) T :
spam classification
p :
Accuracy
E : set of example mails labelled as
spam or not
spam .
p :
Accuracy
E : set of historical transactions marked as
legit
or
fraud .
3) :
Playing the
of chess
T
game
players .
Prediction :
input output
'
Agent
✗ → →
y
observation Prediction
E✗amp
① Given house
selling
the the
predict
size of ,
the
price .
the house
✗ : size of .
Y: price
selling
.
✗ : stock
price at time t
Given
③ an
image , predict the
object in the
image .
✗ !
Image as pixels
Y :
object name from a
predefined set
e.
g. { dog catbirds flight ball
, , , }
④ Given an
image of a handwritten
digit predict ,
the
digit .
✗ : ☒
Y : 7
[ one
of { 0,1 ,
2 .
.
9 }]
⑤ temperature
Given the ,
humidity ,
wind predict whether
✗:
temp humidity , , wind
y :
yes / No
y E- R
② In examples ③ ④ ⑤
,
, , y is
categorical .
y E- { dog ,
cat ,
birds ,
flight ,
ball } in ③
YE { 0,1 ,
. . .9 ] in ⑥
ye { yes ,
no } in ⑤
{ YER-tregressi
c-
o npoobl
{ c.
e m.l
,
y?
. .
.cn } classification probl
9
Categorical
In ⑤ { yes
, y c-
,
no
} →
xp >
-
↳ P''
features
y can be a scalar or a vector .
Note : y →
target
5 → Model 's prediction
-
Supervised
-
Learning : -
N
"
{ n'
'"
training
set
Given a
, y }i= ,
data
of N points learn a
prediction function f
:n→y
,
predict the
correspondingly .
-
Note function
'
: The prediction
'
f- is useful
capability as generalizationtounseen
instances .
Generalization is a
key requirement
for any
ML
algorithm .
Simpke✗amp#
Hoasi-griepredictioninportland.org#
- -
would be One to do
market price way
.
good
this is to first collect information on recent
model
sold and make of
housing
a
houses
prices .
This is an example for regression .
following dataset
Let us consider the :
+
✗ Y
the house
size of Price the house
of .
( in square feet )
2104
399,900
1600 329,900
2400 369,000
"
:
.
:
ttet
to
① we will use the
training data
'
it after
'
training
sees a new n
.
training
Let us divide the dataset into 30
training
data :
Visualization of -
÷
N
Now consider this simple model :
|y^=wo+w,n
kind this fit ?
What of curves can model
fit lines
This model
only
can .
y n yn yn
¥→ .
n
:I -# II.
Wo = 1.5 Wo =
O Wo = I
W W 0.5 W =
0.5
,
=
O ,
=
,
lines
passing through the origin !
Wo = bias in ML literature .
§ =
Wo t W
,
R
µ
Parameters of the model .
W = ( Wo ,w , )
y^ ( ;w )
wotw.net#Gnweantmparameersw.w?-
n =
ob-j:/ vewant.jcniwltobeas~closeaspossible.to#,
This can be done
by minimizing
the
following emorfench
instances .
Number of training
←
'
N
{ goin yin }
'
E.
.
)
Ecw )
I
= ;w -
i
→ T
target
.
error
yln
'
"
minimise this
99"
"
÷ { n' l
}
" "
Ecw) ( w.tw ,
1-2 y
-
|obi:m:%
How to find the minimum ?
→ This error
fn .
is a
quadratic fn .
of the
parameters .
one
there
is only
minimum .
fn will be
.
w.int .
the
parameters w zero .
i. e. dE_ = 0 .
dw
the elements
in of w .
r-mo-delithemodelislinearintermsofthe-yfparameters.gl
; near models with quadratic error
functions
have an unique closed
form solution !
-
Switching tot×E :
① = Wo t win
① = Wo . I + W
,
N
ret w -
l ;) .
✗ ⇐
I :)
Then 5 = wtn .
with 1 to
Note : We will
often pretend n
avoid
treating the bias
separately .
f-i-i .fi:0#:: : : : .i
"
(2)
y
"
:
"
N'
nzlm .
.
.
.
.
Np
one feature
✗ = data matrix Y =
target vector .
rows in ✗ = examples .
Columns in ✗ =
features .
✗ = NXP matrix .
N ✗ 1 Vector
Y =
w = PXI vector .
matrix vector
I = ✗ W Single
to
9 9 multiplication
NXP PXI
- predict y for all
Nxt
N
examples .
|±±¥÷]
"" " " " " "
¥
IF
Solution to least squares : -
Ecw ) =t( Y -
✗ W ) -
✗ w )
the minimum .
(yaw )
¥
-
-
-
o
XT ( y -
✗ W ) = 0
✗TY -
✗ TXW =
0
XTXW = ✗ Ty
|w*=(✗T×5f
c-
Moore -
Penrose
Pseudo - inverse of matrix
✗
Now
given a new n
,
we can find the
target y
as
follows :
y^= w*T ✗ ,
Baeito zpmpLe
: -
model ?
Leastsquaresemor(LsE7_
Not insane
←
2
scale as
¥?(y
"
! win
ECW) = 1- prediction
2
¢
N
depends on .
Better metric for performance :
Ekras
⇐
,|2
scale
Node y
as
: RMSE is on
the same .
Test RMSE =
57976.80 .
-
Summary :
Linear
-
regression
Model : I = win ( linear model )
-
Parameters : w
- -
son:w*=×i×
Me : RAISE =
,|2Elw
Synthetic example :
- -
Consider the
following synthetic function
Y= sin @ Tn ) + c-
C- ~ Nco ,
0.3 )
9 9
mean std deviation
.
training following
The data consists the
of
to data
points sampled from this
fn .
&
Note:I : In real
life problems .
we will not
Adding to the
Note2 Gaussian noise
Because ,
sin fn is reasonable .
even in real life observed
'
'
will have
y
.
inherent stochas #
city of
the
process that
generates
function !
'
the
y
Linear solution :
regression
-
- -
line
learnt fn
does not fit the
data well
very .
"
y^( n
;w ) =
Wo + win
+ wait . . .tw
,yn
a.
rf raised to the
:
power of
2
Note
: n .
al
"
: n of
the second example
Nz : 2nd feature of a .
the
fact that n°-1 can write
Using
we
,
y^ ( n
;w) = wont -1W , n 't . . . .
+ WMNM
M
sÉ
= I
wj
j=o
I :* )
Wo Ñ
Let W =
n =
'Wµ
y^ In ;w) = win .
Mode : 5 ( n
;w ) is a non-linear fn of.
n
.
But
y
,
is still a
quadratic fn .
of W .
Hence
,
the Same
scalar .
However ,
we can still use
looks
Solutions for 14=3,4 good .
keep
increasing ?
M
What happens as we
Solution for M=9 achieves zero
training
error !
Is this solution ?
a
good
NO . The fitted curve oscillates
wildly
of the
representation
-
and
gives
a
poor
for sin # ) n .
↳ This is known as
¥ng .
In this example ,
we know the true function .
So we can
tell that 1×1--9 is not a
good
it
approximation of
.
How to tell
if a model is
overfitting
when we
dont know the true function ?
its
→
If a model is
overfitting ,
.i
should be
generalization performance
bad
high
oO÷÷÷
error .
→ zero
training
error
good region
Kait ! 9 degree polynomial contains 3
degree
then
polynomial as
special case .
why M=q
performs bad ?
As +9 increases the
magnitude of the coeff gets
.
larger .
For M=9 ,
10
weights are
heavily tuned for
the
given
to data
points !
But we are
notabktoarn
that solution .
Howte°fg ?
15 examples .
too examples .
data points
With 100 ,
M=9 model
approximates
the true function very
well !
What if you cannot obtain more data ?
Solution
-
2 :
-
Add a
penalty term to the error
order to
function in
discourage the coefficients
reaching large
from values .
Ecw ) =
f- ¥ ,
{ 5 Cn
"
;w ) -
yen
.
]
11W 112
Iz
+
where llwlid __
WTW = wit wit . -
twµ)
This is known as
regularization .
regularization term .
relative the
✗ = Controls the
importance of
term
regula ligation .
of w .
fn be
exactly
This error .
Can
minimized
in closed form .
compute ✗ as elm ✗
.
T T regularization
.
No much
too
regularization
too
when ✗ is small
,
then no
regularization .
When ✗ is too
high ,
too much
regularization .
It crucial to choose the
is
very right ✗ .
|7=hyper-parameterofthem
-
fix the
hyper parameter
normally
we will -
III
proxy for
to be
Test supposed
a
set is
completely
new n .
the seen
of the model
real performance .
set
✓ Compute the performance in valid .
validation performance
.
(f)
good
region
of ✗ .
Given
training split
training
a set we the data into
,
Chida .
É☒
-
What
if it bad validation
representative ? was a
split ?
We can do k-
fold cross-validation .
fold I
fold 2 folds fold 4 folds
valid train
Valid
train
-
Divide the data into K -
folds (disjoint )
folds for fold for
training
-
Use K-1 and last
testing .
-
Repeat previous step to test with all folds .
' '
Average
all k
performance .
very costly !
-
Mot feave-Cidn ! -
Inner the
do validation
n -
fold cross -
.
-
this kenew the true function
In example ,
we
degree we can
degree polynomial
by regularization .
degree
polynomial and we dont know that ?
value of M
?
M -
is also a
hyper parameter !
on
-
Machine
- Learning Pipeline :
① Define the
input and
output hey
:
data
④ Do
preprocessing
.
model
④ Define your
Paramekrsandhyperparamelers.JO
-
Model will consist
of
to
Define the
eÉn you want
minimize .
by mnimijngn
learn model
-
params .
Compute validation
performance
-
.
perf .
Consider the
following binary classification
problem We are
interested
classifying the
.
in
"
points
"
data as
Orange
"
or
class .
There are 2
features :
N, & N2 .
class 1 : blue
class 2 :
Orange .
to
We will
always convert the
targets numbers .
Blue :O Blue = -1
(
or )
orange
__ 1
Orange =
+1
so1ution Linear
-
model
5>0.5
{
1 if
I =
0 if 5<-0.5
5= Decision boundary .
{
-1 if < o
y^ =
1 if 570
y Decision
boundary .
Orange
region
f)
Linear model without
bias term .
(
→ blue
region
decision
boundary wTn=
?⃝
?⃝
?⃝
linear model with
bias term .
classified
NIE : some blues
wrongly as
orange
& Vice - Versa .
boundary
line
clearly
not decision
good
is a
Solutions :
Nearest-neighbormethods.ve
those
training
observations in the set
+ closest in to
input space n to
form y^ .
If more
neighbors are class 0
,
then predict
541 1-
"gµ,,zgy
-
.
-2
k od
N ,< In ) =
Neighborhood of n .
±
what metric ?
Euclidean distance .
else class 0 .
This like
is
majority voting from the
neighbors .
is
localtysnnooth .
12=10 is
decision
boundary
and
more
irregular
far local
to
responds one
where
clusters
dominates
.
class
=
1<=1 .
The decision
boundary
more
is even
irregular
than before .
k= hyper parameter
-
of the nearest
neighbor
algorithm .
9 model .
linear
Non
-
Are there
any
other hyper param .
for
K -
NN
algorithm ?
melrictocomputew.nl#
Least
-
squares
Vs .
Nearest neighbors : -
É"
Least Nearest
squares
Neighbors .
boundary boundary
-
Decision is Decision is
-
smooth
very .
on a handful of
input points & their
positions .
-
More stable - Less stable .
-
Assumes that the -
Assumes that the
boundary
decision class distribution is
linear
locally
is .
smooth .
↳
strong assumption
-
high bias
,
low -
low bias ,
high
variance
y
Variance .
-
You should know !
-
① Prediction problem .
② Regression / classification
③ Supervised learning / Generalization
Linear
④ regression
①
overfitting Add more data
⑧ to ✓
overpaying
solutions
\
Regularization .
⑨ Model selection
④ cross-validation, k -
fold cross - validation .
ML
pipeline
④ Nearest
neighbor classifier .
-
?⃝