0% found this document useful (0 votes)
4 views25 pages

16 - The Key To The Most Powerful ML Models

Uploaded by

shreyaiota
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views25 pages

16 - The Key To The Most Powerful ML Models

Uploaded by

shreyaiota
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 25

𝑦

𝑦 = sign ( 𝐰 𝐱 + 𝑏) 𝑦 =𝐰 𝐱 +𝑏
⊤ ⊤

Linear Models
It does not seem any linear model can You are right; to solve such
ever do well on these tasks, no matter learning problems, an entirely
how carefully we learn that model new class of models is needed
𝑦

?
?
𝑥

𝑦 = sign ( 𝐰 𝐱 + 𝑏) 𝑦 =𝐰 𝐱 +𝑏
⊤ ⊤

Non-linear Models
non-linear models
stepping out of line seems to be the key to
ML’s success
Linear Functions
A function is called linear if it satisfies two properties
(Additivity) For any two vectors ,
(Homogeneity) For any vector and scalar ,
Claim: Every linear function is of the form for some
fixed vector -th coordinate

Proof: Consider the standard basis vector for and


consider where . Using the two properties of linear
functions, for any
Exercise
 Show that every linear function satisfies
 Consider polynomials of the form where . For what
values of can be a linear function?
 Show that every function that is linear is convex too!
 Is the converse true? Is every convex function linear too?
 Recall the norms of the form .
Are norms linear for any value of ? If not, which property
of linear functions (additivity or homogeneity) do they
violate?
Affine Functions
A function is called affine if we can write as

𝐰 𝐱
for some linear function and Exactly. When someone says they are using
a non-linear model, they usually mean a
model that is both non-linear and non-affine
Yup! If we use a linear function as a They are indeed closely related. The biggest
binary classifier, its decision boundary difference is that linear functions must
will always pass through the origin satisfy , affine functions need not.
𝑦 𝐰
When ML people use the term
“linear model” they are usually
talking about an affine function 

What is the difference between


affine and linear functions? They 𝑥
both seem to be quite similar!
Linear and Affine Maps
Linear maps: linear functions of the form
and for all
Affine maps: affine functions of the form
Claim: Linear maps must be of the form for some fixed
matrix
Proof: Left as an exercise 
Hint: Look at how acts on the standard basis vectors of the
input space i.e., and use additivity and homogeneity
properties as before
Corollary: Affine maps must be of the form for some
fixed matrix and fixed vector
The terms function and map are used interchangeably.
The term map is often used to emphasize the fact that
a function’s outputs are vectors and not scalars
Exercise
 Are constant functions affine? Are they always
linear?
 Show that a function/map is affine iff (if-and-only-if)
there exists a linear map , such that for all

 Affine functions preserve convexity. Specifically, if is


an affine function and is a convex set, then show
that the set is convex too!
 Is the converse true? If a function preserves
convexity (it maps all convex sets to convex sets),
Opening the door to non-linearity
State-of-the-art, commercial, web-scale models often
combine the two techniques e.g., use nearest neighbor
on top of new features learnt using neural networks

Reduce the problem to linear Reduce the problem to linear


models models
Use a combination of multiple Modify the features and learn a
linear models instead of a single linear model over the
single linear model new features
Examples: decision trees, Examples: neural networks,
learning with prototypes, kernel methods
nearest neighbors
Decision Trees Indeed, you could have.
Several possible decision trees
could exist for the same task
root
𝑦
𝑥> 0 ?
2
right-child
YES NO of the root
1 left-child
of the root
0
𝑥 𝑦>0? 𝑦<0?
-1

-2 YES NO YES NO
leaf leaf leaf

-2 -1 0 1
Could we not have 2 gotten the same
result by creating two regions by
splitting vertically, say using
Decision Trees
We will study algorithms to learn DT
Usually we learn layer-by-layer,
first the root model, then the
models for its children and so on

𝑦
later. However, note that DT learning
is an intractable (NP-hard) problem
sign ( 𝐮 𝐱 ) ⊤

2
+𝟏 −𝟏
1

0
𝑥 sign 𝐯 𝐱 )
( ⊤
sign 𝐰 𝐱 )
( ⊤
-1

-2 +𝟏 −𝟏 +𝟏 −𝟏

-2 -1 0 1
2 all these multiple linear models?
How does one learn The number of layers
Do we learn them together or one after another? needs to be tuned as
How do we decide how many layers to have? a hyperparameter
Decision trees for regression

Regression
𝑦
Trees problems are often called
regression trees
root

4 𝑥>
sign ( 𝑢𝑥1+𝑎
?)
3
+𝟏
YES 𝟏
−NO leaf
2 leaf

1 𝑣
𝑥𝑥− +𝑏
0.5 𝑤
1.5𝑥+
−𝑥𝑐

𝑢=1,𝑎=−1
0 𝑥

-3 -2 -1 0 1 2
3 4 5
Notice that since this decision tree is Notice that this regression tree cleverly used linear
solving a regression problem, the models for classification as well as regression to
leaves each contain a regression model solve a non-linear regression problem!
Neural Networks
[1 ]
When an NN fails, it could either be that its parameter values
0 1
(in this case ) were not good or there may be a deeper (pun
𝐴= intended) problem with the architecture of the NN itself

𝑦 0
2
sign(( 𝐰 𝐱 ), 0 ) + 𝑏 )


sign 𝜙 ( 𝐱( )𝐴
𝐰 max +𝑏
𝑏
1
𝐰
sign

0
𝑥 ReLU

-1 𝜙 1
-2

-2 -1 work,0did it?” 1
𝐱 𝐴 𝑥 𝑦
“Didn't ReLU activation
2
𝜙 ( 𝐱 ) =max ( 𝐴 𝐱 , 0 )
-- Maggie Smith, 2002
(as recounted by Ian McKellen)
Neural Networks
[ −1 ]
The network here can discover two new features, The design of the neural network, number of layers,
+1 −1
each of which look like . The parameters values number of nodes in each layer, activation functions,
𝐴= are learnt using (S)GD decide what sort of features that NN can discover

𝑦 +1
These features were able to solve this task but may
not work for some other task. Learning the optimal
neural network is also an NP-hard problem 
2
sign ( 𝐰 max ( 𝐴 𝐱 , 0 ) + 𝑏 )

𝑏
1
𝐰
sign

0
𝑥 ReLU

-1 𝜙 1
-2

-2 -1 0 1
𝐱 𝐴 𝑥 𝑦
ReLU activation
I have so many
2 questions. How did
we choose these features that did
well, how are learnt?
𝜙 ( 𝐱 ) =max ( 𝐴 𝐱 , 0 )
Neural Networks
[ −1 ]
Notice that the NN learnt
+1 −1 a very different decision
𝐴=
max (𝑡 ,0 )𝑦+max (− 𝑡,0)=|𝑡|
boundary than the DT

+1
2
sign
sign( 𝐰(|max

max (𝑥−
𝑥 𝑦𝑦
−( 𝐴 ,0|
𝐱 − 𝑏 ) )( 𝑦 − 𝑥 , 0 ) −1 )
,)0+max
)+1
𝑏
1
𝐰
sign

0
𝑥 ReLU

-1 𝜙 1
-2

-2 -1 0 1
𝐱 𝐴 𝑥 𝑦
ReLU activation
2
𝜙 ( 𝐱 ) =max ( 𝐴 𝐱 , 0 )
Exercises 𝑦
 Note that the classifier will also
2
solve this problem. Can you find
model parameter values for for 1
the previous neural network that 0
will yield this classifier? 𝑥
-1

-2

-2 -1 0 1
2
Neural Networks
Note that this neural network creates
two new features even though the data
𝑦 point had only one feature to begin with

No final
activation
4
𝐰 ⊤
𝜙 ( 𝐱 ) +𝑏
3
𝐰 𝑏
2
ReLU

1
𝜙 1
0 𝑥

-3 -2 -1 0 1 2
𝐱 𝐚𝑥 1𝐜
3 4 5
ReLU activation

𝜙 ( 𝑥 ) =max ( 𝑥 ⋅ 𝐚 + 𝐜 , 0 )
𝐚 =( 1,−1 ) , 𝐜 =( −1,1 )
Neural Networks
𝑦 2
3 0
1
The new axis values are
calculated using original
1 2 axis values, not new ones
No final
0 3
activation
-
-
1
𝐰
4 ⊤
𝜙 ( 𝐱 ) +𝑏
5
-
3
2
4
𝐰 𝑏
ReLU
3 𝜙 1
2

0 𝐱 𝐚𝑥 1𝐜
𝑥 𝑧 new -axis new -axis

( max ( 𝑥𝑥−1
𝜙 ( 𝑥 ) =max ⋅ 𝐚,+0𝐜 , 0 ) (1 − 𝑥 , 0 ))
) , max
Non-linearity in Neural Networks
Activation functions
are often applied
Such deep stacking of layers allows
coordinate -wise
NNs to learn very powerful features
on top of which a linear can do well The simplest features learnt by NNs
look like

where
is an activation function
Examples: ReLU, GeLU, sigmoid, tanh
NNs often stack such feature learners

Learning techniques such as (S)GD


are used to learn the parameters of
the NN i.e. and Crucial: must be a non-linear
function
Recall that ML folks often

Exercises
use the terms linear and
affine interchangeably

 Consider a neural network that uses a linear activation


function

and is a linear activation function for some . Show that


this NN can only learn linear classifiers i.e., there exist
(the values of depend only on the values of ) s.t.

 Consider a NN that uses quadratic activation function

where . What types of functions can this NN learn?


Exercises 𝑥=− 𝑦 𝑥=𝑦
 Create a feature map for
some so that for any ,
takes value if is in the blue
region and if is in the
yellow region. is the -
dimensional all-ones
vector. The dashed lines in
the figure are and .
Exercises
 For a circle with centre and
radius , lets build a classifier that
gives label if a point is inside the ( 𝑝 ,𝑞 )
circle i.e., in the yellow region and 𝑟
otherwise. Give a feature map for
some and a corresponding linear
classifier such that for any , is the
correct output. The map must not
depend on but the classifier may
depend on .
Exercises
 For a 2D rectangular hyperbola
with equation for , give a feature
map for some and a ( 𝑐 ,𝑐 )
corresponding linear classifier so
that for any , takes value in the ( −𝑐 ,−𝑐 )
yellow and in the blue region. The
map must not depend on but
may depend on .
Summary
Linear and affine functions have simple and easy-to-understand
structure that makes it easy to learn linear models e.g. SVMs
Several real-life applications require non-linear models to be learnt
ML uses clever tricks to reduce the problem of learning non-linear
models to the problem of learning linear models
Learning a combination of several linear models (decision/regression trees)
Learning modified features over which a linear model does well (NN)
Learning non-linear models often turns out to be an NP-hard task
Learning an optimal DT or optimal NN are both NP-hard problems
Nevertheless, several heuristics exist that allow us to learn
reasonably accurate DTs and NNs for real-world problems in
reasonable time
Stay Classy!
Catch-up with you next time

You might also like