16 - The Key To The Most Powerful ML Models
16 - The Key To The Most Powerful ML Models
𝑦 = sign ( 𝐰 𝐱 + 𝑏) 𝑦 =𝐰 𝐱 +𝑏
⊤ ⊤
Linear Models
It does not seem any linear model can You are right; to solve such
ever do well on these tasks, no matter learning problems, an entirely
how carefully we learn that model new class of models is needed
𝑦
?
?
𝑥
𝑦 = sign ( 𝐰 𝐱 + 𝑏) 𝑦 =𝐰 𝐱 +𝑏
⊤ ⊤
Non-linear Models
non-linear models
stepping out of line seems to be the key to
ML’s success
Linear Functions
A function is called linear if it satisfies two properties
(Additivity) For any two vectors ,
(Homogeneity) For any vector and scalar ,
Claim: Every linear function is of the form for some
fixed vector -th coordinate
-2 YES NO YES NO
leaf leaf leaf
-2 -1 0 1
Could we not have 2 gotten the same
result by creating two regions by
splitting vertically, say using
Decision Trees
We will study algorithms to learn DT
Usually we learn layer-by-layer,
first the root model, then the
models for its children and so on
𝑦
later. However, note that DT learning
is an intractable (NP-hard) problem
sign ( 𝐮 𝐱 ) ⊤
2
+𝟏 −𝟏
1
0
𝑥 sign 𝐯 𝐱 )
( ⊤
sign 𝐰 𝐱 )
( ⊤
-1
-2 +𝟏 −𝟏 +𝟏 −𝟏
-2 -1 0 1
2 all these multiple linear models?
How does one learn The number of layers
Do we learn them together or one after another? needs to be tuned as
How do we decide how many layers to have? a hyperparameter
Decision trees for regression
Regression
𝑦
Trees problems are often called
regression trees
root
4 𝑥>
sign ( 𝑢𝑥1+𝑎
?)
3
+𝟏
YES 𝟏
−NO leaf
2 leaf
1 𝑣
𝑥𝑥− +𝑏
0.5 𝑤
1.5𝑥+
−𝑥𝑐
𝑢=1,𝑎=−1
0 𝑥
-3 -2 -1 0 1 2
3 4 5
Notice that since this decision tree is Notice that this regression tree cleverly used linear
solving a regression problem, the models for classification as well as regression to
leaves each contain a regression model solve a non-linear regression problem!
Neural Networks
[1 ]
When an NN fails, it could either be that its parameter values
0 1
(in this case ) were not good or there may be a deeper (pun
𝐴= intended) problem with the architecture of the NN itself
𝑦 0
2
sign(( 𝐰 𝐱 ), 0 ) + 𝑏 )
⊤
⊤
sign 𝜙 ( 𝐱( )𝐴
𝐰 max +𝑏
𝑏
1
𝐰
sign
0
𝑥 ReLU
-1 𝜙 1
-2
-2 -1 work,0did it?” 1
𝐱 𝐴 𝑥 𝑦
“Didn't ReLU activation
2
𝜙 ( 𝐱 ) =max ( 𝐴 𝐱 , 0 )
-- Maggie Smith, 2002
(as recounted by Ian McKellen)
Neural Networks
[ −1 ]
The network here can discover two new features, The design of the neural network, number of layers,
+1 −1
each of which look like . The parameters values number of nodes in each layer, activation functions,
𝐴= are learnt using (S)GD decide what sort of features that NN can discover
𝑦 +1
These features were able to solve this task but may
not work for some other task. Learning the optimal
neural network is also an NP-hard problem
2
sign ( 𝐰 max ( 𝐴 𝐱 , 0 ) + 𝑏 )
⊤
𝑏
1
𝐰
sign
0
𝑥 ReLU
-1 𝜙 1
-2
-2 -1 0 1
𝐱 𝐴 𝑥 𝑦
ReLU activation
I have so many
2 questions. How did
we choose these features that did
well, how are learnt?
𝜙 ( 𝐱 ) =max ( 𝐴 𝐱 , 0 )
Neural Networks
[ −1 ]
Notice that the NN learnt
+1 −1 a very different decision
𝐴=
max (𝑡 ,0 )𝑦+max (− 𝑡,0)=|𝑡|
boundary than the DT
+1
2
sign
sign( 𝐰(|max
⊤
max (𝑥−
𝑥 𝑦𝑦
−( 𝐴 ,0|
𝐱 − 𝑏 ) )( 𝑦 − 𝑥 , 0 ) −1 )
,)0+max
)+1
𝑏
1
𝐰
sign
0
𝑥 ReLU
-1 𝜙 1
-2
-2 -1 0 1
𝐱 𝐴 𝑥 𝑦
ReLU activation
2
𝜙 ( 𝐱 ) =max ( 𝐴 𝐱 , 0 )
Exercises 𝑦
Note that the classifier will also
2
solve this problem. Can you find
model parameter values for for 1
the previous neural network that 0
will yield this classifier? 𝑥
-1
-2
-2 -1 0 1
2
Neural Networks
Note that this neural network creates
two new features even though the data
𝑦 point had only one feature to begin with
No final
activation
4
𝐰 ⊤
𝜙 ( 𝐱 ) +𝑏
3
𝐰 𝑏
2
ReLU
1
𝜙 1
0 𝑥
-3 -2 -1 0 1 2
𝐱 𝐚𝑥 1𝐜
3 4 5
ReLU activation
𝜙 ( 𝑥 ) =max ( 𝑥 ⋅ 𝐚 + 𝐜 , 0 )
𝐚 =( 1,−1 ) , 𝐜 =( −1,1 )
Neural Networks
𝑦 2
3 0
1
The new axis values are
calculated using original
1 2 axis values, not new ones
No final
0 3
activation
-
-
1
𝐰
4 ⊤
𝜙 ( 𝐱 ) +𝑏
5
-
3
2
4
𝐰 𝑏
ReLU
3 𝜙 1
2
0 𝐱 𝐚𝑥 1𝐜
𝑥 𝑧 new -axis new -axis
( max ( 𝑥𝑥−1
𝜙 ( 𝑥 ) =max ⋅ 𝐚,+0𝐜 , 0 ) (1 − 𝑥 , 0 ))
) , max
Non-linearity in Neural Networks
Activation functions
are often applied
Such deep stacking of layers allows
coordinate -wise
NNs to learn very powerful features
on top of which a linear can do well The simplest features learnt by NNs
look like
where
is an activation function
Examples: ReLU, GeLU, sigmoid, tanh
NNs often stack such feature learners
Exercises
use the terms linear and
affine interchangeably