Kan Slide
Kan Slide
Arnold Networks
Umar Jamil
Downloaded from: https://github.com/hkproj/kan-notes
License: Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0):
https://creativecommons.org/licenses/by-nc/4.0/legalcode
Class 1
Class 2
Class 3
Class 4
Class 5
Input
a1 a2 a3 r1 z1
Item 1 𝑶 = 𝑿𝑾𝑻 + 𝒃 Item 1 Item 1
w1
X= 𝑾𝑻 = w2
𝑿𝑾𝑻 = O=
(10, 3) (3, 5) w3 (10, 5) (10, 5)
𝑶𝟏 = 𝒙𝑾1𝑻 + 𝒃𝟏
𝑶𝟐 = (𝑶𝟏 )𝑾𝑻2 + 𝒃𝟐
𝑶𝟐 = (𝒙𝑾1𝑻 + 𝒃𝟏 )𝑾𝑻2 + 𝒃𝟐
As you can see, if we do not apply any activation functions, the output will just be a linear combination of the inputs, which means that our MLP will not be
able to learn any non-linear mapping between the input and output, which represents most of the real-world data.
𝑦 = 𝑎𝑥 3 + 𝑏𝑥 2 + 𝑐𝑥 + 𝑑
We can write our system of equations as follows and solve to find the equation of the curve:
5 = 𝑎(0)3 +𝑏(0)2 +𝑐 0 + 𝑑
1 = 𝑎(1)3 +𝑏(1)2 + 𝑐(1) + 𝑑
3 = 𝑎(2)3 +𝑏(2)2 + 𝑐(2) + 𝑑
2 = 𝑎(5)3 +𝑏(5)2 + 𝑐(5) + 𝑑
Source: https://arachnoid.com/polysolve/
𝑩 𝑡 = 𝑷0 + 𝑡 𝑷1 − 𝑷0 = 1 − 𝑡 𝑷0 + 𝑡𝑷1
Given three points, we can calculate the quadratic Bézier curve that interpolates them.
Source: Wikipedia
𝑸0 𝑡 = 1 − 𝑡 𝑷0 + 𝑡𝑷1
𝑸1 𝑡 = 1 − 𝑡 𝑷1 + 𝑡𝑷2
𝑩 𝑡 = 1 − 𝑡 𝑸0 + 𝑡𝑸1
= 1 − 𝑡 1 − 𝑡 𝑷0 + 𝑡𝑷1 + 𝑡 1 − 𝑡 𝑷1 + 𝑡𝑷2
= 1 − 𝑡 2 𝑷0 + 2 1 − 𝑡 𝑡𝑷1 + 𝑡 2 𝑷2
𝑛 𝑛
𝑛 𝑛−𝑖 𝑖
𝑩 𝑡 = 1−𝑡 𝑡 𝑷𝑖 = 𝑏𝑖,𝑛 (𝑡)𝑷𝑖
𝑖
𝑖=0 𝑖=0
Blue: 𝑏0,3 𝑡
Green: 𝑏1,3 𝑡
Red: 𝑏2,3 𝑡
Cyan: 𝑏3,3 𝑡
Binomial coefficients
𝑛 𝑛!
=
𝑖 𝑖! 𝑛 − 𝑖 !
Source: Wikipedia
Source: Wikipedia
Source: MIT
Source: MIT
𝑁0,2 𝑁5,2
𝑁2,2 𝑁3,2
𝑁1,2 𝑁4,2
I want to emphasize what it means to be a universal approximator: it means that given an ideal function (or a family of functions) that models the training data,
the network can learn to approximate it as good as we want, that is, given an error 𝜖, we can always find an approximate function that is close to the ideal
function within this error limit.
This is however a theoretical result; it doesn’t tell us how to do it practically. On a practical level, we have many problems:
• Achieving good approximations may take enormous amounts of computational power
• We may need a large big quantity of training data
• Our hardware may not be able to represent certain weights in 32 bit
• Our optimizer may remain stuck in a local minima
So as you can see, just because a neural network can learn anything, doesn’t mean we are be able to learn it in practice. But at least we know that the limits
are practical.
𝑛=2
2𝑛 + 1 = 5
𝜑1 𝜑2 𝜑3 𝜑4 𝜑5
𝑥1 𝑥2
Layer 2
5 input features, 1 output features
total of 5 functions to “learn”
Layer 1
2 input features, 5 output features
total of 10 functions to “learn”
Compared to MLP, we also have (G+k) parameters for each activation, because we need to learn where to put the control points for the B-Splines.