0% found this document useful (0 votes)
4 views40 pages

data analysis

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 40

Data Analysis

Chan Tsz Pang

December 17, 2024


2

0.1 Data Analysis


0.1.1 Statistics for Data Analysis
Definition 1. Bayes Rule
P(A|B) = P(A)
P(B) P(B|A)

Definition 2. Gaussian Distribution


Normal Distribution or Gaussian Distribution is a type of continuous probability distribution for a real value random vari-
able. Here is the probability density formula:

1 (x − µ)2
f (x) = √ exp( )
2πσ2 2σ2

where µ is a mean or expectation of the distribution.


Definition 3. Central Limit Theorem
The sampling distribution of the mean will always be normally distributed, as long as the sample size is large enough.
Let X1 , X2 , . . . , Xn be i.i.d. random distribution, containing identically distributed (i.i.d.) parameters . If n → ∞, then
N
1 X
lim Xn ∼ N(µ, σ2 )
N→∞ N n=1
0.1. DATA ANALYSIS 3

0.1.2 Linear Algebra for Data Analysis


Definition 4. Linear Independence
Definition 5. Dimension
Definition 6. Basis
Definition 7. Rank

Definition 8. Null Space


Definition 9. Eigenvalues and eigenvectors
Definition 10. Rotation and Projection

Theorem 1. Singular Value Decomposition (S.V.D)


Every m × n matrix A can be decomposed as U V ∗ where is a diagonal matrix and U, V are unitary whose’s columns
P P
are orthonormal(U H = U −1 ) where U H is conjugate transpose.Note that A∗ is the adjoint of A
(a)Check the rank of matrix A
(b) Find the λ eigenvalues of A √
(c) Find the singular values or eigenvectorvi of A∗ A and AA∗ ,all equal to σ = λ
(d) Find the corresponding orthonormal eigenvector to A which is same as the eigenvector of A∗ A for different eigenvalues.
Also normalize the eigenvectors.
(e) = diag{σi } for any i corresponding to the eigenvalues, also we will put the larger one in the upper row. (f) V ∗ is the matrix
P
 
v1 
of corresponding eigenvector also remember V ∗ = v2 
 
v3
 
(g) Then find out the U through calculation.
!
3 0
Example 1. Find the S.V.D. of matrix A = Note that the rank of A is two as there are 2 linearly independent row in
4 5
matrix A.Then there are 2 eigenvalues and we determine signal values σ1 as the bigger one corresponding to λ1 . We will see
that σ1 lagger then
! λ1 and σ2 less then
! λ2 .Since λ1 = 3 and λ2 = 5, let us find the σ1 and σ2 :
25 20 9 12
A A=
T
or AA =T
their characteristic equitation are same as (λ − 45)(λ − 5)
20 25 12 41
√ √ √ √
so that σ1 = λ1 = 45 and σ2 = λ2 = 5
Then we!find! out ! ! ! ! !
25 20 1 1 25 20 −1 −1 1 1
= 45 and =5 Since their length are 2 then we find out the normalized vector v1 = 2 √
20 25 1 1 20 25 1 1 1
! ! √ ! ! √
!
−1 1 1 −3 −3
and v2 = √12 Then we compute Av1 = √32 = √45 = σ1 u1 and Av2 = √12 = √105
= σ2 u2
1 3 10 3 1 1
! √ ! !
1 −3 45 √0 1 −1
Hence U = √110 and = as well as V = √12
P
3 1 0 5 1 1
4

0.1.3 Regression.1
Definition 11. Regression
Regression is a statistical model to determine the relationship between a dependent variable and one or more independent
variables.
Definition 12. Linear Regression

Linear Regression is a statistical model that estimate the linear relationship between dependent variable and one or more
independent variables.
Definition 13. Data Set
Data set is a collection of data defined as a set D := {(xi , yi )}ni=1 where inputs xi ∈ Rd and target variable yi ∈ R.

Definition 14. True Function


True Function is the true relationship between inputs x and output y, we define as f (x) = y which is normally unknown. If
f is known then we can sleep now as we don’t need to do any regression analysis.
Definition 15. Hypothesis Function

Hypothesis Function aims to approximate true function f´, we define it as hw (x = g(x,


´ w) = p where h ∈ H := {hw |w ∈ Rd }
and H is a set of hypothesis function.
Ideas 1. Linear Regression aims to find cost function L and best parameter w∗.
Given any data set D := {(xi , yi )}ni=1 , for any input x ∈ Rd and output y ∈ R. We want find the true function f (x) = y using
hypothesis function h(x) = p to approximate f (x) = y. We define hypothesis class H = {hw |w ∈ Rd } of hypothesis functions.
We also employ a parametric model such that there is some finite-dimensional vector w ∈ Rd . The element w is a parameters or
weights to control the hypothesis function and then approximate true function. Hence hw (x) = g(x, w).

Linear Regression aims to design a cost function (L) to find the best parameter w∗ such that w∗ = argw minL(w).

Ideas 2. Ordinary Least Squares


We want to find the best parameter w∗ to approximate the true function so that we use ordinary least square, it aims to
minimize the sum of the squares of the residuals.In other words it aims to minimize the um of the squares of the offsets of the
point from the curve. To do so, we need to make sure X is full rank and given data X,y so if it is not full rank then the optimal
w∗OLS is not unique.Since X being full rank means that each columns of X is a one-to-one correspondence between w andXw.
0.1. DATA ANALYSIS 5

Theorem 2. Ordinary Least Squares (O.L.S)



If X is full rank then the optimal parameter of ordinary least squares is WOLS = (X T X)−1 X T y
Proof. Ordinary Least Squares (O.L.S) is the simplest regression method, we take hw = xT w such that
y1  y1  w1 
     
 .   .   . 
     T T   
yi ≈ yi = hw (xi ) = xi w ⇐⇒  .  =  .  = x1 ...xn  .  ⇐⇒ y = Xw
T
 .   .   . 
     
yn yn wd
where the design matrix X ∈ Rn×d contains datapoint xi as its ith row. Usually n ≥ d meaning there are more datapoints than
measurements. Therefore there is no exact unique solution to the equation y = Xw ,but we can find an approximate solution by
minimizing the sum of the squard errors: L(w) = ni=1 (xiT w − yi )2 = minw ||Xw − y||.
P
Now we will find the particular structure of OLS allows us to compute a closed-form expression for a globally optimal solution,

which we denote WOLS .


We will use calculus as the first approach to find optimal parameter WOLS .
d
Let cost function be L : R → R which is continuously differentiable,
then any local optimum w∗ satisfies ∇L(w∗ ) = 0, also note that (Xw)T = yT and Xw = y such that
L(w) = ||Xw − y||22 = (Xw − y)T (Xw − y)
= (Xw)T (Xw) − (Xw)T (y) − yT (Xw) + yT y
= wT X T Xw − 2wT X T y + yT y
Using the following results from matrix calculus:
∇ x (aT x) = a
∇ x (xT Ax) = (A + AT )x
Now we will do the partial differentiation as follows
∇L(w) = ∇w (wT X T Xw − 2wT X T y + yT y)
= ∇w (wT X T Xw) − 2∇w (wT X T y)
= 2X T Xw − 2X T y
=0

Since X is full rank then (X T X) is invertible such that WOLS = (X T X)−1 X T y.


Now we will use orthogonal projection as the second approach to find optimal parameter WOLS
Let V be an inner product space and S be a subspace of V, then any v ∈ V can be decomposed uniquely in the form v = vS + v⊥
where vS ∈ S and v⊥ ∈ S ⊥ . Here S ⊥ is the orthogonal complement of S , so S ⊥ 1 S ,
the vector of S ⊥ are all perpendicular to every vector in S .
The orthogonal projection onto S denoted PS , is the linear operator that maps v to vS in the decomposition above.
Note that ||v − P s v|| ≥ ||v − s|| for all s ∈ S ,with eqality if and only if s = P s v. That is P s v = arg s∈S min||v − s||
Now let us considering the case of O.L.S, w∗OLS = argw min||Xw − y||22
Note that the set of vector Xw for some w ∈ Rd is precisely the range of X, which is the subspace of Rn as Rd ⊂ Rn ,so
minz∈ran(X) ||z − y||22 = minw ∈ Rd ||Xw − y||22 . Note that Pran(X) y = Xw∗O LS where w∗ is any optimum for the right-hand side.
The projected point Xw∗OLS is always unique and w∗ is unique too if X is full rank.We now want to solve for w∗OLS ,we need the
following fact: null(X T ) = range(X)⊥
Since we are projecting onto range X, the orthogonality condition for optimality is that
y − Py⊥range(X) and y − Xw ∈ null(X T ).Therefore X T (y − Xw∗OLS ) = 0 such that WOLS ∗
= (X T X)−1 X T y.. ■
6

Remark 1. The calculus approach is suck as we might not know the matrix calculus.
Remark 2. ∇L(w) = 0 only means us find a critical point, it maybe local maximum or local minimum, or a saddle point, so
why we can confirm it is global minimum point???

Due to the convex cost function L.If we want to show its convex then we need to compute the Hessian of L which is the
second partial order matrix. Note that ∇2 L(w) = 2X T X
It shows that this is positive semi-definite as symmetric matrix wT (X T X)w eigenvalues are all nonnegative :
wT (2X T X)w = 2(Xw)T Xw = 2||Xw||22 ≥ 0
Remark 3. Prove P s v = arg s∈S min||v − s||. Then by the Pythagorean Theorem,
||v − s||2 = ||v − P s v + P s v − s||2 = ||v − P s v||2 + ||P s v − s||2 ≥ ||v − P s v||2
         T  
1 0 0 1  x1  1 0 0
Example 2. For any x1 = 0 x2 = 2 x3 = 0 y = 1 we have X =  x2T  = 0 2 0
           
 T 
0 0 3 1 x3 0 0 3
        
 
 1 
By O.L.S we find out WOLS ∗
= (X T X)−1 X T y =  31 
 
1
3

Remark 4. I’m too lazy to calculate it and I ask AI to do it 3 time still wrong as Xw , y.
Code 1.

Ideas 3. Disadvantange of O.L.S


There are a lot of bugz on O.L.S, it only solve linear least square problem and it falls short due to numerical instability and
generaization issues:(X not full rank, sensitive to outliers,colinear make its not stable) It will b overfitting and multicollinearity.
Overfitting means the math model accurates to the dataset but bad at approximate things.Multicolinearity cases the matrix X
losing its rank and full rank of X is really important to O.L.S.How about small singular values very close to 0? Very small
singular value cases the squared inverse of the the singular values of X being extremely large, it will be a huge problem of
numerical stability, it would prevent OLS from generalizing to unseen data.We will do the Singular Value Decomposition
(S.V.D) of X: X = U V then (X T X)−1 = (V 2 V T )−1 = (X T X)−1 will have single value that are the squared inverse of the
P P
singular value of X. It will lead to extremely large singular value when X’s singular value is close to 0. It is a huge problem of
numerical stable problems.We can fix it by Ridge Regression.
0.1. DATA ANALYSIS 7

Ideas 4. Ridge Regression


Ridge Regression add a penalty on the size of coefficients into O.L.S. It will help us mitigate the issues of multicolinearity
and overfitting.
Theorem 3. Ridge Regression

Ridge Regression penalize the entries of w from becoming too large. We can do this by adding a penalty term constraining
the norm of w. For a fixed, small scalar λ >0, we now have: min||Xw − y||22 + λ||w||22 then WRIDGE

= (X T X + λI)−1 X T y.
We will chose a large enough hyperparameter λ to prevent the singular value of (X X) being close to zero.
T −1

Proof. ■

Example 3.
   
1 2 0 3
0 1 1 and y = 2 .In the First Iteration, w0 = 0 and r0 = y
   
2 0 1 2
Note that < r0 , x1 >= 7 and < r0 , x2 >= 8 as well as < r0 , x3 >= 4. Since = 8 is the maximum projection so we select i = 2 and
device ||x2 ||2 = 22 + 12 + 02 = 5 so that <r||x0 ,x2 ||2 > = 58 . The weight will be update as W21 = W00 + 85 = 85
 −1 
 5 
Then we will update the residual r1 = r0 − W2 x2 = r0 − <r||x0 2,x||22> x2 =⇒ r1 =  25 
 
2
 

Code 2.

Ideas 5. Featured Engineering


The true function y = f (x) maybe like a non-linear function but we can still turn it into linear least-squares, by adding a
new features ϕ(x).
Theorem 4. Featured Engineering

We can devise some new function ϕ : Rl → Rd , called a feature map, that maps each raw data point x ∈ Rl into a vector
of feature ϕ(x).The hypothesis function is hw (x) = dj=1 w j ϕ j (x) = wT ϕ(x).Then the model is still linear with respect to the
P
feature.The component function ϕ j are called basis function, we can still use least-square t estimate the weight w. Therefore
we replace the original data matrix X ∈ Rn×l by ϕ ∈ Rn×d , which has ϕ(xi )T as its ith row:
minw ||ϕw − y||22 .

Example 4.
Ideas 6. Hyperprametal Finetuning
We should do a lot of trial-and-error to find the best hyperparameter λ. When we do validation training , we will try a lot of
hyerparameter λ and then training, return the parameter w∗ . If the validation is bad and get a lot of errors, then do it again. If
not ,then return the best hyperparameter and parameter to do the test. Therefore we need to do a lot of validation.
Definition 16. K-Fold Cross Validation
We need to separate the validation data and training data.Cross-Validation is an alternative to having a dedicated validation
set.
(a) Shuffle the data and partition it into k equally-sized blocks.
(b) For i = 1, ..., k, train the model on all data except block i and evulate the model using block i
(c) Average the k validation errors, this is our final estimate of the true error.
8

0.1.4 Regression.2
Ideas 7. Is l2 norm a good choice to estimate the model
We use l2 norm to measure the error of our predictions and to penalize the model parameter, as it is a great design chice in
statistical interpretations of regression, such as Gaussians, M.L.E and M.A.P.
Definition 17. Maximum Likelihood Estimation (M.L.E)

M.L.E aims to find the hypothesis model that maximizes the probability of the data.
Theorem 5. O.L.S satisfy M.L.E
The Maximum Likelihood Estimation is exactly same as Ordinary Least Square.
Proof. If we parameterize the set of hypothesis models with θ, we can express the problem as θ MLE = argθ maxL(θ; D) =
p(data=D|true model = hθ ).The quantity L(θ) that we are maximizing is also known as the likelihood. Considering log we still
working the same problem , because logarithms are monotonic functions. We have P(A) < P(B) ⇐⇒ logP(A) < logP(B). Let
decompose the log likelihood :
l(θ; X, y) = log p(y1 , ..., yn |x1 , ..., xn , θ) = log( ni=1 p(yi |xi , θ)) = ni=1 log[p(yi |xi , θ)]
Q P
Let p(yi |xi , θ) from a Gaussian Yi |θ ≈ N(hθ (xi ), σ2 ).
2 √
Then we will have θ MLE = argθ maxl(θ; X, y) = argmaxθ − ( ni=1 (yi −h2σθ (x2 i )) ) − n log( 2πσ) = argθ min ni=1 (yi − xiT θ)2
P P
Therefore the M.L.S is just O.L.S problem! ■
Example 5.
Code 3.

Definition 18. Maximum a Posteriori


Maximum a Posteriori Estimation aims to find the model for which maximizes the probability of the model
θ MAP = p(true model = hθ |data = D) which is a posterior.
Theorem 6. Ridge regression satisfy M.A.P

Proof. Using the Bayes Rules, we have θ MAP = argθ maxP(true model= hθ |data=D)
= argθ max P(data=D|truemodel=h θ )P(truemodel=hθ )
P(data=D) = argθ max log P(data = D|truemodel = htheta ) + log P(truemodel = hθ )
= argθ min − logP(data = P|truemodel = hθ ) − logP(truemodel = hθ )
then θ ≈ N(θ, σ2 ) such that θ MAX = argθ min( ni=1 (yi − hθ (xi )))2 + σσ2 2j=1 θ2
P 2 P

h
This is just the Ridge Regression! ■
Code 4.
Example 6.

Ideas 8. M.L.E vs M.A.P


Note that M.A.P is just like M.A.P except we add a term P(truemodel = hθ ), this term is prior over our true model. However
what parameters will M.L.E and M.A.P select given any data set D?

Considering M.L.E, it will choose the parameter θ containing maximum likelihood but it might have high variance.A slightly
different data set will alter the predicted model

M.A.P aim to balance the variance and likelihood. If the model weight should be small then the prior will centered at zero
with small variance, M.A.P will choose the parameter maximizes the posterior probability
0.1. DATA ANALYSIS 9

Definition 19. Bias-Variance Tradeoff


ε(x, h) = (E[h(x; D) − f (x)])2 + Var(h(x; D)) + Var(Z) is the bias -variance to measure the effectiveness of hypothesis.
Proof. Let Y = f (x) + Z be a noisy measurement of the true function f (x) , we might try to use M.A.P and M.L.E to recover it
but how to measure the hypothesis effectiveness? We will form a theoretical metric to measure the effectiveness of hypothesis
function h(x), the expected error between the hypothesis and the observation Y = f (x) + Z such that:
ε(x, h) = E[(h(x; D) − Y)2 ] = ....... = (E[h(x; D) − f (x)])2 + Var(h(x; D)) + Var(Z)
where it separate as the bias2 of the method and variance of the method as well as irreducible error. ■
Code 5.
Theorem 7. Matching Pursuit Algorithm
Initialize the weight w0 = 0 and the residual r0 = y − Xw0 = y
while ||w||0 < k do
{
Find the feature i for which the length of the projected residual onto xi is maximized
|<rt−1 ,x >|
i = arg j min(minv ||rt−1 − vx j ||) = arg j max ||x j ||2 j
update the i th feature entry of the weight vector
|<rt−1 ,x >|
wti = wt−1
i + ||x j ||2 j update the residual rt = y − Xwt
}
Remark 5. I know no one understand it so I will explain it in details in the idea and example part:
Ideas 9. Matching Pursuit Algorithm
Matching Pursuit is an iterative algorithm used for solving sparse representation problems in regression. In other words, it
helps us to solve regression in a data set containing a lot of zeros.
Matching Pursuit aims to find a sparse weight vector w such that the representation of y using the features from X is as accurate
as possible, while limiting the number of non-zero entries in w.

Working Principal:
(a)Initialization: Start with w0 = 0 and compute the initial residual r0 = y
(b)At each iteration, select a feature that best reduces the residual error.
Update the weight vector for that feature.
Update the residual to reflect this change.
(c)Stopping Criterion: Continue until the number of non-zero elements in w reaches the specified limit k.
Example 7. Let us consider a dataset with three features (columns) and a target variable y. Your goal is to approximate y
using a linear combination of these features while keeping the model sparse (using at most one feature).
   
1 2 0 3
Let X = 0 1 1 and y = 2 .In the First Iteration, w0 = 0 and r0 = y
   
2 0 1 2
   
Note that < r0 , x1 >= 7 and < r0 , x2 >= 8 as well as < r0 , x3 >= 4. Since = 8 is the maximum projection so we select i = 2 and
device ||x2 ||2 = 22 + 12 + 02 = 5 so that <r||x0 ,x2 ||2 > = 58 . The weight will be update as W21 = W00 + 85 = 58
 −1 
 5 
Then we will update the residual r1 = r0 − W2 X = r0 − <r||x0 2,x||22> X =⇒ r1 =  25  Now we have w = (0, 85 , 0) and r1 = ( −1
5 , 5 , 2)
2
 
2
 
only.
You would repeat the process:
Compute the new inner products with the updated residual r1
Select the feature that maximizes the inner product.
Update the corresponding weight and the residual again.
Continue this process until the sparsity condition is no longer met (i.e., you have selected k features).
10

0.1.5 Dimensionality Reduction


Ideas 10. A lot of data sets are high dimension but we want to make it into low dimension s we can visualization those data
and understand it.It reduces the computational load and reduce variance in estimation — regularize the problem. How can we
do so?
Ideas 11. Principal Component Analysis (P.C.A)
Given any high dimensional data points, P.C.A. aims to extract more orthogonal directions that capture the largest amount
of variance in the unlabeled data.It is the most simplest way to do unsupervised dimensionality reduction.
We will use projection to do P.C.A. .Also we select data set X with 0 mean data points so it won’t influence the direction heavily
, we have faith on origin! We prove that the X T X = λmax X and X T Xvi = λi vi where λmax (X T X) = λ
Theorem 8. Principal Component Analysis (P.C.A)

Let X ∈ Rn×d be a high dimensional data set and v as a projection vector in low dimension, such that the distance between
X and v is:
Note that Pv (xi ) = (xiT v)v
min ni=1 ||xi − Pv (xi )||2
P
=min ni=1 (||xi ||2 − ||xi v||2 )
P
=max ni=1 ||xi v||2
P
=max||Xv||2
=max(vT X T Xv)
=λmax (X T X)
Do it inductively then X T Xvk+1 = λi vk+1 .
Example 8. Principal Component Analysis (P.C.A)
 
3 2 2 
Given 2 3 −2 and x1 , x2 , x3 as three transpose column vectors of X.
 
1 1 0
 
(a) Generating the data with  mean zero: 
2 2 0
x = x1 +x32 +x3 = (2, 2, 0) = 2 2 0 and then X = X − x
 
2 2 0
 
 
2 1 2 
(b)X T X = 1 2 −2 and det(λI − X T X) = λ(λ − 3)(λ − 9) =⇒ λ1 = 0, λ2 = 3, λ3 = 9
 
2 −2 8
 
Now let find out vi through (X T X − λI)vi = 0
Since v1 (X T X − 0I) = 0 =⇒ v1 =0 √ 
− 2 
 √2 
Also v2 (X X − 3I) = 0 =⇒ v2 =  2 
T
 2 
0
 √2 
−√6 
and v3 (X T X − 9I) = 0 =⇒ v3 =  62 
 
 √ 
6
3 !
3 √0
(c) Now we have the S.V.D such that = and V = (v3 , v2 ) as we pick the maximum single value first
P
0 3
 2 √ √
− 2 − 66 

 √2 √ 
After calculate out U =  2 −√66 

 
6 
0 3
0.1. DATA ANALYSIS 11

Ideas 12. Multidimensional Scaling (M.D.S)


M.D.S believe that the distance of high dim data point and the distance low dim data point are same.
Then Di j = ||pi − p j || = ||xi − x j || = di j for any high dim point p and low dim point x.
Let ||zi j || = b then di2j = bi j + b j j − 2bi j =⇒ ni=1 nj=1 di2j = 2nM where M = bii .
P P P

As distance di j is provided, We can solve M then get bi j = ZZ T .

Theorem 9. Multidimensional Scaling (M.D.S)


(a)Given Distance matrix in a high dimensional matrix D
(b)Square each points of Distance Matrix D2
(c)Assume ||Z|| = 1;
2−d
(d)ZZ T = 2 i j for any element in D2
(e)Do the eigenvalue decomposition. ZZ T = V V T
P

Proof. ■

Code 6.
Example 9.
12

0.1.6 Background of Neural Network


Information 1. Introduction to ”MATH3320”
Prof Fan Fenglei focus on the neural network so in 20224 MATH 3320 Data Analysis lecture, he changes a little bit of
syllabus. Since our midterm performance is quite good so he make a final ”little harder”, the final will mainly focus on lecture,
some 2017-2022 research paper and little bit of textbook.HOPE I CAN PASS THE FINAL.
Lecture focus on mathematics behind Neural Network, Hopfield Network, Boltlizmin Machine...
You have to know that neural network is an mathematical model inspired by the structure and function of biology neural network
in animal brains.
In my point of view, 1st version neural network(Hopfield Machine) is a machine that ”remember” things that have ”memory”.
2nd version(Boltlizmin Machine) is a machine that not only remember stuff but also able to guess the pattern and response new
stuff, kind of creation. In general, neural network is a universal function appropriator.

Definition 20. Neuron


Neuron is a computation unit (computation node). It is a linear mathematical function that model the functioning of a
nonlinear biological neuron. Typically, a neuron compute the weighted average of its inputs, and this sum is pass through a
nonlinear function (activation function) such as sigmoid,RELU.The following is the neuron function:

σ(∗) : Rd → Rk , σ(x) = wT x + b
(where w ∈ Rd is the parameters we want to find out,x ∈ Rd is the data input, b ∈ R... b is bias )
Definition 21. Activation Function
An activation function is a non-linear mathematical function applied to the output of a neuron, it calculates the output of the
node based on its individual inputs and their weight.There are different kind of activation function such as sigmoid,RELU......
The following is the general look of activation function:

f : Rk → R : f (x) = σ(wT x + b)

g:neuron1} Figure 1:
0.1. DATA ANALYSIS 13

Remark 6. First Principal in Neural Network


Prof Fan believe that the neural science is the ”First Principal Thinking”, our brain is the solution to the science of neural
network. What a interesting point of view.
Information 2. Topology and Neural Network

Consider the classification example below:


We want to classify sample into 4 feature. In the Graph 1, we can observe that the classification is also linear using two straight
line. In other words, we use 2 neurons to classify the pattern.

Prof Fan ask you a question:If we connect more neuron to become a network then it will be more powerful. How to ar-
range those neuron into networking?G2,G3,G4 which one is the best graph to connect network?

It is actually a OPEN QUESTION. Selecting Network Topology is open question to make best network in topology, maybe you
will want to study Graph Theory and Topology.

Figure 2: {fig:neuro

Geoffrey Hinton believe that DEEP is better so he will choose G4. However we need to consider the wide too. If you
need to train your model , how to make a equation calculate effectivity in training? You should do some research in ”Network
Topology”
(Actually Prof Fan invites one of the researcher of Network Topology to have a talk but I didn t attend...)
14

0.1.7 Nonlinear Least Squares


Information 3. Nonlinear Least Square
y = wT x is a linear function of input x. This also holds on least square polynomial regression—- the input is the augmented
polynomial feature input ϕ(x) not x now. Then we have Y = W T ϕ(x), we use an arbitrary nonlinear function f (x; w) to denote
W T ϕ(x) such that Y = f (x; w).
Let w ∈ R2 . IfYi ∈ R is a noisy distance estimate from n sensors(neurons) whose input positions x ∈ R2 is known.
Since we predict the distance f (x, w) = ||x − 2||2 and we want to find out a universal function appropriator f (x; w) which can
approximate any function f (x) with appropriate parameters w. This will be the basis of neural networks.
For the purpose of following discussion, we assume that given a math model f , an arbitrary differentiable function parameter-
ized by w, so we can find the parameters w MLE that maximize the likelihood of the data

Yi = f (xi ; w) + Zi , where Zi ∼ N(0, σ2 )∀i ∈ N and N is normal distribution.

Note that Yi |xi ∼ N(0, σ2 )∀i ∈ N

w MLE = argw maxℓ(w; X, y)


w MLE = argw max ni=1 log p(yi |xi , w)
P
f (x;w))2
w MLE = argw max ni=1 log( √ 1 2 exp( ()yi −2σ
P
2 ))
2πσ
(yi − f (x;w))2
w MLE = argw max i=1 log( √ 2 ) +
Pn 1
2πσ 2σ2
w MLE = argw min ni=1 (yi − f (xi , w))2
P

Observer that the objective function is a sum of squared residuals as we ’ve seen before the cost function L(w), that f is
nonlinear. Therefore this method is called nonlinear least square. Hence we need to solve the optimization problem below
(yi − f (xi ,w))2
minw L(w) = argw min
Pn
i=1 2

We can solve it by finding all critical point s and choose the minimum point. From first order optimality condition, the gradient
of the objective function at any minimum must be zero such that:

∇w L(w) = ni=1 ∇w f (xi ; w)(yi − f (xi ; w))


P
∇w L(w) = J(w)T (y − F(w)) = 0

J is the Joacobian of F , f is linear in w, the gradient ∇w L(w) only depends on w in F(w) because the term ∇w f (xi ; w)
only depend on xi :

∇w L(w) = ni=1 (yi − wT xi )∇w (wT xi )


P
∇w L(w) = ni=1 (yi − wT xi )xi
P
∇w L(w) = X T (y − Xw) = 0

Finally we may obtain the close form parameter OLS solution w:

X T (y − Xw) = 0
X T y − X T Xw = 0
w = (X T X)−1 X T y

In general , f is nonlinear in w. However we may not derive a closed form solution for w.
f might not be convex so we cannot have additional assumption on f .
w might not be the global minima, might be saddle point , local minimum, or local maximum.
We need to approach the w with no closed form-solution!!!Therefore we have Optimization method–Gradient Descent.
Theorem 10. Critical Point and Gradient
If w∗ is a one of the critical points of f and f is continuously differentiable in a neighborhood of w∗ then ∇ f (w∗ ) = 0
0.1. DATA ANALYSIS 15

Figure 3: {fig:gradi

Information 4. Gradient and Optimization


Let us consider the realm of neural networks and beyond, we will solving arbitrary problems of the form minw∈X f (x) over
an arbitrary continuous differentiable function f : Rd → R and arbitrary domain X which is more general then the closed form.
We need to understand Optimization method.

In optimization, we want to find global minimum of a function. We might find local minimum, local maximum, saddle point in
the process. We also need to consider the boundary point and non-differentiable point to find out global minimum. Therefore
we have to consider all critical points in our analysis. Therefore we need to find the set of all critical points which is simply the
set of points at which the gradient is zero.
 ∂f   
 ∂w1   0 
 ∂ f   0 
∇ f =  ∂w2  =  
 
 ...  ...
 ∂f  0
∂wd
Rather then finding the closed form solution , we want to find a method to creep toward to a local minimum. Therefore we have
the algorithm Gradient Descent, which is the key algorithm of Data Analysis and Machine Learning.
Theorem 11. Gradient Descent
Gradient Descent is an algorithm that iteratively takes small steps in the direction of steepest descent of the objective .
Imagine there is a computer or robot call CTP, he want to walk to the bottom of the mountain, what should he do? he need to
find out the steepest descent of the mountain so he can arrive the bottom faster.Let scaling αt depend on f so we can determine
the adaptive stepsize.

How to determine the direction of steepest descent of a multivariate function f ?


What should we do with the gradient and how to do GRADIENT DESCENT?

1. Given any point wt in the domain of function


2.−∇ f (wt ) is the direction of steepest descent.
3.While f (wt ) not converged (∇ f (wt ) , 0) do
{wt = wt − αt ∇ f (wt )}
Remark 7. Why −∇ f (wt ) is steepest
Since the directional derivative in a unit direction u at wt is defined as the inner prodict of the gradient and the direction.
Finding the steepest descent entails finding the direction that minimizes the directional derivative. We can minimize the direc-
tional derivative if θ = −π then direction u and ∇ f (wt ) are opposite to each other. Thus steepest descent is −∇ f (wt )
Du f (wt ) =< ∇ f (wt ), u >= ||∇ f (wt )||||u||cos(θ)
16

Information 5. Prof Fan BONUS Lecture Content


Prof Fan taught about

1.Backpropagation algorithm,
2.Feedforward neural network.
3. Construct a 2nd layer Neural Network,
4.Gaussian Process,
5.2018 research paper,
6.Neural Network Compression,
7.Signoid,
8.RELU Network express any C n function,
9.Hopfield Network,Energy Function.

Also there are some extra bonus challenge


11.”Barron Space=Fourier Transform”,
12.”Improve the memory complicity of Hopfield Network”,

My CPU gone to be burn out.


0.1. DATA ANALYSIS 17

0.1.8 Neural Networks


Information 6. Visual look of Neural Network

Figure 4: {fig:neuro

Definition 22. Neuron


Neuron is a computation unit (computation node). It is a linear mathematical function that model the functioning of a
nonlinear biological neuron. Typically, a neuron compute the weighted average of its inputs, and this sum is pass through a
nonlinear function (activation function) such as sigmoid,RELU.The following is the neuron function:

σ(∗) : Rd → Rk , σ(x) = wT x + b
(where w ∈ Rd is the parameters we want to find out,x ∈ Rd is the data input, b ∈ R... b is bias )
Definition 23. Activation Function

An activation function is a non-linear mathematical function applied to the output of a neuron, it calculates the output of the
node based on its individual inputs and their weight.There are different kind of activation function such as sigmoid,RELU......
The following is the general look of activation function:

f : Rk → R : f (x) = σ(wT x + b)
18

Information 7. Neural Network


Neural network are a class of compositional function approximators. They come in a variety of shape and sizes.We only
discuss about feedforward neural network, those networks whose computation can be modeled by a directed acyclic graph. The
most basic and commonly used class of feedforward neural network is the multilayer perception.

Computation flows left to right , circles are nodes a.k.a neurons or units based on actual neurons in our brain.

Note that neurons are generated as layer, first one is called input layer, second one is called hidden layer, third one is called
output layer. Obviously, there can be tremendous layers in hidden part as hidden layer are hyperparameters to be chosen by
network designer. The dimensionality of input and output layer is determined by the function we want the network compute.

Every node in that layer is connected to every node in the previous layer, layer are fully connected!

Each edge in the graph has an associated weight, which is the strength of the connection from the input node in one layer
to the node in the next layer. Each node computes a weighted sum of its input, with these connection strengths being the
weights and then applies a nonlinear function which is variously referred to as the activation function or the nonlinearity.

Let wi denotes the weights and σi denotes the activation function of node i,
it computes the function x → σi (wTi , x)
Since the output of each layer is passed as input to the next layer, the function represented by the entire network can be written as

x → σL (WL (σL−1 (...σ1 (W1 x)))))


That why we claim neural network is compositional

Note that in most layer, the activation function are same for each node with that layer, so it is ”the” scalar function σℓ : R → R
as the nonlinearity  for layer ℓ, and apply it element wise
 σℓ (x1 ) 

σℓ (x) =  ... 
 
σℓ (xn ℓ)
 
The principle exception here is the softmax function σ : Rk → R defined by

e xi
σ(x)i = Pk
ex j
which is often used to produce a discrete probability distribution over k classes.
j=1
Note that every entry of the softmax output depends on every entry of the input. More positive xi leads to larger σ(x)i . The
softmax function is used most commonly.
Information 8. Expressive Power
Expressive power is the repeated combination of nonlinearities that gives deep neural network their remarkable expressive
power. If we remove the activation function then x → WL WL−1 ...W2 W1 x = Wx
which is linear in its input! Moreover, the size of the smallest layer restricts the rank of W as
rank(W) ≤ minℓ∈{1,...,L} rank(Wℓ ) ≤ minℓ∈{0,...,L} nℓ

We would like to produce a function approximators. This means that given any continuous function , we can choose a net-
work in this class such that the output of the circuit can be made arbitrary close to the output of the given function for all given
inputs.

The piecewise-constant function are universal function approximateors, the activation is the step function:
σ(x) = 1 if x ≤ 0
σ(x) = 0 if x < 0

It turns out that that only one hidden layer h(x) is need for universal approximation
h(x) = nj=1 c j σ(a j + b j x)
P
where k is the number of hidden units.
0.1. DATA ANALYSIS 19

Theorem 12. Forward Neuron Network


In the first input layer , we have input xi ∈ D(x) from the data set D(x).
Imagine we have n input so D(x) := {x1 , ..., xn }, and then we put them in the first layer of the neuron network.

Now we design the node or neurons in hidden layer, in other work we simulate the nonlinearity (activation function) and
denote it as σ(x), sometime we use h(x) instead of σ(x) to point out it is in the hidden layer.

We connect 1 st layer’s node and 2 nd layer’s node with line. The connection contains the weight (strengths) of every node in
previous layer.The connection is a preactivation function ε(x) which represent the function of neuron– take the sum of input
multiplicate with weight and add bias

1 + 1
ε(x) =
Pk
i=1 (Wk xi bn )

Note that the upper index represents the number of layer, the lower index represents the number of element.
Also all layer are fully connected.

Then we can calculate the activation function or nonlinearity σ(x). We would like to change the input of activation func-
tion as σ as it avoids ambitious and be more precise, denote activation function as α(x). That is :

α(x) = σ(x) = σ(ε2k ) = σ( ki=1 (Wki2 xi1 + b2n ))


P

Similarly, we will do the same process at layer 3:

σ(ε3h ) = σ( h=1 (Whk αh + b3h ))


PH 3 2

By Induction, we can observe that the general layer ’s activation function is


σ(εlj ) = σ( m=1 (W ljm αl−1
m + bm ))
PM l

I know I didn t formulate the activation function σ(x), as there are many choice of activation function such as RELU func-
tion RELU(x) = max(0, x) or sigmoid function sigmoid(x) = 1+e1 −x ......

Below is the graph and process of ”Feedforward Neural Network”.


20

Example 10. Forward Propagation

d-forward} Figure 5:
0.1. DATA ANALYSIS 21

Information 9. Picture of Neural Network

Figure 6: {fig:deepu
22

Theorem 13. Differentiable Network


Neural network is a universal function approximators which is fully connected and differentiable, so neural network is a dif-
ferentiable circuit or differentiable network. Differentiable Network is a composition of a sequence of differentiable arithmetic
operations and elementary differentiable function.
Theorem 14. Auto-differentiable in network
Suppose a differentiable circuit of size N computes a real valued function f : Rℓ → R. Then the gradient ∇ f can be
computed in time O(n) by a circuit of size O(N).
Theorem 15. Chain Rule
Let f be a differentiable function depends on the variable w1 , ...w p with the differentiable intermediate variable g1 , ...gk .
∂f ∂w ∂g j
= kj=1 ∂g
P
Then ∂w i j ∂wi

Information 10. Simplifying Notation of Back Propagation


Let x be the input. hw (x) = o stands for output from Forward Propagation. Define Loss Function L = 12 (y − o)2 where y is
the actual value of output. a = σ(x) stands for activation function, sometime it can replace by output o. z is the pre-activation
function. w is the weight and b stands for bias.
Theorem 16. Gradient Descent
Gradient Descent is an algorithm that iteratively takes small steps in the direction of steepest descent of the objective .
Imagine there is a computer or robot call CTP, he want to walk to the bottom of the mountain, what should he do? he need to
find out the steepest descent of the mountain so he can arrive the bottom faster.Let scaling αt depend on f so we can determine
the adaptive stepsize.

How to determine the direction of steepest descent of a multivariate function f ?


What should
Theorem 17. Gradient Descent
1. Given any point wt in the domain of function
2.−∇ f (wt ) is the direction of steepest descent.
3.While f (wt ) not converged (∇ f (wt ) , 0) do
{wt = wt − αt ∇ f (wt )}we do with the gradient and how to do GRADIENT DESCENT?

1. Given any point wt in the domain of function


2.−∇ f (wt ) is the direction of steepest descent.
3.While f (wt ) not converged (∇ f (wt ) , 0) do
{wt = wt − αt ∇ f (wt )}
Theorem 18. Algorithm of Back Propagation
1. Compute and store the value a[k] and z[k] for k = 1...r − 1

∂L T
2.Compute δ[r] = ∂z[r]
= (z[r] − o) where o = W [r] a[r−1] + b[r]

3. For k = r − 1 to 1 do
{
4. Compute δ[k] = ∂z∂L[k] = (W [k+1] δ[k+1] )ReLU ′ (z j )
T

∂L
5.Compute ∂b[k+1]
= δ[k+1]
}

6. Weight and bias Update


∂L
W [k] = W [k] − α ∂W [k]

b[k] = b[k] − α ∂b∂L[k]


where α is the learning rate you desire.
0.1. DATA ANALYSIS 23

Theorem 19. One neuron neural network case


∂f
We want to compute ∂w in the neural network of one neuron case. We want to compute Loss Function

z = wT x + b
o = hw = ReLU(z)
L = 12 (y − o)2
∂L ∂L ∂o ∂L ∂o ∂z
By chain rule we have ∂wi = ∂o ∂wi = ∂o ∂z ∂wi = (o − y)ReLU ′ (z)xi

∂L ∂L ∂o ∂z
The key observation is that we reduce ∂w into ∂o , ∂z , ∂w

Theorem 20. Two layer neural network case


Similarly, we have

∀ j ∈ [1, ..., m]
T
[1]T
z j = w[1]
j x + b j where w j , b j ∈ R
[1] [1] d

a j = ReLU(z j )
a = [a1 ,T...am ]T
o = w[2]
j a + bj
[2]

L = 12 (y − o)2

By chain rule, we have

∂L ∂L ∂z j ∂L ∂L ∂a j ∂L ∂L ∂o
∂w j = ∂z j ∂(w[1] )ℓ = ∂z j xℓ = ∂a j ∂z j xℓ = ′
∂a j ReLU (z j )xℓ = ′
∂o ∂a j ReLU (z j )xℓ = (o − y)(w[2] ) j ReLU ′ (z j )xi for j=1
j
∂L
∂w2 = (o − y)ReLU ′ (z j )xℓ

Theorem 21. Back Propagation


Let W be the matrix of weight and all turns to matrix form

a[1] = ReLU(W [1] x + b[1]


j )......
a[r−1] = ReLU(W [r−1] a[r−2] + b[r−1]
j )
o = a[r] = W [r] a[r−1] + b[r]
j
L = 12 (a[r] − y)2

By the chain rule of matrix


δ[k] = ∂z∂L[k] = (W [k+1] δ[k+1] )ReLU ′ (z j )
T

∂L T
∂W [k+1]
= δ[k+1] a[k]
∂L
∂b[k+1]
= δ[k+1]
24

Theorem 22. Construct 2-Layer Network


The structure of the neural network consists of: - Input Layer: x (2 inputs) - Hidden Layer: 2 neurons - Output Layer: 1
neuron
Weights and biases are defined as follows:
" (1)
w11 w(1)
# " (1) #
b
W = (1)
(1) 12 , b = 1(1)
(1)
w21 w(1)
22 b2
h i
W(2) = w(2)
1 w(2)
2
, b(2)
The activation function used is the sigmoid function:
1
σ(z) =
1 + e−z
" #
x
Forward propagation steps are as follows: - Input: Let x = 1 - Hidden Layer:
x2

z(1) = W(1) x + b(1)

a(1) = σ(z(1) )
- Output Layer:
z(2) = W(2) a(1) + b(2)
ŷ = σ(z(2) )
Example values for forward propagation are:
" # " #
0.1 0.2 0.1
W(1) = , b(1) =
0.3 0.4 0.1
h i
W(2) = 0.5 0.6 , b(2) = 0.2
" #
0.6
x=
0.9
Calculating forward propagation: 1. Hidden Layer:

0.1 · 0.6 + 0.2 · 0.9 + 0.1


" #" # " # " # " #
0.1 0.2 0.6 0.1 0.74
z =
(1)
+ = =
0.3 0.4 0.9 0.1 0.3 · 0.6 + 0.4 · 0.9 + 0.1 0.98

σ(0.74)
" # " # " #
0.74 0.676
a(1) = σ( )= ≈
0.98 σ(0.98) 0.727
2. Output Layer:
h i "0.676#
z(2) = 0.5 0.6 + 0.2 = 0.5 · 0.676 + 0.6 · 0.727 + 0.2 ≈ 0.706
0.727
ŷ = σ(0.706) ≈ 0.669
The loss function is defined as:
1
L= (ŷ − y)2
2
Assuming the true label y = 1:
1 1
L= (0.669 − 1)2 = (−0.331)2 ≈ 0.0548
2 2
Calculating the gradients for backpropagation: For the output layer:
∂L
= ŷ − y = 0.669 − 1 = −0.331
∂z(2)
0.1. DATA ANALYSIS 25

∂L ∂L
" # " #
0.676 −0.224
= · a(1)
= −0.331 · ≈
∂W(2) ∂z(2) 0.727 −0.241
∂L ∂L
= (2) = −0.331
∂b(2) ∂z
For the hidden layer:
∂L ∂L
" # " #
0.5 −0.1655
= · W = −0.331 ·
(2)
=
∂a(1) ∂z(2) 0.6 −0.1986
The sigmoid derivative is:
σ′ (z) = σ(z)(1 − σ(z))
Calculating the derivative for the hidden layer:

σ(0.74)(1 − σ(0.74))
" # " # " #
0.676(1 − 0.676) 0.219
σ′ (z(1) ) ≈ ≈ ≈
σ(0.98)(1 − σ(0.98)) 0.727(1 − 0.727) 0.198

∂L ∂L
" # " # " #
−0.1655 0.219 −0.0363
= · σ′ (1)
(z ) = ⊙ ≈
∂z(1) ∂a(1) −0.1986 0.198 −0.0393
Calculating gradients for weights and biases in the hidden layer:

∂L ∂L
" # " # " #
−0.0363 0.6 −0.02178 −0.03207
= (1) · x = · ≈
∂W (1) ∂z −0.0393 0.9 −0.02358 −0.03537

∂L ∂L
" #
−0.0363
= =
∂b(1) ∂z(1) −0.0393
The gradients summary is: Output Layer:

∂L ∂L
" #
−0.224
≈ , = −0.331
∂W(2) −0.241 ∂b(2)

Hidden Layer:
∂L ∂L
" # " #
−0.02178 −0.03207 −0.0363
≈ , ≈
∂W(1) −0.02358 −0.03537 ∂b(1) −0.0393
Now, we perform weight updates using a learning rate η: Assuming η = 0.1: For the output layer:

∂L
" # h
h i −0.224 i
W updated = W − η ·
(2) (2)
≈ 0.5 0.6 − 0.1 · ≈ 0.5224 0.6241
∂W (2) −0.241

∂L
b(2) updated = b(2) − η · = 0.2 − 0.1 · (−0.331) ≈ 0.2331
∂b(2)
For the hidden layer:

∂L
" # " # " #
0.1 0.2 −0.02178 −0.03207 0.1022 0.2032
W (1)
updated = W (1)
−η· ≈ − 0.1 · ≈
∂W (1) 0.3 0.4 −0.02358 −0.03537 0.3024 0.4035

∂L
" # " # " #
0.1 −0.0363 0.1036
b (1)
updated = b (1)
− η · (1) ≈ − 0.1 · ≈
∂b 0.1 −0.0393 0.1039
The updated weights and biases are: Output Layer:
h i
W(2) ≈ 0.5224 0.6241 , b(2) ≈ 0.2331

Hidden Layer: " # " #


0.1022 0.2032 0.1036
W(1) ≈ , b(1) ≈
0.3024 0.4035 0.1039
26

The mean squared error (MSE) loss function is defined as:


1
L= (ŷ − y)2
2
where ŷ is the predicted output and y is the true label.
Calculating the loss with ŷ ≈ 0.669 and y = 1:
1 1
L= (0.669 − 1)2 = (−0.331)2 ≈ 0.0548
2 2
For the output layer, we calculate the derivative of the loss with respect to the output layer’s pre-activation:
∂L
= ŷ − y = 0.669 − 1 = −0.331
∂z(2)
The gradients for weights and bias in the output layer are:

∂L ∂L
" # " #
0.676 −0.224
= · a = −0.331 ·
(1)

∂W(2) ∂z(2) 0.727 −0.241

∂L ∂L
= (2) = −0.331
∂b(2) ∂z
For the hidden layer, we first need the gradients of the loss with respect to the activation of the hidden layer:

∂L ∂L
" # " #
0.5 −0.1655
= · W (2)
= −0.331 · =
∂a(1) ∂z(2) 0.6 −0.1986

Next, we calculate the derivatives of the activation function. The sigmoid derivative is given by:

σ′ (z) = σ(z)(1 − σ(z))

For the hidden layer:


σ(0.74)(1 − σ(0.74))
" # " # " #
0.676(1 − 0.676) 0.219
σ (z ) ≈
′ (1)
≈ ≈
σ(0.98)(1 − σ(0.98)) 0.727(1 − 0.727) 0.198
We can now calculate the gradients with respect to the hidden layer’s pre-activation:

∂L ∂L
" # " # " #
−0.1655 0.219 −0.0363
= · σ (z ) =
′ (1)
⊙ ≈
∂z(1) ∂a(1) −0.1986 0.198 −0.0393

Finally, we compute the gradients for the weights and biases in the hidden layer:

∂L ∂L
" # " # " #
−0.0363 0.6 −0.02178 −0.03207
= · x = · ≈
∂W(1) ∂z(1) −0.0393 0.9 −0.02358 −0.03537

∂L ∂L
" #
−0.0363
= =
∂b(1) ∂z(1) −0.0393
The gradients summary is as follows: Output Layer:

∂L ∂L
" #
−0.224
≈ , = −0.331
∂W(2) −0.241 ∂b(2)
Hidden Layer:
∂L ∂L
" # " #
−0.02178 −0.03207 −0.0363
≈ , ≈
∂W(1) −0.02358 −0.03537 ∂b(1) −0.0393
Next, we perform weight updates using a learning rate η: Assuming η = 0.1: For the output layer:

∂L
" # h
h i −0.224 i
W(2) updated = W(2) − η · ≈ 0.5 0.6 − 0.1 · ≈ 0.5224 0.6241
∂W (2) −0.241
0.1. DATA ANALYSIS 27

∂L
b(2) updated = b(2) − η · = 0.2 − 0.1 · (−0.331) ≈ 0.2331
∂b(2)
For the hidden layer:

∂L
" # " # " #
0.1 0.2 −0.02178 −0.03207 0.1022 0.2032
W(1) updated = W(1) − η · ≈ − 0.1 · ≈
∂W(1) 0.3 0.4 −0.02358 −0.03537 0.3024 0.4035

∂L
" # " # " #
0.1 −0.0363 0.1036
b (1)
updated = b (1)
− η · (1) ≈ − 0.1 · ≈
∂b 0.1 −0.0393 0.1039
The updated weights and biases are: Output Layer:
h i
W(2) ≈ 0.5224 0.6241 , b(2) ≈ 0.2331

Hidden Layer: " # " #


0.1022 0.2032 0.1036
W(1) ≈ , b(1) ≈
0.3024 0.4035 0.1039
28

Definition 24. Gaussian Distribution


Normal Distribution or Gaussian Distribution is a type of continuous probability distribution for a real value random vari-
able. Here is the probability density formula:

1 (x − µ)2
f (x) = √ exp( )
2πσ2 2σ2

where µ is a mean or expectation of the distribution.

Definition 25. Central Limit Theorem


The sampling distribution of the mean will always be normally distributed, as long as the sample size is large enough.
Let X1 , X2 , . . . , Xn be i.i.d. random distribution, containing identically distributed (i.i.d.) parameters . If n → ∞, then
N
1 X
lim Xn ∼ N(µ, σ2 )
N→∞ N n=1

Information 11. Point of View from Functional Analysis of Neural Network


Consider Single Gaussian distribution N(µ, σ2 ) and Multivariate Gaussian distribution N(µ, ).
P
We need to have a new point of view of function. Function is a vector with infinite number of dimensions in Functional
Analysis. Therefore, neural network is a function inputing x, parameter w x somehow construct an output according Gaussian
P′
Distribution N( f (x), xx ).
Information 12. DEEP NEURAL NETWORKS AS GAUSSIAN PROCESSES
Our final covers this crazy paper. Neural Network sufficiently express any class of functions and generalize well from
training in a very wide range of task. However, neural network cannot express so many function that all fit the training data. It
is a paradox, we need to choose the one that generalize the real things.

Why we should trust the output of neural network?

We can answer it instead by interpreting neural network as gaussian process.

If the layer of neural network is in the limit of infinite width,then the Central Limit Theorem1 implies that the function computed
by the neural network (NN) is a function drawn from a Gaussian process (GP).
Theorem 23. Universal Approximation Theorem
Two layer network with linear output can be uniformly approximate any continuous function given sufficient hidden units.
m
1X
lim y(x) = lim σ(< wi , x >)
m→∞ m→∞ m
i=1

Remark 8. CLT and UAT


Central Limit Theory and Universal Approximate Theory is similar. We want to show that if y → inf then neural network
actually follows normal distribution.
0.1. DATA ANALYSIS 29

Definition 26. Tangent Kernel


The tangent kernel is derived from the idea of representing the output of a neural network as a Gaussian process. It captures
the relationships between inputs based on the derivatives of the network’s output.Hi j is the Tangent Kernal.

Let M be a manifold embedded in a Euclidean space Rn . For a point x ∈ M, the tangent space T x M at x consists of all
possible directions in which one can tangentially pass through x on the manifold M.

The tangent kernel HT (x, y) is defined as a kernel function that measures the similarity between points x and y based on the
geometry of the manifold. Formally, it can be expressed as:

HT (x, y) = ⟨ϕ(x), ϕ(y)⟩


where ϕ : M → T x M is a mapping that projects points from the manifold to the tangent space at point x, and ⟨·, ·⟩ denotes
the inner product in the tangent space.

The neural tangent kernel is a mathematical construct that describes how the outputs of a neural network change with re-
spect to small changes in its parameters (weights). It is derived from the Taylor expansion of the network output around the
initialization of the weights.

When the weights of a neural network are initialized randomly, the network can be approximated as a linear function of its
parameters. The NTK captures this linear behavior in the vicinity of the initialization.

During training, especially in the regime where the network is wide (i.e., has a large number of neurons), the dynamics of
the training process can be understood in terms of the NTK. This means that the optimization of the neural network can be
analyzed as a linear regression problem in the space defined by the NTK.

The NTK framework provides insights into the effectiveness of gradient descent in training deep neural networks, helping
to explain why certain architectures perform well and how their training dynamics behave.

Researchers use NTKs to study generalization, convergence rates, and the behavior of various neural network architectures
during training.

∂ f (w(t)i , xi ) ∂ f (w(t) j , x j )
H(t)i j =< , >
∂wi ∂w j
Information 13. Paper 1,2
We will talk about 2 paper:

DEEP NEURAL NETWORKS AS GAUSSIAN PROCESSES” by Jaehoon Lee

It points out deep infinite wide network is gaussian process

ON EXACT COMPUTATION WITH An INFINITELY WIDE NEURAL NET by Sanjeev Arora

It points out tangent kernal in infinitely wide kernal is the inner product of weights.

maybe my note is wrong, hope genius can tell me what I need to know
30

Theorem 24. Gaussian Process and Universal Approximate Theory


Consider Universal Approximate Theory
m
1X
lim y(x) = lim σ(< wi , x >)
m→∞ m→∞ m
i=1
Recall Central Limit Theory formula:
N
1 X
lim Xn ∼ N(µ, σ2 )
N→∞ N n=1
Suppose x = 1, then we haveS i = σ(wi ) ∼ D which is some distribution.Taken weight and bias as iid, By Central Limit
Theorem:
m m m
1X 1X 1X
lim y(1) = lim σ(< wi , 1 >) = lim σ(wi ) = lim S i ∼ N(µ, σ2 )
m→∞ m→∞ m m→∞ m m→∞ m
i=1 i=1 i=1
Suppose x = 2, then we have Xi = σ(2wi ) ∼ D which is some distribution,Taken weight and bias as iid, By Multidimensional
Central Limit Theorem:
m m m
1X 1X 1X
lim y(2) = lim σ(< wi , 2 >) = lim σ(2wi ) = lim Xi ∼ N(µ, σ2 )
m→∞ m→∞ m m→∞ m m→∞ m
i=1 i=1 i=1

Inductively, if x = n then y(x) = y(n) ∼ N(µ, σ2 ) for any fixed value x.

When y → ∞, neural network can be degenerate to Gaussian Distribution.


Deep,infinitely wide neural networks is directly correspond to the Gaussian Process.
Lemma 1. Minimize the cost by gradient descent in infinitesimally learning rate follows NTK relation
To prove this idea, we minimize L(w) by
∂w(t)
= ∇L(w(t))
∂t

Let a(t) = [ f (w(t)1 , x1 ), f (w(t)2 , x2 ), ..., f (w(t)n , xn )] be the network output,you can view it as vector contains arbitrary continu-
ous functions and y = [y1 , .., yn ] be the actual output of x1 , ...xn .

We want to show that a(t) follows the following evolution:


∂a(t)
= H(t)i j (a(t) − y)
∂t
where Hi j is the Tangent Kernal.
∂ f (w(t)i , xi ) ∂ f (w(t) j , x j )
H(t)i j =< , >
∂wi ∂w j
We want tooptimizate dynamic over time , please consider infintely small time step :
∂L(w[n] )
w[n+1] = w[n] − λ
∂w
By Backward Propagation and Total Differentiation:
∂w(t) X n
∂ f (w(t), x j )
= −∇L(w(t)) = ( f (w(t), xi ) − yi )
∂t i=1
∂w(t)
n
∂ f (w(t), xi ) ∂ f (w(t), xi ) ∂w(t) ∂(w(t), xi ) ∂xi ∂ f (w(t), xi ) X ∂ f (w(t), xi )
= + = ( f (w(t), x j ) − y j )
∂t ∂w(t) ∂t ∂xi ∂t ∂w(t) i=1
∂w(t)
∂xi
Note that ∂t = 0. Finally we have:
∂ f (w(t), xi ) ∂a(t)
= = H(t)i j (a(t) − y)
∂t ∂t
0.1. DATA ANALYSIS 31

Theorem 25. Equivalence between wide neural net and kernal regression with NTK
We want to show that if dA → ∞, network be infinitely width then

∂ f (w(t)i , xi ) ∂ f (w(t) j , x j )
H(t)i j =< , >→ w[k] (x, x′ )
∂wi ∂w j

which means tangent kernal fixed into inner product of all parameter:

Consider L-hidden layer

f [n] (x) = W [n] g[n−1] ∈ Rdn


g[h] = σ( f [h] (x)) ∈ Rdn
f (w,x) = f [l+1] (x) = W [l+1] (σ(...σ(w[1] x))) ∈ Rdn

By Back Propagation (Matrix Version), we have:

∂ f (w, x)
= b[h] (g[h−1] (x))T
∂w[h]
If h = l + 1 then b[h] (x) = 1. Otherwise b[h] (x) = D[n] (x)(W [h+1] )b[h+1] ∈ Rd×n where D[h] (x) = diag(σ( f [h] )(x)) Then:

∂ f (w(t)i , xi ) ∂ f (w(t) j , x j )
H(t)i j =< , >=< b[h] (g[h−1] (x))T , b[h] (g[h−1] (x′ ))T >
∂wi ∂w j

by Gradient of loss of reactivation b and output of h-layer g

H(t)i j =< D[n] (x)(W [h+1] )b[h+1](x) , D[n] (x′ )(W [h+1] )b[h+1] >

By Central Limit Theorem

H(t)i j =< D[n] (x)(W [h+1] )T , D[n] (x′ )(W [h+1] )T >< b[h+1] (x), b[h+1] (x′ ) >

Inductively , Fix ε > 0 and L ∈ (0, 1). Suppose


  L 
Φ(z) = max(0, z) and min Eh ◦ Φ Lε4 log .
h∈[L] δ
Then for any inputs x, x′ ∈ Rd0 such that

x ≤ 1, x′ ≤ 1,

with probability at least 1 − δ we have:

∂ f (w(t)i , xi ) ∂ f (w(t) j , x j )
< , > −w[k] (wi , w j ) < ε
∂wi ∂w j
Finally we have
∂ f (w(t)i , xi ) ∂ f (w(t) j , x j )
H(t)i j =< , >→ w[k] (x, x′ )
∂wi ∂w j

Hence H(t) is degenerated as w[k] (x, x′ ).


32

Information 14. Why Neural Network Compression


We know that the increasing of depth and wide of neural network, GPT will become much more powerful and faster, so we
just need to increase the size of neural network??ChatGPT contains 140B parameter and 1 trillion neural connections. Human
brain has 100 trillion neural connection.However More’s Law is kind of not working now, we cannot update chips now as
its physical limitation so we might not have large size of neural network for GPT. Hence we need to COMRESS NEURAL
NETWORK.
Information 15. Idea of Neural Network Compression
We observe that in neural network training, a lot of weight close to zero w → 0. It implies that some weight are not that
important to the final result, we can remove their connection to compress network.
How to measure the importance of the length ??
How to desire some ”importance measure” those connect with small value might be really important
(small weight with large derivative !!!)
Specification the deritavive and then remove length or remove entire layer.
Theorem 26. Low Rank Decomposition

At each layer, we have weight matrix W ∈ Rn×n then it requires O(n2 ) time, we apply the decomposition technique with low
rank and eigenvalue decomposition (P.C.A) selecting the most important eigenvalue.
k
X
W= λi vi vTi
i=1

for any k < n and it is possible that W ∈ Rd1 ×d2 ......×dn . Therefore the O(n2 ) parameter become O(kn) parameter, which is
decomposition.
Theorem 27. Decomposition Technique

1.Pruning:

Pruning is a data compression technique in machine learning and search algorithms that reduces the size of decision trees
by removing sections of the tree that are non-critical and redundant to classify instances. Pruning reduces the complexity of the
final classifier, and hence improves predictive accuracy by the reduction of overfitting.
Zeroing out very small weights(not recommend)
L1 regulation

2.Distillation:
Knowledge distillation or model distillation is the process of transferring knowledge from a large model to a smaller one. While
large models (such as very deep neural networks or ensembles of many models) have more knowledge capacity than small mod-
els, this capacity might not be fully utilized. It can be just as computationally expensive to evaluate a model even if it utilizes
little of its knowledge capacity. Knowledge distillation transfers knowledge from a large model to a smaller one without loss of
validity. As smaller models are less expensive to evaluate, they can be deployed on less powerful hardware (such as a mobile
device).[1]
Given any output one hat label vector x = [0, 1, ..., 0]T ∈ Rn from Teacher , it send to student as v = [0, 0, 1, ..., 0]T ∈ R10 , it
is the training process and then student response teacher in less size vector. Inductively we will have the best student send the
smallest size vector

3.Decomposition
PCA (Principal Component Analysis) to reduce the number of features to fewer components

4.Quantization

Reduce the size of number type not Float 32 but Int 8. Storing integer much better the n floating number
W = Wmin + f loor(△ W−W

min
)
0.1. DATA ANALYSIS 33

Definition 27. Low Rank Decomposition


Given the structure specification
S : Rn p → Rm×n ,
a vector of structure parameters
p ∈ Rn p ,
a norm
·,

and a desired rank


r,
minimize over b
p
p−b
p

subject to

rank S(b
p) ≤ r.
34

Theorem 28. Taylor Series


The Taylor series of a real or complex-valued function f (x) is infinitely differentiable at a real or complex variable a, is the
power series.

f ′ (a) f ′′ (a) f ′′′ (a) X f (n) (a)
f (a) + (x − a) + (x − a)2 + (x − a)3 + · · · = (x − a)n .
1! 2! 3! n=0
n!

Theorem 29. RELU network express any n-order differentiable function

For any n-order continuous differentiable function f ∈ C n,d , there exists ReLU network R such that R can express any f
from C n,d such that sup x∈[0,1] R(x) − f (x) < ε

Information 16. ReLU and Taylor

We want to use ReLU network R to approximate any f ∈ C n,d .


First, we use Taylor expression of T → f .
Second we construct T → R
Finally we apply Triangle inequality |R − f | < |R − T | + |T − f |
Then we approximate any f by R. In the following part we will prove the Theorem.

Theorem 30. Taylor series with neural network


Taylor series with more n will be more accurate to f . Approximation of Taylor is only accurate in local region. Therefore
Taylor expression is very good local approximation but not global approximation.
How to turn Taylor series as global approximator? We use the method of divide-and-conquer. We define small region of f to do
Taylor to control error as low as the given R.
However there exists cross top between each region taylor series T 1 , ..., T n between the polynomial of other region. Can we
refine the approximation in the region? T 1 can refine the first interval by multiplying ϕ1 .Therefore R = ni=1 Pi ϕi . We define
P
ϕ = 1 if x ∈ [a, b] and ϕ = 0 if x < [a, b]. Therefore ϕ is not really a rectangular function as it is not good to ReLU function,
since ϕ is not continuous.
We can control R = ni=1 Pi ϕi .
P
0.1. DATA ANALYSIS 35

Theorem 31. ReLU Universal Approximation Theorem


We define ϕ rigiously:
ϕm (x) = dk=1 ψ(x)(3N(xk −
 Q mk
N ))



 
1, |x| < 1


 


 


ψ(x) = 2 < |x|

 


 
0,
 
2 − |x|, 1 < |x| < 2


 





Pm ϕm (x) = 1

Note that
X X X X d
sup f − T = sup ( f ϕm ) − ( Pm ϕm ) = sup ( f − Pm )ϕm ≤ sup x f − Pm ≤ 2d maxm f − Pm ≤ 2d (N)−n ≤ ε
m m m m
n!

Hence T → f , we done step 1.

We want to show that ReLU network R → T using Taylor series P(x)


Recall that
ϕm (x) = dk=1 ψ(x)(3N(xk − mNk ))
 Q



 
1, |x| < 1


 


 


ψ(x) =  0, 2 < |x|


 

 

2 − |x|, 1 < |x| < 2


 





m ϕm (x) = 1


P


Dn f

Pm (x) = P
 m n
n,|n|<n n! | x= mn (x − N )

We need a ReLU network to approximate ϕm and a ReLU network to approximate Pm


Combine them
X Dn f m n m1 n 1 m2 n2 md
Pm (x) = (x − ) = (x1 − ) (x2 − ) ...(xd − )
n,|n|<n
n! x= mn N N N N

We use 4 ReLU neuron to approximate ϕm :

σ(x + 2) − σ(x + 1) − σ(x − 1) + σ(x − 2)

Consider
mi mi mi
xi − = σ(xi − ) − σ(−(xi − ))
N N N
Since product of x,y is :
x+y 2 x−y 2
Prod(x, y) = xy = ( ) −( )
2 2
PROOF BY OTHER PAPER x2 is :
m
X g s (x)
hm (x) = x −
i=1
22s
Since g is a sawtooth function such that g2 = g1 · g1 and gn = g1 ......g1 we use 3 ReLU neuron to approximate g.
Therefore we can use ReLU network to approximate ϕ and P, R → P

Finally we apply Triangle inequality


|R − f | < |R − T | + |T − f |

Hence
R→ f
36

Information 17. Hopfield Network= Energy Function= ”Memory”


Hopfield Network is a complete and fully connected network which aims to optimize the state of neurons but not the weight.

It is associated memory or called Energy Function. It is function about ”Memory” in mathematics

Hopfield neural network is a recurrent neural network invented by John Hopfield in 1982. Hopfield network is a neural network
that combines storage system and binary system. It guarantees convergence to the local minimum, but convergence to the
wrong local minimum (local minimum) instead of the global minimum (global minimum) may also occur. Hopfield networks
also provide a model for simulating human memory.
Theorem 32. Hopeified Network

Let x ∈ Rn×1 , w ∈ Rn×n . Then the Energy function is

E(x) = min − X T WX

Let W = xxT as outer product where x = argmin − xT W x for any x ∈ R[n×1] , ||x|| = 1 We need to minimization point of the
Energy Function.(X T WX like gradient descent)

x = argmin x − xT W x = argmin x − xT xxT x = argmin x − ||x||2 = argmin x − 1

It associate initial point x′ with x(memory). It just like storing image into the energy function, it is imagine recovering method
or called associative memory.

Now let us consider multi images or memories:


m
X
W= xxT
i,1

xi represent the i-th image or signal. ∀images xi , x j are far from each other:

hi (t) = j wi j x j (t)
 P



 
1, h(t) > 0


 
+ = =
 
x (t 1) sgn(h(t))

i




 −1, h(t) < 0



E = i j wi j xi y j

 P P

where hi (t) is the preactivation of the i-th neuron, xi is the output, sgn() is the activation function.
Note that E does not increase as long as wi j = w ji where w is symmetric. Also xi (t + 1) = sgn(h(t)) is the generalized gradient
descent.

Suppose at time t,xk is updated then complete the energy function change:
X X
E(t + 1) − E(t) = wik xi (t)[xk (t + 1) − xk (t)] − wk j x j (t)(xk (t + 1) − xk (t))
j i
X X X
= −2 wik xi (t)xk (t + 1) − 2 wk j x j (t)xk (t + 1) = −4xk (t + 1) [wik xi (t)] = −4sgn(hk (t))hk (t) < 0
j i j

Therefore w is symmetric and E always decrease.


0.1. DATA ANALYSIS 37

Theorem 33. Capacity of Hopfield Network


Suppose P is a random vector. Then at least one should still have an energy landscape for one defined pattern.
W= m i T i n×n
and P ∈ Rn×1 stands for patterns or images.
P
i=1 (P ) P where W ∈ R

m
X m
X
W= (Pm )T (P)m wi j = pm m
i pj
m=1

Hopfield Network really care about the ”states”. Let neuron be 2 states S i = {−1, 1} where the minimzation of S will be P
stands for pattern or perfect image or memory such that S ∗ = P when S ∗ is stable or mini
Pm Pmi Pmj
Let W = PT P = m m=1 Pi P j . Then wi j =
m m
P
m=1 n . Actually W is kind of memory capacity of n-neurons Hopfield Net-
work????? We want to ask that how many pattern can be store at m-neuron Hopfield Network
X
S (t + 1) = sgn( wi j s j (t))
j

Do it in optimization case, we have X


i = sgn(
Pm wi j Pmj )
j

1 X ′m ′m X m
i = sgn (
Pm P P P )
N j i j j j

1 X X m′ m′ m
i = sgn(Pi +
Pm m
P P P )
N m′ ,m j i i j

1 X X m′ m′ m m
i = sgn(1 +
Pm P P P P )
N m′ ,m j i i i j
" #
1, 0.5
Pm
i =1+ aim (Pm
i ∼ )
−1, 0.5
We want 1 + aim > 0 where aim > 0 and so aim ∼ N(0, m−1
N )
q
By Central Limit Theorem,define σ = m−1
N we have:

−∞
−x2
Z
1
P(1 + aim > 0) = √ exp( )
−1 2πσ2 2σ2

If m = 0.105N where m be small and N be large then P(1 + aim > 0) = 0.999

Since it follows the normal distribution then 100 neuron stores 10 pattern in the m, n relationship. How to improve the memory
size and follow Normal Distribution and we can find out memory surly.
38

Theorem 34. **Solution to the Capacity or Memory Complicity Limit of Hopfield Network

The standard model of associative memory [?] uses a system of N binary neurons, with values ±1. A configuration of all
the neurons is denoted by a vector i. The model stores K memories, denoted by ξiµ , which for the moment are also assumed to
be binary. The model is defined by an energy function, which is given by

N K
1X T
ξiµ ξµj ,
X
E= i T i j j, Ti j = (1)
2 i, j=1 µ=1

and a dynamical update rule that decreases the energy at every update. The basic problem is the following: when presented
with a new pattern, the network should respond with a stored memory which most closely resembles the input.
There has been a large amount of work in the community of statistical physicists investigating the capacity of this model,
which is the maximal number of memories that the network can store and reliably retrieve. It has been demonstrated [?, ?, ?]
that in case of random memories this maximal value is of the order of Kmax ≈ 0.14N. If one tries to store more patterns, several
neighboring memories in the configuration space will merge together, producing a ground state of the Hamiltonian (1), which
has nothing to do with any of the stored memories. By modifying the Hamiltonian (1) in a way that removes second-order
correlations between the stored memories, it is possible [?] to improve the capacity to Kmax = N.
The mathematical reason why the model (1) gets confused when many memories are stored is that several memories pro-
duce contributions to the energy which are of the same order. In other words, the energy decreases too slowly as the pattern
approaches a memory in the configuration space. In order to take care of this problem, consider a modification of the standard
energy

K
F ξiµ i
X  
E= (2)
µ=1

In this formula, F(x) is some smooth function (summation over index i is assumed). The computational capabilities of the
model will be illustrated for two cases. First, when F(x) = xn (where n is an integer), which is referred to as a polynomial
energy function. Second, when F(x) is a rectified polynomial energy function

 xn ,
 x≥0
F(x) = 

(3)
0,
 x<0
In the case of the polynomial function with n = 2, the network reduces to the standard model of associative memory [?].
If n > 2, each term in (2) becomes sharper compared to the n = 2 case, thus more memories can be packed into the same
configuration space before cross-talk intervenes.
Having defined the energy function, one can derive an iterative update rule that leads to a decrease in energy. We use
asynchronous updates, flipping one unit at a time. The update rule is:
The argument of the sign function is the difference of two energies. One, for the configuration with all but the i-th units
clumped to their current states and the i-th unit in the “off” state. The other one for a similar configuration, but with the i-th
unit in the “on” state. This rule means that the system updates a unit, given the states of the rest of the network, in such a way
that the energy of the entire configuration decreases. For the case of the polynomial energy function, a very similar family of
models was considered in [?, ?, ?, ?, ?, ?]. The update rule in those models was based on the induced magnetic fields, however,
and not on the difference of energies. The two are slightly different due to the presence of self-coupling terms. Throughout this
paper, we use energy-based update rules.
How many memories can model (4) store and reliably retrieve? Consider the case of random patterns, so that each element
of the memories is equal to ±1 with equal probability. Imagine that the system is initialized in a state equal to one of the
memories (pattern number µ). One can derive a stability criterion, i.e. the upper bound on the number of memories such that
the network stays in that initial state. Define the energy difference between the initial state and the state with spin i flipped:

K 
 K  
X  ν µ X ν µ  X 
ν µ
X
ν µ

E= ξi ξi + ξ j ξ j  − ξi ξi +
 ξ j ξ j  ,
ν=1 j,i ν=1 j,i

where the polynomial energy function is used. This quantity has a mean ⟨E⟩ = N n (N − 2)n ≈ 2n N n−1 , which comes from
the term with ν = µ, and a variance (in the limit of large N)
0.1. DATA ANALYSIS 39

σ2 = Ωn (K − 1)N n−1 , Ωn = 4n2 (2n − 3)!!


The i-th bit becomes unstable when the magnitude of the fluctuation exceeds the energy gap ⟨E⟩ and the sign of the
fluctuation is opposite to the sign of the energy gap. Thus, the probability that the state of a single neuron is unstable (in
the limit when both N and K are large, so that the noise is effectively Gaussian) is equal to
Z ⟨E⟩
1 x2 r(2n − 3)!! K − 2K(2n−3)!!
N n−1
Perror = dx √ e− 2σ2 ≈ e
1 2πσ2 2π N n−1
Requiring that this probability is less than a small value, say 0.5%, one can find the upper limit on the number of patterns
that the network can store:

Kmax = αn N n−1 , (5)


where αn is a numerical constant, which depends on the (arbitrary) threshold 0.5%. The case n = 2 corresponds to the
standard model of associative memory and gives the well-known result K = 0.14N. For the perfect recovery of a memory
(Perror < 1/N) one obtains
1
Kmax no errors ≈ N n−1 ln(N). (6)
2(2n − 3)!!
For higher powers n, the capacity rapidly grows with N in a non-linear way, allowing the network to store and reliably
retrieve many more patterns than the number of neurons that it has, in accordance with [?, ?, ?, ?]. This non-linear scaling
relationship between the capacity and the size of the network is the phenomenon that we exploit.
40

Theorem 35. ***Fourier Transformation


The fourier transformation is a tool that convert a function from time domain to frequency domain. For a function f (x) , its
Fourier transformation is : Z ∞
f (ξ) =
b f (x)eiwx dx.
−∞

where w is the frequency and b


f (ξ) is the frequency spectrum of the Fourier Transformation
Theorem 36. *** Inverse Fourier Transformation
The fourier transformation can be recoginize as inverse version :
Z ∞
1
f (x) = f (ξ)eiwx dx.
b
2π −∞

where w is the frequency and b


f (ξ) is the frequency spectrum of the Fourier Transformation

Theorem 37. Smooth and Fourier Transformation


Specifiy measure on Fourier transformation Z
Cf = |w|| f (w)dw
R
If C is small then f is low frequence, it is smooth
If C is large then f is high frequence, it is osciallory.
Theorem 38. Universal Approximation Theorem
If R | f (w)|dw < ∞ then f (x) is continuous and sufficiently smooth.
R

The smoothness measure C f determine how well neural network can approximate the function

Fourier Analysis provides a precise way to measure the smoothness of function via its frequency spectrum.
Theorem 39. Universal Approximation Bound for Superpositions of Sigmoid

0.2 Reference
0.2.1 Prof Fenglei Fan MATH3320 2024 Lecture in CUHK(9-12)
Best professor on MATH3320!!!

You might also like