0% found this document useful (0 votes)

9 views8 pages

Nonlinear

Uploaded by

dixitlakshya895

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views8 pages

Nonlinear

Uploaded by

dixitlakshya895

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Non-Linear Models

Jonathan May
September 6, 2024

1 Why Nonlinear Models?

The linear models we introduced appear to be very flexible, however they are limited in what
they can capture. Specifically, because the equation θ · f (x, y) is linear, classification cannot
be successful if the data points, when plotted in their feature space, cannot be divided by
a line (or, more generally, a hyperplane). The classic example of this is the xor problem.
Consider this data:
f1 f2 y
1 1 a
1 0 b
0 1 b
0 0 a
1
f2

0.5

0
0 0.5 1
f1
This 2-label data set is class 1 iff binary features f1 and f2 are both on or both off and
is class −1 otherwise. Try to draw a line that separates the data. It of course can’t be
done. You could of course introduce a new feature XOR(f1 , f2 ) that explicitly captures this
relationship and then the data would be linearly separable. But in general you don’t know
which combinations of features yield separability.
You could try a transformation that makes combinations of the weights. Define weights
w11 , w21 , b1 to map from the old feature space to a new feature g1 and w12 , w22 , b2 to map
from the old feature space to a new feature g2 , such that

g1 = w11 f1 + w21 f2 + b1
g2 = w12 f1 + w22 f2 + b2

1
Let’s use these as the weights:

w11 w12 1 −1
=
w21 w22 1 −1

and

b1 b2 = −1 1
(It’s no accident I set these up as a matrix)
That yields:
g1 g2 y
1 -1 a
0 0 b
0 0 b
-1 1 a
1
g2

−1
−1 0 1
g1
It’s still non-separable! This should be no surprise; all a linear transformation can do is
scale, transpose, and rotate the points; it can’t distort them in a way that allows separability.
Consider (cf. 7.2.1 in JM Aug 2024 ed.):

f = W (0) x + b(0) (let’s say the features were from a linear transform too)
g = W (1) f + b(1) (the above transformation in compact form)
(1) (0) (0) (1)
=W (W x+b )+b
(1) (0)
=W W x + W (1) b(0) + b(1)
= W ′ x + b′ W ′ = W (1) W (0) and b′ = W (1) b(0) + b(1)

Still linear! So we’ll multiply by a non-linear step function:

1.5
1
0.5
0
−0.5
0 0.2 0.4 0.6 0.8 1

2
g1 g2 y
1 0 a
0 0 b
0 0 b
0 1 a
Separable!
1
g2

0.5

0
0 0.5 1
g1
The point of nonlinear transformations is to enable recombinations of features. We can
make a linear combination of the new features and apply a nonlinearity to get yet another
recombination. This can be done as many times as needed. What’s nice about this is that
we don’t need to specify complicated features anymore – if we choose weights properly and
use enough layers, we can capture any combinations of the input data.

1.1 Obtaining the weights

In logistic regression and perceptron, we used gradient descent of the loss on training data
to set weights. We can use the same approach here, though the step function, being non-
differentiable, isn’t an appropriate nonlinear activation function, so we’ll use a similarly
shaped function that is differentiable at every point. First, let’s define the model and the
loss. Let f , y be the input feature vector of the input1 (1 × d) and its label (a string) from
a finite set of m labels. Let H of dim (d × v) be the weights matrix, and bH be the bias
vector2 of dim (1 × v). The elementwise nonlinear activation function is g(). Thus to get
the transformed vector (or ‘hidden’ features...or even ‘hidden vector’) h:

k = fH + bH
h = g(k)
What are d and v? That’s up to you to some degree. f , in particular, can be any features,
such as the features we used in perceptron and logistic regression, but usually they’re simply
an arbitrary number of uninterpretable features tied to the vocabulary of the input x.3 For
now assume they are given. If it helps, assume they are the features from the linear model
walkthrough, i.e. the three-feature vector (2, 1, 0).
1
For those who are ready to jump to RNN, Transformer, etc., note that we’re still using a fixed set of
arbitrarily defined features. Even word embeddings will be introduced at the end.
2
When to use a bias term? I don’t know, and I see different formulations do different things. For example,
E ch. 3.1 uses bias in both hidden and output layers, though he hides the bias term in the hidden layer. JM
(Jan 2022) p.140 say “some models don’t include a bias...in the output layer” and follow suit. I will use it
in both places, explicitly.
3
In other words, they are word embeddings, but we’ll get to that shortly.

3
1.2 Nonlinear activation functions
What about g? Before the 90s, the logistic sigmoid (often simply called ‘sigmoid’) function
was used:
1 ex
σ(x) = =
1 + e−x ex + 1
In the 90s and 00s, neural network people recommended hyperbolic tangent or ‘tanh’, another
sigmoid function, with a range from -1 to 1:

ex − e−x
tanh(x) =
ex + e−x
Here they are with the inverse of tanh (arctan) thrown in for good measure:4

These resembled activation functions in neurons, the biological basis of ANNs. But one
problem with these functions, especially as we started exploring deeper networks (i.e. re-
peated alternations of linear transforms and nonlinear activations) is that they saturate, i.e.
the value goes to the extreme, where the slope is near zero, and then very little learning
takes place. A nonlinear function that is a lot like a linear function but is still nonlinear is
desirable. Enter the Rectified Linear Unit or ReLU:

0, if x < 0
ReLU (x) =
x, if x ≥ 0
4
Pic from https://deepai.org/machine-learning-glossary-and-terms/sigmoid-function

4
The gradient of ReLU is very easy to work with, and it turns out this function works very
well in practice. Variants are sometimes used (Leaky ReLU: a small non-zero gradient is used
for negative values; GELU: interpolation with a Gaussian distribution (erf) used in GPT1–3;
Swish: a sigmoid with a hyperparameter; SwiGLU: combination of Swish and GELU used
in LLaMa ).

GELU on left, and Swish on right

1.3 Getting an output label

We now need to convert into the output space, which should be equal in length to m.
For sentiment, let’s assume the dimension is 2; positive and negative. We can just use a
linear transformation for that, getting us the logits. Then, we use softmax again to get the
probability of each output.

5
z = hU + bU
o = softmax(z)

The loss ℓ is again the cross-entropy loss H,5 which is defined for one data item (f , y) as
X
Hf ,y (p, q) = − p(y ′ |f ) log q(y ′ |f )
y ′ ∈Y

where the distribution q(y ′ |f ) may be represented by o and the true distribution p(y ′ |f ) is
taken to be one-hot at y,6 reducing H to − log(oy ).
Thus, ℓ ends up being

ℓ = − log(oy )
i.e. the negative log of the probability of the correct answer (denoted oy to note that member
of o corresponding to choice y).
Having calculated ℓ, we update each set of parameters (H, bH , U, bU ) by the opposite of
the gradient of ℓ with respect to that variable, i.e:

H ← H − λ∂ℓ/∂H
bH ← bH − λ∂ℓ/∂bH
U ← U − λ∂ℓ/∂U
bU ← bU − λ∂ℓ/∂bU

where λ is a learning rate. Now how are these partials determined? We start at the loss
equation itself and use simple calculus:

ℓ = − log(oy )
∂ℓ/∂oy = −1/oy

Now consider the definition of oy itself; we can use the chain rule and the local derivative
of oy with respect to z, though softmax is a slightly tricky function to take a derivative of:

∂ℓ/∂z = ∂ℓ/∂oy × ∂oy /∂z

exp(zy )
oy = P
i exp(zi )

To calculate ∂oy /∂z we will make use of the derivative rule for quotients:
5
Not to be confused with matrix H
6
Is this realistic? No! But it’s convenient. Sometimes ‘label smoothing’ is used to make this more realistic;
I might mention it.

6
a(x) ′ b(x)a(x)′ − a(x)b(x)′
( ) =
b(x) b(x)2
It is helpful to consider the application of this rule to ∂oy /∂z in two cases: when i = k
and when i ̸= k. Remember that even though oy is a scalar, z is a vector, so we’re calculating
∂oy /∂zi for every member zi of z.

P
( i′′exp(zi′′ ) × 0) − (exp(zy ) exp(zi ))
[∂oy /∂z]i̸=y = P
( i′ exp(zi′ ))2
exp(zy ) exp(zi )
= −P P
i′ exp(zi′ ) i′ exp(zi′ )
= −oy oi
2
P
′′ exp(zi′′ ) exp(zy ) − exp(zy )
[∂oy /∂z]y = i P
( i′ exp(zi′ ))2
P
exp(zy ) i exp(z ) − exp(zy )
=P P i
i′′ exp(zi′′ ) i′ exp(zi′ )
= oy (1 − oy )

Now we can multiply ∂ℓ/∂oy = −1/oy with ∂oy /∂z to get ∂ℓ/∂z:
(
oy − 1 i = y
∂ℓ/∂z = (1)
oi otherwise
Implementation note! Once you have o, if you represent the truth as a one-hot embedding
T , then ∂ℓ/∂z = o − T . Try to see why!
We next continue on down to find the gradient of ℓ with respect to U and bU , which are
actual parameters we want to learn. We use the definition of z in terms of these variables
and what we have previously learned:

∂ℓ/∂U = ∂ℓ/∂z × ∂z/∂U

∂ℓ/∂bU = ∂ℓ/∂z × ∂z/∂bU
z = hU + bU
∂z/∂U =h
∂z/∂bU =1

We can simply multiply ∂ℓ/∂z, which is the complicated value in Equation 1, by either
h or (the vector) 1, as noted above. Here, it’s worth noting that we want to get the shapes
of our gradient matrices right and that we want to deal with batches of training samples
properly.
Imagine that we are updating parameters after seeing one training instance of a two-way
classification problem (m = 2) with three features (d = 3) and 50 hidden units (v = 50).7
7
Typical values would be in the hundreds for d and v and in the thousands for m.

7
Then, ∂ℓ/∂z is a (1 x 2) vector. h is a (1 x 50) vector, and we want to update U , which
is a (50 x 2) matrix. Thus we take hT × ∂ℓ/∂z to get the right shape. However, note that
in general, we do not update after a single training instance; rather, there may be some t
items in the minibatch. So in fact ∂ℓ/∂z is a (t x 2) matrix and h is a (t x 50) matrix.
hT × ∂ℓ/∂z still yields a (50 x 2) matrix but it is actually the sum of t individual loss
calculations. The point of batch updating is to take a per-item average. Thus, the proper
T
update for U is to subtract (the learning rate times) h ×∂ℓ/∂z
t
. Similarly, to update bU , it is
important to actually multiply ∂ℓ/∂z by a length-t ones vector, which amounts to summing
each dimension of ∂ℓ/∂z along the batch axis, then divide by t.
If you’ve gotten this far, the rest should be straight-forward. We will need ∂ℓ/∂h, which
is of course ∂ℓ/∂z × ∂z/∂h; the former term is in Equation 1 and is (t x 2), the latter is
simply U , which is (50 x 2). We calculate as ∂ℓ/∂z × U T to get a (t x 50) result for ∂ℓ/∂h.
We can now move on to the hidden layer; let’s assume g is ReLu.

∂ℓ/∂k = ∂ℓ/∂h × ∂h/∂k

h = ReLU(k)
(
1 k≥0
∂h/∂k =
0 otherwise

∂ℓ/∂H = ∂ℓ/∂k × ∂k/∂H

∂ℓ/∂bH = ∂ℓ/∂k × ∂k/∂bH
k = f H + bH
∂k/∂H =f
∂k/∂bH =1
We update H, a (3 x 50) matrix with -∂ℓ/∂H; The dimensions of ∂ℓ/∂k are (t x 50),
T
the dimensions of ∂k/∂H are (t x 3); thus we form ∂ℓ/∂H = (∂k/∂H)t ×∂ℓ/∂k . Similarly, we
update bH , a (1 x 50) vector with −∂ℓ/∂bH ; we multiply ∂ℓ/∂k by a t-length ones vector,
which sums its values along the t-sized axis, then divide by t.

1.4 Word embeddings

Previously, we let f , with dimension d, represent a set of arbitrary features. A more common
approach is to instead use a fixed sequence of some n (let’s say 20) words and represent each
word in the vocabulary by an e-dimensional feature vector. This fits in nicely with our set
of equations. Let E be a |V | × e matrix (often called an embedding table). Informally, we
assign an index for each word in the vocabulary from 1 to |V |. Let the input be j1 , j2 , ...jn
where each ji is a one-hot vector, i.e. if ji represents ‘salamander’ and the index for that
word is 48, then ji = 0, . . . , 0, 1, 0, . . . , 0 consisting of 47 0s, a 1, and then 49,952 0s. Then we
redefine x as Ej1 ; Ej2 ; . . . ; Ejn , a ne-length vector. Eji can be thought of as an ‘embedding’
of ‘salamander’ in e << |V |-space. Backpropagation is extended to update E as well8 . We’ll
next look at why these embeddings are interesting in their own right.
8
There is generally no bias term for the word embeddings

Support Vector Machines: CS229 Lecture Notes
100% (2)
Support Vector Machines: CS229 Lecture Notes
25 pages
Tibagan High School: 7 Ave., Brgy. East Rembo, City of Makati 1216 Metro Manila
No ratings yet
Tibagan High School: 7 Ave., Brgy. East Rembo, City of Makati 1216 Metro Manila
7 pages
BTECH Sem 6 Summer 2025 Timetable
No ratings yet
BTECH Sem 6 Summer 2025 Timetable
33 pages
Breaking Bad News - Protocols
No ratings yet
Breaking Bad News - Protocols
6 pages
Galaxy TM Book 7
100% (1)
Galaxy TM Book 7
144 pages
Punjabprovisionallist2020 2021
No ratings yet
Punjabprovisionallist2020 2021
301 pages
What Is Position Paper
100% (1)
What Is Position Paper
4 pages
Computer Programming I: Course Description
No ratings yet
Computer Programming I: Course Description
11 pages
Online Training On Software Development-2025
No ratings yet
Online Training On Software Development-2025
4 pages
CS601 - Machine Learning - Unit 2 - Notes - 1672759753
No ratings yet
CS601 - Machine Learning - Unit 2 - Notes - 1672759753
14 pages
Lecture 0.3 - Linear Classifiers, Logistic Regression, Multiclass Classification
No ratings yet
Lecture 0.3 - Linear Classifiers, Logistic Regression, Multiclass Classification
48 pages
L3 Cse256 Fa24 FFN
No ratings yet
L3 Cse256 Fa24 FFN
64 pages
Neural Network
No ratings yet
Neural Network
14 pages
Neural Problems
No ratings yet
Neural Problems
45 pages
9th Seminar - Diseases of Pulp Part II
No ratings yet
9th Seminar - Diseases of Pulp Part II
54 pages
DL03 Classroom SNN
No ratings yet
DL03 Classroom SNN
41 pages
Support Vector Machines
No ratings yet
Support Vector Machines
25 pages
Support Vector Machines: CS229 Lecture Notes
No ratings yet
Support Vector Machines: CS229 Lecture Notes
25 pages
CS601 Machine Learning Unit 2 Notes 1672759753
No ratings yet
CS601 Machine Learning Unit 2 Notes 1672759753
14 pages
Neural Networks Skimmed - Ipynb - Colab
No ratings yet
Neural Networks Skimmed - Ipynb - Colab
8 pages
ML Week 4 To 10 PDF
No ratings yet
ML Week 4 To 10 PDF
146 pages
Understanding and Creating Neural Networks
No ratings yet
Understanding and Creating Neural Networks
69 pages
cs188 sp24 Note22
No ratings yet
cs188 sp24 Note22
8 pages
Neural Network (Perceptrons)
No ratings yet
Neural Network (Perceptrons)
31 pages
Part of DL
No ratings yet
Part of DL
24 pages
1) Deep - Learning
No ratings yet
1) Deep - Learning
60 pages
cs188 Fa24 Lec24
No ratings yet
cs188 Fa24 Lec24
46 pages
Montanari
No ratings yet
Montanari
10 pages
Support Vector Machines: CS229 Lecture Notes
No ratings yet
Support Vector Machines: CS229 Lecture Notes
25 pages
08 Neural Networks
No ratings yet
08 Neural Networks
47 pages
14 Deep
No ratings yet
14 Deep
6 pages
January 2025 MAR
No ratings yet
January 2025 MAR
2 pages
Lecture 3
No ratings yet
Lecture 3
24 pages
Hung 2007
No ratings yet
Hung 2007
6 pages
Week 4 Lecture Notes
No ratings yet
Week 4 Lecture Notes
5 pages
Neural Network Training
No ratings yet
Neural Network Training
73 pages
Conditions of Successful Degradation Ceremonies
No ratings yet
Conditions of Successful Degradation Ceremonies
6 pages
Planning For Learning-PS2
No ratings yet
Planning For Learning-PS2
3 pages
Electron and The Bits
No ratings yet
Electron and The Bits
13 pages
Ecr Itp
No ratings yet
Ecr Itp
12 pages
Mathematical Foundations of Computational Linguistics: Manfred Klenner and Jannis Vamvas
No ratings yet
Mathematical Foundations of Computational Linguistics: Manfred Klenner and Jannis Vamvas
32 pages
What Should You Do in Groups?
No ratings yet
What Should You Do in Groups?
3 pages
16 - The Key To The Most Powerful ML Models
No ratings yet
16 - The Key To The Most Powerful ML Models
25 pages
Neural Networks - 2
No ratings yet
Neural Networks - 2
79 pages
11-Nonlinear Models (Neural Networks)
No ratings yet
11-Nonlinear Models (Neural Networks)
6 pages
Linearity: Skip To Content
No ratings yet
Linearity: Skip To Content
10 pages
Deep Learning
No ratings yet
Deep Learning
15 pages
Introduction of Machine Learning
No ratings yet
Introduction of Machine Learning
61 pages
Lecture Slides 2 - Neural Networks - 2021
No ratings yet
Lecture Slides 2 - Neural Networks - 2021
42 pages
Deep Learning Lectures - 2
No ratings yet
Deep Learning Lectures - 2
73 pages
Writing Applications Prewriting Handout
No ratings yet
Writing Applications Prewriting Handout
1 page
Lecture20 Backprop
No ratings yet
Lecture20 Backprop
77 pages
Six Lectures On NN - Montanari
No ratings yet
Six Lectures On NN - Montanari
77 pages
2nd Semester PreEngineering (Natural Freshman) Students
100% (1)
2nd Semester PreEngineering (Natural Freshman) Students
4 pages
2024-11-21
No ratings yet
2024-11-21
2 pages
Basic Presentation: U N I T 4
No ratings yet
Basic Presentation: U N I T 4
7 pages
2021 Logistic Regression
No ratings yet
2021 Logistic Regression
33 pages
16 DL 1
No ratings yet
16 DL 1
9 pages
NN Theory
No ratings yet
NN Theory
138 pages
A Dolls House - PDF Part 2
No ratings yet
A Dolls House - PDF Part 2
2 pages
Chapter 9
No ratings yet
Chapter 9
73 pages
26 Neural Nets
No ratings yet
26 Neural Nets
77 pages
Kagan Lecture2
No ratings yet
Kagan Lecture2
118 pages
Ad3451 ML Unit 4 Notes
No ratings yet
Ad3451 ML Unit 4 Notes
36 pages
Deep Learning Unit 2
No ratings yet
Deep Learning Unit 2
25 pages
Sample 10833
No ratings yet
Sample 10833
16 pages
Ece18898g Neural Networks
No ratings yet
Ece18898g Neural Networks
47 pages
Instructor Solution Manual To Neural Networks and Deep Learning A Textbook Solutions 3319944622 9783319944623 - Compress
No ratings yet
Instructor Solution Manual To Neural Networks and Deep Learning A Textbook Solutions 3319944622 9783319944623 - Compress
40 pages
Notes5 Regression
No ratings yet
Notes5 Regression
14 pages
Activation Function in NN
No ratings yet
Activation Function in NN
29 pages
Deep Learning Assignment3 Solution
No ratings yet
Deep Learning Assignment3 Solution
9 pages
BTEC Nationals in Aerospace Engineering Structures
No ratings yet
BTEC Nationals in Aerospace Engineering Structures
4 pages
Unit 2.1
No ratings yet
Unit 2.1
37 pages
cs188 sp23 Note25
No ratings yet
cs188 sp23 Note25
8 pages
Annette Paper
No ratings yet
Annette Paper
7 pages
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
No ratings yet
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
12 pages
Kemerovo State As 2025..
No ratings yet
Kemerovo State As 2025..
4 pages
Minsky y Papert
No ratings yet
Minsky y Papert
77 pages
Sách
No ratings yet
Sách
52 pages
Instructor's Solution Manual For Neural Networks
No ratings yet
Instructor's Solution Manual For Neural Networks
40 pages
Introduction To Business: Email
No ratings yet
Introduction To Business: Email
3 pages
Notes Chapter8
No ratings yet
Notes Chapter8
4 pages
NLP-NeuralNetworks Reading Notes
No ratings yet
NLP-NeuralNetworks Reading Notes
13 pages
Nonverbal Communication
No ratings yet
Nonverbal Communication
6 pages
Vahid
No ratings yet
Vahid
18 pages
Mid-Year Review Form (MRF) For Teacher I-Iii
No ratings yet
Mid-Year Review Form (MRF) For Teacher I-Iii
13 pages
Lesson Plan Rational Numbers Differentiated
No ratings yet
Lesson Plan Rational Numbers Differentiated
5 pages
Becoming Culturally Responsive Educators
No ratings yet
Becoming Culturally Responsive Educators
16 pages
A Short Course in Automorphic Functions
From Everand
A Short Course in Automorphic Functions
Joseph Lehner
No ratings yet
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)

Nonlinear

Uploaded by

Nonlinear

Uploaded by

Non-Linear Models

1 Why Nonlinear Models?

Still linear! So we’ll multiply by a non-linear step function:

1.1 Obtaining the weights

GELU on left, and Swish on right

1.3 Getting an output label

∂ℓ/∂z = ∂ℓ/∂oy × ∂oy /∂z

∂ℓ/∂U = ∂ℓ/∂z × ∂z/∂U

∂ℓ/∂k = ∂ℓ/∂h × ∂h/∂k

∂ℓ/∂H = ∂ℓ/∂k × ∂k/∂H

1.4 Word embeddings

You might also like