0% found this document useful (0 votes)
9 views8 pages

Nonlinear

Uploaded by

dixitlakshya895
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views8 pages

Nonlinear

Uploaded by

dixitlakshya895
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Non-Linear Models

Jonathan May
September 6, 2024

1 Why Nonlinear Models?


The linear models we introduced appear to be very flexible, however they are limited in what
they can capture. Specifically, because the equation θ · f (x, y) is linear, classification cannot
be successful if the data points, when plotted in their feature space, cannot be divided by
a line (or, more generally, a hyperplane). The classic example of this is the xor problem.
Consider this data:
f1 f2 y
1 1 a
1 0 b
0 1 b
0 0 a
1
f2

0.5

0
0 0.5 1
f1
This 2-label data set is class 1 iff binary features f1 and f2 are both on or both off and
is class −1 otherwise. Try to draw a line that separates the data. It of course can’t be
done. You could of course introduce a new feature XOR(f1 , f2 ) that explicitly captures this
relationship and then the data would be linearly separable. But in general you don’t know
which combinations of features yield separability.
You could try a transformation that makes combinations of the weights. Define weights
w11 , w21 , b1 to map from the old feature space to a new feature g1 and w12 , w22 , b2 to map
from the old feature space to a new feature g2 , such that

g1 = w11 f1 + w21 f2 + b1
g2 = w12 f1 + w22 f2 + b2

1
Let’s use these as the weights:
   
w11 w12 1 −1
=
w21 w22 1 −1

and
   
b1 b2 = −1 1
(It’s no accident I set these up as a matrix)
That yields:
g1 g2 y
1 -1 a
0 0 b
0 0 b
-1 1 a
1
g2

−1
−1 0 1
g1
It’s still non-separable! This should be no surprise; all a linear transformation can do is
scale, transpose, and rotate the points; it can’t distort them in a way that allows separability.
Consider (cf. 7.2.1 in JM Aug 2024 ed.):

f = W (0) x + b(0) (let’s say the features were from a linear transform too)
g = W (1) f + b(1) (the above transformation in compact form)
(1) (0) (0) (1)
=W (W x+b )+b
(1) (0)
=W W x + W (1) b(0) + b(1)
= W ′ x + b′ W ′ = W (1) W (0) and b′ = W (1) b(0) + b(1)

Still linear! So we’ll multiply by a non-linear step function:


1.5
1
0.5
0
−0.5
0 0.2 0.4 0.6 0.8 1

2
g1 g2 y
1 0 a
0 0 b
0 0 b
0 1 a
Separable!
1
g2

0.5

0
0 0.5 1
g1
The point of nonlinear transformations is to enable recombinations of features. We can
make a linear combination of the new features and apply a nonlinearity to get yet another
recombination. This can be done as many times as needed. What’s nice about this is that
we don’t need to specify complicated features anymore – if we choose weights properly and
use enough layers, we can capture any combinations of the input data.

1.1 Obtaining the weights


In logistic regression and perceptron, we used gradient descent of the loss on training data
to set weights. We can use the same approach here, though the step function, being non-
differentiable, isn’t an appropriate nonlinear activation function, so we’ll use a similarly
shaped function that is differentiable at every point. First, let’s define the model and the
loss. Let f , y be the input feature vector of the input1 (1 × d) and its label (a string) from
a finite set of m labels. Let H of dim (d × v) be the weights matrix, and bH be the bias
vector2 of dim (1 × v). The elementwise nonlinear activation function is g(). Thus to get
the transformed vector (or ‘hidden’ features...or even ‘hidden vector’) h:

k = fH + bH
h = g(k)
What are d and v? That’s up to you to some degree. f , in particular, can be any features,
such as the features we used in perceptron and logistic regression, but usually they’re simply
an arbitrary number of uninterpretable features tied to the vocabulary of the input x.3 For
now assume they are given. If it helps, assume they are the features from the linear model
walkthrough, i.e. the three-feature vector (2, 1, 0).
1
For those who are ready to jump to RNN, Transformer, etc., note that we’re still using a fixed set of
arbitrarily defined features. Even word embeddings will be introduced at the end.
2
When to use a bias term? I don’t know, and I see different formulations do different things. For example,
E ch. 3.1 uses bias in both hidden and output layers, though he hides the bias term in the hidden layer. JM
(Jan 2022) p.140 say “some models don’t include a bias...in the output layer” and follow suit. I will use it
in both places, explicitly.
3
In other words, they are word embeddings, but we’ll get to that shortly.

3
1.2 Nonlinear activation functions
What about g? Before the 90s, the logistic sigmoid (often simply called ‘sigmoid’) function
was used:
1 ex
σ(x) = =
1 + e−x ex + 1
In the 90s and 00s, neural network people recommended hyperbolic tangent or ‘tanh’, another
sigmoid function, with a range from -1 to 1:

ex − e−x
tanh(x) =
ex + e−x
Here they are with the inverse of tanh (arctan) thrown in for good measure:4

These resembled activation functions in neurons, the biological basis of ANNs. But one
problem with these functions, especially as we started exploring deeper networks (i.e. re-
peated alternations of linear transforms and nonlinear activations) is that they saturate, i.e.
the value goes to the extreme, where the slope is near zero, and then very little learning
takes place. A nonlinear function that is a lot like a linear function but is still nonlinear is
desirable. Enter the Rectified Linear Unit or ReLU:
 
0, if x < 0
ReLU (x) =
x, if x ≥ 0
4
Pic from https://deepai.org/machine-learning-glossary-and-terms/sigmoid-function

4
The gradient of ReLU is very easy to work with, and it turns out this function works very
well in practice. Variants are sometimes used (Leaky ReLU: a small non-zero gradient is used
for negative values; GELU: interpolation with a Gaussian distribution (erf) used in GPT1–3;
Swish: a sigmoid with a hyperparameter; SwiGLU: combination of Swish and GELU used
in LLaMa ).

GELU on left, and Swish on right

1.3 Getting an output label


We now need to convert into the output space, which should be equal in length to m.
For sentiment, let’s assume the dimension is 2; positive and negative. We can just use a
linear transformation for that, getting us the logits. Then, we use softmax again to get the
probability of each output.

5
z = hU + bU
o = softmax(z)

The loss ℓ is again the cross-entropy loss H,5 which is defined for one data item (f , y) as
X
Hf ,y (p, q) = − p(y ′ |f ) log q(y ′ |f )
y ′ ∈Y

where the distribution q(y ′ |f ) may be represented by o and the true distribution p(y ′ |f ) is
taken to be one-hot at y,6 reducing H to − log(oy ).
Thus, ℓ ends up being

ℓ = − log(oy )
i.e. the negative log of the probability of the correct answer (denoted oy to note that member
of o corresponding to choice y).
Having calculated ℓ, we update each set of parameters (H, bH , U, bU ) by the opposite of
the gradient of ℓ with respect to that variable, i.e:

H ← H − λ∂ℓ/∂H
bH ← bH − λ∂ℓ/∂bH
U ← U − λ∂ℓ/∂U
bU ← bU − λ∂ℓ/∂bU

where λ is a learning rate. Now how are these partials determined? We start at the loss
equation itself and use simple calculus:

ℓ = − log(oy )
∂ℓ/∂oy = −1/oy

Now consider the definition of oy itself; we can use the chain rule and the local derivative
of oy with respect to z, though softmax is a slightly tricky function to take a derivative of:

∂ℓ/∂z = ∂ℓ/∂oy × ∂oy /∂z


exp(zy )
oy = P
i exp(zi )

To calculate ∂oy /∂z we will make use of the derivative rule for quotients:
5
Not to be confused with matrix H
6
Is this realistic? No! But it’s convenient. Sometimes ‘label smoothing’ is used to make this more realistic;
I might mention it.

6
a(x) ′ b(x)a(x)′ − a(x)b(x)′
( ) =
b(x) b(x)2
It is helpful to consider the application of this rule to ∂oy /∂z in two cases: when i = k
and when i ̸= k. Remember that even though oy is a scalar, z is a vector, so we’re calculating
∂oy /∂zi for every member zi of z.

P
( i′′exp(zi′′ ) × 0) − (exp(zy ) exp(zi ))
[∂oy /∂z]i̸=y = P
( i′ exp(zi′ ))2
exp(zy ) exp(zi )
= −P P
i′ exp(zi′ ) i′ exp(zi′ )
= −oy oi
2
P
′′ exp(zi′′ ) exp(zy ) − exp(zy )
[∂oy /∂z]y = i P
( i′ exp(zi′ ))2
P
exp(zy ) i exp(z ) − exp(zy )
=P P i
i′′ exp(zi′′ ) i′ exp(zi′ )
= oy (1 − oy )

Now we can multiply ∂ℓ/∂oy = −1/oy with ∂oy /∂z to get ∂ℓ/∂z:
(
oy − 1 i = y
∂ℓ/∂z = (1)
oi otherwise
Implementation note! Once you have o, if you represent the truth as a one-hot embedding
T , then ∂ℓ/∂z = o − T . Try to see why!
We next continue on down to find the gradient of ℓ with respect to U and bU , which are
actual parameters we want to learn. We use the definition of z in terms of these variables
and what we have previously learned:

∂ℓ/∂U = ∂ℓ/∂z × ∂z/∂U


∂ℓ/∂bU = ∂ℓ/∂z × ∂z/∂bU
z = hU + bU
∂z/∂U =h
∂z/∂bU =1

We can simply multiply ∂ℓ/∂z, which is the complicated value in Equation 1, by either
h or (the vector) 1, as noted above. Here, it’s worth noting that we want to get the shapes
of our gradient matrices right and that we want to deal with batches of training samples
properly.
Imagine that we are updating parameters after seeing one training instance of a two-way
classification problem (m = 2) with three features (d = 3) and 50 hidden units (v = 50).7
7
Typical values would be in the hundreds for d and v and in the thousands for m.

7
Then, ∂ℓ/∂z is a (1 x 2) vector. h is a (1 x 50) vector, and we want to update U , which
is a (50 x 2) matrix. Thus we take hT × ∂ℓ/∂z to get the right shape. However, note that
in general, we do not update after a single training instance; rather, there may be some t
items in the minibatch. So in fact ∂ℓ/∂z is a (t x 2) matrix and h is a (t x 50) matrix.
hT × ∂ℓ/∂z still yields a (50 x 2) matrix but it is actually the sum of t individual loss
calculations. The point of batch updating is to take a per-item average. Thus, the proper
T
update for U is to subtract (the learning rate times) h ×∂ℓ/∂z
t
. Similarly, to update bU , it is
important to actually multiply ∂ℓ/∂z by a length-t ones vector, which amounts to summing
each dimension of ∂ℓ/∂z along the batch axis, then divide by t.
If you’ve gotten this far, the rest should be straight-forward. We will need ∂ℓ/∂h, which
is of course ∂ℓ/∂z × ∂z/∂h; the former term is in Equation 1 and is (t x 2), the latter is
simply U , which is (50 x 2). We calculate as ∂ℓ/∂z × U T to get a (t x 50) result for ∂ℓ/∂h.
We can now move on to the hidden layer; let’s assume g is ReLu.

∂ℓ/∂k = ∂ℓ/∂h × ∂h/∂k


h = ReLU(k)
(
1 k≥0
∂h/∂k =
0 otherwise

∂ℓ/∂H = ∂ℓ/∂k × ∂k/∂H


∂ℓ/∂bH = ∂ℓ/∂k × ∂k/∂bH
k = f H + bH
∂k/∂H =f
∂k/∂bH =1
We update H, a (3 x 50) matrix with -∂ℓ/∂H; The dimensions of ∂ℓ/∂k are (t x 50),
T
the dimensions of ∂k/∂H are (t x 3); thus we form ∂ℓ/∂H = (∂k/∂H)t ×∂ℓ/∂k . Similarly, we
update bH , a (1 x 50) vector with −∂ℓ/∂bH ; we multiply ∂ℓ/∂k by a t-length ones vector,
which sums its values along the t-sized axis, then divide by t.

1.4 Word embeddings


Previously, we let f , with dimension d, represent a set of arbitrary features. A more common
approach is to instead use a fixed sequence of some n (let’s say 20) words and represent each
word in the vocabulary by an e-dimensional feature vector. This fits in nicely with our set
of equations. Let E be a |V | × e matrix (often called an embedding table). Informally, we
assign an index for each word in the vocabulary from 1 to |V |. Let the input be j1 , j2 , ...jn
where each ji is a one-hot vector, i.e. if ji represents ‘salamander’ and the index for that
word is 48, then ji = 0, . . . , 0, 1, 0, . . . , 0 consisting of 47 0s, a 1, and then 49,952 0s. Then we
redefine x as Ej1 ; Ej2 ; . . . ; Ejn , a ne-length vector. Eji can be thought of as an ‘embedding’
of ‘salamander’ in e << |V |-space. Backpropagation is extended to update E as well8 . We’ll
next look at why these embeddings are interesting in their own right.
8
There is generally no bias term for the word embeddings

You might also like