0% found this document useful (0 votes)
20 views11 pages

Neural Networks

This chapter discusses neural networks as a key computational tool for language processing, tracing their origins to the McCulloch-Pitts neuron model. It introduces feedforward networks, emphasizing their ability to learn features from raw data and their superiority over logistic regression in classification tasks. The chapter also explores activation functions, the XOR problem, and demonstrates how multi-layer networks can compute functions that single-layer perceptrons cannot.

Uploaded by

poojab230080ec
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views11 pages

Neural Networks

This chapter discusses neural networks as a key computational tool for language processing, tracing their origins to the McCulloch-Pitts neuron model. It introduces feedforward networks, emphasizing their ability to learn features from raw data and their superiority over logistic regression in classification tasks. The chapter also explores activation functions, the XOR problem, and demonstrates how multi-layer networks can compute functions that single-layer perceptrons cannot.

Uploaded by

poojab230080ec
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Speech and Language Processing. Daniel Jurafsky & James H. Martin. Copyright © 2024.

All
rights reserved. Draft of January 12, 2025.

CHAPTER

Neural Networks
7
“[M]achines of this character can behave in a very complicated manner when
the number of units is large.”
Alan Turing (1948) “Intelligent Machines”, page 6

Neural networks are a fundamental computational tool for language process-


ing, and a very old one. They are called neural because their origins lie in the
McCulloch-Pitts neuron (McCulloch and Pitts, 1943), a simplified model of the
biological neuron as a kind of computing element that could be described in terms
of propositional logic. But the modern use in language processing no longer draws
on these early biological inspirations.
Instead, a modern neural network is a network of small computing units, each
of which takes a vector of input values and produces a single output value. In this
chapter we introduce the neural net applied to classification. The architecture we
feedforward introduce is called a feedforward network because the computation proceeds iter-
atively from one layer of units to the next. The use of modern neural nets is often
deep learning called deep learning, because modern networks are often deep (have many layers).
Neural networks share much of the same mathematics as logistic regression. But
neural networks are a more powerful classifier than logistic regression, and indeed a
minimal neural network (technically one with a single ‘hidden layer’) can be shown
to learn any function.
Neural net classifiers are different from logistic regression in another way. With
logistic regression, we applied the regression classifier to many different tasks by
developing many rich kinds of feature templates based on domain knowledge. When
working with neural networks, it is more common to avoid most uses of rich hand-
derived features, instead building neural networks that take raw words as inputs
and learn to induce features as part of the process of learning to classify. We saw
examples of this kind of representation learning for embeddings in Chapter 6. Nets
that are very deep are particularly good at representation learning. For that reason
deep neural nets are the right tool for tasks that offer sufficient data to learn features
automatically.
In this chapter we’ll introduce feedforward networks as classifiers, and also ap-
ply them to the simple task of language modeling: assigning probabilities to word
sequences and predicting upcoming words. In subsequent chapters we’ll introduce
many other aspects of neural models, such as recurrent neural networks (Chap-
ter 8), the Transformer (Chapter 9), and masked language modeling (Chapter 11).
2 C HAPTER 7 • N EURAL N ETWORKS

7.1 Units
The building block of a neural network is a single computational unit. A unit takes
a set of real valued numbers as input, performs some computation on them, and
produces an output.
At its heart, a neural unit is taking a weighted sum of its inputs, with one addi-
bias term tional term in the sum called a bias term. Given a set of inputs x1 ...xn , a unit has
a set of corresponding weights w1 ...wn and a bias b, so the weighted sum z can be
represented as: X
z = b+ wi xi (7.1)
i
Often it’s more convenient to express this weighted sum using vector notation; recall
vector from linear algebra that a vector is, at heart, just a list or array of numbers. Thus
we’ll talk about z in terms of a weight vector w, a scalar bias b, and an input vector
x, and we’ll replace the sum with the convenient dot product:
z = w·x+b (7.2)
As defined in Eq. 7.2, z is just a real valued number.
Finally, instead of using z, a linear function of x, as the output, neural units
apply a non-linear function f to z. We will refer to the output of this function as
activation the activation value for the unit, a. Since we are just modeling a single unit, the
activation for the node is in fact the final output of the network, which we’ll generally
call y. So the value y is defined as:
y = a = f (z)
We’ll discuss three popular non-linear functions f below (the sigmoid, the tanh, and
the rectified linear unit or ReLU) but it’s pedagogically convenient to start with the
sigmoid sigmoid function since we saw it in Chapter 5:
1
y = σ (z) = (7.3)
1 + e−z
The sigmoid (shown in Fig. 7.1) has a number of advantages; it maps the output
into the range (0, 1), which is useful in squashing outliers toward 0 or 1. And it’s
differentiable, which as we saw in Section ?? will be handy for learning.

Figure 7.1 The sigmoid function takes a real value and maps it to the range (0, 1). It is
nearly linear around 0 but outlier values get squashed toward 0 or 1.

Substituting Eq. 7.2 into Eq. 7.3 gives us the output of a neural unit:
1
y = σ (w · x + b) = (7.4)
1 + exp(−(w · x + b))
7.1 • U NITS 3

Fig. 7.2 shows a final schematic of a basic neural unit. In this example the unit
takes 3 input values x1 , x2 , and x3 , and computes a weighted sum, multiplying each
value by a weight (w1 , w2 , and w3 , respectively), adds them to a bias term b, and then
passes the resulting sum through a sigmoid function to result in a number between 0
and 1.

x1 w1

w2 z a
x2 ∑ σ y
w3

x3 b

+1

Figure 7.2 A neural unit, taking 3 inputs x1 , x2 , and x3 (and a bias b that we represent as a
weight for an input clamped at +1) and producing an output y. We include some convenient
intermediate variables: the output of the summation, z, and the output of the sigmoid, a. In
this case the output of the unit y is the same as a, but in deeper networks we’ll reserve y to
mean the final output of the entire network, leaving a as the activation of an individual node.

Let’s walk through an example just to get an intuition. Let’s suppose we have a
unit with the following weight vector and bias:

w = [0.2, 0.3, 0.9]


b = 0.5

What would this unit do with the following input vector:

x = [0.5, 0.6, 0.1]

The resulting output y would be:


1 1 1
y = σ (w · x + b) = = = = .70
1 + e−(w·x+b) 1 + e−(.5∗.2+.6∗.3+.1∗.9+.5) 1 + e−0.87
In practice, the sigmoid is not commonly used as an activation function. A function
tanh that is very similar but almost always better is the tanh function shown in Fig. 7.3a;
tanh is a variant of the sigmoid that ranges from -1 to +1:

ez − e−z
y = tanh(z) = (7.5)
ez + e−z
The simplest activation function, and perhaps the most commonly used, is the rec-
ReLU tified linear unit, also called the ReLU, shown in Fig. 7.3b. It’s just the same as z
when z is positive, and 0 otherwise:

y = ReLU(z) = max(z, 0) (7.6)

These activation functions have different properties that make them useful for differ-
ent language applications or network architectures. For example, the tanh function
has the nice properties of being smoothly differentiable and mapping outlier values
toward the mean. The rectifier function, on the other hand, has nice properties that
4 C HAPTER 7 • N EURAL N ETWORKS

(a) (b)
Figure 7.3 The tanh and ReLU activation functions.

result from it being very close to linear. In the sigmoid or tanh functions, very high
saturated values of z result in values of y that are saturated, i.e., extremely close to 1, and have
derivatives very close to 0. Zero derivatives cause problems for learning, because as
we’ll see in Section 7.5, we’ll train networks by propagating an error signal back-
wards, multiplying gradients (partial derivatives) from each layer of the network;
gradients that are almost 0 cause the error signal to get smaller and smaller until it is
vanishing
gradient too small to be used for training, a problem called the vanishing gradient problem.
Rectifiers don’t have this problem, since the derivative of ReLU for high values of z
is 1 rather than very close to 0.

7.2 The XOR problem


Early in the history of neural networks it was realized that the power of neural net-
works, as with the real neurons that inspired them, comes from combining these
units into larger networks.
One of the most clever demonstrations of the need for multi-layer networks was
the proof by Minsky and Papert (1969) that a single neural unit cannot compute
some very simple functions of its input. Consider the task of computing elementary
logical functions of two inputs, like AND, OR, and XOR. As a reminder, here are
the truth tables for those functions:

AND OR XOR
x1 x2 y x1 x2 y x1 x2 y
0 0 0 0 0 0 0 0 0
0 1 0 0 1 1 0 1 1
1 0 0 1 0 1 1 0 1
1 1 1 1 1 1 1 1 0

perceptron This example was first shown for the perceptron, which is a very simple neural
unit that has a binary output and has a very simple step function as its non-linear
activation function. The output y of a perceptron is 0 or 1, and is computed as
follows (using the same weight w, input x, and bias b as in Eq. 7.2):

0, if w · x + b ≤ 0
y= (7.7)
1, if w · x + b > 0
7.2 • T HE XOR PROBLEM 5

It’s very easy to build a perceptron that can compute the logical AND and OR
functions of its binary inputs; Fig. 7.4 shows the necessary weights.

x1 x1
1 1
x2 1 x2 1
-1 0
+1 +1
(a) (b)
Figure 7.4 The weights w and bias b for perceptrons for computing logical functions. The
inputs are shown as x1 and x2 and the bias as a special node with value +1 which is multiplied
with the bias weight b. (a) logical AND, with weights w1 = 1 and w2 = 1 and bias weight
b = −1. (b) logical OR, with weights w1 = 1 and w2 = 1 and bias weight b = 0. These
weights/biases are just one from an infinite number of possible sets of weights and biases that
would implement the functions.

It turns out, however, that it’s not possible to build a perceptron to compute
logical XOR! (It’s worth spending a moment to give it a try!)
The intuition behind this important result relies on understanding that a percep-
tron is a linear classifier. For a two-dimensional input x1 and x2 , the perceptron
equation, w1 x1 + w2 x2 + b = 0 is the equation of a line. (We can see this by putting
it in the standard linear format: x2 = (−w1 /w2 )x1 + (−b/w2 ).) This line acts as a
decision
boundary decision boundary in two-dimensional space in which the output 0 is assigned to all
inputs lying on one side of the line, and the output 1 to all input points lying on the
other side of the line. If we had more than 2 inputs, the decision boundary becomes
a hyperplane instead of a line, but the idea is the same, separating the space into two
categories.
Fig. 7.5 shows the possible logical inputs (00, 01, 10, and 11) and the line drawn
by one possible set of parameters for an AND and an OR classifier. Notice that there
is simply no way to draw a line that separates the positive cases of XOR (01 and 10)
linearly
separable from the negative cases (00 and 11). We say that XOR is not a linearly separable
function. Of course we could draw a boundary with a curve, or some other function,
but not a single line.

7.2.1 The solution: neural networks


While the XOR function cannot be calculated by a single perceptron, it can be cal-
culated by a layered network of perceptron units. Rather than see this with networks
of simple perceptrons, however, let’s see how to compute XOR using two layers of
ReLU-based units following Goodfellow et al. (2016). Fig. 7.6 shows a figure with
the input being processed by two layers of neural units. The middle layer (called
h) has two units, and the output layer (called y) has one unit. A set of weights and
biases are shown that allows the network to correctly compute the XOR function.
Let’s walk through what happens with the input x = [0, 0]. If we multiply each
input value by the appropriate weight, sum, and then add the bias b, we get the vector
[0, -1], and we then apply the rectified linear transformation to give the output of the
h layer as [0, 0]. Now we once again multiply by the weights, sum, and add the
bias (0 in this case) resulting in the value 0. The reader should work through the
computation of the remaining 3 possible input pairs to see that the resulting y values
are 1 for the inputs [0, 1] and [1, 0] and 0 for [0, 0] and [1, 1].
6 C HAPTER 7 • N EURAL N ETWORKS

x2 x2 x2

1 1 1

?
0 x1 0 x1 0 x1
0 1 0 1 0 1

a) x1 AND x2 b) x1 OR x2 c) x1 XOR x2

Figure 7.5 The functions AND, OR, and XOR, represented with input x1 on the x-axis and input x2 on the
y-axis. Filled circles represent perceptron outputs of 1, and white circles perceptron outputs of 0. There is no
way to draw a line that correctly separates the two categories for XOR. Figure styled after Russell and Norvig
(2002).

x1 1 h1
1
1
y1
1
-2
x2 1 h2 0
0
-1
+1 +1
Figure 7.6 XOR solution after Goodfellow et al. (2016). There are three ReLU units, in
two layers; we’ve called them h1 , h2 (h for “hidden layer”) and y1 . As before, the numbers
on the arrows represent the weights w for each unit, and we represent the bias b as a weight
on a unit clamped to +1, with the bias weights/units in gray.

It’s also instructive to look at the intermediate results, the outputs of the two
hidden nodes h1 and h2 . We showed in the previous paragraph that the h vector for
the inputs x = [0, 0] was [0, 0]. Fig. 7.7b shows the values of the h layer for all
4 inputs. Notice that hidden representations of the two input points x = [0, 1] and
x = [1, 0] (the two cases with XOR output = 1) are merged to the single point h =
[1, 0]. The merger makes it easy to linearly separate the positive and negative cases
of XOR. In other words, we can view the hidden layer of the network as forming a
representation of the input.
In this example we just stipulated the weights in Fig. 7.6. But for real examples
the weights for neural networks are learned automatically using the error backprop-
agation algorithm to be introduced in Section 7.5. That means the hidden layers will
learn to form useful representations. This intuition, that neural networks can auto-
matically learn useful representations of the input, is one of their key advantages,
and one that we will return to again and again in later chapters.
7.3 • F EEDFORWARD N EURAL N ETWORKS 7

x2 h2

1 1

0 x1 0
h1
0 1 0 1 2

a) The original x space b) The new (linearly separable) h space


Figure 7.7 The hidden layer forming a new representation of the input. (b) shows the
representation of the hidden layer, h, compared to the original input representation x in (a).
Notice that the input point [0, 1] has been collapsed with the input point [1, 0], making it
possible to linearly separate the positive and negative cases of XOR. After Goodfellow et al.
(2016).

7.3 Feedforward Neural Networks


Let’s now walk through a slightly more formal presentation of the simplest kind of
feedforward neural network, the feedforward network. A feedforward network is a multilayer
network
network in which the units are connected with no cycles; the outputs from units in
each layer are passed to units in the next higher layer, and no outputs are passed
back to lower layers. (In Chapter 8 we’ll introduce networks with cycles, called
recurrent neural networks.)
For historical reasons multilayer networks, especially feedforward networks, are
multi-layer
perceptrons sometimes called multi-layer perceptrons (or MLPs); this is a technical misnomer,
MLP since the units in modern multilayer networks aren’t perceptrons (perceptrons have a
simple step-function as their activation function, but modern networks are made up
of units with many kinds of non-linearities like ReLUs and sigmoids), but at some
point the name stuck.
Simple feedforward networks have three kinds of nodes: input units, hidden
units, and output units.
Fig. 7.8 shows a picture. The input layer x is a vector of simple scalar values just
as we saw in Fig. 7.2.
hidden layer The core of the neural network is the hidden layer h formed of hidden units hi ,
each of which is a neural unit as described in Section 7.1, taking a weighted sum of
its inputs and then applying a non-linearity. In the standard architecture, each layer
fully-connected is fully-connected, meaning that each unit in each layer takes as input the outputs
from all the units in the previous layer, and there is a link between every pair of units
from two adjacent layers. Thus each hidden unit sums over all the input units.
Recall that a single hidden unit has as parameters a weight vector and a bias. We
represent the parameters for the entire hidden layer by combining the weight vector
and bias for each unit i into a single weight matrix W and a single bias vector b for
the whole layer (see Fig. 7.8). Each element W ji of the weight matrix W represents
the weight of the connection from the ith input unit xi to the jth hidden unit h j .
The advantage of using a single matrix W for the weights of the entire layer is
that now the hidden layer computation for a feedforward network can be done very
8 C HAPTER 7 • N EURAL N ETWORKS

x1 W U
y1
h1

x2 h2 y2
h3



xn
0 hn
1
b yn
2
+1
input layer hidden layer output layer

Figure 7.8 A simple 2-layer feedforward network, with one hidden layer, one output layer,
and one input layer (the input layer is usually not counted when enumerating layers).

efficiently with simple matrix operations. In fact, the computation only has three
steps: multiplying the weight matrix by the input vector x, adding the bias vector b,
and applying the activation function g (such as the sigmoid, tanh, or ReLU activation
function defined above).
The output of the hidden layer, the vector h, is thus the following (for this exam-
ple we’ll use the sigmoid function σ as our activation function):

h = σ (Wx + b) (7.8)

Notice that we’re applying the σ function here to a vector, while in Eq. 7.3 it was
applied to a scalar. We’re thus allowing σ (·), and indeed any activation function
g(·), to apply to a vector element-wise, so g[z1 , z2 , z3 ] = [g(z1 ), g(z2 ), g(z3 )].
Let’s introduce some constants to represent the dimensionalities of these vectors
and matrices. We’ll refer to the input layer as layer 0 of the network, and have n0
represent the number of inputs, so x is a vector of real numbers of dimension n0 ,
or more formally x ∈ Rn0 , a column vector of dimensionality [n0 , 1]. Let’s call the
hidden layer layer 1 and the output layer layer 2. The hidden layer has dimensional-
ity n1 , so h ∈ Rn1 and also b ∈ Rn1 (since each hidden unit can take a different bias
value). And the weight matrix W has dimensionality W ∈ Rn1 ×n0 , i.e. [n1 , n0 ].
Take a moment to convince yourselfPn0 that the matrix  multiplication in Eq. 7.8 will
compute the value of each h j as σ i=1 W ji x i + b j .
As we saw in Section 7.2, the resulting value h (for hidden but also for hypoth-
esis) forms a representation of the input. The role of the output layer is to take
this new representation h and compute a final output. This output could be a real-
valued number, but in many cases the goal of the network is to make some sort of
classification decision, and so we will focus on the case of classification.
If we are doing a binary task like sentiment classification, we might have a sin-
gle output node, and its scalar value y is the probability of positive versus negative
sentiment. If we are doing multinomial classification, such as assigning a part-of-
speech tag, we might have one output node for each potential part-of-speech, whose
output value is the probability of that part-of-speech, and the values of all the output
nodes must sum to one. The output layer is thus a vector y that gives a probability
distribution across the output nodes.
7.3 • F EEDFORWARD N EURAL N ETWORKS 9

Let’s see how this happens. Like the hidden layer, the output layer has a weight
matrix (let’s call it U), but some models don’t include a bias vector b in the output
layer, so we’ll simplify by eliminating the bias vector in this example. The weight
matrix is multiplied by its input vector (h) to produce the intermediate output z:
z = Uh
There are n2 output nodes, so z ∈ Rn2 , weight matrix U has dimensionality U ∈
Rn2 ×n1 , and element Ui j is the weight from unit j in the hidden layer to unit i in the
output layer.
However, z can’t be the output of the classifier, since it’s a vector of real-valued
numbers, while what we need for classification is a vector of probabilities. There is
normalizing a convenient function for normalizing a vector of real values, by which we mean
converting it to a vector that encodes a probability distribution (all the numbers lie
softmax between 0 and 1 and sum to 1): the softmax function that we saw on page ?? of
Chapter 5. More generally for any vector z of dimensionality d, the softmax is
defined as:
exp(zi )
softmax(zi ) = Pd 1≤i≤d (7.9)
j=1 exp(z j )
Thus for example given a vector
z = [0.6, 1.1, −1.5, 1.2, 3.2, −1.1], (7.10)
the softmax function will normalize it to a probability distribution (shown rounded):
softmax(z) = [0.055, 0.090, 0.0067, 0.10, 0.74, 0.010] (7.11)
You may recall that we used softmax to create a probability distribution from a
vector of real-valued numbers (computed from summing weights times features) in
the multinomial version of logistic regression in Chapter 5.
That means we can think of a neural network classifier with one hidden layer
as building a vector h which is a hidden layer representation of the input, and then
running standard multinomial logistic regression on the features that the network
develops in h. By contrast, in Chapter 5 the features were mainly designed by hand
via feature templates. So a neural network is like multinomial logistic regression,
but (a) with many layers, since a deep neural network is like layer after layer of lo-
gistic regression classifiers; (b) with those intermediate layers having many possible
activation functions (tanh, ReLU, sigmoid) instead of just sigmoid (although we’ll
continue to use σ for convenience to mean any activation function); (c) rather than
forming the features by feature templates, the prior layers of the network induce the
feature representations themselves.
Here are the final equations for a feedforward network with a single hidden layer,
which takes an input vector x, outputs a probability distribution y, and is parameter-
ized by weight matrices W and U and a bias vector b:
h = σ (Wx + b)
z = Uh
y = softmax(z) (7.12)
And just to remember the shapes of all our variables, x ∈ Rn0 , h ∈ Rn1 , b ∈ Rn1 ,
W ∈ Rn1 ×n0 , U ∈ Rn2 ×n1 , and the output vector y ∈ Rn2 . We’ll call this network a 2-
layer network (we traditionally don’t count the input layer when numbering layers,
but do count the output layer). So by this terminology logistic regression is a 1-layer
network.
10 C HAPTER 7 • N EURAL N ETWORKS

7.3.1 More details on feedforward networks


Let’s now set up some notation to make it easier to talk about deeper networks of
depth more than 2. We’ll use superscripts in square brackets to mean layer num-
bers, starting at 0 for the input layer. So W[1] will mean the weight matrix for the
(first) hidden layer, and b[1] will mean the bias vector for the (first) hidden layer. n j
will mean the number of units at layer j. We’ll use g(·) to stand for the activation
function, which will tend to be ReLU or tanh for intermediate layers and softmax
for output layers. We’ll use a[i] to mean the output from layer i, and z[i] to mean the
combination of previous layer output, weights and biases W[i] a[i−1] + b[i] . The 0th
layer is for inputs, so we’ll refer to the inputs x more generally as a[0] .
Thus we can re-represent our 2-layer net from Eq. 7.12 as follows:

z[1] = W[1] a[0] + b[1]


a[1] = g[1] (z[1] )
z[2] = W[2] a[1] + b[2]
a[2] = g[2] (z[2] )
ŷ = a[2] (7.13)

Note that with this notation, the equations for the computation done at each layer are
the same. The algorithm for computing the forward step in an n-layer feedforward
network, given the input vector a[0] is thus simply:

for i in 1,...,n
z[i] = W[i] a[i−1] + b[i]
a[i] = g[i] (z[i] )
ŷ = a[n]

It’s often useful to have a name for the final set of activations right before the final
softmax. So however many layers we have, we’ll generally call the unnormalized
values in the final vector z[n] , the vector of scores right before the final softmax, the
logits logits (see Eq. ??).
The need for non-linear activation functions One of the reasons we use non-
linear activation functions for each layer in a neural network is that if we did not, the
resulting network is exactly equivalent to a single-layer network. Let’s see why this
is true. Imagine the first two layers of such a network of purely linear layers:

z[1] = W[1] x + b[1]


z[2] = W[2] z[1] + b[2]

We can rewrite the function that the network is computing as:

z[2] = W[2] z[1] + b[2]


= W[2] (W[1] x + b[1] ) + b[2]
= W[2] W[1] x + W[2] b[1] + b[2]
= W 0 x + b0 (7.14)

This generalizes to any number of layers. So without non-linear activation functions,


a multilayer network is just a notational variant of a single layer network with a
different set of weights, and we lose all the representational power of multilayer
networks.
7.4 • F EEDFORWARD NETWORKS FOR NLP: C LASSIFICATION 11

Replacing the bias unit In describing networks, we will often use a slightly sim-
plified notation that represents exactly the same function without referring to an ex-
plicit bias node b. Instead, we add a dummy node a0 to each layer whose value will
[0]
always be 1. Thus layer 0, the input layer, will have a dummy node a0 = 1, layer 1
[1]
will have a0 = 1, and so on. This dummy node still has an associated weight, and
that weight represents the bias value b. For example instead of an equation like
h = σ (Wx + b) (7.15)

we’ll use:
h = σ (Wx) (7.16)

But now instead of our vector x having n0 values: x = x1 , . . . , xn0 , it will have n0 +
1 values, with a new 0th dummy value x0 = 1: x = x0 , . . . , xn0 . And instead of
computing each h j as follows:
n0
!
X
hj = σ Wji xi + b j , (7.17)
i=1

we’ll instead use:


n0
!
X
hj = σ Wji xi , (7.18)
i=0

where the value Wj0 replaces what had been b j . Fig. 7.9 shows a visualization.

W U W U
x1 h1 y1 x0=1
h1 y1
h2
x2 y2 x1 h2 y2
h3

x2 h3


xn
hn

0
1
yn hn
b 2 xn 1 yn
+1 0 2
(a) (b)
Figure 7.9 Replacing the bias node (shown in a) with x0 (b).

We’ll continue showing the bias as b when we go over the learning algorithm
in Section 7.5, but then we’ll switch to this simplified notation without explicit bias
terms for the rest of the book.

7.4 Feedforward networks for NLP: Classification


Let’s see how to apply feedforward networks to NLP tasks! In this section we’ll
look at classification tasks like sentiment analysis; in the next section we’ll introduce
neural language modeling.

You might also like