0% found this document useful (0 votes)

20 views11 pages

Neural Networks

This chapter discusses neural networks as a key computational tool for language processing, tracing their origins to the McCulloch-Pitts neuron model. It introduces feedforward networks, emphasizing their ability to learn features from raw data and their superiority over logistic regression in classification tasks. The chapter also explores activation functions, the XOR problem, and demonstrates how multi-layer networks can compute functions that single-layer perceptrons cannot.

Uploaded by

poojab230080ec

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views11 pages

Neural Networks

Uploaded by

poojab230080ec

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Speech and Language Processing. Daniel Jurafsky & James H. Martin. Copyright © 2024.

All
rights reserved. Draft of January 12, 2025.

CHAPTER

Neural Networks
7
“[M]achines of this character can behave in a very complicated manner when
the number of units is large.”
Alan Turing (1948) “Intelligent Machines”, page 6

Neural networks are a fundamental computational tool for language process-

ing, and a very old one. They are called neural because their origins lie in the
McCulloch-Pitts neuron (McCulloch and Pitts, 1943), a simplified model of the
biological neuron as a kind of computing element that could be described in terms
of propositional logic. But the modern use in language processing no longer draws
on these early biological inspirations.
Instead, a modern neural network is a network of small computing units, each
of which takes a vector of input values and produces a single output value. In this
chapter we introduce the neural net applied to classification. The architecture we
feedforward introduce is called a feedforward network because the computation proceeds iter-
atively from one layer of units to the next. The use of modern neural nets is often
deep learning called deep learning, because modern networks are often deep (have many layers).
Neural networks share much of the same mathematics as logistic regression. But
neural networks are a more powerful classifier than logistic regression, and indeed a
minimal neural network (technically one with a single ‘hidden layer’) can be shown
to learn any function.
Neural net classifiers are different from logistic regression in another way. With
logistic regression, we applied the regression classifier to many different tasks by
developing many rich kinds of feature templates based on domain knowledge. When
working with neural networks, it is more common to avoid most uses of rich hand-
derived features, instead building neural networks that take raw words as inputs
and learn to induce features as part of the process of learning to classify. We saw
examples of this kind of representation learning for embeddings in Chapter 6. Nets
that are very deep are particularly good at representation learning. For that reason
deep neural nets are the right tool for tasks that offer sufficient data to learn features
automatically.
In this chapter we’ll introduce feedforward networks as classifiers, and also ap-
ply them to the simple task of language modeling: assigning probabilities to word
sequences and predicting upcoming words. In subsequent chapters we’ll introduce
many other aspects of neural models, such as recurrent neural networks (Chap-
ter 8), the Transformer (Chapter 9), and masked language modeling (Chapter 11).
2 C HAPTER 7 • N EURAL N ETWORKS

7.1 Units
The building block of a neural network is a single computational unit. A unit takes
a set of real valued numbers as input, performs some computation on them, and
produces an output.
At its heart, a neural unit is taking a weighted sum of its inputs, with one addi-
bias term tional term in the sum called a bias term. Given a set of inputs x1 ...xn , a unit has
a set of corresponding weights w1 ...wn and a bias b, so the weighted sum z can be
represented as: X
z = b+ wi xi (7.1)
i
Often it’s more convenient to express this weighted sum using vector notation; recall
vector from linear algebra that a vector is, at heart, just a list or array of numbers. Thus
we’ll talk about z in terms of a weight vector w, a scalar bias b, and an input vector
x, and we’ll replace the sum with the convenient dot product:
z = w·x+b (7.2)
As defined in Eq. 7.2, z is just a real valued number.
Finally, instead of using z, a linear function of x, as the output, neural units
apply a non-linear function f to z. We will refer to the output of this function as
activation the activation value for the unit, a. Since we are just modeling a single unit, the
activation for the node is in fact the final output of the network, which we’ll generally
call y. So the value y is defined as:
y = a = f (z)
We’ll discuss three popular non-linear functions f below (the sigmoid, the tanh, and
the rectified linear unit or ReLU) but it’s pedagogically convenient to start with the
sigmoid sigmoid function since we saw it in Chapter 5:
1
y = σ (z) = (7.3)
1 + e−z
The sigmoid (shown in Fig. 7.1) has a number of advantages; it maps the output
into the range (0, 1), which is useful in squashing outliers toward 0 or 1. And it’s
differentiable, which as we saw in Section ?? will be handy for learning.

Figure 7.1 The sigmoid function takes a real value and maps it to the range (0, 1). It is
nearly linear around 0 but outlier values get squashed toward 0 or 1.

Substituting Eq. 7.2 into Eq. 7.3 gives us the output of a neural unit:
1
y = σ (w · x + b) = (7.4)
1 + exp(−(w · x + b))
7.1 • U NITS 3

Fig. 7.2 shows a final schematic of a basic neural unit. In this example the unit
takes 3 input values x1 , x2 , and x3 , and computes a weighted sum, multiplying each
value by a weight (w1 , w2 , and w3 , respectively), adds them to a bias term b, and then
passes the resulting sum through a sigmoid function to result in a number between 0
and 1.

x1 w1

w2 z a
x2 ∑ σ y
w3

x3 b

Figure 7.2 A neural unit, taking 3 inputs x1 , x2 , and x3 (and a bias b that we represent as a
weight for an input clamped at +1) and producing an output y. We include some convenient
intermediate variables: the output of the summation, z, and the output of the sigmoid, a. In
this case the output of the unit y is the same as a, but in deeper networks we’ll reserve y to
mean the final output of the entire network, leaving a as the activation of an individual node.

Let’s walk through an example just to get an intuition. Let’s suppose we have a
unit with the following weight vector and bias:

w = [0.2, 0.3, 0.9]

b = 0.5

What would this unit do with the following input vector:

x = [0.5, 0.6, 0.1]

The resulting output y would be:

1 1 1
y = σ (w · x + b) = = = = .70
1 + e−(w·x+b) 1 + e−(.5∗.2+.6∗.3+.1∗.9+.5) 1 + e−0.87
In practice, the sigmoid is not commonly used as an activation function. A function
tanh that is very similar but almost always better is the tanh function shown in Fig. 7.3a;
tanh is a variant of the sigmoid that ranges from -1 to +1:

ez − e−z
y = tanh(z) = (7.5)
ez + e−z
The simplest activation function, and perhaps the most commonly used, is the rec-
ReLU tified linear unit, also called the ReLU, shown in Fig. 7.3b. It’s just the same as z
when z is positive, and 0 otherwise:

y = ReLU(z) = max(z, 0) (7.6)

These activation functions have different properties that make them useful for differ-
ent language applications or network architectures. For example, the tanh function
has the nice properties of being smoothly differentiable and mapping outlier values
toward the mean. The rectifier function, on the other hand, has nice properties that
4 C HAPTER 7 • N EURAL N ETWORKS

(a) (b)
Figure 7.3 The tanh and ReLU activation functions.

result from it being very close to linear. In the sigmoid or tanh functions, very high
saturated values of z result in values of y that are saturated, i.e., extremely close to 1, and have
derivatives very close to 0. Zero derivatives cause problems for learning, because as
we’ll see in Section 7.5, we’ll train networks by propagating an error signal back-
wards, multiplying gradients (partial derivatives) from each layer of the network;
gradients that are almost 0 cause the error signal to get smaller and smaller until it is
vanishing
gradient too small to be used for training, a problem called the vanishing gradient problem.
Rectifiers don’t have this problem, since the derivative of ReLU for high values of z
is 1 rather than very close to 0.

7.2 The XOR problem

Early in the history of neural networks it was realized that the power of neural net-
works, as with the real neurons that inspired them, comes from combining these
units into larger networks.
One of the most clever demonstrations of the need for multi-layer networks was
the proof by Minsky and Papert (1969) that a single neural unit cannot compute
some very simple functions of its input. Consider the task of computing elementary
logical functions of two inputs, like AND, OR, and XOR. As a reminder, here are
the truth tables for those functions:

AND OR XOR
x1 x2 y x1 x2 y x1 x2 y
0 0 0 0 0 0 0 0 0
0 1 0 0 1 1 0 1 1
1 0 0 1 0 1 1 0 1
1 1 1 1 1 1 1 1 0

perceptron This example was first shown for the perceptron, which is a very simple neural
unit that has a binary output and has a very simple step function as its non-linear
activation function. The output y of a perceptron is 0 or 1, and is computed as
follows (using the same weight w, input x, and bias b as in Eq. 7.2):

0, if w · x + b ≤ 0
y= (7.7)
1, if w · x + b > 0
7.2 • T HE XOR PROBLEM 5

It’s very easy to build a perceptron that can compute the logical AND and OR
functions of its binary inputs; Fig. 7.4 shows the necessary weights.

x1 x1
1 1
x2 1 x2 1
-1 0
+1 +1
(a) (b)
Figure 7.4 The weights w and bias b for perceptrons for computing logical functions. The
inputs are shown as x1 and x2 and the bias as a special node with value +1 which is multiplied
with the bias weight b. (a) logical AND, with weights w1 = 1 and w2 = 1 and bias weight
b = −1. (b) logical OR, with weights w1 = 1 and w2 = 1 and bias weight b = 0. These
weights/biases are just one from an infinite number of possible sets of weights and biases that
would implement the functions.

It turns out, however, that it’s not possible to build a perceptron to compute
logical XOR! (It’s worth spending a moment to give it a try!)
The intuition behind this important result relies on understanding that a percep-
tron is a linear classifier. For a two-dimensional input x1 and x2 , the perceptron
equation, w1 x1 + w2 x2 + b = 0 is the equation of a line. (We can see this by putting
it in the standard linear format: x2 = (−w1 /w2 )x1 + (−b/w2 ).) This line acts as a
decision
boundary decision boundary in two-dimensional space in which the output 0 is assigned to all
inputs lying on one side of the line, and the output 1 to all input points lying on the
other side of the line. If we had more than 2 inputs, the decision boundary becomes
a hyperplane instead of a line, but the idea is the same, separating the space into two
categories.
Fig. 7.5 shows the possible logical inputs (00, 01, 10, and 11) and the line drawn
by one possible set of parameters for an AND and an OR classifier. Notice that there
is simply no way to draw a line that separates the positive cases of XOR (01 and 10)
linearly
separable from the negative cases (00 and 11). We say that XOR is not a linearly separable
function. Of course we could draw a boundary with a curve, or some other function,
but not a single line.

7.2.1 The solution: neural networks

While the XOR function cannot be calculated by a single perceptron, it can be cal-
culated by a layered network of perceptron units. Rather than see this with networks
of simple perceptrons, however, let’s see how to compute XOR using two layers of
ReLU-based units following Goodfellow et al. (2016). Fig. 7.6 shows a figure with
the input being processed by two layers of neural units. The middle layer (called
h) has two units, and the output layer (called y) has one unit. A set of weights and
biases are shown that allows the network to correctly compute the XOR function.
Let’s walk through what happens with the input x = [0, 0]. If we multiply each
input value by the appropriate weight, sum, and then add the bias b, we get the vector
[0, -1], and we then apply the rectified linear transformation to give the output of the
h layer as [0, 0]. Now we once again multiply by the weights, sum, and add the
bias (0 in this case) resulting in the value 0. The reader should work through the
computation of the remaining 3 possible input pairs to see that the resulting y values
are 1 for the inputs [0, 1] and [1, 0] and 0 for [0, 0] and [1, 1].
6 C HAPTER 7 • N EURAL N ETWORKS

x2 x2 x2

1 1 1

?
0 x1 0 x1 0 x1
0 1 0 1 0 1

a) x1 AND x2 b) x1 OR x2 c) x1 XOR x2

Figure 7.5 The functions AND, OR, and XOR, represented with input x1 on the x-axis and input x2 on the
y-axis. Filled circles represent perceptron outputs of 1, and white circles perceptron outputs of 0. There is no
way to draw a line that correctly separates the two categories for XOR. Figure styled after Russell and Norvig
(2002).

x1 1 h1
1
1
y1
1
-2
x2 1 h2 0
0
-1
+1 +1
Figure 7.6 XOR solution after Goodfellow et al. (2016). There are three ReLU units, in
two layers; we’ve called them h1 , h2 (h for “hidden layer”) and y1 . As before, the numbers
on the arrows represent the weights w for each unit, and we represent the bias b as a weight
on a unit clamped to +1, with the bias weights/units in gray.

It’s also instructive to look at the intermediate results, the outputs of the two
hidden nodes h1 and h2 . We showed in the previous paragraph that the h vector for
the inputs x = [0, 0] was [0, 0]. Fig. 7.7b shows the values of the h layer for all
4 inputs. Notice that hidden representations of the two input points x = [0, 1] and
x = [1, 0] (the two cases with XOR output = 1) are merged to the single point h =
[1, 0]. The merger makes it easy to linearly separate the positive and negative cases
of XOR. In other words, we can view the hidden layer of the network as forming a
representation of the input.
In this example we just stipulated the weights in Fig. 7.6. But for real examples
the weights for neural networks are learned automatically using the error backprop-
agation algorithm to be introduced in Section 7.5. That means the hidden layers will
learn to form useful representations. This intuition, that neural networks can auto-
matically learn useful representations of the input, is one of their key advantages,
and one that we will return to again and again in later chapters.
7.3 • F EEDFORWARD N EURAL N ETWORKS 7

x2 h2

1 1

0 x1 0
h1
0 1 0 1 2

a) The original x space b) The new (linearly separable) h space

Figure 7.7 The hidden layer forming a new representation of the input. (b) shows the
representation of the hidden layer, h, compared to the original input representation x in (a).
Notice that the input point [0, 1] has been collapsed with the input point [1, 0], making it
possible to linearly separate the positive and negative cases of XOR. After Goodfellow et al.
(2016).

7.3 Feedforward Neural Networks

Let’s now walk through a slightly more formal presentation of the simplest kind of
feedforward neural network, the feedforward network. A feedforward network is a multilayer
network
network in which the units are connected with no cycles; the outputs from units in
each layer are passed to units in the next higher layer, and no outputs are passed
back to lower layers. (In Chapter 8 we’ll introduce networks with cycles, called
recurrent neural networks.)
For historical reasons multilayer networks, especially feedforward networks, are
multi-layer
perceptrons sometimes called multi-layer perceptrons (or MLPs); this is a technical misnomer,
MLP since the units in modern multilayer networks aren’t perceptrons (perceptrons have a
simple step-function as their activation function, but modern networks are made up
of units with many kinds of non-linearities like ReLUs and sigmoids), but at some
point the name stuck.
Simple feedforward networks have three kinds of nodes: input units, hidden
units, and output units.
Fig. 7.8 shows a picture. The input layer x is a vector of simple scalar values just
as we saw in Fig. 7.2.
hidden layer The core of the neural network is the hidden layer h formed of hidden units hi ,
each of which is a neural unit as described in Section 7.1, taking a weighted sum of
its inputs and then applying a non-linearity. In the standard architecture, each layer
fully-connected is fully-connected, meaning that each unit in each layer takes as input the outputs
from all the units in the previous layer, and there is a link between every pair of units
from two adjacent layers. Thus each hidden unit sums over all the input units.
Recall that a single hidden unit has as parameters a weight vector and a bias. We
represent the parameters for the entire hidden layer by combining the weight vector
and bias for each unit i into a single weight matrix W and a single bias vector b for
the whole layer (see Fig. 7.8). Each element W ji of the weight matrix W represents
the weight of the connection from the ith input unit xi to the jth hidden unit h j .
The advantage of using a single matrix W for the weights of the entire layer is
that now the hidden layer computation for a feedforward network can be done very
8 C HAPTER 7 • N EURAL N ETWORKS

x1 W U
y1
h1

x2 h2 y2
h3

…
…
xn
0 hn
1
b yn
2
+1
input layer hidden layer output layer

Figure 7.8 A simple 2-layer feedforward network, with one hidden layer, one output layer,
and one input layer (the input layer is usually not counted when enumerating layers).

efficiently with simple matrix operations. In fact, the computation only has three
steps: multiplying the weight matrix by the input vector x, adding the bias vector b,
and applying the activation function g (such as the sigmoid, tanh, or ReLU activation
function defined above).
The output of the hidden layer, the vector h, is thus the following (for this exam-
ple we’ll use the sigmoid function σ as our activation function):

h = σ (Wx + b) (7.8)

Notice that we’re applying the σ function here to a vector, while in Eq. 7.3 it was
applied to a scalar. We’re thus allowing σ (·), and indeed any activation function
g(·), to apply to a vector element-wise, so g[z1 , z2 , z3 ] = [g(z1 ), g(z2 ), g(z3 )].
Let’s introduce some constants to represent the dimensionalities of these vectors
and matrices. We’ll refer to the input layer as layer 0 of the network, and have n0
represent the number of inputs, so x is a vector of real numbers of dimension n0 ,
or more formally x ∈ Rn0 , a column vector of dimensionality [n0 , 1]. Let’s call the
hidden layer layer 1 and the output layer layer 2. The hidden layer has dimensional-
ity n1 , so h ∈ Rn1 and also b ∈ Rn1 (since each hidden unit can take a different bias
value). And the weight matrix W has dimensionality W ∈ Rn1 ×n0 , i.e. [n1 , n0 ].
Take a moment to convince yourselfPn0 that the matrix multiplication in Eq. 7.8 will
compute the value of each h j as σ i=1 W ji x i + b j .
As we saw in Section 7.2, the resulting value h (for hidden but also for hypoth-
esis) forms a representation of the input. The role of the output layer is to take
this new representation h and compute a final output. This output could be a real-
valued number, but in many cases the goal of the network is to make some sort of
classification decision, and so we will focus on the case of classification.
If we are doing a binary task like sentiment classification, we might have a sin-
gle output node, and its scalar value y is the probability of positive versus negative
sentiment. If we are doing multinomial classification, such as assigning a part-of-
speech tag, we might have one output node for each potential part-of-speech, whose
output value is the probability of that part-of-speech, and the values of all the output
nodes must sum to one. The output layer is thus a vector y that gives a probability
distribution across the output nodes.
7.3 • F EEDFORWARD N EURAL N ETWORKS 9

Let’s see how this happens. Like the hidden layer, the output layer has a weight
matrix (let’s call it U), but some models don’t include a bias vector b in the output
layer, so we’ll simplify by eliminating the bias vector in this example. The weight
matrix is multiplied by its input vector (h) to produce the intermediate output z:
z = Uh
There are n2 output nodes, so z ∈ Rn2 , weight matrix U has dimensionality U ∈
Rn2 ×n1 , and element Ui j is the weight from unit j in the hidden layer to unit i in the
output layer.
However, z can’t be the output of the classifier, since it’s a vector of real-valued
numbers, while what we need for classification is a vector of probabilities. There is
normalizing a convenient function for normalizing a vector of real values, by which we mean
converting it to a vector that encodes a probability distribution (all the numbers lie
softmax between 0 and 1 and sum to 1): the softmax function that we saw on page ?? of
Chapter 5. More generally for any vector z of dimensionality d, the softmax is
defined as:
exp(zi )
softmax(zi ) = Pd 1≤i≤d (7.9)
j=1 exp(z j )
Thus for example given a vector
z = [0.6, 1.1, −1.5, 1.2, 3.2, −1.1], (7.10)
the softmax function will normalize it to a probability distribution (shown rounded):
softmax(z) = [0.055, 0.090, 0.0067, 0.10, 0.74, 0.010] (7.11)
You may recall that we used softmax to create a probability distribution from a
vector of real-valued numbers (computed from summing weights times features) in
the multinomial version of logistic regression in Chapter 5.
That means we can think of a neural network classifier with one hidden layer
as building a vector h which is a hidden layer representation of the input, and then
running standard multinomial logistic regression on the features that the network
develops in h. By contrast, in Chapter 5 the features were mainly designed by hand
via feature templates. So a neural network is like multinomial logistic regression,
but (a) with many layers, since a deep neural network is like layer after layer of lo-
gistic regression classifiers; (b) with those intermediate layers having many possible
activation functions (tanh, ReLU, sigmoid) instead of just sigmoid (although we’ll
continue to use σ for convenience to mean any activation function); (c) rather than
forming the features by feature templates, the prior layers of the network induce the
feature representations themselves.
Here are the final equations for a feedforward network with a single hidden layer,
which takes an input vector x, outputs a probability distribution y, and is parameter-
ized by weight matrices W and U and a bias vector b:
h = σ (Wx + b)
z = Uh
y = softmax(z) (7.12)
And just to remember the shapes of all our variables, x ∈ Rn0 , h ∈ Rn1 , b ∈ Rn1 ,
W ∈ Rn1 ×n0 , U ∈ Rn2 ×n1 , and the output vector y ∈ Rn2 . We’ll call this network a 2-
layer network (we traditionally don’t count the input layer when numbering layers,
but do count the output layer). So by this terminology logistic regression is a 1-layer
network.
10 C HAPTER 7 • N EURAL N ETWORKS

7.3.1 More details on feedforward networks

Let’s now set up some notation to make it easier to talk about deeper networks of
depth more than 2. We’ll use superscripts in square brackets to mean layer num-
bers, starting at 0 for the input layer. So W[1] will mean the weight matrix for the
(first) hidden layer, and b[1] will mean the bias vector for the (first) hidden layer. n j
will mean the number of units at layer j. We’ll use g(·) to stand for the activation
function, which will tend to be ReLU or tanh for intermediate layers and softmax
for output layers. We’ll use a[i] to mean the output from layer i, and z[i] to mean the
combination of previous layer output, weights and biases W[i] a[i−1] + b[i] . The 0th
layer is for inputs, so we’ll refer to the inputs x more generally as a[0] .
Thus we can re-represent our 2-layer net from Eq. 7.12 as follows:

z[1] = W[1] a[0] + b[1]

a[1] = g[1] (z[1] )
z[2] = W[2] a[1] + b[2]
a[2] = g[2] (z[2] )
ŷ = a[2] (7.13)

Note that with this notation, the equations for the computation done at each layer are
the same. The algorithm for computing the forward step in an n-layer feedforward
network, given the input vector a[0] is thus simply:

for i in 1,...,n
z[i] = W[i] a[i−1] + b[i]
a[i] = g[i] (z[i] )
ŷ = a[n]

It’s often useful to have a name for the final set of activations right before the final
softmax. So however many layers we have, we’ll generally call the unnormalized
values in the final vector z[n] , the vector of scores right before the final softmax, the
logits logits (see Eq. ??).
The need for non-linear activation functions One of the reasons we use non-
linear activation functions for each layer in a neural network is that if we did not, the
resulting network is exactly equivalent to a single-layer network. Let’s see why this
is true. Imagine the first two layers of such a network of purely linear layers:

z[1] = W[1] x + b[1]

z[2] = W[2] z[1] + b[2]

We can rewrite the function that the network is computing as:

z[2] = W[2] z[1] + b[2]

= W[2] (W[1] x + b[1] ) + b[2]
= W[2] W[1] x + W[2] b[1] + b[2]
= W 0 x + b0 (7.14)

This generalizes to any number of layers. So without non-linear activation functions,

a multilayer network is just a notational variant of a single layer network with a
different set of weights, and we lose all the representational power of multilayer
networks.
7.4 • F EEDFORWARD NETWORKS FOR NLP: C LASSIFICATION 11

Replacing the bias unit In describing networks, we will often use a slightly sim-
plified notation that represents exactly the same function without referring to an ex-
plicit bias node b. Instead, we add a dummy node a0 to each layer whose value will
[0]
always be 1. Thus layer 0, the input layer, will have a dummy node a0 = 1, layer 1
[1]
will have a0 = 1, and so on. This dummy node still has an associated weight, and
that weight represents the bias value b. For example instead of an equation like
h = σ (Wx + b) (7.15)

we’ll use:
h = σ (Wx) (7.16)

But now instead of our vector x having n0 values: x = x1 , . . . , xn0 , it will have n0 +
1 values, with a new 0th dummy value x0 = 1: x = x0 , . . . , xn0 . And instead of
computing each h j as follows:
n0
!
X
hj = σ Wji xi + b j , (7.17)
i=1

we’ll instead use:

n0
!
X
hj = σ Wji xi , (7.18)
i=0

where the value Wj0 replaces what had been b j . Fig. 7.9 shows a visualization.

W U W U
x1 h1 y1 x0=1
h1 y1
h2
x2 y2 x1 h2 y2
h3
…

x2 h3
…
…

…
…

xn
hn
…

0
1
yn hn
b 2 xn 1 yn
+1 0 2
(a) (b)
Figure 7.9 Replacing the bias node (shown in a) with x0 (b).

We’ll continue showing the bias as b when we go over the learning algorithm
in Section 7.5, but then we’ll switch to this simplified notation without explicit bias
terms for the rest of the book.

7.4 Feedforward networks for NLP: Classification

Let’s see how to apply feedforward networks to NLP tasks! In this section we’ll
look at classification tasks like sentiment analysis; in the next section we’ll introduce
neural language modeling.

Chapter 5 Artificial Neural Networks
No ratings yet
Chapter 5 Artificial Neural Networks
50 pages
Unit V
No ratings yet
Unit V
26 pages
Unit 2
No ratings yet
Unit 2
112 pages
NN Unit - 1
No ratings yet
NN Unit - 1
27 pages
7 NN Apr 28 2021
No ratings yet
7 NN Apr 28 2021
81 pages
Understanding and Creating Neural Networks
No ratings yet
Understanding and Creating Neural Networks
69 pages
Module - 3 AAI
No ratings yet
Module - 3 AAI
119 pages
Structure of Neural Networks
No ratings yet
Structure of Neural Networks
12 pages
Ch5-Feedforward Neural Networks, Word Embeddings, Neural Language Models, and Word2vec PDF
No ratings yet
Ch5-Feedforward Neural Networks, Word Embeddings, Neural Language Models, and Word2vec PDF
67 pages
08 Neural Networks
No ratings yet
08 Neural Networks
47 pages
Week 03-04 - Deep Feedforward Networks - Intro
No ratings yet
Week 03-04 - Deep Feedforward Networks - Intro
141 pages
7 NN Apr 28 2021
No ratings yet
7 NN Apr 28 2021
81 pages
Week 14 (NN)
No ratings yet
Week 14 (NN)
49 pages
Unit 2 - Machine Learning - WWW - Rgpvnotes.in
No ratings yet
Unit 2 - Machine Learning - WWW - Rgpvnotes.in
18 pages
Untitledfff
No ratings yet
Untitledfff
40 pages
NNFL Unit III For ECE & EEE
No ratings yet
NNFL Unit III For ECE & EEE
29 pages
L10 Neural Network
No ratings yet
L10 Neural Network
52 pages
Chapter Neural Networks
No ratings yet
Chapter Neural Networks
14 pages
CS 329 Lecture4 2025new
No ratings yet
CS 329 Lecture4 2025new
61 pages
Lesson 2 Neural Network Architectures
No ratings yet
Lesson 2 Neural Network Architectures
35 pages
Neural Network Basics 2.1 Neurons or Nodes and Layers
No ratings yet
Neural Network Basics 2.1 Neurons or Nodes and Layers
9 pages
Unit 2 Deep Learning and Neural Networks
No ratings yet
Unit 2 Deep Learning and Neural Networks
38 pages
Artificial Neural Network
No ratings yet
Artificial Neural Network
86 pages
Ad3451 ML Unit 4 Notes
No ratings yet
Ad3451 ML Unit 4 Notes
36 pages
Lesson 7.0 Supervised Learning With Neural Networks
No ratings yet
Lesson 7.0 Supervised Learning With Neural Networks
22 pages
Machine Learning and Pattern Recognition Week 8 Neural Net Intro
No ratings yet
Machine Learning and Pattern Recognition Week 8 Neural Net Intro
3 pages
Lecture 09 Slides - After
No ratings yet
Lecture 09 Slides - After
57 pages
AI Unit5 Neural Network 1c2c9166 c1b7 47a3 8ce1 E914f1ab6afb
No ratings yet
AI Unit5 Neural Network 1c2c9166 c1b7 47a3 8ce1 E914f1ab6afb
52 pages
Main
No ratings yet
Main
25 pages
Neural Networks and Neural Language Models
No ratings yet
Neural Networks and Neural Language Models
27 pages
Unit 5
No ratings yet
Unit 5
59 pages
Deep Learning
No ratings yet
Deep Learning
19 pages
CS224n: Natural Language Processing With Deep Learning
No ratings yet
CS224n: Natural Language Processing With Deep Learning
18 pages
WINSEM2023-24 BITE410L TH VL2023240503970 2024-03-11 Reference-Material-I
No ratings yet
WINSEM2023-24 BITE410L TH VL2023240503970 2024-03-11 Reference-Material-I
40 pages
CS217 2024 Lec11
No ratings yet
CS217 2024 Lec11
7 pages
Neural Networks
No ratings yet
Neural Networks
27 pages
Unit Iv
No ratings yet
Unit Iv
34 pages
CS231n Convolutional Neural Networks For Visual Recognition 2
No ratings yet
CS231n Convolutional Neural Networks For Visual Recognition 2
12 pages
M3 Transcript
No ratings yet
M3 Transcript
10 pages
Notes Chapter8
No ratings yet
Notes Chapter8
4 pages
Unit 2 - Machine Learning - WWW - Rgpvnotes.in
No ratings yet
Unit 2 - Machine Learning - WWW - Rgpvnotes.in
18 pages
Module 2 DL Snotes P1
No ratings yet
Module 2 DL Snotes P1
16 pages
Machine Learning (CSO851) - Lecture 08
No ratings yet
Machine Learning (CSO851) - Lecture 08
27 pages
Lecture - 05 (Introduction To ANN)
No ratings yet
Lecture - 05 (Introduction To ANN)
27 pages
9.deep Feedforward Networks
100% (1)
9.deep Feedforward Networks
13 pages
Neural Networks From Scratch: 3.1 Formal Neuron
No ratings yet
Neural Networks From Scratch: 3.1 Formal Neuron
8 pages
Unit Ii DNN
No ratings yet
Unit Ii DNN
24 pages
Unit 03 - Neural Networks - MD
No ratings yet
Unit 03 - Neural Networks - MD
24 pages
Lec 23
No ratings yet
Lec 23
13 pages
Notes Chapter Neural Networks
No ratings yet
Notes Chapter Neural Networks
18 pages
Unit V
No ratings yet
Unit V
9 pages
Convolutional Neural Networks
No ratings yet
Convolutional Neural Networks
21 pages
Neural Network Notes
No ratings yet
Neural Network Notes
8 pages
Neural Network
No ratings yet
Neural Network
7 pages
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2016
No ratings yet
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2016
14 pages
Arti Cial Neural Networks and Deep Learning: Statistics and Research Design
No ratings yet
Arti Cial Neural Networks and Deep Learning: Statistics and Research Design
4 pages
What Is A Neural Network? - IBM
No ratings yet
What Is A Neural Network? - IBM
10 pages
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2015
No ratings yet
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2015
14 pages
Neural Networks Handout
No ratings yet
Neural Networks Handout
7 pages
CH 9: Connectionist Models
No ratings yet
CH 9: Connectionist Models
35 pages
Unit 1 Notes
0% (1)
Unit 1 Notes
33 pages
G5Aiai Introduction To AI: Graham Kendall
No ratings yet
G5Aiai Introduction To AI: Graham Kendall
48 pages
Machine Learning and Generative AI
No ratings yet
Machine Learning and Generative AI
5 pages
Counterpropagation Networks
No ratings yet
Counterpropagation Networks
6 pages
Module 2.1: Biological Neurons: Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
No ratings yet
Module 2.1: Biological Neurons: Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
68 pages
ELET442 - Artificial Neural Networks (ANNs)
No ratings yet
ELET442 - Artificial Neural Networks (ANNs)
56 pages
Ch5 - Review Question
No ratings yet
Ch5 - Review Question
3 pages
21CS743
100% (1)
21CS743
1 page
Siraj Raval'S Deep Learning: Student Handbook
No ratings yet
Siraj Raval'S Deep Learning: Student Handbook
30 pages
Ict 423 - Deep Learning
No ratings yet
Ict 423 - Deep Learning
18 pages
Deep Learning
No ratings yet
Deep Learning
152 pages
rESPONSIBLE Ai
No ratings yet
rESPONSIBLE Ai
31 pages
DL 1
No ratings yet
DL 1
46 pages
Deep Learning 2022
No ratings yet
Deep Learning 2022
13 pages
Recurrent Neural Network
No ratings yet
Recurrent Neural Network
36 pages
4 CS826 - Meta Learning
No ratings yet
4 CS826 - Meta Learning
40 pages
Chapter 1: Biological Neurons
No ratings yet
Chapter 1: Biological Neurons
17 pages
0TH Review
No ratings yet
0TH Review
10 pages
Chapter 9. Various Deep Learning Topics - v.1.1
No ratings yet
Chapter 9. Various Deep Learning Topics - v.1.1
132 pages
Hybrid Deep Neural Network Using Transfer Learning For EEG Motor Imagery
No ratings yet
Hybrid Deep Neural Network Using Transfer Learning For EEG Motor Imagery
7 pages
ArchitectureDesign For DeepLearning
No ratings yet
ArchitectureDesign For DeepLearning
34 pages
RNN LSTM BiRNN Notes
No ratings yet
RNN LSTM BiRNN Notes
3 pages
The Promise of Convolutional Neural Networks
No ratings yet
The Promise of Convolutional Neural Networks
13 pages
Skydock Ai 1729259173834335888 1
No ratings yet
Skydock Ai 1729259173834335888 1
11 pages
Basics
No ratings yet
Basics
9 pages
1996 - 997 - DOC - Hands-On Practice With AI Tools
No ratings yet
1996 - 997 - DOC - Hands-On Practice With AI Tools
3 pages
Spatial Gated Multi-Layer Perceptron For Land Use and Land Cover Mapping
No ratings yet
Spatial Gated Multi-Layer Perceptron For Land Use and Land Cover Mapping
5 pages
Siddesh Bhagwan Pingale: Key Expertise
No ratings yet
Siddesh Bhagwan Pingale: Key Expertise
1 page
Artificial Intelligence: 2019 One 80Hrs 3 Hiring
No ratings yet
Artificial Intelligence: 2019 One 80Hrs 3 Hiring
4 pages
Foundations of Deep Learning
No ratings yet
Foundations of Deep Learning
48 pages
A ConvNet For The 2020s - FaceBook - Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, Saining Xie - 2201.03545
No ratings yet
A ConvNet For The 2020s - FaceBook - Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, Saining Xie - 2201.03545
14 pages
Boltzmann Machine
No ratings yet
Boltzmann Machine
47 pages
Neural Networks: Teacher: Elena Marchiori R4.47 Assistant: Kees Jong S2.22
No ratings yet
Neural Networks: Teacher: Elena Marchiori R4.47 Assistant: Kees Jong S2.22
18 pages
Geometric functions in computer aided geometric design
From Everand
Geometric functions in computer aided geometric design
Oscar Ruiz
No ratings yet