Part 5

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 6

A penalty term is added to the loss function that will take care of preventing overfitting by

incorporating regularization. The popular forms of regularization include L1 or Lasso, L2 or


Ridge, and Elastic Net.

Dropout is another very commonly used and effective form of regularization that helps
prevent overfitting in neural networks

In addition to L1, L2, Elastic Net, and dropout, another technique that helps in preventing
dropout is early stopping.

The API in Keras provides us with a mechanism for evaluating the evaluate() performance of
our model on test data and the API helps make predictions for predict() new data.

C9: Applying Convolutions to Text

CNNs try to capture the spatial relationships in data.

If we have an image of n x n and a filter of size f x f, then the output would be a matrix of
dimensions, as follows:
(n-f+1)*(n-f+1)

The pooling operation helps in downsampling the data so that only relevant information is
preserved.

Another thing you may have realized is that even if the data shifts somewhat, pooling allows
us to capture the information we need, irrespective of where the feature is located in the data.
This property is referred to as spatial invariance.

This pooling technique comes into effect for temporal data primarily. The global max
pooling operation can replace the flatten option in neural networks.

For text data, we look at one-dimensional spatial relationships and leverage the Conv1D layer
for this purpose. This is similar to going through n-grams, wherein there would be overlaps in
consecutive n-gram windows.

C10: Capturing Temporal Relationships in Text

With ANNs, we primarily saw that inputs are independent of one another. With
CNNs, we went one step further and tried to capture spatial relationships in the inputs by
trying to extract patterns across a set of tokens together.

Recurrent Neural Networks (RNNs) help us capture context and temporal relationships in
sequences.

Sentences can be thought of as combinations of words, such that words are spoken over
time in a sequential manner. It is essential to capture this temporal relationship in natural
language data.

With CNNs, we only looked at the immediate proximity of a word.


Every recurrent neuron takes in two inputs —one is the current or external input at that state
and the other is called a hidden state, which is basically an output from the previous state.

One thing to be careful about is that we should not think of these as n different neural
networks. Instead, each of them is a snapshot of the same FNN with parameters shared across
the time steps.

Forward propagation is pretty straightforward in an RNN, whereby an input vector along


with a hidden state vector is taken as input at each time step to produce an output that is
further used as the hidden state for the next time step.

One of the key concepts to understand in RNNs is the process of backpropagation through
time (BPTT).

We had one output for one input. For RNNs each token is an input, and figure 3 shows that
we need not have one output per token but a single output for a group of tokens.

As a result, parameters are shared across the time steps.

How do we backpropagate in this scenario?

Since the parameters are shared across the time steps, the gradient calculated at each of the
time steps would not only be dependent on the computations of the present time step but also
on the previous time steps. Essentially, this can be thought of as the same neurons firing
differently across various points in time.

Why did we sum up the weight corrections at each time step and apply them all at once
instead of making the corrections at each time step?

This is because, during the forward pass at each time step for an input, the weight was the
same. If we computed the gradient at time step t and applied the changes to the weights
there and then, the weights at time step t-1 would be different and the error calculation
would be wrong since, during the forward pass, we had the same weights at every time
step. If we had updated the weights at each time step, we would have simply penalized the
weights while computing the gradient for something it did not do at all.

Sequences need not always be at the word level. Characters can be used as input sequences as
well.
The exploding gradient problem occurs when large error gradients pile up and cause huge
updates to the weights in our network. On the other hand, when the values of these gradients
are too small, they effectively prevent the weights from getting updated in a network. This is
called the vanishing gradient problem.

One technique for preventing the exploding gradient problem is called gradient clipping.
As part of gradient clipping, the gradient is capped at a maximum value.

Different flavors of RNN


One-to-One, One-to-Many, Many-to-One, Many-to-Many (These RNNs can take two forms,
depending on whether the size of the input is equal or not to the size of the output.).

Carrying relationships both ways using bidirectional RNNs:

Let's look at the following two sentences:

The boy named Harry became the greatest wizard.


The boy named Harry became a Duke: the Duke of Sussex.

The first sentence talks about the fictional character Harry Potter created by author J.K.
Rowling, whereas the second sentence talks about Prince Harry from the United Kingdom.
Until we arrive at the word Harry, both the sentences are exactly the same: The boy named
Harry. Using a simple RNN, we cannot infer much about Harry from the words before its
occurrence. Once we see the latter half of the sentence, we know who's being talked about:
the wizard or the prince. It would be good if, using an RNN architecture, we could carry
things from the end as well to infer things at a point in time. Bidirectional neural networks
help us in this situation.

Bidirectional RNNs, as shown in the following figure, are essentially two independent
RNNs such that one of them processes the inputs in the correct time order, whereas the
other processes the inputs in the reverse time order. The outputs for these two networks are
concatenated at every time step. This formation allows a network to have information from
both directions at every time step.

RNN helps capture sequential information and temporal relationships by combining previous
outputs with present inputs.

The major problem associated with RNNs is that they suffer in terms of capturing and
making sense of long-term dependencies.

LSTM cells use the concept of state or memory to retain long-term dependencies. At every
stage, it is decided as to what to keep in memory and what to discard. All this is done using
gates.

The input to an LSTM cell, as with RNNs, is a concatenation of the input for that time step
and the output of the previous time step. These values are passed on to the gates in the
LSTM cell, which are nothing but an FNN along with some form of activation function.
These gates are referred to as the forget gate, input gate, and output gate. The neural
networks in each of these gates get trained and allow the signal to flow through them into
the memory in different amounts. They decide as to what information should be remembered,
forgotten, or discarded at each step.

Forget Gate

The forget gate's job is to decide how much of the information should be removed from
memory.

It is as important to understand what should be forgotten as it is to understand what should be


remembered. Think of the following example:

Leonardo is a good actor. He won at the Oscars. Brad is a good actor too.

Initially, our cell should remember that Leonardo is being talked about. However, as soon
as we arrive in the third sentence, it should now remember that Brad is being talked about
and it should discard information about Leonardo from its memory. Basically, our network
should have the ability to forget long-term dependencies as soon as new dependencies
worth remembering arrive in our data. Forget gates help us exactly with this by allowing
space for new dependencies.

The values from the forget gate are multiplied with the values in the memory cell in order
to maintain only relevant information from the past.

Input Gate

We should next understand what we need to remember and how much of it should be
remembered. This is exactly what the input gate does for us.

Think of the following example:

Ronaldo is a good football player. Messi is another good player.

As soon as we arrive at the second sentence, the forget gate will help us forget about
Ronaldo, but it is the job of the input gate to ensure that we now remember about Messi.

The input gate has two parts, which simultaneously help in figuring out what is to be
remembered and how much of it needs to be remembered. Let's understand the functioning
of the two parts next.

Part 1 in the input gate uses a activation function, to pinpoint which part of the Sigmoid input
values needs to be remembered by creating a sort of a mask with values between 0 and 1. A
value of would indicate that nothing is worth remembering from the inputs of 0 this state,
whereas a value of 1 would indicate everything from this input state must be remembered.

Part 2 uses a tanh activation function to help us figure out what is potentially the relevant
information from the present state that the memory cell can get updated with. This part is
also often referred to as the candidate vector since this vector holds the values that the
memory cell might get updated with. The output ranges between -1 and 1 from this FNN.

An element-wise multiplication is performed between the outputs from part 1 and part 2.
Essentially, what we did is we understood how relevant various components of part 2 are
based on the values from part 1. The resultant output is added to the memory vector, thus
updating the information in the memory cell.

Output Gate

The job of the output gate is to understand which bits of information in the current step
should be sent across as output from the cell.

There are two things that happen at this stage in the LSTM cell, as follows:
1. First, the output gate receives the input that was received by the LSTM cell initially, and
these inputs are applied to the FNN in the output gate. Thereafter, the activation function is
applied to the computed values to bring the sigmoid output in the range of 0 to 1.

2. Second, the memory at this juncture is already updated based on what should have been
forgotten and what should have been remembered from the computations performed at the
forget gate and input gate stages. This memory state is now passed through a tanh activation
function at this stage to bring the values between -1 and 1.

Finally, the tanh-applied values from memory along with the sigmoid-applied values from
the output gate are multiplied element-wise to get the final output from this LSTM cell in
the network. This value can be taken as output and can also be sent across as the hidden
state for the next LSTM time step.

The backpropagation in LSTMs works similarly to RNNs. However, unlike RNNs, we don't
encounter the problem of vanishing or exploding gradients, wherein the gradients either
become exceedingly small or large. It is primarily because of the memory component we
introduced in LSTMs.

GRUs

LSTMs are huge networks and they have a lot of parameters. Consequently, we need to
update a lot of parameters that are highly computationally expensive.

GRUs use only two gates instead of three, as we used in LSTMs. They combine the forget
gate and the candidate-choice part in the input gate into one gate, called the update gate.
The other gate is the reset gate, which decides how the memory should get updated with
the newly computed information. Based on the output of these two gates, it is decided what
to send across as the output from this cell and how the hidden state is to be updated. This is
done via using something called a content state, which holds the new information. As a
result, the number of parameters in the network is drastically reduced.

Stacked LSTMs

Stacked LSTMs follow an architecture similar to deep RNNs. During the discussion on deep
RNNs, we mentioned that stacking RNN layers one above the other helps the network
capture highly complex patterns and relationships. The same idea is used when building
stacked LSTMs, which can help us capture highly complex patterns from data. Each LSTM
layer in a stacked LSTM model has its own gates and memory vector. Stacked LSTMs are
very expensive in terms of computational requirements.

C11: State of the Art in NLP

You might also like