Module 4 Part 1
Module 4 Part 1
DEEP LEARNING
Module-4 PART -I
1
SYLLABUS
2
TRACE KTU
In a text-setting, the vector xt will contain the one-hot encoded
word at the t th time-stamp.
In one-hot encoding, we have a vector of length equal to the
lexicon size, and the component for the relevant word has a
value of 1. All other components are 0
successive words are dependent on one another.
4
TRACE KTU
key point is that there is an input xt at each time-stamp, and a
hidden state ht that changes at each time stamp as new data
points arrive. Each time-stamp also has an output value yt.
When used in the text-setting of predicting the next word, this
approach is referred to as language modeling.
The hidden state at time t is given by a function of the input
5 vector at time t and the hidden vector at time (t − 1):
TRACE KTU
Note that the functions f(·) and g(·) are the same at each time
stamp.
A key point here is the presence of the self-loop in Figure 1.17(a),
which will cause the hidden state of the neural network to
change after the input of each xt.
works with sequences of finite length, and it makes sense to
unfurl the loop into a “time-layered” network that looks more like
a feed-forward network. This network is shown in Figure 1.17(b).
6
weight matrices of the connections are shared by multiple
connections in the time layered network to ensure that the
same function is used at each time stamp. This sharing is the key
to the domain-specific insights that are learned by the network.
The backpropagation algorithm takes the sharing and temporal
length into account when updating the weights during the
learning process. This special type of backpropagation algorithm
TRACE KTU
is referred to as backpropagation through time (BPTT).
Because of the recursive nature ,the recurrent network has the
ability to compute a function of variable-length inputs.
For example, starting at h0, which is typically fixed to some constant
Note that the function Ft(·) varies with the value of t. Such an
7 approach is particularly useful for variable-length inputs like text
sentences.
An interesting theoretical property of recurrent neural networks
is that they are Turing complete . What this means is that given
enough data and computational resources, a recurrent neural
network can simulate any algorithm.
TRACE KTU
COMPUTATIONAL GRAPHS
8
a recurrent neural network is a neural network that is specialized
for processing a sequence of values x(1), . . . , x(τ)
A computational graph is a way to formalize the structure of a set of
computations,
such as those involved in mapping inputs and parameters to
TRACE KTU
outputs and loss.
unfolding a recursive
computational graph
or recurrent computation into a
TRACE KTU
• is recurrent because the definition of s at time t refers back to the
same definition at time t − 1.
• For a finite number of time steps τ , the graph can be unfolded by
applying the definition τ − 1 times. For example, if we unfold
equation 10.1 for τ = 3 time steps, we obtain.
Such an expression can now be represented by a traditional
10 directed acyclic computational graph
TRACE KTU
Figure 10.1: The classical dynamical system described by equation 10.1,
illustrated as an unfolded computational graph. Each node represents the
state at some time t and the function f maps the state at t to the state at t
+ 1. The same parameters (the same value of θ used to parametrize f) are
used for all time steps.
As another example, let us consider a dynamical system driven
11 by an external signal x(t),
TRACE KTU
now rewrite equation 10.4 using the variable h to represent the
state:
12
TRACE KTU
One way to draw the RNN is with a diagram containing one node
for every component that might exist in a physical implementation
of the model, such as a biological neural network
In this view, the network defines a circuit that operates in real
time, with physical parts whose current state can influence their
future state,
The other way to draw the RNN is as an unfolded computational
graph, in which each component is represented by many different
13
variables, with one variable per time step, representing the state
of the component at that point in time.
Each variable for each time step is drawn as a separate node of
the computational graph, as in the right of figure 10.2.
The unfolded graph now has a size that depends on the sequence
length.
TRACE KTU
14
The function g(t) takes the whole past sequence (x(t), x(t−1),
x(t−2), . . . , x(2), x(1)) as input and produces the current state,
unfolded recurrent structure allows us to factorize g(t) into
repeated application of a function f.
The unfolding process thus introduces two major advantages:
TRACE KTU
1. Regardless of the sequence length, the learned model always has
the same input size, because it is specified in terms of transition
from one state to another state, rather than specified in terms of
a variable-length history of states.
2. It is possible to use the same transition function f with the
same parameters at every time step.
The recurrent graph and the unrolled graph have their uses
15
The recurrent graph is succinct.
The unfolded graph provides an explicit description of which
computations to perform.
The unfolded graph also helps to illustrate the idea of
information flow forward in time (computing outputs and
losses) and backward in time (computing gradients) by
TRACE KTU
explicitly showing the path along which this information
flows.
RNN Design Figure 10.3: The computational graph to compute the training
loss of a recurrent network that maps an input sequence of x
16 values to a corresponding sequence of output o values.
A loss L measures how far each o is from the corresponding
training target y . When using softmax outputs, we assume o
is the unnormalized log probabilities. The loss L internally
computes ˆ y = softmax(o) and compares this to the target y.
The RNN has input to hidden connections parametrized by a
weight matrix U, hidden-to-hidden recurrent connections
parametrized by a weight matrix W , and hidden-to-output
connections parametrized by a weight matrix V . Equation
TRACE KTU
smaller set of functions) than those in the family represented by figure 10.3.
The RNN in this figure is trained to put a specific output value
19 into o , and o is the only information it is allowed to send to
the future.
There are no direct connections from h going forward. The
previous h is connected to the present only indirectly, via the
predictions it was used to produce.
Unless o is very high-dimensional and rich, it will usually lack
TRACE KTU
important information from the past.
This makes the RNN in this figure less powerful, but it may be
easier to train because each time step can be trained in isolation
from the others t = 1 to t = τ , we apply the following update
equations:
where the parameters are the bias vectors b and c along with
the weight matrices U , V and W , respectively for input-to-
20
hidden, hidden-to-output and hidden-to –hidden
connections
This is an example of a recurrent network that maps an input
sequence to an output sequence of the same length.
The total loss for a given sequence of x values paired with a
sequence of y values would then be just
TRACE KTU
The sum of the losses over all the time steps. For example, if
L(t) is the negative log-likelihood of y (t) given x(1) , . . . , x(t) ,
then
21
moving right to left through the graph. The runtime is O(τ ) and
cannot be reduced by parallelization because the forward propagation
graph is inherently sequential; each time step may only be computed
after the previous one.
TRACE KTU
States computed in the forward pass must be stored until they are
reused during the backward pass, so the memory cost is also O(τ ). The
back-propagation algorithm applied to the unrolled graph with O( τ)
cost is called back-propagation through time or BPTT
22 Teacher forcing is a procedure that emerges from the
maximum likelihood criterion, in which during training the y(t)
model receives the ground truth output as input at time t + 1.
TRACE KTU
• we see that at time t = 2, the model is trained to maximize the
y(2) conditional probability of given both the x sequence so far
and the previous y value from the training set.
• Maximum likelihood thus specifies that during training, rather
than feeding the model’s own output back into itself, these
connections should be fed with the target values specifying what
the correct output should be.
motivated teacher forcing as allowing us to avoid back-
23
propagation through time in models that lack hidden-to-hidden
connections
Some models may thus be trained with both teacher forcing and
BPTT.
The disadvantage of strict teacher forcing arises if the network is
going to be later used in an open-loop mode, with the network
TRACE KTU
outputs (or samples from the output distribution) fed back as
input
ENCODER – DECODER SEQUENCE TO SEQUENCE
24 ARCHITECTURES
TRACE KTU
Here we discuss how an RNN can be trained to map an input
25 sequence to an output sequence which is not necessarily
of the same length.
This comes up in many applications, such as speech
recognition, machine translation or question answering, where
the input and output sequences in the training set are
generally not of the same length
TRACE KTU
The input to the RNN the “context.” We want to produce a
representation of this context, C . The context C might be a
vector or sequence of vectors that summarize the input
sequence
26 (1) an encoder or reader or input RNN processes the input
sequence. The encoder emits the context C , usually as a
simple function of its final hidden state.
(2) a decoder or writer or output RNN is conditioned on that
fixed-length vector to generate the output sequence
TRACE KTU
In a sequence-to-sequence architecture, the two RNNs are
trained jointly to maximize the average of
TRACE KTU
Phenomenon was observed by bahdanau et al. (2015) in the context
of machine translation. They proposed to make C a variable-length
sequence rather than a fixed-size vector.
They introduced an attention mechanism that learns to associate
elements of the sequence c to elements of the output sequence
28
TRACE KTU
Encoder
29
• A stack of several recurrent units where each accepts a
single element of the input sequence, collects information for
that element and propagates it forward.
• In question-answering problem, the input sequence is a
collection of all words from the question. Each word is
represented as x_i where i is the order of that word.
TRACE KTU
• The hidden states h_i are computed using the formula:
30 Encoder Vector
• This is the final hidden state produced from the encoder part
of the model. It is calculated using the formula above.
• This vector aims to encapsulate the information for all input
elements in order to help the decoder make accurate
predictions.
TRACE KTU
• It acts as the initial hidden state of the decoder part of the
model.
Decoder
31
• A stack of several recurrent units where each predicts an output
y_t at a time step t.
• Each recurrent unit accepts a hidden state from the previous
unit and produces and output as well as its own hidden state.
• In the question-answering problem, the output sequence is a
collection of all words from the answer. Each word is represented
TRACE KTU
as y_i where i is the order of that word.
• Any hidden state h_i is computed using the formula:
we are just using the previous hidden state to compute the next
32 one.
• The output y_t at time step t is computed using the formula:
TRACE KTU
34
Applications
It possesses many applications
TRACE KTU
such as
•Google’s Machine Translation
•Question answering chatbots
•Speech recognition
•Time Series Application etc.,