Unit 4 - Machine Learning
Unit 4 - Machine Learning
LECTURE NOTES
UNIT - 4
Department of Computer Science and Engineering
Subject Notes
BOCS 605B- Machine
Learning
Unit 4
Figure : 4.1
RNN have a “memory” which remembers all information about what has been calculated. It uses the same
parameters for each input as it performs the same task on all the inputs or hidden layers to produce the output.
This reduces the complexity of parameters, unlike other neural networks.
It’s part of the network. RNNs can take one or more input vectors and produce one or more output vectors
and the output(s) are influenced not just by weights applied on inputs like a regular NN, but also by a “hidden”
state vector representing the context based on prior input(s)/output(s). So, the same input could produce a
different output depending on previous inputs in the series.
Figure :4.2 A Recurrent Neural Network, with a hidden state that is meant to carry pertinent information from
one input item in the series to others.
The formula for the current state can be written as –
Here, Ht is the new state, ht-1 is the previous state while xt is the current input. We now have a state of the
previous input instead of the input itself, because the input neuron would have applied the transformations
on our previous input. So each successive input is called as a time step.
Taking the simplest form of a recurrent neural network, let’s say that the activation function is tanh, the
weight at the recurrent neuron is Whh and the weight at the input neuron is Wxh, we can write the equation
for the state at time t as –
The Recurrent neuron in this case is just taking the immediate previous state into consideration. For longer
sequences the equation can involve multiple such states. Once the final state is calculated we can go on to
produce the output
Now, once the current state is calculated we can calculate the output state as-
So in recurrent neural networks, layers that get a small gradient update stops learning. Those are usually the
earlier layers. So because these layers don’t learn, RNN’s can forget what it seen in longer sequences, thus
having a short-term memory.
Long Short Term Memory is a kind of recurrent neural network is the solution of the above problem. In RNN
output from the last step is fed as input in the current step. LSTM was desgined by Hochreiter & Schmidhuber.
It tackled the problem of long-term dependencies of RNN in which the RNN cannot predict the word stored in
the long-term memory but can give more accurate predictions from the recent information. As the gap length
increases RNN does not give efficent performance. LSTM can by default retain the information for long period
of time. It is used for processing, predicting and classifying on the basis of time series data.
Structure Of LSTM:
LSTM has a chain structure that contains four neural networks and different memory blocks called cells.
Figure: 4.3
Information is retained by the cells and the memory manipulations are done by the
gates. There are three gates –
1. Forget Gate: The information that no longer useful in the cell state is removed with the forget gate. Two
inputs x_t (input at the particular time) and h_t-1 (previous cell output) are fed to the gate and multiplied
with weight matrices followed by the addition of bias. The resultant is passed through an activation
function which gives a binary output. If for a particular cell state the output is 0, the piece of information is
forgotten and for the output 1, the information is retained for the future use.
Figure: 4.4
2. Input gate: Addition of useful information to the cell state is done by input gate. First, the information is
regulated using the sigmoid function and filter the values to be remembered similar to the forget gate using
inputs h_t-1 and x_t. Then, a vector is created using tanh function that gives output from -1 to +1, which
contains all the possible values from h_t-1 and x_t. Atlast, the values of the vector and the regulated values
are multiplied to obtain the useful information.
3. Output gate: The task of extracting useful information from the current cell state to be presented as an
output is done by output gate. First, a vector is generated by applying tanh function on the cell. Then, the
information is regulated using the sigmoid function and filter the values to be remembered using inputs
h_t- 1 and x_t. At last, the values of the vector and the regulated values are multiplied to be sent as an
output and input to the next cell.
Figure: 4.6
Some of the famous applications of LSTM includes:
1. Language Modelling
2. Machine Translation
3. Image Captioning
4. Handwriting generation
5. Question Answering Chatbots
Figure: 4.7
Update Gate
The update gate acts similar to the forget and input gate of an LSTM. It decides what information to throw away and w
Reset Gate
The reset gate is another gate is used to decide how much past information to forget.
Translation
One of the earliest goals for computers was the automatic translation of text from one language to another.
Automatic or machine translation is perhaps one of the most challenging artificial intelligence tasks given the
fluidity of human language. Classically, rule-based systems were used for this task, which were replaced in the
1990s with statistical methods. More recently, deep neural network models achieve state-of-the-art results in
a field that is aptly named neural machine translation.
Machine translation is the task of automatically converting source text in one language to text in another
language.
In a machine translation task, the input already consists of a sequence of symbols in some language, and the
computer program must convert this into a sequence of symbols in another language. Given a sequence of
text in a source language, there is no one single best translation of that text to another language. This is
because of the natural ambiguity and flexibility of human language. The fact is that accurate translation
requires background knowledge in order to resolve ambiguity and establish the content of the sentence.
Classical machine translation methods often involve rules for converting text in the source language to the
target language. The rules are often developed by linguists and may operate at the lexical, syntactic, or
semantic level. This focus on rules gives the name to this area of study: Rule-based Machine Translation, or
RBMT.
Statistical Machine Translation-
Statistical machine translation, or SMT for short, is the use of statistical models that learn to translate text
from a source language to a target language.
Given a sentence T in the target language, we seek the sentence S from which the translator produced T. We
know that our chance of error is minimized by choosing that sentence S that is most probable given T. Thus, we
wish to choose S so as to maximize Pr(S|T).
The approach is data-driven, requiring only a corpus of examples with both source and target language text.
This means linguists are no longer required to specify the rules of translation.
Neural Machine Translation-
Neural machine translation, or NMT for short, is the use of neural network models to learn a statistical model
for machine translation.
The key benefit to the approach is that a single system can be trained directly on source and target text, no
longer requiring the pipeline of specialized systems used in statistical machine learning.
Unlike the traditional phrase-based translation system which consists of many small sub-components that are
tuned separately, neural machine translation attempts to build and train a single, large neural network that
reads a sentence and outputs a correct translation.
Encoder-Decoder Model
Multilayer Perceptron neural network models can be used for machine translation, although the models are
limited by a fixed-length input sequence where the output must be the same length.
These early models have been greatly improved upon recently through the use of recurrent neural networks
organized into an encoder-decoder architecture that allow for variable length input and output sequences.
An encoder neural network reads and encodes a source sentence into a fixed-length vector. A decoder then
outputs a translation from the encoded vector. The whole encoder–decoder system, which consists of the
encoder and the decoder for a language pair, is jointly trained to maximize the probability of a correct
translation given a source sentence.
Encoder-Decoders with Attention
Although effective, the Encoder-Decoder architecture has problems with long sequences of text to be
translated.
The problem stems from the fixed-length internal representation that must be used to decode each word in
the output sequence.
The solution is the use of an attention mechanism that allows the model to learn where to place attention on
the input sequence as each word of the output sequence is decoded.
The encoder-decoder recurrent neural network architecture with attention is currently the state-of-the-art on
some benchmark problems for machine translation. And this architecture is used in the heart of the Google
Neural Machine Translation system, or GNMT, used in their Google Translate service.
Figure: 4.8
One popular heuristic method to execute the purpose is Beam Search. Another solution to the above is the
use of Greedy Search. Greedy Search takes only one output into account that reduces the possibilities to get
other sentences also which can be more likely output to the translation.
Beam search ―
Beam search decoding iteratively creates text candidates (beams) and scores them.
It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest
sentence y given an input x.
Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.
Figure: 4.9
Beam width
The value of beam width for production purpose is generally kept between 10–100 and for research purpose
this value is usually taken in between 1000 to 3000. More the beam width, more is the possibility of finding a
likely sentence but it makes the computational expenses and memory requirement significantly high.
The beam width B is a parameter for beam search. Large values of B yield to better result but with slower
performance and increased memory. Small values of B lead to worse results but is less computationally
intensive. A standard value for B is around 10.
Length normalization ― In order to improve numerical stability, beam search is usually applied on the
following normalized objective, often called the normalized log-likelihood objective, defined as:
Remark: the parameter αα can be seen as a softener, and its value is usually between 0.5 and 1.
Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by
computing a similarity score based on n-gram precision. It is defined as follows:
Remark: the attention scores are commonly used in image captioning and machine translation.
Attention weight The amount of attention that the output y<t> should pay to the activation a<t′> is given by
α<t,t′> computed as follows:
Using Bellman equation, the value function will be decomposed into two part; an immediate reward, Rt+1,
and discounted value of the successor state 𝛾V(St+1),
then substitute the return Gt+1, starting from time step t+1,
finally, since the expected value function is a linear function, meaning that ➪(aX+bY)= a➪(X) +b➪(Y). The expected
So, for each state in the state space, the Bellman equation gives us the value of that state,
The value of the state S is the reward we get upon leaving that state, plus a discounted average over next
possible successor states, where the value of each possible successor state is multiplied by the probability that
we land in it.
Value Iteration and Policy Iteration
The value-iteration and policy-iteration algorithms are two fundamental methods for solving MDPs. Both
value- iteration and policy-iteration assume that the agent knows the MDP model of the world (i.e. the agent
knows
the state-transition and reward probability functions). Therefore, they can be used by the agent to (offline)
plan its actions given knowledge about the environment before interacting with it.
Value iteration computes the optimal state value function by iteratively improving the estimate of V(s). The
algorithm initialize V(s) to arbitrary random values. It repeatedly updates the Q(s, a) and V(s) values until they
converges. Value iteration is guaranteed to converge to the optimal values.
While value-iteration algorithm keeps improving the value function at each iteration until the value-function
converges. Since the agent only cares about the finding the optimal policy, sometimes the optimal policy will
converge before the value function. Therefore, another algorithm called policy-iteration instead of repeated
improving the value-function estimate, it will re-define the policy at each step and compute the value
according to this new policy until the policy converges. Policy iteration is also guaranteed to converge to the
optimal policy and it often takes less iterations to converge than the value-iteration algorithm.
Value-Iteration vs Policy-Iteration
Both value-iteration and policy-iteration algorithms can be used for offline planning where the agent is
assumed to have prior knowledge about the effects of its actions on the environment (they assume the MDP
model is known). Comparing to each other, policy-iteration is computationally efficient as it often takes
considerably fewer number of iterations to converge although each iteration is more computationally
expensive.
Actor-critic model
1. The “Critic” estimates the value function. This could be the action-value (the Q value) or state-value
(the V value).
2. The “Actor” updates the policy distribution in the direction suggested by the Critic (such as with policy
gradients).
And both the Critic and Actor functions are parameterized with neural networks.
Actor-Critics aim to take advantage of all the good stuff from both value-based and policy-based while
eliminating all their drawbacks. And how do they do this?
The principal idea is to split the model in two: one for computing an action based on a state and another one
to produce the Q values of the action.
The actor takes as input the state and outputs the best action. It essentially controls how the agent behaves
by learning the optimal policy (policy-based). The critic, on the other hand, evaluates the action by computing
the value function (value based). Those two models participate in a game where they both get better in their
own role as the time passes. The result is that the overall architecture will learn to play the game more
efficiently than the two methods separately.
How Actor Critic works
Imagine you play a video game with a friend that provides you some feedback. You’re the Actor and your
friend is the Critic.
Figure: 4.11
At the beginning, you don’t know how to play, so you try some action randomly. The Critic observes your
action and provides feedback.
Learning from this feedback, you’ll update your policy and be better at playing that game.
On the other hand, your friend (Critic) will also update their own way to provide feedback so it can be better
next time.
The idea of Actor Critic is to have two neural networks. We estimate both:
Both run in parallel. Because we have two models (Actor and Critic) that must be trained, it means that we have
two set of weights that must be optimized separately.
Q-learning
In the case where the agent does not know apriori what are the effects of its actions on the environment
(state transition and reward models are not known). The agent only knows what are the set of possible states
and actions, and can observe the environment current state. In this case, the agent has to actively learn
through the experience of interactions with the environment. There are two categories of learning algorithms:
model-based learning: In model-based learning, the agent will interact to the environment and from the
history of its interactions, the agent will try to approximate the environment state transition and reward
models. Afterwards, given the models it learnt, the agent can use value-iteration or policy-iteration to find an
optimal policy.
model-free learning: in model-free learning, the agent will not try to learn explicit models of the environment
state transition and reward functions. However, it directly derives an optimal policy from the interactions with
the environment.
Q-Learning is an example of model-free learning algorithm. It does not assume that agent knows anything
about the state-transition and reward models. However, the agent will discover what are the good and bad
actions by trial and error.
The basic idea of Q-Learning is to approximate the state-action pairs Q-function from the samples of Q(s, a)
that we observe during interaction with the environment. This approach is known as Time-Difference
Learning.
Q-learning is an off policy reinforcement learning algorithm that seeks to find the best action to take given the
current state. It’s considered off-policy because the q-learning function learns from actions that are outside
the current policy, like taking random actions, and therefore a policy isn’t needed. More specifically, q-learning
seeks to learn a policy that maximizes the total reward. The ‘q’ in q-learning stands for quality. Quality in this
case represents how useful a given action is in gaining some future reward.
The Q-learning algorithm Process
SARSA
The SARSA stands for State Action Reward State Action which symbolizes the tuple (s, a, r, s’, a’) is an On-Policy
algorithm for TD-Learning. The major difference between it and Q-Learning, is that the maximum reward for
the next state is not necessarily used for updating the Q-values. Instead, a new action, and therefore reward,
is selected using the same policy that determined the original action. The name Sarsa actually comes from the
fact that the updates are done using the quintuple Q(s, a, r, s', a'). Where: s, a are the original state and
action, r is the reward observed in the following state and s', a' are the new state-action pair.
SARSA vs Q-learning
The difference between these two algorithms is that SARSA chooses an action following the same current
policy and updates its Q-values whereas Q-learning chooses the greedy action, that is, the action that gives the
maximum Q-value for the state, that is, it follows an optimal policy.