21CSE356T-NLP-Unit 4.1
21CSE356T-NLP-Unit 4.1
LANGUAGE PROCESSING
Instructor:
Ms. S. Rama,
Assistant Professor
Department of Information Technology,
SRM Institute of Science and Technology,
Unit IV- (Language Models)
• Recurrent Neural Network
• Long Short Term Memory
• Attention Mechanism
• Transformer Based Models
• Self attention
• Multihead attention
• BERT
• RoBERTa
• Fine Tuning For down streaming tasks
• Text Classification and Generation
Language Model
• What is a Language Model?
• A Language Model (LM) is a computational model designed to
understand, generate, and predict human language. It is a
fundamental concept in Natural Language Processing (NLP) and
serves as the backbone for many AI applications, including chatbots,
machine translation, text summarization, and speech recognition.
Types of Language Models
• Language models can be categorized based on their architecture and
how they process text:
• Statistical Language Models (Before Deep Learning Era)
• These models use probabilities to predict the next word in a sentence
based on previous words.
• Examples:
• N-gram models (bigram, trigram)
• Hidden Markov Models (HMM)
• Latent Dirichlet Allocation (LDA) for topic modeling
Neural Language Models (Deep
Learning Era)
• These models use deep learning techniques to learn complex patterns
in text.
• Examples:
• Recurrent Neural Networks (RNNs) – Captures sequential dependencies.
• Long Short-Term Memory (LSTM) – Handles long-range dependencies better
than RNNs.
• Gated Recurrent Units (GRU) – A simplified version of LSTMs.
• Transformers (State-of-the-Art) – Used in modern NLP applications.
Different Ways to Model Text
• A Classic Approach for Text Classification: Bag-of-Words Model
Recurrent Neural Networks (RNNs)
Recurrent Neural Networks
• Like traditional neural networks, such as feedforward neural networks
and convolutional neural networks (CNNs), recurrent neural networks
use training data to learn. They are distinguished by their “memory” as
they take information from prior inputs to influence the current input
and output.
• A Recurrent Neural Network (RNN) is a type of neural network that processes sequential
data by maintaining a memory of previous inputs
Overview
Networks we
used previously:
also called
feedforward
neural networks
usually want to
predict a vector at
some time steps
Different Types of Sequence Modeling Tasks
• Sequential Representation:
In an RNN, the same network cell is applied at every time step, processing
one element of the sequence at a time. However, because each time step's
computation depends on the previous ones (via the hidden state), the
network inherently forms a loop. Unrolling in time means "unfolding" this
loop so that each time step is represented as a separate layer in a deep
feedforward network.
• Temporal Layers:Imagine you have a sequence with T time steps. Unrolling
the RNN creates TTT copies of the network cell arranged sequentially.
Although these cells share the same weights, they are shown as separate
layers corresponding to time steps t=1,2,… T
Why Unroll the RNN?
• Visualization: It makes it easier to understand how the hidden state
propagates through time and how each time step contributes to the
final output.
• Training with Backpropagation Through Time (BPTT):Unrolling allows
us to apply a variant of backpropagation known as Backpropagation
• Time Step 1: The RNN cell processes the first input 𝑥1 and computes the first hidden state
Forward Pass:
• Result: You end up with a chain of hidden states ℎ1,ℎ2,…,ℎ𝑇 and possibly outputs 𝑦1, 𝑦2,
…,𝑦𝑇
• Backward Pass:
• Error Propagation: When training, the error from the output is propagated backward
through each of these unrolled steps. This allows the network to adjust its weights based
on the entire sequence context.
• Shared Weights: Despite the unrolled structure, the weight matrices remain the same
across all time steps. Gradients computed at each step are accumulated for the shared
parameters.
Backpropagation and Vanishing Gradient
Loss in RNN
• Loss
• Definition:
In machine learning, the loss (or cost) function quantifies the
difference between the model's predicted output and the actual
target values. It measures how "wrong" the predictions are.
• In RNNs:
Since RNNs handle sequential data, the loss is often computed at each
time step and then aggregated (for example, by summing or
averaging) over the entire sequence. This aggregated loss then guides
how the model should adjust its parameters during training.
Backpropagation in RNNs
• Definition:
Backpropagation is the algorithm used to compute the gradient of the loss function with
respect to each weight in the network, allowing for weight updates that minimize the loss.
• Backpropagation Through Time (BPTT):
• Process:
In RNNs, the standard backpropagation algorithm is extended to account for the sequential nature of the
data. The network is "unrolled" over time, creating a copy of the network for each time step. The
gradients are then computed for each of these time steps and aggregated.
• Challenges:
The process can suffer from the vanishing gradient problem because the gradients from later time steps
(which might carry important long-term dependencies) become exponentially smaller as they are
propagated back to earlier time steps.
• Significance:
BPTT is essential for training RNNs effectively, but its challenges have led to the development of
alternative architectures and methods to better capture long-range dependencies in sequences.
Vanishing Gradient
• It occurs when gradients—the values used to update the network's weights—become
exceedingly small as they are propagated backward through the network during
training. Here's a detailed breakdown:
• What Happens During Backpropagation
• Chain Rule Multiplication:
In deep networks or RNNs, the backpropagation algorithm relies on the chain rule to
compute gradients. This involves multiplying the derivatives of the activation
functions across many layers or time steps.
• Exponential Decay of Gradients:
If these derivatives are less than one, as is common with activation functions like
sigmoid or tanh, repeated multiplication can cause the gradient to shrink
exponentially. This results in very small gradient values for earlier layers or time steps.
Consequences of Vanishing
Gradients
• Slow Learning:
When gradients become very small, the corresponding weights receive
almost no update during training. This makes it extremely difficult for
the network to learn from the data, especially the information relevant
to earlier layers or earlier parts of the sequence.
• Difficulty Capturing Long-Term Dependencies:
In the context of RNNs, vanishing gradients hinder the network’s ability
to capture long-term dependencies. The influence of an input from
many time steps ago diminishes rapidly, leading the model to "forget"
important contextual information.
RNN
RNN is basically a blackbox, where it has an “internal state” that is updated as
a sequence is processed. At every single timestep, we feed in an input vector
into RNN where it modifies that state as a function of what it receives.
When we tune RNN weights, RNN will show different behaviors in terms of
how its state evolves as it receives these inputs. We are also interested in
producing an output based on the RNN state, so we can produce these
output vectors on top of the RNN
If we unroll an RNN model, then there are inputs (e.g. video frame) at
different timesteps x1,x2,….xt.
RNN at each timestep takes in two inputs – an input frame ( xi) and previous
representation of what it seems so far (i.e. history) – to generate an output
and update its history, which will get forward propagated over time. All the
RNN blocks i are the same block that share the same parameter, but have
different inputs and history at each timestep.
Advantages
• An RNN remembers each and every piece of information through time. It is
useful in time series prediction only because of the feature to remember
previous inputs as well. This is called Long short term memory.
• Recurrent neural networks are even used with convolutional layers to extend
the effective pixel neighborhood.
Disadvantages
• Vanishing and exploding gradient problems.
• Training an RNN is a very difficult task.
• It cannot process very long sequences if using Tanh or Relu as an activation
function.
LSTM (Long Short-Term Memory)
Sarcasm,
dependency??
LSTM (Long Short-Term Memory)
• What is?
• LSTM is a recurrent neural network (RNN) architecture widely
used in Deep Learning. It excels at capturing long-term
dependencies, making it ideal for sequence prediction
tasks.
• Unlike traditional neural networks, LSTM incorporates feedback
connections, allowing it to process entire sequences of data, not
just individual data points. This makes it highly effective in
understanding and predicting patterns in sequential data like
time series, text, and speech.
• LSTM has become a powerful tool in artificial intelligence and
deep learning, enabling breakthroughs in various fields by
uncovering valuable insights from sequential data
Selective Read, Selective Write,
Selective Forget-
• Selective Write - we select what to write
• Selective read – we select what to read
• Selective Forget - we select what to forget
LSTM Architecture
• At a high level, LSTM works very much like an RNN cell. Here is the
internal functioning of the LSTM network. The LSTM network
architecture consists of three parts, as shown in the image below, and
each part performs an individual function.
• The first part chooses whether the information coming from the previous
timestamp is to be remembered or is irrelevant and can be forgotten.
• In the second part, the cell tries to learn new information from the input to this
cell.
• At last, in the third part, the cell passes the updated information from the current
timestamp to the next timestamp.
LONG SHORT TERM MEMEORY
• These three parts of an LSTM unit are known as gates. They
control the flow of information in and out of the memory
cell or LSTM cell.
• The first gate is called Forget gate, the second gate is
known as the Input gate, and the last one is the Output
gate.
• An LSTM unit that consists of these three gates and a
memory cell or LSTM cell can be considered as a layer
of neurons in traditional feedforward neural
network, with each neuron having a hidden layer and
a current state.
LSTM
• LSTM also has a hidden state where H(t-1)
represents the hidden state of the previous
timestamp and Ht is the hidden state of the
current timestamp. In addition to that, LSTM also
has a cell state represented by C(t-1) and C(t)
for the previous and current timestamps,
respectively.
• Here the hidden state is known as Short
term memory, and the cell state is known
as Long term memory.
The Logic Behind LSTM
Example of LTSM Working : Input : A is a nice person. But B is evil
• Let’s take an example to understand how LSTM works. Here we have
two sentences separated by a full stop. The first sentence is “A is a
nice person,” and the second sentence is “B, on the Other hand, is
evil”. It is very clear, in the first sentence, we are talking about A, and
as soon as we encounter the full stop(.), we started talking about B.
As we move from the first
sentence to the second
sentence, our network should
realize that we are no more
talking about A. Now our
subject is B. Here, the Forget
gate of the network allows it to
forget about it. Let’s
understand the roles played by
these gates in LSTM
Forget Gate
• The information that is no longer useful in the cell state is removed with the forget gate. Two
inputs xt (input at the particular time) and ht-1 (previous cell output) are fed to the gate and
multiplied with weight matrices followed by the addition of bias.
• The resultant is passed through an activation function which gives a binary output. If for a
particular cell state the output is 0, the piece of information is forgotten and for output 1,
the information is retained for future use.
The equation for the forget gate is:
ft=σ(Wf⋅[ht−1,xt]+bf)
where:
•Wf represents the weight matrix associated with the forget gate.
•[ht-1, xt] denotes the concatenation of the current input and the
previous hidden state.
The addition of useful information to the cell state is done by the input gate. First, the information
is regulated using the sigmoid function and filter the values to be remembered similar to the
forget gate using inputs ht-1 and xt. . Then, a vector is created using tanh function that gives an
output from -1 to +1, which contains all the possible values from h t-1 and xt. At last, the values of
the vector and the regulated values are multiplied to obtain the useful information. The equation
for the input gate is:
it=σ(Wi⋅[ht−1,xt]+bi)
Ct=tanh(Wc⋅[ht−1,xt]+bc)
We multiply the previous state by ft, disregarding the
information we had previously chosen to ignore. Next, we
include it∗Ct. This represents the updated candidate values,
adjusted for the amount that we chose to update each state
value.
Ct=ft⊙Ct−1+it⊙C^t
where
• ⊙ denotes element-wise multiplication
Directionality Can be trained to process sequential data in Can only be trained to process
both forward and backward directions sequential data in one direction
Training More difficult to train than RNN due to the Easier to train than LSTM
complexity of the gates and memory unit
Natural language processing,
Machine translation, speech recognition, text machine translation, speech
Applications summarization, natural language processing, recognition, image processing,
time series forecasting
video processing
Attention Mechanism
Motivation
Recurrent Neural Networks (LSTM/GRU) are the model of choice when working
with variable-length inputs and are thus a natural fit to operate on text
processing .But:
• the sequential nature of RNNs prohibits parallelization,
• the context is computed from past only,
• there is no explicit distinction between short- and long-range dependencies
(everything is dealt with via the context),
• training is tricky, how can we do you do efficiently transfer learning?
On the other hand, Convolution can
• operate on both time-series (1D convolution), and images,
• be massively parallelized,
• exploit local dependencies (within the kernel) and long-range dependencies
(using multiple layers),
but:
• we can’t deal with variable-size inputs,
• the position of these dependencies is fixed (see below).
Attention
• An attention model is a mechanism used in neural networks that dynamically
focuses on the most relevant parts of the input data when making
predictions or generating outputs.
• Instead of processing all parts of the input equally, the model assigns different
weights to different elements, allowing it to "attend" more to critical features.
This is especially useful in tasks like machine translation, text summarization,
and image recognition.
• Key Points:
• Dynamic Focus: Traditional models might compress input into a fixed-length
representation, but attention mechanisms evaluate and weigh each input
element based on its relevance to the current task.
• Applications: Originally introduced for neural machine translation (e.g.,
Bahdanau attention), attention mechanisms now play a central role in many
state-of-the-art architectures such as Transformers.
•In Sentence 1, the attention mechanism assigns more weight to "street" when resolving "it."
•In Sentence 2, the attention mechanism assigns more weight to "animal" when resolving "it."
Transformer has no recurrence
• Transformer is an attention mechanism to learn contextual
relations between words– It includes two mechanisms
1. An Encoder that reads text input
2. A Decoder that produces a prediction for the task
3. Model has no recurrence – Self-attention to represent
input/output without RNN
4. Allows more parallelism – High translation quality: state-of-
the-art
5. Training 12 hours on eight P100 GPUs