0% found this document useful (0 votes)
22 views46 pages

21CSE356T-NLP-Unit 4.1

The document provides an overview of Natural Language Processing (NLP) with a focus on language models, particularly Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks. It explains the architecture and functioning of these models, their advantages and disadvantages, and the challenges they face, such as the vanishing gradient problem. Additionally, it covers the mechanisms of LSTM, including its gates for selective memory management, making it suitable for sequence prediction tasks.

Uploaded by

Sa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views46 pages

21CSE356T-NLP-Unit 4.1

The document provides an overview of Natural Language Processing (NLP) with a focus on language models, particularly Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks. It explains the architecture and functioning of these models, their advantages and disadvantages, and the challenges they face, such as the vanishing gradient problem. Additionally, it covers the mechanisms of LSTM, including its gates for selective memory management, making it suitable for sequence prediction tasks.

Uploaded by

Sa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 46

21CSE356T– NATURAL

LANGUAGE PROCESSING

Instructor:
Ms. S. Rama,
Assistant Professor
Department of Information Technology,
SRM Institute of Science and Technology,
Unit IV- (Language Models)
• Recurrent Neural Network
• Long Short Term Memory
• Attention Mechanism
• Transformer Based Models
• Self attention
• Multihead attention
• BERT
• RoBERTa
• Fine Tuning For down streaming tasks
• Text Classification and Generation
Language Model
• What is a Language Model?
• A Language Model (LM) is a computational model designed to
understand, generate, and predict human language. It is a
fundamental concept in Natural Language Processing (NLP) and
serves as the backbone for many AI applications, including chatbots,
machine translation, text summarization, and speech recognition.
Types of Language Models
• Language models can be categorized based on their architecture and
how they process text:
• Statistical Language Models (Before Deep Learning Era)
• These models use probabilities to predict the next word in a sentence
based on previous words.
• Examples:
• N-gram models (bigram, trigram)
• Hidden Markov Models (HMM)
• Latent Dirichlet Allocation (LDA) for topic modeling
Neural Language Models (Deep
Learning Era)
• These models use deep learning techniques to learn complex patterns
in text.
• Examples:
• Recurrent Neural Networks (RNNs) – Captures sequential dependencies.
• Long Short-Term Memory (LSTM) – Handles long-range dependencies better
than RNNs.
• Gated Recurrent Units (GRU) – A simplified version of LSTMs.
• Transformers (State-of-the-Art) – Used in modern NLP applications.
Different Ways to Model Text
• A Classic Approach for Text Classification: Bag-of-Words Model
Recurrent Neural Networks (RNNs)
Recurrent Neural Networks
• Like traditional neural networks, such as feedforward neural networks
and convolutional neural networks (CNNs), recurrent neural networks
use training data to learn. They are distinguished by their “memory” as
they take information from prior inputs to influence the current input
and output.

• While traditional deep learning networks assume that inputs and


outputs are independent of each other, the output of recurrent neural
networks depend on the prior elements within the sequence. While
future events would also be helpful in determining the output of a given
sequence, unidirectional recurrent neural networks cannot account for
these events in their predictions.

• A Recurrent Neural Network (RNN) is a type of neural network that processes sequential
data by maintaining a memory of previous inputs
Overview
Networks we
used previously:
also called
feedforward
neural networks

•Input Layer – Accepts sequential data.


•Hidden Layer – Maintains memory across
time steps.
•Output Layer – Produces predictions.

usually want to
predict a vector at
some time steps
Different Types of Sequence Modeling Tasks

Many-to-one: The input data is a sequence, but


the output is a fixed size vector, not a
sequence.

Ex.: sentiment analysis, the input is some text,


and the output is a class label

One-to-many: Input data is in a standard format


(not a sequence), the output is a sequence.

Ex.: Image captioning, where the input is an


image, the output is a text description of that
image
Different Types of Sequence Modeling Tasks
Many-to-many: Both inputs
and outputs are sequences.
Can be direct or delayed.

Ex.: Video-captioning, i.e.,


describing a sequence of
images via text (direct).
Translating one language
into another (delayed)
Likewise – levels of output
based on type
Unfolded RNN
Recurrent Neural Network

We can process a sequence of vectors x by applying a


recurrence formula at every time step:

Notice: the same function and the


same set of parameters are used at •H- Hidden state
Why​is the output weight matrix,
Output is usually a softmax (for classification) or
every time step. •xt = Input at time t
another activation function suited to the task.
•W = Weight matrices
•f = f is a non-linear activation
function – ReLU, tanh
What is Unrolling in Time?

• Sequential Representation:
In an RNN, the same network cell is applied at every time step, processing
one element of the sequence at a time. However, because each time step's
computation depends on the previous ones (via the hidden state), the
network inherently forms a loop. Unrolling in time means "unfolding" this
loop so that each time step is represented as a separate layer in a deep
feedforward network.
• Temporal Layers:Imagine you have a sequence with T time steps. Unrolling
the RNN creates TTT copies of the network cell arranged sequentially.
Although these cells share the same weights, they are shown as separate
layers corresponding to time steps t=1,2,… T
Why Unroll the RNN?
• Visualization: It makes it easier to understand how the hidden state
propagates through time and how each time step contributes to the
final output.
• Training with Backpropagation Through Time (BPTT):Unrolling allows
us to apply a variant of backpropagation known as Backpropagation

• By viewing the RNN as a deep network with 𝑇 layers, we can


Through Time.

calculate gradients at each time step and propagate errors backward


through the sequence. This process is critical for updating the weights
shared across all time steps.
How actually?
How Unrolling Works

• Time Step 1: The RNN cell processes the first input 𝑥1​ and computes the first hidden state
Forward Pass:

• Time Step 2: The cell takes 𝑥2 and ℎ1 to compute ℎ2 and so on.


ℎ1

• Result: You end up with a chain of hidden states ℎ1,ℎ2,…,ℎ𝑇 and possibly outputs 𝑦1, 𝑦2,
…,𝑦𝑇
• Backward Pass:
• Error Propagation: When training, the error from the output is propagated backward
through each of these unrolled steps. This allows the network to adjust its weights based
on the entire sequence context.
• Shared Weights: Despite the unrolled structure, the weight matrices remain the same
across all time steps. Gradients computed at each step are accumulated for the shared
parameters.
Backpropagation and Vanishing Gradient
Loss in RNN
• Loss
• Definition:
In machine learning, the loss (or cost) function quantifies the
difference between the model's predicted output and the actual
target values. It measures how "wrong" the predictions are.
• In RNNs:
Since RNNs handle sequential data, the loss is often computed at each
time step and then aggregated (for example, by summing or
averaging) over the entire sequence. This aggregated loss then guides
how the model should adjust its parameters during training.
Backpropagation in RNNs

• Definition:
Backpropagation is the algorithm used to compute the gradient of the loss function with
respect to each weight in the network, allowing for weight updates that minimize the loss.
• Backpropagation Through Time (BPTT):
• Process:
In RNNs, the standard backpropagation algorithm is extended to account for the sequential nature of the
data. The network is "unrolled" over time, creating a copy of the network for each time step. The
gradients are then computed for each of these time steps and aggregated.
• Challenges:
The process can suffer from the vanishing gradient problem because the gradients from later time steps
(which might carry important long-term dependencies) become exponentially smaller as they are
propagated back to earlier time steps.
• Significance:
BPTT is essential for training RNNs effectively, but its challenges have led to the development of
alternative architectures and methods to better capture long-range dependencies in sequences.
Vanishing Gradient
• It occurs when gradients—the values used to update the network's weights—become
exceedingly small as they are propagated backward through the network during
training. Here's a detailed breakdown:
• What Happens During Backpropagation
• Chain Rule Multiplication:
In deep networks or RNNs, the backpropagation algorithm relies on the chain rule to
compute gradients. This involves multiplying the derivatives of the activation
functions across many layers or time steps.
• Exponential Decay of Gradients:
If these derivatives are less than one, as is common with activation functions like
sigmoid or tanh, repeated multiplication can cause the gradient to shrink
exponentially. This results in very small gradient values for earlier layers or time steps.
Consequences of Vanishing
Gradients

• Slow Learning:
When gradients become very small, the corresponding weights receive
almost no update during training. This makes it extremely difficult for
the network to learn from the data, especially the information relevant
to earlier layers or earlier parts of the sequence.
• Difficulty Capturing Long-Term Dependencies:
In the context of RNNs, vanishing gradients hinder the network’s ability
to capture long-term dependencies. The influence of an input from
many time steps ago diminishes rapidly, leading the model to "forget"
important contextual information.
RNN
RNN is basically a blackbox, where it has an “internal state” that is updated as
a sequence is processed. At every single timestep, we feed in an input vector
into RNN where it modifies that state as a function of what it receives.

When we tune RNN weights, RNN will show different behaviors in terms of
how its state evolves as it receives these inputs. We are also interested in
producing an output based on the RNN state, so we can produce these
output vectors on top of the RNN
If we unroll an RNN model, then there are inputs (e.g. video frame) at
different timesteps x1,x2,….xt.
RNN at each timestep takes in two inputs – an input frame ( xi) and previous
representation of what it seems so far (i.e. history) – to generate an output
and update its history, which will get forward propagated over time. All the
RNN blocks i are the same block that share the same parameter, but have
different inputs and history at each timestep.
Advantages
• An RNN remembers each and every piece of information through time. It is
useful in time series prediction only because of the feature to remember
previous inputs as well. This is called Long short term memory.
• Recurrent neural networks are even used with convolutional layers to extend
the effective pixel neighborhood.
Disadvantages
• Vanishing and exploding gradient problems.
• Training an RNN is a very difficult task.
• It cannot process very long sequences if using Tanh or Relu as an activation
function.
LSTM (Long Short-Term Memory)
Sarcasm,
dependency??
LSTM (Long Short-Term Memory)
• What is?
• LSTM is a recurrent neural network (RNN) architecture widely
used in Deep Learning. It excels at capturing long-term
dependencies, making it ideal for sequence prediction
tasks.
• Unlike traditional neural networks, LSTM incorporates feedback
connections, allowing it to process entire sequences of data, not
just individual data points. This makes it highly effective in
understanding and predicting patterns in sequential data like
time series, text, and speech.
• LSTM has become a powerful tool in artificial intelligence and
deep learning, enabling breakthroughs in various fields by
uncovering valuable insights from sequential data
Selective Read, Selective Write,
Selective Forget-
• Selective Write - we select what to write
• Selective read – we select what to read
• Selective Forget - we select what to forget
LSTM Architecture
• At a high level, LSTM works very much like an RNN cell. Here is the
internal functioning of the LSTM network. The LSTM network
architecture consists of three parts, as shown in the image below, and
each part performs an individual function.

• The first part chooses whether the information coming from the previous
timestamp is to be remembered or is irrelevant and can be forgotten.
• In the second part, the cell tries to learn new information from the input to this
cell.
• At last, in the third part, the cell passes the updated information from the current
timestamp to the next timestamp.
LONG SHORT TERM MEMEORY
• These three parts of an LSTM unit are known as gates. They
control the flow of information in and out of the memory
cell or LSTM cell.
• The first gate is called Forget gate, the second gate is
known as the Input gate, and the last one is the Output
gate.
• An LSTM unit that consists of these three gates and a
memory cell or LSTM cell can be considered as a layer
of neurons in traditional feedforward neural
network, with each neuron having a hidden layer and
a current state.
LSTM
• LSTM also has a hidden state where H(t-1)
represents the hidden state of the previous
timestamp and Ht is the hidden state of the
current timestamp. In addition to that, LSTM also
has a cell state represented by C(t-1) and C(t)
for the previous and current timestamps,
respectively.
• Here the hidden state is known as Short
term memory, and the cell state is known
as Long term memory.
The Logic Behind LSTM
Example of LTSM Working : Input : A is a nice person. But B is evil
• Let’s take an example to understand how LSTM works. Here we have
two sentences separated by a full stop. The first sentence is “A is a
nice person,” and the second sentence is “B, on the Other hand, is
evil”. It is very clear, in the first sentence, we are talking about A, and
as soon as we encounter the full stop(.), we started talking about B.
As we move from the first
sentence to the second
sentence, our network should
realize that we are no more
talking about A. Now our
subject is B. Here, the Forget
gate of the network allows it to
forget about it. Let’s
understand the roles played by
these gates in LSTM
Forget Gate
• The information that is no longer useful in the cell state is removed with the forget gate. Two
inputs xt (input at the particular time) and ht-1 (previous cell output) are fed to the gate and
multiplied with weight matrices followed by the addition of bias.
• The resultant is passed through an activation function which gives a binary output. If for a
particular cell state the output is 0, the piece of information is forgotten and for output 1,
the information is retained for future use.
The equation for the forget gate is:
ft=σ(Wf⋅[ht−1,xt]+bf)
where:
•Wf represents the weight matrix associated with the forget gate.

•[ht-1, xt] denotes the concatenation of the current input and the
previous hidden state.

•bf is the bias with the forget gate.

•σ is the sigmoid activation function.


Input gate

The addition of useful information to the cell state is done by the input gate. First, the information
is regulated using the sigmoid function and filter the values to be remembered similar to the
forget gate using inputs ht-1 and xt. . Then, a vector is created using tanh function that gives an
output from -1 to +1, which contains all the possible values from h t-1 and xt. At last, the values of
the vector and the regulated values are multiplied to obtain the useful information. The equation
for the input gate is:
it=σ(Wi⋅[ht−1,xt]+bi)
Ct=tanh(Wc⋅[ht−1,xt]+bc)
We multiply the previous state by ft, disregarding the
information we had previously chosen to ignore. Next, we
include it∗Ct. This represents the updated candidate values,
adjusted for the amount that we chose to update each state
value.
Ct=ft⊙Ct−1+it⊙C^t
where
• ⊙ denotes element-wise multiplication

•tanh is tanh activation function


Output gate
The task of extracting useful information from the current cell
state to be presented as output is done by the output gate.
• First, a vector is generated by applying tanh function on
the cell.
• Then, the information is regulated using the sigmoid
function and filter by the values to be remembered using
inputs ht−1​and xt.
• At last, the values of the vector and the regulated values
are multiplied to be sent as an output and input to the next
cell.
The equation for the output gate is:
Bidirectional LSTM Model
Bidirectional LSTM (Bi LSTM/ BLSTM) is a variation of normal LSTM
which processes sequential data in both forward and backward
directions. This allows Bi LSTM to learn longer-range dependencies
in sequential data than traditional LSTMs which can only process
sequential data in one direction.
•Bi LSTMs are made up of two LSTM networks one that processes
the input sequence in the forward direction and one that processes
the input sequence in the backward direction.
•The outputs of the two LSTM networks are then combined to
produce the final output.
LSTM models including Bi LSTMs have demonstrated state-of-the-art
performance across various tasks such as machine translation, speech
recognition and text summarization.
• Applications of LSTM

Some of the famous applications of LSTM includes:


• Language Modeling: Used in tasks like language modeling, machine translation
and text summarization. These networks learn the dependencies between words in
a sentence to generate coherent and grammatically correct sentences.
• Speech Recognition: Used in transcribing speech to text and recognizing spoken
commands. By learning speech patterns they can match spoken words to
corresponding text.
• Time Series Forecasting: Used for predicting stock prices, weather and energy
consumption. They learn patterns in time series data to predict future events.
• Anomaly Detection: Used for detecting fraud or network intrusions. These
networks can identify patterns in data that deviate drastically and flag them as
potential anomalies.
• Recommender Systems: In recommendation tasks like suggesting movies,
music and books. They learn user behavior patterns to provide personalized
suggestions.
LTSM vs RNN
Feature LSTM (Long Short-term Memory) RNN (Recurrent Neural Network)
Has a special memory unit that allows it to
Memory learn long-term dependencies in sequential Does not have a memory unit
data

Directionality Can be trained to process sequential data in Can only be trained to process
both forward and backward directions sequential data in one direction

Training More difficult to train than RNN due to the Easier to train than LSTM
complexity of the gates and memory unit
Natural language processing,
Machine translation, speech recognition, text machine translation, speech
Applications summarization, natural language processing, recognition, image processing,
time series forecasting
video processing
Attention Mechanism
Motivation
Recurrent Neural Networks (LSTM/GRU) are the model of choice when working
with variable-length inputs and are thus a natural fit to operate on text
processing .But:
• the sequential nature of RNNs prohibits parallelization,
• the context is computed from past only,
• there is no explicit distinction between short- and long-range dependencies
(everything is dealt with via the context),
• training is tricky, how can we do you do efficiently transfer learning?
On the other hand, Convolution can
• operate on both time-series (1D convolution), and images,
• be massively parallelized,
• exploit local dependencies (within the kernel) and long-range dependencies
(using multiple layers),
but:
• we can’t deal with variable-size inputs,
• the position of these dependencies is fixed (see below).
Attention
• An attention model is a mechanism used in neural networks that dynamically
focuses on the most relevant parts of the input data when making
predictions or generating outputs.
• Instead of processing all parts of the input equally, the model assigns different
weights to different elements, allowing it to "attend" more to critical features.
This is especially useful in tasks like machine translation, text summarization,
and image recognition.
• Key Points:
• Dynamic Focus: Traditional models might compress input into a fixed-length
representation, but attention mechanisms evaluate and weigh each input
element based on its relevance to the current task.
• Applications: Originally introduced for neural machine translation (e.g.,
Bahdanau attention), attention mechanisms now play a central role in many
state-of-the-art architectures such as Transformers.
•In Sentence 1, the attention mechanism assigns more weight to "street" when resolving "it."
•In Sentence 2, the attention mechanism assigns more weight to "animal" when resolving "it."
Transformer has no recurrence
• Transformer is an attention mechanism to learn contextual
relations between words– It includes two mechanisms
1. An Encoder that reads text input
2. A Decoder that produces a prediction for the task
3. Model has no recurrence – Self-attention to represent
input/output without RNN
4. Allows more parallelism – High translation quality: state-of-
the-art
5. Training 12 hours on eight P100 GPUs

You might also like