0% found this document useful (0 votes)
2 views

NLP_slides2

The document discusses various approaches in Natural Language Processing (NLP), including word embeddings like Word2Vec, sequential processing methods such as RNN and LSTM, and the use of Transformers for large language models. It highlights the advantages and disadvantages of different architectures, emphasizing the need for sequential modeling in certain applications like chatbots and machine translation. Additionally, it explains the attention mechanism in Transformers, which allows for context-dependent processing of word vectors without relying on sequential data processing.

Uploaded by

curvelearning52
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

NLP_slides2

The document discusses various approaches in Natural Language Processing (NLP), including word embeddings like Word2Vec, sequential processing methods such as RNN and LSTM, and the use of Transformers for large language models. It highlights the advantages and disadvantages of different architectures, emphasizing the need for sequential modeling in certain applications like chatbots and machine translation. Additionally, it explains the attention mechanism in Transformers, which allows for context-dependent processing of word vectors without relying on sequential data processing.

Uploaded by

curvelearning52
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 93

Latest Approaches for NLP

▪ Word-embeddings : Word2Vec
▪ Sequential processing using conventional approaches: RNN, LSTM
▪ Application-specific ChatBot Models
▪ Transformers : The engine behind Large Language Models
▪ Startup Examples
Word2vec for NLP
Example: personality Vector
Imagine a person scored 38/100 on introversion/extraversion
test. we can plot that in this way:
Example: personality Vector

We can represent the two dimensions as a point on the graph, or better yet, as a
vector from the origin to that point. We have incredible tools to deal with vectors that
will come in handy very shortly.
Example: personality Vector
Example: personality Vector
Example: Word Vector
Example: Word Vector
Example: Word Vector
1.There’s a straight red column through all of these different words. They’re similar
along that dimension (and we don’t know what each dimensions codes for)

2.You can see how “woman” and “girl” are similar to each other in a lot of places. The
same with “man” and “boy”

3.“boy” and “girl” also have places where they are similar to each other, but different
from “woman” or “man”. Could these be coding for a vague conception of youth?
possible.

4.All but the last word are words representing people. I added an object (water) to show
the differences between categories. You can, for example, see that blue column going all
the way down and stopping before the embedding for “water”.

5.There are clear places where “king” and “queen” are similar to each other and distinct
from all the others. Could these be coding for a vague concept of royalty?
Example: Word Vector
The famous examples that show an incredible property of embeddings is the concept
of analogies. We can add and subtract word embeddings and arrive at interesting
results. The most famous example is the formula: “king” - “man” + “woman”:
Example: Word Vector
The resulting vector from "king-man+woman" doesn't exactly equal "queen",
but "queen" is the closest word to it from the 400,000 word embeddings we
have in this collection.
Application Example: Word Vector - Next word Prediction
Application Example: Word Vector - Next word Prediction
Application Example: Word Vector - Next word Prediction
Application Example: Word Vector - Next word Prediction
Word2Vec Training
Word2Vec Training
Skipgram Approach
Skipgram Approach
Skipgram Approach
Word2Vec Training
Word2Vec Training
Word2Vec Training
Word2Vec Training
Word2Vec Training
Word2Vec Training - Negative Samples
Skipgram with Negative Sampling
(SGNS)
Word2Vec Training
Word2Vec Training Process
At the start of the training process, we initialize these matrices with random values. Then we start the training
process. In each training step, we take one positive example and its associated negative examples. Let’s take our
first group:
Now we have four words: the input word not and output/context words: thou (the actual neighbor), aaron, and taco
(the negative examples). We proceed to look up their embeddings – for the input word, we look in the Embedding
matrix. For the context words, we look in the Context matrix (even though both matrices have an embedding for
every word in our vocabulary).
Word2Vec Training
Word2Vec Training
Use of Neural Networks to Classify Texts using their Embeddings
Applying CNN to Word Vectors
A CNN is applied to constituting
words to extract higher-level
features.

The resulting abstract features


have been effectively used for
sentiment analysis, machine
translation, and question
answering, among other tasks.

The goal of this method is to


transform words into a vector
representation via a look-up table,
which results in a primitive word
embedding approach that learn
weights during the training of the
network.

https://medium.com/dair-ai/deep-learning-for-nlp-an-
overview-of-recent-trends-d0d8f40a776d
Applying CNN to Word2vVec
Embeddings
Motivation: Need for Sequential Modeling

Why do we need Sequential Modeling?


Motivation: Need for Sequential Modeling
Motivation: Need for Sequential
Modeling

Share Features learned across different positions or time steps


Example:
Sentence1: Market falls into bear territory → Trading/Marketing No sequential
or temporal
Sentence2: Bear falls into market territory → UNK
m odeling, i.e.,
order-less

… … Treat s t he t wo
falls falls sentences t he
bear bear sam e
market Trading market UNK

into into

territory … territory …
sentence1 sentence2
FF-net / CNN FF-net / CNN
Recurrent Neural Networks

Goal
➢ model long term dependencies
➢ connect previous information to the present task
➢ model sequence of events with loops, allowing information to persist
Feed Forward NNets can not take time dependencies into account.
Sequential data needs a Feedback Mechanism.
o o0 ot-1 ot oT
Unfold
x0 feedback mechanism
o0 in time
… Whh Whh Whh
or internal state loop A
… …
xt ot Whh … …

… … xt-1 xt xT
x x0
FF-net / CNN time
Recurrent Neural Network (RNN)
Basic Operation : Recurrent Neural
Networks
output labels person other other location
softmax-layer .8 .1 .1 .2 .1 .7 .1 .1 .8 .1 .7 .2 person
location
output layer .8 .1 .1 .2 .1 .7 .1 .1 .8 .1 .7 .2 other
Who
.5 .3 .5 .6
Whh Whh Whh
hidden layer .2 .3 .4 .7
Recurrent Neural Network
.7 -.1 .9 .5

Wxh
1 0 0 0
0 1 0 0
input layer
0 0 1 0
0 0 0 1
input sequence Pankaj lives in Munich
time
Motivation: Need for Sequential
Modeling
Share Features learned across different positions or time steps
Example:
Sentence1: Market falls into bear territory → Trading/Marketing Language
Trading concepts,
Sentence2: Bear falls into market territory → UNK
Word
ordering,
… … Synt act ic &
falls falls
market falls into bear territory semantic
bear bear
inform at ion
UNK
market Trading market Trading

into into

territory … territory …
sentence1 sentence2
bear falls into market territory
FF-net / CNN FF-net / CNN
Sequential model: RNN
Motivation: Need for Sequential
Modeling
Machine Translation: Different Input and Output sizes, incurring sequential patterns

Decoder Decoder
pankaj lebt in münchen पंकज मु ि नच म
� रहता है

encodes input text encodes input text


Pankaj lives in Munich Pankaj lives in Munich

Encoder Encoder
Motivation: Need for Sequential
Modeling
Convolutional vs Recurrent Neural Networks

RNN
- perform well when the input data is interdependent in a sequential pattern
- correlation between previous input to the next input
- introduce bias based on your previous output

CNN/FF-Nets
- all the outputs are self dependent
- Feed-forward nets don’t remember historic input data at test time unlike recurrent networks.
Long Term and Short
Dependencies
Short Term Dependencies

→ need recent information to perform the present task.


For example in a language model, predict the next word based on the previous ones.
“the clouds are in the ?” → ‘sky’ “the clouds are in the sky”

→ Easier to predict ‘sky’ given the context, i.e., short term dependency

Long Term Dependencies

→ Consider longer word sequence “I grew up in France…........…………………… I speak fluent French.”


→ Recent information suggests that the next word is probably the name of a language, but if we want to
narrow down which language, we need the context of France, from further back.
RNN Advantages:
- Can process any length input
- Computation for step t can (in theory) use information from many
steps back

RNN Disadvantages:
- Recurrent computation is slower
- In practice, difficult to access information from many steps back
For instance, the effect of older and more distant inputs will
eventually fade out - Problem of Vanishing Gradient !
Ref : https://medium.com/metaor-artificial-intelligence/the-exploding-and-vanishing-gradients-problem-in-time-
series-6b87d558d22
Reference https://d2l.ai/chapter_recurrent-modern/lstm.html
Integrating Sequential Processing in CNN
Applications

LSTM
Example : Image to Text
Output Examples
Working
The proposed model is trained with a set of images and their corresponding sentence
descriptions. It is assumed, that the sentences written by people refer to a particular but
unknown region of the image.

The first model aligns sentence snippets to the visual image regions. Afterwards the second
multimodal RNN gets trained with the output of the first and learn how to generate sentences.

The CNN has to learn how to align visual and language data. Therefore the net uses a
method described by Girshick et al. to detect objects in every image with a CNN, which is pre-
trained on ImageNet.

This pre-trained network is very similar to the VGGNET with the only difference, that they cut
the last two fully connected layers. Karpathy and Fei-Fei propose a BRNN, that is used to
represent sentences.

Finally after aligning the data, the output of the first model is fed to the Multimodal Recurrent
Neural Network. This Network has a typical hidden layer of 512 neurons. It is shown in figure
6 that the input of the next recurrent layer is always the output of the layer before. The
network is trained to combine a word
Integrating Sequential Processing in CNN Example- Speech to Text
Chatbots

More natural, Application


Generic, large models specific, smaller models
Rudimentary Rule Based Chatbot
Encoder-Decoder using RNN/LSTM

Decoder

encodes input
text

Encoder
Encoder-Decoder using RNN/LSTM
The encoder processes the input sequence and encodes it into a fixed-length
representation, also known as a context vector or latent space representation.
Usually, the final hidden state of the network serves as the context vector, which
summarises the input information.

Once the model encodes the input sequence, the decoder takes over and
generates an output sequence based on the encoded representation. The
decoder usually will use a similar structure as the encoder. However, the hidden
state of the decoder is initialized with the context vector from the encoder.

The decoder uses this initial hidden state to generate the first token of the
output sequence. It then generates subsequent tokens, conditioning its
predictions on both the previously generated tokens and the context vector. This
process continues until an end-of-sequence token is generated or a maximum
sequence length is reached.
NLP Based Chatbot
NLP Based Chatbot – More detailed view

https://bhashkarkunal.medium.com/conversational-ai-chatbot-using-deep-learning-how-bi-directional-lstm-
machine-reading-38dc5cf5a5a3
NLP Based Chatbot using GAN

LSTM LSTM LSTM LSTM


RCNN
Note : Non-NLP use of LSTM
Predictions and forecasting of data based on variables
Going beyond Seq2Seq Model
Despite being a useful model for summarising the input sequence, the sequence-to-sequence model
has an issue when the input sequence is quite long and contains a lot of information. Not every piece
of the input sequence’s context is required at every decoding stage for all text production activities.
For instance, a machine translation model does not need to be aware of the other words in the
sentence when translating “boy” in the phrase “A boy is eating the banana

Therefore people have started using Transformer, which applies a special attention mechanism. Transformer
is a state-of-the-art model that is widely used in NLP and Computer Vision.
Transformer Architecture
Transformers use positional encoding
which enables encoding the positional
information of word-vectors, while
processing the entire group of word-vectors
In parallel, as opposed to Seq2Seq models
based on RNN/LSTM which involve
sequential processing.
The Attention Mechanism in Transformers
The word-vectors generated through Word2Vec method are trained over a large generic data set . The resulting word
vectors may not capture contexts very specific text. These word vectors remain same, irrespective of the change of
context. Like the word vector for the word ‘bark’ will remain same for all different usages of it, once obtained through
Word2Vec method. The attention mechanism tries to fine-tune these word-vectors to capture the immediate
context and dependencies.

(Word2Vec Outputs)
In the above example, dot product is computed between the word vector V1 (query) and all other word vectors in the
input text (keys) and combined to obtain the weighted word vector for Y1. This transformed word-vector Y1 captures the
context of V1 with other word vectors in the ‘Key’ vectors. This is done for all word vectors V1, input to the attention unit.
This operation produces modified word-vectors which capture the similarity with the other vectors in present context.
For each query and key pair, value-vectors are
obtained though prior trained neural network
model, which essentially involves multiplication
of the input values with certain weight matrix,
which is obtained through training . The
attention values are obtained by weighted
combination of all values vectors
(corresponding to query-key pairs with the
modified word embeddings obtained for each
input word vector. In this manner, the final
attention vector captures the context of several
different combinations of surrounding
key words.
Final scaling of value vectors
learnt for each key-query pair
with the similarity score, followed
by concatenation, to produce the
“Attention Vector”

Weighted combination of resulting


modified word-vectors .

Similarity calculation between


each input word vector and key word
vectors using dot product.

A Transformer may have many such attention blocks in parallel


and also stacked one after another involving, multi-step attention computation. The concatenated attention
vector passes though a feed-forward neural network to produce input to the decoder block for generating the
expected output vectors, by combining the attention vector with previous output context vector
Matmul = matrix multiplication for obtaining similarity scores
or scaled output vectors
Transformer networks are different from traditional recurrent neural networks (RNNs)
and convolutional neural networks (CNNs) in that they do not use sequential processing
or convolutional filters.

Instead, they use self-attention mechanisms and parallel processing to handle input
sequences.

The attention layers involve matrix multiplication of input patterns with learnt wights,
that transforms the input vectors into context dependent vectors, accounting for
context created by nearby words.

Reference: Computations involved in the Attention Model

https://towardsdatascience.com/all-you-need-to-know-about-attention-and-transformers-in-
depth-understanding-part-1-552f0b41d021

https://towardsdatascience.com/attention-and-transformer-models-fe667f958378

https://machinelearningmastery.com/the-transformer-attention-mechanism/

Video Explanation : https://www.youtube.com/watch?v=eMlx5fFNoYc


RNN vs Transformers
4.1. Architecture
RNNs are sequential models that process data one element at a time, maintaining an internal hidden state that is updated at
each step. They operate in a recurrent manner, where the output at each step depends on the previous hidden state and the
current input.
Transformers are non-sequential models that process data in parallel. They rely on self-attention mechanisms to capture
dependencies between different elements in the input sequence. Transformers do not have recurrent connections or hidden
states.

4.2. Handling Sequence Length


RNNs can handle variable-length sequences as they process data sequentially. However, long sequences can lead to vanishing or
exploding gradients, making it challenging for RNNs to capture long-term dependencies.
Transformers can handle both short and long sequences efficiently due to their parallel processing nature. Self-attention allows
them to capture dependencies regardless of the sequence length.

4.3. Dependency Modeling


RNNs are well-suited for modeling sequential dependencies. They can capture contextual information from the past, making
them effective for tasks like language modeling, speech recognition, and sentiment analysis.
Transformers excel at modeling dependencies between elements, irrespective of their positions in the sequence. They are
particularly powerful for tasks involving long-range dependencies, such as machine translation, document classification, and
image captioning.
4.4. Size of the Model
The size of an RNN is primarily determined by the number of recurrent units (e.g., LSTM cells or GRU cells) and the number
of parameters within each unit. RNNs have a compact structure as they mainly rely on recurrent connections and relatively
small hidden state dimensions. The number of parameters in an RNN is directly proportional to the number of recurrent
units and the size of the input and hidden state dimensions.
Transformers tend to have larger model sizes due to their architecture. The main components contributing to the size of a
Transformer model are self-attention layers, feed-forward layers, and positional encodings. Transformers have a more
parallelizable design, allowing for efficient computation on GPUs or TPUs. However, this parallel processing capability comes
at the cost of a larger number of parameters.

4.5. Training and Parallelisation


For RNN, we mostly train it in a sequential approach, as the hidden state relies on previous steps. This makes parallelization
more challenging, resulting in slower training times.
On the other hand, we train Transformers in parallel since they process data simultaneously. This parallelization capability
speeds up training and enables the use of larger batch sizes, which makes training more efficient.

4.6. Pre-training and Transfer Learning


Pre-training RNNs is more challenging due to their sequential nature. Transfer learning is typically limited to specific tasks
or related domains.
We can pre-trained the Transformer models on large-scale corpora using unsupervised objectives like language modeling or
masked language modeling. After pre-training, we can fine-tune the model on various downstream tasks, enabling effective
transfer learning.
Transformer
Vs
RNN
Vs
LSRM

Summary
ChatGPT vs BERT:
https://blog.invgate.com/gpt-3-vs-bert

Top 10 LLM Based Startups in India


https://www.f6s.com/companies/large-language-model-
llm/india/co

https://yourstory.com/2023/12/homegrown-startups-
developing-llms-that-understand-indic-languages

LLM Based Foreign Startups


https://www.ventureradar.com/startup/LLM

You might also like