NLP_slides2
NLP_slides2
▪ Word-embeddings : Word2Vec
▪ Sequential processing using conventional approaches: RNN, LSTM
▪ Application-specific ChatBot Models
▪ Transformers : The engine behind Large Language Models
▪ Startup Examples
Word2vec for NLP
Example: personality Vector
Imagine a person scored 38/100 on introversion/extraversion
test. we can plot that in this way:
Example: personality Vector
We can represent the two dimensions as a point on the graph, or better yet, as a
vector from the origin to that point. We have incredible tools to deal with vectors that
will come in handy very shortly.
Example: personality Vector
Example: personality Vector
Example: Word Vector
Example: Word Vector
Example: Word Vector
1.There’s a straight red column through all of these different words. They’re similar
along that dimension (and we don’t know what each dimensions codes for)
2.You can see how “woman” and “girl” are similar to each other in a lot of places. The
same with “man” and “boy”
3.“boy” and “girl” also have places where they are similar to each other, but different
from “woman” or “man”. Could these be coding for a vague conception of youth?
possible.
4.All but the last word are words representing people. I added an object (water) to show
the differences between categories. You can, for example, see that blue column going all
the way down and stopping before the embedding for “water”.
5.There are clear places where “king” and “queen” are similar to each other and distinct
from all the others. Could these be coding for a vague concept of royalty?
Example: Word Vector
The famous examples that show an incredible property of embeddings is the concept
of analogies. We can add and subtract word embeddings and arrive at interesting
results. The most famous example is the formula: “king” - “man” + “woman”:
Example: Word Vector
The resulting vector from "king-man+woman" doesn't exactly equal "queen",
but "queen" is the closest word to it from the 400,000 word embeddings we
have in this collection.
Application Example: Word Vector - Next word Prediction
Application Example: Word Vector - Next word Prediction
Application Example: Word Vector - Next word Prediction
Application Example: Word Vector - Next word Prediction
Word2Vec Training
Word2Vec Training
Skipgram Approach
Skipgram Approach
Skipgram Approach
Word2Vec Training
Word2Vec Training
Word2Vec Training
Word2Vec Training
Word2Vec Training
Word2Vec Training - Negative Samples
Skipgram with Negative Sampling
(SGNS)
Word2Vec Training
Word2Vec Training Process
At the start of the training process, we initialize these matrices with random values. Then we start the training
process. In each training step, we take one positive example and its associated negative examples. Let’s take our
first group:
Now we have four words: the input word not and output/context words: thou (the actual neighbor), aaron, and taco
(the negative examples). We proceed to look up their embeddings – for the input word, we look in the Embedding
matrix. For the context words, we look in the Context matrix (even though both matrices have an embedding for
every word in our vocabulary).
Word2Vec Training
Word2Vec Training
Use of Neural Networks to Classify Texts using their Embeddings
Applying CNN to Word Vectors
A CNN is applied to constituting
words to extract higher-level
features.
https://medium.com/dair-ai/deep-learning-for-nlp-an-
overview-of-recent-trends-d0d8f40a776d
Applying CNN to Word2vVec
Embeddings
Motivation: Need for Sequential Modeling
… … Treat s t he t wo
falls falls sentences t he
bear bear sam e
market Trading market UNK
into into
territory … territory …
sentence1 sentence2
FF-net / CNN FF-net / CNN
Recurrent Neural Networks
Goal
➢ model long term dependencies
➢ connect previous information to the present task
➢ model sequence of events with loops, allowing information to persist
Feed Forward NNets can not take time dependencies into account.
Sequential data needs a Feedback Mechanism.
o o0 ot-1 ot oT
Unfold
x0 feedback mechanism
o0 in time
… Whh Whh Whh
or internal state loop A
… …
xt ot Whh … …
…
… … xt-1 xt xT
x x0
FF-net / CNN time
Recurrent Neural Network (RNN)
Basic Operation : Recurrent Neural
Networks
output labels person other other location
softmax-layer .8 .1 .1 .2 .1 .7 .1 .1 .8 .1 .7 .2 person
location
output layer .8 .1 .1 .2 .1 .7 .1 .1 .8 .1 .7 .2 other
Who
.5 .3 .5 .6
Whh Whh Whh
hidden layer .2 .3 .4 .7
Recurrent Neural Network
.7 -.1 .9 .5
Wxh
1 0 0 0
0 1 0 0
input layer
0 0 1 0
0 0 0 1
input sequence Pankaj lives in Munich
time
Motivation: Need for Sequential
Modeling
Share Features learned across different positions or time steps
Example:
Sentence1: Market falls into bear territory → Trading/Marketing Language
Trading concepts,
Sentence2: Bear falls into market territory → UNK
Word
ordering,
… … Synt act ic &
falls falls
market falls into bear territory semantic
bear bear
inform at ion
UNK
market Trading market Trading
into into
territory … territory …
sentence1 sentence2
bear falls into market territory
FF-net / CNN FF-net / CNN
Sequential model: RNN
Motivation: Need for Sequential
Modeling
Machine Translation: Different Input and Output sizes, incurring sequential patterns
Decoder Decoder
pankaj lebt in münchen पंकज मु ि नच म
� रहता है
Encoder Encoder
Motivation: Need for Sequential
Modeling
Convolutional vs Recurrent Neural Networks
RNN
- perform well when the input data is interdependent in a sequential pattern
- correlation between previous input to the next input
- introduce bias based on your previous output
CNN/FF-Nets
- all the outputs are self dependent
- Feed-forward nets don’t remember historic input data at test time unlike recurrent networks.
Long Term and Short
Dependencies
Short Term Dependencies
→ Easier to predict ‘sky’ given the context, i.e., short term dependency
RNN Disadvantages:
- Recurrent computation is slower
- In practice, difficult to access information from many steps back
For instance, the effect of older and more distant inputs will
eventually fade out - Problem of Vanishing Gradient !
Ref : https://medium.com/metaor-artificial-intelligence/the-exploding-and-vanishing-gradients-problem-in-time-
series-6b87d558d22
Reference https://d2l.ai/chapter_recurrent-modern/lstm.html
Integrating Sequential Processing in CNN
Applications
LSTM
Example : Image to Text
Output Examples
Working
The proposed model is trained with a set of images and their corresponding sentence
descriptions. It is assumed, that the sentences written by people refer to a particular but
unknown region of the image.
The first model aligns sentence snippets to the visual image regions. Afterwards the second
multimodal RNN gets trained with the output of the first and learn how to generate sentences.
The CNN has to learn how to align visual and language data. Therefore the net uses a
method described by Girshick et al. to detect objects in every image with a CNN, which is pre-
trained on ImageNet.
This pre-trained network is very similar to the VGGNET with the only difference, that they cut
the last two fully connected layers. Karpathy and Fei-Fei propose a BRNN, that is used to
represent sentences.
Finally after aligning the data, the output of the first model is fed to the Multimodal Recurrent
Neural Network. This Network has a typical hidden layer of 512 neurons. It is shown in figure
6 that the input of the next recurrent layer is always the output of the layer before. The
network is trained to combine a word
Integrating Sequential Processing in CNN Example- Speech to Text
Chatbots
Decoder
encodes input
text
Encoder
Encoder-Decoder using RNN/LSTM
The encoder processes the input sequence and encodes it into a fixed-length
representation, also known as a context vector or latent space representation.
Usually, the final hidden state of the network serves as the context vector, which
summarises the input information.
Once the model encodes the input sequence, the decoder takes over and
generates an output sequence based on the encoded representation. The
decoder usually will use a similar structure as the encoder. However, the hidden
state of the decoder is initialized with the context vector from the encoder.
The decoder uses this initial hidden state to generate the first token of the
output sequence. It then generates subsequent tokens, conditioning its
predictions on both the previously generated tokens and the context vector. This
process continues until an end-of-sequence token is generated or a maximum
sequence length is reached.
NLP Based Chatbot
NLP Based Chatbot – More detailed view
https://bhashkarkunal.medium.com/conversational-ai-chatbot-using-deep-learning-how-bi-directional-lstm-
machine-reading-38dc5cf5a5a3
NLP Based Chatbot using GAN
Therefore people have started using Transformer, which applies a special attention mechanism. Transformer
is a state-of-the-art model that is widely used in NLP and Computer Vision.
Transformer Architecture
Transformers use positional encoding
which enables encoding the positional
information of word-vectors, while
processing the entire group of word-vectors
In parallel, as opposed to Seq2Seq models
based on RNN/LSTM which involve
sequential processing.
The Attention Mechanism in Transformers
The word-vectors generated through Word2Vec method are trained over a large generic data set . The resulting word
vectors may not capture contexts very specific text. These word vectors remain same, irrespective of the change of
context. Like the word vector for the word ‘bark’ will remain same for all different usages of it, once obtained through
Word2Vec method. The attention mechanism tries to fine-tune these word-vectors to capture the immediate
context and dependencies.
(Word2Vec Outputs)
In the above example, dot product is computed between the word vector V1 (query) and all other word vectors in the
input text (keys) and combined to obtain the weighted word vector for Y1. This transformed word-vector Y1 captures the
context of V1 with other word vectors in the ‘Key’ vectors. This is done for all word vectors V1, input to the attention unit.
This operation produces modified word-vectors which capture the similarity with the other vectors in present context.
For each query and key pair, value-vectors are
obtained though prior trained neural network
model, which essentially involves multiplication
of the input values with certain weight matrix,
which is obtained through training . The
attention values are obtained by weighted
combination of all values vectors
(corresponding to query-key pairs with the
modified word embeddings obtained for each
input word vector. In this manner, the final
attention vector captures the context of several
different combinations of surrounding
key words.
Final scaling of value vectors
learnt for each key-query pair
with the similarity score, followed
by concatenation, to produce the
“Attention Vector”
Instead, they use self-attention mechanisms and parallel processing to handle input
sequences.
The attention layers involve matrix multiplication of input patterns with learnt wights,
that transforms the input vectors into context dependent vectors, accounting for
context created by nearby words.
https://towardsdatascience.com/all-you-need-to-know-about-attention-and-transformers-in-
depth-understanding-part-1-552f0b41d021
https://towardsdatascience.com/attention-and-transformer-models-fe667f958378
https://machinelearningmastery.com/the-transformer-attention-mechanism/
Summary
ChatGPT vs BERT:
https://blog.invgate.com/gpt-3-vs-bert
https://yourstory.com/2023/12/homegrown-startups-
developing-llms-that-understand-indic-languages