0% found this document useful (0 votes)
2 views

NLP_Machine Learning

Uploaded by

Thùy Minh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

NLP_Machine Learning

Uploaded by

Thùy Minh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Deep

Learning for
Natural
Chapter 24.
Language
7/27/2024 TRAN THI THUY MINH 1
1. Word embedding

2. RNN for NLP

TABLE OF 3. Sequence-to-sequence Models


CONTENTS
4. The transformer Architecture

5. Pretraining and Transfer Learning

6. Summary

7/27/2024 2
1. Word embedding
• If we want to plug words into a Neural Network, or some other machine learning
algorithm, we need a way to turn the words into numbers
• Word embedding is a technique used in NLP and ML to represent words as numerical
vectors

“She is beautiful”
She 1
Is 2
Beautiful 3

7/27/2024 TRAN THI THUY MINH 3


1. Word embedding

“She is beautiful” “She is pretty”


She 1 She
Is 2 Is
Beautiful 3 Pretty 4

“Beautiful” and “Pretty” mean similar things – they have different number
-> Neural network will need a lot more complexity and training

-> It would be nice if similar words that are used in similar ways could be given similar numbers.
So that learning how to use one word will help learn how to use the other at the same time

7/27/2024 TRAN THI THUY MINH 4


1. Word embedding

• Word embedding are learned automatically from the data


• The feature space has the property that similar words end up having similar vectors
• These numerical representations capture semantic relationships between words.
words with similar meanings will have similar vector representations

7/27/2024 TRAN THI THUY MINH 5


2. Recurrent Neural Networks for NLP
2.1. Language model with RNNs
• In RNN language model, each input word is encoded as a word embedding vector
• RNN training involves predicting the next word given previous words and updating weight through
backpropagation
• RNNs can generate text by sampling from output distributions, offering options like most likely word

Z1

Z2

7/27/2024 TRAN THI THUY MINH 6


2. Recurrent Neural Networks for NLP
2.2. Classification with recurrent neural networks
• For classification tasks, RNNs require labeled data
• To captue the context on the right, we can use a bidirectional RNN, which concatenates a separate right-to-left
model onto the left-to-right model

7/27/2024 TRAN THI THUY MINH 7


2. Recurrent Neural Networks for NLP
2.3. LSTMs for NLP tasks
Exploding/Vanishing gradient problem

---> Long Short Term Memory (LSTM)

• LSTMs is a kind of RNN, it can choose to remember some parts of the input, copying it over to the next timestep, and
forgot other parts.
• Unlike traditional RNNs, LSTMs use gating units to selectively retain or forget information over time steps, enabling
them to better preserve relevant information for NLP tasks.

7/27/2024 TRAN THI THUY MINH 8


2. Recurrent Neural Networks for NLP
2.3. LSTMs for NLP tasks
Long term memories

Short term memories

• LSTMs is a kind of RNN, it can choose to remember some parts of the input, copying it over to the next timestep, and
forgot other parts.
• Unlike traditional RNNs, LSTMs use gating units to selectively retain or forget information over time steps, enabling
them to better preserve relevant information for NLP tasks.

7/27/2024 TRAN THI THUY MINH 9


3. Sequence-to-sequence Models
• The most studied tasks in NLP is machine translation (MT)

Source language Target language

• The most commonly used for machine translation is called a sequence-to-sequence mode

7/27/2024 TRAN THI THUY MINH 10


3. Sequence-to-sequence Models
3.1. Attention
“Don’t eat the delicious looking and smelling pizza”

Forget

“Eat the delicious looking and smelling pizza”

7/27/2024 TRAN THI THUY MINH 11


3. Sequence-to-sequence Models
3.1. Attention
• The Main idea of Attention is to add a bunch of new paths from the Encoder to the Decoder, one per input
value, so that each step of the Decoder can directly access input values

7/27/2024 TRAN THI THUY MINH 12


3. Sequence-to-sequence Models
3.1. Attention

• First, the attention component itself has no learned weights and supports variable-leght
sequences on both source and target side
• Second, attention is entirely latent
• Attention can also be combined with multilayer RNNs

7/27/2024 TRAN THI THUY MINH 13


3. Sequence-to-sequence Models
3.2. Decoding
• Decoding is the procedure, that we generated the target one word at a time, and then feed
back in the word that we generated at the next timestep
• To improve decoding, beam search is often used

7/27/2024 TRAN THI THUY MINH 14


4. The Transformer Architecture
4.1. Self-attention
• Self-attention in sequence-to-sequence models allows each hidden state sequence to attend
to itself, capturing both nearby and long-distance context.
• The basic method of self-attention computes the attention matrix directly from the dot
product of input vectors, leading to a bias towards attending to oneself.
-> To address this, transformers project the input into three different representations using
separate weight matrices:
- The query vector:
- The key vector:
- The value vector:

7/27/2024 TRAN THI THUY MINH 15


4. The Transformer Architecture
4.2. From self-attention to transformer
• The transformer model comprises multiple layers, each containing sub-layers
• Self-attention is applied first in each layer, followed by feedforward layers with ReLU activation
• To mitigate vanishing gradients, residual connections are employed

7/27/2024 TRAN THI THUY MINH 16


5. Pretraining and Transfer Learning
5.1. Pretrained word embeddings
• Word embedding algorithm Word2vec
Glove
FastText
ELMo
BERT

GloVe (Global Vectors) model

Derive the semantic relationship between words using word-word co-occurrence matrix

7/27/2024 TRAN THI THUY MINH 17


5. Pretraining and Transfer Learning
5.1. Pretrained word embeddings
I love cats
I love cats you
I love you

Window = 1

I love cats you

I 0 2 0 0

love 2 0 1 1

cats 0 1 0 0

you 0 1 0 0

7/27/2024 TRAN THI THUY MINH 18


5. Pretraining and Transfer Learning
5.1. Pretrained word embeddings

I love cats you i The probability of ij


I 0 2 0 0

love 2 0 1 1 Pij = Xij/ Xi


cats 0 1 0 0

you 0 1 0 0
P(I | Love) = 2/2 = 1
j X

7/27/2024 TRAN THI THUY MINH 19


5. Pretraining and Transfer Learning
5.2. Pretrained contextual representation
• A contextual representations map both a word and the surrouding context of words into a word embedding
vector

7/27/2024 TRAN THI THUY MINH 20


5. Pretraining and Transfer Learning
5.3. Masked language models
• A masked language models (MLM) are trained by masking (hidden) individual words in the input and asking
the model to predict the masked words.
• For this task, one can use a deep bidirectional RNN or transformer on top of the masked sentence.
• For example, given the input sentence “The river__rose five feet” we can mask the middle word to get
• “The rive five feet” and ask the model to fill in the blank

rose

Transformer

Transformer

The river [MASK] five feet

7/27/2024 TRAN THI THUY MINH 21


6. Summary

This chapter emphasizes:


1 Word embeddings provide robust, continuous representations of words, pretrained on unlabeled text

data.

2 Recurrent neural networks (RNNs) excel in capturing local and long-distance context.

3 Sequence-to-sequence models are valuable for machine translation and text generation.

4 Transformers, with self-attention, effectively model both local and long-range context, optimizing

hardware matrix multiplication.

5 Transfer learning, leveraging pretrained contextual word embeddings, enables versatile model

development

7/27/2024 TRAN THI THUY MINH 22


Thank you for listening

You might also like