0% found this document useful (0 votes)

36 views78 pages

NLP DL Lecture4

This document discusses pre-trained language models, focusing on BERT. It provides an overview of BERT, including its bidirectional Transformer architecture, pre-training objectives of masked language modeling and next sentence prediction, and fine-tuning procedure. Experiments show that BERT achieves state-of-the-art results on eleven NLP tasks by simply fine-tuning the pre-trained model. Ablation studies examine the effects of pre-training tasks, model sizes, training steps, and feature-based approaches.

Uploaded by

thanh.tien.96.vn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views78 pages

NLP DL Lecture4

Uploaded by

thanh.tien.96.vn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 78

Deep Learning for Natural Language

Processing
Lecture 4: Pre-trained Language Model

Quan Thanh Tho

Faculty of Computer Science and Engineering
Back Khoa University
Agenda

• GPT-2 and what we have done

• Practical Situation: Huge Data, Few Labeled
• Language Model: General Architecture
• Case Studies
• GPT, Transformer and BERT
4
5
Introduction

Common situation at “big-data-based companies”

Cheap large dataset vs. expensive labeled data

6
7
8
9
Neural-based Milestones for NLP

10
Pre-trained Neural Language Model

11
12
13
Intuition

Sua pediasure giá bnhju vay

14
Intuition

Sua pediasure giá bnhju vay

--------------------
Cái này giá bao nhiêu vậy
Giá nó bao nhiêu

Bao nhiêu một hộp sữa vậy

Giá hộp sữa bao nhiêu
Giá hộp này bnhju
Bnhju thì bán được

15
Case Studies

Language model
Intent classification
NER tagging

16
BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding

Devlin et al., 2018 (Google AI Language)

NLP course

17
Outline

• Research context
• Main ideas
• BERT
• Experiments
• Conclusions

18
Research context
• Language model pre-training has been used to improve many NLP
tasks
• ELMo (Peters et al., 2018)
• OpenAI GPT (Radford et al., 2018)
• ULMFit (Howard and Rudder, 2018)
• Two existing strategies for applying pre-trained language
representations to downstream tasks
• Feature-based: include pre-trained representations as additional
features (e.g., ELMo)
• Fine-tunning: introduce task-specific parameters and fine-tune
the pre-trained parameters (e.g., OpenAI GPT, ULMFit)

19
ELMO

20
21
22
23
ULMFit

24
Limitations of current techniques

• Language models in pre-training are

unidirectional, they restrict the power of the
pre-trained representations
• OpenAI GPT used left-to-right architecture
• ELMo concatenates forward and backward
language models
• Solution BERT: Bidirectional Encoder
Representations from Transformers

25
BERT: Bidirectional Encoder
Representations from Transformers
• Main ideas
• Propose a new pre-training objective so that a deep bidirectional
Transformer can be trained
• The “masked language model” (MLM): the objective is to
predict the original word of a masked word based only on its
context
• ”Next sentence prediction”
• Merits of BERT
• Just fine-tune BERT model for specific tasks to achieve state-
of-the-art performance
• BERT advances the state-of-the-art for eleven NLP tasks

26
BERT: Bidirectional Encoder
Representations from Transformers

27
Model architecture

• BERT’s model architecture is a multi-layer

bidirectional Transformer encoder
• (Vaswani et al., 2017) “Attention is all you need”
• Two models with different sizes were investigated
• BERTBASE: L=12, H=768, A=12, Total Parameters=110M
• (L: number of layers (Transformer blocks), H is the hidden size,
A: the number of self-attention heads)
• BERTLARGE: L=24, H=1024, A=16, Total Parameters=340M

28
Differences in pre-training model architectures:
BERT, OpenAI GPT, and ELMo

29
Transformer Encoders
Vaswani et al. (2017) Attention is all you need

• Transformer is an attention-based architecture for NLP

• Transformer composed of two parts: Encoding
component and Decoding component
• BERT is a multi-layer bidirectional Transformer encoder

30
31
32
Inside an Encoder Block

In BERT experiments, the

number of blocks N was chosen
to be 12 and 24.
Blocks do not share weights with
each other

Source: https://medium.com/dissecting-
bert/dissecting-bert-part-1-d3c3d495cdb3 33
34
35
36
Self-Attention

Image source: https://jalammar.github.io/illustrated-transformer/

37
Self-Attention in Detail
• Attention maps a query and a set of key-value
pairs to an output
• query, keys, and output are all vectors

38
39
40
41
42
43
44
45
46
47
48
49
50
51
Input Representation

• Token Embeddings: Use pretrained WordPiece embeddings

• Position Embeddings: Use learned Position Embeddings
• Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N.
Dauphin. Convolutionalsequence to sequence learning. arXiv preprint
arXiv:1705.03122v2, 2017.
• Added sentence embedding to every tokens of each
sentence
• Use [CLS] for the classification tasks
• Separate sentences by using a special token [SEP] 52
Task#1: Masked LM

• 15% of the words are masked at random

• and the task is to predict the masked words based on its
left and right context
• Not all tokens were masked in the same way (example
sentence “My dog is hairy”)
• 80% were replaced by the <MASK> token: “My dog is
<MASK>”
• 10% were replaced by a random token: “My dog is apple”
• 10% were left intact: “My dog is hairy”

53
54
55
56
Pre-training procedure

• Training data: BooksCorpus (800M words) + English Wikipedia

(2,500M words)
• To generate each training input sequences: sample two spans
of text (A and B) from the corpus
• The combined length is ≤ 500 tokens
• 50% B is the actual next sentence that follows A and 50%
of the time it is a random sentence from the corpus
• The training loss is the sum of the mean masked LM likelihood
and the mean next sentence prediction likelihood

57
58
Fine-tuning procedure

• For sequence-level classification task

• All of the parameters of BERT and W are
fine-tuned jointly
• Most model hyperparameters are the same
as in pre-training
• except the batch size, learning rate, and
number of training epochs

59
60
Fine-tuning procedure

Single Sentence Tagging Tasks: CoNLL-2003 NER

61
62
63
64
Outline

• Research context
• Main ideas
• BERT
• Experiments
• Conclusions

65
Experiments
• GLUE (General Language Understanding Evaluation)
benchmark
• Distribute canonical Train, Dev and Test splits
• Labels for Test set are not provided
• Datasets in GLUE:
• MNLI: Multi-Genre Natural Language Inference
• QQP: Quora Question Pairs
• QNLI: Question Natural Language Inference
• SST-2: Stanford Sentiment Treebank
• CoLA: The corpus of Linguistic Acceptability
• STS-B: The Semantic Textual Similarity Benchmark
• MRPC: Microsoft Research Paraphrase Corpus
• RTE: Recognizing Textual Entailment
• WNLI: Winograd NLI
66
67
68
69
70
71
Ablation Studies

• To understand
• Effect of Pre-training Tasks
• Effect of model sizes
• Effect of number of training steps
• Feature-based approach with BERT

72
73
74
Conclusions

• Unsupervised pre-training (pre-training

language model) is increasingly adopted in
many NLP tasks
• Major contribution of the paper is to propose a
deep bidirectional architecture from Transformer
• Advance state-of-the-art for many important NLP tasks

75
Links
• TensorFlow code and pre-trained models for BERT:
https://github.com/google-research/bert
• PyTorch Pretrained Bert:
https://github.com/huggingface/pytorch-pretrained-BERT
• BERT-pytorch: https://github.com/codertimo/BERT-pytorch
• BERT-keras: https://github.com/Separius/BERT-keras

76
Remark: Applying BERT for non-
English languages
• Pre-trained BERT models are provided for more than 100
languages (including Vietnamese)
• https://github.com/google-
research/bert/blob/master/multilingual.md
• Be careful with tokenization!!
• For Japanese (and Chinese): “spaces were added around
every character in the CJI Unicode rage before applying
WordPiece” => Not a good way to do
• Use SentencePiece:
https://github.com/google/sentencepiece
• We may need to pre-train BERT model

77
References

1. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training
of Deep Bidirectional Transformers for Language Understanding. arXiv
preprint arXiv:1810.04805.
2. Vaswani et al. (2017). Attention Is All You Need. arXiv preprint
arXiv:1706.03762. https://arxiv.org/abs/1706.03762
3. The Annotated Transformer:
http://nlp.seas.harvard.edu/2018/04/03/attention.html, by harvardnlp.
4. The Illustrated Transformer: http://jalammar.github.io/illustrated-
transformer/
5. ELMo explained: https://www.mihaileric.com/posts/deep-contextualized-
word-representations-elmo/
6. ULMFit explained: https://yashuseth.blog/2018/06/17/understanding-
universal-language-model-fine-tuning-ulmfit/
7. Dissecting BERT: https://medium.com/dissecting-bert
8. The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning):
http://jalammar.github.io/illustrated-bert/

CS283 Lecture6 2024
No ratings yet
CS283 Lecture6 2024
115 pages
BERT Architecture
No ratings yet
BERT Architecture
23 pages
BERT Explained - State of The Art Language Model For NLP - by Rani Horev - Towards Data Science
100% (1)
BERT Explained - State of The Art Language Model For NLP - by Rani Horev - Towards Data Science
8 pages
11 Bert
No ratings yet
11 Bert
66 pages
Career Development Theory Review
No ratings yet
Career Development Theory Review
2 pages
02 Bed15302 Measurement Evaluation
No ratings yet
02 Bed15302 Measurement Evaluation
129 pages
Intro To Philo Terminologies
No ratings yet
Intro To Philo Terminologies
2 pages
BERT GPT CoT
No ratings yet
BERT GPT CoT
83 pages
Transformers in NLP 1
No ratings yet
Transformers in NLP 1
9 pages
Week 3: Deeplearning - Ai
No ratings yet
Week 3: Deeplearning - Ai
98 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
99 pages
Bert Model - NLP
No ratings yet
Bert Model - NLP
10 pages
Lecture 12 Pretraining
No ratings yet
Lecture 12 Pretraining
46 pages
How To Fine-Tune BERT For Text Classification?: Corresponding Author The Source Codes Are Available at
No ratings yet
How To Fine-Tune BERT For Text Classification?: Corresponding Author The Source Codes Are Available at
10 pages
LSTM To BERT
No ratings yet
LSTM To BERT
30 pages
495 Lecture 11 BERT
No ratings yet
495 Lecture 11 BERT
31 pages
Lec 02
No ratings yet
Lec 02
33 pages
BERT
No ratings yet
BERT
98 pages
All About Encoder-Decoder Models
No ratings yet
All About Encoder-Decoder Models
50 pages
BERT and Transformer
No ratings yet
BERT and Transformer
48 pages
Lec14 Pretraining
No ratings yet
Lec14 Pretraining
42 pages
RADL TTho
No ratings yet
RADL TTho
64 pages
Pre-Training & LLM 2
No ratings yet
Pre-Training & LLM 2
46 pages
7 Transformers
No ratings yet
7 Transformers
20 pages
LLM Learning
No ratings yet
LLM Learning
56 pages
Transformers MUIA
No ratings yet
Transformers MUIA
34 pages
Jacob Devlin BERT
No ratings yet
Jacob Devlin BERT
43 pages
6-Bert T5 GPT
No ratings yet
6-Bert T5 GPT
31 pages
14 04 Transformers
No ratings yet
14 04 Transformers
11 pages
NLP LLM
No ratings yet
NLP LLM
47 pages
Data Mining Report
No ratings yet
Data Mining Report
17 pages
Problem Statement:: Rule-Based Machine Translation (RBMT), Statistical Machine Translation (SMT), Neural
No ratings yet
Problem Statement:: Rule-Based Machine Translation (RBMT), Statistical Machine Translation (SMT), Neural
4 pages
1102AITA04 AI For Text Analytics
No ratings yet
1102AITA04 AI For Text Analytics
88 pages
Pretraining-Based Natural Language Generation For Text Summarization
No ratings yet
Pretraining-Based Natural Language Generation For Text Summarization
7 pages
Bert
No ratings yet
Bert
10 pages
Bert Explained
No ratings yet
Bert Explained
8 pages
A Comparison of LSTM and BERT For Small Corpus: Aysu Ezen-Can SAS Inst. September 14, 2020
No ratings yet
A Comparison of LSTM and BERT For Small Corpus: Aysu Ezen-Can SAS Inst. September 14, 2020
12 pages
32-Bidirectional Encoder Representations From Transformers (BERT) - 30!09!2024
No ratings yet
32-Bidirectional Encoder Representations From Transformers (BERT) - 30!09!2024
8 pages
BERT Finetuning Theory
No ratings yet
BERT Finetuning Theory
14 pages
Rebertsubmission116 NW
No ratings yet
Rebertsubmission116 NW
26 pages
Huggingface Co Blog Warm Starting Encoder Decoder Data Preprocessing
No ratings yet
Huggingface Co Blog Warm Starting Encoder Decoder Data Preprocessing
20 pages
13 - Bert
No ratings yet
13 - Bert
17 pages
BERT
No ratings yet
BERT
4 pages
Bert
No ratings yet
Bert
60 pages
Preprint Jesus
No ratings yet
Preprint Jesus
2 pages
National Service Training Program Specific Module Page 1 of 13
100% (1)
National Service Training Program Specific Module Page 1 of 13
13 pages
STS Activity 1
No ratings yet
STS Activity 1
2 pages
Bert
No ratings yet
Bert
20 pages
855 Roberta A Robustly Optimized B
No ratings yet
855 Roberta A Robustly Optimized B
15 pages
A E A T - B L M: E O M: Nalysis of The Volution of Dvanced Ransformer Ased Anguage Odels Xperiments On Pinion Ining
No ratings yet
A E A T - B L M: E O M: Nalysis of The Volution of Dvanced Ransformer Ased Anguage Odels Xperiments On Pinion Ining
16 pages
GenAI Workflow Automation NPTEL Zoom Course
No ratings yet
GenAI Workflow Automation NPTEL Zoom Course
88 pages
HKBK College of Engineering Department of Computer Science and Engineering
No ratings yet
HKBK College of Engineering Department of Computer Science and Engineering
24 pages
Paper Review
No ratings yet
Paper Review
6 pages
Understanding BERT
No ratings yet
Understanding BERT
4 pages
Transformer Part3 16 Mar 23 PDF
No ratings yet
Transformer Part3 16 Mar 23 PDF
59 pages
Bert Ayman
No ratings yet
Bert Ayman
5 pages
Pretraining Part1 16 Mar 23 PDF
No ratings yet
Pretraining Part1 16 Mar 23 PDF
32 pages
M-U-F Framework
No ratings yet
M-U-F Framework
13 pages
Bert 1
No ratings yet
Bert 1
4 pages
Stanford CS 224N Deep Learning For NLP Practice Quiz Pack
No ratings yet
Stanford CS 224N Deep Learning For NLP Practice Quiz Pack
4 pages
Large Language Models For Information Management - 01 - Modulo Base (MB) - 4pdf
No ratings yet
Large Language Models For Information Management - 01 - Modulo Base (MB) - 4pdf
68 pages
Human Resources Strategies: Master Course
No ratings yet
Human Resources Strategies: Master Course
42 pages
Transactional Analysis
100% (1)
Transactional Analysis
9 pages
Case Study For MBA Students
No ratings yet
Case Study For MBA Students
4 pages
School Form 10 ES Learners Permanent Record
100% (1)
School Form 10 ES Learners Permanent Record
8 pages
Bert
No ratings yet
Bert
36 pages
Social Psychology.
No ratings yet
Social Psychology.
6 pages
The Illustrated BERT, ELMo, and Co. (How NLP Cracked Transfer Learning) - Jay Alammar - Visualizing Machine Learning One Concept at A Time
No ratings yet
The Illustrated BERT, ELMo, and Co. (How NLP Cracked Transfer Learning) - Jay Alammar - Visualizing Machine Learning One Concept at A Time
19 pages
Foundations of Individual Behaviour
No ratings yet
Foundations of Individual Behaviour
35 pages
The Giant Ilonggo Phrasebook: Over 2000 Ilonggo (Hiligaynon) Phrases and Notes That Will Help You Speak Like An Ilonggo
No ratings yet
The Giant Ilonggo Phrasebook: Over 2000 Ilonggo (Hiligaynon) Phrases and Notes That Will Help You Speak Like An Ilonggo
124 pages
14 LookingForward
No ratings yet
14 LookingForward
48 pages
ChatGPT KZ Feb2023 PDF
No ratings yet
ChatGPT KZ Feb2023 PDF
7 pages
Always Have A Plan and Believe in It. Nothing Good Happens by Accident
No ratings yet
Always Have A Plan and Believe in It. Nothing Good Happens by Accident
3 pages
Gujarati (03) : Ntroduction
No ratings yet
Gujarati (03) : Ntroduction
3 pages
07 - Motivation Concepts
No ratings yet
07 - Motivation Concepts
24 pages
Paper Prediction: Ostl Mini Project 1.hrutwika Ambavane 2.juili Kadu 3. Bhavesh Bawankar 4.akshat Singh
No ratings yet
Paper Prediction: Ostl Mini Project 1.hrutwika Ambavane 2.juili Kadu 3. Bhavesh Bawankar 4.akshat Singh
13 pages
Assignment # 02 Case Study: Job Crafting
No ratings yet
Assignment # 02 Case Study: Job Crafting
3 pages
COMP9491 Week2 Deep - Learning 1
No ratings yet
COMP9491 Week2 Deep - Learning 1
66 pages
Rohan Resume
No ratings yet
Rohan Resume
2 pages
Sample 5 - Bray Crash
No ratings yet
Sample 5 - Bray Crash
9 pages
Summarize Chapter 2
No ratings yet
Summarize Chapter 2
3 pages
Antecedents of Sustainability-Oriented Entrepreneurial Intentions: A Comprehensive Model and Empirical Evidence
No ratings yet
Antecedents of Sustainability-Oriented Entrepreneurial Intentions: A Comprehensive Model and Empirical Evidence
18 pages
Phạm Nguyễn Kiều Trang - LESSON PLAN 6C (LISTENING) - FRIEND GLOBAL (LỚP 10)
No ratings yet
Phạm Nguyễn Kiều Trang - LESSON PLAN 6C (LISTENING) - FRIEND GLOBAL (LỚP 10)
3 pages
Training On Artificial Intelligence For Iot
No ratings yet
Training On Artificial Intelligence For Iot
10 pages
The Principle of Child Development
No ratings yet
The Principle of Child Development
2 pages
19MFS10021 Vn0030-Language Wing-Faculty of Arts, Education & Social Sciences (New Campus), J.N.V. University, Jodhpur
No ratings yet
19MFS10021 Vn0030-Language Wing-Faculty of Arts, Education & Social Sciences (New Campus), J.N.V. University, Jodhpur
1 page
Gratitude, Writes Mike Pettigrew
No ratings yet
Gratitude, Writes Mike Pettigrew
2 pages
Inspiring People With Anxiety Through Graphic Design Quotes
No ratings yet
Inspiring People With Anxiety Through Graphic Design Quotes
1 page
FA2 Rom
No ratings yet
FA2 Rom
1 page
BERT Foundations and Applications: Definitive Reference for Developers and Engineers
From Everand
BERT Foundations and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet

NLP DL Lecture4

Uploaded by

NLP DL Lecture4

Uploaded by

Deep Learning for Natural Language

Quan Thanh Tho

• GPT-2 and what we have done

Common situation at “big-data-based companies”

Sua pediasure giá bnhju vay

Sua pediasure giá bnhju vay

Bao nhiêu một hộp sữa vậy

Devlin et al., 2018 (Google AI Language)

• Language models in pre-training are

• BERT’s model architecture is a multi-layer

• Transformer is an attention-based architecture for NLP

In BERT experiments, the

Image source: https://jalammar.github.io/illustrated-transformer/

• Token Embeddings: Use pretrained WordPiece embeddings

• 15% of the words are masked at random

• Training data: BooksCorpus (800M words) + English Wikipedia

• For sequence-level classification task

Single Sentence Tagging Tasks: CoNLL-2003 NER

• Unsupervised pre-training (pre-training

You might also like