0% found this document useful (0 votes)
28 views

NLP DL Lecture4

This document discusses pre-trained language models, focusing on BERT. It provides an overview of BERT, including its bidirectional Transformer architecture, pre-training objectives of masked language modeling and next sentence prediction, and fine-tuning procedure. Experiments show that BERT achieves state-of-the-art results on eleven NLP tasks by simply fine-tuning the pre-trained model. Ablation studies examine the effects of pre-training tasks, model sizes, training steps, and feature-based approaches.

Uploaded by

thanh.tien.96.vn
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

NLP DL Lecture4

This document discusses pre-trained language models, focusing on BERT. It provides an overview of BERT, including its bidirectional Transformer architecture, pre-training objectives of masked language modeling and next sentence prediction, and fine-tuning procedure. Experiments show that BERT achieves state-of-the-art results on eleven NLP tasks by simply fine-tuning the pre-trained model. Ablation studies examine the effects of pre-training tasks, model sizes, training steps, and feature-based approaches.

Uploaded by

thanh.tien.96.vn
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 78

Deep Learning for Natural Language

Processing
Lecture 4: Pre-trained Language Model

Quan Thanh Tho


Faculty of Computer Science and Engineering
Back Khoa University
Agenda

• GPT-2 and what we have done


• Practical Situation: Huge Data, Few Labeled
• Language Model: General Architecture
• Case Studies
• GPT, Transformer and BERT
4
5
Introduction

Common situation at “big-data-based companies”


Cheap large dataset vs. expensive labeled data

6
7
8
9
Neural-based Milestones for NLP

10
Pre-trained Neural Language Model

11
12
13
Intuition

Sua pediasure giá bnhju vay

14
Intuition

Sua pediasure giá bnhju vay


--------------------
Cái này giá bao nhiêu vậy
Giá nó bao nhiêu

Bao nhiêu một hộp sữa vậy


Giá hộp sữa bao nhiêu
Giá hộp này bnhju
Bnhju thì bán được

15
Case Studies

Language model
Intent classification
NER tagging

16
BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding

Devlin et al., 2018 (Google AI Language)

NLP course

17
Outline

• Research context
• Main ideas
• BERT
• Experiments
• Conclusions

18
Research context
• Language model pre-training has been used to improve many NLP
tasks
• ELMo (Peters et al., 2018)
• OpenAI GPT (Radford et al., 2018)
• ULMFit (Howard and Rudder, 2018)
• Two existing strategies for applying pre-trained language
representations to downstream tasks
• Feature-based: include pre-trained representations as additional
features (e.g., ELMo)
• Fine-tunning: introduce task-specific parameters and fine-tune
the pre-trained parameters (e.g., OpenAI GPT, ULMFit)

19
ELMO

20
21
22
23
ULMFit

24
Limitations of current techniques

• Language models in pre-training are


unidirectional, they restrict the power of the
pre-trained representations
• OpenAI GPT used left-to-right architecture
• ELMo concatenates forward and backward
language models
• Solution BERT: Bidirectional Encoder
Representations from Transformers

25
BERT: Bidirectional Encoder
Representations from Transformers
• Main ideas
• Propose a new pre-training objective so that a deep bidirectional
Transformer can be trained
• The “masked language model” (MLM): the objective is to
predict the original word of a masked word based only on its
context
• ”Next sentence prediction”
• Merits of BERT
• Just fine-tune BERT model for specific tasks to achieve state-
of-the-art performance
• BERT advances the state-of-the-art for eleven NLP tasks

26
BERT: Bidirectional Encoder
Representations from Transformers

27
Model architecture

• BERT’s model architecture is a multi-layer


bidirectional Transformer encoder
• (Vaswani et al., 2017) “Attention is all you need”
• Two models with different sizes were investigated
• BERTBASE: L=12, H=768, A=12, Total Parameters=110M
• (L: number of layers (Transformer blocks), H is the hidden size,
A: the number of self-attention heads)
• BERTLARGE: L=24, H=1024, A=16, Total Parameters=340M

28
Differences in pre-training model architectures:
BERT, OpenAI GPT, and ELMo

29
Transformer Encoders
Vaswani et al. (2017) Attention is all you need

• Transformer is an attention-based architecture for NLP


• Transformer composed of two parts: Encoding
component and Decoding component
• BERT is a multi-layer bidirectional Transformer encoder

30
31
32
Inside an Encoder Block

In BERT experiments, the


number of blocks N was chosen
to be 12 and 24.
Blocks do not share weights with
each other

Source: https://medium.com/dissecting-
bert/dissecting-bert-part-1-d3c3d495cdb3 33
34
35
36
Self-Attention

Image source: https://jalammar.github.io/illustrated-transformer/


37
Self-Attention in Detail
• Attention maps a query and a set of key-value
pairs to an output
• query, keys, and output are all vectors

38
39
40
41
42
43
44
45
46
47
48
49
50
51
Input Representation

• Token Embeddings: Use pretrained WordPiece embeddings


• Position Embeddings: Use learned Position Embeddings
• Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N.
Dauphin. Convolutionalsequence to sequence learning. arXiv preprint
arXiv:1705.03122v2, 2017.
• Added sentence embedding to every tokens of each
sentence
• Use [CLS] for the classification tasks
• Separate sentences by using a special token [SEP] 52
Task#1: Masked LM

• 15% of the words are masked at random


• and the task is to predict the masked words based on its
left and right context
• Not all tokens were masked in the same way (example
sentence “My dog is hairy”)
• 80% were replaced by the <MASK> token: “My dog is
<MASK>”
• 10% were replaced by a random token: “My dog is apple”
• 10% were left intact: “My dog is hairy”

53
54
55
56
Pre-training procedure

• Training data: BooksCorpus (800M words) + English Wikipedia


(2,500M words)
• To generate each training input sequences: sample two spans
of text (A and B) from the corpus
• The combined length is ≤ 500 tokens
• 50% B is the actual next sentence that follows A and 50%
of the time it is a random sentence from the corpus
• The training loss is the sum of the mean masked LM likelihood
and the mean next sentence prediction likelihood

57
58
Fine-tuning procedure

• For sequence-level classification task


• All of the parameters of BERT and W are
fine-tuned jointly
• Most model hyperparameters are the same
as in pre-training
• except the batch size, learning rate, and
number of training epochs

59
60
Fine-tuning procedure

Single Sentence Tagging Tasks: CoNLL-2003 NER

61
62
63
64
Outline

• Research context
• Main ideas
• BERT
• Experiments
• Conclusions

65
Experiments
• GLUE (General Language Understanding Evaluation)
benchmark
• Distribute canonical Train, Dev and Test splits
• Labels for Test set are not provided
• Datasets in GLUE:
• MNLI: Multi-Genre Natural Language Inference
• QQP: Quora Question Pairs
• QNLI: Question Natural Language Inference
• SST-2: Stanford Sentiment Treebank
• CoLA: The corpus of Linguistic Acceptability
• STS-B: The Semantic Textual Similarity Benchmark
• MRPC: Microsoft Research Paraphrase Corpus
• RTE: Recognizing Textual Entailment
• WNLI: Winograd NLI
66
67
68
69
70
71
Ablation Studies

• To understand
• Effect of Pre-training Tasks
• Effect of model sizes
• Effect of number of training steps
• Feature-based approach with BERT

72
73
74
Conclusions

• Unsupervised pre-training (pre-training


language model) is increasingly adopted in
many NLP tasks
• Major contribution of the paper is to propose a
deep bidirectional architecture from Transformer
• Advance state-of-the-art for many important NLP tasks

75
Links
• TensorFlow code and pre-trained models for BERT:
https://github.com/google-research/bert
• PyTorch Pretrained Bert:
https://github.com/huggingface/pytorch-pretrained-BERT
• BERT-pytorch: https://github.com/codertimo/BERT-pytorch
• BERT-keras: https://github.com/Separius/BERT-keras

76
Remark: Applying BERT for non-
English languages
• Pre-trained BERT models are provided for more than 100
languages (including Vietnamese)
• https://github.com/google-
research/bert/blob/master/multilingual.md
• Be careful with tokenization!!
• For Japanese (and Chinese): “spaces were added around
every character in the CJI Unicode rage before applying
WordPiece” => Not a good way to do
• Use SentencePiece:
https://github.com/google/sentencepiece
• We may need to pre-train BERT model

77
References

1. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training
of Deep Bidirectional Transformers for Language Understanding. arXiv
preprint arXiv:1810.04805.
2. Vaswani et al. (2017). Attention Is All You Need. arXiv preprint
arXiv:1706.03762. https://arxiv.org/abs/1706.03762
3. The Annotated Transformer:
http://nlp.seas.harvard.edu/2018/04/03/attention.html, by harvardnlp.
4. The Illustrated Transformer: http://jalammar.github.io/illustrated-
transformer/
5. ELMo explained: https://www.mihaileric.com/posts/deep-contextualized-
word-representations-elmo/
6. ULMFit explained: https://yashuseth.blog/2018/06/17/understanding-
universal-language-model-fine-tuning-ulmfit/
7. Dissecting BERT: https://medium.com/dissecting-bert
8. The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning):
http://jalammar.github.io/illustrated-bert/

78

You might also like