NLP DL Lecture4
NLP DL Lecture4
Processing
Lecture 4: Pre-trained Language Model
6
7
8
9
Neural-based Milestones for NLP
10
Pre-trained Neural Language Model
11
12
13
Intuition
14
Intuition
15
Case Studies
Language model
Intent classification
NER tagging
16
BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding
NLP course
17
Outline
• Research context
• Main ideas
• BERT
• Experiments
• Conclusions
18
Research context
• Language model pre-training has been used to improve many NLP
tasks
• ELMo (Peters et al., 2018)
• OpenAI GPT (Radford et al., 2018)
• ULMFit (Howard and Rudder, 2018)
• Two existing strategies for applying pre-trained language
representations to downstream tasks
• Feature-based: include pre-trained representations as additional
features (e.g., ELMo)
• Fine-tunning: introduce task-specific parameters and fine-tune
the pre-trained parameters (e.g., OpenAI GPT, ULMFit)
19
ELMO
20
21
22
23
ULMFit
24
Limitations of current techniques
25
BERT: Bidirectional Encoder
Representations from Transformers
• Main ideas
• Propose a new pre-training objective so that a deep bidirectional
Transformer can be trained
• The “masked language model” (MLM): the objective is to
predict the original word of a masked word based only on its
context
• ”Next sentence prediction”
• Merits of BERT
• Just fine-tune BERT model for specific tasks to achieve state-
of-the-art performance
• BERT advances the state-of-the-art for eleven NLP tasks
26
BERT: Bidirectional Encoder
Representations from Transformers
27
Model architecture
28
Differences in pre-training model architectures:
BERT, OpenAI GPT, and ELMo
29
Transformer Encoders
Vaswani et al. (2017) Attention is all you need
30
31
32
Inside an Encoder Block
Source: https://medium.com/dissecting-
bert/dissecting-bert-part-1-d3c3d495cdb3 33
34
35
36
Self-Attention
38
39
40
41
42
43
44
45
46
47
48
49
50
51
Input Representation
53
54
55
56
Pre-training procedure
57
58
Fine-tuning procedure
59
60
Fine-tuning procedure
61
62
63
64
Outline
• Research context
• Main ideas
• BERT
• Experiments
• Conclusions
65
Experiments
• GLUE (General Language Understanding Evaluation)
benchmark
• Distribute canonical Train, Dev and Test splits
• Labels for Test set are not provided
• Datasets in GLUE:
• MNLI: Multi-Genre Natural Language Inference
• QQP: Quora Question Pairs
• QNLI: Question Natural Language Inference
• SST-2: Stanford Sentiment Treebank
• CoLA: The corpus of Linguistic Acceptability
• STS-B: The Semantic Textual Similarity Benchmark
• MRPC: Microsoft Research Paraphrase Corpus
• RTE: Recognizing Textual Entailment
• WNLI: Winograd NLI
66
67
68
69
70
71
Ablation Studies
• To understand
• Effect of Pre-training Tasks
• Effect of model sizes
• Effect of number of training steps
• Feature-based approach with BERT
72
73
74
Conclusions
75
Links
• TensorFlow code and pre-trained models for BERT:
https://github.com/google-research/bert
• PyTorch Pretrained Bert:
https://github.com/huggingface/pytorch-pretrained-BERT
• BERT-pytorch: https://github.com/codertimo/BERT-pytorch
• BERT-keras: https://github.com/Separius/BERT-keras
76
Remark: Applying BERT for non-
English languages
• Pre-trained BERT models are provided for more than 100
languages (including Vietnamese)
• https://github.com/google-
research/bert/blob/master/multilingual.md
• Be careful with tokenization!!
• For Japanese (and Chinese): “spaces were added around
every character in the CJI Unicode rage before applying
WordPiece” => Not a good way to do
• Use SentencePiece:
https://github.com/google/sentencepiece
• We may need to pre-train BERT model
77
References
1. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training
of Deep Bidirectional Transformers for Language Understanding. arXiv
preprint arXiv:1810.04805.
2. Vaswani et al. (2017). Attention Is All You Need. arXiv preprint
arXiv:1706.03762. https://arxiv.org/abs/1706.03762
3. The Annotated Transformer:
http://nlp.seas.harvard.edu/2018/04/03/attention.html, by harvardnlp.
4. The Illustrated Transformer: http://jalammar.github.io/illustrated-
transformer/
5. ELMo explained: https://www.mihaileric.com/posts/deep-contextualized-
word-representations-elmo/
6. ULMFit explained: https://yashuseth.blog/2018/06/17/understanding-
universal-language-model-fine-tuning-ulmfit/
7. Dissecting BERT: https://medium.com/dissecting-bert
8. The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning):
http://jalammar.github.io/illustrated-bert/
78