0% found this document useful (0 votes)
536 views50 pages

BARTpho: Pre-Trained Sequence-to-Sequence Models For Vietnamese

BARTpho is a pre-trained sequence-to-sequence model for Vietnamese based on BART. It was introduced at the VinAI NLP Workshop in 2021. There are two versions - BARTpho-syllable which operates at the syllable level, and BARTpho-word which operates at the word level. BARTpho helps produce state-of-the-art performance for Vietnamese text summarization tasks by being specifically trained for the Vietnamese language.

Uploaded by

MInh Thanh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
536 views50 pages

BARTpho: Pre-Trained Sequence-to-Sequence Models For Vietnamese

BARTpho is a pre-trained sequence-to-sequence model for Vietnamese based on BART. It was introduced at the VinAI NLP Workshop in 2021. There are two versions - BARTpho-syllable which operates at the syllable level, and BARTpho-word which operates at the word level. BARTpho helps produce state-of-the-art performance for Vietnamese text summarization tasks by being specifically trained for the Vietnamese language.

Uploaded by

MInh Thanh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

BỘ GIÁO DỤC VÀ ĐÀO TẠO

TRƯỜNG ĐẠI HỌC KHOA HỌC TỰ NHIÊN TP.HCM


KHOA CÔNG NGHỆ THÔNG TIN

BARTpho: Pre-trained Sequence-to-Sequence


Models for Vietnamese
Báo cáo môn Trí tuệ nhân tạo nâng cao

GVHD: Nguyễn Ngọc Thảo


Nhóm 4:
• 21C11029 - Hoàng Minh Thanh
• 21C12005 - Trần Hữu Nghĩa
• 21C11026 - Nguyễn Thành Thái
1
Introduction Paper
3
4
Motivation - Self-review
• Seq2Seq - Sequence to Sequence - 2014
• The success of these pre-trained seq2seq models has largely only English
language
• Multilingual models are not aware of the difference between Vietnamese
syllables and word tokens Click to add text
• Note that 85% of Vietnamese word types are composed of at least two
syllables
• From a societal, cultural, linguistic, cognitive and machine learning
perspective -> require model for Vietnamese languge
"chúng tôi" <> "tôi"
"nghiên cứu" -- "nghiên", "cứu"
"chúng_tôi là những_người_nghiên_cứu"
5
Introduction Paper
• BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese
• VinAI NLP Workshop 2021 (29/10/2021)
• First public large-scale monolingual sequence-to-sequence models pre-
trained for Vietnamese,
• Which are based on the seq2seq denoising autoencoder BART
• Two 2 versions BARTpho
• Syllable-level
VinAI công bố các kết quả nghiên cứu khoa học tại hội nghị hàng đầu thế giới về trí tuệ nhân tạo

• Word-level
VinAI công_bố các kết_quả nghiên_cứu khoa_học tại hội_nghị hàng_đầu thế_giới về trí_tuệ
nhân_tạo

(VinAI publishes research outputs at world-leading conferences in Artificial Intelligence)


6
Introduction Paper
• BARTpho in transformers (transformers (v4.12+))
Model #params
vinai/bartpho-syllable 396M Syllable-level - monolingual
vinai/bartpho-word 420M Word level - large-scale

• BARTpho in fairseq

• BARTpho base on BART model.

7
Resolve problems/issues
• Be used with popular libraries fairseq (Facebook - 2019) and
transformers (huggingface.co)
• Can serve as a strong baseline for future research
applications of generative natural language processing task Vietnamese

8
Compare baseline mBART (Facebook - 2020)
• Multilingual Denoising Pre-training for Neural Machine Translation
• focused only on the encoder, decoder, or reconstructing parts of the text
• fine tuned for supervised (both sentence-level and document-level) and
unsupervised machine translation
• mBART up to 12 BLEU points for low resource MT and over 5 BLEU points
Data train Data dev Data test
Original 105418 (~70%) 22642 (~15%) 22644 (~15%)
After filtering duplicate 102044 21040 20733
~70% ~15% ~15%

9
Compare baseline mBART (Facebook - 2020)

Task abstract summary document

10
Compare others

Task abstract summary document

12
Architecture
• 12 encoder and decoder layers and pre-training scheme of BART
• pre-training BART has two stages:

corrupting the input learning to


text with an arbitrary reconstruct the
noising function original text

13
Pre-training data
• Reuse the PhoBERT’s tokenizer and BPE
• PhoBERT pre-training corpus
• Used a large-scale corpus of 20GB Vietnamese texts
• Pre-training corpus of 145M word-segmented sentences (4B word tokens)

14
Architecture
• Transformer architecture
-> Attention Is All You Need
• Has fine-tune
• use a batch size of 512 sequence blocks
• learning rate of 0.0001
• etc...

15
Architecture
• Transformer architecture
-> Attention Is All You Need
• Has fine-tune
• use a batch size of 512 sequence blocks
• learning rate of 0.0001
• etc...

16
17
Transfomer evolution

BARTPho

18
Transfomer Model

19
Attention mechanism

20
Demo Multiplication

https://www.symbolab.com/graphing-calculator

21
22
23
24
25
Attention mechanism

v2 v3
v1
27
28
Multi-Head Attention Layer

29
30
31
32
33
34
Demo :
https://colab.research.google.com/github/tensorflow/tensor2tensor/blob/master/t
ensor2tensor/notebooks/hello_t2t.ipynb
35
36
37
38
39
40
41
BERT Model

42
GPT Model

43
44
BART Model

45
BART Model

46
Xoay văn bản (Document Rotation): Một
token được chọn ngẫu nhiên, văn bản được Điền văn bản (Text Infilling): Một vài đoạn văn
xoay để bắt đầu với token đó. Điều này giúp bản ngẫu nhiên được thay thế bằng [MASK]. Đặ
cho mô hình học được đâu và điểm bắt đầu c biệt, đoạn văn bản có thể là rỗng.
của văn bản.

Mặt nạ (Token Masking): Như


Xóa token (Token Deletaion): BERT, các token được lấy ngẫu nhiên và thay thế
Các token ngẫu nhiên được xóa khỏi xâu đầu bởi [MASK]
vào, mô hình cần xoá được token nào bị xóa.
Tráo câu ngẫu nhiên (Sentence
Permutation): Văn bản được chia thành các câu và đượ
c tráo ngẫu nhiên.
47
Minh họa BARTpho

48
Click to add text

49
Demo
• Colab :
https://colab.research.google.com/drive/1JRSGghV7oWgRSLHqqyxpfZg
UjxSqz1YB?usp=sharing

• Source code :
https://github.com/VinAIResearch/BARTpho

• Ours : https://github.com/hmthanh/BARTpho_code

50
Conclusion
• BARTPho is absolutely base on BART to Vietnamese language
• The main contribution of the author is weight training and
tokenization in Vietnamese language
• Via result evaluate BARTpho helps produce the SOTA performance for
the Vietnamese text summarization task
• Outstanding successes SOTA -> premise for research
• BARTphosyllable and BARTphoword—the first pre-trained and largescale
monolingual seq2seq models for Vietnamese.

51
52

You might also like