NLP Project Final Report1
NLP Project Final Report1
NLP Project Final Report1
net/publication/368275060
CITATIONS READS
0 1,017
1 author:
Aliakbar Abdurahimov
National Research University Higher School of Economics
7 PUBLICATIONS 1 CITATION
SEE PROFILE
All content following this page was uploaded by Aliakbar Abdurahimov on 04 February 2023.
MuhammadMahdi Abdurahimov
National Research Nuclear University MEPhI (Moscow Engineering Physics Institute)
3 Method
where attens and attnB are attention models
Typical neural machine translation model includes with different parameters defined in Eqn.(2).Then
two parts, one is the encoder architecture that
each hˆli is further processed by F F N (x) defined
the encoder forms contextualized word embedding
in Eqn.(3) and we get the output of the l-th layer:
from a source sentence , another is the decoder ar-
chitecture that the decoder generates a target trans- HEl = (F F N (hˆl1 ), ..., F F N (hˆllx )). The encoder
lation from left to right. 1
Attention Is All You Need(Vaswani et al., 2017)
In our experiment, we firstly use transformer as our 2
Incorporating bert into neural machine translation(Zhu
baseline. The transformer architecture is as shown et al., 2019)
will eventually output HEl from the last layer, In short, BERT-fused model combine the output of
BERT with attention modules to incorporate it into
the machine translation model.
P|V |
attn(q, K, V ) = i=1 ai Wv vi ,
4 Experiments
exp((Wq q)T (Wk ki ))
ai = Z , (2) 4.1 Dataset
P|V | In our experiment, we use IWSLT2017 En-Ar
Z= T
i=1 exp((Wq q) (Wk ki )) dataset3 , which was constructed using transcripts
and manual translations of TED talks. As shown in
Table 1, this dataset contains 235,527 parallel sen-
where attn(q, K, V ) defines the attention layer tences in the training set and 888 parallel sentences
and q, K and V represent query, key and value re- in the validation set. And it also contains 6 test sets,
spectively. Here q is a dq -dimensional vector (d ∈ which were collected from TED talks in the year
Z), K and V are two sets with |K| = |V |. Each 2010 to 2015. Figure 3shows some examples about
ki ∈ K and vi ∈ V are also dk /dv -dimensional En→Ar and Ar→En translation of baseline mod-
(dq , dk and dv can be different) vectors, i ∈ [|K|] els. For each translation direction, we show a good
and Wq , Wk and Wv are the parameters to be case, a tricky case and a bad case from Ex- ample 1
learned. to Example 3 respectively. As shown in Example 2,
We define the non-linear transformation layer as the Arabic sentence contains two un- known words.
This results in two problems:En→Ar translation,
F F N (x) = W2 max(W1 x + b1 , 0) + b2 (3) BLEU score does not reflect the actual translation
quality; 2) For Ar→En trans- lation, the MT model
can not generate correct text due to information
where x is the input; W1 , W2 , b1 , b2 are the pa- missing. Therefore, it is of great importance to
rameters to be learned. Let S<t l denote the hidden reconsider the tokenization scheme of Arabic text.
state of l-th layer in the decoder proceeding time
l . Note s0 is a special token indicat-
step t, i.e., S<t Parallel sentences
1
ing the state of a sequence, and s0t is the embedding train 235,527
of the predicted word at time-step t − 1. At the l-th
dev 888
layer, we have
tst2010 1,565
tst2011 1,427
sˆlt = attns (sl−1 l−1 l−1
t , S<t+1 , S<t+1 ), tst2012 1,705
test
tst2013 1,380
sˆlt = 12 (attnB (sˆlt , HB , HB ) (4) tst2014 1,301
tst2015 1,205
+attnE (sˆlt , HEL , HEL )), slt = F F N (sˆlt )
Table 1: Overview of IWSLT2017 En-Ar dataset.
Training Following the practice of (Zhu et al., baseline model achieves a 26.71 BLEU score on av-
2019), we first train the transformer until con- erage across the 6 test sets with a standard deviation
vergence and then initialize the encoder and de- of 1.80. The Bert-fused model achieves a 28.52
coder of the BERT-fused model with the obtained BLEU score on average, outperforming the base-
model. The BERT-encoder attention and BERT- line by an absolute improvement of 1.81 BLEU
decoder attention are randomly initialized. Dur- score. For En→Ar translation, the baseline model
ing training, all parameters in BERT are frozen. achieves a 12.78 BLEU score on average with a
We use fairseq8 for training. For each transla- standard deviation of 1.96. The Bert-fused model
tion direction, we train the BERT-fused model with achieves a 13.81 BLEU score on average, outper-
max_tokens = 4, 000 in each batch. It takes roughly forming the baseline by an absolute improvement
10 hours to train on a single Nvidia A6000 48G of 1.03 BLEU score. For both translation direc-
GPU. We also employ label smoothing of value 0.1 tions, incorporating BERT into Transformer results
during training. in consistent improvements across all six test sets
over the vanilla transformer model. In addition,
Evaluation During the evaluation, we use the comparing the absolute performance of Ar→En
model with the best validation score to generate MT and En→Ar MT, we can see that translating
translations of the given source input. For decoding, to Arabic is much more difficult than translating to
we use a beam search algorithm with a beam size of English in terms. However, it is also possible that
5. The evaluation metric is BLEU score (Papineni BLEU4 is not an appropriate metric for evaluating
et al., 2002), which automatically measures word Arabic translation.
and phrase matching scores between the MT output
and reference translations. Specifically, we use
BLEU4 following common practice.
Effect of Tokenization Table 3 shows the aver-
4.3 Results age performance of Transformer models trained
on different tokenization schemes. From Table 3,
Main Results Table 2 shows the performance of
we can see that for both translation directions, us-
baseline MT models and our BERT-fused models
ing BPE results in a much better BLEU score than
on 6 separate test sets. For Ar→En translation, the
using whole-word tokenization. This is mainly be-
8
https://github.com/facebookresearch/fairseq cause BPE addresses out-of-vocabulary issues.
Figure 3: Examples about En→Ar and Ar→En translations of baseline models.
Table 2: Results of baseline models and BERT-fused models. We report BLEU4 in this table. Bold denotes the best
result.
Effect of Preprocessing Table 4 shows the aver- to English as well as English to Arabic. During the
age performance of Transformer models trained on experiment, we preformed unicode normalization,
raw Arabic corpus and preprocessed Arabic corpus. orthographic normalization, dediacritization and
When using BPE, the results of models trained on BPE tokenization scheme in IWSLT2017 dataset.
preprocessed Arabic and raw Arabic are almost the We compared the performance of baseline MT mod-
same for Ar→En MT. This result indicates that for els on 6 separate test sets, the results in all of the
Ar→En MT, preprocessing Arabic is not necessary sets are really good. We not only explored the low
when using BPE, which makes it easier for peo- resource translation of Ar→En and En→Ar, but
ple who do not understand any Arabic to perform also leveraged pre-trained BERT and fuse it into
this task. However, for En→Ar MT, preprocessing Transformer, and the results are consistently im-
Arabic is important with and without using BPE. proved across six test sets for both direction. What
is more, we found that preprocessing Arabic is crit-
5 Conclusion and Future Work ical for translating English to Arabic. We think
it could work in some other languages. In future
In this work, we have used the transformer as our work, we will continue to use Arabic-BERT in-
baseline model for machine translation of Arabic
Model Ar→En En→Ar Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-
rado, and Jeff Dean. 2013. Distributed representa-
word, raw Ar 17.95 8.24 tions of words and phrases and their compositionality.
BPE, raw Ar 26.31 10.53 Advances in neural information processing systems,
26.
∆ +8.36 +2.29
word, clean Ar 21.02 9.93 Ossama Obeid, Nasser Zalmout, Salam Khalifa, Dima
Taji, Mai Oudah, Bashar Alhafni, Go Inoue, Fadhl
BPE, clean Ar 26.71 12.78 Eryani, Alexander Erdmann, and Nizar Habash. 2020.
∆ +5.69 +2.85 CAMeL tools: An open source python toolkit for Ara-
bic natural language processing. In Proceedings of
Table 3: Results of Transformer models trained with the 12th Language Resources and Evaluation Confer-
different tokenization schemes. We report the average ence, pages 7022–7032, Marseille, France. European
BLEU score on the six test sets in this table. “word" Language Resources Association.
refers to whole-word tokenization, “BPE" refers to byte-
Mai Oudah, Amjad Almahairi, and Nizar Habash. 2019.
pair encoding. “raw Ar" refers to raw Arabic corpus The impact of preprocessing on arabic-english statis-
while “clean Ar" refers to the preprocessed Arabic cor- tical and neural machine translation. arXiv preprint
pus. arXiv:1906.11751.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Uri Shaham and Omer Levy. 2020. Neural machine
Kristina Toutanova. 2018. Bert: Pre-training of deep translation without embeddings. arXiv preprint
bidirectional transformers for language understand- arXiv:2008.09396.
ing. arXiv preprint arXiv:1810.04805.
Pamela Shapiro and Kevin Duh. 2018. Morphological
Shuoyang Ding, Adithya Renduchintala, and Kevin Duh. word embeddings for arabic neural machine trans-
2019. A call for prudent choice of subword merge op- lation in low-resource settings. In Proceedings of
erations in neural machine translation. arXiv preprint the Second Workshop on Subword/Character LEvel
arXiv:1905.10453. Models, pages 1–11.
Abu Bakr Soliman, Kareem Eissa, and Samhaa R El-
Beltagy. 2017. Aravec: A set of arabic word embed-
ding models for use in arabic nlp. Procedia Com-
puter Science, 117:256–265.
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.
Sequence to sequence learning with neural networks.
Advances in neural information processing systems,
27.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. Advances in neural information processing
systems, 30.
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q
Weinberger, and Yoav Artzi. 2019. Bertscore: Eval-
uating text generation with bert. arXiv preprint
arXiv:1904.09675.
Jinhua Zhu, Yingce Xia, Lijun Wu, Di He, Tao Qin,
Wengang Zhou, Houqiang Li, and Tieyan Liu. 2019.
Incorporating bert into neural machine translation. In
International Conference on Learning Representa-
tions.
Barret Zoph, Deniz Yuret, Jonathan May, and
Kevin Knight. 2016. Transfer learning for low-
resource neural machine translation. arXiv preprint
arXiv:1604.02201.