2103.11943v1
2103.11943v1
Introduction
The search for a universal representation of text is at the heart of the automated processing of natural
languages. The big breakthrough in this area has been with the development of pretrained text attachments such as
word2vec [52] or GloVe [64]. Over the past years, supervised models have shown consistently better results than
unsupervised models [49]. However, in recent years, models based on learning without a teacher have become
much more widespread since they do not require the preparation of a specially labeled dataset, but can use already
existing or automatically generated huge corpora of texts and, as a result, learn on much a larger sample, thus
taking full advantage of deep learning.
The centerpiece of 2019 in the field of natural language processing was the introduction of a new pretrained
BERT text attachment model, which enables unprecedented precision results in many automated word processing
tasks. This model is likely to replace the widely known word2vec model in prevalence, becoming, in fact, the
industry standard. Throughout 2019, almost all scientific articles devoted to the problem of word processing in
natural languages, in one way or another, were a reaction to the release of this new model, the authors of which
have become one of the most cited researchers in the field of machine learning.
Natural language processing tasks include a wide range of applications from conversational bots and
machine translation to voice assistants and online speech translation. Over the past few years, this industry has
experienced rapid growth, both quantitatively, in the volume of market applications and products, and qualitatively,
in the effectiveness of the latest models and the proximity to the human level of language understanding.
One of the central themes in natural language processing is the task of text representation. Text
representation is a kind of rule for converting natural language input information into machine-readable data. A
representation can also be considered simply a computer encoding of text, but in the context of applied machine
learning problems, such representations that reflect the internal content and conceptual structure of the text are
more useful.
The most simple textual representations are categorical encoding when each word is represented as a vector
filled with zeros everywhere, except for one position corresponding to the number of this word in the dictionary.
This concept was used in the early stages of the industry. It is quite simple, does not require computational
resources to implement, and conceptually very simple. However, such a representation does not take into account
the semantic features of words, it is rather voluminous and redundant since a vector with the dimension of the
number of words in the dictionary is used to represent each word.
A similar view is the well-known bag of words model. This model represents the entire text as a vector with
the dimension of a vocabulary, in which each component represents the number of occurrences of a given word in
the text. This is also a fairly simple model that does not take into account the semantics of words, however, it is
quite successfully applied to tasks, for example, the categorization of texts.
The most applicable in the modern industry is the representation of text in the form of so-called attachments -
a mechanism for representing each word in the form of a vector, each coordinate of which has a certain semantic
meaning. Most often, attachments with hundreds of coordinates are used. Attachments can capture the semantic
meaning of specific words in a text and display it as a coordinate in multidimensional space. One well-known
illustration of this is the ability to perform vector operations in this semantic latent space.
Nesting is most often done by unsupervised model training on large text corpora. An example of such tasks
can be filling in gaps in the text, determining the relevance of sentences. Learning textual representations is a very
computationally intensive task. For most problems of analysis of general-purpose texts, ready-made
representations, trained in advance, are used [70].
Source: [93]
In addition to the basic algorithm, the authors of BERTScore use the IDF metric to determine the rarer
words, as previous studies of text metrics [81] indicate that sparse words may be more indicative of the similarity
of two sentences. The IDF metric estimated on the reference and smoothed (+1) to account for new words is used
as the weight of the corresponding cosine measure when averaged.
In a comparative analysis of text similarity metrics, metrics based on BERT show consistently higher
results than classic text metrics. This means that they are statistically significantly closer to human estimates.
Besides, the original article introducing BERTScore focuses on performance issues. In this regard,
BERTScore is, of course, often slower than classic models. The authors give an assessment of the comparison with
the popular implementation of SacreBLEU, in which BERTScore is about three times slower. Since the pretrained
BERT model is used to assess the similarity, the increase in the accuracy of the estimate is given at the expense of a
decrease in speed that would not be expected from such a more complex model. Given the typical size of natural
text processing datasets, the increase in computation time during model validation should not have a significant
effect on the performance of the machine learning system as a whole.
Publications already beginning to appear [95] that improve the original BERTScore algorithm by
parallelizing computational operations. Variations of classical metrics [45]using BERT and showing improved
correlation with human estimates are presented.
Undoubtedly, the direction of the future development of textual metrics will be the wider use of BERT as a
basis for assessment, a kind of semantic engine. Also promising is the development of specific models that take
into account the peculiarities of specific subject areas and increase the basic level of accuracy through
specialization. Concerning BERTScore, another advantage is its differentiability, which will allow it to be
integrated into the methodology for training text models in the future, which promises to further increase the
performance and quality of machine learning models of natural text processing.
Source: [25]
When analyzing the results, it turned out that a dataset consisting of generated adversarial examples can
reduce the efficiency of text classification from 80-97% to 0-20%. This indicates the success of the attack on the
machine learning model.
Adding the generated adversarial dataset and additional training of the model on it significantly increases the
efficiency of text classification models, which show 2-7 percentage points higher efficiency on the test adversarial
dataset than without such additional training.
Source: [25]
Undoubtedly, studies of the stability of machine learning models are an indispensable condition for the
widespread introduction of such intelligent systems in the decision-making process, which is the relevance of this
area of research. As mentioned earlier, the development and study of various types of automated attacks on
machine learning models can give us a direction for further improving the development and testing of intelligent
systems in general, not limited to classification models.
Source: [34]
To assess the quality of learning interlanguage models, the XLNI dataset [13] is used from the corpus of texts
in 15 different languages, significantly differing in prevalence, from English to Nepali. Research [34] shows that
the use of unsupervised multilingual learning can force the model to generate reliable cross-language textual
representations. MLM training is especially effective. Without the use of parallel corpora, the model shows an
improvement in the quality of machine translation by 1.3% on average relative to the best-known machine learning
models. The TLM learning procedure can further improve machine translation accuracy by an average of 4.9% on
average.
Source: [36]
Unlike SciBERT, the authors of this model did not use a specialized dictionary to tokenize text.
The BioBERT training process, like most similar models, consists of two parts: initial training and
adjustment (additional training) for a specific word processing task. The authors of the model used three common
tasks to assess the performance of the model: named entity recognition (NER [89]), relation extraction (REL), and
answering a question (QA [67]).
The simulation showed the full advantage of the new model over conventional BERT in all specific tasks.
The best results to date have been achieved in several tasks.
BioBERT is the first, but not the only problem-specific model based on BERT. I. Betalgi, K. Lo, and A.
Cohan from the University of Seattle [6] present a SciBERT model trained on a corpus of scientific articles.
This model was trained on a random sample of more than 1 million scientific articles from the
SemanticScholar database [1]. SciBERT also reconfigures the tokenization tool underlying BERT to account for the
different vocabulary of the selected corpus. In the process of building the tokenizer, it turned out that the
intersection of general-purpose dictionaries and the scientific corpus was 42%, which indicates a significant
difference in the distribution of the token between these corpora. Along with an intuitive understanding, this fact
gives reason to believe that the use of specific models can give increase productivity.
The model was trained following the original BERT training method, but on a different, more compact body
of texts and using a specialized vocabulary for tokenization. The training took 7 days on an 8-core TPU (by
comparison, training the full version of the original BERT model took 4 days on a 16-core TPU and is estimated to
be 40-70 days on an 8-core).
As a result of numerical experiments on some tasks (extraction of entities, classification of texts,
determination of dependencies, etc.) for the processing of scientific texts. Datasets from subject areas were used:
biomedicine (ChemProt [31], EBM-NLP [57] and others), computer science (SciERC [47], ACL-ARC [28]), and
interdisciplinary (SciCite, described by A. Cohan in [10 ]). On all datasets tested, SciBERT performed better than
overall BERT and performed best to date for most.
Source: [6]
It is noteworthy that this model shows comparable and sometimes superior results compared to BioBERT
[36], despite being trained on a significantly smaller body of biomedical texts.
The authors of the SciBERT model separately note that in all cases, additional training of text attachments on
a specific task gives a greater effect than building special architectures based on fixed attachments. Also, the use of
a vocabulary built for the subject area for text tokenization gives a positive effect.
Source: [43]
When testing the training, it was concluded that removing the NSP function improves performance on
subsequent tasks, which directly contradicts the original publication. Also, using sentences instead of segments
degrades performance, presumably by removing the ability to make long-term generalizations.
Initially, BERT was trained in 1 million steps in bursts of 256 sequences. This is computationally equivalent
to training in 30 thousand steps but large packets of 8 thousand sequences. There are studies by M. Ott et al.
[59]showing that language models can benefit from an increase in the size of the training packet, provided the
learning rate is increased accordingly. Research [43] shows that this effect also takes place for BERT, but to a
limited extent; the highest performance is achieved with a packet size of about 2 thousand sequences.
A scrupulous study of all the intricacies of learning a linguistic model, given in [43], led to the creation of a
variation of the model called RoBERTa (robustly optimized approach to BERT), which combines the best practices
from those analyzed. As a result, it was found that the performance of the original model can be improved by an
average of 3-4 percentage points, reaching the best values to date in all tested problems.
Conclusion
Immediately after its appearance, the BERT model received an intense reaction from the scientific
community and is now used in almost all word processing problems. Almost immediately, the proposals for
improving the model considered in this work appeared, which led to an improvement in the results of its application
in all subsequent problems. With all that said, we can confidently assert that BERT represented a quantum leap in
the field of intelligent natural language processing and consolidated the superiority of using pre-trained text
representation models on huge data sets as a universal basis for building intelligent algorithms for solving specific
problems.
BERT has also shown the advantage of bidirectional contextual models of text comprehension based on the
architecture of transformers with an attention mechanism.
Undoubtedly, we will see many more new scientific results based on the application and adaptation of the
BERT model to various problems of word processing in natural languages. Further improvement of the neural
network architecture, coupled with fine-tuning the training procedure and parameters, will inevitably lead to
significant improvements in many computer NLP algorithms, from text classification and annotation to machine
translation and question-answer systems.
References
[1] Waleed Ammar, Dirk Groeneveld, Chandra Bhagavatula, Iz Beltagy, Miles Crawford, Doug Downey, Jason
Dunkelberger, Ahmed Elgohary, Sergey Feldman, Vu Ha, Rodney Kinney, Sebastian Kohlmeier, Kyle Lo,
Tyler- Murray Han Ooi, Matthew Peters, Joanna Power, Sam Skjonsberg, Lucy Wang, Chris Willhelm,
Zheng Yuan, Madeleine Zuylen, and oren. 2018. Construction of the Literature Graph in Semantic Scholar.
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, Volume 3 (Industry Papers). DOI: https: //doi.org/10.18653 / v1
/ n18-3011
[2] Waleed Ammar, George Mulcaire, Yulia Tsvetkov, Guillaume Lample, Chris Dyer, and Noah A. Smith. 2016.
Massively Multilingual Word Embeddings. (2016). Retrieved March 24, 2020 from
https://arxiv.org/pdf/1602.01925.pdf
[3] Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho. 2017. Unsupervised Neural Machine
Translation. (2017). Retrieved March 24, 2020 from https://arxiv.org/pdf/1710.11041.pdf
[4] Vidhya Balasubramanian, Doraisamy Gobu Sooryanarayan, and Navaneeth Kumar Kanakarajan. 2015. A
multimodal approach for extracting content descriptive metadata from lecture videos. (2015). Retrieved
March 20, 2020 from https://doi.org/10.1007/s10844-015-0356-5
[5] Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An Automatic Metric for MT Evaluation with
Improved Correlation with Human Judgments. In Proceedings of the ACL Workshop on Intrinsic and
Extrinsic Evaluation Measures for Machine Translation and / or Summarization, 65–72. Retrieved March 20,
2020 from https://www.aclweb.org/anthology/W05-0909.pdf
[6] Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciBERT: A Pretrained Language Model for Scientific Text.
Retrieved March 21, 2020 from http://arxiv.org/abs/1903.10676
[7] Nicholas Carlini and David Wagner. 2018. Audio Adversarial Examples: Targeted Attacks on
Speech-to-Text. Retrieved March 20, 2020 from http://arxiv.org/abs/1801.01944
[8] Richard A. Caruana. 1993. Multitask Learning: A Knowledge-Based Source of Inductive Bias. Machine
Learning Proceedings 1993, 41–48. DOI: https: //doi.org/10.1016 / b978-1-55860-307-3.50012-5
[9] Daniel Cer, Yinfei Yang, Sheng-Yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St John, Noah Constant, Mario
Guajardo-Cespedes, Steve Yuan, Chris Tar, Yun-Hsuan Sung, Brian Strope, and Ray Kurzweil. 2018.
Universal Sentence Encoder. Retrieved March 20, 2020 from http://arxiv.org/abs/1803.11175
[10] Arman Cohan, Waleed Ammar, Madeleine van Zuylen, and Field Cady. 2019. Structural Scaffolds for
Citation Intent Classification in Scientific Publications. Proceedings of the 2019 Conference of the North.
DOI: https: //doi.org/10.18653 / v1 / n19-1361
[11] Alexis Conneau and Douwe Kiela. 2018. SentEval: An Evaluation Toolkit for Universal Sentence
Representations. (2018). Retrieved March 24, 2020 from https://arxiv.org/pdf/1803.05449.pdf
[12] Alexis Conneau, Guillaume Lample, Marc 'aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou. 2017. Word
Translation Without Parallel Data. (2017). Retrieved March 24, 2020 from
https://arxiv.org/pdf/1710.04087.pdf
[13] Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel Bowman, Holger Schwenk, and
Veselin Stoyanov. 2018. XNLI: Evaluating Cross-lingual Sentence Representations. Proceedings of the 2018
Conference on Empirical Methods in Natural Language Processing. DOI: https: //doi.org/10.18653 / v1 /
d18-1269
[14] Andrew M. Dai and Quoc V. Le. 2015. Semi-supervised Sequence Learning. In Advances in Neural
Information Processing Systems, 3079-3087. Retrieved March 20, 2020 from
http://papers.nips.cc/paper/5949-semi-supervised-sequence-learning.pdf
[15] Zihang Dai, Zhilin Yang, Yiming Yang, William W. Cohen, and Ruslan Salakhutdinov ... 2019.
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context. (January 2019). Retrieved
March 24, 2020 from http://dx.doi.org/
[16] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep
Bidirectional Transformers for Language Understanding. Retrieved March 20, 2020 from
http://arxiv.org/abs/1810.04805
[17] Manaal Faruqui and Chris Dyer. 2014. Improving Vector Space Word Representations Using Multilingual
Correlation. (2014). Retrieved March 24, 2020 from
https://pdfs.semanticscholar.org/8e14/bb86d0b6b28e40b6193a2d0fe80e258751ca.pdf
[18] Pierre-Etienne Genest and Guy Lapalme. 2011. Framework for abstractive summarization using text-to-text
generation. In Proceedings of the Workshop on Monolingual Text-To-Text Generation, unknown, 64–73.
Retrieved March 20, 2020 from http://dx.doi.org/
[19] Giuliano Giacaglia. 2019. Transformers. Medium. Retrieved March 19, 2020 from
https://towardsdatascience.com/transformers-141e32e69591
[20] James R. Glass, Timothy J. Hazen, D. Scott Cyphers, Igor Malioutov, and Regina Barzilay. 2007. Recent
progress in the MIT spoken lecture processing project. In INTERSPEECH 2007, 8th Annual Conference of
the International Speech Communication Association, Antwerp, Belgium, August 27-31, 2007, unknown,
2553-2556. Retrieved March 20, 2020 from http://dx.doi.org/
[21] Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. 2014. Explaining and Harnessing Adversarial
Examples. (December 2014). Retrieved March 20, 2020 from http://dx.doi.org/
[22] Alex Graves. 2012. Sequence Transduction with Recurrent Neural Networks. Retrieved March 19, 2020 from
http://arxiv.org/abs/1211.3711
[23] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. (1997). Retrieved March 20,
2020 from https://doi.org/10.1162/neco.1997.9.8.1735
[24] Jeremy Howard and Sebastian Ruder. 2018. Universal Language Model Fine-tuning for Text Classification.
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long
Papers). DOI: https: //doi.org/10.18653 / v1 / p18-1031
[25] Di Jin, Zhijing Jin, Joey Tianyi Zhou, and Peter Szolovits. 2019. Is BERT Really Robust? A Strong Baseline
for Natural Language Attack on Text Classification and Entailment. Retrieved March 20, 2020 from
http://arxiv.org/abs/1907.11932
[26] Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat,
Fernanda Viégas, Martin Wattenberg, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2017. Google's
Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation. Transactions of the
Association for Computational Linguistics 5, 339-351. DOI: https: //doi.org/10.1162 / tacl_a_00065
[27] Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, and Omer Levy. 2020.
SpanBERT: Improving Pre-training by Representing and Predicting Spans. Transactions of the Association
for Computational Linguistics 8, 64-77. DOI: https: //doi.org/10.1162 / tacl_a_00300
[28] David Jurgens, Srijan Kumar, Raine Hoover, Dan McFarland, and Dan Jurafsky. 2018. Measuring the
Evolution of a Scientific Field through Citation Frames. Transactions of the Association for Computational
Linguistics 6, 391-406. DOI: https: //doi.org/10.1162 / tacl_a_00028
[29] Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. 2014. A Convolutional Neural Network for
Modeling Sentences. In Proceedings of the 52nd Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), 655–665. DOI: https: //doi.org/10.3115 / v1 / P14-1062
[30] Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification. In Proceedings of the 2014
Conference on Empirical Methods in Natural Language Processing (EMNLP), 1746-1751. DOI: https:
//doi.org/10.3115 / v1 / D14-1181
[31] Jens Kringelum, Sonny Kim Kjaerulff, Søren Brunak, Ole Lund, Tudor I. Oprea, and Olivier Taboureau.
2016. ChemProt-3.0: a global chemical biology diseases mapping. Database 2016, (February 2016). DOI:
https: //doi.org/10.1093 / database / bav123
[32] Alexey Kurakin, Ian J. Goodfellow, and Samy Bengio. 2016. Adversarial examples in the physical world.
(2016). Retrieved March 20, 2020 from https://arxiv.org/pdf/1607.02533.pdf
[33] Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. 2017. RACE: Large-scale ReAding
Comprehension Dataset From Examinations. Proceedings of the 2017 Conference on Empirical Methods in
Natural Language Processing. DOI: https: //doi.org/10.18653 / v1 / d17-1082
[34] Guillaume Lample and Alexis Conneau. 2019. Cross-lingual Language Model Pretraining. (2019). Retrieved
March 21, 2020 from https://arxiv.org/pdf/1901.07291.pdf
[35] Guillaume Lample, Ludovic Denoyer, and Marc 'aurelio Ranzato. 2017. Unsupervised Machine Translation
Using Monolingual Corpora Only. (2017). Retrieved March 24, 2020 from
https://arxiv.org/pdf/1711.00043.pdf
[36] Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang.
2019. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. DOI:
https: //doi.org/10.1093 / bioinformatics / btz682
[37] Gregor Leusch, Nicola Ueffing, and Hermann Ney. 2006. CDER: Efficient MT Evaluation Using Block
Movements. EACL (2006). Retrieved March 20, 2020 from
https://pdfs.semanticscholar.org/aa77/9e4e9f381ad98d66f7f415463542ba74c92d.pdf
[38] Bin Liang, Hongcheng Li, Miaoqiang Su, Pan Bian, Xirong Li, and Wenchang Shi. 2017. Deep Text
Classification Can be Fooled. DOI: https: //doi.org/10.24963 / ijcai.2018 / 585
[39] Jinfeng Li, Shouling Ji, Tianyu Du, Bo Li, and Ting Wang. 2018. TextBugger: Generating Adversarial Text
Against Real-world Applications. (2018). Retrieved March 20, 2020 from
https://arxiv.org/pdf/1812.05271.pdf
[40] Jiwei Li, Will Monroe, and Dan Jurafsky. 2016. Understanding Neural Networks through Representation
Erasure. Retrieved March 20, 2020 from http://arxiv.org/abs/1612.08220
[41] Liyuan Liu, Jingbo Shang, Frank Xu, Xiang Ren, and Jiawei Han. 2017. Empower Sequence Labeling with
Task-Aware Neural Language Model. (September 2017). Retrieved March 20, 2020 from http://dx.doi.org/
[42] Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. 2016. Recurrent Neural Network for Text Classification with
Multi-Task Learning. (2016). Retrieved March 20, 2020 from https://arxiv.org/pdf/1605.05101.pdf
[43] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke
Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach.
(2019). Retrieved March 21, 2020 from https://arxiv.org/pdf/1907.11692.pdf
[44] Chi-Kiu Lo. 2017. MEANT 2.0: Accurate semantic MT evaluation for any output language. Proceedings of
the Second Conference on Machine Translation. DOI: https: //doi.org/10.18653 / v1 / w17-4767
[45] Chi-Kiu Lo. 2019. YiSi - a Unified Semantic MT Quality Evaluation and Estimation Metric for Languages
with Different Levels of Available Resources. Proceedings of the Fourth Conference on Machine Translation
(Volume 2: Shared Task Papers, Day 1). DOI: https: //doi.org/10.18653 / v1 / w19-5358
[46] Chi-Kiu Lo, Michel Simard, Darlene Stewart, Samuel Larkin, Cyril Goutte, and Patrick Littell. 2018.
Accurate semantic textual similarity for cleaning noisy parallel corpora using semantic machine translation
evaluation metric: The NRC supervised submissions to the Parallel Corpus Filtering task. Proceedings of the
Third Conference on Machine Translation: Shared Task Papers. DOI: https: //doi.org/10.18653 / v1 /
w18-6481
[47] Yi Luan, Luheng He, Mari Ostendorf, and Hannaneh Hajishirzi. 2018. Multi-Task Identification of Entities,
Relations, and Coreference for Scientific Knowledge Graph Construction. Proceedings of the 2018
Conference on Empirical Methods in Natural Language Processing. DOI: https: //doi.org/10.18653 / v1 /
d18-1360
[48] Qingsong Ma, Yvette Graham, Shugen Wang, and Qun Liu. 2017. Blend: a Novel Combined MT Metric
Based on Direct Assessment - CASICT-DCU submission to WMT17 Metrics Task. In Proceedings of the
Second Conference on Machine Translation, unknown, 598-603. DOI: https: //doi.org/10.18653 / v1 /
W17-4768
[49] Xiaofei Ma, Zhiguo Wang, Patrick Ng, Ramesh Nallapati, and Bing Xiang. 2019. Universal Text
Representation from BERT: An Empirical Study. Retrieved March 20, 2020 from
http://arxiv.org/abs/1910.07973
[50] Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. 2017. Learned in Translation:
Contextualized Word Vectors. In Advances in Neural Information Processing Systems, 6294-6305. Retrieved
March 20, 2020 from
http://papers.nips.cc/paper/7209-learned-in-translation-contextualized-word-vectors.pdf
[51] Michael McCloskey and Neal J. Cohen. 1989. Catastrophic Interference in Connectionist Networks: The
Sequential Learning Problem. Psychology of Learning and Motivation, 109-165. DOI: https:
//doi.org/10.1016 / s0079-7421 (08) 60536-8
[52] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Distributed
Representations of Words and Phrases and their Compositionality. Adv. Neural Inf. Process. Syst. 26,
(October 2013). Retrieved March 20, 2020 from http://dx.doi.org/
[53] Derek Miller. 2019. Leveraging BERT for Extractive Text Summarization on Lectures. (June 2019).
Retrieved March 20, 2020 from http://dx.doi.org/
[54] Nikola Mrksic, Diarmuid Ó. Séaghdha, Blaise Thomson, Milica Gašić, Lina Maria Rojas-Barahona, Pei-Hao
Su, David Vandyke, Tsung-Hsien Wen, and Steve J. Young. 2016. Counter-fitting Word Vectors to Linguistic
Constraints. (2016). Retrieved March 20, 2020 from https://arxiv.org/pdf/1603.00892.pdf
[55] Gabriel Murray, Steve Renals, and Jean Carletta. 2005. Extractive summarization of meeting recordings. In
INTERSPEECH 2005 - Eurospeech, 9th European Conference on Speech Communication and Technology,
Lisbon, Portugal, September 4-8, 2005, unknown, 593-596. Retrieved March 20, 2020 from http://dx.doi.org/
[56] Timothy Niven and Hung-Yu Kao. 2019. Probing Neural Network Comprehension of Natural Language
Arguments. Retrieved March 20, 2020 from http://arxiv.org/abs/1907.07355
[57] Benjamin Nye, Junyi Jessy Li, Roma Patel, Yinfei Yang, Iain J. Marshall, Ani Nenkova, and Byron C.
Wallace. 2018. A Corpus with Multi-Level Annotations of Patients, Interventions and Outcomes to Support
Language Processing for Medical Literature. Proc Conf Assoc Comput Linguist Meet 2018, (July 2018),
197–207. Retrieved from https://www.ncbi.nlm.nih.gov/pubmed/30305770
[58] OpenWebTextCorpus. Download. OpenWebTextCorpus. Retrieved April 7, 2020 from
https://skylion007.github.io/OpenWebTextCorpus/
[59] Myle Ott, Sergey Edunov, David Grangier, and Michael Auli. 2018. Scaling Neural Machine Translation.
Proceedings of the Third Conference on Machine Translation: Research Papers. DOI: https:
//doi.org/10.18653 / v1 / w18-6301
[60] Bo Pang and Lillian Lee. 2005. Seeing Stars: Exploiting Class Relationships for Sentiment Categorization
with Respect to Rating Scales. In Proceedings of the 43rd Annual Meeting of the Association for
Computational Linguistics (ACL'05), 115-124. DOI: https: //doi.org/10.3115 / 1219840.1219855
[61] Joybrata Panja and Sudip Kumar Naskar. 2018. ITER: Improving Translation Edit Rate through Optimizable
Edit Costs. Proceedings of the Third Conference on Machine Translation: Shared Task Papers. DOI: https:
//doi.org/10.18653 / v1 / w18-6455
[62] Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z. Berkay Celik, and Ananthram Swami.
2017. Practical black-box attacks against machine learning. In ASIA CCS 2017 - Proceedings of the 2017
ACM Asia Conference on Computer and Communications Security, Association for Computing Machinery,
Inc, 506-519. DOI: https: //doi.org/10.1145 / 3052973.3053009
[63] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a Method for Automatic
Evaluation of Machine Translation. (October 2002). DOI: https: //doi.org/10.3115 / 1073083.1073135
[64] Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global Vectors for Word
Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing
(EMNLP). DOI: https: //doi.org/10.3115 / v1 / d14-1162
[65] Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke
Zettlemoyer. 2018. Deep Contextualized Word Representations. Proceedings of the 2018 Conference of the
North American Chapter of the Association for Computational Linguistics: Human Language Technologies,
Volume 1 (Long Papers). DOI: https: //doi.org/10.18653 / v1 / n18-1202
[66] Alec Radford. 2018. Improving Language Understanding with Unsupervised Learning. OpenAI. Retrieved
March 20, 2020 from https://openai.com/blog/language-unsupervised/
[67] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ Questions for
Machine Comprehension of Text. In Proceedings of the 2016 Conference on Empirical Methods in Natural
Language Processing, 2383-2392. DOI: https: //doi.org/10.18653 / v1 / D16-1264
[68] Prajit Ramachandran, Peter J. Liu, and Quoc V. Le. 2016. Unsupervised Pretraining for Sequence to
Sequence Learning. Retrieved March 24, 2020 from http://arxiv.org/abs/1611.02683
[69] Marek Rei. 2017. Semi-supervised Multitask Learning for Sequence Labeling. Proceedings of the 55th
Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). DOI: https:
//doi.org/10.18653 / v1 / p17-1194
[70] Joao Schapke. 2019. Evolution of word representations in NLP. Medium. Retrieved March 19, 2020 from
https://towardsdatascience.com/evolution-of-word-representations-in-nlp-d4483fe23e93
[71] Samuel L. Smith, David HP Turban, Steven Hamblin, and Nils Y. Hammerla. 2017. Offline bilingual word
vectors, orthogonal transformations and the inverted softmax. (2017). Retrieved March 24, 2020 from
https://arxiv.org/pdf/1702.03859.pdf
[72] Matthew Snover, Bonnie Dorr, Richard Schwartz, Linnea Micciulla, and John Makhoul. 2006. A study of
translation edit rate with targeted human annotation. In In Proceedings of Association for Machine
Translation in the Americas. Retrieved March 20, 2020 from
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.129.4369
[73] Peter Stanchev, Weiyue Wang, and Hermann Ney. 2019. EED: Extended Edit Distance Measure for Machine
Translation. Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers,
Day 1). DOI: https: //doi.org/10.18653 / v1 / w19-5359
[74] Chi Sun, Xipeng Qiu, Yige Xu, and Xuanjing Huang. 2019. How to Fine-Tune BERT for Text Classification?
Lecture Notes in Computer Science, 194-206. DOI: https: //doi.org/10.1007 / 978-3-030-32381-3_16
[75] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob
Fergus. 2014. Intriguing properties of neural networks. (2014). Retrieved March 20, 2020 from
https://research.google/pubs/pub42503/
[76] Wilson L. Taylor. 1953. “Cloze Procedure”: A New Tool for Measuring Readability. Journalism Quarterly
30, 415-433. DOI: https: //doi.org/10.1177 / 107769905303000401
[77] Christoph Tillmann, Stephan Vogel, Hermann Ney, A. Zubiaga, and Hassan Sawaf. 1997. Accelerated DP
based search for statistical translation. In the Fifth European Conference on Speech Communication and
Technology, EUROSPEECH 1997, Rhodes, Greece, September 22-25, 1997. Retrieved March 20, 2020 from
http://dx.doi.org/
[78] Trieu H. Trinh and Quoc V. Le. 2018. A Simple Method for Commonsense Reasoning. (2018). Retrieved
April 7, 2020 from https://arxiv.org/pdf/1806.02847.pdf
[79] Nicolas Van Labeke, Denise Whitelock, Debora Field, Stephen Pulman, and John TE Richardson. 2013.
What is my essay really saying? Using extractive summarization to motivate reflection and redrafting. In
Artificial Intelligence in Education, unknown. Retrieved March 20, 2020 from http://dx.doi.org/
[80] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz
Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing
Systems, 5998-6008. Retrieved March 20, 2020 from
http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf
[81] Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. CIDEr: Consensus-based image
description evaluation. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). DOI:
https: //doi.org/10.1109 / cvpr.2015.7299087
[82] Takashi Wada, Tomoharu Iwata, and Yuji Matsumoto. 2019. Unsupervised Multilingual Word Embedding
with Limited Resources using Neural Language Models. Proceedings of the 57th Annual Meeting of the
Association for Computational Linguistics. DOI: https: //doi.org/10.18653 / v1 / p19-1300
[83] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2018. GLUE: A
Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In Proceedings of the
2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, 353–355.
DOI: https: //doi.org/10.18653 / v1 / W18-5446
[84] Weiyue Wang, Jan-Thorsten Peter, Hendrik Rosendahl, and Hermann Ney. 2016. CharacTer: Translation Edit
Rate on Character Level. Proceedings of the First Conference on Machine Translation: Volume 2, Shared
Task Papers. DOI: https: //doi.org/10.18653 / v1 / w16-2342
[85] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim
Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu,
Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian,
Nishant Patil, Wei Wang, Cliff Young, Jason, Smith, Jason Riesa Alex Rudnick, Oriol Vinyals, Gregory S.
Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google's Neural Machine Translation System: Bridging
the Gap between Human and Machine Translation. (2016). Retrieved March 20, 2020 from
https://arxiv.org/pdf/1609.08144.pdf
[86] Chao Xing, Dong Wang, Chao Liu, and Yiye Lin. 2015. Normalized Word Embedding and Orthogonal
Transform for Bilingual Word Translation. (2015). Retrieved March 24, 2020 from
https://pdfs.semanticscholar.org/77e5/76c02792d7df5b102bb81d49df4b5382e1cc.pdf
[87] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salocakhutdinov., And Qu. 2019.
XLNet: Generalized Autoregressive Pretraining for Language Understanding. (2019). Retrieved March 21,
2020 from https://arxiv.org/pdf/1906.08237.pdf
[88] Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchical
Attention Networks for Document Classification. In Proceedings of the 2016 Conference of the North
American Chapter of the Association for Computational Linguistics: Human Language Technologies,
1480-1489. DOI: https: //doi.org/10.18653 / v1 / N16-1174
[89] Wonjin Yoon, Chan Ho So, Jinhyuk Lee, and Jaewoo Kang. 2019. CollaboNet: collaboration of deep neural
networks for biomedical named entity recognition. BMC Bioinformatics 20, Suppl 10 (May 2019), 249. DOI:
https: //doi.org/10.1186 / s12859-019-2813-6
[90] Shanshan Yu, Jindian Su, and Da Luo. 2019. Improving BERT-Based Text Classification With Auxiliary
Sentence and Domain Knowledge. IEEE Access 7, 176600-176612. DOI: https: //doi.org/10.1109 /
access.2019.2953990
[91] Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin Choi. 2018. SWAG: A Large-Scale Adversarial
Dataset for Grounded Commonsense Inference. Извлечено 20 март 2020 г. от
http://arxiv.org/abs/1808.05326
[92] Justin Jian Zhang, Ricky Ho Yin Chan, и Pascale Fung. 2007. Improving lecture speech summarization using
rhetorical information. (2007 г.). Извлечено 20 март 2020 г. от
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=4430108
[93] Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, и Yoav Artzi. 2019. BERTScore: Evaluating
Text Generation with BERT. Извлечено 20 март 2020 г. от http://arxiv.org/abs/1904.09675
[94] Xiang Zhang, Junbo Zhao, и Yann LeCun. 2015. Character-level convolutional networks for text
classification. В Advances in Neural Information Processing Systems, Neural information processing systems
foundation, 649–657. Извлечено 20 март 2020 г. от
https://nyuscholars.nyu.edu/en/publications/character-level-convolutional-networks-for-text-classification
[95] Wei Zhao, Maxime Peyrard, Fei Liu, Yang Gao, Christian M. Meyer, и Steffen Eger. 2019. MoverScore: Text
Generation Evaluating with Contextualized Embeddings and Earth Mover Distance. Proceedings of the 2019
Conference on Empirical Methods in Natural Language Processing and the 9th International Joint
Conference on Natural Language Processing (EMNLP-IJCNLP). DOI:https://doi.org/10.18653/v1/d19-1053
[96] Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, и Sanja
Fidler. 2015. Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and
Reading Books. 2015 IEEE International Conference on Computer Vision (ICCV).
DOI:https://doi.org/10.1109/iccv.2015.11
[97] Automatic Text Summarization of Video Lectures Using Subtitles. springerprofessional.de. Извлечено 20
март 2020 г. от
https://www.springerprofessional.de/automatic-text-summarization-of-video-lectures-using-subtitles/1421174
8
[98] https://www.kaggle.com/c/fake-news/data
[99] https://datasets.imdbws.com/
[100] https://commoncrawl.org/2016/10/news-dataset-available/