Combination of Abstractive and Extractive Approaches For Summarization of Long Scientific Texts
Combination of Abstractive and Extractive Approaches For Summarization of Long Scientific Texts
Combination of Abstractive and Extractive Approaches For Summarization of Long Scientific Texts
ITMO University,
49 Kronverksky Pr., St. Peterburg, 197101
1 Introduction
The language modeling task in the general case is a process of learning the
joint probability function of sequences of words in a natural language. Statistical
Language Modeling, or Language Modeling, is the development of probabilistic
models that are able to predict the next word in the sequence given the words
that precede, it also known as context. Language models could operate on dif-
ferent sequence levels. Small language models work with sequences of chars and
words. While the big language model works with sentences. But most common
language model operates with sequences of words.
The first neural language model used feed-forward architecture. One of the
main features of using neural language models is getting representation vector
of words sequences. These word vectors usually called embedding vector [11]
2 V. Tretyak et al.
and embeddings for similar words located closer to each other in dimension
space, also having similar representations. After successful usage of feed-forward
networks, recurrent neural networks [5] achieved better results in language mod-
eling tasks because of its ability to take into account the position of words in
sentences and producing contextual word embeddings [13]. Long Short-Term
Memory networks [6] allows the language model to learn the relevant context of
longer sequences than feed-forward or RNN’s, because of its more sophisticated
memory mechanism. Next, the attention mechanism [6] made improvement in
language modeling tasks with a combination of sequence to sequence framework
[18]. The attention mechanism improves memory mechanism of recurrent neu-
ral networks by giving the ability for the decoder network to look at the whole
context.
The next big step in the language modeling task was developing transformer[19]
architecture, with novel self-attention mechanism, that helps the model to use
the context of the sentence more efficiently. Such models could take into account
both left and right context of the sequence as it is implemented in BERT model
[4] and only left context like in GPT model [14]. Transformer-based LM have
own disadvantages, one of them is a limited receptive field. It means that the
transformer could only proceed sequences that have limited length, while recur-
rent neural networks could work with unlimited sequences. This issue partially
solved by the Transformer-XL model [3], which could work with continuous se-
quences of texts like recurrent neural networks. Despite of this disadvantage,
models like GPT with large receptive field and trained on large amount of data
are capable of capturing long range dependencies.
For this work, we propose method that uses both extractive and abstractive
approaches for summarization task. Our work improves previous approach [17],
by using pre-trained LM instead of training it from scratch. In this research
work, we used arxiv dataset as an great example of long scientific documents
and in order to compare our work with previous approaches. We split training
process in two steps. First, we train extractive model as a classification task, that
simply selects which sentences should be selected into summary. Second, we use
extracted summary together with different article sections as conditioning for
generating abstractive summary. Adding extractive summary into conditioning
part is crucial, for generating target summary. Also, we made experiments with
different variants of conditioning and found the best combination for it. Accord-
ing to our experiments, extracted summary + introduction and conclusion of
the paper performs the best.
2 Related Work
3 Proposed model
Our model consist of two components. 1) Extractive model classifier that choose
which sentences from source text should be included in summary. 2) Abstrac-
tive model that uses condition text to produce abstractive summary. During
4 V. Tretyak et al.
Both for validation and test parts we took 5 percent of documents. During
the preprocessing step, too long and too short papers were removed, also papers
without abstracts, and text of paper were removed. Also, we replaced some
LaTeX markup with special tokens such as: [math], [graph], [table], [equation]
in order to help model recognize special tokens, other LaTeX source code we
cleared. Also, we removed all irrelevant chars and exclude all not Latin letters.
This preprocessing pipeline was applied for both extractive and abstractive tasks.
We used common approach for creating dataset for training extractive model.
First, we create list of sentence pairs, ”abstract sentence” − > ”sentence from
paper”, every sentence from abstract are matched with every sentence from
Using abstractive and extractive summarization 5
paper text. Between every pair, we compute the ROUGE metric. In particular,
we compute: ROUGE-1, ROUGE-2, ROUGE-L, and take the average value of
f-score. After that, we got a scoring value for each pair of sentences, the higher
the better. We choose two pairs with the highest scores, this is our positive
examples. Then we randomly sample two sentences from the paper text and
mark such pairs as negative examples. After completing these steps, we got a
dataset that contains a list of sentence pairs and labels. Then, we save the dataset
for extractive summarization in a separate file.
For the abstractive summarization task, we took our best model that solves
extractive summarization tasks and infer it on the dataset, in other words, we
generate summaries for each paper in the dataset with our best extractive model.
Then we make all necessary preprocessing and save the model generated sum-
maries with corresponding papers and abstracts. These are all steps, that were
done with the dataset.
We used Word Piece tokenizer [15] for both BERT and ELECTRA models as in
original papers, with vocabulary size 30525. RoBERTa uses BPE [16] tokenizer
in original implementation and does not have [CLS] token in its vocabulary,
thats why we manually add it before training. The key benefit of using sub-
word tokenizers is a small dataset size and decreasing out-of-vocabulary cases.
6 V. Tretyak et al.
Also, special tokens were added: [math], [graph], [table], [equation]. That were
extracted by regular expressions while pre-processing.
Classification head in the figure above denotes a block that contains a stack of
linear layers with an output of size 1 with sigmoid activation. We used ROUGE
metrics for evaluation quality of summaries. To evaluate the summarization
model, we firstly inference it on the test set. We make a list of pairs: sentence
from an abstract sentence from paper and score every pair with the model. Then
we use only candidates that have the highest score. After this process, we got
the extractive summary that we use with ground truth abstract to calculate
ROUGE scores. All scores are averaged between papers. We used ROUGE-1,
ROUGE-2, and ROUGE-L as our main evaluation metric for summaries.
All proposed models used the same set of training hyperparameters and other
settings were the same. From the table above we could conclude that the BERT
model achieves the highest result among all other models. With equal set of
Using abstractive and extractive summarization 7
hyperparameters BERT model achieves better results. In the table above, Oracle
denotes the ROUGE score between the ground truth abstract and extracted
summary that includes the most relevant sentences from the text of a paper. The
Oracle scores in the table indicate the limit for extractive models to get the best
summary, according to ROUGE metric. In order to get more coherent text, we
perform paraphrasing of extracted summaries. To get paraphrased sentences we
apply the back-translation technique. For this, we used pre-trained transformer
LM that was trained to translate sentences from english to german and backward.
First, we translate the extractive summary into the German language using
a pre-trained transformer LM and then back to the English language. Those
paraphrased summaries are used later during experiments with condition model.
The results of extractive models are presented in table 2.
The experiments with the BART model were performed in a similar way. But
there are some differences in model input. BART consists of two parts, the
first one is an encoder which uses BERT-like architecture. The second part is
a decoder that consists of stacked transformer decoders, similar architecture to
GPT. These two parts are connected so that encoder output is fed into the
decoder. In such architecture decoder uses hidden states that were produced
by encoder. Encoder and decoder are trained end to end. It means that during
backward pass, we update weights of decoder and encoder. In such a scenario, we
8 V. Tretyak et al.
fed the conditioning part into BARTs encoder and target text into the decoder
during the training process. We propose different conditioning scenarios. We
made conditioning: on the extractive summary, on the introduction of paper,
introduction concatenated with conclusion and introduction concatenated with
the extractive summary and with the conclusion. We made the assumption that
both the introduction and conclusion concentrate the most valuable information
for generating summaries. Because usually in introdiction author describes the
problem itself, some details about the proposed method, novelty. In conclusion,
authors usually make some recap of what was done, conclusion of results. We
did not apply conditioning on long texts to GPT-2, because of restrictions in
the input size. The condition part plus target text in most cases will not fit into
1024 input size, that’s why we only made conditioning on summaries extracted
by models. The experiments results with GPT-2 and BART are presented in
table 3, below.
From the table above we could conclude that extractive summary plays a cru-
cial role in generating abstractive summaries. Removing the extractive summary
from the condition part leads to a decreasing ROUGE score. Also, the assump-
tion that both introduction and conclusion holds the most relevant information,
confirmed. Its obvious that extractive summary has a bigger impact than intro-
duction with conclusion, because extractive summary already holds a lot of rele-
vant (according to ROUGE score) sentences. Also, we investigate that ROUGE-2
and ROUGE-L scores outperforms the best model and outperforms Oracle, be-
cause during the abstractive summarization model could produce words that
could not be presented in the source document. The best model that uses BERT
as extractor and BART as abstractor is presented in Fig. 2. First BERT performs
extractive summarization of the article, extracted summary concatenates with
the introduction and conclusion of the paper. This setup shows the best perfor-
mance according to the ROUGE metric and outperforms the previous approach
that was applied to arxiv dataset.
Abstractive summary
Abstractive model
Extractive model
Text for
summarization
4 Conclusion
The novel improvements was proposed that uses both extractive and abstrac-
tive approaches, as an extracted model BERT model was used, as an abstractive
model BART model was used. During research work, comparison analysis of
different architectures for extractive and abstractive summarization approaches
10 V. Tretyak et al.
References
1. Bae, S., Kim, T., Kim, J., Lee, S.g.: Summary level training of sentence rewriting
for abstractive summarization. arXiv preprint arXiv:1909.08752 (2019)
2. Clark, K., Luong, M.T., Le, Q.V., Manning, C.D.: Electra: Pre-training text en-
coders as discriminators rather than generators. arXiv preprint arXiv:2003.10555
(2020)
3. Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q.V., Salakhutdinov, R.:
Transformer-xl: Attentive language models beyond a fixed-length context. arXiv
preprint arXiv:1901.02860 (2019)
4. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirec-
tional transformers for language understanding. arXiv preprint arXiv:1810.04805
(2018)
5. Gal, Y., Ghahramani, Z.: A theoretically grounded application of dropout in re-
current neural networks. In: Advances in neural information processing systems.
pp. 1019–1027 (2016)
6. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation
9(8), 1735–1780 (1997)
7. Hsu, W.T., Lin, C.K., Lee, M.Y., Min, K., Tang, J., Sun, M.: A unified model for
extractive and abstractive summarization using inconsistency loss. arXiv preprint
arXiv:1805.06266 (2018)
8. Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O.,
Stoyanov, V., Zettlemoyer, L.: Bart: Denoising sequence-to-sequence pre-training
for natural language generation, translation, and comprehension. arXiv preprint
arXiv:1910.13461 (2019)
9. Liu, Y., Lapata, M.: Text summarization with pretrained encoders. arXiv preprint
arXiv:1908.08345 (2019)
10. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M.,
Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized bert pretraining
approach. arXiv preprint arXiv:1907.11692 (2019)
Using abstractive and extractive summarization 11
11. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed repre-
sentations of words and phrases and their compositionality. In: Advances in neural
information processing systems. pp. 3111–3119 (2013)
12. Pascanu, R., Mikolov, T., Bengio, Y.: On the difficulty of training recurrent neural
networks. In: International conference on machine learning. pp. 1310–1318 (2013)
13. Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K.,
Zettlemoyer, L.: Deep contextualized word representations. arXiv preprint
arXiv:1802.05365 (2018)
14. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language
models are unsupervised multitask learners. OpenAI Blog 1(8), 9 (2019)
15. Schuster, M., Nakajima, K.: Japanese and korean voice search. In: 2012 IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP).
pp. 5149–5152. IEEE (2012)
16. Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with
subword units. arXiv preprint arXiv:1508.07909 (2015)
17. Subramanian, S., Li, R., Pilault, J., Pal, C.: On extractive and abstractive neu-
ral document summarization with transformer language models. arXiv preprint
arXiv:1909.03186 (2019)
18. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural
networks. In: Advances in neural information processing systems. pp. 3104–3112
(2014)
19. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser,
L., Polosukhin, I.: Attention is all you need. In: Advances in neural information
processing systems. pp. 5998–6008 (2017)