3.1. Problems
The text summarization task can be seen as: in the case of fully understanding the information of the input text, we select the important sentences from the input sequence, and then these sentences are rewritten to the shorter version that do not change the main meaning. Our whole model is consists of two sub-models: the extraction agent and abstraction agent. Formally, the input article is regarded as a sequence of sentences.
,
m is the index of input sentence sequence, each sentence is a sequence of words.
,
n is the index of the word sequence. We select the important sentences from the sequence s to make a new sequence:
,
K <
M, and then generate the summary s’’ by rewriting the sequence s’. In the case of mentioned above, namely, by fitting the training data, firstly find the extraction function:
, then find the abstraction function:
, so the final objective function that will be obtained
. The overall flowchart of this model can be seen in
Figure 2.
In our works, we used BERT as our encoder for word tokens and sentences. Our main processes include: firstly, we pre-trained our sub-models: abstractor and extractor; secondly, we trained the full end-to-end model with REINFORCE LEARNING, which can bridge the sub-models. The three training processes are mapped to the fitting processes of the aforementioned functions, respectively.
3.2. Word Embedding
Word embedding is based on the distributed hypothesis of word representation. Word embedding represents natural language words as low-dimensional vector representations that computers could understand. The semantic relevance of words can be measured by the similarity between vectors. Word embedding now commonly used in NLP tasks include Word2Vec, Glove, BERT, etc.
There are two existing strategies to apply pre-trained language representations to downstream tasks: feature-based and fine-tuning [
12]. Although BERT is mainly used in a fine-tuning mode in most NLP tasks, we use it as a feature-based mode and only use it as our encoder for text representation. As the same as BERT, the WordPiece tokenizer is used for input text sequence. Experiments show that the WordPiece [
24] tokenizer is more effective than the natural tokenizer (here, ‘natural tokenizer’ refers to the method of word segmentation based on space, comma and other punctuation, and CoreNLP toolkits (
https://stanfordnlp.github.io/CoreNLP/) are generally used in the experiment.). BERT can express tokenized words as corresponding word embeddings, as well as the sentences in the article are input into the BERT model, and the sentence vector representation of each sentence is obtained. The above process is expressed as Formula (1)
where
M is the number of sentences,
m is the index of a sentence,
denotes the text of the
sentence, S is the set of sentences,
is the sentence vector. Next, word embeddings or sentence vectors are used as input in both extractor and abstractor.
3.3. Extraction Model
Our extraction model is motivated by [
21,
22]. The main difference is that our extractor uses BERT as a sentence encoder and the document encoder adopts the self-attention mechanism. We take a similar computation method and make some changes by adding a unidirectional GRU. Each sentence of the document is visited sequentially to obtain a shortlist of remarkable sentences with high recall to further facilitate the abstractor is our objectives.
The model consists of three components: a sentence encoder, a document encoder, and a sentence extractor. The sentence encoder adopts BERT as the encoder, a bidirectional GRU with self-attention is used to encode document, a unidirectional GRU is used to compute the summary representation. Then, the representation, the document vector and the hidden state of bi-GRU are involved in the computation of the sentence score. The architecture of the model is shown in
Figure 3.
After getting the vector representation of sentences by BERT encoder, we can summarize information of documents from both directions. It includes the forward GRU
and a backward GRU
where
is the sentence vector of
sentence in the time step t.
Both GRU and LSTM are based on RNN (Recurrent Neural Network), there is no evidence to show which one is the best [
25,
26]. However, GRU is simpler, more efficient, fewer parameters and easier to implement. Therefore, we use a bidirectional GRU to encode the sentences in the documents and a unidirectional GRU to obtain the summary representation which taking account of decisions made previously. A GRU is a recurrent network with two gates,
called the update gate and
the reset gate, it can be described by the following equations
where the
’s and
’s are learnable parameters and
is the real valued hidden vector at time step j and
is the corresponding input vector namely aforementioned
and
denotes the Hadamard product.
We concatenated the forward and the backward GRU hidden states to get the vector
, which summarizes the information of the sentence
and its context, as in Equation (8),
where
denotes the size of hidden vector, M is the number of sentences in the document, so the
denotes the whole GRU hidden states, as in Equation (9),
We all know that we pay more or less attention to each sentence according to its contribution to the article. The representation of the whole document is modeled as a weighted sum of the concatenated hidden states of the bidirectional GRU by a self-attention mechanism [
26]. We take the concatenated hidden states
as input and yield a vector of weights,
, as output, calculated as shown in Equation (10),
where
and
are learnable parameters,
,
, k is a hyper-parameter can be set arbitrarily. The softmax() is the normalized function used to normalize the attention weights, which sum up to 1. After getting the attention vector
, the document vector is obtained as a weighted sum of the GRU hidden states weighted by
, as shown in
Figure 4, and Equation (11),
where
,
, so the document representation is a vector whose dimension is
d.
For extractor, each sentence is viewed sequentially again, where a logistic layer makes a binary decision as to whether that sentence belongs to the summary, as shown in Equation (12)
where
is a binary variable indicating whether the
sentence is included in the summary,
is the hidden state of bi-GRU at the
time step,
is the dynamic representation of the summary before the
time step,
d is the document vector,
,
,
,
, and
are all learnable parameters. The expression
denotes the information content of the
sentence,
represents the salience of the sentence with respect to the article,
obtains the redundancy of the sentence with respect to the current representation of the summary,
is the position of the sentence with respect to the article.
is calculated using Equation (13)
where
,
a zero vector.
We do not follow the loss function of the literature [
21,
22], where they used the negative log-likelihood. We use cross entropy loss as the loss function, as shown in Equation (14)
where
is the ground-truth label for the sentence and M is the number of sentences. When
, it suggests that the
sentence should be attended to help abstractive summarization.
is the normalize attention weights using softmax(), as shown in Equation (15)
In the end-to-end training phrase, as the sentence-level attention will be focused on abstract summaries.
Essentially, our extraction model is a binary classifier, which classifies whether the sentences in the input text sequence are important or not.
3.4. Abstraction Model
Another part of our method is an abstraction model that rewrites the previously selected key sentences and then generates a concise and readable summary. We use the pointer-generator network proposed by [
6]. The pointer-generator network facilitates copying words from the source text via pointing [
16], which improves accuracy and processing ability of OOV words, while retaining the ability to generate new words [
6]. The network contains an encoder and a decoder and can be seen as a balance between extractive and abstractive methods. Many similar studies [
6,
7,
11] show that such a model can effectively improve the performance of text summary. More details of the network can be found in the literature [
6].
Although we used the pointer-generator network, we made some changes to improve the performance and accuracy of the model. Compared with the vanilla network [
6], there are some differences: first, inspired by [
10], the new network introduces the updated word attention combined with sentence-level and word-level attentions as same as [
10]; second, we replace the LSTM in the network with the GRU, since the GRU is simpler and requires fewer parameters; third, the two models input different amounts of data, when the article reaches 400 tokens the input of the vanilla network is truncated, which causes loss of information, the input of the new network is the key sentences from aforementioned extraction model; fourth, the word embedding of two models are different, the word2vec is used for the vanilla pointer-generator network, the BERT is used in our abstractive network. In addition, the WordPiece tokenizer can help to process the OOV words. The architecture of the updated pointer-generator network is shown in
Figure 5.
There is a lot of evidence that attention mechanism is very important for NLP tasks (e.g., [
5,
19,
23]). We use the sentence-level modulate the word-level attention such that words in less attended sentences are less likely to be generated [
10]. Take the simple scalar multiplication of the aforementioned sentence attention
in sec 3.3 and the word attention
of the
sentence, and then renormalize the result into the new attention. The updated word attention
,
The final probability distribution of word
is related to the updated word attention
as follows
where
is the generating probability (see Equation (8) in [
6]),
is the probability distribution over word
being decoded,
is the context vector, a function of the updated word attention
,
is the encoder hidden state for the
word.
During pre-training, the loss is the negative log likelihood, we minimize the loss as
where
is the target word in the reference abstractive summary at the time step t. The coverage mechanism [
6] is also used to prevent the abstractor from repeatedly putting the focus on the same point. In each decoder step t, the coverage vector
is calculated as follows, which is the sum of attention over all previous timesteps
Moreover, coverage loss
is calculated as
We also apply the inconsistency loss as same as [
10], the inconsistency loss is calculated by Equation (22)
where
is the set of top
K attended words and
is the number of words in the summary. In conclusion, the final loss of abstraction model is
where
,
are hyper-parameters.
3.5. Training Procedure
The training process of our method is divided into two phases: (1) pre-training phase, (2) full training phase. Without well-trained extractor, the extractor would often select irrelevant sentences, and without well-trained abstractor, the extractor would get noisy reward. We first pre-train the extractor by minimizing in Equation (14) and the abstractor by minimizing in Equation (23), respectively, and then, we apply standard policy gradient methods of reinforcement learning to bridge together these two networks and to train the whole model in an end-to-end fashion.
3.5.1. Pre-Training
The sentences with high informativity are our goal of the extractor, the extracted sentences should contain as much information as possible to generate an abstract summary. In order to train the extractive model, we need ground truth labels for each document, but our train corpus only contains human written abstractive summaries, so we need to convert the abstractive summaries to extractive labels. Similar to the extractive model of [
22], we compute the ROUGE-L recall score [
27] between sentence and the reference abstractive summary, and measure the informativity of each sentence in the document by score. We sort and select the sentences in order from high to low. We add one sentence at a time if the new sentence can increase the score of all the selected. The selected sentences should be the ones that maximize the ROUGE score with respect to gold summaries. Finally, we obtain the ground truth labels and train our extraction model by minimizing Equation 14. We use the ROUGE scores of the selected sentences as sentence-level attention of the corresponding sentences, respectively.
When pre-training, the abstractor takes ground truth sentences of the previously extracted as input. The sentence-level attention of these input sentences is viewed as hard attention, which involves the calculation of attention consistency. In the pre-training stage, we finally get these two well-trained extractor and abstractor.
3.5.2. End-to-End Training
During full training stage, we employ a hybrid extractive-abstractive architecture, with policy gradient of reinforcement learning to bridge together the aforementioned two pre-trained networks. We first use an extractor agent to select important sentences and then employ an abstractor to paraphrase each of these extracted sentences. In this stage, RL training work is as follows: if a good sentence is selected by the extractor, the ROUGE match would be high after the abstractor paraphrase and thus the action is encouraged. If a bad sentence is selected, the generated sentence would not match the reference summary after rewrites and if the ROUGE score is low, the action is discouraged.