Yang 2020
Yang 2020
4181
Poster Session F3: Vision and Language MM '20, October 12–16, 2020, Seattle, WA, USA
man-walk with-dog
C: A man is walking with his dogs.
man and dog-on-sand
P: A man is walking his golden rock and tree-near-sand A man is walking with dogs.
dogs. The man and the dogs are Man Tree The dogs are near the man.
standing in the sand. There are They are walking in the sands.
brown rocks and green trees near Sand
the sand. ... Dog Hill
(a) Caption vs. Paragraph (b) Topics and Script (c) Flat RNN
.ORR .ORR .ORR .ORR .ORR .ORR
5T 5T 5T
*UM *UM *UM *UM *UM *UM
.[SGT .ORR
5T :XKK
Sub-set Sub-graph
.[SGT *UM .[SGT :XKK *UM 9GTJ =OZN .[SGT
/T
9GTJ 5T /T
Topics Topics *UM
*UM 9GTJ
5T 5T 5T
*UM *UM *UM *UM *UM *UM
A man is walking with dogs. The man is near a tree. The dogs are near a man. A man is walking with dogs. They are walking in sands. There are trees and hills near the
sands.
(d) HRNN without Hierarchical Constraint (e) HRNN with HSGED
Figure 1: Illustrations of our motivation. (a) Comparisons between single sentence captioning and the informative paragraph.
(b) Topics and script of the paragraph in (a). (c) A paragraph generated by flat RNN. (d) The paragraph generation process of
HRNN without hierarchical constraint where the top and bottom parts denote the higher and lower RNNs, respectively. (e)
The paragraph generation process of our HSGED, where the grey color means this part will not be attended.
The crux of these problems is to find a “script” as in Figure 1 (b), the decoder of our HSGED by two RNNs which are sentence scene
which connects the topics to provide hierarchical constraints for the graph RNN (SSG-RNN) and word scene graph RNN (WSG-RNN).
HRNN. Recent studies show that Scene Graphs [37], which connects Specifically, the scene graph is transformed into a set of sub-graph
local objects as a global graph in terms of the object relations, can level embeddings by a graph neural network [39]. When generating
serve as the script to provide hierarchical knowledge for solving a new sentence, SSG-RNN adaptively attends to a few sub-graph
complex tasks, e.g., image retrieval [11], image generation [9], image embeddings based on the context knowledge to form the new topic
captioning [40], and visual reasoning [29]. Since the neighboring (see Section 3.2.2). Also, we design an irredundant attention strat-
sub-graphs of a scene graph share some concepts and each sub- egy to encourage the new topics to be formed from the undescribed
graph is distinctive, if we form the topic flows from a scene graph sub-graphs (see Section 3.2.3), e.g., as shown in Fig 1 (e) where the
and align each topic to a sub-graph, the generated paragraph will new topic comes from the colorful undescribed regions. Given the
naturally be coherent and distinctive. generated sub-graph topic, WSG-RNN composes a new sentence
Motivated by the above conjecture, we propose Hierarchical by focusing on the region constrained by the selected compact sub-
Scene Graph Encoder-Decoder (HSGED) to exploit the scene graph graphs, which is achieved by a novel inheriting attention strategy
as the topic script and transfer its hierarchical topological knowl- (see Section 3.2.5). Furthermore, we design an efficient sentence-
edge into the text domain for better paragraphs. Specifically, the level loss to encourage the topics further to follow the scene graph
high-level RNNs can follow scene graphs to generate the topics, for a human-like order. Extensive experiments on Stanford image
each of which is represented by a local compact sub-graph. Com- paragraph dataset [13] show that our HSGED generates more co-
pared with the HRNNs which generate sub-set topics without any herent and distinctive paragraphs. In particular, we achieve a new
global constraints, the topic flows in our HSGED are naturally co- state-of-the-art 36.02 CIDEr-D score [31] and have absolute 5.24
herent and distinctive, e.g., the topic flows in Fig 1 (e) are “Human- points boost than a strong baseline (see Section 4).
With-Dog”, “Dog-On-Sand”, and “Hill-Near-Sand”, showing that the
neighborhood topics are not only closely related but also distinctive.
The sub-graph topic retains a sentence about a corresponding re-
gion, e.g., the second topic in Figure 1 (e) is about “Human”, “Dog”,
and “Sand”, which is more likely to be treated as a compact part by 2 RELATED WORK
humans instead of “Human” and “Tree” as in (d). Single Sentence Captioning and Dense Captioning. Single sen-
Since the paragraph dataset is limited in size and for fair com- tence captioning has been exhaustively studied recently due to their
parisons, we follow the previous studies [5, 13, 17] to construct widely practical utility. Many advanced techniques have been pro-
posed to improve their performances, e.g., the encoder-decoder
4182
Poster Session F3: Vision and Language MM '20, October 12–16, 2020, Seattle, WA, USA
wards [28], and the exploration of high-level semantic knowledge [21, ,XUTZ ,XUTZ ,XUTZ
40–42]. Though these captioning systems can now accurately sum- 5T (XU]T 5T (XU]T 5T (XU]T
8UGJ 8UGJ 8UGJ
marize one image, the generated sentence is usually too short to *UM *UM *UM
detail the rich semantic contents in an image. Then researchers The sub-graph around Dog The sub-graph around Bike The sub-graph around Wall
propose Dense Captioning to generate more descriptions for all
the detected salient regions [10, 43]. However, since the detected Figure 2: The illustrations of three sub-graphs around differ-
regions are usually heavily overlapped and disordered, the corre- ent objects. The dash lines connect the shared nodes which
sponding sentences are redundant and incoherent. As a result, their facilitate the system to generate coherent sub-graph topics.
usability is damaged [13].
Image Paragraph Captioning. Image paragraph captioning ad- the corresponding sentence (see Section 3.2). Specifically, we design
dresses the shortcomings of both single sentence captioning and two attention strategies that can remove redundancies for attention
dense captioning which try to generate coherent and distinctive in SSG-RNN for more distinctive sub-graph topics (see Section 3.2.3)
paragraphs [13]. Since each sentence of a paragraph is controlled and enhancing attention in WSG-RNN for more grounded sentences
by a topic, researchers propose Hierarchical RNNs (HRNNs) [5, based on the sub-graph topics (see Section 3.2.5).
13, 17, 34, 46] that higher-level and lower-level RNNs respectively
abstract topics and generate sentences based on the abstracted 3.1 Hierarchical Scene Graph Encoder
topics. Researchers also propose advanced techniques to refine 3.1.1 Scene Graphs. The scene graph is constructed by using di-
the prototypical HRNN, e.g., generative models like GAN [17] or rected edges to connect three different nodes: object node 𝑜𝑖 , denot-
VAE [5] for stronger consistency; the trigram repetition penalty ing the 𝑖-th object; attribute node 𝑎𝑖𝑙 , denoting the 𝑙-th attribute of
based sampling method for diversity [23]. Besides, dense sentence- 𝑜𝑖 ; and relation node 𝑟𝑖 𝑗 , denoting the pairwise relation between 𝑜𝑖
level rewards [36] and curiosity-driven reinforcement learning [22] and 𝑜 𝑗 . We assign directed edges from object 𝑜𝑖 to all of its attributes
are used for more robust training, all of which could also be applied 𝑎𝑖𝑙 and from object 𝑜𝑖 to relation 𝑟𝑖 𝑗 and from relation 𝑟𝑖 𝑗 to object
in our proposed framework, HSGED. However, most of them are 𝑜 𝑗 to form the scene graph. Figure 2 demonstrates one scene graph,
built without enough hierarchical constraints, so the qualities of the which contains four object nodes, three relation nodes, and two
generated paragraphs are unsatisfactory. In contrast, our HSGED attribute nodes. In this way, the scene graph contains rich semantic
exploits the scene graph as the script to transfer its hierarchical and knowledge brought from the semantic labels, e.g., “Brown” and
semantic knowledge from the vision domain to the text domain for “Near”, and topological knowledge brought from the connectivity
more coherent and distinctive paragraphs. of the graph, e.g., “Dog → Near→ Bike” and “Dog → On → Road”.
Exploitation of Scene Graph. The scene graph is formed by con-
3.1.2 Node Level Embeddings: 𝑿 . For three types of nodes, we
necting discrete objects with their attributes and with other objects
use different representations as their node-level embeddings: the
through the pairwise relationships, so it contains rich semantic
linear transformation of visual feature 𝒙𝑜 (RoI feature from Faster
and topological knowledge [37]. Observing such advantages, re-
RCNN through an FC layer) for object node and learnable label
searchers exploit Graph Neural Networks (GNN) [4, 16, 32, 39] to
embeddings of attribute and relation labels 𝒙𝑎 , 𝒙𝑟 for attribute and
embed scene graphs in various computer vision tasks, e.g., image
relation nodes, respectively. We use visual features as 𝒙𝑜 since
retrieval [11], image generation [9], image captioning [40], and
they contain more visual clues and they empirically achieve better
visual reasoning [29]. In this paper, we use an advanced GNN [39]
performance then label embeddings. All these representations are
to compute the sub-graph embeddings to facilitate paragraph gen-
grouped into one node-level embedding set 𝑿 :
eration. Importantly, compared with SGAE [40] which exploits the
explicit semantic knowledge of a scene graph for captioning, we 𝑿 = {𝒙𝑎 , 𝒙𝑟 , 𝒙𝑜 }, (1)
also exploit the implicit topological knowledge. Different from the
where each embedding corresponds to one node, e.g., the scene
method [6] using visual relationship which directly fuses the object
graph in Figure 2 contains 9 embeddings (4, 2, 3 for object, attribute
and relation features into a flat RNN without any topic guidance,
and relation nodes). This node-level embedding set will be input
we treat the scene graph as a script to facilitate the topic guidance
into WSG-RNN for generating sentences (see Section 3.2).
and to regularize the training. Hence, our framework generates
more coherent and distinctive paragraphs. 3.1.3 Sub-graph Level Embeddings: 𝑼 . Since the sentences of a
high-quality paragraph should describe distinctive aspects of an
object, e.g., the relationships with other objects, we adopt the sub-
3 HIERARCHICAL SCENE GRAPH graph level embedding which could facilitate our system to achieve
ENCODER-DECODER such goal. Specifically, we define the sub-graph around the object
Our Hierarchical Scene Graph Encoder-Decoder (HSGED) belongs 𝑜𝑖 as a graph connecting the following nodes: 𝑜𝑖 , which is this
to the encoder-decoder framework [13, 33, 38]. The hierarchical object itself; 𝑎𝑖𝑙 , which are this object’s attributes; 𝑟𝑖 𝑗 and 𝑟𝑘𝑖 , which
encoder transforms the scene graph into node-level and sub-graph are the potential relationships between this object and the other
level embeddings (see Section 3.1.1), which are respectively input objects; and 𝑜 𝑗 and 𝑜𝑘 , which are the objects that have potential
into Sentence Scene Graph RNN (SSG-RNN) to abstract the sub- relationships with 𝑜𝑖 . For example, a sub-graph around the object
graph topic and Word Scene Graph RNN (WSG-RNN) to generate “Dog” in Figure 2 connects the following nodes: “Dog”, “Brown”,
4183
Poster Session F3: Vision and Language MM '20, October 12–16, 2020, Seattle, WA, USA
Object Attribute Relation WSG-RNN. Here we first briefly revisit this architecture and then
detail how to revise it to get our SSG-RNN (see Section 3.2.2) and
(OQK WSG-RNN (see Section 3.2.4).
Basically, a top-down attention network contains two LSTM
(XU]T 4KGX
layers and one attention sub-network, as shown in Figure 4 (a).
8UGJ 5T
*UM *UM *UM Given the input vector 𝒛𝑡 at time step 𝑡, it can be formalized as:
Context vector: 𝒉𝑡1 = LSTM1 (𝒛𝑡 ; 𝒉𝑡1−1 ),
Attention: 𝒗ˆ 𝑡 = ATT(𝑽 , 𝒉𝑡1 ), (6)
Output: 𝒉𝑡2 = LSTM2 (Concat(𝒉𝑡1 , 𝒗ˆ 𝑡 ); 𝒉𝑡2−1 ),
4184
Poster Session F3: Vision and Language MM '20, October 12–16, 2020, Seattle, WA, USA
its knowledge will not only be incorporated into 𝒖𝑟𝑖 , but also into
𝒖𝑟𝑘 if 𝑜𝑖 and 𝑜𝑘 own potential relations (Eq. (4)). As a result, the
U[ZV[Z
salient object 𝑜𝑖 will also be mentioned when the sub-graph of 𝑜𝑘
is chosen as the topic, e.g., as in Figure 1 (e), the concept “Human”
=9- is described in the first two sentences of our generated paragraph.
/4'::
29:3
844
3.2.4 WSG-RNN. Given a topic vector 𝒄𝑛 , WSG-RNN works to
complete the corresponding sentence 𝒔𝑛 . Here the input vector 𝒛𝑡
'ZZKTZOUT in Eq. (6) is:
99-
¯ 𝒉𝑡2−1 ),
𝒛𝑡 = Concat(𝒄𝑛 , 𝑾Σ𝑤𝑡 −1, 𝒙, (13)
29:3 /8':: 844
where 𝒄𝑛 is the topic vector of the 𝑛-th sentence to regularize the
word generation; 𝒙¯ is the mean of node-level embeddings (Eq. (1));
and 𝒉𝑡2−1 is the hidden state of the second LSTM layer in WSG-RNN.
In WSG-RNN, the feature set 𝑽 in Eq. (7) is set to the node-level
embedding set 𝑿 . In this way, WSG-RNN puts attention to the
(a) Top-down Attention Network (b) HSGED nodes and generates the words based on them. After computing
Figure 4: In HSGED, SSG-RNN and WSG-RNN are both built the attended node-level embedding 𝒙ˆ by the inheriting attention
based upon the top-down attention network with different network (Eq. (16)), the second LSTM layer outputs its hidden state
attention mechanisms and inputs: sub-graph level and node 𝒉𝒕2 in Eq. (6) to predict the word distribution at time step 𝑡:
level embedding sets 𝑼 , 𝑿 , respectively.
𝑃 (𝑤𝑡 |𝑤 1:𝑡 −1 ) = softmax(FC(𝒉𝒕2 )). (14)
facilitating the system to choose the neighboring sub-graph accord-
ing to the last attended sub-graphs. For example, in Figure 1(e), the 3.2.5 Inheriting Attention. To generate the sentence more closely
second sub-graph is about “Human”, “Dog” and “Sand”. SSG-RNN corresponding to the selected sub-graphs, inheriting attention is
exploits the node “Sand” as the clue to generate the next sub-graph applied in WSG-RNN to constrain the attention of WSG-RNN more
topic about “Sand”, “Tree”, and “Hill”. on the nodes in the selected sub-graphs. To achieve this, the node
level attention weights 𝜸 are first inherited from the sub-graph
3.2.3 Irredundant Attention. To further encourage each topic vec-
level attention weights 𝜷 (Eq. (12)) as follows:
tor to focus on rarely described sub-graphs, we substitute the atten-
tion sub-network in Eq. (7) with the following irredundant attention Attribute Node: 𝛾𝑎𝑖𝑙 = 𝛽𝑖 ,
strategy [25]. When SSG-RNN generates the 𝑛-th topic vector 𝒄𝑛 , Object Node: 𝛾𝑜𝑖 = 𝑀𝑒𝑎𝑛 (𝛽 𝑗 ),
it does not directly input the computed attention weights 𝒂 into 𝑗 ∈Cover(𝑜𝑖 ) (15)
the softmax layer (Eq. (7)) for computing attended vectors. Instead, Rela Node: 𝛾𝑟𝑖 𝑗 = 𝑀𝑒𝑎𝑛 (𝛽 𝑗 ),
𝑗 ∈Cover(𝑟𝑖 𝑗 )
it distracts the current attention from the previous frequently at-
tended sub-graph level embeddings to get the irredundant attention where the attribute node attention 𝛾𝑎𝑖𝑙 is directly inherited from
weights 𝜷: the attention weight 𝛽𝑖 of the sub-graph around 𝑜𝑖 . Since the object
and relation nodes are covered by many sub-graphs, we average
⎧
⎪
⎪
⎨
⎪ exp(𝑎𝜏𝑖 𝑛−1 ) if 𝑛 = 1 the attention weights of these sub-graphs as the inherited attention
𝑏𝑖𝑛 = 𝜏 (10) weights to both object and relation nodes, where 𝑗 ∈ Cover(𝑜𝑖 )
⎪
⎪ exp(𝑎 𝑛−1 )
⎪ 𝑛−1 𝑖 𝜏𝑚 otherwise in Eq. (15) means that the object 𝑜𝑖 is covered by the sub-graph
⎩ 𝑚=1 exp(𝑎 𝑖 )
which is around the object 𝑜 𝑗 . For example, as in Figure 2, “Bike” is
𝜷 𝑛 = softmax(𝒃 𝑛 ), (11) covered by the sub-graphs of “Dog”, “Bike”, and “Wall”, and then
where 𝜏𝑚 is the time step of the full stop symbol of sentence 𝑠𝑚 . we average the attention weights of these three sub-graphs as the
Then we compute the irredundant attention vector 𝒖ˆ 𝑛 : inherited attention weight of the node 𝑜 𝐵𝑖𝑘𝑒 .
After inheriting 𝜸 from 𝜷, we can compute the attended node
𝒖ˆ 𝑛 = 𝑼 𝜷 𝑛 , (12)
level embedding 𝒙ˆ as:
which will be used to generate the distinctive topic vector 𝒄𝑛 as in
𝒙ˆ = 𝑿 Softmax(𝜸 . ∗ 𝜶 ), (16)
Section 3.2.2.
If the 𝑖-th sub-graph level embedding 𝒖𝑖 is frequently attended where .∗ denotes element-wise production and 𝜶 is computed from
𝑛−1
during generating previous 𝑛−1 topic vectors, then 𝑚=1 exp(𝑎𝜏𝑖 𝑚−1 ) the WSG-RNN attention network in Eq. (7).
𝑛
will be large and 𝑏𝑖 will be small. So, the current topic vector will If a sub-graph is not selected by SSG-RNN for generating the
be less likely to focus on the 𝑖-th sub-graph, and thus the generated current topic vector, the attention weight of the corresponding
sentences will be less repetitive. sub-graph level embedding is small. By inheriting attention, the
It is noteworthy that this irredundant attention will not limit attention weights of the nodes in this sub-graph will also be small, so
our model for attending the salient objects for multiple times as these nodes will be less likely to be selected to complete the current
needed, e.g., as in Figure 1 (a), the concept “Human” is described sentence. In this way, compared with HRNN methods without any
twice in the first two sentences of the ground-truth paragraph. Our hierarchical constrains, our WSG-RNN generates more grounded
model can achieve multi-visit of the same salient object 𝑜𝑖 since sentences.
4185
Poster Session F3: Vision and Language MM '20, October 12–16, 2020, Seattle, WA, USA
3.3 Training Objectives Visual Genome [14] (VG). We use the annotations of this dataset
Given a ground-truth paragraph =P∗ ∗ },
{𝑤 1:𝑇
we can end-to-end to train our scene graph generator, which includes objects’ cate-
train our HSGED by maximizing the likelihood of the ground-truth gories, attributes, and pairwise relations. Specifically, by removing
paragraph P ∗ as in Regions-Hierarchical [13]. For convenience, we the objects, attributes, and relations that appear less than 2,000
denote the predicted word distribution as 𝑃 (𝑤𝑡 ) (Eq. (14)). Maxi- times in the training set, we use the remaining 305 objects, 103
mizing the likelihood equals to minimizing the cross-entropy loss: attributes, and 64 relations to train our object detector, attribute
classifier, and relation classifier. It is noteworthy that for fair testing,
𝑇
we filter out the images of VG which also exist in the test set of
𝐿𝑤𝑜𝑟𝑑 = − log 𝑃 (𝑤𝑡∗ ). (17)
Stanford image paragraph dataset.
𝑡 =1
This is a word-level loss since it directly encourages the generated 4.1.2 Settings. The dimensions of 𝒙𝑜 /𝒙𝑎 /𝒙𝑟 /𝒖𝑜 /𝒖𝑎 /𝒖𝑟 in Section
words to be the same as the ground-truth words. 3.1.2 and 3.1.3 are all set to 1,000. The dimensions of the hidden
Though this word-level loss is simple and efficient, it causes the states (𝒉 ∗ ) of all the LSTM layers are set to 1,000. The sizes of
exposure bias problem [26] which damages the performance due to 𝑾𝑣 /𝑾ℎ (Eq. (7)) are set to 3, 000 × 512/1, 000 × 512 in SSG-RNN and
the mismatch between training and test. To alleviate this mismatch 1, 000 × 512/1, 000 × 512 in WSG-RNN. The size of 𝑾Σ in Eq. (8) is
problem, a reinforcement learning (RL) based reward [23, 28] can 1, 000 × 4, 962. To parse the scene graph, we use Faster-RCNN as
be used to train the paragraph generator: the object detector [27] and MOTIFS as the relation classifier [45].
Our attribute classifier is an FC-ReLU-FC-Sigmoid network aiming
𝐿𝑝𝑎𝑟𝑎 = −E𝑤𝑡𝑠 ∼𝑃 (𝑤) [𝑟 (P 𝑠 ; P ∗ )], (18) to multi-classify the attributes.
where 𝑟 is CIDEr-D [31] metric between the paragraph P 𝑠 = {𝑤 1:𝑇
𝑠 } When training the whole pipeline, we use the word-level loss
sampled from Eq. (14) and the ground-truth paragraph P ∗ . This is (Eq. (17)) in the first 30 epochs and then use the combination loss
a paragraph-level loss since it encourages the whole generated (Eq. (20)) where 𝑤𝑠𝑒𝑛 is 0.1 and 𝑤 𝑝𝑎𝑟𝑎 is 1 in the next 70 epochs.
paragraph to be similar to the ground-truth paragraph. When the word-level/combination losses are used, the learning
Though this paragraph-level loss improves the quality of the rates are initialized as 3𝑒 −4 /2𝑒 −5 and decayed by 0.85 for every 5
generated paragraphs, there are still two shortcomings. First, this epochs. We set the batch size to 100 and use Adam optimizer [12]. At
paragraph-level loss neglects the sequence of the sentences, while a the inference stage, we use both beam search and trigram repetition
coherent paragraph requires the generated sentences to be listed in penalty [23] to sample paragraphs.
proper sequences. Second, the paragraph-level loss does not make 4.1.3 Metrics. Following the previous methods [5, 13, 23], we use
sufficient use of the training paragraphs since it only computes a three standard metrics: CIDEr-D [31], BLEU [24], and METEOR[3],
reward for the whole paragraph. to measure the similarities between the generated paragraphs and
Therefore, we propose a sentence-level loss, which is the sum the ground-truth paragraphs. Because the ground-truth paragraphs
of CIDEr-D scores between each sampled sentence 𝒔𝑛𝑠 and the are created by humans which are naturally coherent, the generated
ground-truth sentence 𝒔𝑛∗ : paragraphs are likely to be more coherent if the similarity scores
𝑁
are higher. Besides, we measure the distinctiveness from two as-
𝐿𝑠𝑒𝑛 = −E𝑤𝑡𝑠 ∼𝑃 (𝑤) [ 𝑟 (𝒔𝑛𝑠 ; 𝒔𝑛∗ )]. (19) pects: diversity and fine-grain degree. Diversity is measured by a
𝑛=1 metric [35] which is derived from the kernelized latent semantic
By enforcing each sampled 𝒔𝑛𝑠 to be similar to the ground-truth analysis of CIDEr-D. Fine-grain degree is measured by the statistics
𝒔𝑛∗ , the narrative logic is encouraged to be the same as the human- of the part-of-speech. Moreover, we conduct the human evaluation
like sequence, which is also a kind of hierarchical constraint. Also, on coherence, diversity and fine-grain degree of the paragraphs
the training paragraphs are more sufficiently utilized because each generated by different methods.
sentence provides a direct supervision.
In the experiment, to utilize the advantages of the word-level, 4.2 Ablative Studies
paragraph-level and sentence-level losses, we combine them as: We carry extensive ablative studies by gradually constructing our
𝐿𝑐𝑜𝑚𝑏 = 𝑤 𝑤𝑜𝑟𝑑 𝐿𝑤𝑜𝑟𝑑 + 𝑤𝑠𝑒𝑛 𝐿𝑠𝑒𝑛 + 𝑤 𝑝𝑎𝑟𝑎 𝐿𝑝𝑎𝑟𝑎 . (20) model from the original Up-Down model [2, 23] so as to evaluate
the effectiveness of each individual component. In particular, we
where 𝑤 𝑤𝑜𝑟𝑑 , 𝑤𝑠𝑒𝑛 , 𝑤 𝑝𝑎𝑟𝑎 are the weights of 𝐿𝑤𝑜𝑟𝑑 , 𝐿𝑠𝑒𝑛 and 𝐿𝑝𝑎𝑟𝑎 , construct the following baselines.
respectively.
4.2.1 Comparing Methods. Base: We treat each paragraph as one
4 EXPERIMENTS long sentence and directly use the Up-Down model (Eq. (6)) as the
Flat RNN to generate them. This baseline is the benchmark for the
4.1 Datasets, Settings, and Metrics other ablative baselines.
4.1.1 Datasets. Stanford Image Paragraph Dataset [13] is a GNN: We use Graph Neural Network (GNN) to compute the sub-
mainstream large-scale dataset for image paragraph generation. graph level embedding set 𝑼 (Eq. (5)) and input it into the decoder
This dataset contains 19,551 images and each image is paired with of a Flat RNN as Base for generating paragraphs. HRNN: We use
one paragraph. The whole dataset contains 14,575/2,487/2,489 image- the hierarchical RNNs to generate paragraphs, while we only input
paragraph pairs for training/validation/test, respectively. On aver- node level embedding set 𝑿 (Eq. (1)) into the two RNNs without ir-
age, each paragraph contains 5.7 sentences and 11.9 words. redundant and inheriting attentions. HSGED-IRA-INA: We input
4186
Poster Session F3: Vision and Language MM '20, October 12–16, 2020, Seattle, WA, USA
Figure 5: Two qualitative examples. The bottom left and bottom right parts show the sub-graph topics and the generated
paragraph, respectively. The colors highlight the alignments between the sentences and the sub-graph topics.
Table 1: The performances of various baselines on Stanford Table 2: The diversity scores (the larger, the more diverse) of
image paragraph dataset. The metrics: B@N, M, and C, de- various baselines. GT denotes ground-truth.
note BLEU@N, METEOR, and CIDEr-D, respectively. Base HRNN HSGED-IRA-INA HSGED-INA HSGED HSGED(SLL) GT
Diversity 0.786 0.791 0.810 0.823 0.836 0.840 0.847
Models B@1 B@2 B@3 B@4 M C our HSGED-IRA-INA is more obvious (2.89 CIDEr-D), which proves
Base 43.42 27.56 17.41 10.39 17.22 30.78 that only applying a naive HRNN structure without using any hi-
Base(SLL) 43.63 27.74 17.58 10.44 17.42 31.90 erarchical constraint is not enough for capturing the hierarchical
GNN 43.77 27.83 17.66 10.65 17.51 32.48 knowledge in both the image and the paragraph and makes HRNN
GNN(SLL) 44.02 27.98 17.87 10.90 17.78 33.57 degrade to the flat RNN. Figure 5 shows two examples of the para-
HRNN 43.54 27.65 17.53 10.42 17.44 31.36
graphs generated by HSGED(SLL) and HRNN, where HSGED’s
HSGED-IRA-INA 44.17 28.36 17.85 11.04 17.89 33.67
HSGED-INA 44.21 28.42 18.04 11.09 18.06 34.42
paragraphs are clearly more coherent and distinctive.
HSGED 44.33 28.47 18.09 11.10 18.11 35.13 4.2.3 Results on Diversity. Table 2 shows the diversity scores [35]
HSGED(SP, SLL) 44.20 28.39 18.02 11.07 18.02 34.15 of the paragraphs generated by different models. We also compute
HSGED(SLL) 44.51 28.69 18.28 11.26 18.33 36.02
the diversity score of the ground truth paragraphs of the test set,
which is denoted as GT. Comparing the results of HSGED-IRA-
the node/sub-graph level embedding sets 𝑿 /𝑼 into SSG/WSG-RNNs,
INA, HSGED-INA, HSGED, HSGED(SLL) in Table 2, we can see
respectively, while we do not deploy irredundant and inheriting
that the diversity scores are improved, which demonstrates the
attentions. HSGED-INA: Compared with HSGED-IRA-INA, we
effectiveness of the irredundant attention, inheriting attention, and
apply irredundant attention and still do not use inheriting atten-
sentence-level loss in improving the paragraph diversity.
tion. HSGED: We use the integral architecture sketched in Figure 4
(b). SP: We replace the original semantic scene graph with the 4.2.4 Results on Fine-grain Degree. To test the fine-grain degree,
spatial scene graph, where each box is connected by its nearest 5 we calculate the ratios of the non-repetitive noun, verb, preposi-
boxes with their relative spatial positions as labels. SLL: We use the tion and adjective among paragraphs generated by Base, HRNN,
sentence-level loss defined in Eq. (19) to train some of the baselines. HSGED-IRA-INA and HSGED(SLL), and draw two corresponding
radar charts in Figure 6, where (a) and (b) use the length of the
4.2.2 Results on Similarity. Table 1 shows the performances of
paragraph and the length of the paragraph without repetition as the
the different ablative baselines. Here we use trigram repetition
denominator, respectively. Furthermore, we calculate the Object
penalty [23] to sample paragraphs. Compared with Base, our inte-
and Relation SPICE [1] scores presented in Table 3. Obviously, para-
gral model, HSGED(SLL), boosts the CIDEr-D by 5.24. By comparing
graphs generated by HSGED(SLL) have the greatest ratios of noun,
Base(SLL) and GNN with Base, we observe that the performances
verb, preposition and adjective in both radar charts, and HSGED
are boosted, which confirms the utility of sentence-level loss and
achieves the highest Object and Relation SPICE score: 0.41 and
sub-graph level embedding. More importantly, by alternately com-
0.23. From these results, we find that HSGED(SLL) generates the
paring HRNN, HSGED-IRA-INA, HSGED-INA, and HSGED, we
most abundant semantic information and HSGED generates richer
also observe the uninterrupted improvements, which substantially
objects and relations than the baseline HRNN.
validate the superiorities of our proposed SSG-RNN and WSG-RNN,
especially the irredundant and inheriting attentions. Another in- 4.2.5 Human Evaluation. To further demonstrate that our model
teresting observation is that the improvement of HRNN compared could generate more coherent, fine-grained, and diverse paragraphs,
with Base is marginal (0.58 CIDEr-D), while the improvement of we conduct a human evaluation with 20 workers. We sample 50
4187
Poster Session F3: Vision and Language MM '20, October 12–16, 2020, Seattle, WA, USA
Table 3: The object and relation SPICE score of various base- Table 4: The performances of various state-of-the-art meth-
lines on Stanford image paragraph dataset. ods. The top and middle sections show the results sampled
by beam search and trigram repetition penalty, respectively.
Base HRNN HSGED-IRA-INA HSGED The model in the bottom section uses a different RL reward.
Object 0.32 0.34 0.38 0.41
Relation 0.15 0.15 0.21 0.23 Models B@1 B@2 B@3 B@4 M C
(GYK .844 (GYK .844
Regions-Hierarchical [13] 41.90 24.11 14.23 8.69 15.95 13.52
.9-+*/8'/4' .9-+*922 .9-+*/8'/4' .9-+*922 RTT-GAN [17] 41.99 24.86 14.89 9.03 17.12 16.87
4U[T 4U[T RTT-GAN (Plus) [17] 42.06 25.35 14.92 9.21 18.39 20.36
0.12 0.25
0.1 0.2
DCPG [5] 42.12 25.18 14.74 9.05 17.81 19.95
18.62
0.08
0.06
0.04
0.15
0.1
DCPG-VAE [5] 42.38 25.52 15.15 9.43 20.93
6XKVUYOZOUT
0.02
0 <KXH 6XKVUYOZOUT
0.05
0 <KXH
HCAVP [46] 41.38 25.40 14.93 9.00 16.79 20.94
DHPV [36] 43.35 26.73 16.92 10.99 17.02 22.47
HSGED-IRA-INA 43.76 26.44 16.86 9.79 17.14 23.85
HSGED(SLL) 44.22 26.93 17.31 10.35 17.45 25.52
'JPKIZO\K 'JPKIZO\K
CAE-LSTM [34] − − − 9.67 18.82 25.15
(a) The Ratio of PoS with Repetition (b) The Ratio of PoS without Repetition TDPG [23] 43.54 27.44 17.33 10.58 17.86 30.63
HSGED(SLL) 44.51 28.69 18.28 11.26 18.33 36.02
Figure 6: The radar charts illustrate the ratios of four part CRL [22] 43.12 27.03 16.72 9.95 17.42 31.47
of speech (PoS) in paragraphs generated by four methods.
4.3.2 Result Analysis. Table 4 reports the performances of our HS-
Coherence Fine grain Diversity GED(SLL) and the other comparing methods. From the table, we
observe that our HSGED(SLL) achieves two new state-of-the-art
7.8% CIDEr-D scores, 25.52 and 36.02, for the cases of using the beam
14.5%
24.0%
search and the trigram repetition penalty strategy, respectively. In
41.6%
45.1%
55.0%
particular, compared with the methods which use HRNN without
40.4%
37.2% applying hierarchical constraints, our method outperforms them
34.4% in almost all metrics, despite that we do not use additional genera-
tive models like in RTT-GAN and DCPG-VAE, or train our model
with additional language data like in RTT-GAN(Plus), or use more
HRNN HSGED-IRA-INA HSGED(SLL) complex LSTM like in HCAVP. The comparison between our HS-
GED(SLL) and DHPV, which exploits a more complex sentence-level
Figure 7: The pie charts show the results of the human eval- loss, suggests that when providing suitable hierarchical constraints,
uation on the quality of the generated paragraphs. even with a simple sentence-level loss, our method can still gener-
images from the test set and assign the paragraphs generated by ate better paragraphs. Although CRL uses different reinforcement
HRNN, HSGED-IRA-INA and HSGED(SLL) to workers. Then the learning methods, HSGED(SLL) still performs better.
workers are asked to choose the most coherent, fine-grained, and
diverse paragraphs. The results are shown in Figure 7. Compared 5 CONCLUSION
with HRNN and HSGED-IRA-INA, the paragraphs generated by In this paper, we proposed Hierarchical Scene Graph Encoder-
HSGED(SLL) are considered as the most coherent, fine-grained and Decoder (HSGED) based encoder-decoder which follows the script
diverse captions, obtaining respectively 41.6%, 45.1% and 55% votes. constructed by a scene graph to generate the paragraph. In this
way, the semantic and hierarchical knowledge of an image can be
4.3 Comparisons with State-of-The-Arts transferred into the language domain, so compared with traditional
4.3.1 Comparing Methods. We compare our HSGED(SLL) with HRNNs without any hierarchical constraints, more coherent and
several state-of-the-art models: Regions-Hierarchical [13], RTT- distinctive paragraphs can be generated. Specifically, our HSGED
GAN [17], DCPG [5], HCAVP [46], DHPV [36], CAE-LSTM [34], contains two RNNs: SSG-RNN for generating sub-graph level topic
TDPG [23] and CRL [22]. Among these methods, RTT-GAN, DCPG, vectors and WSG-RNN for completing the corresponding sentences.
Regions-Hierarchical, DHPV, Hierarchical CAVP and CAE-LSTM To further encourage distinctive sentences, irredundant attention
use HRNNs with different technique details. RTT-GAN and DCPG- and inheriting attention were respectively deployed into SSG-RNN
VAE use additional generative models, e.g., GAN and VAE, for im- and WSG-RNN as the additional hierarchical regularizations. We
proving the coherence and diversity of the paragraphs. For RTT- also designed a sentence-level loss to regularize the sequence of the
GAN (plus), they use additional image captioning data from MS- generated sentences to be similar to the ground-truth paragraphs.
COCO dataset [19] to train their model. Compared with them, Our extensive experiments demonstrated that the proposed model
TDPG uses a Flat RNN to generate paragraphs while it proposes the outperforms the state-of-the-art methods significantly and it can
trigram repetition penalty in sampling process for reducing repeti- generate more coherent and distinctive paragraphs.
tions. When comparing with TDPG, we report the results of using Acknowlegements.This work is partially supported by NTU-Alibaba
the trigram repetition penalty sample strategy. And when compar- Lab, MOE AcRF Tier 1 and Monash FIT Start-up Grant.
ing with the other methods, we report the results of using beam
search with a beam size of 5. CRL uses a different reinforcement
learning method.
4188
Poster Session F3: Vision and Language MM '20, October 12–16, 2020, Seattle, WA, USA
REFERENCES [25] Romain Paulus, Caiming Xiong, and Richard Socher. 2017. A deep reinforced
[1] Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. model for abstractive summarization. arXiv preprint arXiv:1705.04304 (2017).
Spice: Semantic propositional image caption evaluation. In European Conference [26] Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba.
on Computer Vision. Springer, 382–398. 2015. Sequence level training with recurrent neural networks. arXiv preprint
[2] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, arXiv:1511.06732 (2015).
Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for [27] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn:
image captioning and visual question answering. In CVPR. 6. Towards real-time object detection with region proposal networks. In Advances
[3] Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for in neural information processing systems. 91–99.
MT evaluation with improved correlation with human judgments. In Proceedings [28] Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jarret Ross, and Vaibhava
of the acl workshop on intrinsic and extrinsic evaluation measures for machine Goel. 2017. Self-critical sequence training for image captioning. In CVPR, Vol. 1.
translation and/or summarization. 65–72. 3.
[4] Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, [29] Jiaxin Shi, Hanwang Zhang, and Juanzi Li. 2019. Explainable and explicit visual
Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam reasoning over scene graphs. In Proceedings of the IEEE Conference on Computer
Santoro, Ryan Faulkner, et al. 2018. Relational inductive biases, deep learning, Vision and Pattern Recognition. 8376–8384.
and graph networks. arXiv preprint arXiv:1806.01261 (2018). [30] Xiangxi Shi, Jianfei Cai, Shafiq Joty, and Jiuxiang Gu. 2019. Watch It Twice: Video
[5] Moitreya Chatterjee and Alexander G Schwing. 2018. Diverse and coherent Captioning with a Refocused Video Encoder. In Proceedings of the 27th ACM
paragraph generation from images. In Proceedings of the European Conference on International Conference on Multimedia. 818–826.
Computer Vision (ECCV). 729–744. [31] Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider:
[6] Wenbin Che, Xiaopeng Fan, Ruiqin Xiong, and Debin Zhao. 2018. Paragraph Consensus-based image description evaluation. In Proceedings of the IEEE confer-
generation network with visual relationship detection. In Proceedings of the 26th ence on computer vision and pattern recognition. 4566–4575.
ACM international conference on Multimedia. 1435–1443. [32] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro
[7] Jiuxiang Gu, Jianfei Cai, Gang Wang, and Tsuhan Chen. 2018. Stack-captioning: Lio, and Yoshua Bengio. 2017. Graph attention networks. arXiv preprint
Coarse-to-fine learning for image captioning. In Thirty-Second AAAI Conference arXiv:1710.10903 (2017).
on Artificial Intelligence. [33] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show
[8] Jingwen Hou, Sheng Yang, and Weisi Lin. 2020. Object-level Attention for Aes- and tell: A neural image caption generator. In CVPR.
thetic Rating Distribution Prediction. In Proceedings of the 28th ACM International [34] Jing Wang, Yingwei Pan, Ting Yao, Jinhui Tang, and Tao Mei. 2019. Convolu-
Conference on Multimedia. tional Auto-encoding of Sentence Topics for Image Paragraph Generation. In
[9] Justin Johnson, Agrim Gupta, and Li Fei-Fei. 2018. Image generation from scene International Joint Conferences on Artificial Intelligence (IJCAI).
graphs. arXiv preprint (2018). [35] Qingzhong Wang and Antoni B Chan. 2019. Describing like humans: on diversity
[10] Justin Johnson, Andrej Karpathy, and Li Fei-Fei. 2016. Densecap: Fully convo- in image captioning. In Proceedings of the IEEE Conference on Computer Vision
lutional localization networks for dense captioning. In Proceedings of the IEEE and Pattern Recognition. 4195–4203.
Conference on Computer Vision and Pattern Recognition. 4565–4574. [36] Siying Wu, Zheng-Jun Zha, Zilei Wang, Houqiang Li, and Feng Wu. 2019. Densely
[11] Justin Johnson, Ranjay Krishna, Michael Stark, Li-Jia Li, David Shamma, Michael Supervised Hierarchical Policy-Value Network for Image Paragraph Generation.
Bernstein, and Li Fei-Fei. 2015. Image retrieval using scene graphs. In Proceedings In Proceedings of the Twenty-Eighth International Joint Conference on Artificial
of the IEEE conference on computer vision and pattern recognition. 3668–3678. Intelligence, IJCAI-19. International Joint Conferences on Artificial Intelligence
[12] Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic opti- Organization, 975–981. https://doi.org/10.24963/ijcai.2019/137
mization. arXiv preprint arXiv:1412.6980 (2014). [37] Danfei Xu, Yuke Zhu, Christopher B Choy, and Li Fei-Fei. 2017. Scene graph
[13] Jonathan Krause, Justin Johnson, Ranjay Krishna, and Li Fei-Fei. 2017. A Hier- generation by iterative message passing. In Proceedings of the IEEE Conference on
archical Approach for Generating Descriptive Image Paragraphs. In Computer Computer Vision and Pattern Recognition. 5410–5419.
Vision and Patterm Recognition (CVPR). [38] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan
[14] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural
Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. image caption generation with visual attention. In International conference on
2017. Visual genome: Connecting language and vision using crowdsourced dense machine learning. 2048–2057.
image annotations. International Journal of Computer Vision 123, 1 (2017), 32–73. [39] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2019. How Powerful
[15] Jiwei Li, Minh-Thang Luong, and Dan Jurafsky. 2015. A hierarchical neural are Graph Neural Networks?. In International Conference on Learning Representa-
autoencoder for paragraphs and documents. arXiv preprint arXiv:1506.01057 tions. https://openreview.net/forum?id=ryGs6iA5Km
(2015). [40] Xu Yang, Kaihua Tang, Hanwang Zhang, and Jianfei Cai. 2019. Auto-encoding
[16] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. 2015. Gated scene graphs for image captioning. In Proceedings of the IEEE Conference on
graph sequence neural networks. arXiv preprint arXiv:1511.05493 (2015). Computer Vision and Pattern Recognition. 10685–10694.
[17] Xiaodan Liang, Zhiting Hu, Hao Zhang, Chuang Gan, and Eric P Xing. 2017. [41] Xu Yang, Hanwang Zhang, and Jianfei Cai. 2019. Learning to Collocate Neural
Recurrent topic-transition gan for visual paragraph generation. In Proceedings of Modules for Image Captioning. arXiv preprint arXiv:1904.08608 (2019).
the IEEE International Conference on Computer Vision. 3362–3371. [42] Ting Yao, Yingwei Pan, Yehao Li, Zhaofan Qiu, and Tao Mei. 2017. Boosting
[18] Rui Lin, Shujie Liu, Muyun Yang, Mu Li, Ming Zhou, and Sheng Li. 2015. Hierar- image captioning with attributes. In IEEE International Conference on Computer
chical recurrent neural network for document modeling. In Proceedings of the Vision, ICCV. 22–29.
2015 Conference on Empirical Methods in Natural Language Processing. 899–907. [43] Guojun Yin, Lu Sheng, Bin Liu, Nenghai Yu, Xiaogang Wang, and Jing Shao. 2019.
[19] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Context and Attribute Grounded Dense Captioning. In Proceedings of the IEEE
Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common Conference on Computer Vision and Pattern Recognition. 6241–6250.
objects in context. In European conference on computer vision. Springer, 740–755. [44] Haonan Yu, Jiang Wang, Zhiheng Huang, Yi Yang, and Wei Xu. 2016. Video para-
[20] Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. 2017. Knowing graph captioning using hierarchical recurrent neural networks. In Proceedings of
when to look: Adaptive attention via a visual sentinel for image captioning. In the IEEE conference on computer vision and pattern recognition. 4584–4593.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition [45] Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. 2018. Neural motifs:
(CVPR), Vol. 6. 2. Scene graph parsing with global context. In Proceedings of the IEEE Conference
[21] Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2018. Neural Baby Talk. on Computer Vision and Pattern Recognition. 5831–5840.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. [46] Zheng-Jun Zha, Daqing Liu, Hanwang Zhang, Yongdong Zhang, and Feng Wu.
7219–7228. 2019. Context-aware visual policy network for fine-grained image captioning.
[22] Yadan Luo, Zi Huang, Zheng Zhang, Ziwei Wang, Jingjing Li, and Yang Yang. IEEE transactions on pattern analysis and machine intelligence (2019).
2019. Curiosity-Driven Reinforcement Learning for Diverse Visual Paragraph
Generation. In Proceedings of the 27th ACM International Conference on Multimedia.
2341–2350.
[23] Luke Melas-Kyriazi, Alexander Rush, and George Han. 2018. Training for Di-
versity in Image Paragraph Captioning. In Proceedings of the 2018 Conference on
Empirical Methods in Natural Language Processing. 757–761.
[24] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a
method for automatic evaluation of machine translation. In Proceedings of the
40th annual meeting on association for computational linguistics. Association for
Computational Linguistics, 311–318.
4189