0% found this document useful (0 votes)
17 views9 pages

Yang 2020

Uploaded by

Thanhbich Nguyen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views9 pages

Yang 2020

Uploaded by

Thanhbich Nguyen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Poster Session F3: Vision and Language MM '20, October 12–16, 2020, Seattle, WA, USA

Hierarchical Scene Graph Encoder-Decoder for Image


Paragraph Captioning
Xu Yang∗ Chongyang Gao∗
Nanyang Technological University Dartmouth College
Singapore Hanover, New Hampshire, USA
[email protected] [email protected]

Hanwang Zhang Jianfei Cai


Nanyang Technological University Monash University
Singapore Melbourne, Australia
[email protected] [email protected]

ABSTRACT ACM Reference Format:


When we humans tell a long paragraph about an image, we usually Xu Yang, Chongyang Gao, Hanwang Zhang, and Jianfei Cai. 2020. Hierar-
chical Scene Graph Encoder-Decoder for Image Paragraph Captioning. In
first implicitly compose a mental “script” and then comply with it
Proceedings of the 28th ACM International Conference on Multimedia (MM
to generate the paragraph. Inspired by this, we render the modern ’20), October 12–16, 2020, Seattle, WA, USA. ACM, New York, NY, USA, 9 pages.
encoder-decoder based image paragraph captioning model such https://doi.org/10.1145/3394171.3413859
ability by proposing Hierarchical Scene Graph Encoder-Decoder
(HSGED) for generating coherent and distinctive paragraphs. In 1 INTRODUCTION
particular, we use the image scene graph as the “script” to incorpo- The traditional image captioning task depicts an image by a single
rate rich semantic knowledge and, more importantly, the hierarchi- sentence, which is however, too short to detail the rich visual con-
cal constraints into the model. Specifically, we design a sentence tents [21, 33, 38, 41]. So, a paragraph with more descriptive capacity
scene graph RNN (SSG-RNN) to generate sub-graph level topics, is a better way to detail the distinctiveness of an image [13]. As
which constrain the word scene graph RNN (WSG-RNN) to gener- shown in Figure 1 (a), the paragraph details not only more objects
ate the corresponding sentences. We propose irredundant attention (“Sand”, “Tree”, and “Rock”) but also more attributes of these objects
in SSG-RNN to improve the possibility of abstracting topics from (“Golden Gog” and “Brown Rocks”) than a single sentence.
rarely described sub-graphs and inheriting attention in WSG-RNN A paragraph is beyond a bag of sentences; they should be coher-
to generate more grounded sentences with the abstracted topics, ent, i.e., neighboring sentences share some concepts, and distinctive,
both of which give rise to more distinctive paragraphs. An efficient i.e., they detail different aspects of the image, e.g., Figure 1 (b) shows
sentence-level loss is also proposed for encouraging the sequence of the coherent and distinctive topics of the paragraph in (a). There-
generated sentences to be similar to that of the ground-truth para- fore, these new requirements of the paragraph rather than sentence
graphs. We validate HSGED on Stanford image paragraph dataset invalidate the flat RNN based captioning models, which lack explicit
and show that it not only achieves a new state-of-the-art 36.02 topic guidance. To illustrate, if we treat the paragraph as a long
CIDEr-D, but also generates more coherent and distinctive para- sentence and directly apply the flat RNN [13, 23], the generated
graphs under various metrics. paragraph is usually full of redundant sentences as in Figure 1 (c),
where all the sentences repeat the same topic “Human-Near-Dog”.
CCS CONCEPTS To achieve the topic guidance, researchers propose Hierarchical
• Computing methodologies → Computer vision tasks. RNN (HRNN) [15, 18, 44] that contains multiple RNNs, where the
higher-level RNNs abstract the topic as the guidance for the lower-
level RNNs to generate the corresponding sentence. In this way, a
KEYWORDS
more informative paragraph can be generated, as in Figure 1 (d),
Image Paragraph Generation; Scene Graph; Hierarchical Constrain; the generated sentences cover more topics than those generated by
Hierarchical Scene Graph Encoder Decoder the flat RNN in Figure 1 (c).
∗ Both
However, the multi-level RNNs in HRNN are built without any
authors contributed equally to this research.
hierarchical constraints [5, 13, 17, 36, 46], which results in two
problems. 1) The topics are not coherent and distinctive since they
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed are formed by randomly sampling some object sub-sets without any
for profit or commercial advantage and that copies bear this notice and the full citation global constraint. 2) The generated sentences are not corresponding
on the first page. Copyrights for components of this work owned by others than ACM to the given topics since the input of the lower-level RNNs is the
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a whole image instead of the sub-regions constrained by the topic.
fee. Request permissions from [email protected]. So, such HRNN tends to degenerate to flat RNN and generates
MM ’20, October 12–16, 2020, Seattle, WA, USA redundant summary sentences, e.g., in Figure 1 (d), both the first
© 2020 Association for Computing Machinery.
ACM ISBN 978-1-4503-7988-5/20/10. . . $15.00 and third sentences describe “Man” and “Dog”, even though the
https://doi.org/10.1145/3394171.3413859 third topic is about “Dog” and “Sand”.

4181
Poster Session F3: Vision and Language MM '20, October 12–16, 2020, Seattle, WA, USA

man-walk with-dog
C: A man is walking with his dogs.
man and dog-on-sand
P: A man is walking his golden rock and tree-near-sand A man is walking with dogs.
dogs. The man and the dogs are Man Tree The dogs are near the man.
standing in the sand. There are They are walking in the sands.
brown rocks and green trees near Sand
the sand. ... Dog Hill

(a) Caption vs. Paragraph (b) Topics and Script (c) Flat RNN
.ORR .ORR .ORR .ORR .ORR .ORR

:XKK :XKK :XKK :XKK :XKK :XKK


4KGX 4KGX 4KGX

.[SGT .[SGT .[SGT 4KGX 5T .[SGT 4KGX 5T .[SGT 4KGX 5T .[SGT

9GTJ 9GTJ 9GTJ 9GTJ =OZN 9GTJ =OZN 9GTJ =OZN

5T 5T 5T
*UM *UM *UM *UM *UM *UM
.[SGT .ORR
5T :XKK
Sub-set Sub-graph
.[SGT *UM .[SGT :XKK *UM 9GTJ =OZN .[SGT
/T
9GTJ 5T /T
Topics Topics *UM
*UM 9GTJ

.ORR .ORR .ORR .ORR .ORR .ORR

:XKK :XKK :XKK :XKK :XKK :XKK


4KGX 4KGX 4KGX

.[SGT .[SGT .[SGT 4KGX 5T .[SGT 4KGX 5T .[SGT 4KGX 5T .[SGT


9GTJ 9GTJ 9GTJ 9GTJ =OZN 9GTJ =OZN 9GTJ =OZN

5T 5T 5T
*UM *UM *UM *UM *UM *UM

A man is walking with dogs. The man is near a tree. The dogs are near a man. A man is walking with dogs. They are walking in sands. There are trees and hills near the
sands.
(d) HRNN without Hierarchical Constraint (e) HRNN with HSGED

Figure 1: Illustrations of our motivation. (a) Comparisons between single sentence captioning and the informative paragraph.
(b) Topics and script of the paragraph in (a). (c) A paragraph generated by flat RNN. (d) The paragraph generation process of
HRNN without hierarchical constraint where the top and bottom parts denote the higher and lower RNNs, respectively. (e)
The paragraph generation process of our HSGED, where the grey color means this part will not be attended.
The crux of these problems is to find a “script” as in Figure 1 (b), the decoder of our HSGED by two RNNs which are sentence scene
which connects the topics to provide hierarchical constraints for the graph RNN (SSG-RNN) and word scene graph RNN (WSG-RNN).
HRNN. Recent studies show that Scene Graphs [37], which connects Specifically, the scene graph is transformed into a set of sub-graph
local objects as a global graph in terms of the object relations, can level embeddings by a graph neural network [39]. When generating
serve as the script to provide hierarchical knowledge for solving a new sentence, SSG-RNN adaptively attends to a few sub-graph
complex tasks, e.g., image retrieval [11], image generation [9], image embeddings based on the context knowledge to form the new topic
captioning [40], and visual reasoning [29]. Since the neighboring (see Section 3.2.2). Also, we design an irredundant attention strat-
sub-graphs of a scene graph share some concepts and each sub- egy to encourage the new topics to be formed from the undescribed
graph is distinctive, if we form the topic flows from a scene graph sub-graphs (see Section 3.2.3), e.g., as shown in Fig 1 (e) where the
and align each topic to a sub-graph, the generated paragraph will new topic comes from the colorful undescribed regions. Given the
naturally be coherent and distinctive. generated sub-graph topic, WSG-RNN composes a new sentence
Motivated by the above conjecture, we propose Hierarchical by focusing on the region constrained by the selected compact sub-
Scene Graph Encoder-Decoder (HSGED) to exploit the scene graph graphs, which is achieved by a novel inheriting attention strategy
as the topic script and transfer its hierarchical topological knowl- (see Section 3.2.5). Furthermore, we design an efficient sentence-
edge into the text domain for better paragraphs. Specifically, the level loss to encourage the topics further to follow the scene graph
high-level RNNs can follow scene graphs to generate the topics, for a human-like order. Extensive experiments on Stanford image
each of which is represented by a local compact sub-graph. Com- paragraph dataset [13] show that our HSGED generates more co-
pared with the HRNNs which generate sub-set topics without any herent and distinctive paragraphs. In particular, we achieve a new
global constraints, the topic flows in our HSGED are naturally co- state-of-the-art 36.02 CIDEr-D score [31] and have absolute 5.24
herent and distinctive, e.g., the topic flows in Fig 1 (e) are “Human- points boost than a strong baseline (see Section 4).
With-Dog”, “Dog-On-Sand”, and “Hill-Near-Sand”, showing that the
neighborhood topics are not only closely related but also distinctive.
The sub-graph topic retains a sentence about a corresponding re-
gion, e.g., the second topic in Figure 1 (e) is about “Human”, “Dog”,
and “Sand”, which is more likely to be treated as a compact part by 2 RELATED WORK
humans instead of “Human” and “Tree” as in (d). Single Sentence Captioning and Dense Captioning. Single sen-
Since the paragraph dataset is limited in size and for fair com- tence captioning has been exhaustively studied recently due to their
parisons, we follow the previous studies [5, 13, 17] to construct widely practical utility. Many advanced techniques have been pro-
posed to improve their performances, e.g., the encoder-decoder

4182
Poster Session F3: Vision and Language MM '20, October 12–16, 2020, Seattle, WA, USA

Object Attribute Relation


pipeline [33], multistage reasoning [7, 30], sophisticated atten-
=NOZK shared =NOZK shared =NOZK
tion mechanisms [2, 8, 20, 38], reinforcement learning based re- =GRR =GRR
(OQK
=GRR
(OQK
( (OQK

wards [28], and the exploration of high-level semantic knowledge [21, ,XUTZ ,XUTZ ,XUTZ

4KGX 4KGX 4KGX

40–42]. Though these captioning systems can now accurately sum- 5T (XU]T 5T (XU]T 5T (XU]T
8UGJ 8UGJ 8UGJ

marize one image, the generated sentence is usually too short to *UM *UM *UM

detail the rich semantic contents in an image. Then researchers The sub-graph around Dog The sub-graph around Bike The sub-graph around Wall
propose Dense Captioning to generate more descriptions for all
the detected salient regions [10, 43]. However, since the detected Figure 2: The illustrations of three sub-graphs around differ-
regions are usually heavily overlapped and disordered, the corre- ent objects. The dash lines connect the shared nodes which
sponding sentences are redundant and incoherent. As a result, their facilitate the system to generate coherent sub-graph topics.
usability is damaged [13].
Image Paragraph Captioning. Image paragraph captioning ad- the corresponding sentence (see Section 3.2). Specifically, we design
dresses the shortcomings of both single sentence captioning and two attention strategies that can remove redundancies for attention
dense captioning which try to generate coherent and distinctive in SSG-RNN for more distinctive sub-graph topics (see Section 3.2.3)
paragraphs [13]. Since each sentence of a paragraph is controlled and enhancing attention in WSG-RNN for more grounded sentences
by a topic, researchers propose Hierarchical RNNs (HRNNs) [5, based on the sub-graph topics (see Section 3.2.5).
13, 17, 34, 46] that higher-level and lower-level RNNs respectively
abstract topics and generate sentences based on the abstracted 3.1 Hierarchical Scene Graph Encoder
topics. Researchers also propose advanced techniques to refine 3.1.1 Scene Graphs. The scene graph is constructed by using di-
the prototypical HRNN, e.g., generative models like GAN [17] or rected edges to connect three different nodes: object node 𝑜𝑖 , denot-
VAE [5] for stronger consistency; the trigram repetition penalty ing the 𝑖-th object; attribute node 𝑎𝑖𝑙 , denoting the 𝑙-th attribute of
based sampling method for diversity [23]. Besides, dense sentence- 𝑜𝑖 ; and relation node 𝑟𝑖 𝑗 , denoting the pairwise relation between 𝑜𝑖
level rewards [36] and curiosity-driven reinforcement learning [22] and 𝑜 𝑗 . We assign directed edges from object 𝑜𝑖 to all of its attributes
are used for more robust training, all of which could also be applied 𝑎𝑖𝑙 and from object 𝑜𝑖 to relation 𝑟𝑖 𝑗 and from relation 𝑟𝑖 𝑗 to object
in our proposed framework, HSGED. However, most of them are 𝑜 𝑗 to form the scene graph. Figure 2 demonstrates one scene graph,
built without enough hierarchical constraints, so the qualities of the which contains four object nodes, three relation nodes, and two
generated paragraphs are unsatisfactory. In contrast, our HSGED attribute nodes. In this way, the scene graph contains rich semantic
exploits the scene graph as the script to transfer its hierarchical and knowledge brought from the semantic labels, e.g., “Brown” and
semantic knowledge from the vision domain to the text domain for “Near”, and topological knowledge brought from the connectivity
more coherent and distinctive paragraphs. of the graph, e.g., “Dog → Near→ Bike” and “Dog → On → Road”.
Exploitation of Scene Graph. The scene graph is formed by con-
3.1.2 Node Level Embeddings: 𝑿 . For three types of nodes, we
necting discrete objects with their attributes and with other objects
use different representations as their node-level embeddings: the
through the pairwise relationships, so it contains rich semantic
linear transformation of visual feature 𝒙𝑜 (RoI feature from Faster
and topological knowledge [37]. Observing such advantages, re-
RCNN through an FC layer) for object node and learnable label
searchers exploit Graph Neural Networks (GNN) [4, 16, 32, 39] to
embeddings of attribute and relation labels 𝒙𝑎 , 𝒙𝑟 for attribute and
embed scene graphs in various computer vision tasks, e.g., image
relation nodes, respectively. We use visual features as 𝒙𝑜 since
retrieval [11], image generation [9], image captioning [40], and
they contain more visual clues and they empirically achieve better
visual reasoning [29]. In this paper, we use an advanced GNN [39]
performance then label embeddings. All these representations are
to compute the sub-graph embeddings to facilitate paragraph gen-
grouped into one node-level embedding set 𝑿 :
eration. Importantly, compared with SGAE [40] which exploits the
explicit semantic knowledge of a scene graph for captioning, we 𝑿 = {𝒙𝑎 , 𝒙𝑟 , 𝒙𝑜 }, (1)
also exploit the implicit topological knowledge. Different from the
where each embedding corresponds to one node, e.g., the scene
method [6] using visual relationship which directly fuses the object
graph in Figure 2 contains 9 embeddings (4, 2, 3 for object, attribute
and relation features into a flat RNN without any topic guidance,
and relation nodes). This node-level embedding set will be input
we treat the scene graph as a script to facilitate the topic guidance
into WSG-RNN for generating sentences (see Section 3.2).
and to regularize the training. Hence, our framework generates
more coherent and distinctive paragraphs. 3.1.3 Sub-graph Level Embeddings: 𝑼 . Since the sentences of a
high-quality paragraph should describe distinctive aspects of an
object, e.g., the relationships with other objects, we adopt the sub-
3 HIERARCHICAL SCENE GRAPH graph level embedding which could facilitate our system to achieve
ENCODER-DECODER such goal. Specifically, we define the sub-graph around the object
Our Hierarchical Scene Graph Encoder-Decoder (HSGED) belongs 𝑜𝑖 as a graph connecting the following nodes: 𝑜𝑖 , which is this
to the encoder-decoder framework [13, 33, 38]. The hierarchical object itself; 𝑎𝑖𝑙 , which are this object’s attributes; 𝑟𝑖 𝑗 and 𝑟𝑘𝑖 , which
encoder transforms the scene graph into node-level and sub-graph are the potential relationships between this object and the other
level embeddings (see Section 3.1.1), which are respectively input objects; and 𝑜 𝑗 and 𝑜𝑘 , which are the objects that have potential
into Sentence Scene Graph RNN (SSG-RNN) to abstract the sub- relationships with 𝑜𝑖 . For example, a sub-graph around the object
graph topic and Word Scene Graph RNN (WSG-RNN) to generate “Dog” in Figure 2 connects the following nodes: “Dog”, “Brown”,

4183
Poster Session F3: Vision and Language MM '20, October 12–16, 2020, Seattle, WA, USA

Object Attribute Relation WSG-RNN. Here we first briefly revisit this architecture and then
detail how to revise it to get our SSG-RNN (see Section 3.2.2) and
(OQK WSG-RNN (see Section 3.2.4).
Basically, a top-down attention network contains two LSTM
(XU]T 4KGX
layers and one attention sub-network, as shown in Figure 4 (a).
8UGJ 5T
*UM *UM *UM Given the input vector 𝒛𝑡 at time step 𝑡, it can be formalized as:
Context vector: 𝒉𝑡1 = LSTM1 (𝒛𝑡 ; 𝒉𝑡1−1 ),
Attention: 𝒗ˆ 𝑡 = ATT(𝑽 , 𝒉𝑡1 ), (6)
Output: 𝒉𝑡2 = LSTM2 (Concat(𝒉𝑡1 , 𝒗ˆ 𝑡 ); 𝒉𝑡2−1 ),

where the attention sub-network ATT is:


(a) Object Embedding (b) Attribute Embedding (c) Relation Embedding Input: 𝑽 , 𝒉𝑡1 ,
Figure 3: The sketches of three sub-graph level embeddings.
𝑎𝑖𝑡 = 𝝎𝑎𝑇 tanh(𝑾𝑣 𝒗𝑖 + 𝑾ℎ 𝒉𝑡1 ),
“On”, “Road”, “Near”, and “Bike”. We incorporate such sub-graphs Attention weight: (7)
into embeddings by an efficient Graph Neural Network (GNN) [4, 𝜶 𝑡 = softmax(𝒂𝑡 ),
39] and input them into SSG-RNN for abstracting sub-graph topics Output: 𝒗ˆ 𝑡 = 𝑽 𝜶 𝑡 ,
(see Section 3.2.2).
where 𝑾𝑣 , 𝑾ℎ are trainable matrices and 𝝎𝑎 is a trainable vector. In
We denote the 𝑖-th object’s sub-graph embedding as 𝒖𝑖 and group
the following sections about SSG-RNN and WSG-RNN, we specify
all the objects’ sub-graph embeddings to the sub-graph level em-
different values for 𝒛𝑡 and 𝑽 . Then, we revise the attention sub-
bedding set 𝑼 , e.g., for the scene graph in Figure 2, 𝑼 contains 4
network with different strategies to detail how SSG-RNN abstracts
sub-graph embeddings for 4 objects. Each 𝒖𝑖 concatenates three dis-
sub-graph topics and how WSG-RNN generates the corresponding
tinctive embeddings: the object, attribute and relation embeddings
sentences.
𝒖 𝒐 𝒊 , 𝒖 𝒂 𝒊 and 𝒖 𝒓 𝒊 , which are respectively computed as:
𝒖𝑜𝑖 = 𝒙𝑜𝑖 , (2) 3.2.2 SSG-RNN. To avoid confusion, we use 𝑛 as the sentence
 index and 𝑡 as the word index. SSG-RNN is used to generate a
𝒖 𝑎𝑖 = 𝑓𝐴 (𝒙𝑜𝑖 , 𝒙𝑎𝑖𝑙 ), (3) topic vector 𝒄𝑛 which covers the content of the 𝑛-th sentence. The
𝑎𝑖𝑙 ∈𝐴𝑡𝑡𝑟 (𝑜𝑖 ) topic vector has two utilities in the whole system. 1) It passes
 
𝒖𝑟 𝑖 = 𝑓𝑆 (𝒙𝑜𝑖 , 𝒙𝑟𝑖 𝑗 , 𝒙𝑜 𝑗 ) + 𝑓𝑂 (𝒙𝑜𝑘 , 𝒙𝑟𝑘𝑖 , 𝒙𝑜𝑖 ), interdependent knowledge between neighbor sentences for more
𝑟𝑖 𝑗 ∈𝑆𝑏 𝑗 (𝑜𝑖 ) 𝑟𝑘𝑖 ∈𝑂𝑏 𝑗 (𝑜𝑘 )
coherent paragraphs. 2) It constrains the content of the sentence to
(4) be generated. To achieve the first goal, we input the last sentence
topic vector 𝒄𝑛−1 into SSG-RNN to pass interdependent knowledge
𝒖𝑖 = Concat(𝒖𝑜𝑖 , 𝒖𝑎𝑖 , 𝒖𝑟𝑖 ), (5)
and to achieve the second goal, we input the generated new topic
where 𝒙 ∗ are the node-level embeddings in Eq. (1); 𝑆𝑏 𝑗 (𝑜𝑖 ) (or vector 𝒄𝑛 into WSG-RNN.
𝑂𝑏 𝑗 (𝑜𝑖 )) in Eq. (4) means the set of relations where 𝑜𝑖 acts as the Specifically, in SSG-RNN, the input vector 𝒛𝑡 in Eq. (6) is:
subject (or object) node, e.g., 𝑠𝑏 𝑗 (𝑜 𝐷𝑜𝑔 ) = {𝑟 𝑁 𝑒𝑎𝑟 , 𝑟𝑂𝑛 }; Concat
𝒛𝑡 = Concat(𝒄𝑛−1, 𝑾Σ𝑤𝑡 −1, 𝒖¯ , 𝒉𝑡2−1 ), (8)
in Eq. (5) means the concatenation operation, and 𝑓𝐴 , 𝑓𝑆 and 𝑓𝑂
are the independent sub-networks with the same structure: FC- where 𝑾Σ is a trainable word embedding matrix; 𝑤𝑡 −1 is the 𝑡 − 1-
ReLU-FC-ReLU. We use such two-layer perceptrons followed by th word; 𝒖¯ is the mean pooling of sub-graph level embedding set
a summation pooling in GNN to achieve stronger representation 𝑼 (Eq. (5)); and 𝒉𝑡2−1 is the output of the second LSTM layer in
power [39]. Figure 3 sketches the three operations for the sub- SSG-RNN at time step 𝑡 − 1.
graph level embedding, e.g., the relation embedding incorporates In SSG-RNN, we replace the input feature set 𝑽 of the atten-
the knowledge of “Dog”, “Road”, “On”, “Bike”, and “Near”. tion sub-network (Eq. (7)) by the sub-graph level embedding set
Importantly, sub-graph level embeddings have two significant 𝑼 , where each embedding covers the sub-graph of an object. For
advantages. 1) It contains rich semantic knowledge brought from attention computation, SSG-RNN adaptively focuses on a few sub-
the attribute and relation labels. 2) It preserves useful topological graphs and then produces a sub-graph level attended embedding
knowledge to determine whether two objects are closely related. 𝒖ˆ 𝑡 , which will be used to compute our sub-graph topic vector by
For example, as in Figure 2, the sub-graphs around “Dog” and the second LSTM in Eq. (6). Once the input word 𝑤𝑡 is the full stop
“Bike” share the “Dog”, “Bike”, and “Near”, and then the sub-graph symbol “.”, we compute 𝒄𝑛 as:
embedding 𝒖𝐷𝑜𝑔 is more closely related to 𝒖𝐵𝑖𝑘𝑒 instead of related
to 𝒖𝑊 𝑎𝑙𝑙 due to the relation embedding (Eq. (4)). Such relatedness 𝒄𝑛 = ReLU(FC(𝒉𝑡2 )). (9)
encourages SSG-RNN to generate coherent sub-graph topics: after For adaptive attending on the sub-graphs, the generated topic
describing “Dog”, “Bike” is more likely to be described. vector captures distinctive semantic knowledge of the attended
sub-graphs and also preserves topological knowledge for facilitat-
3.2 Hierarchical Decoders ing coherence between neighboring sentences. The coherence is
3.2.1 Top-Down Attention Network. We use the Top-Down atten- achieved due to the relation embedding (Eq. (4)), which incorpo-
tion network [2] as the prototype to design our SSG-RNN and rates an object with its related objects into one embedding, thus

4184
Poster Session F3: Vision and Language MM '20, October 12–16, 2020, Seattle, WA, USA

its knowledge will not only be incorporated into 𝒖𝑟𝑖 , but also into
𝒖𝑟𝑘 if 𝑜𝑖 and 𝑜𝑘 own potential relations (Eq. (4)). As a result, the
U[ZV[Z
salient object 𝑜𝑖 will also be mentioned when the sub-graph of 𝑜𝑘
is chosen as the topic, e.g., as in Figure 1 (e), the concept “Human”
=9- is described in the first two sentences of our generated paragraph.

/4'::
29:3
844
3.2.4 WSG-RNN. Given a topic vector 𝒄𝑛 , WSG-RNN works to
complete the corresponding sentence 𝒔𝑛 . Here the input vector 𝒛𝑡
'ZZKTZOUT in Eq. (6) is:

99-
¯ 𝒉𝑡2−1 ),
𝒛𝑡 = Concat(𝒄𝑛 , 𝑾Σ𝑤𝑡 −1, 𝒙, (13)
29:3 /8':: 844
where 𝒄𝑛 is the topic vector of the 𝑛-th sentence to regularize the
word generation; 𝒙¯ is the mean of node-level embeddings (Eq. (1));
and 𝒉𝑡2−1 is the hidden state of the second LSTM layer in WSG-RNN.
In WSG-RNN, the feature set 𝑽 in Eq. (7) is set to the node-level
embedding set 𝑿 . In this way, WSG-RNN puts attention to the
(a) Top-down Attention Network (b) HSGED nodes and generates the words based on them. After computing
Figure 4: In HSGED, SSG-RNN and WSG-RNN are both built the attended node-level embedding 𝒙ˆ by the inheriting attention
based upon the top-down attention network with different network (Eq. (16)), the second LSTM layer outputs its hidden state
attention mechanisms and inputs: sub-graph level and node 𝒉𝒕2 in Eq. (6) to predict the word distribution at time step 𝑡:
level embedding sets 𝑼 , 𝑿 , respectively.
𝑃 (𝑤𝑡 |𝑤 1:𝑡 −1 ) = softmax(FC(𝒉𝒕2 )). (14)
facilitating the system to choose the neighboring sub-graph accord-
ing to the last attended sub-graphs. For example, in Figure 1(e), the 3.2.5 Inheriting Attention. To generate the sentence more closely
second sub-graph is about “Human”, “Dog” and “Sand”. SSG-RNN corresponding to the selected sub-graphs, inheriting attention is
exploits the node “Sand” as the clue to generate the next sub-graph applied in WSG-RNN to constrain the attention of WSG-RNN more
topic about “Sand”, “Tree”, and “Hill”. on the nodes in the selected sub-graphs. To achieve this, the node
level attention weights 𝜸 are first inherited from the sub-graph
3.2.3 Irredundant Attention. To further encourage each topic vec-
level attention weights 𝜷 (Eq. (12)) as follows:
tor to focus on rarely described sub-graphs, we substitute the atten-
tion sub-network in Eq. (7) with the following irredundant attention Attribute Node: 𝛾𝑎𝑖𝑙 = 𝛽𝑖 ,
strategy [25]. When SSG-RNN generates the 𝑛-th topic vector 𝒄𝑛 , Object Node: 𝛾𝑜𝑖 = 𝑀𝑒𝑎𝑛 (𝛽 𝑗 ),
it does not directly input the computed attention weights 𝒂 into 𝑗 ∈Cover(𝑜𝑖 ) (15)
the softmax layer (Eq. (7)) for computing attended vectors. Instead, Rela Node: 𝛾𝑟𝑖 𝑗 = 𝑀𝑒𝑎𝑛 (𝛽 𝑗 ),
𝑗 ∈Cover(𝑟𝑖 𝑗 )
it distracts the current attention from the previous frequently at-
tended sub-graph level embeddings to get the irredundant attention where the attribute node attention 𝛾𝑎𝑖𝑙 is directly inherited from
weights 𝜷: the attention weight 𝛽𝑖 of the sub-graph around 𝑜𝑖 . Since the object
and relation nodes are covered by many sub-graphs, we average




⎪ exp(𝑎𝜏𝑖 𝑛−1 ) if 𝑛 = 1 the attention weights of these sub-graphs as the inherited attention
𝑏𝑖𝑛 = 𝜏 (10) weights to both object and relation nodes, where 𝑗 ∈ Cover(𝑜𝑖 )

⎪ exp(𝑎 𝑛−1 )
⎪ 𝑛−1 𝑖 𝜏𝑚 otherwise in Eq. (15) means that the object 𝑜𝑖 is covered by the sub-graph
⎩ 𝑚=1 exp(𝑎 𝑖 )
which is around the object 𝑜 𝑗 . For example, as in Figure 2, “Bike” is
𝜷 𝑛 = softmax(𝒃 𝑛 ), (11) covered by the sub-graphs of “Dog”, “Bike”, and “Wall”, and then
where 𝜏𝑚 is the time step of the full stop symbol of sentence 𝑠𝑚 . we average the attention weights of these three sub-graphs as the
Then we compute the irredundant attention vector 𝒖ˆ 𝑛 : inherited attention weight of the node 𝑜 𝐵𝑖𝑘𝑒 .
After inheriting 𝜸 from 𝜷, we can compute the attended node
𝒖ˆ 𝑛 = 𝑼 𝜷 𝑛 , (12)
level embedding 𝒙ˆ as:
which will be used to generate the distinctive topic vector 𝒄𝑛 as in
𝒙ˆ = 𝑿 Softmax(𝜸 . ∗ 𝜶 ), (16)
Section 3.2.2.
If the 𝑖-th sub-graph level embedding 𝒖𝑖 is frequently attended where .∗ denotes element-wise production and 𝜶 is computed from
𝑛−1
during generating previous 𝑛−1 topic vectors, then 𝑚=1 exp(𝑎𝜏𝑖 𝑚−1 ) the WSG-RNN attention network in Eq. (7).
𝑛
will be large and 𝑏𝑖 will be small. So, the current topic vector will If a sub-graph is not selected by SSG-RNN for generating the
be less likely to focus on the 𝑖-th sub-graph, and thus the generated current topic vector, the attention weight of the corresponding
sentences will be less repetitive. sub-graph level embedding is small. By inheriting attention, the
It is noteworthy that this irredundant attention will not limit attention weights of the nodes in this sub-graph will also be small, so
our model for attending the salient objects for multiple times as these nodes will be less likely to be selected to complete the current
needed, e.g., as in Figure 1 (a), the concept “Human” is described sentence. In this way, compared with HRNN methods without any
twice in the first two sentences of the ground-truth paragraph. Our hierarchical constrains, our WSG-RNN generates more grounded
model can achieve multi-visit of the same salient object 𝑜𝑖 since sentences.

4185
Poster Session F3: Vision and Language MM '20, October 12–16, 2020, Seattle, WA, USA

3.3 Training Objectives Visual Genome [14] (VG). We use the annotations of this dataset
Given a ground-truth paragraph =P∗ ∗ },
{𝑤 1:𝑇
we can end-to-end to train our scene graph generator, which includes objects’ cate-
train our HSGED by maximizing the likelihood of the ground-truth gories, attributes, and pairwise relations. Specifically, by removing
paragraph P ∗ as in Regions-Hierarchical [13]. For convenience, we the objects, attributes, and relations that appear less than 2,000
denote the predicted word distribution as 𝑃 (𝑤𝑡 ) (Eq. (14)). Maxi- times in the training set, we use the remaining 305 objects, 103
mizing the likelihood equals to minimizing the cross-entropy loss: attributes, and 64 relations to train our object detector, attribute
classifier, and relation classifier. It is noteworthy that for fair testing,
𝑇
 we filter out the images of VG which also exist in the test set of
𝐿𝑤𝑜𝑟𝑑 = − log 𝑃 (𝑤𝑡∗ ). (17)
Stanford image paragraph dataset.
𝑡 =1
This is a word-level loss since it directly encourages the generated 4.1.2 Settings. The dimensions of 𝒙𝑜 /𝒙𝑎 /𝒙𝑟 /𝒖𝑜 /𝒖𝑎 /𝒖𝑟 in Section
words to be the same as the ground-truth words. 3.1.2 and 3.1.3 are all set to 1,000. The dimensions of the hidden
Though this word-level loss is simple and efficient, it causes the states (𝒉 ∗ ) of all the LSTM layers are set to 1,000. The sizes of
exposure bias problem [26] which damages the performance due to 𝑾𝑣 /𝑾ℎ (Eq. (7)) are set to 3, 000 × 512/1, 000 × 512 in SSG-RNN and
the mismatch between training and test. To alleviate this mismatch 1, 000 × 512/1, 000 × 512 in WSG-RNN. The size of 𝑾Σ in Eq. (8) is
problem, a reinforcement learning (RL) based reward [23, 28] can 1, 000 × 4, 962. To parse the scene graph, we use Faster-RCNN as
be used to train the paragraph generator: the object detector [27] and MOTIFS as the relation classifier [45].
Our attribute classifier is an FC-ReLU-FC-Sigmoid network aiming
𝐿𝑝𝑎𝑟𝑎 = −E𝑤𝑡𝑠 ∼𝑃 (𝑤) [𝑟 (P 𝑠 ; P ∗ )], (18) to multi-classify the attributes.
where 𝑟 is CIDEr-D [31] metric between the paragraph P 𝑠 = {𝑤 1:𝑇
𝑠 } When training the whole pipeline, we use the word-level loss
sampled from Eq. (14) and the ground-truth paragraph P ∗ . This is (Eq. (17)) in the first 30 epochs and then use the combination loss
a paragraph-level loss since it encourages the whole generated (Eq. (20)) where 𝑤𝑠𝑒𝑛 is 0.1 and 𝑤 𝑝𝑎𝑟𝑎 is 1 in the next 70 epochs.
paragraph to be similar to the ground-truth paragraph. When the word-level/combination losses are used, the learning
Though this paragraph-level loss improves the quality of the rates are initialized as 3𝑒 −4 /2𝑒 −5 and decayed by 0.85 for every 5
generated paragraphs, there are still two shortcomings. First, this epochs. We set the batch size to 100 and use Adam optimizer [12]. At
paragraph-level loss neglects the sequence of the sentences, while a the inference stage, we use both beam search and trigram repetition
coherent paragraph requires the generated sentences to be listed in penalty [23] to sample paragraphs.
proper sequences. Second, the paragraph-level loss does not make 4.1.3 Metrics. Following the previous methods [5, 13, 23], we use
sufficient use of the training paragraphs since it only computes a three standard metrics: CIDEr-D [31], BLEU [24], and METEOR[3],
reward for the whole paragraph. to measure the similarities between the generated paragraphs and
Therefore, we propose a sentence-level loss, which is the sum the ground-truth paragraphs. Because the ground-truth paragraphs
of CIDEr-D scores between each sampled sentence 𝒔𝑛𝑠 and the are created by humans which are naturally coherent, the generated
ground-truth sentence 𝒔𝑛∗ : paragraphs are likely to be more coherent if the similarity scores
𝑁
 are higher. Besides, we measure the distinctiveness from two as-
𝐿𝑠𝑒𝑛 = −E𝑤𝑡𝑠 ∼𝑃 (𝑤) [ 𝑟 (𝒔𝑛𝑠 ; 𝒔𝑛∗ )]. (19) pects: diversity and fine-grain degree. Diversity is measured by a
𝑛=1 metric [35] which is derived from the kernelized latent semantic
By enforcing each sampled 𝒔𝑛𝑠 to be similar to the ground-truth analysis of CIDEr-D. Fine-grain degree is measured by the statistics
𝒔𝑛∗ , the narrative logic is encouraged to be the same as the human- of the part-of-speech. Moreover, we conduct the human evaluation
like sequence, which is also a kind of hierarchical constraint. Also, on coherence, diversity and fine-grain degree of the paragraphs
the training paragraphs are more sufficiently utilized because each generated by different methods.
sentence provides a direct supervision.
In the experiment, to utilize the advantages of the word-level, 4.2 Ablative Studies
paragraph-level and sentence-level losses, we combine them as: We carry extensive ablative studies by gradually constructing our
𝐿𝑐𝑜𝑚𝑏 = 𝑤 𝑤𝑜𝑟𝑑 𝐿𝑤𝑜𝑟𝑑 + 𝑤𝑠𝑒𝑛 𝐿𝑠𝑒𝑛 + 𝑤 𝑝𝑎𝑟𝑎 𝐿𝑝𝑎𝑟𝑎 . (20) model from the original Up-Down model [2, 23] so as to evaluate
the effectiveness of each individual component. In particular, we
where 𝑤 𝑤𝑜𝑟𝑑 , 𝑤𝑠𝑒𝑛 , 𝑤 𝑝𝑎𝑟𝑎 are the weights of 𝐿𝑤𝑜𝑟𝑑 , 𝐿𝑠𝑒𝑛 and 𝐿𝑝𝑎𝑟𝑎 , construct the following baselines.
respectively.
4.2.1 Comparing Methods. Base: We treat each paragraph as one
4 EXPERIMENTS long sentence and directly use the Up-Down model (Eq. (6)) as the
Flat RNN to generate them. This baseline is the benchmark for the
4.1 Datasets, Settings, and Metrics other ablative baselines.
4.1.1 Datasets. Stanford Image Paragraph Dataset [13] is a GNN: We use Graph Neural Network (GNN) to compute the sub-
mainstream large-scale dataset for image paragraph generation. graph level embedding set 𝑼 (Eq. (5)) and input it into the decoder
This dataset contains 19,551 images and each image is paired with of a Flat RNN as Base for generating paragraphs. HRNN: We use
one paragraph. The whole dataset contains 14,575/2,487/2,489 image- the hierarchical RNNs to generate paragraphs, while we only input
paragraph pairs for training/validation/test, respectively. On aver- node level embedding set 𝑿 (Eq. (1)) into the two RNNs without ir-
age, each paragraph contains 5.7 sentences and 11.9 words. redundant and inheriting attentions. HSGED-IRA-INA: We input

4186
Poster Session F3: Vision and Language MM '20, October 12–16, 2020, Seattle, WA, USA

9CNN ?KRRU] 9CNN


6TGG =KGX 9NOXZ =NOZK
/CP
.KIJV 9OZ (KJ
(R[K
(KGX 5JKTV 3GT 9ZGTJ ,OKRJ
(TKUDGG (KNOTJ
4KGX =GRR (KGNF ,ORR
$GCT )TCUU

=NOZK .URJ ,XOYHKK =NOZK


$GF 2OMNZ :XKK -XGYY
5T 5T

(KGX ?KRRU] 3GT 9ZGTJ ,OKRJ HSGED(SLL): A man is standing on a


HSGED(SLL): A yellow teddy bear is
field . The field is full of grass . The
sitting on a bed . The teddy bear is
(KGX 9OZ (KJ -XGYY ,ORR ,OKRJ man is wearing a white shirt . He holds
sitting near a wall. The wall is white
a white frisbee. There are some trees
and blue. There is a light on the wall.
(KGX 4KGX =GRR behind him .
The light is on the teddy bear. 3GT =KGX 9NOXZ =NOZK
(R[K =GRR =NOZK HRNN: A man is standing on a grassy
HRNN: A teddy bear is sitting on a bed
3GT .URJ ,XOYHKK =NOZK field . The man is wearing a white and
2OMNZ 5T =GRR . The teddy bear is sitting on top of the
white shirt . The man is wearing a
bed . There is a white on the wall . The
white and white shirt . The grass is
2OMNZ 5T (KGX wall is white. The wall. The wall is :XKK (KNOTJ 3GT
green and green . The grass is green
white.
and green .

Figure 5: Two qualitative examples. The bottom left and bottom right parts show the sub-graph topics and the generated
paragraph, respectively. The colors highlight the alignments between the sentences and the sub-graph topics.

Table 1: The performances of various baselines on Stanford Table 2: The diversity scores (the larger, the more diverse) of
image paragraph dataset. The metrics: B@N, M, and C, de- various baselines. GT denotes ground-truth.
note BLEU@N, METEOR, and CIDEr-D, respectively. Base HRNN HSGED-IRA-INA HSGED-INA HSGED HSGED(SLL) GT
Diversity 0.786 0.791 0.810 0.823 0.836 0.840 0.847

Models B@1 B@2 B@3 B@4 M C our HSGED-IRA-INA is more obvious (2.89 CIDEr-D), which proves
Base 43.42 27.56 17.41 10.39 17.22 30.78 that only applying a naive HRNN structure without using any hi-
Base(SLL) 43.63 27.74 17.58 10.44 17.42 31.90 erarchical constraint is not enough for capturing the hierarchical
GNN 43.77 27.83 17.66 10.65 17.51 32.48 knowledge in both the image and the paragraph and makes HRNN
GNN(SLL) 44.02 27.98 17.87 10.90 17.78 33.57 degrade to the flat RNN. Figure 5 shows two examples of the para-
HRNN 43.54 27.65 17.53 10.42 17.44 31.36
graphs generated by HSGED(SLL) and HRNN, where HSGED’s
HSGED-IRA-INA 44.17 28.36 17.85 11.04 17.89 33.67
HSGED-INA 44.21 28.42 18.04 11.09 18.06 34.42
paragraphs are clearly more coherent and distinctive.
HSGED 44.33 28.47 18.09 11.10 18.11 35.13 4.2.3 Results on Diversity. Table 2 shows the diversity scores [35]
HSGED(SP, SLL) 44.20 28.39 18.02 11.07 18.02 34.15 of the paragraphs generated by different models. We also compute
HSGED(SLL) 44.51 28.69 18.28 11.26 18.33 36.02
the diversity score of the ground truth paragraphs of the test set,
which is denoted as GT. Comparing the results of HSGED-IRA-
the node/sub-graph level embedding sets 𝑿 /𝑼 into SSG/WSG-RNNs,
INA, HSGED-INA, HSGED, HSGED(SLL) in Table 2, we can see
respectively, while we do not deploy irredundant and inheriting
that the diversity scores are improved, which demonstrates the
attentions. HSGED-INA: Compared with HSGED-IRA-INA, we
effectiveness of the irredundant attention, inheriting attention, and
apply irredundant attention and still do not use inheriting atten-
sentence-level loss in improving the paragraph diversity.
tion. HSGED: We use the integral architecture sketched in Figure 4
(b). SP: We replace the original semantic scene graph with the 4.2.4 Results on Fine-grain Degree. To test the fine-grain degree,
spatial scene graph, where each box is connected by its nearest 5 we calculate the ratios of the non-repetitive noun, verb, preposi-
boxes with their relative spatial positions as labels. SLL: We use the tion and adjective among paragraphs generated by Base, HRNN,
sentence-level loss defined in Eq. (19) to train some of the baselines. HSGED-IRA-INA and HSGED(SLL), and draw two corresponding
radar charts in Figure 6, where (a) and (b) use the length of the
4.2.2 Results on Similarity. Table 1 shows the performances of
paragraph and the length of the paragraph without repetition as the
the different ablative baselines. Here we use trigram repetition
denominator, respectively. Furthermore, we calculate the Object
penalty [23] to sample paragraphs. Compared with Base, our inte-
and Relation SPICE [1] scores presented in Table 3. Obviously, para-
gral model, HSGED(SLL), boosts the CIDEr-D by 5.24. By comparing
graphs generated by HSGED(SLL) have the greatest ratios of noun,
Base(SLL) and GNN with Base, we observe that the performances
verb, preposition and adjective in both radar charts, and HSGED
are boosted, which confirms the utility of sentence-level loss and
achieves the highest Object and Relation SPICE score: 0.41 and
sub-graph level embedding. More importantly, by alternately com-
0.23. From these results, we find that HSGED(SLL) generates the
paring HRNN, HSGED-IRA-INA, HSGED-INA, and HSGED, we
most abundant semantic information and HSGED generates richer
also observe the uninterrupted improvements, which substantially
objects and relations than the baseline HRNN.
validate the superiorities of our proposed SSG-RNN and WSG-RNN,
especially the irredundant and inheriting attentions. Another in- 4.2.5 Human Evaluation. To further demonstrate that our model
teresting observation is that the improvement of HRNN compared could generate more coherent, fine-grained, and diverse paragraphs,
with Base is marginal (0.58 CIDEr-D), while the improvement of we conduct a human evaluation with 20 workers. We sample 50

4187
Poster Session F3: Vision and Language MM '20, October 12–16, 2020, Seattle, WA, USA

Table 3: The object and relation SPICE score of various base- Table 4: The performances of various state-of-the-art meth-
lines on Stanford image paragraph dataset. ods. The top and middle sections show the results sampled
by beam search and trigram repetition penalty, respectively.
Base HRNN HSGED-IRA-INA HSGED The model in the bottom section uses a different RL reward.
Object 0.32 0.34 0.38 0.41
Relation 0.15 0.15 0.21 0.23 Models B@1 B@2 B@3 B@4 M C
(GYK .844 (GYK .844
Regions-Hierarchical [13] 41.90 24.11 14.23 8.69 15.95 13.52
.9-+*/8'/4' .9-+*922 .9-+*/8'/4' .9-+*922 RTT-GAN [17] 41.99 24.86 14.89 9.03 17.12 16.87
4U[T 4U[T RTT-GAN (Plus) [17] 42.06 25.35 14.92 9.21 18.39 20.36
0.12 0.25
0.1 0.2
DCPG [5] 42.12 25.18 14.74 9.05 17.81 19.95
18.62
0.08
0.06
0.04
0.15
0.1
DCPG-VAE [5] 42.38 25.52 15.15 9.43 20.93
6XKVUYOZOUT
0.02
0 <KXH 6XKVUYOZOUT
0.05
0 <KXH
HCAVP [46] 41.38 25.40 14.93 9.00 16.79 20.94
DHPV [36] 43.35 26.73 16.92 10.99 17.02 22.47
HSGED-IRA-INA 43.76 26.44 16.86 9.79 17.14 23.85
HSGED(SLL) 44.22 26.93 17.31 10.35 17.45 25.52
'JPKIZO\K 'JPKIZO\K
CAE-LSTM [34] − − − 9.67 18.82 25.15
(a) The Ratio of PoS with Repetition (b) The Ratio of PoS without Repetition TDPG [23] 43.54 27.44 17.33 10.58 17.86 30.63
HSGED(SLL) 44.51 28.69 18.28 11.26 18.33 36.02
Figure 6: The radar charts illustrate the ratios of four part CRL [22] 43.12 27.03 16.72 9.95 17.42 31.47
of speech (PoS) in paragraphs generated by four methods.
4.3.2 Result Analysis. Table 4 reports the performances of our HS-
Coherence Fine grain Diversity GED(SLL) and the other comparing methods. From the table, we
observe that our HSGED(SLL) achieves two new state-of-the-art
7.8% CIDEr-D scores, 25.52 and 36.02, for the cases of using the beam
14.5%
24.0%
search and the trigram repetition penalty strategy, respectively. In
41.6%
45.1%
55.0%
particular, compared with the methods which use HRNN without
40.4%
37.2% applying hierarchical constraints, our method outperforms them
34.4% in almost all metrics, despite that we do not use additional genera-
tive models like in RTT-GAN and DCPG-VAE, or train our model
with additional language data like in RTT-GAN(Plus), or use more
HRNN HSGED-IRA-INA HSGED(SLL) complex LSTM like in HCAVP. The comparison between our HS-
GED(SLL) and DHPV, which exploits a more complex sentence-level
Figure 7: The pie charts show the results of the human eval- loss, suggests that when providing suitable hierarchical constraints,
uation on the quality of the generated paragraphs. even with a simple sentence-level loss, our method can still gener-
images from the test set and assign the paragraphs generated by ate better paragraphs. Although CRL uses different reinforcement
HRNN, HSGED-IRA-INA and HSGED(SLL) to workers. Then the learning methods, HSGED(SLL) still performs better.
workers are asked to choose the most coherent, fine-grained, and
diverse paragraphs. The results are shown in Figure 7. Compared 5 CONCLUSION
with HRNN and HSGED-IRA-INA, the paragraphs generated by In this paper, we proposed Hierarchical Scene Graph Encoder-
HSGED(SLL) are considered as the most coherent, fine-grained and Decoder (HSGED) based encoder-decoder which follows the script
diverse captions, obtaining respectively 41.6%, 45.1% and 55% votes. constructed by a scene graph to generate the paragraph. In this
way, the semantic and hierarchical knowledge of an image can be
4.3 Comparisons with State-of-The-Arts transferred into the language domain, so compared with traditional
4.3.1 Comparing Methods. We compare our HSGED(SLL) with HRNNs without any hierarchical constraints, more coherent and
several state-of-the-art models: Regions-Hierarchical [13], RTT- distinctive paragraphs can be generated. Specifically, our HSGED
GAN [17], DCPG [5], HCAVP [46], DHPV [36], CAE-LSTM [34], contains two RNNs: SSG-RNN for generating sub-graph level topic
TDPG [23] and CRL [22]. Among these methods, RTT-GAN, DCPG, vectors and WSG-RNN for completing the corresponding sentences.
Regions-Hierarchical, DHPV, Hierarchical CAVP and CAE-LSTM To further encourage distinctive sentences, irredundant attention
use HRNNs with different technique details. RTT-GAN and DCPG- and inheriting attention were respectively deployed into SSG-RNN
VAE use additional generative models, e.g., GAN and VAE, for im- and WSG-RNN as the additional hierarchical regularizations. We
proving the coherence and diversity of the paragraphs. For RTT- also designed a sentence-level loss to regularize the sequence of the
GAN (plus), they use additional image captioning data from MS- generated sentences to be similar to the ground-truth paragraphs.
COCO dataset [19] to train their model. Compared with them, Our extensive experiments demonstrated that the proposed model
TDPG uses a Flat RNN to generate paragraphs while it proposes the outperforms the state-of-the-art methods significantly and it can
trigram repetition penalty in sampling process for reducing repeti- generate more coherent and distinctive paragraphs.
tions. When comparing with TDPG, we report the results of using Acknowlegements.This work is partially supported by NTU-Alibaba
the trigram repetition penalty sample strategy. And when compar- Lab, MOE AcRF Tier 1 and Monash FIT Start-up Grant.
ing with the other methods, we report the results of using beam
search with a beam size of 5. CRL uses a different reinforcement
learning method.

4188
Poster Session F3: Vision and Language MM '20, October 12–16, 2020, Seattle, WA, USA

REFERENCES [25] Romain Paulus, Caiming Xiong, and Richard Socher. 2017. A deep reinforced
[1] Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. model for abstractive summarization. arXiv preprint arXiv:1705.04304 (2017).
Spice: Semantic propositional image caption evaluation. In European Conference [26] Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba.
on Computer Vision. Springer, 382–398. 2015. Sequence level training with recurrent neural networks. arXiv preprint
[2] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, arXiv:1511.06732 (2015).
Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for [27] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn:
image captioning and visual question answering. In CVPR. 6. Towards real-time object detection with region proposal networks. In Advances
[3] Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for in neural information processing systems. 91–99.
MT evaluation with improved correlation with human judgments. In Proceedings [28] Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jarret Ross, and Vaibhava
of the acl workshop on intrinsic and extrinsic evaluation measures for machine Goel. 2017. Self-critical sequence training for image captioning. In CVPR, Vol. 1.
translation and/or summarization. 65–72. 3.
[4] Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, [29] Jiaxin Shi, Hanwang Zhang, and Juanzi Li. 2019. Explainable and explicit visual
Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam reasoning over scene graphs. In Proceedings of the IEEE Conference on Computer
Santoro, Ryan Faulkner, et al. 2018. Relational inductive biases, deep learning, Vision and Pattern Recognition. 8376–8384.
and graph networks. arXiv preprint arXiv:1806.01261 (2018). [30] Xiangxi Shi, Jianfei Cai, Shafiq Joty, and Jiuxiang Gu. 2019. Watch It Twice: Video
[5] Moitreya Chatterjee and Alexander G Schwing. 2018. Diverse and coherent Captioning with a Refocused Video Encoder. In Proceedings of the 27th ACM
paragraph generation from images. In Proceedings of the European Conference on International Conference on Multimedia. 818–826.
Computer Vision (ECCV). 729–744. [31] Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider:
[6] Wenbin Che, Xiaopeng Fan, Ruiqin Xiong, and Debin Zhao. 2018. Paragraph Consensus-based image description evaluation. In Proceedings of the IEEE confer-
generation network with visual relationship detection. In Proceedings of the 26th ence on computer vision and pattern recognition. 4566–4575.
ACM international conference on Multimedia. 1435–1443. [32] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro
[7] Jiuxiang Gu, Jianfei Cai, Gang Wang, and Tsuhan Chen. 2018. Stack-captioning: Lio, and Yoshua Bengio. 2017. Graph attention networks. arXiv preprint
Coarse-to-fine learning for image captioning. In Thirty-Second AAAI Conference arXiv:1710.10903 (2017).
on Artificial Intelligence. [33] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show
[8] Jingwen Hou, Sheng Yang, and Weisi Lin. 2020. Object-level Attention for Aes- and tell: A neural image caption generator. In CVPR.
thetic Rating Distribution Prediction. In Proceedings of the 28th ACM International [34] Jing Wang, Yingwei Pan, Ting Yao, Jinhui Tang, and Tao Mei. 2019. Convolu-
Conference on Multimedia. tional Auto-encoding of Sentence Topics for Image Paragraph Generation. In
[9] Justin Johnson, Agrim Gupta, and Li Fei-Fei. 2018. Image generation from scene International Joint Conferences on Artificial Intelligence (IJCAI).
graphs. arXiv preprint (2018). [35] Qingzhong Wang and Antoni B Chan. 2019. Describing like humans: on diversity
[10] Justin Johnson, Andrej Karpathy, and Li Fei-Fei. 2016. Densecap: Fully convo- in image captioning. In Proceedings of the IEEE Conference on Computer Vision
lutional localization networks for dense captioning. In Proceedings of the IEEE and Pattern Recognition. 4195–4203.
Conference on Computer Vision and Pattern Recognition. 4565–4574. [36] Siying Wu, Zheng-Jun Zha, Zilei Wang, Houqiang Li, and Feng Wu. 2019. Densely
[11] Justin Johnson, Ranjay Krishna, Michael Stark, Li-Jia Li, David Shamma, Michael Supervised Hierarchical Policy-Value Network for Image Paragraph Generation.
Bernstein, and Li Fei-Fei. 2015. Image retrieval using scene graphs. In Proceedings In Proceedings of the Twenty-Eighth International Joint Conference on Artificial
of the IEEE conference on computer vision and pattern recognition. 3668–3678. Intelligence, IJCAI-19. International Joint Conferences on Artificial Intelligence
[12] Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic opti- Organization, 975–981. https://doi.org/10.24963/ijcai.2019/137
mization. arXiv preprint arXiv:1412.6980 (2014). [37] Danfei Xu, Yuke Zhu, Christopher B Choy, and Li Fei-Fei. 2017. Scene graph
[13] Jonathan Krause, Justin Johnson, Ranjay Krishna, and Li Fei-Fei. 2017. A Hier- generation by iterative message passing. In Proceedings of the IEEE Conference on
archical Approach for Generating Descriptive Image Paragraphs. In Computer Computer Vision and Pattern Recognition. 5410–5419.
Vision and Patterm Recognition (CVPR). [38] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan
[14] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural
Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. image caption generation with visual attention. In International conference on
2017. Visual genome: Connecting language and vision using crowdsourced dense machine learning. 2048–2057.
image annotations. International Journal of Computer Vision 123, 1 (2017), 32–73. [39] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2019. How Powerful
[15] Jiwei Li, Minh-Thang Luong, and Dan Jurafsky. 2015. A hierarchical neural are Graph Neural Networks?. In International Conference on Learning Representa-
autoencoder for paragraphs and documents. arXiv preprint arXiv:1506.01057 tions. https://openreview.net/forum?id=ryGs6iA5Km
(2015). [40] Xu Yang, Kaihua Tang, Hanwang Zhang, and Jianfei Cai. 2019. Auto-encoding
[16] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. 2015. Gated scene graphs for image captioning. In Proceedings of the IEEE Conference on
graph sequence neural networks. arXiv preprint arXiv:1511.05493 (2015). Computer Vision and Pattern Recognition. 10685–10694.
[17] Xiaodan Liang, Zhiting Hu, Hao Zhang, Chuang Gan, and Eric P Xing. 2017. [41] Xu Yang, Hanwang Zhang, and Jianfei Cai. 2019. Learning to Collocate Neural
Recurrent topic-transition gan for visual paragraph generation. In Proceedings of Modules for Image Captioning. arXiv preprint arXiv:1904.08608 (2019).
the IEEE International Conference on Computer Vision. 3362–3371. [42] Ting Yao, Yingwei Pan, Yehao Li, Zhaofan Qiu, and Tao Mei. 2017. Boosting
[18] Rui Lin, Shujie Liu, Muyun Yang, Mu Li, Ming Zhou, and Sheng Li. 2015. Hierar- image captioning with attributes. In IEEE International Conference on Computer
chical recurrent neural network for document modeling. In Proceedings of the Vision, ICCV. 22–29.
2015 Conference on Empirical Methods in Natural Language Processing. 899–907. [43] Guojun Yin, Lu Sheng, Bin Liu, Nenghai Yu, Xiaogang Wang, and Jing Shao. 2019.
[19] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Context and Attribute Grounded Dense Captioning. In Proceedings of the IEEE
Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common Conference on Computer Vision and Pattern Recognition. 6241–6250.
objects in context. In European conference on computer vision. Springer, 740–755. [44] Haonan Yu, Jiang Wang, Zhiheng Huang, Yi Yang, and Wei Xu. 2016. Video para-
[20] Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. 2017. Knowing graph captioning using hierarchical recurrent neural networks. In Proceedings of
when to look: Adaptive attention via a visual sentinel for image captioning. In the IEEE conference on computer vision and pattern recognition. 4584–4593.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition [45] Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. 2018. Neural motifs:
(CVPR), Vol. 6. 2. Scene graph parsing with global context. In Proceedings of the IEEE Conference
[21] Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2018. Neural Baby Talk. on Computer Vision and Pattern Recognition. 5831–5840.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. [46] Zheng-Jun Zha, Daqing Liu, Hanwang Zhang, Yongdong Zhang, and Feng Wu.
7219–7228. 2019. Context-aware visual policy network for fine-grained image captioning.
[22] Yadan Luo, Zi Huang, Zheng Zhang, Ziwei Wang, Jingjing Li, and Yang Yang. IEEE transactions on pattern analysis and machine intelligence (2019).
2019. Curiosity-Driven Reinforcement Learning for Diverse Visual Paragraph
Generation. In Proceedings of the 27th ACM International Conference on Multimedia.
2341–2350.
[23] Luke Melas-Kyriazi, Alexander Rush, and George Han. 2018. Training for Di-
versity in Image Paragraph Captioning. In Proceedings of the 2018 Conference on
Empirical Methods in Natural Language Processing. 757–761.
[24] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a
method for automatic evaluation of machine translation. In Proceedings of the
40th annual meeting on association for computational linguistics. Association for
Computational Linguistics, 311–318.

4189

You might also like