0% found this document useful (0 votes)
41 views14 pages

2201.11460v2 Compressed

The document describes a Relation Transformer model for scene graph generation. The model takes a one-stage approach to directly predict sparse scene graphs using visual features, without combining entities or labeling all possible predicates like most existing two-stage methods. Experiments on Visual Genome and Open Images V6 datasets show the superior performance and fast inference of the proposed model.

Uploaded by

shehalshah0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views14 pages

2201.11460v2 Compressed

The document describes a Relation Transformer model for scene graph generation. The model takes a one-stage approach to directly predict sparse scene graphs using visual features, without combining entities or labeling all possible predicates like most existing two-stage methods. Experiments on Visual Genome and Open Images V6 datasets show the superior performance and fast inference of the proposed model.

Uploaded by

shehalshah0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

1

RelTR: Relation Transformer for Scene Graph


Generation
Yuren Cong, Michael Ying Yang, and Bodo Rosenhahn

Abstract—Different objects in the same scene are more or less related to each other, but only a limited number of these relationships
are noteworthy. Inspired by Detection Transformer, which excels in object detection, we view scene graph generation as a set
prediction problem. In this paper, we propose an end-to-end scene graph generation model Relation Transformer (RelTR), which has
an encoder-decoder architecture. The encoder reasons about the visual feature context while the decoder infers a fixed-size set of
triplets subject-predicate-object using different types of attention mechanisms with coupled subject and object queries. We design a set
prediction loss performing the matching between the ground truth and predicted triplets for the end-to-end training. In contrast to most
arXiv:2201.11460v2 [cs.CV] 11 Aug 2022

existing scene graph generation methods, RelTR is a one-stage method that predicts sparse scene graphs directly only using visual
appearance without combining entities and labeling all possible predicates. Extensive experiments on the Visual Genome and Open
Images V6 datasets demonstrate the superior performance and fast inference of our model.

Index Terms—Scene Understanding, Scene Graph Generation, One-Stage, Visual Relationship Detection

1 I NTRODUCTION

I N scene understanding, a scene graph is a graph struc-


ture whose nodes are the entities that appear in the
image and whose edges represent the relationships between
window building window

walking
on

woman
building

wearing
woman on
entities [1]. Scene graph generation (SGG) is a semantic pants
holding
pants

sidewalk sidewalk
understanding task that goes beyond object detection and is
closely linked to visual relationship detection [2]. At present, umbrella Two-Stage Method umbrella

scene graphs have shown their potential in different vision- window on building
window building
language tasks such as image retrieval [1], image captioning woman
woman sidewalk
[3], [4], visual question answering (VQA) [5] and image walking
on
wearing

generation [6], [7]. The task of scene graph generation has woman umbrella
holding
pants

sidewalk
also received sustained attention in the computer vision woman pants

Our Method umbrella


community. Most existing methods for generating scene
graphs employ an object detector (e.g. FasterRCNN [8]) and
Fig. 1: Different from most existing two-stage methods that
use some specific neural networks to infer the relationships.
label the dense relationships between all entity proposals,
The object detector generates proposals in the first stage,
our one-stage approach can predict the pair proposals di-
and the relationship classifier labels the edges between
rectly and generate a sparse scene graph with only visual
the object proposals for the second stage. Although these
appearance.
two-stage approaches have made incredible progress, they
still suffer from the drawback that these models require a
large number of trained parameters. If 𝑛 object proposals
are given, the relationship inference network runs the risk objects, rather than those based on visual appearance.
of learning based on erroneous features provided by the Recently, the one-stage models have emerged in the field
detection backbone and has to predict O (𝑛2 ) relationships of object detection [14], [15], [16], [17]. They are attractive for
(see Fig. 1). This manipulation may lead to the selection of the fast speed, low costs and simplicity. These are also the
triplets based on the confident scores of object proposals properties that are urgently needed for the scene graph gen-
rather than interest in relationships. Many previous works eration models. Detection Transformer (DETR) [18] views
[9], [10], [11], [12], [13] have integrated semantic knowledge object detection as an end-to-end set prediction task and
to improve their performance. However, these models face proposes a set-based loss via bipartite matching. This strat-
significant biases in relationship inference conditional on egy can be extended to scene graph generation: based on
subject and object categories. They prefer to predict the a set of learned subject and object queries, a fixed num-
predicates that are popular between particular subjects and ber of triplets <subject-predicate-object> could be
predicted by reasoning about the global image context and
co-occurrences of entities. However, it is challenging to im-
• Yuren Cong and Bodo Rosenhahn are with Institute of Information
Processing, Leibniz University Hannover, Germany. E-mail: {cong, plement such an intuitive idea. The model needs to predict
rosenhahn}@tnt.uni-hannover.de. both the location and the category of the subject and object,
• Micheal Ying Yang is with Scene Understanding Group, University of and also consider their semantic connection. Furthermore,
Twente, The Netherlands. Email: [email protected].
the direct bipartite matching is not competent to assign
2

ground truth information to relationship predictions. This Two-stage methods following [2] are currently dominating
paper aims to address these challenges. scene graph generation: several works [9], [30], [42], [43] use
We propose a novel end-to-end framework for scene residual neural networks with the global context to improve
graph generation, named Relation Transformer (RelTR)As the quality of the generated scene graphs. Xu et al. [42]
shown in Fig. 1, RelTR can detect the triplet proposals with use standard RNNs to iteratively improve the relationship
only visual appearance and predict subjects, objects, and prediction via message passing while MotifNet [9] stacks
their predicates concurrently. We evaluate RelTR on Visual LSTMs to reason about the local and global context. Graph-
Genome [19] and large-scale Open Images V6 [20]. The main based models [44], [45], [46], [47], [48] perform message
contributions of this work are summarized as follows: passing and demonstrate good results. Factorizable Net [45]
decomposes and combines the graphs to infer the relation-
• In contrast to most existing advanced approaches
ships. The attention mechanism is integrated into different
that classify the dense relationships between all en-
types of graph-based models such as Graph R-CNN [44],
tity proposals from the object detection backbone,
GPI [49] and ARN [50]. With the rise of Transformer [51],
our one-stage method can generate a sparse scene
there are several attempts using Transformer to detect vi-
graph by decoding the visual appearance with the
sual relationships and generate scene graphs in very recent
subject and object queries learned from the data.
works [34], [52], [53]. To improve the performance, many
• RelTR generates scene graphs based on visual ap-
works are no longer limited to using only visual appearance.
pearance only, which has fewer parameters and
Semantic knowledge can be utilized as an additional feature
faster inference compared to other SGG models while
to infer scene graphs [2], [9], [11], [54], [55]. Furthermore,
achieving state-of-the-art performance.
statistic priors and knowledge graphs have been introduced
• A set prediction loss is designed to perform the
in [11], [56], [57], [58], [59], [60].
matching between the ground truth and predicted
Compared to the boom of two-stage approaches, one-
triplets with an IoU-based assignment strategy.
stage approaches are still in their infancy and have the ad-
• With the decoupled entity attention, the triplet de-
vantage of being simple, fast and easy to train. To the best of
coder of RelTR can improve the localization and
our knowledge, FCSGG [61] is currently the only one-stage
classification of subjects and objects with the entity
scene graph generation framework that encodes objects as
detection results from the entity decoder.
box center points and relationships as 2D vector fields.
• Through comprehensive experiments, we explore
While FCSGG model being lightweight and fast speed, it
which components are critical for the performance
has a significant performance gap compared to other two-
and analyze the working mechanism of learned sub-
stage methods. To fill this gap, we propose Transformer-
ject and object queries.
based RelTR using only visual appearance in this work
• RelTR can be simply implemented. The source code
with fewer parameters, faster inference speed, and higher
and pretrained model are publicly available at https:
accuracy. Distinct from the other two-stage Transformer-
//github.com/yrcong/RelTR.
based approaches [34], [52], [53] that utilize the attention
The remainder of the paper is structured as follows. In mechanism to capture the context of the entity proposals
Section 2, we review related work in scene graph generation. from an object detector, RelTR can decode the global feature
Section 3 presents our proposed method. Experimental re- maps directly with the subject and object queries learned
sults of the proposed framework are discussed in Section 4. from the data to generate a sparse scene graph.
Section 5 concludes this paper.
2.2 Transformer and Set Prediction
2 R ELATED W ORK
The original Transformer architecture was proposed in [51]
2.1 Scene Graph Generation for sequence transduction. Its encoder-decoder configura-
Scene graphs have been proposed in [1] for the task of image tion and attention mechanism is also used to solve various
retrieval and attract increasing attention in computer vision computer vision tasks in different ways, e.g. object detection
and natural language processing communities for different [18], human-object interaction (HOI) detection [62], and
scene understanding tasks such as image captioning [21], dynamic scene graph generation [39].
[22], [23], VQA [24], [25] and image synthesis [26], [27]. The DETR [18] is a seminal work based on Transformer archi-
main purpose of scene graph generation (SGG) is to detect tecture for object detection in recent years. It views detection
the relationships between objects in the scene. Many earlier as a set prediction problem. In the end-to-end training,
works were limited to identifying specific types of relation- with the object queries, DETR predicts a fixed-size set of
ships such as spatial relationships between entities [28], [29]. object proposals and performs a bipartite matching between
The universal visual relationship detection is introduced proposals and ground truth objects for the loss function.
in [2]. Their inference framework, which detects entities This concept of query-based set prediction quickly gains
in an image first and then determines dense relationships, popularity in the computer vision community. Many tasks
was widely adopted in subsequent works, including their can be reformulated as set prediction problems, e.g. instance
evaluation settings and metrics as well. segmentation [63], image captioning [64] and multiple-
Now many models [30], [31], [32], [33], [34], [35], [36], object tracking [65]. Some works [66], [67] attempt to further
[37] are available to generate scene graphs from different improve object detection based on DETR.
perspectives, and some works even extend the scene graph HOI detection localizes and recognizes the relationships
generation task from images to videos [38], [39], [40], [41]. between humans and objects, whose result is a sub-graph
3

image features entity queries


positional has
encodings Ep
head dog
of
Feature Entity rock

on
Encoder Decoder ne

g
ar

tin
CNN

sit
beach

feature context entity representations


prediction
FFNs
Triplet Decoder subject branch
feature context entity representations
Es + Et Et
subject
representations
DVA DEA
subject queries subject attention heatmap
CSA
object queries
DVA DEA
object
Eo + Et Et representations

feature context entity representations object attention heatmap


object branch

Fig. 2: Given a set of learned subject and object queries coupled by subject and object encodings, RelTR captures the
dependencies between relationships and reasons about the feature context and entity representations, respectively the
output of the feature encoder and entity decoder, to directly compute a set of subject and object representations. A pair
of subject and object representations with attention heat maps is decoded into a triplet <subject-predicate-object>
by feed forward networks (FFNs). CSA, DVA and DEA stand for Coupled Self-Attention, Decoupled Visual Attention and
Decoupled Entity Attention. Ep , Et , Es and Eo are the positional, triplet, subject and object encodings respectively. ⊕
indicates element-wise addition, while ⊗ indicates concatenation or split.

of the scene graph. Several HOI detection frameworks [62], where 𝑑 𝑘 is the dimension of K . In order to benefit from the
[68] have been developed that use holistic triplet queries information in different representation sub-spaces, multi-
to directly infer a set of interactions. However, such a head attention is applied in Transformer. A complete at-
concept is difficult to generalize to the more complex task tention function is a multi-head attention followed by a
of scene graph generation. On large-scale datasets, such normalization layer with residual connection and denoted
as Visual Genome [19] and Open Images [20], localization as 𝐴𝑡𝑡 (.) in this paper for simplicity.
and classification of subjects and objects using only triplet
queries may likely result in low accuracy. On the contrary, 3.1.2 DETR
our proposed RelTR predicts the general relationships using This entity detection framework [18] is built upon the
coupled subject and object queries to achieve high accuracy. standard Transformer encoder-decoder architecture. First, a
CNN backbone generates a feature map Z0 ∈ R 𝐻 ×𝑊 ×𝑑 for
an image. With the self-attention mechanism, the encoder
3 M ETHOD computes a new feature context Z ∈ R 𝐻 𝑊 ×𝑑 using the
flatted Z0 and fixed positional encodings E 𝑝 ∈ R 𝐻 𝑊 ×𝑑 .
A scene graph G consists of entity vertices V and re- The decoder transforms 𝑁𝑒 entity queries into the entity
lationship edges E. Different from previous works that representations Q𝑒 ∈ R 𝑁𝑒 ×𝑑 . The entity queries interact with
detect a set of entity vertices and label the predicates be- each other to capture the entity context and extract visual
tween the vertices, we propose a one-stage model, Relation features from Z .
Transformer (RelTR), to directly predict a fixed-size set of For the end-to-end training, a set prediction loss for
< V𝑠𝑢𝑏 − E 𝑝𝑟 𝑑 − V𝑜𝑏 𝑗 > for scene graph generation. entity detection is proposed in DETR by assigning the
ground truth entities to predictions. The ground truth set
of size 𝑁𝑒 is padded with 𝜙 <background>, and a cost
3.1 Preliminaries
function 𝑐 𝑚 ( 𝑦ˆ , 𝑦) is applied to compute the matching cost
3.1.1 Transformer between a prediction 𝑦ˆ and ground truth entity 𝑦 = {𝑐, 𝑏}
We provide a brief review on Transformer and its attention where 𝑐, 𝑏 indicates the target class and box coordinates re-
mechanism. Transformer [51] has an encoder-decoder struc- spectively. Given the cost matrix C𝑒𝑛𝑡 , the entity prediction-
ture and consists of stacked attention functions. The input ground truth assignment is computed with the Hungarian
of a single-head attention is formed from queries Q, keys K algorithm [69]. The set prediction loss for entity detection
and values V while the output is computed as: can be presented as:
  𝑁𝑒 h i
QK 𝑇
∑︁
𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛( Q, K , V ) = 𝑠𝑜 𝑓 𝑡𝑚𝑎𝑥 √ V, (1) 𝐿 𝑒𝑛𝑡𝑖𝑡 𝑦 = 𝐿 𝑐𝑙𝑠 + 𝟙{ 𝑐𝑖 ≠𝜙 } 𝐿 𝑏𝑜𝑥 , (2)
𝑑𝑘 𝑖=1
4

where 𝐿 𝑐𝑙𝑠 denotes the cross-entropy loss for label classifica- The feature context combines with fixed position encodings
tion and 𝑐𝑖 ≠ 𝜙 means that <background> is not assigned E 𝑝 ∈ R 𝐻 𝑊 ×𝑑 again in DVA. The updated subject represen-
to the 𝑖-th entity prediction. 𝐿 𝑏𝑜𝑥 consists of 𝐿 1 loss and tations containing visual features are presented as:
generalized IoU loss [70] for box regression.
Q = Q 𝑠 + E𝑡 , K = Z + E 𝑝
(𝑠𝑢𝑏) (4)
3.2 RelTR Model Q𝑠 = 𝐴𝑡𝑡 𝐷𝑉 𝐴 ( Q, K , Z ).

As shown in Fig. 2, our one-stage model RelTR has an The same operation is performed in the object branch. In
encoder-decoder architecture, which directly predicts 𝑁𝑡 the multi-head attention operation, 𝑁𝑡 attention heat maps
triplets without inferring the possible predicates between M𝑠 ∈ R 𝑁𝑡 ×𝐻 𝑊 are computed. We also adopt the reshaped
all entity pairs. It consists of the feature encoder extracting heat maps as a spatial feature for predicate classification.
the visual feature context, the entity decoder capturing
the entity representations from DETR [18] and the triplet 3.2.4 Decoupled Entity Attention (DEA)
decoder with the subject and object branches. Decoupled entity attention is performed as the bridge be-
A triplet decoder layer contains three attention functions, tween entity detection and triplet detection. Entity represen-
coupled self-attention (CSA), decoupled visual attention tations Q𝑒 ∈ R 𝑁𝑒 ×𝑑 can provide localization and classifica-
(DVA) and decoupled entity attention (DEA), respectively. tion information with higher quality due to the fact that they
Given 𝑁𝑡 coupled subject and object queries, the triplet do not have semantic restrictions like those between subject
decoder layer reasons about the feature context Z and entity and object representations. The motivation for introducing
representations Q𝑒 from the entity decoder layer to directly DEA is expecting subject and object representations to learn
output the information of 𝑁𝑡 triplets without inferring the more accurate localization and classification information
possible predicates between all entity pairs. from entity representations through the attention mecha-
nism. Q𝑠 and Q𝑜 are finally updated in a triplet decoder
3.2.1 Subject and Object Queries layer as follows:
There are two types of learned embeddings, namely subject (𝑠𝑢𝑏)
queries Q𝑠 ∈ R 𝑁𝑡 ×𝑑 and object queries Q𝑜 ∈ R 𝑁𝑡 ×𝑑 , for the Q𝑠 = 𝐴𝑡𝑡 𝐷𝐸 𝐴 ( Q 𝑠 + E𝑡 , Q 𝑒 , Q 𝑒 )
(𝑜𝑏 𝑗)
(5)
subject branch and object branch respectively. These 𝑁𝑡 pairs Q𝑜 = 𝐴𝑡𝑡 𝐷𝐸 𝐴 ( Q𝑜 + E𝑡 , Q𝑒 , Q𝑒 ),
of subject and object queries are transformed into 𝑁𝑡 pairs
(𝑠𝑢𝑏) (𝑜𝑏 𝑗)
of subject and object representations of size 𝑑. However, where 𝐴𝑡𝑡 𝐷𝐸 𝐴 and 𝐴𝑡𝑡 𝐷𝐸 𝐴 are the decoupled entity atten-
the subject query and the object query are not actually tion modules in the subject and object branch. The outputs
linked together in a query pair since the attention layers of DEA are processed by a feed-forward network followed
in the triplet decoder are permutation invariant. In order to by a normalization layer with residual connection. The feed-
distinguish between different triplets, the learnable triplet forward network (FFN) consist of two linear transformation
encodings E𝑡 ∈ R 𝑁𝑡 ×𝑑 are introduced. layers with ReLU activation.

3.2.2 Coupled Self-Attention (CSA)


Coupled self-attention captures the context between 𝑁𝑡
triplets, and the dependencies between all subjects and
objects. Although the triplet encodings E𝑡 are already avail-
able, we still need subject encodings E𝑠 and object encod-
ings E𝑜 of the same size as E𝑡 to inject the semantic concepts
of <subject> and <object> in coupled self-attention.
Both E𝑠 and E𝑜 are randomly initialized and learned in
the training. The subject and object queries are encoded and
the output of CSA can be formulated as:
Q = K = [ Q 𝑠 + E 𝑠 + E𝑡 , Q 𝑜 + E 𝑜 + E𝑡 ]
(3)
[Q𝑠 , Q𝑜 ] = 𝐴𝑡𝑡𝐶𝑆 𝐴 ( Q, K , [ Q𝑠 , Q𝑜 ]),
where [, ] indicates the unordered concatenation operation
and the updated embeddings keep the original symbols
unchanged for brevity. The output of CSA [Q𝑠 , Q𝑜 ] is
decoupled into Q𝑠 and Q𝑜 which continue to be used for Fig. 3: Left: Architecture of the feed-forward network for
the subject branch and the object branch, respectively. subject/object box regression. Right: Architecture of the
convolutional mask head.
3.2.3 Decoupled Visual Attention (DVA)
Decoupled visual attention concentrates on extracting visual
features from the feature context Z . Decoupled means that 3.2.5 Final Inference
the computations of subject and object representations are A complete triplet includes the predicate label and the class
independent of each other, which is distinct from CSA. labels as well as the bounding box coordinates of the subject
In the subject branch, Q𝑠 ∈ R 𝑁𝑡 ×𝑑 are updated through and object. The subject representations Q𝑠 and object repre-
their interaction with the feature context Z ∈ R 𝐻 𝑊 ×𝑑 . sentations Q𝑜 from the last decoder layer are transformed
5

GT Proposal A Proposal B Proposal C Proposal D

bus-near-bush bus-near-bush BG-no relation-BG X-no relation-BG X-no relation-X

Fig. 4: The ground truth is assigned to Proposal A while <background-no relation-background> is assigned to
Proposal B. However, <background> should not be assigned to the subject of Proposal C and the subject as well as object
of Proposal D. BG denotes <background> while X indicates no assignment.

by two linear projection layers into entity class distribu- and generalized IoU loss [70]:
tions. We utilize two independent feed-forward networks ˆ 𝑏) = 5𝐿 1 ( 𝑏,
ˆ 𝑏) + 2𝐿 𝐺𝐼 𝑜𝑈 ( 𝑏,
ˆ 𝑏).
𝑐 𝑏𝑜𝑥 ( 𝑏, (7)
with the same structure to predict the height, width and
normalized center coordinates of subject and object boxes. The cost function 𝑐 𝑚 can be presented as:
The architecture is shown in Fig. 3 (left). A pair of subject
𝑐 𝑚 ( 𝑦ˆ , 𝑦) = 𝑐 𝑐𝑙𝑠 ( 𝑐, ˆ 𝑏),
ˆ 𝑐) + 𝟙 {𝑏 ∈𝑦 } 𝑐 𝑏𝑜𝑥 ( 𝑏, (8)
attention heat map M𝑠 and object attention heatmap M𝑜
from DVA modules in the last decoder layer is concatenated where 𝑏 ∈ 𝑦 denotes that the ground truth includes the
and resized 2 × 28 × 28. The convolutional mask head shown box coordinates (only for the subject/object cost). The cost
in Fig. 3 (right) converts the attention heat maps to spatial between a triplet prediction and a ground truth triplet is
feature vectors. The final predicate labels are predicted by a computed as:
two-layer perceptron with the subject representations, object
representations and spatial feature vectors. 𝑐 𝑡𝑟 𝑖 = 𝑐 𝑚 ( 𝑦ˆ 𝑠𝑢𝑏 , 𝑦 𝑠𝑢𝑏 ) + 𝑐 𝑚 ( 𝑦ˆ 𝑜𝑏 𝑗 , 𝑦 𝑜𝑏 𝑗 ) + 𝑐 𝑚 ( 𝑐ˆ 𝑝𝑟 𝑑 , 𝑐 𝑝𝑟 𝑑 ), (9)
Given the triplet cost matrix C𝑡𝑟 𝑖 , the Hungarian al-
3.3 Set Prediction Loss for Triplet Detection gorithm is executed for the bipartite matching and each
ground truth triplet is assigned to a prediction. However,
We design a set prediction loss for triplet detection by
<background-no relation-background> should not
extending the entity detection set prediction loss in Eq. 2.
be assigned to all predictions that do not match the
We present a triplet prediction as 𝑦ˆ 𝑠𝑢𝑏 , 𝑐ˆ 𝑝𝑟 𝑑 , 𝑦ˆ 𝑜𝑏 𝑗 where
ground truth triplets. After several iterations of train-
𝑦ˆ 𝑠𝑢𝑏 = 𝑐ˆ𝑠𝑢𝑏 , 𝑏ˆ 𝑠𝑢𝑏 and 𝑦ˆ 𝑜𝑏 𝑗 = 𝑐ˆ𝑜𝑏 𝑗 , 𝑏ˆ 𝑜𝑏 𝑗 while a ground

ing, RelTR is likely to output the triplet proposals in
truth is denoted as 𝑦 𝑠𝑢𝑏 , 𝑐 𝑝𝑟 𝑑 , 𝑦 𝑜𝑏 𝑗 . The predicted subject,
four possible ways, as demonstrated in Fig. 4. Assign-
predicate and object labels are respectively denoted as 𝑐ˆ𝑠𝑢𝑏 ,
ing ground truth to Proposal A and <background-no
𝑐ˆ 𝑝𝑟 𝑑 and 𝑐ˆ𝑜𝑏 𝑗 while the predicted box coordinates of the
relation-background> to Proposal B are two clear
subject and object are denoted as 𝑏ˆ 𝑠𝑢𝑏 and 𝑏ˆ 𝑜𝑏 𝑗 .
cases. For Proposal C, <background> should not be as-
When 𝑁𝑡 relationships are predicted and 𝑁𝑡 is larger
signed to the subject due to the poor object prediction.
than the number of triplets in the image, the ground
Furthermore, <background> should not be assigned to the
truth set of triplets is padded with Φ <background-no
subject and object of Proposal D due to the fact that there
relation-background>. The pair-wise matching cost
is a better candidate Prediction A. To solve this problem,
𝑐 𝑡𝑟 𝑖 between a predicted triplet and a ground truth triplet
we integrate an IoU-based assignment strategy in our set
consists of the subject cost 𝑐 𝑚 ( 𝑦ˆ 𝑠𝑢𝑏 , 𝑦 𝑠𝑢𝑏 ), object cost
prediction loss: For a triplet prediction, if the predicted
𝑐 𝑚 ( 𝑦ˆ 𝑜𝑏 𝑗 , 𝑦 𝑜𝑏 𝑗) and predicate cost 𝑐 𝑚 ( 𝑐ˆ 𝑝𝑟 𝑑 , 𝑐 𝑝𝑟 𝑑 ). The pre-
subject or object label is correct, and the IoU of the predicted
diction 𝑦ˆ = 𝑐, ˆ 𝑏ˆ contains the predicted class 𝑐ˆ including
box and ground truth box is greater than or equal to the
the class probabilities p̂ and the predicted box coordinates 𝑏ˆ
threshold 𝑇, the loss function does not compute a loss for the
while the ground truth 𝑦 = {𝑐, 𝑏} contains the ground truth
subject or object. The set prediction loss for triplet detection
class 𝑐 and the ground truth box 𝑏. For the predicate, we
is formulated as:
only have the predicted class 𝑐ˆ 𝑝𝑟 𝑑 and ground truth class
𝑁𝑡
𝑐 𝑝𝑟 𝑑 . ∑︁ h i
𝐿 𝑠𝑢𝑏 = Θ 𝐿 𝑐𝑙𝑠 + 𝟙{ 𝑐𝑖 ≠𝜙 } 𝐿 𝑏𝑜𝑥
The subject/object cost is determined by the predicted 𝑠𝑢𝑏
𝑖=1
entity class probability and the predicted bounding box 𝑁𝑡 (10)
 
while the predicate cost is determined only by the predicted
∑︁
𝐿 𝑜𝑏 𝑗 = Θ 𝐿 𝑐𝑙𝑠 + 𝟙n 𝑖 ≠𝜙
o𝐿
𝑏𝑜𝑥
predicate class probability. We define the predicted proba- 𝑖=1
𝑐𝑜𝑏 𝑗

bility of class 𝑐 as p̂ (𝑐). We adopt the class cost function 𝑝𝑟 𝑑


𝐿 𝑡𝑟 𝑖 𝑝𝑙𝑒𝑡 = 𝐿 𝑠𝑢𝑏 + 𝐿 𝑜𝑏 𝑗 + 𝐿 𝑐𝑙𝑠 ,
from [66] which can be formulated as:
𝑝𝑟 𝑑
𝑐+𝑐𝑙𝑠 ( 𝑐,
ˆ 𝑐) = 𝛼 · (1 − p̂ (𝑐)) 𝛾 · (−𝑙𝑜𝑔( p̂ (𝑐) + 𝜀)) where 𝐿 𝑐𝑙𝑠 is the cross-entropy loss for predicate classi-
fication. Θ is 0, when <background> is assigned to the
𝑐−𝑐𝑙𝑠 ( 𝑐,
ˆ 𝑐) = (1 − 𝛼) · p̂ (𝑐) 𝛾 · (−𝑙𝑜𝑔(1 − p̂ (𝑐) + 𝜀)) (6) subject/object but the label is predicted correctly and the
ˆ 𝑐) =
𝑐 𝑐𝑙𝑠 ( 𝑐, 𝑐+𝑐𝑙𝑠 ( 𝑐,
ˆ 𝑐) − 𝑐−𝑐𝑙𝑠 ( 𝑐,
ˆ 𝑐), box overlaps with the ground truth IoU> 𝑇; in other cases,
Θ is 1. The total loss function is computed as:
where 𝛼, 𝛾 and 𝜀 are respectively set to 0.25, 2 and 10−8 . The
box cost for the subject and object is computed using 𝐿 1 loss 𝐿 𝑡𝑜𝑡 𝑎𝑙 = 𝐿 𝑒𝑛𝑡𝑖𝑡 𝑦 + 𝐿 𝑡𝑟 𝑖 𝑝𝑙𝑒𝑡 . (11)
6
PredCLS ↑ SGCLS ↑ SGDET ↑
Method 𝐴𝑃50 #params(M) ↓ FPS ↑
R@20 R@50 mR@20 mR@50 R@20 R@50 mR@20 mR@50 R@20 R@50 mR@20 mR@50
MOTIFS [9] 20.0 58.5 65.2 10.8 14.0 32.9 35.8 6.3 7.7 21.4 27.2 4.2 5.7 240.7 6.6
KERN [10] 20.0 59.1 65.8 - 17.7 32.2 36.7 - 9.4 22.3 27.1 - 6.4 405.2 4.6
GB-Net [12] - - 66.6 - 19.3 - 38.0 - 9.6 - 26.4 - 6.1 - -
RelDN [56] - 66.9 68.4 - - 36.1 36.8 - - 21.1 28.3 - - 615.6 4.7
two-
VCTree-TDE [59] 28.1 39.1 49.9 17.2 23.3 22.8 28.8 8.9 11.8 14.3 19.6 6.3 9.3 360.8 1.2
stage
GPS-Net [48] - 67.6 69.7 17.4 21.3 41.8 42.3 10.0 11.8 22.3 28.9 6.9 8.7 - -
BGNN [46] 29.0 - 59.2 - 30.4 - 37.4 - 14.3 23.3 31.0 7.5 10.7 341.9 2.3
BGT-Net [52] 28.1 60.9 67.1 16.8 20.6 41.7 45.9 10.4 12.8 25.5 32.8 5.7 7.8 - -
IMP [42] - 58.5 65.2 - 9.8 31.7 34.6 - 5.8 14.6 20.7 - 3.8 203.8 10.0
CISC [31] - 42.1 53.2 - - 23.3 27.8 - - 7.7 11.4 - - - -
G-RCNN [44] 24.8 - 54.2 - - - 29.6 - - - 11.4 - - - -
one- FCSGG [61] 28.5 33.4 41.0 4.9 6.3 19.0 23.5 2.9 3.7 16.1 21.3 2.7 3.6 87.1 8.4
stage RelTR (ours) 26.4 63.1 64.2 20.0 21.2 29.0 36.6 7.7 11.4 21.2 27.5 6.8 10.8 63.7 16.1

TABLE 1: Comparison with state-of-the-art scene graph generation methods on Visual Genome [19] test set. These methods
are divided into two-stage and one-stage. The best numbers in two-stage methods are shown in bold, and the best numbers
in one-stage methods are shown in italic. Models that use prior knowledge are represented in blue, to distinguish them
from visual-based models. The inference speed (FPS) of different models is tested on the same RTX 2080Ti of batch size 1.

3.4 Post-processing cannot be given directly. Therefore, we assign the ground


Unlike two-stage methods that organizes 𝑁 entities into truth information to the matched triplet proposals when
𝑁 (𝑁 − 1) subject-object pairs, our method simultaneously evaluating RelTR on PredCLS/SGCLS. Recall@𝑘 (R@𝑘),
detects subjects and objects while predicting a fixed number mean Recall@𝑘 (mR@𝑘) and zero-shot Recall@𝑘 (zsR@𝑘) are
of triplets. This results in our approach missing the con- adopted to evaluate the algorithm performance [2], [35]. To
straint that the subject and object cannot be the same entity. better demonstrate the model performance on the imbal-
It turns out that our model sometimes outputs a kind of anced VG dataset, the relationship categories are split into
triplet, where the subject and object are the same entity three groups based on the number of instances in training
with an ambiguous predicate (see Fig. 5 for example). In [46]: head (> 10𝑘), body (0.5𝑘 − 10𝑘) and tail (< 0.5𝑘).
post-processing, if the subject and object are the same entity
(determined by the labels and the bounding boxes’ IoU), the 4.1.2 Open Images V6
triplet is removed. We also conduct experiments on the large-scale Open Im-
ages V6 [20] consisting of 126𝑘 training images, 5.3𝑘 test
ilding sign-on-sign building-near-building imageswindow-in-window
and 1.8𝑘 validation images.window-in-window
It involves 288 entity
categories and 30 predicate categories. We adopt the stan-
dard evaluation metrics used in the Open Images Chal-
lenge. Recall@50, weighted mean average precision (AP)
of relationship detection wmAP𝑟 𝑒𝑙 , and phrase detection
wmAP 𝑝ℎ𝑟 are calculated. The final score is computed as:
score 𝑤𝑡 𝑑 = 0.2∗R@50+0.4∗wmAP𝑟 𝑒𝑙 +0.4∗wmAP 𝑝ℎ𝑟 .

Fig. 5: Triplets in which the subject (blue) and object (orange) 4.2 Implementation Details
n are the same entity are removed in sidewalk-on-sidewalk
sign-on-sign post-processing. The woman-wearing-woman
We adopt the same hyperparameters in our experiments on
predicates are usually ambiguous in such cases. Visual Genome and Open Images. We train RelTR end-to-
end from scratch for 150 epochs on 8 RTX 2080 Ti GPUs
with AdamW [71] setting the batch size to 2 per GPU,
4 E XPERIMENTS weight decay to 10−4 and clipping the gradient norm> 0.1.
The initial learning rates of the Transformer and ResNet-
4.1 Datasets and Evaluation Settings
50 backbone are set to 10−4 and 10−5 respectively and the
4.1.1 Visual Genome learning rates are dropped by 0.1 after 100 epochs. In the
We followed the widely used Visual Genome [19] split training we also use auxiliary losses [72] for the triplet
proposed by [42]. There are a total of 108𝑘 images in the decoder as [18], [66] did. By default, RelTR has 6 encoder
dataset with 150 entity categories and 50 predicate cate- layers and 6 triplet decoder layers. The number of triplet
gories. 70% of the images are divided into the training decoder layers and the number of entity decoder layers
dataset and the remaining 30% are used as the test set. are set to be the same. The multi-head attention modules
5𝑘 images are further drawn from the training set for with 8 heads in our model are trained with dropout of 0.1.
validation. There are three standard evaluation settings: (1) For all experiments, the model dimension 𝑑 is set to 256. If
Predicate classification (PredCLS): predict predicates given not specifically stated, the number of entity queries 𝑁𝑒 and
ground truth categories and bounding boxes of entities. (2) coupled queries 𝑁𝑡 are respectively set to 100 and 200 while
Scene graph classification (SGCLS): predict predicates and the IoU threshold in the triplet assignment is 0.7. For fair
entity categories given ground truth boxes. (3) Scene graph comparison, inference speeds (FPS) of all the reported SGG
detection (SGDET): predict categories, bounding boxes of models are evaluated on a single RTX 2080 Ti with the same
entities and predicates. Distinct from two-stage methods, hardware configuration. For computing the inference speed
the ground truth bounding boxes and categories of entities (FPS), we average over all the test images, where for each
7

Head Body Tail


0.35
Ratio
60 RelTR
BGNN 0.30
50 0.25

Relationship Freq. Ratio


mRecall@100

40 0.20

30 0.15

20 0.10

10 0.05

0 0.00

parked on
in front of

carrying

hanging from

across
covered in

from
growing on
on

flying in
has

with

sitting on

walking on
wearing
of
in
near
behind

holding

riding

for
looking at
watching
above
under
wears

standing on
at
attached to
over

belonging to
and

laying on
along
eating

part of
using
to
on back of
between

covering

mounted on
lying on
walking in
against

painted on
made of
playing
says
Fig. 6: SGDET-R@100 for each relationship category on VG dataset. Long-tail groups are shown with different colors.
RelTR almost always performs better than BGNN [46] from of to in front of. The standard deviation of R@100 are
respectively 11.51 (ours) and 14.15 (BGNN). It indicates that RelTR is more unbiased.

image, the time cost for start timing when an image is given proposals to predict the labels and achieves R@50 = 64.2
as input and end timing when triplet proposals are output and mR@50 = 21.2 on PredCLS while R@50 = 36.6 and
as the inference time. The time cost for evaluating the whole mR@50 = 11.4 on SGCLS.
dataset is not included.
Table 2 demonstrates R@𝐾, mR@𝐾 and zsR@𝑘 on SGDET
of state-of-the-art methods. Compared with the models
4.3 Quantitative Results and Comparison without the Total Direct Effectt (TDE) [59], RelTR has the
4.3.1 Visual Genome best performance on mR@𝐾 and zsR@𝑘. With TDE, zsR@𝑘
and mR@𝐾 of the two-stage methods are improved whereas
We compare scores of R@𝐾 and mR@𝐾, number of param-
R@𝐾 decreases significantly. Our model performs well on all
eters and inference speed on SGDET (FPS) with several
three recall metrics.
two-stage models and one-stage model FCSGG [61] in Ta-
ble 1. Models that not only use visual appearance, but also
prior knowledge (e.g. semantic and statistic information) SGDET
Method Avg.
are represented in blue, to distinguish them from visual- R@20 R@50 mR@20 mR@50 zsR@50 zsR@100
Motifs-TDE [59] 12.4 16.9 5.8 8.2 2.3 2.9 8.1
based models. Overall, the two-stage models have higher VTransE-TDE [59] 13.5 18.7 6.3 8.6 2.0 2.7 8.6
scores of R@𝐾 and mR@𝐾 than the one-stage models while VCTree-TDE [59] 14.0 19.4 6.9 9.3 2.6 3.2 9.2
Motifs [9] 21.4 27.2 4.2 5.7 0.1 0.2 9.8
they have more parameters and slower inference speed. VTransE [73] 23.0 29.7 3.7 5.0 0.8 1.5 10.6
VCTree [35] 22.0 27.9 5.2 6.9 0.2 0.7 10.5
This phenomenon also occurs between the models using FCSGG [61] 16.1 21.3 2.7 3.6 1.0 1.4 7.7
prior information and visual-based models. Noted that the RelTR (ours) 21.2 27.5 6.8 10.8 1.8 2.4 11.8

performance of the entity detectors in the two-stage models


has a significant impact on the model’s scores, especially on TABLE 2: R@𝐾, mR@𝐾 and zsR@𝑘 performance compari-
SGDET. Our model achieves R@50 = 27.5 and mR@50 = 10.8 son. The last column is the average of the first six columns.
on SGDET, which is respectively 5.1 and 6.2 points higher Although the models with TDE have better performance
than another one-stage model FCSGG [61]. Not only that, on zsR@𝑘, R@𝐾 drops significantly. Our visual-based model
RelTR has fewer parameters and faster inference speed. Our performs balanced and well on the three metrics.
model is also competitive compared with recent two-stage
models, and outperforms state-of-the-art visual-based meth-
ods. Although the R@20/R@50 score of RelTR is 2.1/3.5 To further analyze the model performance on imbal-
points lower than that of BGNN [46], the performance of anced Visual Genome, we compute mR@100 for each re-
RelTR on mR@50 is state-of-the-art. Furthermore, RelTR is a lationship group on SGDET in Table 3. Our method out-
light-weight model, which has only 63.7M parameters and performs the prior works [46], [48], [59] on the body group
an inference speed of 16.6 FPS, ca. 7 times faster than BGNN. while mR@100 on the tail group is similar to the best BGNN
This allows RelTR to be used in a wide range of practical [46]. RelTR achieves the highest mR@100 over all relation-
applications. For PredCLS and SGCLS, the ground truth ship categories. The results for each relation category are
bounding boxes and labels of entities cannot be given to shown in Fig. 6. From of to in front of, RelTR almost
RelTR directly. Therefore, we replace the predicted boxes always performs better than BGNN [46] while mR@100 of
and labels of the matched triplet proposals by the ground the three most frequent predicates are lower. This could ex-
truth information. However, it is not possible to capture plain why R@𝑘 of RelTR is not very high but our qualitative
the exact features of the given boxes by RoIAlign as in results perform well and the relationships in the generated
two-stage methods. RelTR uses the features of detected scene graphs are semantically diverse.
8

Head Body Tail Head Body Tail


60 0.40 0.40
Ratio 60 Ratio
RelTR 0.35 RelTR 0.35
BGNN BGNN
50
50
Average Precision of Relationships

0.30 0.30

Average Precision of Phrases


Relationship Freq. Ratio

Relationship Freq. Ratio


40 0.25 40 0.25

30 0.20 0.20
30
0.15 0.15
20 20
0.10 0.10
10 10
0.05 0.05

0 0.00 0 0.00
skateboard

skateboard
handshake

handshake
throw

throw
read

snowboard

read

snowboard
catch
wears
at
contain
holds
ride
on
hang
plays
interacts_with
inside_of
surf
hits
kick

drink
eat
ski
kiss
cut
under

talk_on_phone
hug
highfive
dance
holding_hands

wears
at
contain
holds
ride
on
hang
plays
interacts_with
inside_of
surf
hits
kick
catch
drink
eat
ski
kiss
cut
under

talk_on_phone
hug
highfive
dance
holding_hands
Fig. 7: Average precision of relationships and phrases for RelTR and BGNN on Open Images V6. The distribution of
relationships in the test set is shown with the black dash line. The average precision of relationships of RelTR is higher
than BGNN for 7 of the top-10 high frequency predicates while BGNN generally performs better than RelTR for the low
frequency predicates (skateboard to ski). We conjecture that it is attributed to prior knowledge used in BGNN. The
overall trend of AP 𝑝ℎ𝑟 is the same as AP𝑟 𝑒𝑙 except hang.

Method SGDET-mR@100 Head Body Tail and phrases. The distribution of relationships in the Open
GPS-NET [48] 9.8 30.8 8.5 3.9 Images V6 test set is also shown with the black dash
VCTree-TDE [59] 11.1 24.7 12.2 1.8
lines. There are 9 predicates (kiss to handshake) that
BGNN [46] 12.6 34.0 12.9 6.0
RelTR (ours) 12.6 30.6 14.4 5.0 do not appear in the test set. The average precision of
relationships AP𝑟 𝑒𝑙 and AP 𝑝ℎ𝑟 of RelTR are higher than
TABLE 3: SGDET-mR100 for the head, body and tail groups BGNN for 7 of the top-10 high frequency predicates. For
which are partitioned according to the number of relation- the low frequency predicates (skateboard to ski), BGNN
ship instances in the training set. generally performs better than RelTR. We conjecture that it
is attributed to prior knowledge used in BGNN.

4.3.2 Open Images V6 4.4 Ablation Studies


We train RelTR on the Open Images V6 dataset and compare In the ablation studies, we consider how the following
with other two-stage methods, as shown in Table 4. Al- aspects influence the final performance. All the ablation
though R@50 of RelTR is 3.68 points lower than the best two- studies are performed with Visual Genome dataset [19].
stage method VCTree [35], RelTR has the highest wmAP𝑟 𝑒𝑙
(0.58 points higher than BGNN [46]) and wmAP 𝑝ℎ𝑟 (3.15 4.4.1 Number of Layers
points higher than VCTree [35]). The final weighted score of The feature encoder layer and triplet decoder layer have dif-
RelTR is 1.02 points higher than the best two-stage model. ferent effects on the performance, size and inference speed.
The inference speed of RelTR is 16.3 FPS, ca. 6 and 9 times When the number of encoder layers varies, we keep the
faster than BGNN and VCTree, respectively. number of triplet decoder layers always 6, and vice versa.
When there is no encoder layer, the decoder reasons about
Method R@50 ↑ wmAP𝑟 𝑒𝑙 ↑ wmAP 𝑝ℎ𝑟 ↑ score 𝑤𝑡 𝑑 ↑ FPS ↑
RelDN [56] 73.08 32.16 33.39 40.84 5.3
the feature map without context and R@50 drops by 4.2
VCTree [35] 75.34 33.21 34.31 41.97 1.9 points significantly (see Table 5). Adding an encoder layer
G-RCNN [44] 74.51 33.15 34.21 41.84 -
Motifs [9] 71.63 29.91 31.59 38.93 7.4 brings fewer parameters compared to adding a triplet de-
GPS-NET [48] 74.81 32.85 33.98 41.69 - coder layer. Because the decoder is indispensable for scene
BGNN [46] 74.98 33.51 34.15 41.69 2.9
RelTR (ours) 71.66 34.19 37.46 42.99 16.3 graph generation, the minimum number of triplet decoder
layers in our experiment is set to 3. When the number of
TABLE 4: Comparison with other two-stage methods on the triplet decoder layers is increased to 6, the improvement of
Open Images V6 [20] test set. The numbers of these state-of- R@20, R@50 and R@100 are obvious. In contrast, there is a
the-art methods are taken from [46]. small decrease in performance when the number of triplet
decoder layers is increased to 9. We conjecture that this may
To further demonstrate the performance of RelTR, we be caused by overfitting.
compare the average precision (AP) of relationships and
phrases for RelTR and BGNN [46] (see Fig. 7) with Open 4.4.2 Module Effectiveness
Images V6. Although R@50 of RelTR is lower, RelTR out- To verify the contribution of each module to the overall
performs BGNN on the weighted mean AP of relationships effect, we deactivate different modules and the results are
9

Layer Number SGDET


Encoder Triplet Decoder R@20 R@50 R@100
#params(M) FPS 1, the overall trend of the two curves is decreasing. This is
0 6 17.6 23.3 27.1 55.8 18.0 more evident for the 𝑇-mR@50 curve.
3 6 20.5 26.6 29.5 59.7 17.1
9 6 21.4 27.7 30.8 67.6 15.5 28.0
6 6 21.2 27.5 30.7 63.7 16.1
6 3 19.5 25.9 29.8 48.7 19.6 27.5
6 9 21.0 27.1 30.1 78.7 13.8
27.0
deactivated
TABLE 5: Impact of the number of encoder and decoder 26.5
layers on the performance, model size and inference speed.
11.0
10.5
shown in Table 6. We first ablate the entire triplet decoder
(first row) and combine the 32 entity proposals provided by 10.0 SGDET R@50
SGDET mR@50
the entity decoder into 32 × 31 triplet proposals as a two- 9.5
0.6 0.7 0.8 0.9 1.0
stage method. The feature vectors are concatenated and a
3-layer perceptron is used to predict the relationships. This Fig. 8: 𝑇-R@50 and 𝑇-mR@50 curve on SGDET. × indicates
can also be seen as a simple visual-based baseline with that the IoU-based assignment strategy is deactivated.
DETR [18] as the detector. Without the triplet decoder, R@50
score drops to 17.8 due to the simplicity of the model. It
indicates that only visual information is used to predict rela- 63.68 63.72 63.76
63.6 63.64
tionships, which is a challenge even for two-stage methods.
We activate the Coupled Self-Attention (CSA) and De- 27.5 27.3 27.0
coupled Visual Attention (DVA) simultaneously since they 26.6
are indispensable to each other (second row). Although 25.1
the triplet decoder is not yet complete, the main modules
16.34
CSA and DVA have shown their excellent performance. The 16.24
16.09
model parameters are 43% more than the simple baseline, #param(M) 15.91
SGDET R@50 15.74
but the model can predict up to 77% of the baseline inference FPS
speed (FPS) due to the sparse graph generation method. 100 150 200 250 300
Triplet Query Number
Then we ablate Decoupled Entity Attention (DEA) and the
mask head for the attention heat maps from the framework. Fig. 9: Changes in the parameter number, performance and
Table 6 demonstrates that DEA modules help the model FPS as the triplet number 𝑁𝑡 varies.
to predict higher quality subjects and objects, and increase
R@50 by 0.7 (with the mask head). In comparison, the
improvement offered by the mask head is very limited. We
4.5 Analysis on Subject and Object Queries
hypothesize that the spatial features are already implicit
encoded in the visual features generated by DVA modules. Distinct from the two-stage methods which output 𝑁 object
proposals after NMS and then label 𝑁 (𝑁 − 1) predicates,
Ablation Setting SGDET
#params(M) FPS
RelTR predicts 𝑁𝑡 triples directly by 𝑁𝑡 subject and object
CSA+DVA DEA Mask head R@20 R@50 R@100
7 7 7 11.8 17.8 22.4 41.5 23.1
queries interacting with an image. We trained the model on
3 7 7 20.6 26.6 29.7 59.3 17.7 Visual Genome using different 𝑁𝑡 . Fig. 9 shows that as the
3 7 3 20.8 26.8 30.1 60.5 17.3
3 3 7 21.0 27.2 30.2 62.5 16.7 number of coupled subject and object queries increases lin-
3 3 3 21.2 27.5 30.7 63.7 16.1 early, the parameters of the model increase linearly whereas
the inference speed decreases linearly. However, the per-
TABLE 6: Coupled Self-Attention (CSA) + Decoupled Visual formance varies non-linearly and the best performance is
Attention (DVA), Decoupled entity attention (DEA) and achieved when 𝑁𝑡 = 200 for the Visual Genome dataset. Too
the mask head for the attention heat maps are isolated many queries generate many incorrect triplet proposals that
separately from the framework. The first row indicates that take the place of correct proposals in the recall list.
the entire triplet decoder is deactivated and the model can To explore how RelTR infers triplets with the coupled
be seen as a simple visual-based baseline with DETR as the subject and object queries, we collect predictions from a
detector. 7 denotes the module is ablated. random sample of 5000 images from Visual Genome test set.
We visualize the predictions for 10 out of total 200 coupled
queries. Fig. 10 shows the spatial and class distributions of
4.4.3 Threshold in Set Prediction Loss subjects and objects, as well as the class distribution of top-
The IoU threshold 𝑇 of the IoU-based assignment strategy 5 predicates in the 5000 predictions of 10 coupled subject
in the set prediction loss for triplet detection is varied from and object queries. It demonstrates that different coupled
0.6 to 1. Since a prediction box overlaps with the ground queries learn different patterns from the training data, and
truth box of IoU= 1 is almost impossible in practice, the attend to different classes of triplets in different regions
strategy can be considered as deactivated when 𝑇 = 1. Two at the inference. We also select five predicates: has (from
curves, namely 𝑇-R@50 and 𝑇-mR@50 on SGDET, are shown Head), wears, riding (from Body) using and mounted
in Fig. 8. When our assignment strategy is deactivated (𝑇 = on (from Tail) and count which queries are more inclined
1), the model performs the worst. As 𝑇 increases from 0.7 to to predict these predicates. As shown in Fig. 11, the query
10

Fig. 10: Predictions on 5000 images from Visual Genome test set are presented for 10 coupled subject and object queries.
The size of all images is normalized to 1 × 1, with each point in the first and second rows representing the box center of
a subject and an object in a prediction respectively. Different point colors denote different entity super-categories: (1) blue
for humans (child, person and woman etc.) (2) plum for things that exist in nature (beach, dog and head etc.) (3) yellow
for man-made objects (cup, jacket and table etc.). The corresponding distributions of top-5 predicate are shown in the third
row.

distribution of has is smooth. This indicates that all queries Visual Genome). However, R@9 of the first image is only
are able to predict high frequency relationships. For predi- 5/12 = 41.7 because of the preferences in the ground truth
cates in Body and Tail groups, there are some queries that triplet annotations. This phenomenon is more evident in
are particularly good at detecting them. For example, 21% of the second image (with the woman and computer). Note
the triplets with the predicate wears are predicted by Query that in the used Visual Genome-150 split [42] there is no
115, while half of the triplets with the predicate mounted computer class but only laptop class. 6 out of 9 predic-
on are predicted by Query 107 and 105. tions from RelTR can be considered valid whereas R@9 is
0 due to the labeling preference. Sometimes RelTR outputs
some duplicate triplets such as <woman-wearing-jean>
Head Body Body Tail Tail and <woman-looking at-laptop> in the second image.
Along with the output results, RelTR also shows the regions
of interest for the output relationships, making the behavior
has wears riding using mounted on of the model easier to interpret.
Fig. 11: Query distribution of the triplets with has (from The qualitative results of SGDET for Open Images V6
Head), wears, riding (from Body) using and mounted are shown in Fig. 13. Different from the dense triplets in
on (from Tail) in the predictions on 5000 images from Visual the annotations of VG, each image from Open Images V6
Genome test set. Note that the same color in different pie is labeled with 2.8 triplets on average. Therefore, we only
charts does not mean the same query. show the most confident triplet from predictions for each
image.

4.6 Qualitative Results 5 C ONCLUSION


Fig. 14 shows the qualitative results for scene graph gen- In this paper, based on Transformer’s encoder-decoder ar-
eration (SGDET) of Visual Genome dataset. Although some chitecture, we propose a novel one-stage end-to-end frame-
other proposals are also meaningful, we only demonstrate work for scene graph generation, RelTR. Given a fixed
9 relationships with the highest confidence scores and the number of coupled subject and object queries, a fixed-size
generated scene graph due to space limitations in Fig. 14. set of relationships is predicted using different attention
Blue boxes are the subject boxes while orange boxes are mechanisms in the triplet decoder of RelTR. An IoU-based
the object boxes. Attention scores are displayed in the same assignment strategy is proposed to optimize the triplet
color as boxes. The overlap of subject and object attention prediction-ground truth assignment during the model train-
is shown in white. The ground truth annotations of the two ing. Compared with other state-of-the-art methods, RelTR
images are demonstrated in Fig. 12. For brevity, we only is easy to implement and achieves state-of-the-art perfor-
show the bounding boxes of the entities that appear in the mance using only visual appearance, with very few model
annotated triplets. parameters and fast inference.
For the first image (with the car and building), we can as-
sume that the 9 output triplets are all correct. The prediction ACKNOWLEDGMENTS
<car-in front of-building> indicates that RelTR can This work has been supported by the Federal Ministry of
understand spatial relationships from 2D image to some Education and Research (BMBF), under the project Leib-
extent (in front of is not a high-frequent predicate in nizKILabor (grant no. 01DD20003), the Center for Digital
11

window4-on-building0
window2-on-building0
window5-on-building0
door3-on-building0
window7-on-building0
wheel10-on-car1
wheel10-on-car1
window9-on-car1
window8-on-car1
tire6-on-car1
tire6-on-car1
window9-on-car1

woman1-wearing-leg2
woman1-sitting on-chair0

Fig. 12: Ground truth annotations of the two images in Fig. 14 from Visual Genome dataset. For brevity, only the bounding
boxes of the entities that appears in the annotated triplets are shown with red. All entities are numbered to distinguish
between entities of the same class. There are two errors in the ground truth annotations: <window8-on-car1> in the
first image and <woman1-wearing-leg2> in the second image. There could be duplicate triplets in the ground truth
(e.g. <wheel10-on-car1> in the first image). For the first image, only the relationships with the predicate on are labeled
while for the second image, the relationships such as <woman1-wearing-shirt> are omitted. These biases in the ground
truth annotations lead to the low score of R@𝐾, the other SGG models also suffer from this problem.

Fig. 13: Qualitative results for scene graph generation of Open Images V6. Different from the dense triplets in the
annotations of VG, each image from Open Images V6 is labeled with 2.8 triplets on average. Although Open Images
V6 contains more entity classes, the image scenarios are simpler compared to Visual Genome. Therefore, only the top-1
triplets are shown in the second row while the original images are in the first row. Boxes and attention scores of subjects are
also colored with blue while objects with orange. RelTR demonstrates the excellent quality of its confident triplet proposals.
12

scene graph
door window
window
on

on
on
on
building window
has

in
fr o
nt
of
has on
tire car street
on

scene graph

jean jean

g
we

n
hair

ari
hand

ari

we
n
ha

g
s s
woman ha
we
arin
g

looking at
t
ga
shirt

kin
loo
laptop laptop
on on

screen desk

Fig. 14: Qualitative results for scene graph generation of Visual Genome dataset. The top-9 relationships with confidence
and the generated scene graph are shown. Boxes and attention scores of subjects are colored with blue while objects
with orange. The orange vertices in the generated scene graph indicate the predictions are duplicated. The computer
is classified as laptop in the second image since there is no computer class but only laptop class in the used VG-
150 split [42]. Compared with the ground truth annotations in Fig. 12, the predictions of RelTR are diverse. Although
sometimes RelTR cannot label very difficult relationships correctly (e.g. looking at), the results demonstrate that the
generated scene graphs are of high quality.

Innovations (ZDIN) and the Deutsche Forschungsgemein- [3] K. Nguyen, S. Tripathi, B. Du, T. Guha, and T. Q. Nguyen, “In
schaft (DFG) under Germany’s Excellence Strategy within defense of scene graphs for image captioning,” in Proceedings of
the IEEE/CVF International Conference on Computer Vision (ICCV),
the Cluster of Excellence PhoenixD (EXC 2122). 2021, pp. 1407–1416. 1
[4] L. Gao, B. Wang, and W. Wang, “Image captioning with scene-
graph based semantic concepts,” in Proceedings of the 2018 10th
R EFERENCES International Conference on Machine Learning and Computing, 2018,
[1] J. Johnson, R. Krishna, M. Stark, L.-J. Li, D. Shamma, M. Bernstein, pp. 225–229. 1
and L. Fei-Fei, “Image retrieval using scene graphs,” in Proceedings [5] J. Johnson, B. Hariharan, L. Van Der Maaten, J. Hoffman, L. Fei-
of the IEEE conference on computer vision and pattern recognition, 2015, Fei, C. Lawrence Zitnick, and R. Girshick, “Inferring and executing
pp. 3668–3678. 1, 2 programs for visual reasoning,” in Proceedings of the IEEE Interna-
[2] C. Lu, R. Krishna, M. Bernstein, and L. Fei-Fei, “Visual relationship tional Conference on Computer Vision, 2017, pp. 2989–2998. 1
detection with language priors,” in European conference on computer [6] O. Ashual and L. Wolf, “Specifying object attributes and relations
vision. Springer, 2016, pp. 852–869. 1, 2, 6 in interactive scene generation,” in Proceedings of the IEEE/CVF
13

International Conference on Computer Vision, 2019, pp. 4561–4569. 1 [28] C. Galleguillos, A. Rabinovich, and S. Belongie, “Object catego-
[7] J. Johnson, A. Gupta, and L. Fei-Fei, “Image generation from scene rization using co-occurrence, location and appearance,” in 2008
graphs,” in Proceedings of the IEEE conference on computer vision and IEEE Conference on Computer Vision and Pattern Recognition. IEEE,
pattern recognition, 2018, pp. 1219–1228. 1 2008, pp. 1–8. 2
[8] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: towards [29] S. Gould, J. Rodgers, D. Cohen, G. Elidan, and D. Koller, “Multi-
real-time object detection with region proposal networks,” IEEE class segmentation with relative location prior,” International jour-
transactions on pattern analysis and machine intelligence, vol. 39, no. 6, nal of computer vision, vol. 80, no. 3, pp. 300–316, 2008. 2
pp. 1137–1149, 2016. 1 [30] Y. Cong, H. Ackermann, W. Liao, M. Y. Yang, and B. Rosenhahn,
[9] R. Zellers, M. Yatskar, S. Thomson, and Y. Choi, “Neural motifs: “Nodis: Neural ordinary differential scene understanding,” in
Scene graph parsing with global context,” in Proceedings of the Proceedings of the European Conference on Computer Vision (ECCV),
IEEE Conference on Computer Vision and Pattern Recognition, 2018, 2020, pp. 636–653. 2
pp. 5831–5840. 1, 2, 6, 7, 8 [31] W. Wang, R. Wang, S. Shan, and X. Chen, “Exploring context
[10] T. Chen, W. Yu, R. Chen, and L. Lin, “Knowledge-embedded and visual pattern of relationship for scene graph generation,”
routing network for scene graph generation,” in Proceedings of the in Proceedings of the IEEE/CVF Conference on Computer Vision and
IEEE/CVF Conference on Computer Vision and Pattern Recognition, Pattern Recognition, 2019, pp. 8188–8197. 2, 6
2019, pp. 6163–6171. 1, 6 [32] J. Shi, Y. Zhong, N. Xu, Y. Li, and C. Xu, “A simple baseline for
[11] R. Yu, A. Li, V. I. Morariu, and L. S. Davis, “Visual relationship weakly-supervised scene graph generation,” in Proceedings of the
detection with internal and external linguistic knowledge distilla- IEEE/CVF International Conference on Computer Vision, 2021, pp.
tion,” in Proceedings of the IEEE international conference on computer 16 393–16 402. 2
vision, 2017, pp. 1974–1982. 1, 2 [33] W. Wang, R. Wang, and X. Chen, “Topic scene graph generation by
[12] A. Zareian, S. Karaman, and S.-F. Chang, “Bridging knowledge attention distillation from caption,” in Proceedings of the IEEE/CVF
graphs to generate scene graphs,” in European Conference on Com- International Conference on Computer Vision, 2021, pp. 15 900–15 910.
puter Vision, 2020, pp. 606–623. 1, 6 2
[13] J. Gu, H. Zhao, Z. Lin, S. Li, J. Cai, and M. Ling, “Scene graph [34] Y. Lu, H. Rai, J. Chang, B. Knyazev, G. Yu, S. Shekhar, G. W. Taylor,
generation with external knowledge and image reconstruction,” and M. Volkovs, “Context-aware scene graph generation with
in Proceedings of the IEEE/CVF Conference on Computer Vision and seq2seq transformers,” in Proceedings of the IEEE/CVF International
Pattern Recognition, 2019, pp. 1969–1978. 1 Conference on Computer Vision, 2021, pp. 15 931–15 941. 2
[14] H. Law and J. Deng, “Cornernet: Detecting objects as paired [35] K. Tang, H. Zhang, B. Wu, W. Luo, and W. Liu, “Learning to com-
keypoints,” in Proceedings of the European conference on computer pose dynamic tree structures for visual contexts,” in Proceedings of
vision (ECCV), 2018, pp. 734–750. 1 the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
[15] Z. Tian, C. Shen, H. Chen, and T. He, “Fcos: Fully convolutional 2019, pp. 6619–6628. 2, 6, 7, 8
one-stage object detection,” in Proceedings of the IEEE/CVF interna- [36] L. Chen, H. Zhang, J. Xiao, X. He, S. Pu, and S.-F. Chang, “Coun-
tional conference on computer vision, 2019, pp. 9627–9636. 1 terfactual critic multi-agent training for scene graph generation,”
[16] X. Zhou, D. Wang, and P. Krähenbühl, “Objects as points,” arXiv in Proceedings of the IEEE/CVF International Conference on Computer
preprint arXiv:1904.07850, 2019. 1 Vision, 2019, pp. 4613–4623. 2
[17] P. Sun, Y. Jiang, E. Xie, W. Shao, Z. Yuan, C. Wang, and P. Luo, [37] M.-J. Chiou, H. Ding, H. Yan, C. Wang, R. Zimmermann, and
“What makes for end-to-end object detection?” in International J. Feng, “Recovering the unbiased scene graphs from the biased
Conference on Machine Learning. PMLR, 2021, pp. 9934–9944. 1 ones,” in Proceedings of the 29th ACM International Conference on
[18] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and Multimedia, 2021, pp. 1581–1590. 2
S. Zagoruyko, “End-to-end object detection with transformers,” in [38] J. Ji, R. Krishna, L. Fei-Fei, and J. C. Niebles, “Action genome:
European Conference on Computer Vision. Springer, 2020, pp. 213– Actions as compositions of spatio-temporal scene graphs,” in Pro-
229. 1, 2, 3, 4, 6, 9 ceedings of the IEEE/CVF Conference on Computer Vision and Pattern
[19] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, Recognition, 2020, pp. 10 236–10 247. 2
S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma et al., “Visual [39] Y. Cong, W. Liao, H. Ackermann, B. Rosenhahn, and M. Y. Yang,
genome: Connecting language and vision using crowdsourced “Spatial-temporal transformer for dynamic scene graph genera-
dense image annotations,” International journal of computer vision, tion,” in Proceedings of the IEEE/CVF International Conference on
vol. 123, no. 1, pp. 32–73, 2017. 2, 3, 6, 8 Computer Vision, 2021, pp. 16 372–16 382. 2
[20] A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont- [40] Y. Teng, L. Wang, Z. Li, and G. Wu, “Target adaptive context
Tuset, S. Kamali, S. Popov, M. Malloci, A. Kolesnikov et al., “The aggregation for video scene graph generation,” in Proceedings of
open images dataset v4,” International Journal of Computer Vision, the IEEE/CVF International Conference on Computer Vision, 2021, pp.
vol. 128, no. 7, pp. 1956–1981, 2020. 2, 3, 6, 8 13 688–13 697. 2
[21] X. Yang, K. Tang, H. Zhang, and J. Cai, “Auto-encoding scene [41] Y. Lu, C. Chang, H. Rai, G. Yu, and M. Volkovs, “Multi-view scene
graphs for image captioning,” in Proceedings of the IEEE/CVF graph generation in videos,” in International Challenge on Activity
Conference on Computer Vision and Pattern Recognition, 2019, pp. Recognition (ActivityNet) CVPR 2021 Workshop, vol. 3, 2021. 2
10 685–10 694. 2 [42] D. Xu, Y. Zhu, C. B. Choy, and L. Fei-Fei, “Scene graph generation
[22] J. Gu, S. Joty, J. Cai, H. Zhao, X. Yang, and G. Wang, “Unpaired by iterative message passing,” in Proceedings of the IEEE conference
image captioning via scene graph alignments,” in Proceedings of on computer vision and pattern recognition, 2017, pp. 5410–5419. 2, 6,
the IEEE/CVF International Conference on Computer Vision, 2019, pp. 10, 12
10 323–10 332. 2 [43] Y. Li, W. Ouyang, B. Zhou, K. Wang, and X. Wang, “Scene
[23] K.-H. Lee, H. Palangi, X. Chen, H. Hu, and J. Gao, “Learning visual graph generation from objects, phrases and region captions,” in
relation priors for image-text matching and image captioning with Proceedings of the IEEE international conference on computer vision,
neural scene graph generators,” arXiv preprint arXiv:1909.09953, 2017, pp. 1261–1270. 2
2019. 2 [44] J. Yang, J. Lu, S. Lee, D. Batra, and D. Parikh, “Graph r-cnn for
[24] J. Shi, H. Zhang, and J. Li, “Explainable and explicit visual reason- scene graph generation,” in Proceedings of the European conference
ing over scene graphs,” in Proceedings of the IEEE/CVF Conference on computer vision (ECCV), 2018, pp. 670–685. 2, 6, 8
on Computer Vision and Pattern Recognition, 2019, pp. 8376–8384. 2 [45] Y. Li, W. Ouyang, B. Zhou, J. Shi, C. Zhang, and X. Wang, “Factor-
[25] S. Lee, J.-W. Kim, Y. Oh, and J. H. Jeon, “Visual question answering izable net: an efficient subgraph-based framework for scene graph
over scene graph,” in International Conference on Graph Computing generation,” in Proceedings of the European Conference on Computer
(GC), 2019, pp. 45–50. 2 Vision (ECCV), 2018, pp. 335–351. 2
[26] Y. Li, T. Ma, Y. Bai, N. Duan, S. Wei, and X. Wang, “Pastegan: [46] R. Li, S. Zhang, B. Wan, and X. He, “Bipartite graph network with
A semi-parametric method to generate image from scene graph,” adaptive message passing for unbiased scene graph generation,”
Advances in Neural Information Processing Systems, vol. 32, pp. 3948– in Proceedings of the IEEE/CVF Conference on Computer Vision and
3958, 2019. 2 Pattern Recognition, 2021, pp. 11 109–11 119. 2, 6, 7, 8
[27] A. Talavera, D. S. Tan, A. Azcarraga, and K.-L. Hua, “Layout and [47] T. Chen, W. Yu, R. Chen, and L. Lin, “Knowledge-embedded
context understanding for image synthesis with scene graphs,” in routing network for scene graph generation,” in Proceedings of
IEEE International Conference on Image Processing (ICIP), 2019, pp. the IEEE/CVF Conference on Computer Vision and Pattern Recognition
1905–1909. 2 (CVPR), 2019. 2
14

[48] X. Lin, C. Ding, J. Zeng, and D. Tao, “Gps-net: Graph property Conference on Computer Vision and Pattern Recognition, 2021, pp.
sensing network for scene graph generation,” in Proceedings of the 11 825–11 834. 3
IEEE/CVF Conference on Computer Vision and Pattern Recognition, [69] R. Stewart, M. Andriluka, and A. Y. Ng, “End-to-end people
2020, pp. 3746–3753. 2, 6, 7, 8 detection in crowded scenes,” in Proceedings of the IEEE conference
[49] R. Herzig, M. Raboh, G. Chechik, J. Berant, and A. Globerson, on computer vision and pattern recognition, 2016, pp. 2325–2333. 3
“Mapping images to scene graphs with permutation-invariant [70] H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and
structured prediction,” Advances in Neural Information Processing S. Savarese, “Generalized intersection over union: A metric and
Systems, vol. 31, pp. 7211–7221, 2018. 2 a loss for bounding box regression,” in Proceedings of the IEEE/CVF
[50] M. Qi, W. Li, Z. Yang, Y. Wang, and J. Luo, “Attentive relational Conference on Computer Vision and Pattern Recognition, 2019, pp.
networks for mapping images to scene graphs,” in Proceedings of 658–666. 4, 5
the IEEE/CVF Conference on Computer Vision and Pattern Recognition, [71] I. Loshchilov and F. Hutter, “Decoupled weight decay regulariza-
2019, pp. 3957–3966. 2 tion,” in International Conference on Learning Representations (ICLR),
[51] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. 2019. 6
Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” [72] R. Al-Rfou, D. Choe, N. Constant, M. Guo, and L. Jones,
in Advances in neural information processing systems, 2017, pp. 5998– “Character-level language modeling with deeper self-attention,”
6008. 2, 3 in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33,
[52] N. Dhingra, F. Ritter, and A. Kunz, “Bgt-net: Bidirectional gru no. 01, 2019, pp. 3159–3166. 6
transformer network for scene graph generation,” in Proceedings of [73] H. Zhang, Z. Kyaw, S.-F. Chang, and T.-S. Chua, “Visual transla-
the IEEE/CVF Conference on Computer Vision and Pattern Recognition, tion embedding network for visual relation detection,” in Proceed-
2021, pp. 2150–2159. 2, 6 ings of the IEEE conference on computer vision and pattern recognition,
[53] R. Koner, P. Sinhamahapatra, and V. Tresp, “Relation transformer 2017, pp. 5532–5540. 7
network,” arXiv preprint arXiv:2004.06193, 2020. 2
[54] N. Gkanatsios, V. Pitsikalis, P. Koutras, and P. Maragos,
“Attention-translation-relation network for scalable scene graph
generation,” in Proceedings of the IEEE/CVF International Conference
on Computer Vision Workshops, 2019, pp. 0–0. 2
[55] Z. Cui, C. Xu, W. Zheng, and J. Yang, “Context-dependent dif- Yuren Cong received his Bachelor degree at
fusion network for visual relationship detection,” in Proceedings Hefei University of Technology in 2015. Then he
of the 26th ACM international conference on Multimedia, 2018, pp. studied Electrical Engineering and Information
1475–1482. 2 Technology at Leibniz University Hannover and
[56] J. Zhang, K. J. Shih, A. Elgammal, A. Tao, and B. Catanzaro, received his Master degree in 2019. Since 2020
“Graphical contrastive losses for scene graph parsing,” in Pro- he has worked as a research assistant towards
ceedings of the IEEE/CVF Conference on Computer Vision and Pattern his Ph.D in the group of Prof. Rosenhahn. His
Recognition, 2019, pp. 11 535–11 543. 2, 6, 8 research interests are in the fields of computer
vision with specialization on scene graph gener-
[57] B. Dai, Y. Zhang, and D. Lin, “Detecting visual relationships with
ation.
deep relational networks,” in Proceedings of the IEEE conference on
computer vision and Pattern recognition, 2017, pp. 3076–3086. 2
[58] M. Suhail, A. Mittal, B. Siddiquie, C. Broaddus, J. Eledath,
G. Medioni, and L. Sigal, “Energy-based learning for scene graph Micheal Ying Yang is currently Assistant Pro-
generation,” in Proceedings of the IEEE/CVF Conference on Computer fessor in the Department of Earth Observation
Vision and Pattern Recognition, 2021, pp. 13 936–13 945. 2 Science at ITC - Faculty of Geo-Information
[59] K. Tang, Y. Niu, J. Huang, J. Shi, and H. Zhang, “Unbiased Science and Earth Observation, University of
scene graph generation from biased training,” in Proceedings of the Twente, The Netherlands, heading a group work-
IEEE/CVF Conference on Computer Vision and Pattern Recognition, ing on scene understanding. He received the
2020, pp. 3716–3725. 2, 6, 7, 8 PhD degree (summa cum laude) from University
[60] S. Yan, C. Shen, Z. Jin, J. Huang, R. Jiang, Y. Chen, and X.-S. of Bonn (Germany) in 2011. He received the
Hua, “Pcpl: Predicate-correlation perception learning for unbiased venia legendi in Computer Science from Leibniz
scene graph generation,” in Proceedings of the 28th ACM Interna- University Hannover in 2016. His research inter-
tional Conference on Multimedia, 2020, pp. 265–273. 2 ests are in the fields of computer vision and pho-
[61] H. Liu, N. Yan, M. Mortazavi, and B. Bhanu, “Fully convolutional togrammetry with specialization on scene understanding and semantic
scene graph generation,” in Proceedings of the IEEE/CVF Conference interpretation from imagery. He serves as Associate Editor of ISPRS
on Computer Vision and Pattern Recognition, 2021, pp. 11 546–11 556. Journal of Photogrammetry and Remote Sensing, Co-chair of ISPRS
2, 6, 7 working group II/5 Dynamic Scene Analysis, Program Chair of ISPRS
[62] B. Kim, J. Lee, J. Kang, E.-S. Kim, and H. J. Kim, “Hotr: End- Geospatial Week 2019, and recipient of ISPRS President’s Honorary
to-end human-object interaction detection with transformers,” in Citation (2016), Best Science Paper Award at BMVC (2016), and The
Proceedings of the IEEE/CVF Conference on Computer Vision and Willem Schermerhorn Award (2020).
Pattern Recognition, 2021, pp. 74–83. 2, 3
[63] Y. Wang, Z. Xu, X. Wang, C. Shen, B. Cheng, H. Shen, and H. Xia,
Bodo Rosenhahn studied Computer Science
“End-to-end video instance segmentation with transformers,” in
(minor subject Medicine) at the University of
Proceedings of the IEEE/CVF Conference on Computer Vision and
Kiel. He received the Dipl.-Inf. and Dr.-Ing.
Pattern Recognition, 2021, pp. 8741–8750. 2
from the University of Kiel in 1999 and 2003,
[64] W. Liu, S. Chen, L. Guo, X. Zhu, and J. Liu, “Cptr: Full transformer respectively. From 10/2003 till 10/2005, he
network for image captioning,” arXiv preprint arXiv:2101.10804, worked as PostDoc at the University of Auck-
2021. 2 land (New Zealand), funded with a scholarship
[65] F. Zeng, B. Dong, T. Wang, C. Chen, X. Zhang, and Y. Wei, from the German Research Foundation (DFG).
“Motr: End-to-end multiple-object tracking with transformer,” In 11/2005-08/2008 he worked as senior re-
arXiv preprint arXiv:2105.03247, 2021. 2 searcher at the Max-Planck Institute for Com-
[66] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable puter Science. Since 09/2008 he is Full Profes-
detr: Deformable transformers for end-to-end object detection,” in sor at the Leibniz-University of Hannover, heading a group on automated
International Conference on Learning Representations (ICLR), 2021. 2, image interpretation. He has co-authored over 200 papers, holds 12
5, 6 patents and organized several workshops and conferences in the last
[67] Z. Yao, J. Ai, B. Li, and C. Zhang, “Efficient detr: Improv- years. His works received several awards, including a DAGM-Prize
ing end-to-end object detector with dense prior,” arXiv preprint 2002, the Dr.-Ing. Siegfried Werth Prize 2003, the DAGM-Main Prize
arXiv:2104.01318, 2021. 2 2005, the DAGM-Main Prize 2007, the Olympus-Prize 2007, and the
[68] C. Zou, B. Wang, Y. Hu, J. Liu, Q. Wu, Y. Zhao, B. Li, C. Zhang, Günter Enderle Award (Eurographics) 2017.
C. Zhang, Y. Wei et al., “End-to-end human object interaction
detection with hoi transformer,” in Proceedings of the IEEE/CVF

You might also like