2201.11460v2 Compressed
2201.11460v2 Compressed
Abstract—Different objects in the same scene are more or less related to each other, but only a limited number of these relationships
are noteworthy. Inspired by Detection Transformer, which excels in object detection, we view scene graph generation as a set
prediction problem. In this paper, we propose an end-to-end scene graph generation model Relation Transformer (RelTR), which has
an encoder-decoder architecture. The encoder reasons about the visual feature context while the decoder infers a fixed-size set of
triplets subject-predicate-object using different types of attention mechanisms with coupled subject and object queries. We design a set
prediction loss performing the matching between the ground truth and predicted triplets for the end-to-end training. In contrast to most
arXiv:2201.11460v2 [cs.CV] 11 Aug 2022
existing scene graph generation methods, RelTR is a one-stage method that predicts sparse scene graphs directly only using visual
appearance without combining entities and labeling all possible predicates. Extensive experiments on the Visual Genome and Open
Images V6 datasets demonstrate the superior performance and fast inference of our model.
Index Terms—Scene Understanding, Scene Graph Generation, One-Stage, Visual Relationship Detection
1 I NTRODUCTION
walking
on
woman
building
wearing
woman on
entities [1]. Scene graph generation (SGG) is a semantic pants
holding
pants
sidewalk sidewalk
understanding task that goes beyond object detection and is
closely linked to visual relationship detection [2]. At present, umbrella Two-Stage Method umbrella
scene graphs have shown their potential in different vision- window on building
window building
language tasks such as image retrieval [1], image captioning woman
woman sidewalk
[3], [4], visual question answering (VQA) [5] and image walking
on
wearing
generation [6], [7]. The task of scene graph generation has woman umbrella
holding
pants
sidewalk
also received sustained attention in the computer vision woman pants
ground truth information to relationship predictions. This Two-stage methods following [2] are currently dominating
paper aims to address these challenges. scene graph generation: several works [9], [30], [42], [43] use
We propose a novel end-to-end framework for scene residual neural networks with the global context to improve
graph generation, named Relation Transformer (RelTR)As the quality of the generated scene graphs. Xu et al. [42]
shown in Fig. 1, RelTR can detect the triplet proposals with use standard RNNs to iteratively improve the relationship
only visual appearance and predict subjects, objects, and prediction via message passing while MotifNet [9] stacks
their predicates concurrently. We evaluate RelTR on Visual LSTMs to reason about the local and global context. Graph-
Genome [19] and large-scale Open Images V6 [20]. The main based models [44], [45], [46], [47], [48] perform message
contributions of this work are summarized as follows: passing and demonstrate good results. Factorizable Net [45]
decomposes and combines the graphs to infer the relation-
• In contrast to most existing advanced approaches
ships. The attention mechanism is integrated into different
that classify the dense relationships between all en-
types of graph-based models such as Graph R-CNN [44],
tity proposals from the object detection backbone,
GPI [49] and ARN [50]. With the rise of Transformer [51],
our one-stage method can generate a sparse scene
there are several attempts using Transformer to detect vi-
graph by decoding the visual appearance with the
sual relationships and generate scene graphs in very recent
subject and object queries learned from the data.
works [34], [52], [53]. To improve the performance, many
• RelTR generates scene graphs based on visual ap-
works are no longer limited to using only visual appearance.
pearance only, which has fewer parameters and
Semantic knowledge can be utilized as an additional feature
faster inference compared to other SGG models while
to infer scene graphs [2], [9], [11], [54], [55]. Furthermore,
achieving state-of-the-art performance.
statistic priors and knowledge graphs have been introduced
• A set prediction loss is designed to perform the
in [11], [56], [57], [58], [59], [60].
matching between the ground truth and predicted
Compared to the boom of two-stage approaches, one-
triplets with an IoU-based assignment strategy.
stage approaches are still in their infancy and have the ad-
• With the decoupled entity attention, the triplet de-
vantage of being simple, fast and easy to train. To the best of
coder of RelTR can improve the localization and
our knowledge, FCSGG [61] is currently the only one-stage
classification of subjects and objects with the entity
scene graph generation framework that encodes objects as
detection results from the entity decoder.
box center points and relationships as 2D vector fields.
• Through comprehensive experiments, we explore
While FCSGG model being lightweight and fast speed, it
which components are critical for the performance
has a significant performance gap compared to other two-
and analyze the working mechanism of learned sub-
stage methods. To fill this gap, we propose Transformer-
ject and object queries.
based RelTR using only visual appearance in this work
• RelTR can be simply implemented. The source code
with fewer parameters, faster inference speed, and higher
and pretrained model are publicly available at https:
accuracy. Distinct from the other two-stage Transformer-
//github.com/yrcong/RelTR.
based approaches [34], [52], [53] that utilize the attention
The remainder of the paper is structured as follows. In mechanism to capture the context of the entity proposals
Section 2, we review related work in scene graph generation. from an object detector, RelTR can decode the global feature
Section 3 presents our proposed method. Experimental re- maps directly with the subject and object queries learned
sults of the proposed framework are discussed in Section 4. from the data to generate a sparse scene graph.
Section 5 concludes this paper.
2.2 Transformer and Set Prediction
2 R ELATED W ORK
The original Transformer architecture was proposed in [51]
2.1 Scene Graph Generation for sequence transduction. Its encoder-decoder configura-
Scene graphs have been proposed in [1] for the task of image tion and attention mechanism is also used to solve various
retrieval and attract increasing attention in computer vision computer vision tasks in different ways, e.g. object detection
and natural language processing communities for different [18], human-object interaction (HOI) detection [62], and
scene understanding tasks such as image captioning [21], dynamic scene graph generation [39].
[22], [23], VQA [24], [25] and image synthesis [26], [27]. The DETR [18] is a seminal work based on Transformer archi-
main purpose of scene graph generation (SGG) is to detect tecture for object detection in recent years. It views detection
the relationships between objects in the scene. Many earlier as a set prediction problem. In the end-to-end training,
works were limited to identifying specific types of relation- with the object queries, DETR predicts a fixed-size set of
ships such as spatial relationships between entities [28], [29]. object proposals and performs a bipartite matching between
The universal visual relationship detection is introduced proposals and ground truth objects for the loss function.
in [2]. Their inference framework, which detects entities This concept of query-based set prediction quickly gains
in an image first and then determines dense relationships, popularity in the computer vision community. Many tasks
was widely adopted in subsequent works, including their can be reformulated as set prediction problems, e.g. instance
evaluation settings and metrics as well. segmentation [63], image captioning [64] and multiple-
Now many models [30], [31], [32], [33], [34], [35], [36], object tracking [65]. Some works [66], [67] attempt to further
[37] are available to generate scene graphs from different improve object detection based on DETR.
perspectives, and some works even extend the scene graph HOI detection localizes and recognizes the relationships
generation task from images to videos [38], [39], [40], [41]. between humans and objects, whose result is a sub-graph
3
on
Encoder Decoder ne
g
ar
tin
CNN
sit
beach
Fig. 2: Given a set of learned subject and object queries coupled by subject and object encodings, RelTR captures the
dependencies between relationships and reasons about the feature context and entity representations, respectively the
output of the feature encoder and entity decoder, to directly compute a set of subject and object representations. A pair
of subject and object representations with attention heat maps is decoded into a triplet <subject-predicate-object>
by feed forward networks (FFNs). CSA, DVA and DEA stand for Coupled Self-Attention, Decoupled Visual Attention and
Decoupled Entity Attention. Ep , Et , Es and Eo are the positional, triplet, subject and object encodings respectively. ⊕
indicates element-wise addition, while ⊗ indicates concatenation or split.
of the scene graph. Several HOI detection frameworks [62], where 𝑑 𝑘 is the dimension of K . In order to benefit from the
[68] have been developed that use holistic triplet queries information in different representation sub-spaces, multi-
to directly infer a set of interactions. However, such a head attention is applied in Transformer. A complete at-
concept is difficult to generalize to the more complex task tention function is a multi-head attention followed by a
of scene graph generation. On large-scale datasets, such normalization layer with residual connection and denoted
as Visual Genome [19] and Open Images [20], localization as 𝐴𝑡𝑡 (.) in this paper for simplicity.
and classification of subjects and objects using only triplet
queries may likely result in low accuracy. On the contrary, 3.1.2 DETR
our proposed RelTR predicts the general relationships using This entity detection framework [18] is built upon the
coupled subject and object queries to achieve high accuracy. standard Transformer encoder-decoder architecture. First, a
CNN backbone generates a feature map Z0 ∈ R 𝐻 ×𝑊 ×𝑑 for
an image. With the self-attention mechanism, the encoder
3 M ETHOD computes a new feature context Z ∈ R 𝐻 𝑊 ×𝑑 using the
flatted Z0 and fixed positional encodings E 𝑝 ∈ R 𝐻 𝑊 ×𝑑 .
A scene graph G consists of entity vertices V and re- The decoder transforms 𝑁𝑒 entity queries into the entity
lationship edges E. Different from previous works that representations Q𝑒 ∈ R 𝑁𝑒 ×𝑑 . The entity queries interact with
detect a set of entity vertices and label the predicates be- each other to capture the entity context and extract visual
tween the vertices, we propose a one-stage model, Relation features from Z .
Transformer (RelTR), to directly predict a fixed-size set of For the end-to-end training, a set prediction loss for
< V𝑠𝑢𝑏 − E 𝑝𝑟 𝑑 − V𝑜𝑏 𝑗 > for scene graph generation. entity detection is proposed in DETR by assigning the
ground truth entities to predictions. The ground truth set
of size 𝑁𝑒 is padded with 𝜙 <background>, and a cost
3.1 Preliminaries
function 𝑐 𝑚 ( 𝑦ˆ , 𝑦) is applied to compute the matching cost
3.1.1 Transformer between a prediction 𝑦ˆ and ground truth entity 𝑦 = {𝑐, 𝑏}
We provide a brief review on Transformer and its attention where 𝑐, 𝑏 indicates the target class and box coordinates re-
mechanism. Transformer [51] has an encoder-decoder struc- spectively. Given the cost matrix C𝑒𝑛𝑡 , the entity prediction-
ture and consists of stacked attention functions. The input ground truth assignment is computed with the Hungarian
of a single-head attention is formed from queries Q, keys K algorithm [69]. The set prediction loss for entity detection
and values V while the output is computed as: can be presented as:
𝑁𝑒 h i
QK 𝑇
∑︁
𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛( Q, K , V ) = 𝑠𝑜 𝑓 𝑡𝑚𝑎𝑥 √ V, (1) 𝐿 𝑒𝑛𝑡𝑖𝑡 𝑦 = 𝐿 𝑐𝑙𝑠 + 𝟙{ 𝑐𝑖 ≠𝜙 } 𝐿 𝑏𝑜𝑥 , (2)
𝑑𝑘 𝑖=1
4
where 𝐿 𝑐𝑙𝑠 denotes the cross-entropy loss for label classifica- The feature context combines with fixed position encodings
tion and 𝑐𝑖 ≠ 𝜙 means that <background> is not assigned E 𝑝 ∈ R 𝐻 𝑊 ×𝑑 again in DVA. The updated subject represen-
to the 𝑖-th entity prediction. 𝐿 𝑏𝑜𝑥 consists of 𝐿 1 loss and tations containing visual features are presented as:
generalized IoU loss [70] for box regression.
Q = Q 𝑠 + E𝑡 , K = Z + E 𝑝
(𝑠𝑢𝑏) (4)
3.2 RelTR Model Q𝑠 = 𝐴𝑡𝑡 𝐷𝑉 𝐴 ( Q, K , Z ).
As shown in Fig. 2, our one-stage model RelTR has an The same operation is performed in the object branch. In
encoder-decoder architecture, which directly predicts 𝑁𝑡 the multi-head attention operation, 𝑁𝑡 attention heat maps
triplets without inferring the possible predicates between M𝑠 ∈ R 𝑁𝑡 ×𝐻 𝑊 are computed. We also adopt the reshaped
all entity pairs. It consists of the feature encoder extracting heat maps as a spatial feature for predicate classification.
the visual feature context, the entity decoder capturing
the entity representations from DETR [18] and the triplet 3.2.4 Decoupled Entity Attention (DEA)
decoder with the subject and object branches. Decoupled entity attention is performed as the bridge be-
A triplet decoder layer contains three attention functions, tween entity detection and triplet detection. Entity represen-
coupled self-attention (CSA), decoupled visual attention tations Q𝑒 ∈ R 𝑁𝑒 ×𝑑 can provide localization and classifica-
(DVA) and decoupled entity attention (DEA), respectively. tion information with higher quality due to the fact that they
Given 𝑁𝑡 coupled subject and object queries, the triplet do not have semantic restrictions like those between subject
decoder layer reasons about the feature context Z and entity and object representations. The motivation for introducing
representations Q𝑒 from the entity decoder layer to directly DEA is expecting subject and object representations to learn
output the information of 𝑁𝑡 triplets without inferring the more accurate localization and classification information
possible predicates between all entity pairs. from entity representations through the attention mecha-
nism. Q𝑠 and Q𝑜 are finally updated in a triplet decoder
3.2.1 Subject and Object Queries layer as follows:
There are two types of learned embeddings, namely subject (𝑠𝑢𝑏)
queries Q𝑠 ∈ R 𝑁𝑡 ×𝑑 and object queries Q𝑜 ∈ R 𝑁𝑡 ×𝑑 , for the Q𝑠 = 𝐴𝑡𝑡 𝐷𝐸 𝐴 ( Q 𝑠 + E𝑡 , Q 𝑒 , Q 𝑒 )
(𝑜𝑏 𝑗)
(5)
subject branch and object branch respectively. These 𝑁𝑡 pairs Q𝑜 = 𝐴𝑡𝑡 𝐷𝐸 𝐴 ( Q𝑜 + E𝑡 , Q𝑒 , Q𝑒 ),
of subject and object queries are transformed into 𝑁𝑡 pairs
(𝑠𝑢𝑏) (𝑜𝑏 𝑗)
of subject and object representations of size 𝑑. However, where 𝐴𝑡𝑡 𝐷𝐸 𝐴 and 𝐴𝑡𝑡 𝐷𝐸 𝐴 are the decoupled entity atten-
the subject query and the object query are not actually tion modules in the subject and object branch. The outputs
linked together in a query pair since the attention layers of DEA are processed by a feed-forward network followed
in the triplet decoder are permutation invariant. In order to by a normalization layer with residual connection. The feed-
distinguish between different triplets, the learnable triplet forward network (FFN) consist of two linear transformation
encodings E𝑡 ∈ R 𝑁𝑡 ×𝑑 are introduced. layers with ReLU activation.
Fig. 4: The ground truth is assigned to Proposal A while <background-no relation-background> is assigned to
Proposal B. However, <background> should not be assigned to the subject of Proposal C and the subject as well as object
of Proposal D. BG denotes <background> while X indicates no assignment.
by two linear projection layers into entity class distribu- and generalized IoU loss [70]:
tions. We utilize two independent feed-forward networks ˆ 𝑏) = 5𝐿 1 ( 𝑏,
ˆ 𝑏) + 2𝐿 𝐺𝐼 𝑜𝑈 ( 𝑏,
ˆ 𝑏).
𝑐 𝑏𝑜𝑥 ( 𝑏, (7)
with the same structure to predict the height, width and
normalized center coordinates of subject and object boxes. The cost function 𝑐 𝑚 can be presented as:
The architecture is shown in Fig. 3 (left). A pair of subject
𝑐 𝑚 ( 𝑦ˆ , 𝑦) = 𝑐 𝑐𝑙𝑠 ( 𝑐, ˆ 𝑏),
ˆ 𝑐) + 𝟙 {𝑏 ∈𝑦 } 𝑐 𝑏𝑜𝑥 ( 𝑏, (8)
attention heat map M𝑠 and object attention heatmap M𝑜
from DVA modules in the last decoder layer is concatenated where 𝑏 ∈ 𝑦 denotes that the ground truth includes the
and resized 2 × 28 × 28. The convolutional mask head shown box coordinates (only for the subject/object cost). The cost
in Fig. 3 (right) converts the attention heat maps to spatial between a triplet prediction and a ground truth triplet is
feature vectors. The final predicate labels are predicted by a computed as:
two-layer perceptron with the subject representations, object
representations and spatial feature vectors. 𝑐 𝑡𝑟 𝑖 = 𝑐 𝑚 ( 𝑦ˆ 𝑠𝑢𝑏 , 𝑦 𝑠𝑢𝑏 ) + 𝑐 𝑚 ( 𝑦ˆ 𝑜𝑏 𝑗 , 𝑦 𝑜𝑏 𝑗 ) + 𝑐 𝑚 ( 𝑐ˆ 𝑝𝑟 𝑑 , 𝑐 𝑝𝑟 𝑑 ), (9)
Given the triplet cost matrix C𝑡𝑟 𝑖 , the Hungarian al-
3.3 Set Prediction Loss for Triplet Detection gorithm is executed for the bipartite matching and each
ground truth triplet is assigned to a prediction. However,
We design a set prediction loss for triplet detection by
<background-no relation-background> should not
extending the entity detection set prediction loss in Eq. 2.
be assigned to all predictions that do not match the
We present a triplet prediction as 𝑦ˆ 𝑠𝑢𝑏 , 𝑐ˆ 𝑝𝑟 𝑑 , 𝑦ˆ 𝑜𝑏 𝑗 where
ground truth triplets. After several iterations of train-
𝑦ˆ 𝑠𝑢𝑏 = 𝑐ˆ𝑠𝑢𝑏 , 𝑏ˆ 𝑠𝑢𝑏 and 𝑦ˆ 𝑜𝑏 𝑗 = 𝑐ˆ𝑜𝑏 𝑗 , 𝑏ˆ 𝑜𝑏 𝑗 while a ground
ing, RelTR is likely to output the triplet proposals in
truth is denoted as 𝑦 𝑠𝑢𝑏 , 𝑐 𝑝𝑟 𝑑 , 𝑦 𝑜𝑏 𝑗 . The predicted subject,
four possible ways, as demonstrated in Fig. 4. Assign-
predicate and object labels are respectively denoted as 𝑐ˆ𝑠𝑢𝑏 ,
ing ground truth to Proposal A and <background-no
𝑐ˆ 𝑝𝑟 𝑑 and 𝑐ˆ𝑜𝑏 𝑗 while the predicted box coordinates of the
relation-background> to Proposal B are two clear
subject and object are denoted as 𝑏ˆ 𝑠𝑢𝑏 and 𝑏ˆ 𝑜𝑏 𝑗 .
cases. For Proposal C, <background> should not be as-
When 𝑁𝑡 relationships are predicted and 𝑁𝑡 is larger
signed to the subject due to the poor object prediction.
than the number of triplets in the image, the ground
Furthermore, <background> should not be assigned to the
truth set of triplets is padded with Φ <background-no
subject and object of Proposal D due to the fact that there
relation-background>. The pair-wise matching cost
is a better candidate Prediction A. To solve this problem,
𝑐 𝑡𝑟 𝑖 between a predicted triplet and a ground truth triplet
we integrate an IoU-based assignment strategy in our set
consists of the subject cost 𝑐 𝑚 ( 𝑦ˆ 𝑠𝑢𝑏 , 𝑦 𝑠𝑢𝑏 ), object cost
prediction loss: For a triplet prediction, if the predicted
𝑐 𝑚 ( 𝑦ˆ 𝑜𝑏 𝑗 , 𝑦 𝑜𝑏 𝑗) and predicate cost 𝑐 𝑚 ( 𝑐ˆ 𝑝𝑟 𝑑 , 𝑐 𝑝𝑟 𝑑 ). The pre-
subject or object label is correct, and the IoU of the predicted
diction 𝑦ˆ = 𝑐, ˆ 𝑏ˆ contains the predicted class 𝑐ˆ including
box and ground truth box is greater than or equal to the
the class probabilities p̂ and the predicted box coordinates 𝑏ˆ
threshold 𝑇, the loss function does not compute a loss for the
while the ground truth 𝑦 = {𝑐, 𝑏} contains the ground truth
subject or object. The set prediction loss for triplet detection
class 𝑐 and the ground truth box 𝑏. For the predicate, we
is formulated as:
only have the predicted class 𝑐ˆ 𝑝𝑟 𝑑 and ground truth class
𝑁𝑡
𝑐 𝑝𝑟 𝑑 . ∑︁ h i
𝐿 𝑠𝑢𝑏 = Θ 𝐿 𝑐𝑙𝑠 + 𝟙{ 𝑐𝑖 ≠𝜙 } 𝐿 𝑏𝑜𝑥
The subject/object cost is determined by the predicted 𝑠𝑢𝑏
𝑖=1
entity class probability and the predicted bounding box 𝑁𝑡 (10)
while the predicate cost is determined only by the predicted
∑︁
𝐿 𝑜𝑏 𝑗 = Θ 𝐿 𝑐𝑙𝑠 + 𝟙n 𝑖 ≠𝜙
o𝐿
𝑏𝑜𝑥
predicate class probability. We define the predicted proba- 𝑖=1
𝑐𝑜𝑏 𝑗
TABLE 1: Comparison with state-of-the-art scene graph generation methods on Visual Genome [19] test set. These methods
are divided into two-stage and one-stage. The best numbers in two-stage methods are shown in bold, and the best numbers
in one-stage methods are shown in italic. Models that use prior knowledge are represented in blue, to distinguish them
from visual-based models. The inference speed (FPS) of different models is tested on the same RTX 2080Ti of batch size 1.
Fig. 5: Triplets in which the subject (blue) and object (orange) 4.2 Implementation Details
n are the same entity are removed in sidewalk-on-sidewalk
sign-on-sign post-processing. The woman-wearing-woman
We adopt the same hyperparameters in our experiments on
predicates are usually ambiguous in such cases. Visual Genome and Open Images. We train RelTR end-to-
end from scratch for 150 epochs on 8 RTX 2080 Ti GPUs
with AdamW [71] setting the batch size to 2 per GPU,
4 E XPERIMENTS weight decay to 10−4 and clipping the gradient norm> 0.1.
The initial learning rates of the Transformer and ResNet-
4.1 Datasets and Evaluation Settings
50 backbone are set to 10−4 and 10−5 respectively and the
4.1.1 Visual Genome learning rates are dropped by 0.1 after 100 epochs. In the
We followed the widely used Visual Genome [19] split training we also use auxiliary losses [72] for the triplet
proposed by [42]. There are a total of 108𝑘 images in the decoder as [18], [66] did. By default, RelTR has 6 encoder
dataset with 150 entity categories and 50 predicate cate- layers and 6 triplet decoder layers. The number of triplet
gories. 70% of the images are divided into the training decoder layers and the number of entity decoder layers
dataset and the remaining 30% are used as the test set. are set to be the same. The multi-head attention modules
5𝑘 images are further drawn from the training set for with 8 heads in our model are trained with dropout of 0.1.
validation. There are three standard evaluation settings: (1) For all experiments, the model dimension 𝑑 is set to 256. If
Predicate classification (PredCLS): predict predicates given not specifically stated, the number of entity queries 𝑁𝑒 and
ground truth categories and bounding boxes of entities. (2) coupled queries 𝑁𝑡 are respectively set to 100 and 200 while
Scene graph classification (SGCLS): predict predicates and the IoU threshold in the triplet assignment is 0.7. For fair
entity categories given ground truth boxes. (3) Scene graph comparison, inference speeds (FPS) of all the reported SGG
detection (SGDET): predict categories, bounding boxes of models are evaluated on a single RTX 2080 Ti with the same
entities and predicates. Distinct from two-stage methods, hardware configuration. For computing the inference speed
the ground truth bounding boxes and categories of entities (FPS), we average over all the test images, where for each
7
40 0.20
30 0.15
20 0.10
10 0.05
0 0.00
parked on
in front of
carrying
hanging from
across
covered in
from
growing on
on
flying in
has
with
sitting on
walking on
wearing
of
in
near
behind
holding
riding
for
looking at
watching
above
under
wears
standing on
at
attached to
over
belonging to
and
laying on
along
eating
part of
using
to
on back of
between
covering
mounted on
lying on
walking in
against
painted on
made of
playing
says
Fig. 6: SGDET-R@100 for each relationship category on VG dataset. Long-tail groups are shown with different colors.
RelTR almost always performs better than BGNN [46] from of to in front of. The standard deviation of R@100 are
respectively 11.51 (ours) and 14.15 (BGNN). It indicates that RelTR is more unbiased.
image, the time cost for start timing when an image is given proposals to predict the labels and achieves R@50 = 64.2
as input and end timing when triplet proposals are output and mR@50 = 21.2 on PredCLS while R@50 = 36.6 and
as the inference time. The time cost for evaluating the whole mR@50 = 11.4 on SGCLS.
dataset is not included.
Table 2 demonstrates R@𝐾, mR@𝐾 and zsR@𝑘 on SGDET
of state-of-the-art methods. Compared with the models
4.3 Quantitative Results and Comparison without the Total Direct Effectt (TDE) [59], RelTR has the
4.3.1 Visual Genome best performance on mR@𝐾 and zsR@𝑘. With TDE, zsR@𝑘
and mR@𝐾 of the two-stage methods are improved whereas
We compare scores of R@𝐾 and mR@𝐾, number of param-
R@𝐾 decreases significantly. Our model performs well on all
eters and inference speed on SGDET (FPS) with several
three recall metrics.
two-stage models and one-stage model FCSGG [61] in Ta-
ble 1. Models that not only use visual appearance, but also
prior knowledge (e.g. semantic and statistic information) SGDET
Method Avg.
are represented in blue, to distinguish them from visual- R@20 R@50 mR@20 mR@50 zsR@50 zsR@100
Motifs-TDE [59] 12.4 16.9 5.8 8.2 2.3 2.9 8.1
based models. Overall, the two-stage models have higher VTransE-TDE [59] 13.5 18.7 6.3 8.6 2.0 2.7 8.6
scores of R@𝐾 and mR@𝐾 than the one-stage models while VCTree-TDE [59] 14.0 19.4 6.9 9.3 2.6 3.2 9.2
Motifs [9] 21.4 27.2 4.2 5.7 0.1 0.2 9.8
they have more parameters and slower inference speed. VTransE [73] 23.0 29.7 3.7 5.0 0.8 1.5 10.6
VCTree [35] 22.0 27.9 5.2 6.9 0.2 0.7 10.5
This phenomenon also occurs between the models using FCSGG [61] 16.1 21.3 2.7 3.6 1.0 1.4 7.7
prior information and visual-based models. Noted that the RelTR (ours) 21.2 27.5 6.8 10.8 1.8 2.4 11.8
0.30 0.30
30 0.20 0.20
30
0.15 0.15
20 20
0.10 0.10
10 10
0.05 0.05
0 0.00 0 0.00
skateboard
skateboard
handshake
handshake
throw
throw
read
snowboard
read
snowboard
catch
wears
at
contain
holds
ride
on
hang
plays
interacts_with
inside_of
surf
hits
kick
drink
eat
ski
kiss
cut
under
talk_on_phone
hug
highfive
dance
holding_hands
wears
at
contain
holds
ride
on
hang
plays
interacts_with
inside_of
surf
hits
kick
catch
drink
eat
ski
kiss
cut
under
talk_on_phone
hug
highfive
dance
holding_hands
Fig. 7: Average precision of relationships and phrases for RelTR and BGNN on Open Images V6. The distribution of
relationships in the test set is shown with the black dash line. The average precision of relationships of RelTR is higher
than BGNN for 7 of the top-10 high frequency predicates while BGNN generally performs better than RelTR for the low
frequency predicates (skateboard to ski). We conjecture that it is attributed to prior knowledge used in BGNN. The
overall trend of AP 𝑝ℎ𝑟 is the same as AP𝑟 𝑒𝑙 except hang.
Method SGDET-mR@100 Head Body Tail and phrases. The distribution of relationships in the Open
GPS-NET [48] 9.8 30.8 8.5 3.9 Images V6 test set is also shown with the black dash
VCTree-TDE [59] 11.1 24.7 12.2 1.8
lines. There are 9 predicates (kiss to handshake) that
BGNN [46] 12.6 34.0 12.9 6.0
RelTR (ours) 12.6 30.6 14.4 5.0 do not appear in the test set. The average precision of
relationships AP𝑟 𝑒𝑙 and AP 𝑝ℎ𝑟 of RelTR are higher than
TABLE 3: SGDET-mR100 for the head, body and tail groups BGNN for 7 of the top-10 high frequency predicates. For
which are partitioned according to the number of relation- the low frequency predicates (skateboard to ski), BGNN
ship instances in the training set. generally performs better than RelTR. We conjecture that it
is attributed to prior knowledge used in BGNN.
Fig. 10: Predictions on 5000 images from Visual Genome test set are presented for 10 coupled subject and object queries.
The size of all images is normalized to 1 × 1, with each point in the first and second rows representing the box center of
a subject and an object in a prediction respectively. Different point colors denote different entity super-categories: (1) blue
for humans (child, person and woman etc.) (2) plum for things that exist in nature (beach, dog and head etc.) (3) yellow
for man-made objects (cup, jacket and table etc.). The corresponding distributions of top-5 predicate are shown in the third
row.
distribution of has is smooth. This indicates that all queries Visual Genome). However, R@9 of the first image is only
are able to predict high frequency relationships. For predi- 5/12 = 41.7 because of the preferences in the ground truth
cates in Body and Tail groups, there are some queries that triplet annotations. This phenomenon is more evident in
are particularly good at detecting them. For example, 21% of the second image (with the woman and computer). Note
the triplets with the predicate wears are predicted by Query that in the used Visual Genome-150 split [42] there is no
115, while half of the triplets with the predicate mounted computer class but only laptop class. 6 out of 9 predic-
on are predicted by Query 107 and 105. tions from RelTR can be considered valid whereas R@9 is
0 due to the labeling preference. Sometimes RelTR outputs
some duplicate triplets such as <woman-wearing-jean>
Head Body Body Tail Tail and <woman-looking at-laptop> in the second image.
Along with the output results, RelTR also shows the regions
of interest for the output relationships, making the behavior
has wears riding using mounted on of the model easier to interpret.
Fig. 11: Query distribution of the triplets with has (from The qualitative results of SGDET for Open Images V6
Head), wears, riding (from Body) using and mounted are shown in Fig. 13. Different from the dense triplets in
on (from Tail) in the predictions on 5000 images from Visual the annotations of VG, each image from Open Images V6
Genome test set. Note that the same color in different pie is labeled with 2.8 triplets on average. Therefore, we only
charts does not mean the same query. show the most confident triplet from predictions for each
image.
window4-on-building0
window2-on-building0
window5-on-building0
door3-on-building0
window7-on-building0
wheel10-on-car1
wheel10-on-car1
window9-on-car1
window8-on-car1
tire6-on-car1
tire6-on-car1
window9-on-car1
woman1-wearing-leg2
woman1-sitting on-chair0
Fig. 12: Ground truth annotations of the two images in Fig. 14 from Visual Genome dataset. For brevity, only the bounding
boxes of the entities that appears in the annotated triplets are shown with red. All entities are numbered to distinguish
between entities of the same class. There are two errors in the ground truth annotations: <window8-on-car1> in the
first image and <woman1-wearing-leg2> in the second image. There could be duplicate triplets in the ground truth
(e.g. <wheel10-on-car1> in the first image). For the first image, only the relationships with the predicate on are labeled
while for the second image, the relationships such as <woman1-wearing-shirt> are omitted. These biases in the ground
truth annotations lead to the low score of R@𝐾, the other SGG models also suffer from this problem.
Fig. 13: Qualitative results for scene graph generation of Open Images V6. Different from the dense triplets in the
annotations of VG, each image from Open Images V6 is labeled with 2.8 triplets on average. Although Open Images
V6 contains more entity classes, the image scenarios are simpler compared to Visual Genome. Therefore, only the top-1
triplets are shown in the second row while the original images are in the first row. Boxes and attention scores of subjects are
also colored with blue while objects with orange. RelTR demonstrates the excellent quality of its confident triplet proposals.
12
scene graph
door window
window
on
on
on
on
building window
has
in
fr o
nt
of
has on
tire car street
on
scene graph
jean jean
g
we
n
hair
ari
hand
ari
we
n
ha
g
s s
woman ha
we
arin
g
looking at
t
ga
shirt
kin
loo
laptop laptop
on on
screen desk
Fig. 14: Qualitative results for scene graph generation of Visual Genome dataset. The top-9 relationships with confidence
and the generated scene graph are shown. Boxes and attention scores of subjects are colored with blue while objects
with orange. The orange vertices in the generated scene graph indicate the predictions are duplicated. The computer
is classified as laptop in the second image since there is no computer class but only laptop class in the used VG-
150 split [42]. Compared with the ground truth annotations in Fig. 12, the predictions of RelTR are diverse. Although
sometimes RelTR cannot label very difficult relationships correctly (e.g. looking at), the results demonstrate that the
generated scene graphs are of high quality.
Innovations (ZDIN) and the Deutsche Forschungsgemein- [3] K. Nguyen, S. Tripathi, B. Du, T. Guha, and T. Q. Nguyen, “In
schaft (DFG) under Germany’s Excellence Strategy within defense of scene graphs for image captioning,” in Proceedings of
the IEEE/CVF International Conference on Computer Vision (ICCV),
the Cluster of Excellence PhoenixD (EXC 2122). 2021, pp. 1407–1416. 1
[4] L. Gao, B. Wang, and W. Wang, “Image captioning with scene-
graph based semantic concepts,” in Proceedings of the 2018 10th
R EFERENCES International Conference on Machine Learning and Computing, 2018,
[1] J. Johnson, R. Krishna, M. Stark, L.-J. Li, D. Shamma, M. Bernstein, pp. 225–229. 1
and L. Fei-Fei, “Image retrieval using scene graphs,” in Proceedings [5] J. Johnson, B. Hariharan, L. Van Der Maaten, J. Hoffman, L. Fei-
of the IEEE conference on computer vision and pattern recognition, 2015, Fei, C. Lawrence Zitnick, and R. Girshick, “Inferring and executing
pp. 3668–3678. 1, 2 programs for visual reasoning,” in Proceedings of the IEEE Interna-
[2] C. Lu, R. Krishna, M. Bernstein, and L. Fei-Fei, “Visual relationship tional Conference on Computer Vision, 2017, pp. 2989–2998. 1
detection with language priors,” in European conference on computer [6] O. Ashual and L. Wolf, “Specifying object attributes and relations
vision. Springer, 2016, pp. 852–869. 1, 2, 6 in interactive scene generation,” in Proceedings of the IEEE/CVF
13
International Conference on Computer Vision, 2019, pp. 4561–4569. 1 [28] C. Galleguillos, A. Rabinovich, and S. Belongie, “Object catego-
[7] J. Johnson, A. Gupta, and L. Fei-Fei, “Image generation from scene rization using co-occurrence, location and appearance,” in 2008
graphs,” in Proceedings of the IEEE conference on computer vision and IEEE Conference on Computer Vision and Pattern Recognition. IEEE,
pattern recognition, 2018, pp. 1219–1228. 1 2008, pp. 1–8. 2
[8] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: towards [29] S. Gould, J. Rodgers, D. Cohen, G. Elidan, and D. Koller, “Multi-
real-time object detection with region proposal networks,” IEEE class segmentation with relative location prior,” International jour-
transactions on pattern analysis and machine intelligence, vol. 39, no. 6, nal of computer vision, vol. 80, no. 3, pp. 300–316, 2008. 2
pp. 1137–1149, 2016. 1 [30] Y. Cong, H. Ackermann, W. Liao, M. Y. Yang, and B. Rosenhahn,
[9] R. Zellers, M. Yatskar, S. Thomson, and Y. Choi, “Neural motifs: “Nodis: Neural ordinary differential scene understanding,” in
Scene graph parsing with global context,” in Proceedings of the Proceedings of the European Conference on Computer Vision (ECCV),
IEEE Conference on Computer Vision and Pattern Recognition, 2018, 2020, pp. 636–653. 2
pp. 5831–5840. 1, 2, 6, 7, 8 [31] W. Wang, R. Wang, S. Shan, and X. Chen, “Exploring context
[10] T. Chen, W. Yu, R. Chen, and L. Lin, “Knowledge-embedded and visual pattern of relationship for scene graph generation,”
routing network for scene graph generation,” in Proceedings of the in Proceedings of the IEEE/CVF Conference on Computer Vision and
IEEE/CVF Conference on Computer Vision and Pattern Recognition, Pattern Recognition, 2019, pp. 8188–8197. 2, 6
2019, pp. 6163–6171. 1, 6 [32] J. Shi, Y. Zhong, N. Xu, Y. Li, and C. Xu, “A simple baseline for
[11] R. Yu, A. Li, V. I. Morariu, and L. S. Davis, “Visual relationship weakly-supervised scene graph generation,” in Proceedings of the
detection with internal and external linguistic knowledge distilla- IEEE/CVF International Conference on Computer Vision, 2021, pp.
tion,” in Proceedings of the IEEE international conference on computer 16 393–16 402. 2
vision, 2017, pp. 1974–1982. 1, 2 [33] W. Wang, R. Wang, and X. Chen, “Topic scene graph generation by
[12] A. Zareian, S. Karaman, and S.-F. Chang, “Bridging knowledge attention distillation from caption,” in Proceedings of the IEEE/CVF
graphs to generate scene graphs,” in European Conference on Com- International Conference on Computer Vision, 2021, pp. 15 900–15 910.
puter Vision, 2020, pp. 606–623. 1, 6 2
[13] J. Gu, H. Zhao, Z. Lin, S. Li, J. Cai, and M. Ling, “Scene graph [34] Y. Lu, H. Rai, J. Chang, B. Knyazev, G. Yu, S. Shekhar, G. W. Taylor,
generation with external knowledge and image reconstruction,” and M. Volkovs, “Context-aware scene graph generation with
in Proceedings of the IEEE/CVF Conference on Computer Vision and seq2seq transformers,” in Proceedings of the IEEE/CVF International
Pattern Recognition, 2019, pp. 1969–1978. 1 Conference on Computer Vision, 2021, pp. 15 931–15 941. 2
[14] H. Law and J. Deng, “Cornernet: Detecting objects as paired [35] K. Tang, H. Zhang, B. Wu, W. Luo, and W. Liu, “Learning to com-
keypoints,” in Proceedings of the European conference on computer pose dynamic tree structures for visual contexts,” in Proceedings of
vision (ECCV), 2018, pp. 734–750. 1 the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
[15] Z. Tian, C. Shen, H. Chen, and T. He, “Fcos: Fully convolutional 2019, pp. 6619–6628. 2, 6, 7, 8
one-stage object detection,” in Proceedings of the IEEE/CVF interna- [36] L. Chen, H. Zhang, J. Xiao, X. He, S. Pu, and S.-F. Chang, “Coun-
tional conference on computer vision, 2019, pp. 9627–9636. 1 terfactual critic multi-agent training for scene graph generation,”
[16] X. Zhou, D. Wang, and P. Krähenbühl, “Objects as points,” arXiv in Proceedings of the IEEE/CVF International Conference on Computer
preprint arXiv:1904.07850, 2019. 1 Vision, 2019, pp. 4613–4623. 2
[17] P. Sun, Y. Jiang, E. Xie, W. Shao, Z. Yuan, C. Wang, and P. Luo, [37] M.-J. Chiou, H. Ding, H. Yan, C. Wang, R. Zimmermann, and
“What makes for end-to-end object detection?” in International J. Feng, “Recovering the unbiased scene graphs from the biased
Conference on Machine Learning. PMLR, 2021, pp. 9934–9944. 1 ones,” in Proceedings of the 29th ACM International Conference on
[18] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and Multimedia, 2021, pp. 1581–1590. 2
S. Zagoruyko, “End-to-end object detection with transformers,” in [38] J. Ji, R. Krishna, L. Fei-Fei, and J. C. Niebles, “Action genome:
European Conference on Computer Vision. Springer, 2020, pp. 213– Actions as compositions of spatio-temporal scene graphs,” in Pro-
229. 1, 2, 3, 4, 6, 9 ceedings of the IEEE/CVF Conference on Computer Vision and Pattern
[19] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, Recognition, 2020, pp. 10 236–10 247. 2
S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma et al., “Visual [39] Y. Cong, W. Liao, H. Ackermann, B. Rosenhahn, and M. Y. Yang,
genome: Connecting language and vision using crowdsourced “Spatial-temporal transformer for dynamic scene graph genera-
dense image annotations,” International journal of computer vision, tion,” in Proceedings of the IEEE/CVF International Conference on
vol. 123, no. 1, pp. 32–73, 2017. 2, 3, 6, 8 Computer Vision, 2021, pp. 16 372–16 382. 2
[20] A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont- [40] Y. Teng, L. Wang, Z. Li, and G. Wu, “Target adaptive context
Tuset, S. Kamali, S. Popov, M. Malloci, A. Kolesnikov et al., “The aggregation for video scene graph generation,” in Proceedings of
open images dataset v4,” International Journal of Computer Vision, the IEEE/CVF International Conference on Computer Vision, 2021, pp.
vol. 128, no. 7, pp. 1956–1981, 2020. 2, 3, 6, 8 13 688–13 697. 2
[21] X. Yang, K. Tang, H. Zhang, and J. Cai, “Auto-encoding scene [41] Y. Lu, C. Chang, H. Rai, G. Yu, and M. Volkovs, “Multi-view scene
graphs for image captioning,” in Proceedings of the IEEE/CVF graph generation in videos,” in International Challenge on Activity
Conference on Computer Vision and Pattern Recognition, 2019, pp. Recognition (ActivityNet) CVPR 2021 Workshop, vol. 3, 2021. 2
10 685–10 694. 2 [42] D. Xu, Y. Zhu, C. B. Choy, and L. Fei-Fei, “Scene graph generation
[22] J. Gu, S. Joty, J. Cai, H. Zhao, X. Yang, and G. Wang, “Unpaired by iterative message passing,” in Proceedings of the IEEE conference
image captioning via scene graph alignments,” in Proceedings of on computer vision and pattern recognition, 2017, pp. 5410–5419. 2, 6,
the IEEE/CVF International Conference on Computer Vision, 2019, pp. 10, 12
10 323–10 332. 2 [43] Y. Li, W. Ouyang, B. Zhou, K. Wang, and X. Wang, “Scene
[23] K.-H. Lee, H. Palangi, X. Chen, H. Hu, and J. Gao, “Learning visual graph generation from objects, phrases and region captions,” in
relation priors for image-text matching and image captioning with Proceedings of the IEEE international conference on computer vision,
neural scene graph generators,” arXiv preprint arXiv:1909.09953, 2017, pp. 1261–1270. 2
2019. 2 [44] J. Yang, J. Lu, S. Lee, D. Batra, and D. Parikh, “Graph r-cnn for
[24] J. Shi, H. Zhang, and J. Li, “Explainable and explicit visual reason- scene graph generation,” in Proceedings of the European conference
ing over scene graphs,” in Proceedings of the IEEE/CVF Conference on computer vision (ECCV), 2018, pp. 670–685. 2, 6, 8
on Computer Vision and Pattern Recognition, 2019, pp. 8376–8384. 2 [45] Y. Li, W. Ouyang, B. Zhou, J. Shi, C. Zhang, and X. Wang, “Factor-
[25] S. Lee, J.-W. Kim, Y. Oh, and J. H. Jeon, “Visual question answering izable net: an efficient subgraph-based framework for scene graph
over scene graph,” in International Conference on Graph Computing generation,” in Proceedings of the European Conference on Computer
(GC), 2019, pp. 45–50. 2 Vision (ECCV), 2018, pp. 335–351. 2
[26] Y. Li, T. Ma, Y. Bai, N. Duan, S. Wei, and X. Wang, “Pastegan: [46] R. Li, S. Zhang, B. Wan, and X. He, “Bipartite graph network with
A semi-parametric method to generate image from scene graph,” adaptive message passing for unbiased scene graph generation,”
Advances in Neural Information Processing Systems, vol. 32, pp. 3948– in Proceedings of the IEEE/CVF Conference on Computer Vision and
3958, 2019. 2 Pattern Recognition, 2021, pp. 11 109–11 119. 2, 6, 7, 8
[27] A. Talavera, D. S. Tan, A. Azcarraga, and K.-L. Hua, “Layout and [47] T. Chen, W. Yu, R. Chen, and L. Lin, “Knowledge-embedded
context understanding for image synthesis with scene graphs,” in routing network for scene graph generation,” in Proceedings of
IEEE International Conference on Image Processing (ICIP), 2019, pp. the IEEE/CVF Conference on Computer Vision and Pattern Recognition
1905–1909. 2 (CVPR), 2019. 2
14
[48] X. Lin, C. Ding, J. Zeng, and D. Tao, “Gps-net: Graph property Conference on Computer Vision and Pattern Recognition, 2021, pp.
sensing network for scene graph generation,” in Proceedings of the 11 825–11 834. 3
IEEE/CVF Conference on Computer Vision and Pattern Recognition, [69] R. Stewart, M. Andriluka, and A. Y. Ng, “End-to-end people
2020, pp. 3746–3753. 2, 6, 7, 8 detection in crowded scenes,” in Proceedings of the IEEE conference
[49] R. Herzig, M. Raboh, G. Chechik, J. Berant, and A. Globerson, on computer vision and pattern recognition, 2016, pp. 2325–2333. 3
“Mapping images to scene graphs with permutation-invariant [70] H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and
structured prediction,” Advances in Neural Information Processing S. Savarese, “Generalized intersection over union: A metric and
Systems, vol. 31, pp. 7211–7221, 2018. 2 a loss for bounding box regression,” in Proceedings of the IEEE/CVF
[50] M. Qi, W. Li, Z. Yang, Y. Wang, and J. Luo, “Attentive relational Conference on Computer Vision and Pattern Recognition, 2019, pp.
networks for mapping images to scene graphs,” in Proceedings of 658–666. 4, 5
the IEEE/CVF Conference on Computer Vision and Pattern Recognition, [71] I. Loshchilov and F. Hutter, “Decoupled weight decay regulariza-
2019, pp. 3957–3966. 2 tion,” in International Conference on Learning Representations (ICLR),
[51] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. 2019. 6
Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” [72] R. Al-Rfou, D. Choe, N. Constant, M. Guo, and L. Jones,
in Advances in neural information processing systems, 2017, pp. 5998– “Character-level language modeling with deeper self-attention,”
6008. 2, 3 in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33,
[52] N. Dhingra, F. Ritter, and A. Kunz, “Bgt-net: Bidirectional gru no. 01, 2019, pp. 3159–3166. 6
transformer network for scene graph generation,” in Proceedings of [73] H. Zhang, Z. Kyaw, S.-F. Chang, and T.-S. Chua, “Visual transla-
the IEEE/CVF Conference on Computer Vision and Pattern Recognition, tion embedding network for visual relation detection,” in Proceed-
2021, pp. 2150–2159. 2, 6 ings of the IEEE conference on computer vision and pattern recognition,
[53] R. Koner, P. Sinhamahapatra, and V. Tresp, “Relation transformer 2017, pp. 5532–5540. 7
network,” arXiv preprint arXiv:2004.06193, 2020. 2
[54] N. Gkanatsios, V. Pitsikalis, P. Koutras, and P. Maragos,
“Attention-translation-relation network for scalable scene graph
generation,” in Proceedings of the IEEE/CVF International Conference
on Computer Vision Workshops, 2019, pp. 0–0. 2
[55] Z. Cui, C. Xu, W. Zheng, and J. Yang, “Context-dependent dif- Yuren Cong received his Bachelor degree at
fusion network for visual relationship detection,” in Proceedings Hefei University of Technology in 2015. Then he
of the 26th ACM international conference on Multimedia, 2018, pp. studied Electrical Engineering and Information
1475–1482. 2 Technology at Leibniz University Hannover and
[56] J. Zhang, K. J. Shih, A. Elgammal, A. Tao, and B. Catanzaro, received his Master degree in 2019. Since 2020
“Graphical contrastive losses for scene graph parsing,” in Pro- he has worked as a research assistant towards
ceedings of the IEEE/CVF Conference on Computer Vision and Pattern his Ph.D in the group of Prof. Rosenhahn. His
Recognition, 2019, pp. 11 535–11 543. 2, 6, 8 research interests are in the fields of computer
vision with specialization on scene graph gener-
[57] B. Dai, Y. Zhang, and D. Lin, “Detecting visual relationships with
ation.
deep relational networks,” in Proceedings of the IEEE conference on
computer vision and Pattern recognition, 2017, pp. 3076–3086. 2
[58] M. Suhail, A. Mittal, B. Siddiquie, C. Broaddus, J. Eledath,
G. Medioni, and L. Sigal, “Energy-based learning for scene graph Micheal Ying Yang is currently Assistant Pro-
generation,” in Proceedings of the IEEE/CVF Conference on Computer fessor in the Department of Earth Observation
Vision and Pattern Recognition, 2021, pp. 13 936–13 945. 2 Science at ITC - Faculty of Geo-Information
[59] K. Tang, Y. Niu, J. Huang, J. Shi, and H. Zhang, “Unbiased Science and Earth Observation, University of
scene graph generation from biased training,” in Proceedings of the Twente, The Netherlands, heading a group work-
IEEE/CVF Conference on Computer Vision and Pattern Recognition, ing on scene understanding. He received the
2020, pp. 3716–3725. 2, 6, 7, 8 PhD degree (summa cum laude) from University
[60] S. Yan, C. Shen, Z. Jin, J. Huang, R. Jiang, Y. Chen, and X.-S. of Bonn (Germany) in 2011. He received the
Hua, “Pcpl: Predicate-correlation perception learning for unbiased venia legendi in Computer Science from Leibniz
scene graph generation,” in Proceedings of the 28th ACM Interna- University Hannover in 2016. His research inter-
tional Conference on Multimedia, 2020, pp. 265–273. 2 ests are in the fields of computer vision and pho-
[61] H. Liu, N. Yan, M. Mortazavi, and B. Bhanu, “Fully convolutional togrammetry with specialization on scene understanding and semantic
scene graph generation,” in Proceedings of the IEEE/CVF Conference interpretation from imagery. He serves as Associate Editor of ISPRS
on Computer Vision and Pattern Recognition, 2021, pp. 11 546–11 556. Journal of Photogrammetry and Remote Sensing, Co-chair of ISPRS
2, 6, 7 working group II/5 Dynamic Scene Analysis, Program Chair of ISPRS
[62] B. Kim, J. Lee, J. Kang, E.-S. Kim, and H. J. Kim, “Hotr: End- Geospatial Week 2019, and recipient of ISPRS President’s Honorary
to-end human-object interaction detection with transformers,” in Citation (2016), Best Science Paper Award at BMVC (2016), and The
Proceedings of the IEEE/CVF Conference on Computer Vision and Willem Schermerhorn Award (2020).
Pattern Recognition, 2021, pp. 74–83. 2, 3
[63] Y. Wang, Z. Xu, X. Wang, C. Shen, B. Cheng, H. Shen, and H. Xia,
Bodo Rosenhahn studied Computer Science
“End-to-end video instance segmentation with transformers,” in
(minor subject Medicine) at the University of
Proceedings of the IEEE/CVF Conference on Computer Vision and
Kiel. He received the Dipl.-Inf. and Dr.-Ing.
Pattern Recognition, 2021, pp. 8741–8750. 2
from the University of Kiel in 1999 and 2003,
[64] W. Liu, S. Chen, L. Guo, X. Zhu, and J. Liu, “Cptr: Full transformer respectively. From 10/2003 till 10/2005, he
network for image captioning,” arXiv preprint arXiv:2101.10804, worked as PostDoc at the University of Auck-
2021. 2 land (New Zealand), funded with a scholarship
[65] F. Zeng, B. Dong, T. Wang, C. Chen, X. Zhang, and Y. Wei, from the German Research Foundation (DFG).
“Motr: End-to-end multiple-object tracking with transformer,” In 11/2005-08/2008 he worked as senior re-
arXiv preprint arXiv:2105.03247, 2021. 2 searcher at the Max-Planck Institute for Com-
[66] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable puter Science. Since 09/2008 he is Full Profes-
detr: Deformable transformers for end-to-end object detection,” in sor at the Leibniz-University of Hannover, heading a group on automated
International Conference on Learning Representations (ICLR), 2021. 2, image interpretation. He has co-authored over 200 papers, holds 12
5, 6 patents and organized several workshops and conferences in the last
[67] Z. Yao, J. Ai, B. Li, and C. Zhang, “Efficient detr: Improv- years. His works received several awards, including a DAGM-Prize
ing end-to-end object detector with dense prior,” arXiv preprint 2002, the Dr.-Ing. Siegfried Werth Prize 2003, the DAGM-Main Prize
arXiv:2104.01318, 2021. 2 2005, the DAGM-Main Prize 2007, the Olympus-Prize 2007, and the
[68] C. Zou, B. Wang, Y. Hu, J. Liu, Q. Wu, Y. Zhao, B. Li, C. Zhang, Günter Enderle Award (Eurographics) 2017.
C. Zhang, Y. Wei et al., “End-to-end human object interaction
detection with hoi transformer,” in Proceedings of the IEEE/CVF