0% found this document useful (0 votes)
6 views

Language-Conditioned Feature Pyramids

Referring expression comprehension, which is the ability to locate language to an object in an image, plays an important role in creating common ground.

Uploaded by

hotschi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Language-Conditioned Feature Pyramids

Referring expression comprehension, which is the ability to locate language to an object in an image, plays an important role in creating common ground.

Uploaded by

hotschi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Language-Conditioned Feature Pyramids for Visual Selection Tasks

Taichi Iki1,2 and Akiko Aizawa1,2


1
National Institute of Informatics, Chiyoda-ku, Tokyo, Japan
2
Graduate University for Advanced Studies, Hayama, Kanagawa, Japan
{iki,aizawa}@nii.ac.jp

Abstract
Referring expression comprehension, which is
the ability to locate language to an object in
an image, plays an important role in creat-
ing common ground. Many models that fuse
visual and linguistic features have been pro-
posed. However, few models consider the fu-
sion of linguistic features with multiple visual
features with different sizes of receptive fields,
though the proper size of the receptive field of Figure 1: Illustration of visual features with different
visual features intuitively varies depending on sizes of the receptive fields. Dots represent objects that
expressions. In this paper, we introduce a neu- have color and size as their attributes. Grids in the right
ral network architecture that modulates visual three images represent the receptive fields of their vi-
features with varying sizes of receptive field sual features. Our architecture fuses linguistic features
by linguistic features. We evaluate our archi- with each visual feature.
tecture on tasks related to referring expression
comprehension in two visual dialogue games.
The results show the advantages and broad ap- object retrieval methods based on category labels
plicability of our architecture. Source code is predicted by the recognition models. Hu et al.
available at https://github.com/Alab-NII/lcfp. (2016b) extended this approach to broader natu-
ral language expression including categories of
1 Introduction objects, their attributes, positional configurations,
and interactions. In recent years, models that fuse
Referring expressions are a ubiquitous part of hu-
linguistic features with visual features using deep
man communication (Krahmer and Van Deemter,
learning have been studied (Hu et al., 2016b,a; An-
2012) that must be studied in order to create ma-
derson et al., 2018; Deng et al., 2018; Misra et al.,
chines that work smoothly with humans. Much ef-
2018; Li et al., 2018; Yang et al., 2019a,b; Liu et al.,
fort has been taken to improve methods of creating
2019; Can et al., 2020).
visual common ground between machines, which
When fusing the linguistic features of a spatial
have limited means of expression and knowledge
referring expression with visual features, the size of
about the real world, and humans, from the per-
the receptive field of visual features 1 is important.
spectives of both referring expression comprehen-
Let us take Figure 1 as an example. We can refer
sion and generation (Moratz et al., 2002; Tenbrink
to the gray dot in the figure in various ways:
and Moratz, 2003; Funakoshi et al., 2004, 2005,
2006; Fang et al., 2013). Even now, researchers are • a gray dot
exploring possible methods of designing more re-
alistic scenarios for applications, such as in visual • a dot next to the small dot
dialogue games (De Vries et al., 2017; Haber et al.,
• a dot below and to the right of the large dot
2019; Udagawa and Aizawa, 2019).
1
Many models have been proposed for referring In this paper, we picture the size of the receptive field of
visual features as the grid size in the input image. Note that
expression comprehension so far. As image recog- the size of the receptive field in a real model is wider than the
nition matured, Guadarrama et al. (2014) studied grid size in general because of multiple convolutional layers.

4687
Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4687–4697
November 16 - 20, 2020. c 2020 Association for Computational Linguistics
• the rightmost dot in a triangle consisting of
three dots
• the third largest dot of four dots
As shown in the figure, there is an optimum size
of receptive field when fusing the features of these
expressions with the visual features. Although the
small receptive field (in the second panel to the
left) matches the expression a gray dot, it does not
capture information about the triangle consisting of
three dots to the upper left. Conversely, the largest
receptive field (in the panel to the right) includes
the triangle, but contains too much information to
determine the color of the gray dot. Thus, linguis-
tic and visual features have an optimum size of Figure 2: Example of OneCommon view and dia-
receptive field for fusion. logue. In the OneCommon framework, two players ob-
Few existing models, however, use fusion of lin- serve slightly different views due to parallel shift. The
guistic features with visual features with different game requires them to create common ground about the
views through free conversation and identify the same
receptive field sizes. This is possibly because ma-
dot. We show part of an utterance and underline some
jor datasets for referring expression comprehension, expressions that refer to an object or a group.
for example, Kazemzadeh et al. (2014); Plummer
et al. (2015); Mao et al. (2016); Yu et al. (2016),
use photographs and weigh expressions related to whereas, FiLM is a structure that robustly fuses
object category more often than positional relation- linguistic features with visual features.
ships. Tenbrink and Moratz (2003); Tanaka et al. To confirm the broad applicability of our ar-
(2004); Liu et al. (2012, 2013) reveal that people chitecture, we further evaluate it on another task,
often use group-based expressions (relative posi- which is expected to require the ability of object
tional relationships of multiple objects) when there category recognition more than OneCommon does
is no clear difference between objects; therefore, because it uses photographs. We find that our archi-
these expressions are not so unusual. Further in- tecture achieves better accuracy in these tasks than
vestigation should be done on methods that handle some existing models, suggesting the advantage of
referring expressions based on positional relation- fusion of linguistic features with multiple visual
ships. features that have different receptive fields.
For this reason, we focus on the OneCommon The contributions of this paper are as follows:
corpus (Udagawa and Aizawa, 2019), a recently 1. We propose the language-conditioned feature
proposed corpus on a visual dialogue game using pyramid (LCFP) architecture, which modu-
composite images of simple figures. It captures lates visual features with multiple sizes of re-
various expressions based on positional relation- ceptive fields using language features.
ships, such as group-based expressions, as shown
in Figure 2. 2. We apply LCFP to dialogue history object
In this paper, we introduce a neural network ar- retrieval; our evaluation demonstrates the ad-
chitecture for referring expression comprehension vantage of our architecture on referring ex-
considering visual features with different sizes of pression comprehension in visual dialogue.
the receptive fields, and evaluate it on the OneCom-
2 Dialogue History Object Retrieval
mon task. Our structure combines feature pyramid
networks (FPN) (Lin et al., 2017) and feature-wise The main focus of this paper is the task of predict-
linear modulation (FiLM) (Perez et al., 2018) and ing the final object selected by the speaker given
modulates visual features with different sizes of the a dialogue history, a scene image, and candidate
receptive fields with linguistic features of referring objects in the image. A dialogue history consists of
expressions. FPN is an architecture that uses each a list of speaker and utterance pairs. We consider
layer of the hierarchical convolutional neural net- dialogues where speakers switch every turn. Can-
work (CNN) feature extractor for object detection; didate objects are indicated by bounding boxes in

4688
the image. Some task instances provide additional 3.1 Models for Referring Expression
information, such as object categories. Here, we Comprehension
call this task dialogue history object retrieval.
Models for extracting objects from an image are of-
ten based on object detection (Ren et al., 2015; Liu
OneCommon Target Selection Task OneCom- et al., 2016; Lin et al., 2017; Redmon and Farhadi,
mon is a dialogue corpus for common grounding. 2018) or image segmentation (Ronneberger et al.,
It contains 6,760 dialogues from a collaborative 2015). Object detection considers only the bound-
referring game where two players are given a view ing boxes of the objects. Image segmentation ex-
that contains 7 dots, as shown in Figure 2. Dots tracts the areas indicated by the outlines of the
have four attributes: x/y coordinates on a plane, objects. Referring expression comprehension also
size, and color. Only some dots are seen in com- includes reference detection (Hu et al., 2016b; An-
mon because the centers of the players’ views are derson et al., 2018; Deng et al., 2018; Yang et al.,
different. The goal of the game is to select the same 2019a,b) and segmentation (Hu et al., 2016a; Li
dot after talking. Target selection is a subtask of the et al., 2018; Misra et al., 2018; Liu et al., 2019; Can
game, requiring prediction of the dot that a player et al., 2020) correspondingly.
chose based on a given player’s view and dialogue
The standard reference detection consists of two
history.
stages: detecting candidate objects and selecting
objects that match the expression from the candi-
GuessWhat?! Guessor Subtask GuessWhat?! dates. Essentially, they do not fuse visual feature
(De Vries et al., 2017) is a game related to multi- maps with language when detecting candidates.
modal dialogue. Two players play the roles of Yang et al. (2019b) proposes a one-stage model
oracle and questioner. They are given a photo and that combines the feature map of the object detector
the oracle mentally selects an object. Then, the with language to directly select the referred object.
questioner asks the oracle yes-or-no questions to Whereas their model fuses linguistic and visual fea-
guess the object. The goal of the game is to select tures after reducing visual features of the different
the object at the end of a question sequence. A receptive field sizes, ours fuses them before the re-
published collection of game records consists of duction. Zhao et al. (2018) also proposes a model
150,000 games with human players, with a total of with a structure that fuses multiple scales and lan-
800,000 visual question–answer pairs on 66,000 guages for weakly supervised learning. However,
images extracted from the MS COCO dataset (Lin they use concatenation as the method of fusion,
et al., 2014). The guesser subtask is to predict the whereas we use FiLM.
correct object from 3–20 candidate objects based
For reference segmentation, Li et al. (2018) point
on a given photo and set of question–answer pairs.
out a lack of multi-scale semantics and propose a
Candidate information includes bounding boxes
method that recursively fuses feature maps of differ-
and object category.
ent scales using a recurrent neural network (RNN).
However, this method concatenates linguistic fea-
In addition to dialogue history object retrieval, tures with only the first input of the RNN; hence,
there is an increasing amount of research on task de- the feature map in each scale and the linguistic fea-
sign for visual dialogue games that require unique tures may be poorly fused. U-Net-based models
common understanding. For example, in the Photo- (Misra et al., 2018; Can et al., 2020) have the most
Book dataset (Haber et al., 2019), two participants similar structure to ours. They produce hierarchi-
are presented with multiple images, and they pre- cal feature maps with CNNs, modulate those maps
dict whether an image is presented only to them or with language, and unify them into a single map
also to the other person through conversation. through consecutive deconvolution operations.
The major difference between those U-Net-
3 Related Work based models and ours is fusion architecture. The
U-Net-based models generate kernels from linguis-
This section first describes an overview of the mod- tic features to convolve visual features. Our model
els for referring expression comprehension and operates an affine transformation on visual features
then gives some details about models related to using coefficients made from linguistic features
the OneCommon Corpus and GuessWhat?!. in FiLM blocks. Suppose the dimensions of the

4689
source and modulated visual features are Ds and vlang with dimension dlang , and the feature map fin
Dm , respectively. Then, the size of the kernel for with dimension din and shape (h, w).
convolution is Ds Dm and the size of the coeffi- The Trainable parts of the block are two lin-
cients for affine transformation is 2Dm . Because ear transformations B, G, two convolutional layers
of this independency of Ds , our model has the ad- CNV(1) , CNV(2) and a batch normalization (BN)
vantage of being able to handle visual features with (Ioffe and Szegedy, 2015) layer.
large dimensions, such as the last layer of ResNet50 First, it performs a linear transformation on
(He et al., 2016) typically with 2048 dimensions. vlang to obtain the coefficients of the affine
transformation,
3.2 Models for Dialogue History Object
Retrieval β = Bvlang ; B ∈ Rdlang dout ,
γ = Gvlang ; G ∈ Rdlang dout .
OneCommon Target Selection Udagawa and
Aizawa (2019) proposed the baseline model TSEL, Second, it applies CNV(1) to fin after concate-
which creates the features of a candidate taking into nating a positional encode (PE),
account its attributes (size, color and position) and
 
fvis = F CNV(1) (PE(fin )) ,
the average of the differences between its attributes
and attributes of the other candidates. This model where F is an activation function, typically a rec-
does not use visual features directly. tified linear unit (ReLU) (Nair and Hinton, 2010),
Udagawa and Aizawa (2020) extended the base- PE(fin ) denotes the concatenation of the two-
line model from the perspective of learning tasks dimensional position of each pixel in fin normal-
and introduced TSEL-REF and TSEL-REF-DIAL. ized in a range of [−1, 1] on each axis.
TSEL-REF has a similar structure to TSEL and Last, the second convolutional layer CNV(2)
learns in a multi-task setting. It resolves referring with BN and affine transformation is applied to
expressions in utterances, as well as the final pre- fvis .  
diction. Additional data consisting of manual an- ffuse = F β BN(CNV(2) (fvis )) + γ ,
notations of reference resolution are used for the
training. TSEL-REF-DIAL also learns on self-play ffilm = fvis + ffuse (1),
of dialogue in addition to the TSEL-REF training. where denotes the element-wise product. Lan-
GuessWhat?! Guesser Subtask The Guess- guage and vision are fused in this equation. ffilm is
What?! paper proposes baseline models that use the FiLMed feature map. Note that ffilm can be di-
object category and position to create candidate vided into language-independent fvis and language-
features. Although the paper reports that the exten- dependent ffuse parts. We analyze the effect of the
sion of their baseline model to visual features from terms in Section 6.3
object recognition does not have any advantages, 4.2 Feature Pyramid Networks
some models that use visual features, for exam-
ple, A-ATT (Deng et al., 2018) and HACAN (Yang Feature Pyramid Networks (FPN) (Lin et al., 2017)
et al., 2019a) have recently improved the perfor- use an object recognition model as a backbone and
mance on GuessWhat?!. Their approach, based on reconstruct semantically rich feature maps from the
reference detection and attention mechanism, fuses feature extraction results. Here, we suppose that
linguistic features with visual features that have a the backbone is ResNet.
single size of the receptive fields. ResNet and Stages of Feature Map The
ResNet family has a common structure for reduc-
4 Preliminary
ing the size of the input images. First, it converts
We introduce two prerequisite architectures to de- an input image into a feature map with half the
scribe our proposal. resolution of the image with a convolutional layer.
Next, it reduces the map by a factor of two with the
4.1 Feature-wise Linear Modulation pooling operation. Subsequently, it applies some
A feature-wise linear modulation (Perez et al., residual blocks, gradually reducing the resolution
2018) block fuses a given language vector and fea- by half. This task is repeated until the size becomes
ture map to make a new feature map. Let the output 1/32 of the original image. We define the final layer
feature map dimension be dout , the language vector of each resolution as the feature map of the stage;

4690
Figure 3: Overview of our architecture, consisting of a visual feature extractor and a language encoder. The
feature maps (C1, ..., C5) from the extractor are fused in feature-wise linear modulation blocks with the language
embedded and summed recursively. Striped boxes denote language-conditioned feature maps. For dialogue history
object retrieval, the finest map (P1) is fed into the subsequent pooling layer.

namely, C1 is the final layer of the 1/2 resolution 5.1 Language-Conditioned Feature Pyramids
map, C2 is of 1/4, ..., C5 is of 1/32. Language Encoder LCFP requires a fixed-
length vector of language information to gener-
Top-down Reconstruction FPN makes feature ate input for FiLM blocks. We can use any fixed
pyramids from the stages of a backbone in a top- vector, such as the last hidden layers of RNNs or
down manner. Suppose that CNV(2) , ..., CNV(5) transformer-based language models such as Devlin
are trainable convolutional layers and P2, ..., P5 (P et al. (2019). Our proposal adopts gated recurrent
stands for pyramid) are the reconstructed feature unit (GRU) (Cho et al., 2014) in accordance with
maps on each stage2 . Then Pi can be represented the FiLM paper (Perez et al., 2018). Suppose that
as follows: dlang is the dimension of hidden layer,
Pi = CNV(i) (Ci) + Resize2 (P(i + 1)) (2). hlang = GRU(text) ∈ Rdlang .
where P6 = 0 and Resize2 denotes the operation
Visual Feature Extractor We use ResNet as our
to enlarge the image twice. This means that Pi con-
backbone. In addition to the C2-C5 described in
tains information about higher and coarser stages,
Section 4.2, we use C1 because our goal is to in-
which hold more complex semantics in general
corporate information in the low stages, i.e., visual
because of their wider receptive fields.
features with small receptive fields.

5 Proposed Method {Ci; i = 1, ..., 5} = ResNet(image).

Fusing Language and Vision The key idea to


Our architecture consists of language-conditioned
combine aforementioned two architectures is to
feature pyramids (LCFP) for general feature ex-
replace convolutional layers of FPN in Equation 2
traction and a feature extractor for specific tasks,
with FiLM blocks.
as shown in Figure 3. In this section, we describe
LCFP and the following structure for dialogue his- We represent the block as a function FiLM(vlang ,
tory object retrieval. fin ). Then, our feature reconstruction can be ex-
pressed as follows:
2
The reason we do not mention P1 is that the original Pi = FiLM(i) (hlang , Ci)+Resize2 (P(i+1)) (3),
paper does not use C1 and P1 owing to their large memory
footprint. where the weights of the FiLM block in each stage

4691
are different from each other. We set kernel sizes Accuracy
(1), (2) Model Valid. Test (Full) Test (SO)
for CNVi in each FiLM block 1 × 1 and 67.79
TSEL - -
3 × 3, respectively, according to Perez et al. (2018). ±1.53
{Pi; i = 1, ..., 5} is the output of LCFP. 69.01
TSEL-REF - -
±1.58
TSEL-REF- 69.09
5.2 LCFP-Based Dialogue History Object - -
DIAL ±1.12
Retrieval 72.99 73.47 78.26
LCFP
±1.37 ±1.09 ±1.21
We formulate dialogue history object retrieval as a Human - - 90.79
classification that predicts a selected object based
on a dialogue history, scene image, and set of can- Table 1: Accuracy on OneCommon Target Selection.
didate information. The candidate information con- SO indicates successful games only. The average re-
sists of a bounding box (x1 , y1 , x2 , y2 ) in an image sults of 10 trials are shown. The values of TSEL,
TSEL-REF, TSEL-REF-DIAL, and Human are from
and a fixed-length vector v that represents the addi-
Udagawa and Aizawa (2020).
tional information.

Candidate Features We extract a region corre- Common Implementation We implemented


sponding to a bounding box of each candidate from our model with the PyTorch framework (Paszke
the feature map P1 obtained via LCFP. For candi- et al., 2019). We used ResNet50 provided from
date i, the features in the region are averaged to be the PyTorch vision package, which is pretrained on
converted into a fixed-length vector: object recognition tasks with the ImageNet dataset
fi0 = k∈regioni P1k / k∈regioni 1,
P P
(Deng et al., 2009) as a backbone. All weights of
the backbone, including those of statistics for batch
where regioni and P1k indicate the region of can-
normalization, are fixed. The dimensions of token
didate i and the vector at position k in feature map
embeddings, GRU hidden states, feature maps, ad-
P1, respectively. We concatenate fi0 with vi addi-
ditional information, and the last linear layer are
tional information vector for candidate i to make a
256, 1024, 256, 256 and 1024 respectively. For
full feature vector:
optimization, we used ADAM (Kingma and Ba,
fi = [fi0 ; vi ]. 2014) with alpha 5e-4, eps 1e-9, and mini-batch
size 32. No regularization was used except for BN.
Probability Calculation We apply a linear layer
We ran 5 epochs in a trial and chose the weight set
with ReLU activation to each feature and another
with the lowest validation loss.
linear layer with a one-dimensional output to obtain
a logit for each candidate: 6.1 OneCommon Target Selection Task
logiti = W2 ReLU(W1 fi + b). Model Detail Tokenization was performed by
We apply softmax over all logits of the candidates splitting using white spaces; all tokens are uncased.
when we need probability of the selected candidate. Tokens that appear fewer than five times in the
training dataset were replaced with an <unk> to-
6 Experiments ken. We drew the game views based on candidate
dot data in a 224px square image. The additional
We first validate the advantage of our architecture information vector is disabled by inputting a vector
on two tasks in dialogue history object retrieval that denotes that information is not provided.
described in Section 2. We then investigate the
cause of the advantage through ablation studies. Results Table 1 compares accuracy between the
existing models and ours. Our model achieves bet-
Common Text Processing We consider dia- ter accuracy than the three models described in Sec-
logue history as a text that starts with task name tion 3.2, although the accuracy is lower than with
followed by a <text> token, with a sequence of human performance. In particular, our model out-
utterances and a <selection> token at the end. performs TSEL-REF and TSEL-REF-DIAL, which
Each utterance is interposed between a speaker to- use additional learning, with learning only from
ken, <you> or <them>, and an end-of-sequence standard training data. This result demonstrates the
token <eos>. Tokenization of utterances is differ- advantages and the high learning efficiency of our
ent for each task. architecture.

4692
Error Stage Valid. err.
Model Train Valid. Test Model 5 4 3 2 1 OC GW
LSTM1 SL 27.9 37.9 38.7 Setting 1: Stages Ablation
HRED1 SL 32.6 38.2 39.0 fvis X
LSTM+VGG1 SL 26.1 38.5 39.2 A5 45.8 38.4
ffuse X
HRED+VGG1 SL 27.4 38.4 39.6 fvis X X X
A-ATT2 SL 26.7 33.7 34.2 A3 28.5 33.1
ffuse X X X
HACAN fvis X X X X X
SL 26.9 33.6 34.1 Full 27.0 32.2
w/o HAST3 ffuse X X X X X
GST (SL)4 SL 24.7 33.7 34.3
Setting 2: Language-Conditioned Parts Ablation
20.1 32.2 33.1
LCFP (ours) SL fvis X X X X X
±1.6 ±0.2 ±0.5 A5’ 38.8 37.8
ffuse X
HACAN3 HAST 26.1 32.3 33.2 fvis X X X X X
GST (RL, A3’ 27.4 32.9
RL 16.7 16.9 18.4 ffuse X X X
Max.Q’s=8)4 fvis X X X X X
Humana - 9.0 9.0 9.2 Full
ffuse X X X X X
27.0 32.2

Table 2: Error rate on GuessWhat?! Guesser Subtask. Table 3: Ablation study on the OneCommon Target Se-
SL: Supervised learning, RL: Reinforcement learning, lection Task (OC) and GuessWhat?! Guesser Subtask
HAST: History-Advantaged Sequence Training (Yang (GW). Error is shown. We ablate some of fvis and ffuse
et al., 2019a). The average result of 5 trials for LCFP. in the FiLM block at each stage. fvis and ffuse rows
1
(De Vries et al., 2017), 2 (Deng et al., 2018), 3 (Yang in each model show the condition where X indicates
et al., 2019a) and 4 (Pang and Wang, 2020). that the model uses the corresponding information.

6.2 GuessWhat?! Guesser Subtask 6.3 Ablation


Although it contains many referring expressions To confirm the importance of fusing multiple vi-
related to positional relationships, OneCommon sual features that have different receptive field
uses a view with simple figures. We next evaluated sizes with linguistic features, we performed abla-
our architecture on the Guesser subtask of Guess tion in two settings: Stage ablation and Language-
What?!, which uses photographs, to verify whether conditioned parts ablation. The former examines
our structure can be applied to more complex visual the effect of applying FiLM to small receptive
information. fields by removing FiLM for some stages. The
latter examines the effect of language modulation
Model Detail We tokenized utterences by
by leaving only the language-independent parts of
NLTK’s TweetTokenizer under case-insensitive
FiLM.
conditions and omitted tokens appearing fewer than
five times in the training dataset. We resized the Stage Ablation Stage ablation in Table 3 com-
photos to 224px square, regardless of their aspect pares A5, A3 and Full models. A5 uses only the
ratio. As additional information, we input object last stage of the image extractor and Full uses all
categories provided by the dataset by converting stages. A3 is in the middle. The same trend ex-
them into one-hot embedding vectors. ists for both OneCommon and GuessWhat?!; The
Full model outperforms A5 and achieves a slightly
Results Table 2 shows the error rate of the task.
better result than A3. This shows that consider-
The table also shows the learning methods of
ing visual features with a small receptive field size
the models. Our model achieves the lowest er-
improves performance.
ror rate of models of supervised learning, includ-
ing models that use visual features (LSTM+VGG, Language-Conditioned Parts Ablation This
HRED+VGG, A-ATT and HACAN w/o HAST). ablation introduces A5’ and A3’ models that use
This demonstrates that our architecture can be ap- the language-independent fvis part in all stages but
plied to visual input of natural objects as well as do not use the language-dependent ffuse part in
simple figures. Our method alone does not match some stages (see Equation. 1 in Section 4.1 for
the results of the method using reinforcement learn- the definition of fvis and ffuse ). Comparing A5
ing; however, our method can be combined with and A5’ and A3 and A3’ shows that the models
those more sophisticated learning methods. Ex- consistently achieve better results when using the
amining such combinations will be an interesting language-dependent part, suggesting that the lan-
topic for the future. guage fusion has a positive impact. Although the

4693
Token N TSEL [%] LCFP [%] expressions.
(overall) 2702 66 74
triangle 304 60 (-6) 71 (-3) Note that dialogue history object retrieval re-
group 100 55 (-11) 72 (-2) solves the final reference of the dialogue. The ex-
pair 72 56 (-10) 72 (-2) istance of a group-based referring expression does
square 10 47 (-19) 80 (+6)
diamond 6 72 (+6) 100 (+26) not necessarily mean that it relates to the answer;
trapezoid 4 42 (-24) 75 (+1) hence, this is indirect support.

Table 4: Accuracy of example sets containing group- 7.2 Expressions and the Size of Receptive
related tokens on OneCommon Target Selection. N Fields
represents the number of examples that contain group-
We visualized the activation pattern of the modu-
related tokens in their dialogue. We show the differ-
ences between the accuracy of the overall and example lated features in our architecture to verify our first
sets in parentheses. We merged the validation and test intuition that linguistic and visual features have an
splits for this table. The average results of three trials optimum size of receptive field for fusion.
are shown. Figure 4 shows the results. For visualization, we
input simple expressions related to single attributes
such as select the largest dot (size) or select the
impacts of the language fusion in stages 2 and 1
darkest dot (color). The stage with the most acti-
were expected to be relatively small owing to the
vated pattern varies depending on attributes in the
small difference between Full and A3’ model, they
expressions. We observed this phenomenon on dif-
still have some impact on the performance.
ferent view inputs from the view in Figure 4. The
Combining these, we conclude that the advan- model pays the most attention to stage 1, which
tage evaluated in the previous subsection is a result has the smallest receptive field, when it receives an
of the fusion of linguistic features with multiple input expression related to color. Then, it moves
visual features with different receptive field sizes. to the stages with the larger receptive fields as the
input changes to size and position. That is likely to
7 Discussion correspond to the typical magnitude of localization.
These results suggest that the model selects vi-
Finally, this section focuses on linguistic expres-
sual features by the size of the receptive field ac-
sions. We discuss the effect of our architecture on
cording to the referring expression, supporting our
group-based referring expressions and our first intu-
first intuition.
ition regarding the relationship between expression
and receptive fields using OneCommon. Failure Cases Although the model makes a good
predictions regarding size and color, it does not
7.1 Effect on Group-Based Expression handle position well. Thus, there is still room to
Comprehension improve expression related to positional relation-
To obtain an insight into the performance of group- ships, although the model improves this ability.
based referring expression, we performed an aggre- Through this visualization, we observed that our
gation over examples in which dialogue includes model tends to set the wrong range. For example,
tokens related to groups. We took the six tokens for four position-related expressions in Figure 4,
shown in Table 4 as a marker that indicates that the the model predicts answers only from dots in the
dialogue contains a group-based referring expres- salient triangle formed by dots c, d and e.
sion. If the model struggles to handle group-based A possible explanation of this observation is data
referring expressions, the accuracy should be lower bias. Because the OneCommon game framework
than the overall accuracy. rewards players if they successfully create common
Table 4 shows the results. The baseline model ground with each other, players may think to men-
TSEL yields low accuracy on triangle, group, pair, tion to more salient dots to increase the success rate.
square, and trapezoid with large drops ranging As a result, the variation of expressions could be
from 6% to 24% compared to the overall accu- restricted. In fact, Udagawa and Aizawa (2019) re-
racy. Conversely, our architecture reduces the drop. ports these trends on color and size attributes. This
In the worst case triangle, accuracy drops by 3% suggests the importance of exploring task design
. This supports the idea that our architecture im- for data collection from the viewpoint of collecting
proves the understanding of group-based referring a wide range of general reference expressions.

4694
Figure 4: Single-attribute referring expressions and averaged activation pattern in feature-wise linear modulation
blocks. All patterns are normalized with the same factor. The input view is shown in the top center (characters are
a guide to identify the dots, not inputs). Each band of patterns has five maps corresponding to the stages of the
model. The language-independent parts (fvis ) to the upper left are common to all expressions. The remaining parts
(ffuse ) are responses to the expressions. Black dots under the maps indicate the stage with the largest activation.

8 Conclusion Acknowledgments
We would like to thank the anonymous reviewers
To improve referring expression comprehension, for their helpful comments. This work was sup-
this paper proposes a neural network architecture ported by NEDO SIP-2 “Big-data and AI-enabled
that modulates visual features; the visual features Cyberspace Technologies.”
have different sizes of receptive fields in each hier-
archy extracted by CNNs with linguistic features.
As our architecture affine transforms visual features References
with linguistic features, it requires a lower calcu- Peter Anderson, Xiaodong He, Chris Buehler, Damien
lation cost than methods that generate convolution Teney, Mark Johnson, Stephen Gould, and Lei
kernels. Zhang. 2018. Bottom-up and top-down attention for
image captioning and visual question answering. In
Our evaluation of referring expression compre- Proceedings of the IEEE conference on computer vi-
hension tasks on two visual dialogue games demon- sion and pattern recognition, pages 6077–6086.
strates the model’s advantage in the understanding
Ozan Arkan Can, İlker Kesen, and Deniz Yuret. 2020.
of referring expressions and the broad applicability Bilingunet: Image segmentation by modulating top-
of our architecture. Ablation studies support the down and bottom-up visual processing with refer-
importance of multiple fusion. ring expressions. arXiv preprint arXiv:2003.12739.
We expect that hierarchical visual information Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bah-
is also important to generation. However, our ar- danau, and Yoshua Bengio. 2014. On the properties
chitecture is difficult to directly apply to referring of neural machine translation: Encoder–decoder ap-
proaches. In Proceedings of SSST-8, Eighth Work-
expression generation because it outputs modulated shop on Syntax, Semantics and Structure in Statisti-
feature maps. Therefore, the future direction is to cal Translation, pages 103–111, Doha, Qatar. Asso-
extend our architecture to language generation. ciation for Computational Linguistics.

4695
Harm De Vries, Florian Strub, Sarath Chandar, Olivier Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
Pietquin, Hugo Larochelle, and Aaron Courville. Sun. 2016. Deep residual learning for image recog-
2017. Guesswhat?! visual object discovery through nition. In Proceedings of the IEEE conference on
multi-modal dialogue. In Proceedings of the IEEE computer vision and pattern recognition, pages 770–
Conference on Computer Vision and Pattern Recog- 778.
nition, pages 5503–5512.
Ronghang Hu, Marcus Rohrbach, and Trevor Darrell.
Chaorui Deng, Qi Wu, Qingyao Wu, Fuyuan Hu, Fan 2016a. Segmentation from natural language expres-
Lyu, and Mingkui Tan. 2018. Visual grounding via sions. In European Conference on Computer Vision,
accumulated attention. In Proceedings of the IEEE pages 108–124. Springer.
Conference on Computer Vision and Pattern Recog- Ronghang Hu, Huazhe Xu, Marcus Rohrbach, Jiashi
nition, pages 7746–7755. Feng, Kate Saenko, and Trevor Darrell. 2016b. Nat-
ural language object retrieval. In Proceedings of the
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,
IEEE Conference on Computer Vision and Pattern
and Li Fei-Fei. 2009. Imagenet: A large-scale hier-
Recognition, pages 4555–4564.
archical image database. In 2009 IEEE conference
on computer vision and pattern recognition, pages Sergey Ioffe and Christian Szegedy. 2015. Batch nor-
248–255. IEEE. malization: Accelerating deep network training by
reducing internal covariate shift. In International
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Conference on Machine Learning, pages 448–456.
Kristina Toutanova. 2019. Bert: Pre-training of
deep bidirectional transformers for language under- Sahar Kazemzadeh, Vicente Ordonez, Mark Matten,
standing. In Proceedings of the 2019 Conference of and Tamara Berg. 2014. Referitgame: Referring
the North American Chapter of the Association for to objects in photographs of natural scenes. In Pro-
Computational Linguistics: Human Language Tech- ceedings of the 2014 conference on empirical meth-
nologies, Volume 1 (Long and Short Papers), pages ods in natural language processing (EMNLP), pages
4171–4186. 787–798.

Rui Fang, Changsong Liu, Lanbo She, and Joyce Chai. Diederik P Kingma and Jimmy Ba. 2014. Adam: A
2013. Towards situated dialogue: Revisiting refer- method for stochastic optimization. arXiv preprint
ring expression generation. In Proceedings of the arXiv:1412.6980.
2013 Conference on Empirical Methods in Natural Emiel Krahmer and Kees Van Deemter. 2012. Compu-
Language Processing, pages 392–402. tational generation of referring expressions: A sur-
vey. Computational Linguistics, 38(1):173–218.
Kotaro Funakoshi, Satoru Watanabe, Naoko Kuriyama,
and Takenobu Tokunaga. 2004. Generation of rela- Ruiyu Li, Kaican Li, Yi-Chun Kuo, Michelle Shu, Xi-
tive referring expressions based on perceptual group- aojuan Qi, Xiaoyong Shen, and Jiaya Jia. 2018. Re-
ing. In Proceedings of the 20th international confer- ferring image segmentation via recurrent refinement
ence on Computational Linguistics, pages 666–672. networks. In Proceedings of the IEEE Conference
Association for Computational Linguistics. on Computer Vision and Pattern Recognition, pages
5745–5753.
Kotaro Funakoshi, Satoru Watanabe, and Takenobu
Tokunaga. 2006. Group-based generation of refer- Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming
ring expressions. In Proceedings of the Fourth Inter- He, Bharath Hariharan, and Serge Belongie. 2017.
national Natural Language Generation Conference, Feature pyramid networks for object detection. In
pages 73–80. Proceedings of the IEEE conference on computer vi-
sion and pattern recognition, pages 2117–2125.
Kotaro Funakoshi, Satoru Watanabe, Takenobu Toku-
Tsung-Yi Lin, Michael Maire, Serge Belongie, James
naga, and Naoko Kuriyama. 2005. Understanding
Hays, Pietro Perona, Deva Ramanan, Piotr Dollár,
referring expressions involving perceptual grouping.
and C Lawrence Zitnick. 2014. Microsoft coco:
In 2005 International Conference on Cyberworlds
Common objects in context. In European confer-
(CW’05), pages 413–420. IEEE.
ence on computer vision, pages 740–755. Springer.
Sergio Guadarrama, Erik Rodner, Kate Saenko, Ning Changsong Liu, Rui Fang, and Joyce Y Chai. 2012. To-
Zhang, Ryan Farrell, Jeff Donahue, and Trevor Dar- wards mediating shared perceptual basis in situated
rell. 2014. Open-vocabulary object retrieval. In dialogue. In Proceedings of the 13th Annual Meet-
Robotics: science and systems, volume 2, page 6. ing of the Special Interest Group on Discourse and
Dialogue, pages 140–149. Association for Computa-
Janosch Haber, Tim Baumgärtner, Ece Takmaz, Lieke tional Linguistics.
Gelderloos, Elia Bruni, and Raquel Fernández. 2019.
The photobook dataset: Building common ground Changsong Liu, Rui Fang, Lanbo She, and Joyce Chai.
through visually-grounded dialogue. In Proceedings 2013. Modeling collaborative referring for situated
of the 57th Annual Meeting of the Association for referential grounding. In Proceedings of the SIG-
Computational Linguistics, pages 1895–1910. DIAL 2013 Conference, pages 78–86.

4696
Runtao Liu, Chenxi Liu, Yutong Bai, and Alan L Yuille. Shaoqing Ren, Kaiming He, Ross Girshick, and Jian
2019. Clevr-ref+: Diagnosing visual reasoning with Sun. 2015. Faster r-cnn: Towards real-time ob-
referring expressions. In Proceedings of the IEEE ject detection with region proposal networks. In
Conference on Computer Vision and Pattern Recog- Advances in neural information processing systems,
nition, pages 4185–4194. pages 91–99.
Wei Liu, Dragomir Anguelov, Dumitru Erhan, Chris- Olaf Ronneberger, Philipp Fischer, and Thomas Brox.
tian Szegedy, Scott Reed, Cheng-Yang Fu, and 2015. U-net: Convolutional networks for biomed-
Alexander C Berg. 2016. Ssd: Single shot multi- ical image segmentation. In International Confer-
box detector. In European conference on computer ence on Medical image computing and computer-
vision, pages 21–37. Springer. assisted intervention, pages 234–241. Springer.
Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Hozumi Tanaka, Takenobu Tokunaga, and Yusuke
Camburu, Alan L Yuille, and Kevin Murphy. 2016. Shinyama. 2004. Animated agents capable of under-
Generation and comprehension of unambiguous ob- standing natural language and performing actions.
ject descriptions. In Proceedings of the IEEE con- In Life-Like Characters, pages 429–443. Springer.
ference on computer vision and pattern recognition,
pages 11–20. Thora Tenbrink and Reinhard Moratz. 2003. Group-
based spatial reference in linguistic human-robot in-
Dipendra Misra, Andrew Bennett, Valts Blukis, Eyvind teraction. In Proceedings of EuroCogSci, volume 3,
Niklasson, Max Shatkhin, and Yoav Artzi. 2018. pages 325–330.
Mapping instructions to actions in 3d environ-
ments with visual goal prediction. arXiv preprint Takuma Udagawa and Akiko Aizawa. 2019. A nat-
arXiv:1809.00786. ural language corpus of common grounding under
continuous and partially-observable context. In Pro-
Reinhard Moratz, Thora Tenbrink, John Bateman, and ceedings of the AAAI Conference on Artificial Intel-
Kerstin Fischer. 2002. Spatial knowledge represen- ligence, volume 33, pages 7120–7127.
tation for human-robot interaction. In International
Conference on Spatial Cognition, pages 263–286. Takuma Udagawa and Akiko Aizawa. 2020. An an-
Springer. notated corpus of reference resolution for interpret-
ing common grounding. In Proceedings of the AAAI
Vinod Nair and Geoffrey E Hinton. 2010. Rectified Conference on Artificial Intelligence, volume 34,
linear units improve restricted boltzmann machines. pages 9081–9089.
In Proceedings of the 27th International Conference
on International Conference on Machine Learning, Tianhao Yang, Zheng-Jun Zha, and Hanwang Zhang.
pages 807–814. 2019a. Making history matter: History-advantage
sequence training for visual dialog. In Proceedings
Wei Pang and Xiaojie Wang. 2020. Guessing state of the IEEE International Conference on Computer
tracking for visual dialogue. In The European Con- Vision, pages 2561–2569.
ference on Computer Vision (ECCV).
Zhengyuan Yang, Boqing Gong, Liwei Wang, Wenbing
Adam Paszke, Sam Gross, Francisco Massa, Adam Huang, Dong Yu, and Jiebo Luo. 2019b. A fast and
Lerer, James Bradbury, Gregory Chanan, Trevor accurate one-stage approach to visual grounding. In
Killeen, Zeming Lin, Natalia Gimelshein, Luca Proceedings of the IEEE International Conference
Antiga, et al. 2019. Pytorch: An imperative style, on Computer Vision, pages 4683–4693.
high-performance deep learning library. In Ad-
vances in Neural Information Processing Systems, Licheng Yu, Patrick Poirson, Shan Yang, Alexander C
pages 8024–8035. Berg, and Tamara L Berg. 2016. Modeling context
in referring expressions. In European Conference
Ethan Perez, Florian Strub, Harm De Vries, Vincent on Computer Vision, pages 69–85. Springer.
Dumoulin, and Aaron Courville. 2018. Film: Vi-
sual reasoning with a general conditioning layer. In Fang Zhao, Jianshu Li, Jian Zhao, and Jiashi Feng.
Thirty-Second AAAI Conference on Artificial Intelli- 2018. Weakly supervised phrase localization with
gence, pages 3942–3951. multi-scale anchored transformer network. In Pro-
ceedings of the IEEE Conference on Computer Vi-
Bryan A Plummer, Liwei Wang, Chris M Cervantes, sion and Pattern Recognition, pages 5696–5705.
Juan C Caicedo, Julia Hockenmaier, and Svetlana
Lazebnik. 2015. Flickr30k entities: Collecting
region-to-phrase correspondences for richer image-
to-sentence models. In Proceedings of the IEEE
international conference on computer vision, pages
2641–2649.
Joseph Redmon and Ali Farhadi. 2018. Yolov3:
An incremental improvement. arXiv preprint
arXiv:1804.02767.

4697

You might also like