0% found this document useful (0 votes)
7 views

Exploiting Multiple Sequence Lengths in Fast End

Uploaded by

knowledge engine
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Exploiting Multiple Sequence Lengths in Fast End

Uploaded by

knowledge engine
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Exploiting Multiple Sequence Lengths in Fast End

to End Training for Image Captioning


Jia Cheng Hu Roberto Cavicchioli Alessandro Capotondi
FIM Dept. DCE Dept. FIM Dept.
Univ. of Modena and Reggio Emilia Univ. of Modena and Reggio Emilia Univ. of Modena and Reggio Emilia
Modena, Italy Reggio Emilia, Italy Modena, Italy
[email protected] [email protected] [email protected]
arXiv:2208.06551v4 [cs.CV] 19 Jan 2024

Abstract—We introduce a method called the Expansion mech-


anism that processes the input unconstrained by the number
of elements in the sequence. By doing so, the model can
learn more effectively compared to traditional attention-based
approaches. To support this claim, we design a novel architecture
ExpansionNet v2 that achieved strong results on the MS COCO
2014 Image Captioning challenge and the State of the Art in its
respective category, with a score of 143.7 CIDErD in the offline
test split, 140.8 CIDErD in the online evaluation server and 72.9
AllCIDEr on the nocaps validation set. Additionally, we introduce
an End to End training algorithm up to 2.8 times faster than
established alternatives.
Index Terms—Captioning, COCO, Sequence, Expansion

I. I NTRODUCTION
Fig. 1: The expansion mechanism distributes the input data
Image Captioning consists of the problem of describing into another one featuring a different sequence length during
images without human intervention. It is a challenging multi- the forward phase and performs the reverse operation in the
modal task that requires both language comprehension and backward pass. In this way, the network is enabled to process
visual understanding. Early approaches relied on statistical and the sequence unconstrained by the number of elements.
graph-based methods [1], [2], but since the advent of Neural
Networks most Image Captioning systems adopted an encoder
and decoder structure [3]–[5]. The first component is responsi- Softmax function. Recently, many studies [22]–[27] deepened
ble for extracting visual features from the image, whereas the the understanding of attention approach and suggested that
latter serves the purpose of generating the description. Early there is little difference between the first and alternative
works [3]–[5] relied on Convolutional Neural Network (CNN) solutions such as Gaussian distributions [25], MLPs [27],
backbones [6] combined with Recurrent Neural Networks Fourier Transform [26] and suggested that the effectiveness
(RNNs) [7], [8] to further refine the visual inputs and for of these methods depends mainly on their capability to form
text generation. In contrast, modern Image Captioning systems high-quality compositions out of the input. Motivated by these
adopt Attention-based [9], [10] architectures for the sequence observations, our work investigates the possibility that the
modelling part and, in recent works [11]–[13], also during the fixed number of elements provided by the input (the sequence
image feature extraction. Currently, fully attentive models are length) represents a performance bottleneck for stateless ar-
the standard de facto architecture in many NLP and Vision chitectures and limits their potential to form higher-quality
research fields and their ubiquity led to many refinements compositions, in the particular field of Image Captioning. To
and improvements of the formulation across multiple fields this end, we propose the Expansion mechanism, a method
[11], [12], [14]–[20]. However, one of the purposes of the that distributes and processes the sequence content using an
development Attention mechanism [9], [21] was to spread the increased or arbitrary number of elements and retrieves the
input sequence content along the whole collection of encoder’s original length back in the complementary backward opera-
hidden vectors instead of one single state, overcoming a tion. We then introduce ExpansionNet v2 (depicted in Fig. 1),
significant performance bottleneck in RNNs. To do so, as the which to our knowledge is the first model that learns to exploit
name suggests, the Attention mechanism enhances the values arbitrary sequence lengths in Image Captioning and achieves
of a few elements and inhibits the others by means of the very competitive results without relying on the Attention’s
This work has received funding from the European Union’s Horizon 2020
characteristic function.
programme dAIEDGE (G.A. No 101120726). The overall contributions of this work are the following:
(i) we introduce a new method called Expansion Mechanism age Captioning task [16], [38]–[40]. In particular, OFA [38]
that distributes the input content over an arbitrary or increased and GIT [16] currently represent the State-of-the-Art Image
number of elements during the forward step, and retrieves the Captioning systems and outperform non-generative models
original length back in the complementary backward opera- by a significant margin. However, their model size poses an
tion. To support both bidirectional and auto-regressive process- obstacle to the deployment in memory-limited devices and the
ing, we introduce two methods, called Static Expansion and training data are tens and hundreds of times bigger than the
Dynamic Expansion. The efficiency aspect is addressed in their popular MS-COCO 2014 [41]. For this reason, these works are
design and as a result, the computational impact is negligible considered orthogonal to ours which can be instead integrated
for small configurations; (ii) with the aforementioned methods, to potentially achieve better performances. In general, we only
we design a novel architecture called ExpansionNet v2 that consider works that are trained exclusively on MS-COCO
achieves strong results on the MS-COCO 2014 outperforming 2014, for this reason, the works of [12], [15], [16], [39],
similar models trained on the same dataset; (iii) given the [40] are omitted during evaluation since our model does not
positive results of our architecture, we find out that traditional leverage additional data.
architectures in Image Captioning are indeed penalized by the
III. M ETHOD
fixed number of elements provided by the input; (iv) in contrast
to the general trend, our achieves strong results despite the A. Static and Dynamic Expansion
removal of the Attention in most components. Finally, we The Expansion mechanism is broken down into several
also propose a fast End-to-End training strategy that lowers steps. First, it distributes the sequence content into an arbitrary
significantly the training cost of our model compared to or increased number of elements (Section III-A1) using a
popular approaches. “Forward Expansion”, which is described in Section III-A2
and allows the network to process the sequence unconstrained
II. R ELATED W ORKS
by the fixed input length. Then, it retrieves the original length
Image Captioning models benefited greatly from Deep using the complementary operation “Backward Expansion”,
Learning methods. From hand-crafted sentences combined described in Section III-A3. Depending on the operations, we
with object detection [28], [29], modern systems consist of define two implementations of the idea: Static Expansion and
a neural encoder that extracts meaningful visual represen- Dynamic Expansion. The latter is designed to support both the
tations from the image and a decoder responsible for the auto-regressive and bidirectional processing, in contrast to the
description generation. In the early formulations, the decoder first, which only supports the bidirectional case.
consisted of RNNs [7], [8], whereas the encoder consisted of 1) Expansion coefficient: In both Static and Dynamic Ex-
a convolutional backbone [3], [4] that represented the entire pansion, the expansion coefficient NE defines a collection of
image with a single feature vector. It was later replaced by learnable parameters EQ , EB ∈ RNE ×dm . However, in the
an object detector [5] that extracted a collection of salient Static Expansion, NE defines exactly the size of the expanded
regions of the image. This enabled the adoption of sequence sequences regardless of the input length L. In particular,
modelling architectures in both encoding and decoding [3]–[5], the expansion queries QE and biases BE equal to EQ and
[30] on top of the backbones. Most modern Image Captioning EB respectively. In contrast, in the Dynamic Expansion, the
systems are currently based on the Transformer architecture expanded sequence is of size NE ·L, and the expansion queries
[10] and many works focused on improving its formulation QE and biases BE are calculated with the BroadSum operator,
or structure [14], [17], [19], [31]–[34]. For example, the defined in the two cases as:
work of [17] introduced geometrical awareness in the Self-
QE = (C ⊤ HE )⊤ + (EQ

IE )⊤
Attention formulation. [31] modified the attentive layer with a (1)
gate that served the purpose of mitigating the contribution of BE = (C ⊤ HE )⊤ + (EB

IE )⊤
irrelevant queries. [14] exploited the bilinear pooling to enable where C ∈ RL×dm denotes a linear projection of the input
a higher order of interactions across the input elements. Other and HE ∈ RL×(L·NE ) is defined as:
works such as [12], [18], [35] focused on structural changes  
and exploiting the visual input more effectively. Overall, all 1 0 ... 0
0 1 . . . 0
these methods follow the main components of the formulas
HE =  . . . , 1, 0 ∈ R1×NE
 
introduced in [9], [10], [21]. Our Expansion mechanism is  .. .. . . ... 

based on the adoption of embedding vectors. The effectiveness 0 0 ... 1
of integrating additional learnable parameters in the sequence
was observed first in [33] in Machine Translation. Later in whereas IE ∈ RNE ×(L·NE ) is defined by the column-wise
Image Captioning, the concept was also deployed by [36] and concatenation of L identity matrices of size NE × NE :
[37]. In contrast to these works, our method is the only one
IE = IL IL . . . IL , IL ∈ RNE ×NE
 
that distributes the input into an arbitrary number of hidden
vectors. . An example of the input and output of the BroadSum
Another trend consists of pre-training the model with a operation is depicted in the bottom left of Figure 2, where
huge amount of training data and fine-tuning over the Im- the bias vectors are omitted for simplicity.
Fig. 2: Static Expansion and Auto-regressive Dynamic Expansion scheme and example. Assuming an input length of L = 3.
In the Static Expansion setting, an expansion coefficient of NE = 5 leads to an expanded sequence of length 5. In contrast, in
the Dynamic Expansion, an expansion coefficient of NE = 3 generates an expanded sequence of L · NE = 9. For the sake of
simplicity, the double operation stream, the expansion biases and the gated result combination are omitted in the illustration. The
difference between the Auto-regressive Dynamic Expansion and the bidirectional one lies in the Masked Matrix Multiplication.

2) Forward Expansion: The forward expansion generates This time, the matrices Ribw are multiplied with the expanded
the expanded sequences and involves three linear projections sequences of Equation 5:
of the input, denoted as K, V1 , V2 ∈ RL×dm . First of all, the
Bibw = Ribw Fif w i ∈ {1, 2} (7)
“Length Transformation Matrix”, denoted as M , is computed
as the dot-product similarity between K and the expansion Finally, the final results B1bw and B2bw are combined by means
queries QE : of a sigmoid gate:
QE K ⊤
M= √ . (2) out = σ(S) ⊙ B1bw + (1 − σ(S)) ⊙ B2bw . (8)
dm
where S ∈ RL is a linear projection of the input.
The result is fed into the following operations:
The backward operation completes the operations performed
Rif w = Ψ(ReLU ((−1)i M ), ϵ) i ∈ {1, 2} (3) in the Static and Dynamic Expansion. It can be noted that all
operations in the forward (3)(5) and backward expansion (7)
where Ψ : (X, ϵ) → Y , X, Y ∈ RN1 ×N2 , ϵ ∈ R+ /{0} is the (6) are duplicated in two operations streams for i = 1 and
row-wise normalization function defined as: i = 2, differing mainly in the sign of the computation of the
xij Length Transformation Matrix in (2). The decision was made
Ψ(X, ϵ)ij = PN2 (4) to mitigate the remote possibility of the matrix being populated
z=1 xiz + ϵ
only by zeros. This does not affect the results compared to the
the coefficient ϵ ensures the feasibility of the operation. Then, single path but slightly increases the computational cost.
the expanded sequences are calculated as follows: In the case of Dynamic Expansion, masking is applied
when calculating the results in (5) and (7) to preserve the
Fif w = Rif w Vi + BE i ∈ {1, 2} (5)
auto-regressive property. The operation principle of Static and
3) Backward expansion: In the backward step, the original Dynamic Expansions are illustrated in Fig. 2, which, for
sequence length is retrieved by transposing the length transfor- simplicity, depicts only a single operation stream and omits
mation matrix in Equation 2 and applying the same operations biases and the output sigmoid gates.
of Equation 3: 4) Block Static Expansion: To increase the effectiveness of
the Static Expansion, we perform the Forward and Backward
Ribw = Ψ(ReLU ((−1)i M ⊤ ), ϵ) i ∈ {1, 2} (6) operations on a collection of target lengths instead of one. We
Fig. 3: ExpansionNet v2 architecture.

call the operation Block Static Expansion. From a formula- can omit the time axis), the decoder is made of Ndec
tion perspective, all operations are repeated over a group of Dynamic Expansion → Cross-Attention → FeedForward
expansion coefficients G = {NE1 , NE2 , . . . , NENG } and can be blocks, where skip connection and normalization is applied
implemented in a way such that both forward and backward on each component. Each decoder layer is described by the
steps are performed over all targets at the same time. All following equations:
expansion group queries and biases can be combined into a
Bn = Yn−1 + DynamicExpn (N ormDE
n (Yn−1 ))
single one:
G 1 ⊤ 2 ⊤ NG ⊤ ⊤
Wn = Bn + Attentionn (N ormCA
n (Bn ), XNenc ) (12)
EQ = {(EQ ) , (EQ ) , . . . , (EQ ) }
(9) Yn = Wn + F Fn (N ormF F
n (Wn ))
G 1 ⊤ 2 ⊤ NG ⊤ ⊤
EB = {(EB ) , (EB ) , . . . , (EB ) }
All layers are summed through a linear projection and the final
and the computational efficiency of the previous formulation output is fed to the classification layer. Fig. 3 depicts the main
can be preserved. During the backward stage, the length structure.
transformation matrix is scaled by the inverse number of
elements in the group G. C. Training objectives
The model is first pre-trained using the Cross-Entropy loss
B. Architecture
LXE :
Our model consists of the standard encoder-decoder struc- X T
ture implemented on top of the Swin-Transformer, which LXE (θ) = − log(pθ (yt∗ |y1:t−1

, I)) (13)
details are provided in [13]. The image A is first fed into t
the backbone: where pθ (yt∗ |y1:t−1

, I) is the probability assigned by the model
X0 = Swin-T ransf (A) (10) parameters θ to the target yt∗ given the image I and the

previous words y1:t−1 . Additionally, the CIDEr-D score is
and generates the initial set of processed visual features optimized using the SCST [43] which minimizes the negative
X0 = {x01 , x02 , . . . , x0N }, x0i ∈ Rdm . The result is fed expected reward LR (θ) = −Ey1:T pθ [r(y1:T )], which gradient
into the encoder, which is made of Nenc Static Expansion can be approximated as follows:
→ FeedForward blocks. Here skip connection and pre-layer s s
normalization [42] are adopted, and the following formulas ∇θ LR (θ) ≈ −(r(y1:T ) − b)∇θ log pθ (y1:T ) (14)
describe each encoder layer for n ∈ {1, . . . , Nenc }: b is the baseline computed according to [44] and r(y1:T s
) is
s
En = Xn−1 + StaticExpn (N ormSE the CIDEr-D reward assigned to the sampled sequence y1:T .
n (Xn−1 ))
(11) Although we optimize the model on two loss functions, for
Xn = En + F Fn (N ormF F
n (Bn )) each one of them, the training stage is efficiently split into two
Similarly, given a generic input sequence Y0 = additional steps to allow a broader number of computational
{y10 , y20 , . . . , yM
0
}, yi0 ∈ Rdm (at training stage so we resources to reproduce this work.
TABLE I: Ablation study in the first stage of Cross-Entropy training using beam size 3 over the Karpathy validation split.
B=BLEU. M=METEOR. R=ROUGE. C=CIDEr-D. S=SPICE.
Encoder Decoder B1 B2 B3 B4 M R C S
Baseline Baseline 75.3 59.2 45.4 34.6 28.4 57.0 115.8 21.6
Stc. Exp. G={16} Baseline 76.4 60.6 46.6 35.5 28.6 57.2 117.8 21.9
Stc. Exp. G={32} Baseline 75.9 59.9 46.1 35.2 28.9 57.1 117.9 22.3
Stc. Exp. G={64} Baseline 76.3 60.4 46.4 35.5 28.8 57.1 117.7 22.0
Baseline Dyn. Exp.NE =4 77.2 61.4 47.4 36.2 28.9 57.7 119.7 22.3
Baseline Dyn. Exp.NE =8 76.9 61.5 47.9 37.1 29.1 57.8 120.8 22.3
Baseline Dyn. Exp.NE =16 76.7 61.4 47.8 36.8 29.0 57.7 121.2 22.2
Stc. Exp. G={64} Dyn. Exp.NE =16 77.4 61.9 48.2 37.3 29.2 58.0 122.2 22.3
Stc. Exp. G={128, 128, 128, 128, 128} Dyn. Exp.NE =16 77.8 62.3 48.3 37.2 29.3 58.3 122.8 22.5
Stc. Exp. G={256, 256, 256, 256, 256} Dyn. Exp.NE =16 77.4 62.0 48.2 37.2 29.2 58.0 122.5 22.2
Stc. Exp. G={512, 512, 512, 512, 512} Dyn. Exp.NE =16 77.3 61.7 47.9 37.0 29.3 58.0 122.7 22.4
Stc. Exp. G={32, 64, 128, 256, 512} Dyn. Exp.NE =16 77.6 62.0 48.2 37.2 29.4 58.1 123.5 22.5

IV. R ESULTS in which the backbone’s weights are frozen and a fine-tuning
A. Experimental Setup step during which gradients flow throughout the whole system:
Step A) Cross-Entropy – Freezed backbone. The model is
1) Dataset: The training dataset consists of the popular trained using batch size 48, an initial learning rate of 2e-4, a
MS-COCO benchmark [41] split according to [45], resulting warmup of 10000, and is annealed by 0.8 every 2 epochs for
in 113287 image-description pairs for the training, 5000 in 8 epochs;
the validation set, and in the 5000 test set. Each reference Step B) Cross Entropy – End to End. The whole system is
caption is pre-processed by a simple pipeline consisting of trained for 2 additional epochs, using batch size 48 and an
lowering casing, removing punctuation, and filtering out words initial learning rate of 3e-5 annealed by 0.55 every epoch;
that do not occur at least 5 times (vocabulary of size 10000). Step C) CIDEr-D optimization – Freezed backbone. Rein-
Additionally, the final model is evaluated over the Novel forcement phase adopts a batch size of 48, an initial learning
Object Captioning at Scale (nocaps) dataset validation set [46], rate of 1e-4, no warmup, annealed by 0.8 every epoch for 9
which consists of three classes of images called in-domain, epochs;
near-domain, and out-domain, according to the familiarity of Step D) CIDEr-D optimization – End to End. The whole
the classes with respect to those contained in the training set. system is fine-tuned for a few more iterations up to an addi-
This dataset is subject to the same pre-processing of MS- tional epoch using a batch size of 20 and fixed learning rate
COCO and serves the purpose of further challenging the model 2e-6. This step is optional since it only slightly contributes to
in unfavourable conditions. the final performances and can be skipped if no improvements
2) Model details: Two models are implemented for the are observed. All CIDEr-D optimization steps are implemented
experimental setup. The baseline, which is the Base Trans- according to the Standard configuration2 .
former and our main model, referred to as “ExpansionNet v2”, Despite its apparent complexity, it is much more compu-
is implemented with the following configurations dm =512, tationally friendly than the standard method consisting of a
df f =2048, Nenc =Ndec =3. In the latter, the Dynamic expansion small batch size of 10 for 30 epochs for both optimization
coefficient is set to 16, and the Static expansion coefficients steps. As a matter of fact, only a much smaller number of
consist of G={32, 64, 128, 256, 512} (more details in Section training epochs are dedicated to fine-tuning the whole system.
IV-B). Each one relies on top of the same backbone, the Thus, the time required for the calculation of the backbone’s
Swin-Transformer in the Large configuration [13] pre-trained gradient is often avoided and the time required for forward
on ImageNet [47]. All images are subject to a minimal operations can be drastically reduced as well. In particular,
pre-processing: first, they are resized into a 3×384×384 in our implementation, during steps 1 and 3 the backbone’s
tensor, then RGB values are converted into a [0, 1] range forward pass is performed only once for each image in the
and further normalized using mean=(0.485, 0.456, 0.406) and data set. Therefore, its cost is replaced by a memory read and
std=(0.229, 0.224, 0.225). The source code of the experiments copy. All steps are trained using the RAdam optimizer [53]
is available1 . (β1 = 0.9, β2 = 0.98).
3) Training algorithm: It can be observed that the Swin-
Transformer backbone is the most computationally expensive B. Ablation Study
part of the system. For this reason, inspired by [48], to To study the effectiveness of our method we replace the
enable the End to End training step to a broader number encoder and decoder in the baseline with our methods and
of computational architectures, our training is divided into evaluate several settings of expansion coefficients. It can be
four steps in particular, each phase (in both the cross-entropy observed from Table I that the impact of the static expansion
training and the reinforcement stage) consists of initial training
2 SacreEOS signature [49]:
1 Code available at: https://github.com/jchenghu/ExpansionNet v2 STANDARD wInit+Cider-D[n4,s6.0]+average[nspi5]+1.0.0
TABLE II: Offline comparison of State-of-the-Art single models over the Karpathy test split. B=BLEU. M=METEOR.
R=ROUGE. C=CIDEr-D. S=SPICE.
Cross-Entropy CIDEr-D optimization
Model B1 B4 M R C S B1 B4 M R C S
Up-Down [5] 77.2 36.2 27.0 56.4 113.5 20.3 79.8 36.3 27.7 56.9 120.1 21.4
GCN-LSTM [50] 77.3 36.8 27.9 57.0 116.3 20.9 80.5 38.2 28.5 58.3 127.6 22.0
SGAE [51] - - - - - - 80.8 38.4 28.4 58.6 127.8 22.1
AoANet [31] 77.4 37.2 28.4 57.5 119.8 21.3 80.2 38.9 29.2 58.8 129.8 22.4
X-Transformer [14] 77.3 37.0 28.7 57.5 120.0 21.8 80.9 39.7 29.5 59.1 132.8 23.4
GET [35] - - - - - - 81.5 39.5 29.3 58.9 131.6 22.8
DLCT [18] - - - - - - 81.4 39.8 29.5 59.1 133.8 23.0
RSTNet [52] - - - - - - 81.8 40.1 29.8 59.5 135.6 23.3
PureT [11] - - - - - - 82.1 40.9 30.2 60.1 138.2 24.2
ExpansionNet v2 78.1 38.1 30.1 58.9 128.2 23.5 82.8 41.5 30.3 60.5 140.4 24.5

TABLE III: Offline comparison of State-of-the-Art ensemble models over the Karpathy test split. B=BLEU. M=Meteor.
R=Rouge. C=CIDEr-D. S=SPICE.
Cross-Entropy CIDEr-D optimization
Model B1 B4 M R C S B1 B4 M R C S
GCN-LSTM [50] 77.4 37.1 28.1 57.2 117.1 21.1 80.9 38.3 28.6 58.5 128.7 22.1
SGAE [51] - - - - - - 81.0 39.0 28.4 58.9 129.1 22.2
AoANet [31] 78.7 38.1 28.5 58.2 122.7 21.7 81.6 40.2 29.3 59.4 132.0 22.8
X-Transformer [14] 77.8 37.7 29.0 58.0 122.1 21.9 81.7 40.7 29.9 59.7 135.3 23.8
GET [35] - - - - - - 82.1 40.6 29.8 59.6 135.1 23.8
DLCT [18] - - - - - - 82.2 40.8 29.9 59.8 137.5 23.3
PureT [11] - - - - - - 83.4 42.1 30.4 60.8 141.0 24.3
ExpansionNet v2 78.5 38.5 29.9 58.8 128.7 23.6 83.5 42.7 30.6 61.1 143.7 24.7

layer in the single group configuration is limited. In fact, it simulating an additional level of attention over the inputs and
only slightly improves the baseline, regardless of the choice augmented the language modelling part with an LSTM. On
of NE . Conversely, the dynamic expansion layer showcases the other hand, X-Transformer [14] adopted a fully attentive
a more significant improvement obtaining the best result for architecture and further refined the attentive blocks by means
NE = 16. When the two expansion methods are combined, of bilinear pooling techniques. The most recent and performing
the model outperforms the baseline across all metrics with a architectures focused on increasing the more effective ways to
margin of at least 6.0 CIDEr-D, 2.0 BLEU, 0.5 SPICE, 1.0 feed visual information into the sequence modelling network.
ROUGE, and 0.8 METEOR. Analyzing several configurations For instance, RSTNet [52] showcased the effectiveness of grid
of length groups in the static expansion, it appears that features over regions, GET [35] processed the images using
introducing more expansion vectors does not necessarily lead a global representation in conjunction with the local ones,
to better performances, since for G = {128, 128, 128, 128, DLCT [18] instead exploited the advantages of both regions
128}, G = {384, 384, 384, 384, 384} and G = {512, 512, and grid visual features. Finally, PureT [11] implemented the
512, 512, 512} the model yield similar results. However, the first end-to-end Transformer architecture applying the Window
model seems to benefit from a diverse selection of coefficients / Shifted-Window MHA [13] in both the encoder and decoder.
such as in the case of G = {32, 64, 128, 256, 512} which ExpansionNet v2 outperforms PureT by a margin of 0.7
will be adopted in the remaining experiments. Ultimately, all BLEU1, 0.6 BLEU4, 0.1 METEOR, 0.4 ROUGE, 2.2 CIDEr-
instances outperform the baseline across all metrics. D and 0.3 SPICE in the single model case and by 0.1 BLEU1,
0.6 BLEU4, 0.3 ROUGE, 2.7 CIDEr-D and 0.4 SPICE in the
C. Performance Comparison
ensemble configuration.
1) COCO Offline Evaluation: Table II and Table III report
the score comparison between ExpansionNet v2 and the best- 2) COCO Online Evaluation: We evaluate ExpansionNet
performing models in recent years. Up-Down [5] introduced v2 using the ensemble configuration and adopting the standard
the idea of extracting a collection of features from the images Beam Search (beam size 5) over the official testing set of
using an object detector like Faster-RCNN [6] in contrast 40775 images, submitting the predictions to the online testing
to the classification backbone [4]. The idea was adopted in server. Results are reported in Table IV. c5 and c40 represent
most of the following architectures as well, for instance, the scores of 5 and 40 reference captions (unknown to the
in the case of GCN-LSTM [50] and SGAE [51], which user), respectively.
additionally implemented a convolutional graph network on Our model achieves State-of-the-Art performance (as of
top of it to exploit the information provided by a scene 2 July 2022) among non-generative models trained on MS-
graph. AoANet [31] adopted the Transformer and improved COCO 2014, outperforming the previous one [11] by a margin
the attentive components with two gates serving the purpose of of 1.2 BLEU4 (c40), 0.2 METEOR (c40), 0.5 ROUGE-L (c40)
TABLE IV: Online server results on the MS-COCO 2014 test set which ground truth is unknown. B=BLEU. M=METEOR.
R=ROUGE. C=CIDEr-D. S=SPICE.
B1 B2 B3 B4 METEOR ROUGE-L CIDEr-D
Model c5 c40 c5 c40 c5 c40 c5 c40 c5 c40 c5 c40 c5 c40
Up-Down [5] 80.2 95.2 64.1 88.8 49.1 79.4 36.9 68.5 27.6 36.7 57.1 72.4 117.9 120.5
GCN-LSTM [50] - - 65.5 89.3 50.8 80.3 38.7 69.7 28.5 37.6 58.5 73.4 125.3 126.5
SGAE [51] 81.0 95.3 65.6 89.5 50.7 80.4 38.5 69.7 28.2 37.2 58.6 73.6 123.8 126.5
AoANet [31] 81.0 95.0 65.8 89.6 51.4 81.3 39.4 71.2 29.1 38.5 58.9 74.5 126.9 129.6
X-Transformer [14] 81.9 95.7 66.9 90.5 52.4 82.5 40.3 72.4 29.6 39.2 59.5 75.0 131.1 133.5
RSTNet [52] 82.1 96.4 67.0 91.3 52.2 83.0 40.0 73.1 29.6 39.1 59.5 74.6 131.9 134.0
GET [35] 81.6 96.1 66.5 90.9 51.9 82.8 39.7 72.9 29.4 38.8 59.1 74.4 130.3 132.5
DLCT [18] 82.4 96.6 67.4 91.7 52.8 83.8 40.6 74.0 29.8 39.6 59.8 75.3 133.3 135.4
PureT [11] 82.8 96.5 68.1 91.8 53.6 83.9 41.4 74.1 30.1 39.9 60.4 75.9 136.0 138.3
OFA [15] 84.5 98.1 70.1 94.4 55.9 87.8 43.6 78.7 32.1 42.7 62.5 79.0 147.2 149.6
GIT [16] 84.3 98.1 70.0 94.4 55.7 87.6 43.2 78.3 31.9 42.1 62.0 78.4 146.4 149.8
ExpansionNet v2 83.3 96.9 8.8 92.6 54.4 85.0 42.1 75.3 30.4 40.1 60.8 76.4 138.5 140.8

and 2.5 CIDEr-D in both c5 and c40 instances. However, it is estimated assuming all model computational costs are the same
ultimately outperformed, by a significant margin, by generative as the ExpansionNet v2, which is a generous approximation
models [15], [16] which we consider orthogonal to our work compared to generative models whose sizes are tens of times
since they focus more on training method and data quality larger. Despite such premise and the fact that we also perform
rather than architecture design. end-to-end training, it can be seen that our model can be
3) Nocaps Evaluation: We evaluate ExpansionNet v2 over trained up to 2.8× faster than other non-generative models and
the nocaps validation set. In particular, we adopt a single up to 46.8× faster in the case of generative ones. Recalling
model trained exclusively on Cross-Entropy Loss, using no the results in Table IV, performance-wise, our model achieves
additional pre-training data sets. The predictions are generated 93.9% performances of the State-of-the-Art model GIT [16]
by the standard Beam Search algorithm (beam size 3) in but uses 7080× less data and is 129× smaller.
contrast to the CBS [54]. A limited comparison is reported
in Table V, which showcases that our model achieves very TABLE VI: Inference cost comparison of ablation models on
competitive results among the architectures trained in similar the MS-COCO 2014 validation set (5000 images).
configurations, with an overall lead of 17.6 CIDEr and 1.4 Encoder Decoder FLOPS
SPICE over the Up-Down model [5]. It is still ultimately Baseline Baseline 9.28 × 1012
outperformed by recent V+L pre-training-based works such Stc. Exp. G={16} Baseline 9.62 × 1012
Stc. Exp. G={32} Baseline 9.70 × 1012
TABLE V: Performances on nocaps validation set. C and S Stc. Exp. G={64} Baseline 9.88 × 1012
Baseline Dyn. Exp.NE =4 9.40 × 1012
denote the CIDEr-D and SPICE scores respectively. Baseline Dyn. Exp.NE =8 9.43 × 1012
Domain Metric Enc-Dec [55] Up-Down [5] Ours Baseline Dyn. Exp.NE =16 9.48 × 1012
C 72.8 78.1 83.8 Stc. Exp. G={64} Dyn. Exp.NE =16 10.08 × 1012
In Stc. Exp. G={128}×5 Dyn. Exp.NE =16 13.26 × 1012
S 11.1 11.6 12.6
C 57.1 57.7 79.2 Stc. Exp. G={256}×5 Dyn. Exp.NE =16 16.80 × 1012
Near Stc. Exp. G={512}×5 Dyn. Exp.NE =16 23.88 × 1012
S 10.2 10.3 12.4
C 34.1 31.3 54.0 ExpansionNet v2 ExpansionNet v2 15.21 × 1012
Out
S 8.3 8.3 9.3
C 54.7 55.3 72.9
All
S 10.0 10.1 11.4

E. Qualitative Analysis
as [39], [40], [56].
Table VIII provides some examples of captions. Regardless
D. Training and Inference Cost of the image complexity, ExpansionNet v2 is not only able
The efficiency aspect was addressed in the design of the to correctly describe the subjects depicted in the scenes but
Expansion mechanism. For instance, it can be observed from also showcases a good level of semantic understanding by
Table VI that doubling the expansion coefficients does not describing the goals and interactions. Unfortunately, our model
lead to double FLOPS, which would be the case of actually seems to struggle with out-of-domain objects as showcased
doubling the input sequence length. In particular, for small in Table IX where, due to objects and terms unknown to the
parameters, our model is comparable to the Transformer in model, predictions are either imprecise (2nd image) or incorrect
terms of computational cost. In contrast, ExpansionNet v2 (1st image). Nonetheless, it appears to provide a roughly
is 1.63× slower than the baseline because of an abundant correct description of the image. We showcase an example
selection of expansion coefficients. of attention visualization in Fig. 4, where the scattered focus
In Table VII, we compare our training time with the correctly outlines the main subjects despite the absence of an
ones presented by other works. In particular, time entries are object detector.
TABLE VII: Training time comparison of State-of-the-Art works against our solution. “Time” represents the estimated time
required to train models on a single NVIDIA A100 using the described strategy. γ, θ and σ denote the normalized quantity
of the number of parameters, the number of training images and the training cost compared to our proposal. The ”⋆” symbol
denotes generative modes, typically pre-trained on multiple tasks and images from various sources. We simplify the matter
using the cost of Cross-Entropy training on MS-COCO 2014, and the downstream task learning cost is ignored since it is
negligible compared to the pre-training phase.
Source Params. (γ) Datasets → total num. images (θ) Training Description Train. time (σ)
Obj. Transf. [17], 33M (0.86) MS-COCO 2014 → 113K (1.00) Cross-Entropy: ∼ 30 epochs and 7 days (2.80)
AoANet [31] 87M (2.28) batch size 10. Reinforcement: 30
PureT [11] 34M (0.89) epochs and batch size 10.
X-Transformer [14] 141M (3.71) MS-COCO 2014 → 113K (1.00) Cross-Entropy: 70 epochs and batch 5 days (2.00)
size 40. Reinforcement: 35 epochs and
batch size 32.
GIT [16] ⋆ 4.9B (128.94) MS-COCO, CC3M, CC12M, VG, SBU, 2 epochs and we assume batch size 48 117 days (46.80)
ALT200M + 0.6B → 0.8B (7079.64). in the estimation.
OFA [15] ⋆ 871M (22.92) MS-COCO, CC3M, CC12M, VG, 40 epochs and assume batch size 48 44 days (17.60)
SBU → 15M (132.74). in the estimation.
ExpansionNet v2 38M (1.00) MS-COCO 2014 → 113K (1.00) See Section IV-A. 2.5 days (1.00)

TABLE VIII: Examples of captions. thogonal to our work due to the differences in model size
Image Captions and additional training data. In conclusion, we introduced the
Baseline: A man holding a tennis ball Expansion layers and ExpansionNet v2 and found the answer
on a tennis court. to our research question in the case of the Image Captioning
ExpansionNet v2: A man jumping in
the air to hit a tennis ball. field. Future works will further develop the methods and ideas
Gt: {A tennis player jumps and swats presented in this work, motivated by the fact that they can be
at the ball.; A tennis player hitting a
tennis ball on a court.; Professional easily integrated into other solution approaches (such as V+L
tennis player immediately after pre-training) and other research fields.
returning a shot.}

Baseline: A little girl brushing her TABLE IX: Examples on nocaps out-of-domain images.
hair with a table.
ExpansionNet v2: A little girl Image Captions
brushing her hair with a pink brush. ExpansionNet v2: A close-up of a fish in a
Gt: {A young girl tries to comb her body of water.
own hair.; A young child brushing her 3 gts: { A seahorse in an aquarium full
hair with a big pink brush.; A young of water with some plants growing in the
girl is trying to brush her hair with a background.; A blue seahorse is swimming
pink brush.} near sea plants on back.; A very small
seahorse is in the water along with other
pieces. }

ExpansionNet v2: Three pictures of a


blender with red liquid in it.
V. C ONCLUSION 3 gts: { A picture of three blenders with
a strawberry looking beverage inside.; A
In this work, we addressed the question of whether the white mixer in the process of making a
smoothie.; The steps of making a smoothie
fixed number of elements of the inputs represented a per- in a blender are shown. }
formance bottleneck in modern image-captioning systems. To
this end, we presented the idea of an Expansion mechanism
and provided two concrete implementations called Static Ex-
pansion and Dynamic Expansion, that process the input using
sequences that feature a different length compared to the one
provided in the input. Upon these layers, we designed a new
architecture called ExpansionNet v2 and trained it on the MS-
COCO 2014 dataset using a fast End to End training approach.
Extensive experiments conducted on the testing set showcase
that our method achieved better performances when compared
to the baseline. This answer positively the initial research
question of whether the input length can represent a bottleneck
to the sequence processing. Additionally, ExpansionNet v2 Fig. 4: Attention visualization of a single decoder head in
achieved strong performances on both offline (143.7 CIDEr- ExpansionNet v2.
D) and online (140.8 CIDEr-D) test splits and is outperformed
mainly by V+L pre-training models, which we consider or-
R EFERENCES [26] J. Lee-Thorp, J. Ainslie, I. Eckstein, and S. Ontanon, “Fnet: Mixing
tokens with fourier transforms,” arXiv preprint arXiv:2105.03824, 2021.
[1] M. Mitchell et al., “Midge: Generating image descriptions from com- [27] I. O. Tolstikhin et al., “Mlp-mixer: An all-mlp architecture for vision,”
puter vision detections,” in Proceedings of the 13th Conference of the Advances in neural information processing systems, vol. 34, pp. 24 261–
European Chapter of the Association for Computational Linguistics, 24 272, 2021.
2012, pp. 747–756. [28] R. Socher and L. Fei-Fei, “Connecting modalities: Semi-supervised
[2] G. Kulkarni et al., “Babytalk: Understanding and generating simple im- segmentation and annotation of images using unaligned text corpora,”
age descriptions,” IEEE Transactions on Pattern Analysis and Machine in 2010 IEEE Computer Society Conference on Computer Vision and
Intelligence, vol. 35, no. 12, pp. 2891–2903, 2013. Pattern Recognition. IEEE, 2010, pp. 966–973.
[3] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A [29] B. Z. Yao, X. Yang, L. Lin, M. W. Lee, and S.-C. Zhu, “I2t: Image
neural image caption generator,” in Proceedings of the IEEE conference parsing to text description,” Proceedings of the IEEE, vol. 98, no. 8, pp.
on computer vision and pattern recognition, 2015, pp. 3156–3164. 1485–1508, 2010.
[4] K. Xu et al., “Show, attend and tell: Neural image caption generation [30] L. Wang, Z. Bai, Y. Zhang, and H. Lu, “Show, recall, and tell:
with visual attention,” in International conference on machine learning, Image captioning with recall mechanism,” in Proceedings of the AAAI
2015, pp. 2048–2057. Conference on Artificial Intelligence, vol. 34, no. 07, 2020, pp. 12 176–
[5] P. Anderson et al., “Bottom-up and top-down attention for image 12 183.
captioning and visual question answering,” in Proceedings of the IEEE [31] L. Huang, W. Wang, J. Chen, and X.-Y. Wei, “Attention on attention for
conference on computer vision and pattern recognition, 2018, pp. 6077– image captioning,” in Proceedings of the IEEE International Conference
6086. on Computer Vision, 2019, pp. 4634–4643.
[6] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real- [32] J.-H. Kim, J. Jun, and B.-T. Zhang, “Bilinear attention networks,” arXiv
time object detection with region proposal networks,” arXiv preprint preprint arXiv:1805.07932, 2018.
arXiv:1506.01497, 2015. [33] S. Sukhbaatar, E. Grave, G. Lample, H. Jegou, and A. Joulin,
[7] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural “Augmenting self-attention with persistent memory,” arXiv preprint
computation, vol. 9, no. 8, pp. 1735–1780, 1997. arXiv:1907.01470, 2019.
[8] K. Cho et al., “Learning phrase representations using rnn [34] A. Gulati et al., “Conformer: Convolution-augmented transformer for
encoder-decoder for statistical machine translation,” arXiv preprint speech recognition,” arXiv preprint arXiv:2005.08100, 2020.
arXiv:1406.1078, 2014. [35] J. Ji et al., “Improving image captioning by leveraging intra-and inter-
[9] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by layer global representation in transformer network,” in Proceedings of
jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, the AAAI conference on artificial intelligence, vol. 35, no. 2, 2021, pp.
2014. 1655–1663.
[10] A. Vaswani et al., “Attention is all you need,” in Advances in neural [36] M. Cornia, M. Stefanini, L. Baraldi, and R. Cucchiara, “Meshed-memory
information processing systems, 2017, pp. 5998–6008. transformer for image captioning,” in Proceedings of the IEEE/CVF
[11] Y. Wang, J. Xu, and Y. Sun, “End-to-end transformer based model for Conference on Computer Vision and Pattern Recognition, 2020, pp.
image captioning,” arXiv preprint arXiv:2203.15350, 2022. 10 578–10 587.
[12] V.-Q. Nguyen, M. Suganuma, and T. Okatani, “Grit: Faster and better [37] P. Zeng, H. Zhang, J. Song, and L. Gao, “S2 transformer for image
image captioning transformer using dual visual features,” in Computer captioning,” in Proceedings of the International Joint Conferences on
Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October Artificial Intelligence, vol. 5, 2022.
23–27, 2022, Proceedings, Part XXXVI. Springer, 2022, pp. 167–184. [38] P. Wang et al., “Unifying architectures, tasks, and modalities through
[13] Z. Liu et al., “Swin transformer: Hierarchical vision transformer using a simple sequence-to-sequence learning framework,” arXiv preprint
shifted windows,” in Proceedings of the IEEE/CVF International Con- arXiv:2202.03052, 2022.
ference on Computer Vision, 2021, pp. 10 012–10 022. [39] X. Hu et al., “Scaling up vision-language pre-training for image caption-
[14] Y. Pan, T. Yao, Y. Li, and T. Mei, “X-linear attention networks for image ing,” in Proceedings of the IEEE/CVF Conference on Computer Vision
captioning,” in Proceedings of the IEEE/CVF Conference on Computer and Pattern Recognition, 2022, pp. 17 980–17 989.
Vision and Pattern Recognition, 2020, pp. 10 971–10 980. [40] J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image
[15] P. Wang et al., “Ofa: Unifying architectures, tasks, and modalities pre-training for unified vision-language understanding and generation,”
through a simple sequence-to-sequence learning framework,” in Inter- arXiv preprint arXiv:2201.12086, 2022.
national Conference on Machine Learning. PMLR, 2022, pp. 23 318– [41] T.-Y. Lin et al., “Microsoft coco: Common objects in context,” in
23 340. European conference on computer vision. Springer, 2014, pp. 740–
[16] J. Wang et al., “Git: A generative image-to-text transformer for vision 755.
and language,” arXiv preprint arXiv:2205.14100, 2022. [42] R. Xiong et al., “On layer normalization in the transformer architecture,”
[17] S. Herdade, A. Kappeler, K. Boakye, and J. Soares, “Image captioning: in International Conference on Machine Learning. PMLR, 2020, pp.
Transforming objects into words,” arXiv preprint arXiv:1906.05963, 10 524–10 533.
2019. [43] S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel, “Self-
[18] Y. Luo et al., “Dual-level collaborative transformer for image caption- critical sequence training for image captioning,” in Proceedings of the
ing,” in Proceedings of the AAAI Conference on Artificial Intelligence, IEEE Conference on Computer Vision and Pattern Recognition, 2017,
vol. 35, no. 3, 2021, pp. 2286–2293. pp. 7008–7024.
[19] Q. Guo, X. Qiu, P. Liu, Y. Shao, X. Xue, and Z. Zhang, “Star- [44] R. Luo, “A better variant of self-critical sequence training,” arXiv
transformer,” arXiv preprint arXiv:1902.09113, 2019. preprint arXiv:2003.09971, 2020.
[20] J. Hao, X. Wang, B. Yang, L. Wang, J. Zhang, and Z. Tu, “Modeling [45] A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for
recurrence for transformer,” arXiv preprint arXiv:1904.03092, 2019. generating image descriptions,” in Proceedings of the IEEE conference
[21] M.-T. Luong, H. Pham, and C. D. Manning, “Effective ap- on computer vision and pattern recognition, 2015, pp. 3128–3137.
proaches to attention-based neural machine translation,” arXiv preprint [46] H. Agrawal et al., “Nocaps: Novel object captioning at scale,” in
arXiv:1508.04025, 2015. Proceedings of the IEEE/CVF International Conference on Computer
[22] A. Raganato, Y. Scherrer, and J. Tiedemann, “Fixed encoder self- Vision, 2019, pp. 8948–8957.
attention patterns in transformer-based machine translation,” arXiv [47] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet:
preprint arXiv:2002.10260, 2020. A large-scale hierarchical image database,” in 2009 IEEE conference on
[23] Y. Tay, D. Bahri, D. Metzler, D.-C. Juan, Z. Zhao, and C. Zheng, computer vision and pattern recognition. Ieee, 2009, pp. 248–255.
“Synthesizer: Rethinking self-attention for transformer models,” in In- [48] J. C. Hu, R. Cavicchioli, and A. Capotondi, “Exploring the sequence
ternational conference on machine learning. PMLR, 2021, pp. 10 183– length bottleneck in the transformer for image captioning,” 2022.
10 192. [49] J. Hu, R. Cavicchioli, and A. Capotondi, “A request for clarity
[24] H. Ramsauer et al., “Hopfield networks is all you need,” arXiv preprint over the end of sequence token in the self-critical sequence training”,”
arXiv:2008.02217, 2020. Lecture Notes in Computer Science (including subseries Lecture Notes in
[25] W. You, S. Sun, and M. Iyyer, “Hard-coded gaussian attention for neural Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 14233
machine translation,” arXiv preprint arXiv:2005.00742, 2020. LNCS, p. 39 – 50, 2023.
[50] T. Yao, Y. Pan, Y. Li, and T. Mei, “Exploring visual relationship
for image captioning,” in Proceedings of the European conference on
computer vision (ECCV), 2018, pp. 684–699.
[51] X. Yang, K. Tang, H. Zhang, and J. Cai, “Auto-encoding scene graphs
for image captioning,” in Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, 2019, pp. 10 685–10 694.
[52] X. Zhang et al., “Rstnet: Captioning with adaptive attention on visual
and non-visual words,” in Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition, 2021, pp. 15 465–15 474.
[53] L. Liu et al., “On the variance of the adaptive learning rate and beyond,”
arXiv preprint arXiv:1908.03265, 2019.
[54] P. Anderson, B. Fernando, M. Johnson, and S. Gould, “Guided open
vocabulary image captioning with constrained beam search,” arXiv
preprint arXiv:1612.00576, 2016.
[55] S. Changpinyo, P. Sharma, N. Ding, and R. Soricut, “Conceptual 12m:
Pushing web-scale image-text pre-training to recognize long-tail visual
concepts,” in Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, 2021, pp. 3558–3568.
[56] P. Zhang et al., “Vinvl: Revisiting visual representations in vision-
language models,” in Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition, 2021, pp. 5579–5588.

You might also like