0% found this document useful (0 votes)
11 views3 pages

Automatic Creative Selection With Cross-Modal Matching

Uploaded by

Anjelo Rojas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views3 pages

Automatic Creative Selection With Cross-Modal Matching

Uploaded by

Anjelo Rojas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Automatic Creative Selection with Cross-Modal

Matching

Alex Kim∗ Jia Huang† Rob Monarch


arXiv:2405.00029v1 [cs.CV] 28 Feb 2024

University of Southern California Apple Apple


Los Angeles, CA Cupertino, CA Cupertino, CA
[email protected] [email protected] [email protected]

Jerry Kwac Anikesh Kamath Parmeshwar Khurd


Apple Apple Apple
Cupertino, CA Cupertino, CA Cupertino, CA
[email protected] [email protected] [email protected]

Kailash Thiyagarajan Goodman Gu


Apple Apple
Cupertino, CA Cupertino, CA
[email protected] [email protected]

Abstract
We present a novel approach for matching images to text in the specific context
of matching an application image to the search phrases that someone might use to
discover that application. We share a new fine-tuning approach for a pre-trained
cross-modal model, tuned to search-text and application-image data. We evalu-
ate matching images to search phrases in two ways: the application developers’
intuitions about which search phrases are the most relevant to their application
images; and the intuitions of professional human annotators about which search
phrases are the most relevant to a given application. Our approach achieves 0.96
and 0.95 AUC for these two ground truth datasets, which outperforms current
state-of-the-art models by 8%-17%.

1 Introduction
Application (App) developers promote their Apps by creating multiple custom pages, each with
different images. By using different images for different pages, the App developers can engage
people with diverse interests to use their Apps. In some cases, developers also advertise their Apps
and in these cases they suggest which search phrases that they think are the most relevant for their
Apps.
This give us two different but equally meaningful but subjective perspectives on the relationship
between Apps and search phrases: the intuition of the people who created each App and the intu-
ition of people in general. Therefore, understanding the relationship between images and search
terms, and providing a model to recommend creative images to developers that best match with
search phrases, is an important task for supporting App developers. Recent state-of-the-art ap-
proaches have addressed this problem using a cross-modal Transformer, or BERT architecture
([2][3][4][6][7][8][9][14]).

work completed as an intern at Apple

corresponding author

Preprint. Under review.


Table 1: Model performance on two evaluation datasets
Developer intuitions Professional annotator intuitions
AUC F1 AUC F1
Zero-shot CLIP 0.62 0.70 0.74 0.68
Fine-tuned CLIP 0.84 0.81 0.81 0.75
XLM-R + ResNet 0.89 0.76 0.82 0.73
XLM-R + CLIP img - - 0.83 0.75
Our Approach 0.96 0.89 0.95 0.87

The existing approaches work well for image captioning and QA tasks, but in the search domain cre-
ative images often do not have a description, and search phrases contain much shorter text, making
description-based matching infeasible. We customizes a cross-modal BERT framework by fine-
tuning a pre-trained cross-modal model on an in-house training dataset of (search phrase, ad image,
label), and evaluate the model on our test set for the (image, search phrase) matching task and
compare with baselines including CLIP [9].

2 Method

For each App c, a developer promoting that App can define a set of search phrases K relevant to
their App. Our approach provides a way to automatically select an image m from a candidate image
pool M that best matches a given search phrase k. Given a pair of (search phrase k, image m), c,
we model relevance R(k, m) prediction as binary classification.
Our model uses pre-trained cross-modal image-text matching architecture, as in LXMERT [12],
where input text is split by a WordPiece tokenizer [13], and embedded to word embeddings, and
input image objects are detected by Faster R-CNN [10] and embedded to object-level image em-
beddings. A language encoder and object-relationship encoder are applied to word embeddings and
detected objects respectively. Finally, the relevance score R(k, m) is predicted with a Transformer-
based cross-modal encoder. Extracted features from each image are in the format of ([36, 4], [36,
2048]) representing bounding box coordinates of up to 36 detected objects and 2048-dim feature
vector of those objects. We then fine-tune the pre-trained model on our training dataset. Our fine-
tuning process is defined as a sequential model built on top of a LXMERT Encoder, which consists
of a linear layer, a GELU activation function [5], a layer normalization [1], another linear layer, and
a sigmoid activation function as the last layer to output binary classification predictions.

3 Results

We compare our method to four baselines: (1) XLM-R + ResNet (2) XLM-R + CLIP img (3) Zero-
shot CLIP and (4) Fine-tuned CLIP as shown in Table 1. One possible reason for the performance lift
lies in where the fusion occurs. While both baselines use early-fusion, where text and image features
are concatenated as input sequence, our approach uses mid-fusion where independent transformers
are applied to the textual and the visual modality, then a cross-modal encoder is applied, showing
the effectiveness of the cross-modal encoder in identifying the relationship between modalities.
CLIP uses a contrastive loss assuming a 1:1 mapping between text and image, so our lift here is
likely due to our 1:n mapping. CLIP also follows “shallow-interaction design” [11], with no cross-
modal encoder, so the the relationship between text and images might be too shallow for our use
case.

4 Conclusion

We demonstrate that a cross-modal model framework significantly improves the prediction accuracy
for image-text matching in the domain of applications and search phrases. Our work opens doors to
automatic image selection and for self-serve capabilities for developers that recommendation which
image is the best to promote their applications.

2
References
[1] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint
arXiv:1607.06450 (2016).
[2] Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing
Liu. 2020. Uniter: Universal image-text representation learning. In Computer Vision–ECCV 2020: 16th
European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX. Springer, 104–120.
[3] Hongliang Fei, Tan Yu, and Ping Li. 2021. Cross-lingual cross-modal pretraining for multimodal retrieval.
In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies. 3644–3650.

[4] Zhe Gan, Yen-Chun Chen, Linjie Li, Chen Zhu, Yu Cheng, and Jingjing Liu. 2020. Large-scale adversarial
training for vision-and-language representation learning. Advances in Neural Information Processing
Systems 33 (2020), 6616–6628.
[5] Dan Hendrycks and Kevin Gimpel. 2016. Gaussian error linear units (gelus). arXiv preprint
arXiv:1606.08415 (2016).

[6] Weixiang Hong, Kaixiang Ji, Jiajia Liu, Jian Wang, Jingdong Chen, and Wei Chu. 2021. Gilbert: Gener-
ative vision-language pre-training for image-text retrieval. In Proceedings of the 44th International ACM
SIGIR Conference on Research and Development in Information Retrieval. 1379–1388.
[7] Gen Li, Nan Duan, Yuejian Fang, Ming Gong, and Daxin Jiang. 2020. Unicoder-vl: A universal encoder
for vision and language by cross-modal pre-training. In Proceedings of the AAAI Conference on Artificial
Intelligence, Vol. 34. 11336–11344.

[8] Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong
Hu, Li Dong, Furu Wei, et al. 2020. Oscar: Object-semantics aligned pre-training for vision-language
tasks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020,
Proceedings, Part XXX 16. Springer, 121–137.
[9] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish
Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from
natural language supervision. In International conference on machine learning. PMLR, 8748–8763.

[10] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object
detection with region proposal networks. Advances in neural information processing systems 28 (2015).
[11] Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei
Yao, and Kurt Keutzer. 2021. How much can clip benefit vision-and-language tasks? arXiv preprint
arXiv:2107.06383 (2021).

[12] Hao Tan and Mohit Bansal. 2019. LXMERT: Learning Cross-Modality Encoder Representations from
Transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Pro-
cessing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP).
5100–5111.
[13] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey,
Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s neural machine transla-
tion system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144
(2016).

[14] Tan Yu, Yi Yang, Yi Li, Lin Liu, Hongliang Fei, and Ping Li. 2021. Heterogeneous attention network
for effective and efficient cross-modal retrieval. In Proceedings of the 44th international ACM SIGIR
conference on research and development in information retrieval. 1146–1156.

You might also like