Automatic Creative Selection With Cross-Modal Matching

Uploaded by

Anjelo Rojas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views3 pages

Automatic Creative Selection With Cross-Modal Matching

Uploaded by

Anjelo Rojas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

Automatic Creative Selection with Cross-Modal

Matching

Alex Kim∗ Jia Huang† Rob Monarch

arXiv:2405.00029v1 [cs.CV] 28 Feb 2024

University of Southern California Apple Apple

Los Angeles, CA Cupertino, CA Cupertino, CA
[email protected] [email protected] [email protected]

Jerry Kwac Anikesh Kamath Parmeshwar Khurd

Apple Apple Apple
Cupertino, CA Cupertino, CA Cupertino, CA
[email protected] [email protected] [email protected]

Kailash Thiyagarajan Goodman Gu

Apple Apple
Cupertino, CA Cupertino, CA
[email protected] [email protected]

Abstract
We present a novel approach for matching images to text in the specific context
of matching an application image to the search phrases that someone might use to
discover that application. We share a new fine-tuning approach for a pre-trained
cross-modal model, tuned to search-text and application-image data. We evalu-
ate matching images to search phrases in two ways: the application developers’
intuitions about which search phrases are the most relevant to their application
images; and the intuitions of professional human annotators about which search
phrases are the most relevant to a given application. Our approach achieves 0.96
and 0.95 AUC for these two ground truth datasets, which outperforms current
state-of-the-art models by 8%-17%.

1 Introduction
Application (App) developers promote their Apps by creating multiple custom pages, each with
different images. By using different images for different pages, the App developers can engage
people with diverse interests to use their Apps. In some cases, developers also advertise their Apps
and in these cases they suggest which search phrases that they think are the most relevant for their
Apps.
This give us two different but equally meaningful but subjective perspectives on the relationship
between Apps and search phrases: the intuition of the people who created each App and the intu-
ition of people in general. Therefore, understanding the relationship between images and search
terms, and providing a model to recommend creative images to developers that best match with
search phrases, is an important task for supporting App developers. Recent state-of-the-art ap-
proaches have addressed this problem using a cross-modal Transformer, or BERT architecture
([2][3][4][6][7][8][9][14]).
∗
work completed as an intern at Apple
†
corresponding author

Preprint. Under review.

Table 1: Model performance on two evaluation datasets
Developer intuitions Professional annotator intuitions
AUC F1 AUC F1
Zero-shot CLIP 0.62 0.70 0.74 0.68
Fine-tuned CLIP 0.84 0.81 0.81 0.75
XLM-R + ResNet 0.89 0.76 0.82 0.73
XLM-R + CLIP img - - 0.83 0.75
Our Approach 0.96 0.89 0.95 0.87

The existing approaches work well for image captioning and QA tasks, but in the search domain cre-
ative images often do not have a description, and search phrases contain much shorter text, making
description-based matching infeasible. We customizes a cross-modal BERT framework by fine-
tuning a pre-trained cross-modal model on an in-house training dataset of (search phrase, ad image,
label), and evaluate the model on our test set for the (image, search phrase) matching task and
compare with baselines including CLIP [9].

2 Method

For each App c, a developer promoting that App can define a set of search phrases K relevant to
their App. Our approach provides a way to automatically select an image m from a candidate image
pool M that best matches a given search phrase k. Given a pair of (search phrase k, image m), c,
we model relevance R(k, m) prediction as binary classification.
Our model uses pre-trained cross-modal image-text matching architecture, as in LXMERT [12],
where input text is split by a WordPiece tokenizer [13], and embedded to word embeddings, and
input image objects are detected by Faster R-CNN [10] and embedded to object-level image em-
beddings. A language encoder and object-relationship encoder are applied to word embeddings and
detected objects respectively. Finally, the relevance score R(k, m) is predicted with a Transformer-
based cross-modal encoder. Extracted features from each image are in the format of ([36, 4], [36,
2048]) representing bounding box coordinates of up to 36 detected objects and 2048-dim feature
vector of those objects. We then fine-tune the pre-trained model on our training dataset. Our fine-
tuning process is defined as a sequential model built on top of a LXMERT Encoder, which consists
of a linear layer, a GELU activation function [5], a layer normalization [1], another linear layer, and
a sigmoid activation function as the last layer to output binary classification predictions.

3 Results

We compare our method to four baselines: (1) XLM-R + ResNet (2) XLM-R + CLIP img (3) Zero-
shot CLIP and (4) Fine-tuned CLIP as shown in Table 1. One possible reason for the performance lift
lies in where the fusion occurs. While both baselines use early-fusion, where text and image features
are concatenated as input sequence, our approach uses mid-fusion where independent transformers
are applied to the textual and the visual modality, then a cross-modal encoder is applied, showing
the effectiveness of the cross-modal encoder in identifying the relationship between modalities.
CLIP uses a contrastive loss assuming a 1:1 mapping between text and image, so our lift here is
likely due to our 1:n mapping. CLIP also follows “shallow-interaction design” [11], with no cross-
modal encoder, so the the relationship between text and images might be too shallow for our use
case.

4 Conclusion

We demonstrate that a cross-modal model framework significantly improves the prediction accuracy
for image-text matching in the domain of applications and search phrases. Our work opens doors to
automatic image selection and for self-serve capabilities for developers that recommendation which
image is the best to promote their applications.

2
References
[1] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint
arXiv:1607.06450 (2016).
[2] Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing
Liu. 2020. Uniter: Universal image-text representation learning. In Computer Vision–ECCV 2020: 16th
European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX. Springer, 104–120.
[3] Hongliang Fei, Tan Yu, and Ping Li. 2021. Cross-lingual cross-modal pretraining for multimodal retrieval.
In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies. 3644–3650.

[4] Zhe Gan, Yen-Chun Chen, Linjie Li, Chen Zhu, Yu Cheng, and Jingjing Liu. 2020. Large-scale adversarial
training for vision-and-language representation learning. Advances in Neural Information Processing
Systems 33 (2020), 6616–6628.
[5] Dan Hendrycks and Kevin Gimpel. 2016. Gaussian error linear units (gelus). arXiv preprint
arXiv:1606.08415 (2016).

[6] Weixiang Hong, Kaixiang Ji, Jiajia Liu, Jian Wang, Jingdong Chen, and Wei Chu. 2021. Gilbert: Gener-
ative vision-language pre-training for image-text retrieval. In Proceedings of the 44th International ACM
SIGIR Conference on Research and Development in Information Retrieval. 1379–1388.
[7] Gen Li, Nan Duan, Yuejian Fang, Ming Gong, and Daxin Jiang. 2020. Unicoder-vl: A universal encoder
for vision and language by cross-modal pre-training. In Proceedings of the AAAI Conference on Artificial
Intelligence, Vol. 34. 11336–11344.

[8] Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong
Hu, Li Dong, Furu Wei, et al. 2020. Oscar: Object-semantics aligned pre-training for vision-language
tasks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020,
Proceedings, Part XXX 16. Springer, 121–137.
[9] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish
Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from
natural language supervision. In International conference on machine learning. PMLR, 8748–8763.

[10] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object
detection with region proposal networks. Advances in neural information processing systems 28 (2015).
[11] Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei
Yao, and Kurt Keutzer. 2021. How much can clip benefit vision-and-language tasks? arXiv preprint
arXiv:2107.06383 (2021).

[12] Hao Tan and Mohit Bansal. 2019. LXMERT: Learning Cross-Modality Encoder Representations from
Transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Pro-
cessing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP).
5100–5111.
[13] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey,
Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s neural machine transla-
tion system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144
(2016).

[14] Tan Yu, Yi Yang, Yi Li, Lin Liu, Hongliang Fei, and Ping Li. 2021. Heterogeneous attention network
for effective and efficient cross-modal retrieval. In Proceedings of the 44th international ACM SIGIR
conference on research and development in information retrieval. 1146–1156.

Incorporating Visual Information Into Natural Language Processing
No ratings yet
Incorporating Visual Information Into Natural Language Processing
151 pages
Beyond Text: Optimizing RAG With Multimodal Inputs For Industrial Applications
No ratings yet
Beyond Text: Optimizing RAG With Multimodal Inputs For Industrial Applications
14 pages
Dynamic Image Generation From Text Prompt Research Paper-JOT-5135
100% (1)
Dynamic Image Generation From Text Prompt Research Paper-JOT-5135
7 pages
Enhancing Multimodal Understanding With CLIP-Based
No ratings yet
Enhancing Multimodal Understanding With CLIP-Based
7 pages
Lecture22 Multimodal
No ratings yet
Lecture22 Multimodal
32 pages
Cloth Captioning
No ratings yet
Cloth Captioning
36 pages
Computational Methods For Integrating Vision and Language: Kobus Barnard
No ratings yet
Computational Methods For Integrating Vision and Language: Kobus Barnard
229 pages
X2I - Seamless Integration of Multimodal Understanding Into Diffusion Transformer Via Attention Distillation
No ratings yet
X2I - Seamless Integration of Multimodal Understanding Into Diffusion Transformer Via Attention Distillation
35 pages
Det GPT
No ratings yet
Det GPT
17 pages
26 - Sentiment Analysis of Linguistic Cues To Assist Medical Image Classification
No ratings yet
26 - Sentiment Analysis of Linguistic Cues To Assist Medical Image Classification
20 pages
PhoCLIP 232 Specialized Project OFFICIAL
No ratings yet
PhoCLIP 232 Specialized Project OFFICIAL
105 pages
Mediaeval 2023
No ratings yet
Mediaeval 2023
5 pages
ContextRefine CLIP For EPIC KITCHENS 100 Multi Instance Retrieval Challenge 2025
No ratings yet
ContextRefine CLIP For EPIC KITCHENS 100 Multi Instance Retrieval Challenge 2025
4 pages
Cross-Modal Contrastive Learning For Generalizable and Efficient Image-Text Retrieval
No ratings yet
Cross-Modal Contrastive Learning For Generalizable and Efficient Image-Text Retrieval
14 pages
Multimodal Machine Learning
No ratings yet
Multimodal Machine Learning
27 pages
Automatic Image Captioning Combining Natural Language Processing and
No ratings yet
Automatic Image Captioning Combining Natural Language Processing and
14 pages
Unit: Multimodal Multitask Learning With A Unified Transformer
No ratings yet
Unit: Multimodal Multitask Learning With A Unified Transformer
16 pages
Fine-Grained Visual Textual Alignment For Cross-Modal Retrieval Using Transformer Encoders
No ratings yet
Fine-Grained Visual Textual Alignment For Cross-Modal Retrieval Using Transformer Encoders
23 pages
Text-Image Embeddings With OpenAIs CLIP
No ratings yet
Text-Image Embeddings With OpenAIs CLIP
5 pages
MML Language
No ratings yet
MML Language
11 pages
Image-Text Embedding Learning Via Visual and Textual Semantic Reasoning
No ratings yet
Image-Text Embedding Learning Via Visual and Textual Semantic Reasoning
51 pages
CLIP - Connecting Text and Images - OpenAI
No ratings yet
CLIP - Connecting Text and Images - OpenAI
16 pages
NLP UNIT 5c
No ratings yet
NLP UNIT 5c
33 pages
Cross-Modal Alignment With Graph Reasoning For Image-Text Retrieval
No ratings yet
Cross-Modal Alignment With Graph Reasoning For Image-Text Retrieval
18 pages
Rich Image Captioning in The Wild
No ratings yet
Rich Image Captioning in The Wild
8 pages
465-Lecture 17-CT
No ratings yet
465-Lecture 17-CT
22 pages
Ijariie 26613
No ratings yet
Ijariie 26613
5 pages
Cross-Modal Graph Matching Network For Image-Text Retrieval
No ratings yet
Cross-Modal Graph Matching Network For Image-Text Retrieval
23 pages
Generating Caption From Images Using Flickr Image Dataset
No ratings yet
Generating Caption From Images Using Flickr Image Dataset
7 pages
Ying Zhang Deep Cross-Modal Projection ECCV 2018 Paper
No ratings yet
Ying Zhang Deep Cross-Modal Projection ECCV 2018 Paper
16 pages
L-Verse: Bidirectional Generation Between Image and Text
No ratings yet
L-Verse: Bidirectional Generation Between Image and Text
18 pages
Viecap4H - VLSP 2021: A Transformer-Based Method For Healthcare Image Captioning in Vietnamese
No ratings yet
Viecap4H - VLSP 2021: A Transformer-Based Method For Healthcare Image Captioning in Vietnamese
9 pages
Jiang Cross-Modal Implicit Relation Reasoning and Aligning For Text-To-Image Person Retrieval CVPR 2023 Paper
No ratings yet
Jiang Cross-Modal Implicit Relation Reasoning and Aligning For Text-To-Image Person Retrieval CVPR 2023 Paper
11 pages
Contrastive Language Image Pre-Training
No ratings yet
Contrastive Language Image Pre-Training
18 pages
03IMRAM Iterative Matching With Recurrent Attention Memory For Cross-Modal Image-Text CVPR 2020 Paper
No ratings yet
03IMRAM Iterative Matching With Recurrent Attention Memory For Cross-Modal Image-Text CVPR 2020 Paper
9 pages
Rethinking Benchmarks For Cross-Modal Image-Text Retrieval: Weijing Chen Linli Yao Qin Jin
No ratings yet
Rethinking Benchmarks For Cross-Modal Image-Text Retrieval: Weijing Chen Linli Yao Qin Jin
11 pages
Contrastive Language and Vision Learning of General Fashion Concepts
No ratings yet
Contrastive Language and Vision Learning of General Fashion Concepts
13 pages
Biten Is An Image Worth Five Sentences A New Look Into WACV 2022 Paper
No ratings yet
Biten Is An Image Worth Five Sentences A New Look Into WACV 2022 Paper
10 pages
Fang 2015
No ratings yet
Fang 2015
10 pages
Review 01
No ratings yet
Review 01
13 pages
Learning Transferable Visual Models From Natural Language Supervision
No ratings yet
Learning Transferable Visual Models From Natural Language Supervision
14 pages
Open Ended VQA Models Using Transformers
No ratings yet
Open Ended VQA Models Using Transformers
10 pages
Jia Et Al. - 2021 - Scaling Up Visual and Vision-Language Representati
No ratings yet
Jia Et Al. - 2021 - Scaling Up Visual and Vision-Language Representati
11 pages
L D S M C S C S N C: Earning EEP Emantic Odel For ODE Earch Using ODE Earch ET Orpus
No ratings yet
L D S M C S C S N C: Earning EEP Emantic Odel For ODE Earch Using ODE Earch ET Orpus
6 pages
Learning Transferable Visual Models From Natural Language Supervision
No ratings yet
Learning Transferable Visual Models From Natural Language Supervision
47 pages
Synthesis Lectures On Computer Vision: Series Editors
No ratings yet
Synthesis Lectures On Computer Vision: Series Editors
8 pages
LXMERT: Learning Cross-Modality Encoder Representations From Transformers
No ratings yet
LXMERT: Learning Cross-Modality Encoder Representations From Transformers
14 pages
Implementation of Simple and Efficient P
No ratings yet
Implementation of Simple and Efficient P
8 pages
3 Depreciation
No ratings yet
3 Depreciation
15 pages
Learning Similarities: An Ensemble Model For Textual Query Image Retrieval System
No ratings yet
Learning Similarities: An Ensemble Model For Textual Query Image Retrieval System
8 pages
Deep Visual-Semantic Alignments For Generating Image Descriptions
No ratings yet
Deep Visual-Semantic Alignments For Generating Image Descriptions
17 pages
Image Retrieval - Transformer
No ratings yet
Image Retrieval - Transformer
10 pages
(Supplements To Vetus Testamentum 152) Craig A. Evans, Joel N. Lohr, David L. Petersen (Eds.) - The Book of Genesis - Composition, Reception, and Interpretation-Brill (2012) PDF
100% (8)
(Supplements To Vetus Testamentum 152) Craig A. Evans, Joel N. Lohr, David L. Petersen (Eds.) - The Book of Genesis - Composition, Reception, and Interpretation-Brill (2012) PDF
789 pages
Learning Transferable Visual Models From Natural Language Supervision
No ratings yet
Learning Transferable Visual Models From Natural Language Supervision
48 pages
Ernie-V LG: U G P - B V - L G: I Nified Enerative RE Training For Idirectional Ision Anguage Eneration
No ratings yet
Ernie-V LG: U G P - B V - L G: I Nified Enerative RE Training For Idirectional Ision Anguage Eneration
15 pages
Li Et Al. - 2023 - Multimodal Foundation Models From Specialists To
No ratings yet
Li Et Al. - 2023 - Multimodal Foundation Models From Specialists To
119 pages
Agronomy ACC
No ratings yet
Agronomy ACC
174 pages
Water Management Plan
No ratings yet
Water Management Plan
110 pages
Project Synopsis Imagecaptioning
No ratings yet
Project Synopsis Imagecaptioning
5 pages
TAB-VCR: Tags and Attributes Based Visual Commonsense Reasoning Baselines
No ratings yet
TAB-VCR: Tags and Attributes Based Visual Commonsense Reasoning Baselines
18 pages
Content-Based Image Retrieval Using Deep Learning
No ratings yet
Content-Based Image Retrieval Using Deep Learning
44 pages
Modeling Text With Graph Convolutional Network For Cross-Modal Information Retrieval
No ratings yet
Modeling Text With Graph Convolutional Network For Cross-Modal Information Retrieval
7 pages
Volume One 1
No ratings yet
Volume One 1
290 pages
Chapter 1-Financial Decision Making
No ratings yet
Chapter 1-Financial Decision Making
29 pages
Assistant Property Manager Cover Letter
100% (1)
Assistant Property Manager Cover Letter
5 pages
Meat A Natural Symbol PDF
No ratings yet
Meat A Natural Symbol PDF
286 pages
Aon Pre-Hire Onboarding
No ratings yet
Aon Pre-Hire Onboarding
19 pages
Science Room Rules
No ratings yet
Science Room Rules
5 pages
PP2699 Detektor Asap Soteria UL - Edisi 1
No ratings yet
PP2699 Detektor Asap Soteria UL - Edisi 1
3 pages
Lab Manual Microbiology For Allied Health Students 1.3
No ratings yet
Lab Manual Microbiology For Allied Health Students 1.3
115 pages
AECOM Handbook 2023 21 30
No ratings yet
AECOM Handbook 2023 21 30
10 pages
ICTNWK559 Assessment Task 1
No ratings yet
ICTNWK559 Assessment Task 1
15 pages
02-20-2017 Stepped Pressure Cycle - A New Approach To Lorenz Cycle
No ratings yet
02-20-2017 Stepped Pressure Cycle - A New Approach To Lorenz Cycle
12 pages
Nsi MC 1616 Manual en
No ratings yet
Nsi MC 1616 Manual en
20 pages
AP EURO HRG Unit 1 Noteguide Answers
No ratings yet
AP EURO HRG Unit 1 Noteguide Answers
10 pages
CE21B030 CE3410 EXP4 Aggregates
No ratings yet
CE21B030 CE3410 EXP4 Aggregates
10 pages
Grade 6 CS CH2 - 3
No ratings yet
Grade 6 CS CH2 - 3
2 pages
Group2 Buntal Hats
No ratings yet
Group2 Buntal Hats
8 pages
Grade 8 Learning Standards
No ratings yet
Grade 8 Learning Standards
39 pages
Hart Vs Oconner
No ratings yet
Hart Vs Oconner
2 pages
Employees' Pension Scheme, 1995: Form No. 10 C (E.P.S)
No ratings yet
Employees' Pension Scheme, 1995: Form No. 10 C (E.P.S)
4 pages
Rosenbloom 2013
No ratings yet
Rosenbloom 2013
4 pages
Assignment 2 DESA 1004 - Paulo Ricardo Rangel Maciel Pimenta
No ratings yet
Assignment 2 DESA 1004 - Paulo Ricardo Rangel Maciel Pimenta
4 pages
Flexible Benefit Plan For LTTS FTC
No ratings yet
Flexible Benefit Plan For LTTS FTC
1 page
1019 1024 1
No ratings yet
1019 1024 1
6 pages
LG Electronics-328292323-LB6500 Series Spec Sheet ENG
No ratings yet
LG Electronics-328292323-LB6500 Series Spec Sheet ENG
2 pages
Rameez - Ducted Fans Vs Propellers
No ratings yet
Rameez - Ducted Fans Vs Propellers
2 pages
Learning OpenCV 3 Application Development
From Everand
Learning OpenCV 3 Application Development
Samyak Datta
No ratings yet
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet