Deep CTR Prediction in Display Advertising
Deep CTR Prediction in Display Advertising
∗ † ‡
Junxuan Chen1,2 , Baigui Sun2 , Hao Li2 , Hongtao Lu1 , Xian-Sheng Hua2
1
Department of Computer Science and Engineering, Shanghai Jiao Tong University
2
Alibaba Group, Hangzhou, China
{chenjunxuan, htlu}@sjtu.edu.cn {baigui.sbg, lihao.lh, xiansheng.hxs}@alibaba-inc.com
ABSTRACT
Click through rate (CTR) prediction of image ads is the core
task of online display advertising systems, and logistic re-
gression (LR) has been frequently applied as the prediction
model. However, LR model lacks the ability of extracting
complex and intrinsic nonlinear features from handcrafted
high-dimensional image features, which limits its effective-
ness. To solve this issue, in this paper, we introduce a novel
deep neural network (DNN) based model that directly pre-
dicts the CTR of an image ad based on raw image pixels and
other basic features in one step. The DNN model employs Figure 1: Display ads on an e-commerce web page.
convolution layers to automatically extract representative
visual features from images, and nonlinear CTR features
are then learned from visual features and other contextual
is the product of the bid price and click-through rate (CTR)
features by using fully-connected layers. Empirical evalua-
or conversion rate (CVR).
tions on a real world dataset with over 50 million records
Recently, more and more advertisers prefer displaying im-
demonstrate the effectiveness and efficiency of this method.
age ads [17] (Figure 1) because they are more attractive and
comprehensible compared with textual ads. To maximize
Keywords the revenue of publishers, this has led to a huge demand on
DNN, CNN, Click through rate, Image Ads, Display Adver- approaches that are able to choose the most proper image
tising ad to show for a particular user when he or she is visiting a
web page so that to maximize the CTR or CVR.
Therefore, in most online advertising systems, predicting
1. INTRODUCTION the CTR or CVR is the core task of ads allocation. In this
Online display advertising generates a significant amount paper, we focus on CPC and predict the CTR of display
of revenue by showing textual or image ads on various web ads. Typically an ads system predicts and ranks the CTR
pages [3]. The ad publishers like Google and Yahoo sell of available ads based on contextual information, and then
ad zones on different web pages to advertisers who want shows the top K ads to the users. In general, prediction
to show their ads to users. And then Publishers get paid models are learned from past click data based on machine
by advertisers every time the ad display leads to some de- learning techniques [3, 24, 9, 5, 32, 16].
sired action such as clicking or purchasing according to the Features that are used to represent an ad are extremely
payment options such as cost-per-click (CPC) or cost-per- important in a machine learning model. In recent years,
conversion (CPA) [15]. The expected revenue for publishers to make the CTR prediction model more accurate, many
∗The work is done while the author was an intern at Alibaba researchers use millions of features to describe a user’s re-
sponse record (we call it an ad impression). Typically, an
Group. image ad impression has basic features and visual features.
†Corresponding author.
The basic features are information about users, products and
‡Corresponding author. ad positions in a web page, etc. Visual features describe the
visual appearance of an image ad at different levels. For
example, color and texture are low level features, while face
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed and other contextual objects are high level features. Low
for profit or commercial advantage and that copies bear this notice and the full cita- level and high level features may both have the power to
tion on the first page. Copyrights for components of this work owned by others than influence the CTR of an image ad (Figure 2). Traditionally,
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-
publish, to post on servers or to redistribute to lists, requires prior specific permission researchers lack effective method to extract high-level visual
and/or a fee. Request permissions from [email protected]. features. The importance of visual features is also usually
MM ’16, October 15-19, 2016, Amsterdam, Netherlands under estimated. However, as we can see from Figure 2, ads
c 2016 ACM. ISBN 978-1-4503-3603-1/16/10. . . $15.00 with same basic features may have largely different CTRs
DOI: http://dx.doi.org/10.1145/2964284.2964325 due to different ad images. As a consequence, How to use
811
the visual features in machine learning models effectively
becomes an urgent task.
Among different machine learning models that have been
applied to predict ads CTR using the above features, Lo-
gistic regression (LR) is the mostly well-known and widely-
(a) (b)
used one due to its simplicity and effectiveness. Also, LR
is easy to be parallelized on a distributed computing sys-
tem thus it is not challenging to make it work on billions
of samples [3]. Being able to handle big data efficiently is
necessary for a typical advertising system especially when
the prediction model needs to be updated frequently to deal (c) (d)
with new ads. However, LR is a linear model which is in-
ferior in extracting complex and effective nonlinear features
from handcrafted feature pools. Though one can mitigate Figure 2: Two groups of image ads in each row.
this issue by computing the second-order conjunctions of the The two ads in each group have completely same ad
features, it still can not extract higher-order nonlinear rep- group id, ad zone and target people. CTRs of im-
resentative features and may cause feature explosion if we age ads (a) and (b) are 1.27% and 0.83%. (b) suffers
continue increasing the conjunction order. from low contrast between product and background
To address these problems, other methods such as factor- obviously. CTRs of (c) and (d) are 2.40% and 2.23%.
ization machine [22], decision tree [9], neural network [32] are We find too many subjects in an men’s clothing ad
widely used. Though these methods can extract non-linear may bring negative effect. (the number of impres-
features, they only deal with basic features and handcrafted sions of each ad is sufficiently high to make the CTR
visual features, which are inferior in describing images. In meaningful).
this paper, we propose a deep neural network (DNN) to
directly predict the CTR of an image ad from raw pixels
and other basic features. Our DNN model contains con- 2.1 Display Advertising CTR prediction
volution layers to extract representative visual features and
Since the display advertising has taken a large share of
then fully-connected layers that can learn the complex and
online advertising market, many works addressing the CTR
effective nonlinear features among basic and visual features.
prediction problem have been published. In [24, 2], authors
The main contributions of this work can be summarized as
handcraft many features from raw data and use logistic re-
follows:
gression (LR) to predict the click-through rate. [3] also uses
1. This paper proposed a DNN model which not only di- LR to deal with the CTR problem and scales it to billions of
rectly takes both high-dimensional sparse feature and samples and millions of parameters on a distributed learn-
image as input, but also can be trained from end to ing system. In [20], a Hierarchical Importance-aware Fac-
end. To our best knowledge, this is the first DNN torization Machine (FM) [23] is introduced, which provides
based CTR model which can do such things. a generic latent factor framework that incorporates impor-
tance weights and hierarchical learning. In [5], boosted de-
2. Efficient methods are introduced to tackle the chal- cision trees have been used to build a prediction model. In
lenge of high-dimensionality and huge data amount in [9], a model which combines decision trees with logistic re-
the model training stage. The proposed methods re- gression has been proposed, and outperforms either of the
duce the training time significantly and make it fea- above two models. [32] combines deep neural networks with
sible to train on a normal PC with GPUs even with FM and also brings an improvement. All of these methods
large-scale real world training data. are very effective when deal with ads without images. How-
ever, when it comes to the image ads, they can only use
3. We conduct extensive experiments on a real-world dataset pre-extracted image features, which is less flexible to take
with more than 50 million user response records to il- account of the unique properties of different datasets.
lustrate the improvement provided by our DNN model. Therefore, the image features in display advertisement
The impacts of several deep learning techniques have have received more and more attention. In [1, 4], the im-
also been discussed. We further visualize the saliency pact of visual appearance on user’s response in online display
map of image ads to show our model can learn effective advertising is considered for the first time. They extract
visual features. over 30 handcrafted features from ad images and build a
The paper is organized as follows. Section 2 introduces CTR prediction model using image and basic features. The
the related work, followed by an overview of our scheme experiment result shows that their method achieves better
in Section 3. In Section 4, we describe the proposed DNN performance than models without visual features. [18] is
model in detail, and we show the challenges in the training the most related work in literature with us, in which a de-
stage as well as our solutions in Section 5. Section 6 presents capitated convolutional neural network (CNN) is used to
the experimental results and discussion, and then Section 7 extract image features from ads. However, there are two
is the conclusion. important differences between their method and ours. First,
they do not consider basic features when extracting image
features using CNN. Second, when predicting the CTR they
2. RELATED WORK use logistic regression which lacks the ability in exploring the
We consider display advertising CTR prediction and deep complex relations between image and basic features. Most
neural network are two mostly related areas to our work. of the information in their image features is redundant such
812
as product category which is included in basic features. As
1
a result, their model only achieves limited improvements
when combining both kinds of features. Worse still, when
Fully-connected layer
the dataset contains too many categories of products, it can
hardly converge when training. Our model uses an end to 256
end model to predict the CTR of image ads using basic fea-
tures and raw images in one step, in which image features Fully-connected layer
can be seen as supplementary to the basic features.
256
813
4
conv
layers 4
conv
layers 4
conv
layers 4
conv
layers
vector and fed to two fully-connected layers. The output of layer and a new data sampling scheme. The use of these two
the last fully-connected layer is a real value z. This part is techniques makes the training time of our DeepCTR suitable
called Combnet. On the top of the whole network, Logloss for a real online system.
is computed as described in Section 3.
The design of Convnet is inspired by the network in [7, 5.1 Sparse Fully-Connected Layer
27], as shown in Figure 4. The network consists of 17 convo- In CTR prediction, the basic feature of an ad impression
lution layers. The first convolution layer uses 5 × 5 convolu- includes user information like gender, age, purchasing power,
tion kernels. Following first layer, there are four groups and and ad information like ad ID, ad category, ad zone, etc.
each has four layers with 3 × 3 kernels. We do not build a This information is usually encoded by one-hot encoding or
very deep network such as more than 50 layers in considera- feature hashing [29] which makes the feature dimension very
tion of the trade off between performance and training time. large. For example, it is nearly 200,000 in our dataset. Con-
We pre-train the Convnet on the images in training dataset sequently, in Basicnet, the first fully-connected layer using
with category labels. We use two fully-connected layers with the basic feature as input has around 60 million parame-
1024 hidden nodes (we call them fc18 and fc19 ), a fully- ters, which is similar to the number of all the parameters
connected layer with 96-way outputs (we call it fc20 ) and in AlexNet [13]. However, the basic feature is extremely
a softmax after the Convnet in pre-training period. Since sparse due to the one-hot encoding. Using sparse matrix in
our unique images set is smaller (to be detailed in Section first fully-connected layer can largely reduce the computing
6) than ImageNet [6], we use half the number of outputs in complexity and GPU memory usage.
each group comparing with [7]. After pre-training, a 128- In our model, we use compressed sparse row (CSR) format
way fully-connected layer is connected behind the last con- to represent a batch of basic features V . When computing
volution layer. Then we train the whole DeepCTR using network forward
Logloss from end to end.
Yf c1 = V W, (4)
sparse matrix operations can be used in the first fully-connected
5. SPEED UP TRAINING layer. When backward pass, we only need to update the
An online advertising system has a large number of new weights that link to a small number of nonzero dimensions
user response records everyday. It is necessary for ad sys- according to the gradient
tems to update as frequently as possible to adapt new ten-
∇(W ) = V. (5)
dency. An LR model with distributed system requires sev-
eral hours to train with billions of samples, which makes it Both of the forward pass and backward pass only need a
popular in industry. time complexity of O(nd0 ) where d0 is the number of non-
Typically a deep neural network has millions of parame- zero elements in basic features and d0 d. An experiment
ters which makes it impossible to train quickly. With the result that compares the usages of time and GPU memory
development of GPUs, one can train a deep CNN with 1 with/out sparse fully-connected layer can be found in Sec-
million training images in two days on a single machine. tion 6.2.
However, it is not time feasible for our network since we
have more than 50 million samples. Moreover, the dimen- 5.2 Data Sampling
sionality of basic features is nearly 200,000 which leads to Another crucial issue in training is that Convnet limits the
much more parameters in our network than a normal deep batch-size of Stochastic Gradient Descent (SGD). To train
neural network. Directly training our network on a single a robust CTR prediction model, we usually need millions
machine may take hundreds of days to converge according of ad samples. However, the Convnet requires lots of GPU
to a rough estimation. Even using multi-machine can hardly memory, which makes our batch-size very small, say, a few
resolve the training problem. We must largely speed up the hundred. For a smaller batch-size, the parallel computing
training if we want to deploy our DeepCTR on a real online of GPU can not maximize the effect in the multiplication of
system. large matrix. And the number of iterations of each epoch
To make it feasible to train our model with less than one will be very large. We need much more time to run over
day, we adopt two techniques: using sparse fully-connected an epoch of iterations. Though the sparse fully-connected
814
Algorithm 1 Training a DeepCTR network
Input: : Network N et with parameter W, unique images
Convent Basicnet set U, basic features set V, labels Y, batch size n, basic
feature sample number k.
Output: : Network for CTR prediction, N et
1: Initialize N et.
0010······001 2: Compute the sample probability p(u) of each image u,
{ 0100······100 p(u) = P
#Vu
#Vu0
(7)
···
u0 ∈U
1001······010 3: repeat
4: Sample n images U according to p(u).
5: For each u in U , sample k basic features Vu from Vu
0000······110 with labels Yu uniformly with replacement.
815
experiment more convincing, we also conduct a experiment Test Logloss vs Iters
on a sub test set that only contains new images data that
never been used in training. The basic feature v is one- 0.146
with BN
without BN
hot encoded and has a dimensionality of 153,231. Following
information is consisted by basic features:
0.145
1. Ad zone. The display zone of an ad on the web page.
Test Logloss
We have around 700 different ad zones in web pages.
0.144
2. Ad group. The ad group is a small set of ads. The ads
in an ad group share almost same products but differ-
ent ad images (in Figure 2, (a) and (b) belong to an ad 0.143
group while (c) and (d) belong to another group). We
have over 150,000 different ad groups in our dataset.
Each ad group consists less than 20 different ads. 0.142
816
Table 1: relative AUC and Logloss. All the numbers are best resuts achieved in three repeated experiments.
We omit dnn of methods using deep neural network with pre-extracted features.
method lr basic FM basic basic sift conv13 conv17 fc18 fc19 fc20 DeepCTR 3 DeepCTRs
AUC(%) - 1.47 1.41 1.69 3.92 4.13 4.10 3.48 3.04 5.07 5.92
Logloss(%) - -0.40 -0.39 -0.45 -0.79 -0.86 -0.86 -0.74 -0.69 -1.11 -1.30
5 5
x 10 x 10
7 6
Table 2: relative AUC and Logloss of the sub test
6 5
set that only contains images never shown in the
5
4
4
training set.
3
3 AUC (%) Logloss (%)
2
2 lr basic - -
1 1 dnn basic 1.07 -0.21
0
−0.4 −0.2 0 0.2 0.4 0.6
0
−10 −5 0 5 10
dnn sift 2.14 -0.48
(a) (b) DeepCTR 5.54 -0.85
5 5
x 10 x 10
5 5
4 4
3 3 set are all new ads added into the ad system and the ad
2 2
groups have not appeared in the training set. This lead to
the prediction worse in 3K sub test set because the ad group
1 1
feature does not exist in the training set. However, though
0
−5 0 5
0
−5 0 5 we lack some basic features, visual features bring much more
(c) (d) supplementary information. This is the reason that dnn sift
and DeepCTR have more improvements over the baseline
Figure 7: (a) and (b) are the histograms of outputs methods (which only have basic features) in 3K sub test set
of Basicnet and Convnet without batch normaliza- comparing with the full test set. This experiment shows
tion while (c) and (d) with batch normalization. that visual features can be used to identify ads with similar
characteristics and thus to predict the CTR of new image
ads more accurately. It also verifies that our model indeed
has strong generalization ability but not memories the image
converge on our dataset. We think the reason is that our id rigidly.
dataset consists of too many categories of products while in For different real-world problems, different techniques may
their datasets only 5 different categories are available. be needed due to the intrinsic characteristics of the problem,
Comparison between other baselines is shown in Table network design and data distribution. Therefore, solutions
1 too. From the result, it can be seen that a deep neu- based on deep learning typically will compare and analyze
ral network and image features can both improve the CTR the impact and effectiveness of those techniques to find the
prediction accuracy. FM basic and dnn basic achieve almost best practices for a particular problem. Therefore, we fur-
same improvements comparing with lr basic, which indicates ther explore the influence of different deep learning tech-
that these two models both have strong power in extracting niques in our DeepCTR model empirically.
effective non-linear basic features. For the image part, com- First we find that the batch normalization in the Comb-
paring with handcrafted features, like SIFT, deep features net can speed up training and largely improve performance
have stronger power in describing the image, which leads to (Figure 6). To investigate the reason, we show the histogram
a significant improvement in the prediction accuracy. Our (Figure 7) of the outputs of Convnet and BasicN et. We
DeepCTR model goes one step further by using an end to can see from the histogram that two outputs have signifi-
end learning scheme. Ensemble of multiple deep networks cant difference in scale and variance. Simply concatenating
usually brings better performance, so we train 3 DeepCTR these two different kinds of data stream makes the following
models and average their predictions, and it gives the best fully-connected layer hard to converge.
AUC and Logloss. Compared with lr basic, the AUC in- Dropout [28] is an efficient way to prevent over-fitting
crease will bring us 1∼2 percent CTR increase in the adver- problem in deep neural network. Most deep convolution
tising system (according to online experiments), which will networks remove the dropout because batch normalization
lead to over 1 million earnings growth per day for an 100 can regularize the models [7, 11]. However, in our DeepCTR
million ads business. model, we find it still suffers from over-fitting without dropout.
To make our results more convincing, we also conduct We compare the loss curves of the model with/without dropout
an experiment on the sub test set that only contains 3,090 in the last two fully-connected layers. We can see that the
images that are never shown in training set. The relative model with dropout achieves lower testing Logloss, though
AUC and Logloss of three representative methods dnn basic, we need more time to reach the lowest test loss.
dnn sift and a single DeepCTR are in Table 2. Clearly, We also evaluate the performance of the sparse fully-connected
our DeepCTR wins by a large margin consistently. We also layer and our data sampling method. We plot computing
notice that while the AUC of dnn basic decreases, dnn sift time and memory overhead (Table 3) of the sparse fully-
and our DeepCTR have an even higher relative AUC than connected layer comparing with dense layer. Loss curves of
the result on the full test set. Ad images in 3K sub test training and testing are exactly the same since sparse fully-
817
Logloss vs Iters
0.148
Table 4: AUC and Logloss of dnn conv17 model with
with dropout our data sampling and a throughly shuffle.
0.146
without dropout AUC(%) Logloss(%)
data sampling 4.13 -0.86
0.144 throughly shuffle 4.12 -0.86
Logloss
0.142
0.14
0.138
0.136
0.134
0 2 4 6 8 10 12 14
4
Iters x 10
Figure 8: Logloss of the DeepCTR net with/without Figure 9: Saliency map of an image ad. We can see
dropout in Combnet. Dashed lines denote training that cats, texture, and characters all have effect on
loss, and bold lines denote test loss. the CTR.
818
Figure 10: Saliency map of the image ads. Brighter area plays a more important role in effecting the CTR
of ads.
819
learning for image recognition. arXiv preprint conference on Web search and data mining, pages
arXiv:1512.03385, 2015. 123–132. ACM, 2014.
[8] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep [21] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn:
into rectifiers: Surpassing human-level performance on Towards real-time object detection with region
imagenet classification. arXiv preprint proposal networks. In Advances in Neural Information
arXiv:1502.01852, 2015. Processing Systems, pages 91–99, 2015.
[9] X. He, J. Pan, O. Jin, Xu, et al. Practical lessons from [22] S. Rendle. Factorization machines. In Data Mining
predicting clicks on ads at facebook. In Proceedings of (ICDM), 2010 IEEE 10th International Conference
20th ACM SIGKDD Conference on Knowledge on, pages 995–1000. IEEE, 2010.
Discovery and Data Mining, pages 1–9. ACM, 2014. [23] S. Rendle. Factorization machines with libfm. ACM
[10] G. Hinton, L. Deng, D. Yu, G. E. Dahl, Mohamed, Transactions on Intelligent Systems and Technology
et al. Deep neural networks for acoustic modeling in (TIST), 3(3):57, 2012.
speech recognition: The shared views of four research [24] M. Richardson, E. Dominowska, and R. Ragno.
groups. Signal Processing Magazine, IEEE, Predicting clicks: estimating the click-through rate for
29(6):82–97, 2012. new ads. In Proceedings of the 16th international
[11] S. Ioffe and C. Szegedy. Batch normalization: conference on World Wide Web, pages 521–530. ACM,
Accelerating deep network training by reducing 2007.
internal covariate shift. arXiv preprint [25] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep
arXiv:1502.03167, 2015. inside convolutional networks: Visualising image
[12] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, classification models and saliency maps. arXiv
J. Long, R. Girshick, S. Guadarrama, and T. Darrell. preprint arXiv:1312.6034, 2013.
Caffe: Convolutional architecture for fast feature [26] K. Simonyan and A. Zisserman. Two-stream
embedding. In Proceedings of the ACM International convolutional networks for action recognition in
Conference on Multimedia, pages 675–678. ACM, videos. In Advances in Neural Information Processing
2014. Systems, pages 568–576, 2014.
[13] A. Krizhevsky, I. Sutskever, and G. E. Hinton. [27] K. Simonyan and A. Zisserman. Very deep
Imagenet classification with deep convolutional neural convolutional networks for large-scale image
networks. In F. Pereira, C. Burges, L. Bottou, and recognition. arXiv preprint arXiv:1409.1556, 2014.
K. Weinberger, editors, Advances in Neural [28] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever,
Information Processing Systems 25, pages 1097–1105. and R. Salakhutdinov. Dropout: A simple way to
Curran Associates, Inc., 2012. prevent neural networks from overfitting. The Journal
[14] D. G. Lowe. Object recognition from local of Machine Learning Research, 15(1):1929–1958, 2014.
scale-invariant features. In Computer vision, 1999. [29] K. Weinberger, A. Dasgupta, J. Langford, A. Smola,
The proceedings of the seventh IEEE international and J. Attenberg. Feature hashing for large scale
conference on, volume 2, pages 1150–1157. Ieee, 1999. multitask learning. In Proceedings of the 26th Annual
[15] M. Mahdian and K. Tomak. Pay-per-action model for International Conference on Machine Learning, pages
online advertising. In Proceedings of the 1st 1113–1120. ACM, 2009.
international workshop on Data mining and audience [30] L. Yan, W.-j. Li, G.-R. Xue, and D. Han. Coupled
intelligence for advertising, pages 1–6. ACM, 2007. group lasso for web-scale ctr prediction in display
[16] H. B. McMahan, G. Holt, D. Sculley, M. Young, advertising. In Proceedings of the 31st International
D. Ebner, J. Grady, L. Nie, T. Phillips, E. Davydov, Conference on Machine Learning (ICML-14), pages
D. Golovin, et al. Ad click prediction: a view from the 802–810, 2014.
trenches. In Proceedings of the 19th ACM SIGKDD [31] M. D. Zeiler and R. Fergus. Visualizing and
international conference on Knowledge discovery and understanding convolutional networks. In Computer
data mining, pages 1222–1230. ACM, 2013. Vision–ECCV 2014, pages 818–833. Springer, 2014.
[17] T. Mei, R. Zhang, and X.-S. Hua. Internet multimedia [32] W. Zhang, T. Du, and J. Wang. Deep learning over
advertising: techniques and technologies. In multi-field categorical data: A case study on user
Proceedings of the 19th ACM international conference response prediction. arXiv preprint arXiv:1601.02376,
on Multimedia, pages 627–628. ACM, 2011. 2016.
[18] K. Mo, B. Liu, L. Xiao, Y. Li, and J. Jiang. Image
feature learning for cold start problem in display
advertising. In Proceedings of the 24th International
Conference on Artificial Intelligence, IJCAI’15, pages
3728–3734. AAAI Press, 2015.
[19] V. Nair and G. E. Hinton. Rectified linear units
improve restricted boltzmann machines. In
Proceedings of the 27th International Conference on
Machine Learning (ICML-10), pages 807–814, 2010.
[20] R. J. Oentaryo, E.-P. Lim, J.-W. Low, D. Lo, and
M. Finegold. Predicting response in mobile advertising
with hierarchical importance-aware factorization
machine. In Proceedings of the 7th ACM international
820