0% found this document useful (0 votes)
24 views10 pages

Deep CTR Prediction in Display Advertising

Uploaded by

2012dtd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views10 pages

Deep CTR Prediction in Display Advertising

Uploaded by

2012dtd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Deep CTR Prediction in Display Advertising

∗ † ‡
Junxuan Chen1,2 , Baigui Sun2 , Hao Li2 , Hongtao Lu1 , Xian-Sheng Hua2
1
Department of Computer Science and Engineering, Shanghai Jiao Tong University
2
Alibaba Group, Hangzhou, China
{chenjunxuan, htlu}@sjtu.edu.cn {baigui.sbg, lihao.lh, xiansheng.hxs}@alibaba-inc.com

ABSTRACT
Click through rate (CTR) prediction of image ads is the core
task of online display advertising systems, and logistic re-
gression (LR) has been frequently applied as the prediction
model. However, LR model lacks the ability of extracting
complex and intrinsic nonlinear features from handcrafted
high-dimensional image features, which limits its effective-
ness. To solve this issue, in this paper, we introduce a novel
deep neural network (DNN) based model that directly pre-
dicts the CTR of an image ad based on raw image pixels and
other basic features in one step. The DNN model employs Figure 1: Display ads on an e-commerce web page.
convolution layers to automatically extract representative
visual features from images, and nonlinear CTR features
are then learned from visual features and other contextual
is the product of the bid price and click-through rate (CTR)
features by using fully-connected layers. Empirical evalua-
or conversion rate (CVR).
tions on a real world dataset with over 50 million records
Recently, more and more advertisers prefer displaying im-
demonstrate the effectiveness and efficiency of this method.
age ads [17] (Figure 1) because they are more attractive and
comprehensible compared with textual ads. To maximize
Keywords the revenue of publishers, this has led to a huge demand on
DNN, CNN, Click through rate, Image Ads, Display Adver- approaches that are able to choose the most proper image
tising ad to show for a particular user when he or she is visiting a
web page so that to maximize the CTR or CVR.
Therefore, in most online advertising systems, predicting
1. INTRODUCTION the CTR or CVR is the core task of ads allocation. In this
Online display advertising generates a significant amount paper, we focus on CPC and predict the CTR of display
of revenue by showing textual or image ads on various web ads. Typically an ads system predicts and ranks the CTR
pages [3]. The ad publishers like Google and Yahoo sell of available ads based on contextual information, and then
ad zones on different web pages to advertisers who want shows the top K ads to the users. In general, prediction
to show their ads to users. And then Publishers get paid models are learned from past click data based on machine
by advertisers every time the ad display leads to some de- learning techniques [3, 24, 9, 5, 32, 16].
sired action such as clicking or purchasing according to the Features that are used to represent an ad are extremely
payment options such as cost-per-click (CPC) or cost-per- important in a machine learning model. In recent years,
conversion (CPA) [15]. The expected revenue for publishers to make the CTR prediction model more accurate, many
∗The work is done while the author was an intern at Alibaba researchers use millions of features to describe a user’s re-
sponse record (we call it an ad impression). Typically, an
Group. image ad impression has basic features and visual features.
†Corresponding author.
The basic features are information about users, products and
‡Corresponding author. ad positions in a web page, etc. Visual features describe the
visual appearance of an image ad at different levels. For
example, color and texture are low level features, while face
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed and other contextual objects are high level features. Low
for profit or commercial advantage and that copies bear this notice and the full cita- level and high level features may both have the power to
tion on the first page. Copyrights for components of this work owned by others than influence the CTR of an image ad (Figure 2). Traditionally,
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-
publish, to post on servers or to redistribute to lists, requires prior specific permission researchers lack effective method to extract high-level visual
and/or a fee. Request permissions from [email protected]. features. The importance of visual features is also usually
MM ’16, October 15-19, 2016, Amsterdam, Netherlands under estimated. However, as we can see from Figure 2, ads
c 2016 ACM. ISBN 978-1-4503-3603-1/16/10. . . $15.00 with same basic features may have largely different CTRs
DOI: http://dx.doi.org/10.1145/2964284.2964325 due to different ad images. As a consequence, How to use

811
the visual features in machine learning models effectively
becomes an urgent task.
Among different machine learning models that have been
applied to predict ads CTR using the above features, Lo-
gistic regression (LR) is the mostly well-known and widely-
(a) (b)
used one due to its simplicity and effectiveness. Also, LR
is easy to be parallelized on a distributed computing sys-
tem thus it is not challenging to make it work on billions
of samples [3]. Being able to handle big data efficiently is
necessary for a typical advertising system especially when
the prediction model needs to be updated frequently to deal (c) (d)
with new ads. However, LR is a linear model which is in-
ferior in extracting complex and effective nonlinear features
from handcrafted feature pools. Though one can mitigate Figure 2: Two groups of image ads in each row.
this issue by computing the second-order conjunctions of the The two ads in each group have completely same ad
features, it still can not extract higher-order nonlinear rep- group id, ad zone and target people. CTRs of im-
resentative features and may cause feature explosion if we age ads (a) and (b) are 1.27% and 0.83%. (b) suffers
continue increasing the conjunction order. from low contrast between product and background
To address these problems, other methods such as factor- obviously. CTRs of (c) and (d) are 2.40% and 2.23%.
ization machine [22], decision tree [9], neural network [32] are We find too many subjects in an men’s clothing ad
widely used. Though these methods can extract non-linear may bring negative effect. (the number of impres-
features, they only deal with basic features and handcrafted sions of each ad is sufficiently high to make the CTR
visual features, which are inferior in describing images. In meaningful).
this paper, we propose a deep neural network (DNN) to
directly predict the CTR of an image ad from raw pixels
and other basic features. Our DNN model contains con- 2.1 Display Advertising CTR prediction
volution layers to extract representative visual features and
Since the display advertising has taken a large share of
then fully-connected layers that can learn the complex and
online advertising market, many works addressing the CTR
effective nonlinear features among basic and visual features.
prediction problem have been published. In [24, 2], authors
The main contributions of this work can be summarized as
handcraft many features from raw data and use logistic re-
follows:
gression (LR) to predict the click-through rate. [3] also uses
1. This paper proposed a DNN model which not only di- LR to deal with the CTR problem and scales it to billions of
rectly takes both high-dimensional sparse feature and samples and millions of parameters on a distributed learn-
image as input, but also can be trained from end to ing system. In [20], a Hierarchical Importance-aware Fac-
end. To our best knowledge, this is the first DNN torization Machine (FM) [23] is introduced, which provides
based CTR model which can do such things. a generic latent factor framework that incorporates impor-
tance weights and hierarchical learning. In [5], boosted de-
2. Efficient methods are introduced to tackle the chal- cision trees have been used to build a prediction model. In
lenge of high-dimensionality and huge data amount in [9], a model which combines decision trees with logistic re-
the model training stage. The proposed methods re- gression has been proposed, and outperforms either of the
duce the training time significantly and make it fea- above two models. [32] combines deep neural networks with
sible to train on a normal PC with GPUs even with FM and also brings an improvement. All of these methods
large-scale real world training data. are very effective when deal with ads without images. How-
ever, when it comes to the image ads, they can only use
3. We conduct extensive experiments on a real-world dataset pre-extracted image features, which is less flexible to take
with more than 50 million user response records to il- account of the unique properties of different datasets.
lustrate the improvement provided by our DNN model. Therefore, the image features in display advertisement
The impacts of several deep learning techniques have have received more and more attention. In [1, 4], the im-
also been discussed. We further visualize the saliency pact of visual appearance on user’s response in online display
map of image ads to show our model can learn effective advertising is considered for the first time. They extract
visual features. over 30 handcrafted features from ad images and build a
The paper is organized as follows. Section 2 introduces CTR prediction model using image and basic features. The
the related work, followed by an overview of our scheme experiment result shows that their method achieves better
in Section 3. In Section 4, we describe the proposed DNN performance than models without visual features. [18] is
model in detail, and we show the challenges in the training the most related work in literature with us, in which a de-
stage as well as our solutions in Section 5. Section 6 presents capitated convolutional neural network (CNN) is used to
the experimental results and discussion, and then Section 7 extract image features from ads. However, there are two
is the conclusion. important differences between their method and ours. First,
they do not consider basic features when extracting image
features using CNN. Second, when predicting the CTR they
2. RELATED WORK use logistic regression which lacks the ability in exploring the
We consider display advertising CTR prediction and deep complex relations between image and basic features. Most
neural network are two mostly related areas to our work. of the information in their image features is redundant such

812
as product category which is included in basic features. As
1
a result, their model only achieves limited improvements
when combining both kinds of features. Worse still, when
Fully-connected layer
the dataset contains too many categories of products, it can
hardly converge when training. Our model uses an end to 256
end model to predict the CTR of image ads using basic fea-
tures and raw images in one step, in which image features Fully-connected layer
can be seen as supplementary to the basic features.
256

2.2 Deep Neural Network Batch-Normalization


In recent years, deep neural network has achieved big
128
breakthroughs in many fields. In computer vision field, con-
128
volutional neural network (CNN) [13] is one of the most
efficient tools to extract effective image features from raw ConvNet
image pixels. In speech recognition, deep belief network Fully-connected layer
(DBN) [10] is used and much better performance is obtained
comparing with Gaussian mixture models. Comparing with 112×112
traditional models that have shallow structure, deep learn- 200, 000

ing can model the underlying patterns from massive and


complex data. With such learning ability, deep learning can 0 0 1 0 ······1 0 0 1
be used as a good feature extractor and applied into many
other applications [21, 26].
In CTR prediction field, besides [32] that is mentioned in
Section 2.1, DNN has also been used in some public CTR
prediction competitions12 recently. In these two competi- Figure 3: The overall architecture of the network.
tions, only basic features are available for participants. An The output of each fully-connected layer is then pass
ensemble of four-layer DNNs which use fully-connected lay- through a ReLU nonlinear activation function.
ers and different kinds of non-linear activations achieves bet-
ter or comparable performance than LR with feature con-
junction, factorization machines, decision trees, etc. Com- rithmic Loss (Logloss):
paring with this method, our model can extract more pow- 1 X
L(W) = − (yi log ŷi + (1 − yi ) log(1 − ŷi )) + λ||W||2
erful features by taking consideration of the visual features N i
in image ads. (3)
where W is the parameters of the embedding function f (.)
and λ is a regularization parameter that controls the model
3. METHOD OVERVIEW complexity.
As aforementioned, in this paper, each record of user’s In this model, what we need to learn is the embedding
behavior on an ad is called an impression. denoted by x. function f (.). Conventional methods extract handcrafted
Each impression has an image u with a resolution of around visual features from raw image u and concatenate them with
120 × 200. Besides the image, the basic feature vector is basic features v, then learn linear or nonlinear transforma-
denoted by v ∈ Rd such as the user’s gender, product’s tions to obtain the embedding function. In this paper we
category, ad position in the web page, and usually d can learn this function directly from raw pixels of an image ad
be very large, say, from a few thousand to many million. and the basic features using one integrated deep neural net-
Our goal is to predict the probability that a user clicks on work.
an image ad given these features. We will still use logistic
regression to map our predicted CTR value ŷ to 0 to 1, thus 4. NETWORK ARCHITECTURE
the CTR prediction problem can be written as: Considering basic features and raw images come from two
different domains, we cannot simply concatenate them to-
1 gether directly in the network. Training two separate net-
ŷ = (1)
1 + e−z works is also inferior since it cannot take into account the
z = f (x) (2) correlations between the two features. As a result, our net-
work adopts two different sub-networks to deal with basic
where f (.) is what we are going to learn from training data, features and raw images, respectively, and then uses multi-
that is, the embedding function that maps an impression ple fully-connected layers to capture their correlations.
to a real value z. Suppose we have N impressions X = As illustrated in Figure 3, a deep neural network called
[x1 , x2 ...xN ] and each with a label yi ∈ {0, 1} depends on the DeepCTR is designed which contains three parts. One part,
user’s feedback, 0 means not clicked while 1 means clicked. Convnet, takes raw image u as input and follows with a con-
Then the learning problem is defined as minimizing a Loga- volutional network. The output of the Convnet is a feature
vector of the raw image. The second part which is called
Basicnet, takes basic features v as input and applies a fully-
1
https://www.kaggle.com/c/avazu-ctr-prediction connected layer to reduce the dimensionality. Subsequently,
2
https://www.kaggle.com/c/criteo-display-ad-challenge outputs of Convnet and Basicnet are concatenated into one

813
4  conv  layers 4  conv  layers 4  conv  layers 4  conv  layers

Conv,  3×3,  128

Conv,  3×3,  128

Conv,  3×3,  256

Conv,  3×3,  256


Conv,  5×5,  32

Conv,  3×3,  32  

Conv,  3×3,  32

Conv,  3×3,  64

Conv,  3×3,  64


112×112 56×56 56×56 28×28 14×14 7×7
……… ……… ……… ………

Figure 4: The architecture of the 17-layer Convnet in our model.

vector and fed to two fully-connected layers. The output of layer and a new data sampling scheme. The use of these two
the last fully-connected layer is a real value z. This part is techniques makes the training time of our DeepCTR suitable
called Combnet. On the top of the whole network, Logloss for a real online system.
is computed as described in Section 3.
The design of Convnet is inspired by the network in [7, 5.1 Sparse Fully-Connected Layer
27], as shown in Figure 4. The network consists of 17 convo- In CTR prediction, the basic feature of an ad impression
lution layers. The first convolution layer uses 5 × 5 convolu- includes user information like gender, age, purchasing power,
tion kernels. Following first layer, there are four groups and and ad information like ad ID, ad category, ad zone, etc.
each has four layers with 3 × 3 kernels. We do not build a This information is usually encoded by one-hot encoding or
very deep network such as more than 50 layers in considera- feature hashing [29] which makes the feature dimension very
tion of the trade off between performance and training time. large. For example, it is nearly 200,000 in our dataset. Con-
We pre-train the Convnet on the images in training dataset sequently, in Basicnet, the first fully-connected layer using
with category labels. We use two fully-connected layers with the basic feature as input has around 60 million parame-
1024 hidden nodes (we call them fc18 and fc19 ), a fully- ters, which is similar to the number of all the parameters
connected layer with 96-way outputs (we call it fc20 ) and in AlexNet [13]. However, the basic feature is extremely
a softmax after the Convnet in pre-training period. Since sparse due to the one-hot encoding. Using sparse matrix in
our unique images set is smaller (to be detailed in Section first fully-connected layer can largely reduce the computing
6) than ImageNet [6], we use half the number of outputs in complexity and GPU memory usage.
each group comparing with [7]. After pre-training, a 128- In our model, we use compressed sparse row (CSR) format
way fully-connected layer is connected behind the last con- to represent a batch of basic features V . When computing
volution layer. Then we train the whole DeepCTR using network forward
Logloss from end to end.
Yf c1 = V W, (4)
sparse matrix operations can be used in the first fully-connected
5. SPEED UP TRAINING layer. When backward pass, we only need to update the
An online advertising system has a large number of new weights that link to a small number of nonzero dimensions
user response records everyday. It is necessary for ad sys- according to the gradient
tems to update as frequently as possible to adapt new ten-
∇(W ) = V. (5)
dency. An LR model with distributed system requires sev-
eral hours to train with billions of samples, which makes it Both of the forward pass and backward pass only need a
popular in industry. time complexity of O(nd0 ) where d0 is the number of non-
Typically a deep neural network has millions of parame- zero elements in basic features and d0  d. An experiment
ters which makes it impossible to train quickly. With the result that compares the usages of time and GPU memory
development of GPUs, one can train a deep CNN with 1 with/out sparse fully-connected layer can be found in Sec-
million training images in two days on a single machine. tion 6.2.
However, it is not time feasible for our network since we
have more than 50 million samples. Moreover, the dimen- 5.2 Data Sampling
sionality of basic features is nearly 200,000 which leads to Another crucial issue in training is that Convnet limits the
much more parameters in our network than a normal deep batch-size of Stochastic Gradient Descent (SGD). To train
neural network. Directly training our network on a single a robust CTR prediction model, we usually need millions
machine may take hundreds of days to converge according of ad samples. However, the Convnet requires lots of GPU
to a rough estimation. Even using multi-machine can hardly memory, which makes our batch-size very small, say, a few
resolve the training problem. We must largely speed up the hundred. For a smaller batch-size, the parallel computing
training if we want to deploy our DeepCTR on a real online of GPU can not maximize the effect in the multiplication of
system. large matrix. And the number of iterations of each epoch
To make it feasible to train our model with less than one will be very large. We need much more time to run over
day, we adopt two techniques: using sparse fully-connected an epoch of iterations. Though the sparse fully-connected

814
Algorithm 1 Training a DeepCTR network
Input: : Network N et with parameter W, unique images
Convent Basicnet set U, basic features set V, labels Y, batch size n, basic
feature sample number k.
Output: : Network for CTR prediction, N et
1: Initialize N et.
0010······001 2: Compute the sample probability p(u) of each image u,

{ 0100······100 p(u) = P
#Vu
#Vu0
(7)

···
u0 ∈U

1001······010 3: repeat
4: Sample n images U according to p(u).
5: For each u in U , sample k basic features Vu from Vu
0000······110 with labels Yu uniformly with replacement.

{ 0101······000 6: f orward backward(N et, U, V, Y ).


7: until N et converges
···

0100······001 Algorithm 2 f orward backward


Input: : Network N et with parameters W which contains
Figure 5: Originally, the image and the basic feature a Convnet, Basicnet and Combnet, image samples U ,
vector are one to one correspondence. In our data basic features V , labels Y , basic feature sample number
sampling method, we group basic features of an im- k.
age together, so that we can deal with much more 1: Compute the feature vector convu of each image u:
basic features per batch. conv = net f oward(Convnet, U )
2: Copy each feature vector k times so we have C.
3: loss = net f orward(Basicnet and Combnet, V, C).
layer can largely reduce the forward-backward time in Ba- 4: ∇(C) = net backward(Combnet and Basicnet, loss).
sicnet, training the whole net on such a large dataset still 5: Compute ∇(convu ) of each image u according to Eq. 6.
requires infeasible time. Also, the gradient of each iteration 6: net backward(Convnet, ∇(conv)).
is unstable in the case of smaller batch-size, which makes 7: Update network N et.
the convergence harder. In CTR prediction, this problem is
even more serious because the training data is full of noise.
In this paper, we propose a simple but effective training image ads. This strategy reduces the number of iterations of
method based on an intrinsic property of the image ads click an epoch to several thousand and largely speeds up training.
records, that is, many impressions share a same image ad. A larger batch-size also makes the gradient of each batch
Though the total size of the dataset is very large, the number much more stable which leads to the model easy to converge.
of unique images is relatively smaller. Since a good many We also conduct an experiment to evaluate whether this
of basic features can be processed quickly by sparse fully- sampling method influences the performance of DeepCTR
connected layer, we can set a larger batch-size for Basicnet comparing a throughly shuffle strategy in Section 6.2.
and a smaller one for Convnet. In this paper we employ a
data sampling method that groups basic features of a same 6. EXPERIMENT
image ad together to achieve that (Figure 5), which is de-
In this section, a series of experiments are conducted to
tailed as follows.
verify the superiority of our DeepCTR model.
Suppose the unique images set in our dataset is U, the
set of impressions related to an image u are Xu and basic 6.1 Experimental Setting
features are Vu . At each iteration, suppose the training
batch size is n, we sample n different images U from U. 6.1.1 Dataset
Together with each image u ∈ U , we sample k basic features The experiment data comes from a commercial advertising
Vu from Vu with replacement. Thus we have n images and platform (will expose it in the final version) in an arbitrary
kn basic features in each batch. After Convnet, we have week of year 2015. We use the data from first six days as our
n image features. For each feature vector convu we copy training data and the data from last day (which is a Friday)
it k times to have Cu and send them forward to Combnet as testing data. As described in Section 3, each impression
along with Vu . In backward time, the gradient of each image consists of an ad x and a label y. An impression has an image
feature vector can be computed as: u (Figure 2) and a basic feature vector v. The size of training
1 X data is 50 million while testing set is 9 million. The ratio of
∇(convu ) = ∇(c) (6)
k c∈C positive samples and negative samples is around 1:30. We
u
do not perform any sub-sampling of negative events on the
The training method is summarized in Alg. 1 and Alg. 2. In dataset. We have 101,232 unique images in training data and
fact, this strategy makes us able to deal with kn samples in 17,728 unique images in testing data. 3,090 images in testing
a batch. Since the sparse fully-connected layer requires very set are never shown in training set. Though the image data
small GPU memory, we can set k a very big value according of training set and test data are highly overlapped, they
to the overall average number of basic feature vectors of follow the distribution of the real-world data. To make our

815
experiment more convincing, we also conduct a experiment Test Logloss vs Iters
on a sub test set that only contains new images data that
never been used in training. The basic feature v is one- 0.146
with BN
without BN
hot encoded and has a dimensionality of 153,231. Following
information is consisted by basic features:
0.145
1. Ad zone. The display zone of an ad on the web page.

Test Logloss
We have around 700 different ad zones in web pages.
0.144
2. Ad group. The ad group is a small set of ads. The ads
in an ad group share almost same products but differ-
ent ad images (in Figure 2, (a) and (b) belong to an ad 0.143
group while (c) and (d) belong to another group). We
have over 150,000 different ad groups in our dataset.
Each ad group consists less than 20 different ads. 0.142

3. Ad target. The groups of target people of the ad. We 0 2 4 6 8 10 12 14


have 10 target groups in total. Iters x 10
4

4. Ad category. The category of the product in ads. We


have 96 different categories, like clothing, food, house- Figure 6: Test Logloss of the DeepCTR net
hold appliances. with/without batch normalization in Combnet.
5. User. The user information includes user’s gender, age,
purchasing power, etc. Each group has four convolution layers followed by a batch
Besides above basic features, we do not use any handcrafted normalization [11] and a ReLU [19] activation. The stride of
conjunction features. We hope that our model can learn ef- the first convolution layer is 2 if the output size of a group
fective non-linear features automatically from feature pools. halves. We initialize the layer weights as in [8]. When pre-
training the Convnet on our image dataset with category
6.1.2 Baselines labels, we use SGD with a mini-batch size of 128. The learn-
We use LR only with basic features as our first baseline. ing rate starts from 0.01 and is divided by 10 when test loss
We call this method lr basic in following experiments. To plateaus. The pre-trained model converges after around 120
verify that our DNN model has the ability of extracting ef- epochs. The weight decay of the net is set as 0.0001 and
fective high-order features, a Factorization Machine imple- momentum is 0.9.
mented by LibFM [23] only with basic features is our second After pre-training Convnet, we train our DeepCTR model
baseline. We call it FM basic. We use 8 factors for 2-way from end to end. Other parts of our net use the same initial-
interactions and MCMC for parameter learning in FM ba- ization method as Convnet. We choose the size of mini-batch
sic. Then we evaluate a two hidden layers DNN model only n as 20, and k = 500. That is to say, we deal with 10,000
using basic features. The numbers of outputs of two hid- impressions per batch. We start with the learning rate 0.1,
den layers are 128 and 256 respectively. The model can be and divided it by 10 after 6×104 , 1×105 and 1.4×105 itera-
seen as our DeepCTR net without the Convnet part. This tions. The Convnet uses a smaller initial learning rate 0.001
method is called dnn basic. We further replace the Convnet in case of destroying the pre-trained model. The weight de-
in our DeepCTR net with pre-extracted features, SIFT [14] cay of the whole net is set as 5 × 10−5 . The dnn basic, dnn
with bag of words and the outputs of different layers of the sift and dnn layername use the same learning strategy.
pre-trained Convnet. We call these two methods dnn sift We implement our deep network on C++ Caffe toolbox
and dnn layername (for example dnn conv17 ). [12] with some modifications like sparse fully-connected layer.

6.1.3 Evaluation Metric


6.2 Results and Discussion
We use two popular metrics to evaluate the experiment
result, Logloss and the area under receiver operator curve In this section we compare the results of various methods
(AUC). Logloss can quantify the accuracy of the predicted and the effects of some network structures. First we com-
click probability. AUC measures the ranking quality of the pare the results of models with deep features in different
prediction. Our dataset comes from a real commercial plat- levels. We plot the two metrics of dnn conv13, dnn conv17,
form, so both of these metrics use relative numbers compar- dnn fc18, dnn fc19, and dnn fc20 in the middle of Table
ing with lr basic. 1. From the results, we find that dnn conv17 and dnn fc18
Since the AUC value is always larger than 0.5, we remove achieve best performance. Image features in these layers are
this constant part (0.5) from the AUC value and then com- of relatively high level but not highly group invariant [31].
pute the relative numbers as in [30]: Comparing with following fully-connected layers, they have
more discriminations in same category. Comparing with pre-
AU C(method) − 0.5 vious layers, they contain features in a sufficiently high-level
relative AUC = ( − 1) × 100% (8)
AU C(lr basic) − 0.5 which are superior in describing the objects in images. Con-
sequently, we connect conv17 layer in our DeepCTR model.
6.1.4 Network Configuration We do not choose fc18 because it needs higher computa-
In our Convnet, a 112 × 112 random crop and horizontal tions. We have also tried to compare our DeepCTR with
mirror for the input image are used for data augmentation. the approach in [18]. However the model in [18] does not

816
Table 1: relative AUC and Logloss. All the numbers are best resuts achieved in three repeated experiments.
We omit dnn of methods using deep neural network with pre-extracted features.
method lr basic FM basic basic sift conv13 conv17 fc18 fc19 fc20 DeepCTR 3 DeepCTRs
AUC(%) - 1.47 1.41 1.69 3.92 4.13 4.10 3.48 3.04 5.07 5.92
Logloss(%) - -0.40 -0.39 -0.45 -0.79 -0.86 -0.86 -0.74 -0.69 -1.11 -1.30

5 5
x 10 x 10
7 6
Table 2: relative AUC and Logloss of the sub test
6 5
set that only contains images never shown in the
5

4
4
training set.
3
3 AUC (%) Logloss (%)
2
2 lr basic - -
1 1 dnn basic 1.07 -0.21
0
−0.4 −0.2 0 0.2 0.4 0.6
0
−10 −5 0 5 10
dnn sift 2.14 -0.48
(a) (b) DeepCTR 5.54 -0.85
5 5
x 10 x 10
5 5

4 4

3 3 set are all new ads added into the ad system and the ad
2 2
groups have not appeared in the training set. This lead to
the prediction worse in 3K sub test set because the ad group
1 1
feature does not exist in the training set. However, though
0
−5 0 5
0
−5 0 5 we lack some basic features, visual features bring much more
(c) (d) supplementary information. This is the reason that dnn sift
and DeepCTR have more improvements over the baseline
Figure 7: (a) and (b) are the histograms of outputs methods (which only have basic features) in 3K sub test set
of Basicnet and Convnet without batch normaliza- comparing with the full test set. This experiment shows
tion while (c) and (d) with batch normalization. that visual features can be used to identify ads with similar
characteristics and thus to predict the CTR of new image
ads more accurately. It also verifies that our model indeed
has strong generalization ability but not memories the image
converge on our dataset. We think the reason is that our id rigidly.
dataset consists of too many categories of products while in For different real-world problems, different techniques may
their datasets only 5 different categories are available. be needed due to the intrinsic characteristics of the problem,
Comparison between other baselines is shown in Table network design and data distribution. Therefore, solutions
1 too. From the result, it can be seen that a deep neu- based on deep learning typically will compare and analyze
ral network and image features can both improve the CTR the impact and effectiveness of those techniques to find the
prediction accuracy. FM basic and dnn basic achieve almost best practices for a particular problem. Therefore, we fur-
same improvements comparing with lr basic, which indicates ther explore the influence of different deep learning tech-
that these two models both have strong power in extracting niques in our DeepCTR model empirically.
effective non-linear basic features. For the image part, com- First we find that the batch normalization in the Comb-
paring with handcrafted features, like SIFT, deep features net can speed up training and largely improve performance
have stronger power in describing the image, which leads to (Figure 6). To investigate the reason, we show the histogram
a significant improvement in the prediction accuracy. Our (Figure 7) of the outputs of Convnet and BasicN et. We
DeepCTR model goes one step further by using an end to can see from the histogram that two outputs have signifi-
end learning scheme. Ensemble of multiple deep networks cant difference in scale and variance. Simply concatenating
usually brings better performance, so we train 3 DeepCTR these two different kinds of data stream makes the following
models and average their predictions, and it gives the best fully-connected layer hard to converge.
AUC and Logloss. Compared with lr basic, the AUC in- Dropout [28] is an efficient way to prevent over-fitting
crease will bring us 1∼2 percent CTR increase in the adver- problem in deep neural network. Most deep convolution
tising system (according to online experiments), which will networks remove the dropout because batch normalization
lead to over 1 million earnings growth per day for an 100 can regularize the models [7, 11]. However, in our DeepCTR
million ads business. model, we find it still suffers from over-fitting without dropout.
To make our results more convincing, we also conduct We compare the loss curves of the model with/without dropout
an experiment on the sub test set that only contains 3,090 in the last two fully-connected layers. We can see that the
images that are never shown in training set. The relative model with dropout achieves lower testing Logloss, though
AUC and Logloss of three representative methods dnn basic, we need more time to reach the lowest test loss.
dnn sift and a single DeepCTR are in Table 2. Clearly, We also evaluate the performance of the sparse fully-connected
our DeepCTR wins by a large margin consistently. We also layer and our data sampling method. We plot computing
notice that while the AUC of dnn basic decreases, dnn sift time and memory overhead (Table 3) of the sparse fully-
and our DeepCTR have an even higher relative AUC than connected layer comparing with dense layer. Loss curves of
the result on the full test set. Ad images in 3K sub test training and testing are exactly the same since sparse fully-

817
Logloss vs Iters
0.148
Table 4: AUC and Logloss of dnn conv17 model with
with dropout our data sampling and a throughly shuffle.
0.146
without dropout AUC(%) Logloss(%)
data sampling 4.13 -0.86
0.144 throughly shuffle 4.12 -0.86
Logloss

0.142

0.14

0.138

0.136

0.134
0 2 4 6 8 10 12 14
4
Iters x 10

Figure 8: Logloss of the DeepCTR net with/without Figure 9: Saliency map of an image ad. We can see
dropout in Combnet. Dashed lines denote training that cats, texture, and characters all have effect on
loss, and bold lines denote test loss. the CTR.

6.3 Visualizing the Convnet


Table 3: forward-backward time and GPU memory Visualizing the CNN can help us better understand ex-
overhead of first fully-connected layer with a batch actly what we have learned. In this section, we follow the
size of 1,000. saliency map visualization method used in [25]. We use a
time (ms) memory (MB) linear score model to approximate our DeepCTR for clicked
sparse layer 6.67 397 or not clicked :
dense layer 189.56 4667
z(U ) ≈ wT U + b, (9)
where image U is in the vectorized (one-dimension) form,
connected layer does not change any computing results in and w and b are weight and bias of the model. Indeed,
the net, so we do not plot them. From this table we can find Eq 9 can be seen as the first order Taylor expansion of our
dense layer requires much more computing time and mem- DeepCTR model. We use the magnitude of elements of w
ory than sparse one. Using sparse layer allows a lager batch to show the importance of the corresponding pixels of U for
size when training, which speeds up the training and makes the clicked probability. where w is the derivative of z with
the net much easier to converge. respect to the image U at the point (image) U0 :
Finally, we investigate whether the performance of our ∂z
model descends using our data sampling method comparing w= (10)
∂U U0
a throughly shuffle. We only evaluate the sampling method
on dnn conv17 model, that is, we conduct experiments on a In our case, the image U is in RGB format and has three
model where the Convnet is frozen. Ideally, we should use channels at pixel Ui,j . To derive a single class saliency value
an unfrozen Convnet without data sampling as the contrast Mi,j of each pixel, we take the maximum absolute value of
experiment. However, as mentioned in Section 5.2, training wi,j across RGB channels c:
an unfrozen Convnet limits our batch-size less than 200 be-
Mi,j = maxc |wi,j (c)| (11)
cause the Convnet needs much more GPU memory, while a
model with frozen Convnet can deal with more than 10000 Some of the typical visualization examples are shown as
samples in a batch. It will takes too much time to training heat maps in Figure 10. In these examples, brighter area
our model on such a small batch-size. Also, the main dif- plays a more important role in impacting the CTR of the
ference between with/out sampling is whether the samples ad. We can see main objects in ads are generally more im-
were thoroughly shuffled, while freezing the Convnet or not portant. However, some low level features like texture, char-
does not influence the order of samples. Therefore, we be- acters, and even background can have effect on the CTR of
lieve that our DeepCTR model performs similarly with dnn the ad. In another example (Figure 9), it is more clearly to
conv17 model. From Table 4 we can see the performance of see that visual features in both high level and low level have
the model is not influenced by the data sampling method. effectiveness. From the visualization of the Convnet, we can
At the same time, our method costs far less training time find that the task of display ads CTR prediction is quite
comparing with the approach without data sampling. Using different from object classification where high level features
this data sampling method, training our DeepCTR model dominate in the top layers. It also gives an explanation why
from end to end only takes around 12 hours to converge on an end-to-end training can improve the model. Apparently,
a NVIDIA TESLA k20m GPU with 5 GB memory, which is the Convnet can be trained to extract features that are more
acceptable for an online advertising system requiring daily particularly useful for CTR prediction.
update. The visualization provides us an intuitive understanding

818
Figure 10: Saliency map of the image ads. Brighter area plays a more important role in effecting the CTR
of ads.

of the impact of visual features in ad images which may be 9. REFERENCES


useful for designers to make design choices. For example,
[1] J. Azimi, R. Zhang, Y. Zhou, V. Navalpakkam,
we can decide whether add another model or not in an ad
J. Mao, and X. Fern. The impact of visual appearance
according to the saliency map of this model.
on user response in online display advertising. In
Proceedings of the 21st international conference
7. CONCLUSIONS companion on World Wide Web, pages 457–458.
CTR prediction plays an important role in online display ACM, 2012.
advertising business. Accurate prediction of the CTR of ads [2] D. Chakrabarti, D. Agarwal, and V. Josifovski.
not only increases the revenue of web publishers, also im- Contextual advertising by combining relevance with
proves the user experience. In this paper we propose an end click feedback. In Proceedings of the 17th international
to end integrated deep network to predict the CTR of image conference on World Wide Web, pages 417–426. ACM,
ads. It consists of Convnet, Basicnet and Combnet. Con- 2008.
vnet is used to extract image features automatically while [3] O. Chapelle, E. Manavoglu, and R. Rosales. Simple
Basicnet is used to reduce the dimensionality of basic fea- and scalable response prediction for display
tures. Combnet can learn complex and effective non-linear advertising. ACM Transactions on Intelligent Systems
features from these two kinds of features. The usage of and Technology (TIST), 5(4):61, 2014.
sparse fully-connected layer and data sampling techniques [4] H. Cheng, R. v. Zwol, J. Azimi, E. Manavoglu,
speeds up the training process significantly. We evaluate R. Zhang, Y. Zhou, and V. Navalpakkam. Multimedia
DeepCTR model on a 50 million real world dataset. The features for click prediction of new ads in display
empirical result demonstrates the effectiveness and efficiency advertising. In Proceedings of the 18th ACM SIGKDD
of our DeepCTR model. international conference on Knowledge discovery and
data mining, pages 777–785. ACM, 2012.
8. ACKNOWLEDGMENTS [5] K. S. Dave and V. Varma. Learning the click-through
This paper is partially supported by NSFC (No. 61272247, rate for rare/new ads from similar ads. In Proceedings
61533012, 61472075), the 863 National High Technology Re- of the 33rd international ACM SIGIR conference on
search and Development Program of China (SS2015AA020501), Research and development in information retrieval,
the Basic Research Project of Innovation Action Plan pages 897–898. ACM, 2010.
(16JC1402800) and the Major Basic Research Program [6] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and
(15JC1400103) of Shanghai Science and Technology Com- L. Fei-Fei. Imagenet: A large-scale hierarchical image
mittee. database. In Computer Vision and Pattern
Recognition, 2009. CVPR 2009. IEEE Conference on,
pages 248–255. IEEE, 2009.
[7] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual

819
learning for image recognition. arXiv preprint conference on Web search and data mining, pages
arXiv:1512.03385, 2015. 123–132. ACM, 2014.
[8] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep [21] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn:
into rectifiers: Surpassing human-level performance on Towards real-time object detection with region
imagenet classification. arXiv preprint proposal networks. In Advances in Neural Information
arXiv:1502.01852, 2015. Processing Systems, pages 91–99, 2015.
[9] X. He, J. Pan, O. Jin, Xu, et al. Practical lessons from [22] S. Rendle. Factorization machines. In Data Mining
predicting clicks on ads at facebook. In Proceedings of (ICDM), 2010 IEEE 10th International Conference
20th ACM SIGKDD Conference on Knowledge on, pages 995–1000. IEEE, 2010.
Discovery and Data Mining, pages 1–9. ACM, 2014. [23] S. Rendle. Factorization machines with libfm. ACM
[10] G. Hinton, L. Deng, D. Yu, G. E. Dahl, Mohamed, Transactions on Intelligent Systems and Technology
et al. Deep neural networks for acoustic modeling in (TIST), 3(3):57, 2012.
speech recognition: The shared views of four research [24] M. Richardson, E. Dominowska, and R. Ragno.
groups. Signal Processing Magazine, IEEE, Predicting clicks: estimating the click-through rate for
29(6):82–97, 2012. new ads. In Proceedings of the 16th international
[11] S. Ioffe and C. Szegedy. Batch normalization: conference on World Wide Web, pages 521–530. ACM,
Accelerating deep network training by reducing 2007.
internal covariate shift. arXiv preprint [25] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep
arXiv:1502.03167, 2015. inside convolutional networks: Visualising image
[12] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, classification models and saliency maps. arXiv
J. Long, R. Girshick, S. Guadarrama, and T. Darrell. preprint arXiv:1312.6034, 2013.
Caffe: Convolutional architecture for fast feature [26] K. Simonyan and A. Zisserman. Two-stream
embedding. In Proceedings of the ACM International convolutional networks for action recognition in
Conference on Multimedia, pages 675–678. ACM, videos. In Advances in Neural Information Processing
2014. Systems, pages 568–576, 2014.
[13] A. Krizhevsky, I. Sutskever, and G. E. Hinton. [27] K. Simonyan and A. Zisserman. Very deep
Imagenet classification with deep convolutional neural convolutional networks for large-scale image
networks. In F. Pereira, C. Burges, L. Bottou, and recognition. arXiv preprint arXiv:1409.1556, 2014.
K. Weinberger, editors, Advances in Neural [28] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever,
Information Processing Systems 25, pages 1097–1105. and R. Salakhutdinov. Dropout: A simple way to
Curran Associates, Inc., 2012. prevent neural networks from overfitting. The Journal
[14] D. G. Lowe. Object recognition from local of Machine Learning Research, 15(1):1929–1958, 2014.
scale-invariant features. In Computer vision, 1999. [29] K. Weinberger, A. Dasgupta, J. Langford, A. Smola,
The proceedings of the seventh IEEE international and J. Attenberg. Feature hashing for large scale
conference on, volume 2, pages 1150–1157. Ieee, 1999. multitask learning. In Proceedings of the 26th Annual
[15] M. Mahdian and K. Tomak. Pay-per-action model for International Conference on Machine Learning, pages
online advertising. In Proceedings of the 1st 1113–1120. ACM, 2009.
international workshop on Data mining and audience [30] L. Yan, W.-j. Li, G.-R. Xue, and D. Han. Coupled
intelligence for advertising, pages 1–6. ACM, 2007. group lasso for web-scale ctr prediction in display
[16] H. B. McMahan, G. Holt, D. Sculley, M. Young, advertising. In Proceedings of the 31st International
D. Ebner, J. Grady, L. Nie, T. Phillips, E. Davydov, Conference on Machine Learning (ICML-14), pages
D. Golovin, et al. Ad click prediction: a view from the 802–810, 2014.
trenches. In Proceedings of the 19th ACM SIGKDD [31] M. D. Zeiler and R. Fergus. Visualizing and
international conference on Knowledge discovery and understanding convolutional networks. In Computer
data mining, pages 1222–1230. ACM, 2013. Vision–ECCV 2014, pages 818–833. Springer, 2014.
[17] T. Mei, R. Zhang, and X.-S. Hua. Internet multimedia [32] W. Zhang, T. Du, and J. Wang. Deep learning over
advertising: techniques and technologies. In multi-field categorical data: A case study on user
Proceedings of the 19th ACM international conference response prediction. arXiv preprint arXiv:1601.02376,
on Multimedia, pages 627–628. ACM, 2011. 2016.
[18] K. Mo, B. Liu, L. Xiao, Y. Li, and J. Jiang. Image
feature learning for cold start problem in display
advertising. In Proceedings of the 24th International
Conference on Artificial Intelligence, IJCAI’15, pages
3728–3734. AAAI Press, 2015.
[19] V. Nair and G. E. Hinton. Rectified linear units
improve restricted boltzmann machines. In
Proceedings of the 27th International Conference on
Machine Learning (ICML-10), pages 807–814, 2010.
[20] R. J. Oentaryo, E.-P. Lim, J.-W. Low, D. Lo, and
M. Finegold. Predicting response in mobile advertising
with hierarchical importance-aware factorization
machine. In Proceedings of the 7th ACM international

820

You might also like