Factorization Machines With Follow-The-Regularized-Leader For CTR Prediction in Display Advertising

2015 IEEE International Conference on Big Data (Big Data)
Factorization Machines with Follow-The-Regularized-Leader for CTR prediction in

Display Advertising
Anh-Phuong TA
Zebestof company – CCM Benchmark group
Paris, France
Email: [email protected]
Abstract—Predicting ad click-through rates is the core (SGD) [5], Newton and Quasi-Newton methods (e.g.
problem in display advertising, which has received much L-BFGS), Coordinate Descent [6]. SGD has proved
attention from the machine learning community in recent to be effective in solving these kinds of problems,
years. In this paper, we present an online learning algo- producing good prediction accuracy. Its resulting mod-
rithm for click-though rate prediction, namely Follow-
els, however, are not sparse enough, making them ex-
The-Regularized-Factorized-Leader (FTRFL), which in-
corporates the Follow-The-Regularized-Leader (FTRL-
tremely expensive to store in production. Many efforts
Proximal) algorithm with per-coordinate learning rates have been made to produce sparser models, e.g., the
into Factorization machines. Experiments on a real- FOBOS algorithm [7], the Regularized Dual Averag-
world advertising dataset show that the FTRFL method ing (RDA) algorithm [8], the Follow-The-Regularized
outperforms the baseline with stochastic gradient de- Leader algorithm [9]. Among these algorithms, FTRL-
scent, and has a faster rate of convergence. Proximal has shown to be more effective at producing
Keywords-Online advertising; FTRL-Proximal; Fac- sparsity and better performance [9].
torization machines; Despite their success, logistic regression-based
methods can not capture higher order interactions (i.e.,
I. I NTRODUCTION non-linear information) between features, which have
Internet advertising is a multi-billion dollar busi- proved to be important in CTR prediction [1]. One
ness and is growing rapidly. There are several major can manually select and construct conjunction features
channels on the web for online advertising such as from the original ones as the input for LR models. This
display advertising and search advertising. Display approach, however, will result in quadratic number of
advertising is different from the search advertising in new features, making it really difficult to learn the
that it uses graphical banners placed on the publishers’ model. A new line of research based on the use of
web pages [1]. In online advertising, advertisers can feature engineering and matrix design, called Factor-
choose between Cost per Click (CPC), Cost per Action ization Machines (FM), has recently emerged as very
(CPA) or Cost per Impression (CPM) pricing methods successful models for CTR prediction [10]. Indeed, FM
to purchase display ads. Among them, CPC is the most combines the advantages of Support Vector Machines
popular option, in which advertisers only pay when a with factorization models, which are able to model
user clicks on the ad. As a consequence, click through all interactions between variables even with extreme
rate (CTR) prediction, which is defined as the problem sparsity of data. An implementation of FM, which
of estimating the probability that a user clicks on an ad supports some optimization algorithms including SGD,
in a specific context, is crucial to online advertising. Alternating Least Squares (ALS), and Markov Chain
Predicting CTR in display advertising has been Monto Carlo (MCMC), has been provided [11].
widely studied in the literature. Logistic regression In this paper, we attempt to get both the sparsity
(LR) is commonly used in industry due to its ease provided by FTRL-Proximal and the ability of estimat-
of implementation and effective performance in large- ing higher order information of FM. To this end, we
scale systems [1][2][3][4]. Numerous optimization present the Follow-The-Regularized-Factorized-Leader
methods have been applied to train logistic regression (FTRFL) algorithm, which incorporates the FTRL-
models, including Stochastic (online) Gradient Descent proximal with per-coordinate learning rates into FM.
978-1-4799-9926-2/15/$31.00 ©2015 IEEE 2889

This method has been used as part of our real-world to induce sparsity and yield better performance. In
deployments. The rest of this paper is organized as particular, at each step, we update the weight vector
follows. An overview of our method is given in Section on a per-coordinate basis, where the learning rate for
II. Experimental results are described in Section III, each latent factor vi,f at iteration t is set to
and conclusions are drawn in Section IV. α
ηt,i,f = (4)
II. P ROPOSED METHOD β + Σts=1 2t,i,f
A. Model where α, β are two tunable parameters, as proposed
Our model is based on the second-order FM model in [9]. Due to the limit of space, the details of the
[10], which is defined as algorithm are omitted here. The interested readers are
n
n referred to [9] for similar formulations used with LR.
φ(w, x) = vi , vj xi xj (1) III. E XPERIMENTAL RESULTS
i j>i
A. Experimental setup
where ., . is the dot product of two vectors of size
We choose the DownPour SGD method introduced
k, and the model parameters w = (vi , vj ) ∈ Rn×k .
in [12] for parallel processing to train our model. The
k ∈ N+ 0 is a hyperparameter that defines the number of proposed method was implemented in c++, and run on
latent factors. x ∈ Rn is a real-value input vector. Note
28 CPUs with 139G shared memory.
that, the linear and bias terms are excluded from our
Evaluation metrics: the area under the ROC curve
FM-based model, as they do not give any improvement
(auROC) was used as a test metric in our experiments.
to the model for the CTR prediction task, and make
convergence slower1 . Here we want to model only B. Dataset
second-order interactions between variables. To evaluate our method, a portion of 11 days from
our real-world dataset2 was used to build the model
B. Learning
parameters, and a subset of the data from the three
Given an observed dataset {(xi , y i )|i = (1, .., L)}, consecutive days were used to test the trained model.
in which xi is of length n representing the input There are 45 millions of impression instances in the
features, and y i ∈ {1, 0} indicating a click or non- training set. The test set contains 30M of impressions.
click in an impression, the CTR prediction problem #clicks
The average CTR in the training set ( #impressions ) is
is to learn a function h(x), which can be used to 0.0063. The early stopping technique was employed in
predict the probability of a user clicking on an ad. the training to prevent overfitting. In particular, 20%
Similar to the likelihood of LR, h(x) = P r(y = samples from the last days in the training set were
1
1|x, w) = σ(φ(w, x)), where σ(a) = 1+exp(−a) is the randomly selected as the validation set, and we stop
sigmoid/logistic function. The model parameters are the training process when the validation error begins
estimated by minimizing the following regularized loss to increase.
function:
C. Features
argmin l(φ(w, x), y) + λ × r(w) (2)
w We consider different feature families, including
i
advertiser (ad ID, campaign ID, ...), publisher (url,
where r is a regularization function, and l is the publisher ID, ...), user (CRM data) and time features
LogLoss (logistic loss) function, given as (serve time, click time, ...). The hashing trick, which
l(w) = −y log(φ(w, x))−(1−y) log(1−φ(w, x)) (3) was made popular by the Vowpal Wabbit learning
software, was applied to reduce the dimensionality
Several learning algorithms, e.g., SGD, have been of the model. In our experiments, the number of
proposed [11] to solve equation 2. In this work, the per- bits used for hashing was 22, yielding a model with
coordinate FTRL-Proximal algorithm was employed 4M × k parameters, where k is the dimensionality of
1
the factorization.
This has been confirmed by our experiments. In fact, in our
2
real-world setting we train a separate model for linear terms Log data from July 2015 at Zebestof: http://www.zebestof.com
2890
D. Results Data Mining, ser. KDD ’12. New York, USA: ACM,
2012, pp. 768–776.
We compared our method against the standard FM
with SGD [11]. For both methods, we started training [3] A. Agarwal, O. Chapelle, J. Langford, and C. Cortes,
on a small part of data to choose the tuning parameters, “A reliable effective terascale linear learning system,”
i.e., select the parameters that provide the smallest Tech. Rep., 2011.
error on the validation data. Once the parameters are
determined, the model is then learned from the entire [4] T. Graepel, J. Q. Candela, T. Borchert, and R. Herbrich,
“Web-scale bayesian click-through rate prediction for
training set. For the number of latent factors, k = 20
sponsored search advertising in microsoft’s bing search
is used in all experiments. Table I shows a comparison engine,” in Proceedings of the 27th International Con-
of our results obtained from the test set with those ference on Machine Learning, Israel, 2010.
of the baseline (FM) using SGD. It can be seen that
[5] L. Bottou, “Online algorithms and stochastic
Table I approximations,” in Online Learning and Neural
Results for the proposed method and the FM model with SGD. Networks, D. Saad, Ed. Cambridge, UK: Cambridge
The accuracy is measured by the area under the ROC curve. University Press, 1998, revised, oct 2012. [Online].
Available: http://leon.bottou.org/papers/bottou-98x
XXX
XX Model Our method (FTRFL)
Test data XXX
FM with SGD [6] M. Zinkevich, “Online convex programming and gen-
X
Day 1 0.9836 0.9128 eralized infinitesimal gradient ascent,” in Machine
Day 2 0.9818 0.9105 Learning, Proceedings of the Twentieth International
Day 3 0.9809 0.9021 Conference (ICML 2003), August 21-24, 2003, Wash-
ington, DC, USA, 2003, pp. 928–936.
the proposed method overcomes the baseline one by [7] J. Duchi and Y. Singer, “Efficient online and batch
over 7%. In industry, this has significant impact to the learning using forward backward splitting,” J. Mach.
overall system performance. In terms of convergence Learn. Res., vol. 10, pp. 2899–2934, Dec. 2009.
rate, we observed that the FTRFL method normally [8] L. Xiao, “Dual averaging methods for regularized
converges after 5 iterations, while the FM with SGD stochastic learning and online optimization,” Journal
usually converges after 20 iterations. of Machine Learning Research, vol. 11, 2010.
IV. C ONCLUSION [9] H. B. McMahan, G. Holt, D. Sculley, M. Young,
In this paper, we have applied the FTRL-Proximal D. Ebner, J. Grady, L. Nie, T. Phillips, E. Davydov,
D. Golovin, S. Chikkerur, D. Liu, M. Wattenberg,
algorithm with per-coordinate learning rate to FM. The
A. M. Hrafnkelsson, T. Boulos, and J. Kubica, “Ad
proposed algorithm produces a sparse model, making it click prediction: A view from the trenches,” in Pro-
applicable to real-world scenarios (i.e., in production, ceedings of the 19th ACM SIGKDD International
one can store only the non-zero coefficients of the Conference on Knowledge Discovery and Data Mining,
model). Experimental results show that the FTRFL 2013.
method outperforms the standard FM with SGD, and
has a much faster rate of convergence. [10] S. Rendle, “Factorization machines,” in The 10th IEEE
International Conference on Data Mining, Sydney,
R EFERENCES Australia, 14-17 December 2010, pp. 995–1000.
[1] O. Chapelle, E. Manavoglu, and R. Rosales, “Simple [11] R. Steffen, “Factorization machines with libFM,” ACM
and scalable response prediction for display advertis- Trans. Intell. Syst. Technol., vol. 3, no. 3, pp. 57:1–
ing,” ACM Trans. Intell. Syst. Technol., vol. 5, no. 4, 57:22, May 2012.
pp. 61:1–61:34, Dec. 2014.
[12] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin,
[2] K.-c. Lee, B. Orten, A. Dasdan, and W. Li, “Estimating Q. V. Le, M. Z. Mao, M. Ranzato, A. W. Senior, P. A.
conversion rate in display advertising from past erfor- Tucker, K. Yang, and A. Y. Ng, “Large scale distributed
mance data,” in Proceedings of the 18th ACM SIGKDD deep networks,” in 26th Annual Conference on Neural
International Conference on Knowledge Discovery and Information Processing Systems, 2012.
2891

Factorization Machines With Follow-The-Regularized-Leader For CTR Prediction in Display Advertising

Uploaded by

Copyright:

Available Formats

Factorization Machines With Follow-The-Regularized-Leader For CTR Prediction in Display Advertising

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Factorization Machines With Follow-The-Regularized-Leader For CTR Prediction in Display Advertising

Uploaded by

Copyright:

Available Formats

2015 IEEE International Conference on Big Data (Big Data)

Factorization Machines with Follow-The-Regularized-Leader for CTR prediction in

978-1-4799-9926-2/15/$31.00 ©2015 IEEE 2889

You might also like