Factorization Machines With Follow-The-Regularized-Leader For CTR Prediction in Display Advertising
Factorization Machines With Follow-The-Regularized-Leader For CTR Prediction in Display Advertising
Factorization Machines With Follow-The-Regularized-Leader For CTR Prediction in Display Advertising
Anh-Phuong TA
Zebestof company – CCM Benchmark group
Paris, France
Email: [email protected]
Abstract—Predicting ad click-through rates is the core (SGD) [5], Newton and Quasi-Newton methods (e.g.
problem in display advertising, which has received much L-BFGS), Coordinate Descent [6]. SGD has proved
attention from the machine learning community in recent to be effective in solving these kinds of problems,
years. In this paper, we present an online learning algo- producing good prediction accuracy. Its resulting mod-
rithm for click-though rate prediction, namely Follow-
els, however, are not sparse enough, making them ex-
The-Regularized-Factorized-Leader (FTRFL), which in-
corporates the Follow-The-Regularized-Leader (FTRL-
tremely expensive to store in production. Many efforts
Proximal) algorithm with per-coordinate learning rates have been made to produce sparser models, e.g., the
into Factorization machines. Experiments on a real- FOBOS algorithm [7], the Regularized Dual Averag-
world advertising dataset show that the FTRFL method ing (RDA) algorithm [8], the Follow-The-Regularized
outperforms the baseline with stochastic gradient de- Leader algorithm [9]. Among these algorithms, FTRL-
scent, and has a faster rate of convergence. Proximal has shown to be more effective at producing
Keywords-Online advertising; FTRL-Proximal; Fac- sparsity and better performance [9].
torization machines; Despite their success, logistic regression-based
methods can not capture higher order interactions (i.e.,
I. I NTRODUCTION non-linear information) between features, which have
Internet advertising is a multi-billion dollar busi- proved to be important in CTR prediction [1]. One
ness and is growing rapidly. There are several major can manually select and construct conjunction features
channels on the web for online advertising such as from the original ones as the input for LR models. This
display advertising and search advertising. Display approach, however, will result in quadratic number of
advertising is different from the search advertising in new features, making it really difficult to learn the
that it uses graphical banners placed on the publishers’ model. A new line of research based on the use of
web pages [1]. In online advertising, advertisers can feature engineering and matrix design, called Factor-
choose between Cost per Click (CPC), Cost per Action ization Machines (FM), has recently emerged as very
(CPA) or Cost per Impression (CPM) pricing methods successful models for CTR prediction [10]. Indeed, FM
to purchase display ads. Among them, CPC is the most combines the advantages of Support Vector Machines
popular option, in which advertisers only pay when a with factorization models, which are able to model
user clicks on the ad. As a consequence, click through all interactions between variables even with extreme
rate (CTR) prediction, which is defined as the problem sparsity of data. An implementation of FM, which
of estimating the probability that a user clicks on an ad supports some optimization algorithms including SGD,
in a specific context, is crucial to online advertising. Alternating Least Squares (ALS), and Markov Chain
Predicting CTR in display advertising has been Monto Carlo (MCMC), has been provided [11].
widely studied in the literature. Logistic regression In this paper, we attempt to get both the sparsity
(LR) is commonly used in industry due to its ease provided by FTRL-Proximal and the ability of estimat-
of implementation and effective performance in large- ing higher order information of FM. To this end, we
scale systems [1][2][3][4]. Numerous optimization present the Follow-The-Regularized-Factorized-Leader
methods have been applied to train logistic regression (FTRFL) algorithm, which incorporates the FTRL-
models, including Stochastic (online) Gradient Descent proximal with per-coordinate learning rates into FM.
2890
D. Results Data Mining, ser. KDD ’12. New York, USA: ACM,
2012, pp. 768–776.
We compared our method against the standard FM
with SGD [11]. For both methods, we started training [3] A. Agarwal, O. Chapelle, J. Langford, and C. Cortes,
on a small part of data to choose the tuning parameters, “A reliable effective terascale linear learning system,”
i.e., select the parameters that provide the smallest Tech. Rep., 2011.
error on the validation data. Once the parameters are
determined, the model is then learned from the entire [4] T. Graepel, J. Q. Candela, T. Borchert, and R. Herbrich,
“Web-scale bayesian click-through rate prediction for
training set. For the number of latent factors, k = 20
sponsored search advertising in microsoft’s bing search
is used in all experiments. Table I shows a comparison engine,” in Proceedings of the 27th International Con-
of our results obtained from the test set with those ference on Machine Learning, Israel, 2010.
of the baseline (FM) using SGD. It can be seen that
[5] L. Bottou, “Online algorithms and stochastic
Table I approximations,” in Online Learning and Neural
Results for the proposed method and the FM model with SGD. Networks, D. Saad, Ed. Cambridge, UK: Cambridge
The accuracy is measured by the area under the ROC curve. University Press, 1998, revised, oct 2012. [Online].
Available: http://leon.bottou.org/papers/bottou-98x
XXX
XX Model Our method (FTRFL)
Test data XXX
FM with SGD [6] M. Zinkevich, “Online convex programming and gen-
X
Day 1 0.9836 0.9128 eralized infinitesimal gradient ascent,” in Machine
Day 2 0.9818 0.9105 Learning, Proceedings of the Twentieth International
Day 3 0.9809 0.9021 Conference (ICML 2003), August 21-24, 2003, Wash-
ington, DC, USA, 2003, pp. 928–936.
the proposed method overcomes the baseline one by [7] J. Duchi and Y. Singer, “Efficient online and batch
over 7%. In industry, this has significant impact to the learning using forward backward splitting,” J. Mach.
overall system performance. In terms of convergence Learn. Res., vol. 10, pp. 2899–2934, Dec. 2009.
rate, we observed that the FTRFL method normally [8] L. Xiao, “Dual averaging methods for regularized
converges after 5 iterations, while the FM with SGD stochastic learning and online optimization,” Journal
usually converges after 20 iterations. of Machine Learning Research, vol. 11, 2010.
IV. C ONCLUSION [9] H. B. McMahan, G. Holt, D. Sculley, M. Young,
In this paper, we have applied the FTRL-Proximal D. Ebner, J. Grady, L. Nie, T. Phillips, E. Davydov,
D. Golovin, S. Chikkerur, D. Liu, M. Wattenberg,
algorithm with per-coordinate learning rate to FM. The
A. M. Hrafnkelsson, T. Boulos, and J. Kubica, “Ad
proposed algorithm produces a sparse model, making it click prediction: A view from the trenches,” in Pro-
applicable to real-world scenarios (i.e., in production, ceedings of the 19th ACM SIGKDD International
one can store only the non-zero coefficients of the Conference on Knowledge Discovery and Data Mining,
model). Experimental results show that the FTRFL 2013.
method outperforms the standard FM with SGD, and
has a much faster rate of convergence. [10] S. Rendle, “Factorization machines,” in The 10th IEEE
International Conference on Data Mining, Sydney,
R EFERENCES Australia, 14-17 December 2010, pp. 995–1000.
[1] O. Chapelle, E. Manavoglu, and R. Rosales, “Simple [11] R. Steffen, “Factorization machines with libFM,” ACM
and scalable response prediction for display advertis- Trans. Intell. Syst. Technol., vol. 3, no. 3, pp. 57:1–
ing,” ACM Trans. Intell. Syst. Technol., vol. 5, no. 4, 57:22, May 2012.
pp. 61:1–61:34, Dec. 2014.
[12] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin,
[2] K.-c. Lee, B. Orten, A. Dasdan, and W. Li, “Estimating Q. V. Le, M. Z. Mao, M. Ranzato, A. W. Senior, P. A.
conversion rate in display advertising from past erfor- Tucker, K. Yang, and A. Y. Ng, “Large scale distributed
mance data,” in Proceedings of the 18th ACM SIGKDD deep networks,” in 26th Annual Conference on Neural
International Conference on Knowledge Discovery and Information Processing Systems, 2012.
2891