0% found this document useful (0 votes)
23 views11 pages

Constrained User Intent

Uploaded by

lu284918171
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views11 pages

Constrained User Intent

Uploaded by

lu284918171
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

A Multi-stage Framework for Online Bonus Allocation Based on

Constrained User Intent Detection


Chao Wang∗† Xiaowei Shi∗ Shuai Xu∗
Meituan Group Meituan Group Meituan Group
Shanghai, China Beijing, China Shanghai, China
[email protected] [email protected] [email protected]

Zhe Wang† Zhiqiang Fan Yan Feng


Meituan Group Meituan Group Meituan Group
Beijing, China Beijing, China Shanghai, China
[email protected] [email protected] [email protected]

An You Yu Chen
Meituan Group Meituan Group
Beijing, China Beijing, China
[email protected] [email protected]

ABSTRACT constrain it to be non-negative to achieve monotonicity constraints.


With the explosive development of e-commerce for service, tens Then, in order to reduce the optimality gap, we further propose a
of millions of orders are generated every day on the Meituan plat- convex constrained model to increase the upper bound of the opti-
form. By allocating bonuses to new customers when they pay, the mal value. For the third challenge, to cope with the fluctuation of
Meituan platform encourages them to use its own payment ser- online bonus consumption, we leverage a feedback control strategy
vice for a better experience in the future. It can be formulated as a in the framework to make the actual cost more accurately approach
multi-choice knapsack problem (MCKP), and the mainstream so- the given budget limit. Finally, we conduct extensive offline and
lution is usually a two-stage method. The first stage is user intent online experiments, demonstrating the superiority of our proposed
detection, predicting the effect for each bonus treatment. Then, it framework, which reduced customer acquisition costs by 5.07% and
serves as the objective of the MCKP, and the problem is solved in is still running online.
the second stage to obtain the optimal allocation strategy. How-
ever, this solution usually faces the following challenges: (1) In the CCS CONCEPTS
user intent detection stage, due to the sparsity of interaction and • Information systems → Computational advertising; • Ap-
noise, the traditional multi-treatment effect estimation methods plied computing → Electronic commerce.
lack interpretability, which may violate the domain knowledge that
the marginal gain is non-negative with the increase of the bonus KEYWORDS
amount in economic theory. (2) There is an optimality gap between E-commerce, Bonus Allocation, Multi-treatment Effect Estimation,
the two stages, which limits the upper bound of the optimal value Monotonic Constraint, Convex Constraint
obtained in the second stage. (3) Due to changes in the distribu-
tion of orders online, the actual cost consumption often violates ACM Reference Format:
the given budget limit. To solve the above challenges, we propose Chao Wang, Xiaowei Shi, Shuai Xu, Zhe Wang, Zhiqiang Fan, Yan Feng,
An You, and Yu Chen. 2023. A Multi-stage Framework for Online Bonus
a framework that consists of three modules, i.e., User Intent De-
Allocation Based on Constrained User Intent Detection. In Proceedings of
tection Module, Online Allocation Module, and Feedback Control
the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
Module. In the User Intent Detection Module, we implicitly model (KDD ’23), August 6–10, 2023, Long Beach, CA, USA. ACM, New York, NY,
the treatment increment based on deep representation learning and USA, 11 pages. https://doi.org/10.1145/3580305.3599764
∗ These authors contributed equally to this research.
† Corresponding author. 1 INTRODUCTION
With the explosion of e-commerce, customer acquisition has be-
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed come increasingly important and competitive. Assigning bonuses
for profit or commercial advantage and that copies bear this notice and the full citation or coupons to customers is a crucial way to promote user con-
on the first page. Copyrights for components of this work owned by others than the version in customer acquisition management. Therefore, the most
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission critical problem is how to convert more new customers within
and/or a fee. Request permissions from [email protected]. a given budget limit. For this problem, the mainstream solution
KDD ’23, August 6–10, 2023, Long Beach, CA, USA is a two-stage method [2, 15, 27, 33]: a combination of machine
© 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 979-8-4007-0103-0/23/08. . . $15.00 learning (ML) and operation research (OR). The first stage is user
https://doi.org/10.1145/3580305.3599764 intent detection, which predicts the user’s conversion probability

5028
KDD ’23, August 6–10, 2023, Long Beach, CA, USA Chao Wang et al.

under each candidate bonus. The second stage is optimization. In 0.012


this stage, based on the obtained conversion probability, the plat-
0.009
form allocates an appropriate bonus amount online to achieve the

Prediction
overall optimal conversion. For example, Meituan, one of the most
0.006
popular e-commerce platforms for service, takes tens of millions
of orders daily. In its payment platform, bonuses are allocated to 0.003
new customers through discounts on the order amount. In order
to meet the needs of different users as much as possible, there are 0.000
1 17 34 51
dozens of optional bonus amounts. Treatment
User intent detection is a multi-treatment effect estimation prob-
lem. Methods can be divided into several categories, such as stratifi-
Figure 1: We randomly select three new customers online
cation methods, matching models, meta-algorithms, and tree-based
and obtain the potential outcomes predicted by the intent
methods. Representation learning methods based on deep neural
detection model under each bonus treatment, called the re-
networks have also attracted much attention, including the pio-
sponse curve.
neering method TARNET [21] and the PM model [20] for multi-
treatment scenarios. Compared with traditional machine learning
approaches, deep representation learning models are capable of au-
to make the information fusion between the head layers more suffi-
tomatically searching for correlated features and combining them to
cient and to improve the stability of model training when there are
enable more effective and accurate counterfactual estimation [29].
many treatments.
In terms of optimization, the bonus allocation problem can be for-
For the second challenge, in the bonus allocation optimization
mulated as a multi-choice knapsack problem (MCKP). The bonus
problem, there are millions of new customers daily and dozens of
allocation is non-trivial because of the following challenges.
bonus treatments, so the policy space is combinatorial and enor-
mous. Inspired by [27, 32], we leverage the Lagrangian dual theory
• For user intent detection, the existing model lacks inter- to calculate the Lagrangian multiplier to compress the policy space.
pretability, i.e., does not conform to domain knowledge, to Furthermore, we propose a user intent detection model with con-
make reliable predictions in the real world. In economic vex constraints that can be theoretically proven to reach an upper
theory, the marginal gain of increasing the bonus amount bound on the optimal value.
should be non-negative [11]. However, due to the sparsity For the third challenge, we approach it as a feedback control
of the interaction and noise, as shown in Figure 1, the user problem and apply feedback control theory to address the dynamic
response curve predicted by the traditional treatment effect system’s response and control in the presence of external noise.
estimation methods is not entirely monotonous, and there Without loss of generality, the feedback control module can be
are many jitters. applied not only in our scenario where customer acquisition cost
• There is an optimality gap between the two stages of user (CAC) is the target constraint but also in other cost-related con-
intent detection and optimization. We prove that in the re- straints.
sponse curve obtained in the first stage, some bonus treat- We conduct extensive experiments to evaluate the performances
ments have no chance of being decided, thereby reducing of our proposed framework. Both online and offline experiments
the upper bound of the optimal value. demonstrate that the proposed framework achieves better perfor-
• The distribution of online orders and users changes over mance compared with other methods. The proposed bonus alloca-
time, such as during the week and weekends. How to adjust tion system has brought significant profits to the platform and still
allocation strategy efficiently and dynamically so that the runs online.
daily customer acquisition cost does not exceed the budget The rest of our paper is organized as follows. We describe the
limit is crucial for budget control. bonus allocation system in detail in Section 2. Then we introduce the
three modules included in the proposed framework respectively in
Section 3. We present the experimental setups and results in Section
The main contribution of this work is to propose an online bonus
4. We briefly review related work in Section 5. The conclusions are
allocation framework to address the above challenges, which con-
given in Section 6.
sists of three modules: User Intent Detection Module, Online Allo-
cation Module, and Feedback Control Module.
For the first challenge, we analyze the typical pattern of multi-
2 THE BONUS ALLOCATION FRAMEWORK IN
treatment effect estimation methods based on representation learn- MEITUAN PAYMENT
ing. Based on it, we propose a monotonic constrained user intent In this section, we introduce a bonus allocation algorithm frame-
detection model. It still deploys a multi-head network structure. work for customer acquisition marketing in Meituan Payment, as
The difference is that it implicitly models the effect increment of illustrated in Figure 2,. It comprises a business layer and an algo-
two adjacent bonus amounts and finally obtains the individual treat- rithm layer. When a new customer places an order and makes a
ment effect through accumulation. The monotonic constraint is payment on the Meituan platform, the user intent detection module
implemented by restricting the effect increment to be non-negative. in the algorithm layer evaluates the customer’s conversion prob-
In addition, the accumulation operator is added to the output layer ability for each optional bonus amount, which is provided by the

5029
A Multi-stage Framework for Online Bonus Allocation Based on Constrained User Intent Detection KDD ’23, August 6–10, 2023, Long Beach, CA, USA

business layer and is consistent for all users. In the bonus alloca- represent the expected cost. Eq.(3) is because we only issue one
tion optimization module, we formulate an optimization problem bonus per user visit.
based on the user conversion probability to determine the optimal In order to solve this problem, we propose a framework includ-
allocation strategy that maximizes the total number of converted ing (1) Predicting the conversion probability in the User Intent
customers within a given budget limit, while adhering to the al- Detection Module, (2) Solving the optimization problem in the
location rules in the business layer. We then apply the optimal Online Allocation Module to get optimal policy and making the
allocation strategy in real-time to allocate each order. As the traffic online allocation, (3) Since new customers do not arrive uniformly,
to online orders is non-uniform, the budget consumption fluctu- we are supposed to adjust the pace of budget consumption in the
ates. To ensure that the actual consumption approaches the budget Feedback Control Module so that the cost does not violate the
limit accurately, we deploy a feedback control strategy to adjust budget limit.
the allocation strategy in real time.
Randomized controlled trials (RCTs) are considered the gold 3.2 Online Allocation Module
standard for estimating treatment effects. In this study, we collected The methods for solving MCKPs can be divided into two categories,
samples through RCTs to assess the potential effects of different one is the exact algorithm, such as dynamic programming and
bonus treatments. The experiment was conducted as an online branch and bound method, and the other is the inexact algorithm,
multivariate A/B test, which was deployed in the actual traffic of mainly including Lagrangian dual algorithm, evolutionary algo-
Meituan Payment and lasted for several days. rithm, etc. Due to the need for real-time allocation, it is difficult
to obtain an exact solution. Inspired by [27, 32], we leverage the
3 METHODOLOGY Lagrangian dual theory, so that we can calculate the Lagrangian
In this section, we first formulate the problem of bonus allocation multiplier 𝜆 to compress the parameter space X.
in section 3.1 and then propose the online allocation strategy based By introducing a Lagrangian multiplier 𝜆, we get the Lagrangian
on the user’s conversion probability detected in User Intent Detec- relaxation function of the original problem as follows:
tion Module described in section 3.2. Secondly, in order to meet
𝑁 ∑︁
𝑀 𝑁 𝑀
the business knowledge and increase the upper bound of the opti- ∑︁ ©∑︁ ∑︁
𝐿(X, 𝜆) = 𝑝𝑖,𝑗 𝑥𝑖,𝑗 − 𝜆 ­ 𝑏 𝑗 𝑥𝑖,𝑗 𝑝𝑖,𝑗 − 𝐵 ® ,
ª
mal value under the Lagrangian relaxation paradigm, we leverage (5)
𝑖=1 𝑗=1 « 𝑖=1 𝑗=1
monotonic and convex constraints to the common multi-treatment ¬
effect estimation model in User Intent Detection Module detailed Correspondingly, the dual problem is formulated as:
in section 3.3. Finally, we propose a feedback control strategy in
section 3.4 so that online cost consumption does not violate the min max 𝐿(X, 𝜆), (6)
𝜆 𝑥𝑖,𝑗
budget limit. 𝑀
∑︁
s.t. 𝑥𝑖,𝑗 = 1, ∀𝑖 ∈ [𝑁 ], (7)
3.1 Preliminaries 𝑗=1
Within a time period of one day, we assume that there are 𝑁 new 𝑥𝑖,𝑗 ∈ {0, 1}, ∀𝑖 ∈ [𝑁 ], ∀𝑗 ∈ [𝑀], (8)
customers who visit the Meituan payment platform for transactions,
and there are 𝑀 types of candidate bonus amounts. For a given with optimality conditions:
user 𝑖, we use 𝑝𝑖,𝑗 to denote the conversion probability under the 𝑁 𝑀
bonus amount 𝑗. The objective of bonus allocation is to identify an ©∑︁ ∑︁
𝑏 𝑗 𝑥𝑖,𝑗 𝑝𝑖,𝑗 − 𝐵 ® = 0,
ª
𝜆­ (9)
optimal allocation policy that converts as many users as possible
« 𝑖=1 𝑗=1 ¬
within a given budget constraint 𝐵. Therefore, the bonus allocation 𝑁 𝑀
∑︁ ∑︁
problem can be formulated as the MCKP. 𝑏 𝑗 𝑥𝑖,𝑗 𝑝𝑖,𝑗 − 𝐵 ≤ 0, 𝜆 ≥ 0. (10)
𝑖=1 𝑗=1
𝑁 ∑︁
∑︁ 𝑀
max 𝑝𝑖,𝑗 𝑥𝑖,𝑗 , (1) Thanks to the Lagrangian dual decomposition, we remove Eq.(2)
𝑥𝑖,𝑗 from the constraints, and for a fixed 𝜆, the above optimization
𝑖=1 𝑗=1
𝑁 ∑︁
𝑀 problem turns into a set of sub-problems (i.e., one for each new
customer 𝑖 ∗ ):
∑︁
s.t. 𝑏 𝑗 𝑝𝑖,𝑗 𝑥𝑖,𝑗 ≤ 𝐵, (2)
𝑖=1 𝑗=1 𝑀
∑︁ 
𝑀
∑︁ max 𝑥𝑖 ∗ ,𝑗 𝑝𝑖 ∗ ,𝑗 − 𝜆𝑝𝑖 ∗ ,𝑗 𝑏 𝑗 , (11)
𝑥𝑖 ∗ ,𝑗
𝑥𝑖,𝑗 = 1, ∀𝑖 ∈ [𝑁 ], (3) 𝑗=1
𝑗=1 𝑀
∑︁
𝑥𝑖,𝑗 ∈ {0, 1}, ∀𝑖 ∈ [𝑁 ], ∀𝑗 ∈ [𝑀], (4) s.t. 𝑥𝑖 ∗ ,𝑗 = 1, 𝑥𝑖∗,𝑗 ∈ {0, 1}, ∀𝑗 ∈ [𝑀]. (12)
𝑗=1
Where if bonus 𝑗 is allocated to user 𝑖, i.e 𝑥𝑖 𝑗 = 1, thus X = {𝑥𝑖 𝑗 }
denotes the optimal allocation policy. Eq.(1) denotes the objective, Eq.(11) holds because obtaining the optimal value for a new
that is, the expected total conversion customers. Eq.(2) represents customer leads to the overall optimal value. When 𝜆 is given, all
the budget constraint, 𝑏 𝑗 denotes the amount of bonus 𝑗, and only sub-problems become independent. Compared to the original prob-
the converted users will consume the budget, so we calculate 𝑝𝑖,𝑗 𝑏 𝑗 lem with O(𝑀𝑁 ) decision variables, solving the sub-problem only

5030
KDD ’23, August 6–10, 2023, Long Beach, CA, USA Chao Wang et al.

Business Layer Optional Bonus Allocation Rules Budget Limits


Amount

Algorithm Layer Feedback Control


Module

Order User Intent Bonus Allocation


Detection Module Optimization Online Allocate Logs

Modeling Sample Collection Randomized


Control Allocate

Figure 2: Our proposed framework for multi-stage bonus allocation at the algorithmic layer, including a user intent detection
module, an online allocation module that optimizes the bonus allocation optimization problem, and a feedback control module.

requires traversing 𝑀 bonuses, reducing the computational com- 3.3.1 Monotonic constrained.
plexity to O(𝑀). The optimal allocation policy is expressed as fol- Assumption 1. Controlling other conditions unchanged, the user
lows: conversion probability and the bonus amount maintain a monotoni-
𝑥𝑖,𝑗 ∗ = 1{ 𝑗 ∗ = arg max 𝑝𝑖,𝑗 − 𝜆𝑝𝑖,𝑗 𝑏 𝑗 }, (13) cally increasing relationship.
𝑗
Obviously, the bonus is a direct discount on the order amount
Where 1{·} is the 0/1 indicator function.
in our scenario and does not include personalized creative content.
To solve the dual problem in Eq.(6), we can alternate between
Therefore, the higher the amount, the more attractive it is for user
the following steps: (1) fix 𝜆 ∗ and calculate the optimal allocation
conversion. However, due to the presence of noise and the sparsity
solution X using Eq.(13), and (2) based on the fixed X ∗ , iterate
of the interaction data, we observe that even in randomized con-
𝜆 to satisfy the optimality conditions described in Eq.(9,10). To
trolled trials, monotonicity cannot be guaranteed. To ensure mono-
improve the efficiency of the iterative process for 𝜆, we leverage
tonicity, as shown in Figure 3 (b), we propose a multi-treatment
the bisection algorithm [27, 33] for searching. The initial value of 𝜆
effect estimation model with monotonic constraints. We implicitly
is recommended to be set between 0 and 2 [26].
model the effect increment between two adjacent treatments and
obtain individual treatment effects through accumulation. Next, we
3.3 User Intent Detection Module
describe the model structure in detail.
Due to the explosive development of deep learning, treatment ef- A new customer 𝑢 is represented by a set of characteristics
fect estimation based on representation learning has become a (𝑥, 𝑐, 𝑡), where 𝑥 includes the statistical characteristics of the user’s
hot topic. The architecture of representation learning is usually historical behavior, such as the number of transaction orders and
as shown in Figure 3 (a) [16, 20, 21, 30]. In order to prevent the the average amount of orders in the past 𝑛 days, and user iden-
influence of treatment variables from being lost during training, tity descriptions, such as age and membership status, consumption
a multi-head network structure is often deployed. The loss in the preference level for each business line, etc. 𝑐 represents real-time
network consists of two parts. L𝑝𝑟𝑒𝑑𝑖𝑐𝑡 represents the error be- contextual features, such as the amount of the order, which business
tween the predicted outcome 𝑦ˆ𝑘 and the actual outcome 𝑦𝑘 . In line it originated from, and order time, etc., and 𝑡 represents the
real scenarios, treatments are not always randomly issued, so the exposed bonus treatment. The output of the shared dense layers 𝐿
sample distribution of each treatment is not the same, which may is defined as:
cause the model to fail to obtain the real causal relationship be-
tween treatment and outcome, but a spurious relationship. In deep 𝜙 = 𝐿(𝑥), (14)
representation learning, we can transform the covariates from the Where 𝜙 denotes the common user representation learned from the
original space to the latent space. L𝑏𝑎𝑙𝑎𝑛𝑐𝑒 represents the difference shared layers, it enables the efficient sharing of information across
in the potential space of samples under different treatments, where multiple treatments.
the integral probability metric is widely used. However, we obtain As shown in Figure 3 (b), 𝛿𝑘 is used as input to head layers to
samples through randomized controlled trials in Meituan payment model incremental effects. It includes two parts, context features
customer acquisition so that this regular item can be eliminated. In 𝑐 and treatment information 𝛿𝑘𝑡 . Since the contextual features are
our scenario, treatment corresponds to the bonus amount 𝑡, and more important, drawing on the idea of the deep&wide model [7],
we need to predict the effect of a customer under different bonus we place it close to the output to strengthen the memorization of the
amounts, that is, the conversion probability. model. 𝛿𝑘𝑡 includes the embedding of 𝑘-th and (𝑘 − 1)-th treatment.

5031
A Multi-stage Framework for Online Bonus Allocation Based on Constrained User Intent Detection KDD ’23, August 6–10, 2023, Long Beach, CA, USA

𝓛𝒑𝒓𝒆𝒅𝒊𝒄𝒕
𝓛𝒑𝒓𝒆𝒅𝒊𝒄𝒕 𝓛𝒑𝒓𝒆𝒅𝒊𝒄𝒕

Weighted Cumulative Sum


𝑦"! 𝑦"" 𝑦"#
𝑦"! 𝑦"" 𝑦"# 𝑦"! *
𝝂
Cumulative Sum Reverse Cumulative Sum
H head layers … … 𝑦"! ("
∆ (#

*/
𝝊
t=1 t=k t=M
… … … …
𝓛𝒃𝒂𝒍𝒂𝒏𝒄𝒆 𝛿! 𝛿" 𝛿# 𝛿! 𝛿0 𝛿" 𝛿#
𝜱
𝜱 𝜱
L shared layers

𝒙 𝒙 𝒙
𝒕 𝒕 𝒕
(a) (b) (c)

Figure 3: (a) Multi-treatment effect estimation pattern based on representation learning. A multi-head network structure is
adopted, where 𝑇 represents treatment. (b) The proposed multi-treatment effect estimation model with monotonic constraint.
The outcome is calculated by accumulating the increment by modeling the positive effect increment between two adjacent
sorted treatments. (c) The proposed convex constrained model is realized by restricting increments monotonically decreasing.

Considering the physical meaning of accumulation, 𝛿 1𝑡 will be used Combining proposition 3.1 and 3.2, in order to obtain a higher
to model 𝑦ˆ1 . For convenience, we construct a treatment placeholder optimal value in the bonus allocation problem, we should add new
𝑡 0 , and then 𝛿 1𝑡 contains the embedding of 𝑡 0 and the first treatment. optional bonuses or reduce the size of non-monotonic and non-
We define: convex cases.
Δ̂𝑘 = 𝑓𝑘 ( [𝜙; 𝛿𝑘 ]), (15) Theorem 3.3. If 𝑦 = 𝑓 (𝑥) is a monotonically increasing and con-
Where the function 𝑓𝑘 (·) is the head, [·; ·] denotes the concatenation vex function, there is 𝑦 = 𝑔(𝑥 𝑓 (𝑥)), then 𝑔(·) is also a monotonically
of two vectors. increasing and convex function if 𝑥 > 0.
Finally, the prediction of user 𝑢 for the 𝑡-th bonus treatment is
Thanks to theorem 3.3, we can transform the convex constraint
formulated as:
between conversion probability and expected cost into that between

𝑦ˆ1, 𝑖𝑓 𝑡 = 1 conversion probability and bonus amount, which is convenient for


 𝑡
𝑦ˆ𝑡 = ∑︁ (16) modeling. Therefore, we propose a multi-treatment effect estima-
𝑦ˆ + Δ̂𝑘 , 𝑖 𝑓 𝑡 > 1.
 1

 tion model with monotonically increasing and convex constraints.
 𝑘=2 As shown in Figure 3 (c), the model inference process consists
To implement the monotonic constraint, we only need to make of two steps. First, the increment of the effect incremental slope is
the incremental effect Δ̂𝑘 greater than 0. It is easily achieved by formulated:
squaring the raw output of the last head layer. The loss function of ′
the proposed model is defined as follows: 𝜈ˆ𝑘 = 𝑓𝑘 ([𝜙; 𝛿𝑘 ]). (18)

1
𝑁
∑︁ We square the output of the last head layer to ensure that 𝜈ˆ𝑘 ≥ 0.
L𝑝𝑟𝑒𝑑𝑖𝑐𝑡 = − (𝑦𝑙𝑜𝑔𝑦ˆ𝑡 + (1 − 𝑦)𝑙𝑜𝑔(1 − 𝑦ˆ𝑡 )), (17) and then the effect incremental slope 𝜈ˆ𝑘 is obtained through reverse
𝑁
(𝑢,𝑦) ∈ D accumulation, so that it decreases monotonically with the increase
Where 𝑢 = (𝑥, 𝑐, 𝑡) is the feature tuple, 𝑦 is the label indicating of the bonus amount, satisfying the definition of a convex function.
whether to convert. 𝑁 is the number of samples in the entire sample 𝑀

∑︁
space D. 𝜈ˆ𝑘 = 𝜈ˆ𝑖 . (19)
𝑖=𝑘
3.3.2 Convex constrained.
Secondly, we define the incremental effect Δ̂𝑘 between two adja-
Proposition 3.1. Given 𝑏𝑙 and 𝑏𝑢 , there exists 𝑏𝑙 ≤ 𝑏 𝑗 ≤ 𝑏𝑢 , ∀𝑗 ∈
cent treatments as:
[𝑀], then the optimal value in Eq.(1) will increase, when adding new
bonus candidates. Δ̂𝑘 = 𝜈ˆ𝑘 × 𝛼𝑘 , (20)
Proposition 3.2. For the sub-problem (11), let 𝐶𝑖 = {𝑐𝑖,𝑗 |𝑐𝑖,𝑗 = Where 𝛼𝑘 indicates the amount difference between the (𝑘 − 1)-
𝑝𝑖,𝑗 ∗ 𝑏 𝑗 } represents the expected cost set. Sort the (𝑐𝑖,𝑗 , 𝑝𝑖,𝑗 ) pairs th and the 𝑘-th bonus treatment. Finally, the predicted outcome
in increasing order of 𝑐𝑖,𝑗 , then only pairs that are monotonically is obtained by accumulating the effect increments, as shown in
increasing on the convex hull can be decided. Eq.(16).

5032
KDD ’23, August 6–10, 2023, Long Beach, CA, USA Chao Wang et al.

The loss function is identical to Eq.(17). Obviously, according parameters of a PID controller.
to Eq.(16,19), the monotonically increasing and convex constraints
between the outcome 𝑦ˆ and bonus amount can be realized. 𝑒 (𝑡) = 𝑟 (𝑡) − ℎ(𝑡), (21)
∑︁
𝑢𝜆 (𝑡) = 𝑘𝑝 𝑒 (𝑡) + 𝑘𝑖 𝑒 (𝑘) + 𝑘𝑑 (𝑒 (𝑡) − 𝑒 (𝑡 − 1)), (22)
3.4 Feedback Control Module 𝑖 −1
As shown in 3.2, we iterate 𝜆 to satisfy the budget constraint based 𝜆(𝑡 + 1) = 𝐹𝜆 (𝜆(0), 𝜇𝜆 (𝑡)). (23)
on the historical data. Unfortunately, obtaining the optimal 𝜆 in
the dynamic payment platform is extremely difficult. First, uncon- 4 EXPERIMENTS
trollable changes in factors including customer profile, traffic, and To verify the effectiveness of the proposed framework, we con-
coupon write-off rate make the distribution of the variables non- duct offline experiments and online A/B tests. We first introduce
stationary. Second, the marketing campaign settings such as budget the datasets, baseline methods, and evaluation metrics. Finally, we
and target audience may be changed irregularly. present the experimental results and analysis.
Feedback control, which deals with the dynamic systems from
feedback and outside noise [3], is widely adopted for its robust-
ness and effectiveness. In our scenario, to cope with the fluctuation
4.1 Experimental Setup
of the dynamic bonus allocation environment, we regard the La- 4.1.1 Experimental Settings. To demonstrate the effectiveness of
grangian multiplier 𝜆 as the adjustable system input and use CAC the proposed framework, our experiments consist of three parts:
as the performance indicator. Naturally, the constraint problem is user intent detection evaluation part, budget allocation evaluation
transformed into a feedback control problem: a low 𝜆 leads to a part, and feedback control evaluation part. We mainly conduct ex-
generous bonus distribution so that CAC might be higher than the periments in real industrial data, and the sample collection process
target and the budget may run out early. On the contrary, a high 𝜆 is shown in Figure 2. Through an online randomized controlled trial,
results in a stingy bonus distribution, so that CAC might be lower we collected data on 6.6 million orders and 2 million users within 11
than the target and the budget may not be fully spent. days, including 51 optional bonus amounts. We randomly selected
5.6 million order data as the training set in the user intent detection
module to train the multi-treatment effect estimation model.

4.1.2 Compared Methods.


𝑅𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒 𝑒(𝑡) 𝑢! (𝑡)


Budget Limits Controller Actuator
➖ • Meta-learner: Meta-learners decompose the causal prob-
𝜆(t + 1) lem into separate prediction problems that can be solved
by standard machine learning algorithms and subsequently
Monitor Online Allocate combined to estimate the causal parameters of interest [17].
𝑀ea𝑠𝑢𝑟𝑒𝑑 𝑣𝑎𝑙𝑢𝑒
We compare S-learner and X-learner [14] in the experiment.
Figure 4: Framework of the feedback control module • TARNET [20, 21]: As shown in Figure 3(a), TARNET is a
general base structure for treatment effect estimation, which
adopts a two-stage structure. Since we collected samples
through randomized controlled trials, the regularization term
Figure 4 presents the framework of the feedback control module. L𝑏𝑎𝑙𝑎𝑛𝑐𝑒 considering the difference in representation distri-
The system output follows a desired control signal, called the refer- bution under different treatments was removed.
ence, which is preset depending on the specific task. The monitor • Semi-black box model [27, 33]: Inspired by the demand
measures the actual value of the variable under online allocation curve in economics, the semi-black box model plays an im-
and transmits it to the controller. The controller is designed to portant role in the response prediction of e-commerce. There
diminish the gap between the actual measured value and the refer- are many kinds of demand curves, such as linear, log-linear,
ence and outputs the control signal, which is then transformed by constant-elasticity, logit, and so on [18, 23]. Among them,
the actuator into the system input signal sent back to the dynamic the logit response curve is the most popular [19, 24, 27].
online allocation system. We use the acceptance rate prediction model in Meituan’s
Proportional-Integral-Derivative (PID) controller [5]is the most multi-stage bonus allocation as a comparison method.
popular feedback controller, especially in the absence of knowl- • Unconstrained: As shown in Figure 3(b), the model pro-
edge of the underlying process [6]. A PID controller continuously posed in Section 3.3.1. The difference is that we do not en-
calculates the error 𝑒 (𝑡) between the measured value ℎ(𝑡) and the force that the increment is non-negative to get an uncon-
reference 𝑟 (𝑡) at every time step 𝑡, and generates the control signal strained model.
𝜇𝜆 (𝑡) based on the combination of proportional, integral, and deriv- • Monotonic constrained: As shown in Figure 3(b), we square
ative terms of 𝑒 (𝑡). The control signal 𝜇𝜆 (𝑡) is then sent to adjust the the output of the last layer in the head layers to ensure that
Lagrangian multiplier 𝜆(𝑡) by the actuator model 𝐹𝜆 (𝜆(0), 𝜇𝜆 (𝑡)). the incremental effect is non-negative.
It is practical and common to use discrete time steps (𝑡 1, 𝑡 2, ...) in • Convex constrained: As shown in Figure 3(c), the treatment
online advertising scenarios, so the process of PID could be formu- effect estimation model with convex constraint proposed in
lated as following equations, where 𝑘𝑝 , 𝑘𝑖 , and 𝑘𝑑 are the weight Section 3.3.2.

5033
A Multi-stage Framework for Online Bonus Allocation Based on Constrained User Intent Detection KDD ’23, August 6–10, 2023, Long Beach, CA, USA

Table 1: Results of different intent detection methods.


4.1.3 Evaluation Protocol. For the user intent detection evaluation
part, in addition to adopting the evaluation metrics commonly used
in classification models, such as AUC and Logloss, we also evaluate Method AUC↑ Logloss↓ 𝜖𝑚𝐴𝑇𝑇 ↓ RPR ↓ Ulchr ↑
counterfactual predictions, which is more important. S-learner 0.9157 0.05908 0.000868 6.21% 14.39%
• The error of ATT estimation: 𝜖𝐴𝑇𝑇 is a common metric X-learner 0.9111 0.06215 0.000865 15.71% 14.57%
for standard causal effect estimation [8]. 𝜖𝐴𝑇𝑇 can be trivially Semi-black box 0.9163 0.05885 0.002054 0.0% 92.31%
extended to multiple treatments by considering the average
TARNET 0.9096 0.06010 0.002097 40.33% 10.35%
ATT between every possible pair of treatments.
• Reverse pair rate (RPR) [22]: For each order in the test Unconstrained 0.9170 0.05866 0.000976 29.46% 18.90%
set, we can get its predicted probability under each bonus Monotonic 0.9173 0.05863 0.000792 0.0% 16.16%
treatment, and count the proportion of monotonic violations Convex 0.9163 0.05871 0.000879 0.0% 100.0%
in all possible treatment pairs.
• Upper left convex hull ratio (Ulchr): Based on proposition
3.2, we verify the proportion of bonus treatments that meet compared with the best method X-learner model in the base-
the monotonic and convex relationship between expected lines, its metric 𝜖𝑚𝐴𝑇𝑇 is relatively reduced by 8.44%. How-
cost and conversion probability. ever, the proposed convex constrained model performs worse
than the compared methods S-learner and X-learner. Obvi-
For the joint evaluation of the user intent detection module and
ously, neither of the semi-black box and TARNET models is
allocation module, metrics are as follows:
a clear winner.
• The average expected response (AER): From [34], we • Observing the RPR and Ulchr metrics, our proposed model
estimate the expected response in the entire feature space by strictly satisfies the constraints. In baseline methods, the S-
the law of total expectation, which was proved to be an un- learner method violates the monotonic assumption slightly,
biased estimate of the expected response in the randomized and the semi-black box model has more decision-making
controlled trial setting. bonus treatments.
• The performance of the methods of predicting factual out-
4.1.4 Implementation Details. We apply LightGBM 1 as base-learner comes and estimating counterfactual effects is not always
in meta-learner methods, the learning rate, feature fraction, bag- consistent. For example, the semi-black box model has a high
ging fraction, bagging frequency, max bin, boosting type are set as AUC, but the 𝜖𝑚𝐴𝑇𝑇 performs poorly.
0.1, 0.8, 0.8, 5, 100, gbdt, respectively. • In the meta-algorithms, it is more counter-intuitive that X-
For the neural network-based models, in order to evaluate the learner performs worse than the S-learner according to the
generalization ability and the fairness of the comparison, TARNET experimental results. This is because X-learner was proposed
and our proposed models adopt the same hyperparameters, the to solve the situation when the number of units in one treat-
shared layers are single-layer MLP with dimension 128, and the ment group is much larger than in the other. However, in
head layers are three-layer MLP with dimension [64, 32, 1]. In the our RCTs, the number of samples per treatment is almost the
semi-black box model, hidden layer 0 is a single-layer MLP with same, so simpler models such as S-learner generalize better.
dimension 128, and hidden layer 1 and hidden layer 2 are three-layer • Our proposed monotonic constrained model and convex con-
MLPs with dimension [64, 32, 1]. In the experiment, we train the strained model outperform TARNET by a significant margin
neural network with SGD using the Adam Optimizer [13] with the of about 62.23% and 58.08%, respectively, in terms of 𝜖𝑚𝐴𝑇𝑇 .
hyperparameters of 𝜖 = 0.0001, 𝛽 1 = 0.9, 𝛽 2 = 0.999. We conduct This result confirms the rationality of the assumptions and
experiments with NVIDIA Tesla V100 GPU with 16G memory. the superiority of the proposed methods. Additionally, we
believe that the good performance is due to the more com-
4.2 User Intent Detection Evaluation plete fusion of information in the head layers through the
The intent detection results are shown in Table 1. From these results, accumulation of outputs, as evidenced by the comparison
we have the following insightful findings. between TARNET and the proposed unconstrained model.
• In terms of AUC and Logloss, the evaluation metrics for fac- During the training of the 𝑖-th treatment sample, the first
tual predictions, the proposed monotonic constrained model 𝑖 − 1 head layers are also involved in the training process.
and unconstrained model have nearly identical optimal per- In terms of AER, since there were not enough positive samples
formance, achieving 0.0010 and 0.0007 AUC gains compared matched, the metric fluctuated greatly, so we performed regression
to the best semi-black box model in the baseline methods processing, and the results are shown in Figure 5. We have the
respectively. The proposed convex-constrained model has following observations:
the same AUC as the semi-black box model, but the Logloss • The proposed monotonic constrained model outperforms
is 0.00014 less. other compared methods, however, the convex constrained
• Regarding the effect estimation evaluation metrics 𝜖𝑚𝐴𝑇𝑇 , model is inferior to the S-learner method;
the monotonic constrained model outperforms others, and • The metrics AER and 𝜖𝑚𝐴𝑇𝑇 are approximately consistent.
From the experimental results, we found that the convex con-
1 https://github.com/microsoft/LightGBM strained model does not perform well in estimating the treatment

5034
KDD ’23, August 6–10, 2023, Long Beach, CA, USA Chao Wang et al.

S-learner Unrestricted 4.3 Bonus Allocation Optimization Evaluation


X-leaner Monotonic
Semi-balck box Convex We evaluate the optimality ratio between the Lagrangian dual solu-
TARNET tion against the LP solution (i.e., the upper bound). In our optimiza-
Relative Increase in Expected CVR

tion problem, there are millions of users and dozens of bonuses.


40% Since the current LP solver is difficult to handle such a large-scale
problem [32], We formulate some modest-sized problems by sam-
pling customers and bonuses for experimentation. We adopt the
30% SCIP solver from the popular Google OR-tools LP tool, which is
a framework for constraint integer programming and branch-cut-
and-price.
20% We test the optimality ratios for |𝑢𝑠𝑒𝑟 | ∈ {5𝑘, 10𝑘, 20𝑘, 30𝑘 }
across |𝑏𝑜𝑛𝑢𝑠𝑒𝑠 | ∈ {5, 15, 30, 51}. The results show that the optimal-
10%
ity ratio is above 99.99%, and the gap is negligible, with a maximum
of 5.4 × 10 −16 .

0% 5% 10% 15% 20% 25% 4.4 Feedback Control Evaluation


Relative Increase in Expected CAC
4.4.1 Implementation Details. We adopt the linear model actua-
tor [31] shown in Eq.(24), where 𝜇𝜆 (𝑡) is derived by Eq.(22) in the
Figure 5: The average expected response metrics of the meth- PID controller and 𝜆 is restricted between 𝜆𝑚𝑖𝑛 and 𝜆𝑚𝑎𝑥 to guaran-
ods under multiple budget constraints. The X-axis represents tee the system robustness. In addition, there are two key points in
the relative increase in cost compared to a given baseline, practical applications. First, the reference is frequently modified for
and the Y-axis represents that of CVR. practical reasons. Second, we actually care about the accumulated
CAC when the campaign ends, and there is a significant difference
1.00 in real-time CAC of each time step contributing to the accumulated
Normalized probability

CAC. Therefore, we modify the error 𝑒 (𝑡) as shown in Eq.(25,26),


0.75 where 𝑟 (𝑡) and ℎ(𝑡) respectively refer to the reference accumulate
CAC and the measured accumulated CAC, 𝑐𝑜𝑛𝑣𝑒𝑟𝑠𝑖𝑜𝑛(𝑖) represents
0.50 the number of conversions in the time step i. By such a modifica-
tion, the PID controller would constantly increase its attention on
0.25 the accumulated CAC and can be adapted to scenarios where the
reference is frequently modified.
0.00
1 17 34 51 1 17 34 51
𝜆(𝑡 + 1) = 𝜆(𝑡)(1 + 𝜇𝜆 (𝑡)), (24)
Treatment Treatment
(a) (b) 𝑟 (𝑡) − ℎ(𝑡)
𝑒 (𝑡) = , (25)
𝑟 (𝑡)
Figure 6: For the case study, we randomly pick ten users and
Í𝑡
𝑐𝑜𝑛𝑣𝑒𝑟𝑠𝑖𝑜𝑛(𝑖) · 𝐶𝐴𝐶𝑚𝑒𝑎𝑠𝑢𝑟𝑒𝑑 (𝑖)
observe their conversion probabilities for each bonus treat- ℎ(𝑡) = 𝑖=1 Í𝑡 . (26)
𝑖=1 𝑐𝑜𝑛𝑣𝑒𝑟𝑠𝑖𝑜𝑛(𝑖)
ment predicted by the model. (a) comes from the convex con-
strained model, (b) comes from the monotonic constrained To determine the weight parameters of a PID controller, we use
model. In addition, in order to facilitate the observation of the grid-search method to find the best setting on the training
the difference between user response curves, the probability dataset and apply it to the test dataset. We regard every hour as a
is min-max normalization. time step, so the maximum time steps 𝑇 equals 24.

4.4.2 Evaluation Protocol. The goal of the feedback control module


effect. As shown in Figure 6, we randomly selected 10 users from the is to make the actual daily CAC approach the set reference value
test set and observed their normalized conversion probabilities for by adjusting the Lagrangian multiplier.
each bonus treatment. It can be observed that the predictions of the We verify the effectiveness of the feedback control strategy from
models fully satisfy the respective constraints. However, because the following three aspects.
the convex constraints are too strict, the predicted user response • Accuracy: As shown in Figure 7, we observe the relative
trends are relatively consistent and lack fitting ability. Even if it is error of the feedback control strategy for each day within
proved that the convex constraint can improve the upper bound one month. The minimum absolute value is 0.026%, and the
of the optimal value, it is necessary to consider the damage caused maximum value is 5.35% on 12-16. This is because the refer-
by the strict constraint to the estimation of the treatment effect. ence value changed at 19:00 on that day, and the feedback
Although the convex constraints are too strong in our scenario, we control is too late to adjust. The average absolute relative
believe that the convex constrained model or its weakened version error for the whole month is 0.84%, which satisfies business
may work in other scenarios with diminishing marginal gains. requirements well.

5035
A Multi-stage Framework for Online Bonus Allocation Based on Constrained User Intent Detection KDD ’23, August 6–10, 2023, Long Beach, CA, USA

Table 3: Online A/B test results.


8.0

6.0
Relative error %

Gain
Method
CVR CAC
4.0
Monotonic constrained -0.44% -5.07%
2.0 vs S-learner (𝑝-value=0.68) (0.0082)
Convex constrained -2.59% -6.62%
0.0 vs S-learner (0.0327) (<0.001)

11-16 11-21 11-26 12-01 12-06 12-11 12-16


Date goals. Mainstream solutions often include two stages, a combination
of machine learning and operations research. The first stage is
Figure 7: The relative error of feedback control per day, de- user intent detection, and the second stage is to solve the assigned
fined as (𝐶𝐴𝐶𝑚𝑒𝑎𝑠𝑢𝑟𝑒𝑑 − 𝐶𝐴𝐶𝑟𝑒 𝑓 𝑒𝑟𝑒𝑛𝑐𝑒 )/𝐶𝐴𝐶𝑟𝑒 𝑓 𝑒𝑟𝑒𝑛𝑐𝑒 optimization problem.
Table 2: Competitive analysis of solutions. There are some studies [2, 9, 15, 22, 27, 33] on marketing allo-
cation. [33] propose a unified framework in which the user intent
detection part extends the traditional logit response model to a
Lagrangian multiplier 𝜆 Optimal value semi-black-box model. [27] further improve the method to a two-
Offline 3.66 × 10 −5 4085.56 tower structure, corresponding to fitting two parameters in the
Online 3.59 × 10 −5 ∗ 4084.95 logit model. [15] selects the best model from a number of uplift
∗ The mean value regulated by feedback control. modeling techniques such as two-models, transformed outcome,
and X-learner. [9] leverages Retrospective Estimation, is a model-
ing approach that relies exclusively on data from positive outcome
• Efficiency: Usually, it takes three days to adjust the online examples. In addition, user intent detection is a multi-treatment
CAC to the reference value manually. However, the feedback effect estimation problem, including tree-based [1, 4, 25] and repre-
control is in place on the same day, and the efficiency has sentation learning methods [10, 16, 20, 21, 30]. It has come to our
increased by 300%. attention that there is a similar approach being used for user intent
• Optimality: We perform a competitive analysis for online detection with monotonic constraints [22]. However, our method
decision algorithms and feedback control strategies. As shown differs in that we utilize an end-to-end training approach and a
in Table 2, aligned with the budget limits of the online solu- parallel multi-head network structure.
tion, we solve the optimal policy offline and then compare As one of the most effective methods, the Lagrangian dual method
the optimal values obtained in the online and offline scenar- is widely deployed in the second stage for online decision-making
ios. The Online demonstrates a near-optimal performance [1, 15, 27, 33]. [2] provides a real-time adaptive method and extends
with a negligible optimality gap up to 1.5 × 10 −4 . the existing literature by addressing cases with negative weights
and values. To address the dynamic environment in marketing, [27]
4.5 Performance of Online A/B Tests apply some simple periodic control strategies, which require addi-
Considering the negative impact on the business and offline evalua- tional workload and at the expense of efficiency. We took advantage
tion results, we did not deploy all the baseline methods online. With of the feedback control theory, which has been proven effective
the development of our business, we have successively deployed in advertising display by work of [12, 28, 31] and adapted to the
S-learner, the monotonic constrained model, and the convex con- marketing scenario after delicate improvement.
strained model to the online system. Each model serves millions of
orders every day. For a fair comparison, we control the orders and 6 CONCLUSION
the new customers involved in the A/B testing to be homogeneous. In this paper, we formulate the online bonus allocation problem
From table 3, we see that our proposed monotonic constrained as a MCKP. To overcome its challenges, we propose a framework
model achieves a significant -5.07% reduction in CAC with almost consisting of three modules: User Intent Detection module, Online
the same CVR (-0.44%) compared with the online mainstream S- Bonus Allocation module, and Feedback Control module. In the User
learner model. In addition, compared with the S-learner model, Intent Detection module, in order to overcome the non-monotonic
the CAC of the proposed convex constrained model is reduced by problem of the response curve, we propose a multi-treatment effect
6.62%, and the conversion rate is reduced by 2.59%. The results of prediction model with monotonic constraints. To reduce the opti-
the online A/B experiment are basically consistent with the offline mality gap between user intent detection and optimal allocation
analysis. stages, a convex constrained detection model is further proposed.
Facing the challenge that the distribution of online orders and
5 RELATED WORK users changes over time, we leverage the feedback control strategy
Bonus allocation (or coupons, budgets, discounts, etc.) is a typical to make online costs more accurately approach the budget limit.
problem in the field of intelligent marketing. By issuing incentives Offline and online experimental results demonstrate significant
to users to promote user conversion, in order to achieve business improvement compared with baseline methods.

5036
KDD ’23, August 6–10, 2023, Long Beach, CA, USA Chao Wang et al.

REFERENCES A Framework for Multi-Stage Bonus Allocation in Meal Delivery Platform. In


[1] Meng Ai, Biao Li, Heyang Gong, Qingwei Yu, Shengjie Xue, Yuan Zhang, Yunzhou Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data
Mining (Washington DC, USA) (KDD ’22). Association for Computing Machinery,
Zhang, and Peng Jiang. 2022. LBCF: A Large-Scale Budget-Constrained Causal
New York, NY, USA, 4195–4203. https://doi.org/10.1145/3534678.3539202
Forest Algorithm. In Proceedings of the ACM Web Conference 2022. 2310–2319.
[28] Xun Yang, Yasong Li, Hao Wang, Di Wu, Qing Tan, Jian Xu, and Kun Gai. 2019.
[2] Javier Albert and Dmitri Goldenberg. 2022. E-Commerce Promotions Personaliza-
Bid optimization by multivariable control in display advertising. In Proceedings
tion via Online Multiple-Choice Knapsack with Uplift Modeling. In Proceedings of
of the 25th ACM SIGKDD international conference on knowledge discovery & data
the 31st ACM International Conference on Information & Knowledge Management.
mining. 1966–1974.
2863–2872.
[29] Liuyi Yao, Zhixuan Chu, Sheng Li, Yaliang Li, Jing Gao, and Aidong Zhang. 2021.
[3] Karl Johan Åström and Panganamala Ramana Kumar. 2014. Control: A perspec-
A survey on causal inference. ACM Transactions on Knowledge Discovery from
tive. Autom. 50, 1 (2014), 3–43.
Data (TKDD) 15, 5 (2021), 1–46.
[4] Susan Athey, Julie Tibshirani, and Stefan Wager. 2019. Generalized random
[30] Liuyi Yao, Sheng Li, Yaliang Li, Mengdi Huai, Jing Gao, and Aidong Zhang. 2018.
forests. (2019).
Representation learning for treatment effect estimation from observational data.
[5] Stuart Bennett. 1993. Development of the PID controller. IEEE Control Systems
Advances in Neural Information Processing Systems 31 (2018).
Magazine 13, 6 (1993), 58–62.
[31] Weinan Zhang, Yifei Rong, Jun Wang, Tianchi Zhu, and Xiaofan Wang. 2016.
[6] Stuart Bennett. 1993. A history of control engineering, 1930-1955. Number 47. IET.
Feedback control of real-time display advertising. In Proceedings of the Ninth
[7] Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra,
ACM International Conference on Web Search and Data Mining. 407–416.
Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al.
[32] Xingwen Zhang, Feng Qi, Zhigang Hua, and Shuang Yang. 2020. Solving Billion-
2016. Wide & deep learning for recommender systems. In Proceedings of the 1st
Scale Knapsack Problems. In Proceedings of The Web Conference 2020 (Taipei,
workshop on deep learning for recommender systems. 7–10.
Taiwan) (WWW ’20). Association for Computing Machinery, New York, NY, USA,
[8] Lu Cheng, Ruocheng Guo, Raha Moraffah, Paras Sheth, Kasim Selcuk Candan, and
3105–3111. https://doi.org/10.1145/3366423.3380084
Huan Liu. 2022. Evaluation methods and measures for causal learning algorithms.
[33] Kui Zhao, Junhao Hua, Ling Yan, Qi Zhang, Huan Xu, and Cheng Yang. 2019. A
IEEE Transactions on Artificial Intelligence (2022).
unified framework for marketing budget allocation. In Proceedings of the 25th
[9] Dmitri Goldenberg, Javier Albert, Lucas Bernardi, and Pablo Estevez. 2020. Free
ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.
lunch! retrospective uplift modeling for dynamic promotions recommendation
1820–1830.
within roi constraints. In Proceedings of the 14th ACM Conference on Recommender
[34] Yan Zhao, Xiao Fang, and David Simchi-Levi. 2017. Uplift modeling with mul-
Systems. 486–491.
tiple treatments and general response types. In Proceedings of the 2017 SIAM
[10] Fredrik Johansson, Uri Shalit, and David Sontag. 2016. Learning representations
International Conference on Data Mining. SIAM, 588–596.
for counterfactual inference. In International conference on machine learning.
PMLR, 3020–3029.
[11] Daniel Kahneman and Amos Tversky. 2013. Prospect theory: An analysis of A APPENDIX
decision under risk. In Handbook of the fundamentals of financial decision making:
Part I. World Scientific, 99–127. A.1 Proofs
[12] Niklas Karlsson and Jianlong Zhang. 2013. Applications of feedback control in
online advertising. In 2013 American control conference. IEEE, 6008–6013. A.1.1 Proposition 3.1. Given 𝑏𝑙 and 𝑏𝑢 , there exists 𝑏𝑙 ≤ 𝑏 𝑗 ≤
[13] Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic opti- 𝑏𝑢 , ∀𝑗 ∈ [𝑀], then the optimal value in Eq.(1) will increase, when
mization. arXiv preprint arXiv:1412.6980 (2014).
[14] Sören R Künzel, Jasjeet S Sekhon, Peter J Bickel, and Bin Yu. 2019. Metalearners for adding new bonus candidates.
estimating heterogeneous treatment effects using machine learning. Proceedings
of the national academy of sciences 116, 10 (2019), 4156–4165. Proof. Under the Lagrangian relaxation paradigm, we prove
[15] Liangwei Li, Liucheng Sun, Chenwei Weng, Chengfu Huo, and Weijun Ren. 2020. that the greedy strategy Eq.(13) is optimal, indicating that for a
given 𝜆, there is no better strategy X ∗ such that when the cost
Spending money wisely: Online electronic coupon allocation based on real-time
user intent detection. In Proceedings of the 29th ACM International Conference on
Information & Knowledge Management. 2597–2604. increase Δ𝐶, the optimal value increment Δ𝑉 is greater than 𝜆 ∗ Δ𝐶.
[16] Belbahri Mouloud, Gandouet Olivier, and Kazma Ghaith. 2020. Adapting neural It can be easily proved by contradiction, assuming that there is at
networks for uplift models. arXiv preprint arXiv:2011.00041 (2020).
[17] Gabriel Okasa. 2022. Meta-Learners for Estimation of Causal Effects: Finite least one customer 𝑖, under the more optimal strategy, the allocation
Sample Cross-Fit Performance. arXiv preprint arXiv:2201.12692 (2022). result changes from 𝑥𝑖,𝑗 = 1 to 𝑥𝑖,𝑗 ∗ = 1. Since Δ𝑉 (𝑖) > 𝜆 ∗ Δ𝐶 (𝑖), it
[18] Robert L Phillips. 2021. Pricing and revenue optimization. In Pricing and Revenue
Optimization. Stanford university press.
yields 𝑝𝑖,𝑗 ∗ − 𝜆𝑝𝑖,𝑗 ∗ 𝑏 𝑗 ∗ > 𝑝𝑖,𝑗 − 𝜆𝑝𝑖,𝑗 𝑏 𝑗 , which conflicts with Eq.(13).
[19] Huashuai Qu, Ilya O Ryzhov, and Michael C Fu. 2013. Learning logistic demand However, when new bonus candidates are added, the solution space
curves in business-to-business pricing. In 2013 Winter Simulations Conference expands. Since 𝑁 is very large, according to the optimal greedy
(WSC). IEEE, 29–40.
[20] Patrick Schwab, Lorenz Linhardt, and Walter Karlen. 2018. Perfect match: A allocation, some customers are assigned to new bonus candidates.
simple method for learning representations for counterfactual inference with This part of the group has Δ𝑉 > 𝜆 ∗ Δ𝐶, leading to an increase
neural networks. arXiv preprint arXiv:1810.00656 (2018). in the optimal value. Figure 8 shows the relationship between the
[21] Uri Shalit, Fredrik D Johansson, and David Sontag. 2017. Estimating individual
treatment effect: generalization bounds and algorithms. In International Confer- number of bonuses and the optimal value in our actual business.
ence on Machine Learning. PMLR, 3076–3085. Thus, we complete the proof. □
[22] Yitao Shen, Yue Wang, Xingyu Lu, Feng Qi, Jia Yan, Yixiang Mu, Yao Yang, YiFan
Peng, and Jinjie Gu. 2021. A framework for massive scale personalized promotion.
arXiv preprint arXiv:2108.12100 (2021). A.1.2 Proposition 3.2. For the sub-problem (11), let 𝐶𝑖 = {𝑐𝑖,𝑗 |𝑐𝑖,𝑗 =
[23] Kalyan T Talluri, Garrett Van Ryzin, and Garrett Van Ryzin. 2004. The theory and 𝑝𝑖,𝑗 ∗ 𝑏 𝑗 } represents the expected cost set. Sort the (𝑐𝑖,𝑗 , 𝑝𝑖,𝑗 ) pairs
practice of revenue management. Vol. 1. Springer.
[24] Ruben van de Geer, Arnoud V den Boer, Christopher Bayliss, Christine SM Currie, in increasing order of 𝑐𝑖,𝑗 , then only pairs that are monotonically
Andria Ellina, Malte Esders, Alwin Haensel, Xiao Lei, Kyle DS Maclean, Antonio increasing on the convex hull can be decided.
Martinez-Sykora, et al. 2019. Dynamic pricing and learning with competition:
insights from the dynamic pricing challenge at the 2017 INFORMS RM & pricing Proof. As shown in Figure 9, we discuss non-monotonic and
conference. Journal of Revenue and Pricing Management 18, 3 (2019), 185–203.
[25] Stefan Wager and Susan Athey. 2018. Estimation and inference of heterogeneous
non-convex case by case. First, for the non-monotonic increasing
treatment effects using random forests. J. Amer. Statist. Assoc. 113, 523 (2018), point (𝑐𝑖,𝑑 , 𝑝𝑖,𝑑 ), it exists 𝑝𝑖,𝑑 − 𝑝𝑖,𝑐 < 0 < 𝑐𝑖,𝑑 − 𝑐𝑖,𝑐 . We assume
1228–1242. that 𝑐𝑖,𝑏 is assigned to user 𝑖 under the greedy allocation strategy
[26] Stephen J Wright. 2015. Coordinate descent algorithms. Mathematical Program-
ming 151, 1 (2015), 3–34. Eq.(13). Then it can yields 𝑝𝑖,𝑑 − 𝜆𝑐𝑖,𝑑 ≥ 𝑝𝑖,𝑐 − 𝜆𝑐𝑖,𝑐 , indicating
[27] Zhuolin Wu, Li Wang, Fangsheng Huang, Linjun Zhou, Yu Song, Chengpeng that 𝑝𝑖,𝑑 − 𝑝𝑖,𝑐 ≥ 𝜆(𝑐𝑖,𝑑 − 𝑐𝑖,𝑐 ), which contradicts the truth of non-
Ye, Pengyu Nie, Hao Ren, Jinghua Hao, Renqing He, and Zhizhao Sun. 2022. monotonic increasing. For the non-convex point (𝑐𝑖,𝑏 , 𝑝𝑖,𝑏 ), we also

5037
A Multi-stage Framework for Online Bonus Allocation Based on Constrained User Intent Detection KDD ’23, August 6–10, 2023, Long Beach, CA, USA

assume that it is assigned, indicating that:


55000 𝑝𝑖,𝑏 − 𝜆𝑐𝑖,𝑏 ≥ 𝑝𝑖,𝑎 − 𝜆𝑐𝑖,𝑎 , (27)
𝑝𝑖,𝑏 − 𝜆𝑐𝑖,𝑏 ≥ 𝑝𝑖,𝑐 − 𝜆𝑐𝑖,𝑐 , (28)
52500
Optimal value

Further, it yields:
50000 𝑝𝑖,𝑏 − 𝑝𝑖,𝑎 𝑝𝑖,𝑐 − 𝑝𝑖,𝑏
≥ 0,
𝑐𝑖,𝑏 − 𝑐𝑖,𝑎 𝑐𝑖,𝑐 − 𝑐𝑖,𝑏
47500 M: 5
M: 15 which contradicts the definition of non-convex. Thus, we complete
45000 the proof. □
M: 30
M: 51 A.1.3 Theorem 3.3. If 𝑦 = 𝑓 (𝑥) is a monotonically increasing and
42500
4 6 8 convex function, there is 𝑦 = 𝑔(𝑥 𝑓 (𝑥)), then 𝑔(·) is also a mono-
Budget constraint ×107 tonically increasing and convex function if 𝑥 > 0.
Proof. 𝑦 = 𝑓 (𝑥) yields 𝑥 = 𝑓 −1 (𝑦), thus 𝑔 −1 (𝑦) = 𝑦 𝑓 −1 (𝑦).
Figure 8: The X-axis denotes budget constraint 𝐵, and Y-axis Derivation of 𝑓 −1 is as follows:
denotes the optimal value obtained by using the Lagrangian  ′ 1
𝑓 −1 = −1 , (29)
dual method. As the number of bonus 𝑀 increases, the opti- 𝑓
mal value obtained within the same budget constraint will   ′′ 𝑓
′′

increase marginally. 𝑓 −1 = − ′  3 , (30)


𝑓
Where ′ represents the first derivative and ′′ denotes the second
𝑃!
non-convex non-monotonic derivative.
Since 𝑓 is a convex function, according to the definition of the
′′
convex function, it can be deduced that 𝑓 ≤ 0. It also yields

𝑓 > 0 from the definition of monotonically increasing. Therefore,
 ′′
it further yields 𝑓 −1 ≥ 0 described in Eq.(30).
Let ℎ(𝑦) = 𝑦𝑓 −1 (𝑦), the derivation is as follows:

ℎ = 𝑦 · (𝑓 −1 ) ′ + 𝑓 −1, (31)
′′ −1 ′′ −1 ′
ℎ = 𝑦 · (𝑓 ) + 2 · (𝑓 ) .
𝑎 𝑏 𝑐 𝑑 𝑒 𝑓 𝐶!
(32)
Since 𝑥 = 𝑓 −1 (𝑦) > 0, it can yields ℎ ′
>0 and ℎ ′′
> 0. There-
fore, the function ℎ, that is, the function 𝑔 −1 is a monotonically
Figure 9: A toy example to assist in proving proposition 3.2. increasing and concave function, It can be further proved that
The red dotted line indicates the convex hull.  −1
𝑔(·) = 𝑔 −1 (·) is a monotonically increasing and convex func-
tion by repeating the above steps Eq.(29,30). Thus, we complete the
proof. □

5038

You might also like