ICDM Uplift
ICDM Uplift
ICDM Uplift
Abstract—Uplift modeling has shown very promising results learner and two base learners corresponding to the treatment
in online marketing. However, most existing works are prone and control groups, respectively. 2) Tree (or Forest)-based.
to the robustness challenge in some practical applications. In The basic idea of this line is to employ a hierarchical tree
this paper, we first present a possible explanation for the above
phenomenon. We verify that there is a feature sensitivity problem structure to systematically partition the user population into
in online marketing using different real-world datasets, where sub-populations that exhibit sensitivity to specific treatments.
the perturbation of some key features will seriously affect the An essential step involves modeling the uplift directly by
performance of the uplift model and even cause the opposite applying diverse splitting criteria, including considerations of
trend. To solve the above problem, we propose a novel robustness- distribution divergences [7] and expected responses [8], [9]. 3)
enhanced uplift modeling framework with adversarial feature
desensitization (RUAD). Specifically, our RUAD can more effec- Neural network-based. The basic idea of this line is to leverage
tively allievate the feature sensitivity of the uplift model through the power of neural networks to develop estimators that are
two customized modules, including a feature selection module both intricate and versatile in predicting the user’s response.
with joint multi-label modeling to identify a key subset from the Note that most of them can be seen as the variants of meta-
input features and an adversarial feature desensitization module learners. In this paper, we focus on the neural network-based
using adversarial training and soft interpolation operations to
enhance the robustness of the model against this selected subset line because they can be more flexibly adapted to modeling
of features. Finally, we conduct extensive experiments on a public the complex feature interactions in many industrial systems.
dataset and a real product dataset to verify the effectiveness of Furthermore, due to the widespread use of various neural
our RUAD in online marketing. In addition, we also demonstrate network models in these systems, research on this line is also
the robustness of our RUAD to the feature sensitivity, as well as easier to seamlessly integrate than alternative lines.
the compatibility with different uplift models.
Index Terms—Uplift modeling, Robustness, Adversarial train-
Although existing works on uplift modeling have shown
ing, Feature desensitization very promising results, they generally suffer from the ro-
bustness challenge in many real-world scenarios [10], [11],
I. I NTRODUCTION and little research has been conducted to reveal how such
challenges arise. In this paper, we first identify a feature sen-
One of the critical tasks in each service platform is to sitivity problem in the uplift model as a possible explanation
increase user engagement and platform revenue through online for the above phenomenon using different real-world datasets.
marketing, which uses some well-designed incentives and then Specifically, for each dataset, we randomly select 30% of all
delivers them to the platform users, such as coupons, discounts continuous-valued features and apply a Gaussian noise with
and bonuses [1]. Since each incentive usually comes with a η ∼ N (0, 0.052 ) as perturbation to them. We repeat this
cost, a successful online marketing needs to accurately find the process multiple times to obtain a set of copies with different
corresponding sensitive user group for each of them to avoid feature subsets. Finally, we train the same uplift models for
some ineffective delivery. To achieve this goal, an important each copy and compare their performance with that obtained
step is that the marketing model needs to identify the change on the original dataset. Due to space limitations, we show
in the user’s response caused by different incentives, and only the results of using S-Learner [4] as the uplift model on the
deliver each incentive to its high-uplift users. This involves Production dataset used in the experiment in Fig. 1, and similar
a typical causal inference problem, i.e., the estimation of the results are also found on other datasets or uplift models. We
individual treatment effect (ITE) [2] (also known as the uplift), can find that there are some sensitive key features and a slight
since we usually only observe one type of user response in perturbation to them will seriously affect the performance of
practice, which may be for a certain incentive (i.e., treatment the uplift model, and even an opposite trend appears.
group) or for no incentive (i.e., control group). Therefore, The above empirical findings suggest that the sensitivity of
uplift modeling has been proposed in previous works and uplift models to these key features may be one of the important
verified its effectiveness in online marketing [3]–[5]. reasons for their robustness challenges. Therefore, to alleviate
According to the research ideas, the existing uplift modeling the feature sensitivity problem, we propose a novel robustness-
methods can broadly be categorized into three research lines: enhanced uplift modeling framework with adversarial feature
1) Meta-learner-based. The basic idea of this line is to estimate desensitization, or RUAD for short. Our RUAD contains two
the users’ responses by using existing predictive models as new custom modules that match our empirical findings and
the base learner. Two of the most representative methods can be integrated with most existing uplift models to improve
are S-Learner and T-Learner [6], which adopt a global base their robustness. Specifically, a feature selection module with
0.2 0.2 introduce some related works in Sec. V, and present a con-
0.1 0.1
clusion and some future works in Sec. VI.
Uplift (%)
Uplift (%)
0.0 0.0
II. P RELIMINARIES
-0.1 -0.1
Uplift (%)
0.0 0.0
indicator variable, i.e., whether to get an incentive delivery.
-0.1 -0.1
Note that the proposed framework can also be easily extended
-0.2 -0.2
to other types of uplift modeling problems. For a user i,
0-20 20-40 40-60 60-80 80-100 0-20 20-40 40-60 60-80 80-100
Population Targeted (%) Population Targeted (%) the change in user response caused by a incentive ti , i..e,
(c) Perturbation on feature set 2 (d) Perturbation on feature set 3 individual treatment effect or uplift, denote as τi , is defined as
the difference between the treatment response and the control
Fig. 1. Bar graphs of predicted uplift with 5 bins, w.r.t the origin dataset (i.e., response,
(a)) and three kinds of varieties (i.e., (b)-(d)). For each dataset, we randomly
select 30% of all continuous-valued features and apply a Gaussian noise with τi = yi (1) − yi (0), (1)
η ∼ N (0, 0.052 ) as perturbation while constraining ∥η∥∞ < 0.1. Note that
a good uplift model will usually have a bar graph sorted in descending order. where yi (0) and yi (1) are the user responses of the control
and treatment groups, respectively.
In the ideal world, i.e., obtaining the responses of a user
joint multi-label modeling will be used to identify the desired in both groups simultaneously, we can easily determine the
set of key sensitive features from the original dataset under uplift τi based on Eq.(1). However, in the real world, usually
the supervision of a trade-off optimization objective. Then, only one of the two responses is observed for any one user.
an adversarial feature desensitization module performs an For example, if we have observed the response of a customer
adversarial training operation and a soft interpolation operation who receives the discount, it is impossible for us to observe
based on the selected subset of features, to force the model the response of the same customer when they do not receive
to reduce sensitivity to them, thus effectively truncating a key a discount, where such responses are often referred to as
source of robustness challenges. Finally, we experimentally counterfactual responses. Therefore, the observed response can
verify the effectiveness of our RUAD on a public dataset and also be described as,
a real product dataset.
We summarize the main contributions of this work in the yi = ti yi (1) + (1 − ti )yi (0). (2)
following:
For the brevity of notation, we will omit the subscript i in the
• We provide the first possible explanation and empirical following if no ambiguity arises.
study for the robustness challenge that uplift models As mentioned above, the uplift τ is not identifiable since
suffer in practice, i.e., the problem of feature sensitivity the observed response y is only one of the two necessary
driven by a set of key sensitive features. terms (i.e., y(1) and y(0)). Fortunately, with some appropriate
• We design a feature selection module with joint multi- assumptions [2], [13], we can use the conditional average
label modeling to select the desired set of key sensitive treatment effect (CATE) as an estimator for the uplift, where
features from all the input features, which is consistent CATE is defined as,
with our empirical findings.
• We propose an adversarial feature desensitization module τ (x) = E (Y (1) | X = x) − E (Y (0) | X = x)
that uses an adversarial training operation to remove the = E(Y | T = 1, X = x) − E(Y | T = 0, X = x) . (3)
sensitivity to these key sensitive features from the uplift | {z } | {z }
µ1 (x) µ0 (x)
model, thereby improving the model’s robustness.
• We conduct extensive experiments on a public dataset Intuitively, the desired objective τ (x) can be described as the
and a real product dataset, and the experimental results difference between two conditional means τ (x) = µ1 (x) −
demonstrate the effectiveness, robustness and compatibil- µ0 (x).
ity of our RUAD.
The structure of this paper is organised as follows: we B. Adversarial Training
present some necessary preliminaries in Sec. II, and a detailed Previous works in different research fields have demon-
description of method is given in Sec. III; we analyze and strated the powerful ability of adversarial training to improve
discuss extensive experimental results in Sec. IV; we briefly the model robustness [14], [15]. From a robust optimization
2
perspective, a general adversarial training framework can be B. Base Model
formulated as [16],
Since our RUAD is model-agnostic, it can be integrated
∗ with most existing uplift models. For the convenience of the
θ ← arg min E(x,y)∼D max L(x + δ, y; θ) . (4)
θ∈Θ δ∈B(ϵ) subsequent description, we use T-Learner [2] as the base
model for an example, but different uplift models will be
where L(·, ·) is the loss function, D is the distribution of integrated in the experiments to verify the compatibility of
training set, and θ is the model parameters. B(ϵ) is the our RUAD. In T-Learner, the samples in the treatment group
allowed perturbation set, which can be expressed as B(ϵ) := and the control group will be trained respectively by two
{δ ∈ Rm | ∥δ∥p ≤ ϵ}. Intuitively, adversarial training aims to base learners to obtain the two conditional mean functions
find the perturbation direction that has the greatest impact on required in Eq.(3). Specifically, we consider a T-leaner with
the model and adjust the model to adapt to it. In this paper, the following components,
we will use adversarial training to remove the sensitivity of
• Treatment Response Network: An multi-layer percep-
the uplift model to some key features.
tron (MLP) for estimating the user response in the treat-
ment group, i.e., µ1 (x) ← ŷ(1) = ft (x, t = 1).
III. M ETHODOLOGY • Control Response Network: An MLP for estimating the
According to the empirical findings in Sec. I, a key source user response in the control group, i.e., µ0 (x) ← ŷ(0) =
of robustness challenges for uplift models is their sensitivity fc (x, t = 0).
to some key features. Therefore, this motivates us to design ft and fc are the mapping functions corresponding to the MLP
an effective framework to constrain the effect of feature layer, and note that for each sample, its predicted response
sensitivity during the training of the uplift model, thereby during training will only use one of the two.
improving the robustness of the model.
3
Input Joint Multi-label Adversarial Examples
Modeiling Mask
Numerical
Treatment
Features Base Uplift Perturbation
Propensity Model
Mask
Response Network
Categorical
Embedding
Features Transformed Soft Interpolation
outcome FS AFD
Masked Adversarial
Samples Examples
Conditional
mean Loss Function
Adversarial Feature
Feature Selection Desensitization
Fig. 2. The architecture of our RUAD. The propensity network π(x) is pre-trained to calculate the transformed response y ∗ . The left is the feature selection
module (FS), it leverages a masker to select key sensitive features for jointly modeling transformed response y ∗ and user response y. The right is an adversarial
feature desensitization module (AFD) to reduce the sensitivity of the base uplift model to these key features. Specially, Lo and Lr are used for FS, while
La is used for AFD. The detailed form of the loss function is presented in Eq.(5)
the calculation of each feature dimension in the mask vector where π(x) is the propensity score estimation function and
can be expressed as [19], is usually modeled by a neural network in practice. The
Eq.(10) transforms the observed true response y into y ∗ , such
exp log zj + ξjl /ζ that the expected uplift predictions τ equals the conditional
mj = max P , (7)
l∈{1,...,k} N
j ′ =1 exp log zj ′ + ξjl ′ /ζ expectation of the transformed response y ∗ . Note that the
parameter update of this estimation function π(x) is performed
where l ∈ {1, . . . , k} denotes
the index of the selected feature, by a binary cross-entropy (BCE) loss between its output and
ξjl = − log − log ulj , and ulj ∼ Uniform(0, 1) denotes a t, and π(x) is usually pre-trained to ensure the correctness
uniformly distributed sampling. Note that for simplicity, in of the transformed response y ∗ [21]. Since the transformed
our experiments, we follow the setup of previous work [19], response is a consistent unbiased estimator of the uplift effect
i.e., ζ = 0.5. Finally, we can get the masked samples xm by τ , we can fit it with the uplift prediction of the base model to
multiplying the original samples x with the resulting mask improve the base model’s ability to capture the uplift effect,
vector m(x),
ŷx∗ = µ1 (x) − µ0 (x). (11)
xm = x ⊙ m(x), (8)
However, using only Eq.(11) as the objective may also cause
where ⊙ denotes the element-wise multiplication. the predicted response of the model to violate the true response
2) Joint Multi-Label Modeling: The success of the feature of the user, i.e., damage the coherent prediction.
selection process largely depends on a reasonable guiding op- Therefore, to better guide the training of the above feature
timization objective. Most of the existing uplift models adopt selection process to obtain the desired key features that have
a traditional optimization objective for response modeling, a greater impact on the performance of the uplift model, we
which directly constrains the model to fit the true response define a joint multi-label modeling as a trade-off optimization
y of each sample, objective,
y_real 和
Lr = L (µt (x), y(t)) . y_pred的mse?(9) Lo + Lr = αL ŷx∗m , y ∗ + (1 − α)L (µt (xm ), y(t)) , (12)
This can ensure the coherent prediction of the model on where α is the loss weight, and note that each prediction is
the user response. However, we can find that Eq.(9) is not based on masked samples xm after feature selection.
consistent with the objective of the uplift model (as shown
in Eq.(3)), and this will make the performance of the uplift D. Adversarial Feature Desensitization Module
model easily uncontrollable. After obtaining the desired key features, the next key step
On the other hand, there is little work focusing on estab- is how to effectively reduce the sensitivity of the uplift model
lishing the link between the user responses y and the expected to these sensitive features during its training process. Based
uplift effect τ [20], among which the transformed response is on the empirical findings in Fig. 1, we find that the feature
one of the most representative ways. The specific form of the sensitivity is reflected in the inadaptability of the uplift model
transformed response is shown in Eq.(10), to the perturbations on these key features. This can be linked to
y y previous work [22], where it was shown that the features used
y∗ = ·t− · (1 − t), (10) to train deep learning models can be divided into adversarial
π(x) 1 − π(x)
4
robust features and non-robust features, and the sensitive key Algorithm 1 The training procedure of RUAD.
features in uplift models can be considered as the latter. Given Input: The data D = {xi , ti , yi }ni=1 , the model parameters θ
that existing works show that adversarial training with feature (including base model, feature selection model), training
importance can effectively address the limitation of adversarial epochs M , adversarial learning steps Z, hyper-parameters
non-robust features [23]–[25], our RUAD formalizes an adver- α, β and ϵ.
sarial feature desensitization module, including an adversarial 1: for m = 1 to M do
training operation and a soft interpolation process. 2: Sample (x, t, y) ∼ D
1) Adversarial Training Operation: Our goal is to construct 3: Get the propensity score by using π(x).
a set of adversarial samples for the uplift model and constrain 4: Get the transformed response y ∗ according to Eq.(10).
the model to adapt to them during the adversarial training. 5: Get the mask vector m(x) and the selected key fea-
These samples can be generated by feature-level perturbation tures according to Eq.(6) and (8). ▷ Feature Selection
of masked samples xm undergoing feature selection. Specifi- 6: Predict ŷ ∗ according to Eq.(11).
(0)
cally, for numerical features, the perturbation will act directly 7: xadv = xm ▷ Adversarial Feature Desensitization
on the feature value itself, and for discrete features, the per- 8: for z = 1 to Z do
turbation will act on the corresponding encoding embedding. 9: Search the maximum-allowable adversarial sample
(z)
To improve the effectiveness and efficiency of adversarial xadv according to Eq.(15).
training, many existing works have been proposed to show 10: end for
how to obtain ideal adversarial samples to meet the goal of 11: Get the soft interpolation adversarial example accord-
maximizing the disturbance on model performance [14], [16], ing to Eq.(16).
[26], [27], 12: Update θ according to Eq.(5), (12) and (17).
max L(ŷadv , ŷ), (13) 13: end for
xadv
where ŷ and ŷadv are the model predictions based on the orig-
inal sample x and the adversarial example xadv , respectively, Specifically, the final adversarial examples can be obtained as
and xadv = x + δ (as shown in Eq.(4)). follows,
(0) (Z)
In this paper, we follow virtual adversarial training frame- x̃adv = γ ∗ xadv + (1 − γ) ∗ xadv , (16)
work (VAT) [28] to obtain ideal adversarial samples. Specif- where γ ∼ Uniform(0, 1), and Z is the number of iterations
ically, in order to strengthen the interference of adversarial for the power iteration method. After obtaining ideal adver-
samples on the model’s uplift effect estimation, we first modify sarial examples, we expect the uplift model to adapt to them
the original adversarial loss (i.e., Eq.(13)) to a form based on during training. Therefore, we introduce an adversarial loss to
the transformed response, constrain the model not to have large prediction differences
∗ between the adversarial examples and the original samples,
max L(ŷadv , ŷ ∗ ), (14)
xadv
∗
La = βL(ỹadv , ŷ ∗ ), (17)
∗
where ŷadvis estimated by using xadv as input in Eq.(11).
Then, we perform the search process based on the power where β is a hyper-parameter to control the weight of adversar-
iteration method proposed by VAT, where new adversarial ial loss. For ease of understanding, we provide the pseudocode
samples obtained at each iteration are calculated as follows, of our RUAD in Algorithm 1.
(z+1) (z)
∗
∇xadv L(ŷadv , ŷ ∗ ) E. Complexity Analysis
xadv = xadv + ϵ · ∗ ⊙ m(x), (15)
∥∇xadv L(ŷadv , ŷ ∗ )∥2 Since the base model contains two MLPs for predicting the
user responses in their respective groups, its overall complexity
where ϵ is a hyper-parameter to control the step size of the can be expressed as O(M N d|D|), where M is the number of
perturbation, and z is the number of iterations. Note that we training epochs, N is the number of features, d is the em-
will use the masked samples after feature selection as the bedding dimension, and |D| is the dataset size. For the feature
(0)
initialization of this search process, i.e., xadv = xm , and selection module in our RUAD, its complexity mainly involves
apply the same mask m(x) to the perturbations to ensure that a neural network-based mask function, and the complexity
adversarial training is only performed on selected key features. is O(N d|D|) in one training epoch. The adversarial feature
2) Soft Interpolation Process: Since the adversarial train- desensitization module mainly involves a search process with
ing operation and the feature selection module are jointly the power iterative method, and the complexity is O(ZN d|D|)
trained, excessively large perturbations on some features in in one training epoch, where Z is the number of steps in the
the early stages of model training may damage the effect of power iteration method. Therefore, the complexity of the base
feature selection. To control the magnitude of the adversarial model integrating our RUAD will be O(ZM N d|D|). Since
perturbation within a more moderate level, we integrate the satisfactory model performance can usually be obtained based
(Z)
obtained adversarial examples (i.e., xadv ) and the received on fewer adversarial learning steps in practice, i.e., Z is usually
(0)
original samples (i.e., xadv or xm ) in a soft interpolation form. a small value, thus the complexity of our RUAD is competitive.
5
IV. E XPERIMENTS TABLE I
S TATISTICS OF THE IHDP DATASET AND P RODUCTION DATASET. W E
In this section, we present the overall performance of RANDOMLY SPLIT THE TWO DATASETS FOR TRAIN / VALIDATION / TEST
SPLIT PROPORTION OF 70%/20%/10%.
RUAD and the other methods to be compared. The following
questions are also proposed and investigated: Sample number
Dataset Features
• RQ1: How does our RUAD help the training of the Treated Control
backbones on both IHDP and Production Datasets? IHDP 25 139 608
• RQ2: How does each module of our RUAD help the Production 123 1.82 million 1.85 million
uplift model training procedure?
• RQ3: Can our RUAD improve the robustness of the Treatment group Control group
learned uplift model? (High clarity: 1080P) (Low clarity: 720P)
• RQ4: How is the compatibility of our RUAD?
A. Experiment Setup
1) Datasets: To compare the model performance from
an uplift ranking perspective, we use two datasets to show
the effectiveness of our training framework, the overall data
statistic is presented in Table I.
• IHDP1 . The IHDP dataset is utilized as a semi-synthetic
dataset to assess predicted uplift in [29]. This evalua-
tion involves the synthetic generation of counterfactual
outcomes based on the original features, along with
the introduction of selection bias. The resulting dataset
contained 747 subjects (608 control and 139 treated) with
25 features (6 continuous and 19 binary features) that Fig. 3. The visualization for the online data collection experiment of the
described both the characteristics of the infants and the Production dataset.
characteristics of their mothers. T = 1 represents that the
subject is provided with intensive, high-quality childcare • T-Learner (T-NN) [6]: T-NN is similar to S-NN, which
and home visits from a trained health-care provider. uses two estimators for the treatment group and control
• Production. This dataset comes from an industrial pro- group, respectively.
duction environment, one of the largest short-video plat- • Causal Forest [30]: Causal Forest is a non-parametric
forms in China. For such kind of short video platforms, Random Forest-based tree model that directly estimates
clarity is an important user experience indicator. And the treatment effect, which is another representative of
a decrease in clarity may lead to a decrease in users’ tree-based uplift models.
playback time. Therefore, through random experiments • Transformed Outcome (TO-NN) [20]: TO-NN trans-
within a week, we provided high-clarity videos (T = 1) forms the observed outcome Y to Y ∗ , such that the uplift
to the treatment group, and low-clarity videos (T = 0) equals the conditional expectation of the transformed
to the control group. We count the total viewing time of outcome.
users’ short videos in a week, and quantify the impact of • TARNet [31]: TARNet is a commonly used neural
definition degradation on user experience. The resulting network-based uplift model. Compared with T-NN, it
dataset contains more than 3.6 million users (1.82 million omits the additional imputed treatment effects fitting sub-
treat and 1.85 million control) with 123 features (108 models but introduces the shared layers for treated and
continuous and 15 categorical features) describing user control response networks. The shared network parame-
relative characteristics. We present the visualization of ters could help alleviate the sample imbalance.
online data collection in Fig. 3. • CFR [31]: CFR applies an additional loss to TARNet,
2) Baselines: We compare RUAD with the representative which forces the learned treated and control feature
methods in uplift modeling. All methods use all the input distributions to be closer. We report the CFR perfor-
continuous features in the two datasets. mance using two distribution distance measurement loss
• S-Learner (S-NN) [6]: S-NN is a kind of meta-learner functions, Wasserstein [32] (denoted as CFRwass ) and
method that uses a single estimator to estimate the Maximum Mean Discrepancy (MMD) [33] (denoted as
outcome without giving the treatment a special role. The CFRmmd ).
uplift is estimated by the difference between changed • Dragonnet [34]: Dragonnet exploits the sufficiency of
treatment with fixed other features. the propensity score for estimation adjustment, and uses
a regularization procedure based on the non-parametric
1 https://github.com/AMLab-Amsterdam/CEVAE/tree/master/datasets/IHDP estimation theory.
6
TABLE II accuracy; the adversarial feature desensitization can help to
M AIN HYPER - PARAMETERS AND THEIR VALUES TUNED IN THE solve the feature sensitivity problem, which can train a more
EXPERIMENTS .
robust uplift model on the sensitive key features, then to get
Parameter Range Functionality higher performance on the evaluation metrics.
For the Production dataset, 1) TO-NN significantly outper-
d {23 , 24 , 25 } Embedding dimension
lr {1e−5 , 1e−4 , 1e−3 } Learning rate forms many neural network-based methods, even though some
ϵ {1e−2 , 2e−2 , 4e−2 } Adversarial step size of them use more complex network architectures. This shows
α {0.5, 0.6, 0.7, 0.8} Regression loss weighting that the task of uplift modeling may be different from the
β {0.1, 0.2, 0.3, 0.4} Adversarial loss weighting
λ {1e−5 , 1e−4 , 1e−3 } Regularization loss weighting traditional ITE estimation, that is, more complex architectures
may not get better results. In particular, we can observe that,
with high dimensional features for the uplift modeling, most
• CITE [21]: CITE is based on the contrastive task de- neural network-based methods can not learn a good feature
signed for causal inference. It fully exploits the self- embedding for the final objective. Thus, they no longer have
supervision information in the data to achieve balanced an advantage over the TO-NN. 2) S-NN, T-NN, Causal Forest,
and predictive representations while appropriately lever- TO-NN, and Dragonnet perform better than other baselines.
aging causal prior knowledge. Especially, S-NN, T-NN, Causal Forest, and TO-NN can be
3) Evaluation Metrics: Following the setup of previous seen as the traditional methods for uplift modeling. The results
work [5], we employ two evaluation metrics commonly used show that, for an industrial application, these methods can
in uplift modeling, i.e., the Qini coefficient q̂, and Kendall’s satisfy the basic requirements and get a acceptable perfor-
uplift rank correlation ρ̂. mance. Dragonnet performs pretty well among the neural
4) Implementation Details: We implement all baselines network-based methods. It leverages the propensity score to
and our RUAD with Pytorch 1.10. Besides, we use Adam address the imbalance issue of treatment effect estimation and
optimizer and set the maximum number of iterations to 30, introduces the biases into the model parameters to form a
to search for the best hyper-parameters, and we leverage q̂ target regularization. Although in the Production dataset, it is
as the parameter searching objective. We also adopt an early almost balanced for the sample number of treated and control
stopping mechanism with a patience of 5 to avoid over-fitting groups, the well-learned propensity score also can facilitate the
to the training set. Furthermore, we utilize the hyper-parameter model training with less variance between different samples.
search library Optuna [35] to accelerate the tuning process, and 3) For RUAD, it consistently outperforms other baselines in
we set the Qini coefficient as the hyper-parameter searching most cases. Considering the high dimensional data of the
objective. The tuned values of the main hyper-parameters is Production dataset, the feature selection module with joint
shown in Table II. multi-label modeling can significantly improve the training
efficiency and the model performance. Due to the fact that this
B. Overall Performance (RQ1) dataset is collected from the online environment, there may
We present the comparison results of IHDP and Production exist some noise in the collected data. Thus, adversarial feature
datasets in Table III. For the evaluation, we use both the Qini desensitization module can improve prediction accuracy with
coefficient and Kendall’s uplift rank correlation with 5 bins the adversarial training operation on the sensitive key features.
and 10 bins, and then we have the following observations:
For the IHDP dataset, 1) Meta-learners (S-NN, T-NN) per- C. Ablation Study (RQ2)
form bad among all the methods, because there exists a sample Moreover, we conduct the ablation studies of our RUAD
imbalance in the dataset. Causal Forest performs competitively with S-NN as the uplift model, and we analyze the role
on all the metrics, and the reason may be that the Causal played by each module. We sequentially remove the two
Forest can obtain an unbiased estimation of the causal effect components of the RUAD, i.e. the feature selection module
in all dimensions due to the feature space being repeatedly (FS) and the adversarial feature desensitization module (AFD).
partitioned to achieve data homogeneity in the local feature We construct three variants of RUAD, which are denoted as
space. TO-NN performs pretty well in all cases, and the trans- RUAD (w/o FS-JMM), RUAD (w/o FS) and RUAD (w/o
formed label can also solve the sample imbalance problem by AFD). Specially, RUAD (w/o FS-JMM) represents that we
leveraging the propensity score. 2) The neural network-based only use the response as the training label of the base uplift
methods also have a good performance. Dragonnet performs model. We present the results in Table IV. From the results,
pretty well among them, where the target regularization may we can see that removing any part may bring performance
enhance the model learning. And it is worth noting that the degradation. This verifies the validity of each part designed
more complex architecture may not get higher performance for in our RUAD. In particular, the feature selection module can
uplift modeling evaluation. 3) For RUAD, the learned model select the key sensitive features for uplift modeling, the joint
performs better than other baselines in most cases. This is multi-label modeling can help the feature selection to get
because that the feature selection module with joint multi-label more sensitive key feature set for the uplift modeling task; the
modeling can select sensitive key features for uplift modeling adversarial feature desensitization module can help the uplift
task, reduce the computational cost and improve the prediction model to be more robust on the selected feature set. All the
7
TABLE III
OVERALL COMPARISON BETWEEN OUR MODELS AND THE BASELINES ON IHDP AND P RODUCTION DATASETS , WHERE THE BEST AND SECOND BEST
RESULTS ARE MARKED IN BOLD AND UNDERLINED , RESPECTIVELY. N OTE THE REPORTED RESULTS ARE THE MEAN ± STANDARD DEVIATION OVER FIVE
RUNS WITH DIFFERENT SEEDS .
modules are helpful in solving the feature sensitivity problem 0.2 0.2
and improving prediction accuracy. 0.1 0.1
Uplift (%)
Uplift (%)
D. Robustness Evaluation (RQ3) 0.0 0.0
-0.1 -0.1
Regarding the issue presented in Fig. 1, we also conduct the
-0.2 -0.2
same testing data of the production dataset for the robustness
0-20 20-40 40-60 60-80 80-100 0-20 20-40 40-60 60-80 80-100
evaluation. In specific, we first train the model on normal test- Population Targeted (%) Population Targeted (%)
ing data and add perturbation with the same distribution on the (a) Without perturbation (b) Perturbation on feature set 1
same feature sets after feature normalization. We present the
results of RUAD in Fig. 4. The main problem shown in Fig. 1 0.2 0.2
is that for the third feature set, the perturbation on it makes the 0.1 0.1
Uplift (%)
Uplift (%)
learned model get an opposite uplift barplot for the evaluation. 0.0 0.0
After applying our RUAD and getting a well-trained deep
-0.1 -0.1
uplift model, the results of the uplift bar become more stable
-0.2 -0.2
than before. This observation suggests that our RUAD can 0-20 20-40 40-60 60-80 80-100 0-20 20-40 40-60 60-80 80-100
improve the model’s feature-level robustness. Also, the results Population Targeted (%) Population Targeted (%)
of feature set 3 show that the feature selection module with (c) Perturbation on feature set 2 (d) Perturbation on feature set 3
joint multi-label modeling can select the sensitive key feature
Fig. 4. Bar graphs of predicted uplift with 5 bins, w.r.t the origin dataset
set, then using the adversarial feature desensitization module (i.e., (a)) and three kinds of varieties (i.e., (b)-(d)). We present the results of
can enhance the learned uplift model to be more robust to the our RUAD with S-NN as the base uplift model.
perturbation on these features.
3.0 1.0
Base Model Base Model
E. Compatibility Evaluation (RQ4) 2.5 RUAD
0.8
RUAD
(5 bins)
1.5
models, except for T-NN, we also combine it with two typical 0.4
1.0
models, i.e., S-NN, and Dragonnet, in our experiments. The 0.2
0.5
results of the Production datasets are shown in Fig. 5. Ac-
0.0 S-NN T-NN Dragonnet 0.0 SNN TNN Dragonnet
cording to the results, we have the following observations: Our
RUAD outperforms the base models on two evaluation metrics (a) Qini coeffcient (b) Kendall uplift rank correlation
q̂ (5 bins) and p̂ (5 bins). In particular, S-NN and T-NN are Fig. 5. Performance of our RUAD with three typical base uplift models on the
the models that are commonly used for uplift modeling tasks. Production dataset, i.e. S-NN, T-NN and Dragonnet. We evaluate the results
Especially, for S-NN, after combining it with RUAD to train by using Qini coefficient and Kendall uplift rank correlation with 5 bins.
the model, we can significantly improve the performance of the
Qini coefficient. Dragonnet is an ITE estimation model which
be used as a general framework to improve the accuracy of
performs best among all the neural network-based models in
uplift prediction and the robustness of the uplift model.
Table III. Although it has a complex model structure to get
a better uplift prediction, combining it with RUAD also can V. R ELATED W ORKS
improve the model performance in all cases. For the three
variants of RUAD, the best-performed variant changes with A. Uplift Modeling
the evaluation metrics. The users can select the proper base Uplift modeling has received much attention for the online
model to combine it with RUAD according to the specific marketing in recent years [2]. Research in this area has focused
scenario. The above observations suggest that our RUAD can on various aspects of uplift modeling, including methods for
8
TABLE IV
R ESULTS OF THE ABLATION STUDIES ON THE P RODUCTION DATASET, WHERE THE BEST AND SECOND BEST RESULTS ARE MARKED IN BOLD AND
UNDERLINED , RESPECTIVELY. N OTE THE REPORTED RESULTS ARE THE MEAN ± STANDARD DEVIATION OVER FIVE RUNS WITH DIFFERENT SEEDS .
model building, performance evaluation, and real-world appli- perturbed, the models can still predict the original output.
cations. For binary outcome, the intuitive approach to model Our work is more related to the second one, which leverages
uplift is to build two classification models [6]. This consists adversarial training to improve the model’s robustness.
of two separate conditional probability models, one for the Adversarial training is initially proposed by [14]. They
treated users, and the other for untreated users. Any off-the- discover that some architectures of NNs are particularly
shelf estimator can be used in the two-model approach, such vulnerable to a perturbation applied to an input. They also
as regression trees [36]. This method is simple and flexible, show that training the models to be robust against adversar-
but it can not mitigate the influence of disparity in feature ial perturbation is effective in reducing test error. However,
distributions between treatment and control groups. Künzel et the definition of adversarial perturbation in [14] requires a
al. [6] propose X-Learner to solve the difference in feature computationally expensive inner loop in order to evaluate the
distributions, which leverages the propensity score [37] to get adversarial direction. To overcome this problem, Goodfellow
a weighted average of two estimators. However, in practice, X- et al. [26] propose another definition of adversarial perturba-
Learner is usually shown to be easily over-fitting and difficult tion which admits a form of approximation that is free of the
to tune the parameters. Then, Nie and Wager [38] propose the expensive inner loop. Bachman et al. [42] study the effect of
R-Learner. R-Learner has guaranteed error bounds when using random perturbation in the setting of semi-supervised learning.
penalized kernel regression as the base learner. To directly Pseudo Ensemble Agreement introduced in [43] trains the
model the uplift, a transformed response approach [20] is model in a way that the output from each layer in the NNs
proposed, but it heavily relies on the accuracy of the propensity does not vary too much with the introduction of random
score. For continuous outcome, Causal Forest [30] is a random perturbations. To improve the efficiency of adversarial training,
forest-like algorithm for uplift modeling. It uses the causal tree we propose adversarial feature desensitization to improve the
[39] as its base learner, which is a general framework with model robustness. It leverages a masker to select the important
theoretical guarantees. With the development of deep learning features for the uplift prediction, which can enhance the
in causal inference, there are many works proposed that focus feature-level robustness for the model.
on ITE estimation. TARNet [31] is a two-head structure like
T-Learner, but the information between two heads is shared by
a representation layer. CFRNet leverages the distance metrics VI. C ONCLUSION AND F UTURE W ORK
(MMD and WASS) based on the structure of TARNet to
In this paper, to address the feature sensitivity problem
balance the representation between the two heads. To solve the
existing in most uplift modeling methods, we propose a robust-
sample imbalance between the treatment and control groups,
enhanced uplift modeling framework with adversarial feature
Dragonnet [34] designs a tree-head structure, which uses a
desensitization (RUAD). RUAD consists of two customized
separated head to learn the propensity score. The propensity
modules, where the feature selection module with joint multi-
score is commonly used in ITE estimation. CITE [21] uses it to
label modeling selects the key sensitive features for the uplift
distinguish the positive and negative samples, and then builds
prediction, which can help get more accurate and robust uplift
a contrastive learning structure. Unlike the above methods, our
prediction while reducing the computational cost; then the ad-
RUAD builds the transformed outcome and conditional means
versarial feature desensitization module adding perturbations
together, which can leverage the deep neural network to obtain
on the key sensitive features can help solve the feature sensitiv-
better feature representations for uplift modeling.
ity problem. We conduct extensive evaluations to validate the
effectiveness of RUAD. Also, we demonstrate its robustness to
B. Robust Optimization
the feature sensitivity issue and the compatibility with different
The robust optimization has been widely studied in the uplift models, which show that RUAD is a general training
machine-learning community. In general, there are two most framework for uplift modeling. For future works, we plan
popular types are distributional robustness and adversarial to analyze more reasons for the lack of robustness in uplift
robustness. For the first one [40], it aims to solve the training modeling and provide a theoretical analysis of the feature-level
and testing distribution shift problem. For the second one [41], sensitivity. In addition, we are also interested in extending the
the key problem is that even if the input has been slightly uplift models to the scenarios with multiple treatments.
9
R EFERENCES International Conference on Neural Information Processing Systems,
2019, pp. 1831–1841.
[1] D. Liu, X. Tang, H. Gao, F. Lyu, and X. He, “Explicit feature [24] Z. Wang, H. Guo, Z. Zhang, W. Liu, Z. Qin, and K. Ren, “Feature
interaction-aware uplift network for online marketing,” arXiv preprint importance-aware transferable adversarial attacks,” in Proceedings of the
arXiv:2306.00315, 2023. 2021 IEEE/CVF International Conference on Computer Vision, 2021,
[2] W. Zhang, J. Li, and L. Liu, “A unified survey of treatment effect het- pp. 7639–7648.
erogeneity modelling and uplift modelling,” ACM Computing Surveys, [25] M. Chapman-Rounds, U. Bhatt, E. Pazos, M.-A. Schulz, and K. Geor-
vol. 54, no. 8, pp. 1–36, 2021. gatzis, “Fimap: Feature importance by minimal adversarial perturbation,”
[3] P. Rzepakowski and S. Jaroszewicz, “Decision trees for uplift modeling,” in Proceedings of the 35th AAAI Conference on Artificial Intelligence,
in Proceedings of the 2010 IEEE International Conference on Data 2021, pp. 11 433–11 441.
Mining, 2010, pp. 441–450. [26] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing
[4] P. Gutierrez and J.-Y. Gérardy, “Causal inference and uplift modelling: adversarial examples,” arXiv preprint arXiv:1412.6572, 2014.
A review of the literature,” in Proceedings of the 4th International [27] A. Kurakin, I. Goodfellow, and S. Bengio, “Adversarial machine learning
Conference on Predictive APIs and Apps, 2017, pp. 1–13. at scale,” arXiv preprint arXiv:1611.01236, 2016.
[5] M. Belbahri, A. Murua, O. Gandouet, and V. Partovi Nia, “Qini-based [28] T. Miyato, S.-i. Maeda, M. Koyama, and S. Ishii, “Virtual adversarial
uplift regression,” The Annals of Applied Statistics, vol. 15, no. 3, pp. training: a regularization method for supervised and semi-supervised
1247–1272, 2021. learning,” IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, vol. 41, no. 8, pp. 1979–1993, 2018.
[6] S. R. Künzel, J. S. Sekhon, P. J. Bickel, and B. Yu, “Metalearners
[29] A. Jesson, S. Mindermann, U. Shalit, and Y. Gal, “Identifying causal-
for estimating heterogeneous treatment effects using machine learning,”
effect inference failure with uncertainty-aware models,” in Proceedings
Proceedings of the National Academy of Sciences, vol. 116, no. 10, pp.
of the 34th International Conference on Neural Information Processing
4156–4165, 2019.
Systems, 2020, pp. 11 637–11 649.
[7] N. J. Radcliffe and P. D. Surry, “Real-world uplift modelling with
[30] J. M. Davis and S. B. Heller, “Using causal forests to predict treatment
significance-based uplift trees,” White Paper TR-2011-1, Stochastic
heterogeneity: An application to summer jobs,” American Economic
Solutions, pp. 1–33, 2011.
Review, vol. 107, no. 5, pp. 546–550, 2017.
[8] Y. Zhao, X. Fang, and D. Simchi-Levi, “Uplift modeling with multiple [31] U. Shalit, F. D. Johansson, and D. Sontag, “Estimating individual
treatments and general response types,” in Proceedings of the 2017 SIAM treatment effect: generalization bounds and algorithms,” in Proceedings
International Conference on Data Mining, 2017, pp. 588–596. of the 34th International Conference on Machine Learning, 2017, pp.
[9] Y. Saito, H. Sakata, and K. Nakata, “Cost-effective and stable policy 3076–3085.
optimization algorithm for uplift modeling with multiple treatments,” [32] S. Vallender, “Calculation of the wasserstein distance between probabil-
in Proceedings of the 2020 SIAM International Conference on Data ity distributions on the line,” Theory of Probability and Its Applications,
Mining, 2020, pp. 406–414. vol. 18, no. 4, pp. 784–786, 1974.
[10] F. Oechsle, “Increasing the robustness of uplift modeling using addi- [33] K. M. Borgwardt, A. Gretton, M. J. Rasch, H.-P. Kriegel, B. Schölkopf,
tional splits and diversified leaf select,” Journal of Marketing Analytics, and A. J. Smola, “Integrating structured biological data by kernel
pp. 1–9, 2022. maximum mean discrepancy,” Bioinformatics, vol. 22, no. 14, pp. e49–
[11] A. Wu, K. Kuang, R. Xiong, B. Li, and F. Wu, “Stable estimation of e57, 2006.
heterogeneous treatment effect,” in Proceedings of the 40th International [34] C. Shi, D. Blei, and V. Veitch, “Adapting neural networks for the
Conference on Machine Learning, 2023. estimation of treatment effects,” in Proceedings of the 33rd International
[12] D. B. Rubin, “Causal inference using potential outcomes: Design, Conference on Neural Information Processing Systems, 2019, pp. 2507–
modeling, decisions,” Journal of the American Statistical Association, 2517.
vol. 100, no. 469, pp. 322–331, 2005. [35] T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama, “Optuna: A next-
[13] J. Abrevaya, Y.-C. Hsu, and R. P. Lieli, “Estimating conditional average generation hyperparameter optimization framework,” in Proceedings
treatment effects,” Journal of Business & Economic Statistics, vol. 33, of the 25th ACM SIGKDD International Conference on Knowledge
no. 4, pp. 485–505, 2015. Discovery and Data Mining, 2019, pp. 2623–2631.
[14] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, [36] W.-Y. Loh, “Classification and regression trees,” Wiley Interdisciplinary
and R. Fergus, “Intriguing properties of neural networks,” arXiv preprint Reviews: Data Mining and Knowledge Discovery, vol. 1, no. 1, pp. 14–
arXiv:1312.6199, 2013. 23, 2011.
[15] F. Tramèr, A. Kurakin, N. Papernot, I. Goodfellow, D. Boneh, and [37] M. Caliendo and S. Kopeinig, “Some practical guidance for the imple-
P. McDaniel, “Ensemble adversarial training: Attacks and defenses,” mentation of propensity score matching,” Journal of Economic Surveys,
arXiv preprint arXiv:1705.07204, 2017. vol. 22, no. 1, pp. 31–72, 2008.
[16] U. Shaham, Y. Yamada, and S. Negahban, “Understanding adversarial [38] X. Nie and S. Wager, “Quasi-oracle estimation of heterogeneous treat-
training: Increasing local stability of supervised models through robust ment effects,” Biometrika, vol. 108, no. 2, pp. 299–319, 2021.
optimization,” Neurocomputing, vol. 307, pp. 195–204, 2018. [39] P. Darondeau and P. Degano, “Causal trees,” in Proceedings of the 16th
[17] F. Lyu, X. Tang, D. Liu, L. Chen, X. He, and X. Liu, “Optimizing feature International Colloquium on Automata, Languages and Programming,
set for click-through rate prediction,” in Proceedings of the ACM Web 1989, pp. 234–248.
Conference 2023, 2023, pp. 3386–3395. [40] H. Rahimian and S. Mehrotra, “Distributionally robust optimization: A
[18] E. Jang, S. Gu, and B. Poole, “Categorical reparameterization with review,” arXiv preprint arXiv:1908.05659, 2019.
gumbel-softmax,” arXiv preprint arXiv:1611.01144, 2016. [41] T. Bai, J. Luo, J. Zhao, B. Wen, and Q. Wang, “Recent advances
[19] F. Lv, J. Liang, S. Li, B. Zang, C. H. Liu, Z. Wang, and D. Liu, in adversarial training for adversarial robustness,” arXiv preprint
“Causality inspired representation learning for domain generalization,” arXiv:2102.01356, 2021.
in Proceedings of the 2022 IEEE/CVF Conference on Computer Vision [42] P. Bachman, O. Alsharif, and D. Precup, “Learning with pseudo-
and Pattern Recognition, 2022, pp. 8046–8056. ensembles,” in Proceedings of the 27th International Conference on
[20] S. Athey and G. Imbens, “Recursive partitioning for heterogeneous Neural Information Processing Systems, 2014, pp. 3365–3373.
causal effects,” Proceedings of the National Academy of Sciences, vol. [43] A. Bagnall, R. Bunescu, and G. Stewart, “Training ensembles to detect
113, no. 27, pp. 7353–7360, 2016. adversarial examples,” arXiv preprint arXiv:1712.04006, 2017.
[21] X. Li and L. Yao, “Contrastive individual treatment effects estimation,”
in Proceedings of the 2022 IEEE International Conference on Data
Mining, 2022, pp. 1053–1058.
[22] A. Ilyas, S. Santurkar, D. Tsipras, L. Engstrom, B. Tran, and A. Madry,
“Adversarial examples are not bugs, they are features,” in Proceedings
of the 33rd International Conference on Neural Information Processing
Systems, 2019, pp. 125–136.
[23] H. Zhang and J. Wang, “Defense against adversarial attacks using
feature scattering-based adversarial training,” in Proceedings of the 33rd
10