没有合适的资源?快使用搜索试试~ 我知道了~
提升与预测模型:理论分析.pdf
需积分: 0 0 下载量 98 浏览量
2024-11-19
10:49:57
上传
评论
收藏 808KB PDF 举报
温馨提示
提升与预测模型:理论分析.pdf
资源推荐
资源详情
资源评论
Uplift vs. predictive modeling: a theoretical analysis
Th
´
eo Verhelst
1
, Robin Petit
2
, Wouter Verbeke
3
, Gianluca Bontempi
1
1
Machine-learning Group Universit
´
e Libre de Bruxelles, Belgium
2
Algorithms Research Group, Universit
´
e Libre de Bruxelles, Belgium
3
Information Systems Engineering Research Group, KU Leuven, Belgium
Abstract
Despite the growing popularity of machine-learning techniques in decision-making, the added
value of causal-oriented strategies with respect to pure machine-learning approaches has rarely been
quantified in the literature. These strategies are crucial for practitioners in various domains, such
as marketing, telecommunications, health care and finance. This paper presents a comprehensive
treatment of the subject, starting from firm theoretical foundations and highlighting the parameters
that influence the performance of the uplift and predictive approaches. The focus of the paper is on
a binary outcome case and a binary action, and the paper presents a theoretical analysis of uplift
modeling, comparing it with the classical predictive approach. The main research contributions of
the paper include a new formulation of the measure of profit, a formal proof of the convergence of
the uplift curve to the measure of profit ,and an illustration, through simulations, of the conditions
under which predictive approaches still outperform uplift modeling. We show that the mutual
information between the features and the outcome plays a significant role, along with the variance
of the estimators, the distribution of the potential outcomes and the underlying costs and benefits
of the treatment and the outcome.
Keywords: Uplift modeling, Profit measure, Causal inference, Decision-making
1 Introduction
With the growing popularity of machine-learning techniques in decision-making, the need for effective
and accurate models has become increasingly important in various domains. Conventional predictive
approaches have been used with success, for example, in churn prediction, where the models are built
to forecast whether a customer is likely to stop using a service based on historical data (
´
Oskarsd
´
ottir
et al. 2018; Zhu, Baesens, and Broucke 2017; Mitrovi
´
c et al. 2018; Idris and Khan 2014).
However, traditional predictive models often overlook an essential aspect of decision-making,
the causal nature of interventions. Recently, uplift modeling has been established as an important
approach to take this aspect into account for decision-making (Gutierrez and G
´
erardy 2016; Devriendt,
Berrevoets, and Verbeke 2021). Uplift modeling differs from conventional predictive models by explicitly
considering the causal effect of an intervention on the outcome variable. Rather than estimating the
1
arXiv:2309.12036v1 [cs.LG] 21 Sep 2023
conditional expectation of the outcome based on input features alone, uplift modeling focuses on
estimating the difference in outcomes under different treatment scenarios.
Consider a marketing campaign for churn prevention, as an example. The goal is to identify
customers who are less likely to churn in response to a promotional offer. Traditional predictive models
predict the likelihood of customer churning, however, they do not consider the causal effect of the
intervention (sending the offer) on the outcome (customer churn). In this setting, the possible behavior
of a customer can be summarized in terms of counterfactual statements (Devriendt, Berrevoets, and
Verbeke 2021):
• Sure thing: Customer not churning regardless of the action
• Persuadable: Customer churning only if not contacted
• Do-not-disturb: Customer churning only if contacted
• Lost cause: Customer churning regardless of the action
Ideally, only persuadable customers should be targeted by marketing actions. However, we observe
only one of the two potential outcomes (this is known as the fundamental problem of causal inference,
(Holland 1986)), and it is impossible to determine with certainty who are the persuadable customers.
Uplift modeling explicitly aims to estimate the difference in the probability of a positive outcome under
the treatment scenario (customer receives the offer) and the no-treatment scenario (customer does
not receive the offer). Individuals maximizing this difference are the most likely to generate a profit
increase when contacted. The term uplift is used mainly in business settings where large amounts of
experimental data are available, while in other fields, the same quantity is called the conditional average
treatment effect (CATE), or heterogeneous treatment effect (Gutierrez and G
´
erardy 2016), usually assuming
there is only access to observational data. A large number of models based on machine learning have
been developed in recent years to estimate uplift, such as the S-learner, T-learner and X-learner (Zhang,
J. Li, and Liu 2021; K
¨
unzel et al. 2019).
Despite the intuitive appeal of uplift modeling, the added value of causal-oriented strategies with
respect to pure machine-learning predictive approaches has rarely been quantified in the literature. We
believe that it is important to assess whether the expected benefit of uplift strategies (derived from a bias
reduction in the estimation of causal effect) is still noticeable in settings where the data distribution is
characterized by a large number of dimensions, nonlinearity, class imbalance and low class separability.
The works of Devriendt, Berrevoets, and Verbeke (2021), Fern
´
andez-Loria and Provost (2022a),
Fern
´
andez-Loria and Provost (2022b), and Ascarza (2018) address this issue. Devriendt, Berrevoets, and
Verbeke (2021) and Ascarza (2018) present the uplift and predictive approaches, provide an empirical
evaluation of both approaches and conclude that uplift models should be preferred over churn models.
Fern
´
andez-Loria and Provost (2022a) develop an analytical criterion indicating when an uplift model
leads to a lower causal classification error than a predictive model for a given individual. The same
authors (Fern
´
andez-Loria and Provost 2022b) discuss and develop the differences between causal
classification and uplift modeling. and provide some qualitative arguments on when the predictive
approach should be preferred. We extend these papers by comprehensively treating the question,
2
starting from theoretical foundations and studying the influence of different characteristics of the
setting (distribution of the outcome, variance of the estimators, etc.) on the performance of the uplift
and predictive approaches.
A critical aspect of comparing the two approaches is the necessity for a meaningful and sensible
measure of model performance. In this paper, we extend the work of Verbeke, Olaya, Berrevoets, et al.
(2021) by developing a new formulation of the profit generated by a campaign where individuals targeted
by interventions are selected by a machine-learning model. By incorporating the concept of profit, we
go beyond the traditional evaluation metrics and consider the economic impact of decision-making
strategies. Our measure of profit generalizes Verbeke’s by accommodating varying costs and benefits
across individuals. This flexibility is beneficial, for example, in churn prediction, where prioritizing
higher-value customers is crucial. By selecting an appropriate measure, we ensure a fair and accurate
comparison between the uplift and predictive models, enabling decision-makers to make informed
choices based on the true effectiveness and suitability of each approach.
Our paper seeks to establish firm theoretical foundations for uplift modeling and to answer the
question “When does uplift modeling outperform predictive modeling?”. While we focus on a customer
churn prediction example, our findings have broad applicability across domains, including marketing,
telecommunications, health care and finance. Our main conclusions are as follows. The variance plays
a critical role in determining the performance of a model, and in most cases, the predictive approach
outperforms the uplift approach when the variance of the uplift estimator exceeds a certain threshold. We
also show the important impact of three other aspects: cost sensitivity, the mutual information between
the features and the outcome, and the distribution of the potential outcomes. While the importance
of cost sensitivity and the distribution of potential outcomes have been discussed in the literature
by Verbeke, Olaya, Berrevoets, et al. (2021) and Fern
´
andez-Loria and Provost (2022a), respectively, to
the best of our knowledge, the impact of mutual information has not been assessed before. We show
that it has an important impact on performance, independent of the other aspects (estimator variance,
cost sensitivity and distribution of potential outcomes).
Note, however, that we do not address the question of how to adapt uplift modeling to account for
cost sensitivity or the other aspects mentioned above. Our contributions pertain to model evaluation
rather than model optimization. Thus, it is left for future work to assess the effectiveness of cost-sensitive
models in terms of the metrics developed in this paper. On that topic, Gubela and Lessmann (2021)
have proposed a value-driven ranking method for targeted marketing campaigns.
The main research contributions of this paper are as follows:
•
A new formulation of the measure of profit, intensifying the focus on individual cost sensitivity
and on the stochastic nature of the machine-learning model used to rank individuals (Section 3.2).
•
A proof that the uplift curve (an evaluation curve often used in the uplift literature) is an estimator
of the measure of profit, highlighting the strict conditions necessary for the validity of the uplift
curve (Section 3.4).
•
An empirical estimator of the measure of profit, which is a cost-sensitive generalization of the
uplift curve (Section 3.5).
3
Table 1: Mathematical notation.
Notation Definition
v Random variable
v Realization of v
y ∈ {0, 1} Outcome indicator
t ∈ {0, 1} Treatment indicator
x ∈ X ⊆ R
n
Set of features
f
x
(x) Probability density function of x
I[·] Iverson bracket, equals one when the expression between
brackets is true, zero otherwise
do(t = t) Causal intervention t = t
S
t
= P (y
t
= y) Probability of the outcome y = y under do(t = t)
S
0
(x), S
1
(x) P (y
0
= 1 | x = x), P (y
1
= 1 | x = x)
U Uplift, defined as U = S
0
− S
1
D = {(x
(i)
, y
(i)
, t
(i)
)}
N
i=1
Training set or test set
M(x, D
tr
) Prediction for features x of model M trained on set D
τ ∈ R Classification threshold
ρ ∈ [0, 1] Treatment rate
•
A demonstration through theoretical analysis and simulations of the conditions under which
the predictive approach still outperforms uplift modeling, and notably, the important role of the
mutual information between the input features and the outcome, which has not previously been
discussed in the literature (Section 4).
The rest of this paper is organized as follows. Section 2 introduces the notations and notions
used throughout this paper. Our contributions are presented in Sections 3 and 4: we present the
new formulation of the measure of profit in Section 3, and we assess when the predictive approach
outperforms the uplift approach in Section 4. These results are further discussed in Section 5. In
Section 6, we present related work on evaluation measures and on the comparison between the uplift
and predictive approaches. Concluding remarks and recommendations for practitioners are given in
Section 7. Proofs of the theorems are provided in Appendices A and B.
2 Background
In this section, we introduce the notations and present the key concepts used throughout this paper.
4
2.1 Notation
We use Pearl’s causal framework, which is based on the notion of structural causal models (SCM). A formal
definition of SCMs is given by Pearl (2009, Def. 7.1.1). Here,
t
is a random variable denoting the action,
or treatment,
y
is the outcome, and
x
is a set of features (or covariates) describing the unit/individual.
We denote the realizations of these variables as
t, y
and
x
, respectively. In this paper, we will limit
ourselves to considering the double binary causal classification case, that is, the setting where
y ∈ {0, 1}
and
t ∈ {0, 1}
. Importantly, we always assume having access to experimental data, in which the
treatment
t
is randomized. It is possible to learn the uplift from observational data, for example, with
propensity scores (K
¨
unzel et al. 2019; Curth and Schaar 2021) or double machine learning (Jung, Tian,
and Bareinboim 2021), however, this is beyond the scope of this paper. The
do(t = t)
operator denotes a
causal intervention in the system. The conditional probability of
y = y
given
x = x
under intervention
do(t = t)
is written as
P (y = y | do(t = t), x = x)
, or
P (y
t
= y | x = x)
. For ease of notation,
we also define
S
0
(x) = P (y
0
= 1 | x = x) S
0
= P (y
0
= 1) (1)
S
1
(x) = P (y
1
= 1 | x = x) S
1
= P (y
1
= 1) (2)
U(x) = S
0
(x) − S
1
(x) U = S
0
− S
1
. (3)
In this notation,
U
is the uplift, or average treatment effect (ATE), and
U(x)
is the individual uplift, or
conditional average treatment effect (CATE). Note that, for example, in the literature pertaining to retail or
online advertisements, the uplift is defined as
U = S
1
− S
0
, and similarly
U(x) = S
1
(x) − S
0
(x)
.
This choice depends on whether the probability of the (positive) outcome
y = 1
should be minimized
(e.g., in churn prevention) or maximized (e.g., in sales). The uplift is then defined so that a positive uplift
corresponds to a beneficial outcome. Since we apply our results mostly to churn prevention, we use the
convention U = S
0
− S
1
.
Let
M
be a model that is used to rank individuals such that only the individuals with the highest
scores should be targeted by the action. The model
M
is trained from a data set
D = {(x
(i)
, y
(i)
, t
(i)
)}
N
i=1
of
N
iid
1
realizations of
(x, y, t)
. We assume that
D
is the result of a random process, and we denote it
as a random variable as D. We consider M(x, D
tr
) as a learning algorithm, taking a data set D and a
set of features x as input and returning a score for x, for example, an estimation of U(x).
A threshold
τ
is used to determine which individuals should be targeted. The model
M
prescribes
targeting all individuals with a score
M(x, D
tr
) ≥ τ
and not targeting the remaining individuals. The
threshold
τ
depends upon the model being used, because different models can provide scores in different
ranges, and that are differently distributed. Therefore, to consistently compare the performance of
different models, we let
ρ ∈ (0, 1)
be the proportion of individuals who should be targeted, and the
corresponding threshold
τ
can be determined as the largest value that satisfies
ρ = P (M(x, D
tr
) > τ )
.
Note the random variable
x
in this expression. Since
M(x, D
tr
)
is a deterministic function of
x
(for a
given
D
),
M(x, D
tr
)
is a random variable for which we can compute the probability
P (M(x, D
tr
) >
τ).
1
The independence assumption might be violated in applications such as churn with, for example, a word-of-mouth effect
generating a second order of treatment.
5
剩余45页未读,继续阅读
资源评论
KennySKwan
- 粉丝: 1825
- 资源: 3
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- 手检测18-YOLO(v5至v9)、COCO、CreateML、Darknet、Paligemma、TFRecord、VOC数据集合集.rar
- Inter-Task自适应增强:基于规划与执行轨迹的智能体自演化策略研究
- 大规模语言模型智能代理自动化生成与选择情境感知指南的方法
- 手检测16-YOLO(v5至v9)、COCO、CreateML、Darknet、Paligemma、VOC数据集合集.rar
- 利用多轮反馈机制提升大型语言模型在开放世界环境中的探索能力与任务完成度
- 大规模语言模型在社会科学中的应用:自动化假设生成与验证系统
- 交通信号灯数据集,可识别红绿黄三种颜色并使用coco格式标记.zip
- share_6c773ee2e6abf44995111d91677835171733220471775.mp4
- Video_2024-12-03_183654.wmv
- 手机检测18-YOLO(v5至v9)、COCO、CreateML、Darknet、Paligemma、TFRecord、VOC数据集合集.rar
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功