Machine Learning Framework
Machine Learning Framework
N. Bora Keskin
Fuqua School of Business, Duke University, 100 Fuqua Drive, Durham, NC 27708, United States.
[email protected]
We consider a seller who can dynamically adjust the price of a product at the individual customer level, by utilizing
information about customers’ characteristics encoded as a d-dimensional feature vector. We assume a personalized
demand model, parameters of which depend on s out of the d features. The seller initially does not know the
relationship between the customer features and the product demand, but learns this through sales observations over
a selling horizon of T periods. We prove that the seller’s expected regret, i.e., the revenue loss against a clairvoyant
√
who knows the underlying demand relationship, is at least of order s T under any admissible policy. We then design
a near-optimal pricing policy for a “semi-clairvoyant” seller (who knows which s of the d features are in the demand
√
model) that achieves an expected regret of order s T log T . We extend this policy to a more realistic setting where
the seller does not know the true demand predictors, and show that this policy has an expected regret of order
√
s T (log d + log T ), which is also near-optimal. Finally, we test our theory on simulated data and on a data set from
an online auto loan company in the United States. On both data sets, our experimentation-based pricing policy is
superior to intuitive and/or widely-practiced customized pricing methods such as myopic pricing and segment-then-
optimize policies. Furthermore, our policy improves upon the loan company’s historical pricing decisions by 47% in
expected revenue over a six-month period.
Key words : dynamic pricing, demand learning, demand uncertainty, regret analysis, lasso, machine learning
History : First version: May 23, 2017. This version: April 20, 2020. Forthcoming in Management Science.
1. Introduction
1.1. Background and Overview
In recent years, the advent of online sales channels has made an abundance of detailed customer
information available to sellers. Examples of such information include customer demographics
(postal code, date of birth, education/income status), past spending patterns, and social media
activities. Using sales data on these customer characteristics, it is now possible for many sellers to
dynamically improve their pricing decisions.
The availability of such information poses unique challenges and opportunities for online sellers.
In particular, how can a seller dynamically learn the impact of customer characteristics on product
demand, and simultaneously employ this information in pricing decisions to maximize revenue over
time?
To address this question, we investigate personalized dynamic pricing with learning; i.e., dynamic
pricing with imperfect information on how customers’ characteristics affect the demand. Specif-
ically, we consider a seller who offers personalized prices to different customers whose unique
characteristics are encoded as different vectors, known as features or the feature vector. The seller
is uncertain about the relationship between the customer features and the demand for its product,
but can learn this relationship through sales observations over time. We emphasize that by per-
sonalization we mean customized pricing at the individual level, not via customer segmentation,
which, to the best of our knowledge, seems to be the prevailing scheme for customized pricing in
practice. We thus reserve the term “customized pricing” for price discrimination on at least one
dimension of customer characteristics, and “personalized pricing” for price discrimination at the
most granular level possible with the available information.
At first glance, pricing identical products differently for different customers may seem a ques-
tionable practice. However, price discrimination based on customer characteristics is an age-old
practice that is legal, bar differentiation that violates antitrust or price-fixing laws (Ramasastry
2005). For instance, in the offline world, price discrimination on a per customer basis is prevalent in
insurance and consumer lending industries, where customers accept that rates are set on a case-by-
case basis. In the online world, customized pricing is already a wide-spread practice; e-commerce
websites that practice customized pricing include well-known retailers such as Amazon, Walmart,
and Sears, and travel websites such as Cheaptickets, Expedia, Hotels.com, Priceline, and Orbitz.
Furthermore, as tailor-made discount coupons are just a special case of customized pricing, the
question of finding the optimal personalized price is equivalent to finding the optimal discount
coupon from a common base price.
To formally study personalized dynamic pricing and learning, we consider a seller offering a
product for sale to customers who arrive sequentially over a discrete time horizon of T periods. For
each customer, the seller observes a d-dimensional feature vector pertaining to the characteristics
of that customer. The seller initially does not know the joint impact of customer features and prices
on the demand for the product, but can infer their impact based on individual sales observations.
In practice, d, the dimension of the feature vectors, can be quite large, and not all features may
be informative of the demand. As such, we let s ≤ d denote the number of features that are actual
predictors for the demand, and analyze two different settings: (i) a semi-clairvoyant seller who
knows a priori which s of the d features are non-trivial demand predictors, and (ii) a more realistic
seller who does not know which s of the d features are non-trivial demand predictors. In both
settings, we design and analyze policies that achieve near-optimal revenue performance, where
performance is measured by the seller’s T -period expected regret, i.e., the revenue loss relative to
a clairvoyant who knows the underlying demand model.
Motivated by this observation, we construct and study a personalized dynamic pricing model where
individual customer features affect the price sensitivity of demand as well as the potential market
size and customer taste. Thus, our model captures the joint impact of prices and customer features
in the form of feature-dependent price sensitivity (i.e., individual customers may have different
price-sensitivities; see the demand model (1) for details). To the best of our knowledge, this is the
first work to introduce feature-dependent price sensitivity in the design of personalized dynamic
pricing policies.
Characterizing problem complexity and the need for judicious price experiments.
The generality of our problem formulation (with feature-dependent price sensitivity and no addi-
tional assumptions of all prices being informative) affects the problem complexity in two major
ways. First, the best achievable regret performance in general is substantially worse than the best
achievable regret performance under the assumption that all prices are informative. In Theorem 1,
√
we prove that the seller’s T -period regret is at least of order s T under any admissible policy in
our general setting. In contrast, it has been shown that the seller’s T -period regret can grow loga-
rithmically in T under the assumption that all prices are informative (see, e.g., Qiang and Bayati
2016 and Javanmard and Nazerzadeh 2016). Theorem 1 thus shows that this cannot be achieved
in general. Second, and relatedly, we show that myopic policies can exhibit extremely poor per-
formance in our general problem setting. In §5.1.2 and §5.2.5, we conduct simulation experiments
and analyze a real-life data set from the U.S. auto loan industry to illustrate that not all prices
are necessarily informative in our general setup. As a result, the seller needs to use judicious price
experiments to achieve near-optimal revenue performance. Thus, the generality of our formulation
reveals the practical necessity of experimentation for optimal personalized dynamic pricing and
learning, which stands in stark contrast with the aforementioned related studies that show that
myopic policies are near-optimal under more restrictive assumptions.
Designing near-optimal policies and deriving performance guarantees. Our work
constructs policies that exhibit near-optimal performance in personalized dynamic pricing with
demand model uncertainty. To that end, we first consider the case of a seller who faces a linear
demand model and knows which features are the true predictors of demand. In this case, we design
an iterated estimation-and-pricing policy, and prove that this policy achieves a T -period regret
√
of order s T log T (see Theorem 2). After that, we extend our analysis to the case where the
seller faces a generalized linear demand model and does not know the true demand predictors. For
this case, we design another policy that employs maximum quasi-likelihood regression with lasso
√
regularization, and show that this policy achieves a T -period regret of order s T (log d + log T )
(see Theorem 3). The performance guarantees in Theorems 2 and 3 indicate that both of our poli-
cies are near-optimal (up to logarithmic terms) in view of the best achievable growth rate of regret
established in Theorem 1. Finally, as alluded to earlier, we validate these theoretical results with
two computational studies, one on simulated data based on a linear demand model, and another
on real-life data from a U.S. auto loan company based on a logit demand model (see §5).
Managerial insights for pricing practice. From an application perspective, the results in our
paper are of immediate relevance to practitioners. Comparing our approach with the actual pricing
decisions of the U.S. auto loan company in the aforementioned real-life data set, we observe that
our policies increase the company’s annual expected revenue by 47% over a six-month period (see
§5 for a more detailed comparison). Furthermore, our results also demonstrate the suboptimality
of policies that are prevalent in practice, which are (i) myopic pricing, and (ii) segment-then-
optimize policies. Our computational results in §5 show that these policies can perform in the
worst possible manner (regret growing linearly with T ), which is tantamount to no learning taking
place. With regard to (i), as mentioned above, the poor performance of myopic pricing is due
to the lack of sufficient price experimentation. With regard to (ii), we consider two versions of
segment-then-optimize policies, one in which the seller performs customer segmentation then sets
the price for the average customer in each segment, and another in which the seller performs
customer segmentation then applies our near-optimal personalized dynamic pricing policy within
each segment. In the former version of the segment-then-optimize policy, the regret grows linearly
because the policy accumulates errors in revenue as every arriving customer deviates from the
average customer (almost surely, if the demand is continuous). In the latter version, the regret
grows sub-linearly but at multiples of our personalized policy applied to the entire non-segmented
data set. This is because the segment-then-optimize policy unnecessarily reduces the number of
observation samples to learn from, and prevents learning from customers across different segments
(e.g., if customers are segmented by location, they can still have similarities in other dimensions).
We note that these findings are on a conservative example where there are two sufficiently separated
customer populations (with a misclassification rate at just 4%), and similar results hold even when
the correct segmentation information is provided to the seller.
Furthermore, in §6 we demonstrate that our approach performs well even when a generative
demand model different from our assumed one is fitted to the aforementioned real-life data set.
Based on this observation, our demand model and accompanying results seem robust to model
misspecification in practice.
value of of each product is a linear function of the feature vector. They design a pricing policy
based on ellipsoid methods and derive performance guarantees on the worst-case regret of their
policy. Qiang and Bayati (2016) study the performance of a myopic policy called greedy iterated
least squares, and show that this policy can exhibit near-optimal revenue performance under cer-
tain conditions. Javanmard and Nazerzadeh (2016) consider a dynamic pricing problem with a
binary choice model, and construct a near-optimal policy in their setting. Our work is distinct
from the aforementioned studies in major ways. First of all, our model captures the case of feature-
dependent price sensitivity (i.e., the price sensitivity of demand is allowed to depend on individual
customer features), and we do not assume that all prices are informative. As discussed in §1.2 and
shown in detail below, these aspects of our model significantly affect the problem complexity and
shed light on the design of near-optimal policies in personalized pricing and learning. As shown
by Qiang and Bayati (2016), a myopic policy can be near-optimal when the price sensitivity of
demand does not depend on customer features and all prices are assumed to be informative. In
contrast, we demonstrate that such myopic policies can perform poorly in our general problem
formulation (see §5). Secondly, we show that the best achievable revenue performance in the full
generality of our setting is different from those in the dynamic pricing studies above (see §3).
Informed by this result, we construct policies that achieve said performance benchmark, providing
key practical guidelines on how to conduct price experimentation in the presence of customer fea-
ture information. In addition to the above studies, Nambiar et al. (2017) recently studied model
misspecification in the context of dynamic pricing and product features. Our work is differentiated
from theirs since they also analyze the case where the price sensitivity of demand does not depend
on features and their focus is on model misspecification.
More broadly, our paper also contributes to the growing literature in operations management on
the inclusion of feature (a.k.a. attribute/covariate) information in decision making (see, for instance,
Ferreira et al. 2015 for an empirical analytics study, and Ban and Rudin 2018, Ban et al. 2019 for
theoretical studies, in addition to Chen et al. 2015, Cohen et al. 2016, Qiang and Bayati 2016,
Javanmard and Nazerzadeh 2016 discussed in detail above).
2. Problem Formulation
Basic model elements. Consider a firm, hereafter referred to as the seller, that offers a product
for sale to T customers who arrive sequentially. Viewing each sales opportunity as a “period,” the
seller has a discrete time horizon of T periods and can dynamically adjust the product’s price over
the time horizon.
At the beginning of period t = 1, 2, . . . , T, the seller observes a d-dimensional vector of fea-
tures pertaining to the customer arriving in period t. We denote this random vector by Zt =
(Zt1 , Zt2 , . . . , Ztd ) and assume that {Zt , t = 1, 2, . . . , T } are independent and identically distributed
with a compact support Z ⊆ B0 (zmax ) ⊂ Rd , where B0 (zmax ) is the d-dimensional ball of radius
zmax > 0, and E[Zt ] is normalized to 0.1 We denote by ΣZ = E[Zt ZtT ] the covariance matrix of {Zt }
and assume that ΣZ is a symmetric and positive definite matrix. We note that the seller need
not know ΣZ . We allow Zt to include individual customer data and macroeconomic factors that
entail categorical features (e.g., postal code and income bracket) as well as continuous features
(e.g., credit score). This means that some components of Zt are continuous random variables, and
others discrete random variables. For the features modeled as continuous random variables, the
only assumption we make is that they have positive measure in the interior of their domains and
zero on the boundary. We also introduce the augmented feature vector Xt := Z1t ∈ Rd+1 for expo-
sitional convenience which is made apparent below. Accordingly, we denote the support of Xt by
X = {1} × Z , and the expectation over the product measure on X1 × · · · × XT by EX {·}.
Upon observing Xt = xt , the seller chooses a price pt ∈ [ℓ, u] to be offered to the customer arriving
in period t, where 0 < ℓ < u < ∞. Then, the seller observes this customer’s demand in response to
pt , which is given by
Dt = g α · xt + (β · xt ) pt + εt for t = 1, 2, . . . , T, (1)
where α, β ∈ Rd+1 are demand parameter vectors unknown to the seller, g(·) is a known function,
εt is the unobservable and idiosyncratic demand shock of the customer arriving in period t, and
Pd+1
u · v = i=1 ui vi denotes the inner product of vectors u and v. Note that the demand model (1)
captures feature-dependent customer taste and potential market size (through α · xt ) as well as
feature-dependent price sensitivity (through β · xt ).
Let θ := (α, β) be the vector of all unknown demand parameters, and Θ be a compact rectangle
in R2(d+1) from which the value of θ is chosen. We allow for d to be large, possibly larger than the
selling horizon T , but assume that a smaller subset of the d features have a sizable effect in the
demand model. We express this sparsity structure as follows: let Sα := {i = 1, . . . , d + 1 : αi 6= 0},
Sβ := {i = 1, . . . , d + 1 : βi 6= 0}, and S := Sα ∪ Sβ . Note that S contains the indices of all non-
zero components of α and β. For notational convenience, we use the set S to express the sparsity
structure in the unknown parameter vector θ = (α, β). (If the non-zero components of α and β are
distinct, one could use Sα and Sβ to express the sparsity structures in α and β separately; our
1
In many applications, E[Zt ] need not be 0, but this is subsumed in our analysis: if E[Zt ] equals a non-zero vector
µZ ∈ Rd , then we replace the first components of the vectors α, β ∈ Rd+1 in the demand model (1) with α̃1 =
α1 + (α2 , . . . , αd+1 )T µZ and β̃1 = β1 + (β2 , . . . , βd+1 )T µZ . This normalizes the value of E[Zt ] to 0 without influencing
other distributional assumptions on {Zt }. Moreover, by absorbing µZ into the unknown vectors α and β, we capture
the fact that the seller does not know µZ . We note that our analysis in the following section is valid also when the
seller uses unnormalized feature vectors with an unknown, non-zero mean µZ .
analysis is valid for that case because S already includes all components that influence demand.)
Define αS = (αi )i∈S and βS = (βi)i∈S as the vectors consisting of the components of α and β,
respectively, whose indices are in S , and θS = (αS , βS ). Note that θS is a compressed vector that
contains all non-zero components of θ; hence, we hereafter refer to θS as the compressed version
of θ. Let s ∈ {1, . . . , d + 1} be the cardinality of S , and denote the compressed versions of the
key quantities defined earlier with a subscript S . Thus, the compressed version of Θ is ΘS =
{θS : θ ∈ Θ} ⊂ R2s . For t = 1, . . . , T, the compressed versions of Zt and Xt are ZS ,t ∈ ZS ⊂ Rs and
1
XS ,t = ZS,t ∈ XS ⊂ Rs+1 , respectively, where ZS = {(zi )i∈S : z ∈ Z} and XS = {1} × ZS .
The demand function in (1) is known as a generalized linear model (GLM) because, given x ∈ X ,
the function that maps price p to expected demand is the composition of the function g : R → R
and the linear function p 7→ α · x + (β · x) p. In this relationship, the function g(·) is referred to
as the “link” function that captures potential nonlinearities in the demand-price relationship. We
assume that g(·) is differentiable and increasing. This assumption is satisfied for a broad family of
functions including linear, logit, probit, and exponential demand functions. It also implies that the
link function has bounded derivatives over its compact domain.2
We assume that {εt , t = 1, 2, . . .} is a sub-Gaussian martingale difference sequence; that is,
E[εt |Ft−1 ] = 0, and there exist positive constants σ0 and η0 such that E[ε2t |Ft−1 ] ≤ σ02 and
E[eηεt |Ft−1 ] < ∞ for all η satisfying |η| < η0 , where Ft = σ(p1 , . . . , pt , ε1 , . . . , εt , X1 , . . . , Xt+1 ) and the
construction of admissible price sequences {pt , t = 1, 2, . . .} is specified below. (A simple example
of this setting is where {εt } are bounded and have zero mean.) We note that the distribution of εt
can depend on price and feature observations. This implies that the idiosyncratic demand shocks
of customers are allowed to be dependent on prices and customer features in our formulation.
Moreover, the generality of the above demand-shock distribution allows for continuous as well as
discrete demand distributions. A noteworthy example within discrete demand distributions is the
binary customer response model, where {εt } are such that Dt ∈ {0, 1} for all t. In this case, the
event {Dt = 1} corresponds to a sale at the offered price pt , whereas {Dt = 0} corresponds to no
sale.
1
Given θ = (α, β) ∈ Θ and x = z ∈ X , the seller’s expected single-period revenue is
r(p, θ, x) = p g α · x + (β · x)p for p ∈ [ℓ, u]. (2)
Let ϕ(θ, x) = argmaxp {r(p, θ, x)} denote the unconstrained revenue-maximizing price in terms of
θ ∈ Θ and x ∈ X . We assume that ϕ(θ, x) is in the interior of the feasible set [ℓ, u] for all θ ∈ Θ and
x ∈ X.
2
That is, there exist ℓ̃, ũ ∈ R satisfying 0 < ℓ̃ ≤ |g ′ (ξ)| ≤ ũ < ∞ for all ξ = α · x + (β · x) p such that (α, β) ∈ Θ, x ∈ X ,
and p ∈ [ℓ, u] (here and later, a prime denotes a derivative).
where Pε {·} is the probability measure governing {εt , t = 1, 2, . . .}. The seller’s conditional expected
revenue loss in T periods relative to a clairvoyant who knows the underlying demand parameter
vector θ is defined as
T
X
∆πθ (T ; X T ) = Eπθ ∗
(θ, Xt ) − r(pπt , θ, Xt )
r XT (3)
t=1
for θ ∈ Θ, π ∈ Π, and X T = (X1 , . . . , XT ) ∈ X T , where: Eπθ {·} is the expectation operator associated
with Pπθ {·}, r ∗ (θ, x) = r(ϕ(θ, x), θ, x) is the maximum single-period revenue function, and pπt is the
price charged in period t under policy π. We call this performance metric the seller’s T -period
conditional regret, which is a random variable that depends on the realization of X T = (X1 , . . . , XT ).
The seller aims to minimize its T -period expected regret, given by
∆πθ (T ) = EX ∆πθ (T ; X T )
(4)
for θ ∈ Θ and π ∈ Π, where EX {·} is the expectation operator associated with the probability
measure governing {Xt , t = 1, 2, . . .}. Throughout the sequel, we also use the expectation notation
EπX,θ {·} := EX {Eπθ {·}}, and let PπX,θ {·} be the probability measure associated with EπX,θ {·}. Finally,
to describe the complexity of our problem setting, we use the seller’s worst-case expected regret,
which is defined as ∆π (T ) = sup{∆πθ (T ) : θ ∈ Θ}.
Dt = α · xt + (β · xt ) pt + εt for t = 1, 2, . . . , T. (5)
We hereafter refer to (5) as the linear demand model. Note that, for this model, the unconstrained
revenue-maximizing price is ϕ(θ, x) = −(α · x)/(2β · x) for θ = (α, β) ∈ Θ and x ∈ X .
We consider in the following result the linear demand model (5) in the case where {εt } follow
an independent and identically distributed sequence of normal random variables. In this setting,
we present a lower bound on the seller’s expected regret under any admissible policy.
Theorem 1. (lower bound on regret) Let {Dt } be given by the linear demand model (5),
iid
and εt ∼ N (0, σ02 ) with σ0 > 0. Then, there exists a finite positive constant c such that
√
∆π (T ) ≥ cs T for all π ∈ Π and T ≥ 2.
Remark 1. The constant c is independent of s and T (see the proof of Theorem 1 in Appendix A
iid
for an explicit computation of this constant). We consider the case where εt ∼ N (0, σ02 ) for expo-
sition purposes, but the proof of Theorem 1 is valid for the broader, exponential family of distri-
butions.
√
Remark 2. Theorem 1 implies that supg∈G inf π∈Π {∆g,π (T )} ≥ cs T for all T ≥ 2, where G
denotes the set of all differentiable and increasing functions, and ∆g,π (T ) is the T -period expected
regret of policy π, with its dependence on the link function g(·) expressed explicitly. In other words,
the lower bound in Theorem 1 is a worst-case lower bound on the minimum regret for a broad
class of link functions.
Theorem 1 characterizes the complexity of our problem setting, stating that the expected regret
√
of any admissible policy must grow at least in the order of s T . Thus, any policy whose expected
√
regret is of order s T (up to logarithmic terms) is hereafter called a first-order optimal policy. In
the following section, we design such first-order optimal policies, and prove that the lower bound
in Theorem 1 is tight (up to logarithmic terms).
The proof of Theorem 1 is based on the analysis of the empirical Fisher information matrix,
which is given by
Pt Pt X t
xk xTk pk xk xTk 1 1 T
Jt := Pt k=1
T
k=1
Pt 2 T
= pk pk ⊗ xk xTk , (6)
k=1 pk xk xk k=1 pk xk xk k=1
where ⊗ denotes the Kronecker product of matrices. In this proof, we show that the seller’s expected
revenue loss in period t is inversely proportional to the smallest eigenvalue of Jt , denoted by
µmin (Jt ). Simply put, the seller’s cumulative information in period t, as measured by µmin (Jt ),
characterizes the growth rate of the expected revenue loss. For example, if all feasible prices were
assumed to provide substantial information about the underlying demand model, then µmin (Jt )
would increase by a positive amount in every period, growing linearly over time under any given
policy. In such a setting, there would be no need for active price experimentation. To be more
precise, under a myopic policy that puts no emphasis on price experimentation, the seller’s expected
revenue loss in period t would be proportional to 1/t. By the growth rate of harmonic series, this
implies that the seller’s expected revenue loss over T periods would be of order log T . As a result,
in a setting where all prices are assumed to be informative, the seller can use a myopic policy
to achieve an expected regret of order log T . Theorem 1 shows that this performance benchmark
cannot be achieved in our general setting. In general, not all prices are necessarily informative,
and hence, the seller needs to implement a judicious amount of price experiments. This qualitative
insight is key in our design of first-order optimal policies.
The above observation is consistent with Broder and Rusmevichientong (2012), who studied
dynamic pricing with learning in the absence of customer feature information. In particular,
Broder and Rusmevichientong (2012) showed that if the problem setup assumes that all prices are
informative (also known as the “well-separated” setting), then even myopic pricing policies can
rapidly accumulate information and achieve near-optimal revenue performance with logarithmically
growing regret. What Qiang and Bayati (2016) and Javanmard and Nazerzadeh (2016) study are
precisely pricing problems in this well-separated setting. As such, these papers show that myopic
pricing is near-optimal with logarithmic regret growth. To reiterate, attaining this benchmark is
not possible with our more general demand model. In contrast to the settings of Qiang and Bayati
(2016) and Javanmard and Nazerzadeh (2016), our setting does not assume that all prices are infor-
mative, which necessitates some price experimentation for optimal learning. Table 2 summarizes
these differences to position our work within the related literature.
general well-separated
Qiang and Bayati (2016)
with feature information this paper
Javanmard and Nazerzadeh (2016)
no feature information Broder and Rusmevichientong (2012) Broder and Rusmevichientong (2012)
Therefore, our setting is differentiated from the settings of Qiang and Bayati (2016) and
Javanmard and Nazerzadeh (2016), allowing us to construct a new family of policies and results
that shed light on how price experiments should be implemented in the presence of customer feature
information.
In §5, we investigate both simulated and real-life data sets to illustrate the above points. This
data analysis also indicates that myopic policies exhibit a more severe complication beyond the
complexity characterized in Theorem 1. With no experimentation, the price paths of myopic policies
can converge to the boundary of the feasible price range. This results in an expected regret of
order T , which is substantially worse than the performance benchmark in Theorem 1. That is,
the convergence of price paths to the feasible price boundary makes myopic policies perform much
more poorly than the general regret bound in Theorem 1. We also note that the boundedness of the
feasible price set plays a minimal role in the derivation of Theorem 1. We use this boundedness for
replacing the expected value of the optimal price with an infimum, and the theorem’s conclusion
holds as long as that expected value is finite and positive. The proof of this theorem, as explained
above, is based on the rate of information accumulation, rather than bounds on feasible prices.
Mi = {t = L2 + i − 1 : L = 1, 2, . . .}. (7)
For brevity, we denote by M = M1 ∪ M2 the set of all experimentation periods. The price exper-
imentation scheme in (7) ensures that, for all t ≥ 5, each experimental price is charged at least
1
√
4
t times. This scheme uses two prices for experimentation; as seen below, one needs at least
two distinct experimental prices to ensure that regression estimates are well-defined in all periods.
For expositional purposes, we use exactly two experimental prices in the scheme described above,
noting that our analysis remains valid as long as two or more experimental prices are used.
where
Pt T
Pt T
X t
χ x x χ p x x T
J˜S ,t := Pt k=1 k S ,k S ,k
T
P k=1
t
k k
2
S ,k S
T
,k
= χk p1k p1k ⊗ xS ,k xTS ,k , (10)
χ p x x
k=1 k k S ,k S ,k χ p x
k=1 k k S ,k S ,kx k=1
Pt
and ⊗ denotes the Kronecker product of matrices. Moreover, letting MS ,t := k=1 χk p1k ⊗
xS ,k εk , we deduce from (5) and (9) that
Let ϑ̂S ,t+1 be the truncated estimate that satisfies ϑ̂S ,t+1 = PΘ {θ̂S ,t+1 }, where PΘ : R2s → ΘS is the
projection mapping from R2s onto the compressed parameter space ΘS . After observing the feature
vector XS ,t = xS ,t in period t, the ILSX policy with parameters m1 , m2 , denoted by ILSX(m1 , m2 ),
charges the price
m1 if t ∈ M1 ,
pt = m2
if t ∈ M2 , (12)
ϕ ϑ̂S ,t , xS ,t otherwise.
Analysis of the estimation error and regret. To study the estimation error under the ILSX
policy, we measure information with the following metric:
t
X
Jt = χk (pk − p̄t )2 for all t ≥ 1, (13)
k=1
Pt Pt
where p̄t = k=1 χk pk / k=1 χk . Note that Jt is the cumulative squared deviation in the price
sequence {p1 , p2 , . . . , pt }, which provides a measure of quadratic variation in prices. In the compu-
tation of Jt , the indicator variables {χk } ensure that only the experimentation periods are counted
for measuring information. As the experimental prices m1 and m2 are selected deterministically
under the policy ILSX(m1 , m2 ), the metric Jt is non-random for all t under this policy. Based on
this, our next result shows how Jt is connected to the smallest eigenvalue of the information matrix
J˜S ,t under ILSX(m1 , m2 ).
Lemma 1 states that, with high probability, the minimum eigenvalue of the information matrix
J˜S ,t grows at least in the order of Jt under the policy ILSX(m1 , m2 ). To characterize the growth
√
rate of Jt under ILSX(m1 , m2 ), recall that this policy charges each experimental price at least 41 t
times through the end of period t ≥ 5. This implies that, under ILSX(m1 , m2 ), we have
√
Jt ≥ 18 (m1 − m2 )2 t for all t ≥ 5. (15)
Given this growth rate of Jt , we derive the following result regarding the estimation error under
ILSX(m1 , m2 ).
According to Lemma 2, with high probability, the estimation error under ILSX(m1 , m2 ) converges
√
to zero at a rate of (log t)/ t. Using this lemma, we obtain the following performance guarantee
for ILSX(m1 , m2 ).
√
∆πθS (T ) ≤ Cs T log T for all θS ∈ ΘS and T ≥ 2.
where: χk = I{k ∈ M } and uk = p1k ⊗ xk for k ∈ {1, 2, . . .}, ν(y) = g ′ g −1 (y) for y ∈ R, and
P2(d+1)
kθ̃ k1 = i=1 |θ̃i | denotes the ℓ1 -norm of θ̃. The estimation objective in (17) is a lasso-regularized
quasi-likelihood function for the seller’s observations in the first t periods (for early references on
maximum quasi-likelihood estimation and lasso regularization, see Nelder and Wedderburn 1972
and Tibshirani 1996, respectively). This estimation objective subsumes as a special case the least-
squares estimation objective studied in the preceding subsection: if the demand model were linear
(i.e., g(ξ) = ξ for all ξ ∈ R) and there were no lasso regularization penalty (i.e., λ̃ = 0) then the
Pt
maximizer of Qt (θ̃, λ̃) would equal the least-squares error minimizer of St (θ̃) = k=1 χk (Dk − θ̃ · uk )2
for all θ̃. Moreover, in other special cases, maximizing the estimation objective in (17) is equivalent
to standard maximum likelihood estimation with lasso regularization; see, e.g., §5 for an application
to logit demand models.
It is straightforward to show that, given λ̃ ≥ 0, the mapping θ̃ 7→ Qt (θ̃, λ̃) is strictly concave and
has a unique maximizer, which we use as an estimate of θ̃. We denote this maximum quasi-likelihood
estimate by
(lasso)
θ̂t+1 (λ̃) = argmaxθ̃∈R2(d+1) Qt (θ̃, λ̃) for λ̃ ≥ 0, (18)
Using Lemma 3, we obtain the following performance guarantee for ILQX(m1 , m2 , λ).
Theorem 3 shows that the lasso-based ILQX policy achieves the lowest possible growth rate of
regret presented in Theorem 1 (up to logarithmic terms) and is therefore first-order optimal.
As argued earlier, using a policy with no regularization in the case of unknown sparsity structure
√ √
would make the T -period regret grow in the order of d T , which is substantially larger than s T ,
the best achievable rate given in Theorem 1. In contrast, Theorem 3 establishes that ILQX has
the smallest growth rate of regret that can be achieved in the case of a linear demand model with
known sparsity structure (up to logarithmic terms). Thus, a significant consequence of Theorem 3
is that the ILQX policy effectively recovers the added cost of not knowing the sparsity structure
and having a generalized linear demand model instead of a linear one, thereby achieving first-order
optimality in a fairly general setting.
5. Computational Results
In this section, we first run illustrative simulation experiments in the context of the linear demand
model. After that, to shed further light on our general analysis, we study a real-life data set from
the consumer lending industry that involves nonlinear customer response.
5.1. Case Study I: Simulation Experiments for the Linear Demand Model
In the following simulation experiments, we employ the linear demand model (5) to demonstrate the
value of: (a) regularization, (b) price experimentation over greedy pricing, and (c) individualized
pricing over segmentation-based pricing.
5.1.1. Value of regularization. To illustrate the theoretical performance results derived in §4,
we examine the performance of three personalized dynamic pricing policies, namely the policies
of: (i) a semi-clairvoyant seller who employs the unregularized price experimentation policy ILSX
on the relevant s out of d dimensions of features, as described in §4.1; (ii) a seller who employs
ILSX on all d dimensions of features; and (iii) a seller who employs the lasso-regularized price
experimentation policy ILQX described in §4.2. Note that (i) is the case of known sparsity structure
(i.e., the seller knows which customer features are relevant), whereas (ii) and (iii) correspond to
the case of unknown sparsity structure. Moreover, since we consider the linear demand model in
this section, the essential difference between ILSX and ILQX is lasso regularization.
In our simulation experiments, we consider the following problem parameters. The feasible price
interval is [ℓ, u] = [0.2, 2] and the first two prices of all policies are chosen arbitrarily at p1 =
m1 = 1.1 and p2 = m2 = 1.2. The (unnormalized) customer feature vectors {Zt } are multivariate
normal random variables with d = 14 dimensions, with their mean µZ equal to a vector of ones
and covariance matrix ΣZ = 0.2I14 , where I14 is the 14 × 14 identity matrix. The unknown demand
parameter vector is θ = (α, β) ∈ R30 , where
α = [ 1.1, −0.1, 0, 0.1, 0, 0.2, 0, 0.1, −0.1, 0, 0, 0.1, −0.1, 0.2, −0.2 ],
β = (−1) × [ 0.5, 0.1, −0.1, 0, 0, 0, 0, 0.2, 0.1, 0.2, 0, 0.2, −0.1, −0.2, 0 ].
Note that 11 out of 30 components of θ are zero in this model; thus, the presence of zero components
would have a nontrivial effect. The parameter space is Θ = [−1.5, 1.5]30 , and the demand shocks
{ǫt } are normally distributed with mean zero and standard deviation σ0 = 0.01. For ILQX, the
√
regularization parameter sequence λ = (λ1 , λ2 , . . .) is selected such that λt+1 = 0.05t1/4 log d + log t
for all t; this choice is informed by the theory developed in §4.2.
500
semi-clairvoyant ILSX
400
ILSX
ILQX
∆πθ (T )
300
200
100
0
0 0.5 1 1.5 2
T 4
10
Figure 1 Value of regularization. The solid, dashed, and dotted curves display the T -period regret of (i) a
semi-clairvoyant seller who employs the unregularized price experimentation policy ILSX on the relevant s out of d
dimensions of features; (ii) a seller who employs ILSX on all d dimensions of features; and (iii) a seller who employs
the lasso-regularized price experimentation policy ILQX, respectively. The problem parameters are as given in §5.1.1.
The displayed regret values are computed by averaging the realized regret over 30 sample paths.
The performance of the three pricing policies, as measured by regret, is shown in Figure 1. First
√
of all, we observe that the seller’s regret over T periods, i.e., ∆πθ (T ), grows roughly at rate T
under all three policies. Second, we observe that using ILSX on all d dimensions of features yields
the worst performance, which is expected because ILSX neither knows nor adjusts for the sparsity
present in the model considered. Third, we observe that the lasso-regularized policy, ILQX, performs
slightly better than the first-order optimal semi-clairvoyant policy for short time horizons, whereas
the performance of these two policies are essentially the same for long time horizons. As noted
by Hastie et al. (2009, p. 57), shrinking or setting some coefficients to zero can help reduce the
variance of least-squares estimates; thus, the simple and sparse model of ILQX can be more effective
at handling noise that arises from having a small number of observations in estimation (note that,
√
for T = 20,000 there are about 20000 ≈ 141 experimental periods used in the estimation step of
the policies). This observation is encouraging from a practical standpoint because it suggests that
a seller (who would in practice not know the sparse components of a model, if present) can do very
well by employing ILQX even when the number of samples is small.
5.1.2. Value of price experimentation. As alluded to in §3, myopic policies, which do not
explicitly experiment with prices, can lead to remarkably poor revenue performance in the context
of personalized dynamic pricing. Specifically, we demonstrate that a myopic pricing policy can
make the seller’s regret grow linearly in T , which is the worst possible growth rate of regret.
First, let us define what is meant by myopic pricing in our formulation. Consider the linear
demand model (5), and suppose that the seller charges two distinct prices p1 , p2 ∈ [ℓ, u] in the
first two periods. Then, at the end of every period t ≥ 2, the seller computes the least-squares
estimate of the unknown parameter vector θ, and in the following period, charges the price that
would have been optimal if θ were equal to its most recent estimate. The seller subsequently
repeats this estimate-and-optimize routine until the end of the time horizon. This pricing policy
is typically referred to as “greedy iterated least squares” (abbreviated greedy ILS, or GILS) (see
Keskin and Zeevi 2014, Qiang and Bayati 2016). Note that this is a myopic policy that puts no
emphasis on price experimentation to accelerate learning, and simply maximizes the single-period
revenue function based on the most recent estimates.
To demonstrate the performance of the myopic pricing policy versus one based on price exper-
imentation (as in §4.1), we compare greedy ILS against ILQX in the problem setting in §5.1.1.
Figure 2 shows that the regret of greedy ILS grows at a linear rate, whereas ILQX maintains a
sublinear growth rate—a stark difference in performance.
1500
greedy ILS
ILQX
1000
∆πθ (T )
500
0
0 1000 2000 3000 4000 5000
T
Figure 2 Value of price experimentation. The solid and dotted curves display the T -period regret of (i)
greedy ILS, which is a myopic policy that puts no emphasis on learning, and (ii) ILQX, which dedicates a judiciously
selected number of periods to price experimentation, respectively. The problem setting is as described in §5.1.2. The
displayed regret values are computed by averaging the realized regret over 30 sample paths.
To delve further into the difference between the two policies, we plot in Figure 3 the price paths
followed by the two policies in 30 independent simulation runs, together with the optimal price
path. Interestingly, in many instances the price path of greedy ILS converges to the lower price
boundary of 0.2 and stays there, despite the optimal prices being far away from that boundary. The
histogram in Figure 3(b) makes this clear, showing that 29 out of 30 prices charged by greedy ILS
in period 5000 are equal to 0.2. We note that, when there is a subset of estimate values that
make greedy ILS charge a price at the boundary of feasible range, this boundary price would be
an uninformative action for the seller. That is, if greedy ILS keeps charging the boundary price,
there would be very limited cumulative price variation, and thus, the demand parameter estimates
would not change significantly over time. In turn, this would make greedy ILS charge future prices
near the same boundary, resulting in a vicious cycle. This is how the presence of uninformative
prices in our general formulation leads to poor regret performance under greedy ILS. In contrast,
the experimentation-based policy, ILQX, does not suffer from charging prices at the boundary, with
Figure 3(d) showing that most of the prices charged by this policy is between 0.5 and 1 by period
5000, which is a narrow range where the optimal price sequence falls in as well.
(a) price paths of greedy ILS (b) price histogram of greedy ILS
30
optimal price
number of sample paths
2
25
1.5 20
pt
15
1
10
0.5
5
0 0
0 1000 2000 3000 4000 5000 0.5 1 1.5 2
t p5000
(c) price paths of ILQX (d) price histogram of ILQX
30
optimal price
number of sample paths
2
25
1.5 20
pt
15
1
10
0.5
5
0 0
0 1000 2000 3000 4000 5000 0.5 1 1.5 2
t p5000
Figure 3 Price evolution under greedy ILS and ILQX. Panels (a) and (b) depict sample paths of the price
sequence {pt } and the histogram of the price in period 5000, respectively, generated under greedy ILS. Panels (c)
and (d) show sample paths of the price sequence {pt } and the histogram of the price in period 5000, respectively,
generated under ILQX. All panels are based on the setting considered in §5.1.2, and there are 30 sample paths in total.
In panels (a) and (c), each solid curve displays one of the 30 sample paths of {pt }, and the dotted curve displays the
underlying optimal price sequence. Under greedy ILS, the vast majority of the sample paths result in uninformative
price choices at the boundary of the feasible price range, whereas under ILQX, the sample paths concentrate around
the underlying optimal price instead of the boundaries.
To prevent greedy ILS from getting stuck at the feasible price boundary, it is possible to slightly
modify this myopic policy such that, whenever its price path hits the same boundary in two
consecutive periods, the price path is pulled towards the interior of the feasible price set by a
margin δ > 0. To be precise, if the myopic price equals ℓ (resp. u) in periods t − 1 and t, then
this modified policy, denoted as δ-greedy ILS, charges the price ℓ + δ (resp. u − δ) in period t,
and otherwise, it simply charges the myopic price. This introduces a small amount of variation
in pricing decisions, and as shown in Figure 4, can significantly improve the regret performance.
But, ILQX still outperforms this modification of greedy ILS; hence, the judiciously designed price
experiments of ILQX appear to be a more efficient tool for collecting information.
1500 greedy ILS
δ-greedy ILS with δ = 0.1
δ-greedy ILS with δ = 0.3
1000 δ-greedy ILS with δ = 0.5
∆πθ (T )
ILQX
500
0
0 1000 2000 3000 4000
T
Figure 4 Experimentation near the boundary of feasible prices. The dashed curves display the T -period
regret of δ-greedy ILS for δ ∈ {0.1, 0.3, 0.5}, whereas the solid and dotted curves show the T -period regret of greedy ILS
and ILQX, respectively. For the three curves displayed for δ-greedy ILS, the regret is the lowest when δ = 0.3, but this
is still higher than the regret of ILQX. The problem setting is as described in §5.1.2. The displayed regret values are
computed by averaging the realized regret over 30 sample paths.
As explained in §3, in the absence of uninformative prices, greedy ILS would collect sufficient
information from demand observations, but when uninformative prices exist, it can perform poorly,
as illustrated above. In such cases, experimentation-based pricing policies offer substantial value
relative to greedy ILS. We further illustrate this point in our analysis of a real-life data set in §5.2.
d
Zt = ξ N (µ1 , Σ1 ) + (1 − ξ) N (µ2 , Σ2 ), (21)
where: ξ ∼ Bernoulli(0.5), and (µ1 , Σ1 ) and (µ2 , Σ2 ) are two distinct parameter sets pertaining to
the customer feature distributions in populations 1 and 2, respectively. A cross section of these
feature distributions is illustrated in Figure 5, and a detailed account of the parameter values is
given in Appendix D.
For a conservative comparison of segmentation with ILQX, we consider the following two
“segment-then-optimize” policies. For the first segment-then-optimize policy (denoted by segment),
z2
1
-1
-1 0 1 2 3 4
z1
Figure 5 Distribution of features. The figure shows two contour plots for the feature density functions belonging
to the two customer populations in §5.1.3, with each contour plot displaying the cross section of a higher-dimensional
density function in the first two dimensions. The two populations are sufficiently “separated” that segmentation is
not prone to substantial classification errors—the average misclassification rate using the k-means algorithm is 4%
for T = 5000 customers.
the seller knows which segment each arriving customer belongs to and the demand parameters with
certainty. For each customer, the segment policy chooses the price that maximizes the expected
revenue for an average individual in the current customer’s segment. This policy represents the
segmentation-based pricing used widely in practice. For the second segment-then-optimize policy
(denoted by segment-ILQX), the seller knows which segment each arriving customer belongs to
but has to learn the demand parameters. For each customer, segment-ILQX chooses the price that
maximizes the expected revenue for the individual customer, using the same price experimentation
as ILQX, but using only the data collected from the particular customer’s segment when estimating
the demand model parameters.
For a more realistic comparison, we also consider the policy segment-ILQX′ , where the seller does
not know which segment an arriving customer belongs to and has to learn the segments over time.
For segment-ILQX′ , we assume the seller has access to the features of 100 customers to start with,
so that customer segmentation can operate starting from the first period.
Figure 6 compares ILQX, which treats every customer as an individual, against segment,
segment-ILQX, and segment-ILQX′ . From this figure, we see that ILQX is noticeably better than
these segment-then-optimize policies. The policy segment exhibits a linear regret growth; this poor
performance stems from the fact that targeting the average customer in each population, however
tightly clustered as they may be, leads to biased pricing for every individual customer which does
√
not disappear over time. While segment-ILQX′ achieves a regret of order T , it performs worse
than the plain version of ILQX that does not classify the customers. This observation is due to
the fact that, within each segment, segment-ILQX′ learns from a smaller sized data set. As such,
segment-ILQX′ would perform worse than ILQX that learns from the entire data set, regardless of
the customer population model under consideration.
We see in Figure 6 that segment-ILQX performs better than segment-ILQX′ because its misclas-
sification rate is zero, but it still does not beat ILQX that does not know a priori the existence
of two clusters in the customer base. This observation thus rules out the possibility that the
∆πθ (T )
∆πθ (T )
100
100
50 50
0 0
0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000
T T
Figure 6 Value of personalization at the individual level over customer segmentation. Panel (a) displays
the T -period regret of ILQX (dotted curve) and segment (dashed curve), which knows which population each customer
is coming from and charges the price targeting the average customer in the arriving customer’s population. Panel (b)
depicts the T -period regret of ILQX (dotted curve) and segment-ILQX′ (solid curve), which first classifies the customer
using the k-means classification algorithm and then uses ILQX within each cluster. Panel (b) also shows the T -period
regret of segment-ILQX (dashed curve), which operates similar to segment-ILQX′ except that it knows which population
each customer is coming from, thereby eliminating the classification error. The problem parameters are as described
in §5.1.3 and Appendix D. The displayed regret values are computed by averaging the realized regret over 30 sample
paths.
To gain some intuition for these results, consider a simple example of two clusters of customers
from two very different locations; say Manhattan, New York and Louisville, Colorado. While at
first glance segmenting then targeting the individual customer in each region separately (as in
segment-ILQX′ ) may appear to be the right approach, this unnecessarily reduces the number of
observations to learn from because such a pricing policy would not take advantage of learning from
customers with similar characteristics in dimensions other than location (e.g., income bracket and
credit score). We conclude that, while conceptualizing a heterogenous customer base by segmenta-
tion may be intuitive and helpful, the optimal way to implement personalized dynamic pricing is
at the individual customer level.
5.2. Case Study II: Analyzing Real-life Data from an Auto Loan Company
In this section, we explore the validity of our policy for computing personalized lending rates for
an online auto loan company in the United States. As mentioned in §1, consumer lending is a
prominent industry in which personalization of prices (i.e., rates) is both socially acceptable and in
current practice, albeit at varying degrees of granularity. One purpose of this section is to compare
our first-order optimal policy with historical decisions made by a company. We also compare our
policy with other dynamic pricing-and-learning policies that have been proposed in the literature.
5.2.1. Description of data. We use the data set CPRM-12-001: On-Line Auto Lending pro-
vided by the Center for Pricing and Revenue Management at Columbia University. This includes
information about all auto loan applications received by a major online lender in the United States
from July 2002 through November 2004. The data set includes the date of an application made by
prospective borrowers, the type of loan they requested (term and amount) as well as some personal
information. The data set also includes whether or not the online lender approved the application,
and for the approved applications, an annual percentage rate (APR) offered, and whether or not a
contract eventuated. In contrast to Case Study I, the customers’ demand responses are binary in
this setting, indicating whether or not a loan was agreed upon.
We use the first 50,000 data entries for the case study. A summary of the data set (with descriptive
statistics on the demand and available features) is shown in Table 3 in Appendix E. Note that
the variable apply is the binary demand indicator for eventual contract, and there are 18 feature
variables, both discrete categorical (e.g., the state of residency for the applicant) and continuous
(e.g., FICO score).
5.2.2. Problem formulation. The pricing problem of the online auto loan company in our data
set is a special case of the problem formulation in §2 such that the demand is a binary variable. In
this case, we compute the price p of a loan as the net present value of future payments minus the
loan amount; i.e.,
TX
erm
p = M onthly P ayment × (1 + Rate)−τ − Loan Amount,
τ =1
where Rate is the monthly London interbank offered rate (LIBOR) for the studied time period.3
We use the interval [0, 7500] as the set of feasible prices. To represent the customers’ choices, we
employ a logit demand model. That is, given a price p and a feature vector x, the binary variable
apply that represents the customer choice is given by
eα·x+(β·x) p
(
1 with probability 1+eα·x+(β·x) p
,
apply = 1
(22)
0 with probability 1+eα·x+(β·x) p
,
where α and β are demand parameter vectors.4 The model (22) is a special case of (1) such that the
α·x+(β·x) p
e
variable apply is the demand realization, the expected demand is g(α · x + (β · x)p) = 1+e α·x+(β·x) p ∈
n o
eα·x+(β·x) p eα·x+(β·x) p
[0, 1], and the demand shock is given by ε = I U[0,1] ≤ 1+eα·x+(β·x) p − 1+eα·x+(β·x) p , where U[0,1] is
a uniform random variable in [0, 1].
3
Because this net present value is calculated from the auto loan company’s perspective, the relevant interest rate is
the one used for financial exchanges between commercial banks.
4
It is possible to include higher powers of the price and continuous customer features in the demand model, which
can improve the fit of the model to the data. At the same time, this also raises an overfitting concern because a very
good fit to a given data set could result in poor predictions in future data. The primary goal of the case study in
this section is to provide a simple yet realistic test bed to show the value of our approach, which is based on price
experimentation and lasso regularization. To ensure a fair comparison, all of the studied policies and benchmarks use
the same set of explanatory variables.
5.2.3. Pricing policies. We compare the following pricing policies in our data analysis:
1. The company’s actual pricing policy (denoted by company).
2. Iterated lasso logistic regression with price experimentation (denoted by ILQX, as described
in §4.2). Note that, under a logit demand model, the quasi-likelihood function in §4.2 reduces
to the log-likelihood function for logit demand response.
3. Iterated logistic regression with price experimentation (denoted by IQX). This policy is the
same as ILQX but with no lasso regularization; i.e., λt = 0 for all t.
4. Semi-clairvoyant iterated logistic regression with price experimentation (denoted by
semi-clairvoyant IQX). This policy is the same as IQX but with known sparsity structure.
5. Greedy iterated logistic regression (denoted by greedy IL). This policy uses no price experi-
mentation and is an adaptation of the greedy policy in Qiang and Bayati (2016) for the binary
response model (22). Accordingly, there is no model misspecification for this policy.
6. Greedy episodic iterated lasso logistic regression (denoted by greedy EILL) adapted for the
model (22). This is the policy in Javanmard and Nazerzadeh (2016), which uses no price
experimentation and performs parameter estimation only at the beginning of periodic episodes
of geometrically increasing length. The regularization parameter is chosen according to
Javanmard and Nazerzadeh (2016). There is no model misspecification for this policy.
7. Segmented iterated logistic regression with price experimentation (denoted by segment-X).
First, for the 50,000 customers in the data set used for fitting the population model, we use
the k-means classification algorithm to find their optimal segmentation, as measured by the
average silhouette value.5 We then apply semi-clairvoyant IQX separately within each segment.
This would represent an optimistic result for a segmentation-based policy, as in practice the
optimal number and boundaries of the segments would need to be learned on the fly.
We implement each of the policies above to customers who arrived at the company starting from
July 2002, and compute the company’s expected regret based on a population model fitted to the
first 50,000 observations of the data set. The first 550 observations were used for a burn-in stage,
which was sufficient for the feature design matrix to be positive definite, as assumed by the above
policies. We note that specifying a generative model is necessary for backtesting pricing policies as
responses depend on prices selected. This is in contrast to other supervised learning problems such
as prediction problems, whereby an isolated test data set can be used to evaluate a policy without
specifying a generative model.
We set the population model to be the best model found by logistic regression using the backward
elimination method. That is, we fit a logistic regression model for the variable apply (the binary
5
The silhouette value is used to evaluate cluster analysis of a data set. The silhouette value for each point is a
measure of how similar that point is to points in its own cluster, when compared to points in other clusters. See
https://www.mathworks.com/help/stats/silhouette.html for further details on calculating the silhouette value.
dependent variable for customer choices) using all 18 available features described in Table 3, and
then progressively remove any features that are not statistically significant at the 1% significance
level. The model fitted from this backward elimination method is given by
where the feature vectors x reside in R19 such that x1 = 1 (to allow for an intercept) and x2 , . . . , x19
correspond to the 18 features described in Table 3, normalized by their empirical means for numer-
ical stability.6 In the model fitted above, the two key terms in the expected demand function
g α · x + (β · x)p , namely the terms α · x and (β · x)p, are of comparable order. Note that, because
the prices are expressed in $1000s, the value of β above is of smaller magnitude relative to α, and
this difference in the order of magnitude disappears when β is multiplied by price p.
5.2.4. Results and discussion. Figure 7 shows the regret performance of ILQX against the
√
benchmark policies described in §5.2.3. Note that λt+1 = c̃t1/4 log d + log t for ILQX, as dictated
by Theorem 3, where c̃ = 0.005 is chosen from a grid search. (We revisit and further discuss the
choice of c̃ below.) The experimental prices are arbitrarily chosen as m1 = $391.1 and m2 = $6960.
In Figure 7(a)-(b), we observe that our first-order optimal policy, ILQX, is vastly better than the
company’s policy, and its performance is similar to semi-clairvoyant IQX that knows which features
are non-zero. We also see the importance of lasso regularization, as can be seen by the substantially
larger regret incurred by IQX, which is ILQX with no regularization. In sum, we can conclude that
ILQX is a fast learning policy with sublinear growth of regret. (We remark that the relative noise
in the plots compared to Case Study I is due to the fact that Figure 7 shows a single sample path
rather than an average over many simulation paths.)
Figure 7(c) compares ILQX with two greedy policies, greedy IL of Qiang and Bayati (2016) and
greedy EILL of Javanmard and Nazerzadeh (2016). Because our general problem formulation entails
potentially uninformative prices, the greedy policies perform strikingly worse, even worse than the
company policy, demonstrating the importance of price experimentation. We further investigate
the performance of greedy policies in the following subsection.
Finally, Figure 7(d) shows the value of personalization at the individual level and the subopti-
mality of customer segmentation in dynamic pricing, as observed in Case Study I. Note that the
performance of segment-X is an optimistic lower bound on the regret of segmentation policies used
in practice, as segment-X knows the optimal segmentation for the customers from the beginning.
6
This is a standard data processing step to put the different features on a similar scale. Note the dependent variable
apply is also normalized this way.
10
5 (a) 10
5 (b)
2.5 2.5
company IQX
2 ILQX 2 ILQX
semi-clairvoyant IQX
∆πθ (T )
∆πθ (T )
1.5 1.5
1 1
0.5 0.5
0 0
0 0.5 1 1.5 2 0 0.5 1 1.5 2
T 10
4
T 104
∆πθ (T )
∆πθ (T )
1.5
4
1
2
0.5
0 0
0 0.5 1 1.5 2 0 0.5 1 1.5 2
T 104 T 104
Figure 7 Performance of ILQX against various benchmark policies. In all panels, the dotted curves show
the T -period regret of ILQX. On top of these curves, panel (a) shows the T -period regret of company (solid curve),
panel (b) shows the T -period regret of IQX (dashed curve) and semi-clairvoyant IQX (solid curve), panel (c) shows the
T -period regret of greedy IL (solid curve) and greedy EILL (dashed curve), and panel (d) shows the T -period regret
of segment-X (dashed curve). Problem parameters are calibrated from a real-life data set as described in §5.2; the
displayed regret values are based on the sample path of customer feature vectors observed in the data set.
We summarize all results in Figure 8, which displays the expected cumulative revenues for the
first 20,000 customers (who arrive over approximately six months) under the five policies in this
section. The figure shows that ILQX accumulates the highest expected revenue (about $600,000),
which is 47% higher than that of the company policy.
T -period expected revenue (USD)
105
6 ILQX
5 segment-X
company
4
greedy IL
3 greedy EILL
0
0 0.5 1 1.5 2
T 104
Figure 8 Expected revenue of ILQX against various benchmark policies. The graph shows the T -period
expected revenue of ILQX as well as several other benchmark policies. Problem parameters are calibrated from a
real-life data set as described in §5.2; the displayed regret values are based on the sample path of customer feature
vectors observed in the data set.
5.2.5. More on the value of price experimentation. The reason for greedy poli-
cies performing poorly in our setting but near-optimally in Qiang and Bayati (2016) and
Javanmard and Nazerzadeh (2016) is that both papers consider population demand models with
constant price sensitivity that does not depend on customer features Xt and consequently all prices
are informative in their settings. Specifically, in both papers, β is effectively one-dimensional, with
our setting retrieving theirs by letting β = 1 and β2 = · · · = βd = 0. As a consequence of this mod-
eling choice, the settings of Qiang and Bayati (2016) and Javanmard and Nazerzadeh (2016) do
not entail uninformative prices, and accordingly, greedy policies exhibit good performance. As
explained earlier in §3, this does not necessarily happen in our general setting, thereby indicating
that price experimentation is crucial in general.
To see this point, we plot in Figure 9 the evolution of the prices charged by ILQX and the two
greedy policies, greedy IL and greedy EILL, where the population model and the sequence of feature
vectors are as in §§5.2.1-5.2.4. From Figure 9, we observe that the price paths of greedy IL and
greedy EILL eventually converge to u, the upper boundary of the feasible price interval, thereby
limiting the cumulative price variation and the amount of information collected. By contrast, ILQX
employs explicit price experiments to facilitate learning, and consequently, is able to charge prices
closer to the optimal price. The greedy policies’ convergence to the uninformative upper boundary
of feasible prices is similar to the phenomenon we observed earlier in §5.1.2 for the linear demand
model. This indicates that the personalized pricing problem for the binary demand model also
requires a small amount of judicious price experimentation.
8000
4000
greedy IL
2000 ILQX
0
0 0.5 1 1.5 2
t
Figure 9 Price paths of ILQX and greedy policies. The solid curves show the sequence of prices {pt } charged
by ILQX, greedy IL, and greedy EILL, and the dotted curve shows the optimal price sequence. The population model
and the sequence of feature vectors are as in §§5.2.1-5.2.4.
To further investigate this point, we generate additional results on 30 new customer feature
paths for 20,000 periods by bootstrapping with replacement. We display in Figure 10 the evolution
of prices charged by greedy IL and greedy EILL in this setting. For greedy IL, 24 out of 30 paths
converge to u by period 20,000, and for greedy EILL, 23 out of 30 paths. These observations highlight
the necessity of price experimentation in personalized pricing practice; greedy policies that do not
experiment with prices periodically (to steer the price sequence away from uninformative prices)
can perform very poorly.7
7
As in §5.1.2, it is possible to consider slight modifications of greedy policies such that prices are chosen myopically
unless the price path hits the same boundary in two consecutive periods, in which case the price path is pulled towards
the interior of the feasible set by a margin δ > 0 (see §5.1.2 for a precise description). For the sample path of customer
feature vectors in our real-life data set, we observe that the regret performance of such slightly modified greedy
policies is very close to the regret performance of plain (unmodified) greedy policies. For details, see Appendix F.
pt 4000
10
2000 5
0 0
0 0.5 1 1.5 2 0 2000 4000 6000
t 104 p20,000
(c) price paths of greedy EILL (d) price histogram of greedy EILL
8000 optimal price
4000
10
2000 5
0 0
0 0.5 1 1.5 2 0 2000 4000 6000
t 104
p20,000
Figure 10 Price evolution under greedy IL and greedy EILL. Panels (a) and (b) show sample paths of the price
sequence {pt } and the histogram of the price in period 20,000, respectively, generated under greedy IL. Panels (c)
and (d) depict sample paths of the price sequence {pt } and the histogram of the price in period 20,000, respectively,
generated under greedy EILL. Problem parameters are calibrated from a real-life data set as described in §5.2, and
the sample paths of customer feature vectors are bootstrapped from the data set with replacement.
2 2
0 0
0 0.5 1 1.5 2 0 0.5 1 1.5 2
T 10
4
T 10
4
Figure 11 The effect of the regularization parameter λt on the performance of ILQX. Panel (a) and
(b) show the T -period
√ regret of ILQX for different choices of the regularization parameter sequence {λt }. In panel
(a), λt+1 = c̃t1/4 log d + log t for all t, and different values of the constant c̃ are considered. In panel (b), different
(suboptimal) growth rates of λt are considered; the solid and dashed curves correspond to the following cases: (i)
λt+1 = 0.005 log d for all t; (ii) λt+1 = 0.005(log d + log t) for all t; and (iii) λt+1 = 0.005t1/4 (log d + log t)√1/4 for all t.
On the other hand, the dotted curve corresponds to the near-optimal growth rate of λt : λt+1 = 0.005t1/4 log d + log t
for all t. The rest of the problem parameters are calibrated from a real-life data set as described in §5.2; the displayed
regret values are based on the sample path of customer feature vectors observed in the data set.
In Figure 11(a), we consider different values for the constant c̃ in the near-optimal choice of λt in
Theorem 3. We find that, for this case study, c̃ in the order of 10−3 yields the best results, and
that, within the same order of magnitude, the expected regret is not too sensitive to the exact
value of the constant. To demonstrate the importance of using the correct growth rate for λt , we
compute the performance of ILQX under three suboptimal choices for λt . In the first choice, we
have λt+1 = 0.005 log d for all t; in the second one, λt+1 = 0.005(log d + log t) for all t; finally, in the
third one, λt+1 = 0.005t1/4 (log d + log t)1/4 for all t. In Figure 11(b), we observe that the rate of
convergence of λt is crucial, with the expected regret being sensitive to this choice.
surprisingly good performance of ILQX indicates that, for this data set, (i) the three clusters are
not particularly well-separated, and (ii) the demand responses for the different clusters are not too
different from each other, which make learning and earning for a single, overarching demand model
for the entire customer base sufficient. The slightly better performance of ILQX over segment-ILQX
is likely due to the fact that ILQX works with a greater sample size than segment-ILQX.
Is it then possible to find a generative model where ILQX does not perform well? We are able to
find such a model by trial and error, by adding progressively larger random perturbations to the
coefficients of the demand models of each cluster. With sufficiently large perturbations, we find a
model in which segment-ILQX performs the best with sublinear regret growth, whereas ILQX and
segment perform poorly. The computational results for this model are shown in Figure 12(b), and
the details of this model are given in Appendix G.
10000
segment-ILQX segment-ILQX
segment segment
∆πθ (T )
6000
4000 2
2000
0 0
0 0.5 1 1.5 2 0 0.5 1 1.5 2
T 10 4
T 10
4
Figure 12 Impact of model misspecification. Panel (a) shows the T -period regret of three policies, namely,
ILQX (dotted curve), segment (solid curve), and segment-ILQX (dashed curve) in the three-segment model derived
from the real-life data set described in §5.2.1. Panel (b) shows the T -period regret of the same three policies in the
same model with added random perturbations to demand coefficients. Problem parameters are calibrated from the
aforementioned real-life data set as described in §6 and Appendix G. The displayed regret values are computed by
averaging the realized regret over 30 sample paths of customer feature vectors that are bootstrapped from the data
set with replacement.
We thus conclude that, while a demand model that cannot be captured by the generalized
linear model (1) exists, ILQX is robust to this type of misspecification for the real-life data set
CPRM-12-001: On-Line Auto Lending.
7. Concluding Remarks
In this paper, we investigate personalized dynamic pricing with demand learning and information
about individual customer characteristics. After establishing that the optimal policy must have
√
expected regret growing at a rate at least of order s T , we propose price experimentation policies
and prove their near-optimality. Of particular note is the analysis of the seller who has possibly
a large number of customer features but can nevertheless earn near-optimal profit by employing
the lasso-regularized price experimentation policy, ILQX. Our extensive data analysis validates
the theoretical performance bounds, and firmly establishes the practical value of our personalized
policies over other intuitive and/or widely-practiced policies, such as unregularized pricing policies
that do not account for potential sparsity in the demand model, myopic pricing policies that do
not experiment with prices to facilitate learning, and segmentation policies that do not utilize
across-segment information.
We believe the feature-based modeling approach and techniques used in this paper can be trans-
ferred to analyze personalization in other operational decision problems. These include assortment
optimization, joint pricing-inventory management, and healthcare, which are all rapidly gaining
attention for the potential value of personalization.
Acknowledgments
The authors thank the Department Editor Prof. Noah Gans, the associate editor, and three referees for their
helpful comments that improved the presentation and structuring of the paper.
References
Araman, V. F. and Caldentey, R. (2009). Dynamic pricing for nonperishable products with demand learning.
Operations Research, 57(5):1169–1188.
Ban, G.-Y., Gallien, J., and Mersereau, A. J. (2019). Dynamic procurement of new products with covariate
information: The residual tree method. Manufacturing & Service Operations Management, 21(4):798–815.
Ban, G.-Y. and Rudin, C. (2018). The big data newsvendor: Practical insights from machine learning.
Operations Research, 67(1):90–108.
Belloni, A. and Chernozhukov, V. (2013). Least squares after model selection in high-dimensional sparse
models. Bernoulli, 19(2):521–547.
Besbes, O. and Zeevi, A. (2009). Dynamic pricing without knowing the demand function: Risk bounds and
near-optimal algorithms. Operations Research, 57(6):1407–1420.
Broder, J. and Rusmevichientong, P. (2012). Dynamic pricing under a general parametric choice model.
Operations Research, 60(4):965–980.
Chen, X., Owen, Z., Pixton, C., and Simchi-Levi, D. (2015). A statistical learning approach to personalization
in revenue management. Working Paper.
Cheung, W. C., Simchi-Levi, D., and Wang, H. (2017). Dynamic pricing and demand learning with limited
price experimentation. Operations Research, 65(6):1722–1731.
Cohen, M. C., Lobel, I., and Paes Leme, R. (2016). Feature-based dynamic pricing. Working Paper.
den Boer, A. V. (2015). Dynamic pricing and learning: historical origins, current research, and new directions.
Surveys in Operations Research and Management Science, 20(1):1–18.
den Boer, A. V. and Zwart, B. (2014). Simultaneously learning and optimizing using controlled variance
pricing. Management Science, 60(3):770–783.
Farias, V. F. and Van Roy, B. (2010). Dynamic pricing with a prior on market response. Operations Research,
58(1):16–29.
Ferreira, K. J., Lee, B. H. A., and Simchi-Levi, D. (2015). Analytics for an online retailer: Demand forecasting
and price optimization. Manufacturing & Service Operations Management, 18(1):69–88.
Ferreira, K. J., Simchi-Levi, D., and Wang, H. (2018). Online network revenue management using Thompson
sampling. Operations Research, 66(6):1586–1602.
Harrison, J. M., Keskin, N. B., and Zeevi, A. (2012). Bayesian dynamic pricing policies: Learning and earning
under a binary prior distribution. Management Science, 58(3):570–586.
Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning. Springer, New York.
By elementary analysis, (23) implies that Ht has the following Fisher information matrix under
any given admissible policy π ∈ Π:
( T )
∂ log ℓ t (H t , θ, xT ) ∂ log ℓ t (H t , θ, xT )
Itπ (xT ) := Eπθ = ζ(φ) Eπθ [Jt (xT )], (24)
∂θ ∂θ
∂ ∂ ∂ ∂
where ζ(φ) = Eπθ [φ ·∇T (ε1 ) +B ′ (ε1 )], ∇T (ξ) = ∂ξ T 1 (ξ), ∂ξ T 2 (ξ), . . . , ∂ξ T n (ξ) and B ′ (ξ) = ∂ξ
B(ξ)
for all ξ, Jt (xT ) is the empirical Fisher information matrix given by
Pt Pt X t
xk xTk pk xk xTk 1 1 T
Jt (xT ) = Pt k=1
T
P k=1
t 2 T
= pk · pk ⊗ xk xTk ,
p x
k=1 k k kx p x
k=1 k k k x k=1
and ⊗ denotes the Kronecker product of matrices. In the remainder of the proof, we consider two
cases:
Case 1: d + 1 ≥ T . In this case, we use the following lemma.
Lemma A.1. (lower bound on cumulative pricing error) There exist finite positive con-
stants c0 and c1 such that
( T ) T
X X c0
sup EX Eπθ [(pt − ϕ(θ, Xt ))2 ] ≥ ,
θ∈Θ
t=2 t=2
c1 + supθ∈Θ EX Ct (θ, X t ) Eπθ [Jt−1 (X t−1 )] Ct (θ, X t )T
Because 0 < ℓ ≤ ϕ(θ, x) for all θ and x, the numerator of the right-hand side of the preceding
inequality is greater than or equal to ℓ2 /[4βmax
2
(max{1, zmax })2 ], where βmax = max(α,β)∈Θ {kβ k}.
Thus, letting c0 = ℓ2 /[4ζ(φ)βmax
2
(max{1, zmax })2 ] and c1 = Ĩ (µ)/ζ(φ), we arrive at the desired result.
Q.E.D.
For each k = 1, . . . , t, the constants {γkℓ , ℓ = 1, . . . , t} in Lemma A.1 are found by solving the
following system of linear equations:
XTt Xt γk = ek , (26)
X t Xt−1
= {ϕ(θ, xk′ )ϕ(θ, xk )vkT′ xk − 2pk ϕ(θ, xk′ )vkT′ xk + p2k vkT′ xk }
k′ =1 k=1
t−1
(b) X
= {pk − ϕ(θ, xk )}2 ,
k=1
where: (a) and (b) follow because, by construction, vkT′ xk = 0 unless k = k ′ , in which case vkT xk = 1.
Thus,
Ct (θ, xt ) Eπθ [Jt−1 (xt )] Ct (θ, xt )T = Eπθ [Ct (θ, xt )Jt−1 (xt )Ct (θ, xt )T ]
t−1
X
Eπθ {pk − ϕ(θ, xk )}2 .
=
k=1
Consequently, we have
( T
)
X
π
EX Eπθ 2
T
∆ (T ) = sup −(β Xt )(pt − ϕ(θ, Xt))
θ∈Θ
t=1
( T )
(c) X
||XS ,t ||1 Eπθ 2
≥ |βmin | sup EX (pt − ϕ(θ, Xt ))
θ∈Θ
( t=1
T
)
X
XminEπθ 2
≥ |βmin | sup EX (pt − ϕ(θ, Xt)) ,
θ∈Θ
t=1
Ps i
where: βmin = min(α,β)∈Θ {kβ k}, ||XS ,t ||1 = i=1 |XS ,t | is the ℓ1 -norm of the compressed feature
vector XS ,t ; XSi ,t is the i-th component of XS ,t for i = 1, . . . , s; Xmin := min{||XS ,1 ||1 , . . . , ||XS ,T ||1 };
and (c) follows because β̃ = |βmin |[sgnXS1,t , . . . , sgnXSs,t ] is a feasible solution to the supremum
problem in the first line. Now, since no component of XS ,t is almost surely zero, there is a positive
constant cmin = mini∈{1,...,s} {E|XSi ,t |}. Then, Xmin ≥ cmin s, and we get
( T
)
X
π
EX Eπθ 2
∆ (T ) ≥ |βmin | cmin s sup (pt − ϕ(θ, Xt)) .
θ∈Θ
t=1
Combining the above with Lemma A.1, we can lower bound the worst-case regret by
T
X c0
∆π (T ) ≥ βmin
2
c2mins2 .
t=2
c1 |βmin |cmin s + ∆π (t − 1)
2
Letting K1 = c0 βmin and K2 = c1 |βmin |, we further obtain the following:
(d) K1 c2min s2 (T − 1) (e) sK1 c2min s2 T
∆π (T ) ≥ ≥ ,
K2 cmin s + ∆π (T ) 2∆π (T )(K2 cmin s/∆π (T ) + 1)
where: (d) follows because ∆π (T ) ≥ ∆π (t − 1) for t ∈ {1, . . . , T }, and (e) follows because T ≥ 2. Now
K2
Thus, letting K3 = |βmin |(u−ℓ)2 /4
+ 1, we get
K1 1/2
√
∆π (T ) ≥
2K3
cmin s T .
Case 2: d + 1 < T . In this case the t systems of linear equations (26) may become inconsistent by
the Rouché-Capelli theorem, because the right-hand side of (26) spans the entire Rt space but the
rank of XTt Xt may be less than t. To avoid such inconsistencies, we consider instead augmented
feature vectors, x̃k ∈ RT , where the first d + 1 elements of x̃k equal xk and the rest are determined
by the requirement X̃t = [x̃1 , . . . x̃t ] be of rank t. With this augmentation, the proof of in this
case follows by the same arguments for the preceding case. This concludes the proof when the
components of Xt are continuous random variables.
Finally, if some components of Xt are discrete random variables, we can take the conditional
expectation over all possible realizations of the discrete components first, then apply (25) for each
realization. To illustrate, let D denote the set of all realizations of the discrete components of Xt ;
e.g., if Xt ∈ R3 , with Xt1 = 1 almost surely, Xt2 = ±1/2 with probability 1/2 (half male, half female),
and Xt3 a continuous random variable, then D = {[1, 1/2], [1, −1/2]}. For d ∈ D , let XtC (d) denote
the conditioned random variable where the discrete components of Xt are set to the values in d.
Then, we have
X
Eµ,X Eπθ (pt − ϕ(θ, Xt))2 = PX Xt = XtC (d) Eµ,X C Eπθ (pt − ϕ(θ, Xt ))2 | Xt = XtC (d) ,
d∈D
where Eµ,X C denotes taking expectation over µ and the reduced feature vector that only
contains the continuous components. Applying the multivariate van Trees inequality on
Eµ,X C {Eπθ [(pt − ϕ(θ, Xt ))2 | Xt = XtC (d)]} for each d ∈ D , we arrive at the same conclusion
as before by following the same proof arguments for the conditional regret ∆π,C (T ) :=
PT π 2 C
t=1 EX C ,θ [−(β Xt )(pt − ϕ(θ, Xt )) | Xt = Xt (d)] for each d ∈ D . Q.E.D.
T
supθ∈Θ
where χk = I{k ∈ M }. By Lemma 2 in Keskin and Zeevi (2014), we deduce that there exists a finite
Pt h1 p i
and positive constant γ̃1 such that the smallest eigenvalue of k=1 χk p pk2 is greater than or
k k
where: (b) follows by the Rayleigh-Ritz theorem, (c) follows because ky k = 1, and (d) follows
because ξ + (1 − ξ)µ(ΣZ,S ) is linear in ξ. Thus, (27) implies that
X t
π T
µmin χk EX,θS [US ,k US ,k |Gk−1 ] ≥ γ̃1 min{1, µmin (ΣZ,S )}Jt . (28)
k=1
To conclude the proof, we use Theorem 3.1 in Tropp (2011). Note that the maximum eigenvalue
of χk US ,k UST,k is bounded above by tr χk US ,k UST,k = χk UST,k US ,k = χk (1 + p2k )(1 + Zk1
2 2
+ · · · + Zkd )≤
(1 + u2 )(1 + zmax
2
). Therefore, by Theorem 3.1 in Tropp (2011), (28) implies that
X t
π
PX,θS µmin χk US ,k US ,k ≤ 2 γ̃1 min{1, µmin (ΣZ,S )}Jt ≤ 2se−ρ1 Jt ,
T 1
k=1
1
where ρ1 = 2
(1 − log 2)γ̃1 min{1, µmin (ΣZ,S )}/[(1 + u2 )(1 + zmax
2
)] > 0. Finally, letting γ1 =
1
γ̃ min{1, µmin (ΣZ,S )}
2 1
and κ1 = 2, we obtain the desired result. Q.E.D.
Proof of Lemma 2. Fix π = ILSX(m1 , m2 ), and let ρ > 0. We first note that, if s and t satisfy
√ √
t/ log t < ρs, then (ρ s log t)/ t > 1 and (16) trivially holds by choosing κ2 ≥ ρ. Thus, in the
√
remainder of the proof, we consider the case where s and t satisfy t/ log t ≥ ρs. Define FS ,k =
q √
σ p1 , . . . , pk , ε1 , . . . , εk , XS ,1 , . . . , XS ,k+1 for k = 1, 2, . . . , and δs,t = (ρ s log t)/ t ∈ [0, 1]. Letting
vs,t = δs,t J˜S−1 ˜−1
,t MS ,t /kJS ,t MS ,t k for all t = 1, 2, . . . , we note that
PπX,θS {kJ˜S−1 2 2 π T T ˜
,t MS ,t k > δs,t } ≤ PX,θS {vs,t MS ,t ≥ vs,t JS ,t vs,t }. (29)
Recalling that EπX,θS [εk |FS ,k−1 ] = 0 and EπX,θS [ε2k |FS ,k−1 ] ≤ σ02 for k ≥ 2, and EπX,θS [eηεk |FS ,k−1 ] < ∞
for all η satisfying |η| ≤ η0 and k ≥ 2, we deduce that EπX,θS [eηεk |FS ,k−1 ] = 1 + ηEπX,θS [εk |FS ,k−1 ] +
µ(η)σ02 η 2
P∞ q π q 2 2
q=2 η EX,θS [εk |FS ,k−1 ]/q! ≤ 1 + µ(η)σ0 η ≤ e for all η satisfying |η| ≤ η0 and k ≥ 2, where
P∞ q−2 π 2 2
µ(η) = q=2 η EX,θS [εqk |FS ,k−1 ]/(q!σ02 ). Therefore, EπX,θS [eηεk |FS ,k−1 ] ≤ eµ0 σ0 η for all η satisfying
η0
|η| ≤ η0 and k ≥ 2, where µ0 = max{µ(η) : |η| ≤ η0 }. Letting ψ = min 2µ1σ2 , [(1+u2 )(1+z
2 )]1/2
, we
0 0 max
obtain from (29) that
T 1 T ˜ 1 T ˜
PπX,θS {kJ˜S−1 2 2 π
,t MS ,t k > δs,t } ≤ PX,θS {e
ψvs,t MS,t − 2 ψvs,t JS,t vs,t
≥ e 2 ψvs,t JS,t vs,t }. (30)
Now, let Bs,t = {v ∈ R2s : kv k = δs,t }. For all v ∈ Bs,t , define {Mkv , k = 1, 2, . . .} such that M1v = 1,
T 1
MS,k − ψv T J˜S,k v
and Mkv = eψv 2 for k ≥ 2. Based on this, we re-express (30) as
1 T ˜ (a) 1 T ˜
vs,t
PπX,θS {kJ˜S−1 2 2 π
,t MS ,t k > δs,t } ≤ PX,θS {Mt ≥ e 2 ψvs,t JS,t vs,t } ≤ PπX,θS {Mtv ≥ e 2 ψv JS,t v for some v ∈ Bs,t },
(31)
where (a) follows because vs,t ∈ Bs,t . Consequently, letting µ1 = 18 (m1 − m2 )2 γ1 > 0, and A1 =
√
{µmin (J˜S,t ) ≥ µ1 t}, we deduce from (31) that
(b) 1 T ˜
PπX,θS {kJ˜S−1 2 2 π v
,t MS ,t k > δs,t } ≤ PX,θS {Mt ≥ e
2 ψv JS,t v for some v ∈ B π c
s,t , A1 } + PX,θS {A1 }
(c) 1 2 √
≤ PπX,θS {Mtv ≥ e 2 ψµ1 δs,t t
for some v ∈ Bs,t } + PπX,θS {Ac1 }
(d) 1
= PπX,θS {Mtv ≥ e 2 ψµ1 ρs log t for some v ∈ Bs,t } + PπX,θS {Ac1 }
(e) 1
√
≤ PπX,θS Mtv ≥ e 2 ψµ1 ρs log t for some v ∈ Bs,t + κ1 se−ρ̃1 t ,
(32)
where: ρ̃1 = 81 (m1 − m2 )2 ρ1 > 0; (b) follows by the law of total probability; (c) follows because
√ √
µmin (J˜S ,t ) ≥ µ1 t on A1 , and kvs,t k = δs,t ; (d) follows because δs,t
2
t = ρ s log t; and (e) follows by
Lemma 1 and (15). Our next goal is to apply the union bound on the first term of the right-hand
side of (32). For that purpose, define
√ √
B̃s,t = {−δs,t + i/ t : i = 0, 1, . . . , 2δs,t t}2s ∩ {v ∈ R2s : kv k ≤ δs,t }.
Note that B̃s,t is a discrete grid over the 2s-dimensional ball of radius δs,t , where the increments
√
between adjacent grid points equal 1/ t. Thus, for all v ∈ Bs,t , there exists ṽ ∈ B̃s,t such that kv −
√ √
ṽ k ≤ 1/ t. Now, let µ2 = (1 + u2 )(1 + zmax
2
) > 0, and A2 = {kMS ,t k ≤ µ2 t}. Then, by elementary
algebra, we have Mtṽ ≥ κ̃1 Mtv on A2 , where κ̃1 = e−3ψµ2 . Therefore, (32) implies that
PπX,θS {kJ˜S−1 2 2
,t MS ,t k > δs,t }
(f) 1
√
≤ PπX,θS {Mtṽ ≥ κ̃1 e 2 ψµ1 ρs log t for some ṽ ∈ B̃s,t , A2 } + PπX,θS {Ac2 } + κ1 se−ρ̃1 t
(g) P 1
√
π ṽ 2 ψµ1 ρs log t } + Pπ c −ρ̃1 t
≤ ṽ∈B̃s,t PX,θS {Mt ≥ κ̃1 e X,θS {A2 } + κ1 se
(h) 1
√
1
e− 2 ψµ1 ρs log t EπX,θS [Mtṽ ] + PπX,θS {Ac2 } + κ1 se−ρ̃1 t
P
≤ ṽ∈B̃s,t κ̃1 (33)
where: (f) follows by the law of total probability and the fact that Mtṽ ≥ κ̃1 Mtv on A2 , (g) follows
by the union bound, and (h) follows by Markov’s inequality. We now prove that, for every ṽ ∈ B̃s,t ,
1
{Mkṽ , k = 1, 2, . . .} is a supermartingale. Letting ṽ ∈ B̃s,t , and US ,k = p1k ⊗ XS ,k = p1k ⊗ ZS,k
for
k = 1, 2, . . . , we note that
T 1 ψṽ T J˜ T
EπX,θS [Mkṽ | FS ,k−1 ] = eψṽ MS,k−1 − 2 S,k ṽ EπX,θS [eψχk ṽ US,k εk
| FS ,k−1 ]
(i)ṽ 1 T
US,k )2 T
= Mk−1 e− 2 ψχk (ṽ EπX,θS [eψχk ṽ US,k εk
| US ,k ] (34)
for k ≥ 2, where (i) follows from the definition of {Mkṽ } and the independence of {εk }. Since kṽ k ≤
δs,t for all ṽ ∈ B̃s,t , and the maximum eigenvalue of χk US ,k UST,k cannot exceed tr χk US ,k UST,k =
χk UST,k US ,k ≤ (1 + u2 )(1 + zmax
2
), we deduce that |ψχk ṽ T US ,k | ≤ ψδs,t [(1 + u2 )(1 + zmax
2
)]1/2 ≤ η0 .
Thus,
(j) (k)
1 T
US,k )2 µ0 σ02 ψ 2 χk (ṽ T US,k )2
EπX,θS [Mkṽ | FS ,k−1 ] ≤ Mk−1
ṽ
e− 2 ψχk (ṽ e ṽ
≤ Mk−1
2 2
for k ≥ 2, where: (j) follows by (34) and the fact that EπX,θS [eηεk |FS ,k−1 ] ≤ eµ0 σ0 η for all η satis-
fying |η| ≤ η0 , and (k) follows because µ0 σ02 ψ ≤ 12 . As a result, (Mkṽ , FS ,k ) is a supermartingale.
Consequently, by (33) and the fact that M1ṽ = 1, we deduce that
1
√
PπX,θS {kJ˜S−1 2 2 1
e− 2 ψµ1 ρs log t + PπX,θS {Ac2 } + κ1 se−ρ̃1 t
P
,t MS ,t k > δs,t } ≤ ṽ∈B̃s,t κ̃1
(l) √ √
t log t)− 1
≤ κ̃11 es log(4ρs 2 ψµ1 ρs log t + PπX,θS {Ac2 } + κ1 se−ρ̃1 t
(35)
√ 2
where (l) follows because the cardinality of B̃s,t is at most (2δs,t t)2s = es log(4δs,t t) . Moreover,
√
repeating the above supermartingale argument, we obtain PπX,θS {Ac2 } = PπX,θS {kMS ,t k > µ2 t} ≤
√
η
√
µ √
κ̃2 se−ρ̃2 t/s , where κ̃2 = 4 and ρ̃2 = min 4µµ2σ2 , 0√2 2 . Because t/ log t ≥ ρs, this implies that
√ 0 0
PπX,θS {Ac2 } = PπX,θS {kMS ,t k > µ2 t} ≤ κ̃2 se−ρ̃2 ρ log t . As a result, (35) implies that
√ 1 ψµ ρs log t
√
PπX,θS {kJ˜S−1 2 2 1 s log(4ρs
,t MS ,t k > δs,t } ≤ κ̃1 e
t log t)− 2 1
+ κ̃2 se−ρ̃2 ρ log t + κ1 se−ρ̃1 t . (36)
for T ≥ 2, where: dΘ = max{kϑ − ϑ̃k : ϑ, ϑ̃ ∈ ΘS }, and PπX,θS {·} is the probability measure associ-
√
ated with EπX,θS {·}. By Lemma 2, we know that PπX,θS {Act } ≤ (κ2 s log t)/ t for t ≥ t0 . Therefore,
PT −1 π c
√
t=t0 PX,θS {At } ≤ 2κ2 s T log T . Hence, we deduce from (42) that
T
X 2
EπX,θS (−βST XS ,t ) ϕ(θS , XS ,t ) − pt I{t 6∈ M }
t=1
T −1
2 2
√ X
EπX,θS kθS − ϑ̂S ,t+1 k2 I{At
≤ C0 K0 t0 dΘ + 2κ2 dΘ s T log T + (43)
t=t0
(e) π
= EX,θS kJ˜S−1 2
,t MS ,t k I{At }
(f) ρ2 s log t
≤ √ (44)
t
for all t ≥ t0 ; (d) follows because ϑ̂S ,t = PΘ {θ̂S ,t }, (e) follows by (11), and (f) follows by the definition
of At . Combining (43) and (44), we obtain
T
X 2 √ √
EπX,θS (−βST XS ,t ) ϕ(θS , XS ,t ) − pt I{t 6∈ M } ≤ C0 K0 t0 d2Θ + 2κ2 d2Θ s T log T + 4ρ2 s T log T
t=1
for T ≥ 2, where C2 = C0 K0 (t0 d2Θ + 2κ2 d2Θ + 4ρ2 ). By (38)-(40) and (45), we conclude that ∆πθS (T ) ≤
√
Cs T log T for T ≥ 2, where C = C1 + C2 . Q.E.D.
Pt R g(θ̃·Uk ) D
k −y
To analyze the right-hand side of (46), we first define Q̃t (θ̃) := k=1 χk Dk ν(y)
dy for all
θ̃ ∈ R2(d+1) , where Uk = p1k ⊗ Xk for all k. Then, we have
Qt (θ, λt+1 ) − Qt (θ̃t+1 , λt+1 ) = Q̃t (θ) − λt+1 kθ k1 − Q̃t (θ̃t+1 ) − λt+1 kθ̃t+1 k1
We now examine the Taylor series expansion of the second term on the right-hand side of (47),
namely Q̃t (θ̃t+1 ), around θ. To that end, we deduce from elementary analysis that, for all θ̃ ∈ R2(d+1) ,
∂ Q̃t (θ̃) ∂ Q̃t (θ̃)
∇Q̃t (θ̃) = ∂ θ̃1
, ∂ θ̃ , . . . , ∂∂θ̃Q̃t (θ̃)
2 2(d+1)
t
X Dk − g(θ̃ · Uk )
= χk g ′ (θ̃ · Uk )Uk
k=1
ν g(θ̃ · Uk )
t
(a) X
= χk Uk Dk − g(θ̃ · Uk ) , (48)
k=1
where (a) follows because ν(y) = g ′ g −1 (y) for y ∈ R. In addition, for all θ̃ ∈ R2(d+1) ,
X t
= − χk g ′ (θ̃ · Uk )Uk UkT . (49)
k=1
We deduce from (48), (49), and Taylor’s theorem that there exists a random vector ξt+1 ∈ R2(d+1)
on the line segment connecting θ and θ̃t+1 such that
Q̃t (θ̃t+1 ) = Q̃t (θ) + (θ̃t+1 − θ)T ∇Q̃t (θ) + 12 (θ̃t+1 − θ)T ∇2 Q̃t (ξt+1 )(θ̃t+1 − θ)
t
X t
X
1
χk g ′ (ξt+1 · Uk )Uk UkT (θ̃t+1 − θ)
T
T
= Q̃t (θ) + (θ̃t+1 − θ) χk Uk Dk − g(θ · Uk ) − 2
(θ̃t+1 − θ)
k=1 k=1
t t
(b) X X
= Q̃t (θ) + (θ̃t+1 − θ)T χk Uk εk − 21 (θ̃t+1 − θ)T χk g ′ (ξt+1 · Uk )Uk UkT (θ̃t+1 − θ), (50)
k=1 k=1
where (b) follows because Dk = g(θ · Uk ) + εk for all k. Consequently, letting ζk,t = g ′ (ξt+1 · Uk ) for
all k and t, we deduce from (47) and (50) that
Xt t
X
π 1 T T T
≤ PX,θ 2 (θ̃t+1 − θ) χk ζk,t Uk Uk (θ̃t+1 − θ) − (θ̃t+1 − θ) χk Uk εk < λt+1 kθ k1 − λt+1 kθ̃t+1 k1
k=1 k=1
PπX,θ
1
= 2
(θ̃t+1 − θ) Vt (θ̃t+1 − θ) < λt+1 kθ k1 − λt+1 kθ̃t+1 k1 + (θ̃t+1 − θ)T Mt ,
T
(51)
Pt Pt
where Vt = k=1 χk ζk,t Uk UkT and Mt = k=1 χk Uk εk for all t. Thus, letting B1 = {kMt k∞ ≤ λt+1 },
we deduce from (51) that
(lasso)
PπX,θ kθ̂t+1 (λt+1 ) − θ k2 > δs,t 2
≤ PπX,θ 21 (θ̃t+1 − θ)T Vt (θ̃t+1 − θ) < λt+1 kθ k1 − λt+1 kθ̃t+1 k1 + (θ̃t+1 − θ)T Mt , B1 + PπX,θ B1c . (52)
To obtain an upper bound on PπX,θ B1c , we use the following lemma, the proof of which is deferred
to the end of this section.
Lemma C.1. There exist finite and positive constants κ̃3 and ρ̃3 such that, if ρ ≥ ρ̃3 , then
≤ PπX,θ 12 (θ̃t+1 − θ)T Vt (θ̃t+1 − θ) < λt+1 kθ k1 − λt+1 kθ̃t+1 k1 + (θ̃t+1 − θ)T Mt , B1 + κ̃3 s(log√d+log t)
t
.
(53)
where (c) follows from Hölder’s inequality. By (53) and (54), we deduce that, if ρ ≥ ρ̃3 , then
(lasso)
PπX,θ kθ̂t+1 (λt+1 ) − θ k2 > δs,t 2
≤ PπX,θ 12 (θ̃t+1 − θ)TVt (θ̃t+1 − θ) < λt+1 kθ k1 − λt+1 kθ̃t+1 k1 + λt+1 kθ̃t+1 − θ k1 + κ̃3 s(log√d+log t)
t
= PπX,θ 21 (θ̃t+1 − θ)T Vt (θ̃t+1 − θ) < λt+1 kθ k1 − kθ̃t+1 k1 + kθ̃t+1 − θ k1 + κ̃3 s(log√d+log t)
t
. (55)
Recall that θS ∈ R2s is the compressed parameter vector containing all non-zero components of θ.
Let θN ∈ R2(d+1−s) be the vector consisting of the components of θ that are not contained in θS .
Thus, all components of θN are zero. In accordance with this, we separate θ̃t+1 into two vectors in
the same way: let θ̃S ,t+1 ∈ R2s be the vector consisting of the components of θ̃t+1 whose indices are
in S , and θ̃N ,t+1 ∈ R2(d+1−s) be the vector consisting of the components of θ̃t+1 whose indices are
not in S . Then, we have the following
kθ k1 = kθS k1 , (56a)
where (d) follows from Minkowski’s inequality. We deduce from (55) and (57) that, if ρ ≥ ρ̃3 , then
(lasso)
PπX,θ kθ̂t+1 (λt+1 ) − θ k2 > δs,t 2
≤ PπX,θ 12 (θ̃t+1 − θ)T Vt (θ̃t+1 − θ) < 2λt+1 kθ̃S ,t+1 − θS k1 + κ̃3 s(log√d+log t)
t
= PπX,θ (θ̃t+1 − θ)TVt (θ̃t+1 − θ) < 4λt+1 kθ̃S ,t+1 − θS k1 + κ̃3 s(log√d+log t)
t
. (58)
√
Let µ3 = 81 (m1 − m2 )2 γ1 ℓ̃ > 0 and B2 = {µmin (Vt ) ≥ µ3 t}, where µmin (Vt ) denotes the smallest
eigenvalue of Vt . We use the following lemma to derive an upper bound on PπX,θ B2c . The proof of
Lemma C.2. There exist finite and positive constants κ̃4 and ρ̃4 such that, if ρ ≥ ρ̃4 , then
√
PπX,θ µmin (Vt ) < µ3 t ≤ κ̃4 s(log√d+log t)
t
for t ≥ 5.
Lemma C.2 and (58) imply that, if ρ ≥ max{ρ̃3 , ρ̃4 } and t ≥ 5, then
(lasso)
PπX,θ kθ̂t+1 (λt+1 ) − θ k2 > δs,t
2
√
where κ3 = κ̃3 + κ̃4 , (e) follows from the Rayleigh-Ritz theorem. Since ky k1 ≤ 2sky k for all y ∈ R2s ,
√ √
we have kθ̃S ,t+1 − θS k1 ≤ 2skθ̃S ,t+1 − θS k ≤ 2skθ̃t+1 − θ k. Thus, we deduce from (59) that, if
ρ ≥ max{ρ̃3 , ρ̃4 } and t ≥ 5, then
(lasso)
PπX,θ kθ̂t+1 (λt+1 ) − θ k2 > δs,t
2
√ √
≤ PπX,θ µ3 tkθ̃t+1 − θ k2 < 4λt+1 2skθ̃t+1 − θ k + κ3 s(log√d+log t
t)
√ √
= PπX,θ µ3 tkθ̃t+1 − θ k < 4λt+1 2s + κ3 s(log√d+log t
t)
(g) π
= PX,θ ρ < ρ̃5 + κ3 s(log√d+log
t
t)
. (60)
2 √
where ρ̃5 = 32c̃ µ2
, (f) follows because λt+1 = c̃t1/4 log d + log t, and (g) follows because kθ̃t+1 − θ k2 =
3
2
= ρs(log√d+log t) π
δs,t t
. Note that, if ρ ≥ ρ̃5 , then P X,θ ρ < ρ̃5 = 0. Thus, letting ρ = ρ3 = max{ρ̃3 , ρ̃4 , ρ̃5 }
and t1 = 5, we deduce from (60) that
or equivalently,
(lasso) ρ3 s(log d+log t) κ3 s(log d+log t)
π
PX,θ kθ̂t+1 (λt+1 ) − θ k2 > √
t
≤ √
t
for all t ≥ t1 . Q.E.D.
Proof of Theorem 3. Fix π = ILQX(m1 , m2 , λ). For p ∈ [ℓ, u], θ ∈ Θ, and x ∈ X , consider the
Taylor series expansion of r(p, θ, x) around the revenue-maximizing price, ϕ(θ, x), noting that there
exists a price p̃ between p and ϕ(θ, x) such that
∂ ∂2
2
r ϕ(θ, x), θ, x p − ϕ(θ, x) + 21 ∂p
r(p, θ, x) = r ϕ(θ, x), θ, x + ∂p 2 r p̃, θ, x p − ϕ(θ, x) . (61)
∂
Because ∂p
r ϕ(θ, x), θ, x = 0 for all θ ∈ Θ and x ∈ X , (61) implies that
2
r ∗ (θ, x) − r(p, θ, x) = r ϕ(θ, x), θ, x − r(p, θ, x) ≤ C3 ϕ(θ, x) − p
(62)
∂2
for all θ ∈ Θ and x ∈ X , where C3 = max{ 12 ∂p
2 r p, θ, x : p ∈ [ℓ, u], θ ∈ Θ, x ∈ X }. We deduce from
for T ≥ 2, where EπX,θ {·} = EX { Eπθ {· | X T }}. Using (63) and repeating the arguments used for
deriving (39)-(42) in the proof of Theorem 2, we obtain the following:
T
X 2 √
EπX,θ C3 ϕ(θ, Xt) − pt I{t ∈ M } ≤ C4 T
(64)
t=1
for T ≥ 2, where C6 = C5 (t1 d2Θ + 2κ3 d2Θ + 4ρ3 ). Combining (63), (64), and (66), we deduce that
√
∆πθ (T ) ≤ C̃s T (log d + log T ) for T ≥ 2, where C̃ = C4 + C6 . Q.E.D.
√
Proof of Lemma C.1. We first note that, if t/(log d + log t) < ρs, then we deduce that
PπX,θ kMt k∞ > λt+1 ≤ 1 < ρs(log√d+log t)
t
and obtain the desired result by choosing κ̃3 ≥ ρ. Now,
√
suppose that t/(log d + log t) ≥ ρs, and note that
π π
i
PX,θ kMt k∞ > λt+1 = PX,θ max |Mt | > λt+1
i=1,...,2(d+1)
( 2(d+1) )
[
= PπX,θ |Mit | > λt+1
i=1
2(d+1)
(a) X
PπX,θ |Mit | > λt+1 ,
≤ (67)
i=1
where Mit denotes the i-th component of Mt for i = 1, . . . , 2(d + 1), and (a) follows by the union
bound. We construct an upper bound on the right-hand side of (67). For all v ∈ R and i ∈
i 1 2 Pk
{1, 2, . . . , 2(d + 1)}, define {Mkv,i , k = 1, 2, . . .} such that M1v,i = 1 and Mkv,i = eψ̃vMk − 2 ψ̃v l=1 χl for
1
k ≥ 2, where ψ̃ = 2µ σ1 2 υ2 and υ0 = max p ⊗ x ∞ : p ∈ [ℓ, u], x ∈ X . Based on this, we deduce
0 0 0
PπX,θ |Mit | > λt+1 = PπX,θ Mit > λt+1 + PπX,θ Mit < −λt+1
t t
(c)
π i λt+1 X π i λt+1 X
≤ PX,θ Mt > √ χl + PX,θ Mt < − √ χl
2 t l=1 2 t l=1
v ,i 1 2 Pt v ,i 1 2 Pt
≤ PπX,θ Mt + > e 2 ψ̃v+ l=1 χl + PπX,θ Mt − > e 2 ψ̃v− l=1 χl , (68)
λt+1 λ Pt √
where v+ = √ ,
2 t
v− = − 2t+1
√ , and (c) follows because
t l=1 χl ≤ 2 t. Given v ∈ {v+ , v− } and i ∈
1 2 i
{1, . . . , 2(d + 1)}, we have EπX,θ [Mkv,i | Fk−1 ] = Mk−1 v,i
e− 2 ψ̃v χk EπX,θ [eψ̃vχk Uk εk | Fk−1 ] for k ≥ 2, where
Fk = σ p1 , . . . , pk , ε1 , . . . , εk , X1 , . . . , Xk+1 and Uki denotes the i-th component of Uk . Note that
q q
ψ̃λ υ
√ 0 = 1 ψ̃c̃υ0 log d+log t 1 1 ψ̃c̃υ0 2
|ψ̃vχk Uki | ≤ 2t+1 t 2
√
t
≤ 2
ψ̃c̃υ0 ρs
. Thus, if ρ ≥ ρ̃3 = 2η0
, then |ψ̃vχk Uki | ≤ η0 .
2 2
Therefore, because EπX,θ [eηεk |Fk−1 ] ≤ eµ0 σ0 η for all η satisfying |η| ≤ η0 , we have EπX,θ [Mkv,i | Fk−1 ] ≤
2
v,i 1 χk +µ0 σ02 ψ̃ 2 v 2 χk (Uki )2 v,i
Mk−1 e− 2 ψ̃v ≤ Mk−1 for k ≥ 2, where the latter inequality follows because
µ0 σ02 ψ̃(Uki )2 ≤ 12 . Consequently, for v ∈ {v+ , v− } and i ∈ {1, . . . , 2(d + 1)}, (Mkv,i , Fk ) is a super-
martingale with M v,i = 1. Hence, letting c̃ = 4ψ̃−1/2 , we deduce that, if ρ ≥ ρ̃3 , then
2 1 2
1 2 Pt 2 Pt (d) λt+1 (f) e− 16 ψ̃c̃ log d
1 − 1 ψ̃ √ (e) 1 2
PπX,θ Mtv,i > e 2 v l=1 χl ≤ e− 2 ψ̃v l=1 χl ≤ e 16 t = e− 16 ψ̃c̃ (log d+log t) ≤
√ , (69)
t
λ2 Pt √
where (d) follows because v 2 = t+1 4t
for v ∈ {v+ , v− } and l=1 χl ≥ 21 t, (e) follows because λt+1 =
√
c̃t1/4 log d + log t, and (f) follows because ψ̃ ≥ 8c̃−2 . Combining (68) and (69), we further deduce
that, if ρ ≥ ρ̃3 , then
1 2
2e− 16 ψ̃c̃ log d
PπX,θ
i
|Mt | > λt+1 ≤ √
t
for all i = 1, . . . , 2(d + 1). Using the preceding inequality in (67), we obtain the following: if ρ ≥ ρ̃3 ,
1 ψ̃c̃2 log d
− 16
then PπX,θ kMt k∞ > λt+1 ≤ 4(d+1)e √t ≤ κ̃3 s(log√d+log t)
t
, where κ̃3 = 8 and the latter inequality
follows because ψ̃ ≥ 16c̃−2 . Q.E.D.
√ √
Proof of Lemma C.2. Note that, if t/(log d + log t) < ρs, then PπX,θ µmin (Vt ) < µ3 t ≤ 1 <
ρs(log d+log t)
√
t
and the desired result follows by choosing κ̃4 ≥ ρ. Consequently, we suppose that
√
t/(log d + log t) ≥ ρs in the remainder of the proof. Repeating the arguments in the proof of
√ √
Lemma 1 and noting that Jt ≥ 18 (m1 − m2 )2 t for all t ≥ 5, we deduce that PπX,θ µmin (Vt ) < µ3 t ≤
α = [ 1.1, −0.1, 0, 0.1, 0, 0.2, 0, 0.1, −0.1, 0, 0, 0.1, −0.1, 0.2, −0.2 ],
β = (−1) × [ 0.5, 0.1, −0.1, 0, 0, 0, 0, 0.2, 0.1, 0.2, 0, 0.2, −0.1, −0.2, 0 ].
The parameter space is Θ = [−1.5, 1.5]30 , and the demand shocks {ǫt } are normally distributed
with mean zero and standard deviation σ0 = 0.01.
Further, µ1 equals a vector of ones, Σ1 = 0.1I14 (where I14 is the 14 × 14 identity matrix), µ2
equals a vector of twos, and Σ2 = 0.06AA, where A is a 14 × 14 matrix generated randomly by
setting each element to a uniform random variable between 0 and 1. For ILQX, the regularization
√
parameter sequence λ = (λ1 , λ2 , . . .) is selected such that λt+1 = 0.05t1/4 log d + log t for all t; this
choice is informed by the theory developed in §4.2.
∆πθ (T )
∆πθ (T )
c̃ = 0.1
50 50
0 0
0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000
T T
(c) (d)
150 κ = 0.5 150
κ=1 1×experiments/cycle
κ=2 2×experiments/cycle
3×experiments/cycle
100 100
∆πθ (T )
∆πθ (T )
50 50
0 0
0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000
T T
Figure 13 Sensitivity of ILQX to variations in policy parameters. Panel (a) shows the impact of different
constants c̃ in the near-optimal regularization parameter in Theorem 3. Panel (b) shows the impact of increasing the
difference between the two experimental prices m1 and m2 . Panel (c) shows the impact of scaling the experimentation
cycle lengths by a constant factor κ, whereas panel (d) shows the impact of charging experimental prices one to three
times per experimentation cycle. The rest of the problem parameters are the same as in §5.1.1.
In Figure 13(a), we find c̃ = 0.025 and c̃ = 0.05 to be the best constant multipliers for the
near-optimal regularization parameter in Theorem 3. We also observe that the performance of
ILQX is not too sensitive to the chose of c̃. In Figure 13(b), we observe that regret decreases with
the difference between the two experimental prices m1 and m2 . Since in practice, businesses are
constrained in the degree of price experimentation they may engage in, we can recommend from this
observation that good experimental prices are ones that differ as much as possible within practical
limits. In Figure 13(c), we observe that more frequent price experiments can lead to substantial
improvements in the regret performance, possibly due to the increased sample size in the demand
estimation step of ILQX. In Figure 13(d), we observe that the number of times each experimental
price is charged per experimentation cycle does not affect the results very much.
Impact of using all historical data points in estimation. As explained in §4, ILQX uses
only the data points from the price experimentation periods. To examine the impact of using all
data, Figure 14 compares ILQX and a variant of ILQX that uses all historical data points (denoted
as ILQX with full data) in terms of regret performance. As shown in Figure 14(a), using all historical
points improves expected regret performance in our simulation experiments in §5.1. But, this does
not necessarily ensure an almost sure improvement. For the sample path realized in our real-life
data set in §5.2, ILQX turns out to have slightly better performance than ILQX with full data; please
see Figure 14(b). It is worth noting that both policies outperform other benchmarks in our real-life
data analysis (see §5.2.3 for the definitions of these policies).
(a) 8
5
10 (b)
300 ILQX greedy EILL
ILQX with full data 6 greedy IL
∆πθ (T )
company
∆πθ (T )
0 0
0 1000 2000 3000 4000 5000 0 0.5 1 1.5 2
T T 104
Figure 14 Impact of using all data points versus those in the experimentation periods. Panel (a) shows
the T -period regret of ILQX (dashed curve) and ILQX with full data (dash-dotted curve) in the setting described in
§5.1. Panel (b) shows the T -period regret of ILQX (dashed curve), ILQX with full data (dash-dotted curve), company
(lower solid curve), greedy IL (upper solid curve), and greedy EILL (dashed curve) in the setting described in §5.2,
which is based on real-life data.
(a) (b)
0 0
-500 -500
-1000 -1000
0 0.5 1 1.5 2 0 0.5 1 1.5 2
T 104
T 10
4
Figure 15 More on experimentation near the boundary of feasible prices. Panel (a) shows the T -period
regret gap between greedy IL and δ-greedy IL, i.e., ∆greedy
θ
IL
(T ) − ∆δ-greedy
θ
IL
(T ), for δ ∈ {50, 100}. Panel (b) shows the
T -period regret gap between greedy EILL and δ-greedy EILL, i.e., ∆θ greedy EILL
(T ) − ∆δ-greedy
θ
EILL
(T ), for δ ∈ {50, 100}. The
feasible price set is [ℓ, u] = [0, 15000]. Problem parameters are calibrated from a real-life data set as described in §5.2;
the displayed regret differences are based on the sample path of customer feature vectors observed in the data set.
References
Gill, R. D. and Levit, B. Y. (1995). Applications of the van Trees inequality: a Bayesian Cramér-Rao bound.
Bernoulli, 1(1/2):59–79.
Keskin, N. B. and Zeevi, A. (2014). Dynamic pricing with an unknown demand model: Asymptotically
optimal semi-myopic policies. Operations Research, 62(5):1142–1167.
Tropp, J. A. (2011). User-friendly tail bounds for matrix martingales. Technical report, California Institute
of Technology.