Algorithmic Marketing Ai For Marketing Operations r1 3g
Algorithmic Marketing Ai For Marketing Operations r1 3g
Algorithmic Marketing Ai For Marketing Operations r1 3g
INTRODUCTION TO
ALGORITHMIC MARKETING
Introduction to Algorithmic Marketing
by Ilya Katsov
ISBN 978-0-692-98904-3
“At a time when power is shifting to consumers, while brands and retailers
are grasping for fleeting moments of attention, everyone is competing on data
and the ability to leverage it at scale to target, acquire, and retain customers.
This book is a manual for doing just that. Both marketing practitioners and
technology providers will find this book very useful in guiding them through
the marketing value chain and how to fully digitize it. A comprehensive and
indispensable reference for anyone undertaking the transformational journey
towards algorithmic marketing.”
—Eric Benhamou,
Founder and General Partner, Benhamou Global Ventures;
former CEO and Chairman of 3Com and Palm
CONTENTS
1 introduction 1
1.1 The Subject of Algorithmic Marketing . . . . . . . . . . . 2
1.2 The Definition of Algorithmic Marketing . . . . . . . . . . 4
1.3 Historical Backgrounds and Context . . . . . . . . . . . . 5
1.3.1 Online Advertising: Services and Exchanges . . . . . 5
1.3.2 Airlines: Revenue Management . . . . . . . . . . . . . 7
1.3.3 Marketing Science . . . . . . . . . . . . . . . . . . . . . 10
1.4 Programmatic Services . . . . . . . . . . . . . . . . . . . . 11
1.5 Who Should Read This Book? . . . . . . . . . . . . . . . . 15
1.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2 review of predictive modeling 19
2.1 Descriptive, Predictive, and Prescriptive Analytics . . . . 19
2.2 Economic Optimization . . . . . . . . . . . . . . . . . . . . 20
2.3 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . 23
2.4 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . 25
2.4.1 Parametric and Nonparametric Models . . . . . . . . 26
2.4.2 Maximum Likelihood Estimation . . . . . . . . . . . . 27
2.4.3 Linear Models . . . . . . . . . . . . . . . . . . . . . . . 29
2.4.3.1 Linear Regression . . . . . . . . . . . . . . . . . . 30
2.4.3.2 Logistic Regression and Binary Classification . . 31
2.4.3.3 Logistic Regression and Multinomial Classifica-
tion . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.4.3.4 Naive Bayes Classifier . . . . . . . . . . . . . . . . 34
2.4.4 Nonlinear Models . . . . . . . . . . . . . . . . . . . . . 36
2.4.4.1 Feature Mapping and Kernel Methods . . . . . . 36
2.4.4.2 Adaptive Basis and Decision Trees . . . . . . . . 40
2.5 Representation Learning . . . . . . . . . . . . . . . . . . . 41
2.5.1 Principal Component Analysis . . . . . . . . . . . . . 42
2.5.1.1 Decorrelation . . . . . . . . . . . . . . . . . . . . . 42
2.5.1.2 Dimensionality Reduction . . . . . . . . . . . . . 46
2.5.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.6 More Specialized Models . . . . . . . . . . . . . . . . . . . 52
2.6.1 Consumer Choice Theory . . . . . . . . . . . . . . . . 52
2.6.1.1 Multinomial Logit Model . . . . . . . . . . . . . . 54
2.6.1.2 Estimation of the Multinomial Logit Model . . . 57
2.6.2 Survival Analysis . . . . . . . . . . . . . . . . . . . . . 58
2.6.2.1 Survival Function . . . . . . . . . . . . . . . . . . 60
2.6.2.2 Hazard Function . . . . . . . . . . . . . . . . . . . 62
2.6.2.3 Survival Analysis Regression . . . . . . . . . . . . 64
vii
viii contents
index 475
bibliography 481
ACKNOWLEDGEMENTS
This book would not have been possible without the support and
help of many people. I am very grateful to my colleagues and friends,
Ali Bouhouch, Max Martynov, David Naylor, Penelope Conlon, Sergey
Tryuber, Denys Kopiychenko, and Vadim Kozyrkov, who reviewed the
content of the book and offered their feedback. Special thanks go to
Konstantin Perikov, who provided a lot of insightful suggestions about
search services and also helped with some of the examples.
I am indebted to Igor Yagovoy, Victoria Livschitz, Leonard Livschitz,
and Ezra Berger for supporting this project and helping with the pub-
lishing. Last, but certainly not least, thanks to Kathryn Wright, my
editor, who helped me shape the manuscript into the final product.
xiii
1
INTRODUCTION
1
2 introduction
benefit from data-driven methods. The short summary is that the sub-
ject of algorithmic marketing mainly concerns the processes that can
be found in the four areas of the marketing mix and the automation
of these processes by using data-driven techniques and econometric
methods.
Online advertising and data marketplaces are perhaps the most famous
and successful cases in the history of programmatic marketing, but
8 introduction
definitely not the only one. Online advertising, to a degree, was a com-
pletely new environment that provided unprecedented capabilities for
solving unprecedented challenges. Programmatic methods, however,
can be successfully applied in more traditional settings. Let us con-
sider a prominent case that, by coincidence, also began to unfold in
1978, the same year that the first spam email was sent.
The federal Civil Aeronautics Board had regulated all US interstate
air transport since 1938, setting schedules, routes, and fares based on
standard prices and profitability targets for the airlines. The Airline
Deregulation Act of 1978 eliminated many controls and enabled air-
lines to freely change prices and routes. It opened the door for low-cost
carriers, who pioneered simpler operational models, no-frills services,
and reduced labor costs. One of the most prominent examples was
PeopleExpress, which started in 1981 and offered fares about 70 per-
cent lower than the major airlines.
The low-cost carriers attracted new categories of travelers who had
rarely traveled by air before: college students visiting their families,
leisure travelers getting away for a few days, and many others. In 1984,
PeopleExpress reported revenue of $1 billion and its profits hit $60 mil-
lion [Talluri and Van Ryzin, 2004]. The advent of low-fare airlines was
a growing threat to the major carriers, who had almost no chance to
win the price war. Moreover, the established airlines could not afford to
lose their high-revenue business travelers in pursuit of the low-revenue
market.
The solution was found by American Airlines. First, they recognized
that unsold seats could be used to compete on price with the low-cost
1.3 historical backgrounds and context 9
carriers, because the marginal cost of such seats was close to zero any-
way. The problem, however, was how to prevent the business travelers
from purchasing tickets at discounted prices. The basic solution was
to introduce certain constraints on the discounted offers; for example,
the tickets had to be purchased at least three weeks in advance and
were non-refundable. The challenging part was that the surplus of seats
varied substantially across flights and the optimal allocation could be
achieved only by using dynamic optimization. In 1985, after a few years
of development, American Airlines released a system called Dynamic
Inventory Allocation and Maintenance Optimizer (DINAMO) to man-
age their prices across the board. PeopleExpress also used some simple
price-management strategies to differentiate peak and off-peak fares,
but their information technology system was much simpler than DI-
NAMO and was not able to match its efficiency. PeopleExpress started
to lose money at a rate of $50 million a month, headed straight to
bankruptcy, and eventually ceased to exist as a carrier in 1987 after
acquisition by Continental Airlines [Vasigh et al., 2013]. American Air-
lines, however, not only won the competition with PeopleExpress but
also increased its revenue by 14.5% and profits by 47.8% the year after
DINAMO was launched.
The case of American Airlines was the first major success of a rev-
enue management practice. By the early 1990s, the approach had been
adopted by other industries with perishable inventories and customers
booking the service in advance: hospitality, car rentals, and even tele-
vision ad sales. The success of revenue management in the airline in-
dustry is clearly related to the specific properties of the inventory and
demand in this domain:
• The demand varies significantly across customers, flights, and
time: the purchasing capacity of business travelers can be much
higher than that of discretionary travelers, peak flights can be
much more loaded than off-peak, etc.
• The supply, that is, the available seats, is not flexible. The airline
produces seats in large chunks by scheduling flights, and once
the flight is scheduled, the number of seats cannot be changed.
Unsold seats cannot be removed, so the airlines’ profit completely
depends on its ability to manage the demand and sell efficiently.
We can conclude from the above that revenue management can be
considered as a counterpart of supply-chain management or, alterna-
tively, a demand-chain management solution that struggles with in-
flexible production and adapts it to the demands of the market (and,
conversely, manipulates the demand to align it with the supply), as
illustrated in Figure 1.3.
10 introduction
This friction between supply and demand can be found not only
in air transportation but in other industries as well. Hotels and car
rentals are clearly the closest examples, but advertising, retail, and oth-
ers also demonstrate features that indicate the applicability of algorith-
mic methods for demand-chain management.
The case studies of online advertising and airline ticketing give a gen-
eral idea of how algorithmic methods advanced in the industry. In-
dustrial adoption was backed by rapid developments in marketing sci-
ence and, conversely, the development of scientific marketing methods
was boosted and propelled by industrial needs. Marketing as a dis-
cipline emerged in the early 1900s, and, for the first five decades of
its existence, it was mainly focused on the descriptive analysis of pro-
duction and distribution processes; that is, it collected the facts about
the flow of goods from producers to consumers. The idea that mar-
keting decisions could be supported by mathematical modeling and
optimization methods began to gain currency in the 1960s, a fact that
can be attributed to several factors. First, marketing science was influ-
enced by advances in operations research, a discipline that deals with
decision-making and efficiency problems in military and business ap-
plications by using statistical analysis and mathematical optimization.
Operations research, in turn, arose during World War II in the context
of military operation planning and resource optimization. Second, the
advancement of mathematical methods in marketing can be attributed
to technological changes and the adoption of the first mainframes in
organizations, which made it possible to collect more data and im-
plement data analysis and optimization algorithms. Finally, marketing
practitioners began to feel that the old ways of selling were wearing
thin and that marketing needed to be redefined as a mix of ingredi-
ents that can be controlled and optimized; this is how the concept of
the marketing mix appeared in 1960. Marketing science boomed in the
sixties and seventies of the twentieth century when numerous quan-
titative models for pricing, distribution, and product planning were
1.4 programmatic services 11
The marketing mix model defines four factors that can be controlled
by a company to influence consumer purchase decisions: product, pro-
motion, price, and place. As such, this categorization is very broad
and provides little guidance on how exactly programmatic marketing
systems should be built. So far, we have learned that a programmatic
system can be viewed as a provider of one or more functional services
that implement certain business processes, such as price or promotion
management. Consequently, we can make our problem statement more
specific by defining a set of services, each of which implements a cer-
tain function and has its own inputs (objectives) and outputs (actions).
There are different options for how the marketing mix can be broken
down into functional services depending on the industry and busi-
ness model of a particular company. We choose to define six major
functional services that are relevant for a wide range of business-to-
consumer (B2C) verticals: promotions, advertisements, search, recom-
mendations, pricing, and assortment. These six services are the main
subject of this book, and we will spend the later chapters discussing
how to design and build them. The applications and design principles
are very different across the services, but there are many relationships
12 introduction
We can use these guidelines to define the basic terminology and compo-
nents that can later be elaborated within the scope of the corresponding
domains.
The concept of the programmatic method accentuates the objective-
driven design approach, so we can attempt to define a common frame-
work by starting with the notion of the business objective. In order to
understand and execute the objective, a programmatic service is likely
to include a certain set of functional components that can have different
designs for different domains (see Figure 1.5):
• Since any automatic decision making is driven by data, the
decision-making pipeline starts with data collection. Examples
of input data for most marketing applications include customers’
personal and behavioral profiles, inventory data, and sales
records.
service can take and action parameters that a service can control,
such as price levels, discount values, email messages, or prod-
uct order in search result lists. These controls are used by the
programmatic service to execute its decisions, so the decisions
should eventually be expressed as parameters for available con-
trols.
This book is for everyone who wants to learn how to build advanced
marketing software systems. It will be useful for a variety of marketing
and software practitioners, but it was written with two target audiences
in mind. The first target audience is implementers of marketing soft-
ware, product managers, and software engineers who want to learn
about the features and techniques that can be used in marketing soft-
ware products and also learn about the economic foundations for these
techniques. The other target audience is marketing strategists and tech-
nology leaders who are looking for guidance on how marketing organi-
zations and marketing services can benefit from machine learning and
Big Data and how modern enterprises can leverage advanced decision
automation methods.
It is assumed that readers have an introductory background in statis-
tics, calculus, and programming. Although most methods described in
the book use relatively basic math, this book may not be suitable for
readers who are interested only in the business aspects of marketing:
this book is about marketing automation; it is not a traditional marketing
16 introduction
1.6 summary
ket. These services can be used internally by the company that owns
the customer base or can be sold to a third party. Programmatic com-
ponents should be self-contained, well-packaged, and able to imple-
ment a reasonably high level of abstraction to be sold as high-value
services.
19
20 review of predictive modeling
and the total budget of the campaign C. Thus, the problem is defined
as
Let us now assume that the retailer wants to create two different
consumer segments and assign the strategy si pki , ci q to one of
these segments and the strategy sj pkj , cj q to the other segment.
With the assumption that the strategies are selected from domain S,
the optimization problem will be
¸
max maxt qu ki m ci , qu kj m cj u
si , sj PS u
(2.6)
etc.) and the enterprise’s own actions. One possible approach for han-
dling this dependency is to use a stateless model, treating it as a math-
ematical function, but allow for time-dependent arguments to account
for memory effects. For example, a demand prediction model can fore-
cast the demand for the next month by taking the discount levels for
the last week, last two weeks, last three weeks, etc. as arguments.
ppy | xq (2.7)
The data come into play if the conditional distribution ppy | xq is com-
plex and should be learned from the data as opposed to being manu-
ally defined. Consequently, we are interested in the data that contain
pairs px, yq that are drawn from the true but unknown distribution
pdata py | xq. We will refer to these pairs as samples or data points. The
24 review of predictive modeling
input data are often observed (collected) in a form that is not suitable
or optimal for modeling, so vector x and metric y are typically con-
structed from the raw data by using cleansing and normalizing trans-
formations. We will refer to the elements of such a prepared vector x
as features or independent variables and to y as the response label or de-
pendent variable. With the assumption that we have n samples and m
features, all feature vectors can be represented as an n m matrix, X,
called the design matrix, and all response variables as an n-dimensional
column-vector y. Each row of the design matrix X is a feature vector x
and each element of y is a response label y. All data points can then be
represented as an n pm 1q matrix, D, with the following structure:
x1 y1
x2 y2
D rX | ys
.. .. (2.9)
. .
xn yn
Our goal then is to create a statistical model that approximates the
true distribution pdata py | xq with the distribution pmodel py | xq learned
from the data, so the economic model will actually be evaluated by
using the following approximation:
in which the left-hand side is the estimated value of the response vari-
able. Estimation of the most likely value of y is easier than estimation
of the entire distribution and, as we will see shortly, it can be done with-
out accurate estimation of the actual probability values. The economic
model can then be evaluated as
max G pypxpsqqq (2.12)
s
Previously, we have seen that the modeling task can be partially re-
duced to learning of the distribution ppy | xq based on the available
samples x and y. The function p, which maps an m-dimensional vector
of features to the probability values, can be interpreted as a probability
density function for continuous y or a probability function for discrete
y. In many applications, we do not need to learn the entire distribu-
tion but only a function that predicts the most likely y response based
on the input x. The task of learning such distributions or functions is
known as supervised learning because the data contain the response
variables that “guide” the learning process. This problem comes in two
types. If the response variable is categorical, that is, y belongs to some
finite set of classes, the problem is known as classification. If the response
variable is continuous, the problem is known as regression.
26 review of predictive modeling
pmodel py | x, θq (2.13)
p argmax Prpy c | x, kq
y (2.15)
c
defined as the probability of the observed response data given that the
probability density specified by the parameter vector θ is known:
This function is referred to as the likelihood function or, simply, the like-
lihood. It is the probability of observing the training data under the
assumption that these data are drawn from the distribution specified
by the model with parameters θ. The logarithm of the likelihood func-
tion, often more convenient for analysis and calculations, is known as
the log-likelihood LLpθq. Our goal is to find the parameter vector that
maximizes the likelihood of the estimation:
θML argmax log pmodel py | X, θq (2.17)
θ
As pdata does not depend on θ and cannot be the subject of the opti-
mization, we only have to minimize the second term in order to min-
imize the divergence, which is equivalent to maximization of the log-
likelihood defined in equation 2.19:
Figure 2.2: (a) Example of linear regression with one-dimensional feature space.
(b) Example of classification with a linear decision boundary and
two-dimensional feature space. Training points are shown as circles.
All of the models that we will consider are linear. The regression
model is linear in the sense that the dependency between the features
and response is modeled as a linear function, as shown in Figure 2.2.
If the observed dependency is not actually linear, the model may not
be accurately fitted to the data. The classification models are linear
in the sense that the boundary between the classes is modeled as a
hyperplane, so the data are linearly inseparable; that is, if the groups
of points cannot be accurately separated by a hyperplane, the models
may not be properly fitted to the data.
30 review of predictive modeling
ypxq wT x (2.22)
y ypxq y wT x (2.23)
With the assumption that the error has a normal distribution, the dis-
tribution of the estimate produced by the model is also described by a
normal distribution, that is, the Gaussian with the mean wT x and the
variance σ2 :
ppy | x, wq N y | wT x, σ2
1 2
(2.24)
1
2σ12 yi wT xi
2
exp
2πσ2
By inserting this probability distribution into the definition of the log-
likelihood defined in equation 2.18 and doing some algebra, we get the
following log-likelihood expression that has to be maximized:
ņ
LLpwq log ppy | x, wq
i 1
ņ
log N y | wT xi , σ2 (2.25)
i 1
∇w LLpwq XT y XT Xw (2.29)
Prpy 1 | xq gpwT xq
1
1 exppwT xq
(2.34)
exppwT xq
Prpy 0 | xq 1 g pw xq
T
1 exppwT xq
in which g is known as a logistic function. This model is called a lo-
gistic regression. Note that this is a classification model, despite the
confusing name. Next, we have to calculate the log-likelihood for this
distribution:
ņ
LLpwq log ppyi | xi q
i 1
ņ 1yi
log gpwT xi qyi 1 gpwT xi q (2.35)
i 1
ņ
yi log gpwT xi q p1 yi q log 1 gpwT xi q
i 1
used in the binary case and one linear decision boundary is not suf-
ficient. Instead, we can estimate the probability of each class c sep-
arately by using a dedicated coefficient vector wc . This means that
equation 2.33 can be rewritten as follows:
Prpx | y cqPrpy cq
Prpy c | xq
Prpxq
(2.44)
The probability of the feature vector in the denominator is the same for
all classes, so it can be discarded for a classification problem:
The key assumption made by the Naive Bayes classifier is that each
feature xi is conditionally independent of every other feature, given
the class c. This means that the probability of observing the feature
vector x, given the class c, can be factorized as
¹
m
Prpx | y cq Prpxi | y cq (2.46)
i 1
In the general case, the Naive Bayes classifier is not linear. However,
it is linear under certain assumptions that are accurate in many appli-
cations, so it is often described as linear. For example, let us consider a
case where the distribution Prpx | y cq is assumed to be multinomial.
This is a valid assumption, for example, for text classification when
each element of the feature vector is a word counter. The probability
of a feature vector follows a multinomial distribution with parameter
vector qc
¹
m
Prpx | y cq 9 qx
ci
i
(2.48)
i 1
in which qci is the probability of the feature value xi occurring in class
c. We can rewrite this expression in the vector form as follows:
We can now show that the decision boundary between the classes
is linear by considering the ratio of class probabilities, similarly to the
method we used for logistic regression. By assuming only two classes
y P t0, 1u for the sake of simplicity, we can write
Prpy 1 | xq
log
Prpy 0 | xq
log Prpy 1 | xq log Prpy 0 | xq
(2.50)
x plog q1 log q0 q
T
log Prpy 1q log Prpy 0q
modeling, but we should also realize that data representation can con-
tribute to this. For example, two linearly dependent values would not
be linearly dependent if one of them was measured on a logarithmic
scale. The opposite is also true – data sets that are not tractable for
linear methods can be become tractable if mapped to a different space.
Consider the example in Figure 2.4: the one-dimensional data set on
the left-hand side consists of two classes that are not linearly separa-
ble, but the two-dimensional data set constructed from the first one by
using the mapping pxq Ñ px, x2 q is linearly separable.
Figure 2.4: Feature mapping by using a quadratic function. (a) Original data
points. (b) Mapped data points.
This means that we can estimate the response variable by using only
dot products between the input x and training vectors xi :
ņ
ypxq wT x ai xT xi (2.55)
i 1
This is a very important result because we can now avoid the explicit
feature mapping and replace it with a kernel function that encapsulates
computation of the dot product in the mapped space:
in which x and z are two feature vectors. Equation 2.55 can then be
rewritten purely in terms of the kernel function:
ņ
ypxq ai kpx, xi q (2.57)
i 1
in which
! ? )
φpxq x21 , 2x1 x2 , x22 (2.60)
Note that the kernel is simply a distance function between the original
feature vectors, so the underlying expansion of dimensionality is to-
tally hidden; therefore, we can devise kernels that correspond to φpxq
with a very high or infinite number of dimensions but remain com-
putationally simple. This technique is known as the kernel trick, and a
number of machine learning methods can be extended this way. Selec-
tion of the right kernel can be a challenge, but there are a few kernel
functions that are known to be quite universal and are widely used
in practice. The choice of the kernel function also depends on the ap-
plication because it is essentially a measure of similarity between the
feature vectors – kernels that work well for consumer profiles might
not be the best choice for textual product descriptions, and so on.
Among the best known members of the kernel methods family are
support vector machines (SVMs). The basic SVM algorithms are linear
classification and regression methods, but they can be efficiently ker-
nalized to learn nonlinear dependencies. Consider the example of an
SVM classifier in Figure 2.5. It uses the same data that we used for
the nearest neighbor and logistic regression examples earlier, but its
decision boundary is clearly nonlinear.
¸
q
ypxq wi φi pxq (2.61)
i 1
Provided that the basis functions are highly adaptive and nonlinear,
we can expect this solution to outperform linear methods in appropri-
ate tasks. One of the most widely used realizations of the adaptive
basis concept is classification and regression trees, a family of meth-
ods that generate the adaptive basis φi pxq by using a greedy heuristic
algorithm. Let us consider the most basic version of this solution.
A classification or regression tree is created by recursive splitting of
the feature space into two parts by using a linear decision boundary,
as illustrated in Figure 2.6. In each step of the recursion, the decision
boundary can be selected as follows:
• First, the candidate hyperplanes that can be chosen as the bound-
ary are enumerated. One possible way is to try all dimensions
(e. g., we can split by using either a horizontal or vertical line in
Figure 2.6) and, for each dimension, to try the coordinates of all
data points in the training set.
• The region label is used to score the quality of the candidate split
by the misclassification rate (the ratio between the number of in-
correctly and correctly classified training examples in the region)
or another metric. The boundary is selected as the candidate with
the highest score.
Once the boundary is selected, the algorithm is recursively applied
to the regions on both sides of the boundary. This algorithm produces
2.5 representation learning 41
Figure 2.6: Example of a classification tree. (a) Classification tree model. (b)
Training data and decision boundaries that correspond to the tree.
2.5.1.1 Decorrelation
In marketing applications, the data typically correspond to the ob-
served inputs, properties, and outputs of some real-world marketing
process. Examples of such processes include marketing campaigns, in-
teractions between customers and products, and the interplay of price
and demand, among many others. Each feature in the design matrix
can be viewed as a signal that carries the information about the pro-
cess. We do not have complete knowledge of the process and observe
only certain projections of the process on the feature dimensions that
are available in the input data, just as a physical object can be pho-
tographed by cameras from different perspectives. For example, one
2.5 representation learning 43
does not observe consumer tastes and thoughts directly but captures
certain signals, like purchases, that partially reflect the tastes, thoughts,
and decisions. Representations obtained this way are likely to have
some redundancy, and dimensions are likely to be correlated, just as
images of the same object from different perspectives are redundant
and correlated. This idea is illustrated in Figure 2.7.
Figure 2.7: Different data dimensions can be correlated because they are projec-
tions of the same real-life process.
Z XT (2.64)
VT V I (2.67)
Unit vectors v capture the directions of the variance but not its magni-
tude. Let us calculate these values separately and denote them as
σi kXvi k (2.68)
2.5 representation learning 45
Figure 2.8: Example of principal component analysis. (a) A data set of 500 nor-
mally distributed points and the corresponding principal compo-
nents. Features x1 and x2 are strongly correlated. (b) A decorrelated
representation obtained by using PCA. Features z1 and z2 are not
correlated.
σ1¥ σ2 ¥ . . . ¥ σr (2.69)
Σ diagpσ1 , . . . , σr q (2.70)
X UΣVT (2.71)
U XVΣ1 (2.72)
46 review of predictive modeling
UT U I (2.73)
Once the SVD of the design matrix is obtained, we can use the factors
to find the decorrelating linear transformation T. Let us consider the
product
Z XV (2.74)
and calculate its covariance matrix by using the fact that matrices V
and U are orthogonal:
CZ n 1 1 ZT Z
n 1 1 VT XT XV
(2.75)
n 1 1 VT VΣ2 VT V
n 1 1 Σ2
As Σ2 is diagonal, the representation Z is uncorrelated. This means
that the decorrelating transformation we were looking for is actually
given by matrix V. This transformation is effectively a rotation because
the matrix of principal components is orthonormal. Note that the as-
sumption about the linearity of the transformation is quite restrictive.
In the example provided in Figure 2.8, it works well because the data
set has an elliptical shape, which is the result of the normal distribution;
therefore, the correlation between features can be removed by simple
rotation. This may not be the case for data sets with more complex
shapes.
Zk XVk (2.76)
min kX Ak
A (2.78)
subject to rank pAq ¤ k
p
xij pi qTj (2.79)
2.5 representation learning 49
in which the first entity (e. g., the customer) is somehow mapped to
numerical vector p and the second entity (e. g., the product) to numer-
ical vector q. The length of vectors k is typically chosen to be much
smaller than the size of the design matrix. By rewriting this expression
in matrix form as
p P QT
X (2.80)
we can see that the vectors p and q, which minimize the average affin-
ity approximation error, can actually be obtained from the low-rank
approximation expression 2.77 as
P Uk Σ k
Q Vk
(2.81)
This result is very useful because it helps to convert sparse and re-
dundant representations of entities into compact and dense numerical
vectors. It is a very powerful and generic modeling technique, which
will be extensively used in the following chapters.
2.5.2 Clustering
fit a mixture of distributions that is likely (in the sense of the maximum
likelihood principle) to generate the observed data set. This approach
is illustrated in Figure 2.10, where the clusters are determined by fitting
the mixture of three normal distributions on the data.
Figure 2.10: Example of clustering by using the mixture modeling approach. (a)
Input data set. (b) Three clusters found by fitting a mixture of three
normal distributions.
The hidden factors hnj are known to the decision-maker but not to
the model creator, so the utility model Vnj V pxnj q approximates the
true utility Ynj with some error εnj that can be considered as a random
variable:
As Pnj can typically be estimated from the known statistics for alter-
natives that were available in the past, the parameters w can also be
estimated and a prediction model for Pnj can be built. This model can
be used to evaluate new alternatives specified in terms of properties x,
and, consequently, economic metrics like demand for a new product
54 review of predictive modeling
can be predicted. In the next section, we discuss one of the most sim-
ple yet powerful and practical models, multinomial logit, to show how
computationally tractable expressions for Pnj can be obtained. This
model will be used in the subsequent chapters as a component of more
complex models that are created for specific marketing problems.
By assuming for a moment that εni is given and by using the indepen-
dence of errors, we can state that
¹
Pni | εni Prpεnj εni Vni Vnj q (2.92)
j i
2.6 more specialized models 55
because the utilities are equal for both buses. A more realistic
assumption, however, is that the ratio
will remain constant no matter how many identical buses are of-
fered, which produces the probabilities Pcar 1/2 and Pred bus
Pblue bus 1/4.
in which ε
ni w3 gn εni and represents the errors that are
not independent because of the random variable gn .
𝑃𝑛𝑖
1.0
0.5
𝑉𝑛𝑖
0.0
Figure 2.11: Probability of choice as a function of the utility and its derivatives
at different points.
¹
pPni qy ni
(2.101)
i
The probability that all decision-makers in the data set made their
choices as we actually observe, that is, the likelihood of the data set,
can then be expressed as
¹
N ¹
Lpwq pPni qy
ni
(2.102)
n 1 i
58 review of predictive modeling
Classification methods, even the most basic ones such as logistic re-
gressions, provide a powerful toolkit for estimating the probabilities
of consumer actions. For example, the response probability for a pro-
motional email can be estimated by building a model that uses cus-
tomer attributes, such as the number of purchases, as features and a
binary variable that indicates whether a customer responded to the pre-
vious promotional email as a response label. Although this approach
is widely used in practice, as we will discuss in detail in subsequent
chapters, it has a few shortcomings. First, in many marketing applica-
tions, it is more convenient and efficient to estimate the time until an
event, instead of the event probability. For example, it can be more use-
ful for a marketing system to estimate the time until the next purchase
or time until subscription cancellation, rather than the probabilities of
these events. Second, marketing data very often include records with
unknown or missed outcomes, which cannot be properly accounted
for in classification models. Going back to the example with the sub-
scription cancellation, it is often impossible to distinguish between cus-
tomers who have not defected and those who have not defected yet
because we build a predictive model at a certain point in time and
cannot wait indefinitely until the final outcomes for all customers are
observed. Consequently, we only know the outcomes for customers
who have defected and can certainly label them as negative samples;
the remaining records are incomplete but those customers will not nec-
essarily not defect in the future, so one can argue that labeling these
samples as positives or negatives is not really valid. This suggests that
it is not accurate to use a classification model with a binary outcome
variable determined on the basis of currently observed outcomes and
2.6 more specialized models 59
Figure 2.12: Data preparation for survival analysis. The filled circles correspond
to treatments. The crosses correspond to events. The empty circles
denote censored records.
St nt n dt 1 ndt (2.108)
t t
¹
p pt q
S 1
di
(2.109)
¤
i t
ni
example 2.1
different times and the time of the first purchase after an email has
been recorded. The observed data set looks like this:
in which the i-th element of set t is the observed event time for the i-th
customer measured in days since the email was sent. Set δ contains the
indicators for whether each observation is censored (0) or not censored
(1). For example, the first customer made a purchase on the second day
after the email, and the third customer did not make a purchase by the
time of the analysis although she got the email three days before the
analysis date. In this context, the probability to survive means the prob-
ability of not having made a purchase at a given time. By repeatedly
applying formula 2.109, we obtain the following sequence:
Figure 2.13: An estimate of the survival function for the data set given in defini-
tion 2.110.
dt log pSptqq
d
in which
»t
Hptq hpτqdτ log pSptqq (2.117)
0
in which h0 ptq is the baseline hazard, r is the risk ratio that increases
or decreases the baseline hazard depending on the factors, and w is
the vector of model parameters. Note that the baseline hazard does
not depend on the individual, but the risk ratio does. In other words,
the risk ratio determines how the properties of an individual encoded
in the feature vector influence the hazard rate. The risk ratio cannot
be negative because the hazard rate is not negative, so it is typically
modeled as an exponential function:
h pt | w, xq h0 ptq exp wT x (2.120)
This model can be interpreted as a linear model for the log of the
risk ratio for an individual to the baseline:
h pt | xq
log rpw, xq log wT x
h 0 pt q
(2.121)
Our next step is to estimate the parameters of the Cox model from
the data. The standard approach to this problem is to derive the like-
lihood of the model and then find the parameters that maximize it.
The challenge, however, is that observations can be censored, which
requires us to specify how such records will be accounted for in the
likelihood. First, let us note that each observation contributes to the
likelihood. If the i-th observation is censored, it contributes the proba-
bility of survival up until ti :
¹
k
Li pwq h pti | w, xqδi S pti | w, xq (2.124)
i 1
Rp t q ti : t i ¥ tu (2.125)
For simplicity, let us also assume that there are no event ties, that is,
all event times ti are distinct1 . In this case, the partial likelihood can be
defined by using the conditional probability that a particular person i
will fail at time ti , given the risk set at that time, and that exactly one
failure is going to happen [Cox, 1972, 1975]. This probability is given
by the area under the hazard curve for a small time interval dt, so the
likelihood contributed by individual i can be expressed as
hpti | w, xi qdt
Li pwq °
j P Rpti q hptj | w, xj qdt
(2.126)
1 The case with ties is more complex, but there exist a number of generalizations that account
for ties [Breslow, 1974; Efron, 1977]. In most marketing applications, we can avoid ties by
spreading the conflicting observations by a small margin.
2.6 more specialized models 67
Finally, the partial likelihood for the entire training data set is a prod-
uct of the individual partial likelihoods given by equation 2.127:
δi
¹
k
exp wT xi
Lpwq ° (2.128)
j P Rpti q exp w xj
T
i 1
p
p pt qdt °
h
d i
(2.130)
0 i
j P Rpti q exp wT xj
S0 ptqexppw xq T
the value is, information about other bids can help to improve the
estimate. For example, a bidder who highly values the resource
might reduce their bid if other participants bid lower, because
this additional information can indicate the presence of negative
factors that are unknown to the given bidder but are somehow
recognized by the others.
auctions. The Dutch auction ends right after the first bid, so bidders do
not receive any additional information during the process and can de-
cide on a bid in advance. Consequently, a Dutch auction is equivalent
to a first-price sealed-bid auction in the sense that,whatever strategy
the bidder chooses, it uses the same inputs and leads to the same win-
ners and prices, both for private and interdependent values. In an En-
glish auction with private values, the bidder can also evaluate the item
in advance. As the auction progresses and the price goes up, the bid-
der should always compare the current highest bid with the estimated
value and either make a new bid, calculated as the current highest plus
some small increment, or quit the auction if the price has gone above
his valuation. Hence, an English auction is equivalent to a second-price
sealed bid auction for private values, although this is not true for inter-
dependent values because the bidder can learn from the observed bids
in the case of an English auction.
We now study the Vickrey auction in detail to obtain a toolkit for
building optimization models that include auctions. We focus on the
Vickrey auction because it is convenient for analysis and widely used
in practical applications, although similar results can be obtained for
other auction types by using more advanced analysis methods.
First, we can prove that the optimal strategy for bidders is to bid
their true value. Consider Figure 2.14, in which the bidder evaluates
the item at a price v but makes a lower bid v δ. If the second highest
bid from another bidder is p, then the following three outcomes are
possible:
1. p ¡ v: the bidder loses; it does not matter if he bids v or v δ
So, a bid below the true value always gives the same or a worse result
as a bid of the true value. Figure 2.15 depicts the opposite situation,
2.6 more specialized models 71
when the bid is above the true value. We again have three possible
outcomes:
2. p v: the bidder wins and pays the price p; it does not matter if
he bids v or v δ
We can conclude that bidding the true value is the optimal strategy.
This simple result implies that we need to focus on estimation of the
expected revenue for the bidder when discussing marketing settings
with sealed-bid auctions.
Our next step will be to take the seller’s perspective of the auction
and estimate the revenue for the seller. Let us assume that there are n
bidders participating in the auction and their bid prices V1 , . . . , Vn are
independent and identically distributed random values drawn from
some distribution Fpvq with the probability density fpvq. By recalling
that k-th order statistic Vpkq of a sample is equal to its k-th smallest
value, we can express the expected revenue as the mean of the second-
highest order statistic that corresponds to the second-highest bid:
revenue E Vpn1q (2.134)
This is the probability that k 1 bids in the sample of n bids are smaller
than v, exactly one bid falls into the range rv, v dvs, and the remain-
ing n k bids are higher than v. These three conditions can be ex-
pressed by using the bid cumulative distribution Fpvq and probability
72 review of predictive modeling
density fpvq, so we get the following expression for the order statistics
probability density:
f Vpkq dvlim
Ñ0
Pr v Vpkq v dv
k1
n
rFpvqsk1 pn k 1qfpvq r1 Fpvqsnk (2.136)
nn 11
E Vpn1q (2.138)
2.7 summary
Every product or service has its target market, the group of consumers
at which the product or service is aimed. The distinction between the
target and non-target groups is often fuzzy and ambiguous because
consumers differ in their income, buying behavior, loyalty to a brand,
and many other properties. The diversity of customers is often so high
that an offering created for an average consumer, that is, for everyone,
does not really fit anyone’s needs. This makes it critically important
for businesses to identify the most relevant consumers and tailor their
offerings on the basis of consumer properties. This problem occurs in
virtually all marketing applications, and it plays an especially impor-
tant role in advertisements and promotions because the efficiency of
these services directly depends on the ability to identify the right audi-
ence and convey the right message.
The problem of finding the optimal match between consumers and
offerings can generally be viewed from two perspectives. First, it can be
stated as finding the right offerings for a given customer. This is a prod-
uct discovery problem, which we will discuss in the following chapters
dedicated to search and recommendations. The second perspective is
that of finding the right customers for a given offering. This problem is
known as targeting, and it is the main subject of this chapter. It should
be kept in mind, however, that we draw a line between product discov-
ery and targeting services mainly based on the principal applications
(interactive browsing versus advertising), and the methodologies used
to implement the services can sometimes be viewed from both per-
spectives. Consider customer base segmentation as an example. One
can argue that segmentation identifies the right groups of customers
first and the offerings and experiences are then tailored for each seg-
ment. However, it is also true that segmentation can be viewed as a
method for allocating different offerings and experiences to the most
appropriate customers.
75
76 promotions and advertisements
3.1 environment
3.2.2 Costs
3.2.3 Gains
The gains associated with a campaign can be viewed from several per-
spectives. The most straightforward element is the increase in sales vol-
ume. Both manufacturer-sponsored and retailer-sponsored campaigns
incentivize consumers to make purchases, at the expense of the cam-
paign costs, so the basic equation that describes the campaign gain will
be as follows:
profit Q pP V q C (3.1)
Qc pP V q C ¡ Q0 P (3.2)
Consumers who are at the first stage need to be acquired and con-
verted into customers by using marketing actions that are specifically
designed for this purpose. Customers at the second stage need to be
treated with incentives that are focused on maximization and growth
of consumption. Finally, the customers who are about to churn need
to be identified in a timely fashion and retained. These three objectives
– acquisition, maximization, and retention – are a very popular coordi-
nate system in marketing that can be used to orient individual cam-
paigns and structure campaign portfolios. A brand should be able to
distinguish consumers who belong to different phases, and this lays
the foundation for the targeting process. As we will see later, each of
these objectives can be mapped to a predictive model in a relatively
straightforward way, so this set of targeting objectives is well suited for
programmatics.
3.3 targeting pipeline 85
Once the environment and business objectives are defined, we can dis-
cuss how a programmatic system can approach the problem of target-
86 promotions and advertisements
Figure 3.4: Incremental impact of a marketing action. The upper life-cycle curve
corresponds to the aftermath of the action and the lower curve cor-
responds to the result with a no-action strategy.
The pipeline starts with the available marketing budget that can be
allocated for different marketing activities. The first step of the process
is to determine how the budget should be distributed across the possi-
ble activities: what are the main objectives and how are these objectives
3.3 targeting pipeline 87
E rG | u, T s ¡ C (3.6)
Figure 3.8: Measuring promotion effectiveness with test and control groups.
need to compare the expenditures between the two groups. This can
be a very convenient property if the redemption data is not available.
We will return to the statistical details of measurements at the end of
this chapter.
Targeting models and lifetime value (LTV) models are the basic build-
ing blocks of the targeting process. The purpose of a targeting model
is to quantify the fitness of a given consumer for a given business ob-
jective in a given context. For example, a model can score the fitness
of a consumer for a potato chips promotional campaign given that the
promotion will be sent tomorrow by SMS. Models can be created for
different objectives and contexts, and a targeting system often main-
tains a repository of models attributed with the corresponding meta-
data, so that relevant models can be fetched according to the criteria.
For example, one can have a model for acquisition campaigns in the
potato chips category and another one for maximization campaigns in
the soda drinks category. Models are the basic primitives that can be
combined with each other, as well as with other building blocks, to cre-
ate more complex programmatic flows. Marketing campaigns can be
assembled from models, and marketing portfolios can be assembled
from campaigns.
It should be stressed that a programmatic system can use models
both in predictive and prescriptive ways. The most direct application is
prediction of consumer properties, such as propensity to respond to
an email or expected lifetime profit. Many models, however, express
the dependency between inputs and outputs in a relatively transpar-
ent way, so the system can contain additional logic that uses this pre-
scriptive insight or can at least make some recommendations to the
marketer. For instance, the parameters of a regression model that pre-
dicts a response can indicate a positive or negative correlation with
specific communication channels or other parameters, and this insight
can be used to make additional adjustments, such as limiting the num-
ber of communications in the presence of a negative correlation. In this
section, we consider three major categories of models that can be used
separately or together:
propensity models The idea of propensity models is to estimate
the probability of a consumer to do a certain action, such a pur-
chase of certain product. The output of such models is a score
proportional to the probability that can be used to make target-
ing decisions.
3.5 building blocks: targeting and ltv models 95
lifetime value models LTV models are used to quantify the value
of a customer and estimate the impact of marketing actions.
We start with a review of the data elements and data sources
used in modeling and then discuss several traditional methods that
can be viewed as heuristic propensity and LTV estimation models.
These methods typically make the assumption that the probability
of responding and the value of a customer are proportional to one
or several basic characteristics, such as the frequency of purchases.
The methods then group customers into segments so that an entire
segment can be included or excluded from a certain campaign. These
methods can be thought of as rule-based targeting. We then develop
more advanced models with statistical methods.
Data collection and preparation is one of the most important and chal-
lenging phases of modeling. Although a detailed discussion of data
preparation methodologies is beyond the scope of this book, it is worth
reviewing a few principles that can help to streamline the process and
avoid incorrect modeling. Targeting and LTV models generally aim
to predict consumer behavior as a function of observed metrics and
properties, so it is important to collect and use data in way that is con-
sistent with causal dependencies. From this perspective, data elements
can be arranged by tiers, with each tier depending on the previous
ones [Grigsby, 2016]:
primary motivations Consumer behavior is driven by fundamen-
tal factors such as valuation of a product or service, tastes, needs,
lifestyle, and preferences. Many such attributes cannot be ob-
served directly, but some data such as demographics or market-
ing channel preferences can be collected through loyalty program
registration forms and surveys or purchased from third-party
data providers.
Each tier is attributed with the metrics, such as the average expected
response rate and average spending per customer, estimated based on
the historical data. For each promotion, the optimal subset of tiers
can be determined by running the promotion costs and tier metrics
through the response modeling framework. For example, it can be de-
termined that one campaign will be profitable if only the gold tier is
targeted and another campaign has maximum profit when the gold
and silver tiers are both targeted.
Single-metric segmentation can be elaborated by adding more met-
rics into the mix. As we discussed earlier, the consumer life cycle is an
important consideration for the design of promotional campaigns, so
the ability to target individual life-cycle phases is important. The life-
cycle phases are characterized by both the total spending in a category
and the loyalty to the brand, that is, the relative spending on a brand
compared with that on other brands, so we can classify customers into
segments by using these two metrics, as shown in Figure 3.10. This ap-
proach, known as loyalty–monetary segmentation, is used in traditional
manufacturer-sponsored campaigns.
Customers who are highly loyal to the brand and spend a lot of
money in the category are clearly the most valuable customers, who
should be rewarded and retained. Customers who spend a lot in the
category but are not loyal to the given brand are the best candidates for
trial offers and so on. Similarly to the tiered segmentation, promotions
can be assigned to the optimal subset of segments by starting from the
upper left corner of the grid in Figure 3.10 and evaluating the potential
outcome of including additional segments until the bottom right corner
is reached. This approach can be considered as a simplistic targeting
method that predicts consumer value based on two metrics – spending
98 promotions and advertisements
in the category and brand share of wallet. These are very coarse criteria
that can be improved by using predictive modeling methods.
monetary The total dollar amount spent per time unit. The mone-
tary metric is typically measured by using intervals or scores.
It is quite typical to use the same discrete scoring scale, say from 1
to 5, for all three metrics. In this case, the RFM model can be consid-
ered as a three-dimensional cube made up of cells, each of which is
determined by a triplet of metric values and corresponds to customer
3.5 building blocks: targeting and ltv models 99
the outcome period is used to generate the response label. These two in-
tervals may or may not be separated by a buffer. The buffer can be used
if one needs to predict events in the relatively distant future, instead of
predicting immediate events. For example, a model that predicts cus-
tomer churn should probably be trained with the outcome intervals
shifted into the future – it would be impractical to predict customers
who are likely to churn immediately because it gives no time to per-
form any mitigating marketing action.
variables. The curved line that connects the boxes in Figure 3.12, for
example, corresponds to the share of the bakery category in consumer
spending over the last 6 months relative to other categories. This ap-
proach allows the production of a relatively large number of features
that can be used in predictive model training and evaluation. The same
approach can be used for marketing response data and data from digi-
tal channels.
example 3.1
The Bakery Total and Dairy Total features are the total spending in
the corresponding categories during the observation period. The Bak-
ery Weekend and Dairy Weekend features are the total spending on
the weekend, so the amount spent on workdays equals the difference
between the Total and Weekday values in each category. The Credit col-
umn is the payment method, credit or cash. Finally, the Response vari-
able indicates whether the customer started to buy the dessert or not.
By inspecting this tiny data set visually, we can conclude that the natu-
ral triers of the dessert are mainly the customers who buy a lot of bak-
ery on the weekends and a lot of dairy on workdays. We choose here
to use logistic regression to build a look-alike model, although other
options including decision trees, random forests, and Naive Bayes are
often used in practice as well. By fitting the logistic regression, we get
the parameter estimates presented in table 3.2.
Note that bakery spending is positively correlated with the propen-
sity to try the product, whereas dairy spending is negatively corre-
104 promotions and advertisements
Parameter Estimate
Bakery Total 0.0012
Bakery Weekend 0.0199
Dairy Total – 0.0043
Dairy Weekend – 0.0089
Credit – 0.4015
Table 3.2: Logistic function parameters for the training set in table 3.1.
lated1 . By evaluating the model for six profiles with different propor-
tions of bakery and dairy spending, we get the propensity score esti-
mates shown in table 3.3. One can see that only customers with high
spending on bakery and low spending on dairy have a high propensity
to try the product, regardless of the payment method. In real life, one
possible interpretation could be that dairy desserts are considered as
substitutes for bakery desserts by customers who actively buy in both
categories.
1 See Chapter 2 for a detailed discussion of logistic regression. In this example, we skip
typical steps, such as model validation and diagnostics, for the sake of simplicity.
3.5 building blocks: targeting and ltv models 105
PrpR | T , xq (3.12)
Figure 3.13: Measurement groups for propensity modeling with uplift [Kane
et al., 2014].
in which IpT q is the indicator function and is equal to one if the cus-
tomer x has been treated and zero otherwise [Lo, 2002]. Consequently,
the uplift is estimated as
quadrants in Figure 3.13 [Kane et al., 2014]. We can create such a model
to express the uplift as follows:
PrpT R | xq PrpCN | xq
upliftpxq 1
PrpT | xq PrpC | xq
PrpT R | xq PrpCN | xq
(3.17)
PrpT q PrpCq
1
By repeating the same transformations for the response probability
in the test group, we can also express the uplift as follows:
PrpT R | xq PrpCN | xq
2 upliftpxq
PrpT q PrpCq
PrpT N | xq PrpCR | xq
(3.19)
PrpT q
PrpCq
in which the probabilities in the numerators are estimated by using
a single regression model. The uplift score can often be used as an
alternative to the response probability estimated by basic propensity
models – as we will see later, a targeting system can optimize the cam-
paign ROI by selecting the recipients from those individuals with the
highest uplift score as opposed to propensity scores.
Table 3.4: Example of segments and segment metrics. Each segment can be inter-
preted in psychographic and behavioral terms. Convenience seekers,
for instance, seem to be less price sensitive and have fewer children
than consumers in other segments. This segment contains a relatively
small number of customers but makes a high contribution to the rev-
enue.
example 3.2
Previous Number of
ID Discount, % Purchase Time
Purchase Emails
1 0 2 5 5
2 0 2 0 10
3 0 3 0 20 (censored)
4 1 1 0 6
5 1 2 10 2
6 1 3 0 15
7 1 4 0 20 (censored)
8 1 5 5 6
9 0 2 10 8
10 1 5 5 13
11 0 0 0 20 (censored)
12 1 2 5 8
Table 3.5: Training data set for survival analysis. The censored records corre-
spond to customers who did not make a purchase during the first 20
days after the campaign announcement.
Figure 3.14: Survival curves for different numbers of emails. The purchase indi-
cator and discount depth are equal to zero for all curves.
Figure 3.15: Survival curves for different discount depths. The purchase indica-
tor and numbers of emails are equal to zero for all curves.
LTVpuq
Ţ
pR Cqrt1
t1
p 1 d q t 1 (3.23)
we can model the net profit m not simply as a constant value R C but
as a value that gradually increases over time as the relationship with
the consumer matures:
mt m0 pmM m0 q 1 ekt (3.24)
example 3.3
Table 3.6: Example of the LTV calculation for a horizon of five years.
Figure 3.16: Example of a Markov chain for LTV modeling. The white circles
correspond to three different values of recency. The black circle rep-
resents the defunct state.
Each row of the matrix corresponds to the current state and each
column corresponds to the next state. Each element of the matrix is the
probability of a customer moving from the current state to the next one.
The probability that a customer who is currently in state s will end up
in the state q after t months can then be calculated as the ps, qq element
of the matrix Pt , according to the standard properties of the Markov
chain. This gives us a simple way to estimate the customer journey in
probabilistic terms if the current state is known.
From an economic standpoint, each state of the chain corresponds to
profits and costs. For example, the marketing strategy may be to spend
some budget C on each active customer (e. g., send a printed catalog)
and stop doing so after the customer moves into the defunct state. The
first state is also associated with the revenue R of the purchase. Let
118 promotions and advertisements
us introduce a column vector G so that the net profit of the i-th state
corresponds to its i-th element:
RC
G
C
(3.26)
C
0
with column vector V containing the LTV estimates for each initial
state. The LTV of a customer is also estimated as one of the elements
of this vector based on the current customer states, that is, the recency
value in this example. This result can be compared to the standard
descriptive LTV model in equation 3.23 – we essentially replace the
static net profit and retention rate parameters with a time-dependent
probabilistic estimate.
Let us conclude the example by estimating the LTV for several dif-
ferent values of time horizon T . As before, we assume the transition
probabilities of p1 0.8, p2 0.4, and p3 0.1. Let us also assume
that the expected revenue of one purchase is R $100, monthly cost
of marketing communications is C $5, and monthly discount rate
is d 0.001. By evaluating expression 3.28 for these parameters and
different values of the time horizon T , we get the following sequence
of LTV vectors:
$75.0 $135.5 $184.0
V T 1 $35.0 VT 2 $48.4 VT 3 $53.4 (3.29)
$9.5 $10.4 $10.5
$0.0 $0.0 $0.0
3.5 building blocks: targeting and ltv models 119
We can see that the LTV heavily depends on the initial customer
state. For a time horizon of three months, that is, VT 3 , the LTV of
a customer who made a purchase a month ago is $184.0. After two
months, the expected LTV drops to $53.4, and finally to $10.5 after the
third month.
retention rate and average profit, whereas the Markov chain model es-
timates the same factors with probabilistic analysis. A more flexible
solution for problem 3.30 can be obtained by creating regression mod-
els for both factors. The advantage of this approach is that regression
models can use a wide range of independent variables created from a
customer profile and, thus, enable predictive and prescriptive capabili-
ties.
One can see that survival analysis is a natural choice for the reten-
tion probability factor in equation 3.30. This probability directly corre-
sponds to the customer’s survival function Su ptq, so the model can be
rewritten as
Ţ
LTVpuq Su ptq mpu, tq (3.31)
t 1
example 3.4
Consider the case of a retailer who has 100,000 loyalty card holders.
The retailer plans a targeted campaign where each promotion instance
costs $1 and the potential profit of one response is $40. The average re-
sponse rate for this type of campaign and product category estimated
from historical data is 2%. On the basis that we have created a propen-
sity model that estimates the response probability for each customer,
we can score all card holders and sort them by the scores. The result
can be summarized by splitting the customers into “buckets” of equal
size where the first bucket corresponds to the customers with the high-
est scores and the last bucket corresponds to those with the lowest
scores. The targeting problem can then be defined as finding the opti-
mal number of top buckets to include in the targeting list, or, equiva-
lently, finding the threshold score that separates these top bucket from
the bottom ones. We use bucketing for the sake of convenience in this
example and this approach is often used in practice as well, but there is
nothing to stop us from doing the same calculations for individual cus-
tomers, that is, having as many buckets as customers. Let us assume
that we have 10 buckets or deciles, so that each bucket contains 10,000
customers; consequently, the average expected number of responders is
200 per bucket. In other words, we are likely to get 200 responses from
each bucket if we randomly assign customers to buckets. This number
is shown in the second column of table 3.7, and the third column con-
tains the cumulative number of responders, which reaches 2,000 or 2%
of the customer base in the bottom row.
Next, let us assume that the lowest response probability scores gener-
ated by the propensity model in each bucket are those presented in the
fourth column. By multiplying the bucket size by this probability, we
128 promotions and advertisements
get the expected number of responses in the case of the targeted distri-
bution presented in the next two columns. The total number of respon-
ders still adds up to 2,000, of course. The ratio between the number of
responders in the case of targeted and random distributions is called
lift, and it is the key metric that describes the quality of the targeting
model. The lift is typically visualized by using a lift chart similar to the
one in Figure 3.19. This chart shows two lines that correspond to the
cumulative number of responses: the straight line corresponds to the
random distribution and the raised curve to the targeted distribution.
offer. If we are substantially over budget (above the ε line), then the
threshold should be set to the maximum possible scoring value to stop
the distribution completely. These two extreme points can be connected
by some growing function, as illustrated in Figure 3.21. Consequently,
we become more and more demanding of consumers as we approach
and cross our budgeting limits, and we lower the bar when we do not
encounter enough high-quality prospects.
• The third and the last phase is redemption. On the second shop-
ping trip, the consumer buys the promoted product to redeem
the coupon issued in the previous stage. The consumer is incen-
tivized to buy a product to redeem a coupon and get a discount.
This campaign template can be thought of as a customer journey
with three steps: trigger, purchase, and redemption. It can be argued
that this approach is more efficient than standalone promotions be-
cause it has a more durable impact on customer loyalty and lower
costs per unit moved [Catalina Marketing, 2014]. The dynamically de-
termined discount value in the second stage is an interesting detail
132 promotions and advertisements
because the targeting system needs to optimize this value and fore-
cast how it will influence the campaign outcomes. This aspect is not
addressed by the targeting and budgeting processes discussed in the
previous sections. Let us consider an example that demonstrates how a
targeting system can heuristically evaluate different promotion param-
eters and forecast the campaign outcomes by using just basic statistics.
More formal discount optimization methods will be discussed later in
Chapter 6 in the context of price optimization.
example 3.5
in which rpdi q is the average response rate predicted by the model. The
cost of coupons at level i can then be estimated as
If the overall retention strategy is focused, that is, the customers are
not normally treated with retention offers, a propensity model trained
3.6 designing and running campaigns 135
which corresponds to the horizontal area at the top of the square in Fig-
ure 3.22, which contains many Do-no-disturbs and Lost Causes. This
aspect of modeling should be taken into account when the population
for model training is selected. The retention campaign can also use sur-
vival analysis to estimate the time-to-churn, which can be more conve-
nient for choosing the right moment for treatment than the probability
of churn.
The main shortcoming of the expected loss model is that it uses only
the probability to churn and not the churn uplift, that is, the difference
between the treated and non-treated churn probabilities:
A positive churn uplift means that the treatment amplifies the churn,
that is, the treatment has a negative effect. High uplift corresponds to
the upper left corner of the square in Figure 3.22. A negative churn up-
lift means that the treatment decreases the churn, which corresponds
to the lower right corner in Figure 3.22. Consequently, we want to tar-
get customers by using the inverse of the uplift as a score:
Once the targeted scores are calculated, the optimal targeting depth,
that is, the percentage of the population to be targeted, can be deter-
mined by using the ROI maximization method described earlier in Sec-
tion 3.6.2.2. The campaign can then be executed with the same target-
ing process as that used for product promotional campaigns.
This value can be averaged over time to obtain the average relative
channel contribution. The efficiency of the channel can be measured as
the ratio between the absolute channel contribution and the channel
budget, that is, the number of units sold generated by each dollar spent
on marketing activities through this channel. The following example
illustrates how the adstock model can be created and used.
example 3.6
Consider a retailer who uses two marketing channels: email and SMS.
The retailer can measure and control the intensity of marketing com-
munications through each of the channels by setting budgeting and
capping rules. The retailer also observes the sales volume. A data sam-
ple with these metrics is plotted in Figure 3.23 for 20 sequential time
intervals (we have omitted the table with numerical values for the sake
of space).
The adstock model can be fitted by solving problem 3.46 with numer-
ical optimization methods. By setting the length of the decay window
n to 3, we get the following estimates for the model parameters:
baseline: c 28.028
email: λemail 0.790 wemail 1.863
SMS: λsms 0.482 wsms 4.884
3.7 resource allocation 141
Figure 3.23: Data for adstock modeling: sales volume, email activity, and SMS
activity.
Figure 3.24: Decomposition of the sales volume into the layers contributed by
different marketing channels.
at 1 1
exppµ xt q
λ at1 (3.48)
3.8.1 Environment
• A user is the recipient of the ads delivered via the channels. The
user can interact with multiple channels and publishers over
time, receiving ad realizations known as impressions. From the
brand perspective, the user either eventually converts to produce
some desired outcome, such a purchase on the brand’s website,
or does not convert. Consequently, there is a funnel of sequential
impression events Ai for each user that ends with the outcome
Y, as shown in Figure 3.25.
• Finally, the impressions and conversions are tracked by an attri-
bution system. We consider the attribution system as an abstract
entity that can trace the user identity across channels and pub-
lishers and can keep records of which user received which im-
pression from which advertiser at which point in time. The pur-
pose of the attribution system is to measure the effectiveness of
the ad campaign and provide insights into the contributions of
individual channels, advertisers, and user segments. The attribu-
tion system typically collects the information by using tracking
pixels attached to ad banners and conversion web pages; users
are identified with web browser cookies. However, the attribu-
tion process can consume additional data sources, such as pur-
chases in brick-and-mortar stores, correlate this data with online
profiles by using loyalty or credit card IDs, and measure causal
effects across online and offline channels.
In the environment model above, the brand relies on the attribution
system to measure the effectiveness of individual advertisers and ad
campaigns as a whole. The metrics produced by the attribution sys-
tem directly translate into the advertiser’s fees and the brand’s costs
and revenues, so we will spend the next section examining attribution
models and their impact on advertiser’s strategies.
CPA
Ccamp
(3.53)
Nconv
Conversion, however, can be defined in different ways. One possible
approach is to count post-view actions, that is, to count the users who
148 promotions and advertisements
visited the brand’s site or made a purchase within a certain time in-
terval (for example, within a week) after they received an impression.
A more simple method is to count the immediate clicks on advertise-
ments, which is referred to as the cost per click (CPC) model. From the
advertiser perspective, it makes sense to split the cost of the campaign
into a product of the number of impressions and the average cost of
one impression, so the CPA metric can be expressed as follows:
Nimpr E cimpr
CPA CR E
1 (3.54)
cimpr
Nconv
in which Nimpr is the total number of impressions delivered by the
campaign, E cimpr is the average price paid by the brand for one
impression, and CR is the conversion rate. The advertiser’s margin is
the difference between the price paid by the brand and the bid value
placed in the RTB, so we can define the advertiser’s equivalent of the
CPA as follows:
CPAa CR
1
E cimpr cbid (3.55)
• Cost per mile (CPM) contract. The brand pays a fixed fee for each
impression, but eventually measures the overall CPA by using
the attribution system.
Both approaches are equivalent in the sense that the advertiser has
to minimize the CPA metric to satisfy the client, even for CPM con-
tacts. The fixed fee implies that the CPA metric in equation 3.54 can
be optimized by maximization of the conversion rate CR. However, the
bid value cbid in equation 3.55 is not fixed and directly influences the
conversion rate, so optimization of the CPAa metric requires joint op-
timization of CR and cbid .
The final area we need to cover is attribution in the case of multi-
ple advertisers. The most basic approach is last-touch attribution (LT),
which gives all the credit to the last impression that preceded the con-
version. Consequently, the goal of the advertiser under the LT model is
to identify customers who are likely to convert immediately after the
impression.
3.8 online advertisements 149
The basic goal of targeting under the CPA-LT model is to identify users
who are likely to convert shortly after the impression. Similarly to the
case of promotion targeting, we use a variant of look-alike modeling to
solve this problem, but we want to explicitly account for the informa-
tion about a user’s response to advertisements as opposed to selecting
natural buyers based on purchase histories. In particular, we want to
account for the performance of the currently running advertisement,
which means that we have to dynamically adjust our targeting method
based on the observed results. In other words, we want to build a self-
tuning targeting method.
We can assume that the advertiser has the following data for each
consumer profile:
150 promotions and advertisements
• Bids and impressions. The advertiser can track the bids it made
for a given user and the impressions delivered to the user.
Figure 3.26: Desirable sampling for the targeting task. The shaded circles corre-
spond to positive and negative examples.
ϕpuq PrpY | uq
PrpY | URL1 , . . . , URLn q
(3.56)
in which URLi are binary labels equal to one if the user visited the
corresponding URL and zero otherwise. The advertiser can use differ-
ent definitions of the URL and conversion to build multiple models
ϕu1 , . . . , ϕuk that capture different indicators of proximity:
• The URLs can be aggregated into clusters, and labels URLi can
be replaced by per-cluster binary labels that indicate whether the
user visited some URL from a cluster or not. This reduces the di-
mensionality of the problem, which can be helpful if the number
152 promotions and advertisements
ωa pu, iq
bpuq bbase s1 pψa puqq s2
ωa pu q
(3.60)
154 promotions and advertisements
Note that although the notation we have used implies that ωa puq
equals ψa puq, the advertiser can use different data samples and mod-
els to estimate ω and ψ depending on the available data and other
considerations. The steepness of the scaling functions s1 pq and s2 pq
determines the trade-off between conversion rates and the advertiser’s
CPA. Steep scaling functions (e. g., zero if the argument is below the
threshold and a very high value otherwise) generally maximize the
conversion rate, but these can be suboptimal from the CPA standpoint.
Scaling functions that are close to the identity function optimize the
CPA as it follows from theoretical equation 3.58 but can be suboptimal
in terms of conversion rates.
Model A1 A2 A3 A4 A5
Table 3.9: Static attribution models. The table shows the percentage of credit
assigned to each of five impressions A1 , . . . , A5 .
ence between the probability of conversion for the full set of channels
and the probability of conversion if channel Ck is removed:
because we draw sequences of length |S| from the set CzCk with car-
dinality |C| 1. For example, the causal effect of channel C3 in the
network C tC1 , C2 , C3 u is given by the following equation:
V3 13 pPrpY | C1 , C2 , C3 q PrpY | C1 , C2 qq
1
pPrpY | C1 , C3 q PrpY | C1 qq
6
(3.64)
pPrpY | C2 , C3 q PrpY | C2 qq
1
3
pPrpY | C3 q PrpY | ∅qq
The attribution formula 3.62 can be difficult to evaluate in prac-
tice because long sequences of channels have relatively low realization
probabilities, which impacts the estimation stability [Dalessandro et al.,
2012b; Shao and Li, 2011]. It can be reasonable to discard all sequences
S longer than 2 channels to produce a more simple and stable model
[Shao and Li, 2011]:
¸
Vk
wS,k PrpY | S Y Ck q PrpY | Sq
z
S C Ck
w0 PrpY | Ck q PrpY | ∅q (3.65)
¸
w1 PrpY | Cj , Ck q PrpY | Cj q
j k
156 promotions and advertisements
C| 1 1 1
w0
|
| C | |C |
1
0
C| 1 1 1
(3.66)
w1
|
|C| p|C| 1q|C|
1
1
We can therefore express the causal effect as
R
k
(3.68)
n
The obtained estimate may or may not be statistically reliable, depend-
ing on the number of individuals and conversions. If these numbers
are small, we can expect the measured rate to have high variance and
to change drastically if the same campaign is run multiple times. If
the numbers are high, we can expect more consistent results. The re-
liability of the estimate can be measured in different ways by using
different statistical frameworks. In this book, we generally advocate
158 promotions and advertisements
ppk | RqppRq
ppR | kq
p pk q
(3.69)
In words, we start with a prior belief about the rate distribution ppRq,
and the observed data, that is, the number of conversions k, provide
evidence for or against our belief. The posterior distribution ppR | kq is
obtained by updating our belief based on the evidence that we see.
As the posterior rate distribution includes two factors, ppk | Rq and
ppRq, we need to specify these two distributions. Under the assumption
that the conversion rate is fixed, the probability that exactly k individ-
uals out of n will convert is given by a binomial distribution, with a
probability mass function of the form
p p k | Rq
n
k
Rk p1 Rqnk
(3.71)
k!pnn! kq! Rk p1 Rqnk
The second factor, the prior distribution ppRq, can be assumed to
be uniform or can be estimated from historical campaign data. Let us
3.9 measuring the effectiveness 159
beta pα, βq
1
Bpα, βq
xα1 p1 xqβ1
»1 (3.74)
Bpα, βq xα1 p1 xqβ1 dx
0
The distribution of the conversion rate given n treated individuals
and k conversions is described by the beta distribution.
If the prior distribution is not uniform, it can also be modeled as the
beta distribution:
ppRq beta px, yq (3.75)
in which parameters x and y can be estimated, for example, based on
historical data. In this case, the posterior distribution is still the beta
distribution:
ppR | kq 9 ppk | Rq ppRq
9 Rk p1 Rqnk beta px, yq
9 Rk x1 p1 Rqnk y1
(3.76)
9 beta pk x, n k yq
It is said that the beta distribution is the conjugate prior to the bi-
nomial distribution: if the likelihood function is binomial, the choice
of a beta prior will ensure that the posterior distribution is also beta.
Note that beta p1, 1q reduces to a uniform distribution, so result 3.73
obtained for the uniform prior is a particular case of expression 3.76.
We can now estimate the probability that the conversion rate R lies
within some credible interval ra, bs as
»b
Prpa R bq beta pk 1, n k 1q dR (3.77)
a
160 promotions and advertisements
p | q
Figure 3.28: Examples of the posterior distribution p R k for the uniform
prior and different sample sizes n. The mean k n { 0.1 for all
samples. The vertical lines are the 2.5% and 97.5% percentiles of the
corresponding distributions. We start with the uniform prior, and
the more samples we get, the narrower the posterior distribution
becomes.
3.9.1.2 Uplift
The conversion rate by itself is not a sufficient measure of the quality of
a targeting algorithm or the effectiveness of a marketing campaign. As
3.9 measuring the effectiveness 161
L 1
R
(3.78)
R0
in which R0 is the baseline conversion rate and R is the conversion rate
for the campaign in question. From a statistical standpoint, we also
want to measure the reliability of this estimate, that is, the probability
to ensure that the obtained results are attributed to the impact of the
campaign in question relative to the baseline, not to some external un-
controlled factors. The standard way to tackle this problem is random-
ized experiments. The approach is to randomly split the consumers who
can potentially be involved in the campaign into two groups (test and
control), provide the test group with the treatment (send promotion,
show ads, present a new website design, etc.), and provide the control
group with the no-action or baseline treatment. Random selection of
test and control individuals is important to ensure that the observed
difference in outcomes is not caused by a systematic bias between the
two groups, such as a difference in average income. Running the test
and control in parallel is also important to ensure equality of the test
conditions for the control groups, which might not be the case, for
example, in a comparison of new data with historical data.
The design of randomized experiments for targeted campaigns is
illustrated by Figure 3.29. The high-propensity customers identified
by the targeting algorithm are divided into test and control groups,
and the test group receives the treatment. The number of positive and
negative outcomes is measured for both groups: nT and nC are the
number of individuals and kT and kC are the number of conversions
in the test and control groups, respectively. The uplift is measured by
comparing the conversion rate of the test group kT {nT with that of the
control group kC {nC .
We now want to assess the probability PrpRT ¡ RC q or, equivalently,
to find a credible interval for uplift L. We can calculate this in a similar
manner to that we used for the credible interval of the conversion rate
in expression 3.77, but now we need to account for the joint distribution
for RT and RC :
¼
Prpa L bq LpRT , RC q PrpRT , RC qdRT dRC (3.80)
a L b
162 promotions and advertisements
with the control group. Although subjects can be assigned to the test
and control groups randomly, some people in the test group cannot
be exposed to the treatment because of compliance issues. The split
into compliant and non-compliant subgroups after randomization cor-
responds to the win–lose split in the bidding process when it follows
control group selection, so we can leverage the studies dedicated to
clinical trials with non-compliance.
The problem of uplift estimation with observational studies can
be approached by using different techniques. We start with a basic
method that illustrates how some concepts of causality theory can be
applied to the problem [Chalasani and Sriharsha, 2016; Rubin, 1974;
Jo, 2002].
We can see in Figure 3.31 that we have at least three conversion rates
that can be measured directly: RC for the control group, RL T for the lost
bids in the test group, and RW
T for the users who got actual impressions.
Our goal is to find the conversion rate RWC , which can be interpreted as
a potential conversion rate of the users who would have been won even
if they were not provided with impressions. This value is hypothetical
because we cannot go into the past, revoke the impressions we already
delivered, and see what would happen. However, it can be estimated
3.9 measuring the effectiveness 165
from the known data under certain assumptions. First, we can note
that the ratio γ between the number of users who were won and the
number of users who were lost is directly observable. By assuming that
the distribution of “winners” and “losers” is the same in both test and
control groups, we can claim that
RC γ RW
C p1 γqRLC (3.82)
in which RW L
C and RC are the conversion rates that we can expect from
the control users who could be won or lost, respectively, if reassigned
to the test group. The second assumption we can make is that RL C RT
L
because both groups contain only “losers” who have not been exposed
to the ad, so we do not expect any bias between them. Consequently,
we can express RW C by using the known values as follows:
RW
C γ1 RC p1 γqRL
T (3.83)
Finally, the uplift can be estimated as the ratio between the observed
RW
T and inferred RC .
W
166 promotions and advertisements
Figure 3.32: Graphical model for an observational study with latent factors.
3.9 measuring the effectiveness 167
We now need to specify the state random variable S and its role in
the densities Prpa | z, sq and Prpy | a, sq. The idea behind the latent fac-
tors is to capture “the state of the world” that is not observed directly
but can influence outcomes like the uplift. This concept can be consid-
ered as a counterpart of the potential outcomes that we discussed at
the beginning of this section, because if we can infer the state from the
observations then we can evaluate the potential outcomes for different
preconditions. For example, if we know that a given user can never be
won on the exchange, then we can predict the outcomes for assigning
this user to both the test and control groups.
The latent state can be modeled differently depending on the avail-
able data, metrics of interest, and general understanding of the domain.
We use a standard model that illustrates how the latent states can be de-
fined as functions of the observed data and how metrics like uplift can
be derived from the states [Heckerman and Shachter, 1995; Chickering
and Pearl, 1996].
From a campaign efficiency standpoint, we are interested mainly in
two properties of the user: compliance with the advertising method
(ability or inability to win a bid) and response to the advertisement
(converted or not). These properties correspond to the probabilities
Prpa | z, sq and Prpy | a, sq discussed above and can be considered
as the user’s internal state that systematically influences the outcomes
obtained for the user. We can enumerate the possible states separately
for compliance and response and specify a condition for each state that
indicates whether the state is possible for the observed tuple pz, a, yq
or not.
The set of possible user states is a Cartesian product of the
compliance and response behaviors, which gives us a 16-element
168 promotions and advertisements
Table 3.10: User compliance states and conditions. States C3 and C4 should
never be the case in the scenario that we consider, but they can occur
in other environments such as omni-channel advertising.
set ts1 , . . . , s16 u, in which si iterates through all pairs Cp , Rq of
compliance and response behaviors listed in tables 3.10 and 3.11:
S P ts1 , . . . , s16 u
p q 1 ¤ p, q ¤ 4
(3.85)
sp 4 q 1 Cp , Rq ,
in which each share µi is the ratio between the number of users in the
corresponding state si and the total number of observed users. The
metrics can then be defined as functions of µ. For example, the uplift
Lpµq can be defined as the ratio between the sum of the four µi values
3.9 measuring the effectiveness 169
3.9.2.2 Simulation
By assuming the model specified above, we can express the credible
interval of the metric Lpµq via the posterior distribution of the random
vector µ:
»
Prpa Lpµq bq Lpµq ppµ | dataqdµ
(3.87)
a L µ p q b
in which data represents all observed tuples zj , aj , yj . Let us denote
the vector of user states as
s s1 , . . . , sm (3.88)
It can be the case that we cannot sample points directly from the mul-
tivariate distribution, but sampling from the conditional distribution
is possible. The idea in Gibbs sampling is that, rather than probabilis-
tically picking all n variables at once, we can pick one variable at a
time with the remaining variables fixed to their current values. In other
words, each variable is sampled from its conditional distribution with
the remaining variables fixed:
draw x2
piq p x piq pi1q
x ,x p i 1 q
, . . . , xn
2 1 3
...
Let us now come back to the distribution ppµ, s | dataq and inves-
tigate how the Gibbs sampler can be used to draw samples from it.
3.9 measuring the effectiveness 171
p p sj
si | µ, s, dataq 9 ppaj , yj | zj , si q µi (3.93)
The last step is to specify the prior distribution Prpµq. Recall that we
have used a beta distribution for the prior in randomized experiments
because the likelihood had a binomial distribution and the beta distri-
bution is a conjugate prior to the binomial. In a similar way, we now
have a multinomial likelihood and its conjugate prior is the Dirichlet
distribution (see Appendix A): if we choose Prpµq to be the Dirich-
let, the posterior distribution described in expression 3.96 will also be
Dirichlet. More formally, we can express the prior belief as a set of
172 promotions and advertisements
counters n0i , which are used as parameters of the prior Dirichlet distri-
bution, and the posterior can then be expressed as
¹
ppµ | s, dataq 9 i Dirpn1 , . . . , n16 q
µn i 0 0
i
¹ n 0 n 1
9 i
µii (3.96)
i
9 Dirpn01 ni , . . . , n016 n16 q
The above equations can be plugged directly into the Gibbs sampler:
we generate the samples of µ by using expression 3.96, generate m sam-
ples of s by using equation 3.93, and then repeat this process iteratively
until we have enough realizations of vector µ to evaluate the credible
interval of Lpµq.
scoring The targeting server scores the incentives that have passed
the previous step by evaluating propensity models associated
with the incentive against the context, including the consumer’s
historical profile. Propensity models can be dynamically selected
for an incentive based on metadata and business rules to avoid
manual binding of a model to each and every incentive.
main outputs of the data preparation process is the profile features that
can be used for training and evaluation of predictive models. The data
consolidated and cleaned by the platform can be provided with differ-
ent service level agreements for different consumers. For example, pro-
file features can be created in batch mode for analytical purposes, but
the same features are needed for real-time model evaluation in the tar-
geting server. Thus, some data preparation modules can be designed
to work in different modes. This aspect is illustrated by the feature
preparation block in Figure 3.33, which is used for both modeling and
real-time data aggregation for the data management platform.
One of the main functions of the platform is to produce propensity
models by running machine learning algorithms over the data that
come from the targeting server and external data sources. The platform
can provide tools for manual model creation and automated model
updates (re-training). In addition to that, the analytics system performs
measurements and provides capabilities for reporting and exploratory
data analysis.
Finally, the analytics platform can host a planner. This is a key com-
ponent of a programmatic marketing system that designs and opti-
mizes promotion and advertising campaigns. The planner uses histor-
ical data and statistics, business rules, best practices, and heuristics
to determine the optimal strategies (duration and type of incentives,
channels, propensity models, etc.) based on the input objective and
additional constraints like budgeting limits. It can also forecast the per-
formance of a campaign by using historical data. The planner might
have several functional blocks including
investment planner The investment planner provides a high-
level view of market opportunities derived from historical data.
It helps the end user to set the right business objectives and dis-
tribute the budget amounts across different strategic directions
and campaigns. It can be considered as a global optimization
tool.
3.11 summary
the offerings for a given consumer or the consumers for a given of-
fering.
179
180 search
and determine products that match these needs. This capability is the
key to building efficient customer-facing services and applications.
The first product discovery service that we will consider is the search.
The purpose of a search service is to fetch offerings that are relevant
to the customer’s search intent expressed in a search query or with
selected filters. This type of problem is addressed by information re-
trieval theory, so we have a wide range of theoretical frameworks and
practical search methods at our disposal. The primary goal of this sec-
tion is to put together, align, and adapt the frameworks and methods
that are relevant to marketing applications. Some of these methods are
borrowed from the toolkit of generic search methods developed in in-
formation retrieval theory, and others were developed specifically for
marketing and merchandising. We will be taking a practical approach
to search methods and will focus on industrial experience, techniques,
and examples, rather than information theory. At the same time, we
will try to avoid implementation details, such as data indexing, as
much as possible and will stay focused on the business value deliv-
ered through relevant search results.
We will start this chapter with a review of the environment and eco-
nomic objectives. We will then demonstrate that the problem of rele-
vant search can be expressed in terms of features, signals, and controls,
similarly to other programmatic services. We will review a number of
methods for engineering, mixing, and tuning these signals and controls
in manual mode, and we will then discuss how predictive analytics can
be leveraged for automated optimization.
4.1 environment
2. Merchandising controls
precision
R
(4.2)
S
recall
R
(4.3)
D
We typically need both of these metrics to describe the result set or
search method. On the one hand, recall measures the completeness of
the search results regardless of the result set size, so one can always
achieve the maximum possible recall of 1.00 by returning the entire
collection of items. On the other hand, precision measures the density
of relevant items in the result set and tells us nothing about the relevant
items that have not been fetched.
The difference between the two metrics, however, does not mean
that they are independent. First, let us review Figure 4.2 again. It sug-
gests that we can change the recall from 0 to 1 by stretching the search
results rectangle in the vertical direction; meanwhile, the precision re-
mains constant. This behavior is almost never the case for real data.
186 search
all items
Figure 4.2: The relationship between relevant items, search results, and relevant
results.
One of the main reasons is that we describe items and specify queries
by using dimensions that are not necessarily aligned with the shape of
the item set defined by the search intent. Let us consider the example
depicted in Figure 4.3. A retailer has a large collection of shoes that are
described by using properties such as price and category. A user who
searches for affordable quality shoes might consider items scattered along
the diagonal from inexpensive dress shoes to expensive sandals to be
generally relevant. A search system, however, might not be able to re-
peat this shape. If it treats the terms affordable and shoes in a strict way,
it can achieve high precision but relatively low recall by returning, for
example, average running shoes. Loosening of the criteria increases the
recall but also scoops irrelevant items, such as expensive dress shoes,
which decreases the precision1 .
This pattern is almost constantly present in search applications, so
we typically have to choose between high-precision, low-recall search
methods and the low-precision, high-recall alternatives. Merchandis-
ing search is heavily biased towards high-precision search methods be-
cause the primary goal is to provide a user with a reasonable number
of relevant results that can quickly be reviewed.
The basic precision and recall metrics provide a useful conceptual
view of relevance but have many limitations as quantitative measures.
First, precision and recall are set-based metrics that cannot be straight-
forwardly applied to ranked search results. This limitation is critical for
merchandising search that strives to provide the user with a few valu-
able results sorted by relevance. One possible approach to account for
ranking is to go through the items in the result set from top to bottom,
1 This problem is not specific for search and often comes up in machine learning, especially
in deep-learning applications. For example, a set of photographic images is a very “curly”
area in the space of all possible two-dimensional matrices. Such sets embedded into high-
dimensional spaces are referenced as manifolds.
4.2 business objectives 187
price
expensive
high precision
low recall
category
dress shoes running sandals
shoes
calculate the precision and recall at each point, and plot a precision–
recall curve. This process is illustrated by the example in Figure 4.4. Let
us assume that we have 20 items in total and 5 of them are relevant.
The result set starts with a relevant item, so the precision is 1.00 and
recall is 1⁄5. The next two items are not relevant, so the recall remains
constant but precision first drops to 1⁄2 and then to 1⁄3. Continuing this
process, we get a precision of 1⁄4 and recall of 1.00 at the twentieth item
in the result set. The precision–recall curve is thus jagged but typically
has a downward-sloping concave trend.
The precision–recall curve provides a convenient way to analyze
search quality for a single query, but we often need a compact metric
that expresses the overall performance of a search service as a single
number. One standard way to do this is to determine the mean average
precision (MAP), which first averages the precisions at each relevant
item and then averages this value over all of the queries we use in the
evaluation. If the number of queries is Q, the number of relevant items
for query q is Rq , and the precision at the k-th relevant item is Pqk ,
then
Rq
1 ¸ 1 ¸
Q
MAP Pqk (4.4)
Q
Rq
q 1 k 1
188 search
Figure 4.4: Precision–recall curve. The solid circles correspond to relevant items
and the empty circles correspond to irrelevant items in the result set.
For example, the MAP for the single query illustrated in Figure 4.4
is the mean of the five precision numbers for each of the five relevant
items:
2Rk 1
Ķ
DCG
log2 pk 1q
(4.8)
k 1
The magnitude of the DCG calculated according to formula 4.8 can
vary depending on the number of results K. To compare DCG metrics
obtained for different queries, we need to normalize them. This can be
done by calculating the maximum possible DCG, called the Ideal DCG,
and dividing the actual DCG by this value to obtain the normalized
DCG (NDCG):
NDCG
DCG
(4.9)
Ideal DCG
The ideal DCG can be estimated by sorting the items in a search
results list by relevance grades and applying formula 4.8 to calculate
the corresponding DCG. Consequently, the NCDG is equal to one for
an ideal ranking. For example, let us consider a search results list with
six items that are scored by an expert on a scale of 0 to 4, with 0
meaning nonrelevant and 4 meaning most relevant:
4, 3, 4, 2, 0, 1 (4.10)
190 search
The value of the DCG calculated for this search result in accordance
with formula 4.8 is 28.56. The ideal ordering for this search result is
4, 4, 3, 2, 1, 0 (4.11)
search latency The time it takes to process a search query and re-
turn a result has a big impact on the user experience. Many retail
and web search companies have reported impressive statistics on
this matter. For instance, Amazon reported that every 100 mil-
lisecond increase in the page load time results in a 1% loss in
4.3 building blocks: matching and ranking 193
Figure 4.5: A high-level view of a search flow and its main controls.
Product 2
Description: Fiery red dress. A black ribbon
at the waist.
The most basic thing we can do to search over such documents is to
break the descriptions up into words and allow only for single-word
search queries, so a product will be included in a result set only if
its description contains the same word as the query. The process of
breaking up a text into words or other elements such as phrases is
called tokenization, and the outputs – words in our cases – are called
tokens. For English language, tokenization is usually done by using
spaces and punctuation marks as delimiters, so the documents above
will produce the following tokens:
Product 1: [Pleated], [black], [dress], [Lightweight],
[look], [for], [the], [office]
Product 2: [Fiery], [red], [dress], [A], [black]
(4.14)
[ribbon], [at], [the], [waist]
Consequently, the query black will match both products and the
query red will only match the second product. Clearly, this method
4.3 building blocks: matching and ranking 195
provides only a matching capability and the items in the result set are
not ranked.
Although this token matching method is extremely simple and lim-
ited, it illustrates the main search controls that we discussed earlier.
First, each token can be viewed as an individual feature that may or
may not be present in a product. The tokenization process is then an
example of feature engineering. Words are indeed reasonably good fea-
tures because they carry a strong signal about a product type, such as
shoes, and its properties, such as a black color. Second, token match-
ing is a way to produce signals about correlations between product
features and query features. Finally, the signals from all tokens are
combined together to produce the final decision on match or mismatch.
This flow is visualized in Figure 4.6.
will match only the second product from example 4.14, whereas the
query
will match both products. Boolean queries do not take into account
positions of tokens in the text and, hence, can be thought of as several
token matching queries combined together.
The second important capability that extends basic token matching is
the phrase query. A phrase query is a query that searches for documents
that contain a sequence of tokens that follow one another, as opposed
to a Boolean query, which searches for documents that contain individ-
ual tokens irrespective of their order and positions in the text. We use
square brackets to denote phrase queries and subqueries. For instance,
the following query will match the first product from example 4.14 but
not the second one:
[black dress]
This result has higher precision and lower recall than the result of the
Boolean query black AND dress, which matches both products. Boolean
and phrase queries together provide very powerful tools to control
relevancy and manage the precision–recall trade-off. A query language
that directly supports Boolean expressions is often a good solution for
expert search where users are willing to learn and use advanced search
functions, but its usage in merchandising search is limited because of
the unintuitive user experience. We will discuss how free-text search
can take advantage of complex Boolean and phrase queries in the later
sections.
It is not difficult to see that chopping a text into tokens produces re-
sults that are not optimal from a matching standpoint. In natural lan-
guage, words may have different forms and spellings that can be con-
sidered indistinguishable for almost all search intents. Some words do
not carry any meaningful information at all and generate noise signals.
This suggests that we need to perform a normalization of the raw to-
kens to create a cleaner token vocabulary. Such normalized tokens are
typically referred to as terms.
Normalization is a complex process that usually includes multiple
steps to address the different properties and phenomena of natural
language. Let us go through an example that illustrates the key trans-
formations by starting from the following original product description:
The first step is to normalize the character set because a search query
can be entered with or without diacritics and this difference typically
4.3 building blocks: matching and ranking 197
does not mean different search intents. Tokenizing the text and con-
verting it to a standard character set, we get the tokens
[Maison] [Kitsune] [Men’s] [Slim] [Jeans] [These]
[premium] [jeans] [come] [in] [a] [slim] [fit]
[for] [a] [fashionable] [look]
The second issue that we face is the presence of lower- and uppercase
characters that are also indistinguishable in most cases. The standard
approach is to transform each token into its lowercase form, which
gives the following result for our example:
[maison] [kitsune] [men’s] [slim] [jeans] [these]
[premium] [jeans] [come] [in] [a] [slim] [fit]
[for] [a] [fashionable] [look]
The exclusion of stop words may have both positive and negative
implications. On the one hand, it can positively influence some match-
ing and ranking methods that we discuss later because meaningless
high-frequency terms can skew some metrics used in relevancy calcu-
lations. On the other hand, the removal of stop words can result in
losses of substantial information and reduce our ability to search for
certain queries. For example, the removal of stop words prevents us
from finding phrases like to be, or not to be or distinguishing new from
not new. Stop words can also destroy the semantic relationships be-
tween the entities, so it becomes impossible to distinguish between an
object on the table and under the table.
The fourth standard normalization technique is stemming. In most
natural languages, words can change their form depending on number
(dress and dresses), tense (look and looked), possession (men and men’s),
and other factors. Stemming is a process of reducing different word
forms to the same root in order to eliminate differences that typically
relate to the same search intent. The problem of stemming is challeng-
ing because of the multiple exceptions and special cases that can be
found in natural languages. There exist multiple stemming methods, ei-
ther rule based or dictionary based, with different strengths and weak-
nesses. One popular family of rule-based stemmers is based on the
so-called Porter stemmer [Porter, 1980]. This represents a few groups
of suffix transformation rules and conditions that check that a word is
198 search
Rule Example
rational −Ñ rational
Table 4.1: An example of the rules used in the Porter stemmer. All rules in this
example require at least one switch from vowel to consonant in front
of the suffix, so the second rule applies to the word conditional but not
to rational.
d q ¥ kqk2 (4.17)
dq ¥ 1 (4.18)
The cosine similarity, that is, a cosine of the angle between the vec-
tors, is a convenient metric that ranges from zero to one for positively
defined vectors. A cosine similarity of zero means that a document
vector is orthogonal to a query vector in the space of terms, and a simi-
larity value of one means an exact match equivalent to a Boolean query.
Unlike a Boolean query, the cosine similarity does not require opera-
tions to be specified in a query – it treats both query and document
as an unordered collection of terms. Let us illustrate this vector space
model by using an example.
example 4.1
Consider two items that have the following descriptions (for simplicity,
we assume that the descriptions have been tokenized and normalized):
Product 1: dark blue jeans blue denim fabric
Product 2: skinny jeans in bright blue
These two descriptions and the query dark jeans are represented as
binary vectors in table 4.2.
Table 4.2: An example of two documents and one query represented as binary
vectors.
The similarity values between the query and each of the documents
will be
cos pq, d1 q ? ? 0.632
1+1
2 5
(4.20)
cos pq, d2 q ? ? 0.316
1
2 5
Figure 4.7 shows the relationship between the documents and query
in a vector space. Note that the cosine similarity can be evaluated effi-
ciently because only the non-zero dimensions of the query have to be
considered.
N
4.3 building blocks: matching and ranking 201
dark
1.0
d₁ q
)
, d₁
(q
s
co
cos(q, d₂) jeans
1.0
other
terms
d₂
Figure 4.7: An example of the vector space model and cosine similarity for two
documents and one query. The document and query vectors are de-
picted as normalized.
The vector space model with binary vectors has two important short-
comings that negatively impact the relevance of results ranked by us-
ing this method. First, it does not take into account the term frequency
in a document. We can expect that documents that have multiple oc-
currences of query terms are more relevant than documents where
the same terms occur only once. Second, some terms can be more im-
portant than others: rarely used words are often more discriminative
and informative than frequently used words. For example, an apparel
retailer may have word clothing in most product descriptions so match-
ing this term does not signal strong relevance. The stop words that we
discussed earlier are an extreme case of this problem.
The first issue described above can be mitigated by replacing the
zeros and ones in the document vectors with the corresponding term
frequencies in the document. This variant of the vector space model is
often called the bag-of-words model. Term frequency (TF) can be defined as
the number of occurrences of term t in document d, which we denote
202 search
idfptq 1
N
dfptq
ln (4.22)
1
in which N is the total number of documents in the collection, and dfptq
is the document frequency of the term. Similarly to term frequency,
a logarithm function is used to smooth down the magnitude of the
coefficient for rare terms.
Term frequency and inverse document frequency are usually com-
bined together so that the elements of a document vector are calcu-
lated as a product of values defined by equations 4.21 and 4.22 for the
corresponding term:
dpiq tfpti , dq idfpti q (4.23)
This is a widely used approach known as the TFIDF model. Sub-
stituting expression 4.23 into the definitions of the dot product and
Euclidean norm, we get the following formulas that can be used to cal-
culate the cosine similarity score for query q and document d under
the TFIDF model:
¸
qd tfpt, dq idfptq tfpt, qq idfptq (4.24)
t in q
d ¸
kqk rtfpt, qq idfptqs2 (4.25)
t in q
4.3 building blocks: matching and ranking 203
d ¸
kdk rtfpt, dq idfptqs2 (4.26)
t in d
example 4.2
Applying formulas 4.21 and 4.22, we obtain the TF and IDF values
summarized in table 4.3. Let us now score these products for the query
skinny jeans. The query norm and document norms can then be calcu-
lated in accordance with formulas 4.27 and 4.28 by using the TF and
IDF values that we just evaluated.
? ?5 2.236
Ld pd1 q 6 2.449, Ld pd2 q (4.30)
b
Lq pqq idfpjeansq2 idfpskinnyq2 1.163 (4.31)
Table 4.3: An example of TF and IDF calculations for two documents. The last
two lines correspond to the TF IDF vector representations of the doc-
uments.
The coordination factor is 0.50 for the first product and 1.00 for the
second one. By substituting all norms and TF/IDF values into for-
mula 4.29, we get a score of 0.062 for the first product and a much
higher score of 0.520 for the second product, which is in agreement
with the intuitive expectation that the second product is more relevant
for the query that we used.
4.3 building blocks: matching and ranking 205
The first document, however, looks much more relevant in this con-
text. Stemming will map the words dark, darker, and darkness to the
same root dark, which will result in higher scores for the first and sec-
ond documents because of higher term frequency. Moreover, a user
who searches for darkish shoes will get no results without stemming,
which is unlikely to be a good user experience.
N
The TFIDF scorer treats n-grams just like single-word terms and
calculates the cosine similarity in the vector space where each vector
element corresponds to a shingle, and the TFIDF metrics are also cal-
culated for shingles. So these products produce equal TFIDF scores
for the query black shirt if tokenized into single words (unigrams), but
the second product scores higher if bigrams are used because it ex-
plictly contains the black shirt subphrase. It can be argued that scoring
206 search
Items can also have dynamic properties, such as sales data and user
ratings, that also carry important information about their fitness and,
ultimately, relevance. Item property values can be short strings such
as product names, long text snippets such as descriptions or reviews,
numbers, tokens from a discrete set such brand names, or even nested
or hierarchical entities such as product variants or categories. This cre-
ates a diversity of features and signals that are measured on different
scales and may not be directly comparable. We need to find a way to
correlate all of these features with a query and mix the resulting signals
together to produce a relevance score.
One naïve approach to this problem is to blend all of the property val-
ues into one large text and use basic scoring methods to search through
this text. Although this approach is not totally meaningless, it results
in a very smooth and blurry signal that unpredictably scores search
results based on the interplay of term frequencies and text lengths. For
instance, the seemingly simple query black dress shoes can result in a
wild mix of dresses, shoes, black tuxedos, and other items that happen
to have some of the query terms in the description. To manage this
problem, we have to create a method that preserves the focused fea-
tures and signals and provides enough controls to pick the strongest
and most relevant results.
4.4 mixing relevance signals 207
Figure 4.8: A basic schema of multifield scoring. F stands for splitting a doc-
ument into fields f1 , f2 , . . . , fn . S is a signal mixing function that
produces the final score.
example 4.3
Product 1
Name: Men’s 514 Straight-Fit Jeans.
Description: Dark blue jeans. Blue denim fabric.
Brand: Levi Strauss
....
Product 1000
Name: Leather Oxfords.
Description: Elegant blue dress shoes.
Brand: Out Of The Blue
We can expect that it is quite usual for jeans to have the term blue in
the name or description, so its IDF value will be quite low. For instance,
if we have 500 blue jeans out of a thousand products, we get
At the same time, the brand Out Of The Blue that manufactures blue
shoes may be very rare. Let us assume that we have only one product
of this brand and there are no other brand names containing the word
blue, so the IDF value for the term blue in the brand field is
Consider the query blue jeans now. The blue shoes will have a very
high score for the brand field and a low score for the description field,
whereas the blue jeans will have a relatively low score for the descrip-
tion field that matches both terms of the query but with low IDFs.
Combining the signals by using a sum or maximum function, we are
likely to get a search results list with the shoes at the top, which does
not match the search intent. The reason is that IDFs depend on term
distributions within a field; therefore, IDFs for different fields are not
comparable.
N
Multifield search has two aspects that complement each other and typ-
ically need to be solved in parallel – signal engineering and signal
equalization. Signal engineering aims to create clear and focused sig-
nals; meanwhile, signal equalization aims to mix these signals together
to produce the final results. The same relevance problem sometimes
can be solved in different ways, either by tuning the mixing function or
by constructing a better more accurate signal. When we are searching
through multiple fields, the following types of relationship between the
fields and the search intent can be distinguished [Gormley and Tong,
2015]:
• One strong signal. It can be the case that a user searches for a
certain property that ideally should match with one of the fields
and produce a single strong signal. Signals from different fields
do not complement each other but rather compete. For instance,
a user who searches for the brand Out Of The Blue is likely to be
focused on the brand field and does not consider the color blue
to be relevant.
Figure 4.9: An example of a signal mixing pipeline that focuses on the strongest
signal.
search results
1 q: blue
2
yes 3 name: blue
description: blue
brand: blue
brand match?
q: out of the blue
no brand: out of the blue
name: blue
description: blue
Figure 4.10: A search result structure for the strongest signal strategy.
is not unusual to generate several signals from the same field, such as
parallel unigram and bigram scores that can be mixed together, or to
request merchandisers to attribute items with a completely new prop-
erty that helps to produce a more focused signal than the available
features.
The search result structure in Figure 4.10 is quite basic, and we can
program more complex behavior by relaxing the signal mixing func-
tion. One possible solution is to mix weaker signals with the strongest
one in a controllable way. This can be expressed by using the following
signal mixing formula:
¸
s sm α si (4.34)
i m
search results
1
yes 2
3
brand match? inner tiers
yes no
name match?
no outer tiers
Product 2
Name: Polo
Brand: Lacoste
It is quite intuitive that the second product is more relevant for the
query Lacoste Polo, but TFIDF calculations give a different result. Re-
call that the practical TFIDF scoring formula 4.29 for one-term fields
boils down to the following equation:
query coordination factor
query norm field norm
tfpterm, fieldq idf2field ptermq (4.35)
The query coordination factor is 0.50 for all four fields because only
one of the query terms matches (Polo or Lacoste). The query norm, field
norm, and TF value are also the same for all terms and fields. The IDF
values for Polo and Lacoste are the same for the brand field, but different
from the Polo IDF in the name field. Consequently, the name and brand
fields (pairwise) have the same scores in both documents and the total
document scores are equal as well; it does not matter which function
we use to mix the query signals, sum or maximum. The fundamental
reason is that each field matches exactly one query term (either Polo or
Lacoste) to produce equally strong signals, but the fact that the second
document as a whole covers two terms and the first document covers
only one term is not taken into account. This issue can arise in many
cases where different facets of the same logical property are modeled
as different fields: a person name can be broken down into first and last
names, a delivery address can be split into street name, city name, and
country, and so on. Fragmented signals can lead to frustrating search
results – a document that perfectly matches a query can be present in
the search results list but may have a surprisingly low rank.
One possible way to address the problem is to merge several similar
fields into one, thereby eliminating the problem with fragmented or
214 search
Figure 4.13: Term-centric scoring pipeline. t1 and t2 are query terms; f1 and
f2 are document fields.
Returning to our example with the Polo and Lacoste brands, we can
notice that the term-centric approach will produce more meaningful
results for the query Lacoste Polo. The first document will get a high
4.4 mixing relevance signals 215
score for the query term Polo from both fields and zero scores for the
term Lacoste, so the total score will be
in which the TFIDF score for Polo is 1.00 for the name field and
0.35 for the brand field due to the difference in IDF. At the same time,
the second document will get a high score for both query terms:
in which the TFIDF score for Lacoste is 1.00 for the brand field.
This result looks more relevant than the result we achieved by using
the field-centric approach.
We have seen that the structure of a search result can be derived from
the design of the signal mixing pipeline. We can turn this process up-
side down and attempt to develop a pipeline from a known search
result structure. This problem statement is of great practical value be-
cause it enables us to engineer features and scoring functions to a spec-
ification that describes a desired search result. This is closely related
to both relevance engineering and merchandising controls because the
specification can incorporate domain knowledge about relevance cri-
teria and business objectives. A programmatic system can provide an
interface that facilitates the specification of desired search results and
design of the signal mixing pipelines, as well as experimental evalua-
tion.
Let us go through an example of a relatively complex search result
specification to demonstrate the end-to-end engineering of signals and
scoring functions by using both textual and non-textual features. We
will consider the case of a fashion retailer who builds an online search
service. We will also assume that the user searches within a certain
product category to keep the problem reasonably simple (we will dis-
cuss how to achieve good precision when searching across multiple
categories in one of the next sections). Our starting point is the spec-
ification provided in Figure 4.14 that codifies the following business
rules:
• If a user searches for a certain product by its name or ID, a match-
ing product should be at the top of the search results.
Figure 4.14: An example of a search result specification that can be used to de-
sign a signal mixing pipeline.
factor, and we also have to mix it with the rating and newness features
to achieve the desired secondary sorting, as shown in Figure 4.15. We
clearly need to rescale the raw rating and newness values to convert
them into meaningful scoring factors; this can be done in many dif-
ferent ways. A raw customer rating on a scale from 1 to 5 can be too
aggressive as an amplification factor and can be tempered by using a
square root or logarithm function to reduce the gap between low-rated
and high-rated products. For instance, the magnitude of the brand sig-
nal amplified by a raw rating of 5.0 is two times higher than for a rating
of 2.5; however, by taking the square root of the rating, we reduce the
difference down to 1.41.
Combining the brand match signal with the rating and newness fac-
tors, we get a signal that corresponds to the second tier. Finally, the top
tier is crafted from the name match signal mixed with a large constant
factor that elevates the corresponding matching products to the very
top of the search results list.
218 search
The search methods that we have discussed so far are all based on to-
ken matching. Although we have used a few techniques, such as stem-
ming, that deal with specific features of natural language, we have es-
sentially reduced the search problem to a mechanical comparison of to-
kens. This approach, sometimes referred to as a syntactic search, works
very well in practice and is used as a core method in most search en-
gine implementations. Syntactic search, however, has limited ability for
modeling the features of natural language that go beyond individual
terms. The meaning of words in natural language is often dependent
on the context created by the preceding and succeeding words and sen-
tences, and several types of such dependencies exist. Most of them are
related to one of the following two categories:
polysemy Polysemy is the association of one word with multiple
meanings. For instance, the word wood can refer to the material
or an area of land filled with trees. Polysemy represents a seri-
ous problem for relevance because a user might have in mind one
meaning of a word (e. g., products made of wood) but the search
engine will return documents that use the same word with a
different meaning (e. g., products that are related to woodlands,
such as forestry equipment). We have already encountered the
problem of complex concepts that are expressed with phrases
that must be treated as a single token, for example, dress shoes.
This issue can be viewed as a particular case of polysemy be-
cause the meaning of the individual words depends on the con-
text, so the meaning of the word dress depends on whether it is
followed by the word shoes. A particular case of polysemy that is
very common in merchandising applications is the usage of vo-
cabulary words in brand and product names. An example of this
problem would be the brand name Blue, which is indistinguish-
able from the color blue in queries like blue jeans.
Our goal is to make these terms identical from the querying stand-
point, so that documents containing the terms sweet and confection are
retrieved for the query candy, and the other way around. This can be
done in a few ways, and each method has advantages and disadvan-
tages.
The first approach is contraction. One of the synonyms in the list is
assigned to be the principal, and all occurrences of other synonyms
are replaced by the principal. For example, we can choose to replace
all occurrences of sweet and confection with candy, both in documents
and queries. Note that a principal does not necessarily have to be a real
word; it can be a special token that never appears in the input texts but
is used as an internal representation of a synonym group. Thus, the
contraction approach works exactly like stemming. Contraction clearly
achieves the goal of making all synonyms identical from the query-
ing standpoint, but the downside is that it collapses all synonyms into
the principal, which makes frequently used synonym terms indistin-
guishable from rarely used ones. This can negatively impact TFIDF
calculations.
The alternative to contraction is expansion. The expansion strategy
replaces each synonym instance with a full list of synonyms:
best candy shop Ñ [best] [candy] [sweet] [confection] [shop]
generic terms. For instance, an item that contains the term fruitcake will
be included in a search results list for queries with the terms cake and
bakery.
learning methods. For example, the name of a famous athlete can con-
note to a certain sport, a type of sports equipment, or the brand that
the athlete promotes. Our next goal will be to develop methods that
are able to learn a thesaurus automatically.
The vector space model states that a document or query can be rep-
resented as a vector in a linear space of terms. In the light of our dis-
cussion of the polysemy and synonymy problems, we know that terms
can be ambiguous and redundant, so it can be the case that a document
representation that uses terms as dimensions is not particularly good
or is at least flawed. Indeed, we have already discussed that polysemy
and synonymy can be viewed as a mismatch between words and con-
cepts, which suggests that words are a convoluted representation that
conceals semantic relationships.
We can attempt to find a better representation by changing the ba-
sis for the document space. Conceptually, we would like to map doc-
uments and queries to vectors of real numbers, so that the ranking
scores can be calculated simply as a dot product between the query
and document representations:
qÑp
dÑv
ķ
(4.40)
scorepq, dq p v pi vi
i 1
analyzed to create a thesaurus, that is, to match words with their syn-
onyms or related words.
The key problem of word embedding is, of course, how to construct
the new vector representations. Conceptually, we want a vector space
that preserves semantic relationships: words and documents with sim-
ilar or related semantic meanings should be collocated or lie on the
same line, whereas documents that have different semantic meanings
should not be collocated or collinear, even if they contain the same (pol-
ysemic) words. If we could construct such a space, it would be possible
to overcome the limitations of the vector space model. First, it would
be possible to find relevant documents for a query even if a document
and query did not have common terms, that is, to tackle the synonymy
problem. Second, it would be possible to rule out semantically non-
relevent documents even if they nominally contained query terms. In-
tuitively, we can assume that the semantic space can be constructed by
analyzing word co-occurrences, either globally in a document or locally
in a sentence, and identifying groups of related words. The dimensions
of the semantic space can then be defined based on these groups, and,
consequently, the vector representations of individual documents and
words will be defined in terms of affinities to the groups. It turns out
that this simple idea is very challenging to implement, and there exist
a large number of methods that use very different mathematical tech-
niques. We continue this section with a detailed discussion of several
important approaches and concrete models.
Finally, a note about terminology. Word embedding is a relatively
new term, and many semantic analysis methods, including the latent
semantic analysis and probabilistic topic modeling described later in
this section, were not developed specifically for word embedding (in
the sense of equation 4.40), but for different purposes and based on
different considerations. Most of these methods are very generic and
powerful statistical methods used in a wide range of applications from
natural language processing to evolutionary biology. These methods,
however, can also be viewed as word embedding techniques. In this
section, we choose word embedding as the main theme because it is
a convenient way to connect different semantic methods to each other,
at least in the context of merchandising search. The reader, however,
should keep in mind that it is just one possible perspective; seman-
tic analysis methods are not limited to word embedding and search,
and neither is word embedding limited to search applications. Even
in the scope of algorithmic marketing, semantic analysis methods can
be applied to many uses, including automated product attribution, rec-
ommendations, and image search.
224 search
d1 d2 dn
tfpt1 , d1 q tfpt1 , d2 q tfpt1 , dn q
t1
t2 tfpt2 , d1 q tfpt2 , d2 q tfpt2 , dn q
X
(4.41)
.. .. .. ..
. . . .
tm tfptm , d1 q tfptm , d2 q tfptm , dn q
X Lk VTk (4.42)
4.5 semantic analysis 225
X Lk VTk
min
(4.43)
Lk ,Vk
Recall that the solution of this problem is given by the singular value
decomposition (SVD) that we discussed in Chapter 2. It is also very im-
portant that the matrices produced by the SVD algorithm are column-
orthonormal, which means that the k concept dimensions (the columns
of matrix Vk ) will be orthogonal to each other. This essentially means
that the original vector space model vectors (the rows of matrix X)
will be decorrelated by collapsing the strongly correlated vectors into
a single principal vector. This follows our intuitive expectation – the
frequently co-occurring terms correspond to highly correlated compo-
nents of the original term vectors (the rows of matrix X), so the decor-
relation is likely to merge co-occurring terms (potentially synonyms)
into one concept vector.
Let us now describe this process more formally. We first consider the
case of full SVD, in which the number of concept dimensions k is not
limited. The SVD algorithm breaks down the matrix into three factors:
X UΣVT
T
σ1 0 v1
. ..
u
1 ur
.
.
..
.
.
..
.
(4.44)
0 σr vn
m r
r r
n r
kvvik
vvj
cos vi , vj (4.45)
i j
Next, we need to convert the query into a vector in the basis of con-
cepts in order to calculate the cosine similarity between the query and
documents. This process is known as query folding. We can rearrange
equation 4.44 to express the document vectors as a function of the
term–document matrix:
V XT UΣ1 (4.46)
p qT UΣ1 (4.47)
p vi
score pq, di q cos pp, vi q (4.48)
kpk kvi k
Equation 4.48 defines a new scoring method, latent semantic index-
ing (LSI) scoring, which can be used as an alternative to the standard
vector space model and TFIDF approach. The principal advantage of
LSI scoring over term-based methods is the ability to fetch documents
that do not explicitly contain query terms. For example, a concept vec-
tor can include the three key terms candy, sweet, and confection, which
4.5 semantic analysis 227
Xk Uk Σk VTk
T
σ1 0 v1
. ..
u
1 uk
.
.
..
.
.
..
.
0 σk vn
m k
k k
n k
(4.49)
2 See Chapter 2 for a detailed discussion of the exact meaning of significance in the context
of SVD.
228 search
example 4.4
Filtering out some stop words and applying basic normalization and
stemming, we get the following term–document matrix:
d1 d2 d3
chicago 1 0 1
chocolate 1 1 1
retro 1 0 1
candy 1 1 0
X
made 1 0 0 (4.50)
love 1 1 1
sweet
0 1 1
collection 0 1 0
mini
0 1 0
heart 0 1 0
concept 1 concept 2
chicago -0.318 0.424
chocolate
-0.486 0.018
-0.318 0.424
retro
-0.333 -0.148
candy
U2 made
love
-0.166
-0.488
0.257
0.018
(4.51)
sweet -0.320 -0.239
collection -0.168 -0.406
mini -0.168 -0.406
heart -0.168 -0.406
Σ2 3.562 0
(4.52)
0 1.966
4.5 semantic analysis 229
concept 1 concept 2
d1 -0.592 0.505
V2 d2 -0.598 -0.798
(4.53)
d3 -0.541 0.329
Query d1 d2 d3
Chicago 0.891 -0.510 0.806
Candy 0.183 0.969 0.338
Table 4.4: The final document scores for the example of LSA calculations.
shows that it can actually outperform the basic vector space model in
many settings. In addition to that, LSA offers the following advantages:
synonyms A low-dimensional representation is able to capture syn-
onyms and semantic relationships. LSA can also estimate dis-
tances between words to generate a thesaurus that can be used
for synonym expansions in the standard TFIDF scoring, and
there exist specialized LSA-based methods to compute semantic
similarities, such as the correlated occurrence analogue to lexical
semantic (COALS) method [Rohde et al., 2006].
Figure 4.19: Graphical model representation of the pLSA model. The outer box
represents the repeated choice of documents. The inner box repre-
sents the repeated choice of topics and terms within a document
that contains md terms. The shaded circles correspond to the ob-
served variables; the unshaded one denotes the latent variable.
In addition to that, the pLSA model assumes that terms and docu-
ments are conditionally independent given the topic, that is
Prpt | d, zq Prpt | zq (4.56)
A joint probability model over D T can be expressed as
Prpd, tq PrpdqPrpt | dq (4.57)
for which the conditional probability of the term within the docu-
ment can be expressed as a sum of the probabilities over all topics:
¸ ¸
Prpt | dq Prpt, z | dq Prpt | d, zqPrpz | dq
z z
¸ (4.58)
Prpt | zqPrpz | dq
z
4.5 semantic analysis 235
The next step is to learn the unobserved probabilities and, thus, infer
the latent topics. Given a set of training documents D, the likelihood
function is defined as
¹
L PrpD, T q Prpd, tqnpd,tq (4.60)
d,t
max log L
¸
subject to Prpt | zq 1
t
¸
Prpd | zq 1 (4.62)
d
¸
Prpzq 1
z
Values Prpd | zq are known from the model, but the query representa-
tion Prpq | zq needs to be learned for each query. This can be achieved
by fixing parameters Prpt | zq and Prpzq and fitting model 4.62 with
respect to Prpq | zq. The similarity metric can then be used to score and
rank the documents in the search results list.
X U Σ VT (4.64)
P L S RT (4.66)
4.5 semantic analysis 237
can be different from the topic that this term is associated with in the
context of the second document
argmax Pr z | dj , t
z
context target
the best chocolate cakes and candies in the town
x h s y
t₁ x₁ s₁ y₁
t₂ x₂ h₁ s₂ y₂
W V
n×m m×n
hm
tn xn sn yn
input layer hidden layer output layer
Figure 4.22: The design of a Word2Vec neural network for a single-word context.
the input vector contains only one non-zero element, so the result will
be identical to the corresponding row of the weight matrix:
h WT x wTk (4.72)
si vTi h, i 1, . . . , n (4.74)
yi
ps i q
Pr pti | tk q °nexpexp (4.75)
j 1s j
Ipj kq °nexpexpsj ps q (4.79)
i1 i
y j Ip j k q
ej
in which Ipj kq is the indicator function equal to one if j k and
zero otherwise. We can see that this derivative is simply the prediction
error, so we denote it as ej . By taking the derivative with respect to the
weights of the output layer, we find the gradient for weight optimiza-
tion:
B J B J B sj
Bvij Bsj Bvij ej hi (4.80)
This result means that we should decrease weight vij if the product
ej hi is positive and increase the weight otherwise. The stochastic
gradient descent equation for weights will thus be as follows:
(new)
vj v(old)
j λ ej h, j 1, . . . , n (4.81)
244 search
BJ ņ BJ Bsj ņ e v ε
Bhi j1 Bsj Bhi j1 j ij i (4.82)
ε rε1 , . . . , εm s (4.83)
We use this result and the stochastic gradient descent to update the
weights of the hidden layer, similarly to our method for equations 4.80
and 4.81 for the weights of the output layer. Taking into account the
fact that all xj values in equation 4.84 are zeros except for xk , we find
that only the k-th row of matrix W needs to be updated:
wk
(new)
w(old)
k λ εT (4.85)
x₁₁
x1
x1n
s₁ y₁
x₂₁ h₁ s₂ y₂
x2 + W V
n×m m×n
x2n hm
sn yn
xq1
xq
xqn
Figure 4.23: The Word2Vec model for a context with multiple words.
The equation for the loss function remains the same, although it
represents a different conditional probability:
ņ
J log Pr ta | tk1 , . . . , tkq sa log exp sj (4.87)
j 1
(new)
wkj w(old)
k j
qλ εT , j 1, . . . , q (4.89)
example 4.5
juice
tea coffee
drink
eat
cake
pie
cookie
Thus far, we have discussed relatively generic search methods and their
applications in merchandising search. The challenge of merchandising
search, however, goes well beyond the tuning of standard methods
and requires the creation of more specialized search techniques for
248 search
Our first goal is to improve the precision of search results, given that
documents have many categorical fields that often contain compound
and polysemic terms. We can make the observation that making users
write structured Boolean queries would be a great solution for this
problem. For example, the free-text query pink rose sweater becomes
much less ambiguous if the user explicitly articulates fields and com-
pound terms:
and terms from the original free-text query and by searching for docu-
ments containing these artificial queries. A query generation algorithm
is designed to produce relatively restrictive search criteria, so that doc-
uments must strongly correlate with a query to match. This increases
the probability that documents are included in a search results list not
because of accidental matching of separate terms but because the doc-
ument attributes provide a really good coverage of query terms and
phrases. This methodology can be considered as a generalization of
shingling for multifield documents.
The first step of a combinatorial phrase search is to partition the
query into sub-phrases. Let us assume that a query entered by a user
is a sequence of n terms:
q rt 1 t 2 . . . t n s (4.91)
rt1 t2 t3 s
rt1 s rt2 t3 s
rt1 t2 s rt3 s (4.92)
Finally, the Boolean queries for all partitions are executed, and the fi-
nal search result set is obtained as a union of the search results from all
Boolean queries. This is equivalent to combining all partition queries
into one big Boolean query with the OR operator. The overall struc-
ture of this query is visualized in Figure 4.25. Our partition generation
algorithm does not try to recognize compound terms in a query and
just mechanically splits it into sub-phrases. Consequently, sub-phrases
are likely to be misaligned with the compound term boundaries. For
example, the query blue calvin klein jeans can be partitioned into the
sub-phrases blue calvin and klein jeans. By combining all of the parti-
tions together, we ensure that at least some of the splits capture the
compound terms correctly.
example 4.6
At the same time, the query will not match a product of a sweater
type and pink color until the brand name is rose. Moreover, a combi-
natorial phrase search becomes even more restrictive as the length of
the query grows because all terms need to be covered. This behavior is
different from the standard vector space model that appreciates every
term match and, thereby, decreases in precision as the length of the
query increases.
The disadvantage of combinatorial phrase search is that the number
of partitions grows exponentially with the number of terms in the user
query and so does the number of statements in the resulting Boolean
query. In practice, the complexity of the Boolean query can often be
reduced by excluding some statements based on the field type. For
example, it can be the case that the field color has only a few valid
possible values, so filters like
Color = [sweater]
Color = [pink rose]
query. This helps to keep the search results consistent and use the avail-
able display space efficiently. The combinatorial approach, however,
has its own downsides. One of the most significant issues is that strict
matching can return an empty search results list if a query contains
misspelled words or some other unfortunate combination of terms that
cannot be covered by the available documents. This behavior is unde-
sirable because it leaves the user with an empty screen instead of a list
of products, which thereby decreases the probability of a conversion.
This problem can be addressed by taking additional actions if an
empty search result is returned by the basic combinatorial phrase
search algorithm. For example, we can first attempt to run a combi-
natorial search that requires exact field matching and then fall back
to the basic vector space model that allows partial field matches. We
can develop this idea further and create a chain of search methods
with gradually decreasing precision, with each method invoked
sequentially until at least one document (or some other minimum
number of documents) matches. For example, a chain could have the
following structure:
1. Exact match. Search documents with a standard combinatorial
phrase search without normalization or stemming.
Figure 4.27: Examples of search results lists for the query evening dresses and
different approaches to data modeling.
The two variants can be merged into one product-level document with
the following structure:
Brand: Samsonite
Name: Carry-on Hardside Suitcase
Color: red black
Size: small large
The result looks reasonable because the document is scored well for
queries like red suitcase, small suitcase, and so on. The major problem
with this approach is that it loses the structural information about
nested entities, which makes it impossible to distinguish valid attribute
combinations from invalid ones. The document above is scored well for
the query small red suitcase, and this is correct because one of the vari-
ants really is small and red. However, the same document is scored
equally well for the query small black suitcase. This is not correct be-
cause none of the variants are small and black at the same time, which
makes the product non-relevant for the user’s search intent. This prob-
lem is quite challenging from the implementation standpoint because
it cannot be solved purely in terms of plain documents and requires a
search engine either to explicitly support nested entities or to operate
with variant-level documents internally and then rework the results
to group variants into products. If product filtering is implemented
correctly, product-level results can substantially improve the merchan-
dising efficiency of a search service.
We have found that collapsing product variants into products can be
beneficial, so we can consider the possibility of collapsing products into
product collections as the next step. This question is more complicated
because a user can have different search intents and look for either
products or product collections. For example, a user who searches for
dinnerware is likely to expect collections, whereas a user who searches
for a cup is more likely to expect individual products. This implies
that we need to make a decision dynamically about grouping based
on the query and matched results. This problem can be approached by
introducing heuristic merchandising rules to analyze the structure of
the results and matching attributes and make a grouping decision. For
instance, we can choose to replace individual products by a collection
only if a collection is generally consistent with the query, that is, all or
almost all products in the collection and the collection-level attributes
match the query. Consider the example in Figure 4.28. The query white
cup is likely to match individual products or collections that include
only white cups, but not dinnerware sets with plates, bowls, or cups of
different colors. Consequently, we present a user with a search results
list that contains mainly individual products. On the other hand, the
query white dinnerware is likely to produce a different result. We can
258 search
type: bowl
query: white cup color: white
type: plate
type: cup type: dinnerware color: white
color: white
type: cup
color: red
type: cup
query: white dinnerware color: white
type: plate
type: dinnerware color: white
3. The search results for each benchmark query are manually ana-
lyzed, and the search algorithms are tuned to improve the rele-
vance metrics.
tune scoring formulas and assess the relevance of search results by an-
alyzing user behavior and interactions with the search service. We will
devote the next sections to a discussion of these two topics.
In practice, one can create a training data set by fetching the results
lists for each query by using conventional search methods and setting
relevance grades by using expert judgment. The goal is then to learn a
ranking model that predicts the grade y from the input that consists of
a query and a document.
Similarly to other supervised learning problems, learning to rank
starts with feature engineering. As we have already mentioned, the
relevance grade is predicted for a document in the context of a certain
query, so a feature vector depends on both the document and the query.
More specifically, the following groups of features are typically used
in practice [Chapelle and Chang, 2011; Liu and Qin, 2010]:
document features This type of features contains the statistics
and attributes for the document, including the following:
• The basic document statistics, such as the number of terms.
These statistics can be calculated independently for each field
and for the entire document to produce several groups of
features.
• Product classification labels, such as product type, price category,
and so on.
• Dynamic attributes and web statistics. Examples of such features
include product sales data, user ratings, and newness.
• Web search implementation of learning to rank often includes
web graphs and audience-related features, such as the number
of inbound and outbound links for a web page. Although these
may have limited applicability in merchandising search, such
metrics can be valid candidates if available.
• Various statistics calculated for terms that the query and docu-
ment have in common. For example, this can be a sum or vari-
ance of term frequencies or inverted document frequencies for
common terms. These metrics can be calculated for each docu-
ment field, as well as the entire document.
• Standard text matching and similarity metrics, such as the num-
ber of common terms and TFIDF .
• Statistics related to user feedback. This includes different inter-
action probabilities, such as the probability of click (the share of
users who clicked on a given document at least once among all
users who entered a certain query), probability of the last click
(the share of users who ended their search on a given document),
probability to skip (the share of users who click on a document
below the given one), and so on.
The structure of the feature vector is summarized in Figure 4.29. The
total number of features in practical applications can reach several hun-
dreds.
It can be shown that the DCG error is upper bounded by the classi-
fication and regression losses, and minimization of the loss functions
thus helps to optimize the DCG [Cossock and Zhang, 2006; Li et al.,
2007]. However, the pointwise approach has a major downside, re-
gardless of the choice of the loss function. The issue is that we are
concerned with the relative order of items in the search results list,
not in qualitative or quantitative estimates of individual grades. For
example, the pointwise approach does not recognize that we will per-
fectly rank a results list of four items with relevance grades (1,2,3,4),
even if the grades are predicted as (2,3,4,5). Consequently, we may
take a different view on the loss function to account for the relative
order of items.
The pointwise approach is used in a number of ranking algorithms,
including McRank [Li et al., 2007] and PRank [Crammer and Singer,
2001].
264 search
¸ m
¸q
L0 L f xq,i , f xq,j (4.99)
¡
q i,j:yq,i yq,j
yq,d P t 1, 2, . . . , K u (4.102)
model training. This step often requires significant human effort and
also limits the system’s ability to self-tune dynamically. We can attempt
to work around this problem by inferring relevance grades automati-
cally based on user interactions with search results. For example, the
results that nobody clicks on are likely to be irrelevant. One possible
way to leverage this information is to incorporate it into feature vectors,
as we already did in the previous section. We could take a step further
and attempt to develop a method that learns the relevance grades from
implicit feedback.
Although it is intuitively clear that users tend to click on relevant
search results and skip irrelevant ones, user behavior can communi-
cate more sophisticated relevance relationships. For example, a user
can enter a search query, browse the results, click through some of
them, reformulate the query, and click through some of the new re-
sults. All queries and documents in such a scenario are related to a
single search intent, so relevance relationships can be established both
within a single search results list and across the queries. In this sec-
tion, we consider a feedback model that captures such relationships
by using several heuristic rules [Radlinski and Joachims, 2005]. This
particular model comes from academic research, although loosely sim-
ilar methods for learning from implicit feedback have been reported
by Yahoo [Zhang et al., 2015].
The model we will consider has two groups of relevance feedback
rules. The first group, illustrated in Figure 4.30, includes two rules that
are applied in the scope of a single search query. The first rule states
that, if a user clicks on some document in the result list, this document
is more relevant than all the documents above with regard to a given
query. This is based on the assumption that a user typically reads the
results from top to bottom. The second rule is based on empirical ev-
idence (including eye-tracking studies) that a user typically considers
at least the top two results in the list before taking an action. Conse-
quently, if a user clicks on the first document in the list, it is considered
more relevant than the second one (with regard to a given query).
The second group of rules is applied to query chains, that is, query
sequences that represent different formulations of the same search in-
tent. This first requires the detection of queries that belong to the same
chain. This problem is not trivial because a user can make multiple
queries with only one search intent but formulate it differently or can
make multiple unrelated queries in a search for completely different
products. The implicit feedback model that we consider approaches
this problem by building an additional classifier that predicts whether
a pair of queries belong to the same chain or not. The model is trained
with manually classified query pairs and uses features, such as time
4.7 relevance tuning 267
Figure 4.30: Implicit feedback rules for documents within one results list.
interval between queries, the number of common terms, and the num-
ber of common documents in the corresponding result lists. Once the
queries are grouped into query chains, we can introduce four addi-
tional relevance rules that can be applied to pairs of search results lists.
All of these rules are based on the assumption that queries in the chain
express the same search intent and, hence, can be considered equiva-
lent.
The first two rules in this group are presented in Figure 4.31. They
repeat the two single-query rules that we considered earlier but with
regard to adjacent queries in a query chain. Consider a chain in which
query q1 is followed by query q2 . Rule 3 mirrors rule 1 by stating that
a clicked document in the result list for query q2 is more relevant than
the preceding skipped documents with regard to query q1 because
both queries are related to one search intent. Similarly, rule 4 mirrors
rule 2.
The last two rules are shown in Figure 4.32. These rules establish
relevance relationships between the documents from different results
lists in a query chain. Rule 5 states that documents that are viewed
but not clicked on in the results list for query q1 are less relevant
than the documents that are clicked on in the results set for query
q2 . This relevance relationship is established with regard to the earlier
query. Consistently with rules 1 and 2, documents are considered to be
viewed if they are above the clicked ones or right below the last clicked
document, like document d3 in Figure 4.32. Finally, rule 6 states that
documents clicked on in the later results list are more relevant than
the first two documents in the former list. This rule is based on the
assumption that a user analyzes at least the first two results in the list
before reformulating a query.
All six rules are simultaneously evaluated against each query chain
in the training data set to produce relevance relationships in the form
di ¡q dj (4.106)
4.9 summary
two sets of features to produce relevance signals, and then uses the
signals to make ranking decisions.
273
274 recommendations
popularity / demand
Traditional Digital
popular
products
niche products
D₁ D₂
products
sorted by popularity
5.1 environment
The basic settings for recommendation services are similar to those for
search services. Similarly to search services, the primary purpose of a
recommender system is to provide a customer with a ranked list of rec-
ommended items. These recommendations can be delivered through
different marketing channels. We will assume that the recommenda-
tions are requested by a channel in real time, which is typically the
case for websites and mobile applications, although some channels
such as email can have more relaxed requirements and allow recom-
mendations to be calculated offline. The basic inputs of a recommender
system, depicted in Figure 5.2, are, however, different from those of a
search service and include the following:
276 recommendations
Marketing Channels
request
Context / Criteria
Recommendations Ratings
Item 1 Item 1
Recommender
System Item 2
provide
...
...
Item k Item n
load load
unary In many cases, a rating matrix does not capture the level of
affinity between a user and item but merely registers the fact of
an interaction. For example, many interfaces have only one Like
button, so a user can either like an item or provide no feedback at
all. Another typical example of unary ratings is implicit feedback
that registers interactions between users and items but does not
capture details, such as the number of purchases, although it can
be argued that the simple quantitative properties of this feedback
are important [Hu et al., 2008]. The elements of an unary rating
matrix can take only two values – either one or unspecified.
The rating values in matrix R are often attributed with contextual
information, such as the date and time that the rating was set, the
marketing channel used by the customer to provide the rating, and so
on. This information can be used by a recommender system to figure
out which rating values are most relevant for a given context.
An important observation about rating values is that a matrix with
explicit ordinal ratings also contains implicit feedback. The reason is that
users typically tend to rate the items they like and avoid the items they
find unattractive. For example, a user can totally avoid music of cer-
tain genres or products from certain categories. Consequently, which
items are rated is important, in addition to how they are rated. In other
words, the distribution of ratings for random items is likely to be dif-
ferent from the distribution of ratings for items selected by a user. This
means that a recommender system, strictly speaking, should not rely
on the assumption that the distribution of observed ratings is represen-
tative of the distribution of missing ones [Devooght et al., 2015]. As we
will see later, some advanced recommendation methods take this con-
sideration into account and infer the implicit feedback from the rating
matrix. In a more general case, a recommender algorithm can involve
two separate rating matrices for both explicit and implicit feedback.
The second important property of a rating matrix is sparsity. A rat-
ing matrix is inherently sparse because any single user interacts with
only a tiny fraction of the available items, so that each row of the ma-
trix contains only a few known ratings and all other values are miss-
ing. Moreover, the distribution of known ratings typically exhibits the
long-tail property that we have discussed earlier. This means that a
disproportionally large number of the known ratings correspond to a
few of the most popular items, whereas niche product ratings are es-
pecially scarce. This property can be illustrated by a well-known data
5.2 business objectives 279
¸
Profit Quantity soldproduct Marginproduct
(5.1)
products
Our next step is to design quantitative metrics that can be used to eval-
uate the quality of recommender systems with regard to the objectives
defined in the previous section.
The quality of search results can generally be evaluated by using
expert judgement to score the relevance of items in the context of a
given query, but this approach has limited applicability for recommen-
dations because the context typically includes user profile data and,
hence, is unique for each user. This makes it challenging or impossi-
ble to manually score the quality of recommendations for every pos-
sible context. On the other hand, the rating matrix already contains
expert judgements provided by users in their own personalized con-
texts. Consequently, the recommendation problem can be viewed as a
5.3 quality evaluation 281
train the model and evaluate its quality. In the case of classification,
this is done on a row-by-row basis. For example, the available data in
matrix 5.2 can be split into three sets by assigning the first row to the
training set, the second row to the validation set, and the third row
to the test set. This approach does not work well for matrix comple-
tion because it implies that the model is trained on one set of users
and evaluated on another, which is not really the case. Instead, the rat-
ing matrix is typically sampled on an element-by-element basis. This
means that a certain fraction of known ratings is removed from the
original rating matrix to leave a training matrix, and the removed rat-
ings are placed into validation or testing sets, which are later used to
evaluate the quality of prediction.
By interpreting the recommendation problem as a rating prediction
problem, we can define several quality metrics that can be linked to the
business objectives. We will spend the next sections developing these
metrics.
error (RMSE), which is measured in the same units as the original rat-
ings:
?
RMSE MSE (5.6)
NRMSE
RMSE
rmax rmin
(5.7)
The RMSE and its variations are widely used in practice for recom-
mender system evaluation as a result of their simplicity. However, the
RMSE and similar pointwise accuracy metrics have several important
drawbacks:
• As we discussed earlier, ratings typically follow a long-tail distri-
bution, which means that ratings are dense for the popular items
and sparse for those items from the long tail. This makes rating
prediction more challenging for the long-tail items relative to the
popular items and results in different prediction accuracies for
these two item groups. The RMSE does not differentiate between
these two groups and simply takes the average, so poor accuracy
for the long-tail items can be counterbalanced by high accuracy
for the popular items. To measure and control this trade-off, we
can calculate the RMSE separately for different item groups or
add item-specific weights into equation 5.5 to account for item
margins or other considerations.
precisionpKq
|Yu pKq X Iu |
|Yu pKq| (5.8)
recallpKq
|Yu pKq X Iu |
|Iu | (5.9)
2 gi 1
Ķ
DCG
log2 pi 1q
(5.10)
i 1
in which gi is the relevance grade of the i-th item in the list. If test set
T contains ratings provided by m users, we can define the overall DCG
as the average of the DCGs for the recommendation lists for individual
users:
¸ 2rui 1
m̧
DCG
1
log2 pRui 1q (5.11)
m
P
¤
u 1 i Iu
Rui K
5.3.3 Novelty
5.3.4 Serendipity
5.3.5 Diversity
5.3.6 Coverage
Thus far, we have described the environment and data sources that a
recommender system is integrated with, its business objectives, and the
metrics that can be used to evaluate the quality of recommendations.
This provides a reasonably solid foundation for the design of recom-
mendation algorithms. This task can be approached from several dif-
ferent perspectives, and there are several families of recommendation
methods that differ in the data sources leveraged to make recommen-
dations (rating matrix, catalog data, or contextual information) and
the type of rating prediction model. Although we will methodically go
through all major categories of recommendation algorithms in the rest
of this chapter, it will be worthwhile to briefly review the classification
of recommendation methods and make a few general comments before
we dive deeper into the details of individual methods.
Recommendation methods can be categorized in a number of ways,
depending on the perspective taken. From the algorithmic and informa-
tion retrieval perspectives, recommendation methods are categorized
primarily by the type of predictive model and its inputs. The corre-
sponding hierarchy is shown in Figure 5.4. Historically, the two main
families of recommendation methods are content-based filtering and
collaborative filtering. Content-based filtering primarily relies on con-
tent data, such as textual descriptions of items, and collaborative filter-
ing primarily relies on patterns in the rating matrix. Both approaches
can use either formal predictive models or heuristic algorithms that
typically search for a neighborhood of similar users or items. In addi-
tion to these core methods, there is a wide range of solutions that can
5.4 overview of recommendation methods 289
combine the multiple core algorithms into hybrid models, extend them
to account for contextual data and secondary optimization objectives,
or make recommendations in settings where the core methods are not
optimal because, for example, of a lack of data for personalization. We
will thoroughly analyze each of these approaches in the following sec-
tions.
Recommendation methods
core methods additional methods
Content-Based
Hybrid
Filtering
Neighborhood-based Switching
Model-based Blending
Feature Augmentation
Collaborative
Filtering
Contextual
Neighborhood-based
Pre-filtering
User-based
Post-filtering
Item-based
Contextual modeling
Combined
Model-based Non-personalized
Regression
Most-popular
Latent factors
Neighborhood-based
Multi-objective
Figure 5.5: Some typical recommendation usage cases and the corresponding
categories of recommendation methods.
5.5 content-based filtering 291
items
R
users profile u ru1 ru2 ru3
content
candidate item similarity
score
1 See Chapter 4 for a detailed discussion of such measures. One of the most common exam-
ples would be the TF IDF distance between textual descriptions discussed in Section 4.3.5.
292 recommendations
Content-based
recommender profile
system Content items Profile
analyzer
item 1 - ru1
item item 2 - ru2
features ...
user
Profile ratings
candidate learner
item
feedback
item predicted
features ratings
Content Profile Recommendations
analyzer model
• New and rare items. A particular case of the cold-start problem is the
recommendation of new or rare items that have a few or no ratings.
A recommendation algorithm that heavily relies on ratings may not
be able to recommend such items, which negatively impacts the cov-
erage of the catalog. Content filtering is not sensitive to this problem
because it relies on content similarity. This capability is especially
important in the context of the long-tail property that we discussed
earlier – the catalog often contains many rare items that receive few
ratings, even over a long period. The same issue appears in domains
with a rapidly changing assortment, such as apparel stores, where it
can be difficult to accumulate enough statistics about items.
and Pr wi | cj is the empirical conditional probability of word wi
(the fraction of documents of class cj that contain this word). This ba-
sic Bayes rule needs to be extended to accommodate the multiple fields
that we have in each item document. By assuming that each item doc-
ument has F fields and each field fqm is a text snippet that contains
fqm words, we can rewrite formula 5.17 for the posterior class proba-
bility as follows:
Pr cj ¹F ¹
Pr cj , d Pr pdq
Pr wi | cj , fm (5.18)
m 1 wi P fm
The ranking score of the item can then be estimated as
Pr pc1 | dq
scorepdq
Pr pc0 | dq
(5.19)
¸
Q
Pr cj Q1 αqj , j 0, 1 (5.22)
q 1
¸ nqm pwi q
Q
Pr wi | cj , field m αqj , m 1, . . . , F (5.23)
q 1
Ljm
298 recommendations
in which Ljm is the total weighted length of the texts in field m for
class j:
¸
Q
Ljm αqj fqm , m 1, . . . , F (5.24)
q 1
example 5.1
Book 2
Title: Machine learning for healthcare and life science
Synopsis: Case studies specific to the challenges of
working with healthcare data
Rating: 3
We first convert the textual fields into bags of words and remove
stop words to obtain the following:
α11 8 9 1 97 (5.25)
5.5 content-based filtering 299
α21 3 9 1 29 (5.27)
L1, synopsis α11 synopsis1 α21 synopsis2
28
3
Finally, we can estimate the conditional probabilities of words by
using expression 5.23. As an illustration, let us estimate the conditional
probability of the word price in the field synopsis, given the negative
class:
Pr pprice | c 0, field synopsisq
92 23/3
1 7
0
9 23/3
692
300 recommendations
Title Synopsis
c 0 c 1 c 0 c 1
analytics 0.044 0.160 0.029 0.083
applications 0.000 0.000 0.029 0.083
behavior 0.000 0.000 0.029 0.083
case 0.000 0.000 0.100 0.024
challenges 0.000 0.000 0.100 0.024
customer 0.000 0.000 0.029 0.083
data 0.044 0.160 0.130 0.110
detailed 0.000 0.000 0.029 0.083
healthcare 0.160 0.044 0.100 0.024
including 0.000 0.000 0.029 0.083
learning 0.200 0.200 0.000 0.000
life 0.160 0.044 0.000 0.000
machine 0.200 0.200 0.000 0.000
prediction 0.000 0.000 0.029 0.083
predictive 0.044 0.160 0.000 0.000
price 0.000 0.000 0.029 0.083
science 0.160 0.044 0.000 0.000
specific 0.000 0.000 0.100 0.024
studies 0.000 0.000 0.100 0.024
treatment 0.000 0.000 0.029 0.083
working 0.000 0.000 0.100 0.024
Table 5.1 provides a few useful insights into the logic of the Naive
Bayes recommender. First, we can see that the words that occur in both
positively and negatively rated items cancel each other out. For ex-
ample, both books have the words machine learning in their titles, so
each of these words has an equal probability value for the positive
and negative classes, and these values will cancel each other out in ra-
tio 5.19 used for scoring. Second, we can see that words present in the
attributes of the negatively scored book (e. g., healthcare) are interpreted
as negative signals, in the sense that their probability values for the neg-
ative class are higher than those for the positive class. Similarly, some
other words (e. g., behavior) are interpreted as positive signals. In prac-
tice, this interpretation may or may not be correct. In our example, the
user has disliked the second book about machine learning for health-
care. We do not really know the reason – it may be that this particular
book is not well written or the healthcare domain is not relevant for
5.5 content-based filtering 301
the user. If it is assumed that the user has rated the book after purchas-
ing and reading it, the first explanation is probably more likely than
the second one because the user was almost certainly aware that the
book was about the healthcare domain and chose it deliberately. The
Naive Bayes model, however, interprets the word healthcare as a nega-
tive signal, so all books with this word in the title will be scored lower.
This limited ability to differentiate between the quality and relevance
of content is one of the major shortcomings of content-based filtering.
As we will see later, collaborative filtering takes a different approach
to the problem and places more emphasis on item quality signals.
N
ased towards popular items and standard choices. This limits the
ability to produce non-trivial recommendations and recommen-
dations for users with unusual tastes.
A typical rating matrix exhibits strong user and item biases – some
users systematically give higher (or lower) ratings than other users,
and some items systematically receive higher ratings than other items
[Koren, 2009; Ekstrand et al., 2011]. This can be explained by the fact
that some users can be more or less critical than others, and items, of
course, differ in their quality. We can account for these systematic user
and item effects by defining the baseline estimate for an unknown user
rating rui as
bui µ bu bi (5.32)
in which Ui is the set of users who rated item i. Then, the user biases
are estimated as
¸
bu | I1 | prui µ bi q (5.34)
u uPI
u
described by equation 5.32 becomes closer to the global mean and less
dependent on the unreliable bias estimates.
The bias parameters can be estimated more accurately by solving the
following least squares problem [Koren, 2009]:
¸ ¸ ¸
min prui µ bi bu q λ b2u b2i (5.36)
bi , bu
i,u PR u i
example 5.2
User 1 5 4 — 1 2 1
User 2 4 — 3 1 1 2
User 3 — 5 5 — 3 3
User 4 2 — 1 4 5 4
User 5 2 2 2 — 4 —
User 6 1 2 1 — 5 4
Let us now calculate the baseline estimates for the missing ratings.
Calculating the global average and bias values by using formulas 5.33
and 5.34, we get
µ 2.82
p0.02 0.42 0.42 0.82 0.51 0.02q
bi (5.37)
bu p0.23 0.46 1.05 0.53 0.44 0.31q
User 1 5 4 [ 2.16 ] 1 2 1
User 2 4 [ 2.78 ] 3 1 1 2
User 3 [ 3.85 ] 5 5 [ 3.05 ] 3 3
User 4 2 [ 3.78 ] 1 4 5 4
User 5 2 2 2 [ 1.55 ] 4 [ 2.35 ]
User 6 1 2 1 [ 1.68 ] 5 4
N
5.7 neighborhood-based collaborative filtering 309
items
R
most
users similar
users
known ratings
ilarly rated by the same users. Next, the rating that a given user will
give to this item is estimated based on the ratings that the user gave
to other items in the neighborhood. Once again, the key assumption is
that a user who positively rated a few items in the past will probably
like items that are rated similarly to these past choices by many other
users.
items
R
users
known
ratings
Iuv Iu X Iv (5.38)
Note that formula 5.40 computes the average user rating over the
set of common items Iuv , as required by the definition of the Pearson
correlation coefficient. Hence, this value is not constant for a given user
u but is unique for each pair of users. In practice, however, it is quite
common to use the global average rating for user u computed over all
items Iu rated by this user [Aggarwal, 2016].
The similarity measure allows us to identify k users who are most
similar to the target user u. The size of the neighborhood k is a param-
eter of the recommendation algorithm. As our goal is to predict the
rating for user u and item i by averaging the ratings given for this item
by other users, we select not simply the top k most similar users but
the top k most similar users who have rated item i. Let us denote this
set of peers as Skui . This set can include less than k users if the rating
matrix does not contain enough ratings for item i or enough peers of
user u with commonly rated items. The rating can then be estimated
as a similarity-weighted average of the peer ratings:
°
p µu v P°Skui simpu, vq prvi µv q
rui
v P Skui
| simpu, vq | (5.41)
Formula 5.41 exploits the idea of separating the user biases from the
interaction signal, which we discussed earlier in the section devoted
to baseline estimates. The global user averages µu and µv are consid-
ered biases and are initially subtracted from the raw ratings, then the
interaction signal is estimated as a product of similarity measures and
mean-centered ratings, and finally the user bias µu is added back to
account for the preferences of the target user.
312 recommendations
sim 1 pu, vq
| Iuv |
| Iuv | λ simpu, vq (5.43)
wi log | Um | (5.44)
i
5.7 neighborhood-based collaborative filtering 313
d°
P Iu prui µu q
2
σu i
| Iu | 1
(5.48)
example 5.3
We can see that the first three users are positively correlated to each
other and negatively correlated with the last three. The similarity ma-
trix allows one to look up a neighborhood of the top k most similar
users for a given target user and mix their ratings to make a prediction.
For example, let us predict the missing rating for user 1 and The Godfa-
ther movie by assuming the neighborhood size k 2. The most similar
neighbors are users 3 and 2 who rated The Godfather as 5 and 3, respec-
tively. Applying rating prediction formula 5.41, we get the following
estimate:
simp1, 3q pr33 µ3 q simp1, 2q pr23 µ2 q
r13 µ1
p
| simp1, 3q | | simp1, 2q |
0.94 p5 4.00q 0.87 p3 2.20q
(5.53)
2.60 0.94 0.87
3.50
By repeating this process for all missing ratings, we obtain the results
shown in table 5.5. Note that these estimates look more intuitive and
accurate than the baseline estimates in table 5.4.
N
User 1 5 4 [ 3.50 ] 1 2 1
User 2 4 [ 3.40 ] 3 1 1 2
User 3 [ 6.11 ] 5 5 [ 2.59 ] 3 3
User 4 2 [ 2.64 ] 1 4 5 4
User 5 2 2 2 [ 3.62 ] 4 [ 3.61 ]
User 6 1 2 1 [ 3.76 ] 5 4
Table 5.5: Example of ratings predicted with the user-based collaborative filter-
ing algorithm.
Uij Ui X Uj (5.54)
in which µi and µj are the average ratings for items i and j, respec-
tively. This formula is the same as the Pearson correlation for users in
equation 5.39; the only difference is that users (rows) are replaced by
5.7 neighborhood-based collaborative filtering 317
items (columns). All items rated by user u can then be sorted by their
similarity to given item i, and the top k most similar items can be se-
lected from this list. Let us denote this neighborhood of item i as Qk
ui .
Note that the neighborhood includes only the items rated by the target
user u and not the most similar items in the catalog in general, so set
Qkui converges to Iu as k increases. The rating can then be predicted as
a weighted average of ratings of the top k most similar items by using
mean-centered ratings as inputs:
°
p µi j P °Qkui simpi, jq ruj µj
rui
j P Qkui
| simpi, jq | (5.56)
The item-based approach was proposed years after the first user-based
methods appeared, but it has quickly gained popularity because of bet-
ter scalability and computational efficiency [Linden et al., 2003; Koren
and Bell, 2011]. One of the key advantages is that the total number of
items m in the system is often small enough to precalculate and store
the m m item similarity matrix, so the top k recommendations can be
quickly looked up for a given user profile. This enables a more scalable
architecture for the recommender system: the heavy computations re-
quired to create the similarity matrix are done in the background, and
the recommendation service uses this matrix to make recommenda-
tions in real time. Although the same strategy can be applied to user-
based methods, it can be very expensive or completely impractical in
recommender systems with a high number of users. Finally, some stud-
ies found that item-based methods consistently outperform user-based
approaches in terms of prediction accuracy for certain important data
sets, such as Netflix data [Bell and Koren, 2007].
At the same time, it should be noted that user-based approaches are
able to capture certain relationships that might not be recognized by
item-based methods [Koren and Bell, 2011]. Recall that the item-based
318 recommendations
approach predicts rating rui based on the ratings that user u gave to
the items similar to i. This prediction is unlikely to be accurate if none
of the items rated by the user is similar to i. On the other hand, it may
still be the case that the user-based approach will identify users similar
to u who rated i, so the rating can be reliably predicted. As we will see
later, some advanced recommendation methods combine item-based
and user-based models to take advantage of both methods.
The ratio between the number of users and items is one of the key
considerations in the choice of approach. In many retail applications,
the item-based approach is preferable because the number of items is
smaller than the number of users. The number of items, however, can
exceed the number of users in some other domains. For example, an
article recommender system for researchers can benefit from the user-
based solution because the total number of all research articles ever
published reaches many hundreds of millions, whereas the research
community that uses the system is relatively smaller [Jack et al., 2016].
items
R j i
users u
rvi
wij
Qui
Figure 5.10: Item-based nearest neighbor regression.
Aw b (5.66)
min wT Aw 2bT w
w
(5.68)
subject to w¥0
example 5.4
By solving this equation, we get the weights w32 1.033 for Titanic
and w31 0.035 for Forrest Gump. The rating can then be predicted
as
p
r13 µ3 w32 r12 w31 r11
2.40 1.033 0.75 0.035 2.20 3.09
(5.72)
in which µ3 is the average rating for The Godfather movie and the in-
put ratings r12 and r11 are taken from table 5.6. Repeating this process
for all unknown ratings, we get the final results presented in table 5.7.
N
5.7.4.2 User-based Regression
The regression analysis framework we have just developed for item-
based methods can be applied to user-based models quite straightfor-
wardly. The input to the process is the rating matrix centered by user
averages (row average is subtracted from each element in the row) or
baseline predictions. The user-based variant of the least squares prob-
lem 5.60 can then be defined as
¸ 2
min ruj pruj (5.73)
w
j i
5.7 neighborhood-based collaborative filtering 323
User 1 5 4 [ 3.09 ] 1 2 1
User 2 4 [ 3.83 ] 3 1 1 2
User 3 [ 4.02 ] 5 5 [ 1.98 ] 3 3
User 4 2 [ 2.34 ] 1 4 5 4
User 5 2 2 2 [ 2.28 ] 4 [ 3.15 ]
User 6 1 2 1 [ 2.94 ] 5 4
This problem needs to be solved for each target user u. Inserting the
rating prediction formula 5.58 into the previous equation, we get
2
¸ ¸
min ruj wuv rvj
(5.74)
w
j i v P Sk
uj
pitemq
in which wij
puserq
and wuv are two different sets of weights to be
learned. This model essentially sums a centered version of user-based
function 5.58 with item-based function 5.69 [Aggarwal, 2016; Koren
and Bell, 2011]. The rating function can then be inserted into the least
squares problem for the prediction error and optimized with respect to
all bias variables and weights. As the weights are learned from the data,
the user and item neighborhoods do not necessarily need to be limited
by the top k items, and sets Iu and Ui can be used instead of Qk ui and
Skui , respectively. However, if the sets are limited by a finite value of k,
the computational complexity can be reduced, at the expense of model
accuracy.
The combined model is able to learn both item–item and user–user
relationships (see section 5.7.3 for details) and, consequently, to
combine the strengths of the two approaches. It has been shown
that combined models can outperform individual user-based and
item-based models on industrial data sets [Koren and Bell, 2011].
It is important to note that the regression framework can be used
not only to combine user-based and item-based solutions but also
to integrate neighborhood-based methods with completely different
models, including some that we will discuss in the next section.
items
x1 ... xi - 1 y xi + 1 ... xm
u rui = fi(xu)
rui
users training
samples fi(x)
R i
• In certain cases, missing ratings can be filled with zeros. This pri-
marily applies to unary rating matrices where each element indi-
cates whether a user interacted with a given item or not [Aggar-
wal, 2016]. This approach, however, cannot be universally used
for every rating type because the insertion of default (zero) rat-
ing values results in prediction bias.
in which Ui is the set of users who rated item i, and Ipxq is the
indicator function that equals one if the argument is true and zero
otherwise. The conditional probability that user u rates item j as ruj ,
given that this user had previously rated item i as ck , can be estimated
as follows:
°
Pr ruj | rui ck
v P Ui°Iprvj ruj and rvi ck q
v P Ui Iprvi ck q
(5.81)
i j
R
3 5
ck = 3 5 u
2 3
3 4
example 5.5
c1 c2 c3 c4 c5
p | q
Pr r13 ck I1 2/84035 1/38880 1/19440 0 1/38880
(5.87)
This means that a rating of 3 is the best estimate for r13 . If we re-
peat the process for all missing ratings, we get the results presented in
table 5.8.
User 1 5 4 [3] 1 2 1
User 2 4 [4] 3 1 1 2
User 3 [2] 5 5 [1] 3 3
User 4 2 [2] 1 4 5 4
User 5 2 2 2 [4] 4 [4]
User 6 1 2 1 [4] 5 4
Table 5.8: Example of ratings predicted by using the item-based Naive Bayes
collaborative filtering algorithm.
As a side note, let us also show how the Naive Bayes approach can be
connected to neighborhood-based collaborative filtering. We can make
their structural similarity more apparent by replacing the product in
equation 5.78 by a sum of logarithms and inserting it into class proba-
bility formula 5.77:
Pr prui ck | Iu q Pr pck q ¸
sk pi, jq
Pr pI q (5.88)
u j P Iu
in which
sk pi, jq log Pr ruj | rui ck (5.89)
• The rating matrix is typically very sparse (in practice, about 99%
of ratings can be missing). This impacts the computational sta-
bility of the recommendation algorithms and leads to unreliable
estimates in cases when a user or item has no really similar neigh-
bors. This problem is often aggravated by the fact that most of the
basic algorithms are either user-oriented or item-oriented, which
limits their ability to capture all types of similarities and interac-
tions available in the rating matrix.
• The data in the rating matrix are usually highly correlated be-
cause of similarities between users and items. This means that
the signals available in the rating matrix are not only sparse but
also redundant, which contributes to the scalability problem.
The above considerations indicate that a raw rating matrix can be
a non-optimal representation of the rating signals and we should con-
sider some alternative representations that are more suitable for collab-
orative filtering purposes. To explore this idea, let us go back to square
one and reflect a little bit on the nature of recommender services. Fun-
damentally, a recommender service can be viewed as an algorithm that
predicts ratings based on some measure of affinity between a user and
item:
p
rui affinity pu, iq (5.90)
332 recommendations
One possible way to define this affinity measure is to take the latent
factors approach and map both users and items to points in some k-
dimensional space, so that each user and each item is represented as a
k-dimensional vector:
min
R R
R P QT
2
2
p
(5.94)
P, Q
m k
pu m
T
n R = n P Q k
+
qi
R UΣVT (5.95)
p U Σ VT
R (5.96)
k k k
Consequently, the latent factors that are optimal from the prediction
accuracy standpoint can be obtained by means of the SVD as follows:
P Uk Σ k
Q Vk
(5.97)
problem because the signal present in the original rating matrix is ef-
ficiently condensed (recall that we select the top k dimensions with
the highest signal energy) and the latent factor matrices are not sparse.
Figure 5.14 illustrates this property. The user-based neighborhood al-
gorithm (5.14a) convolves the sparse rating vector for a given item with
the sparse similarity vector for a given user to produce the rating es-
timate. In contrast, the latent factor model (5.14b) estimates the rating
by convolving the two vectors of reduced dimensionality and higher
energy density.
i sim(u,v)
m k k
R P Q
n n qi
pu
rui rui
(a) Neighborhood model (b) Latent factor model
Figure 5.14: Signal energy distribution in the user-based neighborhood and la-
tent factor models
J
R PQT
2
min (5.98)
P, Q
E R PQT (5.100)
PÐP α EQ
(5.101)
Q T
ÐQT
α PT E
initialize pud
p0q random with mean µ 1 ¤ u ¤ n, 1¤d¤k
0
initialize qid
p0q random with mean µ 1 ¤ i ¤ m, 1¤d¤k
0
e rui prui
µ0 µ{
ak|µ|.
ε is the convergence threshold, µ is the mean known rating, and
The first stage of the algorithm is to initialize the latent factor ma-
trices. Selection of these initial values is not really important, but we
choose to evenly distribute the energy of known ratings among the
randomly generated latent factors. The algorithm then optimizes the
concept dimensions one by one. For each dimension, it repeatedly goes
5.8 model-based collaborative filtering 337
through all ratings in the training set, predicts each rating by using the
current latent factor values, estimates the prediction error, and adjusts
the factor values in accordance with expressions 5.101. A given dimen-
sion is considered done once the convergence condition is met and the
algorithm switches to the next dimension.
Algorithm 5.1 helps to overcome the limitations of the standard
SVD. It optimizes the latent factors by cycling through the individ-
ual data points and, consequently, avoids the issues with the missing
ratings and algebraic operations over ginormous matrices. The itera-
tive element-by-element approach also makes the stochastic gradient
descent more convenient for practical applications than the gradient
descent, which updates entire matrices by using expressions 5.101.
example 5.6
The Godfather
drama
user 1 -1
Forrest
Gump
user 3 0 d₁ action
0.5 Titanic 1
d₃ -0.5 user 2 The
Matrix
user 5 Alien
1
user 4
0 d₂ user 6
Batman
p PQT
R µ
5.11 4.00 [ 4.01 ] 0.75 2.00 1.35
3.94 [ 2.93 ] 3.19 1.11 1.00 1.49
2.67 (5.105)
[ 1.86
4.31 ] 5.03
[ 1.61 ]
5.19
0.94
[ 2.12 ]
4.30
3.03
5.01 3.71
1.99 2.15 1.88 [ 3.87 ] 3.95 [ 3.51 ]
1.26 1.69 1.20 [ 4.81 ] 4.94 4.17
The results reproduce the known ratings quite accurately and predict
the missing ratings in a way that corresponds to our intuitive expecta-
tions. The accuracy of the estimates can be increased or decreased by
changing the number of dimensions, and the optimal number of di-
mensions can be determined in practice by cross-validation and the
selection of a reasonable trade-off between computational complexity
and accuracy.
N
5.8.3.2 Constrained Factorization
The standard SVD algorithm gives an optimal solution of the low-rank
approximation problem, and the factors P and Q produced by the SVD
are column-orthogonal. The stochastic gradient descent algorithm 5.1
approximates this optimal solution. If the input rating matrix is com-
plete, algorithm 5.1 converges to the same column-orthogonal outputs
as the SVD. The only difference is that the diagonal scaling matrix
present in the SVD is rolled into the two factors. However, if the input
rating matrix is not complete, the outputs produced by the algorithm
are not necessarily orthogonal. This means that the concepts remain
correlated in the statistical and geometric senses. Although this does
340 recommendations
R P QT
2
min
P, Q
(5.106)
subject to PT P is a diagonal matrix
QT Q is a diagonal matrix
At the next step, the third user concept vector can be orthogonalized
by subtracting its projections onto the previous two
min
R P QT
2
P, Q
P¥0
(5.112)
subject to
Q¥0
342 recommendations
initialize pud
p0q random with mean µ 1 ¤ u ¤ n, 1¤d¤k
0
initialize qid
p0q random with mean µ 1 ¤ i ¤ m, 1¤d¤k
0
end
d¸1
pd Ð pd proj ppd , ps q (projection)
s 1
1
d¸
qd Ð qd proj pqd , qs q (projection)
s 1
p
rui µ bi bu pu qTi (5.113)
in which µ is the global mean, bi is the item bias, bu is the user bias,
and the last term corresponds to the latent factors part of the model.
Adding a regularization term that helps to avoid overfitting on sparse
data, we translate this model into the following optimization problem:
¸ 2
min rui µ bi bu pu qTi
u,i (5.114)
λ b2i b2u kpu k2 kqi k2
bu Ð b u α pe λ b u q
bi Ð bi α pe λ bi q
pud Ð pud α pe qid λ pud q
(5.115)
but by the positions of the known ratings as well (see section 5.1.1 for
details). We can isolate this signal about user–item interactions in the
n m implicit feedback matrix, which contains ones in the positions of
known ratings and zeros in the positions of missing ratings. Normaliz-
ing each row to a unit length, we define the implicit feedback matrix F
as
$
fui |
& I
u | {
1 2
, if rui is known
(5.116)
%
0, otherwise
R pP FYq QT
2
min (5.117)
P, Q, Y
By adding user and item biases, we get the following rating predic-
tion formula:
¸
p
rui µ bi bu pu | Iu |1{2 yj
qTi (5.118)
j P Iu
in which yj are the rows of matrix Y. The learning rules for
stochastic gradient descent can be straightforwardly derived from
5.8 model-based collaborative filtering 345
bu Ð b u α pe λ 1 b u q
b i Ð b i α pe λ 1 b i q
pud Ð pud α pe qid λ2 pud q
(5.119)
qid Ð pid α e pud | Iu | { j P I yjd λ2 qid
° 1 2
u
yjd Ð yjd α e | Iu | { qid λ2 yjd
1 2
5.9.1 Switching
in which Ui is the set of users who rated item i. This solution can
help us to work around the cold-start problem, which is an issue for
collaborative filtering, and, at the same time, to improve trivial rec-
ommendations produced by content-based filtering whenever possible.
A generic schema of such a switching recommender is shown in Fig-
ure 5.16. This approach, however, is somewhat rudimentary because
it is based on heuristic rules rather than a formal optimization frame-
work. We can definitely achieve better results by leveraging machine
learning and optimization algorithms to properly mix the outputs of
the individual models.
5.9.2 Blending
rating for a given pair of user and item. For each training sample j, let
us denote the vector of q model outputs (predicted rating values) as xj
and the true rating value as yj . The problem of blending the available
estimates together can then be defined as finding the blending function
b pxq that minimizes the prediction error:
ş 2
min b xj yj (5.122)
b
j 1
One of the most basic solutions of problem 5.122 is, of course, lin-
ear regression. In this case, the combiner function is a linear function
defined as
b pxq xT w (5.123)
to the blend one after another. Each model is trained by using gradient
descent or stochastic gradient descent in the inner loop of algorithm 5.3.
For each iteration, we update the model by using its learning rules, pre-
dict ratings for all training samples, assemble a temporary matrix X by
appending a column of newly predicted ratings, re-optimize the blend
(the algorithm uses a linear blending function for illustration, but any
other blending model can be used), and estimate the blend prediction
error. The method does not change the error functions of individual
models and learning rules, but it changes the convergence condition
so that training stops when the overall prediction error of the blend is
minimized. In practice, the blend prediction error can continue to de-
crease after the model prediction error reaches its minimum and starts
to increase.
are then subtracted from the raw samples, the second model is trained
by using these residual errors as inputs, and so on. The final blend
can include predictions produced by models trained on the raw data
and predictions created based on the residual errors. Among the previ-
ously discussed models, the removal of the global rating average and
baseline predictors are basic examples of residual training techniques.
¸
q
b pxq wk xk (5.126)
k 1
in which fk are called feature functions and vki are static weights. In
other words, the feature functions amplify or suppress the signals from
the recommendation models, as shown in Figure 5.19. Note that this
design is quite similar to the signal mixing pipelines that we discussed
in Chapter 4, in the context of search services.
in which the outer sum iterates over all s training samples, xjk is the
rating predicted by algorithm k for the j-th training sample, gji stands
for the j-th sample meta-features, and yj is the true rating value for
sample j. To solve this problem, let us first introduce the s pqpq matrix
A that contains the cross-products of predictions and meta-features for
each training sample:
(c)
in which prui is the rating predicted by the content-based recom-
mender. The second step is to apply the collaborative filtering part
of the hybrid to the pseudo rating matrix to predict the rating for a
given pair of user and item. In principle, one can use any out-of-the-
box collaborative filtering algorithm to make the final prediction. The
5.9 hybrid methods 355
challenge, however, is that the ratings filled in during the previous step
skew the statistics of the number of known ratings used by many col-
laborative filtering algorithms. This requires the collaborative filtering
part to be modified and several additional factors and parameters to
be introduced to fix the statistics:
• The reliability of content-based rating predictions depends on the
number of known ratings for a given user. Consequently, predic-
tions that do not have enough support should be devalued when
used in collaborative filtering. If we use a user-based neighbor-
hood model for the collaborative filtering step, we can account
for the reliability of the incoming ratings by modifying the user
similarity measure. Let us first define a normalized support vari-
able that grows proportionally to the number of ratings provided
by a user but is limited to one if the number of ratings exceeds
the threshold parameter T :
$
qu
&1,
% | Iu | ¥ T (5.132)
| Iu | {T , otherwise
The final rating prediction formula for the collaborative filtering part
can then be defined as follows:
wu prui µu
(c) °
v sim
1 pu, vq pzvi µv q
p
rui µu ° 1 pu, vq (5.135)
wu v sim
356 recommendations
specified by the user, and then this list can be sorted by a collaborative
filtering unit. This technique, sometimes referred to as cascading, can
be viewed as an extreme case of blending, where the signal from the
first recommender or search service is steeply pitched to sort items into
relevant and irrelevent buckets and the second recommender does the
secondary sorting within the buckets.
p
rui R pu, iq (5.136)
p
rui R pu, i, location, time, . . .q (5.137)
rating values in the empty cells of the array. Note that a multidimen-
sional array can be collapsed onto a standard two-dimensional rating
matrix by discarding the contextual information. This may require the
merging of several rating values that are projected onto one element
of the matrix. For example, if a user rated the same item several times
on different dates, only the latest value or average value can be kept
in the rating matrix. Alternatively, the rating matrix can be obtained
by selecting a certain point on the context dimension and cutting out
a two-dimensional slice from the multidimensional cube at this point.
For example, the array depicted in Figure 5.20 can be viewed as a pile
of rating matrices Rptq, one for each time interval. Finally, a rating ma-
trix can be created not for a certain point on the contextual dimension
but for a certain range. In the case of the example in Figure 5.20, this
would be for a certain time interval.
R(t)
R(u,i,t)
users
time
items
U I R U I R i, j, ...
u
Figure 5.21: The main steps of the non-contextual recommendation process
[Adomavicius and Tuzhilin, 2008]. U, I, and R are the dimensions
of users, items, and ratings, respectively. Recommended items are
denoted as i, j, . . .
c u
Figure 5.22: Context-aware recommender system with contextual prefiltering
[Adomavicius and Tuzhilin, 2008].
Contextual
Data Model Recommendations Recommendations
U I C R U I R i, j, ... i’, j’, ...
u c
Figure 5.23: Context-aware recommender system with contextual postfiltering
[Adomavicius and Tuzhilin, 2008].
Contextual
Data Model Recommendations
U I C R U I C R i, j, ...
u, c
bui µ bu bi (5.144)
5.10 contextual recommendations 365
in which bu and bi are the average user and item biases, respectively.
With the added assumption that user and item biases can change over
time, the time-aware version can be defined as
bui µ b u pt q b i pt q (5.145)
in which bu ptq and bi ptq are the functions of time that need to be
learned from the data. Parameter t can be defined as the number of
days measured off from some date zero in the past. In practice, user
and item components can have very different temporal dynamics and
properties, so one might need two different solutions for these two
functions [Koren, 2009]. In many practical applications, item popular-
ity changes slowly over time and each item has relatively many ratings.
Therefore, the time range can be split into multiple time intervals (for
example, several weeks each), and the item bias can be estimated for
each interval independently. This leads to the following simple time-
aware model for the item component:
b i pt q b i bi,∆t (5.146)
b u pt q b u wu du ptq (5.148)
β=1
du(t)
β=0.5
β=0.25
tu t
Figure 5.25: The user bias drift function for different values of β. The function
is linear when β 1.
p
rui pu qTi (5.151)
in which pu and qi are the k-dimensional user and item latent factor
vectors, respectively. One can account for variability of user tastes by
adding a time drift term to each element of the latent factor vector,
similar to the corresponding expression 5.148 for baseline estimates:
in which ∆v1 piq is the relative sales volume change (in percent)
for the previous day and ∆v2 piq is the volume change two days
ago.
support pXq
|t : Xt|
tPT
|T | , (5.155)
support pX Y Y q
confidence pX Ñ Y q
support pXq
(5.157)
the rule (and the recommendation created based on this rule) as fol-
lows:
¸
revenue pX Ñ Y q support pX Ñ Y q price piq (5.158)
i PY
given that a user is about to buy item X. More accurate estimates
of monetary metrics can be obtained by using the uplift modeling
techniques that we discussed earlier in the context of promotion op-
timization. A recommender system typically needs association rules
with high support and confidence levels to ensure that these rules are
reliable and discriminative. The creation of such rules from a given
transaction history is a standard data mining problem, known as fre-
quent itemset mining, affinity analysis, or market basket analysis. This
problem can be solved by using a wide range of specialized algorithms,
such as Apriory or FP-growth.
example 5.7
Transaction ID Items
1 milk, bread, eggs
2 bread, sugar
3 milk, cereal
4 bread, cereal
5 milk, bread, sugar
6 cereal, milk, bread
7 bread, cereal
8 milk, cereal
9 milk, bread, cereal, eggs
that contain milk. The recommended items are then obtained from the
right-hand sides of the rules, so that the list of recommendations for
milk includes cereal, bread, eggs, and sugar, in order of relevance.
• topk pq denotes the first k elements with a maximal ranking score.
This operation truncates the original matrices X and Y to the size
of n k.
Mpiq
m̧
g pzq I pz i ¤ kq
1
(5.162)
k
i 1
zi
in which Fpiq is a feature label that equals one if the item is featured
and zero otherwise. The matrix X is defined as
z ϕ py, xq : zi yi wx i
(5.167)
5.13 architecture of recommender systems 377
5.14 summary
• Extremely wide and deep assortments with a long tail of niche prod-
ucts create a need for efficient discovery services, including search
and recommendations.
The problem of price management has a very long history. The fun-
damental aspects of pricing have been studied for centuries to explain
the interplay of supply and demand on a market. This has resulted in
the development of a comprehensive theory that describes the strate-
gic aspects of pricing, such as price structures, relationships between
price and demand, and others. These methods provide relatively coarse
price optimization methods that can, however, inform strategic pricing
decisions. The opportunity to automatically improve tactical price de-
cisions was first recognized and seized by the airline industry at the
beginning of the 1980s and can be partially attributed to the advance-
ment of digital reservation systems that enabled dynamic and agile
resource and price management. This required the development of a
totally new set of optimization methods that were later adopted in
other service industries, such as hotels and car rentals. This new, truly
algorithmic, approach, commonly referred to as revenue management or
yield management, has clearly demonstrated the power of automated
price and resource management by multiple cases of late adopters of
new techniques being bankrupted or defeated by the pioneers of auto-
mated price management.
Price management is closely related to other programmatic services,
especially promotions and advertisements. Price management meth-
ods can be used both to optimize discount values of promotions and
to price advertising and media resources sold to service clients. We
will start this chapter with a review of the basic principles of strategic
pricing and price optimization. We will then continue with the devel-
opment of more tactical and practical demand prediction and price op-
timization methods for market segmentation, markdowns, and clear-
ance sales. We will also briefly review the major resource allocation
methods used in service industries to set booking limits. Finally, we
will consider the assortment optimization problem, for which we can
reuse some of the building blocks developed for price management.
383
384 pricing and assortment
6.1 environment
Gi Qi pPi Vi q Ci (6.1)
Pi
opt
argmax Qi pPi q pPi Vi q Ci (6.2)
Pi
example 6.1
Table 6.1: Quantitative example that illustrates how changes in prices, costs, and
quantity sold influence profits.
The exact design of the value exchange model depends on the nature
of the products and their differences. In many cases, product features
can be evaluated with methodologies, such as conjoint analysis, that rely
on consumer surveys [Green and Srinivasan, 1978]. In certain cases, the
390 pricing and assortment
The concept of utility may suggest that buyers make decisions by com-
paring their willingness to pay against the price: the product is pur-
chased if, and only if, the price is below the utility. This “rational be-
havior”, however, is not an adequate model of real consumers. Valu-
ation is a subjective process that depends on how exactly the value
and price are communicated to the prospect and how the prospect per-
ceives it. Failure to properly communicate the value or price can set
the wrong expectations and displace the price boundaries in the un-
desirable direction. Efficient communication, by contrast, can improve
the perceived value of a product or diminish the value of comparable
alternatives.
Value communication, at the first glance, might not look very rel-
evant for a discussion of algorithmic methods because it deals with
psychological aspects of value and price perception that can probably
not be codified in the software. It turns out, however, that analysis of
these psychological patterns can produce applicable rules that can be
incorporated into price structures and, consequently, can be accounted
for in price optimization problems.
One of the most solid frameworks that captures many important
aspects of value and price perception is prospect theory [Kahneman
and Tversky, 1979]. Prospect theory considers the evaluation process
6.3 price and value 391
Figure 6.3: A prospect theory value function. The actual value increment of ∆
can be perceived as a relatively small gain, whereas the actual decre-
ment of the same magnitude ∆ can be perceived as a huge loss.
B
Bp qppq Qmax wppq (6.8)
∆q{q
∆p{p
qpppq BBp qppq (6.9)
A simple demand model can be derived under the assumption that the
willingness to pay is uniformly distributed in the range from 0 to the
maximum acceptable price P:
$
wppq unifp0, Pq
{
&1 P, 0¤p¤P
(6.10)
%
0, otherwise
6.4 price and demand 395
Qmax
P
p Qmax
Figure 6.4: Uniform willingness to pay and the corresponding linear demand
curve. Note that we follow the traditional economic notation in the
right-hand plot by placing price on the vertical axis, although de-
mand is considered as a function of price.
constant. This means that we need to solve the following equation for
q pp q:
p
qppq
BBp qppq (6.12)
q p p q C p (6.13)
wppq
B 1
Bp qppq qp0q p
1
(6.14)
The demand curve qppq and willingness to pay wppq with constant-
elasticity demand are depicted in Figure 6.5. Similarly to the linear
demand function, constant-elasticity demand can be a reasonable ap-
proximation for relatively small price changes. The constant-elasticity
demand correctly captures a smooth decrease in the willingness to pay
as the price grows, but it also implies that the willingness to pay – re-
call that this is the maximum acceptable price – is concentrated near
zero, which is not necessarily a realistic assumption.
qppq Qmax
1
(6.15)
1 ea bp
B ea bp
wppq
B p q pp q q p0 q b p1 ea q
1
2 (6.16)
1 ea bp
Figure 6.6: Logit demand function for several values of the parameter b and the
corresponding willingness to pay. The line thickness is proportional
to the magnitude of b.
ebi pi
µi ppq ° b p (6.18)
je
j j
ebi pi bi pi 1
µ i pp i q ° c e eb p cc1
j ie
bj pj ebi pi i i
(6.20)
e ln c bi pi
1 e ln cbi pi
which is the same as the logit demand model from equation 6.15.
6.5 basic price structures 399
A demand curve describes the relationship between the price and de-
manded quantity. It allows us to express a firm’s revenues and profits
as a function of price and to solve the optimization problem to deter-
mine the profit-optimal price level. Although this approach might look
like a precise way to calculate optimal prices, it can rarely produce
acceptable results because it is difficult, perhaps impossible, to esti-
mate a globally accurate demand curve that takes into account all of
the consequences of a price change, including competitors’ responses
and other strategic moves. The formal optimization problem, however,
can provide useful insights and support decision making, which is an
important step towards a programmatic solution. The analysis of de-
mand curves also helps to justify different price structures and their
key properties, which is necessary for the more advanced and auto-
mated optimizations that we will consider in later sections.
G q p p q pp V q (6.21)
Solving this equation for p, we obtain the optimal price, which is the
average of P and V:
popt P 2V (6.24)
400 pricing and assortment
We can substitute the optimal price into equation 6.22 to determine the
number of units that the firm is expected to sell at this price:
qopt Q2P
max
pP V q (6.25)
Gopt Q4P
max
pP V q2 (6.26)
popt V 1 (6.28)
the elasticity with high accuracy, and it is likely that the estimate is ac-
tually more like 1.5 0.4. This leads us to a range of “optimal” prices
from $21 to $110, as shown in table 6.2. It is clear that this result has
limited practical applicability.
Elasticity 1.10 1.20 1.30 1.40 1.50 1.60 1.70 1.80 1.90
Optimal $110 $60 $43 $35 $30 $27 $24 $23 $21
Price
Table 6.2: Optimal prices calculated for different values of elasticity and variable
costs of $10 with the cost-elasticity demand model.
At the same time, any single price popt , no matter how optimal it is
or isn’t, represents a trade-off because some customers will not buy a
product if they consider it to be too expensive, although they would be
willing to buy it at a lower price, in between popt and V, and would
thereby positively contribute to the profit. Moreover, some customers
will tolerate prices higher than popt , although the sales volume that
they will generate is relatively small. In both cases, a firm fails to cap-
ture additional profits that lie in the triangle in between the demand
402 pricing and assortment
curve and the variable costs line. Price segmentation is a natural way
to overcome the limitations of a single regular price by segmenting
customers according to their willingness to pay and offering different
prices to different segments. A particular case of this strategy where
the regular price has been complemented by a higher premium price
and a lower discounted price is shown in Figure 6.8. Note how the
profit area increases relative to that in the single-price strategy.
in which pi and qi are the prices and quantity sold for a segment i,
respectively, and q0 0. The quantity sold at price pi is
qi Qmax 1 pPi (6.31)
We can find prices that maximize the profit by taking partial deriva-
tives of G and equating them to zero. By inserting equation 6.31 into
equation 6.30, setting p0 S and pn 1 V, and doing algebraic sim-
plifications, we find
BG Qmax
Bpi P ppi1 2pi pi 1 q, 1¤i¤n (6.32)
pi pi1 2 pi 1
(6.33)
Figure 6.9: Example of optimal prices for four segments with variable costs of
$10 and a maximum acceptable price of $25.
Total
Price Demand Revenue Profit
Profit
The total profit earned for a customer is a sum of the entrance fee
and metered charges:
G pe q pp m V q
opt
(6.38)
BG Bpopt Bq
B p m B p m B p m pp m V q q0
e (6.39)
Inserting equations 6.37 and 6.38 into equation 6.39 and solving it
for pm , we find that the optimal metered price should be set equal to
the marginal costs:
opt
pm V (6.40)
This means that two-part tariff pricing encourages the seller to set
the entrance fee as high as possible and lower the metered price to the
minimum, so the profit will be extracted exclusively from the entrance
fee. This strategy, for example, is widely adopted by amusement parks
that tend to charge high entrance fees rather than charge per ride.
The approach with the high entrance fee, however, faces a strong
headwind in the case of competition or heterogeneous demand where
customers are willing to purchase different quantities of a product at
a given price. This situation is depicted in Figure 6.11, where multiple
demand curves have the same slope but differ in quantity purchased.
Let us assume that each demand curve corresponds to a certain cus-
tomer segment. We now cannot set the entrance fee higher than the sur-
plus that corresponds to the lowest demand curve because we would
6.5 basic price structures 409
lose customers otherwise. Let us denote the ratio between the demand
of the i-th segment to the demand of the segment that corresponds to
the lowest demand curve as ki , so the equations for the demand curves
can be written as
qi Qmax ki 1
pm
ki P
, ki ¥1 (6.41)
The profit can be expressed by summing the entrance fee with the
metered charges from each of the segments:
¸
G pe µ i q i pp m V q (6.42)
i
in which µi is the share of the segment i, that is, if the total number
of customers is N then the segment i contains N µi customers. The
optimal metered price can be found by taking the derivative of the
profit with respect to the metered price and equating it to zero:
B G B pe pp V q ¸ µ B q i ¸
0
B pm B pm m i
i
B pm i
µi q i (6.43)
Equation 6.37 for pe still holds, under the assumption that P is the
maximum acceptable price for the lowest demand curve, so we can
insert it into the equation above and, by using the fact that the sum
of all µi equals one, we derive a simple expression for the optimal
metered price:
¸
V µ i pk i 1 q V PpE rks 1q
opt
pm P (6.44)
i
410 pricing and assortment
6.5.4 Bundling
example 6.2
Spreadsheet Presentation
Sales $100 $150
Accounting $150 $100
Table 6.4: Example of price bundling with two products and two consumer seg-
ments.
Figure 6.12: Customer segments in which the willingness to pay for two prod-
ucts is positively correlated. The segments are depicted as black
dots.
pure bundling The price of bundle pB can be set equal to the ag-
gregate willingness to pay beause it is constant for all segments:
The first and second terms of equation 6.47 are the total profits
for product X and Y, respectively. Comparing equations 6.46 and
6.47, we find that GB is greater than or equal to GU for any
product prices px and py , so pure bundling is a more effective
strategy in this scenario than selling unbundled products.
px Vx ¡ pB Vx Vy (6.48)
px ¡ pB Vy
py ¡ pB Vx
(6.49)
segments that fall into these areas will choose to buy either prod-
uct X (rightmost triangle) or Y (leftmost triangle) instead of a
bundle, which will generate a higher revenue than that with pure
bundling.
The approach above provides relatively high flexibility for the opti-
mization of bundle prices. With the assumption that the willingness to
pay is estimated for sample segments or individual consumers, the cor-
responding points can be pinned on a plane and the optimal prices can
be searched by using numerical optimization methods and tessellating
the plane into areas with different profits depending on the segment
locations and sizes.
The first demand model we will consider was designed for assortment
optimization at Albert Heijn, a supermarket chain in the Netherlands
[Kök and Fisher, 2007]. It places a strong emphasis on consumer de-
cisions to enable a fine-grained analysis of the factors that influence
consumer choice.
A supermarket chain carries a large number of products that are
divided into merchandising categories, such as cheese, wine, cookies,
and milk. Each category is further divided into subcategories such that
products within a subcategory are similar and are often good substi-
tutes for one another but the difference between subcategories is sub-
stantial. For example, a fluid milk category can include subcategories
such as whole milk, fat-free milk, flavored milk, and so on. Supermar-
kets typically achieve very high service levels for the products that they
carry and stockouts are quite rare, so the model we will consider does
not take stockouts into account. At the same time, this demand model
was designed for assortment analysis and optimization, so it explic-
itly accounts for consumer choice, which makes it well suited to the
assortment-related problem that we will consider in later sections.
The demand for a single product can be broken down into three
separate decisions that apply for every consumer visiting a store:
• First, a consumer purchases or does not purchase from a subcat-
egory. Let us denote the probability that a consumer purchases
any product from a subcategory during the visit to the store as
Prppurchase | visitq.
Dj N Prppurchase | visitq
Prpj | purchaseq (6.50)
E rQ | j, purchases
6.6 demand prediction 417
Prppurchase | visitq
1
e x
(6.52)
1
which is equivalent to
Prppurchase | visitq
x log
1 Prppurchase | visitq
(6.53)
xht β1 β 2 Tt β3 Wt β4 Aht
7̧ ¸E
N (6.54)
β4 i Bti β11 i Eti
i 1
i 1
with indicator variables Ajth for individual products that equal one if
the product is promoted and zero otherwise:
J̧
Aht 1J Ajht (6.55)
j 1
E rQ | j, purchases λj λN 1 Ajht λN 2 Wt
¸
N H (6.58)
λN 2 i Eti
i 1
in which λ are regression coefficients and the other variables have
been defined and explained above. By substituting the individual re-
gression models above into the root equation 6.50, we obtain a fully
specified demand prediction model. This model can be adjusted to the
retailer’s business usage cases by adding more explanatory variables,
such as marketing events.
Competing products and their attributes play an important role in
demand modeling even if the assortment is not the main concern. For
example, the online fashion retailer Rue La La reported that the relative
price of competing styles and the number of competing styles are in the
top three most important features in their demand prediction model
[Ferreira et al., 2016].
6.6 demand prediction 419
The second demand model we will review was developed for Zara,
a Spanish fashion retailer and the main brand of Inditex, the world’s
largest fashion group [Caro and Gallien, 2012]. The model is geared
towards sales events optimization and places strong emphasis on the
temporal dimension of the demand.
Seasonal clearance sales are an integral part of the business strategy
for many apparel retailers. A regular selling season, which is typically
biannual (fall–winter and spring–summer), is followed by a relatively
short clearance sale period that aims to sell off the remaining inventory
and free up space for the new collection for the next season. Some
retailers exercise even smaller sales cycles to overrun competitors and
get more revenues from customers by offering more diverse and fluid
assortment. Price optimization in such an environment requires the
creation of a demand model that properly accounts for seasonal effects
and stockouts caused by the exhaustion of inventory and deliberate
assortment changes.
We describe the demand model in two steps, in accordaance with
the original report [Caro and Gallien, 2012]. The first step is to pre-
pare the available demand data for regression analysis by removing
seasonal variations and accounting for demand censoring due to stock-
outs. Next, the regression model itself is specified.
δpdq
SW pdq
SD°pweekday pdqq (6.59)
i 1 D q
p
SW 7
S i
The first and second terms account for inter-week and intra-week
demand variations, respectively. Next, we introduce the following fac-
tor that accounts for both seasonality and on-display information to
normalize the demand for product r and week w:
kpr, wq
¸ S p
W r, v, h q ¸
δpdq Fpr, v, d, hq
SW prq
(6.60)
h,v d in w
Lpr, wq
1 (6.62)
α4,w log min 1,
T
pr,w
α5,w log
pr,0
in which α are regression coefficients and the features are defined as
follows:
6.6 demand prediction 421
• α4,w : The broken assortment effect refers to the fact that the de-
mand for a given product can decrease as the inventory level
gets low. In the fashion retail context, this can often be attributed
to unpopular sizes and colors that remain after the most popular
ones have sold out. This effect can be accounted for by introduc-
ing a threshold T for Lpr, wq, which is the aggregated inventory
level for product r across all variants and stores.
• α5,w : The discount depth is defined as the ratio between the cur-
rent price pr,w and the regular price pr,0 . This term is effectively
a price sensitivity factor.
We can see that the model is heavily focused on the demand variabil-
ity over time because it was created for the optimization of seasonal
sales events. Demand models reported by other fashion retailers can
include more features such as brand, color and size popularity, rela-
tive prices of competing styles, and different statistics about past sales
events that can shift price sensitivity [Ferreira et al., 2016].
The demand models described in the previous sections are merely re-
gression models that are trained to forecast the sales numbers. In prac-
tice, the observed sales numbers do not necessarily match the actual de-
mand because of stockout events. If this is the case, the observed sales
volume will be lower than the actual demand, that is, the sales volume
that could potentially be achieved given an unlimited supply without
stockouts. The problem of stockouts can be especially important for
business models with seasonal sales or flash sales, where stockouts
are very frequent and a demand prediction model created based on
the observed sales volume is likely to be biased and inappropriate for
422 pricing and assortment
• If the product i has already been on sale and did sell out, we
have not observed the true demand. This requires us to perform
demand unconstraining, that is, to estimate the true demand based
on the quantity sold and the historical data for other products.
6.6 demand prediction 423
• If the product is new and has never been on sale, the demand
needs to be predicted with a regression model that uses product
and event properties as features and the unconstrained demand
value as the response variable. This problem can be solved by
using the methods discussed in the previous section. This case
can be summarized as
! )
p
d i fpproduct, eventq Ñ pi
q min cnew p
i , di (6.66)
Figure 6.14: Example of demand curves and the determination of the demand-
unconstraining proportion [Ferreira et al., 2016]. In this example,
there are three classes of events, each of which is represented by a
typical demand curve, and the event that needs to be unconstrained
falls into the first class.
p
d i qki cki (6.67)
This estimate can then be used for sales volume prediction, demand
modeling, and stock level optimization.
into a group and assigning that group a single price. This can be done
by rewriting equation 6.68 for N segment groups Si as
Ņ ¸
max pi pp i v s q q s pp i q (6.69)
p
i 1 P
s Si
example 6.3
in which p is the price per tablet, s is the package size (the num-
ber of tablets in the bottle), and h is the average household size fac-
tor, which is positive when the average household size in the store
area is relatively large and negative when the average household is
small. The demand is negatively correlated with price, as we expect.
It is also negatively correlated with the package size, which indicates
that consumers prefer smaller packages to larger ones. The last term
is positively correlated with the package size for large households and
negatively correlated for small ones, so the demand for large packages
is higher in areas with large households, which is also intuitive.
We will optimize prices for a setting with two stores with different
values of the household size factor h and two package sizes with dif-
ferent wholesale prices v, as shown in Figure 6.15.
The first scenario we consider is a fine-grained price differentiation
that jointly optimizes quantity discounts based on package sizes and
store-level prices. The goal is to find four different prices pij , in which
i corresponds to one of two package sizes and j corresponds to one of
two stores. The optimization problem can then be stated as follows:
¸ ¸
max si pij vi qppij , si , hj q (6.71)
p
i 1,2 j 1,2
This optimization problem is separable and quadratic with respect to
prices because the demand function is linear. Solving the problem for
6.7 price optimization 427
Figure 6.15: Parameters of the price optimization problem for two stores and
two package sizes.
the values from Figure 6.15, we obtain the results presented in table 6.5.
We can see that the solution justifies quantity discounts and suggests
that per-tablet prices are lower for a large package in both stores. It
also exploits the higher demand for large packages in the area with
large households by raising the price in the corresponding store.
Table 6.5: The optimal prices for a scenario with four segments.
that of the first scenario, as one can see in table 6.6 where the optimiza-
tion results are shown.
Package Size
Price per tablet Demand (bottles)
(tablets)
25 $0.74 1401
50 $0.63 1195
Table 6.6: The optimal prices for a scenario with two segments.
model that simultaneously accounts for the prices of all related seg-
ments. It is often impossible to measure cross-price elasticities (how the
demand in one segment is impacted by the price in another segment)
for all possible pairs of segments, so one has to use grosser approxima-
tions, such as the ratio between the price in a given segment and the
average prices in other segments. Another challenge of demand shift-
ing models is that cross-segment dependencies make the optimization
problem inseparable and sharply increase its computational complex-
ity. Assuming that we draw prices from a discrete set of size m and the
number of segments is n, we might need to evaluate as many as mn
price combinations if the demand model does not follow any particular
functional form that exhibits properties such as linearity or convexity.
One possible approach to price optimization with demand shifting
is to make the assumption that the demand shift is proportional to the
price difference between the segments. More specifically, if the price
for segment i is higher than the price for segment j, then the demand
for segment i decreases by Kppi pj q and the demand for segment
j increases by the same amount. Parameter K determines the amount
of demand transferred between the two segments for every dollar of
price difference. The basic optimization problem can then be rewritten
as follows to adjust the demand in each segment by the total of the
demands shifted from other segments:
¸ ¸
max
p
ppi vi q qi ppi q K pj pi (6.73)
i j
in which i and j iterate over all segments. Note that this demand
shifting model does not change the total demand, in the sense that the
sum of all shifts is zero. However, this does not mean that the total
quantity sold remains constant for any K because the demand shift
causes the optimal prices to change, which, in turn, changes the values
of the demand functions.
example 6.4
in which the demand shift is the sum of the pairwise price differ-
ences with other segments, or
¸ ¸
∆ppij q K pxy pij (6.75)
x 1,2 y 1,2
We solve this optimization problem and obtain the prices for the four
segments presented in table 6.7. From a comparison with tables 6.5 and
6.7, we find that the demand shifting has increased price sensitivity, so
the demand for small packages has decreased but the demand for a
large package has increased because of the relatively low price per
tablet. This change in demand can be considered positive for a retailer
because it increases the total profit.
Table 6.7: The optimal prices for a scenario with four segments and demand
shift. The shifting parameter K 400.
N
6.7.1.2 Differentiation with Constrained Supply
The price optimization models that we have considered so far were fo-
cused on setting prices that achieve the highest possible profits allowed
by a given demand curve. This view on price optimization assumes
perfect replenishment, such that a seller is always able to deliver the
quantity of a product demanded at the profit-optimal price. This as-
sumption is reasonably fair for some industries, such as supermarket
retail, where it is possible to build a supply chain that almost perfectly
replenishes the inventory and stockouts are rare. However, as we have
already discussed, it does not hold in many other industries that face
different supply constraints. In this section, we will discuss a relatively
simple case, in which each market segment has a fixed capacity of a
product and we need to find the optimal global price or segment-level
prices.
6.7 price optimization 431
Let us first consider how a product can be priced for a single mar-
keting segment if the available quantity is fixed. This problem is a stan-
dard unit-price optimization with the addition of a quantity constraint:
max x pp V q
p,x
subject to x ¤ q pp q
(6.76)
x¤C
p¥0
example 6.5
in which p is the ticket price and t is the day of the week. We also
assume that the variable costs per seat are negligible because the build-
ing maintenance costs, cost of the performance, and other expenses are
pretty much constant.
The opera house might decide to set a flat price for all days, which
would lead to the following constrained optimization problem:
¸
max p xt
p,x
t
subject to xt ¤ qpp, tq (6.78)
xt ¤ C
p¥0
in which t iterates through the seven days of the week. This prob-
lem instance is not particularly difficult because we can assume that
the price belongs to a relatively small discrete set, so we can evaluate
each candidate solution. We find that the optimal price is $19.80, which
corresponds to a revenue of $98,010.
The result above can be contrasted with variable pricing, for which
each day is considered as a separate segment and optimized accord-
6.7 price optimization 433
Figure 6.17: Example of ticket price optimization for an opera house. The verti-
cal bars represent the number of seats sold, and the points are the
ticket prices for the corresponding days.
wppq unifp0, Pq
{
&1 P, 0¤p¤P
(6.81)
%
0, otherwise
Recall that uniform willingness to pay implies a linear demand
curve, which, in this context, can be interpreted as the number of
customers who buy a product at a given price and become inactive
until the end of the sale. Consequently, we can visualize the markdown
process as sliding down the demand curve, as shown in Figure 6.18.
Note that the optimization of markdown prices in this interpretation is
almost identical to the market segmentation problem that we studied
earlier, so the equations below are structurally similar to the equations
for market segmentation but have different meanings.
We find that the quantity sold during period t is
pt
Qmax 1 1 t1
p
Qt
P P
(6.82)
Qmax
P
p p t 1 p t q
Consequently, the total sales revenue is
Ţ
G pt
Qmax
p p t 1 p t q (6.83)
t 1
P
pt pt1 2 pt 1
(6.85)
438 pricing and assortment
p0 P
1 PS
(6.86)
pT
The markdown prices that meet relationship 6.85 and conditions 6.86
are given by
opt
pt PS pP PS q 1 T t
1
(6.87)
example 6.6
We substitute these parameters into problem 6.90 and solve it for dif-
ferent values of product capacity C. These solutions are presented in
the small tables below. Each solution is a price schedule with 20 values
of z: each column corresponds to one of the four weeks, and each row
corresponds to one of the five price levels, with the topmost row cor-
responding to the highest price and the lowermost row corresponding
to the lowest price. For example, the optimal price is $89 for the first,
third, and fourth weeks when the capacity is 700 items. For the same
capacity, the price is fractional for the second week, which means a mix
with 22% at $89 and 78% at $79.
6.7 price optimization 441
We can see that the prices generally decrease over time, which re-
flects the decreasing trend in the demand rates. Another tendency
is that tight capacity constraints generally diminish and delay mark-
downs, which is also expected.
N
6.7.2.3 Price Optimization for Competing Products
One of the main challenges of price optimization is the dependen-
cies between the products. In many business domains, especially in
retail, customers constantly make a choice between competing or sub-
stitutable products by using the price of one product as a reference
point for another product, so the demand for a given product is usually
a function of both the product price and the prices of the competing
products. If this is the case, the prices cannot be optimized for each
product in isolation, and the prices for all competing products need to
be optimized jointly instead. The number of competing products can
be as high as several hundreds in some applications, so the optimiza-
tion problem can become computationally intractable. In this section,
we dive into this problem and discuss the framework developed by
Rue La La, an online fashion retailer, which can significantly reduce
the computational effort [Ferreira et al., 2016].
442 pricing and assortment
¸
max pi qi ppq
p
i PN (6.92)
subject to pi P P, for i 1, . . . , n
m npk 1q 1 (6.94)
The utility function xkut can be learned from the data by building a
regression model such as the following:
W̧
xkut βuw Fkutw (6.98)
w 1
will lift this volume by adding area S1 , so the total revenue and pro-
motional costs (shown in the middle plot) will both be proportional
to S0 S1 . A time-optimized promotion will make the revenue pro-
portional to S0 S2 , and its costs will be proportional to S02 S2 (the
plot at the bottom). This difference between a flat promotion and an op-
timized promotion shows the potential to take advantage of temporal
optimization in the case of certain quantitative properties of probability
density functions.
tels, and freight distribution, might be less flexible in that regard. This
difference can be partly attributed to historical reasons, namely the
practice of setting fixed fares for different classes of service. This has
resulted in the development of a large group of methods that are based
on an alternative interpretation of the problem. These methods first ap-
peared in the airline industry and historically preceded dynamic pric-
ing, as well as most programmatic methods in general.
Assuming that a seller has operational, legal, or business constraints
that limit the ability to vary prices arbitrarily, we can turn the dynamic
price optimization problem upside down and consider an alternative
approach. The idea is to define a set of fixed price segments, typically
referred to as fare classes, and allocate a fraction of the total capacity to
every class in a way that maximizes profits. Consequently, the subject
of optimization is the capacity limits allocated for each class. A classic
example of this problem is an airline that offers three fare classes (e. g.,
economy, business, and first class) and decides how many seats of each
class should be reserved, given that the total airplane capacity is fixed.
6.8.1 Environment
• All fare classes are served from the same fixed capacity of a re-
source, and the booking limits for classes can be changed dynam-
ically. For example, an airline can allocate different percentages
of standard economy and discounted economy seats for differ-
ent flights, although the total number of economy seats remains
fixed.
Protection Reservation
Units Sold Action
Levels Request
y1 y2 y3 C1 C2 C3
1 2 4 8 0 0 0 1 unit in C 2 Accept
2 2 4 7 0 1 0 1 unit in C 3 Accept
3 2 4 6 0 1 1 1 unit in C 3 Accept
4 2 4 5 0 1 2 1 unit in C 1 Accept
5 2 4 4 1 1 2 1 unit in C 3 Reject
6 2 4 4 1 1 2 1 unit in C 1 Accept
7 2 3 3 2 1 2 1 unit in C 2 Accept
8 2 2 2 2 2 2 1 unit in C 2 Reject
9 2 2 2 2 2 2 1 unit in C 3 Reject
10 2 2 2 2 2 2 1 unit in C 1 Accept
11 1 1 1 3 2 2 1 unit in C 1 Accept
12 0 0 0 4 2 2 — Reject
Table 6.8: Example of the reservation process with nested booking limits.
accept no more than C y requests for the second class and reserve
the remaining y units for the first class requests.
Any time we receive a request for the second class, we have the
choice of accepting it or rejecting it and switching the space to the
first class. This decision can be easily analyzed in terms of expected
outcomes, as illustrated in Figure 6.21. If we accept the request, we
earn revenue of p2 . If we reject the request, close the second class,
and switch to the first class, then two outcomes are possible. On the
one hand, if the demand for the first class will eventually exceed the
remaining capacity y, we will book this unit at price p1 . On the other
hand, if the demand for the first class is lower than the remaining
capacity, the unit would not be booked at all and we would earn zero
revenue. Consequently, the acceptance condition for the second class
can be written as follows:
p2 ¥ p1 PrpQ1 ¥ yq (6.102)
p2 ¥ p1 p1 F1 pyqq (6.103)
yopt F1 1 1
p2
p1
(6.104)
example 6.7
Let us illustrate the optimization of two fare classes with the following
example. Consider a service provider that has 20 units of a resource
and sells them at a full price of $300 and a discounted price of $200.
The demand for the full price service is estimated to be normally dis-
tributed with a mean of 8 and a standard deviation of 2. Consequently,
the probability that the demand for the full price offering will exceed
y units can be expressed by using the cumulative distribution function
Φ of the standard normal distribution
y 8 0.5
PrpQ1 ¥ yq 1 Φ (6.105)
2
Note that we have added a shift of 0.5 because of the discrete nature
of the reservation units – the probability that the demand is exactly y
units can be approximated by integrating the cumulative distribution
function over the interval from y 0.5 to y 0.5. Given that we reserve
y units to be sold at full price, the marginal revenue for a full-price
unit can be defined as
r1 pyq $300 PrpQ1 ¥ yq (6.106)
In other words, the marginal revenue is the difference between the
expected revenue from the full price segment, given that y units are
allocated for this segment, and the corresponding revenue, given that
only y 1 units are allocated. The total expected revenue from the
full-price segment is the sum of the marginal revenues:
¸
y
R1 p y q r1 piq (6.107)
i 1
y r1 y p q p q
R1 y p q
R2 y p q
R1 y p q
R2 y
1 $299 $299 $3800 $4099
2 $299 $599 $3600 $4199
3 $299 $898 $3400 $4298
4 $296 $1195 $3200 $4395
5 $287 $1483 $3000 $4483
6 $268 $1751 $2800 $4551
7 $232 $1983 $2600 $4583
8 $179 $2163 $2400 $4563
9 $120 $2283 $2200 $4483
10 $67 $2351 $2000 $4351
11 $31 $2383 $1800 $4183
12 $12 $2395 $1600 $3995
13 $3 $2398 $1400 $3798
14 $0 $2399 $1200 $3599
15 $0 $2399 $1000 $3399
16 $0 $2400 $800 $3200
17 $0 $2400 $600 $3000
18 $0 $2400 $400 $2800
19 $0 $2400 $200 $2600
20 $0 $2400 $0 $2400
Table 6.9: Example of the protection level optimization for two fare classes.
Let us take one step forward and consider a decision tree for a third
class, as shown in Figure 6.22.
454 pricing and assortment
Similarly to the two-class problem, a request for the third class can
be either accepted, which earns us the revenue of p3 , or rejected. The
latter action means that the third class will be closed and all further
requests will be handled in the two-class mode. This can result in two
possible outcomes:
• If the total demand for the first and second classes is below the
protection level y2 , the unit will be lost – we closed the third class
too early.
We can compare equations 6.109 and 6.110 and apply the decision
tree approach recursively to find the following relationship for the op-
timal protection levels:
pj 1
pj
Pr Q1 ... Qj ¥ yopt
j
(6.111)
Q1 ¥ yopt
1 and . . . and Q1 ... Qj1 ¥ yopt
j1
Figure 6.23: Optimization of the protection levels for three classes by using sim-
ulations.
6.8.4.1 EMSRa
Recall that the protection level for class j is the total capacity reserved
for classes from j down to 1. If we have already calculated the protec-
tion levels for the cheaper classes from n down to j 1, the protection
level for j determines how to split the capacity between class j 1 and
the more expensive classes. The idea of EMSRa is to approximate this
protection level by the sum of the protection levels obtained by apply-
ing Littlewood’s rule for class j 1 and each of the classes from j down
to 1 separately. This means that we first calculate j pairwise protection
levels from the following equations:
pj 1 pj Pr Qj ¥ ypjjq1
pj1q
pj 1 pj1 Pr Qj1 ¥ yj 1
(6.112)
..
.
pj 1 p1 Pr Q1 ¥ ypj1q1
The final protection level is calculated as a sum of the pairwise levels,
as illustrated in Figure 6.24.
¸ pkq
j
yj yj 1
(6.113)
k 1
Comparing equations 6.112 and 6.113 with the optimal solu-
tion 6.111, we see that EMSRa uses the probabilities that separate
demands exceed the levels determined by the corresponding price
ratios to approximate the probability that the sum of demands exceeds
a certain level. As a result, EMSRa tends to be excessively conservative,
in the sense that it reserves too many units for the higher classes and,
thereby, rejects too many low-fare bookings.
6.8.4.2 EMSRb
The alternative approach to the problem is to merge the classes from j
to 1 into one virtual aggregate class that has its own demand and price
and to then apply Littlewood’s rule. The demand for an aggregate class
can be estimated as the sum of demands for the included classes:
¸
j
Qj Qk (6.114)
k 1
6.9 assortment optimization 457
In other words, the lift is the ratio between the observed probability
of two items co-occurring and the co-occurrence probability calculated
under the assumption that the items are independent. Consequently,
a lift higher than one indicates affinity of the items (assuming statisti-
cal significance of the results, of course). We can measure the lift not
only for pairs of products but for pairs of categories as well, by map-
ping each item in the transaction history to its category and evaluating
expression 6.118 with the assumption that ra and rb are categories.
Let us now return to the store-layout optimization problem. We have
n product categories and n product locations, such as aisles or shelves.
The optimization problem can then be stated as the assignment of all
6.9 assortment optimization 459
example 6.8
1 QAP was first introduced in the context of operations research to model the following real-
life problem. There is a set of facilities and a set of locations. The objective is to assign each
facility to a location such that the total cost is minimized, with the assignment cost being a
product of the distance between locations and the flow between the facilities.
460 pricing and assortment
The floor plan is the 2 3 grid shown in Figure 6.25, in which each
placement represents a display shelf. In total, we have six available
placements for six categories.
1 2 3 4 5 6
1 0 1 0 1 0 0
2 1 0 1 0 1 0
D
3 0 1 0 0 0 1 (6.123)
4 1 0 0 0 1 0
5 0 1 0 1 0 1
6 0 0 1 0 1 0
We assume that the distance is only equal to one between adjacent
placements and is zero otherwise. For example, placement number five
has a distance of one to cells two, four, and six. By solving optimization
problem 6.121 for the matrices specified above, we find the optimal lay-
out presented in Figure 6.26. This small example can be easily solved
by evaluating all 6! 720 possible permutations, but larger problems
require the use of optimization software that can handle QAP or one
of its relaxations.
N
Figure 6.26: One of the optimal store layouts for the example with six categories.
Alternative optimal layouts can, of course, be obtained by mirroring
this grid horizontally or vertically.
• dj : the original demand rate for product j, that is, the number
of customers who would select product j if presented with full
assortment N. We also denote the vector of demands for all prod-
ucts as d pd1 , . . . , dJ q.
• Dj : the observed demand rate for the products, that is, the actual
number of customers per day who selected product j because
of their original intention or substitution. The observed demand
for a given product depends on the original demand and the
availability of other products because of the substitution effect,
so it can be thought as function Dj pf, dq.
With the above notation, the assortment optimization problem for a
given store and a given category can be specified as follows:
¸
max Gj fj , Dj pf, dq
f
P
j N
¸ (6.124)
subject to fj ¤ F0
j
Gj pfj , Dj q mj Dj (6.125)
Retailers of perishable goods should also take into account the losses
owing to disposed-of inventory, which can be modeled by introducing
a per-unit disposal loss L that applies to unsold inventory:
Gj pfj , Dj q mj minpDj , fj q Li fi minpDj , fj q (6.127)
For the sake of brevity, we hereafter assume that all products are
perfectly replenished, so stockouts are not possible or are negligible.
464 pricing and assortment
6.11 summary
• The basic price structures include unit price, segmented price, two-
part tariffs, tying arrangements, and bundling. All of these struc-
tures can be optimized if a global demand curve is accurately esti-
mated.
471
472 appendix: dirichlet distribution
Pr(t1)
1.0
B
A
Pr(t2)
1.0
1.0
Pr(t3)
α pα1 , . . . , αm q , αi ¡0 (A.3)
0.0
0.0
0.5 0.5
1.0 1.0
0.0 0.0
0.5 0.5
1.0 1.0
0.0 0.0
Figure A.2: Density plots for the Dirichlet distribution over the probability sim-
plex in a three-dimensional space.
The density function is completely flat when α p1, 1, 1q. The den-
sity function is bell-shaped and symmetric about the center of the
simplex when the parameter vector is flat. If the parameter vector is
not flat, the bell is shifted in the direction of the parameters with the
biggest magnitudes. Finally, the important thing to note is that the
Dirichlet distribution is sparse for the parameter elements with small
magnitudes, in the sense that the density is concentrated in the corners,
and, consequently, the PMFs drawn from such a distribution tend to
have a strong bias towards a small subset of terms [Telgarsky, 2013].
INDEX
475
476 INDEX
Anderson, C. (2008). The Long Tail: Why the Future of Business is Selling
Less of More. Hyperion.
Asuncion, A., Welling, M., Smyth, P., and Teh, Y. W. (2009). On smooth-
ing and inference for topic models. In Proceedings of the Twenty-Fifth
Conference on Uncertainty in Artificial Intelligence, UAI ’09, pages 27–
34. AUAI Press.
Belobaba, P. (1987). Air travel demand and airline seat inventory man-
agement. Technical report, Cambridge, MA: Flight Transportation
Laboratory, Massachusetts Institute of Technology.
481
482 bibliography
Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent dirichlet alloca-
tion. J. Mach. Learn. Res., 3:993–1022.
Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N.,
and Hullender, G. (2005). Learning to rank using gradient descent.
In Proceedings of the 22Nd International Conference on Machine Learning,
ICML ’05.
Cao, Z., Qin, T., Liu, T.-Y., Tsai, M.-F., and Li, H. (2007). Learning to
rank: From pairwise approach to listwise approach. In Proceedings of
the 24th International Conference on Machine Learning, pages 129–136.
ACM.
Crocker, C., Kulick, A., and Ram, B. (2012). Real user monitoring at
walmart.
Dalessandro, B., Perlich, C., Hook, R., Stitelman, O., Raeder, T., and
Provost, F. (2012a). Bid Optimizing and Inventory Scoring in Targeted
Online Advertising.
484 bibliography
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harsh-
man, R. (1990). Indexing by latent semantic analysis. Journal of the
American Society for Information Science, 41(6).
Freund, Y., Iyer, R., Schapire, R. E., and Singer, Y. (2003). An efficient
boosting algorithm for combining preferences. J. Mach. Learn. Res.,
4:933–969.
Frigyik, B., Kapila, A., and Maya, G. (2010). Introduction to the dirich-
let distribution and related processes. Technical report, University
of Washington.
Ghani, R., Probst, K., Liu, Y., Krema, M., and Fano, A. (2006). Text min-
ing for product attribute extraction. SIGKDD Explor. Newsl., pages
41–48.
Goldberg, D., Nichols, D., Oki, B. M., and Terry, D. (1992). Using col-
laborative filtering to weave an information tapestry. Commun. ACM,
35(12):61–70.
Herbrich, R., Graepel, T., and Obermayer, K. (2000). Large margin rank
boundaries for ordinal regression. In Advances in Large Margin Clas-
sifiers, pages 115–132. MIT Press.
Hu, Y., Koren, Y., and Volinsky, C. (2008). Collaborative filtering for
implicit feedback datasets. In Proceedings of the 2008 Eighth IEEE In-
ternational Conference on Data Mining, ICDM ’08, pages 263–272. IEEE
Computer Society.
Jack, K., Ingold, E., and Hristakeva, M. (2016). Mendeley suggest archi-
tecture.
Johnson, J., Tellis, G. J., and Ip, E. H. (2013). To Whom, When, and How
Much to Discount? A Constrained Optimization of Customized Temporal
Discounts, volume 89. Elsevier.
Ju, C., Bao, F., Xu, C., and Fu, X. (2015). A novel method of interest-
ingness measures for association rules mining based on profit. In
Discrete Dynamics in Nature and Society.
bibliography 487
Kane, K., Lo, V. S., and Zheng, J. X. (2014). Mining for the truly respon-
sive customers and prospects using true-lift modeling: Comparison
of new and existing methods. Journal of Marketing Analytics, 2(4):218–
238.
Li, P., Burges, C. J. C., and Wu, Q. (2007). Mcrank: Learning to rank
using multiple classification and gradient boosting. In NIPS, pages
897–904. Curran Associates, Inc.
Lo, V. S. (2002). The true lift model: a novel data mining approach to re-
sponse modeling in database marketing. ACM SIGKDD Explorations
Newsletter, 4(2):78–86.
Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013a). Efficient esti-
mation of word representations in vector space.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. (2013b).
Distributed representations of words and phrases and their compo-
sitionality. In Advances in Neural Information Processing Systems 26.
Curran Associates, Inc.
Mobasher, B., Dai, H., Luo, T., and Nakagawa, M. (2001). Effective
personalization based on association rule discovery from web usage
data. In Proceedings of the 3rd International Workshop on Web Informa-
tion and Data Management, WIDM ’01, pages 9–15, New York, NY,
USA. ACM.
Musalem, A., Olivares, M., Bradlow, E. T., Terwiesch, C., and Corsten,
D. (2010). Structural estimation of the effect of out-of-stocks. Man-
agement Science, 56(7):1180–1197.
Perlich, C., Dalessandro, B., Raeder, T., Stitelman, O., and Provost,
F. (2013). Machine Learning for Targeted Display Advertising: Transfer
Learning in Action.
Rodriguez, M., Posse, C., and Zhang, E. (2012). Multiple objective op-
timization in recommender systems. In Proceedings of the Sixth ACM
Conference on Recommender Systems, RecSys ’12, pages 11–18, New
York, NY, USA. ACM.
Sarwar, B., Karypis, G., Konstan, J., and Riedl, J. (2001). Item-based
collaborative filtering recommendation algorithms. In Proceedings of
the 10th International Conference on World Wide Web, pages 285–295.
ACM.
Sill, J., Takacs, G., Mackey, L. W., and Lin, D. (2009). Feature-weighted
linear stacking.
Su, X., Khoshgoftaar, T. M., Zhu, X., and Greiner, R. (2008). Imputation-
boosted collaborative filtering using machine learning classifiers. In
Proceedings of the 2008 ACM Symposium on Applied Computing, SAC
’08, pages 949–950, New York, NY, USA. ACM.
Talluri, K. and Van Ryzin, G. (2004). The Theory and Practice of Revenue
Management. International Series in Operations Research & Manage-
ment Science. Springer.
Töscher, A., Jahrer, M., and Bell, R. M. (2009). The bigchaos solution to
the netflix grand prize.
Vasigh, B., Tacker, T., and Fleming, M. (2013). Introduction to Air Trans-
port Economics: From Theory to Applications.
Vulcano, G. J., van Ryzin, G. J., and Ratliff, R. (2012). Estimating pri-
mary demand for substitutable products from sales transaction data.
Operations Research, 60(2):313–334.
Walker, R. (2009). The Song Decoders. The New York Times Magazine.
Xia, Z., Dong, Y., and Xing, G. (2006). Support vector machines for
collaborative filtering. In Proceedings of the 44th Annual Southeast Re-
gional Conference, ACM-SE 44, pages 169–174, New York, NY, USA.
ACM.
Zhang, A., Goyal, A., Kong, W., Deng, H., Dong, A., Chang, Y., Gunter,
C. A., and Han, J. (2015). adaqac: Adaptive query auto-completion
via implicit negative feedback. In Proceedings of the 38th International
ACM SIGIR Conference on Research and Development in Information Re-
trieval, pages 143–152. ACM.
Zhang, S., Wang, W., Ford, J., and Makedon, F. (1996). Learning from
incomplete ratings using non-negative matrix factorization. In In
Proc. of the 6th SIAM Conference on Data Mining, pages 549–553.