0% found this document useful (0 votes)
76 views8 pages

Leakage in Data Mining - Formulation, Detection and Avoidance - Tel-Aviv University

The document discusses the issue of leakage in data mining, which occurs when information about the target is unintentionally introduced into the data, leading to suboptimal model performance. It highlights the prevalence of leakage in both real-life projects and data mining competitions, emphasizing the need for a formal definition and methodologies for detection and avoidance. The authors propose a new approach to understanding and addressing leakage, aiming to improve predictive modeling practices and outcomes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views8 pages

Leakage in Data Mining - Formulation, Detection and Avoidance - Tel-Aviv University

The document discusses the issue of leakage in data mining, which occurs when information about the target is unintentionally introduced into the data, leading to suboptimal model performance. It highlights the prevalence of leakage in both real-life projects and data mining competitions, emphasizing the need for a formal definition and methodologies for detection and avoidance. The authors propose a new approach to understanding and addressing leakage, aiming to improve predictive modeling practices and outcomes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Leakage in Data Mining:

Formulation, Detection, and Avoidance


Shachar Kaufman Saharon Rosset Claudia Perlich
School of Electrical Engineering School of Mathematical Sciences Media6Degrees
th th
Tel-Aviv University Tel-Aviv University 37 East 18 Street, 9 floor
69978 Tel-Aviv, Israel 69978 Tel-Aviv, Israel New York, NY 10003
[email protected] [email protected] [email protected]

ABSTRACT should not be legitimately available to mine from. A trivial exam-


Deemed “one of the top ten data mining mistakes”, leakage is ple of leakage would be a model that uses the target itself as an
essentially the introduction of information about the data mining input, thus concluding for example that „it rains on rainy days‟. In
target, which should not be legitimately available to mine from. In practice, the introduction of this illegitimate information is unin-
addition to our own industry experience with real-life projects, tentional, and facilitated by the data collection, aggregation and
controversies around several major public data mining competi- preparation process. It is usually subtle and indirect, making it
tions held recently such as the INFORMS 2010 Data Mining very hard to detect and eliminate. Leakage is undesirable as it may
Challenge and the IJCNN 2011 Social Network Challenge are lead a modeler, someone trying to solve the problem, to learn a
evidence that this issue is as relevant today as it has ever been. suboptimal solution, which would in fact be outperformed in
While acknowledging the importance and prevalence of leakage deployment by a leakage-free model that could have otherwise
in both synthetic competitions and real-life data mining projects, been built. At the very least leakage leads to overestimation of the
existing literature has largely left this idea unexplored. What little model‟s performance. A client for whom the modeling is underta-
has been said turns out not to be broad enough to cover more ken is likely to discover the sad truth about the model when per-
complex cases of leakage, such as those where the classical i.i.d. formance in deployment is found to be systematically worse than
assumption is violated, that have been recently documented. In the estimate promised by the modeler. Even then, identifying
our new approach, these cases and others are explained by expli- leakage as the reason might be highly nontrivial.
citly defining modeling goals and analyzing the broader frame- Existing literature, which we survey in Section 2, mentions lea-
work of the data mining problem. The resulting definition enables kage and acknowledges its importance and prevalence in both
us to derive general methodology for dealing with the issue. We synthetic competitions and real-life data mining projects [e.g. 2,
show that it is possible to avoid leakage with a simple specific 7]. However these discussions lack several key ingredients. First,
approach to data management followed by what we call a learn- they do not present a general and clear theory of what constitutes
predict separation, and present several ways of detecting leakage leakage. Second, these sources do not suggest practical methodol-
when the modeler has no control over how the data have been ogies for leakage detection and avoidance that modelers could
collected. apply to their own statistical inference problems. This gap in
theory and methodology could be the reason that several major
Categories and Subject Descriptors data mining competitions held recently such as KDD-Cup 2008,
H.2.8 [Database Management]: Database Applications – Data or the INFORMS 2010 Data Mining Challenge, though judicious-
mining. I.5.2 [Pattern Recognition]: Design Methodology – Clas- ly organized by capable individuals, suffered from severe leakage.
sifier design and evaluation. In many cases, attempts to fix leakage resulted in the introduction
of new leakage which is even harder to deal with. Other competi-
General Terms tions such as KDD-Cup 2007 and IJCNN 2011 Social Network
Theory, Algorithms. Challenge were affected by a second form of leakage which is
specific to competitions. Leakage from available external sources
Keywords undermined the organizers‟ implicit true goal of encouraging
Data mining, Leakage, Statistical inference, Predictive modeling.
submissions that would actually be useful for the domain. These
1. INTRODUCTION cases, in addition to our own experience with leakage in the indus-
Deemed “one of the top ten data mining mistakes” [7], leakage in try and as competitors in and organizers of data mining chal-
data mining (henceforth, leakage) is essentially the introduction of lenges, are examined in more detail also in Section 2. We revisit
information about the target of a data mining problem, which them in later sections to provide a more concrete setting for our
discussion.
The major contribution of this paper, that is, aside from raising
Permission to make digital or hard copies of all or part of this work for awareness to an important issue which we believe is often over-
personal or classroom use is granted without fee provided that copies are looked, is a proposal in Section 3 for a formal definition of lea-
not made or distributed for profit or commercial advantage and that copies kage. This definition covers both the common case of leaking
bear this notice and the full citation on the first page. To copy otherwise, features and more complex scenarios that have been encountered
or republish, to post on servers or to redistribute to lists, requires prior
in predictive modeling competitions. We use this formulation to
specific permission and/or a fee.
KDD’11, August 21 – 24, 2011, San Diego, California, USA. facilitate leakage avoidance in Section 4, and suggest in Section 5
Copyright 2011 ACM 978-1-4503-0813-7/11/08…$10.00. methodology for detecting leakage when we have limited or no

556
control over how the data have been collected. This methodology ing the "heavy spender". The idea is that it is better to ask analyti-
should be particularly useful for practitioners in predictive model- cal questions that have a clear temporal cause-and-effect structure.
ing problems, as well as for prospective competition organizers. Of course leaks are still possible, but much harder to introduce by
accident and much easier to identify. We return to this idea in
2. LEAKAGE IN THE KDD LITERATURE Section 3. A later paper by the authors [4] reiterates the previous
The subject of leakage has been visited by several data mining discussion, and adds the example of the “use of free shipping”,
textbooks as well as a few papers. Most of the papers we refer to where a leak is introduced when free shipping is provided as a
are related to KDD-Cup competitions, probably due to authors of special offer with large purchases.
works outside of competitions locating and fixing leakage issues
without reporting the process. We shall give a short chronological Rosset et al. [11] discuss leakage encountered in the 2007 KDD-
review here while collecting examples to be used later as case Cup competition. In that year's contest there were two related
studies for our proposed definition of leakage. challenges concerning movie viewers‟ reviews from the famous
Netflix database. The first challenge, "Who Reviewed What", was
Pyle [9, 10, 11] refers to the phenomenon which we call here to predict whether each user would give a review for each title in
leakage, in the context of predictive modeling, as Anachronisms 2006, given data up to 2005. The second challenge, "How Many
(something that is out of place in time), and says that "too good to Reviews", was to predict the number of reviews each title would
be true" performance is "a dead giveaway" of its existence. The receive in 2006, also using data given up to 2005. For the first
author suggests turning to exploratory data analysis in order to challenge, a test set with actual reviews from 2006 was provided.
find and eliminate leakage sources, which we will also discuss in Although disjoint sets of titles were used to construct the data sets
Section 5. Nisbet et al. [7] refer to the issue as "leaks from the for these two challenges, Rosset et al.‟s winning submission ma-
future” and claim it is "one of the top 10 data mining mistakes". naged to use the test set for the first problem as the target in a
They repeat the same basic insights, but also do not suggest a supervised-learning modeling approach for the second problem.
general definition or methodology to correct and prevent leakage. This was possible due to a combination of two facts. First, up to a
These titles provide a handful of elementary but common exam- scaling factor and noise, the expected number of user/review pairs
ples of leakage. Two representative ones are: (i) An "account in the first problem's test set in which a title appears is equal to the
number" feature, for the problem of predicting whether a potential total number of reviews which that titled received in 2006. This is
customer would open an account at a bank. Obviously, assignment exactly the target for the second problem, only on different titles.
of such an account number is only done after an account has been Second, the titles are similar enough to share statistical properties,
opened. (ii) An "interviewer name" feature, in a cellular company so from the available dynamics for the first group of titles one can
churn prediction problem. While the information “who inter- infer the dynamics of the second group‟s. We shall revisit this
viewed the client when they churned” appears innocent enough, it complex example in Section 3, where this case will motivate us to
turns out that a specific salesperson was assigned to take over extend our definition of leakage beyond leaking features.
cases where customers had already notified they intend to churn.
Two medical data mining contests held the following year and
Kohavi et al. [2] describe the introduction of leaks in data mining which also exhibited leakage are discussed in [7, 13]. KDD-Cup
competitions as giveaway attributes that predict the target because 2008 dealt with cancer detection from mammography data. Ana-
they are downstream in the data collection process. The authors lyzing the data for this competition, the authors point out that the
give an example in the domain of retail website data analytics “Patient ID” feature (ignored by most competitors) has tremend-
where for each page viewed the prediction target is whether the ous and unexpected predictive power. They hypothesize that mul-
user would leave or stay to view another page. A leaking attribute tiple clinical study, institution or equipment sources were used to
is the "session length", which is the total number of pages viewed compile the data, and that some of these sources were assigned
by the user during this visit to the website. This attribute is added their population with prior knowledge of the patient‟s condition.
to each page-view record at the end of the session. A solution is to Leakage was thus facilitated by assigning consecutive patient IDs
replace this attribute with "page number in session" which de- for data from each source, that is, the merge was done without
scribes the session length up to the current page, where prediction obfuscating the source. The INFORMS Data Mining Challenge
is required. 2008 competition held the same year, addressed the problem of
Subsequent work by Kohavi et al. [3] presents the common busi- pneumonia diagnosis based on patient information from hospital
ness analysis problem of characterizing big spenders among cus- records. The target was originally embedded as a special value of
tomers. The authors explain that this problem is prone to leakage one or more features in the data given to competitors. The orga-
since immediate triggers of the target (e.g. a large purchase or nizers removed these values, however it was possible to identify
purchase of a diamond) or consequences of the target (e.g. paying traces of such removal, constituting the source of leakage in this
a lot of tax) are usually available in collected data and need to be example (e.g. a record with all condition codes missing, similarly
manually identified and removed. To show how correcting for to Kohavi‟s jewelry example).
leakage can become an involved process, the authors also discuss Also in the recent work by Rosset et al. [13], the concept of iden-
the more complex situation where removing the information "total tifying and harnessing leakage has been openly addressed as one
purchase in jewelry" caused information of "no purchases in any of three key aspects for winning data mining competitions. This
department" to become fictitiously predictive. This is because work provides the intuitive definition of leakage as "The uninten-
each customer found in the database is there in the first place due tional introduction of predictive information about the target by
to some purchase, and if this purchase is not in any department the data collection, aggregation and preparation process". The
(still available), it has to be jewelry (which has been removed). authors mention that leakage might be the cause of many failures
They suggest defining analytical questions that should suffer less of data mining applications, and give the illustrative example of
from leaks – such as characterizing a "migrator" (a user who is a predicting people who are likely to be sick by looking at how
light spender but will become a heavy one) instead of characteriz-

557
many work days they would end up missing. They also describe a modeling. However, none of the discussions that we could find
real-life business intelligence project at IBM where potential has addressed the issue in a general way, or suggested methodolo-
customers for certain products were identified, among other gy for handling it. In the following section we make our attempt to
things, based on keywords found on their websites. This turned derive a definition of leakage.
out to be leakage since the website content used for training had
been sampled at the point in time where the potential customer has 3. FORMULATION
already become a customer, and where the website contained 3.1 Preliminaries and Legitimacy
traces of the IBM products purchased, such as the word “Webs-
In our discussion of leakage we shall define the roles of client and
phere” (e.g. in a press release about the purchase or a specific
modeler as in Section 1, and consider the standard statistical infe-
product feature the client uses).
rence framework of supervised learning and its generalizations,
The latest INFORMS and IJCNN competitions held in late 2010 where we can discuss examples, targets and features. We assume
and early 2011 are fresh examples of how leakage continues to the reader is familiar with these concepts. For a complete refer-
plague predictive modeling problems and competitions in particu- ence see [1]. Let us just lay out our notation and say that in our
lar. The INFORMS 2010 Data Mining Challenge required partici- framework we receive from an axiomatic data preparation stage a
pants to develop a model that predicts stock price movements, multivariate random process . is the outcome or
over a fixed one-hour horizon, at five minute intervals. Competi- target generating process with samples target instances. Values
tors were provided with intraday trading data showing stock pric- or realizations of the random variable are denoted (in bold).
es, sectoral data, economic data, experts' predictions and indices. Similarly, , and are the feature-vector generating process,
The data were segmented to a training database, on which partici- an instance and realization. For individual feature generating
pants were expected to build their predictive models, and a test processes, instances and realizations we use , and
database which was used by the organizers to evaluate submis- . Specific instances and taken from the same instance
sions. The surprising results were that about 30 participating of are said to be -related. The modeler‟s goal is to statistical-
groups achieved more than 0.9 AUC, with the best model surpass- ly infer a target instance, from its associated feature-vector in-
ing 0.99 AUC. Had these models been legitimate they would‟ve stance in and from a separate group of samples of , called
indeed made a “big impact on the finance industry” as the orga- the training examples . The solution to this problem is a mod-
nizers had hoped, not to mention making their operators very el . We say that the model‟s observational inputs
wealthy individuals. Unfortunately, however, it became clear that for predicting are and , and this relation between the
although some steps had been taken to prevent competitors from various elements in the framework is the base for our discussion.
“looking up the answers” (the underlying target stock‟s identity
was not revealed, and the test set did not include the variable Models containing leaks are a subclass of the broader concept of
being predicted), it was still possible to build models that rely on illegitimate or unacceptable models. At this level, legitimacy,
data from the future. Having data from the future for the explana- which is a key concept in our formulation of leakage, is complete-
tory variables, some of which are highly cointegrated with the ly abstract. Every modeling problem sets its own rules for what
target (e.g. a second stock within the same sector as the target constitutes a legitimate or acceptable solution and different prob-
stock), and having access to publicly available stock data such as lems, even if using the same data, may have wildly different views
Yahoo/Google Finance (which allows finding at least good candi- on legitimacy. For example a solution could be considered illegi-
dates for the identity of the target stock, consequently revealing all timate if it is too complex – say if it uses too many features or if it
test values) was the true driver of success for these models. The is not linear in its features.
organizers held two rankings of competitors, one where future However our focus here is on leakage, which is a specific form of
information was allowed and another where it was forbidden, illegitimacy that is an intrinsic property of the observational inputs
however in the end they had to admit that verifying future infor- of a model. This form of illegitimacy remains partly abstract, but
mation was not used was impossible, and that it was probable that could be further defined as follows: Let be some random varia-
all models were tainted, as all modelers had been exposed to the ble. We say a second random variable is -legitimate if is
test set. observable to the client for the purpose of inferring In this case
The IJCNN 2011 Social Network Challenge presented participants we write .
with anonymized 7,237,983 edges from an undisclosed online A fully concrete meaning of legitimacy is built-in to any specific
social network and asked to predict which of an additional set of inference problem. The trivial legitimacy rule, going back to the
8,960 potential edges are in fact realized on the network as well. first example of leakage given in Section 1, is that the target itself
The winners have recently reported [3] they had been able to must never be used for inference:
recognize, through sophisticated analysis, that the social network
in question was Flickr and then to de-anonymize the majority of (1)
the data. This allowed them to use edges available from the on- We could use this rule if we wanted to disqualify the winning
line Flickr network to correctly predict over 60% of edges which submission to the IJCNN 2011 Social Network Challenge, for it,
were identified, while the rest had to be handled classically using however cleverly, eventually uses some of the targets themselves
legitimate prediction. Similarly to other cases that have been for inference. This condition should be abided by all problems,
mentioned, these rogue solutions are sometimes so elegant and and we refrain from explicitly mentioning it for the remaining
insightful that they carry merit in their own right. The problem is examples we shall discuss.
that they do not answer the original question presented by the
organizers. Naturally, a model contains leaks with respect to a target instance
if one or more of its observational inputs are -illegitimate. We
Clearly, then, the issue of leakage has been observed in various say that the model inherits the illegitimacy property from the
contexts and problem domains, with a natural focus on predictive

558
features and training examples it uses. The discussion proceeds . (5)
along these two possible sources of leakage for a model: features
and training examples. We can think of a requirement to use exactly features from a
specified pool of preselected features:
3.2 Leaking Features
We begin with the more common case of leaking features. First , (6)
we must extend our abstract definition of legitimacy to the case of
random processes: Let be some random process. We say a and so on. In fact, there is a variant of example ‎(6) which is very
second random process is -legitimate if, for every pair of common: only the features selected for a specific provided
instances of and , and respectively, which are -related, dataset are considered legitimate. Sometimes this rule allows free
is -legitimate. We use the same notation as we did for random use of the entire set:
variables in ‎3.1, and write that . . (7)
Leaking features are then covered by a simple condition for the
absence of leakage: Usually however this rule is combined with ‎(3) to give:

. (2) . (8)

That is, any feature made available by the data preparation process Most documented cases of leakage mentioned in Section 2 are
is deemed legitimate by the precise formulation of the modeling covered by condition ‎(2) in conjunction with a no-time-machine
problem at hand, instance by instance w.r.t. its matching target. requirement as in ‎(3). For instance, in the trivial example of pre-
dicting rainy days, the target is an illegitimate feature since its
The prevailing example for this type of leakage is what we call the
value is not observable to the client when the prediction is re-
no-time-machine requirement. In the context of predictive model-
quired (say, the previous day). As another example, the pneumo-
ing, it is implicitly required that a legitimate model only build on
nia detection database in the INFORMS 2008 challenge discussed
features with information from a time earlier (or sometimes, no
in [8, 13] implies that a certain combination of missing diagnosis
later) than that of the target. Formally, and , made scalar for
code and some other features is highly informative of the target.
the sake of simplicity, are random processes over some time axis However this feature is illegitimate, as the patient‟s condition is
(not necessarily physical time). Prediction is required by the client still being studied.
for the target process at times , and their -related feature
process is observable to the client at times . We then have: It is easy to see how conditions ‎(2) and (3) similarly apply to the
account number and interviewer name examples from [10], the
. (3) session length of [2] (while the corrected “page number in ses-
Such a rule should be read: Any legitimate feature w.r.t. the target sion” is fine), the immediate and indirect triggers described in [3,
process is a member of the right hand side set of features. In this 4], the remaining competitions described in [8, 13], and the web-
case the right hand side is the set of all features whose every in- site based features used by IBM and discussed in [13]. However
stance is observed earlier than its -related target instance. We not all examples fall under condition ‎(2).
are assuming with this notation that contains all possible fea- Let us examine the case mentioned earlier of KDD-Cup 2007 as
tures, and use “ ” to express that additional legitimacy constraints discussed in [11]. While clearly taking advantage of information
might also apply (otherwise “ ” could be used). from reviews given to titles during 2006 (the mere fact of using
While the simple no-time-machine requirement is indeed the most data from the future is proof, but we can also see it in action by
common case, one could think of additional scenarios which are the presence of measurable leakage – the fact that this model
performed significantly better both in internal tests and the final
still covered by condition ‎(2). A simple extension is to require
competition), the final delivered model does not include any
features to be observable a sufficient period of time prior to as
illegitimate feature1. To understand what has transpired, we must
in ‎(4) below in order to preclude any information that is an imme- address the issue of leakage in training examples.
diate trigger of the target. One reason why this might be necessary
is that sometimes it is too limiting to think of the target as pertain- 3.3 Leakage in Training Examples
ing to a point-in-time, only to a rough interval. Using data observ- Let us first consider the following synthetic but illustrative exam-
able close to makes the problem uninteresting. Such is the case ple. Suppose we are trying to predict the level of a white noise
for the “heavy spender” example from [3]. With legitimacy de- process for , clearly a hopeless task.
fined as ‎(3) (or as ‎(4) when ) a model may be built that uses Suppose further that for the purpose of predicting , itself is a
the purchase of a diamond to conclude that the customer is a big legitimate feature but otherwise, as in (3), only past information is
spender but with sufficiently large this is not allowed. This deemed legitimate – so obviously we cannot cheat. Now consider
transforms the problem from identification of “heavy spenders” to a model trained on examples taken from .
the suggested identification of “migrators”. The proposed model is , a table containing for
each the target‟s realized value . Strictly speaking, the only
. (4)

Another example, using the same random process notation, is a


1
memory limitation, where a model may not use information older In fact the use of external sources that are not rolled-back to
than a time relative to that of the target: 2005, such as using current (2007) IMDB data, is simple lea-
kage just like in the IBM example. However this is not the ma-
jor source of leakage in this example.

559
feature used by this model, , is legitimate. Hence the model has have been completely legitimate. In some domains such as time
no leakage as defined by condition ‎(2), however it clearly has series prediction, where typically only a single history measuring
perfect prediction performance for the evaluation set in the exam- the phenomenon of interest is available for analysis, this form of
ple. We would naturally like to capture this case under a complete leakage is endemic and commonly known as data snooping /
definition of leakage for this problem. dredging [5].
In order to tackle this case, we suggest adding to (2) the following Regarding concretization of legitimacy for a new problem: Argu-
condition for the absence of leakage: For all , ably, more often than not the modeler might find it very challeng-
ing to define, together with the client, a complete set of such
(9) legitimacy guidelines prior to any modeling work being underta-
where 2
is the set of evaluation target instances, and are ken, and specifically prior to performing preliminary evaluation.
the sets of training targets and feature-vectors respectively whose Nevertheless it should usually be rather easy to provide a coarse
realizations make up the set of training examples . definition of legitimacy for the problem, and a good place to start
is to consider model use cases. The specification of any modeling
One way of interpreting this condition is to think of the informa- problem is really incomplete without laying out these ground rules
tion presented for training as constant features embedded into the of what constitutes a legitimate model.
model, and added to every feature-vector instance the model is
called to generate a prediction for. As a final point on legitimacy, let us mention that once it has been
clearly defined for a problem, the major challenge becomes pre-
For modeling problems where the usual i.i.d. instances assump- paring the data in such a way that ensures models built on this
tion is valid, and when without loss of generality considering all data would be leakage free. Alternatively, when we do not have
information specific to the instance being predicted as features full control over data collection or when it is simply given to us, a
rather than examples, condition ‎(9) simply reduces to condition methodology for detecting when a large number of seemingly
‎(2) since irrelevant observations can always be considered legiti- innocent pieces of information are in fact plagued with leakage is
mate. In contrast, when dealing with problems exhibiting non- required. This shall be the focus of the following two sections.
stationarity, a.k.a. concept-drift [15], and more specifically the
case when samples of the target (or, within a Bayesian framework, 4. AVOIDANCE
the target/feature) are not mutually independent, condition ‎(9) 4.1 Methodology
cannot be reduced to condition ‎(2). Such is the case of KDD-Cup Our suggested methodology for avoiding leakage is a two stage
2007. Available information about the number of reviews given to process of tagging every observation with legitimacy tags during
a group of titles for the “who reviewed what” task is not statisti- collection and then observing what we call a learn-predict separa-
cally independent of the number of reviews given to the second tion. We shall now describe these stages and then provide some
group of titles which is the target in the “how many ratings” task. examples.
The reason for this is that these reviews are all given by the same
At the most basic level suitable for handling the more general case
population of users over the same period in 2006, and thus are of leakage in training examples, legitimacy tags (or hints) are
mutually affected by shared causal ancestors such as viewing and
ancillary data attached to every pair of observational input
participation trends (e.g. promotions, similar media or event that
instance and target instance , sufficient for answering the ques-
gets a lot of exposure and so on). Without proper conditioning on
tion “is legitimate for inferring “ under the problem‟s defini-
these shared ancestors we have potential dependence, and because
tion of legitimacy. With this tagged version of the database it is
most of these ancestors are unobservable, and difficult to find
possible, for every example being studied, to roll back the state of
observable proxies for, dependence is bound to occur.

3.4 Discussion
It is worth noting that leakage in training examples is not limited
to the explicit use of illegitimate examples in the training process.
A more dangerous way in which illegitimate examples may creep illegitimate
in and introduce leakage is through design decisions. Suppose for legitimate
example that we have access to illegitimate data about the dep-
loyment population, but there is no evidence in training data to
support this knowledge. This might prompt us to use a certain
modeling approach that otherwise contains no leakage in training (a) A general separation
examples but is still illegitimate. Examples could be: (i) selecting
or designing features that will have predictive power in deploy-
ment, but don‟t show this power on training examples, (ii) algo-
rithm or parametric model selection, and (iii) meta-parameter
value choices. This form of leakage is perhaps the most dangerous
as an evaluator may not be able to identify it even when she
knows what she is looking for. The exact same design could have
been brought on by theoretic rationale, in which case it would

(b) Time separation (c) Only targets are illegit.


2
We use the term evaluation as it could play the classic role of
either validation or testing. Figure 1. An illustration of learn-predict separation.

560
the world to a legitimate decision state, eliminating any confusion leakage from the data actually separated. The fact that other data
that may arise from only considering the original raw data. are even considered is indeed a competition issue, or in some
In the learn-predict separation paradigm (illustrated in Figure 1) cases an issue of a project organized like a competition (i.e.
the modeler uses the raw but tagged data to construct training projects within large organizations, outsourcing or government
examples in such a way that (i) for each target instance, only those issued projects). Sometimes this issue stems from a lack of an
observational inputs which are purely legitimate for predicting it auditing process for submissions, however most of the time, it is
are included as features, and (ii) only observational inputs which introduced to the playground on purpose.
are purely legitimate with all evaluation targets may serve as Competition organizers, and some project clients, have an ulterior
examples. This way, by construction, we directly take care of the conflict of interest. On the one hand they do not want competitors
two types of leakage that make up our formulation, respectively to cheat and use illegitimate data. On the other hand they would
leakage in features (2) and in training examples (9). To complete- welcome insightful competitors suggesting new ideas for sources
ly prevent leakage by design decisions, the modeler has to be of information. This is a common situation, but the two desires or
careful not to even get exposed to information beyond the separa- tasks are often conflicting: when one admits not knowing which
tion point, for this we can only prescribe self-control. sources could be used, one also admits she can't provide an air-
As an example, in the common no-time-machine case where legi- tight definition of what she accepts as legitimate. She may be able
to say something about legitimacy in her problem, but would
timacy is defined by ‎(3), legitimacy tags are time-stamps with
intentionally leave room for competitors to maneuver.
sufficient precision. Legitimacy tagging is implemented by time-
stamping every observation. Learn-predict separation is imple- The solution to this conflict is to separate the task of suggesting
mented by a cut at some point in time that segments training from broader legitimacy definitions for a problem from the modeling
evaluation examples. This is what has been coined in [13] predic- task that fixes the current understanding of legitimacy. Competi-
tion about the future. Interestingly enough, this common case tions should just choose one task, or have two separate challenges:
does not sit well with the equally common way databases are one to suggest better data, and one to predict with the given data
organized. Updates to database records are usually not time- only. The two tasks require different approaches to competition
stamped and not stored separately, and at best whole records end organization, a thorough account of which is beyond the scope of
up with one time-stamp. Records are then translated into exam- this paper. One approach for the first task that we will mention is
ples, and this loss of information is often the source of all evil that live prediction.
allows leakage to find its way into predictive models. When the legitimacy definition for a data mining problem is iso-
The original data for the INFORMS 2008 Data Mining Challenge, morphic to the no-time-machine legitimacy definition ‎(3) of pre-
lacked proper time-stamping, causing observations taken before dictive modeling, we can sometimes take advantage of the fact
and after the target‟s time-stamp to end up as components of that a learn-predict separation over time is physically impossible
examples. This made time-separation impossible, and models to circumvent. We can then ask competitors to literally predict
built on this data did not perform prediction about the future. On targets in the future (that is, a time after submission date) with
the other hand, the data for KDD-Cup 2007‟s “How Many Re- whatever sources of data they think might be relevant, and they
views” task in itself was (as far as we are aware) well time- will not be able to cheat in this respect. For instance the IJCNN
stamped and separated. Training data provided to competitors was Social Network Challenge could have asked to predict new edges
sampled prior to 2006, while test data was sampled after and in the network graph a month in advance, instead of synthetically
including 2006, and was not given. The fact that training data removing edges from an existing network which left traces and
exposed by the organizers for the separate "Who Reviewed What" the online original source for competitors to find.
task contained leakage was due to an external source of leakage,
an issue related with data mining competitions which we shall 5. DETECTION
discuss next. Often the modeler doesn‟t have control over the data collection
process. When the data are not properly tagged, the modeler can-
4.2 External Leakage in Competitions not pursue a learn-predict separation as in the previous section.
Our account of leakage avoidance, especially in light of our recur- One important question is how to detect leakage when it happens
ring references to data mining competitions in this paper, would in given data, as the ability to detect that there is a problem can
be incomplete without mentioning the case of external leakage. help mitigate its effects. In the context of our formulation from
This happens when some data source other than what is simply Section 3, detecting leakage boils down to pointing out how con-
given by the client (organizer) for the purpose of performing ditions (2) or (9) fail to hold for the dataset in question. A brute-
inference, contains leakage and is accessible to modelers (compet- force solution to this task is often infeasible because datasets will
itors). Examples for this kind of leakage include the KDD-Cup always be too large. We propose the following methods for filter-
2007 “How Many Reviews” task, the INFORMS 2010 financial ing leakage candidates.
forecasting challenge, and the IJCNN 2011 Social Network Chal-
lenge3. Exploratory data analysis (EDA) can be a powerful tool for identi-
fying leakage. EDA [14] is the good practice of getting more
In these cases, it would seem that even a perfect application of the intimate with the raw data, examining it through basic and inter-
suggested avoidance methodology breaks down by considering pretable visualization or statistical tools. Prejudice free and me-
the additional source of data. Indeed, separation only prevents thodological, this kind of examination can expose leakage as
patterns in the data that are surprising. In the INFORMS 2008
3 breast cancer example, for instance, the fact that the “patient id” is
Although it is entirely possible that internal leakage was also so strongly correlated with the target is surprising, if we expect ids
present in these cases (e.g. forum discussions regarding the to be given with little or no knowledge of the patient‟s diagnosis,
IJCNN 2011 competition on http://www.kaggle.com). for instance on an arrival time basis. Of course some surprising

561
facts revealed by the data through basic analysis could be legiti- the pilot evaluation – but would still be better than random guess-
mate, for the same breast cancer example it might be the case that ing and possibly competitive with models built with no leakage.
family doctors direct their patients to specific diagnosis paths This is good news as it means that, for some problems, living with
(which issue patient IDs) based on their initial diagnosis, which is leakage without attempting to fix it could work.
a legitimate piece of information. Generally however, as most What happens when we do try to fix leakage? Without explicit
worthy problems are highly nontrivial, it is reasonable that only legitimacy tags in the data, it is often impossible to figure out the
few surprising candidates would require closer examination to legitimacy of specific observations and/or features even if it is
validate their legitimacy. obvious that leakage has occurred. It may be possible to partly
Initial EDA is not the only stage of modeling where surprising plug the leak but not to seal it completely, and it is not uncommon
behavior can expose leakage. The “IBM Websphere” example that an attempt to fix leakage only makes it worse.
discussed in Section 1 is an excellent example that shows how the Usually, where there is one leaking feature, there are more. Re-
surprising behavior of a feature in the fitted model, in this case a moving the "obvious" leaks that are detected may exacerbate the
high entropy value (the word “Websphere”), becomes apparent effect of undetected ones. In the e-commerce example from [4],
only after the model has been built. Another approach related to one might envision to simply remove the obvious „free shipping‟
critical examination of modeling results comes from observing field, however this kind of feature removal succeeds only in very
overall surprising model performance. In many cases we can few and simple scenarios to completely eradicate leaks. In particu-
come to expect, from our own experience or from prior/competing lar, in this example you are still left with the „no purchase in any
documented results, a certain level of performance for the prob- department‟ signature. Another example for this is KDD-Cup
lem at hand. A substantial divergence from this expected perfor- 2008 breast cancer prediction competition, where the patient ID
mance is surprising and merits testing the most informative contained an obvious leak. It is by no means obvious that remov-
observations the model is based on more closely for legitimacy. ing this feature would leave a leakage-free dataset, however. As-
The results of many participants in the INFORMS 2010 financial suming different ID ranges correspond to different health care
forecasting Challenge are an example of this case because they facilities (in different geographical locations, with different
contradict prior evidence about the efficiency of the stock market. equipment), there may be additional traces of this in the data. If
Finally, perhaps the best approach but possibly also the one most for instance the imaging equipment‟s grey scale is slightly differ-
expensive to implement, is early in-the-field testing of initial ent and in particular grey levels are higher in the location with
models. Any substantial leakage would be reflected as a differ- high cancer rate, the model without ID could pick up this leaking
ence between estimated and realized out-of-sample performance. signal from the remaining data, and the performance estimate
However, this is in fact a sanity check of the model‟s generaliza- would still be optimistic (the winners show evidence of this in
tion capability, and while this would work well for many cases, their report [8]).
other issues can make it challenging or even impossible to isolate Similar arguments can be made about feature modification per-
the cause of such performance discrepancy as leakage: classical formed in INFORMS 2008 in an attempt to plug obvious leaks,
over-fitting, tangible concept-drift, issues with the design of the which clearly created others; and instance removal in organization
field-test such a sampling bias and so on. of INFORMS 2009, which also left some unintended traces [16].
A fundamental problem with the methods for leakage detection In summary, further research into general methodology for lea-
suggested in this section is that they all require some degree of kage correction is indeed required. Lacking such methodology,
domain knowledge: For EDA one needs to know if a good predic- our experience is that fully fixing leakage without learn-predict
tor is reasonable; comparison of model performance to alternative separation is typically very hard, perhaps impossible, and that
models or prior state-of-art models requires knowledge of the modeling with the remaining leakage is often the preferred alter-
previous results; and the setup for early in-the-field evaluation is native to futile leakage removal efforts.
obviously very involved. The fact that these methods still rely on
domain knowledge places an emphasis on leakage avoidance 7. CONCLUSION
during data collection, where we have more control over the data. It should be clear by now that modeling with leakage is undesira-
ble on many levels: it is a source for poor generalization and over-
6. (NOT) FIXING LEAKAGE estimation of expected performance. A rich set of examples from
Once we have detected leakage, what should we do about it? In diverse data mining domains given throughout this paper add to
the best-case scenario, one might be able to take a step back, get our own experience to suggest that in the absence of methodology
access to raw data with intact legitimacy tags, and use a learn- for handling it, leakage could be the cause of many failures of data
predict separation to reconstruct a leakage-free version of the mining applications.
problem. The second-best scenario happens when intact data is
not available but the modeler can afford to fix the data collection In this paper we have described leakage as an abstract property of
process and postpone the project until leakage-free data become the relationship of observational inputs and target instances, and
available. In the final scenario, one just has to make do with that showed how it could be made concrete for various problems. In
which is available. light of this formulation an approach for preventing leakage dur-
ing data collection was presented that adds legitimacy tags to each
Because of structural constraints at work, leakage can be some- observation. Also suggested were three ways for zooming in on
what localized in samples. This is true in both INFORMS 2008 potentially leaking features: EDA, ex-post analysis of modeling
and INFORMS 2009 competitions mentioned above, and also in results and early field-testing. Finally, problems with fixing lea-
the IBM Websphere example. When the model is used in the field, kage have been discussed as an area where further research is
by definition all observations are legitimate and there can be no required.
active leaks. So to the extent that most training examples are also
leakage-free, the model may perform worse in deployment than in

562
Many cases of leakage happen when in selecting the target varia- cial Network Challenge. Proceedings of the 2011 Interna-
ble from an existing dataset, the modeler neglects to consider the tional Joint Conference on Neural Networks (IJCNN). Pre-
legitimacy definition imposed by this selection, which makes print.
other related variables illegitimate (e.g. large purchases vs. free [7] Nisbet, R., Elder, J. and Miner, G. 2009. Handbook of Statis-
shipping). In other cases, the modeler is well aware of the implica- tical Analysis and Data Mining Applications. Academic
tions of his selection, but falters when facing the tradeoff between Press.
removing potentially important predictive information and ensur-
ing no leakage. Most instances of internal leakage in competitions [8] Perlich C., Melville P., Liu Y., Swirszcz G., Lawrence R.,
were in fact of this nature and have been created by the organizers Rosset S. 2008. Breast cancer identification: KDD cup win-
despite best attempts to avoid it. ner‟s report. SIGKDD Explorations Newsletter. 10(2) 39-42.
We hope that the case studies and suggested methodology de- [9] Pyle, D. 1999. Data Preparation for Data Mining. Morgan
scribed in this paper can help save projects and competitions from Kaufmann Publishers.
falling in the leakage trap and allow them to encourage models [10] Pyle, D. 2003. Business Modeling and Data Mining. Morgan
and modeling approaches that would be relevant in their domains. Kaufmann Publishers.
8. REFERENCES [11] Pyle, D. 2009. Data Mining: Know it All. Ch. 9. Morgan
[1] Hastie T., Tibshirani, R. and Friedman, J. H. 2009. The Ele- Kaufmann Publishers.
ments of Statistical Learning: Data Mining, Inference, and [12] Rosset, S., Perlich, C. and Liu, Y. 2007. Making the most of
Prediction. Second Edition. Springer. your data: KDD-Cup 2007 “How Many Ratings” Winner‟s
[2] Kohavi, R., Brodley, C., Frasca, B., Mason, L., and Zheng, Report. ACM SIGKDD Explorations Newsletter. 9(2).
Z. 2000. KDD-cup 2000 organizers‟ report: peeling the [13] Rosset, S., Perlich, C., Swirszcz, G., Liu, Y., and Prem, M.
onion. ACM SIGKDD Explorations Newsletter. 2(2). 2010. Medical data mining: lessons from winning two com-
[3] Kohavi, R. and Parekh, R. 2003. Ten supplementary analyses petitions. Data Mining and Knowledge Discovery. 20(3) 439-
to improve e-commerce web sites. In Proceedings of the 468.
Fifth WEBKDD Workshop. [14] Tukey, J. 1977. Exploratory Data Analysis. Addison-Wesley.
[4] Kohavi, R., Mason L., Parekh, R. and Zheng Z. 2004. Les- [15] Widmer, G. and Kubat, M. 1996. Learning in the presence of
sons and challenges from mining retail e-commerce data. concept drift and hidden contexts. Machine Learning. 23(1).
Machine Learning. 57(1-2). [16] Xie, J. and Coggeshall, S. 2010. Prediction of transfers to
[5] Lo, A.W. and MacKinlay A.C. 1990. Data-snooping biases tertiary care and hospital mortality: A gradient boosting deci-
in tests of financial asset pricing models. Review of Financial sion tree approach. Statistical Analysis and Data Mining, 3:
Studies. 3(3) 431-467. 253–258.
[6] Narayanan, A., Shi, E., and Rubinstein, B. 2011. Link Pre-
diction by De-anonymization: How We Won the Kaggle So-

563

You might also like