0% found this document useful (0 votes)
3 views18 pages

An Abstract Interpretation-Based Data Leakage Static Analysis

Paper published at TASE 2024

Uploaded by

caterina.urban
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views18 pages

An Abstract Interpretation-Based Data Leakage Static Analysis

Paper published at TASE 2024

Uploaded by

caterina.urban
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

An Abstract Interpretation-Based

Data Leakage Static Analysis

Filip Drobnjaković1 , Pavle Subotić1 , and Caterina Urban2


1
Microsoft, Serbia
2
Inria & ENS | PSL, France

Abstract. Data leakage is a well-known problem in machine learning


which occurs when the training and testing datasets are not indepen-
dent. This phenomenon leads to overly optimistic accuracy estimates at
training time, followed by a significant drop in performance when mod-
els are deployed in the real world. This can be dangerous, notably when
models are used for risk prediction in high-stakes applications.
In this paper, we propose an abstract interpretation-based static analysis
to prove the absence of data leakage. We implemented it in the NBLyzer
framework and we demonstrate its performance and precision on 2111
Jupyter notebooks from the Kaggle competition platform.

1 Introduction
As artificial intelligence (AI) continues its unprecedented impact on society, en-
suring machine learning (ML) models are accurate is crucial. To this end, ML
models need to be correctly trained and tested. This iterative task is typically
performed within data science notebook environments [19,9]. A notable bug
that can be introduced during this process is known as a data leakage [18].
Data leakages have been identified as a pervasive problem by the data science
community [10,11,17]. In a number of recent cases data leakages crippled the
performance of real-world risk prediction systems with dangerous consequences
in high-stakes applications such as child welfare [1] and healthcare [24].
Data leakages arise when dependent data is used to train and test a model.
This can come in the form of overlapping data sets or, more insidiously, by
library transformations that create indirect data dependencies.

Example 1 (Motivating Example). Consider the following excerpt of a data sci-


ence notebook (based on 569.ipynb from our benchmarks, and written in the
small language that we introduce in Section 3.3):

1 data = read ( " data . csv " )


2 X_norm = normalize ( X )
3 X_train = X_norm . select [[⌊0.025 ∗ RX norm ⌋ + 1, . . . , RX norm ]][]
4 X_test = X_norm . select [[0 , . . ., ⌊0.025 ∗ RX norm ⌋]][]
5 train ( X_train )
6 test ( X_test )
2 Filip Drobnjaković, Pavle Subotić, and Caterina Urban

Line 1 reads data from a CSV file and line 2 normalizes it. Line 3 and 4 split
the data into training and testing segments (we write Rx for the number of data
rows stored in x). Finally, line 5 trains a ML model, and line 6 tests its accuracy.
In this case, a data leakage is introduced because line 2 performs a normal-
ization, before line 3 and 4 split into train and test data. This implicitly uses
the mean over the entire dataset to perform the data transformation and, as a
result, the train and test data are implicitly dependent on each other. In our
experimental evaluation (cf. Section 5) we found that this is a common pattern
for introducing a data leakage in several real-world notebooks. In the following,
we say that the data resulting from the normalization done in line 2 is tainted.

Mainstream methods rely on detecting data leakages retroactively [11,18].


Given a suspicious result, e.g., an overly accurate model, data analysis methods
are used to identify data dependencies. However, a reasonable result may avoid
suspicion from a data scientist until the model is already deployed. This is a
natural use case for static analysis to detect data leakages at development time.
In this paper, we propose a static analysis for proving the absence of data
leakage in data-manipulating programs: it tracks the origin of data used for
training and testing and verifies that they originate from disjoint and untainted
data sources. In Example 1 our analysis identifies a data leakage since X train
and X test originate from previously normalized data (despite being disjoint).
Our static analysis (cf. Section 4) is designed within the abstract interpreta-
tion framework [5]: it is derived through successive abstractions from the (sound
and complete, but not computable) collecting program semantics (cf. Section 3).
This formal development allows us to formally justify the soundness of the anal-
ysis (cf. Theorem 3), and to exactly pinpoint where it can lose precision (e.g.,
modeling data joins, cf. Section 4.2) to guide the design of more precise abstrac-
tions, if necessary in the future (in our evaluation we found the current analysis
to be sufficiently precise, cf. Section 5). Moreover, it allows a clear comparisons
with other related static analyses, e.g., information flow and taint analyses (cf.
Section 6). Finally, this design principle allowed us to identify and overcome
issues and shortcomings of previous data leakage analysis attempts [21,20].
We implemented our analysis in the NBLyzer [21] framework. We evaluate
its performance on 2111 Jupyter notebooks from the Kaggle competition plat-
form, and demonstrate that our approach scales to the performance constraints
of interactive data science notebook environments while detecting 25 real data
leakages with a precision of 93%. Notably, we are able to detect 60% more data
leakages compared to the ad-hoc analysis previously implemented in NBLyzer.

2 Background
2.1 Data Frame-Manipulating Programs
We consider programs manipulating data frames, that is, tabular data structures
with columns labeled by non-empty unique names. Let V be a set of (heteroge-
neous atomic) values (i.e., such as numerical or string values). We can formalize
An Abstract Interpretation-Based Data Leakage Static Analysis 3

a data frame as a possibly empty (r × c)-matrix of values, where r ∈ N and


c ∈ N denote the number of matrix rows and columns, respectively. The first
row of non-empty data frames contains the labels of the data frame columns.
def S
Let D = r∈N c∈N Vr×c be the set of all possible data frame. Given a data
S
frame D ∈ D, we use RD and CD to denote the number of its rows and columns,
respectively, and write hdr(D) for the set of labels of its columns. We also write
D[r] for the specific row indexed with r ∈ RD in D.

2.2 Trace Semantics


The semantics of a data frame-manipulating program is a mathematical charac-
terization of its behavior when executed for all possible input data. We model the
operational semantics of a program in a language-independent way as a transition
system ⟨Σ, τ ⟩, where Σ is a (potentially infinite) set of program states and the
transition relation τ ⊆ Σ × Σ describes the possible transitions between states.
def
The set of final states of the program is Ω = {s ∈ Σ | ∀s′ ∈ Σ : ⟨s, s′ ⟩ ̸∈ τ }.
def
In the following, let Σ +∞ = Σ + ∪ Σ ω be the set of all non-empty finite or
infinite sequences of program states. A trace is a non-empty sequence of program
states that respects the transition relation τ , that is, ⟨s, s′ ⟩ ∈ τ for each pair of
consecutive states s, s′ ∈ Σ in the sequence. The trace semantics Υ ∈ P (Σ +∞ )
generated by a transition system ⟨Σ, τ ⟩ is the union of all finite traces that are
terminating with a final state in Ω, and all infinite traces [2]:
def
[
Υ = {s0 . . . sn−1 ∈ Σ n | ∀i < n − 1 : ⟨si , si+1 ⟩ ∈ τ, sn−1 ∈ Ω}
n∈N + (1)
∪ {s0 · · · ∈ Σ ω | ∀i ∈ N : ⟨si , si+1 ⟩ ∈ τ }

In the rest of the paper, we write JP K for the trace semantics of a program P .

3 Concrete Data Leakage Semantics


The trace semantics fully describes the behavior of a program. However, reason-
ing about a particular property of a program is facilitated by the design of a
semantics that abstracts away from irrelevant details about program executions.
In this section, we define our property of interest — absence of data leakage —
and use abstract interpretation to systematically derive, by abstraction of the
trace semantics, a semantics that precisely captures this property.

3.1 (Absence of ) Data Leakage


We use an extensional definition of a property as the set of elements having
such a property [5,6]. This allows checking property satisfaction by set inclusion
(see below) also across abstractions (cf. Theorems 1 and 2). Semantic proper-
ties of programs are properties of their semantics. Thus, properties of programs
with trace semantics in P (Σ +∞ ) are sets of sets of traces in P (P (Σ +∞ )). The
4 Filip Drobnjaković, Pavle Subotić, and Caterina Urban

set of program properties forms a complete boolean lattice ⟨P (P (Σ +∞ )) , ⊆


, ∪, ∩, ∅, P (Σ +∞ )⟩ for subset inclusion (i.e., logical implication). The strongest
def
property is the standard collecting semantics Λ ∈ P (P (Σ +∞ )): Λ = {Υ }, i.e.,
the property of “being the program with trace semantics Υ ”. Let LP M denote
the collecting semantics of a program P . Then, a program P satisfies a given
property H ∈ P (P (Σ +∞ )) if and only if its collecting semantics is a subset of
H: P |= H ⇔ LP M ⊆ H. In this paper, we consider the property of absence of
data leakage, which requires data used for training and data used for testing a
machine learning model to be independent.
Example 2 ((In)dependent Data Frame Variables). Let us consider a program P
with a single input data frame variable reading data frames withS four rows andr
one single column with values in {3, 9}, i.e., data frames in r∈{1,2,3,4} {3, 9}
(cf. Section 2.1). Imagine that P first performs min-max normalization (i.e.,
rescaling all data frame values to be in the [0, 1] range) and then splits the data
frame in half to use the first two rows for training and the last two rows for
testing. The table below shows all possible train and test data resulting from all
possible input data of this program:
3 3 3 3 9 9 9 9 3 3 3 3 9 9 9 9 1
3 3 9 9 3 3 9 9 3 3 9 9 3 3 9 9 2
input data
3 9 3 9 3 9 3 9 3 9 3 9 3 9 3 9 3
3 3 3 3 3 3 3 3 9 9 9 9 9 9 9 9 4
↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓
0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 0 1
train data
0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 0 2
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 0 1
test data
0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 2
σ σ′
In this case, train and test data are not independent: if we consider, for instance,
the execution σ with input data frame value “3|9|9|9” we can change the value of
its first row (i.e., r = 1 in Equation 2) from v̄ = 3 to v̄ = 9 (while leaving all other
rows unchanged) to obtain an execution σ ′ resulting in a difference in both train
and test data (i.e., σ(Utrain
P ) ̸= σ ′ (Utrain
P ) and σ(Utest ′ test
P ) ̸= σ (UP ) in Equation 2,
with train data differing at line 2 and test data differing at both lines 1 and 2).
Instead, the table below shows all possible resulting train and test data if
the normalization is performed after the split into train and test data:
3 3 3 3 9 9 9 9 3 3 3 3 9 9 9 9 1
3 3 9 9 3 3 9 9 3 3 9 9 3 3 9 9 2
input data
3 9 3 9 3 9 3 9 3 9 3 9 3 9 3 9 3
3 3 3 3 3 3 3 3 9 9 9 9 9 9 9 9 4
↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓
0 0 0 0 1 1 0 0 0 0 0 0 1 1 0 0 1
train data
0 0 1 1 0 0 0 0 0 0 1 1 0 0 0 0 2
0 1 0 1 0 1 0 1 0 0 0 0 0 0 0 0 1
test data
0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 0 2
An Abstract Interpretation-Based Data Leakage Static Analysis 5

Here train and test data remain independent as modifying any input data row r
in any execution yields another execution that may result in a difference in either
train and test data but never both. Equivalently, all possible values of either train
and test data are possible independently of the choice of the value of the row r.

More formally, let X be the set of all the (data frame) variables of a (data
frame-manipulating) program P . We denote with IP ⊆ X the set of its input
or source data frame variables, i.e., data frame variables whose value is directly
read from the input, and use UP ⊆ X to denote the set of its used data frame
variables, i.e., data frame variables used for training or testing a ML model.
We write Utrain
P ⊆ UP and Utest
P ⊆ UP for the variables used for training and
testing, respectively. For simplicity, we can assume that programs are in static
single-assignment form so that data frame variables are assigned exactly once:
data is read from the input, transformed and normalized, and ultimately used for
training and testing. Given a trace σ ∈ JP K, we write σ(i) and σ(o) to denote the
value of the data frame variables i ∈ IP and o ∈ UP in σ. We can now define when
used data frame variables are independent in a program with trace semantics JP K:
def
ind(JP K) = ∀σ ∈ JP K, i ∈ IP , r ∈ Ri : unch(σ, i, r, Utest train
P ) ∨ unch(σ, i, r, UP )
unch(σ, i, r, U ) = ∀v̄ ∈ VCi : σ(i)[r] ̸= v̄ ⇒ (∃σ ′ ∈ JP K :
def


σ ′ (i)[r] = v̄ ∧ σ(i) = σ ′ (i) ∧ σ(IP \{i}) = σ ′ (IP \{i}) ∧ σ(U ) = σ ′ (U ))
(2)
where Ri and Ci stand for Rσ(i) (i.e., number of rows of the data frame value
of i ∈ IP ) and Cσ(i) (i.e., number of columns of the data frame value of i ∈ IP ),

respectively, σ(i) = σ ′ (i) stands for ∀r′ ∈ Ri : r′ ̸= r ⇒ σ(i)[r′ ] = σ ′ (i)[r′ ] (i.e.,
the data frame value of i ∈ IP remains unchanged for any row r′ ̸= r), and
σ(X) = σ ′ (X) stands for ∀x ∈ X : σ(x) = σ ′ (x). The definition requires that
changing the value of a data source i ∈ IP can modify data frame variables
used for training (Utrain
P ) or testing (Utest
P ), but not both: the value of data frame
variables used for either training or testing in a trace σ remains the same inde-
pendently of all possible values v̄ ∈ VCi of any portion (e.g., any row r ∈ Ri )
of any input data frame variable i ∈ IP in σ. Note that this definition quantifies
over changes in data frame rows since the split into train and test data happens
across rows (e.g., using train_test_split in Pandas), but takes into account all
possible column values in each row (v̄ ∈ VCi ). It also implicitly takes into account
implicit flows of information by considering traces in JP K. In particular, in terms
of secure information flow, notably non-interference, this definition says that we
cannot simultaneously observe different values in Utrain P and Utest
P , regardless of
the values of the input data frame variables. Here we weaken non-interference
to consider either Utrain
P or Utest
P as low outputs (depending on which row of the
input data frame variables is modified), instead of fixing the choice beforehand.
Note also that unch quantifies over all possible values v̄ ∈ VCi of the changed
row i rather than quantifying over traces to allow non-determinism, i.e., not all
traces that only differ at row r ∈ Ri of data frame variable i ∈ IP need to agree
on the values of the used variables U ⊆ UP but all values of U that are feasible
6 Filip Drobnjaković, Pavle Subotić, and Caterina Urban

from a value of r of i need to be feasible for all possible values of r of i. In terms


of input data (non-)usage [23], this definition says that training and testing do
not use the same (portions of the) input data sources. Here we generalize the
notion of data usage proposed by Urban and Müller [23] to multi-dimensional
variables and allow multiple values for all outcomes but one (variables used for
either training or testing) for each variation in the values of the input variables.
The absence of data leakage property can now be formally defined as the set
def
I = {JP K ∈ P (Σ +∞ ) | ind(JP K)} of programs (semantics) that use independent
data for training and testing ML models. Thus P |= I ⇔ LP M ⊆ I.
In the rest of this section, we derive, by abstraction of the collecting se-
mantics Λ, a sound and complete semantics Λ̇I that contains only and exactly
the information needed to reason about (the absence of) data leakage. A fur-
ther abstraction in the next section, loses completeness but yields a sound and
computable over-approximation of Λ̇I that allows designing a static analysis to
effectively detect data leakage in data frame-manipulating programs.

3.2 Dependency Semantics


From the definition of absence of data leakage, we observe that for reasoning
about data leakage we essentially need to track the flow of information between
(portions of) input data sources and data used for training or testing. Thus we
can abstract the collecting semantics into a set of dependencies between (rows
of) input data frame variables and used data frame variables.
We define the following Galois connection:
γI⇝U
⟨P P Σ +∞ , ⊆⟩ ← −−−−−−− ⟨P ((X × N) × (X × N)) , ⊇⟩

−−α −−−→ (3)
I⇝U

between sets of sets of traces and sets of relations (i.e., dependencies) between
data frame variables indexed at some row. The abstraction and concretiza-
tion function are parameterized by a set I ⊆ X of input variables and a set
U ⊆ X of used variables of interest. In particular, the dependency abstraction
αI⇝U : P (P (Σ +∞ )) → P ((X × N) × (X × N)) is:

i ∈ I, r ∈ N, o ∈ U, r′ ∈ N, (∀T ∈ S :
 
 

 
Ci ′ ′
∃σ ∈ ∈ ∀σ ∈ ∧

T, v̄ : T : σ(i) = σ (i)

def
αI⇝U (S) = i[r] ⇝ o[r ] ′ V

 σ(I \{i}) = σ ′ (I \{i}) ∧ σ(o)[r′ ] = σ ′ (o)[r′ ]
⇒ σ ′ (i)[r] ̸= v̄)
 

where we write i[r] ⇝ o[r′ ] for a dependency ⟨⟨i, r⟩, ⟨o, r′ ⟩⟩ between a data frame
variable i ∈ I at the row indexed by r ∈ N and a data frame variable o ∈ U at
the row indexed by r′ ∈ N. In particular, α⇝ extracts a dependency i[r] ⇝ o[r′ ]
when (in all sets of traces T in the semantic property S) there is a value v̄ ∈ VCi
for row r of data frame variable i that changes the value at row r′ of data frame
variable o, that is, there is a value for row r′ of data frame variable o that cannot
be reached if the value for row r of i is changed to v̄ (and all else remains the

same, i.e., σ(i) = σ ′ (i) ∧ σ(I \{i}) = σ ′ (I \{i})).
An Abstract Interpretation-Based Data Leakage Static Analysis 7

Note that our dependency abstraction generalizes that of Cousot [3] to non-
deterministic programs and multi-dimensional data frame variables, thus track-
ing dependencies between portions of data frames. As in [3], this is an abstraction
of semantic properties thus the dependencies must hold for all semantics having
the semantic property: more semantics have a semantic property, fewer depen-
dencies will hold for all semantics. Therefore, sets of dependencies are ordered
by superset inclusion ⊇ (cf. Equation 3).
Example 3 (Dependencies Between Data Frame Variables). Let us consider again
the program P from Example 2. Let i denote the input data frame of the pro-
gram and let otrain and otest denote the data frames used for training and testing.
In this case, for instance, we have i[1] ⇝ otrain [2] because, taking execution σ,
changing only the value of i[1] from 3 to 9 yields execution σ ′ which changes the
value of otrain [2], i.e., all other executions either differ at other rows of i or differ
at least in the value of otrain [2] (such as σ ′ ). In fact, the set of dependencies for
the whole set of executions of the program shows that otrain and otest depend on
all rows of the input data frame variable i.
Instead, performing normalization after splitting into train / test data yields
{i[1] ⇝ otrain [j], i[2] ⇝ otrain [j], i[3] ⇝ otest [j], i[4] ⇝ otest [j]}, j ∈ {1, 2}, where
otrain and otest depend on disjoint subsets of rows of the input data frame i.
It is easy to see that the abstraction function αIP ⇝UP is a complete join
def S
morphism. Thus, γIP ⇝UP (D) = {S | αIP ⇝UP (S) ⊇ D}.
We can now define the dependency semantics ΛI⇝U ∈ P ((X × N) × (X × N))
def
by abstraction of the collecting semantics Λ: ΛI⇝U = αI⇝U (Λ). In the rest of
the paper, we write LP M⇝ to denote the dependency semantics of a program P ,
leaving the sets of data frame variables of interest I and U implicitly set to IP
and UP , respectively. The dependency semantics remains sound and complete:
Theorem 1. P |= I ⇔ LP M⇝ ⊇ αIP ⇝UP (I)

3.3 Data Leakage Semantics


As hinted by Example 3, we observe that for detecting data leakage (resp. verify-
ing absence of data leakage), we care in particular about which rows of input data
frame variables the used data frame variables depend on. In case of data leakage
(resp. absence of data leakage), data frame variables used for different purposes
will depend on overlapping (resp. disjoint) sets of rows of input data frame vari-
ables. Thus, we further abstract the dependency semantics Λ⇝+ pointwise [7]
into a map for each data frame variable associating with each data frame row in-
dex the set of (input) variables (indexed at some row) from which it depends on.
Formally, we define the following Galois connection:
γ̇
⟨P ((X × N) × (X × N)) , ⊇⟩ ←−−
−− −
− ⟨X → (N → P (X × N)), ⊇⟩
→ ˙ (4)
α̇

where the abstraction α̇ : P ((X × N) × (X × N)) → (X → (N → P (X × N))) is:

α̇(D) = λx ∈ X : (λr ∈ N : {i[r′ ] | i ∈ X, r′ ∈ N, i[r′ ] ⇝ x[r] ∈ D})


def
(5)
8 Filip Drobnjaković, Pavle Subotić, and Caterina Urban

Example 4 (Data Leakage Semantics). Let us consider again the last dependen-
cies in Example 3: {i[1] ⇝ otrain [j], i[2] ⇝ otrain [j], i[3] ⇝ otest [j], i[4] ⇝ otest [j]},
j ∈ {1, 2}. Its abstraction following Equation 5 is the following map:
 (
 {i[1], i[2]} r=1
λr : {i[1], i[2]} x = otrain



r=2
λx : (
 {i[3], i[4]} r=1
λr : {i[3], i[4]} x = otest



r=2

Instead, the abstraction of the set of dependencies resulting from performing


normalization before splitting into train and test data is the following map:
 (
 {i[1], i[2], i[3], i[4]} r=1
λr : {i[1], i[2], i[3], i[4]} x = otrain



r=2
λx : (
 {i[1], i[2], i[3], i[4]} r=1
λr : {i[1], i[2], i[3], i[4]} x = otest



r=2

The abstraction function α̇ is another complete join morphism so it uniquely


def T
determines the concretization function: γ̇(m) = ˙ m .
D | α̇(D) ⊇
We finally derive our data leakage semantics Λ̇ ∈ X → (N → P (X × N)) by
def
abstraction of the dependency semantics Λ⇝ : Λ̇ = α̇(Λ⇝ ). In the following, we
write LP˙ M for the data leakage semantics of a program P . The abstraction α̇ does
not lose any information, so we still have both soundness and completeness:

Theorem 2. P |= I ⇔ LP˙ M ⊇
˙ α̇(αIP ⇝UP (I))

We can now equivalently verify absence of data leakage by checking that data
frames used for different purposes depend on disjoint (rows of) input data:

Lemma 1. P |= I ⇔ ∀o1 ∈ Utrain , o2 ∈ Utest ˙ ˙


P PS : LP Mo1 ∩ LP Mo2 = ∅ where, with a
slightly abuse of notation, LP Mo stands for r∈dom(LP˙ Mo) LP˙ Mo(r), i.e., the union
˙
of all sets LP˙ Mo(r) of rows of input data frame variables in the range of LP˙ Mo the
data leakage semantics LP˙ M for the used data frame variable o

Example 5 (Continued from Example 4). The first map in Example 4 satisfies
Lemma 1 since the set {i[1], i[2]} of input data frame rows on which otrain de-
pends is disjoint from the set {i[3], i[4]} of input data frame rows on which otest
depends. Thus, performing min-max normalization after splitting into train and
test data does not create data leakage.
This is not the case for the second map in Example 4 where the sets of input
data frame rows from which otrain and otest depend are identical, indicating data
leakage when normalization is done before the split into train and test data.
An Abstract Interpretation-Based Data Leakage Static Analysis 9

Small Data Frame-Manipulating Language The formal treatment so far is


language independent. In the rest of this section, we give a constructive definition
of our data leakage semantics Λ̇I for a small data frame-manipulating language
which we then use to illustrate our data leakage analysis in the next section.
(Note that the actual implementation of the analysis handles more advanced
constructs such as branches, loops, and procedures calls, cf. [8, Appendix D]).
We consider a simple sequential language without procedures nor references.
The only variable data type is the set D of data frames. Programs in the language
are sequences of statements, which belong to either of the following classes:
1. source: y = read(name) name ∈ W
2. select: y = x.select[r̄][C] r̄ ∈ Nk≤Rx , C ⊆ hdr(x)
3. merge: y = op(x1 , x2 ) x1 , x2 ∈ X, op ∈ {concat, join}
4. function: y = f (x) x ∈ X, f ∈ {normalize, other}
5. use: f (X) X ⊆ X, f ∈ {train, test}
where name ∈ W is a (string) data file name; we write Rx and hdr(x) for the
number of rows and set of labels of the columns of the data frame (value) stored
into the variable x. The source statement (representing library functions such as
read_csv, read_excel, etc., in Python pandas) reads data from an input file and
stores it into a variable y . The select statement (loosely corresponding to library
functions such as iloc, loc, etc., in Python pandas) returns a subset data frame
y of x, based on an array of row indexes r̄ and a set of column labels C . The
selection parameters r̄ and C are optional: when missing the selection includes all
rows or columns of the given data frame. The merge statements are binary merge
operations between data frames (the concat and join operations roughly match
the default Python pandas concat and merge library functions, respectively).
The function statements modify a data frame x either by normalizing it (with
the normalize function) or by applying some other function. The normalize
function produces a tainted data frame y (representing normalization functions
such as standardization or scaling in Python Sklearn). We assume that any other
function does not produce tainted data frames. Finally, use statements employs
data frames for either training (f = train) or testing (f = test) a ML model.

Constructive Data Leakage Semantics We can now instantiate the defini-


tion of our data leakage semantics Λ̇I with our small data frame-manipulating
language. Given a program P ≡ S1 , . . . , Sn written in our small language (where
S1 , . . . , Sn are statements), the set of input data frame variables IP is given (with
a slight abuse of notation, for simplicity) by the set of data files read by source
def def
statements, i.e., IP = iJP K = iJSn K ◦ · · · ◦ iJS1 K∅, where iJy = read(name)KI =
def
I∪{name} and iJSKI = I for any other statement S in P . Similarly, we define the
def def
set of used variables UP = uJP K = uJSn K◦· · ·◦uJS1 K∅, where uJf (X)KU = U ∪X
def
and uJSKU = U for any other statement S, and analogously for Utrain P ⊆ UP
(when f = train) and Utest P ⊆ UP (when f = test).
Our constructive data leakage semantics is LP˙ M = sJSn K◦· · ·◦sJS1 K∅˙ where ∅˙ is
def

the totally undefined function and the semantic function sJSK for each statement
10 Filip Drobnjaković, Pavle Subotić, and Caterina Urban

S in P is defined as follows:
def  
sJy = read(name)Km = m y 7→ λr ∈ Rread() : {name[r]}
def  
sJy = x.select[r̄][C]Km = m y 7→ λr ∈ Rx.select[r̄][C] : m(x)(r̄[r])
def
sJy = concat(x1 , x2 )Km =
" ( #
m(x1 )r r ≤ |dom(m(x1 ))|
m y 7→ λr ∈ Rconcat(x1 ,x2 ) :
m(x2 )(r−|dom(m(x1 ))|) r > |dom(m(x1 ))|
sJy = join(x1 , x2 )Km = m y 7→ λr ∈ Rjoin(x1 ,x2 ) : m(x1 )←

r ∪ m(x2 )→

def  
r
 
[
m(x)r′ 
def
sJy = normalize(x)Km = m y 7→ λr ∈ Rnormalize(x) :
r ′ ∈dom(m(x))
def
 
sJy = other(x)Km = m y 7→ λr ∈ Rother(x) : m(x)r
def
sJuse(x)Km = m

The semantics of the source statement maps each row r of a read data frame y to
(the set containing) the corresponding row in the read data file (name[r]). The
semantics of the select statement maps each row r of the resulting data frame y to
the set of data sources (m(x)) of the corresponding row (r̄[r]) in the original data
frame. The concat operation between two data frames x1 and x2 yields a data
frame with all rows of x1 followed by all rows of x2 . Thus, the semantics of concat
statements accordingly maps each row r of the resulting data frame y to the set
of data sources of the corresponding row in x1 (if r ≤ |dom(m(x1 ))|, that is, r
falls within the size of x1 ) or x2 (if r > |dom(m(x1 ))|). Instead, the join operation
combines two data frames x1 and x2 based on a(n index) column and yields a
data frame containing only the rows that have a matching value in both x1 and
x2 . Thus, the semantics of join statements maps each row r of the resulting
data frame y to the union of the sets of data sources of the corresponding rows
(←−
r and → −r ) in x1 and x2 . We consider only one type of join operation (inner
join) for simplicity, but other types (outer, left, or right join) can be similarly
defined. The normalize function is a tainting function so the semantics for the
normalize function introduces dependencies for each row r in the normalized
data frame y with the data sources (m(x)) of each row r′ of the data frame
before normalization. Instead, the semantics of other (non-tainting) functions
maintains the same dependencies (m(x)r) for each row r of the modified data
frame y. Finally, use statements do not modify any dependency so the semantics
of use statements leaves the dependencies map unchanged.

4 Data Leakage Analysis

In this section, we abstract our concrete data leakage semantics to obtain a


sound data leakage static analysis. In essence, our analysis keeps track of (an
over-approximation of) the data source cells each data frame variable depends
An Abstract Interpretation-Based Data Leakage Static Analysis 11

on (to detect potential explicit data source overlaps). Plus, it tracks whether
data source cells are tainted, i.e., modified by a library function in such a way
that could introduce data leakage (by implicit indirect data source overlaps).

4.1 Data Sources Abstract Domain


Data Frame Abstract Domain. We over-approximate data sources by means
of a parametric data frame abstract domain L(C, R), where the parameter ab-
stract domains C and R track data sources columns and rows, respectively. We
illustrate below two simple instances of these domains.

Column Abstraction. We propose an instance of C that over-approximates the set


of column labels in a data frame. As, in practice, data frame labels are pretty
much always strings, the elements of C belong to a complete lattice ⟨C, ⊑C
def
, ⊔C , ⊓C , ⊥C , ⊤C ⟩ where C = P (W) ∪ {⊤C }; W is the set of all possible strings
of characters in a chosen alphabet and ⊤C represents a lack of information on
which columns a data frame may have (abstracting any data frame). Elements
in C are ordered by set inclusion extended with ⊤C being the largest element:
def
C1 ⊑C C2 ⇔ C2 = ⊤C ∨ (C1 ̸= ⊤C ∧ C1 ⊆ C2 ). Similarly, join ⊔C and meet ⊓C
are set inclusion and set intersection, respectively, extended to account for ⊤C :

(
C1
 C2 = ⊤C
def ⊤C C1 = ⊤C ∨ C2 = ⊤2 def
C1 ⊔C C2 = C1 ⊓C C2 = C2 C1 = ⊤C
C1 ∪C2 otherwise 
C1 ∩C2 otherwise

Finally, the bottom ⊥C is the empty set ∅ (abstracting an empty data frame).

Row Abstraction. Unlike columns, data frame rows are not named. Moreover,
data frames typically have a large number of rows and often ranges or rows
are added to or removed from data frames. Thus, the abstract domain of inter-
vals [4] over the natural numbers is a suitable instance of R. The elements of
R belong to the complete lattice ⟨R, ⊑R , ⊔R , ⊓R , ⊥R , ⊤R ⟩ with the set R de-
def
fined as R = {[l, u] | l ∈ N, u ∈ N ∪ {∞} , l ≤ u} ∪ {⊥R }. The top element ⊤R is
[0, ∞]. Intervals in R abstract (sets of) row indexes: the concretization function
def def
γR : R → P (N) is such that γR (⊥R ) = ∅ and γR ([l, u]) = {r ∈ N | l ≤ r ≤ u}.
The interval domain partial order (⊑R ) and operators for join (⊔R ) and meet
(⊓R ) are defined as usual (e.g., see Miné’s PhD thesis [15] for reference).
In addition, we associate with each interval R ∈ R another interval idx(R)
def def
of indices: idx(⊥R ) = ⊥R and idx([l, u]) = [0, u − l]; this essentially establishes
a map ϕR : N → N between elements of γR (R) (ordered by ≤) and elements of
γR (idx(R)) (also ordered by ≤). In the following, given an interval R ∈ R and an
interval of indices [i, j] ∈ R (such that [i, j] ⊑R R), we slightly abuse notation
and write ϕ−1R ([i, j]) for the sub-interval of R between the indices i and j, i.e.,
we have that γR (ϕ−1
def 
R ([i, j])) = r ∈ γ(R) | ϕ−1 (i) ≤ r ≤ ϕ−1 (j) . We need this
operation to soundly abstract consecutive row selections (cf. Section 4).
12 Filip Drobnjaković, Pavle Subotić, and Caterina Urban

Example 6 (Row Abstraction). Let us consider the interval [10, 14] ∈ R with
index idx(R) = [0, 4]. We have an isomorphism ϕR between {10, 11, 12, 13, 14}
and {0, 1, 2, 3, 4}. Let us consider now the interval of indices [1, 3]. We then have
ϕ−1 −1 −1
R ([1, 3]) = [11, 13] (since ϕR (1) = 11 and ϕR (3) = 13).

Data Frame Abstraction. The elements of the data frame abstract domain L(C, R)
def
belong to a partial order ⟨L, ⊑L ⟩ where L = W × C × R contains triples of a
data file name f ile ∈ W, a column over-approximation C ∈ C, and a row over-
approximation R ∈ R. In the following, we write fileC R for the abstract data
frame ⟨file, C, R⟩ ∈ L. The partial order ⊑L compares abstract data frames de-
′ def
rived from the same data files: XRC
⊑L YRC′ ⇔ X = Y ∧ C ⊑C C ′ ∧ R ⊑ R′ .
We also define a predicate for whether abstract data frames overlap:

, YRC′ ) ⇔ X = Y ∧ C ⊓C C ′ ̸= ∅ ∧ R ⊓ R′ ̸= ⊥R
C def
overlap(XR (6)

and partial join (⊔L ) and meet (⊓L ) over data frames from the same data files:
C1 C2 C1 ⊔C C2
def C1 C2 C1 ⊓C C2
def
XR 1
⊔L XR2
= XR 1 ⊔R R2
XR 1
⊓L XR 2
= XR 1 ⊓R R2

Finally, we define a constraining operator ↓C


R that restricts an abstract data

C C ′ def
frame to given column and row over-approximations: XR ↓R′ = X C⊓ CC
.
ϕ−1 (idx(R)⊓R R′ )
−1
Note that here the definition makes use of idx(R) and ϕ ([i, j]) to compute the
correct row over-approximation.
{id,city}
Example 7 (Abstract Data Frames). Let file[10,14] abstract a data frame with
columns {id, city} and rows {10, 11, 12, 13, 14} derived from a data source f ile.
{country} {id}
The abstract data frame file[12,15] does not overlap with it, while file[12,15] does.
{id,city} {country} {id,city,country}
Joining file[10,14] with file[12,15] yields file[10,15] . Instead, the meet
{id} {id} {id,city,country} {city}
with f ile[12,15] yields file[12,14] . Finally, the constraining file[10,15] ↓[1,2]
{city} −1
results in file[11,12] (since ϕ[10,15] (1) = 11 and ϕ−1[10,15] (2) = 12).

In the rest of this section, for brevity, we simply write L instead of L(C, R).

Data Frame Set Abstract Domain. Data frame variables may depend on
multiple data sources. We thus lift our abstract domain L to an abstract domain
S(L) of sets of abstract data frames. The elements of S(L) belong to a lattice
def
⟨S, ⊑S , ⊔S , ⊓S ⟩ with S = P (L). Sets of abstract data frames in S are maintained
in a canonical form such that no abstract data frames in a set can be overlapping
(cf. Equation 6). The partial order ⊑S between canonical sets relies on the partial
def
order between abstract data frames: S1 ⊑S S2 ⇔ ∀L1 ∈ S1 ∃L2 ∈ S2 : L1 ⊑L L2 .
The join (⊔S ) and meet (⊓S ) operators perform a set union and set intersec-
tion, respectively, followed by a reduction to make the result canonical:

S1 ⊔S S2 = reduce⊔L (S1 ∪ S2 ) S1 ⊓S S2 = reduce⊓L (S1 ∩ S2 )


def def
An Abstract Interpretation-Based Data Leakage Static Analysis 13

def
where reduceop (S) = {L1 opL2 | L1 , L2 ∈ S, overlap(L1 , L2 )} ∪
{L1 ∈ S | ∀L2 ∈ S \ {L1 } : ¬overlap(L1 , L2 )}
C def
 C
Finally, we lift ↓C
R by element-wise application: S ↓R = L ↓R | L ∈ S .

Example 8 (Abstract Data


n Frame Sets). Let o us consider the n join of two abstract o
{id} {name} {id} {zip}
data frame sets S1 = file1[1,10] , file2[0,100] and S2 = file1[9,12] , file3[0,100]
n o
{id} {id} {name} {zip}
Before reduction, we obtain file1[1,10] , file1[9,12] , file2[0,100] , file3[0,100] . The re-
n o
{id} {name} {zip}
duction operation makes the set canonical: file1[1,12] , file2[0,100] , file3[0,100] .

In the following, for brevity, we omit L and simply write S instead of S(L).

Data Frame Sources Abstract Domain. We can now define the domain
X → A(S) that we use for our data leakage analysis. Elements in this abstract
domain are maps from data frame variables in X to elements of a data frame
sources abstract domain A(S), which over-approximates the (input) data frame
variables (indexed at some row) from which a data frame variable depends on.
In particular, elements in A(S) belong to a lattice ⟨A, ⊑A , ⊔A , ⊓A , ⊥A ⟩ where
def
A = S × B contains pairs ⟨S, B⟩ of a data frame set abstraction in S ∈ S and
def
a boolean flag in B ∈ B = {untaninted, maybe-tainted}. In the following,
given an abstract element m ∈ X → A of X → A(S) and a data frame variable
x ∈ X, we write ms (x) ∈ S and mb (x) ∈ B for the first and second component
of the pair m(x) ∈ A, respectively.
def
The abstract domain operators apply component operators pairwise: ⊑A =⊑S
def def
× ≤, ⊔A = ⊔S × ∨, ⊓A = ⊓S × ∧, where ≤ in B is such that untainted ≤
maybe-tainted. The bottom element ⊥A is ⟨∅, untainted⟩.
Finally, we define the concretization γ : (X → (N → A)) → (X → (N →
def
P (X × N))): γ(m) = λx ∈ X : (λr ∈ N : γA (m(x))), where γA : A → P (X × N)
def  C
is γA (⟨S, B⟩) = X[r] | XR ∈ S, r ∈ γR (R) (with γR : R → P (N) being the
concretization function for row abstractions, cf. Section 4.1). Note that, γA does
not use B ∈ B nor C ∈ C. These are devices uniquely needed by our abstract
semantics (that we define below) to track (and approximate the concrete actual)
dependencies across program statements.

4.2 Abstract Data Leakage Semantics

Our data leakage analysis is given by LP˙ M♮ = aJSn K ◦ · · · ◦ aJS1 K⊥˙A where ⊥˙A
def

maps all data frame variables to ⊥A and the abstract semantic function aJSK
for each statement in P is defined as follows:
h n o i
aJy = read(name)Km = m y 7→ ⟨ name⊤
def C
[0,∞] , false⟩
" ( #
def ⟨ms (x) ↓C [min (r̄),max(r̄)]
, mb (x)⟩ ¬mb (x)
aJy = x.select[r̄][C]Km = m y 7→
⟨ms (x), mb (x)⟩ otherwise
14 Filip Drobnjaković, Pavle Subotić, and Caterina Urban

def
aJy = op(x1 , x2 )Km = m [y 7→ ⟨ms (x1 ) ⊔S ms (x2 ), mb (x1 ) ∨ mb (x2 )⟩]
def
aJy = normalize(x)Km = m [y 7→ ⟨ms (x), true⟩]
def
aJy = other(x)Km = m [y 7→ ⟨ms (x), mb (x)⟩]
def
aJuse(x)Km = m
The abstract semantics of the source statement simply maps a read data frame
variable y to the untainted abstract data frame set containing the abstraction
of the read data file (name⊤ [0,∞] ). The abstract semantics of the select state-
C

ment maps the resulting data frame variable y to the abstract data frame set
ms (x) associated with the original data frame variable x; in order to soundly
propagate (abstract) dependencies, ms (x) is constrained by ↓C [min(r̄),max(r̄)]
(cf.
Section 4.1) only if ms (x) is untainted. The abstract semantics of merge state-
ments merges the abstract data frame sets ms (x1 ) and ms (x2 ) and taint flags
mb (x1 ) and mb (x2 ) associated with the given data frame variables x1 and x2 .
Note that such semantics is a sound but rather imprecise abstraction, in par-
ticular, for the join operation. More precise abstractions can be easily defined,
at the cost of also abstracting data frame contents. The abstract semantics of
function statements maps the resulting data frame variable y to the abstract
data frame set ms (x) associated with the original data frame variable x; the
normalize function sets the taint flag to true, while other functions leave the
taint flag mb (x) unchanged. Note that, unlike the analysis sketched by Subotić
et al. [21], we do not perform any renaming or resetting of the data source map-
ping (cf. Section 4.2 in [21]) but we keep tracking dependencies with respect to
the input data frame variables. Finally, the abstract semantics of use statements
leave the abstract dependencies map unchanged.
The abstract data leakage semantics LP˙ M♮ is sound :
Theorem 3. P |= I ⇐ γ(LP˙ M♮ ) ⊇
˙ α̇(α⇝+ (I))
Similarly, we have the sound but not complete counterpart of Lemma 1 for
practically checking absence of data leakage:
Lemma 2.
P |= I ⇐ ∀o1 ∈ Utrain
P , o2 ∈ Utest
P :

γ(LP˙ M♮ )o1 (r1 ) ∩ ˙ M♮ )o2 (r2 ) = ∅


[ [
γ(LP
r1 ∈dom(γ(LP˙M♮ )o1 ) r2 ∈dom(γ(LP˙M♮ )o2 )
˙ ♮ C′ ˙ ♮
⇔ ∀o1 ∈ Utrain
P , o2 ∈ Utest C
P : ∀XR ∈ LP Ms o1 , YR′ ∈ LP Ms o2 :
˙ ˙

 
¬overlap(XR C
, YRC′ ) ∧ X = Y ⇒ ¬LP M♮b o1 ∧ ¬LP M♮b o2

Specifically, Lemma 2 allows us to verify absence of data leakage by check-



C
ing that any pair of abstract data frames sources XR and YRC′ for data re-
spectively used for training (i.e., o1 ) and testing (i.e., o2 ) are disjoint (i.e,
′ ˙ ˙
¬overlap(XR C
, YRC′ ), cf. Equation 6) and untainted (i.e., ¬LP M♮b o1 ∧ ¬LP M♮b o2 if
X = Y , that is, if they originate from the same data file).
An Abstract Interpretation-Based Data Leakage Static Analysis 15

Example 9 (Motivating Example (Continued)). The data leakage analysis of our


motivating example (cf. Example 1) is the following:
 Dn 
 data.csv⊤C
o E
q y def ,false x = data 
a data = read("data.csv") ⊥˙A =m1 = λx : [0,∞]
undefined otherwise
q y
a X = data.select[][{"X 1", "X 2", "y"}] m1 =
{“X 1”,“X 2”,“y′′ }
    
def
m2 = m1 X7→ data.csv[0,∞] ,false

2”,“y ′′ }
    
data.csv{[0,∞]
q y def “X 1”,“X
a X norm = normalize(X) m2 = m3 = m2 X norm7→ ,true
q y
a X train = X norm.select[[⌊0.025 ∗ RX norm ⌋ + 1, . . . , RX norm ]][] m3 =
“X 1”,“X 2”,“y ′′ }
    
def
m4 = m3 X train7→ data.csv{[⌊0.025∗R X norm ⌋+1,R
X norm ]
,true
q y
a X test = X norm.select[[0, . . . , ⌊0.025 ∗ RX norm ⌋]][] m4 =
“X 1”,“X 2”,“y ′′ }
    
def
m5 = m4 X test7→ data.csv{[0,⌊0.025∗R X norm ⌋]
,true
q y q y
a test(X test) (a train(X train) m5 )=m5

At the end of the analysis, X train ∈ Utrain and X test ∈ Utest depend on dis-
joint but tainted abstract data frames derived from the same input file data.csv.
Thus, the absence of data leakage check from Lemma 2 (rightfully) fails.

5 Experimental Evaluation

We implemented our static analysis into the open source NBLyzer [21] frame-
work for data science notebooks. NBlyzer performs the analysis starting on an
individual code cell (intra-cell analysis) and, based on the resulting abstract
state, it proceeds to analyze valid successor code cells (inter-cell analysis).
Whether a code cell is a valid successor or not, is specified by an analysis-
dependent cell propagation condition. We refer to the original NBLyzer pa-
per [21] and to the extended version of this paper [8] for further details.
We evaluated3 our analysis against the
data leakage analysis previously implemented Table 1: Alarms raised by the
in NBLyzer [21], using the same benchmarks. previous [21] vs our analysis.
These are notebooks written to succeed in
non-trivial real data science tasks and can be TP
Analysis FP
assumed to closely represent code written by Taint Overlap
non-novice data scientists. [21] 10 0 2
We summarize the results in Table 1. For Ours 10 15 2
each reported alarm, we engaged 4 data sci-
entists at Microsoft to determine true (TP) and false positives (FP). We further
classified the true positives as due to a normalization taint (Taint) or overlapping
data frames (Overlap). Our analysis found 10 Taint data leakages in 5 notebooks,
3
Experiments done on a Ryzen 9 6900HS with 24GB DDR5 running Ubuntu 22.04.
16 Filip Drobnjaković, Pavle Subotić, and Caterina Urban

and 15 Overlap data leakage in 11 notebooks, i.e., a 1.2% bug rate, which adheres
to true positive bug rates reported for null pointers in industrial settings (e.g.,
see [12]). The previous analysis only found 10 Taint leakages in 5 notebooks. It
could not Overlap data leakages because it cannot reason at the granularity of
partial data frames. The cost for our more precise analysis is a mere 7% slow-
down. Both analyses reported 2 false positives, due to different objects having
the same function name (e.g., LabelEncoder and StandardScaler both having
the function fit transform). This issue can be solved by introducing object
sensitive type tracking in our analysis. We leave it for future work.
The full experimental evaluation is described in [8, Appendix E].

6 Related Work

Related Abstract Interpretation Frameworks and Static Analyses. As


mentioned in Section 3, our framework generalizes the notion of data usage
proposed by Urban and Müller [23] and the definition of dependency abstrac-
tion used by Cousot [3]. In particular, among other things, the generalization
involves reasoning about dependencies (and thus data usage relationships) be-
tween multi-dimensional variables. In [23], Urban and Müller show that infor-
mation flow analyses can be used for reasoning about data usage, albeit with
a loss in precision unless one repeats the information flow analysis by setting
each time a different input variable as high security variable (cf. Section 8 in
[23]). In virtue of the generalization mentioned above, the same consideration
applies to our work, i.e., information flow analyses can be used to reason about
data leakage, but with an even higher cost to avoid a precision loss (the analysis
needs to be repeated each time for different portions of the input data sources).
Analogously, in [3], Cousot shows that information flow, slicing, non-interference,
dye, and taint analyses are all further abstractions of his proposed framework
(cf. Section 7 in [3]). As such, they are also further (and thus less precise) ab-
stractions of our proposed framework. In particular, a taint analysis will only be
able to detect a subset of the data leakage bugs that our analysis can find, i.e.,
those solely originating from library transformations but not those originating
from (partially) overlapping data. Vice versa, our proposed analysis could also
be used as a more fine-grained information flow or taint analysis.

Static Analysis for Data Science. Static analysis for data science is an
emerging area in the program analysis community [22]. Some notable static
analyses for data science scripts include an analysis for ensuring correct shape di-
mensions in TensorFlow programs [13], an analysis for constraining inputs based
on program constraints [23], and a provenance analysis [16]. In addition, static
analyses have been proposed for data science notebooks [21,14]. NBLyzer [21]
contains an ad-hoc data leakage analysis that detects Taint data leakages [21]. An
extension to handle Overlap data leakages was sketched in [20] (but no analysis
precision results were reported). However, both these analysis are not formal-
An Abstract Interpretation-Based Data Leakage Static Analysis 17

ized nor formally proven sound4 . In contrast, we introduce a sound data leakage
analysis that has a rigorous semantic underpinning.

7 Conclusion

We have presented an approach for detecting data leakages statically. We provide


a formal and rigorous derivation from the standard program collecting semantics,
via successive abstraction to a final sound and computable static analysis defini-
tion. We implement our analysis in the NBLyzer framework and demonstrate
clear improvements upon previous ad-hoc data leakage analyses.

Acknowledgements. We thank our colleagues at Microsoft Azure Data Labs and


Microsoft Development Centre Serbia for all their feedback and support.

References
1. Chouldechova, A., Prado, D.B., Fialko, O., Vaithianathan, R.: A Case Study of
Algorithm-Assisted Decision Making in Child Maltreatment Hotline Screening De-
cisions. In: FAT (2018)
2. Cousot, P.: Constructive Design of a Hierarchy of Semantics of a Transition Sys-
tem by Abstract Interpretation. Electronic Notes in Theoretical Computer Science
277(1-2), 47–103 (2002)
3. Cousot, P.: Abstract Semantic Dependency. In: SAS. pp. 389–410 (2019)
4. Cousot, P., Cousot, R.: Static Determination of Dynamic Properties of Programs.
In: Second International Symposium on Programming. pp. 106–130 (1976)
5. Cousot, P., Cousot, R.: Abstract Interpretation: A Unified Lattice Model for Static
Analysis of Programs by Construction or Approximation of Fixpoints. In: POPL.
pp. 238–252 (1977)
6. Cousot, P., Cousot, R.: Systematic Design of Program Analysis Frameworks. In:
POPL. pp. 269–282 (1979)
7. Cousot, P., Cousot, R.: Higher Order Abstract Interpretation (and Application to
Comportment Analysis Generalizing Strictness, Termination, Projection, and PER
Analysis. In: ICCL. pp. 95–112 (1994)
8. Drobnjaković, F., Subotić, P., Urban, C.: An Abstract Interpretation-Based Data
Leakage Static Analysis. CoRR abs/2211.16073 (2022), https://arxiv.org/
abs/2211.16073
9. Guzharina, A.: We downloaded 10m jupyter notebooks from github - this
is what we learned. https://blog.jetbrains.com/datalore/2020/12/17/
we-downloaded-10-000-000-jupyter-notebooks-from-github-this-is-what-we-learned/
(2020), accessed: 22-01-22
10. Kapoor, S., Narayanan, A.: Leakage and the reproducibility crisis in machine-
learning-based science. Patterns 4(9), 100804 (2023)
11. Kaufman, S., Rosset, S., Perlich, C., Stitelman, O.: Leakage in Data Mining: For-
mulation, Detection, and Avoidance. ACM Transactions on Knowledge Discovery
from Data 6(4) (2012)
4
We found a number of soundness issues in [20] when working on our formalization.
18 Filip Drobnjaković, Pavle Subotić, and Caterina Urban

12. Kharkar, A., Moghaddam, R.Z., Jin, M., Liu, X., Shi, X., Clement, C., Sundaresan,
N.: Learning to Reduce False Positives in Analytic Bug Detectors. In: ICSE. p.
1307–1316 (2022)
13. Lagouvardos, S., Dolby, J., Grech, N., Antoniadis, A., Smaragdakis, Y.: Static
Analysis of Shape in TensorFlow Programs. In: ECOOP. pp. 15:1–15:29 (2020)
14. Macke, S., Gong, H., Lee, D.J.L., Head, A., Xin, D., Parameswaran, A.G.: Fine-
Grained Lineage for Safer Notebook Interactions. CoRR abs/2012.06981 (2020),
https://arxiv.org/abs/2012.06981
15. Miné, A.: Weakly Relational Numerical Abstract Domains. Ph.D. thesis, École
Polytechnique, Palaiseau, France (2004), https://tel.archives-ouvertes.fr/
tel-00136630
16. Namaki, M.H., Floratou, A., Psallidas, F., Krishnan, S., Agrawal, A., Wu, Y., Zhu,
Y., Weimer, M.: Vamsa: Automated Provenance Tracking in Data Science Scripts.
In: KDD. pp. 1542–1551 (2020)
17. Nisbet, R., Miner, G., Yale, K.: Handbook of Statistical Analysis and Data Mining
Applications. Academic Press, Boston, second edition edn. (2018). https://doi.
org/https://doi.org/10.1016/C2012-0-06451-4
18. Papadimitriou, P., Garcia-Molina, H.: A Model for Data Leakage Detection. In:
ICDE. pp. 1307–1310 (2009)
19. Perkel, J.: Why Jupyter is Data Scientists’ Computational Notebook of Choice.
Nature 563, 145–146 (11 2018)
20. Subotić, P., Bojanić, U., Stojić, M.: Statically Detecting Data Leakages in Data
Science Code. In: SOAP. pp. 16–22 (2022)
21. Subotić, P., Milikić, L., Stojić, M.: A Static analysis Framework for Data Science
Notebooks. In: ICSE. pp. 13–22 (2022)
22. Urban, C.: Static analysis of data science software. In: SAS. pp. 17–23 (2019)
23. Urban, C., Müller, P.: An Abstract Interpretation Framework for Input Data Us-
age. In: ESOP. pp. 683–710 (2018)
24. Wong, A., Otles, E., Donnelly, J.P., Krumm, A., McCullough, J., DeTroyer-Cooley,
O., Pestrue, J., Phillips, M., Konye, J., Penoza, C., Ghous, M.H., Singh, K.: Ex-
ternal Validation of a Widely Implemented Proprietary Sepsis Prediction Model
in Hospitalized Patients. JAMA Internal Medicine (2021)

You might also like