An Abstract Interpretation-Based Data Leakage Static Analysis
An Abstract Interpretation-Based Data Leakage Static Analysis
1 Introduction
As artificial intelligence (AI) continues its unprecedented impact on society, en-
suring machine learning (ML) models are accurate is crucial. To this end, ML
models need to be correctly trained and tested. This iterative task is typically
performed within data science notebook environments [19,9]. A notable bug
that can be introduced during this process is known as a data leakage [18].
Data leakages have been identified as a pervasive problem by the data science
community [10,11,17]. In a number of recent cases data leakages crippled the
performance of real-world risk prediction systems with dangerous consequences
in high-stakes applications such as child welfare [1] and healthcare [24].
Data leakages arise when dependent data is used to train and test a model.
This can come in the form of overlapping data sets or, more insidiously, by
library transformations that create indirect data dependencies.
Line 1 reads data from a CSV file and line 2 normalizes it. Line 3 and 4 split
the data into training and testing segments (we write Rx for the number of data
rows stored in x). Finally, line 5 trains a ML model, and line 6 tests its accuracy.
In this case, a data leakage is introduced because line 2 performs a normal-
ization, before line 3 and 4 split into train and test data. This implicitly uses
the mean over the entire dataset to perform the data transformation and, as a
result, the train and test data are implicitly dependent on each other. In our
experimental evaluation (cf. Section 5) we found that this is a common pattern
for introducing a data leakage in several real-world notebooks. In the following,
we say that the data resulting from the normalization done in line 2 is tainted.
2 Background
2.1 Data Frame-Manipulating Programs
We consider programs manipulating data frames, that is, tabular data structures
with columns labeled by non-empty unique names. Let V be a set of (heteroge-
neous atomic) values (i.e., such as numerical or string values). We can formalize
An Abstract Interpretation-Based Data Leakage Static Analysis 3
In the rest of the paper, we write JP K for the trace semantics of a program P .
Here train and test data remain independent as modifying any input data row r
in any execution yields another execution that may result in a difference in either
train and test data but never both. Equivalently, all possible values of either train
and test data are possible independently of the choice of the value of the row r.
More formally, let X be the set of all the (data frame) variables of a (data
frame-manipulating) program P . We denote with IP ⊆ X the set of its input
or source data frame variables, i.e., data frame variables whose value is directly
read from the input, and use UP ⊆ X to denote the set of its used data frame
variables, i.e., data frame variables used for training or testing a ML model.
We write Utrain
P ⊆ UP and Utest
P ⊆ UP for the variables used for training and
testing, respectively. For simplicity, we can assume that programs are in static
single-assignment form so that data frame variables are assigned exactly once:
data is read from the input, transformed and normalized, and ultimately used for
training and testing. Given a trace σ ∈ JP K, we write σ(i) and σ(o) to denote the
value of the data frame variables i ∈ IP and o ∈ UP in σ. We can now define when
used data frame variables are independent in a program with trace semantics JP K:
def
ind(JP K) = ∀σ ∈ JP K, i ∈ IP , r ∈ Ri : unch(σ, i, r, Utest train
P ) ∨ unch(σ, i, r, UP )
unch(σ, i, r, U ) = ∀v̄ ∈ VCi : σ(i)[r] ̸= v̄ ⇒ (∃σ ′ ∈ JP K :
def
r̄
σ ′ (i)[r] = v̄ ∧ σ(i) = σ ′ (i) ∧ σ(IP \{i}) = σ ′ (IP \{i}) ∧ σ(U ) = σ ′ (U ))
(2)
where Ri and Ci stand for Rσ(i) (i.e., number of rows of the data frame value
of i ∈ IP ) and Cσ(i) (i.e., number of columns of the data frame value of i ∈ IP ),
r̄
respectively, σ(i) = σ ′ (i) stands for ∀r′ ∈ Ri : r′ ̸= r ⇒ σ(i)[r′ ] = σ ′ (i)[r′ ] (i.e.,
the data frame value of i ∈ IP remains unchanged for any row r′ ̸= r), and
σ(X) = σ ′ (X) stands for ∀x ∈ X : σ(x) = σ ′ (x). The definition requires that
changing the value of a data source i ∈ IP can modify data frame variables
used for training (Utrain
P ) or testing (Utest
P ), but not both: the value of data frame
variables used for either training or testing in a trace σ remains the same inde-
pendently of all possible values v̄ ∈ VCi of any portion (e.g., any row r ∈ Ri )
of any input data frame variable i ∈ IP in σ. Note that this definition quantifies
over changes in data frame rows since the split into train and test data happens
across rows (e.g., using train_test_split in Pandas), but takes into account all
possible column values in each row (v̄ ∈ VCi ). It also implicitly takes into account
implicit flows of information by considering traces in JP K. In particular, in terms
of secure information flow, notably non-interference, this definition says that we
cannot simultaneously observe different values in Utrain P and Utest
P , regardless of
the values of the input data frame variables. Here we weaken non-interference
to consider either Utrain
P or Utest
P as low outputs (depending on which row of the
input data frame variables is modified), instead of fixing the choice beforehand.
Note also that unch quantifies over all possible values v̄ ∈ VCi of the changed
row i rather than quantifying over traces to allow non-determinism, i.e., not all
traces that only differ at row r ∈ Ri of data frame variable i ∈ IP need to agree
on the values of the used variables U ⊆ UP but all values of U that are feasible
6 Filip Drobnjaković, Pavle Subotić, and Caterina Urban
between sets of sets of traces and sets of relations (i.e., dependencies) between
data frame variables indexed at some row. The abstraction and concretiza-
tion function are parameterized by a set I ⊆ X of input variables and a set
U ⊆ X of used variables of interest. In particular, the dependency abstraction
αI⇝U : P (P (Σ +∞ )) → P ((X × N) × (X × N)) is:
i ∈ I, r ∈ N, o ∈ U, r′ ∈ N, (∀T ∈ S :
r̄
Ci ′ ′
∃σ ∈ ∈ ∀σ ∈ ∧
T, v̄ : T : σ(i) = σ (i)
def
αI⇝U (S) = i[r] ⇝ o[r ] ′ V
σ(I \{i}) = σ ′ (I \{i}) ∧ σ(o)[r′ ] = σ ′ (o)[r′ ]
⇒ σ ′ (i)[r] ̸= v̄)
where we write i[r] ⇝ o[r′ ] for a dependency ⟨⟨i, r⟩, ⟨o, r′ ⟩⟩ between a data frame
variable i ∈ I at the row indexed by r ∈ N and a data frame variable o ∈ U at
the row indexed by r′ ∈ N. In particular, α⇝ extracts a dependency i[r] ⇝ o[r′ ]
when (in all sets of traces T in the semantic property S) there is a value v̄ ∈ VCi
for row r of data frame variable i that changes the value at row r′ of data frame
variable o, that is, there is a value for row r′ of data frame variable o that cannot
be reached if the value for row r of i is changed to v̄ (and all else remains the
r̄
same, i.e., σ(i) = σ ′ (i) ∧ σ(I \{i}) = σ ′ (I \{i})).
An Abstract Interpretation-Based Data Leakage Static Analysis 7
Note that our dependency abstraction generalizes that of Cousot [3] to non-
deterministic programs and multi-dimensional data frame variables, thus track-
ing dependencies between portions of data frames. As in [3], this is an abstraction
of semantic properties thus the dependencies must hold for all semantics having
the semantic property: more semantics have a semantic property, fewer depen-
dencies will hold for all semantics. Therefore, sets of dependencies are ordered
by superset inclusion ⊇ (cf. Equation 3).
Example 3 (Dependencies Between Data Frame Variables). Let us consider again
the program P from Example 2. Let i denote the input data frame of the pro-
gram and let otrain and otest denote the data frames used for training and testing.
In this case, for instance, we have i[1] ⇝ otrain [2] because, taking execution σ,
changing only the value of i[1] from 3 to 9 yields execution σ ′ which changes the
value of otrain [2], i.e., all other executions either differ at other rows of i or differ
at least in the value of otrain [2] (such as σ ′ ). In fact, the set of dependencies for
the whole set of executions of the program shows that otrain and otest depend on
all rows of the input data frame variable i.
Instead, performing normalization after splitting into train / test data yields
{i[1] ⇝ otrain [j], i[2] ⇝ otrain [j], i[3] ⇝ otest [j], i[4] ⇝ otest [j]}, j ∈ {1, 2}, where
otrain and otest depend on disjoint subsets of rows of the input data frame i.
It is easy to see that the abstraction function αIP ⇝UP is a complete join
def S
morphism. Thus, γIP ⇝UP (D) = {S | αIP ⇝UP (S) ⊇ D}.
We can now define the dependency semantics ΛI⇝U ∈ P ((X × N) × (X × N))
def
by abstraction of the collecting semantics Λ: ΛI⇝U = αI⇝U (Λ). In the rest of
the paper, we write LP M⇝ to denote the dependency semantics of a program P ,
leaving the sets of data frame variables of interest I and U implicitly set to IP
and UP , respectively. The dependency semantics remains sound and complete:
Theorem 1. P |= I ⇔ LP M⇝ ⊇ αIP ⇝UP (I)
Example 4 (Data Leakage Semantics). Let us consider again the last dependen-
cies in Example 3: {i[1] ⇝ otrain [j], i[2] ⇝ otrain [j], i[3] ⇝ otest [j], i[4] ⇝ otest [j]},
j ∈ {1, 2}. Its abstraction following Equation 5 is the following map:
(
{i[1], i[2]} r=1
λr : {i[1], i[2]} x = otrain
r=2
λx : (
{i[3], i[4]} r=1
λr : {i[3], i[4]} x = otest
r=2
Theorem 2. P |= I ⇔ LP˙ M ⊇
˙ α̇(αIP ⇝UP (I))
We can now equivalently verify absence of data leakage by checking that data
frames used for different purposes depend on disjoint (rows of) input data:
Example 5 (Continued from Example 4). The first map in Example 4 satisfies
Lemma 1 since the set {i[1], i[2]} of input data frame rows on which otrain de-
pends is disjoint from the set {i[3], i[4]} of input data frame rows on which otest
depends. Thus, performing min-max normalization after splitting into train and
test data does not create data leakage.
This is not the case for the second map in Example 4 where the sets of input
data frame rows from which otrain and otest depend are identical, indicating data
leakage when normalization is done before the split into train and test data.
An Abstract Interpretation-Based Data Leakage Static Analysis 9
the totally undefined function and the semantic function sJSK for each statement
10 Filip Drobnjaković, Pavle Subotić, and Caterina Urban
S in P is defined as follows:
def
sJy = read(name)Km = m y 7→ λr ∈ Rread() : {name[r]}
def
sJy = x.select[r̄][C]Km = m y 7→ λr ∈ Rx.select[r̄][C] : m(x)(r̄[r])
def
sJy = concat(x1 , x2 )Km =
" ( #
m(x1 )r r ≤ |dom(m(x1 ))|
m y 7→ λr ∈ Rconcat(x1 ,x2 ) :
m(x2 )(r−|dom(m(x1 ))|) r > |dom(m(x1 ))|
sJy = join(x1 , x2 )Km = m y 7→ λr ∈ Rjoin(x1 ,x2 ) : m(x1 )←
−
r ∪ m(x2 )→
−
def
r
[
m(x)r′
def
sJy = normalize(x)Km = m y 7→ λr ∈ Rnormalize(x) :
r ′ ∈dom(m(x))
def
sJy = other(x)Km = m y 7→ λr ∈ Rother(x) : m(x)r
def
sJuse(x)Km = m
The semantics of the source statement maps each row r of a read data frame y to
(the set containing) the corresponding row in the read data file (name[r]). The
semantics of the select statement maps each row r of the resulting data frame y to
the set of data sources (m(x)) of the corresponding row (r̄[r]) in the original data
frame. The concat operation between two data frames x1 and x2 yields a data
frame with all rows of x1 followed by all rows of x2 . Thus, the semantics of concat
statements accordingly maps each row r of the resulting data frame y to the set
of data sources of the corresponding row in x1 (if r ≤ |dom(m(x1 ))|, that is, r
falls within the size of x1 ) or x2 (if r > |dom(m(x1 ))|). Instead, the join operation
combines two data frames x1 and x2 based on a(n index) column and yields a
data frame containing only the rows that have a matching value in both x1 and
x2 . Thus, the semantics of join statements maps each row r of the resulting
data frame y to the union of the sets of data sources of the corresponding rows
(←−
r and → −r ) in x1 and x2 . We consider only one type of join operation (inner
join) for simplicity, but other types (outer, left, or right join) can be similarly
defined. The normalize function is a tainting function so the semantics for the
normalize function introduces dependencies for each row r in the normalized
data frame y with the data sources (m(x)) of each row r′ of the data frame
before normalization. Instead, the semantics of other (non-tainting) functions
maintains the same dependencies (m(x)r) for each row r of the modified data
frame y. Finally, use statements do not modify any dependency so the semantics
of use statements leaves the dependencies map unchanged.
on (to detect potential explicit data source overlaps). Plus, it tracks whether
data source cells are tainted, i.e., modified by a library function in such a way
that could introduce data leakage (by implicit indirect data source overlaps).
Finally, the bottom ⊥C is the empty set ∅ (abstracting an empty data frame).
Row Abstraction. Unlike columns, data frame rows are not named. Moreover,
data frames typically have a large number of rows and often ranges or rows
are added to or removed from data frames. Thus, the abstract domain of inter-
vals [4] over the natural numbers is a suitable instance of R. The elements of
R belong to the complete lattice ⟨R, ⊑R , ⊔R , ⊓R , ⊥R , ⊤R ⟩ with the set R de-
def
fined as R = {[l, u] | l ∈ N, u ∈ N ∪ {∞} , l ≤ u} ∪ {⊥R }. The top element ⊤R is
[0, ∞]. Intervals in R abstract (sets of) row indexes: the concretization function
def def
γR : R → P (N) is such that γR (⊥R ) = ∅ and γR ([l, u]) = {r ∈ N | l ≤ r ≤ u}.
The interval domain partial order (⊑R ) and operators for join (⊔R ) and meet
(⊓R ) are defined as usual (e.g., see Miné’s PhD thesis [15] for reference).
In addition, we associate with each interval R ∈ R another interval idx(R)
def def
of indices: idx(⊥R ) = ⊥R and idx([l, u]) = [0, u − l]; this essentially establishes
a map ϕR : N → N between elements of γR (R) (ordered by ≤) and elements of
γR (idx(R)) (also ordered by ≤). In the following, given an interval R ∈ R and an
interval of indices [i, j] ∈ R (such that [i, j] ⊑R R), we slightly abuse notation
and write ϕ−1R ([i, j]) for the sub-interval of R between the indices i and j, i.e.,
we have that γR (ϕ−1
def
R ([i, j])) = r ∈ γ(R) | ϕ−1 (i) ≤ r ≤ ϕ−1 (j) . We need this
operation to soundly abstract consecutive row selections (cf. Section 4).
12 Filip Drobnjaković, Pavle Subotić, and Caterina Urban
Example 6 (Row Abstraction). Let us consider the interval [10, 14] ∈ R with
index idx(R) = [0, 4]. We have an isomorphism ϕR between {10, 11, 12, 13, 14}
and {0, 1, 2, 3, 4}. Let us consider now the interval of indices [1, 3]. We then have
ϕ−1 −1 −1
R ([1, 3]) = [11, 13] (since ϕR (1) = 11 and ϕR (3) = 13).
Data Frame Abstraction. The elements of the data frame abstract domain L(C, R)
def
belong to a partial order ⟨L, ⊑L ⟩ where L = W × C × R contains triples of a
data file name f ile ∈ W, a column over-approximation C ∈ C, and a row over-
approximation R ∈ R. In the following, we write fileC R for the abstract data
frame ⟨file, C, R⟩ ∈ L. The partial order ⊑L compares abstract data frames de-
′ def
rived from the same data files: XRC
⊑L YRC′ ⇔ X = Y ∧ C ⊑C C ′ ∧ R ⊑ R′ .
We also define a predicate for whether abstract data frames overlap:
′
, YRC′ ) ⇔ X = Y ∧ C ⊓C C ′ ̸= ∅ ∧ R ⊓ R′ ̸= ⊥R
C def
overlap(XR (6)
and partial join (⊔L ) and meet (⊓L ) over data frames from the same data files:
C1 C2 C1 ⊔C C2
def C1 C2 C1 ⊓C C2
def
XR 1
⊔L XR2
= XR 1 ⊔R R2
XR 1
⊓L XR 2
= XR 1 ⊓R R2
In the rest of this section, for brevity, we simply write L instead of L(C, R).
Data Frame Set Abstract Domain. Data frame variables may depend on
multiple data sources. We thus lift our abstract domain L to an abstract domain
S(L) of sets of abstract data frames. The elements of S(L) belong to a lattice
def
⟨S, ⊑S , ⊔S , ⊓S ⟩ with S = P (L). Sets of abstract data frames in S are maintained
in a canonical form such that no abstract data frames in a set can be overlapping
(cf. Equation 6). The partial order ⊑S between canonical sets relies on the partial
def
order between abstract data frames: S1 ⊑S S2 ⇔ ∀L1 ∈ S1 ∃L2 ∈ S2 : L1 ⊑L L2 .
The join (⊔S ) and meet (⊓S ) operators perform a set union and set intersec-
tion, respectively, followed by a reduction to make the result canonical:
def
where reduceop (S) = {L1 opL2 | L1 , L2 ∈ S, overlap(L1 , L2 )} ∪
{L1 ∈ S | ∀L2 ∈ S \ {L1 } : ¬overlap(L1 , L2 )}
C def
C
Finally, we lift ↓C
R by element-wise application: S ↓R = L ↓R | L ∈ S .
In the following, for brevity, we omit L and simply write S instead of S(L).
Data Frame Sources Abstract Domain. We can now define the domain
X → A(S) that we use for our data leakage analysis. Elements in this abstract
domain are maps from data frame variables in X to elements of a data frame
sources abstract domain A(S), which over-approximates the (input) data frame
variables (indexed at some row) from which a data frame variable depends on.
In particular, elements in A(S) belong to a lattice ⟨A, ⊑A , ⊔A , ⊓A , ⊥A ⟩ where
def
A = S × B contains pairs ⟨S, B⟩ of a data frame set abstraction in S ∈ S and
def
a boolean flag in B ∈ B = {untaninted, maybe-tainted}. In the following,
given an abstract element m ∈ X → A of X → A(S) and a data frame variable
x ∈ X, we write ms (x) ∈ S and mb (x) ∈ B for the first and second component
of the pair m(x) ∈ A, respectively.
def
The abstract domain operators apply component operators pairwise: ⊑A =⊑S
def def
× ≤, ⊔A = ⊔S × ∨, ⊓A = ⊓S × ∧, where ≤ in B is such that untainted ≤
maybe-tainted. The bottom element ⊥A is ⟨∅, untainted⟩.
Finally, we define the concretization γ : (X → (N → A)) → (X → (N →
def
P (X × N))): γ(m) = λx ∈ X : (λr ∈ N : γA (m(x))), where γA : A → P (X × N)
def C
is γA (⟨S, B⟩) = X[r] | XR ∈ S, r ∈ γR (R) (with γR : R → P (N) being the
concretization function for row abstractions, cf. Section 4.1). Note that, γA does
not use B ∈ B nor C ∈ C. These are devices uniquely needed by our abstract
semantics (that we define below) to track (and approximate the concrete actual)
dependencies across program statements.
Our data leakage analysis is given by LP˙ M♮ = aJSn K ◦ · · · ◦ aJS1 K⊥˙A where ⊥˙A
def
maps all data frame variables to ⊥A and the abstract semantic function aJSK
for each statement in P is defined as follows:
h n o i
aJy = read(name)Km = m y 7→ ⟨ name⊤
def C
[0,∞] , false⟩
" ( #
def ⟨ms (x) ↓C [min (r̄),max(r̄)]
, mb (x)⟩ ¬mb (x)
aJy = x.select[r̄][C]Km = m y 7→
⟨ms (x), mb (x)⟩ otherwise
14 Filip Drobnjaković, Pavle Subotić, and Caterina Urban
def
aJy = op(x1 , x2 )Km = m [y 7→ ⟨ms (x1 ) ⊔S ms (x2 ), mb (x1 ) ∨ mb (x2 )⟩]
def
aJy = normalize(x)Km = m [y 7→ ⟨ms (x), true⟩]
def
aJy = other(x)Km = m [y 7→ ⟨ms (x), mb (x)⟩]
def
aJuse(x)Km = m
The abstract semantics of the source statement simply maps a read data frame
variable y to the untainted abstract data frame set containing the abstraction
of the read data file (name⊤ [0,∞] ). The abstract semantics of the select state-
C
ment maps the resulting data frame variable y to the abstract data frame set
ms (x) associated with the original data frame variable x; in order to soundly
propagate (abstract) dependencies, ms (x) is constrained by ↓C [min(r̄),max(r̄)]
(cf.
Section 4.1) only if ms (x) is untainted. The abstract semantics of merge state-
ments merges the abstract data frame sets ms (x1 ) and ms (x2 ) and taint flags
mb (x1 ) and mb (x2 ) associated with the given data frame variables x1 and x2 .
Note that such semantics is a sound but rather imprecise abstraction, in par-
ticular, for the join operation. More precise abstractions can be easily defined,
at the cost of also abstracting data frame contents. The abstract semantics of
function statements maps the resulting data frame variable y to the abstract
data frame set ms (x) associated with the original data frame variable x; the
normalize function sets the taint flag to true, while other functions leave the
taint flag mb (x) unchanged. Note that, unlike the analysis sketched by Subotić
et al. [21], we do not perform any renaming or resetting of the data source map-
ping (cf. Section 4.2 in [21]) but we keep tracking dependencies with respect to
the input data frame variables. Finally, the abstract semantics of use statements
leave the abstract dependencies map unchanged.
The abstract data leakage semantics LP˙ M♮ is sound :
Theorem 3. P |= I ⇐ γ(LP˙ M♮ ) ⊇
˙ α̇(α⇝+ (I))
Similarly, we have the sound but not complete counterpart of Lemma 1 for
practically checking absence of data leakage:
Lemma 2.
P |= I ⇐ ∀o1 ∈ Utrain
P , o2 ∈ Utest
P :
2”,“y ′′ }
data.csv{[0,∞]
q y def “X 1”,“X
a X norm = normalize(X) m2 = m3 = m2 X norm7→ ,true
q y
a X train = X norm.select[[⌊0.025 ∗ RX norm ⌋ + 1, . . . , RX norm ]][] m3 =
“X 1”,“X 2”,“y ′′ }
def
m4 = m3 X train7→ data.csv{[⌊0.025∗R X norm ⌋+1,R
X norm ]
,true
q y
a X test = X norm.select[[0, . . . , ⌊0.025 ∗ RX norm ⌋]][] m4 =
“X 1”,“X 2”,“y ′′ }
def
m5 = m4 X test7→ data.csv{[0,⌊0.025∗R X norm ⌋]
,true
q y q y
a test(X test) (a train(X train) m5 )=m5
At the end of the analysis, X train ∈ Utrain and X test ∈ Utest depend on dis-
joint but tainted abstract data frames derived from the same input file data.csv.
Thus, the absence of data leakage check from Lemma 2 (rightfully) fails.
5 Experimental Evaluation
We implemented our static analysis into the open source NBLyzer [21] frame-
work for data science notebooks. NBlyzer performs the analysis starting on an
individual code cell (intra-cell analysis) and, based on the resulting abstract
state, it proceeds to analyze valid successor code cells (inter-cell analysis).
Whether a code cell is a valid successor or not, is specified by an analysis-
dependent cell propagation condition. We refer to the original NBLyzer pa-
per [21] and to the extended version of this paper [8] for further details.
We evaluated3 our analysis against the
data leakage analysis previously implemented Table 1: Alarms raised by the
in NBLyzer [21], using the same benchmarks. previous [21] vs our analysis.
These are notebooks written to succeed in
non-trivial real data science tasks and can be TP
Analysis FP
assumed to closely represent code written by Taint Overlap
non-novice data scientists. [21] 10 0 2
We summarize the results in Table 1. For Ours 10 15 2
each reported alarm, we engaged 4 data sci-
entists at Microsoft to determine true (TP) and false positives (FP). We further
classified the true positives as due to a normalization taint (Taint) or overlapping
data frames (Overlap). Our analysis found 10 Taint data leakages in 5 notebooks,
3
Experiments done on a Ryzen 9 6900HS with 24GB DDR5 running Ubuntu 22.04.
16 Filip Drobnjaković, Pavle Subotić, and Caterina Urban
and 15 Overlap data leakage in 11 notebooks, i.e., a 1.2% bug rate, which adheres
to true positive bug rates reported for null pointers in industrial settings (e.g.,
see [12]). The previous analysis only found 10 Taint leakages in 5 notebooks. It
could not Overlap data leakages because it cannot reason at the granularity of
partial data frames. The cost for our more precise analysis is a mere 7% slow-
down. Both analyses reported 2 false positives, due to different objects having
the same function name (e.g., LabelEncoder and StandardScaler both having
the function fit transform). This issue can be solved by introducing object
sensitive type tracking in our analysis. We leave it for future work.
The full experimental evaluation is described in [8, Appendix E].
6 Related Work
Static Analysis for Data Science. Static analysis for data science is an
emerging area in the program analysis community [22]. Some notable static
analyses for data science scripts include an analysis for ensuring correct shape di-
mensions in TensorFlow programs [13], an analysis for constraining inputs based
on program constraints [23], and a provenance analysis [16]. In addition, static
analyses have been proposed for data science notebooks [21,14]. NBLyzer [21]
contains an ad-hoc data leakage analysis that detects Taint data leakages [21]. An
extension to handle Overlap data leakages was sketched in [20] (but no analysis
precision results were reported). However, both these analysis are not formal-
An Abstract Interpretation-Based Data Leakage Static Analysis 17
ized nor formally proven sound4 . In contrast, we introduce a sound data leakage
analysis that has a rigorous semantic underpinning.
7 Conclusion
References
1. Chouldechova, A., Prado, D.B., Fialko, O., Vaithianathan, R.: A Case Study of
Algorithm-Assisted Decision Making in Child Maltreatment Hotline Screening De-
cisions. In: FAT (2018)
2. Cousot, P.: Constructive Design of a Hierarchy of Semantics of a Transition Sys-
tem by Abstract Interpretation. Electronic Notes in Theoretical Computer Science
277(1-2), 47–103 (2002)
3. Cousot, P.: Abstract Semantic Dependency. In: SAS. pp. 389–410 (2019)
4. Cousot, P., Cousot, R.: Static Determination of Dynamic Properties of Programs.
In: Second International Symposium on Programming. pp. 106–130 (1976)
5. Cousot, P., Cousot, R.: Abstract Interpretation: A Unified Lattice Model for Static
Analysis of Programs by Construction or Approximation of Fixpoints. In: POPL.
pp. 238–252 (1977)
6. Cousot, P., Cousot, R.: Systematic Design of Program Analysis Frameworks. In:
POPL. pp. 269–282 (1979)
7. Cousot, P., Cousot, R.: Higher Order Abstract Interpretation (and Application to
Comportment Analysis Generalizing Strictness, Termination, Projection, and PER
Analysis. In: ICCL. pp. 95–112 (1994)
8. Drobnjaković, F., Subotić, P., Urban, C.: An Abstract Interpretation-Based Data
Leakage Static Analysis. CoRR abs/2211.16073 (2022), https://arxiv.org/
abs/2211.16073
9. Guzharina, A.: We downloaded 10m jupyter notebooks from github - this
is what we learned. https://blog.jetbrains.com/datalore/2020/12/17/
we-downloaded-10-000-000-jupyter-notebooks-from-github-this-is-what-we-learned/
(2020), accessed: 22-01-22
10. Kapoor, S., Narayanan, A.: Leakage and the reproducibility crisis in machine-
learning-based science. Patterns 4(9), 100804 (2023)
11. Kaufman, S., Rosset, S., Perlich, C., Stitelman, O.: Leakage in Data Mining: For-
mulation, Detection, and Avoidance. ACM Transactions on Knowledge Discovery
from Data 6(4) (2012)
4
We found a number of soundness issues in [20] when working on our formalization.
18 Filip Drobnjaković, Pavle Subotić, and Caterina Urban
12. Kharkar, A., Moghaddam, R.Z., Jin, M., Liu, X., Shi, X., Clement, C., Sundaresan,
N.: Learning to Reduce False Positives in Analytic Bug Detectors. In: ICSE. p.
1307–1316 (2022)
13. Lagouvardos, S., Dolby, J., Grech, N., Antoniadis, A., Smaragdakis, Y.: Static
Analysis of Shape in TensorFlow Programs. In: ECOOP. pp. 15:1–15:29 (2020)
14. Macke, S., Gong, H., Lee, D.J.L., Head, A., Xin, D., Parameswaran, A.G.: Fine-
Grained Lineage for Safer Notebook Interactions. CoRR abs/2012.06981 (2020),
https://arxiv.org/abs/2012.06981
15. Miné, A.: Weakly Relational Numerical Abstract Domains. Ph.D. thesis, École
Polytechnique, Palaiseau, France (2004), https://tel.archives-ouvertes.fr/
tel-00136630
16. Namaki, M.H., Floratou, A., Psallidas, F., Krishnan, S., Agrawal, A., Wu, Y., Zhu,
Y., Weimer, M.: Vamsa: Automated Provenance Tracking in Data Science Scripts.
In: KDD. pp. 1542–1551 (2020)
17. Nisbet, R., Miner, G., Yale, K.: Handbook of Statistical Analysis and Data Mining
Applications. Academic Press, Boston, second edition edn. (2018). https://doi.
org/https://doi.org/10.1016/C2012-0-06451-4
18. Papadimitriou, P., Garcia-Molina, H.: A Model for Data Leakage Detection. In:
ICDE. pp. 1307–1310 (2009)
19. Perkel, J.: Why Jupyter is Data Scientists’ Computational Notebook of Choice.
Nature 563, 145–146 (11 2018)
20. Subotić, P., Bojanić, U., Stojić, M.: Statically Detecting Data Leakages in Data
Science Code. In: SOAP. pp. 16–22 (2022)
21. Subotić, P., Milikić, L., Stojić, M.: A Static analysis Framework for Data Science
Notebooks. In: ICSE. pp. 13–22 (2022)
22. Urban, C.: Static analysis of data science software. In: SAS. pp. 17–23 (2019)
23. Urban, C., Müller, P.: An Abstract Interpretation Framework for Input Data Us-
age. In: ESOP. pp. 683–710 (2018)
24. Wong, A., Otles, E., Donnelly, J.P., Krumm, A., McCullough, J., DeTroyer-Cooley,
O., Pestrue, J., Phillips, M., Konye, J., Penoza, C., Ghous, M.H., Singh, K.: Ex-
ternal Validation of a Widely Implemented Proprietary Sepsis Prediction Model
in Hospitalized Patients. JAMA Internal Medicine (2021)