0% found this document useful (0 votes)
2 views

Continuous_Management_of_Machine_Learning-Based_Application_Behavior

The paper presents a multi-model approach for the continuous management of non-functional behaviors in machine learning (ML)-based applications, addressing the need for stability amidst model changes. It proposes a two-step process involving model assessment and substitution to ensure non-functional properties like fairness and privacy are maintained over time. The approach is designed to be algorithm-agnostic and applicable across various ML models, filling gaps in existing solutions that often prioritize accuracy over stable application behavior.

Uploaded by

eftm2024
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Continuous_Management_of_Machine_Learning-Based_Application_Behavior

The paper presents a multi-model approach for the continuous management of non-functional behaviors in machine learning (ML)-based applications, addressing the need for stability amidst model changes. It proposes a two-step process involving model assessment and substitution to ensure non-functional properties like fairness and privacy are maintained over time. The approach is designed to be algorithm-agnostic and applicable across various ML models, filling gaps in existing solutions that often prioritize accuracy over stable application behavior.

Uploaded by

eftm2024
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

112 IEEE TRANSACTIONS ON SERVICES COMPUTING, VOL. 18, NO.

1, JANUARY/FEBRUARY 2025

Continuous Management of Machine Learning-Based


Application Behavior
Marco Anisetti , Senior Member, IEEE, Claudio A. Ardagna , Senior Member, IEEE,
Nicola Bena , Member, IEEE, Ernesto Damiani , Senior Member, IEEE, and Paolo G. Panero

Abstract—Modern applications are increasingly driven by Ma- properties. We experimentally evaluate our solution in a real-world
chine Learning (ML) models whose non-deterministic behavior is scenario focusing on non-functional property fairness.
affecting the entire application life cycle from design to operation.
The pervasive adoption of ML is urgently calling for approaches Index Terms—Assurance, machine learning, multi-armed band-
that guarantee a stable non-functional behavior of ML-based ap- it, non-functional properties.
plications over time and across model changes. To this aim, non-
functional properties of ML models, such as privacy, confidential-
ity, fairness, and explainability, must be monitored, verified, and I. INTRODUCTION
maintained. Existing approaches mostly focus on i) implementing
ACHINE Learning (ML) has become the technique of
solutions for classifier selection according to the functional behav-
ior of ML models, ii) finding new algorithmic solutions, such as
continuous re-training. In this paper, we propose a multi-model
M choice to provide advanced functionalities and carry out
tasks hardly achievable by traditional control and optimization
approach that aims to guarantee a stable non-functional behavior algorithms [1]. Even the behavior, orchestration, and deploy-
of ML-based applications. An architectural and methodological
approach is provided to compare multiple ML models showing ment parameters of distributed systems and services, possibly
similar non-functional properties and select the model supporting offered on the cloud-edge continuum, are increasingly based
stable non-functional behavior over time according to (dynamic on ML models [2]. Concerns about the black-box nature of ML
and unpredictable) contextual changes. Our approach goes beyond have led to a societal push that involves all components of society
the state of the art by providing a solution that continuously (policymakers, regulators, academic and industrial stakeholders,
guarantees a stable non-functional behavior of ML-based appli-
cations, is ML algorithm-agnostic, and is driven by non-functional citizens) towards trustworthy and transparent ML, giving rise to
properties assessed on the ML models themselves. It consists of legislative initiatives on artificial intelligence (e.g., the AI Act
a two-step process working during application operation, where in Europe [3]).
model assessment verifies non-functional properties of ML models This scenario introduces the need for solutions that continu-
trained and selected at development time, and model substitu- ously guarantee a stable non-functional behavior of ML-based
tion guarantees continuous and stable support of non-functional
applications, a task that is significantly more complex than mere
QoS-based selection and composition (e.g., [4], [5], [6]). The
focus of such a task is to assess the non-functional properties
Received 17 November 2023; revised 15 July 2024; accepted 7 October 2024. of ML models, such as privacy, confidentiality, fairness, and
Date of publication 28 October 2024; date of current version 6 February 2025.
Research supported, in parts, by i) project BA-PHERD, funded by the European explainability, over time and across changes. The non-functional
Union – NextGenerationEU, under the National Recovery and Resilience Plan assessment of ML-based applications behavior has to cope with
(NRRP) Mission 4 Component 2 Investment Line 1.1: “Fondo Bando PRIN the ML models’ complexity, low transparency, and continuous
2022” (CUP G53D23002910006); ii) MUSA – Multilayered Urban Sustain-
ability Action – project, funded by the European Union - NextGenerationEU, evolution [7], [8]. ML models in fact are affected by model
under the National Recovery and Resilience Plan (NRRP) Mission 4 Component and data drifts, quality degradation, and accuracy loss, which
2 Investment Line 1.5: Strengthening of research structures and creation of may substantially impact on the quality and soundness of the
R&D “innovation ecosystems”, set up of “territorial leaders in R&D” (CUP
G43C22001370007, Code ECS00000037); iii) project SERICS (PE00000014) application itself.
under the NRRP MUR program funded by the EU – NextGenerationEU; iv) Recent research points to solutions where ML models evolve
projects 1H-HUB and SOV-EDGE-HUB funded by Università degli Studi di according to contextual changes (e.g., a shift in the incom-
Milano – PSR 2021/2022 – GSA – Linea 6; and v) program “Piano di Sostegno
alla Ricerca” funded by Università degli Studi di Milano. Views and opinions ing data distribution), typically via continuous re-training and
expressed are however those of the authors only and do not necessarily reflect peculiar training algorithms and ML models [9], [10], [11].
those of the European Union or the Italian MUR. Neither the European Union Other solutions consider classifier selection where a (set of)
nor the Italian MUR can be held responsible for them. (Corresponding author:
Claudio A. Ardagna.) ML model is statically or dynamically selected according to
Marco Anisetti, Claudio A. Ardagna, Nicola Bena, and Paolo G. Panero are some criteria [12], [13], [14], [15]; in this context, dynamic
with the Department of Computer Science, Università degli Studi di Milano, selection identifies the most suitable ML model for each data
20133 Milano, Italy (e-mail: [email protected]; claudio.ardagna@unimi.
it; [email protected]; [email protected]). point at inference time. Ensembles have been also considered to
Ernesto Damiani is with the Department of Computer Science, Università increase ML robustness [16], [17], [18], [19], [20]. Finally, some
degli Studi di Milano, 20133 Milano, Italy, and also with C2PS, Computer solutions have initially discussed certification-based assessment
Science Department, Khalifa University, Abu Dhabi P.O. Box: 127788, UAE
(e-mail: [email protected]). of ML-based applications [7], [8], [21]. Current approaches
Digital Object Identifier 10.1109/TSC.2024.3486226 however fall short in supporting the requirements of modern
© 2024 The Authors. This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see
https://creativecommons.org/licenses/by/4.0/
ANISETTI et al.: CONTINUOUS MANAGEMENT OF MACHINE LEARNING-BASED APPLICATION BEHAVIOR 113

ML-based applications. On the one hand, they disregard stable


application behavior and non-functional properties, which are
increasingly mandated by law, in favor of accuracy maximiza-
tion. On the other hand, they do not provide a general solution
that applies to any non-functional properties and ML algorithms,
and rather focus on specific, though relevant, properties (e.g.,
fairness) and algorithms (e.g., decision trees).
This paper fills in the above gaps by defining a multi-model
approach that guarantees a stable non-functional behavior of
ML-based applications. Similarly to dynamic classifier selec-
tion, our approach keeps a pool of ML models and one ML
model at time is dynamically selected during inference ac-
cording to a (set of) non-functional property; the selected ML
model is replaced only when its non-functional property de-
grades. Our approach is particularly suited for constrained and
critical scenarios with (dynamic and unpredictable) contextual Fig. 1. Overview of our approach.
changes. In such scenarios, online re-training and dynamic
classifier/ensemble selection approaches i) have a larger over- definition of rigorous assurance-based processes for ML-based
head due to the expensive training and the need to select a applications is still more an art than a science [7], [8]. We finally
model for each data point, and ii) can lead to unexpected extensively evaluate our solution focusing on non-functional
application behavior due to the arrival of new, unpredictable property fairness.
input data. The remainder of this paper is organized as follows. Section II
Our multi-model approach is built on a two-step process work- presents our reference scenario and our approach at a glance.
ing during application operation as follows. The first step, model Section III describes our building blocks, including Static MAB
assessment, verifies non-functional properties of ML models that is later extended in Section IV towards Dynamic MAB for
already trained and selected at development time. To this aim, we non-functional ML model assessment. Section V presents the
extend our previous work on Multi-Armed Bandit (MAB) [22], two strategies for model substitution. Section VI describes our
towards a dynamic MAB that assesses non-functional properties approach in an end-to-end walkthrough. Section VII presents an
of ML models at run time. The second step, model substitution, extensive experimental evaluation in a real scenario. Section VIII
is driven by the properties assessed at step i), and guarantees comparatively discusses the related work and Section IX draws
a stable support for non-functional properties over time and our conclusions.
across changes. Our approach can be used both as a complete
solution for application behavior management (from design to II. OUR APPROACH AT A GLANCE
operation), or to complement existing ML-based applications We consider a scenario where a service provider is willing
with a multi-model substitution approach. to deploy an application (service workflow) whose behavior
Our contribution is threefold. We first propose a new defini- depends on an ML model. The service provider needs to maintain
tion of non-functional property of ML models. Our definition stable performance across time in terms of quality (e.g., accuracy
departs from traditional, attribute-based properties available in of the model) and non-functional posture (e.g., fairness). Let us
the literature (e.g., [23], [24]), and includes a scoring function assume a scenario where the model behavior changes, such as
at the basis of ML models comparison and selection. We extend model drift (e.g., due to online partial or full re-training) or data
the scope of traditional properties, which are mostly based on drift (e.g., service re-deployment or migration in the cloud-edge
accuracy [12] or metrics unrelated to the model itself (e.g., continuum), which are induced by modifications in the appli-
the battery level of the device or latency [25]), to include cation operational conditions. To cope with this scenario, the
non-functional properties such as fairness and integrity, often service provider adopts a multi-model approach by designing
mandated by law. Though important, these properties are often and deploying multiple models that can be alternatively used
neglected in literature [26]. We then describe our multi-model depending on the context. This multi-model deployment can
approach for managing the non-functional behavior of ML- impact single or multiple nodes in cloud or cloud-edge scenarios.
based applications. Our approach defines a dynamic MAB for We note that the model behavior is evaluated at design time and
the assessment of the non-functional properties of ML models, continuously monitored at run time to decide which model to
and proposes two model substitution strategies built on it. The use during application operation. We also note that the service
two strategies support the dynamic choice of the model with provider can decide to substitute or not the model in operation
the best set of non-functional properties at run time, by ranking due to restrictions in the application environment, but is always
and substituting the models in a dynamically sized evaluation capable of comparing the behavior of the model in use with the
window, and performing additional early substitutions upon other alternative models and use this evidence to fine-tune them
severe non-functional degradation using an assurance-based offline.
evaluation. We note that, although a plethora of assurance tech- Fig. 1 shows an overview of the above scenario and how
niques exist for the verification of non-functional properties in we apply our multi-model approach to address the continuous
traditional service-based applications [23], [27], [28], [29], the management of ML-based application non-functional behavior.
114 IEEE TRANSACTIONS ON SERVICES COMPUTING, VOL. 18, NO. 1, JANUARY/FEBRUARY 2025

Our approach starts at development time with a set of pre- TABLE I


TERMINOLOGY
trained, candidate ML models and statically selects the model
with the best (set of) non-functional property to be used by
the application. At run time, two processes, namely, model
assessment and model substitution, continuously monitor the
non-functional property(ies) of all models and apply model sub-
stitution when necessary to maintain stable application behavior.
The two processes work in an evaluation window.
Let cl denote the set of candidate models {m0 , . . . , mk }
and m̂ the model currently in use. Process model assessment
(Section IV) evaluates models in cl according to the given non-
functional property p. It implements a Dynamic Multi-Armed
Bandit (Dynamic MAB) approach, which extends our previous
work built on traditional MAB [22] to continuously evaluate the
models.
Process model substitution (Section V) takes as input the
results of process model assessment and selects the best model m̂
to be used within the application according to two strategies. The
first strategy compares models in cl using the Dynamic MAB
in the entire evaluation window, producing a model ranking.
The best model in the ranking is then selected as the new m̂ to
be used by the application in the following evaluation window. Other approaches target non-AI systems (e.g., [31]), or do not
The second strategy extends the first one by implementing generalize over the ML algorithms or non-functional properties
early substitutions of m̂ according to metric assurance level al, (e.g., [10], [11], [32], [33], [34]). To the best of our knowledge,
measuring the model degradation. Early substitutions anticipate our multi-model approach is the first solution that guarantees
the replacement of m̂, addressing transient changes before the stable application non-functional behavior over time and is
end of the evaluation window. generic with respect to the ML algorithm and property. A
Example 1 (Reference Scenario). Our reference scenario con- detailed comparison of the approach in this paper with solutions
siders an ML-based application that supports authorities (i.e., in literature is provided in Section VIII.
courts) in estimating the bail of an individual in prison. The
application trains 5 models cl={m1 , ..., m5 } in the cloud on
III. BUILDING BLOCKS
the same dataset, containing data on past bails at national level.
Each court is then provided with a model. Let us assume that Our multi-model approach is based on three main building
the selected model is m3 (i.e., m̂=m3 ). Due to the nature of the blocks: i) execution traces (Section III-A), ii) non-functional
task, the non-functional property of interest is fairness, in terms properties (Section III-B), and iii) Multi-Armed Bandit (Sec-
of variance over some protected attributes [22]. tion III-C).
Let us assume that, at run time, m̂ shows significant biases Table I shows the terminology used in this paper.
in the presence of underrepresented/disadvantaged groups, thus
affecting predicted bails. The overall fairness of m̂ must be A. Execution Traces
evaluated and compared to the other candidate models, and a Execution traces capture the behavior of a given ML model
model substitution triggered when needed to maintain stable at run time. They can be defined as follows.
non-functional behavior. Definition 1 (Execution Trace). An execution trace et is a tu-
Our reference scenario exemplifies the four main challenges ple of the form dp, pred where i) dp is the data point (i.e., a set of
of modern ML applications: i) the definition of advanced non- features) given as input to a model, ii) pred is the predicted result.
functional properties that are typical of ML such as fairness We note that dp can also contain the raw samples given as input
and privacy; ii) the assessment and comparison of models in to a deep learning model. Execution traces can be captured, for
terms of a given non-functional property; iii) the detection of an instance, by intercepting calls to the ML-based application or
application’s non-functional property degradation at run time; through monitoring [35].
iv) the automatic substitution of models to keep the application Example 2 (Execution Trace). Following Example 1, let
behavior stable with respect to their non-functional properties. us consider an execution trace et=[age = 27, gender=male,
Existing solutions in literature cannot tackle these chal- race=latino, past-offence= 0, ...], $10K, retrieved by moni-
lenges in their entirety. For instance, QoS-aware service selec- toring model m̂, where [age= 27, ...,] is the data point sent to
tion approaches (e.g., [4], [5], [30]) maximize specific (non- m̂ and $10K is the predicted bail.
)functional metrics to build an optimum composition or re-
trieve the most suitable models. Similarly, classifier selection
approaches (e.g., [12], [13], [14], [15]) maximize quality metrics B. Non-Functional Properties
such as accuracy, continuously swapping the models and poten- Traditional non-functional properties are defined as an ab-
tially introducing fluctuations in the non-functional behavior. stract property (i.e., the property name) refined by a set of
ANISETTI et al.: CONTINUOUS MANAGEMENT OF MACHINE LEARNING-BASED APPLICATION BEHAVIOR 115

Fig. 2. Partial view of the ML property taxonomy [22].

attributes [23]. Common properties include performance, con- set of execution traces. MAB repeatedly executes an experiment,
fidentiality, integrity, availability [36]. When an ML model is whose goal is to get the highest reward that can be earned by
considered, the notion of property is redesigned [7] as follows. executing a specific action chosen among a set of alternatives.
Definition 2 (Non-Functional Property). A non-functional Every action returns a reward or a penalty with different (and un-
property p is a pair p= (p̂, S), where p̂ is an abstract property known) probabilities. The experiment is commonly associated
taken from a shared controlled vocabulary [22] and S is a score with the problem of a gambler facing different slot machines
function of the form S:{et}→R quantitatively describing how (or a single slot machine with many arms retrieving different
much an ML model supports p̂ according to its execution traces. results). In our scenario, the actions are the models mi in the
In the following, we use the dotted notation to refer to the candidate list cl and the reward is based on the score function
components of p (e.g., p.S). p.S in Definition 2.
Example 3 (Non-Functional Property). Following Exam- Definition 3 (MAB). Let cl be the set of candidate models
ple 2, property fairness can be defined as pfairness =(fairness, {m1 , ..., mk }, each associated with an unknown reward vm
variance-over-gender-race), where the score function i) gener- for non-functional property p. The goal of the MAB is to
ates a number of synthetic data points dp covering all the possible select the model m∗ providing the highest reward in a set
combinations of protected attributes gender and race; ii) sends of experiments (i.e., a set of execution traces). A probability
each dp to the model; and iii) measures the variance σ 2 over the distribution fm (y | θ) drives experiments’ rewards, with y the
predicted bails. observed reward and θ a collection of unknown parameters that
We note that the higher the variance, the lower the support for must be learned through experimentation. MAB is based on
property fairness. Bayesian inference considering that, in each experiment, the
Non-functional properties of ML can be peculiar properties success/failure odd of each model is unknown and can be shaped
purposefully defined for ML evaluation (e.g., adversarial robust- with the probability distribution Beta distribution. Let m be a
ness [21]) or a new interpretation of traditional ones (e.g., the model, its Beta distribution Betam is based on two parameters
integrity of the predictions) [7]. Fig. 2 shows a portion of our α, β∈[0, 1] (denoted as αm and βm , resp.) and its probability
taxonomy of non-functional properties, which has been fully density function can be represented as
presented in our previous work [22]. The taxonomy includes
generic properties, which are then refined by detailed properties. x(αm −1) (1 − x)βm −1
Betam (x; αm , βm ) = (1)
For example, transparency is a generic property with two sub- B (αm , βm )
properties: i) explainability, the capability to explain the model,
on the one hand, and individual decisions taken by the model, on where the normalization function B is the Euler beta function
the other hand; and ii) interpretability, the capability to predict  1
the consequences on a model when changes are observed. B (αm , βm ) = xαm −1 (1 − x)βm −1 dx (2)
0
As another example, fairness is a generic property with mul-
tiple sub-properties. For each detailed property, different score Thompson sampling [38] pulls models in cl, as a new trace
functions can be defined. For instance, Fig. 2 shows two score et is received from the application, by sampling the models’
functions for property individual fairness: variance σ 2 (used in Beta distribution. The model with the highest sampled reward
this paper) and Shapley [37]. Score functions in the taxonomy (denoted as m∗ ) is then evaluated according to p.S and et.
are general, though we note that they need to be refined and A comparison of the score function output against a threshold
instantiated in the context of an evaluation process for a specific determines the success or failure of this evaluation. Betam∗ is
ML-based application. then updated accordingly, such that m∗ is pulled more frequently
in case of successful evaluation (αm∗ increased by 1), less
C. The MAB frequently (βm∗ increased by 1), otherwise.
We use the Multi-Armed Bandit (MAB) technique [22] to Let yt denote the set of observations recorded up to the t-th
compare models according to a non-functional property p on a execution trace ett . The optimal model m∗ is selected according
116 IEEE TRANSACTIONS ON SERVICES COMPUTING, VOL. 18, NO. 1, JANUARY/FEBRUARY 2025

to probability winnerm,t : is the difference between i) the reward vm∗ (θ0 ) of the optimal
model m∗ retrieved at the end of window w and ii) the reward
winnerm,t = P (m∗ | yt ) vm∗,t (θ0 ) of the optimal model m∗,t retrieved at execution
  

trace ett .
= l m = arg max vm (θ) p (θ | yt ) dt (3) Considering that the regret is not directly observable, it can
m∈cl
be computed using the posterior probability distribution. Let
where l is the indicator function and p(θ|yt ) is the Bayesian us consider v∗ (θ(g) )=maxm∈cl vm (θ(g) ) where θ(g) is drawn
posterior probability distribution of θ given the observations from p(θ|yt ). The “regret” r in g is r(g) =v∗ (θ(g) )−vm∗,t (θ0 ),
up to the t-th execution trace. The MAB terminates when all which derives from the regret posterior probability distribution.
experiments end, that is, all traces have been received. We note that v∗ (θ(g) ) is the maximum available value within
The optimal model m∗ is used by the application (i.e., each Monte Carlo draw set g and vm∗,t (θ(g) ) is the value (alike
m̂=m∗ ) [22]. We note that, while effective at application startup, taken in g) for the best arm within each Monte Carlo simulation.
the MAB cannot be continuously applied at run time as new Regret is expressed as the percentage of the deviation from the
traces come. For this reason, the MAB in this section (Static model identified as the winner, so that draws from the posterior
MAB in the following) is only used for static model selection probability are given as follows.
at development time. We then define in Section IV a Dynamic
   
MAB as the extension of the Static MAB for run-time model (g) v∗ θ(g) − vm∗,t θ(g)
selection and substitution. r =   (4)
vm∗,t θ(g)

IV. MODEL ASSESSMENT: DYNAMIC MAB The experiment completes when 95% of the samples of a
Process model assessment compares ML models at run time simulation have a residual value less than a given percentage
according to their non-functional behavior. It takes as input the (residualr ) of the value of the best model vm∗,t (θ0 ). For-
models in the candidate list cl and the non-functional property mally, a window can be closed when percentile(r(g) , 95) ≤
p, and returns as output the models’ Beta distributions. Models vm∗,t (θ0 )×residualr . A common value for residualr is 1%; it
assessment uses the Static MAB within an evaluation window can be increased to reduce the window size, while leading to a
w of |w| execution traces, and then shifts the window of |w| greater residual. We note that the window size can be tuned in
execution traces instantiating a new Static MAB. terms of the acceptable regret using residualr .
The window size |w| can be fixed or variable. When w has In a nutshell, DMVW takes a decision based on the execution
fixed size |w|, the Dynamic MAB may not reach a statistical traces in a specific window w only. A new MAB is executed from
relevance to take a decision; in this case, i) the outcome can scratch in each window, potentially leading to a discontinuous
be sub-optimal or ii) the evaluation can be extended to the next model comparison. Due to this effect, DMVW can produce
window. fluctuations in the selection of the optimal model m∗ to be
When w has variable size |w|, our default approach, the MAB used by the application. To address these issues, we extend the
terminates the evaluation and moves to the next window only DMVW with the notion of memory in Section IV-B.
when a statistically relevant decision can be made. It is based on
the value remaining in the experiment [39], a tunable strategy B. DMVW With Memory (DMVW-Mem)
that controls both the estimation error and the window size
The DMVW with memory (DMVW-Mem) keeps track of past
requested to reach a valuable decision. In the following, we
DMVW executions to smooth the discontinuity among consec-
present our solutions based on variable window sizes, namely
utive windows. DMVW-Mem for window wj is defined on the
Dynamic MAB with Variable Window (DMVW) and DMVW
basis of the Beta distributions and corresponding parameters in
with Memory (DMVW-Mem).
window wj−1 as follows.
Definition 4 (DMVW with memory (DMVW-Mem)). A
A. Dynamic MAB With Variable Window (DMVW)
DMVW-Mem is a DMVW where the Beta distribution Betam,j
The Dynamic MAB with Variable Window (DMVW) imple- of each model m in window wj is initialized on the basis of the
ments the value remaining in the experiment using a Monte Carlo Beta distribution Betam,j−1 of the corresponding model m in
simulation. The simulation considers a random set g of sampled window wj−1 , as follows:
draws from models’ Beta distributions. It then counts the fre- r αm,j = αm,j−1 × δ
quency of each model being the winner in g as an estimation of r βm,j = βm,j−1 × δ,
the corresponding probability distribution. where δ∈[0, 1] denotes the memory size, αm,j−1 and βm,j−1
The value remaining in the experiment is based on the are α and β of Beta distribution Betam,j−1 of model m in window
minimization of the “regret” (the missed reward) due to an wj−1 . We note that the resulting αm,j and βm,j are rounded
early terminated experiment. Let θ0 denote the value of θ and down and set to 1 when equal to 0.
m∗ =arg maxm∈cl vm (θ0 ) the optimal model at the end of a In other words, DMVW-Mem initializes the Beta distributions
window w. The regret due to early termination of an experiment in each window wj according to the Beta distribution parameters
within window w is represented by vm∗ (θ0 )−vm∗,t (θ0 ), which observed in window wj−1 .
ANISETTI et al.: CONTINUOUS MANAGEMENT OF MACHINE LEARNING-BASED APPLICATION BEHAVIOR 117

We note that the ranking can also be used in case further


substitutions in wj+1 are needed. For instance, the second model
in the ranking is used when m̂ experiences an (unrecoverable)
error.
Example 5. Following Example 4, the ranking metric
rmm5 ,11 has value 110/(110 + 2)≈0.98. Let us assume that
rmm5 ,11 has the highest value: m5 substitutes the model used
in w11 , and is used for bail prediction in w12 (i.e., m̂=m5 ).
The assumption that the ranking computed for wj is appro-
priate for wj+1 does not hold when transient changes in the
models non-functional behavior are observed within wj+1 (e.g.,
a sharp change in the environmental context). In this scenario,
although m̂ becomes suboptimal, it cannot be substituted until
the following window begins. To address this issue, we propose
Fig. 3. Examples of Beta distributions. Betam5 ,11 (110, 2) is retrieved at the
an approach based on early substitution that is presented in
end of window w11 , Betam5 ,12 (11, 1) is used for the next window w12 , with Section V-B.
δ=10%.

B. Assurance-Based Substitution
Example 4. Following Example 3, let us assume that the Assurance-based substitution triggers early substitution of the
current evaluation window w11 in a given court terminates after selected model m̂ before window w terminates. It monitors m̂
200 execution traces according to DMVW-Mem. The output by computing its assurance level as follows.
of process model assessment is {Betam1 ,11 , ..., Betam5 ,11 }. Definition 6 (Assurance level). Let m̂ be the selected model
Fig. 3 shows Betam5 ,11 , where αm5 ,11 =110 and βm5 ,11 =2, and ett ∈wj an execution trace. The assurance level alt of m̂
meaning that m5 has been frequently sampled and successfully given ett is vm̂t (θ)/v∗ (θ(g) ).
evaluated. Let us then assume that the memory has size 10% According to Definition 6, alt is the ratio between i) the reward
(i.e., δ=0.1). Fig. 3 shows Betam5 ,12 defined for window w12 , vm̂t (θ) of the selected model m̂ retrieved at execution trace ett
which is initialized as: i) αm5 ,12 = 110 × 0.1 =11; ii) βm5 ,12 and ii) the reward v∗ (θ(g) ) of the optimal model m∗ , according
= 2 × 0.1 =0.2, which is then set to 1 according to Definition 4. to the Monte Carlo simulation in DMVW-Mem (Section IV-A).
We note that the assurance level can be retrieved for each model
V. MODEL SUBSTITUTION mi using the corresponding reward as numerator.
Process model substitution is executed on the basis of process The assurance level al is used to calculate the degradation
model assessment in Section IV. It takes as input the results of of the selected model. Formally, let ett be an execution trace
the DMVW-Mem evaluation in the current window, and returns in window wj . The degradation of m̂ at ett ∈wj is defined as
as output the model to be selected and used by the application follows.
in the following window. t
ali
degt = 1 − i=1 (5)
A. Ranking-Based Substitution t
Ranking-based substitution ranks models m∈cl in a given Substitution: It works as the ranking-based substitution but
window wj and determines the model m̂ to be used in the the selected model m̂ is substituted with the second model in the
following window wj+1 . Let us recall that αm (βm , resp.) is ranking before the window termination (i.e., early substitution),
incremented by 1 when p.S is successfully (unsuccessfully, iff its degradation degt exceeds threshold thr (degt >thr).
resp.) evaluated on trace et∈wj (Section III-C). Ranking-based Early substitution copes with transient changes within the
substitution is based on a metric evaluating how frequently window according to the degradation represented in thr. A
each model is selected by Thompson Sampling and successfully high (low, resp.) threshold means high (low, resp.) tolerance.
evaluated in DMVW-Mem, as follows. For instance, a high tolerance is preferable when the substi-
Definition 5 (Ranking Metric). Let wj be a window and m a tution overhead is high (e.g., when large models should be
model. The value of ranking metric rmm,j of m in wj is retrieved physically moved). A low tolerance is preferable when small
as αm,j /(αm,j + βm,j ). variations in the properties of the deployed models have a strong
According to Definition 5, rmm,j is the ratio between the impact on the application behavior. We note that, given its
number of successful evaluations of m (in terms of p.S) and the fundamental role in the substitution process, we experimentally
total number of draws computed by DMVW-Mem in wj . It is evaluated the adoption of different degradation thresholds thr in
retrieved for every model in cl and used for ranking. Section VII-C.
Substitution: At the end of window wj , the top-ranked model Example 6. Following Example 5, let us consider model m5
m̂ is selected and used within window wj+1 . The substitution as the selected model and model m4 as the second model in the
happens when wj terminates according to the value remaining ranking. Fig. 4 shows an example of the assurance levels of m̂5
in the experiment (Section IV-A). and m4 , denoted as alm̂5 ,t and alm4 ,t , respectively. Fig. 4 also
118 IEEE TRANSACTIONS ON SERVICES COMPUTING, VOL. 18, NO. 1, JANUARY/FEBRUARY 2025

updating the corresponding Beta distribution accordingly (Def-


inition 3). Then, process model assessment invokes function
monte_carlo_simulation to simulate the probabilities of mod-
els being winners. It creates a two-dimensional matrix with
dimensions |cl|×g, where g is the number of estimations. Each
cell contains samples drawn from the models’ Beta distributions;
the matrix counts the frequency of each model being winner and
approximates the probability distribution p(θ|yt ), accordingly.
Process model assessment proceeds until the minimum num-
ber of iterations is met and the value remaining in the experi-
ment permits to reach a statistically relevant decision (function
should_terminate). At this point, the evaluation window ends
(function handle_window).
When process model assessment ends, process model substi-
Fig. 4. Assurance levels of m̂5 (ranked first) and m4 (ranked second), denoted tution invokes function send_into_production ranking models
as alm̂5 ,t and alm4 ,t , respectively. The plot shows the logarithmic trend lines proportionally to the number of their successful evaluations of
and the outcomes in relation to the value remaining in the experiment of the
DMVW-Mem in a given window w.
the non-functional property (ranking metric in Definition 5).
For instance, the ranking at the end of window w1 is {m3 ,
m2 , m1 , m4 , m5 }, from best to worst. The top-ranked model
(m3 ) is pushed to production replacing m2 selected at deploy-
shows the corresponding logarithmic trend lines for readability, ment time by the Static MAB (Section V-A). Process model
and the value remaining in the experiment, using DMVW-Mem. substitution also monitors the selected model invoking function
Let us first consider ranking-based substitution only. At assurance_management. The latter verifies whether the non-
t=290, window wj terminates according to the value remaining functional behavior of the selected model is worsening with
in the experiment (Section IV-A). DMVW-Mem recomputes respect to the optimum model estimated by the Monte Carlo
the ranking; m4 is the top-ranked model while m5 the second simulation. It computes the assurance level alt for each new
one. DMVW-Mem triggers ranking-based substitution, and m4 execution trace ett (Definition 6) and uses it to retrieve the
becomes the selected model (m̂=m4 ) for window wj+1 . overall degradation (equation 5). For instance, during window
Let us then consider assurance-based substitution. We can w2 , the degradation of m3 is negligible, meaning that m3 is still
observe that alm̂5 ,t decreases as execution traces arrive. From adequate according to the data observed in w2 and does not need
t=38, alm̂5 ,t stably becomes less than 1. Around t=87, alm̂4 ,t to be substituted in advance.
overcomes alm̂5 ,t thus suggesting a possible substitution. How- When the current window w2 terminates, process model sub-
ever, the degradation of model m5 is not severe enough to justify stitution recomputes the ranking. For instance, the ranking is
the early substitution (i.e., the degradation is lower than the {m3 , m2 , m4 , m1 , m5 }, and m3 is used as the selected model
degradation threshold). for window w3 . In w3 , process model substitution observes a
constant degradation in the assurance level of m3 , reaching
the early substitution threshold. Early substitution is therefore
VI. WALKTHROUGH
triggered and the second model in the ranking (m2 ) substitutes
We present a walkthrough of our approach based on the m3 . Again when the current window w3 terminates, process
reference scenario in Example 1. Fig. 5 shows the pseudocode model substitution recomputes the ranking and m2 is confirmed
of our approach. at the top of the ranking.
The five models in the candidate list cl in Example 1 are Overall, this adaptive approach ensures that i) model substi-
first evaluated offline using the Static MAB in Section III-C, tution happens only when the decision is statistically relevant
to retrieve the optimum model m∗ that initializes our approach. according to the observed behavior (ranking-based substitution
Let us assume that model m2 is selected as the optimal model in DMVW-Mem), ii) a sub-optimal substitution decision can
(m̂=m∗ ). Our model assessment and substitution processes be fixed as soon as it is detected without waiting for the entire
(Fig. 5) begin, instantiating the DMVW-Mem. The processes evaluation window (assurance-based substitution), and iii) the
take as input i) the models in cl, ii) the observed execution entire process can be fine-tuned according to each scenario.
traces, iii) the non-functional property fairness in Example 3,
iv) the memory size δ, v) the early substitution threshold thr,
and vi) the minimum number of MAB iterations.
Execution traces observed from all the models are given as VII. EXPERIMENTAL EVALUATION
input to process model assessment. For each execution trace, We experimentally evaluated our approach focusing on: i)
the function thompson_sampling in DMVW-Mem chooses the model assessment at development time using Static MAB;
a model among those in cl by drawing a sample from each ii) the model substitution at run time using Dynamic MAB, also
model Beta distribution and retrieving the one with the high- evaluating the impact of different memory sizes; iii) quality
est value. The retrieved model is evaluated according to the and iv) performance of ranking-based and assurance-based
fairness score function (function score_function in Fig. 6), substitutions.
ANISETTI et al.: CONTINUOUS MANAGEMENT OF MACHINE LEARNING-BASED APPLICATION BEHAVIOR 119

Fig. 5. Pseudocode of our approach.

into training and test sets, where the training set includes more
than 3 million points.
We modeled the score function p.S of property fairness as
the variance (σ 2 ) of the bail amount in relation to sensitive
attributes gender and race [40], [41], [42]. Fig. 6 shows the
pseudocode of the score function and its usage according to
the threshold-based evaluation in Definition 3. We generated
five Naive Bayes models cl={m1 , ..., m5 }, each one trained
on a training set randomly extracted from the main training set.
The models showed similar performance, in terms of precision
and recall in bail estimation. We also extracted 10 test sets
corresponding to 10 individual experiments exp1 –exp10 to be
used in our experimental evaluation.
Experiments have been run on a laptop running Microsoft
Windows 10, equipped with a CPU Intel Core i7 @ 2.6
GHz and 16 GBs of RAM, using Python 3 with libraries
numpy v1.19.1 [43], pandas v1.2.5 [44], [45] and scikit-learn
v0.22.1 [46]. Datasets, code, and experimental results are avail-
Fig. 6. Pseudocode of the score function of property fairness and its usage.
able at https://doi.org/10.13130/RD_UNIMI/2G3CVO.

A. Experimental Settings
We considered the application for bail estimation and property B. Model Assessment
fairness in our reference example in Section VI. In our experi- We present the experimental evaluation of our Static MAB
ments, we used the dataset of the Connecticut State Department for model assessment at development time. We compare the
of Correction.1 This dataset provides a daily updated list of five Naive Bayes models using the Static MAB approach, by
people detained in the Department’s facilities awaiting a trial. evaluating their behavior with respect to non-functional property
It anonymously discloses data of individual people detained in fairness. Table II shows the Thompson Sampling draws for the
the correctional facilities every day starting from July 1st , 2016. five models in the candidate list on a randomly chosen sample
It contains attributes such as last admission date, race, gender, (2,000 data points) for each of the 10 experiments.
age, type of offence and facility description, in more than four Table II shows the distribution of models selected as best
millions data points (at the download date). We divided this set candidate (denoted in bold) for property fairness. Since m3 is
never selected as the best candidate, it is removed from the
1 Available at https://data.ct.gov/Public-Safety/Accused-Pre-Trial-Inmates- candidate list for the rest of the experimental evaluation. We
in-Correctional-Faciliti/b674-jy6w and downloaded on February 21st, 2020. note that comparing models based on the same algorithm (i.e.,
120 IEEE TRANSACTIONS ON SERVICES COMPUTING, VOL. 18, NO. 1, JANUARY/FEBRUARY 2025

TABLE II
STATIC MAB COMPARISON IN TERMS OF THOMPSON SAMPLING DRAWS ON A
RANDOM SAMPLE OF 2,000 DATA POINTS FOR EACH EXPERIMENT

Fig. 8. The selected model for each execution trace et of experiment exp1
with different memory sizes δ.

Fig. 7. Individual window sizes and moving average trends across all sets of
execution traces with different memory sizes for exp1 .

Naive Bayes) is more challenging than considering different al-


Fig. 9. Stacked histograms showing the ranking of the models in each of the
gorithms [22], posing our experiments in a worst-case scenario. 10 experiments with δ10 .

C. Model Substitution
We present the experimental evaluation of our process model set of execution traces in experiment exp1 , considering different
substitution using DMVW-Mem with different memory sizes memory sizes. We note that extemporaneous changes on the
(δ0 =0%, δ5 =5%, δ10 =10%, δ25 =25%). We evaluated i) the selected model are frequent without memory (δ0 ), less frequent
impact of the memory on the window size, ii) the impact of with δ5 , where clusters of continuously selected models emerge,
the ranking-based substitution in terms of stability of model and highly infrequent with δ10 . Fig. 8(d) shows a stable selection
selections, iii) the quality of the ranking-based substitution, and of model m2 , while models m4 and m5 are often not selected
iv) the quality of the assurance-based substitution. We note that preferring m1 instead. Considering the entire ranking, model
no artificial degradation was introduced during the experiments. m1 is ranked at the second position with δ25 , while m4 at the
1) Memory Size and Ranking: Fig. 7 shows the window size third position.
varying the memory in experiment exp1 with residual threshold In general, we observe that the number of changes across
residualr =0.01 (Section IV-A). We note that a bigger memory the experiments, in terms of selected models, decreases as
corresponds to a smaller window. This is expected, since the the memory increases. On average, across all experiments, it
DMVW-Mem does not start from scratch in every window, decreases by 41.18% when memory increases from δ5 to δ10
and the more DMVW-Mem knows about the models’ Beta (from 34 changes on average with δ5 to 20 changes on average
distributions, the sooner the value remaining in the experiment with δ10 ); it decreases by 20% when memory increases from δ10
reaches the threshold. Considering all the experiments, the av- to δ25 (from 20 changes on average with δ10 to 16 changes on
erage window size for δ25 is 157 confirming the trend in Fig. 7. average with δ25 ).
Let us now consider the model selected according to the Fig. 9 shows an aggregated ranking for all the experiments
DMVW-Mem ranking. Fig. 8 shows the selected model for each with δ10 . It shows the percentage of times a model has been
ANISETTI et al.: CONTINUOUS MANAGEMENT OF MACHINE LEARNING-BASED APPLICATION BEHAVIOR 121

ranked into a specific position for all the experiments. We note


that exp1 was one of the most balanced experiments in terms of
ranking, having at least three models (m1 , m2 , and m5 ) with a
similar percentage of first and second positions in the ranking.
In exp2 and exp8 , m4 and m5 were ranked at first or second
position ≈80% of the times. More specifically, m1 and m5 are
ranked as the first model in the ranking 41.39% and 27.87% of
times, respectively (Fig. 8(c)). When m5 is not ranked first, it is
ranked second 26.36% of times, while m1 17.21%.
Considering all experiments and memory sizes, when com-
pared with δ5 , we note an average decrease of ranking changes
in the first position of 34.66% with δ10 and of 62.43% with δ25 .
This is also clear from Fig. 8(d), where m2 was ranked in the
first position most of the times.
2) Quality Evaluation: We evaluated the quality of ranking-
Fig. 10. Cumulative residual error ξ̂ between DMVW (δ0 ) and the DMVW-
based substitution and assurance-based substitution varying the Mem with different memory sizes δi and number of traces for experiment exp1 .
memory. The ranking retrieved according to DMVW is used as Model substitutions are marked with “X”.
baseline.
Let R denote the function that returns as output the (current)
position in the DMVW-based ranking of the (current) top-ranked
model in the DMVW-Mem-based ranking; this position is a
number ∈[1, ..., k] with k=|cl|. The residual error ξ measures
the difference between the ranking obtained with DMVW-Mem
and DMVW, and is defined as:
 
R−1
ξ = penalty , (6)
k−1

where penalty is the residual penalty function. We note that


in case the top-ranked model according to DMVW-Mem is
top-ranked also according to DMVW, the residual error is
ξ=penalty(0); in case it is ranked last according to DMVW,
the residual error is ξ=penalty(1). Residual penalty function
penalty is defined as a sigmoid function as follows:
Fig. 11. Total number of triggered early substitutions (total), relevant early
substitutions (relevant), and successful early substitutions (success) varying
1 threshold thr and memory δ for exp1 –exp10 .
penalty(x) = , (7)
1+ e−c1 (x−c2 )
where c2 control the x of the sigmoid inflection point and c1
the slope. The residual error measures the difference in terms Considering all experiments and memory sizes, we note an av-
of ranking between the different settings. While it is not an erage cumulative residual error of 163.67 with δ5 and of 243.63
indicator of the absolute quality, we assume this measure as with δ25 , corresponding to an average increase of 48.85%. As
a valid indicator of the relative quality between the different depicted in Fig. 9, our experiments revealed a frequent variability
settings of our solution. among the best candidate models. Therefore, the most suitable
Ranking-Based Substitution: Fig. 10 shows the cumulative approach in terms of residual error was the one having lower
residual error ξ=ˆ  ξ (i.e., the sum of the error retrieved memory.
t
in each window and execution trace) for exp1 with different Assurance-Based Substitution: Using the memory settings in
memory sizes. It also shows, marked with “×”, the execution Section VII-C 1, we first evaluated the impact of the degradation
traces where model substitutions occurred due to changes at threshold thr, varying its value in thr5 =0.05, thr10 =0.10, and
the top of the ranking. We note that in this experiment the thr25 =0.25.
bigger the memory, the bigger the cumulative residual error. Fig. 11 shows the total number of triggered early substitutions
This effect is compensated by fewer model substitutions as also (denoted as total) compared with total number of substitutions
demonstrated in Section VII-C 1. We also note that, depending that really occurred at end of the window (i.e, the correct
on the application domain, the memory settings can be dynamic. early substitutions, denoted as relevant), on average across all
For instance, in scenarios where fast reaction to changes is more experiments and memory settings. We observe that in 89% of
important than stability of the selected model, the memory can the cases, an early substitution has been correctly triggered.
be lowered; it can be increased in scenarios where stability is In detail, an early substitution was correctly triggered in 81%
important to counteract fluctuations. of the cases when using thr5 , increasing to 92% when using
122 IEEE TRANSACTIONS ON SERVICES COMPUTING, VOL. 18, NO. 1, JANUARY/FEBRUARY 2025

thr25 . These results were expected, since a higher threshold


corresponds to a more severe assurance variation, and thus to
a higher likelihood of the change being correct at the end of
the window.
Fig. 11 also shows the number of successfully executed early
substitutions (denoted as success) among the relevant early
substitutions. A successful early substitution is a substitution
where the model selected for substitution is the one evaluated
by DMVW-Mem at the first position of the ranking at the end
of the window. We observe that, in 93% of the cases on average,
assurance-based early substitution took the correct decision.
This result also confirms the quality of the entire retrieved rank-
ing, meaning that when a substitution was needed, the second-
ranked model was indeed the most suitable for substitution.
We also observe that i) as the early substitution threshold
Fig. 12. Performance (average windows duration) expressed in milliseconds
thr increases, the number of early substitutions decreases. For varying threshold thr and memory δ for exp1 –exp12 .
instance, with δ10 , it decreases from 196 with thr5 to 147 with
thr25 ; ii) the difference between the number of substitutions
with thr5 and thr25 is lower than expected (e.g., from 301 to
243 with δ5 ). VIII. RELATED WORK
In other words, when a degradation occurs, it exceeds thr25 Our approach guarantees stable application behavior over
in most of the cases. Even this experiment confirms that a bigger time, by dynamically selecting the most suitable ML model
memory corresponds to fewer early substitutions. A more stable according to a (set of) non-functional property. This issue has
trend of the assurance level of the selected model was also been studied from different angles in the literature: i) classifier
observed with bigger memory. and ensemble selection, ii) functional and iii) non-functional ML
We then evaluated the duration of early substitutions, in adaptation.
terms of the number of execution traces from the moment when At the end of this section, we also present a detailed compar-
the early substitution is triggered to the end of the considered ison of our approach with the related work in terms of their cat-
window. egory, objective, type of objective (functional/non functional),
We observe that an increase in the memory and threshold and applicability to ML algorithms and properties.
results in a decrease in the number of substitutions (see Fig. 11) Classifier and ensemble selection: it refers to the techniques
and their duration. On average across all experiments, the dura- that select the most suitable (set of) classifier among a set of
tion varies from ≈290 execution traces when using δ5 and thr5 candidates. It is referred to as classifier selection when one
to 84 when using δ25 and thr25 . classifier is selected, ensemble selection, otherwise [51]. It can
Finally, we observe that, similarly to the memory tuning, be performed at training time (static), or for each (subset of) data
the substitution threshold should be fine-tuned according to the point at inference time (dynamic). The latter, often combined
different application domains, to adequately react to changes with static selection, typically shows the best performance [12].
occurred within a given window. Selection maximizes functional metrics, often accuracy. Meta-
Performance: We compared the performance of our ranking- learning is frequently used, as surveyed by Khan et al. [52].
based and assurance-based substitutions with different memory For instance, Cruz et al. [47] proposed a dynamic ensemble
settings and assurance thresholds on all the experiments. selection that considers different spatial-based criteria using
Fig. 12 shows both the ranking-based and the assurance-based a meta-classifier. Zhu et al. [14] defined a dynamic ensemble
substitution performance varying memory settings and thresh- selection based on the generation of diversified classifiers. Se-
olds. The results are presented as the average time to compute lection is based on spatial information (i.e., the most competent
an evaluation window. We note that the ranking-based approach classifiers for a region). Classifiers predictions are combined
outperformed the assurance-based approach with an average using weighted majority voting, weights depend on the clas-
improvement around 4.68%, due to the absence of assurance sifiers competency for a data point. Zhang et al. [13] defined a
metric computations and corresponding comparisons. We also dynamic ensemble selection whose selection criterion considers
level the impact of the different thresholds on performance the classifiers synergy. It evaluates the contribution of each
is negligible, with thr10 showing the best performance in all classifier to the ensemble, in terms of the accuracy retrieved with
conditions. Fig. 12 clarifies that the dominating factor is the and without the classifier. For each data point, it selects the clas-
memory size. This is due to the fact that a bigger memory corre- sifiers with a positive contribution, and uses such contribution as
sponds to a smaller window. In addition, it also corresponds to weight in predictions aggregation. Other approaches focused on
fewer substitutions positively impacting performance, because imbalanced learning. Roy et al. [12] showed that specific prepro-
each substitution requires more iterations in DMVW-Mem to cessing (e.g., oversampling of the underrepresented class) and
converge. dynamic, spatial-based selection outperform static selection in
ANISETTI et al.: CONTINUOUS MANAGEMENT OF MACHINE LEARNING-BASED APPLICATION BEHAVIOR 123

TABLE III
COMPARISON WITH RELATED WORK

this scenario. Mousavi et al. [48] also used oversampling. Static Non-functional adaptation: it refers to the techniques that
selection then defines the ensemble and its combiner (e.g., ma- adapt a ML model (and application) according to a non-
jority voting). Dynamic selection finally retrieves a subset of the functional property. Fairness is the most studied property in lit-
ensemble for each data point. Pérez-Gállego et al. [49] focused erature in both static and dynamic settings; we focus on the latter
on quantification tasks with drifts between classes. The proposed due to its connection with the work in this paper. For instance,
dynamic ensemble selection uses a specifically designed crite- Iosifidis et al. [32] designed an approach that tackles fairness and
rion, selecting the classifiers whose training distribution is the concept drift. It uses two pre-processing techniques modifying
most similar to the input data points. Our approach implements data, which are then taken as input by classifiers that can natively
a dynamic classifier selection, which departs from existing solu- adapt to concept drifts (e.g., Hoefdding trees). A similar solution
tions implementing a (dynamic) selection of a (set of) classifier is proposed by Badar et al. [50] in federated learning. It first
for each data point to maximize accuracy at inference time. detects drift, and then evaluates if fairness is no longer sup-
Our goal is rather the run-time selection and substitution of ported. It then performs oversampling as countermeasure. Zhang
the ML model to the aim of guaranteeing a stable behavior of et al. [10], [11] introduced a training algorithm based on Hoefd-
the application with respect to a specific (set of) non-functional ding trees, whose splitting criterion considers fairness and ac-
property. curacy. Such idea has also been applied to random forest mod-
Functional adaptation: it refers to the techniques that adapt a els [33]. Iosifidis et al. [34] designed an online learning algorithm
ML model (and application) according to changing conditions, that detects class imbalance and lack of fairness, and adjusts the
notably a drift, to keep quality metrics high. According to the ML model accordingly. It fixes weights during boosting (for
survey by Lu et al. [53], the possible actions upon a detected drift imbalance) and the learned decision boundary (for fairness).
are: training and using a new ML model, using ensemble pur- Our approach implements an adaptation process, which departs
posefully trained for drift, and adapting an existing ML model from existing re-training solutions using a custom algorithm
when the drift is localized to a region. The issue of drift has been focused on a specific property (fairness). Our goal is rather the
approached using dynamic classifier selection. For instance, adaptation of the overall application behavior according to any
Almeida et al. [15] designed a drift detector whose selection non-functional properties and ML algorithms.
criterion considers both spatial and concept-based information. Table III shows how our approach compares with the re-
It relies on a set of diverse classifiers that is dynamically updated, lated work in terms of Category (denoted as Cat.), Objective,
removing unnecessary classifiers and training new ones as new Objective Type, and Applicability. Category can be i) classifier
concepts emerge. Tahmasbi et al. [9] designed a novel adaptive and ensemble selection (denoted as S), ii) functional adaptation
ML model. It uses one classifier at time, and, upon drift detec- (denoted as FA), and iii) non-functional adaptation (denoted as
tion, selects the subsequent classifier with the highest quality NFA). Objective Type can be i) functional (denoted as F), and
in the last evaluation window. Our approach implements an i) non-functional (denoted √ as NF). Applicability is expressed in
adaptation process, which departs from existing solutions based terms of i) ML Model ( if applicable to any ML algorithm,
on the online re-training of individual ML models according to ≈ if applicable to a class of ML algorithms, ✗ √ if applicable to
drift or the selection of ML models that maximize quality under a specific ML algorithm only); ii) Property ( if applicable
drift. Our goal is rather the adaptation of the overall ML-based to any (non-)functional property, ≈ if applicable to a class
application according to a (arbitrary) non-functional property of (non-)functional properties, ✗ if applicable to a specific
of interest. (non-)functional property). Table III shows that our approach
124 IEEE TRANSACTIONS ON SERVICES COMPUTING, VOL. 18, NO. 1, JANUARY/FEBRUARY 2025

(last row in Table III) is the only architectural and method- [15] P. R. L. Almeida, L. S. Oliveira, A. S. Britto, and R. Sabourin, “Adapting
ological solution that supports stable non-functional behavior dynamic classifier selection for concept drift,” Expert Syst. Appl., vol. 104,
2018, pp. 67–85.
of ML-based applications. It builds on a smart and dynamic [16] R. Chen, Z. Li, J. Li, J. Yan, and C. Wu, “On collective robustness
multi-model substitution departing from expensive re-training of bagging against data poisoning,” in Proc. Int. Conf. Mach. Learn.,
approaches and inference-time classifier selection for individual Baltimore, MD, USA, 2022, pp. 3299–3319.
[17] J. Jia, X. Cao, and N. Z. Gong, “Intrinsic certified robustness of bagging
data points. against data poisoning attacks,” in Proc. Conf. Assoc. Advance. Artif.
Intell., 2021, pp. 7961–7969.
IX. CONCLUSION [18] A. Levine and S. Feizi, “Deep partition aggregation: Provable defenses
against general poisoning attacks,” in Proc. Int. Conf. Learn. Representa-
We presented a multi-model approach for the continuous tions, Vienna, Austria, 2021.
management of ML-based application non-functional behavior. [19] W. Wang, A. Levine, and S. Feizi, “Improved certified de-
fenses against data poisoning with (deterministic) finite aggrega-
Our approach guarantees a stable application behavior at run tion,” in Proc. Int. Conf. Mach. Learn., Baltimore, MD, USA, 2022,
time, over time and across model changes, where multiple ML pp. 22769–22783.
models with similar non-functional properties are available and [20] M. Anisetti, C. A. Ardagna, A. Balestrucci, N. Bena, E. Damiani, and
C. Y. Yeun, “On the robustness of random forest against untargeted data
one model is selected at time according to such properties poisoning: An ensemble-based approach,” IEEE Trans. Sustain. Comput.,
and the application context. Our approach manages (dynamic vol. 8, no. 4, pp. 540–554, Fourth Quarter, 2023.
and unpredictable) contextual changes in modern ML deploy- [21] N. Bena, M. Anisetti, G. Gianini, and C. A. Ardagna, “Certifying accuracy,
privacy, and robustness of ML-based malware detection,” SN Comput. Sci.,
ments, supporting early model substitutions based on Dynamic vol. 5, 2024, Art. no. 710.
MAB and assurance evaluation. [22] M. Anisetti, C. A. Ardagna, E. Damiani, and P. G. Panero, “A methodol-
ogy for non-functional property evaluation of machine learning models,”
REFERENCES in Proc. Int. Conf. Manage. Digit. Ecosyst., Abu Dhabi, UAE, 2020,
pp. 38–45.
[1] E. Damiani and C. Ardagna, “Certified machine-learning models,” in Proc. [23] M. Anisetti, C. A. Ardagna, and N. Bena, “Multi-dimensional certification
Int. Conf. Curr. Trends Theory Pract. Comput. Sci., Limassol, Cyprus, of modern distributed systems,” IEEE Trans. Serv. Comput., vol. 16, no. 3,
2020, pp. 3–15. pp. 1999–2012, May/Jun. 2023.
[2] T. L. Duc, R. G. Leiva, P. Casari, and P.-O. Östberg, “Machine learning [24] C. A. Ardagna and N. Bena, “Non-functional certification of modern
methods for reliable resource provisioning in edge-cloud computing: A distributed systems: A research manifesto,” in Proc. IEEE Int. Conf. Softw.
survey,” ACM Comput. Surv., vol. 52, no. 5, pp. 1–39, 2019. Serv. Eng., Chicago, IL, USA, 2023, pp. 71–79.
[3] European Union, “Regulation (EU) 2024/1689 of the European Parliament [25] J. R. Gunasekaran, C. S. Mishra, P. Thinakaran, B. Sharma, M. T.
and of the Council of 13 June 2024 laying down harmonised rules on Kandemir, and C. R. Das, “[Cocktail: A multidimensional optimization
artificial intelligence and amending Regulations (EC) No 300/2008, (EU) for model serving in cloud],” in Proc. USENIX Symp. Netw. Syst. Des.
No 167/2013, (EU) No 168/2013, (EU) 2018/858, (EU) 2018/1139 and Implementation, Renton, WA, USA, 2022, pp. 1041–1057.
(EU) 2019/2144 and Directives 2014/90/EU, (EU) 2016/797 and (EU) [26] F. Doshi-Velez and B. Kim, “Towards a rigorous science of interpretable
2020/1828 (Artificial Intelligence Act)Text with EEA relevance,” 2024. machine learning,” 2017, arXiv: 1702.08608.
[Online]. Available: https://eur-lex.europa.eu/eli/reg/2024/1689/oj [27] M. Anisetti, N. Bena, F. Berto, and G. Jeon, “A DevSecOps-based assur-
[4] K. Meng, Z. Wu, M. Bilal, X. Xia, and X. Xu, “Blockchain-enabled de- ance process for big data analytics,” in Proc. IEEE Int. Conf. Web Serv.,
centralized service selection for QoS-aware cloud manufacturing,” Expert Barcelona, Spain, 2022, pp. 1–10.
Syst., 2024, Art. no. e13602. [28] C. A. Ardagna, R. Asal, E. Damiani, and Q. Vu, “From security to
[5] B. Qolomany, I. Mohammed, A. Al-Fuqaha, M. Guizani, and J. Qadir, assurance in the cloud: A survey,” ACM Comput. Surv., vol. 48, no. 1,
“Trust-based cloud machine learning model selection for industrial IoT and pp. 1–50, 2015.
smart city services,” IEEE Internet Things J., vol. 8, no. 4, pp. 2943–2958, [29] S. Lins, S. Schneider, and A. Sunyaev, “Trust is good, control is better: Cre-
Feb. 2021. ating secure clouds by continuous auditing,” IEEE Trans. Cloud Comput.,
[6] F. Ishikawa and N. Yoshioka, “How do engineers perceive difficulties vol. 6, no. 3, pp. 890–903, Third Quarter, 2018.
in engineering of machine-learning systems? - Questionnaire survey,” in [30] M. Hosseinzadeh, H. K. Hama, M. Y. Ghafour, M. Masdari, O. H. Ahmed,
Proc. IEEE/ACM Joint 7th Int. Workshop Conducting Empirical Stud. and H. Khezri, “Service selection using multi-criteria decision making:
Ind.-6th Int. Workshop Soft. Eng. Res. Ind. Pract., Montreal, Canada, 2019, A comprehensive overview,” J. Netw. Syst. Manage., vol. 28, no. 4,
pp. 2–9. pp. 1639–1693, 2020.
[7] M. Anisetti, C. A. Ardagna, N. Bena, and E. Damiani, “Rethinking [31] C. A. Ardagna, R. Asal, E. Damiani, T. Dimitrakos, N. El Ioini, and C.
certification for trustworthy machine-learning-based applications,” IEEE Pahl, “Certification-based cloud adaptation,” IEEE Trans. Serv. Comput.,
Internet Comput., vol. 27, no. 6, pp. 22–28, Nov./Dec. 2023. vol. 14, no. 1, pp. 82–96, Jan./Feb. 2021.
[8] K. Brecker, S. Lins, and A. Sunyaev, “Artificial intelligence systems’ [32] V. Iosifidis, T. N. H. Tran, and E. Ntoutsi, “Fairness-enhancing interven-
impermanence: A showstopper for assessment?,” in Proc. Workshop Inf. tions in stream classification,” in Proc. Int. Conf. Database Expert Syst.
Technol. Syst., Hyderabad, India, 2023. Appl., Linz, Austria, 2019, pp. 261–276.
[9] A. Tahmasbi, E. Jothimurugesan, S. Tirthapura, and P. B. Gibbons, “Drift- [33] W. Zhang, A. Bifet, X. Zhang, J. C. Weiss, and W. Nejdl, “FARF: A fair
Surf: Stable-state/reactive-state learning under concept drift,” in Proc. Int. and adaptive random forests classifier,” in Proc. Pacific-Asia Conf. Knowl.
Conf. Mach. Learn., 2021, pp. 10054–10064. Discov. Data Mining, 2021, pp. 245–256.
[10] W. Zhang and E. Ntoutsi, “FAHT: An adaptive fairness-aware de- [34] V. Iosifidis, W. Zhang, and E. Ntoutsi, “Online fairness-aware learning
cision tree classifier,” in Proc. Int. Joint Conf. Artif. Intell., 2019, with imbalanced data streams,” 2021, arXiv:2108.06231.
pp. 1480–1486. [35] M. Anisetti, C. A. Ardagna, and N. Bena, “Continuous certification of
[11] W. Zhang et al., “Flexible and adaptive fairness-aware learning in non- non-functional properties across system changes,” in Proc. Serv.-Oriented
stationary data streams,” in Proc. IEEE 32nd Int. Conf. Tools Artif. Intell., Comput., ICSOC, Rome, Italy, 2023, pp. 3–18.
Baltimore, MD, USA, 2020, pp. 399–406. [36] R. N. Taylor, N. Medvidović, and E. M. Dashofy, Software Architecture:
[12] A. Roy, R. M. O. Cruz, R. Sabourin, and G. D. C. Cavalcanti, “A study Foundations, Theory, and Practice. Hoboken, NJ, USA: John Wiley &
on combining dynamic selection and data preprocessing for imbalance Sons, 2009.
learning,” Neurocomputing, vol. 286, pp. 179–192, 2018. [37] N. Mehrabi, F. Morstatter, N. Saxena, K. Lerman, and A. Galstyan, “A
[13] Z.-L. Zhang and Y.-H. Zhu, “DES-SV: Dynamic ensemble selection based survey on bias and fairness in machine learning,” ACM Comput. Surv.,
on shapley value,” SSRN Preprint SSRN:4608310, 2023. vol. 54, no. 6, pp. 1–35, 2021.
[14] X. Zhu, J. Ren, J. Wang, and J. Li, “Automated machine learn- [38] O. Chapelle and L. Li, “An empirical evaluation of Thompson sampling,”
ing with dynamic ensemble selection,” Appl. Intell., vol. 53, no. 20, in Proc. Int. Conf. Neural Inf. Process. Syst., Granada, Spain, 2011,
pp. 23596–23612, 2023. pp. 2249–2257.
ANISETTI et al.: CONTINUOUS MANAGEMENT OF MACHINE LEARNING-BASED APPLICATION BEHAVIOR 125

[39] S. L. Scott, “Multi-armed bandit experiments in the online service econ- Claudio A. Ardagna (Senior Member, IEEE) s full
omy,” Appl. Stochastic Models Bus. Ind., vol. 31, no. 1, pp. 37–45, 2015. professor at the Department of Computer Science,
[40] L. Floridi, M. Holweg, M. Taddeo, J. Amaya Silva, J. Mökander, and Y. Università degli Studi di Milano, the Director of the
Wen, “CapAI – A procedure for conducting conformity assessment of CINI National Lab on Data Science, and co-founder
AI systems in line with the EU artificial intelligence act,” SSRN Preprint of Moon Cloud srl. His research interests are in the
SSRN: 4064091, 2022. area of edge-cloud and AI security and assurance,
[41] S. Maghool, E. Casiraghi, and P. Ceravolo, “Enhancing fairness and and data science. He has published more than 170
accuracy in machine learning through similarity networks,” in Proc. Int. articles and books. He has been visiting professor at
Conf. Cooperative Inf. Syst., Groningen, The Netherlands, 2023, pp. 3–20. Université Jean Moulin Lyon 3 and visiting researcher
[42] G. Vargas-Solar, C. Ghedira-Guégan, J. A. Espinosa-Oviedo, and J.-L. at BUPT, Khalifa University, GMU.
Zechinelli-Martin, “Embracing diversity and inclusion: A decolonial ap-
proach to urban computing,” in Proc. 20th ACS/IEEE Int. Conf. Comput.
Syst. Appl., Giza, Egypt, 2023, pp. 1–6.
[43] C. R. Harris et al., “Array programming with NumPy,” Nature, vol. 585,
no. 7825, pp. 357–362, Sep. 2020.
[44] Wes McKinney, “Data structures for statistical computing in Python,” in Nicola Bena (Member, IEEE) is a postdoc with the
Proc. Python Sci. Conf., Austin, TX, USA, 2010, pp. 51–56. Department of Computer Science, Università degli
[45] The Pandas Development Team, “Pandas-dev/pandas: Pandas,” Feb. 2020. Studi di Milano. His research interests are in the
[Online]. Available: https://doi.org/10.5281/zenodo.3509134 area of security of modern distributed systems with
[46] F. Pedregosa et al., “Scikit-learn: Machine learning in Python,” J. Mach. particular reference to certification, assurance, and
Learn. Res., vol. 12, pp. 2825–2830, 2011. risk management techniques. He has been visiting
[47] R. M. O. Cruz, R. Sabourin, G. D. C. Cavalcanti, and T. IngRen, “META- scholar at Khalifa University and at INSA Lyon.
DES: A dynamic ensemble selection framework using meta-learning,”
Pattern Recognit., vol. 48, no. 5, pp. 1925–1935, 2015.
[48] R. Mousavi, M. Eftekhari, and F. Rahdari, “Omni-ensemble learning
(OEL): Utilizing over-bagging, static and dynamic ensemble selection
approaches for software defect prediction,” J. Artif. Intell. Technol., vol. 27,
no. 6, 2018, Art. no. 1850024.
[49] P. Pérez-Gállego, A. Casta no, J. Ramón Quevedo, and J. José del Coz,
“Dynamic ensemble selection for quantification tasks,” Inf. Fusion, vol. 45,
pp. 1–15, 2019.
[50] M. Badar, W. Nejdl, and M. Fisichella, “FAC-Fed: Federated adaptation Ernesto Damiani (Senior Member, IEEE) is full
for fairness and concept drift aware stream classification,” Mach. Learn., professor at the Department of Computer Science,
vol. 112, pp. 2761–2786, 2023. Università degli Studi di Milano, where he leads
[51] R. M. Cruz, R. Sabourin, and G. D. Cavalcanti, “Dynamic classifier selec- the Secure Service-oriented Architectures Research
tion: Recent advances and perspectives,” Inf. Fusion, vol. 41, pp. 195–216, (SESAR) Laboratory. He is also the Founding Direc-
2018. tor of the Center for Cyber-Physical Systems, Khalifa
[52] I. Khan, X. Zhang, M. Rehman, and R. Ali, “A literature survey and University, UAE. He received an Honorary Doctorate
empirical study of meta-learning for classifier selection,” IEEE Access, from INSA Lyon for his contributions to research and
vol. 8, pp. 10262–10281, 2020. teaching on big data analytics. His research interests
[53] J. Lu, A. Liu, F. Dong, F. Gu, J. Gama, and G. Zhang, “Learning under include cybersecurity, big data, artificial intelligence,
concept drift: A review,” IEEE Trans. Knowl. Data Eng., vol. 31, no. 12, and cloud-edge processing, and he has published over
Dec. 2019. 680 peer-reviewed articles and books. He is a Distinguished Scientist of ACM
and was a recipient of the 2017 Stephen Yau Award.

Marco Anisetti (Senior Member, IEEE) is full pro- Paolo G. Panero is currently working toward the
fessor with the Department of Computer Science,
master degree with the Department of Computer Sci-
Università degli Studi di Milano. His research in-
ence, Università degli Studi di Milano. He is IT officer
terests are in the area of computational intelligence
in a in-house public administration company where
and its application to the design and evaluation of he deals with innovation and IT services. His research
complex systems. He has been investigating inno-
interests are in the area of Machine Learning with a
vative solutions in the area of assurance evaluation
focus on models evaluation.
of cloud security and AI. In this area he defined a
new scheme for continuous and incremental cloud
security certification, based on distributed assurance
evaluation architecture.

Open Access provided by ‘Università degli Studi di Milano’ within the CRUI CARE Agreement

You might also like