Continuous_Management_of_Machine_Learning-Based_Application_Behavior
Continuous_Management_of_Machine_Learning-Based_Application_Behavior
1, JANUARY/FEBRUARY 2025
Abstract—Modern applications are increasingly driven by Ma- properties. We experimentally evaluate our solution in a real-world
chine Learning (ML) models whose non-deterministic behavior is scenario focusing on non-functional property fairness.
affecting the entire application life cycle from design to operation.
The pervasive adoption of ML is urgently calling for approaches Index Terms—Assurance, machine learning, multi-armed band-
that guarantee a stable non-functional behavior of ML-based ap- it, non-functional properties.
plications over time and across model changes. To this aim, non-
functional properties of ML models, such as privacy, confidential-
ity, fairness, and explainability, must be monitored, verified, and I. INTRODUCTION
maintained. Existing approaches mostly focus on i) implementing
ACHINE Learning (ML) has become the technique of
solutions for classifier selection according to the functional behav-
ior of ML models, ii) finding new algorithmic solutions, such as
continuous re-training. In this paper, we propose a multi-model
M choice to provide advanced functionalities and carry out
tasks hardly achievable by traditional control and optimization
approach that aims to guarantee a stable non-functional behavior algorithms [1]. Even the behavior, orchestration, and deploy-
of ML-based applications. An architectural and methodological
approach is provided to compare multiple ML models showing ment parameters of distributed systems and services, possibly
similar non-functional properties and select the model supporting offered on the cloud-edge continuum, are increasingly based
stable non-functional behavior over time according to (dynamic on ML models [2]. Concerns about the black-box nature of ML
and unpredictable) contextual changes. Our approach goes beyond have led to a societal push that involves all components of society
the state of the art by providing a solution that continuously (policymakers, regulators, academic and industrial stakeholders,
guarantees a stable non-functional behavior of ML-based appli-
cations, is ML algorithm-agnostic, and is driven by non-functional citizens) towards trustworthy and transparent ML, giving rise to
properties assessed on the ML models themselves. It consists of legislative initiatives on artificial intelligence (e.g., the AI Act
a two-step process working during application operation, where in Europe [3]).
model assessment verifies non-functional properties of ML models This scenario introduces the need for solutions that continu-
trained and selected at development time, and model substitu- ously guarantee a stable non-functional behavior of ML-based
tion guarantees continuous and stable support of non-functional
applications, a task that is significantly more complex than mere
QoS-based selection and composition (e.g., [4], [5], [6]). The
focus of such a task is to assess the non-functional properties
Received 17 November 2023; revised 15 July 2024; accepted 7 October 2024. of ML models, such as privacy, confidentiality, fairness, and
Date of publication 28 October 2024; date of current version 6 February 2025.
Research supported, in parts, by i) project BA-PHERD, funded by the European explainability, over time and across changes. The non-functional
Union – NextGenerationEU, under the National Recovery and Resilience Plan assessment of ML-based applications behavior has to cope with
(NRRP) Mission 4 Component 2 Investment Line 1.1: “Fondo Bando PRIN the ML models’ complexity, low transparency, and continuous
2022” (CUP G53D23002910006); ii) MUSA – Multilayered Urban Sustain-
ability Action – project, funded by the European Union - NextGenerationEU, evolution [7], [8]. ML models in fact are affected by model
under the National Recovery and Resilience Plan (NRRP) Mission 4 Component and data drifts, quality degradation, and accuracy loss, which
2 Investment Line 1.5: Strengthening of research structures and creation of may substantially impact on the quality and soundness of the
R&D “innovation ecosystems”, set up of “territorial leaders in R&D” (CUP
G43C22001370007, Code ECS00000037); iii) project SERICS (PE00000014) application itself.
under the NRRP MUR program funded by the EU – NextGenerationEU; iv) Recent research points to solutions where ML models evolve
projects 1H-HUB and SOV-EDGE-HUB funded by Università degli Studi di according to contextual changes (e.g., a shift in the incom-
Milano – PSR 2021/2022 – GSA – Linea 6; and v) program “Piano di Sostegno
alla Ricerca” funded by Università degli Studi di Milano. Views and opinions ing data distribution), typically via continuous re-training and
expressed are however those of the authors only and do not necessarily reflect peculiar training algorithms and ML models [9], [10], [11].
those of the European Union or the Italian MUR. Neither the European Union Other solutions consider classifier selection where a (set of)
nor the Italian MUR can be held responsible for them. (Corresponding author:
Claudio A. Ardagna.) ML model is statically or dynamically selected according to
Marco Anisetti, Claudio A. Ardagna, Nicola Bena, and Paolo G. Panero are some criteria [12], [13], [14], [15]; in this context, dynamic
with the Department of Computer Science, Università degli Studi di Milano, selection identifies the most suitable ML model for each data
20133 Milano, Italy (e-mail: [email protected]; claudio.ardagna@unimi.
it; [email protected]; [email protected]). point at inference time. Ensembles have been also considered to
Ernesto Damiani is with the Department of Computer Science, Università increase ML robustness [16], [17], [18], [19], [20]. Finally, some
degli Studi di Milano, 20133 Milano, Italy, and also with C2PS, Computer solutions have initially discussed certification-based assessment
Science Department, Khalifa University, Abu Dhabi P.O. Box: 127788, UAE
(e-mail: [email protected]). of ML-based applications [7], [8], [21]. Current approaches
Digital Object Identifier 10.1109/TSC.2024.3486226 however fall short in supporting the requirements of modern
© 2024 The Authors. This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see
https://creativecommons.org/licenses/by/4.0/
ANISETTI et al.: CONTINUOUS MANAGEMENT OF MACHINE LEARNING-BASED APPLICATION BEHAVIOR 113
attributes [23]. Common properties include performance, con- set of execution traces. MAB repeatedly executes an experiment,
fidentiality, integrity, availability [36]. When an ML model is whose goal is to get the highest reward that can be earned by
considered, the notion of property is redesigned [7] as follows. executing a specific action chosen among a set of alternatives.
Definition 2 (Non-Functional Property). A non-functional Every action returns a reward or a penalty with different (and un-
property p is a pair p= (p̂, S), where p̂ is an abstract property known) probabilities. The experiment is commonly associated
taken from a shared controlled vocabulary [22] and S is a score with the problem of a gambler facing different slot machines
function of the form S:{et}→R quantitatively describing how (or a single slot machine with many arms retrieving different
much an ML model supports p̂ according to its execution traces. results). In our scenario, the actions are the models mi in the
In the following, we use the dotted notation to refer to the candidate list cl and the reward is based on the score function
components of p (e.g., p.S). p.S in Definition 2.
Example 3 (Non-Functional Property). Following Exam- Definition 3 (MAB). Let cl be the set of candidate models
ple 2, property fairness can be defined as pfairness =(fairness, {m1 , ..., mk }, each associated with an unknown reward vm
variance-over-gender-race), where the score function i) gener- for non-functional property p. The goal of the MAB is to
ates a number of synthetic data points dp covering all the possible select the model m∗ providing the highest reward in a set
combinations of protected attributes gender and race; ii) sends of experiments (i.e., a set of execution traces). A probability
each dp to the model; and iii) measures the variance σ 2 over the distribution fm (y | θ) drives experiments’ rewards, with y the
predicted bails. observed reward and θ a collection of unknown parameters that
We note that the higher the variance, the lower the support for must be learned through experimentation. MAB is based on
property fairness. Bayesian inference considering that, in each experiment, the
Non-functional properties of ML can be peculiar properties success/failure odd of each model is unknown and can be shaped
purposefully defined for ML evaluation (e.g., adversarial robust- with the probability distribution Beta distribution. Let m be a
ness [21]) or a new interpretation of traditional ones (e.g., the model, its Beta distribution Betam is based on two parameters
integrity of the predictions) [7]. Fig. 2 shows a portion of our α, β∈[0, 1] (denoted as αm and βm , resp.) and its probability
taxonomy of non-functional properties, which has been fully density function can be represented as
presented in our previous work [22]. The taxonomy includes
generic properties, which are then refined by detailed properties. x(αm −1) (1 − x)βm −1
Betam (x; αm , βm ) = (1)
For example, transparency is a generic property with two sub- B (αm , βm )
properties: i) explainability, the capability to explain the model,
on the one hand, and individual decisions taken by the model, on where the normalization function B is the Euler beta function
the other hand; and ii) interpretability, the capability to predict 1
the consequences on a model when changes are observed. B (αm , βm ) = xαm −1 (1 − x)βm −1 dx (2)
0
As another example, fairness is a generic property with mul-
tiple sub-properties. For each detailed property, different score Thompson sampling [38] pulls models in cl, as a new trace
functions can be defined. For instance, Fig. 2 shows two score et is received from the application, by sampling the models’
functions for property individual fairness: variance σ 2 (used in Beta distribution. The model with the highest sampled reward
this paper) and Shapley [37]. Score functions in the taxonomy (denoted as m∗ ) is then evaluated according to p.S and et.
are general, though we note that they need to be refined and A comparison of the score function output against a threshold
instantiated in the context of an evaluation process for a specific determines the success or failure of this evaluation. Betam∗ is
ML-based application. then updated accordingly, such that m∗ is pulled more frequently
in case of successful evaluation (αm∗ increased by 1), less
C. The MAB frequently (βm∗ increased by 1), otherwise.
We use the Multi-Armed Bandit (MAB) technique [22] to Let yt denote the set of observations recorded up to the t-th
compare models according to a non-functional property p on a execution trace ett . The optimal model m∗ is selected according
116 IEEE TRANSACTIONS ON SERVICES COMPUTING, VOL. 18, NO. 1, JANUARY/FEBRUARY 2025
to probability winnerm,t : is the difference between i) the reward vm∗ (θ0 ) of the optimal
model m∗ retrieved at the end of window w and ii) the reward
winnerm,t = P (m∗ | yt ) vm∗,t (θ0 ) of the optimal model m∗,t retrieved at execution
∗
trace ett .
= l m = arg max vm (θ) p (θ | yt ) dt (3) Considering that the regret is not directly observable, it can
m∈cl
be computed using the posterior probability distribution. Let
where l is the indicator function and p(θ|yt ) is the Bayesian us consider v∗ (θ(g) )=maxm∈cl vm (θ(g) ) where θ(g) is drawn
posterior probability distribution of θ given the observations from p(θ|yt ). The “regret” r in g is r(g) =v∗ (θ(g) )−vm∗,t (θ0 ),
up to the t-th execution trace. The MAB terminates when all which derives from the regret posterior probability distribution.
experiments end, that is, all traces have been received. We note that v∗ (θ(g) ) is the maximum available value within
The optimal model m∗ is used by the application (i.e., each Monte Carlo draw set g and vm∗,t (θ(g) ) is the value (alike
m̂=m∗ ) [22]. We note that, while effective at application startup, taken in g) for the best arm within each Monte Carlo simulation.
the MAB cannot be continuously applied at run time as new Regret is expressed as the percentage of the deviation from the
traces come. For this reason, the MAB in this section (Static model identified as the winner, so that draws from the posterior
MAB in the following) is only used for static model selection probability are given as follows.
at development time. We then define in Section IV a Dynamic
MAB as the extension of the Static MAB for run-time model (g) v∗ θ(g) − vm∗,t θ(g)
selection and substitution. r = (4)
vm∗,t θ(g)
IV. MODEL ASSESSMENT: DYNAMIC MAB The experiment completes when 95% of the samples of a
Process model assessment compares ML models at run time simulation have a residual value less than a given percentage
according to their non-functional behavior. It takes as input the (residualr ) of the value of the best model vm∗,t (θ0 ). For-
models in the candidate list cl and the non-functional property mally, a window can be closed when percentile(r(g) , 95) ≤
p, and returns as output the models’ Beta distributions. Models vm∗,t (θ0 )×residualr . A common value for residualr is 1%; it
assessment uses the Static MAB within an evaluation window can be increased to reduce the window size, while leading to a
w of |w| execution traces, and then shifts the window of |w| greater residual. We note that the window size can be tuned in
execution traces instantiating a new Static MAB. terms of the acceptable regret using residualr .
The window size |w| can be fixed or variable. When w has In a nutshell, DMVW takes a decision based on the execution
fixed size |w|, the Dynamic MAB may not reach a statistical traces in a specific window w only. A new MAB is executed from
relevance to take a decision; in this case, i) the outcome can scratch in each window, potentially leading to a discontinuous
be sub-optimal or ii) the evaluation can be extended to the next model comparison. Due to this effect, DMVW can produce
window. fluctuations in the selection of the optimal model m∗ to be
When w has variable size |w|, our default approach, the MAB used by the application. To address these issues, we extend the
terminates the evaluation and moves to the next window only DMVW with the notion of memory in Section IV-B.
when a statistically relevant decision can be made. It is based on
the value remaining in the experiment [39], a tunable strategy B. DMVW With Memory (DMVW-Mem)
that controls both the estimation error and the window size
The DMVW with memory (DMVW-Mem) keeps track of past
requested to reach a valuable decision. In the following, we
DMVW executions to smooth the discontinuity among consec-
present our solutions based on variable window sizes, namely
utive windows. DMVW-Mem for window wj is defined on the
Dynamic MAB with Variable Window (DMVW) and DMVW
basis of the Beta distributions and corresponding parameters in
with Memory (DMVW-Mem).
window wj−1 as follows.
Definition 4 (DMVW with memory (DMVW-Mem)). A
A. Dynamic MAB With Variable Window (DMVW)
DMVW-Mem is a DMVW where the Beta distribution Betam,j
The Dynamic MAB with Variable Window (DMVW) imple- of each model m in window wj is initialized on the basis of the
ments the value remaining in the experiment using a Monte Carlo Beta distribution Betam,j−1 of the corresponding model m in
simulation. The simulation considers a random set g of sampled window wj−1 , as follows:
draws from models’ Beta distributions. It then counts the fre- r αm,j = αm,j−1 × δ
quency of each model being the winner in g as an estimation of r βm,j = βm,j−1 × δ,
the corresponding probability distribution. where δ∈[0, 1] denotes the memory size, αm,j−1 and βm,j−1
The value remaining in the experiment is based on the are α and β of Beta distribution Betam,j−1 of model m in window
minimization of the “regret” (the missed reward) due to an wj−1 . We note that the resulting αm,j and βm,j are rounded
early terminated experiment. Let θ0 denote the value of θ and down and set to 1 when equal to 0.
m∗ =arg maxm∈cl vm (θ0 ) the optimal model at the end of a In other words, DMVW-Mem initializes the Beta distributions
window w. The regret due to early termination of an experiment in each window wj according to the Beta distribution parameters
within window w is represented by vm∗ (θ0 )−vm∗,t (θ0 ), which observed in window wj−1 .
ANISETTI et al.: CONTINUOUS MANAGEMENT OF MACHINE LEARNING-BASED APPLICATION BEHAVIOR 117
B. Assurance-Based Substitution
Example 4. Following Example 3, let us assume that the Assurance-based substitution triggers early substitution of the
current evaluation window w11 in a given court terminates after selected model m̂ before window w terminates. It monitors m̂
200 execution traces according to DMVW-Mem. The output by computing its assurance level as follows.
of process model assessment is {Betam1 ,11 , ..., Betam5 ,11 }. Definition 6 (Assurance level). Let m̂ be the selected model
Fig. 3 shows Betam5 ,11 , where αm5 ,11 =110 and βm5 ,11 =2, and ett ∈wj an execution trace. The assurance level alt of m̂
meaning that m5 has been frequently sampled and successfully given ett is vm̂t (θ)/v∗ (θ(g) ).
evaluated. Let us then assume that the memory has size 10% According to Definition 6, alt is the ratio between i) the reward
(i.e., δ=0.1). Fig. 3 shows Betam5 ,12 defined for window w12 , vm̂t (θ) of the selected model m̂ retrieved at execution trace ett
which is initialized as: i) αm5 ,12 = 110 × 0.1 =11; ii) βm5 ,12 and ii) the reward v∗ (θ(g) ) of the optimal model m∗ , according
= 2 × 0.1 =0.2, which is then set to 1 according to Definition 4. to the Monte Carlo simulation in DMVW-Mem (Section IV-A).
We note that the assurance level can be retrieved for each model
V. MODEL SUBSTITUTION mi using the corresponding reward as numerator.
Process model substitution is executed on the basis of process The assurance level al is used to calculate the degradation
model assessment in Section IV. It takes as input the results of of the selected model. Formally, let ett be an execution trace
the DMVW-Mem evaluation in the current window, and returns in window wj . The degradation of m̂ at ett ∈wj is defined as
as output the model to be selected and used by the application follows.
in the following window. t
ali
degt = 1 − i=1 (5)
A. Ranking-Based Substitution t
Ranking-based substitution ranks models m∈cl in a given Substitution: It works as the ranking-based substitution but
window wj and determines the model m̂ to be used in the the selected model m̂ is substituted with the second model in the
following window wj+1 . Let us recall that αm (βm , resp.) is ranking before the window termination (i.e., early substitution),
incremented by 1 when p.S is successfully (unsuccessfully, iff its degradation degt exceeds threshold thr (degt >thr).
resp.) evaluated on trace et∈wj (Section III-C). Ranking-based Early substitution copes with transient changes within the
substitution is based on a metric evaluating how frequently window according to the degradation represented in thr. A
each model is selected by Thompson Sampling and successfully high (low, resp.) threshold means high (low, resp.) tolerance.
evaluated in DMVW-Mem, as follows. For instance, a high tolerance is preferable when the substi-
Definition 5 (Ranking Metric). Let wj be a window and m a tution overhead is high (e.g., when large models should be
model. The value of ranking metric rmm,j of m in wj is retrieved physically moved). A low tolerance is preferable when small
as αm,j /(αm,j + βm,j ). variations in the properties of the deployed models have a strong
According to Definition 5, rmm,j is the ratio between the impact on the application behavior. We note that, given its
number of successful evaluations of m (in terms of p.S) and the fundamental role in the substitution process, we experimentally
total number of draws computed by DMVW-Mem in wj . It is evaluated the adoption of different degradation thresholds thr in
retrieved for every model in cl and used for ranking. Section VII-C.
Substitution: At the end of window wj , the top-ranked model Example 6. Following Example 5, let us consider model m5
m̂ is selected and used within window wj+1 . The substitution as the selected model and model m4 as the second model in the
happens when wj terminates according to the value remaining ranking. Fig. 4 shows an example of the assurance levels of m̂5
in the experiment (Section IV-A). and m4 , denoted as alm̂5 ,t and alm4 ,t , respectively. Fig. 4 also
118 IEEE TRANSACTIONS ON SERVICES COMPUTING, VOL. 18, NO. 1, JANUARY/FEBRUARY 2025
into training and test sets, where the training set includes more
than 3 million points.
We modeled the score function p.S of property fairness as
the variance (σ 2 ) of the bail amount in relation to sensitive
attributes gender and race [40], [41], [42]. Fig. 6 shows the
pseudocode of the score function and its usage according to
the threshold-based evaluation in Definition 3. We generated
five Naive Bayes models cl={m1 , ..., m5 }, each one trained
on a training set randomly extracted from the main training set.
The models showed similar performance, in terms of precision
and recall in bail estimation. We also extracted 10 test sets
corresponding to 10 individual experiments exp1 –exp10 to be
used in our experimental evaluation.
Experiments have been run on a laptop running Microsoft
Windows 10, equipped with a CPU Intel Core i7 @ 2.6
GHz and 16 GBs of RAM, using Python 3 with libraries
numpy v1.19.1 [43], pandas v1.2.5 [44], [45] and scikit-learn
v0.22.1 [46]. Datasets, code, and experimental results are avail-
Fig. 6. Pseudocode of the score function of property fairness and its usage.
able at https://doi.org/10.13130/RD_UNIMI/2G3CVO.
A. Experimental Settings
We considered the application for bail estimation and property B. Model Assessment
fairness in our reference example in Section VI. In our experi- We present the experimental evaluation of our Static MAB
ments, we used the dataset of the Connecticut State Department for model assessment at development time. We compare the
of Correction.1 This dataset provides a daily updated list of five Naive Bayes models using the Static MAB approach, by
people detained in the Department’s facilities awaiting a trial. evaluating their behavior with respect to non-functional property
It anonymously discloses data of individual people detained in fairness. Table II shows the Thompson Sampling draws for the
the correctional facilities every day starting from July 1st , 2016. five models in the candidate list on a randomly chosen sample
It contains attributes such as last admission date, race, gender, (2,000 data points) for each of the 10 experiments.
age, type of offence and facility description, in more than four Table II shows the distribution of models selected as best
millions data points (at the download date). We divided this set candidate (denoted in bold) for property fairness. Since m3 is
never selected as the best candidate, it is removed from the
1 Available at https://data.ct.gov/Public-Safety/Accused-Pre-Trial-Inmates- candidate list for the rest of the experimental evaluation. We
in-Correctional-Faciliti/b674-jy6w and downloaded on February 21st, 2020. note that comparing models based on the same algorithm (i.e.,
120 IEEE TRANSACTIONS ON SERVICES COMPUTING, VOL. 18, NO. 1, JANUARY/FEBRUARY 2025
TABLE II
STATIC MAB COMPARISON IN TERMS OF THOMPSON SAMPLING DRAWS ON A
RANDOM SAMPLE OF 2,000 DATA POINTS FOR EACH EXPERIMENT
Fig. 8. The selected model for each execution trace et of experiment exp1
with different memory sizes δ.
Fig. 7. Individual window sizes and moving average trends across all sets of
execution traces with different memory sizes for exp1 .
C. Model Substitution
We present the experimental evaluation of our process model set of execution traces in experiment exp1 , considering different
substitution using DMVW-Mem with different memory sizes memory sizes. We note that extemporaneous changes on the
(δ0 =0%, δ5 =5%, δ10 =10%, δ25 =25%). We evaluated i) the selected model are frequent without memory (δ0 ), less frequent
impact of the memory on the window size, ii) the impact of with δ5 , where clusters of continuously selected models emerge,
the ranking-based substitution in terms of stability of model and highly infrequent with δ10 . Fig. 8(d) shows a stable selection
selections, iii) the quality of the ranking-based substitution, and of model m2 , while models m4 and m5 are often not selected
iv) the quality of the assurance-based substitution. We note that preferring m1 instead. Considering the entire ranking, model
no artificial degradation was introduced during the experiments. m1 is ranked at the second position with δ25 , while m4 at the
1) Memory Size and Ranking: Fig. 7 shows the window size third position.
varying the memory in experiment exp1 with residual threshold In general, we observe that the number of changes across
residualr =0.01 (Section IV-A). We note that a bigger memory the experiments, in terms of selected models, decreases as
corresponds to a smaller window. This is expected, since the the memory increases. On average, across all experiments, it
DMVW-Mem does not start from scratch in every window, decreases by 41.18% when memory increases from δ5 to δ10
and the more DMVW-Mem knows about the models’ Beta (from 34 changes on average with δ5 to 20 changes on average
distributions, the sooner the value remaining in the experiment with δ10 ); it decreases by 20% when memory increases from δ10
reaches the threshold. Considering all the experiments, the av- to δ25 (from 20 changes on average with δ10 to 16 changes on
erage window size for δ25 is 157 confirming the trend in Fig. 7. average with δ25 ).
Let us now consider the model selected according to the Fig. 9 shows an aggregated ranking for all the experiments
DMVW-Mem ranking. Fig. 8 shows the selected model for each with δ10 . It shows the percentage of times a model has been
ANISETTI et al.: CONTINUOUS MANAGEMENT OF MACHINE LEARNING-BASED APPLICATION BEHAVIOR 121
TABLE III
COMPARISON WITH RELATED WORK
this scenario. Mousavi et al. [48] also used oversampling. Static Non-functional adaptation: it refers to the techniques that
selection then defines the ensemble and its combiner (e.g., ma- adapt a ML model (and application) according to a non-
jority voting). Dynamic selection finally retrieves a subset of the functional property. Fairness is the most studied property in lit-
ensemble for each data point. Pérez-Gállego et al. [49] focused erature in both static and dynamic settings; we focus on the latter
on quantification tasks with drifts between classes. The proposed due to its connection with the work in this paper. For instance,
dynamic ensemble selection uses a specifically designed crite- Iosifidis et al. [32] designed an approach that tackles fairness and
rion, selecting the classifiers whose training distribution is the concept drift. It uses two pre-processing techniques modifying
most similar to the input data points. Our approach implements data, which are then taken as input by classifiers that can natively
a dynamic classifier selection, which departs from existing solu- adapt to concept drifts (e.g., Hoefdding trees). A similar solution
tions implementing a (dynamic) selection of a (set of) classifier is proposed by Badar et al. [50] in federated learning. It first
for each data point to maximize accuracy at inference time. detects drift, and then evaluates if fairness is no longer sup-
Our goal is rather the run-time selection and substitution of ported. It then performs oversampling as countermeasure. Zhang
the ML model to the aim of guaranteeing a stable behavior of et al. [10], [11] introduced a training algorithm based on Hoefd-
the application with respect to a specific (set of) non-functional ding trees, whose splitting criterion considers fairness and ac-
property. curacy. Such idea has also been applied to random forest mod-
Functional adaptation: it refers to the techniques that adapt a els [33]. Iosifidis et al. [34] designed an online learning algorithm
ML model (and application) according to changing conditions, that detects class imbalance and lack of fairness, and adjusts the
notably a drift, to keep quality metrics high. According to the ML model accordingly. It fixes weights during boosting (for
survey by Lu et al. [53], the possible actions upon a detected drift imbalance) and the learned decision boundary (for fairness).
are: training and using a new ML model, using ensemble pur- Our approach implements an adaptation process, which departs
posefully trained for drift, and adapting an existing ML model from existing re-training solutions using a custom algorithm
when the drift is localized to a region. The issue of drift has been focused on a specific property (fairness). Our goal is rather the
approached using dynamic classifier selection. For instance, adaptation of the overall application behavior according to any
Almeida et al. [15] designed a drift detector whose selection non-functional properties and ML algorithms.
criterion considers both spatial and concept-based information. Table III shows how our approach compares with the re-
It relies on a set of diverse classifiers that is dynamically updated, lated work in terms of Category (denoted as Cat.), Objective,
removing unnecessary classifiers and training new ones as new Objective Type, and Applicability. Category can be i) classifier
concepts emerge. Tahmasbi et al. [9] designed a novel adaptive and ensemble selection (denoted as S), ii) functional adaptation
ML model. It uses one classifier at time, and, upon drift detec- (denoted as FA), and iii) non-functional adaptation (denoted as
tion, selects the subsequent classifier with the highest quality NFA). Objective Type can be i) functional (denoted as F), and
in the last evaluation window. Our approach implements an i) non-functional (denoted √ as NF). Applicability is expressed in
adaptation process, which departs from existing solutions based terms of i) ML Model ( if applicable to any ML algorithm,
on the online re-training of individual ML models according to ≈ if applicable to a class of ML algorithms, ✗ √ if applicable to
drift or the selection of ML models that maximize quality under a specific ML algorithm only); ii) Property ( if applicable
drift. Our goal is rather the adaptation of the overall ML-based to any (non-)functional property, ≈ if applicable to a class
application according to a (arbitrary) non-functional property of (non-)functional properties, ✗ if applicable to a specific
of interest. (non-)functional property). Table III shows that our approach
124 IEEE TRANSACTIONS ON SERVICES COMPUTING, VOL. 18, NO. 1, JANUARY/FEBRUARY 2025
(last row in Table III) is the only architectural and method- [15] P. R. L. Almeida, L. S. Oliveira, A. S. Britto, and R. Sabourin, “Adapting
ological solution that supports stable non-functional behavior dynamic classifier selection for concept drift,” Expert Syst. Appl., vol. 104,
2018, pp. 67–85.
of ML-based applications. It builds on a smart and dynamic [16] R. Chen, Z. Li, J. Li, J. Yan, and C. Wu, “On collective robustness
multi-model substitution departing from expensive re-training of bagging against data poisoning,” in Proc. Int. Conf. Mach. Learn.,
approaches and inference-time classifier selection for individual Baltimore, MD, USA, 2022, pp. 3299–3319.
[17] J. Jia, X. Cao, and N. Z. Gong, “Intrinsic certified robustness of bagging
data points. against data poisoning attacks,” in Proc. Conf. Assoc. Advance. Artif.
Intell., 2021, pp. 7961–7969.
IX. CONCLUSION [18] A. Levine and S. Feizi, “Deep partition aggregation: Provable defenses
against general poisoning attacks,” in Proc. Int. Conf. Learn. Representa-
We presented a multi-model approach for the continuous tions, Vienna, Austria, 2021.
management of ML-based application non-functional behavior. [19] W. Wang, A. Levine, and S. Feizi, “Improved certified de-
fenses against data poisoning with (deterministic) finite aggrega-
Our approach guarantees a stable application behavior at run tion,” in Proc. Int. Conf. Mach. Learn., Baltimore, MD, USA, 2022,
time, over time and across model changes, where multiple ML pp. 22769–22783.
models with similar non-functional properties are available and [20] M. Anisetti, C. A. Ardagna, A. Balestrucci, N. Bena, E. Damiani, and
C. Y. Yeun, “On the robustness of random forest against untargeted data
one model is selected at time according to such properties poisoning: An ensemble-based approach,” IEEE Trans. Sustain. Comput.,
and the application context. Our approach manages (dynamic vol. 8, no. 4, pp. 540–554, Fourth Quarter, 2023.
and unpredictable) contextual changes in modern ML deploy- [21] N. Bena, M. Anisetti, G. Gianini, and C. A. Ardagna, “Certifying accuracy,
privacy, and robustness of ML-based malware detection,” SN Comput. Sci.,
ments, supporting early model substitutions based on Dynamic vol. 5, 2024, Art. no. 710.
MAB and assurance evaluation. [22] M. Anisetti, C. A. Ardagna, E. Damiani, and P. G. Panero, “A methodol-
ogy for non-functional property evaluation of machine learning models,”
REFERENCES in Proc. Int. Conf. Manage. Digit. Ecosyst., Abu Dhabi, UAE, 2020,
pp. 38–45.
[1] E. Damiani and C. Ardagna, “Certified machine-learning models,” in Proc. [23] M. Anisetti, C. A. Ardagna, and N. Bena, “Multi-dimensional certification
Int. Conf. Curr. Trends Theory Pract. Comput. Sci., Limassol, Cyprus, of modern distributed systems,” IEEE Trans. Serv. Comput., vol. 16, no. 3,
2020, pp. 3–15. pp. 1999–2012, May/Jun. 2023.
[2] T. L. Duc, R. G. Leiva, P. Casari, and P.-O. Östberg, “Machine learning [24] C. A. Ardagna and N. Bena, “Non-functional certification of modern
methods for reliable resource provisioning in edge-cloud computing: A distributed systems: A research manifesto,” in Proc. IEEE Int. Conf. Softw.
survey,” ACM Comput. Surv., vol. 52, no. 5, pp. 1–39, 2019. Serv. Eng., Chicago, IL, USA, 2023, pp. 71–79.
[3] European Union, “Regulation (EU) 2024/1689 of the European Parliament [25] J. R. Gunasekaran, C. S. Mishra, P. Thinakaran, B. Sharma, M. T.
and of the Council of 13 June 2024 laying down harmonised rules on Kandemir, and C. R. Das, “[Cocktail: A multidimensional optimization
artificial intelligence and amending Regulations (EC) No 300/2008, (EU) for model serving in cloud],” in Proc. USENIX Symp. Netw. Syst. Des.
No 167/2013, (EU) No 168/2013, (EU) 2018/858, (EU) 2018/1139 and Implementation, Renton, WA, USA, 2022, pp. 1041–1057.
(EU) 2019/2144 and Directives 2014/90/EU, (EU) 2016/797 and (EU) [26] F. Doshi-Velez and B. Kim, “Towards a rigorous science of interpretable
2020/1828 (Artificial Intelligence Act)Text with EEA relevance,” 2024. machine learning,” 2017, arXiv: 1702.08608.
[Online]. Available: https://eur-lex.europa.eu/eli/reg/2024/1689/oj [27] M. Anisetti, N. Bena, F. Berto, and G. Jeon, “A DevSecOps-based assur-
[4] K. Meng, Z. Wu, M. Bilal, X. Xia, and X. Xu, “Blockchain-enabled de- ance process for big data analytics,” in Proc. IEEE Int. Conf. Web Serv.,
centralized service selection for QoS-aware cloud manufacturing,” Expert Barcelona, Spain, 2022, pp. 1–10.
Syst., 2024, Art. no. e13602. [28] C. A. Ardagna, R. Asal, E. Damiani, and Q. Vu, “From security to
[5] B. Qolomany, I. Mohammed, A. Al-Fuqaha, M. Guizani, and J. Qadir, assurance in the cloud: A survey,” ACM Comput. Surv., vol. 48, no. 1,
“Trust-based cloud machine learning model selection for industrial IoT and pp. 1–50, 2015.
smart city services,” IEEE Internet Things J., vol. 8, no. 4, pp. 2943–2958, [29] S. Lins, S. Schneider, and A. Sunyaev, “Trust is good, control is better: Cre-
Feb. 2021. ating secure clouds by continuous auditing,” IEEE Trans. Cloud Comput.,
[6] F. Ishikawa and N. Yoshioka, “How do engineers perceive difficulties vol. 6, no. 3, pp. 890–903, Third Quarter, 2018.
in engineering of machine-learning systems? - Questionnaire survey,” in [30] M. Hosseinzadeh, H. K. Hama, M. Y. Ghafour, M. Masdari, O. H. Ahmed,
Proc. IEEE/ACM Joint 7th Int. Workshop Conducting Empirical Stud. and H. Khezri, “Service selection using multi-criteria decision making:
Ind.-6th Int. Workshop Soft. Eng. Res. Ind. Pract., Montreal, Canada, 2019, A comprehensive overview,” J. Netw. Syst. Manage., vol. 28, no. 4,
pp. 2–9. pp. 1639–1693, 2020.
[7] M. Anisetti, C. A. Ardagna, N. Bena, and E. Damiani, “Rethinking [31] C. A. Ardagna, R. Asal, E. Damiani, T. Dimitrakos, N. El Ioini, and C.
certification for trustworthy machine-learning-based applications,” IEEE Pahl, “Certification-based cloud adaptation,” IEEE Trans. Serv. Comput.,
Internet Comput., vol. 27, no. 6, pp. 22–28, Nov./Dec. 2023. vol. 14, no. 1, pp. 82–96, Jan./Feb. 2021.
[8] K. Brecker, S. Lins, and A. Sunyaev, “Artificial intelligence systems’ [32] V. Iosifidis, T. N. H. Tran, and E. Ntoutsi, “Fairness-enhancing interven-
impermanence: A showstopper for assessment?,” in Proc. Workshop Inf. tions in stream classification,” in Proc. Int. Conf. Database Expert Syst.
Technol. Syst., Hyderabad, India, 2023. Appl., Linz, Austria, 2019, pp. 261–276.
[9] A. Tahmasbi, E. Jothimurugesan, S. Tirthapura, and P. B. Gibbons, “Drift- [33] W. Zhang, A. Bifet, X. Zhang, J. C. Weiss, and W. Nejdl, “FARF: A fair
Surf: Stable-state/reactive-state learning under concept drift,” in Proc. Int. and adaptive random forests classifier,” in Proc. Pacific-Asia Conf. Knowl.
Conf. Mach. Learn., 2021, pp. 10054–10064. Discov. Data Mining, 2021, pp. 245–256.
[10] W. Zhang and E. Ntoutsi, “FAHT: An adaptive fairness-aware de- [34] V. Iosifidis, W. Zhang, and E. Ntoutsi, “Online fairness-aware learning
cision tree classifier,” in Proc. Int. Joint Conf. Artif. Intell., 2019, with imbalanced data streams,” 2021, arXiv:2108.06231.
pp. 1480–1486. [35] M. Anisetti, C. A. Ardagna, and N. Bena, “Continuous certification of
[11] W. Zhang et al., “Flexible and adaptive fairness-aware learning in non- non-functional properties across system changes,” in Proc. Serv.-Oriented
stationary data streams,” in Proc. IEEE 32nd Int. Conf. Tools Artif. Intell., Comput., ICSOC, Rome, Italy, 2023, pp. 3–18.
Baltimore, MD, USA, 2020, pp. 399–406. [36] R. N. Taylor, N. Medvidović, and E. M. Dashofy, Software Architecture:
[12] A. Roy, R. M. O. Cruz, R. Sabourin, and G. D. C. Cavalcanti, “A study Foundations, Theory, and Practice. Hoboken, NJ, USA: John Wiley &
on combining dynamic selection and data preprocessing for imbalance Sons, 2009.
learning,” Neurocomputing, vol. 286, pp. 179–192, 2018. [37] N. Mehrabi, F. Morstatter, N. Saxena, K. Lerman, and A. Galstyan, “A
[13] Z.-L. Zhang and Y.-H. Zhu, “DES-SV: Dynamic ensemble selection based survey on bias and fairness in machine learning,” ACM Comput. Surv.,
on shapley value,” SSRN Preprint SSRN:4608310, 2023. vol. 54, no. 6, pp. 1–35, 2021.
[14] X. Zhu, J. Ren, J. Wang, and J. Li, “Automated machine learn- [38] O. Chapelle and L. Li, “An empirical evaluation of Thompson sampling,”
ing with dynamic ensemble selection,” Appl. Intell., vol. 53, no. 20, in Proc. Int. Conf. Neural Inf. Process. Syst., Granada, Spain, 2011,
pp. 23596–23612, 2023. pp. 2249–2257.
ANISETTI et al.: CONTINUOUS MANAGEMENT OF MACHINE LEARNING-BASED APPLICATION BEHAVIOR 125
[39] S. L. Scott, “Multi-armed bandit experiments in the online service econ- Claudio A. Ardagna (Senior Member, IEEE) s full
omy,” Appl. Stochastic Models Bus. Ind., vol. 31, no. 1, pp. 37–45, 2015. professor at the Department of Computer Science,
[40] L. Floridi, M. Holweg, M. Taddeo, J. Amaya Silva, J. Mökander, and Y. Università degli Studi di Milano, the Director of the
Wen, “CapAI – A procedure for conducting conformity assessment of CINI National Lab on Data Science, and co-founder
AI systems in line with the EU artificial intelligence act,” SSRN Preprint of Moon Cloud srl. His research interests are in the
SSRN: 4064091, 2022. area of edge-cloud and AI security and assurance,
[41] S. Maghool, E. Casiraghi, and P. Ceravolo, “Enhancing fairness and and data science. He has published more than 170
accuracy in machine learning through similarity networks,” in Proc. Int. articles and books. He has been visiting professor at
Conf. Cooperative Inf. Syst., Groningen, The Netherlands, 2023, pp. 3–20. Université Jean Moulin Lyon 3 and visiting researcher
[42] G. Vargas-Solar, C. Ghedira-Guégan, J. A. Espinosa-Oviedo, and J.-L. at BUPT, Khalifa University, GMU.
Zechinelli-Martin, “Embracing diversity and inclusion: A decolonial ap-
proach to urban computing,” in Proc. 20th ACS/IEEE Int. Conf. Comput.
Syst. Appl., Giza, Egypt, 2023, pp. 1–6.
[43] C. R. Harris et al., “Array programming with NumPy,” Nature, vol. 585,
no. 7825, pp. 357–362, Sep. 2020.
[44] Wes McKinney, “Data structures for statistical computing in Python,” in Nicola Bena (Member, IEEE) is a postdoc with the
Proc. Python Sci. Conf., Austin, TX, USA, 2010, pp. 51–56. Department of Computer Science, Università degli
[45] The Pandas Development Team, “Pandas-dev/pandas: Pandas,” Feb. 2020. Studi di Milano. His research interests are in the
[Online]. Available: https://doi.org/10.5281/zenodo.3509134 area of security of modern distributed systems with
[46] F. Pedregosa et al., “Scikit-learn: Machine learning in Python,” J. Mach. particular reference to certification, assurance, and
Learn. Res., vol. 12, pp. 2825–2830, 2011. risk management techniques. He has been visiting
[47] R. M. O. Cruz, R. Sabourin, G. D. C. Cavalcanti, and T. IngRen, “META- scholar at Khalifa University and at INSA Lyon.
DES: A dynamic ensemble selection framework using meta-learning,”
Pattern Recognit., vol. 48, no. 5, pp. 1925–1935, 2015.
[48] R. Mousavi, M. Eftekhari, and F. Rahdari, “Omni-ensemble learning
(OEL): Utilizing over-bagging, static and dynamic ensemble selection
approaches for software defect prediction,” J. Artif. Intell. Technol., vol. 27,
no. 6, 2018, Art. no. 1850024.
[49] P. Pérez-Gállego, A. Casta no, J. Ramón Quevedo, and J. José del Coz,
“Dynamic ensemble selection for quantification tasks,” Inf. Fusion, vol. 45,
pp. 1–15, 2019.
[50] M. Badar, W. Nejdl, and M. Fisichella, “FAC-Fed: Federated adaptation Ernesto Damiani (Senior Member, IEEE) is full
for fairness and concept drift aware stream classification,” Mach. Learn., professor at the Department of Computer Science,
vol. 112, pp. 2761–2786, 2023. Università degli Studi di Milano, where he leads
[51] R. M. Cruz, R. Sabourin, and G. D. Cavalcanti, “Dynamic classifier selec- the Secure Service-oriented Architectures Research
tion: Recent advances and perspectives,” Inf. Fusion, vol. 41, pp. 195–216, (SESAR) Laboratory. He is also the Founding Direc-
2018. tor of the Center for Cyber-Physical Systems, Khalifa
[52] I. Khan, X. Zhang, M. Rehman, and R. Ali, “A literature survey and University, UAE. He received an Honorary Doctorate
empirical study of meta-learning for classifier selection,” IEEE Access, from INSA Lyon for his contributions to research and
vol. 8, pp. 10262–10281, 2020. teaching on big data analytics. His research interests
[53] J. Lu, A. Liu, F. Dong, F. Gu, J. Gama, and G. Zhang, “Learning under include cybersecurity, big data, artificial intelligence,
concept drift: A review,” IEEE Trans. Knowl. Data Eng., vol. 31, no. 12, and cloud-edge processing, and he has published over
Dec. 2019. 680 peer-reviewed articles and books. He is a Distinguished Scientist of ACM
and was a recipient of the 2017 Stephen Yau Award.
Marco Anisetti (Senior Member, IEEE) is full pro- Paolo G. Panero is currently working toward the
fessor with the Department of Computer Science,
master degree with the Department of Computer Sci-
Università degli Studi di Milano. His research in-
ence, Università degli Studi di Milano. He is IT officer
terests are in the area of computational intelligence
in a in-house public administration company where
and its application to the design and evaluation of he deals with innovation and IT services. His research
complex systems. He has been investigating inno-
interests are in the area of Machine Learning with a
vative solutions in the area of assurance evaluation
focus on models evaluation.
of cloud security and AI. In this area he defined a
new scheme for continuous and incremental cloud
security certification, based on distributed assurance
evaluation architecture.
Open Access provided by ‘Università degli Studi di Milano’ within the CRUI CARE Agreement