Model Based Performability and Dependability Evaluation of A System With VM Migration As Rejuvenation in The Presence of Bursty Workloads
Model Based Performability and Dependability Evaluation of A System With VM Migration As Rejuvenation in The Presence of Bursty Workloads
https://doi.org/10.1007/s10922-021-09619-3
Abstract
Software aging accumulation leads to increased resource consumption. In this con-
text, the memory leak is one of the well-known problems related to software aging.
A bursty workload can accelerate software aging bug activation as it requires instan-
taneous resource allocation. Then, the rapid resource allocation and deallocation
may lead to software aging through memory leaks. Moreover, a bursty workload
may cause a resource exhaustion failure in a system already overloaded by software
aging accumulation. Virtual Machine (VM) migration schedules can be used to
mitigate software aging moving services away from a compromised physical host.
Despite the considerable progress made in this area, the state-of-the-art still lacks a
modeling framework for performability and dependability evaluation of VM migra-
tion as rejuvenation in a system under bursty workloads. This paper proposes a set
of Stochastic Reward Net (SRN), aiming at filling this research gap. We consider
five scenarios covering different bursty workload conditions, and present a specific
model to cover the uncertainties related to bursty workloads. Our results present the
specific rejuvenation schedule to maximize system performability and dependability
for each scenario. The proposed modeling framework may be useful to support vir-
tualized environment management decisions.
* Matheus Torquato
[email protected]; [email protected]; [email protected]
Extended author information available on the last page of the article
13
Vol.:(0123456789)
3 Page 2 of 33 Journal of Network and Systems Management (2022) 30:3
1 Introduction
13
Journal of Network and Systems Management (2022) 30:3 Page 3 of 33 3
• RQ1: What is the VM migration scheduling policy which maximizes the system
availability?
• RQ2: What is the VM migration scheduling policy which maximizes the system
throughput?
• RQ3: What are the system reliability levels when applying the VM migration pol-
icies that maximize system availability?
We consider a virtualized system with the three main components: (i) VM, running
the user’s service, (ii) Main Node, physical machine providing resources for the
VM execution, and (iii) Standby Node, which receives the VM migration. We apply
guard functions and marking-dependent firing rates in the model to cover the follow-
ing aspects: (i) influence of burst occurrence in the resources exhaustion accelera-
tion; (ii) influence of the level of resources degradation in the VM migration pro-
cess; (iii) bursty workload uncertainties (i.e., duration, intensity, and probability).
We present three case studies to exercise our models. The first shows the sys-
tem steady-state availability evaluation; the second brings the system steady-state
throughput results; the third is about the reliability evaluation. We consider a set of
scenarios in each case study covering different Asset Classes. In this paper, an Asset
Class is the definition of the associated bursty workload conditions, namely, dura-
tion, intensity, and probability. Specifically, we consider five Asset Classes ranging
from light bursts to heavier bursts. Note that these Asset Classes are illustrative for
our evaluation purposes. In the best scenario, we would collect the burst incidence
from the system behavior observation.
The highlights of this paper are:
– We propose a set of Petri Net models for system performability and dependabil-
ity evaluation considering software aging and bursty workloads;
– The results allow the investigation of proper rejuvenation policies to maximize
the system availability or the system throughput (in some cases, both);
– Each case study presents a set of scenarios to support the decision-making pro-
cess.
Wang et al. [63], and our previous works [54, 57] tackled similar problems con-
sidering varying workload aspects. However, these works neglect the uncertainties
related to the burst occurrence as probability, intensity, and duration. Moreover, this
paper presents a more comprehensive result set covering availability, reliability, and
system throughput. A more detailed comparison with related works is in Sect. 7.
In this paper, we apply VM migration scheduling only as software rejuvenation,
which means that the proposed rejuvenation policy is not the direct countermeasure
to bursty workloads. Our goal is to extend previous works, such as [23, 24, 36, 54,
57], which investigate VM migration as software rejuvenation but neglect the impact
of the possible occurrence of bursty workloads in the system metrics. Nevertheless,
13
3 Page 4 of 33 Journal of Network and Systems Management (2022) 30:3
2 Motivation
The majority of public clouds such as Google cloud2 and Amazon AWS3 provide
specific Service Level Agreements (SLAs) for system availability. The break of
an availability SLA brings revenue loss for both the provider and the client [43].
Nowadays, even smaller and private clouds may host applications that demand high
dependability and performance levels.
The main problem here is how to set up a consolidated evaluation approach to
take into account these desired metrics. In scenarios where it is hard to conduct the
evaluation based on measurements, we can use the modeling approach [16]. In the
performability and dependability evaluation, in most cases, the measurement-based
evaluation is unpractical due to the dependability events frequency (e.g., the mean
time to failure of a system can be above 1000 h [46]).
The use of the modeling approach may solve the problem for the performability
and dependability evaluation. However, a challenge is to set up realistic models. In
this research, we take a step towards more realistic models trying to cover specific
details of VM migration and the workload’s influence in the software aging process.
Up to our knowledge, this is the first research effort towards the performability,
availability, and reliability evaluation of a system with VM migration scheduling
subject to software aging and bursty workload. Due to these aspects’ relevance, the
models may be useful to design management policies and SLAs for virtualized envi-
ronment users.
1
Bursty workload occurrence usually causes a system utilization peak.
2
https://cloud.google.com/.
3
https://aws.amazon.com/.
13
Journal of Network and Systems Management (2022) 30:3 Page 5 of 33 3
3 Methodology
13
3 Page 6 of 33 Journal of Network and Systems Management (2022) 30:3
4 Assumptions
This section presents our assumptions for this work. It has four subsections. Sec-
tion 4.1 presents the details of the considered system architecture. Section 4.2
describes the failure and operational modes. Section 4.3 contains the approach for
software aging and rejuvenation modeling. Finally, Sect. 4.4 explains our strategy
for bursty workload modeling.
4.1 System Architecture
The considered system architecture has three components: VM, Main Node and
Standby Node. The physical machines are in a private network. We assume this vir-
tualized environment is liable to receive burst workloads. The details of how we
incorporate the bursty workload behavior in the models are in Sect. 4.4. Figure 2
summarizes the system architecture.
The proposed model and evaluations are focused on the architecture presented
in Fig. 2. This architecture is simpler than the ones usually used in cloud data cent-
ers. Rather than proposing a model for a large data center, we decided to consider
a simpler architecture, but include details regarding the system behavior which
were missing in previous works (e.g., pre-copy phase of VM live migration, using
marking-dependent firing rates to cover software aging effects, and bursty workload
occurrence). This decision leads to a complex model which is difficult to scale up
due to state-space explosion problems [61]. We highlight two possible approaches
to cope with the scalability of the model. (1) Simulation—using simulation frame-
works (e.g., SimPy4) it is possible to simulate the aspects considered in the models
4
https://simpy.readthedocs.io/en/latest/.
13
Journal of Network and Systems Management (2022) 30:3 Page 7 of 33 3
VM component is the most critical component for the system availability, meaning
that the system is available only when the VM is running. Therefore, any system
behavior which causes an interruption in the VM running leads the system to una-
vailability. The first failure mode is the VM non-aging failure. Software or operating
system can lead the VM to service interruption, meaning unavailability. As the VM
depends on the Main Node to run, any Main Node interruption will cause system
unavailability. Standby Node failures do not cause unavailability directly. However,
Standby Node unavailability prevents VM migration. Likewise, we assume that, as
long as the Standby Node is active, there is the capacity to receive the VM migra-
tion. The system suffers a short downtime on each VM migration operation. So, fre-
quent migration may have a severe impact on overall system availability. Besides
that, the system may fail due to resource exhaustion because of software aging. We
assume that the operations for system repairing after a resource exhaustion failure
comprise software rejuvenation actions like OS reboot or application restart.
Bursty workload occurrence accelerates the depletion of the resources, leading to
quicker failure due to resource exhaustion.
We adopted the VM-migrate rejuvenation technique [24, 36]. The technique con-
sists of VM migration scheduling to move the VM from a host with software aging
accumulation to another host without aging accumulation. A software component
capable of communicating with the virtualized environment is in charge of submit-
ting VM migration commands observing a predefined schedule. In this paper, we
named such a software component as Clock. When the Clock time counting reaches
the predefined schedule, it submits the VM migration command to the virtualized
environment. The virtualized environment receives the message from the Clock and
verifies the system components’ status. Then, the VM migration occurs if all the sys-
tem components are available (i.e., Main Node, VM and Standby Node). After a suc-
cessful VM migration, the VM goes from the Main Node to the Standby Node. As
soon as the VM arrives in the Standby Node, the Standby Node swaps the role with
the Main Node, meaning that it turns into the Main Node. The previous Main Node
(VM migration source) pass through a software rejuvenation after VM migration.
Then, after the software rejuvenation completion, the previous Main Node assumes
the role of Standby Node and will wait for new VM migration requests.
Following the classic definitions in the paper [15], Fig. 3 presents a state machine
with the considered system behavior related to software aging and rejuvenation. The
13
3 Page 8 of 33 Journal of Network and Systems Management (2022) 30:3
Fig. 3 State machine of software aging and rejuvenation, adapted from [15]
state machine diagram has four states: fresh—system without software aging accu-
mulation; degraded—system with degraded performance or increased failure rate
due to software aging accumulation; failed—system non-operational due to a soft-
ware aging failure; and rejuvenation - system under rejuvenation actions. The initial
state is the fresh state. From this point, the expected software aging accumulation
leads the system to state degraded. If no rejuvenation action occurs, the system suf-
fers a software aging failure, going to the failed state. The system returns to fresh
state after a system repair. However, if the system performs a rejuvenation action,
it goes to the state rejuvenation. In this paper, the considered rejuvenation action
is VM migration. Therefore, we assume that the system faces the VM migration
downtime in the rejuvenation state. The software rejuvenation brings the system
back to the fresh state. As mentioned earlier, Clock component obeys a predefined
schedule. Therefore, some unnecessary migrations (i.e., VM migrations triggering
in a system without software aging accumulation) may occur depending on the time
interval between migrations.
13
Journal of Network and Systems Management (2022) 30:3 Page 9 of 33 3
Figure 4 presents a diagram of the burst cycle model. The circles represent
the system states, the continuous arcs represent the time delay for the state tran-
sition, the dashed arcs represent the state transition based on probability (instead
of time delay), and the gray arc with a circle represents an attribute of a specific
system state.
The initial state of the burst cycle model is the start state, which represents
the start of the time cycle between burst occurrence. After the Cycle, the system
state goes from start to end. In the end state, based on probability (burstProb-
ability), the system can restart the cycle (going back to the start state), or suf-
fer the bursty workload (moving the system state from end to underBurst). The
underBurst state has the burstIntensity attribute, which, as mentioned earlier,
indicates the severity of the workload submitted by the bursty workload. After
the time delay burstDuration, the cycle for the next burst begins. In this paper,
we incorporate the behavior of the burst cycle model using an SRN model pre-
sented in the Sect. 5.1.
In this study, we apply the burst cycle model for the bursty workload. We
used parameters from previous studies for model evaluation. However, it is
possible to customize burst cycle model parameters. In some situations, it is
hard to find specific datasets to use as input for the analysis. burst cycle model
approach enables the system managers to set up reasonable scenarios based
on their knowledge about the environment. Therefore, they can build better
SLAs and internal policies, considering the possibilities of bursty workload
occurrence.
Finally, we highlight that in this paper, the VM is the target for the bursty
workload and not the Main Node. Therefore, a VM migration also switches the
workload to the VM migration target host during a burst.
13
3 Page 10 of 33 Journal of Network and Systems Management (2022) 30:3
5 Models
This section has two subsections. The first presents the details of the proposed SRN
model for the availability evaluation. Moreover, the second explains the perfor-
mance evaluation model, which is an M/M/1/k queue. The models obey the rela-
tionship presented in Fig. 5. The availability model (M1) provides two inputs to the
performance model (M2). The first is the performance penalty (Penalty) due to
resource depletion. We compute Penalty from a place of the availability model,
which represents the level of resource depletion (ResourcesDepletion place).
Besides that, M1 also provides the system unavailability (UA) as input for M2. We
compute UA observing the probability of the absence of tokens in the place UP,
which is the place used to represent the system availability. More details of the inter-
actions between M1 and M2 are in the next sections. Note that we consider M/M/1/k
as this is one of the most used queuing models for client-server applications. Nev-
ertheless, it is possible to adapt M2 to other scenarios by including the effects of
Penalty and UA in other queueing models. Section 5.2 provides details on how
we incorporate such effects in the M/M/1/K model. Literature exists to support the
design of other queuing models based on Petri Nets [5, 18, 47].
We use two separated models for performability and dependability evaluation.
The reason for that is as follows. The use of transitions with considerable differ-
ences in the firing delay magnitude causes the stiffness problem [6]. In our modeling
framework, we have the mean time to failure of the Main Node, which is above a
1000 h, and the system service time, measured in milliseconds. By using the model
decomposition, we can mitigate stiffness as the evaluations of performability (with
delays of milliseconds) and dependability (with transition delays of months) are per-
formed separately.
5.1 Availability Model
The proposed availability model has three sub-models: (i) clock model; (ii) burst
cycle model and (iii) system model (Fig. 6). These models interact using guard
functions and transitions with marking-dependent firing rates which will be later
explained.
The first model is the clock model. The clock model has the Clock and Read-
yToMigrate places and the Trigger and ResetClock transitions. The clock
13
Journal of Network and Systems Management (2022) 30:3 Page 11 of 33 3
(a)
(b) (c)
Fig. 6 Availability model
13
3
13
Page 12 of 33
the token in the End place and puts a token in the UnderBurst place, represent-
ing that the system is suffering a bursty workload. The BurstDuration transition
represents the time duration of the bursty workload. As long as the system is under
a burst (i.e., model with token presence in the place UnderBurst), it suffers a
resources depletion acceleration. This acceleration is related to the Burst intensity
mentioned in the Sect. 4.4. We model this behavior using a marking-dependent fir-
ing delay on the transition Phase of the system model. The Phase transition repre-
sents the resources depletion progress. Table 2 presents the details of the transitions
with marking-dependent firing delays. The BurstDuration firing removes the
token from the UnderBurst place and puts a token in the Start place, represent-
ing the cycle restart.
The system model intends to cover three system aspects: (i) behavior of non-
aging failures and their repairs; (ii) resources depletion due to software aging; and
(iii) VM migration. We highlighted the model’s sections that represent each one of
these behaviors.
In this paper, we consider system non-aging failures (e.g. hardware or Operating
system failures). At the initial state, the system model has a token in the UP place.
UP place with tokens represents the Main Node and the hosted VM running. And,
the absence of tokens in the UP place represents the system unavailability. From the
initial state, the Main Node can suffer a non-aging failure (MN_f transition firing).
The MN_f transition firing removes the token from the place UP and puts a token in
the place DW. The Main Node repair has two steps: (1) Main Node recovery (MN_r
transition firing, moving the token from DW to VM_S place) and (2) VM reboot
(VM_rb firing, returning the token to the UP place). The system can also suffer a
VM non-aging failure. The transition VM_f firing represents a VM non-aging failure
occurrence. After a VM failure there are two possibilities to recover the system: a
VM repair (transition VM_r firing), or a subsequent Main Node failure (MN_f2 fir-
ing, moving the token from VM_DW to the DW place). We also cover the non-aging
failure and repair processes in the Standby Node. SN_UP and SN_DW places repre-
sent the Standby Node status related to the non-aging failures and repairs. SN_UP
13
3 Page 14 of 33 Journal of Network and Systems Management (2022) 30:3
place with tokens represents the Standby Node running and ready to receive VM
migration. The transitions SN_f and SN_r represent a Standby Node non-aging
failure and repair, respectively. Standby Node unavailability (token in the SN_DW
place) affects the system availability indirectly as it prevents the software rejuvena-
tion for aging failures avoidance.
As mentioned earlier, StartLM transition represents the VM migration start and
it has an associated guard function (see Table 1). We assume the Pre-copy VM live
migration [7] as the VM migration method. Pre-copy algorithm has two main phases:
(i) Pre-copy phase (transition PC)—transfer of the memory pages from the Main
Node to the Standby Node; (ii) Downtime phase (transition LM_dwt)—transfer of
the processor state and VM migration acknowledgment. StartMig firing deposits
a token in the Mig place. We used inhibitor arcs5 to indicate that a new migration
can only occur after the finishing of the previous. Mig transition with tokens rep-
resents that the VM migration is in the Pre-copy phase. During the Pre-copy phase
the system continues to run (the token stays on the UP place). SysFail transition
serves to represent possible system failures (i.e., Main Node, VM or Standby Node
failures) during the Pre-copy phase. We also embed this behavior using guard func-
tions. A system failure during the Pre-copy phase implies in VM migration abort,
thus SysFail transition removes the token from Mig place. As presented in [1,
20, 33, 62], the amount of dirty memory pages affects the VM migration latency. To
represent this behavior, we used a marking-dependent firing delay in the PC transi-
tion (see Table 2). The marking-dependent firing delay increases the supposed delay
for the Pre-copy phase (Precopy variable) observing the status of system resources
depletion (i.e., number of tokens in the ResourcesDepletion place). After the
completion of the Pre-copy phase (firing of PC transition), the system enters in the
Downtime phase (token in the DW_Mig place). In the Downtime phase the system is
unavailable. We represent this behavior removing the token from UP place after PC
transition firing. The system returns to be available after the VM migration comple-
tion (LM_dwt transition firing). As mentioned in the Sect. 4.3, the previous Main
Node (i.e., VM migration source), will pass through software rejuvenation before
assumes the role of the Standby Node. Thus, LM_dwt firing puts a token in the
SN_W representing that the previous Main Node is waiting for the software rejuve-
nation. Transition Rej represents the software rejuvenation action, its firing replace
a token in the SN_UP. This behavior represents the rejuvenation completion and that
the Standby Node is ready to receive VM migrations.
Finally, about the resources depletion due to software aging modeling. To repre-
sent the resource depletion behavior, we adopted a four-phase Erlang distribution, as
the Erlang distribution is suitable to represent Increasing Failure Rate (IFR) behav-
ior [14]. We used an Erlang subnet with four phases to represent the IFR in the sys-
tem model. The Erlang subnet is in the upper part of the system model. The places
AvailableResources and ResourcesDepletion are related to the system
resources depletion status. The number of tokens in the AvailableResources
5
Arcs terminating in a circle instead of an arrowhead.
13
Journal of Network and Systems Management (2022) 30:3 Page 15 of 33 3
denotes the amount of resources available, and the number of tokens in the
ResourcesDepletion place denotes the resources depletion status.
At the initial state, the transition Aging fires swapping6 the token from the
UP place. The same transition deposits four tokens in the place AvailableRe-
sources. The number of tokens denotes the amount of resources still available for
the Main Node usage. As time passes, the Main Node starts to accumulate software
aging status. The transition Phase firing represent the resources consumption pro-
gress, which removes the tokens from the AvailableResources and deposits
tokens in the ResourcesDepletion place. If the software aging status persists
in the system, it can suffer a resource exhaustion failure (ResourcesExhaus-
tion transition). We highlight that the Phase firing rate is also adjusted when the
system is under bursty workload, meaning faster resource exhaustion. Resource-
sExhaustion firing removes the token from the UP place and puts a token in
the DW2. After a resource exhaustion failure, the system recovery has three phases:
(i) detection of resource exhaustion; (ii) software management to cleanup residuals;
(iii) complete OS reboot. We model these steps in a single exponential transition
named Repair. After Repair firing, a token returns to the UP place, representing
that the system is available again.
5.1.1 Metrics Computation
We obtain two metrics from the availability model. The first is the system avail-
ability (A). We compute the system availability as the probability of token pres-
ence in the place UP ( A = P{#𝚄𝙿 > 0}). We use Availability to obtain secondary
metrics as unavailability (UA) and annual downtime in hours per year7 (Dwt). The
expressions are as follows: UA = 1 − A and Dwt = 8760 ⋅ UA. Note that, as we are
computing steady-state availability, it is possible to obtain the downtime for other
intervals (besides one year). For example, we can compute the monthly downtime
in minutes8 using Dwt = 43, 200 ⋅ UA. And, as mentioned earlier, besides the avail-
ability-related metrics, we also computed the Penalty, using the following expres-
sion Penalty = E(#𝚁𝚎𝚜𝚘𝚞𝚛𝚌𝚎𝚜𝙳𝚎𝚙𝚕𝚎𝚝𝚒𝚘𝚗)∕3, which is a normalized value of the
expected number of tokens in the ResourcesDepletion place. We take account
of resource depletion accumulation when the place ResourcesDepletion has
tokens. Therefore, the possibilities are ResourcesDepletion with one, two or
three tokens. For normalization of Penalty metric, we divided the expected number
of tokens in place ResourcesDepletion by three.
6
Receiving and returning.
7
We consider a year with 365 days.
8
Considering a month with 30 days. 30 ⋅ 24 ⋅ 60 = 43, 200.
13
3 Page 16 of 33 Journal of Network and Systems Management (2022) 30:3
5.2.1 Metrics Computation
9
Note that, in this case, the performance is degradable due to software aging accumulation issues, then
the computed metrics are related to system performability [37].
13
Journal of Network and Systems Management (2022) 30:3 Page 17 of 33 3
The first step of the evaluation is to use the variable Penalty in the performance
model. In some of our previous experiments [52], we noticed that a generic Web
server under software aging effects has a service rate of about one request per sec-
ond. Based on our previous experimentation, we assume that when the Penalty
assumes its maximum value, the system service time (i.e., firing rate of transition
service) will be decreased to one request per second. We changed the firing
rate of the transition service accordingly to each proposed scenario. We obtain
the effective rate of transactions accepted in the queue (RTAQ). We obtain RTAQ
using the following expression RTAQ = P{#𝚋𝚞𝚏𝚏𝚎𝚛 > 𝟶} ⋅ 𝜆, where 𝜆 is the firing
rate of the transition arrival. However, we still have to consider the system’s
unavailability in the system’s steady-state throughput. Then, we used the expres-
sion ST = RTAQ ⋅ A to compute the system throughput (ST), where A is the system
availability.
6 Case Studies
We used the TimeNet tool for the availability model design and evaluation [66], and
Mercury tool [31] for the performance analysis. TimeNet has a friendly graphical
interface and provides instantaneous results, while Mercury has an easy-to-use script
language that facilitates sensitivity analysis using non-linear parameter variation.
We used the values in Table 3 as default values for our evaluations. We obtained
these values from the papers [53, 63]. Note that these values are only for reference
and should be adjusted whenever real-scenario values are available. However, these
are the most representative values that we can find to feed the models as they were
published in reputed journals.
In the following case studies, we focused on finding the rejuvenation schedule to
maximize system availability and system throughput. First, we search for the best
rejuvenation schedule using a graphical sensitivity analysis, which explores the
models’ output varying the time interval for the VM migrations (Trigger firing
delay) from one to 720 hours (a month) using a one-hour step. Second, once we
find the availability-oriented rejuvenation schedule, we propose the last case study
to verify system reliability in the first month of the system running.
For all the scenarios, we assume that the service hosted in the virtualized envi-
ronment is an asset, which is a possible target for a bursty workload. Depending on
the asset, the burst may be more or less likely to occur. To represent the different
asset classes, we propose five different scenarios, as in Table 4.
The case studies in this sections are: Availability (Sect. 6.1), System Throughput
(Sect. 6.2) and Reliability (Sect. 6.3).
Our goal in this case study is to find the rejuvenation schedule, which maximizes
the system availability. This case study aims to answer the RQ1. There are two main
13
3
Availability model
13
Page 18 of 33
13
3 Page 20 of 33 Journal of Network and Systems Management (2022) 30:3
Table 4 Asset classes Asset class # Burst prob- Burst intensity Burst duration
definitions ability (%)
0 0.01 2000 60 s
1 0.1 4000 120 s
2 1 6000 240 s
3 5 8000 360 s
4 10 10000 480 s
problems for availability in the scenarios covered in our study. The first is when
applying frequent migrations. As each migration has an associated downtime, fre-
quent migrations will degrade the steady-state availability. The second is due to
resource exhaustion failures due to software aging and bursty workloads. Less fre-
quent migrations may allow the system to reach resource exhaustion failures.
In some situations, VM migration during bursty workloads may accelerate the
exhaustion of the resources due to VM migration overhead. As mentioned earlier,
we capture the influence of bursty workloads in the VM migration process using
transitions with marking-dependent firing delays.
Figure 8 presents the availability results for each proposed asset class. The black
line represents the system availability when applying rejuvenation, and the gray line
represents the system availability without rejuvenation (Baseline). We notice that
the system availability has a peak in all the scenarios, which is the specific rejuvena-
tion trigger that maximizes system availability. After the peak, we notice the system
13
Journal of Network and Systems Management (2022) 30:3 Page 21 of 33 3
13
3 Page 22 of 33 Journal of Network and Systems Management (2022) 30:3
13
Journal of Network and Systems Management (2022) 30:3 Page 23 of 33 3
The goal is this case study is to find the rejuvenation policy which maximizes the
system throughput. This case study aims to answer RQ2. Figure 11 shows the results
for all the considered scenarios. We noticed that in scenarios with shorter rejuvena-
tion triggers, the system throughput stays at higher levels. After a certain point, we
noticed a drop in the system throughput rate. We can draw the following conclusions
from these results: i) Systems with shorter migration intervals tend to persist in a
lower Penalty. Therefore, the service rate persists in higher levels compensating for
the lower availability levels due to frequent migrations. ii) The baseline through-
put is higher in systems with more severe bursty workloads. In the model analysis,
we noticed that in such cases, after a burst, the system fails quickly. Thus, the sys-
tem steady-state Penalty is lower than in scenarios with a lighter bursty workload.
13
3 Page 24 of 33 Journal of Network and Systems Management (2022) 30:3
Therefore, the steady-state system throughput tends to be higher than in the other
scenarios.
Table 7 presents the best rejuvenation schedule for the proposed scenarios. The
adopted policies for system throughput maximization are close to the results for sys-
tem availability maximization. The last column presents the percentual improvement
when comparing the baseline results and the results with rejuvenation. Like the pre-
vious case study, we noticed that the improvement is lower than in the others in sce-
narios with heavier bursty workloads.
Finally, for the sake of comparison, we plot the throughput improvement of each
scenario in Fig. 12. As in the availability results, the results from Asset Class #0 and
#1 are nearly the same.
System reliability is related to the service continuity [3], or the period that the sys-
tem passes free from failures. In this case study, we investigated the system reliabil-
ity when applying the availability-oriented rejuvenation policies (Table 5). Our goal
is to answer the RQ3. To conduct the reliability evaluation, we used the availability
model without the repair transitions. Thus, we compute the reliability using the fol-
lowing expression: Reliability = P{𝚄𝙿 > 0}. However, different from steady-state
availability, system reliability is a transient metric. Thus, our goal is to calculate the
probability of the system staying failure-free in its first month of running.
13
Journal of Network and Systems Management (2022) 30:3 Page 25 of 33 3
0 480
1 400
2 400
3 320
4 240
Our obtained results are in the Fig. 13. The black dots represent the reliability
results for the system with rejuvenation, and the gray dots represent the reliability
results for the system without rejuvenation. The dashed lines represent the 95% con-
fidence interval. We also performed linear regression in the Reliability results to
extract functions representing the reliability curve (R(t), where t is the time) when
applying software rejuvenation policies. Table 8 presents a summary of the reliabil-
ity results. The table also presents the coefficient R2, which determines in a range
from 0 to 1, the fraction of the total variation explained by the obtained regression
model.
We noticed a steeper reduction in scenarios with heavier bursty workloads. The
rapid reliability decrease in the Asset Class # 4 shows that the probability of a fail-
ure-free system is almost null at the 720th hour of continuous run. Therefore, after
this point, the rejuvenation mechanism produces no improvement in the system
reliability when compared to the baseline scenario. However, the reliability results
for the first two scenarios (Asset classes #0, #1) are nearly the same. In the Asset
Class #0 scenario, the probability of a failure-free system in its first month running
is about 30%, while in the Asset Class #1 scenario, the probability is about 29%.
Using the quadratic, polynomial, and logarithmic models for linear regression, we
can achieve R2 values above 0.999, meaning that the proposed functions can repre-
sent the reliability curve with substantial fidelity. Therefore, we can use these func-
tions to approximate the reliability results for the desired scenarios.
Additionally, we calculate the depletion point (i.e., point in time where the
resources are depleted) of the baseline scenarios. These results highlight how long
the system survives without rejuvenation. The depletion point results are in Table 9.
7 Related Works
There are several works about performability and dependability modeling in the
cloud. However, up to our knowledge, this is the first research attempt to quantify
system performability and dependability in a system with VM migration as reju-
venation under bursty workloads. In the following, we highlight relevant related
research and compare the current state of the art and our work.
13
3 Page 26 of 33 Journal of Network and Systems Management (2022) 30:3
First, it is worth mentioning the works from Melo et al. [35, 36] and Machida
et al. [23] which provided the basis for our availability modeling framework.
Another paper from Machida et al. [24] provided the insights for our performance
evaluation approach. These papers are focused on the availability metric, aiming to
find the proper rejuvenation schedules to maximize system availability. The paper
[24] also comprises the number of transactions lost (performance metric). However,
in this paper, the authors neglect the performance penalty due to software aging.
In our work, besides comprising availability, we also covered reliability and system
throughput metrics. Our models also capture the influence of software aging in sys-
tem responsiveness.
Wang et al. [63] provided a relevant background for our performability mod-
eling. We used their papers to recover some rates for our models’ evaluation. In
their paper, they cover the aspect of workload variation in a system with rejuve-
nation. Unlike them, we cover VM migration and burst workloads. Moreover, we
cover an extended set of metrics as availability and reliability. Escheikh et al. [10]
work also consider bursty workload occurrence in a system with VM migration.
Unlike their work, we consider a set of different asset classes and also provide
reliability results. Moreover, we adopted a different approach to represent bursty
occurrence using a specific submodel (i.e., burst cycle model).
Our previous papers [54, 57] are the first step of this research. We highlight the
following improvements: (1) Use of marking-dependent firing delays to improve
the model’s accuracy—in the previous works, we have used places with multiple
tokens and guard functions to represent resource consumption dynamics. (2) The
problem of using guard functions is that the firing rate is only adjusted for firing
the next resource consumption phase. (3) Inclusion of performability metrics. (4)
Linear regression for the reliability curves.
In the papers [25–28, 30], Machida et al. provided an extensive modeling
framework considering the aspects of performance penalty due to software aging
accumulation. For example, in the papers [25–27], the authors derive optimal
13
Journal of Network and Systems Management (2022) 30:3 Page 27 of 33 3
This paper presented a comprehensive SRN modeling framework for the performa-
bility and dependability evaluation of a virtualized system subject to software aging
and bursty workload. The considered system applies VM migration scheduling as
support for software rejuvenation. Our results include the metrics: steady-state avail-
ability, steady-state system throughput, and system reliability.
About our main research question ( RQmain: What are the performability and
dependability levels of a virtualized system with VM migration subject to software
aging and bursty workload?), we noted that these levels would vary depending on
the studied scenario. There is a specific rejuvenation schedule to maximize sys-
tem availability and throughput. In scenarios with lighter bursty workload condi-
tions (burst probability of 0.01%, and 0.1%), the performability results are nearly
the same. However, in the scenarios with heavier bursty workload conditions (burst
probability of 1%, 5%, and 10%), the system performability degradation due to the
bursty workload is substantial. In such scenarios, the rejuvenation policies tend to
produce lower system performability improvement.
We covered the main aspects of software aging and rejuvenation and the uncer-
tainties related to the bursty workload occurrence. Besides that, we also considered
important details as the influence of burst occurrence in resource exhaustion and the
influence of resource consumption levels in the VM migration process.
13
3 Page 28 of 33 Journal of Network and Systems Management (2022) 30:3
13
Journal of Network and Systems Management (2022) 30:3 Page 29 of 33 3
represents a system failure occurrence. The same transition moves the token from
UP place to the DW place. The system repair is represented by the MTTRtransition
(mean time to repair (MTTR)). MTTRtransition firing brings the model back to
its initial state.
We can compute the system availability using the following reward measure
Availability = P{UP > 0}, which captures the probability of tokens presence in
the UP place.
Acknowledgements This work has been partially supported by Portuguese Foundation for Science and
Technology (FCT), through the PhD Grant SFRH/BD/146181/2019, within the scope of the project
CISUC - UID/CEC/00326/2020. This work is also funded by the European Social Fund, through the
Regional Operational Program Centro 2020. This work also received support from AIDA: (Adaptive,
Intelligent and Distributed Assurance Platform) project, funded by Operational Program for Competi-
tiveness and Internationalization (COMPETE 2020) and FCT (under CMU Portugal Program) through
Grant POCI-01-0247-FEDER-045907. And, from project TalkConnect funded by COMPETE 2020
trough Grant POCI-01-0247-FEDER-039676.
Declarations
Conflict of interest The authors declare that they have no conflict of interest.
References
1. Akoush, S., Sohan, R., Rice, A., Moore, A.W., Hopper, A.: Predicting the performance of virtual
machine migration. In: 2010 IEEE International Symposium on Modeling, Analysis and Simulation
of Computer and Telecommunication Systems, pp. 37–46. IEEE (2010)
2. Araujo, J., Matos, R., Maciel, P., Matias, R., Beicker, I.: Experimental evaluation of software aging
effects on the eucalyptus cloud computing infrastructure. In: Proceedings of the Middleware 2011
Industry Track Workshop, p. 4. ACM (2011)
3. Avizienis, A., Laprie, J.C., Randell, B., Landwehr, C.: Basic concepts and taxonomy of dependable
and secure computing. IEEE Trans. Depend. Secure Comput. 1(1), 11–33 (2004)
4. Avritzer, A., Weyuker, E.J.: Monitoring smoothly degrading systems for increased dependability.
Empir. Softw. Eng. 2(1), 59–77 (1997)
5. Bause, F.: Queueing petri nets-a formalism for the combined qualitative and quantitative analysis of
systems. In: Proceedings of 5th International Workshop on Petri Nets and Performance Models, pp.
14–23. IEEE (1993)
6. Bobbio, A.: System modelling with petri nets. In: Systems Reliability Assessment, pp. 103–143.
Springer (1990)
7. Clark, C., Fraser, K., Hand, S., Hansen, J.G., Jul, E., Limpach, C., Pratt, I., Warfield, A.: Live migra-
tion of virtual machines. In: Proceedings of the 2nd Conference on Symposium on Networked Sys-
tems Design & Implementation-Volume 2, pp. 273–286. USENIX Association (2005)
8. Cotroneo, D., Natella, R., Pietrantuono, R., Russo, S.: A survey of software aging and rejuvenation
studies. ACM J. Emerg. Technol. Comput. Syst. 10(1), 8 (2014)
9. Dohi, T., Zheng, J., Okamura, H., Trivedi, K.S.: Optimal periodic software rejuvenation policies
based on interval reliability criteria. Reliab. Eng. Syst. Saf. 180, 463–475 (2018)
10. Escheikh, M., Tayachi, Z., Barkaoui, K.: Performability evaluation of server virtualized systems
under bursty workload. IFAC-PapersOnLine 51(7), 45–50 (2018)
11. Feuerlicht, G., Burkon, L., Sebesta, M.: Cloud computing adoption: what are the issues. Syst. Integr.
18(2), 187–192 (2011)
13
3 Page 30 of 33 Journal of Network and Systems Management (2022) 30:3
12. Garg, S., Van Moorsel, A., Vaidyanathan, K., Trivedi, K.S.: A methodology for detection and esti-
mation of software aging. In: Proceedings Ninth International Symposium on Software Reliability
Engineering (Cat. No. 98TB100257), pp. 283–292. IEEE (1998)
13. Grottke, M., Matias, R., Trivedi, K.S.: The fundamentals of software aging. In: 2008 IEEE Interna-
tional Conference on Software Reliability Engineering Workshops (ISSRE Wksp), pp. 1–6. IEEE
(2008)
14. Gupta, A.K., Zeng, W.B., Wu, Y.: Probability and Statistical Models: Foundations for Problems in
Reliability and Financial Mathematics. Springer, New York (2010)
15. Huang, Y., Kintala, C., Kolettis, N., Fulton, N.D.: Software rejuvenation: analysis, module and
applications. In: Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of
Papers, pp. 381–390. IEEE (1995)
16. Jain, R.: The Art of Computer Systems Performance Analysis: Techniques for Experimental Design,
Measurement, Simulation, and Modeling. Wiley, New York (1990)
17. Kleinrock, L.: Queueing Systems, vol. i: Theory (1975)
18. Kounev, S.: Performance modeling and evaluation of distributed component-based systems using
queueing petri nets. IEEE Trans. Softw. Eng. 32(7), 486–502 (2006)
19. Kuchárik, M., Balogh, Z.: Modeling of uncertainty with petri nets. In: Asian Conference on Intel-
ligent Information and Database Systems, pp. 499–509. Springer (2019)
20. Liu, H., Xu, C.Z., Jin, H., Gong, J., Liao, X.: Performance and energy modeling for live migration of
virtual machines. In: Proceedings of the 20th International Symposium on High Performance Distrib-
uted Computing, pp. 171–182. ACM (2011)
21. Low, C., Chen, Y., Wu, M.: Understanding the determinants of cloud computing adoption. Ind. Manag.
Data Syst. 111(7), 1006–1023 (2011)
22. Macêdo, A., Ferreira, T.B., Matias, R.: The mechanics of memory-related software aging. In: 2010
IEEE Second International Workshop on Software Aging and Rejuvenation, pp. 1–5. IEEE (2010)
23. Machida, F., Kim, D.S., Trivedi, K.S.: Modeling and analysis of software rejuvenation in a server virtu-
alized system. In: 2010 IEEE Second International Workshop on Software Aging and Rejuvenation, pp.
1–6. IEEE (2010)
24. Machida, F., Kim, D.S., Trivedi, K.S.: Modeling and analysis of software rejuvenation in a server virtu-
alized system with live vm migration. Perform. Eval. 70(3), 212–230 (2013)
25. Machida, F., Miyoshi, N.: An optimal stopping problem for software rejuvenation in a job processing
system. In: 2015 IEEE International Symposium on Software Reliability Engineering Workshops (ISS-
REW), pp. 139–143. IEEE (2015)
26. Machida, F., Miyoshi, N.: Analysis of an optimal stopping problem for software rejuvenation in a dete-
riorating job processing system. Reliab. Eng. Syst. Saf. 168, 128–135 (2017)
27. Machida, F., Nicola, V.F., Trivedi, K.S.: Job completion time on a virtualized server subject to software
aging and rejuvenation. In: 2011 IEEE Third International Workshop on Software Aging and Rejuvena-
tion, pp. 44–49. IEEE (2011)
28. Machida, F., Nicola, V.F., Trivedi, K.S.: Job completion time on a virtualized server with software reju-
venation. ACM J. Emerg. Technol. Comput. Syst. 10(1), 10 (2014)
29. Machida, F., Xiang, J., Tadano, K., Maeno, Y.: Aging-related bugs in cloud computing software. In:
2012 IEEE 23rd International Symposium on Software Reliability Engineering Workshops, pp. 287–
292. IEEE (2012)
30. Machida, F., Xiang, J., Tadano, K., Maeno, Y.: Lifetime extension of software execution subject to
aging. IEEE Trans. Reliab. 66(1), 123–134 (2016)
31. Maciel, P., Matos, R., Silva, B., Figueiredo, J., Oliveira, D., Fé, I., Maciel, R., Dantas, J.: Mercury:
performance and dependability evaluation of systems with exponential, expolynomial, and general
distributions. In: 2017 IEEE 22nd Pacific Rim International Symposium on Dependable Computing
(PRDC), pp. 50–57. IEEE (2017)
32. Matos, R., Araujo, J., Alves, V., Maciel, P.: Characterization of software aging effects in elastic storage
mechanisms for private clouds. In: 2012 IEEE 23rd International Symposium on Software Reliability
Engineering Workshops, pp. 293–298. IEEE (2012)
33. Maziku, H., Shetty, S.: Towards a network aware vm migration: Evaluating the cost of vm migration in
cloud data centers. In: 2014 IEEE 3rd International Conference on Cloud Networking (CloudNet), pp.
114–119. IEEE (2014)
34. Mell, P., Grance, T., et al.: The nist definition of cloud computing (2011)
13
Journal of Network and Systems Management (2022) 30:3 Page 31 of 33 3
35. Melo, M., Araujo, J., Matos, R., Menezes, J., Maciel, P.: Comparative analysis of migration-based reju-
venation schedules on cloud availability. In: 2013 IEEE International Conference on Systems, Man, and
Cybernetics, pp. 4110–4115. IEEE (2013)
36. Melo, M., Maciel, P., Araujo, J., Matos, R., Araujo, C.: Availability study on cloud computing envi-
ronments: live migration as a rejuvenation mechanism. In: 2013 43rd Annual IEEE/IFIP International
Conference on Dependable Systems and Networks (DSN), pp. 1–6. IEEE (2013)
37. Meyer, J.F.: Performability: a retrospective and some pointers to the future. Perform. Eval. 14(3–4),
139–156 (1992)
38. Mijumbi, R., Serrat, J., Gorricho, J.L., Bouten, N., De Turck, F., Boutaba, R.: Network function virtual-
ization: state-of-the-art and research challenges. IEEE Commun. Surv. Tutor. 18(1), 236–262 (2015)
39. Murata, T.: Petri nets: properties, analysis and applications. Proc. IEEE 77(4), 541–580 (1989). https://
doi.org/10.1109/5.24143
40. Myint, M.T.H., Thein, T.: Availability improvement in virtualized multiple servers with software reju-
venation and virtualization. In: 2010 Fourth International Conference on Secure Software Integration
and Reliability Improvement, pp. 156–162. IEEE (2010)
41. Nguyen, T.A., Min, D., Choi, E., Tran, T.D.: Reliability and availability evaluation for cloud data center
networks using hierarchical models. IEEE Access 7, 9273–9313 (2019)
42. Oliveira, T., Thomas, M., Espadanal, M.: Assessing the determinants of cloud computing adoption: an
analysis of the manufacturing and services sectors. Inf. Manag. 51(5), 497–510 (2014)
43. Patterson, D.A., et al.: A simple way to estimate the cost of downtime. LISA 2, 185–188 (2002)
44. Pietrantuono, R., Russo, S.: A survey on software aging and rejuvenation in the cloud. Softw. Q. J. 1–32
(2019)
45. Salfner, F., Tröger, P., Polze, A.: Downtime analysis of virtual machine live migration. In: The Fourth
International Conference on Dependability (DEPEND 2011). IARIA, pp. 100–105 (2011)
46. Schroeder, B., Gibson, G.A.: Disk failures in the real world: What does an mttf of 1, 000, 000 hours
mean to you? FAST 7, 1–16 (2007)
47. Siddiqui, S., Darbari, M., Yagyasen, D., et al.: Modelling and simulation of queuing models through the
concept of petri nets (2020)
48. Soltesz, S., Pötzl, H., Fiuczynski, M.E., Bavier, A., Peterson, L.: ACM: Container-based operating sys-
tem virtualization: a scalable, high-performance alternative to hypervisors. ACM SIGOPS Oper. Syst.
Rev. 41, 275–287 (2007)
49. Strunk, A.: Costs of virtual machine live migration: a survey. In: 2012 IEEE Eighth World Congress on
Services, pp. 323–329. IEEE (2012)
50. Thein, T., Park, J.S.: Availability analysis of application servers using software rejuvenation and virtual-
ization. J. Comput. Sci. Technol. 24(2), 339–346 (2009)
51. Torquato, M., Araujo, J., Umesh, I., Maciel, P.: Sware: a methodology for software aging and rejuvena-
tion experiments. J. Inf. Syst. Eng. Manag. 3(2), 15 (2018)
52. Torquato, M., Maciel, P., Araujo, J., Umesh, I.: An approach to investigate aging symptoms and rejuve-
nation effectiveness on software systems. In: 2017 12th Iberian Conference on Information Systems and
Technologies (CISTI), pp. 1–6. IEEE (2017)
53. Torquato, M., Maciel, P., Vieira, M.: A model for availability and security risk evaluation for systems
with vmm rejuvenation enabled by vm migration scheduling. IEEE Access 7, 138315–138326 (2019)
54. Torquato, M., Maciel, P., Vieira, M.: Availability and reliability modeling of vm migration as rejuvena-
tion on a system under varying workload. Softw. Qual. J. 1–25 (2020)
55. Torquato, M., Torquato, L., Maciel, P., Vieira, M.: Iaas cloud availability planning using models and
genetic algorithms. In: 2019 9th Latin-American Symposium on Dependable Computing (LADC), pp.
1–10. IEEE (2019)
56. Torquato, M., Umesh, I., Maciel, P.: Models for availability and power consumption evaluation of a
private cloud with vmm rejuvenation enabled by vm live migration. J. Supercomput. 74(9), 4817–4841
(2018)
57. Torquato, M., Vieira, M.: Interacting srn models for availability evaluation of vm migration as rejuvena-
tion on a system under varying workload. In: 2018 IEEE International Symposium on Software Reli-
ability Engineering Workshops (ISSREW), pp. 300–307. IEEE (2018)
58. Torquato, M., Vieira, M.: An experimental study of software aging and rejuvenation in dockerd. In:
2019 15th European Dependable Computing Conference (EDCC), pp. 1–6. IEEE (2019)
59. Trivedi, K.S., Vaidyanathan, K., Goseva-Popstojanova, K.: Modeling and analysis of software aging
and rejuvenation. In: Proceedings 33rd Annual Simulation Symposium (SS 2000), pp. 270–279. IEEE
(2000)
13
3 Page 32 of 33 Journal of Network and Systems Management (2022) 30:3
60. Vaidyanathan, K., Trivedi, K.S.: A comprehensive model for software rejuvenation. IEEE Trans.
Dependable Secure Comput. 2(2), 124–137 (2005)
61. Valmari, A.: The state explosion problem. In: Advanced Course on Petri Nets, pp. 429–528. Springer
(1996)
62. Voorsluys, W., Broberg, J., Venugopal, S., Buyya, R.: Cost of virtual machine live migration in clouds:
a performance evaluation. In: IEEE International Conference on Cloud Computing, pp. 254–265.
Springer (2009)
63. Wang, D., Xie, W., Trivedi, K.S.: Performability analysis of clustered systems with rejuvenation under
varying workload. Perform. Eval. 64(3), 247–265 (2007)
64. Yeboah-Boateng, E.O., Essandoh, K.A.: Factors influencing the adoption of cloud computing by small
and medium enterprises in developing economies. Int. J. Emerg. Sci. Eng. 2(4), 13–20 (2014)
65. Zheng, J., Okamura, H., Dohi, T.: A transient interval reliability analysis for software rejuvenation mod-
els with phase expansion. Softw. Qual. J. 1–22 (2019)
66. Zimmermann, A.: Modelling and performance evaluation with timenet 4.4. In: International Confer-
ence on Quantitative Evaluation of Systems, pp. 300–303. Springer (2017)
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published
maps and institutional affiliations.
Matheus Torquato is a Ph.D. candidate at the University of Coimbra. His research interests comprise sub-
jects like Cloud Computing, Performance, Dependability, and Security Modeling. His current research
focuses in the design and development of analytical models to evaluate performance, dependability, and
security of moving target defense deployments in cloud computing. He received his Master’s Degree in
Computer Science from the Federal University of Pernambuco. He is currently on leave from his teach-
ing activities at the Federal Institute of Alagoas, Campus Arapiraca to pursue Ph.D. at the University of
Coimbra. His website is: https://www.matheustorquato.com/
Paulo Maciel received the degree in electronic engineering in 1987 and the M.Sc. and Ph.D. degrees in
electronic engineering and computer science from the Federal University of Pernambuco, Recife, Brazil,
respectively. He was a faculty member with the Department of Electrical Engineering, Pernambuco Uni-
versity, Recife, Brazil, from 1989 to 2003. Since 2001, he has been a member of the Informatics Center,
Federal University of Pernambuco, where he is currently a Full Professor. In 2011, during his sabbatical
from the Federal University of Pernambuco, he stayed with the Department of Electrical and Computer
Engineering, Edmund T. Pratt School of Engineering, Duke University, Durham, NC, USA, as a Visiting
Professor. His current research interests include performance and dependability evaluation, Petri nets and
formal models, encompassing manufacturing, embedded, computational, and communication systems as
well as power consumption analysis. Dr. Maciel is a Research Member of the Brazilian Research Council.
Marco Vieira received the Ph.D. degree from University of Coimbra, Coimbra, Portugal, in 2005. He cur-
rently is a Full Professor with the University of Coimbra. He has participated and coordinated several
research projects, both at the national and European level. His research interests include dependability
and security assessment and benchmarking, fault injection, software processes, and software quality
assurance, subjects in which he has authored or coauthored more than 200 papers in refereed conferences
and journals.Prof. Vieira has served on program committees of the major conferences of the depend-
ability area and acted as referee for many international conferences and journals in the dependability and
security areas.
13
Journal of Network and Systems Management (2022) 30:3 Page 33 of 33 3
13