0% found this document useful (0 votes)
20 views33 pages

Model Based Performability and Dependability Evaluation of A System With VM Migration As Rejuvenation in The Presence of Bursty Workloads

Uploaded by

Luan Lins
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views33 pages

Model Based Performability and Dependability Evaluation of A System With VM Migration As Rejuvenation in The Presence of Bursty Workloads

Uploaded by

Luan Lins
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Journal of Network and Systems Management (2022) 30:3

https://doi.org/10.1007/s10922-021-09619-3

Model‑Based Performability and Dependability


Evaluation of a System with VM Migration as Rejuvenation
in the Presence of Bursty Workloads

Matheus Torquato1,2 · Paulo Maciel3 · Marco Vieira1

Received: 23 December 2020 / Revised: 17 June 2021 / Accepted: 4 August 2021 /


Published online: 14 September 2021
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature
2021

Abstract
Software aging accumulation leads to increased resource consumption. In this con-
text, the memory leak is one of the well-known problems related to software aging.
A bursty workload can accelerate software aging bug activation as it requires instan-
taneous resource allocation. Then, the rapid resource allocation and deallocation
may lead to software aging through memory leaks. Moreover, a bursty workload
may cause a resource exhaustion failure in a system already overloaded by software
aging accumulation. Virtual Machine (VM) migration schedules can be used to
mitigate software aging moving services away from a compromised physical host.
Despite the considerable progress made in this area, the state-of-the-art still lacks a
modeling framework for performability and dependability evaluation of VM migra-
tion as rejuvenation in a system under bursty workloads. This paper proposes a set
of Stochastic Reward Net (SRN), aiming at filling this research gap. We consider
five scenarios covering different bursty workload conditions, and present a specific
model to cover the uncertainties related to bursty workloads. Our results present the
specific rejuvenation schedule to maximize system performability and dependability
for each scenario. The proposed modeling framework may be useful to support vir-
tualized environment management decisions.

Keywords Software aging and rejuvenation · Bursty workload · Dependability ·


Performability · Modeling

* Matheus Torquato
[email protected]; [email protected]; [email protected]
Extended author information available on the last page of the article

13
Vol.:(0123456789)
3 Page 2 of 33 Journal of Network and Systems Management (2022) 30:3

1 Introduction

Virtualization is the core of well-known platforms as Cloud computing [34], Vir-


tualized Containers [48], and Network Function Virtualization [38]. Many ser-
vice providers, organizations, and companies rely on virtualized environments to
host and run their applications. Among the concerns of virtualized environment
users, we can highlight the assurance of high performance and dependability [11,
21, 42, 64].
Virtualized environments are liable to suffer from software aging problems [2,
29, 32, 51, 58]. Software aging is related to software performance and depend-
ability degradation over long-running execution times [15]. Software aging
effects occur due to the accumulation of software faults during the software life-
time. These effects can lead the system to resource exhaustion, ending up causing
crashes, hangs, and other failures [12, 13, 22].
Software rejuvenation is the countermeasure for software aging [15]. Software
rejuvenation aims at bringing the software with aging accumulation back to a
reliable and stable state after software aging cleaning [4, 59]. Standard software
rejuvenation actions comprise software restart and Operating System (OS) reboot
[60]. However, in some cases, such measures may impose unacceptable service
downtime. There are some supporting techniques used to reduce software rejuve-
nation availability impact. For example, the use of Virtual Machine (VM) migra-
tion to move a service away from a physical host with software aging accumula-
tion [52]. For further reading on this topic, we recommend papers [8, 44], which
present a comprehensive background of software aging and rejuvenation in the
cloud.
Each VM migration, even in live migration mode, has an associated downtime
[7]. A usual approach is to find a VM migration-based rejuvenation schedule that
maximizes system availability. It should avoid both: availability degradation due
to frequent migrations; and software aging failures due to lack of rejuvenation.
Previous research efforts applied analytical modeling to find the best rejuvenation
schedules in such scenarios [23, 36, 40, 50].
Software aging imposes resource consumption overhead. Consequently, it also
affects system performance. It is important to define rejuvenation schedules to
minimize the probability of software aging-related resource exhaustion. How-
ever, in realistic situations, we should consider other aspects of system resource
consumption (besides software aging) to provide better evaluations. For example,
bursty workloads may lead to faster resource exhaustion.
This paper proposes Stochastic Reward Net (SRN) models for performability
and dependability evaluation of a system with VM migration scheduling. The
considered system uses VM migration scheduling as support for software rejuve-
nation. Our evaluation scenarios cover the aspects of software aging and bursty
workloads.
The main research question is as follows:
RQmain : What are the performability and dependability levels of a virtualized
system with VM migration subject to software aging and bursty workload?

13
Journal of Network and Systems Management (2022) 30:3 Page 3 of 33 3

We divided this question into three subquestions:

• RQ1: What is the VM migration scheduling policy which maximizes the system
availability?
• RQ2: What is the VM migration scheduling policy which maximizes the system
throughput?
• RQ3: What are the system reliability levels when applying the VM migration pol-
icies that maximize system availability?

We consider a virtualized system with the three main components: (i) VM, running
the user’s service, (ii) Main Node, physical machine providing resources for the
VM execution, and (iii) Standby Node, which receives the VM migration. We apply
guard functions and marking-dependent firing rates in the model to cover the follow-
ing aspects: (i) influence of burst occurrence in the resources exhaustion accelera-
tion; (ii) influence of the level of resources degradation in the VM migration pro-
cess; (iii) bursty workload uncertainties (i.e., duration, intensity, and probability).
We present three case studies to exercise our models. The first shows the sys-
tem steady-state availability evaluation; the second brings the system steady-state
throughput results; the third is about the reliability evaluation. We consider a set of
scenarios in each case study covering different Asset Classes. In this paper, an Asset
Class is the definition of the associated bursty workload conditions, namely, dura-
tion, intensity, and probability. Specifically, we consider five Asset Classes ranging
from light bursts to heavier bursts. Note that these Asset Classes are illustrative for
our evaluation purposes. In the best scenario, we would collect the burst incidence
from the system behavior observation.
The highlights of this paper are:

– We propose a set of Petri Net models for system performability and dependabil-
ity evaluation considering software aging and bursty workloads;
– The results allow the investigation of proper rejuvenation policies to maximize
the system availability or the system throughput (in some cases, both);
– Each case study presents a set of scenarios to support the decision-making pro-
cess.

Wang et al. [63], and our previous works [54, 57] tackled similar problems con-
sidering varying workload aspects. However, these works neglect the uncertainties
related to the burst occurrence as probability, intensity, and duration. Moreover, this
paper presents a more comprehensive result set covering availability, reliability, and
system throughput. A more detailed comparison with related works is in Sect. 7.
In this paper, we apply VM migration scheduling only as software rejuvenation,
which means that the proposed rejuvenation policy is not the direct countermeasure
to bursty workloads. Our goal is to extend previous works, such as [23, 24, 36, 54,
57], which investigate VM migration as software rejuvenation but neglect the impact
of the possible occurrence of bursty workloads in the system metrics. Nevertheless,

13
3 Page 4 of 33 Journal of Network and Systems Management (2022) 30:3

as presented in [35], it is possible to adjust the proposed model to cover VM migra-


tion triggering based on the system utilization status (i.e., adjusting the VM migra-
tion triggering in response to system utilization peaks1).
Our set of models may be useful for conducting the performability and depend-
ability evaluation in similar system architectures. It is also possible to expand or
reduce them for specific scenarios where we consider only specific parts of the
model, such as (i) bursty workload uncertainties; (ii) general behavior of non-
aging failures and repairs; (iii) VM migration behavior; or (iv) software aging and
rejuvenation.
The rest of this paper is organized as follows. To complement the research gaps
presented above, Sect. 2 raises additional reasons for conducting this research. Sec-
tion 3 presents the details of our methodology. Section 4 presents the assumptions
behind this research. Section 5 discusses the proposed modeling framework, includ-
ing the availability and performability models. Section 6 presents the case studies.
Section 7 discusses related work. Finally, Sect. 8 presents conclusions and future
work directions.

2 Motivation

The majority of public clouds such as Google cloud2 and Amazon AWS3 provide
specific Service Level Agreements (SLAs) for system availability. The break of
an availability SLA brings revenue loss for both the provider and the client [43].
Nowadays, even smaller and private clouds may host applications that demand high
dependability and performance levels.
The main problem here is how to set up a consolidated evaluation approach to
take into account these desired metrics. In scenarios where it is hard to conduct the
evaluation based on measurements, we can use the modeling approach [16]. In the
performability and dependability evaluation, in most cases, the measurement-based
evaluation is unpractical due to the dependability events frequency (e.g., the mean
time to failure of a system can be above 1000 h [46]).
The use of the modeling approach may solve the problem for the performability
and dependability evaluation. However, a challenge is to set up realistic models. In
this research, we take a step towards more realistic models trying to cover specific
details of VM migration and the workload’s influence in the software aging process.
Up to our knowledge, this is the first research effort towards the performability,
availability, and reliability evaluation of a system with VM migration scheduling
subject to software aging and bursty workload. Due to these aspects’ relevance, the
models may be useful to design management policies and SLAs for virtualized envi-
ronment users.

1
Bursty workload occurrence usually causes a system utilization peak.
2
https://​cloud.​google.​com/.
3
https://​aws.​amazon.​com/.

13
Journal of Network and Systems Management (2022) 30:3 Page 5 of 33 3

Fig. 1  Research methodology—flowchart

3 Methodology

Our research methodology consists of four steps:

1. Scope definition In this research, we evaluate the performability and dependability


of a private cloud with software rejuvenation based on VM migration. The system
mentioned is liable to suffer bursty workloads.
2. Metrics definition In the performability and dependability domain, we focus on
three specific metrics: Availability, Reliability, and System throughput.
3. Model design We adopted a hierarchical model approach, with the performance
model (M2) receiving inputs from the availability model (M1). We also propose
a specific model (burst cycle model) to cover the bursty workload occurrence
uncertainties (i.e., intensity, cycle, duration, and probability).
4. Model analysis and results We propose three case studies, one for each desired
metric. In these case studies, we evaluate the impact of the VM migration sched-
ule on system availability and performance. We also analyze the system reliability
in the first month.

Figure 1 presents a flowchart of the proposed methodology including the related


keywords to each step.
It is plausible that some limitations arise due to our research design decisions.
Here, we highlight two that we find important and introduce the proposed mitigation
methods. First, we decided to use modeling instead of simulation or measurements.
Modeling produces less accurate results than measurements or simulation [16]. Note
that simulation requires the implementation of specific simulators for the studied
scenarios. Measurement-based evaluations require a dedicated system to perform
experiments and workloads. Besides that, it is hard to reproduce rare events (e.g.,
long mean time to failure of virtual machines) in the environment. The acceleration
of such events may influence software aging accumulation, leading to biased results.

13
3 Page 6 of 33 Journal of Network and Systems Management (2022) 30:3

Fig. 2  System architecture

Second, the considered architecture is small, we acknowledge this limitation and


include further explanation in the system architecture section (Sect. 4.1).

4 Assumptions

This section presents our assumptions for this work. It has four subsections. Sec-
tion 4.1 presents the details of the considered system architecture. Section 4.2
describes the failure and operational modes. Section 4.3 contains the approach for
software aging and rejuvenation modeling. Finally, Sect. 4.4 explains our strategy
for bursty workload modeling.

4.1 System Architecture

The considered system architecture has three components: VM, Main Node and
Standby Node. The physical machines are in a private network. We assume this vir-
tualized environment is liable to receive burst workloads. The details of how we
incorporate the bursty workload behavior in the models are in Sect. 4.4. Figure 2
summarizes the system architecture.
The proposed model and evaluations are focused on the architecture presented
in Fig. 2. This architecture is simpler than the ones usually used in cloud data cent-
ers. Rather than proposing a model for a large data center, we decided to consider
a simpler architecture, but include details regarding the system behavior which
were missing in previous works (e.g., pre-copy phase of VM live migration, using
marking-dependent firing rates to cover software aging effects, and bursty workload
occurrence). This decision leads to a complex model which is difficult to scale up
due to state-space explosion problems [61]. We highlight two possible approaches
to cope with the scalability of the model. (1) Simulation—using simulation frame-
works (e.g., SimPy4) it is possible to simulate the aspects considered in the models

4
https://​simpy.​readt​hedocs.​io/​en/​latest/.

13
Journal of Network and Systems Management (2022) 30:3 Page 7 of 33 3

in larger architectures. (2) Numerical extrapolation—using an idea similar to [55],


we can estimate larger architectures behavior. The idea is to consider our simple
architecture as a building block of a larger architecture and then estimate the results
for larger architectures by combining these building blocks. Both approaches are out
of the scope of this paper and are kept for future work.

4.2 Failure and Operational Modes

VM component is the most critical component for the system availability, meaning
that the system is available only when the VM is running. Therefore, any system
behavior which causes an interruption in the VM running leads the system to una-
vailability. The first failure mode is the VM non-aging failure. Software or operating
system can lead the VM to service interruption, meaning unavailability. As the VM
depends on the Main Node to run, any Main Node interruption will cause system
unavailability. Standby Node failures do not cause unavailability directly. However,
Standby Node unavailability prevents VM migration. Likewise, we assume that, as
long as the Standby Node is active, there is the capacity to receive the VM migra-
tion. The system suffers a short downtime on each VM migration operation. So, fre-
quent migration may have a severe impact on overall system availability. Besides
that, the system may fail due to resource exhaustion because of software aging. We
assume that the operations for system repairing after a resource exhaustion failure
comprise software rejuvenation actions like OS reboot or application restart.
Bursty workload occurrence accelerates the depletion of the resources, leading to
quicker failure due to resource exhaustion.

4.3 Software Aging and Rejuvenation Modeling Approach

We adopted the VM-migrate rejuvenation technique [24, 36]. The technique con-
sists of VM migration scheduling to move the VM from a host with software aging
accumulation to another host without aging accumulation. A software component
capable of communicating with the virtualized environment is in charge of submit-
ting VM migration commands observing a predefined schedule. In this paper, we
named such a software component as Clock. When the Clock time counting reaches
the predefined schedule, it submits the VM migration command to the virtualized
environment. The virtualized environment receives the message from the Clock and
verifies the system components’ status. Then, the VM migration occurs if all the sys-
tem components are available (i.e., Main Node, VM and Standby Node). After a suc-
cessful VM migration, the VM goes from the Main Node to the Standby Node. As
soon as the VM arrives in the Standby Node, the Standby Node swaps the role with
the Main Node, meaning that it turns into the Main Node. The previous Main Node
(VM migration source) pass through a software rejuvenation after VM migration.
Then, after the software rejuvenation completion, the previous Main Node assumes
the role of Standby Node and will wait for new VM migration requests.
Following the classic definitions in the paper [15], Fig. 3 presents a state machine
with the considered system behavior related to software aging and rejuvenation. The

13
3 Page 8 of 33 Journal of Network and Systems Management (2022) 30:3

Fig. 3  State machine of software aging and rejuvenation, adapted from [15]

state machine diagram has four states: fresh—system without software aging accu-
mulation; degraded—system with degraded performance or increased failure rate
due to software aging accumulation; failed—system non-operational due to a soft-
ware aging failure; and rejuvenation - system under rejuvenation actions. The initial
state is the fresh state. From this point, the expected software aging accumulation
leads the system to state degraded. If no rejuvenation action occurs, the system suf-
fers a software aging failure, going to the failed state. The system returns to fresh
state after a system repair. However, if the system performs a rejuvenation action,
it goes to the state rejuvenation. In this paper, the considered rejuvenation action
is VM migration. Therefore, we assume that the system faces the VM migration
downtime in the rejuvenation state. The software rejuvenation brings the system
back to the fresh state. As mentioned earlier, Clock component obeys a predefined
schedule. Therefore, some unnecessary migrations (i.e., VM migrations triggering
in a system without software aging accumulation) may occur depending on the time
interval between migrations.

4.4 Bursty Workload Modeling Approach

The modeling of such an unexpected event as bursty workloads is a complicated


task because the burst characteristics are often related to the targeted system.
Depending on the target, the burst duration, probability, and intensity can be higher
or lower. Therefore, setting up a generic approach for the bursty workload modeling
is unpractical due to its myriad possibilities. However, proposing a set of possible
scenarios of bursty workloads can help support the design of SLAs or internal reju-
venation policies.
In this study, we consider the following four main characteristics of the bursty
workloads. Cycle—Supposed time between bursts. Burst probability—Probability
of burst occurrence. Burst duration—Time that the system spends under the bursty
workload. Burst intensity—Severity of the workload submitted to the system. We
take these four characteristics into account in a specific model named “burst cycle”.

13
Journal of Network and Systems Management (2022) 30:3 Page 9 of 33 3

Fig. 4  A diagram for the burst cycle model

Figure 4 presents a diagram of the burst cycle model. The circles represent
the system states, the continuous arcs represent the time delay for the state tran-
sition, the dashed arcs represent the state transition based on probability (instead
of time delay), and the gray arc with a circle represents an attribute of a specific
system state.
The initial state of the burst cycle model is the start state, which represents
the start of the time cycle between burst occurrence. After the Cycle, the system
state goes from start to end. In the end state, based on probability (burstProb-
ability), the system can restart the cycle (going back to the start state), or suf-
fer the bursty workload (moving the system state from end to underBurst). The
underBurst state has the burstIntensity attribute, which, as mentioned earlier,
indicates the severity of the workload submitted by the bursty workload. After
the time delay burstDuration, the cycle for the next burst begins. In this paper,
we incorporate the behavior of the burst cycle model using an SRN model pre-
sented in the Sect. 5.1.
In this study, we apply the burst cycle model for the bursty workload. We
used parameters from previous studies for model evaluation. However, it is
possible to customize burst cycle model parameters. In some situations, it is
hard to find specific datasets to use as input for the analysis. burst cycle model
approach enables the system managers to set up reasonable scenarios based
on their knowledge about the environment. Therefore, they can build better
SLAs and internal policies, considering the possibilities of bursty workload
occurrence.
Finally, we highlight that in this paper, the VM is the target for the bursty
workload and not the Main Node. Therefore, a VM migration also switches the
workload to the VM migration target host during a burst.

13
3 Page 10 of 33 Journal of Network and Systems Management (2022) 30:3

Fig. 5  Models relationship

5 Models

This section has two subsections. The first presents the details of the proposed SRN
model for the availability evaluation. Moreover, the second explains the perfor-
mance evaluation model, which is an M/M/1/k queue. The models obey the rela-
tionship presented in Fig. 5. The availability model (M1) provides two inputs to the
performance model (M2). The first is the performance penalty (Penalty) due to
resource depletion. We compute Penalty from a place of the availability model,
which represents the level of resource depletion (ResourcesDepletion place).
Besides that, M1 also provides the system unavailability (UA) as input for M2. We
compute UA observing the probability of the absence of tokens in the place UP,
which is the place used to represent the system availability. More details of the inter-
actions between M1 and M2 are in the next sections. Note that we consider M/M/1/k
as this is one of the most used queuing models for client-server applications. Nev-
ertheless, it is possible to adapt M2 to other scenarios by including the effects of
Penalty and UA in other queueing models. Section 5.2 provides details on how
we incorporate such effects in the M/M/1/K model. Literature exists to support the
design of other queuing models based on Petri Nets [5, 18, 47].
We use two separated models for performability and dependability evaluation.
The reason for that is as follows. The use of transitions with considerable differ-
ences in the firing delay magnitude causes the stiffness problem [6]. In our modeling
framework, we have the mean time to failure of the Main Node, which is above a
1000 h, and the system service time, measured in milliseconds. By using the model
decomposition, we can mitigate stiffness as the evaluations of performability (with
delays of milliseconds) and dependability (with transition delays of months) are per-
formed separately.

5.1 Availability Model

The proposed availability model has three sub-models: (i) clock model; (ii) burst
cycle model and (iii) system model (Fig. 6). These models interact using guard
functions and transitions with marking-dependent firing rates which will be later
explained.
The first model is the clock model. The clock model has the Clock and Read-
yToMigrate places and the Trigger and ResetClock transitions. The clock

13
Journal of Network and Systems Management (2022) 30:3 Page 11 of 33 3

(a)

(b) (c)
Fig. 6  Availability model

model represents the behavior of a software component responsible for the VM


migration scheduling process. So, at the initial state the Clock place has a token
enabling the firing of the deterministic transition ReadyToMigrate, represent-
ing that the time counting for the VM migration is active. Then, the Trigger
transition firing removes the token from the Clock place and puts a token in the
ReadyToMigrate place. The ReadyToMigrate place with a token represents
that the system reaches the planned schedule for VM migration. However, besides
the planned schedule, the VM migration (StartLM transition firing) depends on
a few more conditions as: (i) Main Node and VM running (token in the UP place);
and (ii) Standby Node running (token in the SN_UP place). We embedded all these
conditions in the StartLM transition using guard functions. Table 1 contains the
meaning and the associated guard functions of all the immediate transitions. The
transition ResetClock represents the start of the time counting for the next VM
migration. The ResetClock firing depends on the token’s presence in the place
ReadyToMigrate and in the place Mig, representing that the system clock
restarts its cycle right after the beginning of the VM migration. In this model, we
assume that the clock works in this cycle permanently.
The burst cycle model characterize the considered bursty workload in the envi-
ronment. It obeys the behavior described in the Sect. 4.4. The token in the Start
place enables the transition Cycle firing, representing the supposed time duration
between bursts. To improve the uncertainty aspect of this modeling process, we pre-
fer to use the exponential distribution instead of deterministic in the Cycle transi-
tion. Cycle transition firing removes the token from the Start place and deposits
a token in the End place. The token’s presence in the End place enables the transi-
tions Burst and noBurst concurrently. As presented in [19], we assigned dif-
ferent weights on each arc for the transitions Burst and noBurst to represent
the probability of burst occurrence. The arc from End to the transition Burst has
weight = burstProb (burstProb is variable related to the burst occurence probabil-
ity), and the arc from End to noBurst transition has weight = (1 − burstProb). In
the case of the noBurst firing, the token goes back to the Start place, represent-
ing the cycle restart. Otherwise, in the case of Burst firing, the transition removes

13
3

13
Page 12 of 33

Table 1  Immediate transitions and associated guard functions


Transition Meaning Associated guard function

StartMig VM migration start (#ReadyToMigrate>0) AND


(#UP>0) AND (#SN_UP)
ResetClock Start time counting for next VM migration (#Mig>0)
SysFail System fail during migration (#UP==0) OR (#SN_UP==0)
Burst Immediate transition which represents the burst occurrence No guard functions
noBurst Immediate transition which indicates that there is no burst occurrence in the No guard functions
cycle iteration
Aging Start of the resources depletion No guard functions
Clear1 and Clear2 Resources depletion clearence due to software rejuvenation or recovery after a No guard functions
failure
ResourcesExhaustion Resources exhaustion failure No guard functions
Journal of Network and Systems Management (2022) 30:3
Journal of Network and Systems Management (2022) 30:3 Page 13 of 33 3

Table 2  Transitions with marking-dependent firing delays


Transition Marking-dependent firing delay Meaning

Phase IF(#UnderBurst>0): phaseDura- If the system is under bursty


tion/BurstIntensity ELSE phase- workload, the resource depletion
Duration progress phase is accelerated by
the factor of the Burst intensity.
Otherwise, the resources deple-
tion progress follow the expected
time delay (phaseDuration
variable).
PC IF(#ResourcesDepletion==0): Pre- This marking-dependent firing
copy ELSE Precopy+Precopy*(#Res delay captures the influence of
ourcesDepletion/3); the number of dirty memory
pages on the VM live migration
process [45, 49].

the token in the End place and puts a token in the UnderBurst place, represent-
ing that the system is suffering a bursty workload. The BurstDuration transition
represents the time duration of the bursty workload. As long as the system is under
a burst (i.e., model with token presence in the place UnderBurst), it suffers a
resources depletion acceleration. This acceleration is related to the Burst intensity
mentioned in the Sect. 4.4. We model this behavior using a marking-dependent fir-
ing delay on the transition Phase of the system model. The Phase transition repre-
sents the resources depletion progress. Table 2 presents the details of the transitions
with marking-dependent firing delays. The BurstDuration firing removes the
token from the UnderBurst place and puts a token in the Start place, represent-
ing the cycle restart.
The system model intends to cover three system aspects: (i) behavior of non-
aging failures and their repairs; (ii) resources depletion due to software aging; and
(iii) VM migration. We highlighted the model’s sections that represent each one of
these behaviors.
In this paper, we consider system non-aging failures (e.g. hardware or Operating
system failures). At the initial state, the system model has a token in the UP place.
UP place with tokens represents the Main Node and the hosted VM running. And,
the absence of tokens in the UP place represents the system unavailability. From the
initial state, the Main Node can suffer a non-aging failure (MN_f transition firing).
The MN_f transition firing removes the token from the place UP and puts a token in
the place DW. The Main Node repair has two steps: (1) Main Node recovery (MN_r
transition firing, moving the token from DW to VM_S place) and (2) VM reboot
(VM_rb firing, returning the token to the UP place). The system can also suffer a
VM non-aging failure. The transition VM_f firing represents a VM non-aging failure
occurrence. After a VM failure there are two possibilities to recover the system: a
VM repair (transition VM_r firing), or a subsequent Main Node failure (MN_f2 fir-
ing, moving the token from VM_DW to the DW place). We also cover the non-aging
failure and repair processes in the Standby Node. SN_UP and SN_DW places repre-
sent the Standby Node status related to the non-aging failures and repairs. SN_UP

13
3 Page 14 of 33 Journal of Network and Systems Management (2022) 30:3

place with tokens represents the Standby Node running and ready to receive VM
migration. The transitions SN_f and SN_r represent a Standby Node non-aging
failure and repair, respectively. Standby Node unavailability (token in the SN_DW
place) affects the system availability indirectly as it prevents the software rejuvena-
tion for aging failures avoidance.
As mentioned earlier, StartLM transition represents the VM migration start and
it has an associated guard function (see Table 1). We assume the Pre-copy VM live
migration [7] as the VM migration method. Pre-copy algorithm has two main phases:
(i) Pre-copy phase (transition PC)—transfer of the memory pages from the Main
Node to the Standby Node; (ii) Downtime phase (transition LM_dwt)—transfer of
the processor state and VM migration acknowledgment. StartMig firing deposits
a token in the Mig place. We used inhibitor arcs5 to indicate that a new migration
can only occur after the finishing of the previous. Mig transition with tokens rep-
resents that the VM migration is in the Pre-copy phase. During the Pre-copy phase
the system continues to run (the token stays on the UP place). SysFail transition
serves to represent possible system failures (i.e., Main Node, VM or Standby Node
failures) during the Pre-copy phase. We also embed this behavior using guard func-
tions. A system failure during the Pre-copy phase implies in VM migration abort,
thus SysFail transition removes the token from Mig place. As presented in [1,
20, 33, 62], the amount of dirty memory pages affects the VM migration latency. To
represent this behavior, we used a marking-dependent firing delay in the PC transi-
tion (see Table 2). The marking-dependent firing delay increases the supposed delay
for the Pre-copy phase (Precopy variable) observing the status of system resources
depletion (i.e., number of tokens in the ResourcesDepletion place). After the
completion of the Pre-copy phase (firing of PC transition), the system enters in the
Downtime phase (token in the DW_Mig place). In the Downtime phase the system is
unavailable. We represent this behavior removing the token from UP place after PC
transition firing. The system returns to be available after the VM migration comple-
tion (LM_dwt transition firing). As mentioned in the Sect. 4.3, the previous Main
Node (i.e., VM migration source), will pass through software rejuvenation before
assumes the role of the Standby Node. Thus, LM_dwt firing puts a token in the
SN_W representing that the previous Main Node is waiting for the software rejuve-
nation. Transition Rej represents the software rejuvenation action, its firing replace
a token in the SN_UP. This behavior represents the rejuvenation completion and that
the Standby Node is ready to receive VM migrations.
Finally, about the resources depletion due to software aging modeling. To repre-
sent the resource depletion behavior, we adopted a four-phase Erlang distribution, as
the Erlang distribution is suitable to represent Increasing Failure Rate (IFR) behav-
ior [14]. We used an Erlang subnet with four phases to represent the IFR in the sys-
tem model. The Erlang subnet is in the upper part of the system model. The places
AvailableResources and ResourcesDepletion are related to the system
resources depletion status. The number of tokens in the AvailableResources

5
Arcs terminating in a circle instead of an arrowhead.

13
Journal of Network and Systems Management (2022) 30:3 Page 15 of 33 3

denotes the amount of resources available, and the number of tokens in the
ResourcesDepletion place denotes the resources depletion status.
At the initial state, the transition Aging fires swapping6 the token from the
UP place. The same transition deposits four tokens in the place AvailableRe-
sources. The number of tokens denotes the amount of resources still available for
the Main Node usage. As time passes, the Main Node starts to accumulate software
aging status. The transition Phase firing represent the resources consumption pro-
gress, which removes the tokens from the AvailableResources and deposits
tokens in the ResourcesDepletion place. If the software aging status persists
in the system, it can suffer a resource exhaustion failure (ResourcesExhaus-
tion transition). We highlight that the Phase firing rate is also adjusted when the
system is under bursty workload, meaning faster resource exhaustion. Resource-
sExhaustion firing removes the token from the UP place and puts a token in
the DW2. After a resource exhaustion failure, the system recovery has three phases:
(i) detection of resource exhaustion; (ii) software management to cleanup residuals;
(iii) complete OS reboot. We model these steps in a single exponential transition
named Repair. After Repair firing, a token returns to the UP place, representing
that the system is available again.

5.1.1 Metrics Computation

We obtain two metrics from the availability model. The first is the system avail-
ability (A). We compute the system availability as the probability of token pres-
ence in the place UP ( A = P{#𝚄𝙿 > 0}). We use Availability to obtain secondary
metrics as unavailability (UA) and annual downtime in hours per year7 (Dwt). The
expressions are as follows: UA = 1 − A and Dwt = 8760 ⋅ UA. Note that, as we are
computing steady-state availability, it is possible to obtain the downtime for other
intervals (besides one year). For example, we can compute the monthly downtime
in minutes8 using Dwt = 43, 200 ⋅ UA. And, as mentioned earlier, besides the avail-
ability-related metrics, we also computed the Penalty, using the following expres-
sion Penalty = E(#𝚁𝚎𝚜𝚘𝚞𝚛𝚌𝚎𝚜𝙳𝚎𝚙𝚕𝚎𝚝𝚒𝚘𝚗)∕3, which is a normalized value of the
expected number of tokens in the ResourcesDepletion place. We take account
of resource depletion accumulation when the place ResourcesDepletion has
tokens. Therefore, the possibilities are ResourcesDepletion with one, two or
three tokens. For normalization of Penalty metric, we divided the expected number
of tokens in place ResourcesDepletion by three.

6
Receiving and returning.
7
We consider a year with 365 days.
8
Considering a month with 30 days. 30 ⋅ 24 ⋅ 60 = 43, 200.

13
3 Page 16 of 33 Journal of Network and Systems Management (2022) 30:3

Fig. 7  Performance model—


SPN for a M/M/1/k queue

5.2 Performance Model—M/M/1/k Queue

In some situations, it is important to understand system performance behavior


when considering bursty workloads.9 In this section, we present a queueing model
to evaluate system performance. This performance model aims to provide system
throughput results. The queue model receives a variable named Penalty from the
availability model output. We obtain Penalty observing the steady-state expected
number of tokens in place ResourcesDepletion. We use the variable Pen-
alty to adjust the service time accordingly to the system resources depletion.
Besides that, we also cover the influence of system unavailability in the perform-
ability metric.
We consider that the VM runs a user application or a service that receives and
processes requests obeying an M/M/1/k queue model [17]. Machida et al. adopted
the same approach in a similar problem [24]. Figure 7 presents the SPN model used
for M/M/1/k queue metrics calculation. Transition arrival represents transaction
arrival in the system. The transaction acceptance depends on the available buffer
space (i.e., queue capacity) and system availability.
We parameterized the place buffer with k tokens. The k variable represents the
queue capacity. The transition arrival firing removes one token from the buffer
place and deposits one token in the queue place, representing a transaction arrival
in the system. Transition service represents the transaction processing. After
service firing the token returns to the buffer place, representing the finishing of
a transaction processing.

5.2.1 Metrics Computation

System throughput is the rate of served transactions in a limited period. We com-


puted system throughput observing the effective number of transactions that enter
the system.

9
Note that, in this case, the performance is degradable due to software aging accumulation issues, then
the computed metrics are related to system performability [37].

13
Journal of Network and Systems Management (2022) 30:3 Page 17 of 33 3

The first step of the evaluation is to use the variable Penalty in the performance
model. In some of our previous experiments [52], we noticed that a generic Web
server under software aging effects has a service rate of about one request per sec-
ond. Based on our previous experimentation, we assume that when the Penalty
assumes its maximum value, the system service time (i.e., firing rate of transition
service) will be decreased to one request per second. We changed the firing
rate of the transition service accordingly to each proposed scenario. We obtain
the effective rate of transactions accepted in the queue (RTAQ). We obtain RTAQ
using the following expression RTAQ = P{#𝚋𝚞𝚏𝚏𝚎𝚛 > 𝟶} ⋅ 𝜆, where 𝜆 is the firing
rate of the transition arrival. However, we still have to consider the system’s
unavailability in the system’s steady-state throughput. Then, we used the expres-
sion ST = RTAQ ⋅ A to compute the system throughput (ST), where A is the system
availability.

6 Case Studies

We used the TimeNet tool for the availability model design and evaluation [66], and
Mercury tool [31] for the performance analysis. TimeNet has a friendly graphical
interface and provides instantaneous results, while Mercury has an easy-to-use script
language that facilitates sensitivity analysis using non-linear parameter variation.
We used the values in Table 3 as default values for our evaluations. We obtained
these values from the papers [53, 63]. Note that these values are only for reference
and should be adjusted whenever real-scenario values are available. However, these
are the most representative values that we can find to feed the models as they were
published in reputed journals.
In the following case studies, we focused on finding the rejuvenation schedule to
maximize system availability and system throughput. First, we search for the best
rejuvenation schedule using a graphical sensitivity analysis, which explores the
models’ output varying the time interval for the VM migrations (Trigger firing
delay) from one to 720 hours (a month) using a one-hour step. Second, once we
find the availability-oriented rejuvenation schedule, we propose the last case study
to verify system reliability in the first month of the system running.
For all the scenarios, we assume that the service hosted in the virtualized envi-
ronment is an asset, which is a possible target for a bursty workload. Depending on
the asset, the burst may be more or less likely to occur. To represent the different
asset classes, we propose five different scenarios, as in Table 4.
The case studies in this sections are: Availability (Sect. 6.1), System Throughput
(Sect. 6.2) and Reliability (Sect. 6.3).

6.1 Case Study #1—Availability

Our goal in this case study is to find the rejuvenation schedule, which maximizes
the system availability. This case study aims to answer the RQ1. There are two main

13
3

Table 3  Parameters used in the timed transitions


Parameters Values

Availability model

13
Page 18 of 33

Transition name Description Mean time

Trigger Interval to VM Live Migration 1 → 720 h


Cycle Supposed cycle between burst occurrence 24 h
BurstDuration Bursty workload duration 60, 120, 240, 360 or 480 s 1
1
Depending on the scenario
AgingPhase Time to Aging (Phases) 62.5 h 2
2
We adjust the time to aging observing the burst occurrence
Repair Time to system recovery after a resources exhaustion 1h
failure
MN_f, MN_f2 Main Node Failure Delay 1236.706 h
MN_r Main Node Repair Delay 1.094 h
SN_f Standby Node Failure Delay 1236.706 h
SN_r Standby Node Repair Delay 1.094 h
VM_f Virtual Machine Failure Delay 2880 h
VM_r Virtual Machine Repair Delay 30 min
VM_rb Virtual Machine Reboot Delay 5 min
PC VM Live Migration pre-copy phase time 72 seconds 3
3 We adjust the service rate accordingly to the number of tokens in the ResourcesDepletion place
LM_dwt VM Live Migration Downtime 4s
Rej Rejuvenation Node Delay 2 min
Performance model
Transition name Description Rate
Journal of Network and Systems Management (2022) 30:3

arrival Transaction arrival 1000 requests per second


Table 3  (continued)
Performance model
Transition name Description Rate

service Service rate 1500 requests per second 4


4
We adjust the service rate accordingly to the Penalty variable
Journal of Network and Systems Management (2022) 30:3
Page 19 of 33 3

13
3 Page 20 of 33 Journal of Network and Systems Management (2022) 30:3

Table 4  Asset classes Asset class # Burst prob- Burst intensity Burst duration
definitions ability (%)

0 0.01 2000 60 s
1 0.1 4000 120 s
2 1 6000 240 s
3 5 8000 360 s
4 10 10000 480 s

(a) (b) (c) (d) (e)


Fig. 8  Availability of each scenario

Table 5  Results—availability Asset class# Baseline Rej. Downtime Downtime


downtime (h/ Trigger (h/year) reduction (h/
year (h) year)

0 41.14 19 10.56 30.58


1 41.25 19 10.65 30.60
2 43.06 18 12.78 30.28
3 53.04 16 22.33 27.71
4 68.39 15 44.51 23.88

problems for availability in the scenarios covered in our study. The first is when
applying frequent migrations. As each migration has an associated downtime, fre-
quent migrations will degrade the steady-state availability. The second is due to
resource exhaustion failures due to software aging and bursty workloads. Less fre-
quent migrations may allow the system to reach resource exhaustion failures.
In some situations, VM migration during bursty workloads may accelerate the
exhaustion of the resources due to VM migration overhead. As mentioned earlier,
we capture the influence of bursty workloads in the VM migration process using
transitions with marking-dependent firing delays.
Figure 8 presents the availability results for each proposed asset class. The black
line represents the system availability when applying rejuvenation, and the gray line
represents the system availability without rejuvenation (Baseline). We notice that
the system availability has a peak in all the scenarios, which is the specific rejuvena-
tion trigger that maximizes system availability. After the peak, we notice the system

13
Journal of Network and Systems Management (2022) 30:3 Page 21 of 33 3

Fig. 9  Downtime reduction (h/year)

Fig. 10  Sensitivity analysis of the VM migration downtime parameter

availability decrease tending to the baseline scenario. This is an expected result, as


scarce VM migrations (i.e., high rejuvenation trigger) leads the system to the base-
line (i.e., system without rejuvenation). Table 5 presents these specific rejuvenation
policies and their results regarding the annual downtime. We also notice that more
severe bursty workloads produce worse availability results. The results from Asset
Class #0 and #1 are nearly the same. Therefore, with a lighter bursty workload, we
can apply the same rejuvenation policy to maximize system availability.
Finally, to provide a more comprehensive overview of the impacts of bursty
workloads in the system availability, Fig. 9 presents a comparison of the downtime
reduction for all the scenarios. We noticed that, in scenarios with more severe bursts,
the downtime reduction is lower. In such scenarios, it is important to set up mecha-
nisms to improve resiliency against bursty workloads.

13
3 Page 22 of 33 Journal of Network and Systems Management (2022) 30:3

Table 6  Summary of VM migration downtime parameter sensitivity analysis


Asset class# Unavailability with 0.5 seconds of Unavailability with 60 s of down- Relative
downtime per migration (h/year) time per migration (h/year) difference
(%)

0 10.12 17.72 75.14


1 10.21 17.81 74.47
2 12.31 20.33 65.17
3 24.80 33.80 36.32
4 43.94 53.53 21.80

(a) (b) (c) (d) (e)


Fig. 11  System throughput of each scenario

6.1.1 Sensitivity Analysis of VM Migration Downtime

In the previous case study, we fixed the VM migration downtime. However, in


realistic scenarios, VM migration downtime may vary due to various reasons
(e.g., amount of memory pages to be transferred, dirty pages rate, network band-
width, VM migration technique). Therefore, it is important to study the VM
migration downtime variation in the system steady-state availability.
In this case study, we conducted a sensitivity analysis of VM migration down-
time parameter. We considered the VM migration downtime parameter variation
from 0.5 seconds to one minute with a 0.5 seconds step. The other parameters
remain the same presented in Table 3. We fixed the rejuvenation trigger (RT) in
the analysis using the results presented in Table 5. Figure 10 presents the results.
We notice a similar linear increase in the system downtime in all observed
scenarios. The difference in the curves of Asset 0 and Asset 1 (scenarios with
lower risk of bursty workload occurrence) is negligible. In these curves, the
results of 60 seconds of downtime for each VM migration is about 75% greater
than the results with 0.5 seconds. In the higher risk scenario (Asset 4), the same
difference is 22%. Meaning that the relative impact of longer downtime is lower
when compared to lower risk scenarios. Table 6 summarizes the sensitivity anal-
ysis results.
Presumably, the higher the risk of bursty workload occurrence, the worse
the unavailability results. However, in such scenarios, where the system

13
Journal of Network and Systems Management (2022) 30:3 Page 23 of 33 3

Table 7  Results—Sys. throughput (req/s)


Asset class# Baseline sys. Rej. trigger (h) Sys. throughput Improvement (%)
throughput

0 793.5799 18 998.7940 25.86


1 794.0027 19 998.7837 25.79
2 802.5243 17 998.5409 24.43
3 846.2617 16 997.1087 17.83
4 901.4916 15 994.9192 10.36

Fig. 12  System throughput difference comparison

accumulated downtime is already high, perhaps additional downtime is toler-


ated. In these cases, it is possible to design alternative VM migration strategies
to reach specific goals besides software rejuvenation. For example, using cold
VM migration to assure database migration and consolidation or cryptography
algorithms to secure VM migration.

6.2 Case Study #2—System Throughput

The goal is this case study is to find the rejuvenation policy which maximizes the
system throughput. This case study aims to answer RQ2. Figure 11 shows the results
for all the considered scenarios. We noticed that in scenarios with shorter rejuvena-
tion triggers, the system throughput stays at higher levels. After a certain point, we
noticed a drop in the system throughput rate. We can draw the following conclusions
from these results: i) Systems with shorter migration intervals tend to persist in a
lower Penalty. Therefore, the service rate persists in higher levels compensating for
the lower availability levels due to frequent migrations. ii) The baseline through-
put is higher in systems with more severe bursty workloads. In the model analysis,
we noticed that in such cases, after a burst, the system fails quickly. Thus, the sys-
tem steady-state Penalty is lower than in scenarios with a lighter bursty workload.

13
3 Page 24 of 33 Journal of Network and Systems Management (2022) 30:3

(a) (b) (c) (d) (e)


Fig. 13  Reliability of each scenario

Table 8  Reliability results


Asset Reliability with- Reliability with Reliability with rejuvenation (linear regression) R2
class # out rejuvenation rejuvenation
(720h) (720h)

0 0.002 ± 0.001 0.305 ± 0.033 R(t) = ((−6.286e−4) ⋅ t + 1.003)2 0.9992


1 0.002 ± 0.001 0.293 ± 0.032 R(t) = (4.350e−7) ⋅2 +(−1.319e−3) ⋅ t + 1.008 0.9991
2 0.001 ± 0.001 0.235 ± 0.026 R(t) = (6.875e−7) ⋅ t2 + (−1.558e−3) ⋅ t + 1.004 0.9996
3 0.001 ± 0.001 0.095 ± 0.013 R(t) = exp(−3.270e−3 ⋅ t + 2.007e−2) 0.9996
4 0 0.021 ± 0.005 R(t) = exp(−5.393e−3 ⋅ t + 5.590e−2) 0.9995

Therefore, the steady-state system throughput tends to be higher than in the other
scenarios.
Table 7 presents the best rejuvenation schedule for the proposed scenarios. The
adopted policies for system throughput maximization are close to the results for sys-
tem availability maximization. The last column presents the percentual improvement
when comparing the baseline results and the results with rejuvenation. Like the pre-
vious case study, we noticed that the improvement is lower than in the others in sce-
narios with heavier bursty workloads.
Finally, for the sake of comparison, we plot the throughput improvement of each
scenario in Fig. 12. As in the availability results, the results from Asset Class #0 and
#1 are nearly the same.

6.3 Case Study #3—Reliability

System reliability is related to the service continuity [3], or the period that the sys-
tem passes free from failures. In this case study, we investigated the system reliabil-
ity when applying the availability-oriented rejuvenation policies (Table 5). Our goal
is to answer the RQ3. To conduct the reliability evaluation, we used the availability
model without the repair transitions. Thus, we compute the reliability using the fol-
lowing expression: Reliability = P{𝚄𝙿 > 0}. However, different from steady-state
availability, system reliability is a transient metric. Thus, our goal is to calculate the
probability of the system staying failure-free in its first month of running.

13
Journal of Network and Systems Management (2022) 30:3 Page 25 of 33 3

Table 9  Depletion point results Asset class # Deple-


tion point
(h)

0 480
1 400
2 400
3 320
4 240

Our obtained results are in the Fig. 13. The black dots represent the reliability
results for the system with rejuvenation, and the gray dots represent the reliability
results for the system without rejuvenation. The dashed lines represent the 95% con-
fidence interval. We also performed linear regression in the Reliability results to
extract functions representing the reliability curve (R(t), where t is the time) when
applying software rejuvenation policies. Table 8 presents a summary of the reliabil-
ity results. The table also presents the coefficient R2, which determines in a range
from 0 to 1, the fraction of the total variation explained by the obtained regression
model.
We noticed a steeper reduction in scenarios with heavier bursty workloads. The
rapid reliability decrease in the Asset Class # 4 shows that the probability of a fail-
ure-free system is almost null at the 720th hour of continuous run. Therefore, after
this point, the rejuvenation mechanism produces no improvement in the system
reliability when compared to the baseline scenario. However, the reliability results
for the first two scenarios (Asset classes #0, #1) are nearly the same. In the Asset
Class #0 scenario, the probability of a failure-free system in its first month running
is about 30%, while in the Asset Class #1 scenario, the probability is about 29%.
Using the quadratic, polynomial, and logarithmic models for linear regression, we
can achieve R2 values above 0.999, meaning that the proposed functions can repre-
sent the reliability curve with substantial fidelity. Therefore, we can use these func-
tions to approximate the reliability results for the desired scenarios.
Additionally, we calculate the depletion point (i.e., point in time where the
resources are depleted) of the baseline scenarios. These results highlight how long
the system survives without rejuvenation. The depletion point results are in Table 9.

7 Related Works

There are several works about performability and dependability modeling in the
cloud. However, up to our knowledge, this is the first research attempt to quantify
system performability and dependability in a system with VM migration as reju-
venation under bursty workloads. In the following, we highlight relevant related
research and compare the current state of the art and our work.

13
3 Page 26 of 33 Journal of Network and Systems Management (2022) 30:3

Table 10  Related work comparison


Papers Dependability metrics Performance metrics

[23, 35, 36, 57] Availability None


[24] Availability Number of transactions lost
[10, 63] Availability System throughput and
Blocking probability
[25, 26] None Job completion time
[28, 30] Availability Job completion time
[9, 65] Interval reliability and pointwise avail- None
ability
[41] Availability and reliability None
[53] Availability None
This paper Availability and reliability System throughput

First, it is worth mentioning the works from Melo et al. [35, 36] and Machida
et al. [23] which provided the basis for our availability modeling framework.
Another paper from Machida et al. [24] provided the insights for our performance
evaluation approach. These papers are focused on the availability metric, aiming to
find the proper rejuvenation schedules to maximize system availability. The paper
[24] also comprises the number of transactions lost (performance metric). However,
in this paper, the authors neglect the performance penalty due to software aging.
In our work, besides comprising availability, we also covered reliability and system
throughput metrics. Our models also capture the influence of software aging in sys-
tem responsiveness.
Wang et al. [63] provided a relevant background for our performability mod-
eling. We used their papers to recover some rates for our models’ evaluation. In
their paper, they cover the aspect of workload variation in a system with rejuve-
nation. Unlike them, we cover VM migration and burst workloads. Moreover, we
cover an extended set of metrics as availability and reliability. Escheikh et al. [10]
work also consider bursty workload occurrence in a system with VM migration.
Unlike their work, we consider a set of different asset classes and also provide
reliability results. Moreover, we adopted a different approach to represent bursty
occurrence using a specific submodel (i.e., burst cycle model).
Our previous papers [54, 57] are the first step of this research. We highlight the
following improvements: (1) Use of marking-dependent firing delays to improve
the model’s accuracy—in the previous works, we have used places with multiple
tokens and guard functions to represent resource consumption dynamics. (2) The
problem of using guard functions is that the firing rate is only adjusted for firing
the next resource consumption phase. (3) Inclusion of performability metrics. (4)
Linear regression for the reliability curves.
In the papers [25–28, 30], Machida et al. provided an extensive modeling
framework considering the aspects of performance penalty due to software aging
accumulation. For example, in the papers [25–27], the authors derive optimal

13
Journal of Network and Systems Management (2022) 30:3 Page 27 of 33 3

policies for software rejuvenation submission in the environment when observing


system performance degradation. In the papers [28, 30], the authors expand their
contribution, also considering availability evaluation. Unlike them, we also com-
prise system reliability and bursty workload aspects.
Zheng et al. [65], and Dohi et al. [9] studied transient metrics in a system with
rejuvenation. The main metrics in these studies are pointwise availability and
interval reliability. The former focuses on the use of phase expansion for transient
reliability analysis. The latter focuses on finding the optimal software rejuvena-
tion timing to maximize interval reliability. Unlike these papers, we comprise the
behavior of VM migration as rejuvenation in our models.
Nguyen et al. [41] presented the availability and reliability evaluation of a cloud
computing environment. The authors applied a hierarchical approach to the model’s
design. The analysis also comprises metrics as power consumption and cost. How-
ever, the authors neglected the bursty workload occurrence.
Finally, we highlight the paper from Torquato et al. [53], which comprises VM
migration as rejuvenation and the security aspect. The authors proposed a metric
named RiskScore for the security evaluation of a virtualized system considering the
Denial of Service and Man-in-the-middle attack. Our approach is different because
we proposed the burst cycle as a basis for the considered bursty workload. Besides
that, different from this paper, we also comprise the system performance in our
research, considering the possible impacts of software aging accumulation.
Table 10 summarizes the scope of the related works above in contrast with the
current paper.

8 Conclusions and Future Works

This paper presented a comprehensive SRN modeling framework for the performa-
bility and dependability evaluation of a virtualized system subject to software aging
and bursty workload. The considered system applies VM migration scheduling as
support for software rejuvenation. Our results include the metrics: steady-state avail-
ability, steady-state system throughput, and system reliability.
About our main research question ( RQmain: What are the performability and
dependability levels of a virtualized system with VM migration subject to software
aging and bursty workload?), we noted that these levels would vary depending on
the studied scenario. There is a specific rejuvenation schedule to maximize sys-
tem availability and throughput. In scenarios with lighter bursty workload condi-
tions (burst probability of 0.01%, and 0.1%), the performability results are nearly
the same. However, in the scenarios with heavier bursty workload conditions (burst
probability of 1%, 5%, and 10%), the system performability degradation due to the
bursty workload is substantial. In such scenarios, the rejuvenation policies tend to
produce lower system performability improvement.
We covered the main aspects of software aging and rejuvenation and the uncer-
tainties related to the bursty workload occurrence. Besides that, we also considered
important details as the influence of burst occurrence in resource exhaustion and the
influence of resource consumption levels in the VM migration process.

13
3 Page 28 of 33 Journal of Network and Systems Management (2022) 30:3

Fig. 14  Petri net components

Fig. 15  Flow of a SRN simple availability model

VM migration is a standard maintenance tool for system managers. Our research


may help them to improve the understanding of the performability, availability, and
reliability impacts of applying VM migration while considering significant concerns
as software aging and bursty workloads. The models and results presented here can
be extended to similar scenarios. Moreover, they may be helpful in setting up virtu-
alized environment management policies and Service Level Agreements.
Our main future work direction is to investigate the system’s power consumption
aspect under software aging and bursty workloads. Then, we can apply a similar
approach as presented in our previous paper [56], where the authors approximated
the power consumption using the system availability levels.

Appendix: Stochastic Reward Nets

Stochastic Reward Nets (SRN) are a sub-type of Petri Nets (PN). A PN is a


5-tuple, PN = (P, T, F, W, M0 ) where: P = {p1 , p2 , … , pn } is a finite set of places,
T = {t1 , t2 , … , tn } is a finite set of transitions, F ⊆ (P × T) ∪ (T × P) is a set of
arcs, W ∶ F → {0, 1, 2, 3, …} is a weight function, and M0 ∶ P → {0, 1, 2, 3, …} is
the initial marking [39].
The graphical representation of PN has four main components, as presented in
Fig. 14. The places keep the tokens, the arcs indicate the relation between places
and transitions, and the PN state is altered upon a transition firing, which moves
tokens from one transition to other. In SRNs, it is possible to assign time delays
to the transitions.
Let us consider the flow of a simple SRN availability model in the Fig. 15. In
the initial state, the system is running, presented by the token in the UP place. The
transition MTTF represents the system mean time to failure (MTTF). MTTF firing

13
Journal of Network and Systems Management (2022) 30:3 Page 29 of 33 3

represents a system failure occurrence. The same transition moves the token from
UP place to the DW place. The system repair is represented by the MTTR​transition
(mean time to repair (MTTR)). MTTR​transition firing brings the model back to
its initial state.
We can compute the system availability using the following reward measure
Availability = P{UP > 0}, which captures the probability of tokens presence in
the UP place.

Acknowledgements This work has been partially supported by Portuguese Foundation for Science and
Technology (FCT), through the PhD Grant SFRH/BD/146181/2019, within the scope of the project
CISUC - UID/CEC/00326/2020. This work is also funded by the European Social Fund, through the
Regional Operational Program Centro 2020. This work also received support from AIDA: (Adaptive,
Intelligent and Distributed Assurance Platform) project, funded by Operational Program for Competi-
tiveness and Internationalization (COMPETE 2020) and FCT (under CMU Portugal Program) through
Grant POCI-01-0247-FEDER-045907. And, from project TalkConnect funded by COMPETE 2020
trough Grant POCI-01-0247-FEDER-039676.

Declarations

Conflict of interest The authors declare that they have no conflict of interest.

References
1. Akoush, S., Sohan, R., Rice, A., Moore, A.W., Hopper, A.: Predicting the performance of virtual
machine migration. In: 2010 IEEE International Symposium on Modeling, Analysis and Simulation
of Computer and Telecommunication Systems, pp. 37–46. IEEE (2010)
2. Araujo, J., Matos, R., Maciel, P., Matias, R., Beicker, I.: Experimental evaluation of software aging
effects on the eucalyptus cloud computing infrastructure. In: Proceedings of the Middleware 2011
Industry Track Workshop, p. 4. ACM (2011)
3. Avizienis, A., Laprie, J.C., Randell, B., Landwehr, C.: Basic concepts and taxonomy of dependable
and secure computing. IEEE Trans. Depend. Secure Comput. 1(1), 11–33 (2004)
4. Avritzer, A., Weyuker, E.J.: Monitoring smoothly degrading systems for increased dependability.
Empir. Softw. Eng. 2(1), 59–77 (1997)
5. Bause, F.: Queueing petri nets-a formalism for the combined qualitative and quantitative analysis of
systems. In: Proceedings of 5th International Workshop on Petri Nets and Performance Models, pp.
14–23. IEEE (1993)
6. Bobbio, A.: System modelling with petri nets. In: Systems Reliability Assessment, pp. 103–143.
Springer (1990)
7. Clark, C., Fraser, K., Hand, S., Hansen, J.G., Jul, E., Limpach, C., Pratt, I., Warfield, A.: Live migra-
tion of virtual machines. In: Proceedings of the 2nd Conference on Symposium on Networked Sys-
tems Design & Implementation-Volume 2, pp. 273–286. USENIX Association (2005)
8. Cotroneo, D., Natella, R., Pietrantuono, R., Russo, S.: A survey of software aging and rejuvenation
studies. ACM J. Emerg. Technol. Comput. Syst. 10(1), 8 (2014)
9. Dohi, T., Zheng, J., Okamura, H., Trivedi, K.S.: Optimal periodic software rejuvenation policies
based on interval reliability criteria. Reliab. Eng. Syst. Saf. 180, 463–475 (2018)
10. Escheikh, M., Tayachi, Z., Barkaoui, K.: Performability evaluation of server virtualized systems
under bursty workload. IFAC-PapersOnLine 51(7), 45–50 (2018)
11. Feuerlicht, G., Burkon, L., Sebesta, M.: Cloud computing adoption: what are the issues. Syst. Integr.
18(2), 187–192 (2011)

13
3 Page 30 of 33 Journal of Network and Systems Management (2022) 30:3

12. Garg, S., Van Moorsel, A., Vaidyanathan, K., Trivedi, K.S.: A methodology for detection and esti-
mation of software aging. In: Proceedings Ninth International Symposium on Software Reliability
Engineering (Cat. No. 98TB100257), pp. 283–292. IEEE (1998)
13. Grottke, M., Matias, R., Trivedi, K.S.: The fundamentals of software aging. In: 2008 IEEE Interna-
tional Conference on Software Reliability Engineering Workshops (ISSRE Wksp), pp. 1–6. IEEE
(2008)
14. Gupta, A.K., Zeng, W.B., Wu, Y.: Probability and Statistical Models: Foundations for Problems in
Reliability and Financial Mathematics. Springer, New York (2010)
15. Huang, Y., Kintala, C., Kolettis, N., Fulton, N.D.: Software rejuvenation: analysis, module and
applications. In: Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of
Papers, pp. 381–390. IEEE (1995)
16. Jain, R.: The Art of Computer Systems Performance Analysis: Techniques for Experimental Design,
Measurement, Simulation, and Modeling. Wiley, New York (1990)
17. Kleinrock, L.: Queueing Systems, vol. i: Theory (1975)
18. Kounev, S.: Performance modeling and evaluation of distributed component-based systems using
queueing petri nets. IEEE Trans. Softw. Eng. 32(7), 486–502 (2006)
19. Kuchárik, M., Balogh, Z.: Modeling of uncertainty with petri nets. In: Asian Conference on Intel-
ligent Information and Database Systems, pp. 499–509. Springer (2019)
20. Liu, H., Xu, C.Z., Jin, H., Gong, J., Liao, X.: Performance and energy modeling for live migration of
virtual machines. In: Proceedings of the 20th International Symposium on High Performance Distrib-
uted Computing, pp. 171–182. ACM (2011)
21. Low, C., Chen, Y., Wu, M.: Understanding the determinants of cloud computing adoption. Ind. Manag.
Data Syst. 111(7), 1006–1023 (2011)
22. Macêdo, A., Ferreira, T.B., Matias, R.: The mechanics of memory-related software aging. In: 2010
IEEE Second International Workshop on Software Aging and Rejuvenation, pp. 1–5. IEEE (2010)
23. Machida, F., Kim, D.S., Trivedi, K.S.: Modeling and analysis of software rejuvenation in a server virtu-
alized system. In: 2010 IEEE Second International Workshop on Software Aging and Rejuvenation, pp.
1–6. IEEE (2010)
24. Machida, F., Kim, D.S., Trivedi, K.S.: Modeling and analysis of software rejuvenation in a server virtu-
alized system with live vm migration. Perform. Eval. 70(3), 212–230 (2013)
25. Machida, F., Miyoshi, N.: An optimal stopping problem for software rejuvenation in a job processing
system. In: 2015 IEEE International Symposium on Software Reliability Engineering Workshops (ISS-
REW), pp. 139–143. IEEE (2015)
26. Machida, F., Miyoshi, N.: Analysis of an optimal stopping problem for software rejuvenation in a dete-
riorating job processing system. Reliab. Eng. Syst. Saf. 168, 128–135 (2017)
27. Machida, F., Nicola, V.F., Trivedi, K.S.: Job completion time on a virtualized server subject to software
aging and rejuvenation. In: 2011 IEEE Third International Workshop on Software Aging and Rejuvena-
tion, pp. 44–49. IEEE (2011)
28. Machida, F., Nicola, V.F., Trivedi, K.S.: Job completion time on a virtualized server with software reju-
venation. ACM J. Emerg. Technol. Comput. Syst. 10(1), 10 (2014)
29. Machida, F., Xiang, J., Tadano, K., Maeno, Y.: Aging-related bugs in cloud computing software. In:
2012 IEEE 23rd International Symposium on Software Reliability Engineering Workshops, pp. 287–
292. IEEE (2012)
30. Machida, F., Xiang, J., Tadano, K., Maeno, Y.: Lifetime extension of software execution subject to
aging. IEEE Trans. Reliab. 66(1), 123–134 (2016)
31. Maciel, P., Matos, R., Silva, B., Figueiredo, J., Oliveira, D., Fé, I., Maciel, R., Dantas, J.: Mercury:
performance and dependability evaluation of systems with exponential, expolynomial, and general
distributions. In: 2017 IEEE 22nd Pacific Rim International Symposium on Dependable Computing
(PRDC), pp. 50–57. IEEE (2017)
32. Matos, R., Araujo, J., Alves, V., Maciel, P.: Characterization of software aging effects in elastic storage
mechanisms for private clouds. In: 2012 IEEE 23rd International Symposium on Software Reliability
Engineering Workshops, pp. 293–298. IEEE (2012)
33. Maziku, H., Shetty, S.: Towards a network aware vm migration: Evaluating the cost of vm migration in
cloud data centers. In: 2014 IEEE 3rd International Conference on Cloud Networking (CloudNet), pp.
114–119. IEEE (2014)
34. Mell, P., Grance, T., et al.: The nist definition of cloud computing (2011)

13
Journal of Network and Systems Management (2022) 30:3 Page 31 of 33 3

35. Melo, M., Araujo, J., Matos, R., Menezes, J., Maciel, P.: Comparative analysis of migration-based reju-
venation schedules on cloud availability. In: 2013 IEEE International Conference on Systems, Man, and
Cybernetics, pp. 4110–4115. IEEE (2013)
36. Melo, M., Maciel, P., Araujo, J., Matos, R., Araujo, C.: Availability study on cloud computing envi-
ronments: live migration as a rejuvenation mechanism. In: 2013 43rd Annual IEEE/IFIP International
Conference on Dependable Systems and Networks (DSN), pp. 1–6. IEEE (2013)
37. Meyer, J.F.: Performability: a retrospective and some pointers to the future. Perform. Eval. 14(3–4),
139–156 (1992)
38. Mijumbi, R., Serrat, J., Gorricho, J.L., Bouten, N., De Turck, F., Boutaba, R.: Network function virtual-
ization: state-of-the-art and research challenges. IEEE Commun. Surv. Tutor. 18(1), 236–262 (2015)
39. Murata, T.: Petri nets: properties, analysis and applications. Proc. IEEE 77(4), 541–580 (1989). https://​
doi.​org/​10.​1109/5.​24143
40. Myint, M.T.H., Thein, T.: Availability improvement in virtualized multiple servers with software reju-
venation and virtualization. In: 2010 Fourth International Conference on Secure Software Integration
and Reliability Improvement, pp. 156–162. IEEE (2010)
41. Nguyen, T.A., Min, D., Choi, E., Tran, T.D.: Reliability and availability evaluation for cloud data center
networks using hierarchical models. IEEE Access 7, 9273–9313 (2019)
42. Oliveira, T., Thomas, M., Espadanal, M.: Assessing the determinants of cloud computing adoption: an
analysis of the manufacturing and services sectors. Inf. Manag. 51(5), 497–510 (2014)
43. Patterson, D.A., et al.: A simple way to estimate the cost of downtime. LISA 2, 185–188 (2002)
44. Pietrantuono, R., Russo, S.: A survey on software aging and rejuvenation in the cloud. Softw. Q. J. 1–32
(2019)
45. Salfner, F., Tröger, P., Polze, A.: Downtime analysis of virtual machine live migration. In: The Fourth
International Conference on Dependability (DEPEND 2011). IARIA, pp. 100–105 (2011)
46. Schroeder, B., Gibson, G.A.: Disk failures in the real world: What does an mttf of 1, 000, 000 hours
mean to you? FAST 7, 1–16 (2007)
47. Siddiqui, S., Darbari, M., Yagyasen, D., et al.: Modelling and simulation of queuing models through the
concept of petri nets (2020)
48. Soltesz, S., Pötzl, H., Fiuczynski, M.E., Bavier, A., Peterson, L.: ACM: Container-based operating sys-
tem virtualization: a scalable, high-performance alternative to hypervisors. ACM SIGOPS Oper. Syst.
Rev. 41, 275–287 (2007)
49. Strunk, A.: Costs of virtual machine live migration: a survey. In: 2012 IEEE Eighth World Congress on
Services, pp. 323–329. IEEE (2012)
50. Thein, T., Park, J.S.: Availability analysis of application servers using software rejuvenation and virtual-
ization. J. Comput. Sci. Technol. 24(2), 339–346 (2009)
51. Torquato, M., Araujo, J., Umesh, I., Maciel, P.: Sware: a methodology for software aging and rejuvena-
tion experiments. J. Inf. Syst. Eng. Manag. 3(2), 15 (2018)
52. Torquato, M., Maciel, P., Araujo, J., Umesh, I.: An approach to investigate aging symptoms and rejuve-
nation effectiveness on software systems. In: 2017 12th Iberian Conference on Information Systems and
Technologies (CISTI), pp. 1–6. IEEE (2017)
53. Torquato, M., Maciel, P., Vieira, M.: A model for availability and security risk evaluation for systems
with vmm rejuvenation enabled by vm migration scheduling. IEEE Access 7, 138315–138326 (2019)
54. Torquato, M., Maciel, P., Vieira, M.: Availability and reliability modeling of vm migration as rejuvena-
tion on a system under varying workload. Softw. Qual. J. 1–25 (2020)
55. Torquato, M., Torquato, L., Maciel, P., Vieira, M.: Iaas cloud availability planning using models and
genetic algorithms. In: 2019 9th Latin-American Symposium on Dependable Computing (LADC), pp.
1–10. IEEE (2019)
56. Torquato, M., Umesh, I., Maciel, P.: Models for availability and power consumption evaluation of a
private cloud with vmm rejuvenation enabled by vm live migration. J. Supercomput. 74(9), 4817–4841
(2018)
57. Torquato, M., Vieira, M.: Interacting srn models for availability evaluation of vm migration as rejuvena-
tion on a system under varying workload. In: 2018 IEEE International Symposium on Software Reli-
ability Engineering Workshops (ISSREW), pp. 300–307. IEEE (2018)
58. Torquato, M., Vieira, M.: An experimental study of software aging and rejuvenation in dockerd. In:
2019 15th European Dependable Computing Conference (EDCC), pp. 1–6. IEEE (2019)
59. Trivedi, K.S., Vaidyanathan, K., Goseva-Popstojanova, K.: Modeling and analysis of software aging
and rejuvenation. In: Proceedings 33rd Annual Simulation Symposium (SS 2000), pp. 270–279. IEEE
(2000)

13
3 Page 32 of 33 Journal of Network and Systems Management (2022) 30:3

60. Vaidyanathan, K., Trivedi, K.S.: A comprehensive model for software rejuvenation. IEEE Trans.
Dependable Secure Comput. 2(2), 124–137 (2005)
61. Valmari, A.: The state explosion problem. In: Advanced Course on Petri Nets, pp. 429–528. Springer
(1996)
62. Voorsluys, W., Broberg, J., Venugopal, S., Buyya, R.: Cost of virtual machine live migration in clouds:
a performance evaluation. In: IEEE International Conference on Cloud Computing, pp. 254–265.
Springer (2009)
63. Wang, D., Xie, W., Trivedi, K.S.: Performability analysis of clustered systems with rejuvenation under
varying workload. Perform. Eval. 64(3), 247–265 (2007)
64. Yeboah-Boateng, E.O., Essandoh, K.A.: Factors influencing the adoption of cloud computing by small
and medium enterprises in developing economies. Int. J. Emerg. Sci. Eng. 2(4), 13–20 (2014)
65. Zheng, J., Okamura, H., Dohi, T.: A transient interval reliability analysis for software rejuvenation mod-
els with phase expansion. Softw. Qual. J. 1–22 (2019)
66. Zimmermann, A.: Modelling and performance evaluation with timenet 4.4. In: International Confer-
ence on Quantitative Evaluation of Systems, pp. 300–303. Springer (2017)

Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published
maps and institutional affiliations.

Matheus Torquato is a Ph.D. candidate at the University of Coimbra. His research interests comprise sub-
jects like Cloud Computing, Performance, Dependability, and Security Modeling. His current research
focuses in the design and development of analytical models to evaluate performance, dependability, and
security of moving target defense deployments in cloud computing. He received his Master’s Degree in
Computer Science from the Federal University of Pernambuco. He is currently on leave from his teach-
ing activities at the Federal Institute of Alagoas, Campus Arapiraca to pursue Ph.D. at the University of
Coimbra. His website is: https://​www.​mathe​ustor​quato.​com/

Paulo Maciel received the degree in electronic engineering in 1987 and the M.Sc. and Ph.D. degrees in
electronic engineering and computer science from the Federal University of Pernambuco, Recife, Brazil,
respectively. He was a faculty member with the Department of Electrical Engineering, Pernambuco Uni-
versity, Recife, Brazil, from 1989 to 2003. Since 2001, he has been a member of the Informatics Center,
Federal University of Pernambuco, where he is currently a Full Professor. In 2011, during his sabbatical
from the Federal University of Pernambuco, he stayed with the Department of Electrical and Computer
Engineering, Edmund T. Pratt School of Engineering, Duke University, Durham, NC, USA, as a Visiting
Professor. His current research interests include performance and dependability evaluation, Petri nets and
formal models, encompassing manufacturing, embedded, computational, and communication systems as
well as power consumption analysis. Dr. Maciel is a Research Member of the Brazilian Research Council.

Marco Vieira received the Ph.D. degree from University of Coimbra, Coimbra, Portugal, in 2005. He cur-
rently is a Full Professor with the University of Coimbra. He has participated and coordinated several
research projects, both at the national and European level. His research interests include dependability
and security assessment and benchmarking, fault injection, software processes, and software quality
assurance, subjects in which he has authored or coauthored more than 200 papers in refereed conferences
and journals.Prof. Vieira has served on program committees of the major conferences of the depend-
ability area and acted as referee for many international conferences and journals in the dependability and
security areas.

13
Journal of Network and Systems Management (2022) 30:3 Page 33 of 33 3

Authors and Affiliations

Matheus Torquato1,2 · Paulo Maciel3 · Marco Vieira1


Paulo Maciel
[email protected]
Marco Vieira
[email protected]
1
CISUC, DEI, University of Coimbra, Coimbra, Portugal
2
Federal Institute of Alagoas, Campus Arapiraca, Arapiraca, Brazil
3
Centro de Informática, Universidade Federal de Pernambuco (CIn-UFPE), Recife, Brazil

13

You might also like