0% found this document useful (0 votes)
2 views

Resource_Usage_Cost_Optimization_in_Cloud_Computing_Using_Machine_Learning

This article discusses a novel approach to optimize cloud computing resource costs using machine learning, anomaly detection, and particle swarm optimization. The proposed solution adapts to dynamic workloads and pricing changes, achieving a significant cost reduction of 85% over 10 months in a real-world Microsoft Azure environment. The paper highlights the limitations of existing methods that focus on single-factor optimization and presents a comprehensive system that operates autonomously without external supervision.

Uploaded by

thestral2017
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Resource_Usage_Cost_Optimization_in_Cloud_Computing_Using_Machine_Learning

This article discusses a novel approach to optimize cloud computing resource costs using machine learning, anomaly detection, and particle swarm optimization. The proposed solution adapts to dynamic workloads and pricing changes, achieving a significant cost reduction of 85% over 10 months in a real-world Microsoft Azure environment. The paper highlights the limitations of existing methods that focus on single-factor optimization and presents a comprehensive system that operates autonomously without external supervision.

Uploaded by

thestral2017
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2020.3015769, IEEE
Transactions on Cloud Computing
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, XX XXX 1

Resource Usage Cost Optimization in Cloud


Computing Using Machine Learning
Patryk Osypanka, Piotr Nawrocki

Abstract—Cloud computing is gaining popularity among small and medium-sized enterprises. The cost of cloud resources plays a
significant role for these companies and this is why cloud resource optimization has become a very important issue. Numerous methods
have been proposed to optimize cloud computing resources according to actual demand and to reduce the cost of cloud services. Such
approaches mostly focus on a single factor (i.e. compute power) optimization, but this can yield unsatisfactory results in real-world cloud
workloads which are multi-factor, dynamic and irregular. This paper presents a novel approach which uses anomaly detection, machine
learning and particle swarm optimization to achieve a cost-optimal cloud resource configuration. It is a complete solution which works in a
closed loop without the need for external supervision or initialization, builds knowledge about the usage patterns of the system being
optimized and filters out anomalous situations on the fly. Our solution can adapt to changes in both system load and the cloud provider’s
pricing plan. It was tested in Microsoft’s cloud environment Azure using data collected from a real-life system. Experiments demonstrate
that over a period of 10 months, a cost reduction of 85% was achieved.

Index Terms—cloud resource usage prediction, anomaly detection, machine learning, particle swarm optimization, resource cost
optimization.

1 I NTRODUCTION

C OMPUTER systems are currently often located in comput-


ing clouds such as Amazon Web Services (operated by
Amazon), Azure (operated by Microsoft), Google Cloud Plat-
CPU
power
VM
RAM
size
I/O limit
Database
Disk
Size
I/O limit

form (operated by Google) and many others. A computing


cloud provides storage, network and computing resources Micro-service
to anyone who needs them. There are different cloud usage CPU RAM
I/O limit
power size
models, i.e. Infrastructure as a Service (IaaS), Platform as
a Service (PaaS) or Software as a Service (SaaS), but all of
Fig. 1. Example of components and their properties
them reduce management effort and downtime risk while
providing high-scalability possibilities when compared to
on-premise solutions. Scalability means that new instances
of services (PaaS), virtual machines (IaaS) or databases also important in order to protect the environment.
(databases are partially SaaS and partially PaaS) can be added A cloud provider offers different components (i.e. virtual
as required. In many systems, it is difficult to predict load machines (VM) or databases (DB)), and each component
beforehand and thus to meet accessibility and responsiveness consists of different properties (i.e. compute power (CPU),
requirements (especially where the system is too big for random access memory size (RAM), disk capacity or in-
frequent, on-demand adjustments), the system must be put/output operations per second (IOPS)) (Fig. 1). Our idea
scaled up with a margin for both unforeseen load spikes is to automate the process of scaling system components
and long-term load changes. This results in considerable while taking into account the predicted usage level. In the
power and storage overprovisioning and thus unnecessary process, we take into consideration the usage of virtual
spending. In many cases, companies provision resources with machines, application services and databases. Our solution
a large safety margin just to avoid unexpected emergencies. can optimize cloud resource usage costs by predicting the
Sometimes they add to these resources when a problem demand for different resources (i.e. CPU, IOPS, memory,
emerges and leave them at high levels even after the problem storage) and then adjusting cloud components accordingly.
has been fixed. Moreover, Anders and Edler [1] estimate Prediction is done with the use of machine learning interpo-
that in 2030, data centers will use around 3–13% of global lation combined with anomaly detection. Cost reductions are
electricity, and this is why reducing provisioned resources is achieved by provisioning cloud components that meet the
demand and at the same time are optimal from the financial
• P. Osypanka is with the Department of Computer Science, AGH University point of view. The optimal configuration is arrived at using
of Science and Technology, al. A. Mickiewicza 30, 30-059 Krakow, Poland a particle swarm optimization (PSO) algorithm adjusted to
and ASEC S.A., ul. Wadowicka 6, 30-415 Kraków, Poland. solving discrete problems.
E-mail: [email protected]
• P. Nawrocki is with the Department of Computer Science, AGH University The classic approach to cloud resource optimization either
of Science and Technology, al. A. Mickiewicza 30, 30-059 Krakow, Poland. focuses on a single resource (e.g. CPU) and scaling parameter
E-mail: [email protected] (e.g. number of machines) or creates resource utilization
Manuscript received X X, X; revised X X, X. models that ignore potential unexpected changes. The main

2168-7161 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: The University of Toronto. Downloaded on October 13,2021 at 07:44:23 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2020.3015769, IEEE
Transactions on Cloud Computing
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, XX XXX 2

contributions of this paper are briefly summarized as follows: private and public clouds. The system scaling engine is based
• using an anomaly detection filter to improve the quality on queuing theory and makes it possible to extend private
of machine learning regression predictions; cloud capabilities using public cloud resources. This solution
• using an adapted PSO algorithm to solve the cloud uses threshold-based policies and time-series analysis. We
resource reservation problem; use machine learning instead to predict demand, which
• using both vertical (quality) and horizontal (quantity) allows us to bridge the delays caused by the process of
scaling at the same time to obtain optimal results; provisioning new instances.
• presenting experimental results with real-life data from In [11], the authors use ML for resource usage prediction
the production system for different cloud usage models and in [12], the authors propose a host of different ML
and verifying the effectiveness of the solution proposed algorithms as a way of improving prediction; however, in
along with actual cost reductions. both cases no further steps beyond prediction are presented.
The rest of this paper is structured as follows: Section 2 Also, none of these solutions perform anomaly detection
contains a description of related work, Section 3 is concerned which makes them prone to inaccurate predictions in case of
with defining in detail the cloud resource cost optimization temporary deviations.
process, Section 4 describes the implementation of the The authors of [13] describe a system which develops
optimization solution, and Section 5 contains the conclusion virtual machine reservation plans based on CPU usage
and further work. history. During evaluation, different ML algorithms are
compared with OpenStack and Blazar. In addition, tests in
the virtual environment (without cloud integration) present
2 R ELATED WORK the system’s performance over a year. Although the system
The literature describes various studies devoted to resource uses ML, which makes it flexible, the authors focus on a
allocation optimization. For example, the authors of [2], [3] single type of virtual machine only as contrasted with our
and [4] describe a solution which analyses incoming tasks system, which uses all VM types available from a given cloud
and reserves virtual machine instances in a way that makes it provider to minimize overall cost. We also account for more
possible to meet a deadline and is cost efficient. The solution resources (i.e. RAM) along with anomaly detection, which
assumes that the system is performing tasks with known makes our solution more complete and accurate.
CPU and memory demands. The authors of the review Other works present different methods of virtual machine
presented in [5] discuss different task scheduling methods usage optimization: a time-aware residual network [14],
which can be used in such cases. On the other hand, we autonomic computing and reinforcement learning [15], deep
optimize more generic systems which fulfil many functions learning [16], a combination of PPSO and NN [17], an NN
and therefore cannot focus on task scheduling as we are with self-adaptive differential evolution algorithm [18] and
unable to determine the relevant parameters. We must make standalone neural networks [19] [20]. The authors of [21] use
sure that just enough cloud resources are available when Naı̈ve Bayes, and in [22] and [23] the authors use learning
needed. automata. Kaur et al. [24] propose a set of various prediction
In a similar manner, cloud resource management with the methods working in parallel and the authors of [25] use a
use of deep reinforcement learning algorithms was described progressive QoS prediction model and a genetic algorithm.
by Zhang et al. [6]. The authors propose a deep Q-network All those works along with surveys [26] [27] focus on virtual
as a variant of a reinforcement learning algorithm, which is machines, mostly on CPU and RAM usage. We extend these
initially pre-trained by a stacked autoencoder (SAQN). To approaches to other cloud component types (PaaS, SaaS) and
address stability issues, they introduced experience replay, make a step further by selecting real-life, provider-dependent
Q-network freeze and network normalization. The described sets of resources. Although the authors of [28] describe a
solution assumes that the client makes requests with a general idea for a system which would cover IaaS, PaaS
resource demand which is known beforehand and tested and SaaS, that study does not include any tests or broader
using an artificial load generated by HiBench, a big data analysis of the topic.
benchmark suite. Our approach is to optimize generic A lot of studies describe different ways of allocating
systems which generate requests that are variable in time resources optimally from a cloud provider’s point of view.
and whose characteristics are unknown. The tests performed Dorian Minarolli and Bernd Freisleben [29] describe a system
show that due to anomaly detection, our solution works which optimizes virtual machine allocation using fuzzy con-
without initial training and is able to operate properly not trol. Owing to the proposed multi-agent environment, their
only with a simulated (artificial) load, but also with real- solution is able to operate on a considerable set of virtual
world, noisy data. machines. Similarly, Singh et al. [30] propose mobile agents
Hilman et al. [7] propose an online incremental learning which manage resource allocation in the cloud provider’s
approach to predict the run time of tasks, and the authors physical infrastructure. The authors take into consideration
of [8] use machine learning (ML) for the same purpose. In not just the type of physical resources available, but also
addition, Yang et al. [9] propose ML along with heuristic their location and network infrastructure which allows a
algorithms to assign tasks to the optimal virtual machine. cloud provider to reduce costs. In comparison to the above
However, the important aspect of resource management and solutions, our approach is focused on cost optimization from
scaling is missing from those works, as opposed to our solu- the end-user perspective; although a reduction in server
tion which considers available component configurations. operation costs could possibly lead to a provider offering
A different approach is presented in [10] where the a discount, the solution proposed by us provides direct
authors propose a system which scales resources across cost savings. The solutions described in the aforementioned

2168-7161 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: The University of Toronto. Downloaded on October 13,2021 at 07:44:23 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2020.3015769, IEEE
Transactions on Cloud Computing
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, XX XXX 3

articles take into consideration only the provisioning of Optimization Working


virtual machines as a cloud provider’s building blocks, while solution system
we are focused on cost optimization from the end user’s
perspective, and thus not only IaaS, but also PaaS and SaaS Prediction
are considered. module
Usage prediction enables us to develop a resource usage
plan. Many works describe different techniques of resource System to
Database
allocation. In [31], Wei et al. present a game-theoretic method optimize
while the authors of [32] propose a coral-reef and game
theory-based approach. Machine learning is proposed in [33], Monitoring
and a combinatorial auction algorithm and a combinatorial module
double auction algorithm are described in [34] and [35].
Zhang et al. [36] propose machine learning-based resource
allocation, and in [37] the authors put forward greedy particle Fig. 2. Optimization setup overview
swarm optimization. Our solution uses the more lightweight,
although accurate, Integer-PSO algorithm described in [38], scaling plus human support, which helps reduce costs but
which we adapt and use for resource allocation planning without automation.
purposes. Our analysis of existing solutions shows that currently
In the survey [39], Gondhi et al. review different virtual none tackle the problem of optimizing different types of
machine scheduling algorithms. Besides particle swarm cloud resources (IaaS, PaaS, SaaS) with proactive usage
optimization, which is a base for Integer-PSO, the authors prediction, anomaly detection and efficient, cloud-provider
describe a genetic algorithm, simulated annealing, ant colony specific, automatic resource allocation. The contribution of
optimization, an artificial immune system and other meta- this study is to define such a fully automatic system along
heuristics algorithms. Despite providing comparisons of with simulations and tests of its behavior using real-life
advantages and disadvantages of the methods presented, the usage data. Our solution does not require initial reservation
survey does not describe complete solutions. For example, schedules or knowledge about the type of tasks performed
a continuous PSO algorithm has to be first adapted to the by the system. It works with different combinations of
discrete resource allocation problem (Integer-PSO) and only cloud component types (IaaS, PaaS, SaaS) and accounts for
then can it be used in the optimization process, while the various resource properties (CPU, IOPS, RAM, etc.). The
aforementioned survey does not cover this adaptation. On cost optimization mechanism is resistant to anomalies (i.e.
the other hand, our work describes a complete solution which temporary usage spikes) and adapts to price changes (i.e.
was tested on real-world data. periodic discounts) as pricing policy is obtained directly from
In addition to the research described above, there are the cloud provider.
some commercial solutions which enable cloud resource op-
timization. For example, scaling components as exemplified
3 C LOUD RESOURCE COST OPTIMIZATION
by Azure Autoscale1 , AWS Autoscale2 and Google Cloud
Autoscale3 are part of the cloud environment. Unfortunately, Systems located in the cloud can be complicated and involve
only threshold-based scaling and simple time-based scaling multiple different resource types. The demand for those
are available. Both of those scaling techniques require an anal- resources varies over time, which is conditioned by:
ysis of system usage patterns which might be difficult when 1) usage patterns generated by users which depend on the
the system is complicated. There are also commercial cloud time of the day and the day of the week;
provider-independent systems4 that offer cloud resource 2) usage patterns which depend on end-point machine
optimization. These systems analyze spending and present it configuration (usage generated by automated devices,
in an easy-to-understand form. Additionally, they provide i.e. IoT);
hints about potential scale-downs of some cloud components 3) changes in system configuration (new functionalities,
or reorganizations which reduce cloud running costs. These new devices);
systems are not automated, so administrators must approve 4) accidental changes caused by temporary conditions (a
the changes proposed every time they find them useful. software bug, communication issues).
Some of these systems advertise that they are using ML in A system must meet availability demands. A change
their analysis56 , but in fact, they offer an overall view of in demand for cloud resources necessitates changes in
spending sources and a simple scheduling of component those resources’ configurations, which means scaling them.
Resources can be scaled up or out. For example, a virtual
1. Azure Autoscale – https://azure.microsoft.com/en- machine can be scaled up by increasing its CPU parameters
us/features/autoscale
or it can be scaled out by provisioning another copy of the
2. AWS Autoscale – https://aws.amazon.com/autoscaling
3. Google Cloud Autoscale – https://cloud.google.com/compute/ given VM. Depending on the cloud provider’s pricing plan,
docs/autoscaler either scaling up or scaling out can be more cost-effective
4. Azure Cost Management – https://azure.microsoft.com/en- while providing the same computing power. Scaling takes
us/services/cost-management time, so it must be performed before it is needed, which
5. Cloud Cost Management, Efficiency and Optimization –
https://www.cloudability.com requires resource usage prediction.
6. Next-Generation Cloud Optimization for CloudOps – To meet the above requirements, we have developed
https://www.densify.com a solution which performs prediction and monitoring. It

2168-7161 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: The University of Toronto. Downloaded on October 13,2021 at 07:44:23 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2020.3015769, IEEE
Transactions on Cloud Computing
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, XX XXX 4

Optimization system startup


1. Gather
historical data

2. Filter out
6. Check predictions Loop every anomalies
Loop every 5. Store prediction
and scale cloud Read Database Write
hour data in the database week
resources if needed
3. Make usage
prediction
4. Combine
predictions and
calculate optimal
configuration

Fig. 3. The optimization loop

1: for all component of the monitoredComponents do form of algorithm (Fig. 4). For every component (component)
2: for all resource of the component do in the set of monitored components (monitoredComponents)
3: data = GetHistoryData(resource); which we optimize, the module collects (GetHistoryData())
4: data = AnomalyFilter(data); CPU, memory and storage usage data (usage level along
5: data = MedianFilter(data); with the time of day and the day of the week). These data
6: WriteToDB(data); are filtered and then stored in the database to be used for
7: predictionData = prediction later on. Filtering is done by the anomaly detection
ReadFromDB(conf iguredW indow); algorithm [40]: first using the exchangeability martingales
8: predictionResult = function (AnomalyFilter()), and next, in order to smooth the
PredictUsageML(predictionData) data and improve prediction quality, using a median filter
9: predictions.Add(predictionResult); (MedianFilter()). Filtering prevents unnecessary prediction
10: end for distortions when resource usage changes are temporary and
11: end for random. If such a change exceeds the allocated resources,
12: pricing = GetPricingPlan(); the system becomes less responsive and takes longer to
13: conf iguration = process requests. Filtered data are stored in the database
CalculateConfiguration(prediction,pricing ); (WriteToDB()). In the next step, the module reads collected
14: WriteToDB(conf iguration); historical data from the database (ReadFromDB()) to predict
usage for the next week. The historical data time window
Fig. 4. Prediction module algorithm length affects prediction stability and adaptation rate and has
to be configured according to optimized system properties
(configuredWindow). The time window must be sufficiently
consists of a Prediction module, a Monitoring module and a
long to observe usage patterns but sufficiently short to allow
Database to store predicted data (Fig. 2). We designed it (Fig. 3)
quick prediction adaptation. For every collected piece of
to periodically (every week) gather historical usage data from
usage data, the module develops usage predictions (Predict-
the last month for each resource which needs to be tailored.
UsageML()) using machine learning interpolation and then
This task is done by the Prediction module. In the next step, the
stores them (predictions.Add()).
solution filters out anomalies to improve prediction quality.
Next, for each resource, it makes a prediction for the next 7 In the last stage of the algorithm, after all predictions
days and, using all these predictions combined, calculates have been done, the module obtains the current pricing plan
a cost-optimal cloud resource configuration with hourly from the cloud provider (GetPricingPlan()) and calculates the
resolution and the desired maximum resource utilization optimal resource configuration. The cloud provider defines
level. The maximum utilization level depends on system possible scaling configurations for different cloud compo-
type (i.e. will be lower for high-availability systems). Such nents; the same CPU, memory or disk storage resources can
a long prediction timeframe reduces prediction frequency be provisioned with a different configuration and therefore
and provides an allocation plan for the entire week for at a different cost. This creates a matrix of possibilities. As
the administrator’s inspection if required. Only available the number of possible configurations is usually large, calcu-
scaling options are considered; if a cloud provider adds lating all variants is not feasible and this is why the module
new possibilities, these will be automatically included in chooses a cost-optimal configuration (CalculateConfiguration())
calculations. The calculated cloud resource configuration is using a particle swarm optimization algorithm. Based on the
stored in the Database. In a separate hourly loop, using the solution described by A. S. Ajeena Beegom et al. [38], we
Monitoring module, the system checks if cloud resources need defined our own version of the Integer-PSO algorithm which
to be scaled according to predictions. is suited to our needs. Given the predicted required level
The logic of the Prediction module is presented in the of resources L = [L1 , . . . , Lm ] (i.e. CPU core count, or RAM

2168-7161 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: The University of Toronto. Downloaded on October 13,2021 at 07:44:23 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2020.3015769, IEEE
Transactions on Cloud Computing
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, XX XXX 5

amount) and n different component configuration types (i.e. Q with the minimal cost can be found using the cost
compute-optimized, memory-optimized, general-purpose) function D from equation 6 and the Integer-PSO algorithm.
from the cloud provider [T1 , . . . , Tn ], our problem is to find a As the cloud providers’ pricing policies are usually complex,
set of configurations Q which will meet the L constraint and it is impossible to define how many minimums exist in the
will be cost-efficient at the same time. Q = [z1 , . . . , zn ] defines cost function, which is discrete, as a fractional component
how many instances of every configuration type should be cannot be provisioned. The final stage of the original algo-
used. As an example, we can take virtual machines with rithm described in [38] was altered, as we are looking for
CPU core count (L1 ) and RAM amount (L2 ) as the resources multiples of available machines [z1 , . . . , zn ] rather than task
examined along with the predicted required level as L = assignment configuration.
[7, 16], which means 7 CPU cores and 16 GB of RAM. A To reduce frequent configuration changes, the new
sample cloud provider offers 3 different machine types: calculated configuration Q0 is compared to the previous
• T1 : 4 CPU cores, 1 GB of RAM, e 12.00/month; configuration. If the old Q still meets the P ≥ L constraint
P −P 0
• T2 : 2 CPU cores, 8 GB of RAM, e 14.00/month; and if ∀i ∈ (1, . . . , m) di < F (where di = iPi i and F is
• T3 : 2 CPU cores, 2 GB of RAM, e 10.00/month. a stability factor), Q0 is discarded and Q is used instead.
In this case Q, which meets the L constraint, can be defined F determines how probable it is that the algorithm will
as [1, 2, 0]. It means one virtual machine of type T1 and keep the previous configuration set. Continuing the example
2 machines of type T2 . The maximum value k for zi (i ∈ defined previously where Q = [1, 2, 0] and P = [8, 17], we
(1, . . . , n)) which has to be taken into consideration while can take as an example a new predicted required level L0 =
finding Q can be defined as the number of the least powerful [4, 15], new set Q0 = [0, 2, 0], P 0 = [4, 16] and we can define
configurations needed to meet the L level. Adding more the stability vector as F = 0.4. CPU count d1 = 8−4 8 = 0.5,
resources will be more expensive and is not necessary, as L is RAM amount d2 = 17−16 17 ≈ 0.06 . In this case, di < F is
definitely already met. Following the above example, k = 16 not met for the CPU count (i = 1) and a new value Q0
as 16 virtual machines of type T1 fulfills the L requirement will be used. Each time the old configuration is used, F is
in terms of RAM amount. Q is defined as: decremented; when Q0 is used, F is reset to its initial value.
  The final results are stored in the database (WriteToDB()) and
Q = z1 , . . . , zn (1) are later used by the Monitoring module.
where ∀i ∈ (1, . . . , n) 0 ≤ zi ≤ k . The cost of such set C is In a separate loop, the Monitoring module runs every hour.
defined as: It monitors if a given resource must be scaled according to
  the predicted configuration, and scales it if needed.
m1 n To estimate the quality of the predicted components’ set,
 ..  X
C(Q, M ) = Q · M = Q ·  .  = (zi · mi ) (2) we use common prediction measurements: Root Mean Square
mn i=1 Error (RM SE ), Mean Absolute Error (M AE ), Relative
Absolute Error (RAE ) and Root Relative Squared Error
where mi is the price of Ti configuration type. The resource (RRSE ). To compare the predicted configuration with real
level P provided by Q is defined as: usage history, we defined the R metric, which is the mean

s11 , . . . , s1m
 of overusage errors. For the given predicted usage during
hours t1 to tm , R is defined as:
P = Q ·  ... ..  = P , . . . , P  (3)

.  1 m Pm
Et
sn1 , . . . , snm R = t=1 (8)
Pn m
where Pj = i=1 (zi · sij ) and sij is the j resource level where Et is prediction error for the hour t, defined as:
provided by the Ti configuration type. In the example defined
before, cost is calculated as: Et = (ut − pt ) · H(ut − pt ) (9)

 e12
 
where H is a discrete Heaviside Step Function:
C = 1, 2, 0 · e14 = e40.00

(4)
e10
(
0, n < 0
H(n) = (10)
and resource level as: 1, n ≥ 0
 
  4, 1   pt is the calculated level for hour t and ut is the actual
P = 1, 2, 0 · 2, 8 = 8, 17 . (5) resource usage level for hour t.
2, 2 In the end, we measure average cost savings per hour:
Cost definition for minimization algorithm D is as follows: V . For a given resource and given predicted usage of this
( resource during hours t1 to tm , V is defined as:
C if P ≥ L Pm
D(C, P , L) = (6) (Gt − Ct )
∞ otherwise V = t=1 (11)
m
where P ≥ L is defined as: where Gt is the cost of the configuration without optimiza-
P ≥ L ⇔ ∀i ∈ (1, . . . , m) Pi ≥ Li . (7) tion during the hour t and Ct is the cost of predicted
configuration during the hour t. Both are expressed in cloud
In the example, D = C = e40.00 as 8 ≥ 7 and 17 ≥ 16. provider currency.

2168-7161 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: The University of Toronto. Downloaded on October 13,2021 at 07:44:23 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2020.3015769, IEEE
Transactions on Cloud Computing
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, XX XXX 6

The system defined above, which uses machine learning


Payment
combined with anomaly detection along with the PSO algo- Web browsers
devices
rithm, calculates the optimal cloud resource configuration.
As a result, resource usage cost reduction is achieved.
Azure cloud

4 E VALUATION Management Payment


web page service
Based on the concept from the previous section, we have
developed an optimization system which uses Azure cloud
computing. The efficiency of our system was proved during Database
tests with cloud simulators; using a simulator reduces testing
time and improves testing elasticity as described in [41].
Azure (Microsoft’s cloud service) exposes an API which
gives access to a component’s historical usage and makes Fig. 5. Architecture of TMS – a real-life system used as the test data
it possible to get and set a component’s parameters. The provider for our simulations
Azure API also exposes the current pricing plan. Since this is
convenient, we focus on the Azure cloud only, especially on
Optimization Working
virtual machines (IaaS), App Services (PaaS) and the Azure solution system
SQL (SaaS). Virtual machines and App Services can be scaled
in terms of Azure Compute Units (ACUs), which represent Prediction Mock payment
module service
unified compute (CPU) performance power. The available
RAM can be scaled for an App Service and the maximum Mock
Database management Historical
level of input/output operations per second can be scaled for web page data
a virtual machine. SQL databases can be scaled in terms of
Monitoring Mock
storage size and available Database Transaction Units (DTUs) module database
which are a blend of used memory, CPU power and IOPS
level. Nevertheless, our solution is suitable for any cloud
provider and any cloud resources which can be scaled. Fig. 6. Time-compressed environment test architecture
To take advantage of the Azure environment, we selected
Microsoft Azure Machine Learning Studio as our main
prediction engine. Machine Learning Studio offers ready- We set up a test environment (Fig. 6) that allowed us
to-use data processing and ML components. It also allows to perform time-compressed tests. Instead of using the real
custom functions written in the R and Python languages. production system (TMS), we created mock components:
We tested our solution using real-life data from a working the Payment service, the Management web page and the
system called Terminal Management System (TMS), which Database, which were used as inputs for our solution. Data
is a cloud-based manager of Internet of Things devices [42]. collected from the production TMS (10 months in total) were
TMS enables credit card payments in vending machines stored in a separate database for test purposes. The entire
and kiosks. It consists of many endpoint devices which TMS system was monitored and all types of components
connect to the central server. The central server processes (PaaS, IaaS and SaaS) were taken into account; ACU, RAM,
payment transactions and allows operators to configure IOPS, DTU and storage usage were used in tests. We
and maintain end-point devices. The central server consists implemented four different prediction types: Bayesian Linear
of micro services (which deal with payment) and virtual (BL), Decision Forest Regression (DF), Boosted Decision Tree
machines (which host management/reporting webpages). Regression (BDT) and Neural Network Regression (NN).
Both micro services and virtual machines connect to the SQL The Bayesian approach uses linear regression enhanced
database (Fig. 5). by information in the form of a probability distribution.
Payment devices connect to the Payment service during Statistical analysis is undertaken, prior knowledge about
the credit card payment process. These devices are located in model parameters is merged with a likelihood function, and
Asia, Europe and America and are used mostly in unattended posterior estimates for the parameters are generated [43].
vending machines. This causes daily variations in resource Decision trees are models which execute a sequence of data
demand. Web browsers connect to the Management webpage analyses until a decision is achieved. The Decision Forest
when the operator changes configurations or generates Regression model consists of multiple decision trees. Each
reports. Also, devices connect to the Management web page tree creates a prediction (a Gaussian distribution) which
to report their status and check for configuration changes. is compared to the combined distribution for all trees in
The main load comes from the devices, which are configured the model [44]. Boosted Decision Tree Regression uses the
to connect periodically. Therefore, there is no visible resource MART gradient boosting algorithm which gradually builds
demand variation pattern. The database is used by both the a series of decision trees. The optimal tree is selected using
Payment service and the webpage, and thus the resource an arbitrary loss function [45]. Neural Network Regression
demand variations visible in the payment module are also uses a neural network as a model. This type of regression is
present in database usage to a certain extent. As TMS consists suitable for difficult problems where other regression models
of components with different usage characteristics, we can cannot fit a solution [46].
test our idea in different test conditions. Each prediction type operates in the “Tune Model Hyper-

2168-7161 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: The University of Toronto. Downloaded on October 13,2021 at 07:44:23 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2020.3015769, IEEE
Transactions on Cloud Computing
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, XX XXX 7

parameters”7 self-tune mode, which means that prediction Re-sampling method = Bagging

algorithm parameters are picked automatically. Prediction Number of decision trees = 8

was performed for every type of machine learning, thus • Maximum depth of the decision trees = 32
we were able to compare the results. As initial values for • Number of random splits per node = 128
self-tuning, the following default configurations were used: • Minimum number of samples per leaf node = 1
• Tune Model Hyperparameters maximum number of
1) BL
runs = 5
• Regularization weight = 1
• Tune Model Hyperparameters maximum number of
3) BDT
runs = 15 • Maximum number of leaves per tree = 20
• Minimum number of samples per leaf node = 10
2) DF
• Learning rate = 0.2
• Total number of trees constructed = 100
7. Tune Model Hyperparameters – https://docs.microsoft.com/en-
• Tune Model Hyperparameters maximum number of
us/azure/machine-learning/studio-module-reference/tune-model-
hyperparameters runs = 5

Actual usage Neural Network Boosted Decision Tree


Decision Forest Bayesian Linear
(a) time-compressed prediction of APU - PaaS
14
10
APU
6
2

Date 2018-12-12 2018-12-17 2018-12-22 2018-12-27


(b) time-compressed prediction of APU - IaaS
64 6770
APU
61
58

Date 2019-04-10 2019-04-13 2019-04-16 2019-04-19


(c) time-compressed prediction of DTU - SaaS
20 2530
DTU
15
10

Date 2019-05-11 2019-05-16 2019-05-21 2019-05-26

Fig. 7. Comparison of predictions made with different ML algorithms for different resources

2168-7161 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: The University of Toronto. Downloaded on October 13,2021 at 07:44:23 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2020.3015769, IEEE
Transactions on Cloud Computing
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, XX XXX 8

4) NN
Actual usage
• Hidden layers = 1, fully connected After anomaly detection
• Number of hidden nodes = 100
• Learning rate = 0.02
• Number of iterations = 80

50
• The initial learning weights diameter = 0.1
• The momentum = 0
• The type of normalizer = Do not normalize

40
Integer-PSO was used with 300 particles in 500 epochs.
As in [38], we set the Inertia weight to 0.6 and acceleration

DTU
30
coefficients to 0.2. The maximum velocity was set to 0.1 · n,
where n is the number of available configuration options,
and minimum velocity was set accordingly with the minus

20
sign. Accuracy was set to 3 digits.
For Payment service (PaaS) optimization, we selected

10
ACU and RAM utilization levels as optimization factors.
For the Management web page (IaaS), ACU and IOPS were
selected, and for the Database (SaaS), DTU and disk space
were selected. Because we are using the Database equally
for reading and writing, it is hard to scale it out (multiply Date 2019-05-13 2019-05-20 2019-05-27
its instances), so in this case we set k = 1 in the equation
(1) to limit PSO-Integer to one instance only. We performed Fig. 8. Comparison of the actual and anomaly filtered DTU usage level
our simulation using 10 months of data from TMS, and (SaaS, May 2019)
we compared the results with the production configuration.
We also compared the results obtained from different ML
algorithms. In total, we made 24 predictions (6 optimization in May 2019 equals e 392 and the optimization achieved by
factors multiplied by 4 algorithms). Each prediction consisted our system reduces the cost to e 23. For the entire period
of more than 6,500 points (10 months with hourly resolution). tested, PaaS costs were reduced by 88%, which results in
For purposes of clarity, we chose one optimization factor for savings of e 4,268.
every component and presented them separately for selected
periods (Fig. 7). In fact, as defined in equations (3)(2)(7),
Actual usage
all resources included in the component in question are Decision Forest
calculated together. For every component, the chart presents Calculated configuration
a workload characteristic (Actual usage) and all prediction
algorithm results. We selected periods so that they consisted
of anomalies (Figs. 7a, 7b, 7c), visible patterns (Fig. 7b) and
50

longer high-usage events (Fig. 7a). In addition, negative


impact of previous data on the prediction process can be
observed (Fig. 7b) where, especially in the beginning, the
40

usage level predicted is clearly lower than the actual one.


Nevertheless, even this error has no negative impact on the
DTU
30

real system as we aim to keep resource usage at 70%, which


gives us a 30% safety margin.
20

To clearly visualize the optimization process, we choose


3 weeks (8th to 31st May 2019), one component (SaaS), one
property (DTU) and one ML algorithm (DF). Fig. 8 presents
10

the actual usage level along with usage level after anomaly
detection (with the anomalies removed). Prediction is based
on the data after anomaly detection and thus it is not
distorted by temporary usage spikes (Fig. 9). Date 2019-05-13 2019-05-20 2019-05-27
Although we are using the Integer-PSO algorithm to
find the optimum component configuration, due to cloud Fig. 9. Comparison of the actual DTU usage level, its DF prediction and
resource granulation (the cloud provider only offers pre- the calculated configuration (SaaS, May 2019)
configured component variants, e.g. a VM with 210 ACUs
and 4,000 IOPS), the values predicted are not used exactly Our tests demonstrate that in the case of anomalous
in the calculated configuration. In the chart (Fig. 9) we behavior (sudden high resource usage), the calculated
present the actual resource usage, predicted usage and configuration does not cover 100% of resource demand
calculated configuration based on DF prediction. Despite and the cloud provider resorts to throttling. This slows
this granulation, we still observe a significant reduction in down the processing of incoming requests or, in cases of
resource costs (Fig. 10). In the TMS system, the cost of SaaS prolonged high-level usage, results in a timeout response to

2168-7161 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: The University of Toronto. Downloaded on October 13,2021 at 07:44:23 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2020.3015769, IEEE
Transactions on Cloud Computing
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, XX XXX 9

Azure Autoscale was not able to optimize SaaS resources,


Database cost without optimization
Database cost with optimization and PaaS and IaaS were optimized only in one dimension;
the final result was more expensive and in the case of IaaS,
performance was much lower.
Optimization introduces quality degradation when com-
0.7

pared to the original system. During our test period (from 8th
Database cost per hour, Euro

to 31st May 2019), in which the original system was highly


over-provisioned, we observed a mean response time that
was 4 times longer and a variance that was almost 100 times
0.5

higher while the original system response time was almost


constant. However, when tested during high usage periods
(from 12th to 19th September 2019), the original system’s
0.3

mean response time and variance were similar to the values


observed after our optimization. Despite the fact that longer
response times still allow the system to operate properly,
optimization with quality of user experience as a parameter
0.1

will be a topic of our further studies as mentioned in Section


5.
The optimization solution runs independently from the
Date 2019-05-13 2019-05-20 2019-05-27 working system which is being optimized, and thus the
optimization process does not introduce any performance
overhead. The monitoring module only runs when a compo-
Fig. 10. Comparison of the cost of the database with and without the nent change is required (usually once a couple of hours), and
optimization (SaaS, May 2019)
the prediction module runs once a week and uses Microsoft
Azure Machine Learning Studio. All these operations fit in
the free plans offered by Azure.
the client (we did not observe such long-lasting anomalies).
Nevertheless, the TMS system is designed to handle such
situations, as timeouts are often caused by poor network 5 C ONCLUSIONS AND FURTHER WORK
conditions at the endpoint side (in this case – the credit card In this work, we present a solution for optimizing cloud
payment terminal) anyway. resource costs. Our approach operates autonomously, in
During tests conducted for data between 8th and 31st May closed-loop configuration, without any need for external
2019, we observed a reduction in resource usage cost not only tuning. We used real-world data from a production system.
for SaaS (as presented here), but also for other component Tests show that the savings calculated are significant and
types: IaaS and PaaS. Although we did not find similar that our system works properly, minimizing cloud resource
solutions or test data to compare them with our system, we usage and cost. A comparison between current system costs
used the Azure Autoscale mechanism described in Section and those after optimization demonstrates that during the 10
2 as the point of reference. Although Azure Autoscale months covered by tests, the solution, if implemented in the
performs only a horizontal (quantity) optimization for IaaS working system, would have resulted in savings of e 6,128,
and PaaS, we chose it for its out-of-the-box availability. As which translates to an 85% cost reduction.
vertical (quality) optimization is not available, we used the Our solution aims to reduce the cost of using cloud
cheapest possible components (Autoscale is not enabled resources by predicting future demand for resources and
for low-price PaaS) to ensure the most detailed scaling. In adjusting the provisioned resources accordingly. Therefore,
Table 1, we present financial savings along with common any cloud-based system which uses scalable resources (i.e.
prediction quality metrics described in Section 3, the R IaaS, PaaS or SaaS) can be optimized using our solution.
(mean of overusage errors) and V (cost savings per hour) Optimization is performed at the resource allocation level
parameters calculated according to equations (8) and (11). and knowledge of the internal structure of the system
When compared to the original value, the high savings being optimized is not required; however, any performance
percentage figure is caused by the considerable resource improvements in this system will be captured by our
over-provisioning in the TMS system due to the tendency solution and less resources will be provisioned in the future.
described in Section 1. Our anomaly detection solution makes Since we are using prediction techniques, the greatest cost
this over-provisioning unnecessary. Azure Autoscale reduced reduction will be observed for systems with usage patterns
the cost to some degree, but it was still only half as efficient that are complicated, hard to define and varied over time;
as our solution. Additionally, being a reactive system, it these patterns will be determined by machine learning
introduced a performance decline. For IaaS, where dynamic algorithms. Scaling resources is simpler when client-server
resource demand was observed, the mean response time communication is stateless, as every call can be directed to
was 2.5 times longer and the variance of response time the appropriate resource independently; nevertheless, cloud
was 7 times higher when compared to our solution. This providers also offer scaling of stateful communications. Our
led to longer periods of availability issues. On the other solution is compatible with many cloud-based system types,
hand, for PaaS, where the resource demand was stable, i.e. IoT hubs or Enterprise Resources Planning services in
both response time and variance were similar to our system. the form of web services, payment gateways that process

2168-7161 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: The University of Toronto. Downloaded on October 13,2021 at 07:44:23 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2020.3015769, IEEE
Transactions on Cloud Computing
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, XX XXX 10

TABLE 1
Savings and Quality Metrics for the Best Algorithms (May 2019)

Component PaaS IaaS SaaS


Original cost e 58.29 e 132.13 e 391.74
Azure Autoscale e 28.32 e 29.05 -
Prediction algorithm Decision Forest Neural Network Decision Forest
Predicted cost e 12.61 e 15.30 e 23.15
Compared to original
Savings (e) e 45.68 e 116.83 e 368.58
Savings (%) 78.36% 88.42% 94.09%
V as defined in the equation (11) e 0.08 e 0.20 e 0.64
Compared to Azure Autoscale
Savings (e) e 15.71 e 13.75 -
Savings (%) 55.47% 47.33% -
V as defined in the equation (11) e 0.03 e 0.02 -
Optimization factor ACU RAM ACU IOPS DTU Size
Root Mean Square Error 5.2720 0.1982 41.1866 17.0858 12.1205 4.4203
Mean Absolute Error 2.4751 0.0984 13.5780 7.7893 6.0334 1.8355
Relative Absolute Error 0.6459 0.4884 0.2213 0.9200 0.5403 0.2512
Root Relative Squared Error 0.7547 0.5966 0.4951 1.0397 0.6979 0.4875
R as defined in the equation (8) 0.00 ACU 0.00 GB 0.00 ACU 0.00 IOPS 0.0514 DTU 0.00 GB

online transactions, e-commerce solutions as well as web [6] Y. Zhang, J. Yao, and H. Guan, “Intelligent cloud resource manage-
information portals and social networks. ment with deep reinforcement learning,” IEEE Cloud Computing,
vol. 4, no. 6, pp. 60–69, 2017.
Time-compressed tests demonstrate that the efficiency of [7] M. H. Hilman, M. A. Rodriguez, and R. Buyya, “Task runtime
our solution improves over time. This is why, if historical prediction in scientific workflows using an online incremental
data are available, the solution can be trained in advance to learning approach,” in 2018 IEEE/ACM 11th International Conference
boost efficiency from the start. This topic may be the subject on Utility and Cloud Computing (UCC), 12 2018, pp. 93–102.
[8] Y. Yu, V. Jindal, F. Bastani, F. Li, and I. Yen, “Improving the
of our further studies. Currently, we are monitoring and smartness of cloud management via machine learning based
storing over 100 parameters of the production system (TMS). workload prediction,” in 2018 IEEE 42nd Annual Computer Software
In the future, we would like to incorporate quality of user and Applications Conference (COMPSAC), vol. 02, 7 2018, pp. 38–44.
experience criteria in our resource prediction process, which [9] R. Yang, X. Ouyang, Y. Chen, P. Townend, and J. Xu, “Intelligent
resource scheduling at scale: A machine learning perspective,” in
may result in better resource usage optimization and quicker 2018 IEEE Symposium on Service-Oriented System Engineering (SOSE),
system response times. 3 2018, pp. 132–141.
[10] C.-C. Crecana and F. Pop, “Monitoring-based auto-scalability across
hybrid clouds,” Proceedings of the 33rd Annual ACM Symposium on
Applied Computing, pp. 1087–1094, 2018.
ACKNOWLEDGMENTS [11] T. Mehmood, S. Latif, and S. Malik, “Prediction of cloud computing
The research presented in this paper was supported by funds resource utilization,” in 2018 15th International Conference on Smart
Cities: Improving Quality of Life Using ICT IoT (HONET-ICT), 10 2018,
from the Polish Ministry of Science and Higher Education pp. 38–42.
assigned to the AGH University of Science and Technology. [12] I. K. Kim, W. Wang, Y. Qi, and M. Humphrey, “Cloudinsight:
Utilizing a council of experts to predict future cloud application
workloads,” in 2018 IEEE 11th International Conference on Cloud
R EFERENCES Computing (CLOUD), 7 2018, pp. 41–48.
[13] B. Sniezynski, P. Nawrocki, M. Wilk, M. Jarzab, and K. Zielinski,
[1] A. S. Andrae and T. Edler, “On global electricity usage of commu- “Vm reservation plan adaptation using machine learning in cloud
nication technology: Trends to 2030,” Challenges, vol. 6, no. 1, pp. computing,” Journal of Grid Computing, Jul 2019.
117–157, 2015. [14] S. Chen, Y. Shen, and Y. Zhu, “Modeling conceptual characteristics
[2] M. Mao and M. Humphrey, “Auto-scaling to minimize cost and of virtual machines for cpu utilization prediction,” in Conceptual
meet application deadlines in cloud workflows,” Proceedings of 2011 Modeling, J. C. Trujillo, K. C. Davis, X. Du, Z. Li, T. W. Ling, G. Li,
International Conference for High Performance Computing, Networking, and M. L. Lee, Eds. Cham: Springer International Publishing,
Storage and Analysis, 2011. 2018, pp. 319–333.
[3] J. Yang, W. Xiao, C. Jiang, M. S. Hossain, G. Muhammad, and S. U. [15] M. Ghobaei-Arani, S. Jabbehdari, and M. A. Pourmina, “An
Amin, “Ai-powered green cloud and data center,” IEEE Access, autonomic resource provisioning approach for service-based cloud
vol. 7, pp. 4195–4203, 2019. applications: A hybrid approach,” Future Generation Computer
[4] S. Abrishami, “Deadline-constrained workflow scheduling algo- Systems, vol. 78, pp. 191 – 210, 2018.
rithms for infrastructure as a service clouds,” Future Generation [16] Q. Zhang, L. T. Yang, Z. Yan, Z. Chen, and P. Li, “An efficient deep
Computer Systems, 2013. learning model to predict cloud workload for industry informatics,”
[5] S. Memeti, S. Pllana, A. Binotto, J. Kolodziej, and I. Brandic, IEEE Transactions on Industrial Informatics, vol. 14, no. 7, pp. 3170–
“A review of machine learning and meta-heuristic methods for 3178, 7 2018.
scheduling parallel computing systems,” in Proceedings of the [17] A. Abdelaziz, M. Elhoseny, A. S. Salama, and A. Riad, “A ma-
International Conference on Learning and Optimization Algorithms: chine learning model for improving healthcare services on cloud
Theory and Applications, ser. LOPAL ’18. New York, NY, USA: computing environment,” Measurement, vol. 119, pp. 117 – 128,
ACM, 2018, pp. 5:1–5:6. 2018.

2168-7161 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: The University of Toronto. Downloaded on October 13,2021 at 07:44:23 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2020.3015769, IEEE
Transactions on Cloud Computing
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, XX XXX 11

[18] J. Kumar and A. K. Singh, “Workload prediction in cloud using [38] A. S. Ajeena Beegom and R. M S, “Integer-pso: a discrete pso
artificial neural network and adaptive differential evolution,” algorithm for task scheduling in cloud computing systems,” Evolu-
Future Generation Computer Systems, vol. 81, pp. 41 – 52, 2018. tionary Intelligence, vol. 12, 02 2019.
[19] J. N. Witanto, H. Lim, and M. Atiquzzaman, “Adaptive selection [39] N. K. Gondhi and A. Gupta, “Survey on machine learning based
of dynamic vm consolidation algorithm using neural network for scheduling in cloud computing,” in Proceedings of the 2017 Inter-
cloud resource management,” Future Generation Computer Systems, national Conference on Intelligent Systems, Metaheuristics & Swarm
vol. 87, pp. 35 – 42, 2018. Intelligence, ser. ISMSI ’17. New York, NY, USA: Association for
[20] K. Mason, M. Duggan, E. Barrett, J. Duggan, and E. Howley, Computing Machinery, 2017, p. 57–61.
“Predicting host cpu utilization in the cloud using evolutionary [40] G. Cherubin, A. Baldwin, and J. Griffin, “Exchangeability mar-
neural networks,” Future Generation Computer Systems, vol. 86, pp. tingales for selecting features in anomaly detection,” Conference:
162 – 173, 2018. Conformal and Probabilistic Prediction and Applications, 06 2018.
[21] A. M. Al-Faifi, B. Song, M. M. Hassan, A. Alamri, and A. Gumaei, [41] T. Lorido-Botran, J. Miguel-Alonso, and J. A. Lozano, “A review of
“Performance prediction model for cloud service selection from auto-scaling techniques for elastic applications in cloud environ-
smart data,” Future Generation Computer Systems, vol. 85, pp. 97 – ments,” Journal of Grid Computing, 2014.
106, 2018. [42] A. Botta, W. de Donato, V. Persico, and A. Pescapé, “On the
[22] A. A. Rahmanian, M. Ghobaei-Arani, and S. Tofighy, “A learning integration of cloud computing and internet of things,” in 2014
automata-based ensemble resource usage prediction algorithm International Conference on Future Internet of Things and Cloud, 2014,
for cloud computing environment,” Future Generation Computer pp. 23–30.
Systems, vol. 79, pp. 54 – 71, 2018. [43] C. Bishop and M. Tipping, “Bayesian regression and classification,”
[23] M. Ranjbari and J. A. Torkestani, “A learning automata-based in Advances in Learning Theory: Methods, Models and Applications,
algorithm for energy and sla efficient consolidation of virtual ser. NATO Science Series, III: Computer and Systems Sciences,
machines in cloud data centers,” Journal of Parallel and Distributed J. Suykens, I. Horvath, S. Basu, C. Micchelli, and J.Vandewalle, Eds.
Computing, vol. 113, pp. 55 – 62, 2018. IOS Press, 2003, pp. 267–285.
[24] G. Kaur, A. Bala, and I. Chana, “An intelligent regressive ensemble [44] A. Criminisi, J. Shotton, and E. Konukoglu, “Decision forests: A
approach for predicting resource usage in cloud computing,” unified framework for classification, regression, density estimation,
Journal of Parallel and Distributed Computing, vol. 123, pp. 1 – 12, manifold learning and semi-supervised learning,” in Foundations
2019. and Trends in Computer Graphics and Vision, January 2012, vol. 7, no.
[25] X. Chen, J. Lin, B. Lin, T. Xiang, Y. Zhang, and G. Huang, “Self- 2-3, pp. 81–227.
learning and self-adaptive resource allocation for cloud-based soft- [45] C. J. Burges, “From ranknet to lambdarank to lambdamart: An
ware services,” Concurrency and Computation: Practice and Experience, overview,” Microsoft, Tech. Rep. MSR-TR-2010-82, June 2010.
vol. 0, no. 0, p. e4463, 2018, e4463 CPE-17-0360. [46] C. M. Bishop, “Neural networks: a pattern recognition perspective,”
[26] C. Qu, R. N. Calheiros, and R. Buyya, “Auto-scaling web appli- Aston University, Birmingham, Technical Report, January 1996.
cations in clouds: A taxonomy and survey,” ACM Comput. Surv.,
vol. 51, no. 4, pp. 73:1–73:33, Jul. 2018.
[27] Y. Al-Dhuraibi, F. Paraiso, N. Djarallah, and P. Merle, “Elasticity in
cloud computing: State of the art and research challenges,” IEEE
Transactions on Services Computing, vol. 11, no. 2, pp. 430–447, 3
2018.
[28] H. M. Makrani, H. Sayadi, D. Motwani, H. Wang, S. Rafatirad,
and H. Homayoun, “Energy-aware and machine learning-based
resource provisioning of in-memory analytics on cloud,” in Pro-
ceedings of the ACM Symposium on Cloud Computing, ser. SoCC ’18. Patryk Osypanka , M.Sc., is a doctoral stu-
New York, NY, USA: ACM, 2018, pp. 517–517. dent in the Department of Computer Science
[29] D. Minarolli and B. Freisleben, “Virtual machine resource alloca- at the AGH University of Science and Technol-
tion in cloud computing via multi-agent fuzzy control,” in 2013 ogy, Krakow, Poland. He works professionally at
International Conference on Cloud and Green Computing, 2013, pp. ASEC S.A. as software development team leader,
188–194. mainly using Microsoft technologies (.Net, Azure).
[30] A. Singh, D. Juneja, and M. Malhotra, “A novel agent based au- His research focuses on cloud computing.
tonomous and service composition framework for cost optimization
of resource provisioning in cloud computing,” Journal of King Saud
University - Computer and Information Sciences, vol. 29, no. 1, pp. 19 –
28, 2017.
[31] G. Wei, A. V. Vasilakos, Y. Zheng, and N. Xiong, “A game-theoretic
method of fair resource allocation for cloud computing services,”
The Journal of Supercomputing, vol. 54, no. 2, pp. 252–269, Nov 2010.
[32] M. Ficco, C. Esposito, F. Palmieri, and A. Castiglione, “A coral-
reefs and game theory-based approach for optimizing elastic cloud
resource allocation,” Future Generation Computer Systems, vol. 78,
pp. 343 – 352, 2018.
[33] S. Sotiriadis, N. Bessis, and R. Buyya, “Self managed virtual
machine scheduling in cloud systems,” Information Sciences, vol. Piotr Nawrocki , Ph.D., is Associate Professor in
433-434, pp. 381 – 400, 2018. the Department of Computer Science at the AGH
[34] D. Gudu, M. Hardt, and A. Streit, “Combinatorial auction algorithm University of Science and Technology, Krakow,
selection for cloud resource allocation using machine learning,” in Poland. His research interests include distributed
Euro-Par 2018: Parallel Processing, M. Aldinucci, L. Padovani, and systems, computer networks, mobile systems,
M. Torquati, Eds. Cham: Springer International Publishing, 2018, mobile cloud computing, Internet of Things and
pp. 378–391. service-oriented architectures. He has partici-
[35] S. A. Tafsiri and S. Yousefi, “Combinatorial double auction-based pated in several EU research projects including
resource allocation mechanism in cloud computing market,” Journal MECCANO, 6WINIT, UniversAAL and national
of Systems and Software, vol. 137, pp. 322 – 334, 2018. projects including IT-SOA and ISMOP. He is a
[36] J. Zhang, N. Xie, K. Yue, W. Li, and D. Kumar, “Machine learning member of the Polish Information Processing
based resource allocation of cloud computing in auction,” Comput- Society (PTI).
ers, Materials and Continua, vol. 56, pp. 123–135, 01 2018.
[37] Z. Zhong, K. Chen, X. Zhai, and S. Zhou, “Virtual machine-based
task scheduling algorithm in a cloud computing environment,”
Tsinghua Science and Technology, vol. 21, no. 6, pp. 660–667, Dec
2016.

2168-7161 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: The University of Toronto. Downloaded on October 13,2021 at 07:44:23 UTC from IEEE Xplore. Restrictions apply.

You might also like