Resource_Usage_Cost_Optimization_in_Cloud_Computing_Using_Machine_Learning
Resource_Usage_Cost_Optimization_in_Cloud_Computing_Using_Machine_Learning
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2020.3015769, IEEE
Transactions on Cloud Computing
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, XX XXX 1
Abstract—Cloud computing is gaining popularity among small and medium-sized enterprises. The cost of cloud resources plays a
significant role for these companies and this is why cloud resource optimization has become a very important issue. Numerous methods
have been proposed to optimize cloud computing resources according to actual demand and to reduce the cost of cloud services. Such
approaches mostly focus on a single factor (i.e. compute power) optimization, but this can yield unsatisfactory results in real-world cloud
workloads which are multi-factor, dynamic and irregular. This paper presents a novel approach which uses anomaly detection, machine
learning and particle swarm optimization to achieve a cost-optimal cloud resource configuration. It is a complete solution which works in a
closed loop without the need for external supervision or initialization, builds knowledge about the usage patterns of the system being
optimized and filters out anomalous situations on the fly. Our solution can adapt to changes in both system load and the cloud provider’s
pricing plan. It was tested in Microsoft’s cloud environment Azure using data collected from a real-life system. Experiments demonstrate
that over a period of 10 months, a cost reduction of 85% was achieved.
Index Terms—cloud resource usage prediction, anomaly detection, machine learning, particle swarm optimization, resource cost
optimization.
1 I NTRODUCTION
2168-7161 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: The University of Toronto. Downloaded on October 13,2021 at 07:44:23 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2020.3015769, IEEE
Transactions on Cloud Computing
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, XX XXX 2
contributions of this paper are briefly summarized as follows: private and public clouds. The system scaling engine is based
• using an anomaly detection filter to improve the quality on queuing theory and makes it possible to extend private
of machine learning regression predictions; cloud capabilities using public cloud resources. This solution
• using an adapted PSO algorithm to solve the cloud uses threshold-based policies and time-series analysis. We
resource reservation problem; use machine learning instead to predict demand, which
• using both vertical (quality) and horizontal (quantity) allows us to bridge the delays caused by the process of
scaling at the same time to obtain optimal results; provisioning new instances.
• presenting experimental results with real-life data from In [11], the authors use ML for resource usage prediction
the production system for different cloud usage models and in [12], the authors propose a host of different ML
and verifying the effectiveness of the solution proposed algorithms as a way of improving prediction; however, in
along with actual cost reductions. both cases no further steps beyond prediction are presented.
The rest of this paper is structured as follows: Section 2 Also, none of these solutions perform anomaly detection
contains a description of related work, Section 3 is concerned which makes them prone to inaccurate predictions in case of
with defining in detail the cloud resource cost optimization temporary deviations.
process, Section 4 describes the implementation of the The authors of [13] describe a system which develops
optimization solution, and Section 5 contains the conclusion virtual machine reservation plans based on CPU usage
and further work. history. During evaluation, different ML algorithms are
compared with OpenStack and Blazar. In addition, tests in
the virtual environment (without cloud integration) present
2 R ELATED WORK the system’s performance over a year. Although the system
The literature describes various studies devoted to resource uses ML, which makes it flexible, the authors focus on a
allocation optimization. For example, the authors of [2], [3] single type of virtual machine only as contrasted with our
and [4] describe a solution which analyses incoming tasks system, which uses all VM types available from a given cloud
and reserves virtual machine instances in a way that makes it provider to minimize overall cost. We also account for more
possible to meet a deadline and is cost efficient. The solution resources (i.e. RAM) along with anomaly detection, which
assumes that the system is performing tasks with known makes our solution more complete and accurate.
CPU and memory demands. The authors of the review Other works present different methods of virtual machine
presented in [5] discuss different task scheduling methods usage optimization: a time-aware residual network [14],
which can be used in such cases. On the other hand, we autonomic computing and reinforcement learning [15], deep
optimize more generic systems which fulfil many functions learning [16], a combination of PPSO and NN [17], an NN
and therefore cannot focus on task scheduling as we are with self-adaptive differential evolution algorithm [18] and
unable to determine the relevant parameters. We must make standalone neural networks [19] [20]. The authors of [21] use
sure that just enough cloud resources are available when Naı̈ve Bayes, and in [22] and [23] the authors use learning
needed. automata. Kaur et al. [24] propose a set of various prediction
In a similar manner, cloud resource management with the methods working in parallel and the authors of [25] use a
use of deep reinforcement learning algorithms was described progressive QoS prediction model and a genetic algorithm.
by Zhang et al. [6]. The authors propose a deep Q-network All those works along with surveys [26] [27] focus on virtual
as a variant of a reinforcement learning algorithm, which is machines, mostly on CPU and RAM usage. We extend these
initially pre-trained by a stacked autoencoder (SAQN). To approaches to other cloud component types (PaaS, SaaS) and
address stability issues, they introduced experience replay, make a step further by selecting real-life, provider-dependent
Q-network freeze and network normalization. The described sets of resources. Although the authors of [28] describe a
solution assumes that the client makes requests with a general idea for a system which would cover IaaS, PaaS
resource demand which is known beforehand and tested and SaaS, that study does not include any tests or broader
using an artificial load generated by HiBench, a big data analysis of the topic.
benchmark suite. Our approach is to optimize generic A lot of studies describe different ways of allocating
systems which generate requests that are variable in time resources optimally from a cloud provider’s point of view.
and whose characteristics are unknown. The tests performed Dorian Minarolli and Bernd Freisleben [29] describe a system
show that due to anomaly detection, our solution works which optimizes virtual machine allocation using fuzzy con-
without initial training and is able to operate properly not trol. Owing to the proposed multi-agent environment, their
only with a simulated (artificial) load, but also with real- solution is able to operate on a considerable set of virtual
world, noisy data. machines. Similarly, Singh et al. [30] propose mobile agents
Hilman et al. [7] propose an online incremental learning which manage resource allocation in the cloud provider’s
approach to predict the run time of tasks, and the authors physical infrastructure. The authors take into consideration
of [8] use machine learning (ML) for the same purpose. In not just the type of physical resources available, but also
addition, Yang et al. [9] propose ML along with heuristic their location and network infrastructure which allows a
algorithms to assign tasks to the optimal virtual machine. cloud provider to reduce costs. In comparison to the above
However, the important aspect of resource management and solutions, our approach is focused on cost optimization from
scaling is missing from those works, as opposed to our solu- the end-user perspective; although a reduction in server
tion which considers available component configurations. operation costs could possibly lead to a provider offering
A different approach is presented in [10] where the a discount, the solution proposed by us provides direct
authors propose a system which scales resources across cost savings. The solutions described in the aforementioned
2168-7161 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: The University of Toronto. Downloaded on October 13,2021 at 07:44:23 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2020.3015769, IEEE
Transactions on Cloud Computing
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, XX XXX 3
2168-7161 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: The University of Toronto. Downloaded on October 13,2021 at 07:44:23 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2020.3015769, IEEE
Transactions on Cloud Computing
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, XX XXX 4
2. Filter out
6. Check predictions Loop every anomalies
Loop every 5. Store prediction
and scale cloud Read Database Write
hour data in the database week
resources if needed
3. Make usage
prediction
4. Combine
predictions and
calculate optimal
configuration
1: for all component of the monitoredComponents do form of algorithm (Fig. 4). For every component (component)
2: for all resource of the component do in the set of monitored components (monitoredComponents)
3: data = GetHistoryData(resource); which we optimize, the module collects (GetHistoryData())
4: data = AnomalyFilter(data); CPU, memory and storage usage data (usage level along
5: data = MedianFilter(data); with the time of day and the day of the week). These data
6: WriteToDB(data); are filtered and then stored in the database to be used for
7: predictionData = prediction later on. Filtering is done by the anomaly detection
ReadFromDB(conf iguredW indow); algorithm [40]: first using the exchangeability martingales
8: predictionResult = function (AnomalyFilter()), and next, in order to smooth the
PredictUsageML(predictionData) data and improve prediction quality, using a median filter
9: predictions.Add(predictionResult); (MedianFilter()). Filtering prevents unnecessary prediction
10: end for distortions when resource usage changes are temporary and
11: end for random. If such a change exceeds the allocated resources,
12: pricing = GetPricingPlan(); the system becomes less responsive and takes longer to
13: conf iguration = process requests. Filtered data are stored in the database
CalculateConfiguration(prediction,pricing ); (WriteToDB()). In the next step, the module reads collected
14: WriteToDB(conf iguration); historical data from the database (ReadFromDB()) to predict
usage for the next week. The historical data time window
Fig. 4. Prediction module algorithm length affects prediction stability and adaptation rate and has
to be configured according to optimized system properties
(configuredWindow). The time window must be sufficiently
consists of a Prediction module, a Monitoring module and a
long to observe usage patterns but sufficiently short to allow
Database to store predicted data (Fig. 2). We designed it (Fig. 3)
quick prediction adaptation. For every collected piece of
to periodically (every week) gather historical usage data from
usage data, the module develops usage predictions (Predict-
the last month for each resource which needs to be tailored.
UsageML()) using machine learning interpolation and then
This task is done by the Prediction module. In the next step, the
stores them (predictions.Add()).
solution filters out anomalies to improve prediction quality.
Next, for each resource, it makes a prediction for the next 7 In the last stage of the algorithm, after all predictions
days and, using all these predictions combined, calculates have been done, the module obtains the current pricing plan
a cost-optimal cloud resource configuration with hourly from the cloud provider (GetPricingPlan()) and calculates the
resolution and the desired maximum resource utilization optimal resource configuration. The cloud provider defines
level. The maximum utilization level depends on system possible scaling configurations for different cloud compo-
type (i.e. will be lower for high-availability systems). Such nents; the same CPU, memory or disk storage resources can
a long prediction timeframe reduces prediction frequency be provisioned with a different configuration and therefore
and provides an allocation plan for the entire week for at a different cost. This creates a matrix of possibilities. As
the administrator’s inspection if required. Only available the number of possible configurations is usually large, calcu-
scaling options are considered; if a cloud provider adds lating all variants is not feasible and this is why the module
new possibilities, these will be automatically included in chooses a cost-optimal configuration (CalculateConfiguration())
calculations. The calculated cloud resource configuration is using a particle swarm optimization algorithm. Based on the
stored in the Database. In a separate hourly loop, using the solution described by A. S. Ajeena Beegom et al. [38], we
Monitoring module, the system checks if cloud resources need defined our own version of the Integer-PSO algorithm which
to be scaled according to predictions. is suited to our needs. Given the predicted required level
The logic of the Prediction module is presented in the of resources L = [L1 , . . . , Lm ] (i.e. CPU core count, or RAM
2168-7161 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: The University of Toronto. Downloaded on October 13,2021 at 07:44:23 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2020.3015769, IEEE
Transactions on Cloud Computing
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, XX XXX 5
amount) and n different component configuration types (i.e. Q with the minimal cost can be found using the cost
compute-optimized, memory-optimized, general-purpose) function D from equation 6 and the Integer-PSO algorithm.
from the cloud provider [T1 , . . . , Tn ], our problem is to find a As the cloud providers’ pricing policies are usually complex,
set of configurations Q which will meet the L constraint and it is impossible to define how many minimums exist in the
will be cost-efficient at the same time. Q = [z1 , . . . , zn ] defines cost function, which is discrete, as a fractional component
how many instances of every configuration type should be cannot be provisioned. The final stage of the original algo-
used. As an example, we can take virtual machines with rithm described in [38] was altered, as we are looking for
CPU core count (L1 ) and RAM amount (L2 ) as the resources multiples of available machines [z1 , . . . , zn ] rather than task
examined along with the predicted required level as L = assignment configuration.
[7, 16], which means 7 CPU cores and 16 GB of RAM. A To reduce frequent configuration changes, the new
sample cloud provider offers 3 different machine types: calculated configuration Q0 is compared to the previous
• T1 : 4 CPU cores, 1 GB of RAM, e 12.00/month; configuration. If the old Q still meets the P ≥ L constraint
P −P 0
• T2 : 2 CPU cores, 8 GB of RAM, e 14.00/month; and if ∀i ∈ (1, . . . , m) di < F (where di = iPi i and F is
• T3 : 2 CPU cores, 2 GB of RAM, e 10.00/month. a stability factor), Q0 is discarded and Q is used instead.
In this case Q, which meets the L constraint, can be defined F determines how probable it is that the algorithm will
as [1, 2, 0]. It means one virtual machine of type T1 and keep the previous configuration set. Continuing the example
2 machines of type T2 . The maximum value k for zi (i ∈ defined previously where Q = [1, 2, 0] and P = [8, 17], we
(1, . . . , n)) which has to be taken into consideration while can take as an example a new predicted required level L0 =
finding Q can be defined as the number of the least powerful [4, 15], new set Q0 = [0, 2, 0], P 0 = [4, 16] and we can define
configurations needed to meet the L level. Adding more the stability vector as F = 0.4. CPU count d1 = 8−4 8 = 0.5,
resources will be more expensive and is not necessary, as L is RAM amount d2 = 17−16 17 ≈ 0.06 . In this case, di < F is
definitely already met. Following the above example, k = 16 not met for the CPU count (i = 1) and a new value Q0
as 16 virtual machines of type T1 fulfills the L requirement will be used. Each time the old configuration is used, F is
in terms of RAM amount. Q is defined as: decremented; when Q0 is used, F is reset to its initial value.
The final results are stored in the database (WriteToDB()) and
Q = z1 , . . . , zn (1) are later used by the Monitoring module.
where ∀i ∈ (1, . . . , n) 0 ≤ zi ≤ k . The cost of such set C is In a separate loop, the Monitoring module runs every hour.
defined as: It monitors if a given resource must be scaled according to
the predicted configuration, and scales it if needed.
m1 n To estimate the quality of the predicted components’ set,
.. X
C(Q, M ) = Q · M = Q · . = (zi · mi ) (2) we use common prediction measurements: Root Mean Square
mn i=1 Error (RM SE ), Mean Absolute Error (M AE ), Relative
Absolute Error (RAE ) and Root Relative Squared Error
where mi is the price of Ti configuration type. The resource (RRSE ). To compare the predicted configuration with real
level P provided by Q is defined as: usage history, we defined the R metric, which is the mean
s11 , . . . , s1m
of overusage errors. For the given predicted usage during
hours t1 to tm , R is defined as:
P = Q · ... .. = P , . . . , P (3)
. 1 m Pm
Et
sn1 , . . . , snm R = t=1 (8)
Pn m
where Pj = i=1 (zi · sij ) and sij is the j resource level where Et is prediction error for the hour t, defined as:
provided by the Ti configuration type. In the example defined
before, cost is calculated as: Et = (ut − pt ) · H(ut − pt ) (9)
e12
where H is a discrete Heaviside Step Function:
C = 1, 2, 0 · e14 = e40.00
(4)
e10
(
0, n < 0
H(n) = (10)
and resource level as: 1, n ≥ 0
4, 1 pt is the calculated level for hour t and ut is the actual
P = 1, 2, 0 · 2, 8 = 8, 17 . (5) resource usage level for hour t.
2, 2 In the end, we measure average cost savings per hour:
Cost definition for minimization algorithm D is as follows: V . For a given resource and given predicted usage of this
( resource during hours t1 to tm , V is defined as:
C if P ≥ L Pm
D(C, P , L) = (6) (Gt − Ct )
∞ otherwise V = t=1 (11)
m
where P ≥ L is defined as: where Gt is the cost of the configuration without optimiza-
P ≥ L ⇔ ∀i ∈ (1, . . . , m) Pi ≥ Li . (7) tion during the hour t and Ct is the cost of predicted
configuration during the hour t. Both are expressed in cloud
In the example, D = C = e40.00 as 8 ≥ 7 and 17 ≥ 16. provider currency.
2168-7161 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: The University of Toronto. Downloaded on October 13,2021 at 07:44:23 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2020.3015769, IEEE
Transactions on Cloud Computing
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, XX XXX 6
2168-7161 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: The University of Toronto. Downloaded on October 13,2021 at 07:44:23 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2020.3015769, IEEE
Transactions on Cloud Computing
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, XX XXX 7
parameters”7 self-tune mode, which means that prediction Re-sampling method = Bagging
•
algorithm parameters are picked automatically. Prediction Number of decision trees = 8
•
was performed for every type of machine learning, thus • Maximum depth of the decision trees = 32
we were able to compare the results. As initial values for • Number of random splits per node = 128
self-tuning, the following default configurations were used: • Minimum number of samples per leaf node = 1
• Tune Model Hyperparameters maximum number of
1) BL
runs = 5
• Regularization weight = 1
• Tune Model Hyperparameters maximum number of
3) BDT
runs = 15 • Maximum number of leaves per tree = 20
• Minimum number of samples per leaf node = 10
2) DF
• Learning rate = 0.2
• Total number of trees constructed = 100
7. Tune Model Hyperparameters – https://docs.microsoft.com/en-
• Tune Model Hyperparameters maximum number of
us/azure/machine-learning/studio-module-reference/tune-model-
hyperparameters runs = 5
Fig. 7. Comparison of predictions made with different ML algorithms for different resources
2168-7161 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: The University of Toronto. Downloaded on October 13,2021 at 07:44:23 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2020.3015769, IEEE
Transactions on Cloud Computing
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, XX XXX 8
4) NN
Actual usage
• Hidden layers = 1, fully connected After anomaly detection
• Number of hidden nodes = 100
• Learning rate = 0.02
• Number of iterations = 80
50
• The initial learning weights diameter = 0.1
• The momentum = 0
• The type of normalizer = Do not normalize
40
Integer-PSO was used with 300 particles in 500 epochs.
As in [38], we set the Inertia weight to 0.6 and acceleration
DTU
30
coefficients to 0.2. The maximum velocity was set to 0.1 · n,
where n is the number of available configuration options,
and minimum velocity was set accordingly with the minus
20
sign. Accuracy was set to 3 digits.
For Payment service (PaaS) optimization, we selected
10
ACU and RAM utilization levels as optimization factors.
For the Management web page (IaaS), ACU and IOPS were
selected, and for the Database (SaaS), DTU and disk space
were selected. Because we are using the Database equally
for reading and writing, it is hard to scale it out (multiply Date 2019-05-13 2019-05-20 2019-05-27
its instances), so in this case we set k = 1 in the equation
(1) to limit PSO-Integer to one instance only. We performed Fig. 8. Comparison of the actual and anomaly filtered DTU usage level
our simulation using 10 months of data from TMS, and (SaaS, May 2019)
we compared the results with the production configuration.
We also compared the results obtained from different ML
algorithms. In total, we made 24 predictions (6 optimization in May 2019 equals e 392 and the optimization achieved by
factors multiplied by 4 algorithms). Each prediction consisted our system reduces the cost to e 23. For the entire period
of more than 6,500 points (10 months with hourly resolution). tested, PaaS costs were reduced by 88%, which results in
For purposes of clarity, we chose one optimization factor for savings of e 4,268.
every component and presented them separately for selected
periods (Fig. 7). In fact, as defined in equations (3)(2)(7),
Actual usage
all resources included in the component in question are Decision Forest
calculated together. For every component, the chart presents Calculated configuration
a workload characteristic (Actual usage) and all prediction
algorithm results. We selected periods so that they consisted
of anomalies (Figs. 7a, 7b, 7c), visible patterns (Fig. 7b) and
50
the actual usage level along with usage level after anomaly
detection (with the anomalies removed). Prediction is based
on the data after anomaly detection and thus it is not
distorted by temporary usage spikes (Fig. 9). Date 2019-05-13 2019-05-20 2019-05-27
Although we are using the Integer-PSO algorithm to
find the optimum component configuration, due to cloud Fig. 9. Comparison of the actual DTU usage level, its DF prediction and
resource granulation (the cloud provider only offers pre- the calculated configuration (SaaS, May 2019)
configured component variants, e.g. a VM with 210 ACUs
and 4,000 IOPS), the values predicted are not used exactly Our tests demonstrate that in the case of anomalous
in the calculated configuration. In the chart (Fig. 9) we behavior (sudden high resource usage), the calculated
present the actual resource usage, predicted usage and configuration does not cover 100% of resource demand
calculated configuration based on DF prediction. Despite and the cloud provider resorts to throttling. This slows
this granulation, we still observe a significant reduction in down the processing of incoming requests or, in cases of
resource costs (Fig. 10). In the TMS system, the cost of SaaS prolonged high-level usage, results in a timeout response to
2168-7161 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: The University of Toronto. Downloaded on October 13,2021 at 07:44:23 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2020.3015769, IEEE
Transactions on Cloud Computing
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, XX XXX 9
pared to the original system. During our test period (from 8th
Database cost per hour, Euro
2168-7161 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: The University of Toronto. Downloaded on October 13,2021 at 07:44:23 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2020.3015769, IEEE
Transactions on Cloud Computing
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, XX XXX 10
TABLE 1
Savings and Quality Metrics for the Best Algorithms (May 2019)
online transactions, e-commerce solutions as well as web [6] Y. Zhang, J. Yao, and H. Guan, “Intelligent cloud resource manage-
information portals and social networks. ment with deep reinforcement learning,” IEEE Cloud Computing,
vol. 4, no. 6, pp. 60–69, 2017.
Time-compressed tests demonstrate that the efficiency of [7] M. H. Hilman, M. A. Rodriguez, and R. Buyya, “Task runtime
our solution improves over time. This is why, if historical prediction in scientific workflows using an online incremental
data are available, the solution can be trained in advance to learning approach,” in 2018 IEEE/ACM 11th International Conference
boost efficiency from the start. This topic may be the subject on Utility and Cloud Computing (UCC), 12 2018, pp. 93–102.
[8] Y. Yu, V. Jindal, F. Bastani, F. Li, and I. Yen, “Improving the
of our further studies. Currently, we are monitoring and smartness of cloud management via machine learning based
storing over 100 parameters of the production system (TMS). workload prediction,” in 2018 IEEE 42nd Annual Computer Software
In the future, we would like to incorporate quality of user and Applications Conference (COMPSAC), vol. 02, 7 2018, pp. 38–44.
experience criteria in our resource prediction process, which [9] R. Yang, X. Ouyang, Y. Chen, P. Townend, and J. Xu, “Intelligent
resource scheduling at scale: A machine learning perspective,” in
may result in better resource usage optimization and quicker 2018 IEEE Symposium on Service-Oriented System Engineering (SOSE),
system response times. 3 2018, pp. 132–141.
[10] C.-C. Crecana and F. Pop, “Monitoring-based auto-scalability across
hybrid clouds,” Proceedings of the 33rd Annual ACM Symposium on
Applied Computing, pp. 1087–1094, 2018.
ACKNOWLEDGMENTS [11] T. Mehmood, S. Latif, and S. Malik, “Prediction of cloud computing
The research presented in this paper was supported by funds resource utilization,” in 2018 15th International Conference on Smart
Cities: Improving Quality of Life Using ICT IoT (HONET-ICT), 10 2018,
from the Polish Ministry of Science and Higher Education pp. 38–42.
assigned to the AGH University of Science and Technology. [12] I. K. Kim, W. Wang, Y. Qi, and M. Humphrey, “Cloudinsight:
Utilizing a council of experts to predict future cloud application
workloads,” in 2018 IEEE 11th International Conference on Cloud
R EFERENCES Computing (CLOUD), 7 2018, pp. 41–48.
[13] B. Sniezynski, P. Nawrocki, M. Wilk, M. Jarzab, and K. Zielinski,
[1] A. S. Andrae and T. Edler, “On global electricity usage of commu- “Vm reservation plan adaptation using machine learning in cloud
nication technology: Trends to 2030,” Challenges, vol. 6, no. 1, pp. computing,” Journal of Grid Computing, Jul 2019.
117–157, 2015. [14] S. Chen, Y. Shen, and Y. Zhu, “Modeling conceptual characteristics
[2] M. Mao and M. Humphrey, “Auto-scaling to minimize cost and of virtual machines for cpu utilization prediction,” in Conceptual
meet application deadlines in cloud workflows,” Proceedings of 2011 Modeling, J. C. Trujillo, K. C. Davis, X. Du, Z. Li, T. W. Ling, G. Li,
International Conference for High Performance Computing, Networking, and M. L. Lee, Eds. Cham: Springer International Publishing,
Storage and Analysis, 2011. 2018, pp. 319–333.
[3] J. Yang, W. Xiao, C. Jiang, M. S. Hossain, G. Muhammad, and S. U. [15] M. Ghobaei-Arani, S. Jabbehdari, and M. A. Pourmina, “An
Amin, “Ai-powered green cloud and data center,” IEEE Access, autonomic resource provisioning approach for service-based cloud
vol. 7, pp. 4195–4203, 2019. applications: A hybrid approach,” Future Generation Computer
[4] S. Abrishami, “Deadline-constrained workflow scheduling algo- Systems, vol. 78, pp. 191 – 210, 2018.
rithms for infrastructure as a service clouds,” Future Generation [16] Q. Zhang, L. T. Yang, Z. Yan, Z. Chen, and P. Li, “An efficient deep
Computer Systems, 2013. learning model to predict cloud workload for industry informatics,”
[5] S. Memeti, S. Pllana, A. Binotto, J. Kolodziej, and I. Brandic, IEEE Transactions on Industrial Informatics, vol. 14, no. 7, pp. 3170–
“A review of machine learning and meta-heuristic methods for 3178, 7 2018.
scheduling parallel computing systems,” in Proceedings of the [17] A. Abdelaziz, M. Elhoseny, A. S. Salama, and A. Riad, “A ma-
International Conference on Learning and Optimization Algorithms: chine learning model for improving healthcare services on cloud
Theory and Applications, ser. LOPAL ’18. New York, NY, USA: computing environment,” Measurement, vol. 119, pp. 117 – 128,
ACM, 2018, pp. 5:1–5:6. 2018.
2168-7161 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: The University of Toronto. Downloaded on October 13,2021 at 07:44:23 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2020.3015769, IEEE
Transactions on Cloud Computing
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, XX XXX 11
[18] J. Kumar and A. K. Singh, “Workload prediction in cloud using [38] A. S. Ajeena Beegom and R. M S, “Integer-pso: a discrete pso
artificial neural network and adaptive differential evolution,” algorithm for task scheduling in cloud computing systems,” Evolu-
Future Generation Computer Systems, vol. 81, pp. 41 – 52, 2018. tionary Intelligence, vol. 12, 02 2019.
[19] J. N. Witanto, H. Lim, and M. Atiquzzaman, “Adaptive selection [39] N. K. Gondhi and A. Gupta, “Survey on machine learning based
of dynamic vm consolidation algorithm using neural network for scheduling in cloud computing,” in Proceedings of the 2017 Inter-
cloud resource management,” Future Generation Computer Systems, national Conference on Intelligent Systems, Metaheuristics & Swarm
vol. 87, pp. 35 – 42, 2018. Intelligence, ser. ISMSI ’17. New York, NY, USA: Association for
[20] K. Mason, M. Duggan, E. Barrett, J. Duggan, and E. Howley, Computing Machinery, 2017, p. 57–61.
“Predicting host cpu utilization in the cloud using evolutionary [40] G. Cherubin, A. Baldwin, and J. Griffin, “Exchangeability mar-
neural networks,” Future Generation Computer Systems, vol. 86, pp. tingales for selecting features in anomaly detection,” Conference:
162 – 173, 2018. Conformal and Probabilistic Prediction and Applications, 06 2018.
[21] A. M. Al-Faifi, B. Song, M. M. Hassan, A. Alamri, and A. Gumaei, [41] T. Lorido-Botran, J. Miguel-Alonso, and J. A. Lozano, “A review of
“Performance prediction model for cloud service selection from auto-scaling techniques for elastic applications in cloud environ-
smart data,” Future Generation Computer Systems, vol. 85, pp. 97 – ments,” Journal of Grid Computing, 2014.
106, 2018. [42] A. Botta, W. de Donato, V. Persico, and A. Pescapé, “On the
[22] A. A. Rahmanian, M. Ghobaei-Arani, and S. Tofighy, “A learning integration of cloud computing and internet of things,” in 2014
automata-based ensemble resource usage prediction algorithm International Conference on Future Internet of Things and Cloud, 2014,
for cloud computing environment,” Future Generation Computer pp. 23–30.
Systems, vol. 79, pp. 54 – 71, 2018. [43] C. Bishop and M. Tipping, “Bayesian regression and classification,”
[23] M. Ranjbari and J. A. Torkestani, “A learning automata-based in Advances in Learning Theory: Methods, Models and Applications,
algorithm for energy and sla efficient consolidation of virtual ser. NATO Science Series, III: Computer and Systems Sciences,
machines in cloud data centers,” Journal of Parallel and Distributed J. Suykens, I. Horvath, S. Basu, C. Micchelli, and J.Vandewalle, Eds.
Computing, vol. 113, pp. 55 – 62, 2018. IOS Press, 2003, pp. 267–285.
[24] G. Kaur, A. Bala, and I. Chana, “An intelligent regressive ensemble [44] A. Criminisi, J. Shotton, and E. Konukoglu, “Decision forests: A
approach for predicting resource usage in cloud computing,” unified framework for classification, regression, density estimation,
Journal of Parallel and Distributed Computing, vol. 123, pp. 1 – 12, manifold learning and semi-supervised learning,” in Foundations
2019. and Trends in Computer Graphics and Vision, January 2012, vol. 7, no.
[25] X. Chen, J. Lin, B. Lin, T. Xiang, Y. Zhang, and G. Huang, “Self- 2-3, pp. 81–227.
learning and self-adaptive resource allocation for cloud-based soft- [45] C. J. Burges, “From ranknet to lambdarank to lambdamart: An
ware services,” Concurrency and Computation: Practice and Experience, overview,” Microsoft, Tech. Rep. MSR-TR-2010-82, June 2010.
vol. 0, no. 0, p. e4463, 2018, e4463 CPE-17-0360. [46] C. M. Bishop, “Neural networks: a pattern recognition perspective,”
[26] C. Qu, R. N. Calheiros, and R. Buyya, “Auto-scaling web appli- Aston University, Birmingham, Technical Report, January 1996.
cations in clouds: A taxonomy and survey,” ACM Comput. Surv.,
vol. 51, no. 4, pp. 73:1–73:33, Jul. 2018.
[27] Y. Al-Dhuraibi, F. Paraiso, N. Djarallah, and P. Merle, “Elasticity in
cloud computing: State of the art and research challenges,” IEEE
Transactions on Services Computing, vol. 11, no. 2, pp. 430–447, 3
2018.
[28] H. M. Makrani, H. Sayadi, D. Motwani, H. Wang, S. Rafatirad,
and H. Homayoun, “Energy-aware and machine learning-based
resource provisioning of in-memory analytics on cloud,” in Pro-
ceedings of the ACM Symposium on Cloud Computing, ser. SoCC ’18. Patryk Osypanka , M.Sc., is a doctoral stu-
New York, NY, USA: ACM, 2018, pp. 517–517. dent in the Department of Computer Science
[29] D. Minarolli and B. Freisleben, “Virtual machine resource alloca- at the AGH University of Science and Technol-
tion in cloud computing via multi-agent fuzzy control,” in 2013 ogy, Krakow, Poland. He works professionally at
International Conference on Cloud and Green Computing, 2013, pp. ASEC S.A. as software development team leader,
188–194. mainly using Microsoft technologies (.Net, Azure).
[30] A. Singh, D. Juneja, and M. Malhotra, “A novel agent based au- His research focuses on cloud computing.
tonomous and service composition framework for cost optimization
of resource provisioning in cloud computing,” Journal of King Saud
University - Computer and Information Sciences, vol. 29, no. 1, pp. 19 –
28, 2017.
[31] G. Wei, A. V. Vasilakos, Y. Zheng, and N. Xiong, “A game-theoretic
method of fair resource allocation for cloud computing services,”
The Journal of Supercomputing, vol. 54, no. 2, pp. 252–269, Nov 2010.
[32] M. Ficco, C. Esposito, F. Palmieri, and A. Castiglione, “A coral-
reefs and game theory-based approach for optimizing elastic cloud
resource allocation,” Future Generation Computer Systems, vol. 78,
pp. 343 – 352, 2018.
[33] S. Sotiriadis, N. Bessis, and R. Buyya, “Self managed virtual
machine scheduling in cloud systems,” Information Sciences, vol. Piotr Nawrocki , Ph.D., is Associate Professor in
433-434, pp. 381 – 400, 2018. the Department of Computer Science at the AGH
[34] D. Gudu, M. Hardt, and A. Streit, “Combinatorial auction algorithm University of Science and Technology, Krakow,
selection for cloud resource allocation using machine learning,” in Poland. His research interests include distributed
Euro-Par 2018: Parallel Processing, M. Aldinucci, L. Padovani, and systems, computer networks, mobile systems,
M. Torquati, Eds. Cham: Springer International Publishing, 2018, mobile cloud computing, Internet of Things and
pp. 378–391. service-oriented architectures. He has partici-
[35] S. A. Tafsiri and S. Yousefi, “Combinatorial double auction-based pated in several EU research projects including
resource allocation mechanism in cloud computing market,” Journal MECCANO, 6WINIT, UniversAAL and national
of Systems and Software, vol. 137, pp. 322 – 334, 2018. projects including IT-SOA and ISMOP. He is a
[36] J. Zhang, N. Xie, K. Yue, W. Li, and D. Kumar, “Machine learning member of the Polish Information Processing
based resource allocation of cloud computing in auction,” Comput- Society (PTI).
ers, Materials and Continua, vol. 56, pp. 123–135, 01 2018.
[37] Z. Zhong, K. Chen, X. Zhai, and S. Zhou, “Virtual machine-based
task scheduling algorithm in a cloud computing environment,”
Tsinghua Science and Technology, vol. 21, no. 6, pp. 660–667, Dec
2016.
2168-7161 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: The University of Toronto. Downloaded on October 13,2021 at 07:44:23 UTC from IEEE Xplore. Restrictions apply.