Berral Sac2013 Poster

Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

Empowering Automatic Data-Center Management with

Machine Learning

Josep Ll. Berral Ricard Gavaldà Jordi Torres


Universitat Politècnica de Universitat Politècnica de Univ. Politècnica de Catalunya
Catalunya Catalunya Barcelona Supercomp. Center
Barcelona, Spain Barcelona, Spain Barcelona, Spain
[email protected] [email protected] [email protected]

ABSTRACT system monitors) and high-level data (user behavior and ser-
The Cloud as computing paradigm has become nowadays vice performance). The scenario is modeled as a set of data-
crucial for most Internet business models. Managing and center resources and a set of web services, enclosed in vir-
optimizing its performance on a moment-by-moment ba- tual machines (VM), each resource with a maximum quota
sis is not easy given as the amount and diversity of el- of usage and energy requirements, and each service with
ements involved (hardware, applications, workloads, cus- resource requirements (load per time unit), performance re-
tomer needs. . . ). Here we show how a combination of schedul-
ing algorithms and data mining techniques helps improving quirements, and an execution reward.
the performance and profitability of a data-center running As many of the parameters involved in this optimization
virtualized web-services. We model the data-center’s main problem are unknown a priori and vary over time, explicit
resources (CPU, memory, IO), quality of service (viewed modeling is very difficult. We use data mining and machine
as response time), and workloads (incoming streams of re- learning methods, a more viable option, to create models
quests) from past executions. We show how these models from past experience for each element in the system (an
to help scheduling algorithms make better decisions about
job and resource allocation, aiming for a balance between application type, a workload, a physical machine (PM), a
throughput, quality of service, and power consumption. high-level service requirement). Here we present a method-
ology for using machine learning techniques (ML) to model
the main resources of a web-service based data-center from
Categories and Subject Descriptors low-level information, and learn high-level information pre-
C.0 [Computer Systems Organization]: Modeling of com- dictors to drive decision-making algorithms for virtualized
puter architecture; I.2.6 [Artificial Intelligence]: Learning—
service schedulers, without much expert knowledge or real-
Induction; K.6.2 [Management of Computing and Infor-
mation Systems]: Installation Management—Performance and time supervision. 1
usage measurement, Pricing and resource allocation
2. MANAGING DATA-CENTERS
Keywords In commercial data-centers the customers can run their
Cloud Computing, Machine Learning, Modeling, Web-Services services without knowing details of the infrastructure, pay-
ing the provider on a usage-basis to ensure a Service Level
1. INTRODUCTION Agreement (SLA) detailing the QoS among others. The
Cloud Computing has become a crucial model for the ex- provider enables a VM for the customer to deploy his web-
ternalization of information and IT resources for people and services, and adjusts the VM granted resources. Customers
organizations thanks to the “everything as a service” (plat- base their business on the clients using the service, so a given
form, infrastructure, and services) capabilities. We distin- QoS for each service must be satisfied (e.g. response time
guish three main actors: the cloud service provider (owner of RT). The provider goal is to use as minimal resources for
IT resources), the cloud customer (who wants to run services the VMs but granting the VMs will have enough to satisfy
on the cloud), and the final client (who uses the services). the QoS agreed in the SLA. Figure 1 shows the business
The goal of the provider is to provide customers enough infrastructure.
resources to fulfill their services Quality of Service (QoS),
reducing the amount of used resources to save power.
In order to match services and resources, managers may
use low-level measurements (resource, power, and operating

Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific Figure 1: Data-center business infrastructure
permission and/or a fee.
1
SAC’13 March 18-22, 2013, Coimbra, Portugal. An extended version of this work is available on
Copyright 2013 ACM 978-1-4503-1656-9/13/03 ...$10.00. http://www.lsi.upc.edu/dept/techreps/llistat detallat.php?id=1131
Our decision maker relies on a middleware such as Open- on the VM or occupation on the PM or network. And by
Nebula [6], for monitoring (collecting high- and low-level learning a function expecting the RT from placing a VM in a
data) and for acting (managing tasks, workloads, and VM PM with a given occupation f (status, resources) → E[RT ],
and PM resources). Monitors get the load and resources scheduler can consolidate VMs without risking the RT in
from PMs and VMs, obtaining the following set of attributes excess, and grant resources playing safe. Figure 2 shows our
per time unit: timestamps; number of requests; average re- decision making schema.
sponse times; average requested bytes; and resource usage
and bandwidth. From this information we can make de-
cisions and do the following actions: migrate VMs among
PMs and adjust VM granted resources.
When making decisions in this context, often the required
information 1) is not available, 2) is available but highly
uncertain, or 3) cannot be read because of privacy issues.
In order to solve these lacks of information and uncertainty,
we employ here Machine Learning (ML) methods, setting
base for our work in a ML hypothesis: For each situation,
there may be a model obtained by careful expert modeling
and tuning better than any ML-learned model. But, for each Figure 2: Information flow schema using models
situation, ML can obtain semi-automatically a model which
is as good as or better than a generic model built without Finally, following the schema from MUSE [2], the data-
intensive expert knowledge or intensive tuning work. center benefit optimization problem can be formulated as a
The advantage of ML over explicit expert modeling is when Mixed Integer Program (MIP), maximizing the sum of the
systems are complex enough that no human expert can ex- income from customers per executed VM, minus penalties
plore all relevant possibilities, when no experts exist, or for SLA degradation, minus power costs by the used (turned
when system changes over time so models must be con- on) machines. Due to MIP exponential cost in the number of
stantly rebuilt. variables and constraints, solving it becomes unfeasible for
realistic settings. Here we use the generic for bin packing
3. METHODOLOGY AND LEARNING problems, Ordered First-Fit and Best-Fit algorithms, also
the BackFilling and λ-Round Robin algorithms [3], special-
First of all we model the VM and PM behaviors (CPU, ized for load-balancing via consolidation.
Memory and IO) from the amount of load received, to be All such algorithms use as an oracle used to evaluate how
predicted on-line, complementing the decision making algo- well a VM “will fit” into a PM which has already been as-
rithm (here the PM×VM scheduler) with extra information. signed some VMs. We substitute the conventional fitting
The input data is the load information (e.g. the estimated functions (i.e. cpupmh + cpuvmvm ≤ M axCP Uh ) by the
requests per time unit, the average computing time per re- learned functions mapping tasks descriptions and assigned
quest, and the average number of bytes exchanged per re- functions to response times (i.e. is E[RTvm ] ≥ α · RT0,vm ,
quest). For the expected CPU and IO usage we selected the or find best profit according to E[RT ]).
M5P algorithm [4], a decision tree holding linear regressions
on its leaves, as CPU and IO usage may be in significantly
different load regimes, but reasonably linear in each. While 4. EXPERIMENTS
for Memory, as being the web-services memory greedy (con- The details for the learning process are shown in Table 1.
stantly get memory, then flush memory occasionally), a Lin- An important detail after the learning process is that each
ear regression is enough, using the load information and the model showed the relevance of each attribute over each re-
memory usage from t − 1. source, so operators and architects can also learn from their
Secondly we predict the QoS variables (the RT in this case system (e.g. CPU depends basically on the amount of re-
of study). Giving each VM always the maximum resources quests, IO on the average bytes per requests, and Memory
would not consolidate resources as much as could be, and depends on its previous state).
giving each VM less than the minimum required given the We have performed different test to demonstrate how ML
load would degrade the RT. A common “Response Time to can match or improve approximate and ad-hoc algorithms
QoS function” in SLAs is to set a threshold α and a desired using explicit knowledge, and to validate the models on real
response time RT0 , and set SLA fulfillment level to degrade machines. The experiments have been performed using real
linearly from 1 to 0 in between RT0 and α · RT0 . Our de- workloads [1] and environments (Intel Xeon 4core + Ora-
cision making method (allocator) predicts the degree SLA cle VirtualBox + XAMPP software) for the model learning
fulfillment of a VM from its load parameters and its con- process, an analytic simulator (R version for EEFSIM [5])
text, i.e. the features of the PM where it is currently or to compare the different ML-augmented algorithms, and real
tentatively placed, the load parameters of the VM in the hosting machines for the model validation. Also for pricing
same PM, and the amount of physical resources currently we fixed costs to 0.17 euro/hour (current EC2 pricing in Eu-
allocated and demanded by each VM. Here we use again the rope) and power cost to 0.09 euro/KWh (representative of
M5P method, since simple linear regressions were incapable prices with most cloud-providing companies). The services
of representing the relations between resources and RT. on workload have as RT0 the values ∈ [0.4, 0.12]s, as ex-
By learning the function f (load) → E[CP U, M EM, IO], periments on our data-center showed that it is a reasonable
lectures from inside the VM can be replaced, and predict response value obtained by the web service without stress or
the estimated effective resources required by a VM depend- interferences. The initial α parameter is set to 2. Figure 2
ing only on its received load without interferences of stress show the results for the different algorithms running 20 VMs
ML Method Training Validation MRE MAE StDev Data range
Predict CPU M5P (M = 50) 3968 inst 7528 inst 0.164 2.530% 4.511 [2.37, 100.0]% CPU
Predict MEM Linear Reg. 107 inst 243 inst 0.0127 4.396 MB 8.340 [124.2, 488.4] MB
Predict IN M5P (M = 30) 1623 inst 2423 inst 0.193 926 Pkts 1726 [56, 31190] #Pkts
Predict OUT M5P (M = 30) 1623 inst 2423 inst 0.184 893 Pkts 1807 [25, 41410] #Pkts
Predict RT M5P (M = 4) 38040 inst 15216 inst 0.00878 9.9 ms 0.0354 [0, 2.78]s, RT
d 17ms

Table 1: Learning details per element. All training processes are done using random split of instances (66/34)

within 20 PMs for a 24 hours workload. We can see that best-fit considers that all VMs will fit in
CPU and Memory (virtualized and physically) in one ma-
Euro Watth/h Avg.QoS Migrs Avg.PMs/h chine, which degrades RT. The ML approach, instead, is
λRoundRobin 33.94 2114 0.6671 33 9.416 able to detect from low-level measures situations where RT
BackFilling 31.32 1032 0.6631 369 6.541
FirstFit 28.77 1874 0.5966 139 6.542 would not be achieved (because of CPU competition, but
FirstFit+ML 29.99 1414 0.6032 153 5.000 also because of memory exhaustion and network/disk com-
BestFit 29.85 778 0.5695 119 2.625 petition), hence migrating sufficient VMs to other machines
BestFit+ML 31.771 1442 0.6510 218 4.625
where, for example, network interfaces not so loaded.
Table 2: Algorithms vs relevant business values
From the results we observe that the versions using the
5. CONCLUSIONS
learned model perform similar or better than the versions In this work we presented a methodology for modeling
including expert knowledge, and they approach relatively cloud computing resources of a web-service based data-center
well to the ad-hoc expert algorithms, backfilling and λ-RR, using machine learning, obtaining good predictors to em-
using the optimal configurations for this kind of data-center power and drive decision-making algorithms for virtualized
calculated in [3]. While ML version of the approximated job schedulers, without the intervention of much expert knowl-
algorithms are better than their expert-knowledge versions, edge. We observe that the ML-augmented algorithms be-
the Best Fit + ML approach is close to the ad-hoc expert have often equal or better than ad-hoc with expert tuning.
algorithms in QoS and benefit. Response time and quality of service is better maintained on
After the initial experiments on the simulator, we moved some stress situations when it is possible, by consolidating
to validate and test the method in a real environment. The and de-consolidating by predicting the required computing
set-up consists in a small workbench composed by 5 Intel resources and the resulting RT for a given schedule.
Xeon machines, 3 as data-center nodes, 1 as gateway and 1 Next steps will focus on scalability and on hierarchically
attacking machine reproducing client requests scaled by 100- modeling the cloud system as a set of data-centers where
300 times to produce heavy load, in a different data-center services can not only move between machines but among lo-
than the previous training. cations around the world. Also we will focus on the network
Using the same machine architecture than the ones for side, including the service time DC-client as another SLA
modeling, we could import the learned models for CPU, object, bringing the services near their demand.
Memory, and IO. But as network environment was differ- Acknowledgments
ent this time, the RT model had to be learned again. We
observed that M5P, in this case, seemed to perform signif- Thanks to RDlab-LSI for their support. This work has been
icantly worse than before. We trained a nearest neighbor supported by the Spanish Ministry of Science under contract
model, which recovered the previous performance. Let us TIN2011-27479-C04-03 and under FPI grant BES-2009-011987
recall that the contribution we want to emphasize is not the (TIN2008-06582-C03-01), by EU PASCAL2 Network of Excel-
particular models but the methodology: this episode sug- lence, and by the Generalitat de Catalunya (2009-SGR-1428).
gests that, methodologically, it is probably a good idea to
fix on any particular model kind, and that upon a new envi- 6. REFERENCES
ronment or system changes, several model kinds should be [1] J. Berral, R. Gavaldà, and J. Torres. Li-BCN Workload
always tested. Table 3 shows the results comparing Best-Fit 2010, 2011. http://www.lsi.upc.edu/dept/techreps/
versus its ML-augmented version. llistat detallat.php?id=1099.
[2] J. S. Chase, D. C. Anderson, P. N. Thakar, and A. M.
Vahdat. Managing energy and server resources in
hosting centers. In 18th ACM SOSP 2001.
[3] Í. Goiri, F. Julià, R. Nou, J. Berral, J. Guitart, and
J. Torres. Energy-aware Scheduling in Virtualized
Datacenters. In 12th IEEE Cluster 2010.
[4] M. Hall, E. Frank, G. Holmes, B. Pfahringer,
P. Reutemann, and I. H. Witten. The WEKA data
mining software: an update. SIGKDD Explor. Newsl.,
11(1):10–18, 2009.
[5] F. Julià, J. Roldàn, R. Nou, O. Fitó, Vaquè, G. Í., and
J. Berral. EEFSim: Energy Efficency Simulator, 2010.
[6] B. Sotomayor, R. S. Montero, I. M. Llorente, and
I. Foster. Virtual infrastructure management in private
and hybrid clouds. IEEE Internet Computing,
13(5):14–22, Sept. 2009.
Figure 3: BF-noML vs BF+ML: SLA(RT) and PMs

You might also like