0% found this document useful (0 votes)

31 views14 pages

2019-Management and Orchestration of Virtual Network Functions Via Deep Reinforcement Learning

This article proposes a management and orchestration algorithm for virtual network functions using deep reinforcement learning. The algorithm aims to minimize costs while optimizing latency and quality of service. It formulates the resource optimization problem as a parameterized action Markov decision process and proposes a novel deep reinforcement learning approach called parameterized action twin deterministic policy gradient to provision resources to virtual network functions online in an autonomous manner.

Uploaded by

Ziqiang Hua

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views14 pages

2019-Management and Orchestration of Virtual Network Functions Via Deep Reinforcement Learning

Uploaded by

Ziqiang Hua

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

304 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 38, NO.

2, FEBRUARY 2020

Management and Orchestration of Virtual Network

Functions via Deep Reinforcement Learning
Joan S. Pujol Roig, David M. Gutierrez-Estevez, and Deniz Gündüz

Abstract— Management and orchestration (MANO) of in both the capital expenditures (CAPEX), i.e., equipment
resources by virtual network functions (VNFs) represents inversion, equipment installation and personnel training, and
one of the key challenges towards a fully virtualized operational expenditures (OPEX), i.e., the cost of of operating
network architecture as envisaged by 5G standards. Current
threshold-based policies inefficiently over-provision network the system [1]. To overcome this limitation, network function
resources and under-utilize available hardware, incurring virtualization (NFV) has been proposed to curtail constant
high cost for network operators, and consequently, the users. acquisition of technical hardware, by leveraging virtualization
In this work, we present a MANO algorithm for VNFs technology to implement NFs using general purpose comput-
allowing a central unit (CU) to learn to autonomously re- ers/servers [2]. With virtualization, software implementation of
configure resources (processing power and storage), deploy new
VNF instances, or offload them to the cloud, depending on the a NF can be decoupled from the underlying hardware, i.e., NFs
network conditions, available pool of resources, and the VNF can be instantiated without the need of new equipment acqui-
requirements, with the goal of minimizing a cost function that sition and installation, and they can run over commercial off-
takes into account the economical cost as well as latency and
两个优化目标 the-shelf hardware. The isolation of software from hardware
the quality-of-service (QoS) experienced by the users. First, allows for a set of VNFs to be deployed on a shared pool of
we formulate the stochastic resource optimization problem as a
parameterized action Markov decision process (PAMDP). Then, resources. This motivates a solution to manage the underlying
we propose a solution based on deep reinforcement learning shared infrastructure (processing power, storage, etc.) in an
(DRL). More precisely, we present a novel RL approach, called efficient, scalable and rapid manner.
parameterized action twin (PAT) deterministic policy gradient, There has been a lot of work on resource allocation for cloud
which leverages an actor-critic architecture to learn to provision networks. One of the most popular ways to address resource
resources to the VNFs in an online manner. Finally, we present
numerical performance results, and map them to 5G key provisioning is threshold-based reactive approaches, where
performance indicators (KPIs). To the best of our knowledge, resources are added or removed if the network’s condition
this is the first work that considers DRL for MANO of VNFs’ reaches certain predefined thresholds [3]–[6]. Although this
physical resources. provides a simple and scalable solution to dynamic resource
Index Terms— Deep reinforcement learning, resource allo- allocation, threshold-based criteria tend to over-provision and
cation, software defined networks, virtual network functions, under-utilize network equipment (incurring high costs for the
wireless edge processing. infrastructure provider) and make the management of dynamic
traffic and deployment of new types of services difficult as
I. I NTRODUCTION network traffic models must be elaborated beforehand. In [7],
authors study the scaling of virtual machines (VMs) in a proac-
T RADITIONALLY the deployment of new network func-
tions (NFs) has been done through the acquisition and
installation of a proprietary hardware running a licensed soft-
tive way. In particular they propose a solution via decision tree
approach to resolve whether a VM instance should be verti-
ware. This fact reduces the incentives for network operators cally scaled; that is, more physical resources (e.g., processing
in updating their network’s physical architecture to offer new power, storage) should be added, or horizontally scaled, i.e., by
services or update existing ones, as it represents an increase deploying a new VM instance. An autonomous vertical scaling
approach is proposed in [8] using Q-learning, where an agent
Manuscript received June 23, 2019; revised October 17, 2019; accepted learns how to autonomously provision resources (storage and
November 6, 2019. Date of publication December 13, 2019; date of current
version February 19, 2020. This work was supported in part by the European processing power) to a VM.
Research Council (ERC) Starting Grant BEACON under Project 725731. The With the explosion of machine learning (ML) and virtu-
work of J. S. Pujol Roig was supported in part by the Engineering and alization technologies, and their applications to communica-
Physical Sciences Research Council (EPSRC) and in part by the Toshiba
Research Europe through the iCASE Award to carry out the Ph.D. degree. tion networks, the idea of self-governing networks leveraging
(Corresponding author: Joan S. Pujol Roig.) modern ML techniques is becoming popular among the com-
J. S. Pujol Roig and D. Gündüz are with the Department of Electrical munications research community. Chen et al. [9] proposes
and Electronic Engineering, Imperial College London, London, U.K. (e-mail:
[email protected]; [email protected]). deep double Q-learning (DDQ) and deep-SARSA solutions
D. M. Gutierrez-Estevez is with the Samsung Electronics R&D Institute for mobile edge computing, where an end user terminal
U.K., Surrey TW18 4QE, U.K. (e-mail: [email protected]). with limited local computation and energy resources jointly
Color versions of one or more of the figures in this article are available
online at http://ieeexplore.ieee.org. optimizes computation offloading and energy consumption
Digital Object Identifier 10.1109/JSAC.2019.2959263 selection in an autonomous manner. The end user terminal
0733-8716 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Zhejiang Lab. Downloaded on August 02,2023 at 02:36:46 UTC from IEEE Xplore. Restrictions apply.
PUJOL ROIG et al.: MANAGEMENT AND ORCHESTRATION OF VIRTUAL NETWORK FUNCTIONS VIA DEEP REINFORCEMENT LEARNING 305

依赖独立隔离的环境
decides whether to execute a computing task locally or offload The feasibility of the proposed solution relies on the
it to one or more of the available edge base stations (BSs), assumption that the technology envisaged for NF virtual-
also selecting the amount of energy to be allocated for the ization is “containerization” [18], where containers perform
task in question. A proactive VM orchestration solution is operating-system-level virtualization, i.e., every time a VNF is
proposed in [10] using Q-learning, where, given the current launched a container is deployed in a physical server. A “con-
state, an agent decides to increase, reduce or retain the tainer” is a lightweight, standalone, executable package of
number of VMs allocated to a VNF. In [11], a deep learning software that includes everything needed to run an applica-
approach is introduced to decide the number of VNFs that tion: code, runtime, system tools, system libraries and settings
must be deployed to meet the network traffic demands. The and its run by the operating system kernel [19]. Containers
authors formulate a classification problem, where each class are isolated from one another and can communicate with each
corresponds to the number of VNFs that must be instantiated other through well-defined channels. We find containers to
to be able to cope with the current traffic, and use historical be a more appropriate virtualization technology, compared to
labelled traffic data to train the proposed algorithm. others, such as VMs, as they require less power [20], take less
More recently, in the management and network orchestra- start-up and re-scaling time [21], [22], and, most importantly,
tion (MANO) domain for wireless networks resources, the use can be rescaled on-the-fly without disrupting the service they
of DRL has gained attraction for network slicing resource provide.
orchestration and management. These works formulate a dis- Due to the use of an actor-critic architecture, our approach
crete action selection optimization problem, and use well is a joint policy and action-value-based optimization, which
established value-based methods, e.g., Q-learning or SARSA, generally shows better convergence properties [15] compared
to solve the formulated problem. In this line of work, [12], pro- to value-based approaches implemented in [9], [10], [12],
poses a deep Q-learning approach to radio resource slicing and [12], [13]. Moreover, Q-learning and SARSA are used for
priority-based core network slicing, showing its advantage in discrete action selection [23], which is not feasible for the
addressing demand-aware resource allocation. In [12], authors continuous control problem addressed in this work. Although
formulate the problem of frequency band allocation, and the a continuous action space is considered in [14], it focuses
problem of computation resource orchestration for different on the allocation of a single resource (bandwidth), while
slices. These problems are reduced to choosing a particular we consider the allocation of two continuous resources plus
configuration from a finite set of available configurations, a discrete action for server selection. In contrast to [10],
which is done leveraging DDQNs. Similarly, in [13], a DRL we consider not only horizontal scaling but also vertical
solution based on DDQN is presented for multi-tenant cross- scaling, as well as offloading, significantly increasing the
slice resource orchestration, where again, a discrete number complexity of the problem. Furthermore, our algorithm works
of communication and computation resources have to be in an online manner, i.e., dynamically adapting to the network
allocated to different slice tenants. Finally, a deep deterministic traffic, which differs from [24], where the algorithms are
policy gradient (DDPG) method with advantage function is trained using historic labelled data and cannot adapt to new
employed in [14] to allocate bandwidth resources to different types (i.e., classes) of traffic that differs significantly from
network slices. Compared to the aforementioned approaches, the training set. Moreover, in disagreement with what is
the continuous nature of DDPG allows for more fine-grained stated in [24] for reinforcement learning (RL) approaches, our
resource allocation. approach can use unlabelled historical data to learn, as we are
In our work, we consider 3GPP functional split, where a interested in the network patterns (captured by the historical
central unit (CU) deploys and maintains a set of VNFs serving arrival and service times) to update the critic value-function
the users of several distributed units (DUs). We first formulate estimates accordingly. Finally, in comparison with the cloud
the dynamic allocation of processing and storage resources management algorithm presented in [7], using deep neural
to VNFs as a Markov decision process (MDP). The optimal networks (DNNs) for function approximation can handle a
solution for this problem is elusive due to prohibitively large higher dimensional state space, which would be challenging
state and action spaces. Therefore, we present a novel deep to be captured using decisions trees due to the exponential
reinforcement learning (DRL) algorithm, called parameterized growth in the number of leaves.
action twin (PAT), where we use DDPG [15] and its novel The remainder of this paper is organized as follows: In
variant called twin delayed DDPG [16], as well as ideas from Section II the system model is introduced. The problem for-
the parameterized action Markov decision process (PAMDP) mulation using a Markov decision process (MDP) framework
as in [16], [17], so that an agent placed at the CU is trained is presented in Section III. In Section IV we provide an
to learn whether to scale vertically (add processing power and overview of the RL notation, and review the works upon
storage), horizontally (instantiate new VNFs), or to offload which our approach is based. The proposed PAT algorithm
(send the VNFs to the cloud) based on the system state (service used to train the agent is explained in Section V. Numerical
request arrivals, service rates, service level agreement (SLA), results illustrating the performance of the PAT algorithm are
etc.), using a cost function that combines the economic cost, presented in Section VI. Finally, a summary of the results and
SLA requirements, and the latency experienced by the users. conclusions are presented in Section VII.
The proposed algorithm is deployed in a variety of scenarios Notation: [·]T denotes the transpose operation. 1(x) denotes
and its performance is evaluated according to a defined set the logical operator, which equals to 1 if x is true, and 0 other-
of 5G key performance indicators (KPIs). wise. For positive integer K, [K] denotes the set {1, 2, . . . , K}.

Authorized licensed use limited to: Zhejiang Lab. Downloaded on August 02,2023 at 02:36:46 UTC from IEEE Xplore. Restrictions apply.
306 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 38, NO. 2, FEBRUARY 2020

removed from the set of services over time. Let us refer to

(t) (t)
ρk ≤ ρmax and ηk ≤ ηmax as the total processing power
and storage, respectively, of server Sk being used at epoch t.
The CU is connected to the cloud via a dedicated link of
capacity R(t) Mbps. We denote by Nk,j the instance of Nj
deployed at server Sk , and allow only for one instance of each
VNF to be deployed in a server.
We consider two main physical resources to be provisioned
to the VNFs, CPU and memory, and assume that NF instance
(t)
Nk,j , ∀j ∈ [N ], at epoch t uses ck,j C Hz of CPU capability
(t)
and mk,j B bits of storage, at server Sk , k ∈ [K + 1], where
(t) (t) (t) (t) (t) (t)
ck,j ∈ ck,j , ck,j and mk,j ∈ mk,j , mk,j . Thus, each VNF
has a different resource range in which it can operate following
Fig. 1. Considered system architecture.
the definition of an elastic NF, which refers to a VNF whose
QoS gracefully degrades with the scarcity of resources [25].
The range of CPU and memory resources VNF Nk,j is able
1N denotes the vector of 1s of size N. For set A, we denote its
to operate at is given by:
power set, i.e., the set of all subsets of A, by 2A . We define the ⎧ (t) (t)
function clip (x, xmin , xmax ) max {xmin , min {xmax , x}}. ⎪
⎪ ck,j = cj,0 + (cj,r − Δcj,d ) uk,j
⎪
⎪
⎪
⎪
⎪
⎪ (t) (t)
II. S YSTEM M ODEL ⎨ck,j = cj,0 + (cj,r + Δcj,d ) uk,j

We consider a radio access network (RAN) with the 3GPP ⎪

⎪
(t) (t)
mk,j = mj,0 + (mj,r − Δmj,d ) uk,j
⎪
⎪
CU-DU functional split, consisting of B small-cell BSs (the ⎪
⎪
⎪
⎪
DUs), denoted by B = {B1 , B2 , . . . , BB }, connected to a CU ⎩m(t) (t)
k,j = mj,0 + (mj,r + Δmj,d ) uk,j ,
that is in charge of the MANO of the NFs, such as transfer
(t)
of user data, mobility control, RAN sharing, positioning, where uk,j denotes the number of users being served by
session management, etc. The BSs work as remote radio VNF instance Nk,j at epoch t; cj,0 and mj,0 represent the
heads (RRHs), i.e., relaying all its traffic to the CU. Let offset CPU and memory requirements, respectively, that do not
N = {N0 , . . . , NN−1 } denote the set of distinct heterogeneous depend on the number of users being served. The variables cj,r
NFs offered by the CU that can be instantiated by any and mj,r account for the linear increment of CPU and memory
traffic requirement in the network. Based on the users’ traffic per user being served by the particular deployment of Nj in
requirements from Bi , i ∈ [B], the CU deploys and maintains server Sk . Values Δcj,d and Δmj,d are referred as the elastic
a subset of VNFs from set N . See Figure 1 for an illustration service coefficients, and define the resource range under which
of the network model. the VNF is able to operate.
(t)
We envisage an autonomous CU that has a local pool of The QoS of the uk,j users served by the instance of VNF
resources that can be used to instantiate new VNFs, or to (t)
Nk,j is denoted by QoSk,j , and depends on the resources
maintain deployed ones. The pool of local resources consists allocated to this VNF instance. VNF Nj , j ∈ [N ], has a
of K servers denoted by {S1 , . . . , SK }, each with a limited minimum QoS requirement, QoSmin , that must always be
j
processing and storage capability. We assume homogeneity ensured as specified by the SLA, and a maximum perceived
across servers, such that each Sk , k ∈ [K], has the same QoS, QoSmax .
j
storage size of ηmax F bits and a central processing unit (CPU) (t) (t) (t)
We assume that QoSk,j as a function of mk,j and ck,j is
of capability ρmax C Hz. In addition to the local servers,
given by the following piecewise function:
the CU can also employ resources located at a cloud center
(t)
by offloading VNFs to the cloud, albeit at an increased cost QoSk,j
which will be specified later. We consider a central cloud with ⎧
⎪ (t) (t) (t) (t)
⎪
⎪ QoSmaxj , if ck,j > ck,j and mk,j > mk,j
an infinite capacity resource pool, and denote it by SK+1 , ⎪
⎪
⎪
⎪ (t) (t) (t)
0, if ck,j < ck,j or mk,j < mk,j
(t)
so the set of available resources to the CU is denoted by ⎪
⎪
K = {S1 , . . . , SK+1 }. ⎪
⎪
⎨ QoSmax − QoSmin
j j (t) (t) (t)
We consider a slotted resource allocation scheme, where = (t) (t)
min{mk,j , mk,j } + min{ck,j ,
⎪
⎪ r k,j − r k,j
new users arriving at the system wait until the start of the ⎪
⎪
⎪
⎪ QoSmin r(t) − QoSmax r (t)
next slot to be allocated resources. Thus, the time horizon ⎪
⎪
⎪
⎪ (t)
} +
j k,j j k,j
is discretized into decision epochs, corresponding to slots of ⎪ c
⎩ k,j (t) (t)
, otherwise,
duration T , and are indexed by an integer t ∈ N+ . At the r −r
k,j k,j
beginning of each epoch, the CU decides how to allocate the (t) (t) (t) (t) (t) (t)
where we defined r k,j+ mk,j and rk,j ck,j + mk,j .
ck,j
network resources. We denote by N (t) ⊂ N , the set of active (t)
VNFs maintained by the CU during epoch t. The dependency We see that Nk,j satisfies the SLA if and only if mk,j ≥
(t) (t) (t) (t)
of N (t) on t emphasizes the fact that VNFs can be added or mk,j and ck,j ≥ ck,j , as these result in QoSk,j ≥ QoSmin
j ,

(t) (t) (t)
and that additional resources beyond ck,j and mk,j do not • Deployment latency δdk,j : When a new VNF Nk,j
have an impact on the QoS perceived by the users, which is instantiated on a server Sk , k ∈ [K], we consider a
saturates at QoSmax j . Furthermore, based on Eqn. (1) the QoS boot-up delay of δd,b per container. The total deployment
is the same for all the users served by the same VNF instance latency of instance Nk,j is
(t)
Nk,j , i.e., uk,j . The CU can also offload VNFs to the cloud,
(t) (t−1) (t)
in which case the CU’s local pool is not used. δdk,j = 1 ck,j = 0 and ck,j > 0 δd,b .
(t) (t) (t) (t)
We define ck ck,1 , . . . , ck,N and mk

• Offloading latency δof fK+1,j : If a VNF instance is
(t) (t) (t)
mk,1 , . . . , mk,N . Finally, the matrices C deployed on the cloud, in order to keep the service
(t) (t) (t) (t)
[c1 , . . . , cK ] and S(t) = [m1 , . . . , mK ] represent the running, a continuous flow of information between the
CPU and memory allocations across all the K servers at cloud and the CU must be retained until the VNF is
(t) (t) (t) (t)
epoch t, such that ρk = 1N · ck and ηk = 1N · mj . terminated. This incurs a total latency of
The number of new service requests from all the BSs for
(t) δof fK+1,j = 2mK+1,j B/R(t)
VNF Nj , j ∈ [N ], in epoch t, is denoted by nj , and is
assumed to follow an independent and identically distributed for the offloaded VNF. Once a VNF is deployed in the
(t)
(i.i.d.) homogeneous Poisson process with parameter λj ; in cloud, we consider that the maximum resource utilization
(t) (t) (t)
other words, the probability of nj new demands to arrive at rj is guaranteed, so that QoSK+1,j = QoSmax .
the CU for VNF Nj in epoch t for a time-slot of duration T The total latency incurred by VNF instance Nk,j at epoch t is
is given by: (t)
n (t) (t) δ + δr(t) , if k ∈ [K]
(t) δTk,j = uk,j · dk,j k,j

λj T (t) δof fK+1,j , otherwise.

(t)
P nj = n = e−λj T .
n! All the users, being served by instance Nk,j experience the
(t)
Remark 1: In order to capture slow variations of net- same latency, and hence the scaling by uk,j , the number of
(t) users being served for each instance.
work traffic over time, we consider time-varying λj values,
(t)
obtained by sampling a Gaussian distribution with parameters Financial Cost (CTk,j ): A price model that takes into
μj and σj and taking the maximum between the obtained account the economic implications of each Nk,j VNF instance
(t) configuration is developed.
value and 0, i.e., λj = max {x, 0} where x ∼ N (μj , σj ).
(t)
We assume value of λj is kept constant for a block of tmax • Resource cost Crk,j : is a financial cost of Cr,m per B
time slots, and changes to an independent realization from the bits of memory per epoch and Cr,p per C units of CPU
aforementioned truncated Gaussian distribution for the next resource per epoch for server Sk , k ∈ [K], i.e.,
block. R(t) is also obtained by sampling a truncated Gaussian
(t) (t)
distribution: R(t) = max {x, Rmin }, where x ∼ N (μr , σr ). Cr(t)
k,j
= ck,j Cr,p + mk,j Cr,m .
We model users’ service times by a geometric distribution;
(t)
i.e., at the end of each time slot a user will remain in • Server cost Cik,j : Every time a server is powered on,
the system with probability pj , and leave the system with we consider a one-time payment of Ci,0 plus a rental cost
probability 1 − pj , so the expected service time of a user of Ci,v per epoch. Hence, the server cost is given by:
is 1/pj , ∀j ∈ [N ]. C
There are three objectives the CU may want to optimize (t) (t−1) (t) i,0
Cik,j = 1 1N ck = 0 and 1N ck > 0
simultaneously: latency, financial cost and service quality. N
C
In order to simplify this multi-objective optimization problem (t) i,v
+1 1N ck > 0 .
the CU minimizes the long-term weighted average of these N

three objectives. Next we explain each of these costs. (t)
(t) • Cloud cost CcK+1,j : The financial cost of offloading a
Latency (δTk,j ): The latency cost associated with VNF
VNF to the cloud is modelled as a one-time payment of
instance Nk,j during epoch t is due to three potential causes:
Cc,0 plus a rental payment of Cc,v per user per epoch
(t)
• VNF resizing latency δrk,j is associated with resiz- until the VNF is terminated. Thus the cloud cost of VNF
ing the containers. Resizing a VNF consists of varying Nj is given by:
the amount of allocated CPU and memory resources.
(t−1) (t)
Docker allows to resize containers on-the-fly by using Cc(t)
K+1,j
= 1 cK+1,j = 0 and cK+1,j > 0 Cc,0
the command docker update (from Docker v1.11.1). + mK+1,j Cc,v .
We assume that any instantiated container incurs a delay
of δr,c per unit C of CPU added/removed, and δr,m The total financial cost of VNF instance Nk,j at epoch t is
per block of memory of size F added/removed. Thus, given by:
resizing latency of the VNF instance Nk,j is: (t) (t)
(t) (t) Crk,j + Cik , if k ∈ [K]
(t) (t−1) (t) (t−1)
Crk,j = uk,j ·
δr(t)
k,j
= |ck,j − ck,j |δr,c + |mk,j − mk,j |δr,m . Cc(t)
K+1,j
, otherwise,

Authorized licensed use limited to: Zhejiang Lab. Downloaded on August 02,2023 at 02:36:46 UTC from IEEE Xplore. Restrictions apply.
308 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 38, NO. 2, FEBRUARY 2020

(t)
Service Level Agreement (SLAk,j ): Each VNF instance the
action space.
Action a(t) in state s(t) incurs a certain cost
(t) (t)
Nk,j , k ∈ [K] j ∈ [N ], is associated with a minimum QoS R s , a , where R : S ×A → R denotes the cost function,
requirement, QoSmin
j , and the failure to provision resources and the agent
transitions to a new state s(t+1) ∈ S with
accordingly might incur service disruption, which violates the probability p s(t+1) | s(t) , a(t) ∈ P, where P : S × A ×
SLA. Accordingly, we define the SLA cost at VNF instance S → [0, 1] is a probability kernel. At each interaction,
Nk,j as the agent maps the observed
state s(t) to a probability distri-
bution over the action set A s(t) . This MDP model is thus
(t) (t) (t) (t)
SLAk,j = γj 1 QoSk,j < QoSmin j − QoSk,j ) uk,j , characterized by the 4-tuple
s, a, r, p (s | s, a). The policy of
(t)
the agent, denoted by π, specifies the probability of selecting
where QoSk,j is the perceived QoS of VNF Nj at server Sk , action a(t) = a in state s(t) = s, given by π (a | s).
and γj is the penalty for not fulfilling QoSj . Furthermore, The state-value function Vπ (s) for policy π at state s
the SLA cost scales with the number of users being served is defined as the expected discounted cost the agent would
by the VNF instance Nk,j as all of them experience the accumulate starting at state s following policy π:
same QoS. ∞

We remark that, in order to capture the impact of reconfig- . (t−1) (t) (t) (1)
Vπ (s) = Eπ γ R s ,π s |s =s ,
uration of a VNF container on the whole network, each objec- t=1
tive cost function is scaled by the number of users, such that a where 0 ≤ γ ≤ 1 is the discount factor that determines how
reconfiguration that affects more users is penalized/rewarded far into the future the CU “looks”, i.e., γ = 0 corresponds to
more than those affecting less users. a “myopic” CU, that focus only on its immediate cost, while
Network Cost: We define the overall network cost as the γ = 1 represents an CU concerned with the cost over the
total cost incurred by all the instances deployed in the network whole time horizon. The action-value function, also referred
at decision epoch t, defined as: as Q-function, is defined as:
(t) (t)
(t) + ω3 SLA
(t) k∈[K+1] ω1 δTk,j k,j + ω2 CTk,j Qπ (s, a)
(t)
CT =
j∈N
∞
,
j∈N (t) k∈[K+1] uk,j .
= Eπ k
γ R s (t+k)
,π s (t+k) (t) (t)
| s = s, a = a .
where ω1 , ω2 , ω3 ∈ R+ are fixed weights independent of the k=0
VNF and the server. These weights can be tuned based on the We define the optimal value function, V ∗ (s), as the minimum
preferences of the network operator, e.g., a network operator expected total discounted cost obtained starting in state s and
might be more concerned about reducing the economic cost following the optimal policy:
rather than providing a high quality service, etc. The nor- V ∗ (s) = min Eπ [Vπ (s)] .
malization by the number of users is to balance the network π
cost between heavy and low traffic periods. Without such a The goal is to find a policy π∗ whose value function is the
normalization busy traffic periods would incur higher costs same as the optimal value function Vπ∗ (s) = V ∗ .
regardless of the CU’s performance. Next, we define the state and action spaces for our problem.
VNF Instance Cost: For purposes that will be explained in
(t) B. State Space
Section IV, we define the VNF instance cost Ck,j incurred by
instance Nk,j at epoch t, as follows: The network state space is the set of all possible configu-
(t) (t) rations of the network. The state at epoch tconsists
of:
ω1 δT (t) + ω3 SLAk,j + ω2 CTk,j (t)
(t) k,j 1) the number of arrivals for each VNF nj .
Ck,j = . j∈N
uk,j
2) deployed VNFs N (t) .
This cost measures the contribution of a particular VNF 3) number of users being served by VNF Nj at each server,
instance to the global network cost. (t)
uk,j .
4) cloud link capacity, R(t) .
III. P ROBLEM F ORMULATION 5) CPU resources allocated to each VNF at each server,
In this section, we formulate the resource allocation problem C(t) .
as a MDP. We envisage an autonomous CU with the goal of 6) memory resources allocated to each VNF at each server,
minimizing the long-term cost. To this end, we define the state S(t) .
space and the set of actions that the CU can take at each The network state at epoch t is characterized by
⎛
decision epoch.
(t) (t)
s(t) = ⎝ nj , N (t) , uk,j , C(t) , S(t) , R(t) ∈ S,
A. MDP j∈N k∈K,j∈N

At each decision epoch of a MDP an agent observes a where

state s(t) ∈ S,
where S is the
state space, and selects and S = Z+
N
× 2N × Z+
K×N

action a(t) ∈ A s(t) , where A s(t) is the

set of all possible
N ×K K
actions in state s(t) . Set A = ∪s(t) ∈S A s(t) is referred as × {0, . . . , ρmax } × {0, . . . , ηmax } × R+ .

C. Action Space 3) Work Offloading: If the CU foresees that it cannot cope

The CU can react to variations in the workload in three with a traffic fluctuation by scaling vertically or horizontally,
ways: vertical scaling, horizontal scaling and offloading. The it can decide to offload a VNF to the cloud. We define the
(t)
CU actions are taken at the user level, that is, the CU selects offloading action as Aof f . This action is not associated with
an action for each user request arriving at the system. This any parameter due to the assumption of unlimited CPU and
allows the CU to allocate users requesting the same VNF to memory resources at the cloud.
different servers. 4) Parametrized Action Space: Following the PAMDP nota-
Following [26], we employ a PAMDP formulation, where a tion the complete parameterized action space at epoch t is
discrete action set is defined as AD = {a1 , a2 , . . . , aD }, and given by
eachaction a ∈ AD is associated with na continuous parame-

(t) (t) (t) (t) (t) (t)
A(t) AV , pCP U , pM , ∪ AH , pCP U , pM ∪ Aof f .
(t)
ters pa1 , . . . , pana , pai ∈ R. Thus, each tuple a, pa1 , . . . , pana
represents a distinct action, and the action space is given The cost function for our problem has been defined in
by A = ∪a∈AD a, pa1 , . . . , pana . In our problem, the first Section II in detail. Note that the CU’s action at each time
discrete component denotes the server at which the user slot consists of nj (t) actions in the PAMDP formulation, one
is assigned to, while the remaining continuous components for each user request. Note that the randomness in our problem
denote how the associated resources are updated. is due to random user arrivals for each VNF, and the random
1) Vertical Scaling: The vertical scaling action space, service time for each user in the system. If these statistics are
denoted by AV = [K], refers to actions adding resources known, the optimal policy can be identified through dynamic
to, or removing from, an already deployed VNF instance at programming (DP), e.g., by the value iteration algorithm.
epoch t [7]. Taking into account the traffic fluctuations and However, estimating these probabilities for our problem, which
VNF requirements, a CU might decide to increase (decrease) has large state and action spaces is prohibitive, making the DP
the CPU, and/or memory resources allocated to a deployed solution practically infeasible. Hence, we will instead exploit
VNF instance independently, i.e., the memory allocation can DRL to find an approximation to the optimal value function.
be increased while the CPU allocation is decreased, or vice-
versa. Hence, we define the vertical scaling actions separately
(t) (t) IV. R EINFORCEMENT L EARNING FOR VNF M ANAGEMENT
for the CPU and memory resources, as pCP U and pM ,
respectively, as the change in the allocated resources with In the proposed RL method, the agent does not necessarily
respect to time slot t − 1. We have exploit (or even know) the transition probabilities governing
⎧ (t) the underlying MDP as it learns directly a policy as well
⎨pCP U ∈ i · B | i ∈ R, −ρ(t) k ≤ i ≤ ρmax − ρk
(t)
as the action-value functions based on its past experience
(model-free). The formulated problem suffers from the curse
⎩p(t) ∈ i · C | i ∈ R, −η (t) ≤ i ≤ η − η
(t)
.
M k max k of dimensionality due to the prohibitively large size of the
Vertical scaling is limited by the resources of the physical state and action spaces (continuous state space). Therefore,
server in which a container is deployed, thus, the limitation of we employ the actor-critic method with NNs as a function
(t) (t) approximator to parameterize the policy, which allows the
ρmax − ρk and ±ηmax − ηk .
(t) (t) learning agent to directly search over the action space. Another
Note that the parameters pCP U and pM represent incre-
DNN is employed to approximate the state-value functions,
ment/decrement of the resources already allocated to VNF
(t) (t−1) (t) (t) which are used as feedback to determine how good the current
Nj at server Sk , i.e., ck,j = ck,j + pCP U and sk,j =
(t−1) (t) (t) (t)
policy is.
sk,j + pM ; hence, pCP U and pM can also take negative
values. As mentioned before, all the users of a server’s VNF
instance equally share the allocated resources, thus, all of them A. Deep Reinforcement Learning (DRL)
are affected by the reshuffling of resources. The use of DNNs as general function approximators have
2) Horizontal Scaling: Horizontal scaling refers to the been proven to work very well in a wide range of areas, such as
deployment of new containers to support an existing VNF computer vision, speech recognition, natural language process-
Nj at epoch t. If the load of a VNF increases, and the CU ing, and more recently, wireless networks [27]. Traditional RL
estimates that server k at epoch t + 1 will not be able to methods struggle to address real-world problems due to their
support its operation, the CU might create another instance of high complexity. In these problems, high-dimensional state
the same VNF in another server. spaces need to be managed in order to obtain a model that
We have AH = [K] and can generalize past experiences to new states. For example,
⎧ (t) tabular Q-learning uses a hash table to store the estimated
⎨pCP U ∈ i · B | i ∈ R, −ρ(t) k ≤ i ≤ ρ max − ρ
(t)
k cost of state-action pairs, so for continuous input states, even

⎩p(t) ∈ i · C | i ∈ R, −η (t) ≤ i ≤ η (t) if quantized, this solution deems intractable, since even with
max − ηk ,
M k modest 5-level quantization and a state vector of size N, 5N
where k denotes the server at which a new VNF instance is to entries would have to be stored (≈1013 entries if N = 20).
(t) (t) DRL aims to solve this problem by employing NNs as
be deployed using horizontal scaling. pCP U and pM account
for the amount of CPU and memory resources to be allocated function approximators to reduce the complexity of classical
for the new deployment of Nk,j at server k. RL methods.

Authorized licensed use limited to: Zhejiang Lab. Downloaded on August 02,2023 at 02:36:46 UTC from IEEE Xplore. Restrictions apply.
310 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 38, NO. 2, FEBRUARY 2020

In [28] authors introduce deep Q-learning network (DQN), Algorithm 1 Proposed PAT
where a DNN is used as a function approximator for action Initialize the actors and critics networks, i.e., θ1 , θ2 and φ1 , φ2
selection on a discrete action space, based on Q-learning. using Gaussian initialization with μ = 0, σ = 10−2 .
Given a state, Q-learning updates the action-value estimate Copy the parameters to the target networks, i.e., φ− −
1 ← φ1 , φ2 ←
with the immediate reward plus a weighted version of the φ2 , θ1− ← θ1 , θ2− ← θ2
Initialize the replay buffer M
highest Q-estimate for the next state. Using a combination t=0
of 3 convolutional layers (for computer vision) and two fully while t < total_timesteps do
connected layers (Q-learning part), they obtain human-level for i= 0,…,T do
results for a wide range of Atari games. Other architectures Observe state s.
based on DQN have also been proposed, such as the duelling Select action a = μθ2 (s) with probability 1 − or a random
action a with probability .
DQN [29], which has two distinct DNNs, one to estimate Select
Vπ (s), and the other to estimate the so-called advantage p = clip (μθ2 (s, a) + w, pmin , pmax ) ,
function Aπ (s, a) = Qπ (s, a) − Vπ (s). These methods work
where w ∼ clip N 0, σ 2 · (pmax , ηmax ), −c, c
well for a continuous state space but are limited to discrete
Store transaction s, (a, p), r, s ∈ M
action space, suffering from the curse of dimensionality when end for
the action space is large. Get a batch B = {(s, (a, p), r, s )} of randomly sampled trajec-
To overcome the limitation of discrete action tories from the replay buffer M
selection, in [15] the idea of DQN is extended to Compute the action and parameter targets:
continuous action spaces using the deterministic policy a− (s ) = μθ− (s)
1
gradient (DPG) theorem [30], in particular the deep-DPG − −
p (s , a ) = clip μθ2 s , a− (s ) + w, pmin , pmax
(DDPG) method. DDPG extends the use of DNN to the
actor-critic method leveraging off-policy learning, where Compute the target estimates
a deterministic policy is learned using a combination of
y(r, s ) = r + γ min Qφ− s , a− s , p− s , a− (s )
replay buffer and target networks to ensure stability and a i=1,2 i

zero-mean Gaussian noise is added to the actions for action Update Q-functions:
space exploration.
1 2
In [16] it is shown that DDPG may lead to overestimation of ∇φi Qφi (s, a, p) − y r, s for i = 1, 2
|B| ρ∈B
action values, thereby to suboptimal policies. To overcome this
problem, authors present a novel method called twin delayed Update action policy:
DDPG (TD3). A novel actor-critic architecture is proposed 1
which comprises two critic networks, hence, two different ∇θ1 Qφ1 (s, μθ1 (s), p)
|B|
(s,p)∈B
Q-functions are learned, and the smaller of the two estimates is
used as the update rule for the critics. The proposed algorithm Update parameter policy:
adds clipped noise to the target action to make it harder for 1
the policy to exploit Q-function errors. They also propose ∇θ2 Qφ1 (s, a, μθ2 (s, a))
|B|
(s,a)∈B
to update the targets and the policy less frequently than the
Q-functions, helping to reduce the variance. Update target networks:
Given the high dimensionality of both the state and the φ− −
i = τ φi + (1 − τ )φi for i = 1, 2
action spaces in our model we propose a solution that leverages θ− = τ θi + (1 − τ )θi− for i = 1, 2
DNNs as policy and action-value function approximators,
while exploiting the results from [16] for continuous action t = t+1
end while
selection. To this end, we implement a novel architecture for
PAMDP to address the problem defined in Section III.
V. ACTOR -C RITIC M ETHOD IN PAMDP environment described in Section II, as we believe it will
The actor-critic method is leveraged is a combination of ease the comprehension of the algorithm. At the beginning of
value based and policy optimization approaches. It combines each decision epoch t, we randomly select a VNF and proceed
the benefits of both methods as the critic estimates the to serve new demands for this VNF. The random selection
action-value function Qφ (s, a), while the actor derives a policy of VNF is motivated by fairness, so that we avoid starting
πθ (s) critically using the value estimates of the critic to update the process of resource allocation (when more resources are
the policy. In this section we present our novel algorithm (see available) with always the same VNF. Following the random
Algorithm 1), which implements the actor-critic method for a VNF selection, we iterate over all requests of this VNF to
PAMDP, which we call the parameterized action twin (PAT). allocate the network resources using the PAT method. For
(t) resource allocation, a snapshot of the network state is used
For ease of notation in the
rest of the section we use s = s, as input to the PAT method. Based on the network state,
s(t+1) = s , a(t) = a, R s(t) , a(t) = r.
the proposed RL algorithm decides the actions to be taken
A. VNF MANO Meets PAT and from which server/cloud the user is served. After the
Before detailing the proposed PAT algorithm, we clarify allocation, a new snapshot of the network state is obtained and
its integration into the CU, and how it interacts with the the agent cost described in Section V-C is computed. These

Qφi (s, a, p), i = 1, 2. As is typical in actor-critic meth-

ods, we use off-policy temporal difference of 0, i.e., TD(0),
for action-value function approximation, with clipped update
rules:

φi (s, a, p)
Qt+1

= Qtφi (s, a, p) + α r + γ min Qtφ− s , μθ− (s ) ,
i=1,2 i 1

μθ− s , μθ− , (s ) − Qtφi (s, a, p) ,
Fig. 2. The information flow between different DNNs in the proposed 2 1

architecture.
which minimizes the following loss function, for i = 1, 2:

1
transitions are stored in the memory buffer of the agent and LQφi (s, a, p) = r + γ min Qφ− s , μθ− (s ) ,
will later be used to train the PAT algorithm, so that it can 2 i=1,2 i 1

2
adapt to previously seen as well as new traffic patterns. Even if
μθ− s , μθ− , (s ) − Qφi (s, a, p) .
a VNF does not have any new requests, we nevertheless select 2 1

that VNF and apply the PAT algorithm for action selection
The action-value functions of the critics are learned through
so that, in case it was already deployed, resources can be
gradient descent with the update rule:
added or removed; if not, the VNF can be deployed ahead
of future traffic. In this case, the VNF instance only incurs an φt+1
economical cost as no user is served.
i
= φti + α r + γ min Qφ− s , μθ− (s ) ,
i=1,2 i 1
B. PAT Algorithm

Following [16], we use two critics in order to obtain two μθ− s , μθ− , (s ) − Qφti (s, a, p) ∇φti Qφti (s, a, p) .
2 1
distinct estimates of the action-values; thus, two different
DNNs, parameterized by φ1 and φ2 , are used to estimate two The critics’ estimations of the joint action and parameters
different action-value functions. The aim of the two critic net- are gathered by the actors to update the policy. In continu-
works, as explained in [16], is to avoid overestimation. We find ous action space, the greedy update of the policy becomes
that clipping the critics’ updates to the minimum between infeasible as it requires a global maximization at every step,
the two estimates yields better policies. Note that this update and going through all the action space to maximize the
rule might introduce underestimation bias; however, we find estimated expected return is infeasible. Following [30], we use
it more convenient in the long term to avoid convergence to the critic’s network’s gradient that indicates the direction
suboptimal policies. the global Q-value estimate increases, to update the policy
Two more DNNs, parameterized by θ1 and θ2 , are used parameters. In order to obtain the gradients, we need to
for the policy parameterization of the actor. The goal of the perform back-propagation through one of the critics networks
first actor network is to select the discrete action a based (we chose critic 1). It must be noted here that this gradient
on the current state s, while the second network generates is not the conventional gradient with respect to the network
(t) (t) parameters, but with respect to the input, such that for the
the continuous action parameters p = [pCP U , pM ]T based
on the outcome of the first actor network a and the current action network θ1 the update rule is

system state s. Thus, the joint selection (a, p) determined by θ1t+1 = θ1t + α Es∼ρθ1 ∇θ1 μθ2 (s) ∇a Qφ1 (s, a, p) a=μ (s) ,
two distinct networks, in contrast to the approach in [17], θ1

where a single DNN architecture is used to determine both, while the update rule for network θ2 is
the action and the parameters associated with it. We find this
architecture to reflect a more natural process of action selection θ2t+1

by first deciding the discrete action a, and then, choosing the = θ2t +α Es∼ρθ2 ∇θ2 μθ2 (s, a) ∇p Qφ1 (s, a, p) p=μ ,
θ2 (s,a)
associated parameters p as defined in Subsection III-C. We use
a stochastic policy for discrete action selection while deter- where s ∼ ρθi refers to the trajectory sample using network i.
ministic policy is leveraged for parameter selection, which we 2) Stabilizing Updates: Once both the critic and the actor
denote by μ, i.e., parameters θ 2 map state and action (s, a) networks are updated, the target networks should also be
to parameters μθ2 (s, a) = p. Figure 2 illustrates the DNN updated. Target networks are used to stabilize the updates.
structure and the flow of information. If the same φ1 = φ2 = φ network is used for bootstrapping
Finally, four more DNNs are employed, corresponding (i.e., estimating the value function of the next state Q(s , a ))
to the mirroring target networks, and are parametrized by and estimating Q(s, a) in (1), the φ network will be updated
φ− − − −
1 , φ2 , θ 1 , θ 2 , respectively. Their function is explained later with each iteration to move closer to the target Q-values; but,
in this section. at the same time, the target Q-values, which are given by the
1) Parameter Updates: The critics take the network state s same network, will also be changing in the same direction,
and the action (a, p), and estimate the value function like a dog chasing its tail. By introducing the target networks,

Authorized licensed use limited to: Zhejiang Lab. Downloaded on August 02,2023 at 02:36:46 UTC from IEEE Xplore. Restrictions apply.
312 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 38, NO. 2, FEBRUARY 2020

we reduce this constant movement of the target estimates by selection (K servers plus the cloud). For the actor parameter
delaying its update. The target networks are updated as network, the last layer comprises an hyperbolic tangent acti-
vation function scaled by pmax and ηmax with two outputs,
φ− −
i = τ φi + (1 − τ )φi
t
corresponding to the CPU and memory values allocated to the
θi− = τ θit + (1 − τ )θi− , discrete action selected, while its inputs are the state and the
where τ ≤ 1 is an hyper-parameter to regulate the update selected action. Finally the critic network gathers the state,
speed. the action, and the action parameters, and a single output
Another tool to stabilize the network parameter updates is value is obtained, the estimate of Q(s, a, p). We use ADAM
the memory buffer M, which stores the interactions of the optimizer for both the actor and the critic, with a learning
agent with the environment, to be more precise, we store rate of lr .
on-step trajectories, i.e., s, a, r, s . Once the memory is filled
with enough samples (≈ 100K), we uniformly sample the C. Agent Cost Function
memory to obtain mini-batches of N samples, which are In Section II, and more precisely in Eqn. (1), we defined
used to compute the losses of the actor and critic. Most the global network cost as the main metric this work aims to
optimization algorithms, including gradient descent, assume minimize. However, we do not directly use (1) as the metric
that the samples, from which the gradient estimate is obtained, the agent optimizes, as we found it to be too general to guide
are i.i.d. Clearly this is not the case in the defined envi- the agent in its initial learning steps towards finding resource
ronment; however, by sampling uniformly from the memory allocation policies that lead to good results. The goal of this
buffer the correlation between consecutive samples is reduced, subsection is to define the cost function Ψ(t) that we use
leading to a more stable optimization of the action-parameter in learning to provide feedback to the agent regarding its
selection. actions.
3) Exploitation vs. Exploration: Any RL algorithm with a Individual actions taken by the agent have direct impact
deterministic policy entails the trade-off between exploitation on the performance of the selected VNF instance, but also
and exploration. For discrete action selection, we use the contribute to the total network performance. Thus, in order
− greedy policy to ensure exploration, where with prob- to guide the agent to learn to allocate resources for different
ability a random action is selected by sampling a uni- VNF instances, we use the VNF instance cost of (1). However,
form random distribution over all possible discrete actions. the minimization goal of this work is the total network cost,
A high value for is set at the beginning to encourage hence, we need to include it in the global picture. To this end
exploration, but its value is reduced gradually over time we define the cost as follows:
until it reaches a certain minimum min , where it remains (t) (t)
Ck,j + βCT
stable. Ψ(t) = ,
Ensuring the exploration of all possible continuous para- Γmax
meters is not possible. We use the approach proposed in [16], where Γmax is a hyperparameter that guarantees
where a clipped zero-mean Gaussian noise is constantly added Ψ(t) ∈ [−1, 1], while β determines how much the agent
to the parameter selection policy (see Eqn. (1)). This approach accounts for the total network cost. Eqn. (1) is what we
is motivated by the assumption that similar parameters should use in the DNN training, while Eqn. ((1)) is the objective
have similar costs, and thus, similar estimates; and the noise function of the CU.
addition is used to encourage exploration. After the addition Furthermore, during the training phase, the proposed RL
of noise the parameter values are clipped to the allowed range approach needs to learn the physics of the environment, that
[pmin , pmax ], as defined in Section III-C: is, at the beginning of the learning process the agent might
try to add/subtract more CPU or memory to a server than
μ (s, a) = clip (μθ2 (s, a) + w, pmin , pmax ) , the one that is available/possible. In order to teach the agent

w ∼ clip N 0, σ 2 , −c, c , the environment’s physical limitations, whenever the algorithm
outputs an infeasible action we offload the user to the cloud,
where σ and c are hyperparameters. Similarly to the para-
and impose a cost of Ψ(t) = −1.
meter for the action selection, we gradually reduce the value
of c, until it reaches a minimum value cmin .
VI. N UMERICAL R ESULTS
4) Architecture: The DNN architectures for the action,
action parameter, and critic networks are the same. For all the In this section we present numerical results obtained with
networks, the inputs are processed by three fully connected the PAT method described in Section V. We start by presenting
layers consisting of 128 and 64 units, respectively. Each fully some benchmarks to compare the proposed algorithm with,
connected layer is followed by a rectified linear unit (ReLU) followed by the experimental setup and the parameters used
activation function with negative slope 10−2 . The weights in the simulations.
of the fully connected layers are initialized using Xavier
initialization with a standard deviation of 10−2 [31]. A. Benchmarks
The input of the actor action network is the network In order to assess the quality of the proposed algorithm,
state, and connected to its final inner product layer there are we compare the PAT agent with other DRL benchmark
K + 1 linear outputs corresponding to the discrete action algorithms.

Fig. 3. Defined costs comparison between PAT and the other DRL benchmarks. The shaded regions demonstrate the standard deviation of the average
evaluation over 10 trials.

• Greedy: For each new user in the system, the greedy 2) A3C [32], where the output of the DNNs provide
algorithm checks whether the new user’s VNF is already the mean and variance values of the Gaussian dis-
deployed in one of the CU servers. If so, computes the tributions used to sample the values of the CPU
CPU and memory that the server would need to allocate and memory. The parameter T of [32] is chosen to
to that VNF such that the new and existing users can be be 128.
served from that VNF instance, i.e., the resulting VNF’s 3) DDQN [28], where we discretize the CPU and
QoS lies within [QoS min , QoS max ]. If this is feasible, memory action spaces, with a resolution of 5,
the VNF is resized, and the user is allocated to that server meaning that the total number of actions is given
(vertical scaling). If not, another server is checked until by ηmax /5 × ρmax /5.
resources for the new user can be assigned (horizontal The DDQN for discrete action selection, and the previous
scaling). If no server is able to allocate this new user, set of algorithms are trained recursively (discrete action
it is offloaded to the cloud. network training first, followed by parameter network
• Cloud: This policy offloads all the traffic to the cloud. training) for 1000 times. Each algorithm interacts with
• DRL benchmarks: To overcome the problem of discrete the environment for 10000 time slots.
and continuous action selection formulated in this work,
we use two distinct state of the art DRL algorithms.
For server selection we use DDQN, while for parameter B. Parameters
selection we use the following algorithms: We consider N = 10 VNFs with features shown in Table I.
1) DDPG [30] with an hyperbolic tangent activation We consider a CU with K = 10 servers each with a CPU capa-
function in the outer layer scaled by the maximum bility of 50 C Hz (ρmax = 50) and memory capacity of 50 B
(t)
values of the CPU and memory, respectively. bits (ηmax = 50). The arrival rates (λj ) for different VNFs at

Authorized licensed use limited to: Zhejiang Lab. Downloaded on August 02,2023 at 02:36:46 UTC from IEEE Xplore. Restrictions apply.
314 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 38, NO. 2, FEBRUARY 2020

TABLE I
VNF R ESOURCE R EQUIREMENTS

TABLE II
D ELAY PARAMETERS

TABLE III
PAT PARAMETERS

each epoch are sampled from a normal distribution, where each

of the VNFs is characterized by different mean and variance
values, listed in Table I. The values of the other parameters
involved in the calculation of the cost function and those used
to reinforce the actor behaviours are given in Table II. The PAT
algorithm parameters are collected in Table III. We note that Fig. 4. Evolution of the network cost in Eqn. (1).
the values presented in Tables I, II and III for the numerical
simulations are chosen as reasonable values that would lead
to a solution with a balanced allocation of available resources the fly, by interacting with the environment, variations in the
in the servers and the cloud. Naturally, the value of these environment are directly fedback into the model by adding new
parameters in practice depends highly on the implementation traces into the memory buffer. Therefore, when a statistically
and the technology used (memory/CPU capability) as well as significant change occurs, this is captured by the model.
the VNFs being considered; however, our problem formulation
is general, and we have reached similar observations with a D. Mapping to KPIs
large variety of parameter values considered.
We now define three KPIs for MANO in 5G networks, and
map the results obtained with the PAT algorithm to these KPIs.
C. PAT Performance The following KPIs are of interest for future 5G networks [33]:
The proposed PAT algorithm is run using 10 different seeds, • Resource Utilisation Efficiency: Given the CPU and mem-
and the average learning curves are depicted in Figure 4. ory resources, resource utilisation efficiency is defined
Given the cost function in (1), it can be seen that the agent as the ratio of utilized resources with respect the total
maximizes the SLA to the point where the cost function is available resources for the execution of a VNF, for a
negative, meaning that the perceived QoS is greater than the particular number of users. With the elastic functions
weighted combination of the latency and the economical costs. employed in our model, the system should achieve a
Thus, given the predefined cost function, the agent learns higher resource utilisation efficiency, since it can shelter
to minimize the cost and to utilize the servers in an online a much larger number of users over the same physical
manner. That is, since the agent generates its own dataset on infrastructure.

Fig. 5. Servers resources utilization.

• Cost Efficiency Gain: This metric captures the average 1) Resource Utilisation Efficiency: Figure 5 shows that the
cost of deploying and maintaining the network PAT algorithm leads to a more efficient usage of the CPU
infrastructure to provide the required service to its and memory resources compared to A3C, DDPG and DDQN,
users. Given the elastic nature of the VNFs deployed, as for a similar CPU and memory utilization (Figures 5(a)
the CU should be able to optimally dimension the and 5(b)) less traffic is offloaded to the cloud (Figure 6).
network such that less resources are required to support The efficient usage of resources accomplished by PAT is also
the same services; in addition, the elastic system should visible in Figures 3(a) and 3(b). Even though similar resources
avoid the usage of unnecessary resources. are utilized by the the other DRL algorithms, the latency
cost and the financial cost achieved by the PAT are lower
E. PAT Evaluation (on average). The greedy approach aims to allocate as many
In Figure 3 and 5, we present the results of the proposed users as possible to the CU, that is why its CU resources
PAT approach and that of the other benchmark algorithms. are fully utilised most of the time and the average delay
The comparison is carried out under exactly the same traffic of the users is the lowest, while its financial cost is the
patterns, i.e., same arrival and departure times. highest.
From Figures 3(a), 3(b) and 6 it can be observed that, for the 2) Cost Efficiency Gain: The comparison between the
particular set of parameters chosen, the users offloaded to the average economical cost of the PAT deployment and the
cloud experience higher delays than those served by the CU. other schemes is presented in Figure 3(b), where a gain in
On the contrary, for the financial cost, VNFs instantiated in economical cost by the PAT algorithm is clearly visible. The
the CU have a higher cost than those instantiated in the cloud. economical cost difference between the PAT and the greedy is
That explains why the scheme with lowest cloud utilization, straightforward, as the network configuration entails a higher
i.e., greedy, has the lowest delay but the highest financial cost cost for the use of CU resources. The reduction of financial
per user. cost of the PAT compared to the other DRL algorithms
Furthermore, Figure 6 shows how all the DRL algorithms is because the latter allocate more CPU and memory than
decide to allocate more users to the cloud than the CU, this needed, given the fact that most of their traffic is directed to
is mainly for two reasons affecting the learning process: the cloud, and the CU resources are underutilized, incurring
1) Cloud offloading does not carry any penalty. Contrary higher cost per user.
to VNF allocation in the CU, where algorithms are 3) Network Cost: Figure 3(d) shows that the proposed
penalised if the physical limitations of the servers are not PAT algorithm outperforms the other approaches on the main
respected, or if the allocated resources are not enough metric of this work, the network cost defined in Eqn. 1. The
to fulfill the SLA, the deployment of VNFs at the CU PAT finds a middle point, between directing traffic to the
carry a positive reward. CU and offloading it to the cloud, and the optimization of
2) QoS drives the learning experience. Since ω3 is greater the server selection and resource allocation are done jointly.
than the other two weights in Eqns. (1) and (1), qalgo- Contrary to A3C, DDQN and DDPG, where the training is
rithms aim to maximize SLA cost (given the RL reward done iteratively, the PAT algorithm propagates the gradient of
function of (1)). As allocating VNF to CU may lead the value-estimates obtained by the critics to the parameter
to lower QoS if the CU fails to provide the maximum network and the server selection network at the same time,
demanded resources, the DRL tends to use the cloud pushing both networks to more optimal points simultaneously,
where QoS max is guaranteed. improving the efficiency of each training update.

Authorized licensed use limited to: Zhejiang Lab. Downloaded on August 02,2023 at 02:36:46 UTC from IEEE Xplore. Restrictions apply.
316 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 38, NO. 2, FEBRUARY 2020

allocation problem as a MDP. Then, we proposed a

DRL-based solution for this MANO problem, presenting a
novel algorithm named PAT, which leverages the actor-critic
method to learn to provision network resources in an online
manner, given the current network state and the requirements
of the deployed VNFs. The novel architecture implements
two critics for action value function estimation (twin), and
two actor networks are used to determine the action and
the parameter. A deterministic policy is implemented for
both action and parameter selection. We have shown that the
proposed solution outperforms all benchmark DRL schemes
as well as heuristic greedy allocation in a variety of network
scenarios, including static traffic arrivals as well as highly
time-varying traffic settings. To the best of our knowledge,
this is the first work that considers DRL for network MANO
of VNFs.
Fig. 6. Cloud utilization. As future research directions, we consider addressing the
MANO of VNF chains. The problem addressed in this work
does not take into account the likely relationship between
different VNFs to form VNF chains, where NFs may have a
temporal ordering in which they are requested by users. This
factor highly increases the complexity of resource allocation
as the overall user experience might be affected by a subtle
VNF resource modification.

R EFERENCES

[1] C. Liang and F. R. Yu, “Wireless network virtualization: A survey, some

research issues and challenges,” IEEE Commun. Surveys Tuts., vol. 17,
no. 1, pp. 358–380, 1st Quart., 2015.
[2] B. Han, V. Gopalakrishnan, L. Ji, and S. Lee, “Network function virtual-
ization: Challenges and opportunities for innovations,” IEEE Commun.
Mag., vol. 53, no. 2, pp. 90–97, Feb. 2015.
[3] M. M. Murthy, H. A. Sanjay, and J. Anand, “Threshold based auto
scaling of virtual machines in cloud environment,” in Proc. IFIP Int.
Fig. 7. Network cost with constants λj . Conf. Netw. Parallel Comput. Springer, 2014, pp. 247–256.
[4] T. Lorido-Botran, J. Miguel-Alonso, and J. A. Lozano, “A review of
auto-scaling techniques for elastic applications in cloud environments,”
J. Grid Comput., vol. 12, no. 4, pp. 559–592, Dec. 2014.
It is somewhat surprising that the greedy algorithm per- [5] Amazon Web Services. (2016). Fleet Management Made Easy
With Auto Scaling. [Online]. Available: https://aws.amazon.com/blogs/
forms comparably, or sometimes even better than the baseline compute/fleet-management-made-easy-with-auto-scaling/
DRL algorithms at some time periods. This is because the [6] Microsoft Azure. (2018). Azure Resource Manager Overview. [Online].
DRL algorithms, in contrast to PAT, have delays in adapting Available: https://docs.microsoft.com/en-us/azure/azure-resource-
manager/resource-group-overview
to the randomness of the environment, or tend to slightly
[7] S. Dutta, S. Gera, A. Verma, and B. Viswanathan, “Smartscale: Auto-
over-provision resources to be able to cope with highly matic application scaling in enterprise clouds,” in Proc. IEEE 5th Int.
time-variant traffic demands. If, however, we keep the traffic Conf. Cloud Comput. (CLOUD), Jun. 2012, pp. 221–228.
statistics (arrival rates) constant, we can see in Figure 7 that [8] L. Yazdanov and C. Fetzer, “VScaler: Autonomic virtual machine
scaling,” in Proc. IEEE 6th Int. Conf. Cloud Comput., Jun. 2013,
baseline DRL algorithms outperform greedy, while PAT is still pp. 212–219.
the best performing algorithm. This shows that PAT not only [9] X. Chen, H. Zhang, C. Wu, S. Mao, Y. Ji, and M. Bennis, “Optimized
outperforms other baselines in exploiting the resources in the computation offloading performance in virtual edge computing systems
via deep reinforcement learning,” IEEE Internet Things J., vol. 6, no. 3,
most efficient manner in a static environment, but also is the pp. 4005–4018, Jun. 2018.
fastest in terms of adapting to variations in the environment. [10] P. Tang, F. Li, W. Zhou, W. Hu, and L. Yang, “Efficient auto-scaling
approach in the Telco cloud using self-learning algorithm,” in Proc.
Global Commun. Conf. (GLOBECOM), Dec. 2015, pp. 1–6.
VII. C ONCLUSIONS AND F UTURE W ORK [11] S. Rahman, T. Ahmed, M. Huynh, M. Tornatore, and B. Mukherjee,
“Auto-scaling VNFs using machine learning to improve QoS and reduce
We presented a novel DRL algorithm for autonomous cost,” in Proc. IEEE Int. Conf. Commun. (ICC), May 2018, pp. 1–6.
MANO of VNFs, where the CU learns to re- [12] R. Li et al., “Deep reinforcement learning for resource management in
configure resources (CPU and memory), to deploy new network slicing,” IEEE Access, vol. 6, pp. 74429–74441, 2018.
[13] X. Chen et al., “Multi–tenant cross–slice resource orchestration: A deep
VNF instances, or to offload VNFs to a central cloud. reinforcement learning approach,” IEEE J. Sel. Areas Commun., vol. 37,
We formulated the corresponding stochastic resource no. 10, pp. 2377–2392, Oct. 2019.

[14] C. Qi, Y. Hua, R. Li, Z. Zhao, and H. Zhang, “Deep reinforcement David M. Gutierrez-Estevez He received the Engi-
learning with discrete normalized advantage functions for resource neering degree in telecommunications (Hons.) from
management in network slicing,” IEEE Commun. Lett., vol. 23, no. 8, the Universidad de Granada, Spain, and the M.S. and
pp. 1337–1341, Aug. 2019. Ph.D. degrees from the Georgia Institute of Technol-
[15] T. P. Lillicrap et al., “Continuous control with deep rein- ogy, Atlanta, USA. He is a Principal Research and
forcement learning,” 2015, arXiv:1509.02971. [Online]. Available: Standards Engineer with the Samsung Electronics
https://arxiv.org/abs/1509.02971 R&D Institute U.K., where he currently attends
[16] S. Fujimoto, H. van Hoof, and D. Meger, “Addressing function approxi- 3GPP meetings as a global Samsung Delegate in
mation error in actor-critic methods,” 2018, arXiv:1802.09477. [Online]. SA2 on data analytics for automation and other
Available: https://arxiv.org/abs/1802.09477 network architecture topics. He was supported by
[17] M. Hausknecht and P. Stone, “Deep reinforcement learning in para- graduate fellowships from Fundacion la Caixa and
meterized action space,” in Proc. Int. Conf. Learn. Represent. (ICLR), Fundacion Caja Madrid from Spain. He developed the Ph.D. thesis at the
2016. Broadband Wireless Networking Laboratory under the supervision of Prof. I.
[18] P. Sharma, L. Chaufournier, P. Shenoy, and Y. C. Tay, “Containers F. Akyildiz, where he received the Researcher of the Year Award in 2013 for
and virtual machines at scale: A comparative study,” in Proc. 17th Int. outstanding research contributions. From September 2014 to September 2015,
Middleware Conf. New York, NY, USA: ACM, 2016, Art. no. 1. he was a Principal Research Engineer with Huawei Technologies, Silicon
[19] Docker. (2018). What is a Container. [Online]. Available: Valley. Previous to that, he had held an internship position at the Corporate
https://www.docker.com/resources/what-container R&D Division, Qualcomm, and a Research Assistant and intern positions
[20] R. Morabito, “Power consumption of virtualization technologies: at the Fraunhofer Heinrich Hertz Institute and the Fraunhofer Institute for
An empirical investigation,” in Proc. IEEE/ACM 8th Int. Conf. Utility Integrated Circuits in Germany. He joined Samsung in January 2016, where he
Cloud Comput. (UCC), Dec. 2015, pp. 522–527. has been as Work Package Leader involved in several 5GPPP projects and with
[21] S. F. Piraghaj, A. V. Dastjerdi, R. N. Calheiros, and R. Buyya, working groups, leading Samsung’s involvement in the phase II project 5G-
“A framework and algorithm for energy efficient container consolidation MoNArch on end-to-end network architecture. He was the technical manager
in cloud data centers,” in Proc. IEEE Int. Conf. Data Sci. Data Intensive of the successful phase III project proposal 5G-TOURS, a 15M EUR effort
Sys. (DSDIS), Dec. 2015, pp. 368–375. with nearly 30 partners that started in June 2019 aimed at developing 5G
[22] S. Maheshwari, S. Deochake, R. De, and A. Grover, “Comparative advanced vertical trials for Europe. His work accumulates over 1000 citations.
study of virtual machines and containers for DevOps developers,” 2018, He is the co-inventor of multiple patents and patent applications. He has served
arXiv:1808.08192. [Online]. Available: https://arxiv.org/abs/1808.08192 as an Associate Editor of Computer Networks Journal (Elsevier) and the Track
[23] R. Sutton and A. G. Barto, Introduction to Reinforcement Learning, Chair and a TPC member of major IEEE conferences such as INFOCOM,
vol. 135. Cambridge, MA, USA: MIT Press, 1998. ICC, GLOBECOM, and VTC.
[24] S. Rahman, T. Ahmed, M. Huynh, M. Tornatore, and B. Mukherjee,
“Auto-scaling network resources using machine learning to improve
QoS and reduce cost,” 2018, arXiv:1808.02975. [Online]. Available:
https://arxiv.org/abs/1808.02975
[25] D. Gutierrez-Estevez et al., “The path towards resource elasticity for
5G network architecture,” in Proc. IEEE Wireless Commun. Netw.
Conf. (WCNCW), Apr. 2018, pp. 214–219.
[26] K. Narasimhan, T. D. Kulkarni, and R. Barzilay, “Language understand-
ing for text-based games using deep reinforcement learning,” in Proc.
Conf. Empirical Methods Natural Lang. Process., 2015.
[27] D. Gunduz, P. de Kerret, N. D. Sidiropoulos, D. Gesbert, C. R. Murthy,
and M. van der Schaar, “Machine learning in the air,” IEEE J. Sel. Areas
Commun., vol. 37, no. 10, pp. 2184–2199, Oct. 2019.
[28] V. Mnih et al., “Human-level control through deep reinforcement learn-
ing,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
[29] Z. Wang, T. Schaul, M. Hessel, H. van Hasselt, M. Lanctot,
and N. de Freitas, “Dueling network architectures for deep rein-
forcement learning,” 2015, arXiv:1511.06581. [Online]. Available: Deniz Gündüz (S’03–M’08–SM’13) received the
https://arxiv.org/abs/1511.06581 Ph.D. degree from the NYU Tandon School of
[30] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller, Engineering (formerly Polytechnic University)
“Deterministic policy gradient algorithms,” in Proc. ICML, 2014. in 2007.
[31] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep After the Ph.D. degree, he served as a
feedforward neural networks,” in Proc. 13th Int. Conf. Artif. Intell. Post-Doctoral Research Associate with Princeton
Statist., 2010, pp. 249–256. University and as a consulting Assistant Professor
[32] V. Mnih et al., “Asynchronous methods for deep reinforcement learning,” with Stanford University. He was a Research
in Proc. Int. Conf. Mach. Learn., 2016, pp. 1928–1937. Associate with CTTC, Barcelona, Spain, until
[33] Architecture and Mechanisms for Resource Elasticity Provisioning, September 2012, when he joined the Electrical
document EU H2020 project 5G-MoNArch, Deliverable D4.1, 2018. and Electronic Engineering Department, Imperial
College London, U.K., where he is currently a Reader (Associate Professor)
in information theory and communications and leads the Information
Processing and Communications Laboratory (IPC-Lab). His research interests
include communications and information theory, machine learning, and
Joan S. Pujol Roig received the joint B.S. and M.S. privacy. He is a Distinguished Lecturer of the IEEE Information Theory
degrees in telecommunication engineering from the Society. He was a recipient of the Starting Grant of the European Research
Polytechnic University of Catalonia in 2014. He is Council (ERC) in 2016, the IEEE Communications Society, Communication
currently pursuing the Ph.D. degree with the Depart- Theory Technical Committee (CTTC) Early Achievement Award in 2017,
ment of Electronic Engineering, Imperial College and several best paper awards such as IEEE ISIT, WCNC, and GlobalSIP.
London. From August 2018 to February 2019, He has served the General Co-Chair for the 2016 IEEE Information Theory
he held an internship position at the 5G Division, Workshop, the 2018 International ITG Workshop on Smart Antennas, and the
Samsung Electronics R&D Institute U.K. His cur- 2019 London Symposium on Information Theory. He also served as a Guest
rent research interests include information and cod- Editor of the IEEE JSAC special issue on Machine Learning in Wireless
ing theory, interference management in cache-aided Communication in 2019. He is an Editor of the IEEE T RANSACTIONS ON
networks, resource optimization of wireless net- W IRELESS C OMMUNICATIONS and the IEEE T RANSACTIONS ON G REEN
works, deep learning, and deep reinforcement learning. C OMMUNICATIONS AND N ETWORKING.