Smart Power Control For Quality-Driven Multi-User Video Transmissions A Deep Reinforcement Learning Approach

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

SPECIAL SECTION ON MOBILE MULTIMEDIA: METHODOLOGY AND APPLICATIONS

Received December 7, 2019, accepted December 19, 2019, date of publication December 23, 2019, date of current version January 2, 2020.
Digital Object Identifier 10.1109/ACCESS.2019.2961914

INVITED PAPER

Smart Power Control for Quality-Driven


Multi-User Video Transmissions: A Deep
Reinforcement Learning Approach
TICAO ZHANG AND SHIWEN MAO , (Fellow, IEEE)
Department of Electrical and Computer Engineering, Auburn University, Auburn, AL 36849-5201, USA
Corresponding author: Shiwen Mao ([email protected])
This work was supported in part by the NSF under Grant IIP-1822055 and Grant ECCS-1923717, and in part by the Wireless Engineering
Research and Education Center (WEREC), Auburn University, Auburn, AL, USA.

ABSTRACT Device-to-device (D2D) communications have been regarded as a promising technology to


meet the dramatically increasing video data demand in the 5G network. In this paper, we consider the power
control problem in a multi-user video transmission system. Due to the non-convex nature of the optimization
problem, it is challenging to obtain an optimal strategy. In addition, many existing solutions require
instantaneous channel state information (CSI) for each link, which is hard to obtain in resource-limited
wireless networks. We developed a multi-agent deep reinforcement learning-based power control method,
where each agent adaptively controls its transmit power based on the observed local states. The proposed
method aims to maximize the average quality of received videos of all users while satisfying the quality
requirement of each user. After off-line training, the method can be distributedly implemented such that all
the users can achieve their target state from any initial state. Compared with conventional optimization based
approach, the proposed method is model-free, does not require CSI, and is scalable to large networks.

INDEX TERMS Multi-user video transmission, multi-agent deep reinforcement learning, power control,
quality of experience.

I. INTRODUCTION system coverage can be potentially improved. Also, delay can


Due to the popularization of wireless multimedia commu- be significantly reduced. However, interference management
nication services and applications, such as mobile TV, 3D is becoming a challenging problem with the presence of
video, 360-degree video, multi-view video, and augmented D2D links [6]. Specifically, in a multi-user communication
reality (AR), there is an explosive growth of mobile data network, a transmitter may increase its transmit power to
traffic. It is expected that the mobile traffic will increase ensure a better video quality for the corresponding receiver,
seven-fold from 2017 to 2022 [1]. Moreover, 82% of the but at the same time, it may degrade the performance of the
mobile data will be video related by 2022 [1]. The dramat- links it interferes with.
ically increasing video data demand brings great challenges Transmit power control, as a physical layer issue, has been
to the present and future wireless networks [2]. well studied since the first generation cellular networks [7].
Device-to-device (D2D) communications have been Many centralized interference management methods have
regarded as an emerging 5G communication technology to been developed. The weighted minimum mean square
meet the increasing data demand [3]–[5]. In D2D communi- error (WMMSE) algorithm [8] and fractional program-
cations, nearby devices can establish local links so that traffic ming (FP) algorithm [9] are typical centralized algorithms.
flows directly between them instead of through a base station These algorithm often require precise channel state infor-
(BS). As a result, the system spectrum efficiency and the mation (CSI) for all the links, which will incur consid-
erable signaling overhead. Moreover, the complexity of
The associate editor coordinating the review of this manuscript and centralized algorithm increases with the number of users,
approving it for publication was Dapeng Wu . bringing about heavy computational pressure on the power

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/
VOLUME 8, 2020 611
T. Zhang, S. Mao: Smart Power Control for Quality-Driven Multi-User Video Transmissions

controller. To reduce the signaling overhead and better adapt reward, rather than simply obtaining the instant maximum
to large scale networks, a series of distributed algorithms have reward. This is quite important for resource optimization
been developed. For example, in [10], the power allocation problems in wireless networks, where the channel state
problem in cognitive wireless network was formulated as changes rapidly. There is now an increasing interest on incor-
a noncooperative game. A stochastic power allocation with porating DRL into the design of wireless networking algo-
conjecture-based multi-agent Q-learning approach was pro- rithms [20], such as mobile off-loading [21], dynamic channel
posed. The authors in [11] proposed a Stackelberg game access [22], [23], mobile edge computing and caching [24],
based power control scheme for D2D communication under- [25], dynamic base station on and off [26], TCP congestion
lay cellular networks. By introducing a new co-tier price fac- control [27], and resource allocation [28]–[31].
tor, the distributed power control algorithm can mitigate the In particular, the authors in [28] consider the problem of
cross-tier interference effectively. Despite their good perfor- power control in a cognitive radio system consisting of a
mance, current solutions often require frequent information primary user and a secondary user. With DRL, the secondary
exchange and cannot guarantee an optimal performance. user can interact with the primary user efficiently to reach
Meanwhile, we have observed that these physical layer a target state after a few number of steps. Another work
technologies generally aim to optimize the transmission data in [29] demonstrates the potential of DRL for power control in
rate or bit error rate (BER), they do not improve the user’s wireless networks. Instead of searching for the near-optimal
quality of service (QoS) or quality of experience (QoE) solution by solving the challenging optimization problem,
directly, when users are watching a specific video. Given the authors develop a distributed dynamic power control
the same transmission bandwidth, different videos generally scheme. This method is model-free and the system’s weighted
have different qualities. As an application layer performance sum rate can be maximized. The authors in [30] investigated
metric, video quality directly reflect the user satisfaction the spectrum sharing problem in vehicular networks with a
level in contrast to physical layer metrics. In future mobile DRL based solution. The multiple vehicle-to-vehicle (V2V)
networks, it is more important to develop a cross-layer agents dynamically allocate their power and spectrum in a
interference management approach that jointly considers the cooperate way so that their sum capacity can be maximized.
physical layer issues as well as the user’s requirement and In this paper, we consider the power allocation and interfer-
experience [12], [13]. Motivated by this observation, some ence management problem in a multi-user video transmission
cross-layer video transmission designs have been proposed. system from the point of a cross-layer optimization. To the
For example, the authors in [14] designed a quality-driven best of our knowledge, this is the first work that attempts to
scalable video transmission framework in a non-orthogonal integrate DRL for interference management to improve users’
multiple access (NOMA) system and proposed a subopti- video viewing quality. The main contributions of this paper
mal power allocation algorithm. This algorithm leverages are summarized as follows.
the hidden monotonic property of the problem and it has a • The proposed algorithm is based on multi-agent deep
polynomial time complexity. The recent work [15] proposed Q-learning, which amenable to distributed implemen-
a spatial modulation (SM) and NOMA integrated system tation. It is model-free and does not require labeled
for multiuser video transmission. Efficient algorithms are training data. It can be applied to arbitrary network
proposed to perform optimal power control so that the user’s configurations.
QoE can be maximized. A novel cross-layer optimization • Each agent does not need to know other agents’ CSI. The
framework is proposed in [16] for scalable video transmission complexity of the proposed algorithm does not increase
over OFDMA networks. The proposed iterative algorithm can with the network size. This method can be applied to
jointly maximize the achievable sum rate and minimize the very large networks.
distortion among multiple videos. In our recent work [17], • This work is a cross layer design which considers both
a cross-layer optimization framework for softcast video trans- the physical layer issues as well as the application layer
mission is developed and analyzed. Compared with physical video-related design factors. By properly designing the
layer-only designs, such cross layer optimization for video reward function, users can actually work in a cooperative
transmissions help users enjoy a better perceived video qual- manner to achieve a high level of satisfaction.
ity. Despite the success of these algorithms, they require that The remainder of this paper is organized as follows. The
every user to have full knowledge of the CSI for all the links, system model and the problem formulation are discussed in
which may be infeasible in practice. Besides, the formulated Section II. In Section III, we develop a multi-agent DRL
problem is generally non-convex. The developed method algorithm for power control. Simulation setup is provided
often lead to a sub-optimal solution. in Section IV. Experimental results are given in Section V,
Recently, machine learning (ML) has achieved great suc- followed by conclusions in Section VI.
cess in a variety of fields, such as computer version and
speech recognition. Deep reinforcement learning (DRL), as a II. SYSTEM MODEL AND PROBLEM FORMULATION
powerful ML technique, has show high potential for many A. PHYSICAL LAYER MODEL
challenging tasks, such as human-level control [18] and com- We consider a wireless network consisting of N users, where
puter games [19]. In DRL, the agent considers the long-term all the users share a common spectrum resource. As shown

612 VOLUME 8, 2020


T. Zhang, S. Mao: Smart Power Control for Quality-Driven Multi-User Video Transmissions

B. VIDEO TRANSMISSION MODEL


For video applications, PSNR is a common objective perfor-
mance measure, which is highly correlated to user-perceived
video quality. The relationship between PSNR value Q and
distortion is given by
2552
 
PSNR = Q = 10 log10 , (4)
MSE
where the mean-squared-error (MSE) is used to characterize
distortion.
In [33], the author propose a general semi-analytical rate-
distortion (R-D) model, which has been verified for scalable
FIGURE 1. System model for a radio network with multiple video users video coding (SVC) in [34]. With this model, the relationship
(N = 2). between the rate and distortion at the encoder side can be
predicted. Specifically, the video coding rate for user i can
be expressed as a function of PSNR Qi as follows.
in Fig. 1, each user consists of a transmitter and receiver pair.
Each receiver requests a specific video from its corresponding θi
Fi (Qi ) = + βi , Qi ≥ Qi,min , (5)
transmitter. We assume a cooperative system that different 255 10 i /10
2 −Q + αi
users can exchange information with each other, including where Qi,min is the minimum PSNR value corresponding to
channel gain, power control vectors, and some acknowl- the minimum rate Fi,min . The parameters θi , αi and βi depends
edgment (ACK) signals. The information exchange process on the video content, encoder, and the RTP packet loss rate.
can be implemented to occur once per time slot, either in These parameters can be obtained with a curve-fitting method
a wireless or a wired manner. Conventional technologies, over at least six empirical R-D samples [34], [35] and a
such as Zigbee [32], can be used to convey this information relevant number of iterations, to achieve a high accuracy.
to other users in a timely fashion. Note that Zigbee uses The authors in [36] further simplify this model to reduce
a different frequency, hence it generates no interference to complexity by eliminating the parameter αi , i.e.,
the video users. The users dynamically adjust their transmit
θi
power based on the information collected from their neighbor Fi (Qi ) = + βi , Qi ≥ Qi,min . (6)
users. Each user has a minimum QoE requirement for the 2552 10−Qi /10
received video. We aim to develop an optimal power control In this case, only four R-D samples are sufficient to deter-
policy so that their combined QoE is maximized and all the mined the model. In this paper, we adopt the simplified
users’ minimum QoE requirements are satisfied. model, although the developed method also applies to any
Let pi , i = 1, 2, . . . , N , denote the transmit power of other R-D models.
user i. Let hij be the channel gain from transmitter Tx i to Without loss of generality, we assume that the overhead
receiver Rx j , i, j ∈ {1, 2, . . . , N }. The signal-to-noise-plus- introduced by the network stack layers (e.g., header and
interference ratio (SINR) at receiver i can be computed as trailer bits) is constant, so we ignore this overhead for sim-
|hii |pi plification. As a result, the physical layer rate (3) is assumed
SINRi = P , i, j ∈ {1, 2, . . . , N }, (1) to be equal to the transmission rate at the application layer (6),
j6=i |hji |pj + σi
2
i.e., Ri (p) = Fi (Qi ). The relationship between the PSNR of a
where σi2 is the noise power at receiver i. We consider a received video and its corresponding transmit power can thus
free-space propagation model. So the channel gain is be expressed as
2
λ Qi (p) = Fi−1 (Ri (p))

hij = , (2)
θi
 
4π dij
= −10 log10 + 20 log10 255. (7)
where λ is the signal wavelength and dij is the distance Ri (p) − βi
between transmitter Tx i and receiver Rx j . We denote the
C. QUALITY-DRIVEN POWER ALLOCATION PROBLEM
distance matrix as D = [dij ].
Since all the users share the same frequency spectrum for The ultimate goal of power control is to improve the overall
video transmissions, they have the same bandwidth B. The video quality of all users. We formulate this problem as
data transmission rate for user i can be expressed as follows.
N
Ri (p) = B log2 (1 + SINRi ) , (3) 1 X
max Q(p) = Qi (p) (8)
p N
where p = [p1 , p2 , . . . , pN ] is the transmit power allocation i=1
vector. It can be seen that the transmission rate of each user s.t. 0 ≤ pi ≤ pmax , ∀i (9)
is determined by the transmit power allocation vector. Qi ≥ Qi,min , ∀i, (10)

VOLUME 8, 2020 613


T. Zhang, S. Mao: Smart Power Control for Quality-Driven Multi-User Video Transmissions

where Qi (p) is given in (7). Note that (9) is the system power s ∈ S. At time instant t, the agent takes action at ∈ A when
constraint and (10) is the video quality constraint, which observing a state st ∈ S. Then the agent receives a reward r t
depends on a variety of factors such as the video content, and the next state st+1 is observed. The Q-learning algorithm
encoder setting, and the user’s quality requirement. aims to maximize a certain reward over time. For example,
Based on (3) and (6), it can be seen that PSNR is a mono- we can define the reward function as
tone function in terms of SINR. The quality constraint (10) ∞
γ τ r t+τ ,
X
can thus be replaced by the corresponding SINR constraint. Rt = (17)
To simplify expression, we rewrite Problem (8) as follows. τ =0
where γ ∈ (0, 1] is a discount scalar representing the tradeoff
 
f1 (p) f2 (p) fN (p)
max φ , ,..., (11) between the immediate and future rewards. γ = 0 means
p g1 (p) g2 (p) gN (p)
s.t. 0 ≤ pi ≤ pmax , ∀i (12) we only care about the immediate reward. A lager γ means
earlier period rewards play a more important role.
SINRi (p) ≥ SINRi,min , ∀i, (13) Under a policy π (s, a), the Q-function of the agent with
where φ(x) is an increasing function on RN
+ , expressed as
action a and state s is defined as
N  Qπ (s, a) = Eπ [Rt |st = s, at = a]. (18)
θi

10 Y 20
φ(x) = − log10 + log10 255,
N B log2 (1 + xi ) N Q-learning aims to maximize the Q-function (18). The opti-
i=1
mal action-value function, Q∗ (s, a) , maxπ Qπ (s, a), obeys
(14)
the Bellman optimality equation, as
and
Q∗ (s, a)
 
fi (p) = |hii | · pi (15)
X = Est+1 r t+1 + γ max Q∗ (st+1 , a0 )|st = s, at = a , (19)
gi (p) = |hji | · pj + σi2 . (16) a0

j6=i where st+1 is the new state after executing the state-action
It can be seen that Problem (11) actually belongs to the pair (s, a). Let q(s, a) be the state action-value function in the
class of generalized linear fractional programming (GLFP) iteration process. Q-learning updates q(st , at ) as follows.
problems. In addition, combining together with the structure q(st+1 , at+1 )
of functions fi (x) and gi (x), this problem is actually non-  
convex [37]. Generally speaking, there is no efficient solu- ← q(st , at )+δ r t+1 +γ max q(st+1 , a0 )−q(st , at ) , (20)
a0
tions to find the global optimal solution within polynomial
time. where δ is the learning rate.
Q-learning uses a Q-table to approximate the Q-function.
III. THE MULTI-AGENT DEEP REINFORCEMENT When the state and action spaces are discrete and small, learn-
LEARNING APPROACH ing the optimal policy π is possible with Q-learning. How-
In the proposed multi-user video transmission system, ever, when the state and action spaces become continuous
the transmitter of each user dynamically adjusts its transmit and large, the problem becomes intractable. Deep Q-learning
power based on the observed environment state. The action utilizes a deep Q-Network (DQN), i.e., a deep neural network
taken at the next time slot depends on the current observa- (DNN), to approximate the mapping table. DQL inherits the
tions, hence it can be modeled as a Markov Decision Process advantages of both RL and deep learning.
(MDP). We develop a multi-agent deep reinforcement learn- Suppose the DQN is expressed as q(:, :, 2t ), where 2t are
ing approach to solve the problem. the parameters of the DQN. As the quasi-static target network
method implies [18], we define two DQNs: the target DQN
A. OVERVIEW OF DEEP REINFORCEMENT LEARNING with parameters 2ttarget and the trained DQN with parameters
Reinforcement learning (RL) is an effective technique to 2ttrain . 2ttarget is updated to be equal to 2ttrain once every Tu
solve the MDP problems. In RL, agents learn an optimal pol- time slots. Using the target network can help stabilize the
icy through interactions with the environment, by receiving overall network performance. Instead of training with only
an intermediate reward together with a state update after tak- the current experience, the DQN uses a randomly sampled
ing each action. The received reward as well as the observed mini-batch from the experience replay memory, which stores
new state will help adjust the control policy. The process will the recent tuples (st , at , r t , st+1 ).
continue until an optimal policy is found. With experience replay, the least squares loss of training
The most representative RL algorithm is Q-learning, where DQN for a sampled mini-batch Dt can be defined as
the policy is updated by an action-value function, referred to L(2ttrain )
as the Q-function. Let S denote the set of possible states and X  2
A denote the set of discrete actions. The policy π (s, a) is the = ytDQN (r t , st+1 )−Q(st , at ; 2ttrain ) , (21)
probability of taking an action a ∈ A when given a state (st ,at ,r t ,st+1 )∈Dt

614 VOLUME 8, 2020


T. Zhang, S. Mao: Smart Power Control for Quality-Driven Multi-User Video Transmissions

where the target output is solution is to use single agent-DQN, which computes the
joint actions for all agents [39]. However, the complexity
ytDQN (r t , st+1 ) = r + γ · maxa0 Q(s
t t+1
,a 0
, 2ttarget ). (22)
will grow proportional to the size of the state-action space.
This experience replay strategy ensures that the optimal pol- Moreover, the single agent approach is not suitable for dis-
icy will not lead to a local minimum. In each training step, tributed implementation, which may limit its use in large
the stochastic gradient descent algorithm is used to minimize networks. Recently, there has been several multi-agent DRL
the training loss (21) over the mini-batch Dt . variants, however, there is no theoretical guarantees despite
their promising empirical performance [40], [41]. In this
B. MULTI-AGENT DRL FOR RESOURCE ALLOCATION paper, we limit the convergence analysis by providing simu-
In the resource sharing scenario illustrated in Fig. 1, multiple lation results in Section V, which is also employed in similar
users attempt to transmit video data to the target receivers, prior works [42], [43]. Specifically, we investigate the impact
which can be modeled as a multi-agent DRL problem. Each of the learning rate on the convergence performance.
user is an agent and interacts with the unknown communi-
cation environment to gain experiences. The experience is C. MDP ELEMENTS
then used to guide the transmit power control policy. At the As depicted in Fig. 2, we proposed a multi-agent DRL
first glance, the power allocation problem seems to be a com- approach where each user serves as an agent. In order to
petitive game. If each agent maximizes its transmit power, utilize the DRL for power control, the state space, the action
the other users may receive severe interference. In this paper, space, and the reward function need to be properly designed.
we turn this competitive game into a cooperative game by
properly designing the reward function. This way, the global 1) STATE SPACE
system performance can be optimized. At time slot t, the observed state for each agent is defined
The multi-agent RL based approach is divided into two as st = {[I1t , I2t , . . . , INt ], pti , 0it }, where Iit is the indicator
phases: (i) the offline training phase and (ii) the online imple- function, which shows whether the quality requirement of
mentation phase. We assume that the system is trained in user i is satisfied or not. Specifically, it is defined as
a centralized way but implemented in a distributed manner. (
To be more specific, in the training phase, each agent adjusts t 1, if Qti > Qi,min
Ii = (23)
its actions based on a system performance-oriented reward. 0, otherwise,
In the implementation phase, each agent observes its local
states and selects the optimal power control action. pti is agent i’s current transmit power and 0it is the total inter-
As shown in Fig. 2, each agent n receives a local obser- ference that comes from the other agents, which is defined
vation of the environment and then takes an action. These as
X
actions form a joint action vector. The agent then receives a 0it = |hji | · ptj + σi2 . (24)
joint reward and the environment evolves to the next state. j6=i
The new local states are observed by the corresponding
agents. When the reward is shared by all the agents, the coop- Note that for each agent, pti and 0it are local information that
erative behavior is encouraged. is readily available (no need for exchange).

2) ACTION SPACE
We assume that the transmitter of each agent chooses its
transmit power from a finite set consisting of L elements,
 
pmax 2pmax
A= , , . . . , pmax , (25)
L L
where pmax is the peak power constraint for each user. As a
result, the dimension of the action space is L. The agent is
only allowed to pick an action ati ∈ A to update its transmit
FIGURE 2. The multi-agent DRL model. power. Increasing the size of action space may potentially
increase the overall performance, meanwhile it also brings a
For each agent, the power control process is an MDP. larger training overhead and system complexity.
Independent Q-learning [38] is one of the most widely used
methods to solve the MDP problem with multiple agents. 3) REWARD DESIGN
In independent Q-learning, each agent learns a decentralized One reason that makes DRL appealing is its flexibility in
policy based on its local observation and action, treating handling the hard-to-optimize objective function. When the
other agents as part of the environment. Note that, each agent system reward is properly designed according to the objective
would face a non-stationary problem as other learning agents function, the system performance can be improved. For our
are updating their policies simultaneously. One promising cross-layer video quality optimization problem, the objective

VOLUME 8, 2020 615


T. Zhang, S. Mao: Smart Power Control for Quality-Driven Multi-User Video Transmissions

is to maximize the averaged users’ quality while also satisfy- Algorithm 1 Multi-Agent DRL Training Algorithm
ing the power constraints. 1: Start environment simulator, generating channels;
To achieve this goal, we define the reward function as 2: Initialize Q-networks for all agents randomly;
follows [44] 3: Initialize p for all agents, and obtain s0 ;
4: for each training episode do
1 X t
rt = qi , (26) 5: Randomly initialize the agents’ transmit power;
N
i 6: for each step do
where 7: for each agent i do
( 8: observe sti ;
Qti , if Iit = 1 choose action ati according to the -greedy
qti = (27) 9:
−100, otherwise, policy;
10: end for
such that the user’s video quality constraint (10) is satisfied.
11: All agents take actions and receive reward r t+1 ;
So far we assume all the agents share the same reward r t
12: for each agent i do
and the same action space st . In practice, such knowledge may
13: Update state st+1
i ;
be obtained at some additional communication cost. For the 
state space signal, the agents only need to monitor the ACK 14: store si , ai , r t+1 , st+1
t t
i in replay memory
signal sent by each other to infer if the quality requirement is Di ;
satisfied. The communication cost would be extremely low. 15: end for
For reward function design, each agent computes its own 16: for each agent i do
quality based on (1), (3), and (7), and then broadcast this 17: Uniformly sample mini-batches from Di ;
information to other agents via message passing [30]. For 18: minimizing error between Q-network and the
large networks, the transmission of the exact quality value target network with stochastic gradient methods;
may occupy considerable wireless resources. A more feasible 19: end for
solution is that different agents observe only its nearby users’ 20: if the QoE of each user is satisfied then
ACK signals and take the average quality among its neighbor- 21: Break;
ing users as the reward function. For example, we may design 22: end if
the state observed by agent i as 23: end for
24: end for
sti = [Int |n ∈ Ni (K )], pti , 0it ,

(28)
where Ni (K ) denotes the nearest K receivers (including agent
i itself) of agent i. The reward function for each user can be and exploration of a better option. Each episode consists
designed as of a maximum T steps. In each step t, all agents collect
1 X t and store the state action and reward tuple, (sti , ati , r t , st+1
i ),
rit = qj . (29) in the experience replay. In each step, a mini-batch Dt is
K
j∈Ni (K ) uniformly sampled from the replay memory. If all the users’
This assumption is reasonable because in large networks, video quality requirements are satisfied, the system randomly
only nearby D2D users are in the same interference domain. initialize the transmit power of all users and goes to the next
episode. The training algorithm is presented in Algorithm 1.
D. LEARNING ALGORITHM In the training stage, the agent reaches their target state
1) TRAINING STAGE if the action remains unchanged in the next state st+1 i . It is
t+1
We leverage deep Q-learning with experience replay to train easy to show that the next state si is also a goal state. The
multiple agents for optimal power control. It has been shown agent will stay on the target state until the transmission is
that Q-learning will converge to the optimal policy with prob- completed. As a result, the policy will converge, and we will
ability 1 [45]. In deep Q-learning, DQN is used to approxi- obtain the largest estimated Q value.
mate the action-value function. We assume that each agent
maintains a dedicated DQN that takes an input of the current 2) IMPLEMENTATION STAGE
state and outputs the value functions corresponding to all During the implementation stage, each agent observes
actions. the environment state and then selects an action, which
The DQN is trained through multiple episodes. In each maximizes the state-action value according to the trained
episode l, all agents concurrently explore the state-action Q-network. Afterwards, all agents transmit their video data
space with the -greedy policy, i.e., the agent chooses the with a proper power determined by their selected actions.
action that maximizes the estimated state-action value with The implementation algorithm is summarized in Algorithm 2.
probability  l and chooses a random action with probabil- In most of the cases, each agent can reach their target state
ity 1 −  l . The -greedy policy helps achieve a balance within 1 step. To solve the non-convergent problem, we add
between exploitation of the current best Q-value function a testing loop. That is, if the agent cannot reach the target

616 VOLUME 8, 2020


T. Zhang, S. Mao: Smart Power Control for Quality-Driven Multi-User Video Transmissions

Algorithm 2 DRL-Based Power Control Algorithm {θi , βi } with a curve-fitting method. The estimated values for
Initialize the environment, agents randomly select initial these parameters are listed in Table 1 and the corresponding
power, and obtain the initial state s0 ; rate-distortion curves are presented in Fig. 3.
1: for each agent i do
2: for each step do TABLE 1. Optimal parameters for the two video sequences.
3: Select ai = arg maxa∈A Qi (s0 , a; 2∗i );
4: if the quality requirement of each user is satisfied
then
5: Break;
6: end if
7: end for
8: end for
9: Obtain the optimal power allocation p =
[a1 , a2 , . . . , aN ];

state, all the agents will explore the taken action based on the
current state until all the agents’ minimum quality is satisfied.
Since the training procedure can be performed offline over
different episodes for different network topologies, video
quality requirements, video types, and channel conditions,
the heavy training complexity should not be a problem in
practice. Meanwhile, the online implementation complexity
is extremely low, which enables many real-time applications.
In practice, the trained DQN can be updated only when FIGURE 3. Rate-distortion curve for the two video sequences.

the network topology and video sequences are dramatically It can be seen that different video sequences generally
changed. exhibit quite different behaviors. For example, the rate of
video ‘‘Football’’ increases rapidly with increased PSNR
IV. SYSTEM SETUP value, while the video sequence ‘‘Foreman’’ grows quite
We next carry out experiments to validate the performance of slowly. With the same transmission rate (e.g., 500kbps),
the proposed DRL-based power control method. The max- the user who requests video sequence ‘‘Football’’ has a
imum power (in Watt) is set to pmax = 0.4 and L is 10. PSNR value of 29dB, while the users who request video
The bandwidth is set to B = 500kHz and the cell carrier sequence ‘‘Foreman’’ can enjoy a video quality up to around
frequency is set to 2.4GHz. The noise power density for each 41dB. Hence, simply perform physical layer resource opti-
user is −174dBm/Hz. The distance between the transmitter mization may not be optimal. A cross-layer optimization is
and the receiver of each agent is fixed to be 50m. The agents indispensable.
are randomly located in a square area of 500m × 500m.
B. DRL PARAMETERS
A. VIDEO CONFIGURATION In our experiments, we choose a deep neural network (DNN)
For simplicity, we assume our video library contains to approximate the action-value function. The DNN consists
2 video sequences in the common intermediate format of three fully connected hidden layers, while each layer con-
(CIF). One video sequence is ‘‘Foreman,’’ which has a low tains 32, 32, and 16 neurons, respectively. The rectified linear
spatial-temporal content complexity. The other is ‘‘Football,’’ units (ReLUs) are used as the activation function. We adopt
which has a high spatial-temporal content complexity [35]. the Adam algorithm for loss optimization. The replay mem-
Each sequence is encoded by the High Efficiency Video ory size is set to 200. The batch size is set to 8. The probability
Coding (HEVC) software [46]. We use the default low delay of exploring new actions linearly decreases with the number
configurations to operate the encoder with both intra encod- of episodes, from 0.9 to 0 for the first 1000 episodes. Algo-
ing and motion compensation. The group of picture (GOP) rithm 1 is used to train the network and Algorithm 2 is used for
size is 4. distributed implementation. The hyper-parameters are listed
We enable rate control and change the target bit rate. Given in Table 2.
a target bit rate, the video sequences are encoded into bit
streams. We average the MSE between the reconstructed V. SIMULATION RESULT AND DISCUSSIONS
frame and the original frame over all the 20 frames. The A. TWO USERS
PSNR value is then calculated based on (4). Based on the First of all, we consider the simplest case where there are two
obtained samples, we estimate the video sequence parameters users. The distance matrix is randomly generated. For this

VOLUME 8, 2020 617


T. Zhang, S. Mao: Smart Power Control for Quality-Driven Multi-User Video Transmissions

TABLE 2. DRL hyper-parameters.

experiment, the distance matrix is generated as


 
50 275
D= . (30)
294 50

The layout of the two users is shown in Fig. 4. The channel


fading follows the free space propagation model, which is
defined in (2). We also assume that the two users request
video sequences ‘‘Football’’ and ‘‘Foreman’’ respectively.
The quality requirement Qi,min for both videos is set to 42dB.
Our aim is to ensure that the average quality of all the users
is maximized while each user’s minimum quality requirement
is also satisfied. Fig. 5 shows the training loss calculated
by (21) for different δ values. It can be seen that with a large
learning rate, the training loss converges to 0 quickly, while
with a smaller learning rate value, the loss may converge
slow. For example, when δ = 0.01, even after 1500 episodes,
the training loss still does not converge to 0. Meanwhile, when
δ is moderate, e.g., δ = 0.5, the training loss is generally
very small across all the episodes. This is also confirmed by
the training reward for different values of δ plotted in Fig. 6.
When δ = 0.01, the reward does not converge to a positive
value, which means there is a penalty induced by that situa-
tion such that the user’s quality requirement is not satisfied.
If we choose δ to be 0.5 or 1, the training reward will stay
at a stable state. However, the reward value corresponding to
δ = 0.5 is slightly larger than the reward value corresponding
to δ = 1. Based on these observations, a moderate value
of learning rate is preferred. In our experiments, we choose
δ = 0.5.
Fig. 7 presents the two users’ video quality performance
versus the number of training episodes. We observe that at FIGURE 4. Layout of the video users used in the simulations.
the beginning of the training stage, the users’ video quality
fluctuates slightly. This is because at first the  value is
complexity. After the training process is done offline, the dis-
large, the agents tend to explore new actions. As with more
tributed online deployment should be very easy and fast.
iterations,  starts to decrease from 0.9 to 0. During this
Fig. 8 demonstrates the performance of the proposed method.
stage, the agent keeps on exploring the unknown environment
As benchmarks for the proposed algorithm, we introduced
while also exploiting the gained knowledge to train the tar-
two baseline algorithms:
get network. After 1000 episodes, the value of  decreases
to 0, the agent will stop exploring the environment; instead, 1) Random power method: each user randomly selects a
it will choose the actions that have achieved the maximum transmit power from A;
state-action values. As a result, the quality curves for the two 2) Maximum power method: each user transmits its video
users remain stable. sequences at the maximum power pmax .
We next consider the distributed implementation stage. We perform 1000 simulations and the users’ PSNR values
Note that the training stage may involve a high computational for the first 25 simulations are plotted in Fig. 8. In each

618 VOLUME 8, 2020


T. Zhang, S. Mao: Smart Power Control for Quality-Driven Multi-User Video Transmissions

FIGURE 5. Loss function versus the number of training episodes (N = 2). FIGURE 8. Users’ QoE versus the number of testing episodes (N = 2).

are satisfied. As a comparison, the maximum power method


and the random power method can only guarantee one user’s
quality requirement, while the other user’s quality is below
the minimum requirement.
To better compare these algorithms, we define the success
rate as the ratio of the number of successful trials to the total
number of tests. In all the 1000 simulations, the proposed
DRL method achieves a success rate of 100%, but the success
rate of the random power method is only 3% and the success
rate of the maximum power method is 0. In practice, when
the number of users is small and the action space is small,
the users can randomly choose powers by trial and error and
eventually obtain a feasible solution if the feasible solution
FIGURE 6. Reward versus the number of training episodes (N = 2). exists. However, frequent information exchange and complex
iterations are usually required, which would pose additional
delays. When the network size grows large and the number
of action space becomes large, the random power allocation
method will no longer work. We will demonstrate this point
in the next subsection.

B. LARGER NUMBER OF USERS


Now we consider the case of 5 users in the system; the layout
of the users is shown in Fig. 4(b). The distance matrix is
randomly generated as
 
50 361.9 362.9 275.7 95.2
279.1 50 201.3 170.8 294.1
 . (31)
 
D=  301.7 131 50 62.5 261.7
289.2 133.9 56.9 50 248.8
FIGURE 7. Users’ QoE versus the number of training episodes (N = 2). 53.0 318.1 308.8 221.8 50
We assume that the first user requests video sequence ‘‘Foot-
ball’’ with a minimum quality requirement of 34dB and the
testing episode, the agent initializes the state by randomly rest four users request the video sequence ‘‘Foreman’’ with a
generating transmit powers. Then both agents observe the minimum quality requirement of 40dB. The training episode
state and take action according to the state-action value. is set to be 10000.  is linearly decreased from 0.9 to 0 for the
Simulation results show that the agents can converge to the first 3000 episodes.
optimal action with 1 step from any initial state. In this The training loss and the training reward are depicted
process, no channel estimation is needed and no iterations in Fig. 9 and Fig. 10, respectively. It can be seen that after
are required. Hence this approach is quite fast. Moreover, around 3000 episodes, the training loss converges to 0. The
with the DRL-based approach, both users’ required quality reward approximates 40, which means there is no penalty and

VOLUME 8, 2020 619


T. Zhang, S. Mao: Smart Power Control for Quality-Driven Multi-User Video Transmissions

FIGURE 9. Loss function versus the number of training episodes (N = 5). FIGURE 12. Users’ QoE versus the number of testing episodes with the
random power method (N = 5).

FIGURE 10. Reward versus the number of training episodes (N = 5).


FIGURE 13. Users’ QoE versus the number of testing episodes with the
proposed method. (N = 20).

of 100% across all the testing episodes. As a comparison,


the random power method and the maximum power method
has a success rate of 0. We depicted the performance of the
random power method in Fig. 14. In practice, for the proposed
multi-agent DRL approach, we find that the agents may face a
non-stationary problem, i.e., the trained DQN for each agent
may not be able to reach the target state within 1 step. For
example, when we set the learning rate δ = 0.1, in most of
the cases, the agent can reach a target state within 1 step with
the trained DQN from arbitrary initial states. However, there
are a few cases when agent cannot reach the target state within
1 step. This may be caused by the experience replay sampling
FIGURE 11. Users’ QoE versus the number of testing episodes with the process. The sampled DQN from the experience replay may
proposed method (N = 5).
not reflect the current dynamics. So far, there is no theoretical
solutions which can solve this problem. Possible heuristic
solutions include adding the training  into the state [30],
all the users’ quality requirements are satisfied in the training
finding a proper value of the learning rate or in the testing
stage. In the distributed implementation stage, we perform
stage we perform more iterations until the obtained state is
1000 testing episodes and the agents initialize their states
feasible.1
randomly in each episode. The PSNR performance of each
user for the first 25 episodes is depicted in Fig. 11. We find 1 We add iterations in Algorithm 2 for each testing episode. If the current

that all the agents can observe their local environments and state is not the target state, all the agents perform a further action based on
the current step until they reach the target state. In each iteration, only ACK
reach their target QoE within 1 step. The success rate of signals are needed; other agents’ quality requirements are not needed. So the
the proposed multi-agent DRL approach has a success rate communication cost is low compared to the training process.

620 VOLUME 8, 2020


T. Zhang, S. Mao: Smart Power Control for Quality-Driven Multi-User Video Transmissions

3) Users may work on different channels in practice. In the


future, a DRL based joint spectrum and power alloca-
tion method could be developed.

REFERENCES
[1] Cisco, ‘‘Cisco visual networking index: Forecast and trends,
2017–2022,’’ Cisco, San Jose, CA, USA, Feb. 2019. [Online].
Available: https://www.cisco.com/c/en/us/solutions/collateral/service-
provider/visual-networking-index-vni/white-paper-c11-738429.html
[2] Y. Xu and S. Mao, Mobile Cloud Media: State of the Art and Outlook.
Hershey, PA, USA: IGI Global, 2013, ch. 2, pp. 18–38.
[3] J. Liu, N. Kato, J. Ma, and N. Kadowaki, ‘‘Device-to-device communi-
cation in LTE-advanced networks: A survey,’’ IEEE Commun. Surv. Tutr.,
vol. 17, no. 4, pp. 1923–1940, 4th Quart., 2015.
[4] F. Boccardi, R. W. Heath, A. Lozano, T. L. Marzetta, and P. Popovski,
‘‘Five disruptive technology directions for 5G,’’ IEEE Commun. Mag.,
FIGURE 14. Users’ QoE versus the number of testing episodes (N = 20) vol. 52, no. 2, pp. 74–80, Feb. 2014.
with the random power allocation method. (N = 20). [5] M. N. Tehrani, M. Uysal, and H. Yanikomeroglu, ‘‘Device-to-device com-
munication in 5G cellular networks: Challenges, solutions, and future
directions,’’ IEEE Commun. Mag., vol. 52, no. 5, pp. 86–92, May 2014.
Now we consider a more challenging task where there are [6] G. Fodor, E. Dahlman, G. Mildh, S. Parkvall, N. Reider, G. Miklós, and
20 users as shown in Fig. 4(c). Their locations are randomly Z. Turányi, ‘‘Design aspects of network assisted device-to-device commu-
generated. For simplicity, we assume that all of users request nications,’’ IEEE Commun. Mag., vol. 50, no. 3, pp. 170–177, Mar. 2012.
[7] M. Chiang, P. Hande, T. Lan, and C. W. Tan, ‘‘Power control in wireless
the same video sequence ‘‘Foreman’’ and their minimum
cellular networks,’’ Found. Trends Netw., vol. 2, no. 4, pp. 381–533,
quality requirement is set to 36dB. To better control the Apr. 2008.
complexity, we assume that each agent only observe state [8] Q. Shi, M. Razaviyayn, Z.-Q. Luo, and C. He, ‘‘An iteratively weighted
from the nearest 5 neighbors, i.e., K = 5. The corresponding MMSE approach to distributed sum-utility maximization for a MIMO
interfering broadcast channel,’’ IEEE Trans. Signal Process., vol. 59, no. 9,
testing stage is shown in Fig. 13. Due to space limitation, pp. 4331–4340, Sep. 2011.
we only plot 5 users’ PSNR values. Actually, all the 20 users’ [9] K. Shen and W. Yu, ‘‘Fractional programming for communication
video quality requirements are satisfied and their average systems—Part I: Power control and beamforming,’’ IEEE Trans. Signal
Process., vol. 66, no. 10, pp. 2616–2630, May 2018.
quality is maximized. As a comparison, we present the PSNR [10] X. Chen, Z. Zhao, and H. Zhang, ‘‘Stochastic power adaptation with
for the random power allocation method in Fig. 14, where multiagent reinforcement learning for cognitive wireless mesh networks,’’
the users’ PSNR values are obviously not stable. In some IEEE Trans. Mobile Comput., vol. 12, no. 11, pp. 2155–2166, Nov. 2013.
[11] G. Zhang, J. Hu, W. Heng, X. Li, and G. Wang, ‘‘Distributed power con-
cases, the user’s PSNR falls below 30dB. The success rates of trol for D2D communications underlaying cellular network using Stack-
both the random power allocation method and the maximum elberg game,’’ in Proc. IEEE Wireless Commun. Netw. Conf. (WCNC),
power allocation method are 0. San Francisco, CA, USA, Mar. 2017, pp. 1–6.
[12] Z. He, S. Mao, and T. Jiang, ‘‘A survey of QoE-driven video streaming
over cognitive radio networks,’’ IEEE Netw., vol. 29, no. 6, pp. 20–25,
VI. CONCLUSION AND FUTURE WORK Nov./Dec. 2015.
In this paper, we studied the quality-aware power allocation [13] M. Amjad, M. H. Rehmani, and S. Mao, ‘‘Wireless multimedia cognitive
radio networks: A comprehensive survey,’’ IEEE Commun. Surveys Tutr.,
problem for multi-user video streaming. We developed a dis- vol. 20, no. 2, pp. 1056–1103, 2nd Quart., 2018.
tributed model-free power allocation algorithm, which help [14] X. Jiang, H. Lu, and C. W. Chen, ‘‘Enabling quality-driven scalable video
maximize the users’ target quality. The proposed method does transmission over multi-user NOMA system,’’ in Proc. IEEE Conf. Com-
put. Commun., Honolulu, HI, USA, Apr. 2018, pp. 1952–1960.
not require explicit channel state information, which would [15] H. Lu, M. Zhang, Y. Gui, and J. Liu, ‘‘QoE-driven multi-user video
save significant resources. Experiment results showed that the transmission over SM-NOMA integrated systems,’’ IEEE J. Sel. Areas
developed multi-agent DRL approach can guarantee that all Commun., vol. 37, no. 9, pp. 2102–2116, Sep. 2019.
[16] S. Cicalo and V. Tralli, ‘‘Distortion-fair cross-layer resource allocation for
the users achieve their target quality requirements within few scalable video transmission in OFDMA wireless networks,’’ IEEE Trans.
steps and the users’ average quality is maximized. For future Multimedia, vol. 16, no. 3, pp. 848–863, Apr. 2014.
investigations, possible directions include [17] T. Zhang and S. Mao, ‘‘Joint power and channel resource optimization in
1) the randomness of the layout of the D2D channels and soft multi-view video delivery,’’ IEEE Access, vol. 7, pp. 148084–148097,
2019.
the content of the requested videos could be considered [18] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Belle-
in the training process. The agent will take the channel mare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen,
state and the video contents as local state information. C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra,
S. Legg, and D. Hassabis, ‘‘Human-level control through deep reinforce-
Efficient training algorithms need to be developed so ment learning,’’ Nature, vol. 518, pp. 529–533, 2015.
that users can take action based on the local observation [19] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den
and the users’ average quality could be maximized. Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot,
S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever,
2) Currently, we start the training process based on the T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, D. Hassabis, ‘‘Mas-
assumption that there exists at least one feasible solu- tering the game of Go with deep neural networks and tree search,’’ Nature,
tion. Theoretical methods should be provided to guar- vol. 529, no. 7587, pp. 484–489, 2016.
[20] Y. Sun, M. Peng, Y. Zhou, Y. Huang, and S. Mao, ‘‘Application of machine
antee a quick examination to check if there exists a learning in wireless networks: Key techniques and open issues,’’ IEEE
feasible solution before the training process. Commun. Surveys Tutr., vol. 21, no. 4, pp. 3072–3108, 4th Quart., 2019.

VOLUME 8, 2020 621


T. Zhang, S. Mao: Smart Power Control for Quality-Driven Multi-User Video Transmissions

[21] X. Chen, H. Zhang, C. Wu, S. Mao, Y. Ji, and M. Bennis, ‘‘Optimized [44] F. A. Asuhaimi, S. Bu, P. V. Klaine, and M. A. Imran, ‘‘Channel access and
computation offloading performance in virtual edge computing systems power control for energy-efficient delay-aware heterogeneous cellular net-
via deep reinforcement learning,’’ IEEE Internet Things J., vol. 6, no. 3, works for smart grid communications using deep reinforcement learning,’’
pp. 4005–4018, Jun. 2019. IEEE Access, vol. 7, pp. 133474–133484, 2019.
[22] S. Wang, H. Liu, P. H. Gomes, and B. Krishnamachari, ‘‘Deep reinforce- [45] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction.
ment learning for dynamic multichannel access in wireless networks,’’ Cambridge, MA, USA: MIT Press, 2018.
IEEE Trans. Cogn. Commun. Netw., vol. 4, no. 2, pp. 257–265, Jun. 2018. [46] High Efficiency Video Coding (HEVC). Accessed: Nov. 13, 2019[Online].
[23] O. Naparstek and K. Cohen, ‘‘Deep multi-user reinforcement learning for Available: https://hevc.hhi.fraunhofer.de/
distributed dynamic spectrum access,’’ IEEE Trans. Wireless Commun.,
vol. 18, no. 1, pp. 310–323, Jan. 2019.
[24] Y. He, F. R. Yu, N. Zhao, V. C. M. Leung, and H. Yin, ‘‘Software-defined
networks with mobile edge computing and caching for smart cities: A big
data deep reinforcement learning approach,’’ IEEE Commun. Mag., vol. 55,
no. 12, pp. 31–37, Dec. 2017.
[25] Y. Sun, M. Peng, and S. Mao, ‘‘Deep reinforcement learning-based mode
selection and resource management for green fog radio access networks,’’
IEEE Internet Things J., vol. 6, no. 2, pp. 1960–1971, Apr. 2019. TICAO ZHANG received the B.E. and M.S.
[26] J. Liu, B. Krishnamachari, S. Zhou, and Z. Niu, ‘‘DeepNap: Data-driven degrees from the School of Electronic Infor-
base station sleeping operations through deep reinforcement learning,’’ mation and Communications, Huazhong Univer-
IEEE Internet Things J., vol. 5, no. 6, pp. 4273–4282, Dec. 2018. sity of Science and Technology, Wuhan, China,
[27] K. Xiao, S. Mao, and J. K. Tugnait, ‘‘TCP-Drinc: Smart congestion in 2014 and 2017, respectively. He is currently
control based on deep reinforcement learning,’’ IEEE Access, vol. 7, pursuing the Ph.D. degree in electrical and com-
pp. 11892–11904, 2019. puter engineering with Auburn University. His
[28] X. Li, J. Fang, W. Cheng, H. Duan, Z. Chen, and H. Li, ‘‘Intelligent power research interests include video coding and com-
control for spectrum sharing in cognitive radios: A deep reinforcement munications, machine learning, and optimization
learning approach,’’ IEEE Access, vol. 6, pp. 25463–25473, 2018. and design of wireless multimedia networks.
[29] Y. S. Nasir and D. Guo, ‘‘Multi-agent deep reinforcement learning for
dynamic power allocation in wireless networks,’’ IEEE J. Sel. Areas Com-
mun., vol. 37, no. 10, pp. 2239–2250, Oct. 2019.
[30] L. Liang, H. Ye, and G. Y. Li, ‘‘Spectrum sharing in vehicular
networks based on multi-agent reinforcement learning,’’ May 2019,
arXiv:1905.02910. [Online]. Available: https://arxiv.org/abs/1905.02910
[31] M. Feng and S. Mao, ‘‘Dealing with Limited Backhaul Capacity in
millimeter-wave systems: A deep reinforcement learning approach,’’ IEEE
Commun. Mag., vol. 57, no. 3, pp. 50–55, Mar. 2019. SHIWEN MAO (S’99–M’04–SM’09–F’19)
[32] J. Yick, B. Mukherjee, and D. Ghosal, ‘‘Wireless sensor network survey,’’ received the Ph.D. degree in electrical and com-
Comput. Netw., vol. 52, no. 12, pp. 2292–2330, Aug. 2008. puter engineering from Polytechnic University,
[33] K. Stuhlmuller, N. Farber, M. Link, and B. Girod, ‘‘Analysis of video Brooklyn, NY, USA (now The New York Univer-
transmission over lossy channels,’’ IEEE J. Sel. Areas Commun., vol. 18, sity Tandon School of Engineering).
no. 6, pp. 1012–1032, Jun. 2000. He joined Auburn University, Auburn, AL,
[34] H. Mansour, V. Krishnamurthy, and P. Nasiopoulos, ‘‘Channel aware USA, as an Assistant Professor, in 2006, was
multiuser scalable video streaming over lossy under-provisioned chan- the McWane Associate Professor, from 2012 to
nels: Modeling and analysis,’’ IEEE Trans. Multimedia, vol. 10, no. 7, 2015, and has been the Samuel Ginn Distinguished
pp. 1366–1381, Nov. 2008. Professor with the Department of Electrical and
[35] K. Lin and S. Dumitrescu, ‘‘Cross-layer resource allocation for scalable Computer Engineering, since 2015. He is currently the Director of the
video over OFDMA wireless networks: Tradeoff between quality fairness Wireless Engineering Research and Education Center, Auburn University,
and efficiency,’’ IEEE Trans. Multimedia, vol. 19, no. 7, pp. 1654–1669,
since 2015, and the Director of the NSF IUCRC FiWIN Center Auburn
Jul. 2017.
University site, since 2018. His research interests include wireless networks,
[36] S. Cicalò, A. Haseeb, and V. Tralli, ‘‘Fairness-oriented multi-stream rate
adaptation using scalable video coding,’’ Signal Process., Image Commun.,
multimedia communications, and smart grid. He is a Distinguished Speaker
vol. 27, no. 8, pp. 800–813, 2012. (2018–2021) and was a Distinguished Lecturer (2014–2018) of the IEEE
[37] Z.-Q. Luo and S. Zhang, ‘‘Dynamic spectrum management: Complexity Vehicular Technology Society.
and duality,’’ IEEE J. Sel. Topics Signal Process., vol. 2, no. 1, pp. 57–73, Dr. Mao received the IEEE ComSoc TC-CSR Distinguished Technical
Feb. 2008. Achievement Award, in 2019, the IEEE ComSoc MMTC Distinguished
[38] M. Tan, ‘‘Multi-agent reinforcement learning: Independent vs. cooperative Service Award, in 2019, the Auburn University Creative Research and
agents,’’ in Proc. ICML, Amherst, MA, USA, Jun. 1993, pp. 330–337. Scholarship Award, in 2018, the 2017 IEEE ComSoc ITC Outstanding
[39] J. Foerster, N. Nardelli, G. Farquhar, T. Afouras, P. H. Torr, P. Kohli, and Service Award, the 2015 IEEE ComSoc TC-CSR Distinguished Service
S. Whiteson, ‘‘Stabilising experience replay for deep multi-agent rein- Award, the 2013 IEEE ComSoc MMTC Outstanding Leadership Award,
forcement learning,’’ in Prof. ICML, Sydney, NSW, Australia, Aug. 2017, and the NSF CAREER Award, in 2010. He is a co-recipient of the IEEE
pp. 1146–1155. ComSoc MMTC Best Journal Paper Award, in 2019, the IEEE ComSoc
[40] T. T. Nguyen, N. D. Nguyen, and S. Nahavandi, ‘‘Deep reinforce- MMTC Best Conference Paper Award, in 2018, the Best Demo Award
ment learning for multi-agent systems: A review of challenges, solutions from the IEEE SECON 2017, the Best Paper Awards from the IEEE
and applications,’’ Dec. 2018, arXiv:1812.11794. [Online]. Available: GLOBECOM 2019, 2016, and 2015, the IEEE WCNC 2015, and the IEEE
https://arxiv.org/abs/1812.11794
ICC 2013 and the 2004 IEEE Communications Society Leonard G. Abraham
[41] A. Tampuu, T. Matiisen, D. Kodelja, I. Kuzovkin, K. Korjus, J. Aru,
Prize in the Field of Communications Systems. He is an Area Editor of the
J. Aru, and R. Vicente, ‘‘Multiagent cooperation and competition with
deep reinforcement learning,’’ PLoS ONE, vol. 12, no. 4, Apr. 2017,
IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, the IEEE OPEN JOURNAL
OF THE COMMUNICATIONS SOCIETY, the IEEE INTERNET OF THINGS JOURNAL,
Art. no. e0172395.
[42] U. Challita, L. Dong, and W. Saad, ‘‘Proactive resource management for the IEEE/CIC CHINA COMMUNICATIONS, and the ACM GetMobile, as well
LTE in unlicensed spectrum: A deep learning perspective,’’ IEEE Trans. as an Associate Editor of the IEEE TRANSACTIONS ON NETWORK SCIENCE AND
Wireless Commun., vol. 17, no. 7, pp. 4674–4689, Jul. 2018. ENGINEERING, the IEEE TRANSACTIONS ON MULTIMEDIA, the IEEE TRANSACTIONS
[43] N. Zhao, Y.-C. Liang, D. Niyato, Y. Pei, M. Wu, and Y. Jiang, ‘‘Multi-agent ON MOBILE COMPUTING, the IEEE MULTIMEDIA, the IEEE NETWORKING LETTERS,
reinforcement learning: Independent vs. cooperative agents,’’ IEEE Trans. and the Digital Communications and Networks Journal (Elsevier).
Wireless Commun., vol. 18, no. 11, pp. 5141–5152, Nov. 2019.

622 VOLUME 8, 2020

You might also like