Smart Power Control For Quality-Driven Multi-User Video Transmissions A Deep Reinforcement Learning Approach
Smart Power Control For Quality-Driven Multi-User Video Transmissions A Deep Reinforcement Learning Approach
Smart Power Control For Quality-Driven Multi-User Video Transmissions A Deep Reinforcement Learning Approach
Received December 7, 2019, accepted December 19, 2019, date of publication December 23, 2019, date of current version January 2, 2020.
Digital Object Identifier 10.1109/ACCESS.2019.2961914
INVITED PAPER
INDEX TERMS Multi-user video transmission, multi-agent deep reinforcement learning, power control,
quality of experience.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/
VOLUME 8, 2020 611
T. Zhang, S. Mao: Smart Power Control for Quality-Driven Multi-User Video Transmissions
controller. To reduce the signaling overhead and better adapt reward, rather than simply obtaining the instant maximum
to large scale networks, a series of distributed algorithms have reward. This is quite important for resource optimization
been developed. For example, in [10], the power allocation problems in wireless networks, where the channel state
problem in cognitive wireless network was formulated as changes rapidly. There is now an increasing interest on incor-
a noncooperative game. A stochastic power allocation with porating DRL into the design of wireless networking algo-
conjecture-based multi-agent Q-learning approach was pro- rithms [20], such as mobile off-loading [21], dynamic channel
posed. The authors in [11] proposed a Stackelberg game access [22], [23], mobile edge computing and caching [24],
based power control scheme for D2D communication under- [25], dynamic base station on and off [26], TCP congestion
lay cellular networks. By introducing a new co-tier price fac- control [27], and resource allocation [28]–[31].
tor, the distributed power control algorithm can mitigate the In particular, the authors in [28] consider the problem of
cross-tier interference effectively. Despite their good perfor- power control in a cognitive radio system consisting of a
mance, current solutions often require frequent information primary user and a secondary user. With DRL, the secondary
exchange and cannot guarantee an optimal performance. user can interact with the primary user efficiently to reach
Meanwhile, we have observed that these physical layer a target state after a few number of steps. Another work
technologies generally aim to optimize the transmission data in [29] demonstrates the potential of DRL for power control in
rate or bit error rate (BER), they do not improve the user’s wireless networks. Instead of searching for the near-optimal
quality of service (QoS) or quality of experience (QoE) solution by solving the challenging optimization problem,
directly, when users are watching a specific video. Given the authors develop a distributed dynamic power control
the same transmission bandwidth, different videos generally scheme. This method is model-free and the system’s weighted
have different qualities. As an application layer performance sum rate can be maximized. The authors in [30] investigated
metric, video quality directly reflect the user satisfaction the spectrum sharing problem in vehicular networks with a
level in contrast to physical layer metrics. In future mobile DRL based solution. The multiple vehicle-to-vehicle (V2V)
networks, it is more important to develop a cross-layer agents dynamically allocate their power and spectrum in a
interference management approach that jointly considers the cooperate way so that their sum capacity can be maximized.
physical layer issues as well as the user’s requirement and In this paper, we consider the power allocation and interfer-
experience [12], [13]. Motivated by this observation, some ence management problem in a multi-user video transmission
cross-layer video transmission designs have been proposed. system from the point of a cross-layer optimization. To the
For example, the authors in [14] designed a quality-driven best of our knowledge, this is the first work that attempts to
scalable video transmission framework in a non-orthogonal integrate DRL for interference management to improve users’
multiple access (NOMA) system and proposed a subopti- video viewing quality. The main contributions of this paper
mal power allocation algorithm. This algorithm leverages are summarized as follows.
the hidden monotonic property of the problem and it has a • The proposed algorithm is based on multi-agent deep
polynomial time complexity. The recent work [15] proposed Q-learning, which amenable to distributed implemen-
a spatial modulation (SM) and NOMA integrated system tation. It is model-free and does not require labeled
for multiuser video transmission. Efficient algorithms are training data. It can be applied to arbitrary network
proposed to perform optimal power control so that the user’s configurations.
QoE can be maximized. A novel cross-layer optimization • Each agent does not need to know other agents’ CSI. The
framework is proposed in [16] for scalable video transmission complexity of the proposed algorithm does not increase
over OFDMA networks. The proposed iterative algorithm can with the network size. This method can be applied to
jointly maximize the achievable sum rate and minimize the very large networks.
distortion among multiple videos. In our recent work [17], • This work is a cross layer design which considers both
a cross-layer optimization framework for softcast video trans- the physical layer issues as well as the application layer
mission is developed and analyzed. Compared with physical video-related design factors. By properly designing the
layer-only designs, such cross layer optimization for video reward function, users can actually work in a cooperative
transmissions help users enjoy a better perceived video qual- manner to achieve a high level of satisfaction.
ity. Despite the success of these algorithms, they require that The remainder of this paper is organized as follows. The
every user to have full knowledge of the CSI for all the links, system model and the problem formulation are discussed in
which may be infeasible in practice. Besides, the formulated Section II. In Section III, we develop a multi-agent DRL
problem is generally non-convex. The developed method algorithm for power control. Simulation setup is provided
often lead to a sub-optimal solution. in Section IV. Experimental results are given in Section V,
Recently, machine learning (ML) has achieved great suc- followed by conclusions in Section VI.
cess in a variety of fields, such as computer version and
speech recognition. Deep reinforcement learning (DRL), as a II. SYSTEM MODEL AND PROBLEM FORMULATION
powerful ML technique, has show high potential for many A. PHYSICAL LAYER MODEL
challenging tasks, such as human-level control [18] and com- We consider a wireless network consisting of N users, where
puter games [19]. In DRL, the agent considers the long-term all the users share a common spectrum resource. As shown
where Qi (p) is given in (7). Note that (9) is the system power s ∈ S. At time instant t, the agent takes action at ∈ A when
constraint and (10) is the video quality constraint, which observing a state st ∈ S. Then the agent receives a reward r t
depends on a variety of factors such as the video content, and the next state st+1 is observed. The Q-learning algorithm
encoder setting, and the user’s quality requirement. aims to maximize a certain reward over time. For example,
Based on (3) and (6), it can be seen that PSNR is a mono- we can define the reward function as
tone function in terms of SINR. The quality constraint (10) ∞
γ τ r t+τ ,
X
can thus be replaced by the corresponding SINR constraint. Rt = (17)
To simplify expression, we rewrite Problem (8) as follows. τ =0
where γ ∈ (0, 1] is a discount scalar representing the tradeoff
f1 (p) f2 (p) fN (p)
max φ , ,..., (11) between the immediate and future rewards. γ = 0 means
p g1 (p) g2 (p) gN (p)
s.t. 0 ≤ pi ≤ pmax , ∀i (12) we only care about the immediate reward. A lager γ means
earlier period rewards play a more important role.
SINRi (p) ≥ SINRi,min , ∀i, (13) Under a policy π (s, a), the Q-function of the agent with
where φ(x) is an increasing function on RN
+ , expressed as
action a and state s is defined as
N Qπ (s, a) = Eπ [Rt |st = s, at = a]. (18)
θi
10 Y 20
φ(x) = − log10 + log10 255,
N B log2 (1 + xi ) N Q-learning aims to maximize the Q-function (18). The opti-
i=1
mal action-value function, Q∗ (s, a) , maxπ Qπ (s, a), obeys
(14)
the Bellman optimality equation, as
and
Q∗ (s, a)
fi (p) = |hii | · pi (15)
X = Est+1 r t+1 + γ max Q∗ (st+1 , a0 )|st = s, at = a , (19)
gi (p) = |hji | · pj + σi2 . (16) a0
j6=i where st+1 is the new state after executing the state-action
It can be seen that Problem (11) actually belongs to the pair (s, a). Let q(s, a) be the state action-value function in the
class of generalized linear fractional programming (GLFP) iteration process. Q-learning updates q(st , at ) as follows.
problems. In addition, combining together with the structure q(st+1 , at+1 )
of functions fi (x) and gi (x), this problem is actually non-
convex [37]. Generally speaking, there is no efficient solu- ← q(st , at )+δ r t+1 +γ max q(st+1 , a0 )−q(st , at ) , (20)
a0
tions to find the global optimal solution within polynomial
time. where δ is the learning rate.
Q-learning uses a Q-table to approximate the Q-function.
III. THE MULTI-AGENT DEEP REINFORCEMENT When the state and action spaces are discrete and small, learn-
LEARNING APPROACH ing the optimal policy π is possible with Q-learning. How-
In the proposed multi-user video transmission system, ever, when the state and action spaces become continuous
the transmitter of each user dynamically adjusts its transmit and large, the problem becomes intractable. Deep Q-learning
power based on the observed environment state. The action utilizes a deep Q-Network (DQN), i.e., a deep neural network
taken at the next time slot depends on the current observa- (DNN), to approximate the mapping table. DQL inherits the
tions, hence it can be modeled as a Markov Decision Process advantages of both RL and deep learning.
(MDP). We develop a multi-agent deep reinforcement learn- Suppose the DQN is expressed as q(:, :, 2t ), where 2t are
ing approach to solve the problem. the parameters of the DQN. As the quasi-static target network
method implies [18], we define two DQNs: the target DQN
A. OVERVIEW OF DEEP REINFORCEMENT LEARNING with parameters 2ttarget and the trained DQN with parameters
Reinforcement learning (RL) is an effective technique to 2ttrain . 2ttarget is updated to be equal to 2ttrain once every Tu
solve the MDP problems. In RL, agents learn an optimal pol- time slots. Using the target network can help stabilize the
icy through interactions with the environment, by receiving overall network performance. Instead of training with only
an intermediate reward together with a state update after tak- the current experience, the DQN uses a randomly sampled
ing each action. The received reward as well as the observed mini-batch from the experience replay memory, which stores
new state will help adjust the control policy. The process will the recent tuples (st , at , r t , st+1 ).
continue until an optimal policy is found. With experience replay, the least squares loss of training
The most representative RL algorithm is Q-learning, where DQN for a sampled mini-batch Dt can be defined as
the policy is updated by an action-value function, referred to L(2ttrain )
as the Q-function. Let S denote the set of possible states and X 2
A denote the set of discrete actions. The policy π (s, a) is the = ytDQN (r t , st+1 )−Q(st , at ; 2ttrain ) , (21)
probability of taking an action a ∈ A when given a state (st ,at ,r t ,st+1 )∈Dt
where the target output is solution is to use single agent-DQN, which computes the
joint actions for all agents [39]. However, the complexity
ytDQN (r t , st+1 ) = r + γ · maxa0 Q(s
t t+1
,a 0
, 2ttarget ). (22)
will grow proportional to the size of the state-action space.
This experience replay strategy ensures that the optimal pol- Moreover, the single agent approach is not suitable for dis-
icy will not lead to a local minimum. In each training step, tributed implementation, which may limit its use in large
the stochastic gradient descent algorithm is used to minimize networks. Recently, there has been several multi-agent DRL
the training loss (21) over the mini-batch Dt . variants, however, there is no theoretical guarantees despite
their promising empirical performance [40], [41]. In this
B. MULTI-AGENT DRL FOR RESOURCE ALLOCATION paper, we limit the convergence analysis by providing simu-
In the resource sharing scenario illustrated in Fig. 1, multiple lation results in Section V, which is also employed in similar
users attempt to transmit video data to the target receivers, prior works [42], [43]. Specifically, we investigate the impact
which can be modeled as a multi-agent DRL problem. Each of the learning rate on the convergence performance.
user is an agent and interacts with the unknown communi-
cation environment to gain experiences. The experience is C. MDP ELEMENTS
then used to guide the transmit power control policy. At the As depicted in Fig. 2, we proposed a multi-agent DRL
first glance, the power allocation problem seems to be a com- approach where each user serves as an agent. In order to
petitive game. If each agent maximizes its transmit power, utilize the DRL for power control, the state space, the action
the other users may receive severe interference. In this paper, space, and the reward function need to be properly designed.
we turn this competitive game into a cooperative game by
properly designing the reward function. This way, the global 1) STATE SPACE
system performance can be optimized. At time slot t, the observed state for each agent is defined
The multi-agent RL based approach is divided into two as st = {[I1t , I2t , . . . , INt ], pti , 0it }, where Iit is the indicator
phases: (i) the offline training phase and (ii) the online imple- function, which shows whether the quality requirement of
mentation phase. We assume that the system is trained in user i is satisfied or not. Specifically, it is defined as
a centralized way but implemented in a distributed manner. (
To be more specific, in the training phase, each agent adjusts t 1, if Qti > Qi,min
Ii = (23)
its actions based on a system performance-oriented reward. 0, otherwise,
In the implementation phase, each agent observes its local
states and selects the optimal power control action. pti is agent i’s current transmit power and 0it is the total inter-
As shown in Fig. 2, each agent n receives a local obser- ference that comes from the other agents, which is defined
vation of the environment and then takes an action. These as
X
actions form a joint action vector. The agent then receives a 0it = |hji | · ptj + σi2 . (24)
joint reward and the environment evolves to the next state. j6=i
The new local states are observed by the corresponding
agents. When the reward is shared by all the agents, the coop- Note that for each agent, pti and 0it are local information that
erative behavior is encouraged. is readily available (no need for exchange).
2) ACTION SPACE
We assume that the transmitter of each agent chooses its
transmit power from a finite set consisting of L elements,
pmax 2pmax
A= , , . . . , pmax , (25)
L L
where pmax is the peak power constraint for each user. As a
result, the dimension of the action space is L. The agent is
only allowed to pick an action ati ∈ A to update its transmit
FIGURE 2. The multi-agent DRL model. power. Increasing the size of action space may potentially
increase the overall performance, meanwhile it also brings a
For each agent, the power control process is an MDP. larger training overhead and system complexity.
Independent Q-learning [38] is one of the most widely used
methods to solve the MDP problem with multiple agents. 3) REWARD DESIGN
In independent Q-learning, each agent learns a decentralized One reason that makes DRL appealing is its flexibility in
policy based on its local observation and action, treating handling the hard-to-optimize objective function. When the
other agents as part of the environment. Note that, each agent system reward is properly designed according to the objective
would face a non-stationary problem as other learning agents function, the system performance can be improved. For our
are updating their policies simultaneously. One promising cross-layer video quality optimization problem, the objective
is to maximize the averaged users’ quality while also satisfy- Algorithm 1 Multi-Agent DRL Training Algorithm
ing the power constraints. 1: Start environment simulator, generating channels;
To achieve this goal, we define the reward function as 2: Initialize Q-networks for all agents randomly;
follows [44] 3: Initialize p for all agents, and obtain s0 ;
4: for each training episode do
1 X t
rt = qi , (26) 5: Randomly initialize the agents’ transmit power;
N
i 6: for each step do
where 7: for each agent i do
( 8: observe sti ;
Qti , if Iit = 1 choose action ati according to the -greedy
qti = (27) 9:
−100, otherwise, policy;
10: end for
such that the user’s video quality constraint (10) is satisfied.
11: All agents take actions and receive reward r t+1 ;
So far we assume all the agents share the same reward r t
12: for each agent i do
and the same action space st . In practice, such knowledge may
13: Update state st+1
i ;
be obtained at some additional communication cost. For the
state space signal, the agents only need to monitor the ACK 14: store si , ai , r t+1 , st+1
t t
i in replay memory
signal sent by each other to infer if the quality requirement is Di ;
satisfied. The communication cost would be extremely low. 15: end for
For reward function design, each agent computes its own 16: for each agent i do
quality based on (1), (3), and (7), and then broadcast this 17: Uniformly sample mini-batches from Di ;
information to other agents via message passing [30]. For 18: minimizing error between Q-network and the
large networks, the transmission of the exact quality value target network with stochastic gradient methods;
may occupy considerable wireless resources. A more feasible 19: end for
solution is that different agents observe only its nearby users’ 20: if the QoE of each user is satisfied then
ACK signals and take the average quality among its neighbor- 21: Break;
ing users as the reward function. For example, we may design 22: end if
the state observed by agent i as 23: end for
24: end for
sti = [Int |n ∈ Ni (K )], pti , 0it ,
(28)
where Ni (K ) denotes the nearest K receivers (including agent
i itself) of agent i. The reward function for each user can be and exploration of a better option. Each episode consists
designed as of a maximum T steps. In each step t, all agents collect
1 X t and store the state action and reward tuple, (sti , ati , r t , st+1
i ),
rit = qj . (29) in the experience replay. In each step, a mini-batch Dt is
K
j∈Ni (K ) uniformly sampled from the replay memory. If all the users’
This assumption is reasonable because in large networks, video quality requirements are satisfied, the system randomly
only nearby D2D users are in the same interference domain. initialize the transmit power of all users and goes to the next
episode. The training algorithm is presented in Algorithm 1.
D. LEARNING ALGORITHM In the training stage, the agent reaches their target state
1) TRAINING STAGE if the action remains unchanged in the next state st+1 i . It is
t+1
We leverage deep Q-learning with experience replay to train easy to show that the next state si is also a goal state. The
multiple agents for optimal power control. It has been shown agent will stay on the target state until the transmission is
that Q-learning will converge to the optimal policy with prob- completed. As a result, the policy will converge, and we will
ability 1 [45]. In deep Q-learning, DQN is used to approxi- obtain the largest estimated Q value.
mate the action-value function. We assume that each agent
maintains a dedicated DQN that takes an input of the current 2) IMPLEMENTATION STAGE
state and outputs the value functions corresponding to all During the implementation stage, each agent observes
actions. the environment state and then selects an action, which
The DQN is trained through multiple episodes. In each maximizes the state-action value according to the trained
episode l, all agents concurrently explore the state-action Q-network. Afterwards, all agents transmit their video data
space with the -greedy policy, i.e., the agent chooses the with a proper power determined by their selected actions.
action that maximizes the estimated state-action value with The implementation algorithm is summarized in Algorithm 2.
probability l and chooses a random action with probabil- In most of the cases, each agent can reach their target state
ity 1 − l . The -greedy policy helps achieve a balance within 1 step. To solve the non-convergent problem, we add
between exploitation of the current best Q-value function a testing loop. That is, if the agent cannot reach the target
Algorithm 2 DRL-Based Power Control Algorithm {θi , βi } with a curve-fitting method. The estimated values for
Initialize the environment, agents randomly select initial these parameters are listed in Table 1 and the corresponding
power, and obtain the initial state s0 ; rate-distortion curves are presented in Fig. 3.
1: for each agent i do
2: for each step do TABLE 1. Optimal parameters for the two video sequences.
3: Select ai = arg maxa∈A Qi (s0 , a; 2∗i );
4: if the quality requirement of each user is satisfied
then
5: Break;
6: end if
7: end for
8: end for
9: Obtain the optimal power allocation p =
[a1 , a2 , . . . , aN ];
state, all the agents will explore the taken action based on the
current state until all the agents’ minimum quality is satisfied.
Since the training procedure can be performed offline over
different episodes for different network topologies, video
quality requirements, video types, and channel conditions,
the heavy training complexity should not be a problem in
practice. Meanwhile, the online implementation complexity
is extremely low, which enables many real-time applications.
In practice, the trained DQN can be updated only when FIGURE 3. Rate-distortion curve for the two video sequences.
the network topology and video sequences are dramatically It can be seen that different video sequences generally
changed. exhibit quite different behaviors. For example, the rate of
video ‘‘Football’’ increases rapidly with increased PSNR
IV. SYSTEM SETUP value, while the video sequence ‘‘Foreman’’ grows quite
We next carry out experiments to validate the performance of slowly. With the same transmission rate (e.g., 500kbps),
the proposed DRL-based power control method. The max- the user who requests video sequence ‘‘Football’’ has a
imum power (in Watt) is set to pmax = 0.4 and L is 10. PSNR value of 29dB, while the users who request video
The bandwidth is set to B = 500kHz and the cell carrier sequence ‘‘Foreman’’ can enjoy a video quality up to around
frequency is set to 2.4GHz. The noise power density for each 41dB. Hence, simply perform physical layer resource opti-
user is −174dBm/Hz. The distance between the transmitter mization may not be optimal. A cross-layer optimization is
and the receiver of each agent is fixed to be 50m. The agents indispensable.
are randomly located in a square area of 500m × 500m.
B. DRL PARAMETERS
A. VIDEO CONFIGURATION In our experiments, we choose a deep neural network (DNN)
For simplicity, we assume our video library contains to approximate the action-value function. The DNN consists
2 video sequences in the common intermediate format of three fully connected hidden layers, while each layer con-
(CIF). One video sequence is ‘‘Foreman,’’ which has a low tains 32, 32, and 16 neurons, respectively. The rectified linear
spatial-temporal content complexity. The other is ‘‘Football,’’ units (ReLUs) are used as the activation function. We adopt
which has a high spatial-temporal content complexity [35]. the Adam algorithm for loss optimization. The replay mem-
Each sequence is encoded by the High Efficiency Video ory size is set to 200. The batch size is set to 8. The probability
Coding (HEVC) software [46]. We use the default low delay of exploring new actions linearly decreases with the number
configurations to operate the encoder with both intra encod- of episodes, from 0.9 to 0 for the first 1000 episodes. Algo-
ing and motion compensation. The group of picture (GOP) rithm 1 is used to train the network and Algorithm 2 is used for
size is 4. distributed implementation. The hyper-parameters are listed
We enable rate control and change the target bit rate. Given in Table 2.
a target bit rate, the video sequences are encoded into bit
streams. We average the MSE between the reconstructed V. SIMULATION RESULT AND DISCUSSIONS
frame and the original frame over all the 20 frames. The A. TWO USERS
PSNR value is then calculated based on (4). Based on the First of all, we consider the simplest case where there are two
obtained samples, we estimate the video sequence parameters users. The distance matrix is randomly generated. For this
FIGURE 5. Loss function versus the number of training episodes (N = 2). FIGURE 8. Users’ QoE versus the number of testing episodes (N = 2).
FIGURE 9. Loss function versus the number of training episodes (N = 5). FIGURE 12. Users’ QoE versus the number of testing episodes with the
random power method (N = 5).
that all the agents can observe their local environments and state is not the target state, all the agents perform a further action based on
the current step until they reach the target state. In each iteration, only ACK
reach their target QoE within 1 step. The success rate of signals are needed; other agents’ quality requirements are not needed. So the
the proposed multi-agent DRL approach has a success rate communication cost is low compared to the training process.
REFERENCES
[1] Cisco, ‘‘Cisco visual networking index: Forecast and trends,
2017–2022,’’ Cisco, San Jose, CA, USA, Feb. 2019. [Online].
Available: https://www.cisco.com/c/en/us/solutions/collateral/service-
provider/visual-networking-index-vni/white-paper-c11-738429.html
[2] Y. Xu and S. Mao, Mobile Cloud Media: State of the Art and Outlook.
Hershey, PA, USA: IGI Global, 2013, ch. 2, pp. 18–38.
[3] J. Liu, N. Kato, J. Ma, and N. Kadowaki, ‘‘Device-to-device communi-
cation in LTE-advanced networks: A survey,’’ IEEE Commun. Surv. Tutr.,
vol. 17, no. 4, pp. 1923–1940, 4th Quart., 2015.
[4] F. Boccardi, R. W. Heath, A. Lozano, T. L. Marzetta, and P. Popovski,
‘‘Five disruptive technology directions for 5G,’’ IEEE Commun. Mag.,
FIGURE 14. Users’ QoE versus the number of testing episodes (N = 20) vol. 52, no. 2, pp. 74–80, Feb. 2014.
with the random power allocation method. (N = 20). [5] M. N. Tehrani, M. Uysal, and H. Yanikomeroglu, ‘‘Device-to-device com-
munication in 5G cellular networks: Challenges, solutions, and future
directions,’’ IEEE Commun. Mag., vol. 52, no. 5, pp. 86–92, May 2014.
Now we consider a more challenging task where there are [6] G. Fodor, E. Dahlman, G. Mildh, S. Parkvall, N. Reider, G. Miklós, and
20 users as shown in Fig. 4(c). Their locations are randomly Z. Turányi, ‘‘Design aspects of network assisted device-to-device commu-
generated. For simplicity, we assume that all of users request nications,’’ IEEE Commun. Mag., vol. 50, no. 3, pp. 170–177, Mar. 2012.
[7] M. Chiang, P. Hande, T. Lan, and C. W. Tan, ‘‘Power control in wireless
the same video sequence ‘‘Foreman’’ and their minimum
cellular networks,’’ Found. Trends Netw., vol. 2, no. 4, pp. 381–533,
quality requirement is set to 36dB. To better control the Apr. 2008.
complexity, we assume that each agent only observe state [8] Q. Shi, M. Razaviyayn, Z.-Q. Luo, and C. He, ‘‘An iteratively weighted
from the nearest 5 neighbors, i.e., K = 5. The corresponding MMSE approach to distributed sum-utility maximization for a MIMO
interfering broadcast channel,’’ IEEE Trans. Signal Process., vol. 59, no. 9,
testing stage is shown in Fig. 13. Due to space limitation, pp. 4331–4340, Sep. 2011.
we only plot 5 users’ PSNR values. Actually, all the 20 users’ [9] K. Shen and W. Yu, ‘‘Fractional programming for communication
video quality requirements are satisfied and their average systems—Part I: Power control and beamforming,’’ IEEE Trans. Signal
Process., vol. 66, no. 10, pp. 2616–2630, May 2018.
quality is maximized. As a comparison, we present the PSNR [10] X. Chen, Z. Zhao, and H. Zhang, ‘‘Stochastic power adaptation with
for the random power allocation method in Fig. 14, where multiagent reinforcement learning for cognitive wireless mesh networks,’’
the users’ PSNR values are obviously not stable. In some IEEE Trans. Mobile Comput., vol. 12, no. 11, pp. 2155–2166, Nov. 2013.
[11] G. Zhang, J. Hu, W. Heng, X. Li, and G. Wang, ‘‘Distributed power con-
cases, the user’s PSNR falls below 30dB. The success rates of trol for D2D communications underlaying cellular network using Stack-
both the random power allocation method and the maximum elberg game,’’ in Proc. IEEE Wireless Commun. Netw. Conf. (WCNC),
power allocation method are 0. San Francisco, CA, USA, Mar. 2017, pp. 1–6.
[12] Z. He, S. Mao, and T. Jiang, ‘‘A survey of QoE-driven video streaming
over cognitive radio networks,’’ IEEE Netw., vol. 29, no. 6, pp. 20–25,
VI. CONCLUSION AND FUTURE WORK Nov./Dec. 2015.
In this paper, we studied the quality-aware power allocation [13] M. Amjad, M. H. Rehmani, and S. Mao, ‘‘Wireless multimedia cognitive
radio networks: A comprehensive survey,’’ IEEE Commun. Surveys Tutr.,
problem for multi-user video streaming. We developed a dis- vol. 20, no. 2, pp. 1056–1103, 2nd Quart., 2018.
tributed model-free power allocation algorithm, which help [14] X. Jiang, H. Lu, and C. W. Chen, ‘‘Enabling quality-driven scalable video
maximize the users’ target quality. The proposed method does transmission over multi-user NOMA system,’’ in Proc. IEEE Conf. Com-
put. Commun., Honolulu, HI, USA, Apr. 2018, pp. 1952–1960.
not require explicit channel state information, which would [15] H. Lu, M. Zhang, Y. Gui, and J. Liu, ‘‘QoE-driven multi-user video
save significant resources. Experiment results showed that the transmission over SM-NOMA integrated systems,’’ IEEE J. Sel. Areas
developed multi-agent DRL approach can guarantee that all Commun., vol. 37, no. 9, pp. 2102–2116, Sep. 2019.
[16] S. Cicalo and V. Tralli, ‘‘Distortion-fair cross-layer resource allocation for
the users achieve their target quality requirements within few scalable video transmission in OFDMA wireless networks,’’ IEEE Trans.
steps and the users’ average quality is maximized. For future Multimedia, vol. 16, no. 3, pp. 848–863, Apr. 2014.
investigations, possible directions include [17] T. Zhang and S. Mao, ‘‘Joint power and channel resource optimization in
1) the randomness of the layout of the D2D channels and soft multi-view video delivery,’’ IEEE Access, vol. 7, pp. 148084–148097,
2019.
the content of the requested videos could be considered [18] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Belle-
in the training process. The agent will take the channel mare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen,
state and the video contents as local state information. C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra,
S. Legg, and D. Hassabis, ‘‘Human-level control through deep reinforce-
Efficient training algorithms need to be developed so ment learning,’’ Nature, vol. 518, pp. 529–533, 2015.
that users can take action based on the local observation [19] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den
and the users’ average quality could be maximized. Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot,
S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever,
2) Currently, we start the training process based on the T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, D. Hassabis, ‘‘Mas-
assumption that there exists at least one feasible solu- tering the game of Go with deep neural networks and tree search,’’ Nature,
tion. Theoretical methods should be provided to guar- vol. 529, no. 7587, pp. 484–489, 2016.
[20] Y. Sun, M. Peng, Y. Zhou, Y. Huang, and S. Mao, ‘‘Application of machine
antee a quick examination to check if there exists a learning in wireless networks: Key techniques and open issues,’’ IEEE
feasible solution before the training process. Commun. Surveys Tutr., vol. 21, no. 4, pp. 3072–3108, 4th Quart., 2019.
[21] X. Chen, H. Zhang, C. Wu, S. Mao, Y. Ji, and M. Bennis, ‘‘Optimized [44] F. A. Asuhaimi, S. Bu, P. V. Klaine, and M. A. Imran, ‘‘Channel access and
computation offloading performance in virtual edge computing systems power control for energy-efficient delay-aware heterogeneous cellular net-
via deep reinforcement learning,’’ IEEE Internet Things J., vol. 6, no. 3, works for smart grid communications using deep reinforcement learning,’’
pp. 4005–4018, Jun. 2019. IEEE Access, vol. 7, pp. 133474–133484, 2019.
[22] S. Wang, H. Liu, P. H. Gomes, and B. Krishnamachari, ‘‘Deep reinforce- [45] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction.
ment learning for dynamic multichannel access in wireless networks,’’ Cambridge, MA, USA: MIT Press, 2018.
IEEE Trans. Cogn. Commun. Netw., vol. 4, no. 2, pp. 257–265, Jun. 2018. [46] High Efficiency Video Coding (HEVC). Accessed: Nov. 13, 2019[Online].
[23] O. Naparstek and K. Cohen, ‘‘Deep multi-user reinforcement learning for Available: https://hevc.hhi.fraunhofer.de/
distributed dynamic spectrum access,’’ IEEE Trans. Wireless Commun.,
vol. 18, no. 1, pp. 310–323, Jan. 2019.
[24] Y. He, F. R. Yu, N. Zhao, V. C. M. Leung, and H. Yin, ‘‘Software-defined
networks with mobile edge computing and caching for smart cities: A big
data deep reinforcement learning approach,’’ IEEE Commun. Mag., vol. 55,
no. 12, pp. 31–37, Dec. 2017.
[25] Y. Sun, M. Peng, and S. Mao, ‘‘Deep reinforcement learning-based mode
selection and resource management for green fog radio access networks,’’
IEEE Internet Things J., vol. 6, no. 2, pp. 1960–1971, Apr. 2019. TICAO ZHANG received the B.E. and M.S.
[26] J. Liu, B. Krishnamachari, S. Zhou, and Z. Niu, ‘‘DeepNap: Data-driven degrees from the School of Electronic Infor-
base station sleeping operations through deep reinforcement learning,’’ mation and Communications, Huazhong Univer-
IEEE Internet Things J., vol. 5, no. 6, pp. 4273–4282, Dec. 2018. sity of Science and Technology, Wuhan, China,
[27] K. Xiao, S. Mao, and J. K. Tugnait, ‘‘TCP-Drinc: Smart congestion in 2014 and 2017, respectively. He is currently
control based on deep reinforcement learning,’’ IEEE Access, vol. 7, pursuing the Ph.D. degree in electrical and com-
pp. 11892–11904, 2019. puter engineering with Auburn University. His
[28] X. Li, J. Fang, W. Cheng, H. Duan, Z. Chen, and H. Li, ‘‘Intelligent power research interests include video coding and com-
control for spectrum sharing in cognitive radios: A deep reinforcement munications, machine learning, and optimization
learning approach,’’ IEEE Access, vol. 6, pp. 25463–25473, 2018. and design of wireless multimedia networks.
[29] Y. S. Nasir and D. Guo, ‘‘Multi-agent deep reinforcement learning for
dynamic power allocation in wireless networks,’’ IEEE J. Sel. Areas Com-
mun., vol. 37, no. 10, pp. 2239–2250, Oct. 2019.
[30] L. Liang, H. Ye, and G. Y. Li, ‘‘Spectrum sharing in vehicular
networks based on multi-agent reinforcement learning,’’ May 2019,
arXiv:1905.02910. [Online]. Available: https://arxiv.org/abs/1905.02910
[31] M. Feng and S. Mao, ‘‘Dealing with Limited Backhaul Capacity in
millimeter-wave systems: A deep reinforcement learning approach,’’ IEEE
Commun. Mag., vol. 57, no. 3, pp. 50–55, Mar. 2019. SHIWEN MAO (S’99–M’04–SM’09–F’19)
[32] J. Yick, B. Mukherjee, and D. Ghosal, ‘‘Wireless sensor network survey,’’ received the Ph.D. degree in electrical and com-
Comput. Netw., vol. 52, no. 12, pp. 2292–2330, Aug. 2008. puter engineering from Polytechnic University,
[33] K. Stuhlmuller, N. Farber, M. Link, and B. Girod, ‘‘Analysis of video Brooklyn, NY, USA (now The New York Univer-
transmission over lossy channels,’’ IEEE J. Sel. Areas Commun., vol. 18, sity Tandon School of Engineering).
no. 6, pp. 1012–1032, Jun. 2000. He joined Auburn University, Auburn, AL,
[34] H. Mansour, V. Krishnamurthy, and P. Nasiopoulos, ‘‘Channel aware USA, as an Assistant Professor, in 2006, was
multiuser scalable video streaming over lossy under-provisioned chan- the McWane Associate Professor, from 2012 to
nels: Modeling and analysis,’’ IEEE Trans. Multimedia, vol. 10, no. 7, 2015, and has been the Samuel Ginn Distinguished
pp. 1366–1381, Nov. 2008. Professor with the Department of Electrical and
[35] K. Lin and S. Dumitrescu, ‘‘Cross-layer resource allocation for scalable Computer Engineering, since 2015. He is currently the Director of the
video over OFDMA wireless networks: Tradeoff between quality fairness Wireless Engineering Research and Education Center, Auburn University,
and efficiency,’’ IEEE Trans. Multimedia, vol. 19, no. 7, pp. 1654–1669,
since 2015, and the Director of the NSF IUCRC FiWIN Center Auburn
Jul. 2017.
University site, since 2018. His research interests include wireless networks,
[36] S. Cicalò, A. Haseeb, and V. Tralli, ‘‘Fairness-oriented multi-stream rate
adaptation using scalable video coding,’’ Signal Process., Image Commun.,
multimedia communications, and smart grid. He is a Distinguished Speaker
vol. 27, no. 8, pp. 800–813, 2012. (2018–2021) and was a Distinguished Lecturer (2014–2018) of the IEEE
[37] Z.-Q. Luo and S. Zhang, ‘‘Dynamic spectrum management: Complexity Vehicular Technology Society.
and duality,’’ IEEE J. Sel. Topics Signal Process., vol. 2, no. 1, pp. 57–73, Dr. Mao received the IEEE ComSoc TC-CSR Distinguished Technical
Feb. 2008. Achievement Award, in 2019, the IEEE ComSoc MMTC Distinguished
[38] M. Tan, ‘‘Multi-agent reinforcement learning: Independent vs. cooperative Service Award, in 2019, the Auburn University Creative Research and
agents,’’ in Proc. ICML, Amherst, MA, USA, Jun. 1993, pp. 330–337. Scholarship Award, in 2018, the 2017 IEEE ComSoc ITC Outstanding
[39] J. Foerster, N. Nardelli, G. Farquhar, T. Afouras, P. H. Torr, P. Kohli, and Service Award, the 2015 IEEE ComSoc TC-CSR Distinguished Service
S. Whiteson, ‘‘Stabilising experience replay for deep multi-agent rein- Award, the 2013 IEEE ComSoc MMTC Outstanding Leadership Award,
forcement learning,’’ in Prof. ICML, Sydney, NSW, Australia, Aug. 2017, and the NSF CAREER Award, in 2010. He is a co-recipient of the IEEE
pp. 1146–1155. ComSoc MMTC Best Journal Paper Award, in 2019, the IEEE ComSoc
[40] T. T. Nguyen, N. D. Nguyen, and S. Nahavandi, ‘‘Deep reinforce- MMTC Best Conference Paper Award, in 2018, the Best Demo Award
ment learning for multi-agent systems: A review of challenges, solutions from the IEEE SECON 2017, the Best Paper Awards from the IEEE
and applications,’’ Dec. 2018, arXiv:1812.11794. [Online]. Available: GLOBECOM 2019, 2016, and 2015, the IEEE WCNC 2015, and the IEEE
https://arxiv.org/abs/1812.11794
ICC 2013 and the 2004 IEEE Communications Society Leonard G. Abraham
[41] A. Tampuu, T. Matiisen, D. Kodelja, I. Kuzovkin, K. Korjus, J. Aru,
Prize in the Field of Communications Systems. He is an Area Editor of the
J. Aru, and R. Vicente, ‘‘Multiagent cooperation and competition with
deep reinforcement learning,’’ PLoS ONE, vol. 12, no. 4, Apr. 2017,
IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, the IEEE OPEN JOURNAL
OF THE COMMUNICATIONS SOCIETY, the IEEE INTERNET OF THINGS JOURNAL,
Art. no. e0172395.
[42] U. Challita, L. Dong, and W. Saad, ‘‘Proactive resource management for the IEEE/CIC CHINA COMMUNICATIONS, and the ACM GetMobile, as well
LTE in unlicensed spectrum: A deep learning perspective,’’ IEEE Trans. as an Associate Editor of the IEEE TRANSACTIONS ON NETWORK SCIENCE AND
Wireless Commun., vol. 17, no. 7, pp. 4674–4689, Jul. 2018. ENGINEERING, the IEEE TRANSACTIONS ON MULTIMEDIA, the IEEE TRANSACTIONS
[43] N. Zhao, Y.-C. Liang, D. Niyato, Y. Pei, M. Wu, and Y. Jiang, ‘‘Multi-agent ON MOBILE COMPUTING, the IEEE MULTIMEDIA, the IEEE NETWORKING LETTERS,
reinforcement learning: Independent vs. cooperative agents,’’ IEEE Trans. and the Digital Communications and Networks Journal (Elsevier).
Wireless Commun., vol. 18, no. 11, pp. 5141–5152, Nov. 2019.