IET - Applications of Machine Learning in Wireless Communications
IET - Applications of Machine Learning in Wireless Communications
Applications of Machine
Learning in Wireless
Communications
Other volumes in this series:
Volume 9 Phase Noise in Signal Sources W.P. Robins
Volume 12 Spread Spectrum in Communications R. Skaug and J.F. Hjelmstad
Volume 13 Advanced Signal Processing D.J. Creasey (Editor)
Volume 19 Telecommunications Traffic, Tariffs and Costs R.E. Farr
Volume 20 An Introduction to Satellite Communications D.I. Dalgleish
Volume 26 Common-Channel Signalling R.J. Manterfield
Volume 28 Very Small Aperture Terminals (VSATs) J.L. Everett (Editor)
Volume 29 ATM: The broadband telecommunications solution L.G. Cuthbert and J.C. Sapanel
Volume 31 Data Communications and Networks, 3rd Edition R.L. Brewster (Editor)
Volume 32 Analogue Optical Fibre Communications B. Wilson, Z. Ghassemlooy and I.Z. Darwazeh
(Editors)
Volume 33 Modern Personal Radio Systems R.C.V. Macario (Editor)
Volume 34 Digital Broadcasting P. Dambacher
Volume 35 Principles of Performance Engineering for Telecommunication and Information
Systems M. Ghanbari, C.J. Hughes, M.C. Sinclair and J.P. Eade
Volume 36 Telecommunication Networks, 2nd Edition J.E. Flood (Editor)
Volume 37 Optical Communication Receiver Design S.B. Alexander
Volume 38 Satellite Communication Systems, 3rd Edition B.G. Evans (Editor)
Volume 40 Spread Spectrum in Mobile Communication O. Berg, T. Berg, J.F. Hjelmstad, S. Haavik
and R. Skaug
Volume 41 World Telecommunications Economics J.J. Wheatley
Volume 43 Telecommunications Signalling R.J. Manterfield
Volume 44 Digital Signal Filtering, Analysis and Restoration J. Jan
Volume 45 Radio Spectrum Management, 2nd Edition D.J. Withers
Volume 46 Intelligent Networks: Principles and applications J.R. Anderson
Volume 47 Local Access Network Technologies P. France
Volume 48 Telecommunications Quality of Service Management A.P. Oodan (Editor)
Volume 49 Standard Codecs: Image compression to advanced video coding M. Ghanbari
Volume 50 Telecommunications Regulation J. Buckley
Volume 51 Security for Mobility C. Mitchell (Editor)
Volume 52 Understanding Telecommunications Networks A. Valdar
Volume 53 Video Compression Systems: From first principles to concatenated codecs A. Bock
Volume 54 Standard Codecs: Image compression to advanced video coding, 3rd Edition
M. Ghanbari
Volume 59 Dynamic Ad Hoc Networks H. Rashvand and H. Chao (Editors)
Volume 60 Understanding Telecommunications Business A. Valdar and I. Morfett
Volume 65 Advances in Body-Centric Wireless Communication: Applications and state-of-the-
art Q.H. Abbasi, M.U. Rehman, K. Qaraqe and A. Alomainy (Editors)
Volume 67 Managing the Internet of Things: Architectures, theories and applications J. Huang
and K. Hua (Editors)
Volume 68 Advanced Relay Technologies in Next Generation Wireless Communications
I. Krikidis and G. Zheng
Volume 69 5G Wireless Technologies A. Alexiou (Editor)
Volume 70 Cloud and Fog Computing in 5G Mobile Networks E. Markakis, G. Mastorakis,
C.X. Mavromoustakis and E. Pallis (Editors)
Volume 71 Understanding Telecommunications Networks, 2nd Edition A. Valdar
Volume 72 Introduction to Digital Wireless Communications Hong-Chuan Yang
Volume 73 Network as a Service for Next Generation Internet Q. Duan and S. Wang (Editors)
Volume 74 Access, Fronthaul and Backhaul Networks for 5G & Beyond M.A. Imran, S.A.R. Zaidi
and M.Z. Shakir (Editors)
Volume 76 Trusted Communications with Physical Layer Security for 5G and Beyond
T.Q. Duong, X. Zhou and H.V. Poor (Editors)
Volume 77 Network Design, Modelling and Performance Evaluation Q. Vien
Volume 78 Principles and Applications of Free Space Optical Communications A.K. Majumdar,
Z. Ghassemlooy, A.A.B. Raj (Editors)
Volume 79 Satellite Communications in the 5G Era S.K. Sharma, S. Chatzinotas and D. Arapoglou
Volume 80 Transceiver and System Design for Digital Communications, 5th Edition Scott
R. Bullock
Volume 905 ISDN Applications in Education and Training R. Mason and P.D. Bacsich
Applications of Machine
Learning in Wireless
Communications
Edited by
Ruisi He and Zhiguo Ding
The Institution of Engineering and Technology is registered as a Charity in England & Wales
(no. 211014) and Scotland (no. SC038698).
This publication is copyright under the Berne Convention and the Universal Copyright
Convention. All rights reserved. Apart from any fair dealing for the purposes of research
or private study, or criticism or review, as permitted under the Copyright, Designs and
Patents Act 1988, this publication may be reproduced, stored or transmitted, in any
form or by any means, only with the prior permission in writing of the publishers, or in
the case of reprographic reproduction in accordance with the terms of licences issued
by the Copyright Licensing Agency. Enquiries concerning reproduction outside those
terms should be sent to the publisher at the undermentioned address:
www.theiet.org
While the authors and publisher believe that the information and guidance given in this
work are correct, all parties must rely upon their own skill and judgement when making
use of them. Neither the authors nor publisher assumes any liability to anyone for any
loss or damage caused by any error or omission in the work, whether such an error or
omission is the result of negligence or any other cause. Any and all such liability
is disclaimed.
The moral rights of the authors to be identified as authors of this work have been
asserted by them in accordance with the Copyright, Designs and Patents Act 1988.
Foreword xiii
10 Deep learning for indoor localization based on bimodal CSI data 343
Xuyu Wang and Shiwen Mao
10.1 Introduction 343
10.2 Deep learning for indoor localization 345
10.2.1 Autoencoder neural network 345
10.2.2 Convolutional neural network 346
10.2.3 Long short-term memory 348
10.3 Preliminaries and hypotheses 348
10.3.1 Channel state information preliminaries 348
10.3.2 Distribution of amplitude and phase 349
10.3.3 Hypotheses 350
10.4 The BiLoc system 355
10.4.1 BiLoc system architecture 355
10.4.2 Off-line training for bimodal fingerprint database 356
10.4.3 Online data fusion for position estimation 358
10.5 Experimental study 359
10.5.1 Test configuration 359
10.5.2 Accuracy of location estimation 360
10.5.3 2.4 versus 5 GHz 362
10.5.4 Impact of parameter ρ 362
10.6 Future directions and challenges 364
10.6.1 New deep-learning methods for indoor localization 364
10.6.2 Sensor fusion for indoor localization using
deep learning 364
10.6.3 Secure indoor localization using deep learning 365
10.7 Conclusions 365
Acknowledgments 366
References 366
Contents xi
Index 461
Foreword
concerns and ideas of this branch. Then, classic algorithms and their last develop-
ments are reviewed with typical applications and useful references. Furthermore,
pseudocodes are added to provide interpretations and details of algorithms. Each
section ends by a summary in which the structure of this section is untangled and
relevant applications in wireless communication are given.
● In Chapter 2, using machine learning in wireless channel modelling is presented.
First of all, the background of the machine-learning-enabled channel modelling is
introduced. Then, four related aspects are presented: (i) propagation scenario clas-
sification, (ii) machine-learning-based multipath component (MPC) clustering,
(iii) automatic MPC tracking and (iv) deep-learning-based channel modelling.
The results in this chapter can provide references to other real-world measurement
data-based channel modelling.
● In Chapter 3, the wireless channel prediction is addressed, which is a key issue
for wireless communication network planning and operation. Instead of the
classic model-based methods, a survey of recent advances in machine-learning
technique-based channel prediction algorithms is provided, including both batch
and online methods. Experimental results are provided using the real data.
● In Chapter 4, new types of channel estimators based on machine learning are
introduced, which are different from traditional pilot-aided channel estimators
such as least squares and linear minimum mean square errors. Specifically, two
newly designed channel estimators based on deep learning and one blind estimator
based on expectation maximization algorithm are provided for wireless commu-
nication systems. The challenges and open problems for channel estimation aided
by machine-learning theories are also suggested.
● In Chapter 5, cognitive radio is introduced as a promising paradigm to solve the
spectrum scarcity and to improve the energy efficiency of the next generation
mobile communication network. In the context of cognitive radios, the necessity
of using signal identification techniques is first presented. A survey of signal
identification techniques and recent advances in this field using machine learn-
ing are then provided. Finally, open problems and possible future directions for
cognitive radio are briefly discussed.
● In Chapter 6, the fundamental concepts that are important in the study of com-
pressive sensing (CS) are introduced. Three conditions are described, i.e. the null
space property, the restricted isometry property and mutual coherence, that are
used to evaluate the quality of sensing matrices and to demonstrate the feasibility
of reconstruction. Some widely used numerical algorithms for sparse recovery are
briefly reviewed, which are classified into two categories, i.e. convex optimiza-
tion algorithms and greedy algorithms. Various examples are illustrated where
the CS principle has been applied to WSNs.
● In Chapter 7, the enhancement of the proposed IEEE 802.11p Medium Access
Control (MAC) layer is studied for vehicular use by applying RL. The purpose
of this adaptive channel access control technique is enabling more reliable, high-
throughput data exchanges among moving vehicles for cooperative awareness
purposes. Some technical background for vehicular networks is presented, as
well as some relevant existing solutions tackling similar channel sharing prob-
lems. Finally, some new findings from combining the IEEE 802.11p MAC with
Foreword xv
1
School of Computer and Information Technology, Beijing Jiaotong University, China
2
State Key Laboratory of Rail Traffic Control and Safety, Beijing Jiaotong University, China
2 Applications of machine learning in wireless communications
(e.g. temperature, wind speed and precipitation for the past days). Now, given today’s
meteorological data, how to predict the weather of the next day? A natural idea is
to explore a rule from the historical meteorological data. Specifically, you need to
observe and analyse what the weather was under the meteorological data of the last
day. If you are fortunate enough to find a rule, then you will make a successful pre-
diction. However, in most cases, the meteorological data is too big to analyse for
humans. Supervised learning would be a solution to this challenge.
In fact, what you try to do in the above example is a typical supervised learning
task. Formally, supervised learning is a procedure of learning a function f (·) that maps
an input x (meteorological data of a day) to an output y (weather of the next day) based
on a set of sample pairs T = {(xi , yi )}ni=1 (historical data), where T is called a training
set and yi is called a label. If y is a categorical variable (e.g. sunny or rainy), then the
task is called a classification task. If y is a continuous variable (e.g. probability of
precipitation), then the task is called a regression task. Furthermore, for a new input
x0 , which is called a test sample, f (x0 ) will give the prediction.
In wireless communications, an important problem is estimating the channel
noise in a MIMO wireless network, since knowing these parameters are essential
to many tasks of a wireless network such as network management, event detection,
location-based service and routing [3]. This problem can be solved by using supervised
learning approaches. Let us consider the circumstances for the linear channel with
white added Gaussian noise MIMO environments with t transmitting antennas and r
receiving antennas. Assume the channel model is z = Hs + u, where s ∈ Rt , u ∈ Rr
and z ∈ Rr denote signal vector, noise vector and received vector, respectively. The
goal in the channel noise estimation problem is to estimate u given s and z. This
problem can be formulated as r regression tasks, and the target of the kth regression
task is to predict uk for 1 ≤ k ≤ r. In the kth regression task, a training pair is
represented as {[sT , zk ]T , uk }. We can complete these tasks using any regression model
(will be introduced later in this chapter). Once the model is well trained, uk can be
predicted when a new sample [s̄T , z̄k ]T comes. In this section, we will discuss three
practical technologies of supervised learning.
(denoted by a circle) using the k-NN method. When k = 3, as shown in Figure 1.1(b),
the test sample will be assigned to the first class according to the majority principle.
When k = 1, as shown in Figure 1.1(c), the test sample will be assigned to the second
class since its nearest neighbour belongs to the second class. A formal description of
k-NN is presented in Algorithm 1.1.
The output of the k-NN algorithm is related to two things. One is the distance
function, which measures how near two samples are. Different distance functions will
lead to different k-NN sets and thus different classification results. The most com-
monly used distance function is the Lp distance. Given two vectors x = (x1 , . . . , xd )T
and z = (z1 , . . . , zd )T , the Lp distance between them is defined as
d 1/p
Lp (x, z) = |xi − zi |p
. (1.1)
i=1
Figure 1.1 An illustration for the main idea of k-NN. (a) A training set there
consists of seven samples, four of which are labelled as the first class
(denoted by squares) and the others are labelled as the second class
(denoted by triangles). A test sample is denoted as a circle. (b) When
k = 3, the test sample is classified as the first class. (c) When k = 1,
the test sample is assigned as the second class
y0 = arg max I (yi = c),
1≤c≤m
xi ∈Nk (x0 )
When p equals 2, the Lp distance becomes the Euclidean distance. When p equals 1,
the Lp distance is also called the Manhattan distance. When p goes to ∞, it can be
shown that
L∞ (x, z) = max |xi − zi |. (1.2)
i
Internal node
Leaf node
to enumerate all trees to find an effective tree. Thus, the key problem is how to con-
struct an effective tree in a reasonable span of time. A variety of methods have been
developed for learning an effective tree, such as ID3 [12], C4.5 [13] and classifica-
tion and regression tree (CART) [14]. Most of them share a similar core idea that
employs a top-down, greedy strategy to search through the space of possible decision
trees. In this section, we will focus on the common-used CART method and its two
improvements, random forest (RF) and gradient boosting decision tree (GBDT).
The similar steps will be carried out recursively for T1 and T2 , respectively, until a
stop condition meets.
In contrast, a regression tree is to predict continuous variables and its partition
criteria is usually chosen as the minimum mean square error. Specifically, given a
training set T = {(xi , yi )}ni=1 , a regression tree will divide T into T1 and T2 such that
the following equation is minimized:
2
(yi − m1 )2 + yj − m 2 , (1.6)
(xi ,yi )∈T1 (xj ,yj )∈T2
where mj = (1/|Tj |) (xi ,yi )∈Tj yi ( j = 1, 2). For clarity, we summarize the construct-
ing process and the predicting process in Algorithms 1.2 and 1.3, respectively.
By using Algorithm 1.2, we can construct a decision tree. However, this tree is so
fine that it may cause overfitting (i.e. it achieves perfect performance on a training set
but bad predictions for test samples). An extra pruning step can improve this situation.
The pruning step consists of two main phases. First, iteratively prune the tree from the
leaf nodes to the root node and thus acquire a tree sequence Tree0 , Tree1 , . . . , Treen ,
where Tree0 denotes the entire tree and Treen denotes the tree which only contain
the root node. Second, select the optimal tree from the sequence by using the cross
validation. For more details, readers can refer to [14]. References [15–17] demonstrate
three applications of CART in wireless communications.
m
fm (x) = Treej (x; j ), (1.7)
j=1
8 Applications of machine learning in wireless communications
where Treej (x; j ) denotes the jth tree with parameter of j . Given a training set
{(x1 , y1 ), . . . , (xn , yn )}, the goal of GBDT is to minimize:
n
L( fm (xi ), yi ), (1.8)
i=1
where L(·, ·) is a differentiable function which measures the difference between fm (xi )
and yi and is chosen according to the task.
However, it is often difficult to find an optimal solution to minimize (1.8). As a
trade-off, GBDT uses a greedy strategy to yield an approximate solution. First, notice
that (1.7) can be written as a recursive form:
where we have defined f0 (x) = 0. Then, by fixing the parameters of fj−1 , GBDT finds
the parameter set j by solving:
n
min L( fj−1 (xi ) + Treej (xi ; j ), yi ). (1.10)
j
i=1
Introduction of machine learning 9
Replacing the loss function L(u, v) by its first-order Taylor series approximation with
respect to u at u = fj−1 (xi ), we have
n
L( fj−1 (xi ) + Treej (xi ; j ), yi )
i=1
n
(1.11)
∂L( fj−1 (xi ), yi )
≈ L( fj−1 (xi ), yi ) + Tree(xi ; j ) .
i=1
∂fj−1 (xi )
Notice that the right side is a linear function with respect to Tree(xi ; j ) and its value
can decrease by letting Tree(xi ; j ) = −(∂L( fj−1 (xi ), yi )/∂fj−1 (xi )). Thus, GBDT
n
trains Tree(·, j ) by using a new training set (xi , −(∂L( fj−1 (xi ), yi )/(∂fj−1 (xi )))) i=1 .
The above steps will be repeated for j = 1, . . . , m and thus a gradient boosting tree is
generated.
GBDT is known as one of the best methods in supervised learning and has been
widely applied in many tasks. There are many tricks in its implementation. Two
popular implementations, XGboost and LightGBM, can be found in [23] and [24],
respectively. References [25] and [26] demonstrate two applications of GBDT in
obstacle detection and quality of experience (QoE) prediction, respectively.
1.1.3 Perceptron
A perceptron is a linear model for a binary-classification task and is the foundation
of the famous support vector machine (SVM) and deep neural networks (DNNs).
Intuitively, it tries to find a hyperplane to separate the input space (feature space)
into two half-spaces such that the samples of different classes lie in the different half-
spaces. An illustration is shown in Figure 1.3(a). A hyperplane in Rd can be described
by an equation wT x + b = 0, where w ∈ RD is the normal vector. Correspondingly,
1.5 x1
w1
x2 w2
1 wT x + b > 0 w Tx + b
wTx + b < 0 wd
0.5 xd
b
0 +1
0 0.5 1 1.5 2
(a) (b)
Figure 1.3 (a) An illustration of a perceptron and (b) the graph representation of a
perceptron
10 Applications of machine learning in wireless communications
wT x + b > 0 and wT x + b < 0 represent the two half-spaces separated by the hyper-
plane wT x + b = 0. For a sample x0 , if wT x0 + b is larger than 0, we say x0 is in the
positive direction of the hyperplane, and if wT x0 + b is less than 0, we say it is in the
negative direction.
In addition, by writing wT x + b = [xT , 1] · [wT , b]T = di=1 xi wi + b, we can
view [xT , 1]T , [wT , b]T and wT x + b as the inputs, parameters and output of a per-
ceptron, respectively. Their relation can be described by a graph, where the inputs
and output are represented by nodes, and the parameters are represented by edges,
as shown in Figure 1.3(b). This graph representation is convenient for describing the
multilayer perceptron (neural networks) which will be introduced in Section 1.1.3.3.
Suppose we have a training set T = {(x1 , y1 ), . . . , (xn , yn )}, where xi ∈ Rd and
yi ∈ {+1, −1} is the ground truth. The perceptron algorithm can be formulated as
N
min L(w, b) − yi (wT xi + b), (1.12)
w∈Rd ,b∈R
i=1
2 2
Positive sample Positive sample
Negative sample Negative sample
1.5 1.5
Classification
margin
1 1
0.5 0.5
0 0
0 0.5 1 1.5 2 0 0.5 1 1.5 2
(a) (b)
Figure 1.4 (a) Three hyperplanes which can separate the two classes of training
samples and (b) the hyperplane which maximizes the classification
margin
styles, and each of them can serve as a solution to the perceptron. However, as shown
in Figure 1.4(b), only the hyperplane which maximizes the classification margin can
serve as the solution to SVM.
According to the definition of the classification margin, distance from any train-
ing sample to the classification hyperplane should be not less than it. Thus, given
the training set {(xi , yi )}ni=1 , the learning of SVM can be formulated as the following
optimization problem:
max γ
w∈Rd ,b∈R
(1.13)
w T xi b
s.t. yi + ≥ γ (i = 1, . . . , n)
w w
T
where (w xi /w) + (b/w) can be viewed as the signed distance from xi to the
classification hyperplane wT x + b = 0, and the sign of yi (wT xi /w) + (b/w)
denotes whether xi lies in the right half-space. It can be shown that problem (1.13)
is equivalent to
1
min w2
w∈Rd ,b∈R 2 (1.14)
s.t. yi wT xi + b − 1 ≥ 0 (i = 1, . . . , n).
Problem (1.14) is a quadratic programming problem [28] and can be efficiently solved
by several optimization tools [29,30].
Note that both the perceptron and SVM suppose that the training set can be
separated linearly. However, this supposition is not always correct. Correspondingly,
the soft-margin hyperplane and the kernel trick have been introduced to deal with
12 Applications of machine learning in wireless communications
the non-linear situation. Please refer to [31] and [32] for more details. In addi-
tion, SVM can also be used to handle regression tasks, which is also known as
support vector regression [33]. SVM has been widely applied in many fields of wire-
less communications, such as superimposed transmission mode identification [34],
selective forwarding attacks detection [35], localization [36] and MIMO channel
learning [3].
2 1
x2 0.8
1.5 Distance
g (wTx + b)
wTx + b = 0 0.6
1 (0,0.5)
x1
0.4
x3
0.5
0.2
0 0
0 0.5 1 1.5 2 –2 –1 0 1 2
(a) (b) w Tx + b
(1.16)
n
= w T xi + b − log 1 + exp wT xi + b .
yi =1 i=1
where P( y = j|x) denotes the probability of x belonging to the jth class. The parameter
set {(wi , bi )}kj=1 can also be estimated by using the maximum likelihood estimation.
Another name of multinomial logistic regression is softmax regression, which is often
used as the last layer of a multilayer perceptron that will be introduced in the next
section. Logistic regression has been applied to predict device wireless data and
location interface configurations that can optimize energy consumption in mobile
devices [11]. References [37], [38] and [39] demonstrate three applications of logis-
tic regression to home wireless security, reliability evaluation and patient anomaly
detection in medical wireless sensor networks, respectively.
X 3
T
x1 yi = g(w1ix + b1i) Sigmoid
Tanh
y1 2 ReLU
T
x2 z= g(w2y + b2)
y2
g(t)
1
z
0
yh
xd
–1
+1 +1 –3 –2 –1 0 1 2 3
t
(a) (b)
Figure 1.6 (a) A simple neural network with three layers, where g(·) is a non-linear
activation function and (b) the curves of three commonly used
activation functions
node though we only use one for simplicity in this example. The middle layer is called
a hidden layer, because its nodes are not observed in the training process. Similar to
a Markov chain, the node values of each layer are computed only depending on the
node values of its previous layer.
Because the original perceptron is just a linear function that maps the weighted
inputs to the output of each layer, the linear algebra shows that any number of lay-
ers can be reduced to a two-layer input–output model. Thus, a non-linear activation
function g : R → R, which is usually monotonously increasing and differentiable
almost everywhere [40], is introduced to achieve a non-linear mapping. Here, we list
some commonly used activation functions in Table 1.1, and their curves are shown
in Figure 1.6(b). Notice that the sigmoid function is just a modification of the prob-
ability mapping (1.15) used in logistic regression. As shown in Figure 1.6(b), the
hyperbolic tangent (tanh) function shares similar curve trace with the sigmoid func-
tion except the output range being (−1,1) instead of (0, 1). Their good mathematical
properties make them popular in early research [41]. However, they encounter dif-
ficulties in DNNs. It is easy to verify that limt→∞ g
(t) = 0 and |g
(t)| is a small
value in most areas of the domain for both of them. This property restricts their use
in DNNs since training DNNs require that the gradient of the activation function is
around 1. To meet this challenge, a rectified linear unit (ReLU) activation function is
proposed. As shown in Table 1.1, the ReLU function is piece-wise linear function and
saturates at exactly 0 whenever the input t is less than 0. Though it is simple enough,
the ReLU function has achieved great success and became the default choice in
DNNs [40].
Now, we will have a brief discussion about the training process of the multi-
layer perceptron. For simplicity, let us consider the performance of a regression task
Introduction of machine learning 15
by using the model shown in Figure 1.6(a). For convenience, denote the parame-
ters between the input and hidden layers as a matrix W1 = ŵ11 , . . . , ŵ1h , ŵ1h+1 ∈
T
R(d+1)×(h+1) , where ŵ1i = w1iT , b1i ∈ Rd+1 and ŵ1h+1 = (0, 0, . . . , 1)T . Similarly,
the parameters between the hidden and output layers are denoted as a vector w2 =
(w21 , . . . , w2h , w2h+1 )T ∈ Rh+1 , where w2h+1 = b2 . Let x = (x1 , . . . , xd , 1)T ∈ Rd+1 ,
y = (y1 , . . . , yh , 1)T ∈ Rh+1 and z denote the input vector, the hidden vector, and the
output scalar, respectively. Then the relations among x = (x1 , . . . , xd , 1)T ∈ Rd+1 ,
y = (y1 , . . . , yh , 1)T ∈ Rh+1 and z can be presented as
y = g(W1T x),
(1.18)
z = g(w2T y),
where the activation function g will act on each element for a vector as input. Suppose
we expect that the model outputs z̄ for the input x, and thus the square error is given
by e = (1/2)(z − z̄)2 . We decrease this error by using the gradient descent method.
This means that ∂e/∂W1 and ∂e/∂w2 need to be computed. By the gradient chain
rule, we have
∂e
= (z − z̄),
∂z
(1.19)
∂e ∂e ∂z
= = (z − z̄)g
(w2T y + b2 )y,
∂w2 ∂z ∂w2
and
∂e ∂e ∂z ∂y
= , (1.20)
∂W1 ∂z ∂y ∂W1
where we have omitted the dimensions for simplicity. Thus, to compute ∂e/∂W1 ,
we first need to compute:
∂z
= g
(w2T y)w2 ,
∂y
(1.21)
∂y ∂yi h+1
T
h+1
= = g (w1i x) · xeiT i=1 ,
∂W1 ∂W1 i=1
16 Applications of machine learning in wireless communications
where ei ∈ Rh+1 denotes the unit vector with its ith element being 1. By plugging
(1.21) into (1.20), we have
∂e ∂e ∂z ∂y
h+1
= = (z − z̄)g
(w2T y + b2 )w2 g
(w1iT x) · xeiT i=1
∂W1 ∂z ∂y ∂W1
(1.22)
h+1
= (z − z̄)g
(w2T y + b2 ) w2i g
(w1iT x) · xeiT
i=1
Thus, we can update the parameters by using the gradient descent method to
reduce the error. In the above deduction, what we really need to calculate are just
∂e/∂z, ∂z/∂w2 , ∂z/∂y and ∂y/∂W1 . As shown in Figure 1.7(a), we find these terms
are nothing but the derivatives of the node values or the parameters of each layer with
respect to the node values of the next layer. Beginning from the output layer, ‘multiply’
them layer by layer according to the chain rule, and then we obtain the derivatives of
the square error with respect to the parameters of each layer. The above strategy is the
so-called backpropagation (BP) algorithm [42]. Equipped with the ReLU activation
function, the BP algorithm can train the neural networks with dozens or even hundreds
of layers, which constitutes the foundation of deep learning.
In the model shown in Figure 1.6(a), we can observe that every node of the input
layer is connected to every node of the hidden layer. This connection structure is
called fully connection, and the layer which is fully connected (FC) to the previous
layer is called the FC layer [42]. Supposing that the number of nodes of two layers
are m and n, then the number of parameters of fully connection will be m × n, which
will be a large number even when m and n are moderate. Excessive parameters will
slow down the training process and increase the risk of overfitting, which is especially
serious in DNNs. Parameter sharing is an effective technology to meet this challenge.
A representative example using the parameter sharing technology is convolution neu-
ral networks (CNNs), which are a specialized kind of neural networks for processing
∂y ∂z ∂z ∂e
∂W1 ∂y ∂w2 ∂z
a
x1
y1
b ag + bh + ci
x2 g
y2
c bg + ch + di
W1 w2 z h
d cg + dh + ei
yh i
xd e dg + eh + fi
+1 +1 f
(a) (b)
Figure 1.7 (a) The derivatives of each layer with respect to its previous layer and
(b) an example of the convolution operation performed on vectors
Introduction of machine learning 17
data that has a known grid-like topology [43], such as time-series data and matrix data.
The name of CNNs comes from its basic operation called convolution (which has a
little difference from the convolution in mathematics). Though convolution operation
can be performed on vectors, matrices and even tensors with arbitrary order, we will
introduce the vector case here for simplicity.
To perform the convolution operation on a vector x ∈ Rd , we first need a kernel,
which is also a vector k ∈ Rl with l ≤ d. Let x[i : j] denotes the vector generated by
extracting the elements from the ith position to the jth position of x, i.e. x[i : j]
T
xi , xi+1 , . . . , xj . Then the convolution of x and k is defined as
⎛ ⎞
x[1 : l], k
⎜ x[2 : l + 1], k ⎟
⎜ ⎟
xk ⎜ .. ⎟ = ŷ ∈ Rd−l+1 , (1.23)
⎝ . ⎠
x[d − l + 1 : d], k
where ·, · denotes the inner product of two vectors. See Figure 1.7(b) for an example
of convolution. The convolution operation for matrices and tensors can be defined
similarly by using a matrix kernel and a tensor kernel, respectively. Based on the
convolution operation, a new transformation structure, as distinct from the fully
connection, can be built as
y = g (x k), (1.24)
where k is the parameter that needs to be trained. The layer with this kind of trans-
formation structure is called the convolution layer. Compared with the FC layer, the
number of parameters has dramatically decreased for the convolution layer. Further-
more, the size of the kernel is independent to the number of nodes of the previous
layer.
It should be noted that we can set several kernels in a convolution layer to generate
richer features. For example, if we choose the convolution layer with M kernels as the
hidden layer in example shown in Figure 1.6(a), then m features will be generated as
See Figure 1.8(a) for an example of max-pooling. The max-pooling operation for
matrices and tensors can be defined similarly by using windows with different dimen-
sions. Other layers such as normalization layers and average-pooling layers [44], we
do not detail here to stay focused.
Generally, a neural network may be constructed by using several kinds of layers.
For example, in a classical architecture, the first few layers are usually composed
alternate of the convolution layer and the max-pooling layer, and the FC layer is often
used as the last few layers. A simple example of architecture for classification with a
convolution network is shown in Figure 1.8(b).
The recent 10 years have witnessed earthshaking development of deep learning.
The state-of-the-art of many applications has been dramatically improved due to its
development. In particular, CNNs have brought about breakthroughs in processing
multidimensional data such as image and video. In addition, recurrent neural net-
works [42] have shone light on sequential data such as text and speech; generative
anniversary networks [45] are known as a class of models which can learn a mimic
distribution from the true data distribution to generate high-quality artificial samples,
such as images and speeches; deep RL (DRL) [46] is a kind of tool to solve control and
decision-making problems with high-dimensional inputs, such as board game, robot
navigation and smart transportation. Reference [42] is an excellent introduction to
deep learning. More details about the theory and the implementation of deep learning
can be found in [43]. For a historical survey of deep learning, readers can refer to [47].
Many open-source deep-learning frameworks, such as TensorFlow and Caffe, make
neural networks easy to implement. Readers can find abundant user-friendly tutorials
from the Internet. Deep learning has been widely applied in many fields of wireless
communications, such as network prediction [48,49], traffic classification [50,51],
modulation recognition [52,53], localization [54,55] and anomaly detection [56–58].
Readers can refer to [59] for a comprehensive survey of deep learning in mobile and
wireless networks.
k-Nearest neighbour
method
Classification and
regression tree
Support vector
machine
Multilayer
perceptron and deep
learning
Figure 1.9 Structure chart for supervised learning technologies discussed in this
chapter
20 Applications of machine learning in wireless communications
Rather than defining classes before observing the test data, clustering allows us to
find and analyse the undiscovered groups hidden in data. From Sections 1.2.1–1.2.4,
we will discuss four representative clustering algorithms. Density estimation aims to
estimate the distribution density of data in the feature space, and thus we can find the
high-density regions which usually reveal some important characteristics of the data.
In Section 1.2.5, we will introduce a popular density-estimation method: the Gaus-
sian mixture model (GMM). Dimension reduction pursues to transform the data in a
high-dimensional space to a low-dimensional space, and the low-dimensional repre-
sentation should reserve principal structures of the data. In Sections 1.2.6 and 1.2.7,
we will discuss two practical dimension-reduction technologies: principal component
analysis (PCA) and autoencoder.
1.2.1 k-Means
k-Means [61] is one of the simplest unsupervised learning algorithms which solve
a clustering problem. This method only needs one input parameter k, which is the
number of clusters we expect to output. The main idea of k-means is to find k optimal
points (in the feature space) as the representatives of k clusters according to an evalu-
ation function, and each point in the data set will be assigned to a clusterbased on
the
n
distance between the point to each representative. Given a data set X = xi ∈ Rd i=1 ,
let Xi and ri denote the ith cluster and the corresponding representative, respectively.
Then, k-means aims to find the solution of the following problem:
k
min x − ri 2
ri , Xi
i=1 x∈Xi
k
(1.27)
s.t. Xi = X
i=1
Xi Xj = ∅ (i = j).
Notice that x∈Xi x − ri 2 measures how dissimilar the points in ith cluster to the
corresponding representative, and thus the object is to minimize the sum of these
dissimilarities.
However, the above problem has been shown to be an NP-hard problem [62],
which means the global optimal solution cannot be found efficiently in general cases.
As an alternative, k-means provides an iterative process to obtain an approximate
solution. Initially, it randomly selects k points as initial representative. Then it alter-
nately conducts two steps as follows. First, partitions the all points into k clusters in
which each point is assigned to the cluster with the nearest representative. Second,
take the mean points of each cluster as the k new representatives, which reveals the
origin of the name of k-means. The above steps will be repeated until the clusters
remain stable. The whole process is summarized in Algorithm 1.6.
Let us checkout the correctness of k-means step by step. In the first step, when
fixing k representatives, each point is assigned to the nearest representative and thus
the object value of (1.27) will decrease or remain unchanged. In the second step,
22 Applications of machine learning in wireless communications
by fixing the k clusters, we can find the optimal solutions to the sub-problems of
(1.27), i.e.:
x∈Xi x
arg min x − r =
2
(i = 1, . . . , k). (1.28)
r∈R d
x∈X
|Xi |
i
Thus, the value of object function will also decrease or remain unchanged in the
second step. In summary, the two steps in k-means will decrease the object value or
reach convergence.
Because k-means is easy to implement and has short running time for low-
dimensional data, it has been widely used in various topics and as a preprocessing
step for other algorithms [63–66]. However, three major shortcomings are known for
the original k-means algorithm. The first one is that choosing an appropriate k is a non-
trivial problem. Accordingly, X -means [67] and G-means [68] have been proposed
based on the Bayesian information criterion [69] and Gaussian distribution. They can
estimate k automatically by using model-selection criteria from statistics. The second
one is that an inappropriate choice for the k initial representatives may lead to poor
performance. As a solution, the k-means++ algorithm [70] augmented k-means with
a simple randomized seeding technique and is guaranteed to find a solution that is
O(log k) competitive to the optimal k-means solution. The third one is that k-means
fails to discover clusters with complex shapes [71]. Accordingly, kernel k-means [72]
was proposed to detect arbitrary-shaped clusters, with an appropriate choice of the
kernel function. References [73], [74] and [75] present three applications of k-means
Introduction of machine learning 23
where dist(·, ·) can choose any distance function according to the application. Then,
the density of x, denoted by ρ(x), is defined as the number of points belonging to the
neighbourhood of x, i.e.:
Non-core point
Non-core point Core point
Core point Outlier
Cluster C1
x ε
Cluster C2
(a) (b)
Figure 1.10 (a) An illustration for the connected relation, where the two core
points x and y are connected and (b) an illustration for clusters and
outliers. There are two clusters and seven outliers denoted by
four-pointed stars
Notice
that there
may exist some points which do not belong to any clusters, that is,
k
X\ i=1 Ci = ∅. DBSCAN assigns these points as outliers because they are far
from any normal points. An illustration is shown in Figure 1.10(b).
To this end, we have presented the three main steps of DBSCAN. Algorithm 1.7
summarizes the details of DBSCAN. Let the number of points be n = |X |. Finding
the ε-neighbourhood for points is the most time-consuming step with computational
complexity of O(n2 ). The other steps can be implemented within nearly linear com-
putational complexity. Thus, the computation complexity of DBSCAN is O(n2 ). The
neighbour searching step of DBSCAN can be accelerated by using spatial index tech-
nology [60] and groups method [77]. DBSCAN can find arbitrary-shaped clusters
and is robust to outliers. However, the clustering quality of DBSCAN highly depends
on the parameter ε and it is non-trivial to find an appropriate value for ε. Accordingly,
the OPTICS algorithm [78] provides a visual tool to help users find the hierarchical
cluster structure and determine the parameters. Some applications of DBSCAN in
wireless sensor networks can be found in [79–83].
DBSCAN, that is, estimating density, finding core points and forming clusters. How-
ever, there are two differences between them. First, FDP detects core points based on
a novel criterion, named delta-distance, other than the density. Second, FDP forms
cluster by using a novel concept, named higher density nearest neighbour (HDN),
rather than the neighbourhood in DBSCAN. Next, we will introduce the two novel
concepts followed by the details of FDP.
To begin with, FDP shares the same density definition with DBSCAN.
Specifically, given a data set X = {xi }ni=1 , the density of a point x is computed as
ρ(x) = |Nε (x)|, (1.32)
where Nε (x) denotes the ε-neighbourhood of x (see (1.29)). After computing the
density, FDP defines the HDN of a point x, denoted by π (x), as the nearest point
whose density is higher than x, i.e.:
π (x) arg min dist(y, x). (1.33)
y∈X ,ρ(y)>ρ(x)
Specially, for the point with the highest density, its HDN is defined as the farthest
point in X . Then, FDP defines the delta-distance of a point x as
δ(x) = dist(x, π (x)). (1.34)
Note that the delta-distance is small for most points and only much larger for a point
being either a local maxima in the density or an outlier because the HDN of a outlier
may be far from it. In FDP, a local maxima in the density is called a core point.
26 Applications of machine learning in wireless communications
2.6 0.25
2.55
0.2 Core points
2.5
0.15
2.45 Outliers
δ
x2
0.1
2.4
2.35 0.05
2.3 0
1.65 1.7 1.75 1.8 1.85 1.9 1.95 2 0 20 40 60 80 100 120 140
x1 ρ
(a) (b)
Figure 1.11 (a) A simple data set distributed in a two-dimensional space. There
are two clusters and three outliers (denoted by green ‘x’). The core
point of each cluster is denoted by a pentagram. (b) The decision
graph corresponding to the example shown in (a), which is the plot
of δ as a function of ρ for all points
1 Core points
1
1 3 3
2 1
3 5 4
1 1
1 1
2
2
2 3
1 1
3
2 2 1
2
1
Figure 1.12 An illustration for assigning remaining points. The number in each
point denotes its density. The HDN of each point is specified by an
arrow. The three core points are denoted by pentagrams
their corresponding atom clusters are then merged through α-reachable paths on a
k-NNs graph. Next, we will introduce RNKD followed by the details of RECOME.
The RNKD is based on an NKD. For a sample x in a data set X = {xi }ni=1 , the
NKD of x is defined as
dist(x, z)
ρ(x) = exp − , (1.35)
z∈N (x)
σ
k
where Nk (x) denotes the k-NNs set of x in X , and σ is a constant which can be
estimated from the data set. NKD enjoys some good properties and allows easy
discrimination of outliers. However, it fails to reveal clusters with various densities.
To overcome this shortcoming, RNKD is proposed with the definition of:
ρ(x)
ρ ∗ (x) = . (1.36)
max {ρ(z)}
z∈Nk (x)∪{x}
(a) (b)
Figure 1.13 (a) The heat map of NKD for a two-dimensional data set. (b) The heat
map of RNKD for a two-dimensional data set. Figures are quoted
from [90]
Introduction of machine learning 29
and each tree is rooted at a core sample (similar to the relation shown in Figure 1.12).
Furthermore, a core sample and its descendants in the tree are called an atom cluster
in RECOME. Atom clusters form the basis of final clusters; however, a true cluster
may consist of several atom clusters. This happens when many local maximal exist
in one true cluster. Thus, a merging step is introduced to combine atom clusters into
true clusters.
RECOME treats each core sample as the representative of the atom cluster that it
belongs to and merges atom clusters by merging core samples. To do that, it defines
another graph with undirected edges, k-NN graph, as
Furthermore, on the k-NN graph, two samples x and z are called α-connected if there
exists a path x, w1 , . . . , ws , z in Gk such that ρ ∗ (wi ) > α for i = 1, . . . , s, where
α is a user-specified parameter. It can be verified that the α-connected relation is
an equivalence relation on the core sample set. RECOME divides core samples into
equivalence classes by using this relation. Correspondingly, atom clusters associated
with core samples in the same equivalent class are merged into a final cluster. For
clarity, we summarize the details of RECOME in Algorithm 1.9.
In RECOME, there are two user-specified parameters k and α. As discussed
in [90], the clustering quality of √ RECOME
√ is not sensitive to k, and it is recom-
mended to tune k in the range [ n/2, n]. On the other hand, the clustering result
of RECOME largely depends on parameter α. In particular, as α increases, cluster
granularity (i.e. the volume of clusters) decreases and cluster purity increases. In [90],
authors also provide an auxiliary algorithm to help users to tune α fast. RECOME
has been shown to be effective on detecting clusters with different shapes, densities
and scales. Furthermore, it has nearly linear computational complexity if the k-NNs
of each sample are computed in advance. In addition, readers can refer to [93] for an
application to channel modelling in wireless communications.
k
f (x) = φi N (x|μi , i ), (1.38)
i=1
30 Applications of machine learning in wireless communications
4 for i = 1 to n do
5 ρ ∗ (xi ) = ρ(xi )
max {ρ(z)}
;
z∈Nk (xi )∪{xi }
k
where φ1 , . . . , φk are non-negative with i=1 φi = 1, and
1 1 1
N (x|μi , i ) = exp − (x − μi )T −1
i (x − μi ) (1.39)
(2π) |
i |
d/2 1/2 2
is the Gaussian distribution with mean vector μi and covariance matrix i . Parameter
k controls the capacity and the complexity of GMM. Considering a two-dimensional
data set shown in Figure 1.14(a), Figure 1.14(b) shows the fitted distribution by using
GMM with k = 5, where we observe that only a fuzzy outline is reserved and most
details are lost. In contrast, Figure 1.14(c) shows the case for k = 20, where we
find the fitted distribution reflects the main characteristic of the data set. In fact, by
increasing k, GMM can approximate any continuous distribution to some desired
degree of accuracy. However, it does not mean that the larger k is better, because a
large k may lead to overfitting and a huge time cost for the parameter estimation.
In most cases, k is inferred experientially from the data.
Introduction of machine learning 31
×10–6
350 350 16
300 300 14
12
250 250
10
200 200
8
150 150
6
100 100 4
50 50 2
0 0
0 100 200 300 400 500 600 700 0 100 200 300 400 500 600 700
(a) (b)
×10–5
350
2
300 1.8
1.6
250
1.4
200 1.2
1
150
0.8
100 0.6
0.4
50 0.2
0
0 100 200 300 400 500 600 700
(c)
Figure 1.14 (a) A two-dimensional data set, (b) the fitted distribution by using
GMM with k = 5 and (c) the fitted distribution by using GMM with
k = 20
Now, we discuss the parameter estimation for GMM. Given a data set {xj }nj=1 , the
log-likelihood of GMM is given by
k
n
n
L= ln f (xj ) = ln φi N (xj |μi , i ) . (1.40)
j=1 j=1 i=1
max L
{φi },{μi },{ i }
s.t. i 0 (i = 1, . . . , k),
φi ≥ 0 (i = 1, . . . , k), (1.41)
k
φi = 1.
i=1
32 Applications of machine learning in wireless communications
2. M-step: Update the parameter θ̄ by the solution of maximizing (1.43), that is,
θ̄ = arg max p(Z|X , θ̄ ) ln p(X , Z|θ ). (1.44)
θ
Z
Finally, the resulting θ̄ will be the estimated parameter set. Here, we do not discuss
the correctness of the EM algorithm to stay focused. Readers can refer to [94] for
more details.
Introduction of machine learning 33
n
= p(zj |xj , θ̄) ln p(xj , zj |θ )
j=1 zj
n
k (1.46)
= p(zj = l|xj , θ̄ ) ln (φl N (xj |μl , l ))
j=1 l=1
n
k
= γjl ln (φl N (xj |μl , l )).
j=1 l=1
s.t. l 0 (l = 1, . . . , k),
(1.47)
φl ≥ 0 (l = 1, . . . , k),
k
φl = 1,
l=1
1
n
l = γjl (xj − μl )(xj − μl )T ,
nl j=1
34 Applications of machine learning in wireless communications
2 repeat
/* E-step */
3 for l = 1 to k do
4 for i = 1 to n do
φ̄ N (x |μ̄ ,¯ )
5 γjl = k l φ̄ Nj (xl |μ̄ l,¯ ) ;
i=1 i j i i
6 nl = nj=1 γjl ;
/* M-step */
7 for l = 1 to k do
8 φ̄l = nnl ;
9 μ̄l = n1l nj=1 γjl xj ;
10 for l = 1 to
k do
11 ¯ l = 1 nj=1 γjl (xj − μ̄l )(xj − μ̄l )T ;
nl
12 until convergence;
where we have defined nl = nj=1 γjl . We conclude the whole procedure in
Algorithm 1.10.
In addition to fitting the distribution density of data, GMM can also be used for
clustering data. Specifically, if we regard the k Gaussian models as the ‘patterns’ of
k clusters, then the probability that a sample xi comes from the lth pattern is given by
¯ l)
φl N (xj |μ̄l ,
p(zj = l|xj ) = k . (1.49)
¯
i=1 φi N (xj |μ̄i , i )
Thus, l ∗ arg maxl p(zj = l|xj ) gives the index of the cluster that xj most likely
belongs to. Furthermore, low p(zj = l ∗ |xj ) may imply that xj is an outlier. Reference
[95] shows a diffusion-based EM algorithm for distributed estimation of GMM in
wireless sensor networks. References [96] and [97] present two applications of GMM
in target tracking and signal-strength prediction, respectively.
0.6 0.6
0.4 0.4
0.2 0.2
0
Z
Z
0
–0.2 –0.2
–0.4 –0.4
–0.6 –0.6
1 0.8 1 1 1
0.6 0.8 0.8 0.8
0.4 0.6 0.6 0.6
0.2 0.4 0.4 0.4
Y 0.2 X Y 0.2 0.2 X
0 0 0 0
(a) (b)
1.2
1
0.8
0.6
0.4
0.2
0
–0.2
–0.2 0 0.2 0.4 0.6 0.8 1 1.2
(c)
Figure 1.15 (a) A three-dimensional data set, (b) the points are distributed near a
two-dimensional plane and (c) projection of the points onto the plane
popular one. PCA can be derived from the perspectives of both geometry and statistics.
Here we will focus on the former perspective since it meets our intuition better.
To begin with, let us consider a three-dimensional data set as shown in
Figure 1.15(a). Though all points lie in a three-dimensional space, as shown in
Figure 1.15(b), they are distributed near a two-dimensional plane. As shown in Fig-
ure 1.15(c), after projecting all points onto the plane, we can observe that, in fact,
they are distributed in a rectangle. In this example, we find that the low-dimensional
representation captures the key characteristic of the data set.
Reviewing the above example, the key step is to find a low-dimensional plane
near all points. In PCA, this problem is formalized as an optimization problem by
using linear algebra. Specifically, given a data set {xi ∈ Rd }ni=1 , PCA intends to find
a t-dimensional (t < d) plane1 that minimizes the sum of the square of the distance
between each point and its projection onto the plane. Formally, a t-dimensional plane
can be described by a semi-orthogonal matrix B = (b1 , . . . , bt ) ∈ Rd×t (i.e. BT B = I)
and a shift vector s ∈ Rd . By linear algebra, {Bz|z ∈ Rt } is a t-dimensional subspace.
1
A formal name should be affine subspace. Here we use ‘plane’ for simplicity.
36 Applications of machine learning in wireless communications
{s + Bz│z ∈ R2}
s
b1
b2
B = (b1, b2)
Noticing that
2
(I − BBT )(xi − μ) = xi − μ2 − (xi − μ)T BBT (xi − μ), (1.53)
(1.52) is equivalent to
n
max (xi − μ)T BBT (xi − μ) = Tr X̄T BBT X̄ = Tr BT X̄X̄T B
B
i=1 (1.54)
s.t. BT B = I.
Introduction of machine learning 37
1 compute μ = 1n ni=1 xi ;
2 define X̄ = (x1 − μ, . . . , xn − μ) ∈ Rd×n ;
3 compute the first t orthogonal eigenvectors p1 , . . . , pt of X̄X̄T ;
4 define P = (p1 , . . . , pt ) ∈ Rd×t ;
5 Y = PT X̄;
B∗ = P (1.55)
1.2.7 Autoencoder
In this section, we will introduce another dimension-reduction method autoencoder.
An autoencoder is a neural network (see Section 1.1.3.3) used to learn an effective
representation (encoded mode) for a data set, where the transformed code has lower
dimensions compared with the original data.
As shown in Figure 1.17, an autoencoder consists of two parts, i.e. an encoder
f (·|η) and a decoder g(·|θ ). Each of them is a neural network, where η and θ denote the
parameter sets of the encoder and the decoder, respectively. Given an input x ∈ Rd ,
the encoder is in charge of transforming x into a code z ∈ Rt , i.e. f (x|η) = z, where
t is the length of the code with t < d. In contrast, the decoder tries to recover the
original feature x from the code z, i.e. g(z|θ ) = x̄ ∈ Rd such that x̄ ≈ x. Thus, given
38 Applications of machine learning in wireless communications
Squared
error
││x – –x││2
–
X f (.│η) Z g (.│θ) X X
Figure 1.17 The flowchart of an autoencoder. First, the encoder f (·|η) encodes an
input x into a code z with short length. Then the code z is transformed
into an output x̄ being the same size as x, by the decoder g(·|θ ). Given
a data set {xi }ni=i , the object of training is to learn
the parameter set
such that minimize the sum of squared error ni=1 xi − x̄i 2
a data set {xi }ni=1 , the training process of the autoencoder can be formulated as the
following optimization problem:
n
min xi − g( f (xi |η)|θ)2 . (1.56)
η,θ
i=1
By limiting the length of code, minimizing the object function will force the code to
capture critical structure of input features and ignore trivial details such as sparse
noises. Thus, besides dimension reduction, an autoencoder can also be used to
de-noise.
In Figure 1.18, we show a specific implementation for the dimension-reduction
task on the MNIST data set of handwritten digits [41], where each sample is a grey-
scale image with the size of 28 × 28. For simplicity, stacking the columns of each
image into a vector, thus the input has a dimension of 28 × 28 = 784. As we see,
the encoder consists of three FC layers, where each layer is equipped with a sigmoid
activation function. The first layer non-linearly transforms an input vector with 784
dimensions into a hidden vector with 256 dimensions, and the second layer continues
to reduce the dimension of the hidden vector from 256 to 128. Finally, after the
transformation of the third layer, we get a code of 64 dimensions, which is far less
than the dimension of the input vector. On the other hand, the decoder shares a same
structure with the encoder except that each FC layer transforms a low-dimensional
vector into a high-dimensional vector. The encoder tries to reconstruct the original
input vector with 784 dimensions from the code with 64 dimensions. In addition,
a sparse constraint on the parameters should be added as a regularization term to
achieve better performance. After training the autoencoder using the BP algorithm
Introduction of machine learning 39
Encoder Decoder
FC layer FC layer FC layer FC layer FC layer FC layer
784×256 256×128 128×64 64×128 128×256 256×784
Inputs Outputs
784 (28×28) 784 (28×28)
Codes
64 (8×8)
Inputs:
Codes:
Outputs:
(see Section 1.1.3.3), we can obtain the result shown in Figure 1.19. From this figure,
we observe that an image can be reconstructed with high quality from a small-size
code, which indicates that the main feature of the original image is encoded into the
code.
We can use an encoder to compress a high-dimensional sample into a low-
dimensional code and then use a decoder to reconstruct the sample from the code. An
interesting problem is that, can we feed a randomly generated code into the decoder
to obtain a new sample. Unfortunately, in most cases, the generated samples either
are very similar to the original data or become meaningless things. Inspired by this
idea, a variational autoencoder [104] is proposed. Different from the autoencoder, the
variational autoencoder tries to learn an encoder to encode the distribution of original
data rather than the data itself. By using a well-designed object function, the distribu-
tion of original data can be encoded into some low-dimensional normal distributions
through an encoder. Correspondingly, a decoder is trained to transform the normal
40 Applications of machine learning in wireless communications
k-Means
DBSCAN
Clustering
FDP
Unsupervised RECOME
learning
Gaussian mixture
Density estimation
model
PCA
Dimension
reduction
Autoencoder
distributions into real data distribution. Thus, one can first sample a code from the
normal distributions and then feed it to the decoder to obtain a new sample. For more
details about a variational autoencoder, please refer to [104]. In wireless communi-
cations and sensor networks, autoencoder has been applied in many fields, such as
data compression [105], sparse data representation [106], wireless localization [107]
and anomaly detection [108].
shown to be effective on detecting clusters with various shapes, densities and scales.
In addition, it also provides an auxiliary algorithm to help users selecting parameters.
Density estimation is a basic problem in unsupervised learning. We have pre-
sented the GMM, which is one of most popular models for density estimation. The
GMM can approximate any continuous distribution to some desired degree of accu-
racy as soon as the parameter k is large enough, but accordingly, time cost for the
estimation of its parameters will increase. Dimension reduction plays an important
role in the compression, comprehension and visualization of data. We have introduced
two dimension-reduction technologies, PCA and autoencoder. PCA can be deduced
from an intuitive geometry view and has been shown to be highly effective for data
distributed in a linear structure. However, it may destroy non-linear topological rela-
tions in original data. Autoencoder is a dimension-reduction method based on neural
networks. Compared with PCA, autoencoder has great potential to reserve the non-
linear structure in original data, but it needs more time to adjust parameters for a
given data set. In Table 1.3, we summarize the applications of supervised learning in
wireless communications. For more technologies of unsupervised learning, readers
can refer to [8].
Agent
Reward rt
State st Action at
Next state st+1
Environment
2
Here we suppose the reward function is deterministic for simplicity though it can be a random function.
3
People often pay more attention to the short-term reward.
Introduction of machine learning 43
In addition, the strategy of the agent taking actions is defined as a policy π , where
π(a|s) gives the probability of the agent taking an action a in a state s. In other words,
a policy fully defines the behaviour of an agent. Given an initial state s0 and a policy
π, an MDP can ‘run’ as follows:
For t = 0, 1, 2, . . .
at ∼ π(·|st );
(1.57)
rt = R(st , at );
st+1 ∼ P(·|st , at ).
Our
∞ objective is to find a policy π ∗ that maximizes cumulative discounted award
t
t=0 γ rt on average.
To smooth the ensuing discussion, we need to introduce two functions, i.e. a value
function and a Q-value function. The value function with the policy π is defined as
the expectation of cumulative discounted reward, i.e.:
!∞ #
"
"
V π (s) E γ t rt "s0 = s , s ∈ S. (1.58)
t=0
Intuitively, the value function and the Q-value function evaluate how good a state and
a state-action pair are under a policy π, respectively. If we expand the summations in
the value function, we have
!∞ #
"
"
V π (s) = E γ t rt "s0 = s
t=0
! #
∞ "
"
= E r0 + γ t rt "s0 = s
t=1
! # (1.60)
∞ "
"
= π(a|s) R(s, a) + γ P(s
|s, a)E γ t−1 rt "s1 = s
a∈A s
∈S t=1
= π(a|s) R(s, a) + γ P(s |s, a)V (s )
π
a∈A s ∈S
Equations (1.60) and (1.61) are so-called Bellman equations, which are foundations
of RL.
44 Applications of machine learning in wireless communications
On the other hand, if we fix s and a, V π (s) and Qπ (s, a) in fact evaluate how good
a policy π is. Thus, a policy that maximizes V π (s) (Qπ (s, a)) will be a good candidate
for π ∗ though s and a are fixed. An arising problem is that there exists a policy that
maximizes V π (s) (Qπ (s, a)) for any s ∈ S (and a ∈ A). The following theorem gives
a positive answer:
Theorem 1.1. [110] For any MDP, there exists an optimal policy π ∗ such that
∗
V π (s) = max V π (s) ∀s ∈ S
π
and
∗
Qπ (s, a) = max Qπ (s, a) ∀s ∈ S and ∀a ∈ A.
π
According to Theorem 1.1, we can define the optimal value function and the
optimal Q-value function as
∗ ∗
V ∗ (·) V π (·) and Q∗ (·, ·) Qπ (·, ·) (1.62)
respectively, which are useful in finding the optimal policy. Furthermore, if V ∗ (·) and
Q∗ (·, ·) have been obtained, we can construct the optimal policy π ∗ by letting:
⎧
⎨ 1 if a = arg maxa∈A Q∗ (s, a)
∗
π (a|s) = = arg maxa∈A R(s, a) + γ s
∈S P(s
|s, a)V ∗ (s) (1.63)
⎩
0 otherwise.
In other word, there always exists a deterministic optimal policy for any MDP. In
addition, we have the Bellman optimality equations as follows:
V ∗ (s) = max R(s, a) + γ P(s
|s, a)V ∗ (s) (1.64)
a
s
∈S
and
Q∗ (s, a) = R(s, a) + γ P(s
|s, a) max
Q∗ (s
, a
). (1.65)
a ∈A
s
∈S
MDP and the Bellman equations are theoretical cornerstones of RL, based on
which many RL algorithms have been derived as we will show below. For more results
regarding MDP, readers can refer to [111].
The policy iteration takes an iterative strategy to find the optimal policy π ∗ . Given
an MDP M = S, A, P, R, γ and an initial policy π , the policy iteration alternatively
executes the following two steps:
How to compute a value function? Given an MDP and a policy, the corresponding
value function can be evaluated by Bellman equation (1.60). This process is described
in Algorithm 1.12. The function sequence in Algorithm 1.12 can be proved to converge
to V π .
How to improve a policy? Given an MDP and the value function of a policy, the
policy can be improved by using (1.63). As a result, we conclude the policy iteration
algorithm in Algorithm 1.13.
The value iteration, as its name suggests, iteratively updates a value function until
it achieves the optimal value function. It has a very brief form since it just iterates
according to the optimal Bellman equation (1.64). We present the value iteration
algorithm in Algorithm 1.14. After obtaining the optimal value function, we can
construct the optimal policy by using (1.63).
In summary, when an MDP is given as known information, we can use either
of the policy and the value iteration to find the optimal policy. The value iteration
has more simple form, but the policy iteration usually converges more quickly in
practice [111]. In wireless communications, the policy and the value iteration methods
have been applied to many tasks, such as heterogeneous wireless networks [112],
energy-efficient communications [113] and energy harvesting [114].
46 Applications of machine learning in wireless communications
π ← π̄;
4 until π converges;
It is defined by the expectation of all trails starting from s. Now, suppose that
we independently conduct l experiments' by applying the policy π , and thus ( we
(i) (i) (i) (i) (i) (i)
obtain l trajectories {τi }li=1 , where τi = s ≡ s0 , a0 , r0 , s1 , a1 , r1 , . . . , rn(i)i . Let
i t (i)
R(i) = nt=0 γ rt . Then, according to the law of large numbers, we have
!∞ #
" 1 (i)
l
t "
V (s) = E
π
γ rt "s0 = s ≈ R
t=0
l i=1
1 (i) 1 1 (i)
l−1 l−1
= R + R −
(l)
R
l − 1 i=1 l l − 1 i=1
when l is large enough. Therefore, the value function can be estimated if we afford
numerous experiments. Similarly, the Q-value function can also be estimated by
using the MC method. However, in practice, we often face an online infinity trajec-
tory: s0 , a0 , r0 , . . . . In this situation, we can update the Q-value (or value) function
incrementally as shown in Algorithm 1.15. The truncation number n in Algorithm 1.15
is used to shield negligible remainders.
Once we estimate the Q-value function for a given policy π , the optimal policy
can be obtained iteratively as presented in Algorithm 1.16. Here ε is introduced to
take a chance on small probability events.
MC methods are unbiased and easy to implement. However, they often suffer from
high variance in practice since the MDP model in real world may be so complicated
that a huge amount of samples are required to achieve a stable estimation. This restricts
the usage of MC methods when the cost of experiments is high.
48 Applications of machine learning in wireless communications
1 (i)
l
≈ r + γ V π (s(i) )
l i=1
(1.66)
1 (l)
= μl−1 + r + γ V π (s(l) ) − μl−1 ,
l
1 (l)
≈ V π (s) + r + γ V π (s(l) ) − V π (s) ,
l
where μl−1 = (1/(l − 1)) l−1 i=1 r + γ V (s ). Therefore, to acquire an estimation
(i) π (i)
π
of V (s), we can update it by the fixed point iteration [111]:
1 (l)
V π (s) ← V π (s) + r + γ V π (s(l) ) − V π (s) . (1.67)
l
In practice, 1/l in (1.67) is usually replaced by a monotonically decreasing sequence.
So far, we have presented the main idea of the TD learning. The detailed steps of TD
learning are concluded in Algorithm 1.17, where the learning rate sequence should
satisfy ∞ t=0 α t = ∞ and ∞
t=0 αt < ∞.
2
Similarly, the Q-value function w.r.t. a policy can be estimated by using Algo-
rithm 1.18, which is also known as the Sarsa algorithm. Based on the Sarsa algorithm,
Introduction of machine learning 49
4 for t = 0, 1, . . . do
5 Qπ (st , at ) ← Qπ (st , at ) + αt (rt + γ Qπ (st+1 , at+1 ) − Qπ (st , at ));
we can improve the policy alternatively by using Algorithm 1.16, where the Q-value
function is estimated by the Sarsa algorithm.
On the other hand, if choosing the optimal Bellman equation (1.65) as the
iteration strategy, we can derive the famous Q-learning algorithm as presented in
Algorithm 1.19.
In summary, TD learning, Sarsa and Q-learning are all algorithms based on
the Bellman equations and MC sampling. Among them, the goal of TD learning
and Sarsa is to estimate the value or Q-value function for a given policy, while
Q-learning aims at learning the optimal Q-value function directly. It should be noted
that, by using TD learning, we can only estimate the value function, which is not
enough to determine a policy because the state transition probability is unknown to
us. In contrast, a policy can be derived from the Q-value function, which is estimated
by Sarsa and Q-learning. In practice, Sarsa often demonstrates better performance
than Q-learning. Furthermore, all of the three methods can be improved to converge
more quickly by introducing the eligibility trace. Readers can refer to [111] for
more details.
Moreover, TD learning, Sarsa and Q-learning have been widely applied in wire-
less communications. References [115] and [116] demonstrate two applications of TD
learning in energy-aware sensor communications and detection of spectral resources,
50 Applications of machine learning in wireless communications
respectively. References [117], [118] and [119] show three applications of Sarsa in
channel allocation, interference mitigation and energy harvesting, respectively. Ref-
erences [120], [121] and [122] present three applications of Q-learning in routing
protocols, power allocation and caching policy, respectively.
Input: a sample set D = {(s(i) , r (i) , s
(i) )}li=1 , batch size m, learning rate α
Output: approximate value function V̂ (·, W)
1 Initialize W;
2 repeat
3 Randomly sample a subset {(s( j) , r ( j) , s
( j) )}mj=1 from D;
4 W ← W + mα mj=1 V̂ (s( j) , W) − (r ( j) + V̂ (s
( j) , W) ∇W V̂ (s( j) , W);
5 until convergence;
Now, if we have obtained a finite sample set {(s(i) , r (i) , s
(i) )}li=1 from the experience,
(1.71) can be estimated as
1 (i)
l
V̂ (s , W) − (r (i) + V̂ (s
, W) ∇W V̂ (s(i) , W).
(i)
(1.72)
l i=1
Thus, we can use (1.72) to update W until convergence. The value function
approximation via DNNs is summarized in Algorithm 1.20.
On the other hand, for the Q-value function, we can approximate it with a similar
way as described in Algorithm 1.21. After the value function or the Q-value function
is approximated, we can work out the optimal policy by using the policy iteration
(Algorithms 1.13 or 1.16). However, given a large size problem, a more smarter way
is to parametrize the policy by using another DNN, which will be discussed in the
following part.
Input: a sample set D = {(s(i) , a(i) , r (i) , s
(i) , a
(i) )}li=1 , batch size m, learning
rate α
Output: approximate value function Q̂(·, ·, U)
1 Initialize U;
2 repeat
3 Randomly sample a subset {(s( j) , a( j) , r ( j) , s
( j) , a
( j) )}mj=1 from D;
4 U←
U + mα mj=1 Q̂(s( j) , a( j) , U) − (r ( j) + Q̂(s
( j) , a
( j) , U) ∇U Q̂(s( j) , a( j) , U);
5 until convergence;
To begin with, let π̂ (s, a, θ ) denote a DNN which is with parameter θ and receive
two inputs s ∈ S and a ∈ A. Our goal is to learn the parameter θ such that the
expectation of the total reward is maximized, i.e.:
!∞ # +
"
"
max J (θ) E γ rt "π̂ (·, ·, θ) = g(τ )P{τ |π̂ (·, ·, θ )}dτ ,
t
(1.73)
θ
t=0 τ
where τ = s0 , a0 , r0 , s1 , a1 , r1 , . . . and g(τ ) = ∞ t
t=0 γ rt denote a trajectory and its
reward, respectively. To update θ , we need to take the gradient w.r.t. θ, that is
+
∇θ J (θ ) = g(τ )∇θ P{τ |π̂ (·, ·, θ)}dτ. (1.74)
τ
But the gradient in (1.74) is hard to be estimated since it relies on the probability.
Fortunately, this difficulty can be resolved by using a nice trick as follows:
+
∇θ J (θ ) = g(τ )∇θ P{τ |π̂ (·, ·, θ)}dτ
τ
+
∇θ P{τ |π̂ (·, ·, θ )}
= g(τ )P{τ |π̂(·, ·, θ )} dτ
P{τ |π̂ (·, ·, θ )} (1.75)
τ
+
= g(τ )P{τ |π̂ (·, ·, θ)}∇θ log P{τ |π̂(·, ·, θ)}dτ
τ
, " -
"
= E g(τ )∇θ log P{τ |π̂(·, ·, θ )}"π̂(·, ·, θ ) .
Introduction of machine learning 53
Moreover, we have
∞
∇θ log P{τ |π̂ (·, ·, θ)} = ∇θ log P(s0 ) π̂(st , at , θ )P(st+1 |st , at )
t=0
∞
∞
= ∇θ log P(s0 ) + log P(st+1 |st , at ) + log π̂ (st , at , θ )
t=0 t=0
∞
= ∇θ log π̂ (st , at , θ ). (1.76)
t=0
1
l
≈ (Qπ̂(·,·,θ ) (s(i) , a(i) ) − V π̂ (·,·,θ ) (s(i) ))∇θ log π̂ (s(i) , a(i) , θ ), (1.78)
l
i=1
where {(s , a )}i=1
(i) (i) l
is a sample set from the experience under the policy π̂(·, ·, θ ).
So far, a remaining problem is that Qπ̂(·,·,θ) and V π̂ (·,·,θ ) are unknown to us. The
answer would be using the value and Q-value function approximations as described
in Section 1.3.4.1. We summary the whole process in Algorithm 1.22. This algo-
rithm is known as the famous actor–critic (AC) algorithm (also known as the A3C
algorithm), where actor and critic refer to the policy DNN and the value (Q-value)
DNN, respectively.
It is worth mentioning that the AC algorithm has an extension named asyn-
chronous advantage AC algorithm [125]. The A3C algorithm has better convergence
and became a standard starting point in many recent works [126].
DRL is popular with current wireless communications research. For example,
Q-value function approximation has been applied in mobile edge computing [127],
resource allocation [128] and base station control [129]. In addition, [130], [131]
and [132] demonstrate three applications of actor–critic algorithm in quality of service
(QoS) driven scheduling, bandwidth intrusion detection and spectrum management,
respectively.
α3
l
θ ←θ+ (Q(s(i) , a(i) , U) − V (s(i) , W))∇θ log π̂ (s(i) , a(i) , θ );
l i=1
7 until convergence;
Figure 1.22, we have introduced three parts of RL: model-based methods, model-free
methods and DRL.
Model-based methods assume that the MDP model is given as prior information.
Based on the model information and Bellman equations, this kind of algorithms try
to learn the (optimal) value function, the (optimal) Q-value function and the optimal
policy. In general, model-based algorithms have a better effect and a faster conver-
gence than model-free algorithms provided that the given MDP model is accurate.
However, model-based algorithms are rarely used in practice, since MDP models in
real world are usually too complicated to be estimated accurately.
Model-free methods are designed for the case where information of hidden MDP
is unknown. Model-free algorithms can be further divided into two subclasses: MC
methods and TD learning. Based on the law of large numbers, MC methods try to
estimate the value or Q-value function from an appropriate number of samples gener-
ated from experiments. MC methods are unbiased, but they suffer from high variance
in practice since MDP models in real world are usually complex such that it needs
massive samples to achieve a stable result. On the other hand, TD learning integrates
the Bellman equations and MC sampling in its algorithms design. By introducing
the Bellman equations, TD learning reduces the estimation variance compared with
MC methods, though its estimation may be biased. TD learning has shown a decent
performance in practice and provides basic ideas for many subsequent RL algorithms.
DRL is proposed to deal with the condition where the number of states is
extremely large or even infinite. DRL applies DNNs to approximate the value
function, the Q-value function and the policy. Among them, the update rule of the value
Introduction of machine learning 55
Policy iteration
Model-based
methods
Value iteration
Monte Carlo
estimation
Model-free
TD learning
methods
Reinforcement
learning
Sarsa algorithm
Q-learning
Value function
approximation
Deep
Q-value function
reinforcement
approximation
learning
Actor-critic
algorithm
1.4 Summary
In this chapter, we have reviewed three main branches of machine learning: super-
vised learning, unsupervised learning and RL. Supervised learning tries to learn a
function that maps an input to an output by referring to a training set. A supervised
learning task is called a classification task or a regression task according to whether
the predicted variable is categorical or continuous. In contrast, unsupervised learning
aims at discovering and exploring the inherent and hidden structures of a data set
without labels. Unsupervised learning has three main functions: clustering, density
estimation and dimension reduction. RL is commonly employed to deal with the opti-
mal decision-making in a dynamic system. By modelling the problem as the MDP, RL
seeks to find an optimal policy. An RL algorithm is called a model-based algorithm or
a model-free algorithm depending on whether the MDP model parameter is required
or not. Furthermore, if an RL algorithm applies DNNs to approximate a function, it
also called a deep RL method.
There is no doubt that machine learning is achieving increasingly promising
results in wireless communications. However, there are several essential open-
research issues that are noteworthy in the future [59]:
1. In general, supervised models require massive training data to gain satisfying
performance, especially for deep models. Unfortunately, unlike some popular
research areas such as computer vision and NLP, there still lacks high-quality
and large-volume-labelled data sets for wireless applications. Moreover, due
to limitations of sensors and network equipment, wireless data collected are
usually subjected to loss, redundancy, mislabelling and class imbalance. How
to implement supervised learning with limited low-quality training data is a
significant and urgent problem in the research of wireless learning.
2. On the other hand, wireless networks generate large amounts of data every
day. However, data labelling is an expensive and time-consuming process. To
facilitate the analysis of raw wireless data, unsupervised learning is increas-
ingly essential in extracting insights from unlabelled data [134]. Furthermore,
recent success in generative models (e.g. variational autoencoder and generative
adversarial networks) greatly boosts the development of unsupervised learning.
Introduction of machine learning 57
Acknowledgement
This work is supported in part by the National Natural Science Foundation of China
(Grant No. 61501022).
References
[1] Samuel AL. Some studies in machine learning using the game of checkers.
IBM Journal of Research and Development. 1959;3(3):210–229.
[2] Tagliaferri L. An Introduction to Machine Learning; 2017. https://www.
digitalocean.com/community/tutorials/an-introduction-to-machine-learning.
[3] Feng Vs, and Chang SY. Determination of wireless networks parameters
through parallel hierarchical support vector machines. IEEE Transactions on
Parallel and Distributed Systems. 2012;23(3):505–512.
[4] Deza E, and Deza MM. Dictionary of Distances. Elsevier; 2006. Available
from: https://www.sciencedirect.com/science/article/pii/B9780444520876
500007.
[5] Everitt BS, Landau S, Leese M, et al. Miscellaneous Clustering Methods.
Hoboken, NJ: John Wiley & Sons, Ltd; 2011.
[6] Kohavi R. A study of cross-validation and bootstrap for accuracy estima-
tion and model selection. In: International Joint Conference on Artificial
Intelligence; 1995. p. 1137–1143.
[7] Samet H. The Design and Analysis of Spatial Data Structures. Boston,
MA: Addison-Wesley; 1990.
[8] Hastie T, Tibshirani R, and Friedman J. The Elements of Statistical Learning:
Data Mining, Inference and Prediction. 2nd ed. Berlin: Springer; 2008.
[9] Friedman JH. Flexible Metric Nearest Neighbor Classification; 1994. Tech-
nical report. Available from: https://statistics.stanford.edu/research/flexible-
metric-nearest- neighbor-classification.
58 Applications of machine learning in wireless communications
[10] Erdogan SZ, and Bilgin TT. A data mining approach for fall detection by
using k-nearest neighbour algorithm on wireless sensor network data. IET
Communications. 2012;6(18):3281–3287.
[11] Donohoo BK, Ohlsen C, Pasricha S, et al. Context-aware energy enhance-
ments for smart mobile devices. IEEE Transactions on Mobile Computing.
2014;13(8):1720–1732.
[12] Quinlan JR. Induction of decision trees. Machine Learning. 1986;1(1):
81–106. Available from: https://doi.org/10.1007/BF00116251.
[13] Quinlan JR. C4.5: Programs for Machine Learning. San Francisco, CA, USA:
Morgan Kaufmann Publishers Inc.; 1993.
[14] Breiman L, Friedman J, Stone CJ, et al. Classification and Regression
Trees. The Wadsworth and Brooks-Cole statistics-probability series. Taylor &
Francis; 1984. Available from: https://books.google.com/books?id=JwQx-
WOmSyQC.
[15] Geurts P, El Khayat I, and Leduc G. A machine learning approach to improve
congestion control over wireless computer networks. In: International
Conference on Data Mining. IEEE; 2004. p. 383–386.
[16] Nadimi ES, Søgaard HT, and Bak T. ZigBee-based wireless sensor networks
for classifying the behaviour of a herd of animals using classification trees.
Biosystems Engineering. 2008;100(2):167–176.
[17] Coppolino L, D’Antonio S, Garofalo A, et al. Applying data mining tech-
niques to intrusion detection in wireless sensor networks. In: International
Conference on P2P, Parallel, Grid, Cloud and Internet Computing. IEEE;
2013. p. 247–254.
[18] Breiman L. Random forests. Machine Learning. 2001;45(1):5–32. Available
from: https://doi.org/10.1023/A:1010933404324.
[19] Calderoni L, Ferrara M, Franco A, et al. Indoor localization in a hospital envi-
ronment using random forest classifiers. Expert Systems with Applications.
2015;42(1):125–134.
[20] Wang Y, Wu K, and Ni LM. WiFall: Device-free fall detection by wireless
networks. IEEE Transactions on Mobile Computing. 2017;16(2):581–594.
[21] Friedman JH. Greedy function approximation: a gradient boosting machine.
Annals of Statistics. 2001;29:1189–1232.
[22] Freund Y, and Schapire RE. A decision-theoretic generalization of on-line
learning and an application to boosting. Journal of Computer and System
Sciences. 1997;55(1):119–139.
[23] Chen T, and Guestrin C. XGBoost: a scalable tree boosting system. In:
SIGKDD International Conference on Knowledge Discovery and Data
Mining. ACM; 2016. p. 785–794.
[24] Ke G, Meng Q, Finley T, et al. LightGBM: a highly efficient gradient boosting
decision tree. In: Advances in Neural Information Processing Systems; 2017.
p. 3149–3157.
[25] Yu X, Chen H, Zhao W, et al. No-reference QoE prediction model for video
streaming service in 3G networks. In: International Conference on Wire-
less Communications, Networking and Mobile Computing. IEEE; 2012.
p. 1–4.
Introduction of machine learning 59
[26] Sattiraju R, Kochems J, and Schotten HD. Machine learning based obstacle
detection for Automatic Train Pairing. In: International Workshop on Factory
Communication Systems. IEEE; 2017. p. 1–4.
[27] Novikoff AB. On convergence proofs on perceptrons. In: Proceedings of the
Symposium on the Mathematical Theory of Automata. vol. 12. New York,
NY, USA: Polytechnic Institute of Brooklyn; 1962. p. 615–622.
[28] Chi CY, Li WC, and Lin CH. Convex Optimization for Signal Processing and
Communications: From Fundamentals toApplications. Boca Raton, FL: CRC
Press; 2017.
[29] Grant M, Boyd S, and Ye Y. CVX: Matlab Software for Disciplined Convex
Programming; 2008. Available from: http://cvxr.com/cvx.
[30] Platt J. Sequential Minimal Optimization: A Fast Algorithm for Train-
ing Support Vector Machines; 1998. Technical report. Available from:
https://www.microsoft.com/en-us/research/publication/sequential-minimal-
optimization-a-fast-algorithm-for-training-support-vector-machines/.
[31] Cortes C, and Vapnik V. Support-vector networks. Machine Learning. 1995;
20(3):273–297.
[32] Schölkopf B, and Smola AJ. Learning with Kernels: Support Vector
Machines, Regularization, Optimization, and Beyond. Cambridge, MA:
MIT Press; 2002.
[33] Smola AJ, and Schölkopf B. A tutorial on support vector regression. Statistics
and Computing. 2004;14(3):199–222.
[34] Gandetto M, Guainazzo M, and Regazzoni CS. Use of time-frequency
analysis and neural networks for mode identification in a wireless software-
defined radio approach. EURASIP Journal on Applied Signal Processing.
2004;2004:1778–1790.
[35] Kaplantzis S, Shilton A, Mani N, et al. Detecting selective forwarding attacks
in wireless sensor networks using support vector machines. In: International
Conference on Intelligent Sensors, Sensor Networks and Information. IEEE;
2007. p. 335–340.
[36] Huan R, Chen Q, Mao K, et al. A three-dimension localization algorithm for
wireless sensor network nodes based on SVM. In: International Conference
on Green Circuits and Systems. IEEE; 2010. p. 651–654.
[37] Woon I, Tan GW, and Low R. Association for Information Systems. A protec-
tion motivation theory approach to home wireless security. In: International
Conference on Information Systems; 2005. p. 31.
[38] Huang F, Jiang Z, Zhang S, et al. Reliability evaluation of wireless sen-
sor networks using logistic regression. In: International Conference on
Communications and Mobile Computing. vol. 3. IEEE; 2010. p. 334–338.
[39] Salem O, Guerassimov A, Mehaoua A, et al. Sensor fault and patient
anomaly detection and classification in medical wireless sensor networks.
In: International Conference on Communications. IEEE; 2013. p. 4373–4378.
[40] Gulcehre C, Moczulski M, Denil M, et al. Noisy activation functions. In:
International Conference on Machine Learning; 2016. p. 3059–3068.
[41] LeCun Y, Bottou L, Bengio Y, et al. Gradient-based learning applied to
document recognition. Proceedings of the IEEE. 1998;86(11):2278–2324.
60 Applications of machine learning in wireless communications
[76] Ester M, Kriegel HP, Sander J, et al. A density-based algorithm for discov-
ering clusters in large spatial databases with noise. In: SIGKDD Conference
on Knowledge Discovery and Data Mining. vol. 96; 1996. p. 226–231.
[77] Kumar KM, and Reddy ARM. A fast DBSCAN clustering algorithm by
accelerating neighbor searching using Groups method. Pattern Recognition.
2016;58:39–48.
[78] Kriegel HP, Kröger P, Sander J, et al. Density-based clustering. Wiley Inter-
disciplinary Reviews: Data Mining and Knowledge Discovery. 2011;1(3):
231–240.
[79] Zhao F, Luo Hy, and Quan L. A mobile beacon-assisted localization
algorithm based on network-density clustering for wireless sensor networks.
In: International Conference on Mobile Ad-hoc and Sensor Networks. IEEE;
2009. p. 304–310.
[80] Faiçal BS, Costa FG, Pessin G, et al. The use of unmanned aerial vehicles
and wireless sensor networks for spraying pesticides. Journal of Systems
Architecture. 2014;60(4):393–404.
[81] Shamshirband S, Amini A, Anuar NB, et al. D-FICCA: a density-based
fuzzy imperialist competitive clustering algorithm for intrusion detection in
wireless sensor networks. Measurement. 2014;55:212–226.
[82] Wagh S, and Prasad R. Power backup density based clustering algorithm
for maximizing lifetime of wireless sensor networks. In: International
Conference on Wireless Communications, Vehicular Technology, Informa-
tion Theory and Aerospace & Electronic Systems (VITAE). IEEE; 2014.
p. 1–5.
[83] Abid A, Kachouri A, and Mahfoudhi A. Outlier detection for wireless sensor
networks using density-based clustering approach. IET Wireless Sensor
Systems. 2017;7(4):83–90.
[84] Rodriguez A, and Laio A. Clustering by fast search and find of density
peaks. Science. 2014;344(6191):1492–1496.
[85] Botev ZI, Grotowski JF, Kroese DP, et al. Kernel density estimation via
diffusion. The Annals of Statistics. 2010;38(5):2916–2957.
[86] Xie J, Gao H, Xie W, et al. Robust clustering by detecting density peaks and
assigning points based on fuzzy weighted k-nearest neighbors. Information
Sciences. 2016;354:19–40.
[87] Liang Z, and Chen P. Delta-density based clustering with a divide-and-
conquer strategy: 3DC clustering. Pattern Recognition Letters. 2016;73:
52–59.
[88] Wang G, and Song Q. Automatic clustering via outward statistical testing on
density metrics. IEEE Transactions on Knowledge and Data Engineering.
2016;28(8):1971–1985.
[89] Yaohui L, Zhengming M, and Fang Y. Adaptive density peak clustering
based on K-nearest neighbors with aggregating strategy. Knowledge-Based
Systems. 2017;133:208–220.
[90] Geng Ya, Li Q, Zheng R, et al. RECOME: a new density-based cluster-
ing algorithm using relative KNN kernel density. Information Sciences.
2018;436:13–30.
Introduction of machine learning 63
[106] Alsheikh MA, Lin S, Tan HP, et al. Toward a robust sparse data representation
for wireless sensor networks. In: Conference on Local Computer Networks.
IEEE; 2015. p. 117–124.
[107] Zhang W, Liu K, Zhang W, et al. Deep neural networks for wireless local-
ization in indoor and outdoor environments. Neurocomputing. 2016;194:
279–287.
[108] Feng Q, Zhang Y, Li C, et al. Anomaly detection of spectrum in wireless
communication via deep auto-encoders. The Journal of Supercomputing.
2017;73(7):3161–3178.
[109] Li R, Zhao Z, Chen X, Palicot J, and Zhang H. A Transfer Actor-Critic
Learning Framework for Energy Saving in Cellular Radio Access Networks.
IEEE Transactions on Wireless Communications. 2014;13(4):2000–2011.
[110] Puterman ML. Markov Decision Processes: Discrete Stochastic Dynamic
Programming. Hoboken, NJ: John Wiley & Sons; 2014.
[111] Sigaud O, and Buffet O. Markov Decision Processes in Artificial Intelligence.
Hoboken, NJ: John Wiley & Sons; 2013.
[112] Stevens-Navarro E, Lin Y, and Wong VW. An MDP-based vertical handoff
decision algorithm for heterogeneous wireless networks. IEEE Transactions
on Vehicular Technology. 2008;57(2):1243–1254.
[113] Mastronarde N, and van der Schaar M. Fast reinforcement learning for
energy-efficient wireless communication. IEEE Transactions on Signal
Processing. 2011;59(12):6262–6266.
[114] Blasco P, Gunduz D, and Dohler M. A learning theoretic approach to energy
harvesting communication system optimization. IEEE Transactions on
Wireless Communications. 2013;12(4):1872–1882.
[115] Pandana C, and Liu KR. Near-optimal reinforcement learning framework
for energy-aware sensor communications. IEEE Journal on Selected Areas
in Communications. 2005;23(4):788–797.
[116] Berthold U, Fu F, van der Schaar M, et al. Detection of spectral resources
in cognitive radios using reinforcement learning. In: Symposium on New
Frontiers in Dynamic Spectrum Access Networks. IEEE; 2008. p. 1–5.
[117] Lilith N, and Dogançay K. Distributed dynamic call admission control and
channel allocation using SARSA. In: Communications, 2005 Asia-Pacific
Conference on. IEEE; 2005. p. 376–380.
[118] Kazemi R, Vesilo R, Dutkiewicz E, et al. Reinforcement learning in power
control games for internetwork interference mitigation in wireless body area
networks. In: International Symposium on Communications and Information
Technologies. IEEE; 2012. p. 256–262.
[119] Ortiz A, Al-Shatri H, Li X, et al. Reinforcement learning for energy
harvesting point-to-point communications. In: Communications (ICC), 2016
IEEE International Conference on. IEEE; 2016. p. 1–6.
[120] Saleem Y, Yau KLA, Mohamad H, et al. Clustering and reinforcement-
learning-based routing for cognitive radio networks. IEEE Wireless
Communications. 2017;24(4):146–151.
Introduction of machine learning 65
2.1 Introduction
Channel modeling is one of the most important research topics for wireless com-
munications, since the propagation channel determines the performance of any
communication system operating in it. Specifically, channel modeling is a process of
exploring and representing channel features in real environments, which reveals how
radio waves propagate in different scenarios. The fundamental physical propagation
processes of the radio waves, such as reflections, diffractions, are hard to observe
directly, since radio waves typically experience multiple such fundamental interac-
tions on their way from the transmitter to the receiver. In this case, channel modeling
is developed to characterize some effective channel parameters, e.g., delay dispersion
or attenuation, which can provide guidelines for the design and optimization of the
communication system.
Most channel models are based on measurements in representative scenarios.
Data collected during such measurement campaigns usually are the impulse response
or transfer function for specific transmit and receive antenna configurations. With
the emergence of multiple-input–multiple-output (MIMO) systems, directional char-
acteristics of the channels can be extracted as well. In particular for such MIMO
measurements, high-resolution parameter estimation (HRPE) techniques can be
applied to obtain high-accuracy characteristics of the multipath components (MPCs).
Examples for HRPE include space-alternating generalized expectation-maximization
(SAGE) [1], clean [2], or joint maximum likelihood estimation (RiMAX) [3].
1
School of Computer and Information Technology, Beijing Jiaotong University, China
2
State Key Lab of Rail Traffic Control and Safety, Beijing Jiaotong University, China
3
Department of Electrical Engineering, University of Southern California, USA
68 Applications of machine learning in wireless communications
In this chapter, we introduce recent progress of the above applications for machine
learning in channel modeling. The results in this chapter can provide references to
other real-world measurement data-based channel modeling.
investigate in the following in more detail the application of the SVM to distinguish
LOS/NLOS scenarios based on the channel properties.
The main goal of the algorithm described in the following is to use the machine-
learning tool, i.e., the SVM, to learn the internal features of the LOS/NLOS
parameters, which can be obtained by using parameters estimation algorithms, e.g.,
beamformers, and build an automatic classifier based on the extracted features. Con-
sequently, there are two main steps of the proposed algorithm: (i) develop the input
vector for the SVM method and (ii) adjust the parameters of the SVM method to
achieve a better accuracy of classification.
70
–55
60
Elevation (degree)
50
–60
40
30
–65
20
10
–70
20 40 60 80 100 120 140 160 180
(a) Azimuth (degree)
70
–67
60 –68
Elevation (degree)
50 –69
40 –70
–71
30
–72
20
–73
10 –74
Figure 2.1 Power angle spectrum of (a) LOS and (b) NLOS scenarios, which are
estimated by using Bartlett beamformer
Machine-learning-enabled channel modeling 71
400
LOS
NLOS
300
Occurrence
200
100
0
–75 –70 –65 –60 –55
Power (dB)
Figure 2.2 Histograms of the power distribution of the LOS and NLOS scenarios,
respectively
crucial for the performance of the SVM. In the described algorithm, the SVM is used
to learn the difference between the LOS and NLOS from the classified data (training
data) and distinguish the LOS and NLOS condition of the unclassified data (test data).
In this case, an input vector that is able to most clearly present the physical features
of the LOS/NLOS data can achieve the best classification accuracy.
In order to design an appropriate input vector, we first consider the main differ-
ence of physical features between the MPCs in the LOS and NLOS scenarios. First,
the average power is usually different, where the LOS scenario usually has higher
power. Second, the power distribution is another noteworthy difference between the
LOS and NLOS scenarios. Since the LOS path is blocked in the NLOS scenario, the
impact of MPCs undergoing reflections, scatterings, and diffusions is more signifi-
cant in the NLOS case. In other words, even if all such indirect MPCs are exactly
the same in LOS and NLOS scenarios, the existence of the direct MPC changes the
power distribution.
From the above, it follows that the histogram of the power is a characteristic that
can be used to distinguish the LOS/NLOS scenarios, where the abscissa represents
different power intervals, and the ordinate represents how many elements in the PAS
distribute in the different power intervals. Furthermore, to simplify the feature vector,
the number of power intervals is set at 100, with a uniform distribution in the range
of the power of the PAS, as shown in Figure 2.2. In this case, the histogram of the
power is considered as the input vector X, which can be expressed as
X = {x1 , x2 , . . . , x100 } (2.1)
By using the RBF kernel function, the training data are projected to a higher dimen-
sion, in which the difference between the LOS and NLOS data can be observed more
easily.
In this case, the histogram of the power in each PAS is considered as the feature
vector for the input of the SVM to distinguish the LOS and NLOS scenarios. Based
on our experiments, the described solution achieves nearly 94% accuracy on the
classification.
In addition, the angle (azimuth/elevation) distribution of the power is also gener-
ally considered to be different in the LOS and NLOS scenarios. Since there is no LOS
component in the NLOS scenario, it will more concentrate on reflections and scat-
terings in the environment, which leads to a lower average power and smaller power
spread in the histograms. Therefore, utilizing the angle distribution in the feature
vector may also increase the classification accuracy of the solution.
2.3.1.1 Clustering
Figure 2.3(a)–(d) shows the four stages in the iteration of clustering. The dots and
blocks in (a) present the input MPCs and initialed cluster-centroids, respectively,
whereas the different colors of the dots in (b)–(d) represent different categories
of clusters. The KPowerMeans algorithm requires the number of clusters as prior
Cluster-
MPC centroid
AoA
AoA
Cluster-
centroid
(a) AoD (b) AoD
AoA
AoA
information, e.g., the blue and red blocks in Figure 2.3(a) and then clusters the MPCs
preliminarily to the closest cluster-centroid, as shown in Figure 2.3(b). To accurately
measure the similarity between MPCs/clusters, the MPCs distance (MCD) is used
to measure the distance between MPCs and cluster-centroids, where the angle of
arrival (AoA), angle of departure (AoD) and delay of the MPCs/cluster-centroids are
considered. The MCD between the ith MPC and the jth MPC can be obtained as
MCDij = MCDAoA,ij 2 + MCDAoD,ij 2 + MCD2τ ,ij (2.3)
where
⎛ ⎞ ⎛ ⎞
sin(θi ) cos(ϕi ) sin(θj ) cos(ϕj )
1 ⎜ ⎟ ⎜ ⎟
MCDAoA/AoD,ij = ⎝ sin(θi ) sin(ϕi ) ⎠ − ⎝ sin(θj ) sin(ϕj ) ⎠ , (2.4)
2
cos(θi ) cos(θj )
|τi − τj | τstd
MCDτ ,ij = ζ · · , (2.5)
τmax τmax
with τmax = maxi,j {|τi − τj |} and ζ as an opportune delay scaling factor; various
ways to select this scaling factor have been described in the literature. After the
MPCs are clustered preliminarily, the cluster-centroids are recomputed, as shown in
Figure 2.3(c). Then, the cluster members and the cluster-centroids are alternately
recomputed in each iteration, until the data converge to stable clusters or reach the
restraint of a preset running time.
2.3.1.2 Validation
To avoid the impact of an indeterminate number of the clusters, [13] develops the
CombinedValidate method based on the combination of the Calinski–Harabasz (CH)
index and the Davies–Bouldin criterion (DB). The basic idea of CombinedValidate is
to restrict valid choices of the optimum number of clusters by a threshold set in the
DB index. Subsequently, the CH index is used to decide on the optimum number out
of the restricted set of possibilities.
–40
29.5
–45
29
28.5
Delay (ns)
P (dB)
–50
28
27.5
27 –55
26.5
2 –60
2
0
0
AoD (rad) –2 –2 AoA (rad)
Figure 2.4 The unclustered MIMO measurement data in LOS scenario from [13],
where the power of MPCs is color coded
29.5
29
28.5
Delay (ns)
28
27.5
27
26.5
2
0 2
0
AoD (rad) –2 –2
AoA (rad)
Figure 2.5 The result of clustering from [13], where the weak MPCs are removed
the visual inspection, the KPowerMeans can well identify the clusters closest to each
other, e.g., the clusters in red, yellow, blue, and green in Figure 2.4.
2.3.1.4 Development
It is noteworthy that the initial parameters, e.g., cluster number and position of ini-
tial cluster-centroid, have a great impact on the performance of KPowerMeans. In
KPowerMeans, the validation method is applied to select the best estimation of the
76 Applications of machine learning in wireless communications
number of the clusters; thus the performance of the validation method also affects
the performance and efficiency of clustering. In [23], a performance assessment of
several cluster validation methods is presented. There it was found that the Xie–Beni
index and generalized Dunnes index reach the best performance, although the result
also shows that none of the indices is able to always predict correctly the desired
number of clusters. Moreover, to improve the efficiency of clustering, [24] devel-
ops the KPowerMeans by using the MPCs that have the highest power as the initial
cluster-centroids. On the other hand, the study in [25] claims that as a hard partition
approach, KMeans is not the best choice for clustering the MPCs, considering that
some MPCs are located near the middle of more than one cluster, and thus cannot be
directly associated with a single cluster. Therefore, instead of using hard decisions as
the KPowerMeans, a Fuzzy-c-means-based MPC clustering algorithm is described
in [25], where soft information regarding the association of multipaths to a centroid
are considered. As the result, the Fuzzy-c-means-based MPC clustering algorithm
performs a robust and automatic clustering.
where A1 and A2 denote the intercluster and intra-cluster power decay, respectively;
α 0,0 2 denotes the average power of the first MPC in the first cluster. and
l are
the cluster and MPC power decay constants, respectively.
Then, the measured power delay profile (PDP) vector P is considered as the
given signal, and the convex optimization is used to recover an original signal vec-
tor P̂, which is assumed to have the formulation (2.6). Furthermore, re-weighted l1
minimization [28], which employed the weighted norm and iterations, is performed
to enhance the sparsity of the solution.
Finally, based on the enhanced sparsity of P̂, clusters are identified from the curve
of P̂. Generally, each cluster appears as a sharp onset followed by a linear decay, in
the curve of the P̂ on a dB-scale. Hence, the clusters can be identified based on this
feature, which can be formulated as the following optimization problem:
where · x denotes the lx norm operation, and the l0 norm operation returns the
number of nonzero coefficients. λ is a regularization parameter, and 1 is the finite-
difference operator, which can be expressed as
⎛ τ τ ⎞
− 0 ··· ··· 0
⎜ |τ1 − τ2 | |τ1 − τ2 | ⎟
⎜ ⎟
⎜ ⎟
⎜ τ τ ⎟
⎜ 0 − · · · · · · 0 ⎟
⎜ |τ − τ | |τ − τ | ⎟
⎜ 2 3 2 3 ⎟
⎜ .. .. .. .. .. .. ⎟
⎜
1 = ⎜ ⎟
. . . . . . ⎟
⎜ ⎟
⎜ .. τ τ ⎟
⎜ 0 0 . − 0 ⎟
⎜ |τN −2 − τN −1 | |τN −2 − τN −1 | ⎟
⎜ ⎟
⎜ ⎟
⎝ τ τ ⎠
0 0 ··· ··· −
|τN −1 − τN | |τN −1 − τN | (N −1)×N
(2.8)
where N is the dimension of P and P̂, τ is the minimum resolvable delay difference
of data. 2 is used to obtain the turning point at which the slope changes significantly
and can be expressed as
⎛ ⎞
1 −1 0 ··· ··· 0
⎜0 1 −1 · · · · · · ⎟
0
⎜ ⎟
⎜ .. . . .. .. .. ⎟
..
2 = ⎜
⎜. . . . . ⎟
⎟. (2.9)
⎜ .. ⎟
⎝0 0 . 1 −1 0⎠
0 0 ··· ··· 1 −1 (N −2)×(N −1)
Note that λ2 · 1 · P̂0 in (2.7) is used to ensure that the recovered P̂ conform
with the anticipated behavior of A2 in (2.6). In this case, even a small number of
clusters can be well identified by using the described algorithm. Moreover, [26] also
incorporates the anticipated behavior of A1 in (2.6) into P̂ by using a clustering-
enhancement approach.
The details of the implementation of the sparsity-based clustering algorithm can
be found in [26]. To evaluate the performance, Figure 2.6(a) gives the cluster identifi-
cation result by using the sparsity-based algorithm, while (b) and (c) give the results
by using KMeans and KPowerMeans approaches, respectively. It can be seen that
the clusters identified by the sparsity-based algorithm show more distinct features,
where each cluster begins with a sharp power peak and ends with a low power valley
before the next cluster. This feature well conforms to the assumption of the cluster
in the SV model. On the other hand, as shown in Figure 2.6(b) and (c), the KMeans
and KPowerMeans tend to group the tail of one cluster into the next cluster, which
may lead to the parameterized intra-cluster PDP model having a larger delay spread.
More details and further analysis can be found in [26].
78 Applications of machine learning in wireless communications
PDP (dB)
PDP (dB)
–80 –80 –80
–90 –90 –90
–100 –100 –100
–110 –110 –110
–120 –120 –120
0 100 200 300 0 100 200 300 0 100 200 300
(a) Delay (ns) (b) Delay (ns) (c) Delay (ns)
Figure 2.6 Example plots of PDP clustering in [26], where (a) gives the cluster
identification result by using the sparsity-based algorithm, (b) and (c)
give the results by using KMeans and KPowerMeans approaches,
respectively. Different clusters are identified by using different colors,
where the magenta lines represent the least squared regression of
PDPs within clusters
1. The KPD-based algorithm identifies clusters by using the Kernel density to iden-
tify the clusters; therefore, the density needs to be calculated first. For each MPC
x, the density ρ with the K nearest MPCs can be obtained as follows:
|τx − τy |2 |T ,x − T ,y |
ρx = exp(αy ) × exp − × exp −
y∈Kx
(στ )2 σT
|T ,x − T ,y | |R,x − R,y |
× exp − × exp − (2.10)
σT σR
where y is an arbitrary MPC (y = x). Kx is the set of the K nearest MPCs for the
MPC x. σ(·) is the standard deviation of the MPCs in the domain of (·). Specif-
ically, past studies have modeled with good accuracy the intra-cluster power
angle distribution as Laplacian distribution [31]; therefore, the Laplacian Kernel
density is also used for the angular domain in (2.10).
Machine-learning-enabled channel modeling 79
Measurement Density ρ
300 300
Delay (ns)
Delay (ns)
200 200 15
–10
100 –20 100 10
5
0 –30 0
50 50
100 100
0 0 0 0
AoD (degree) –50 –100 AoA (degree) AoD (degree) –50 –100 AoA (degree)
(a) (b)
Relative density ρ* Clustering results
300 1 300
Delay (ns)
5 Delay (ns)
200 2 200
0.8
8
100 3 0.6 100
0.4
0 0.2 0
50 4 50
7 100 100
0 6 0 0 0
AoD (degree) –50 –100 AoA (degree) AoD (degree) –50 –100 AoA (degree)
(c) (d)
Figure 2.7 Illustration of KPD clustering using the measured MPCs: Part (a)
shows the measured MPCs, where the color bar indicates the power of
an MPC. Part (b) plots the estimated density ρ, where the color bar
indicates the level of ρ. Part (c) plots the estimated density ρ ∗ , where
the color bar indicates the level of ρ ∗ . The eight solid black points
are the key MPCs with ρ ∗ = 1. Part (d) shows the clustering results
by using the KPD algorithm, where the clusters are plotted with
different colors
2. In the next step, the relative density ρ ∗ also needs to be calculated based on the
obtained density ρx , which can be expressed as
ρx
ρx∗ = . (2.11)
maxy∈Kx ∪{x} {ρy }
Figure 2.7 shows an example plot of the relative density ρ ∗ . Specifically, the
relative density ρ ∗ in (2.11) can be used to identify the clusters with relatively
weak power.
3. Next, the key MPCs need to be obtained. An MPC x will be labeled as key MPC
x̂ if ρ ∗ = 1:
ˆ = {x|x ∈ , ρx∗ = 1}.
(2.12)
80 Applications of machine learning in wireless communications
In the described algorithm, the obtained key MPCs are selected as the initial
cluster-centroids. Figure 2.7(c) gives an example of the key MPCs, which are
plotted as solid black points.
4. The main goal of the KPD algorithm is to cluster MPCs based on the Kernel den-
sity, therefore, for each non-key MPC x, we define its high-density-neighboring
MPC x̃ as
x̃ = arg min d(x, y) (2.13)
y∈,
ρy >ρx
where d represents the Euclidean distance. Then, the MPCs are connected based
on their own high-density-neighboring x̃ and the connection is defined as
px = {x → x̃} (2.14)
and thus a connection map ζ1 can be obtained as follows:
ζ1 = {px |x ∈ }. (2.15)
In this case, the MPCs that are connected to the same key MPC in ζ1 are grouped
as one cluster.
5. For each MPC, the connection between itself and its K nearest MPCs can be
expressed as follows:
qx = {x → y, y ∈ Kx } (2.16)
where another connectedness map ζ2 can be obtained, as follows:
ζ2 = {qx |x ∈ }. (2.17)
In this case, two key MPCs clusters will be merged into a new cluster if the following
criteria are met:
● The two key MPCs are included in ζ2
● Any MPC belonging to the two key MPCs’ clusters has ρ ∗ > χ
where χ is a density threshold. As shown in Figure 2.7(c), clusters 2 and 3, 6 and 7
meet the conditions and are merged into new clusters, respectively.
To validate the performance of the clustering result, the F-measure is used in [29],
where the precision and recall of each cluster are considered. It is noteworthy that the
validation by using F-measure requires the ground truth of the cluster members. Gen-
erally, the ground truth is unavailable in measured channels; hence, the F-measure
can be only applied for the clustering result of simulated channel, for which the
(clustered) MPC generation mechanism, and thus the ground truth, is known. The
3GPP 3D MIMO channel model is used to simulate the channels in [29], and 300
random channels are simulated to validate the performance of the KPD-based algo-
rithm, where the conventional KPowerMeans [13] and DBSCAN [32] are shown as
comparisons. Figure 2.8 depicts the impact of the cluster number on the F-measure,
where the described algorithm shows better performance than the others, especially
in the scenarios containing more clusters, and the clustering performances of all three
reduce with the increasing number of clusters.
Machine-learning-enabled channel modeling 81
0.9
0.8
0.7
0.6 KPD
F measure
KPM
0.5
DBSCAN
0.4
0.3
0.2
0.1
0
4 6 8 10 12 14 16 18 20 22 24
Cluster number
1
KPD
0.9
KPM
0.8 DBSCAN
0.7
Silhouette coefficient
0.6
0.5
0.4
0.3
0.2
0.1
0
4 6 8 10 12 14 16 18 20 22 24
Cluster number
Figure 2.9 shows the impact of the cluster angular spread on the F-measure of the
three algorithms. It is found that the F-measure generally decreases with the increasing
cluster angular spread, where the KPD-based algorithm shows the best performance
among the three candidates. Further validation and analysis can be found in [29].
82 Applications of machine learning in wireless communications
2.3.4.1 TC clustering
In [33], the TCs are defined as a group of MPCs that have similar runtime and sepa-
rated from other MPCs by a minimum interval, but which may arrive from different
directions. Specifically, the minimum intercluster void interval is set to 25 ns. In other
words, the MPCs whose inter-arrival time is less than 25 ns are considered as one
TC; otherwise, they are considered as different TCs. Besides, the propagation phases
of each MPC can be uniform between 0 and 2π. The choice of different intercluster
voids results in different number of clusters in the delay domain; to be physically
meaningful, this parameter needs to be adapted to the environment of observation.
2.3.4.2 SL clustering
Meanwhile, SLs are defined by the main directions of arrival/departure of the signal.
Since the TCSL is based on measurements without HRPE of the MPCs, the angular
width of an SL is determined by the beamwidth of the antenna (horn or lens or phased
array) and measured over several hundred nanoseconds. A −10 dB power threshold
with respect to the maximum received angle power is set in [33] to obtain the SLs
(again, different thresholds might lead to different clusterings).
By applying the TCs and SLs, the MPCs in the time-space domain are decoupled
into temporal and spatial statistics. Since the SLs and TCs are obtained individually,
it is possible that a TC contains MPCs which belong to different SLs. On the contrary,
an SL may contain many MPCs which belong to different TCs. These cases have
been observed in real-world measurements [35–37], where the MPCs in the same TC
may be observed in different SLs, or the MPCs in the same SL may be observed in
different TCs.
The TCSL-clustering approach has low complexity, and some of its parameters
can be related to the physical propagation environment [33]. However, it requires
some prior parameters, such as the threshold to obtain the SLs, the delays and power
levels of TC.
120 –67
Elevation (degree)
100 –69
80
–71
60
–73
40
20 –75
Since the number of the power levels is limited, αT ∗ can be easily found by a sequential
search.
Nevertheless, the signal to interference plus noise ratio of the PAS has a strong
impact on the performance of the target recognition. In an LOS scenario, the clusters
are generally easily discernible, with strong power and low background noise, i.e.,
the targets can be easily recognized and detected. However, in many NLOS scenar-
ios, the power distribution of the PAS is more complicated with high background
noise, and many small clusters caused by reflections and scatterings interfere with
the recognition process. In this case, the targets that contain many small clusters are
difficult to be separated. To avoid this effect, an observation window αW is set so that
only the elements having the power level between [αL − αW , . . . , αL ] are processed
in the target recognition approach. In this case, the best selection threshold αT ∗ is
obtained by
∗
αT = arg{max δ 2 (αT )|αL − αW
≤ αT < αL }. (2.20)
By using the observation window, the recognition process can focus on the ele-
ments with stronger power compared to the noise background. Moreover, a heuristic
sequential search is used to select an appropriate observation window size αW as fol-
lows. Parameter αW is initialized to 0.1αL at the beginning of the searching process
and keeps increasing until the following constraints are no longer satisfied:
● Size of recognized targets: S < Smax
● Power gap of each single target: A < Amax
where S is the size of the recognized targets indicating how many elements the target
consists of and Smax is the upper limit of size. Specifically, to avoid the interference
caused by the small and fragmental targets, the lower limit of the size is also consid-
ered: only a target bigger than Smin is counted, whereas the target smaller than Smin is
considered as noise rather than clusters. Parameter A is the gap between the highest
power and the mean power of each target. In each iteration, S and A are updated
based on the recognized target objects by using the new αT ∗ from (2.20), until the
above constraints are no longer satisfied.
Examples for the clustering results in an LOS and NLOS scenarios are given in
Figure 2.11(a) and (b), respectively. In the experiments in [39], the PASCT algorithm
is able to well recognize the clusters in time-varying channels without using any
high-resolution estimation algorithm.
120
Elevation (degree)
100 –66
80
60 –70
40
–74
20
100
80 –66
60
–71
40
20 –73
Figure 2.11 Cluster recognition results of the (a) LOS scenario and (b) NLOS
scenario, respectively, in [39]
To accurately measure the distance between MPCs, the BMCD (balanced multipath
component distance) [41] is used here. The main difference between the BMCD and
MCD is that the BMCD introduces additional normalization factors for the angular
domains. The normalization factors are calculated as
stdj (dMCD,AoD/AoA (xj , x̄))
δAoD/AoA = 2 · , (2.21)
max2j (dMCD,AoD/AoA (xj , x̄))
where stdj is the standard deviation of the MCD between all MPC positions xj and
the center of data space x̄, and maxj is the corresponding maximum.
The concrete steps of the improved subtraction are expressed as follows:
1. Calculate the normalized parameter β:
N
β = N , (2.22)
j=i dMPC (xj , x̄)
where N is the total number of MPCs and dMPC (xj , x̄) is the BMCD between xj
and x̄.
86 Applications of machine learning in wireless communications
where mT · β scales the actual influence of neighboring MPCs and its inverse is
called neighborhood radius. For measurement data, it is more practical to find
the appropriate radii for DoA, DoD, and delay dimension separately. Hence, both
m and d vectors contain three components:
dMPC (xi , xj ) = [dMPC,DoA ((xi , xj )), dMPC,DoD ((xi , xj )), dMPC,delay ((xi , xj ))]T . (2.24)
3. The points xk with the highest density value are selected as the new cluster-
centroids if their density value is above a certain threshold. Stop the iteration if
all density values are lower than the threshold.
4. Subtract the new centroid from the data by updating the density values:
Pim = Pim − Pkm · exp(−η · mT · β · dMPC (xi , xk )), (2.25)
where η ∈ (0, 1] is a weight parameter for the density subtraction. Return to
step 3.
Then, the number and position of the initial cluster-centroids can be determined,
and the KPowerMeans can be initialized with these values.
Specifically, to find a proper neighborhood radius, the correlation self-
comparison method [41] is used. The detailed steps are
1. Calculate the set of density values for all MPCs P ml for an increasing ml , where
ml ∈ {1, 5, 10, 15, . . .}, and the other components in m are set to be 1.
2. Calculate the correlation between P ml and P ml+1 . If the correlation increases
above a preset threshold, ml here is selected as the value for m in this dimension.
3. Compare the current cluster number CN and the maximum cluster number CN ,max ,
if CN < CN ,max : for each recent cluster Ck , and calculate the BMCDs between
all MPCs xi in the current cluster and the reference points rn according to
where α is a weight parameter. Consequently, for each cluster, only the MPCs which
have a BMCD significantly larger than the others in the same cluster are considered
to be separated. The separation is stopped if all MPCs in the clusters are below the
threshold.
Figure 2.12 compares the accuracy of detecting the cluster number by using the
improved subtraction algorithm in [41] and the MR-DMS in [42]. In the validation,
over 500 drops of the WINNER channel model scenario “urban macro cell” (C2)
are tested. In addition, two different scenarios are used where the cluster angular
spread of arrival (ASA) is varied (ASA ={6◦ , 15◦ }). From the results, the MR-DMS
achieves better performance in detecting the correct cluster number than the improved
subtraction algorithm.
Moreover, Figure 2.13 gives azimuth of AoA/AoD and delay domain clustering
results based on a MIMO measurement campaign in Bonn, where Figure 2.13(a) and
(b) is obtained by using the improved subtraction algorithm together with a run of
KPowerMeans and the MR-DMS algorithm, respectively. Details of the measurement
campaign can be found in [44], and the MPCs are extracted by using the RiMAX
algorithm.
100
80
Probability of detection
60
40
Sub. ASA 6°
Sub. ASA 15°
20 MR-DMS ASA 6°
MR-DMS ASA 15°
0
5 10 15 20
Simulated # of cluster
Figure 2.12 Probability of correctly detecting the number of clusters by using the
improved subtraction algorithm in [41] and the MR-DMS in [42], vs.
cluster angular spread of arrival (ASA), where ASA = {6◦ , 15◦ }
4
τ (μs)
2 –200
1 –100
–200 0
–100
0 100 ϕd [°]
100
ϕa [°] 200 200
(a)
5
4
τ (μs)
3
2 –200
1 –100
–200 0
–100
0 100 ϕd [°]
a
100
(b) ϕ [°] 200 200
MPCj in MPCk in
snapshot t+1 snapshot t+1
Tracking
result Preset
AoA
AoA
threshold
Preset MPCj in
MPCi in
threshold snapshot t MPCi in snapshot t+1
snapshot t
MPCk in
snapshot t+1
AoD AoD
(b)
(a)
MPCj in
snapshot t
MPCk in
AoA
snapshot t+1
Preset
threshold
MPCi in
snapshot t
AoD
(c)
Figure 2.14 MCD-based tracking algorithm (a) gives the principle of tracking
process, whereas (b) and (c) shows the cases of split and mergence,
respectively
Dx,y ≤ ε (2.30)
xc(n|n)
Figure 2.15 Framework of the Kalman filter, where xc(n) are tracked objects in the
input data (X(n) , P(n) )
92 Applications of machine learning in wireless communications
145
135
130
0
0.5
–1.5 –1
AoA (rad) –2
1 –3 –2.5
–3.5
AoD (rad)
Figure 2.16 gives the tracking result in [15]. For a single target, the Kalman
filter-based tracking algorithm can achieve high tracking accuracy. To track multiple
targets, the Kalman filter can be replaced by a particle filter [49].
matrix. Specifically, all the parameters in the state model are assumed to be uncorre-
lated with each other. In other words, each parameter evolves independently in time
and the state noise, which is additive real white Gaussian while the observation noise
is circular complex white Gaussian, is also assumed to be uncorrelated with each other
and the state. For each path, the covariance matrix of the state noise is represented by
Qθ, p = diag{σμ2(τ ) , σμ2(ϕ) , σγ2Re , σγ2Im }, whereas the covariance matrix of the observation
noise is denoted by Ry .
Considering the estimated parameters are real, the EKF equations can be
expressed as
where R(•) and I (•) denotes the real and imaginary parts of •, respectively, P(k|k)
is the estimated error covariance matrix, J(θ̂, Rd ) = R{DH(k) Ry−1 D(k) }, and D is the
Jacobian matrix. For P paths containing L parameters, D can be expressed as
∂ ∂ ∂
D(θ) = T s(θ ) = s(θ ) · · · s(θ ) . (2.41)
∂θ ∂θ1T ∂θLPT
Apparently, the initialization value of parameters for the state transition matrix
and the covariance matrix of the state noise Qθ are crucial to the performance of the
following tracking/estimation for the EKF. Therefore, it is suggested in [51] to employ
another HRPE algorithm, e.g., SAGE [1], RiMAX [3], for this purpose.
snapshots, i.e., lAx ,By is the moving path from Ax to By , between Si and Si+1 , as shown
in Figure 2.17(a). In the probability-based tracking algorithm, each moving path lA,B
is weighed by a moving probability P(Ax , By ), as shown in Figure 2.17(b).
In the probability-based tracking algorithm, the moving paths are identified
by maximizing the total probabilities of all selected moving paths, which can be
expressed as
L∗ = arg max P(Ax , By ) (2.42)
L⊂L
(Ax ,By )∈L
where L is the selected set of the moving paths and L is the set of all moving
paths. Then, the moving probability P(Ax , By ) is obtained by using the normalized
MPC A1 l1 MPC B1
in Si in Si+1
l2
Delay
(ns)
l3 MPC B2
l4 in Si+1
MPC A2
in Si
O Azimuth (degree)
(a)
P(A1, B1)
A1 B1
)
, B1
P(A 2
P(A ,
1 B
2)
A2 B2
P(A2, B2)
(b)
Figure 2.17 Illustration of the moving paths between two consecutive snapshots
in [16], where (a) is delay and azimuth domain and (b) is bipartite
graph domain
Machine-learning-enabled channel modeling 95
Euclidean distance DAx ,By of the vector of parameters [φ D , φA , τ , α], which can be
expressed as
⎧
⎪
⎪ 1 DAx ,By = 0,
⎪
⎪
⎨0 DAx ,Bz = 0, y = z,
P(Ax , By ) = (2.43)
⎪
⎪ 1
⎪
⎪ M others.
⎩
DAx ,By z=1 DA−1x ,Bz
Initialized
End
Weight value
learning
1.5
Measured data
1 Hetrakul–Taylor model
MSE = 1.60×10–3
0.5
Saleh model
MSE = 5.20×10–4
0
NN model
MSE = 2.63×10–4
–0.5
–1
–1.5
–6 –4 –2 0 2 4 6
Input amplitude
Feed back
Hidden layer
In addition, there are some other novel frameworks for neural networks which
can avoid the vanishing gradient, e.g., the restricted Boltzmann machines framework
and the deep-learning method described in [53]. Hence, there are some propagation
channel modeling studies, e.g., [54], where the amplitude frequency response of
11 paths is modeled by an MLP neural network.
adopted as the basis function. At the end of the network, the out layer receives
the outputs of the hidden layer, which are combined with linear weighting. The
mapping function between the input and output layer can be expressed as
m
m x − c
i
y = f (x) = ωi φ(x − ci , σi ) = ωi exp − (2.45)
i=1 i=1
2σi2
where the vector x = (x1 , x2 , . . . , xm ) represents the input data of the network, ci
and σi are the mean and standard deviation of a Gaussian function, respectively,
m is the number of hidden layer neurons, ωi is the weight between the link of the
ith basis function and the output node, and • is the Euclidean norm.
The training process adjusts, through iterations, the parameters of the network
including the center and width of each neuron in the hidden layer, and the weight
vectors between the hidden and output layer.
2. Channel modeling: To model the radio channel by using a neural network,
the mapping relationship/function between the input, i.e., transmit power and
distance, and the output, receive power and delay, is usually a nonlinear function.
Hence, the goal of the neural-network-based channel modeling is to use the
network to approximate the transmission system, as shown in Figure 2.22.
In [19], the number of RBFs is set to the number of MPCs, to simulate the
transmit signal with different time delays. The output layer gives the received signal.
Besides, the width of the RBF network is obtained by
d
σ =√ (2.46)
2M
where d is the maximum distance and M is the number of RBF nodes. In this case,
once the nodes and width of RBF network are determined, the weights of the output
layer can be obtained by solving linear equations.
b1
W1i
W21
W22 b2
τ1
Y(KTs)
∑
τ2 W23
W24
τN
2
Desired output
1.5 BP neural network
RBF neural network
1
0.5
Output value
–0.5
–1
–1.5
–2
0 2 4 6 8 10
Sample numbers
Figure 2.23 Simulation results of AWGN channel containing two pathways with
doppler frequency shifts in [19]
102 Applications of machine learning in wireless communications
W1j
f Wjk
Output
W2j WNk
.
.
.
Figure 2.24 Neural network used in [55], which contains two input nodes, 80
hidden nodes, and one output node. The two input nodes correspond
to frequency and distance, whereas the output node corresponds to the
power of the received signal
Neural network
Measured data (AoA, AoD and
delay, etc.) in Scenario n
Model 1 Sub-neural
network 1
Data
Iteration
generally have good performance at data processing, these approaches can be further
improved by considering the physical characteristics of channel parameters. For exam-
ple, KMeans is a good conventional clustering algorithm for data processing, and the
KPowerMeans in [13] is further developed by combining the physical interpretation
Machine-learning-enabled channel modeling 103
of the power of MPCs with KMeans, thus improving performance for clustering of
MPCs than merely using the KMeans. Moreover, the development of MCD is another
example, where the physical characteristics of the MPCs are considered, and the MCD
thus is a more accurate measure of the differences of MPCs than using the Euclidean
distance for channel modeling. As for neural networks, the physical interpretation is
also important to build an appropriate network, e.g., the description of the CIR needs
to be considered while constructing the activation function for the neural network.
In addition, the disadvantages of the adopted machine-learning methods cannot be
neglected, e.g., the KMeans is sensitive to initial parameters; this feature also appears
in the KPowerMeans, e.g., the clustering result is sensitive to the assumed number of
clusters and the position of cluster-centroids. Using the physical meaning of param-
eters of the approaches is a possible way to evade these disadvantages. Hence, the
potential relationship between parameters of machine-learning techniques and phys-
ical variables of the radio channels needs to be further incorporated into the adopted
algorithms to improve accuracy.
2.6 Conclusion
In this chapter, we presented some machine-learning-based channel modeling algo-
rithms, including (i) propagation scenarios classification, (ii) machine-learning-based
MPC clustering, (iii) automatic MPC tracking, and (iv) neural network-based channel
modeling. The algorithms can be implemented to preprocess the measurement data,
extract the characteristics of the MPCs, or model the channels by directly seeking the
mapping relationships between the environments and received signals. The results in
this chapter can provide references to other real-world measurement-based channel
modeling.
References
[1] Fleury BH, Tschudin M, Heddergott R, et al. Channel Parameter Estimation
in Mobile Radio Environments Using the SAGE Algorithm. IEEE Journal on
Selected Areas in Communications. 1999;17(3):434–450.
[2] Vaughan RG, and Scott NL. Super-Resolution of Pulsed Multipath Channels
for Delay Spread Characterization. IEEE Transactions on Communications.
1999;47(3):343–347.
[3] Richter A. Estimation of radio channel parameters: models and algorithms.
Technischen Universität Ilmenau; 2005 December.
[4] Benedetto F, Giunta G, Toscano A, et al. Dynamic LOS/NLOS Statistical
Discrimination of Wireless Mobile Channels. In: 2007 IEEE 65th Vehicular
Technology Conference – VTC2007-Spring; 2007. p. 3071–3075.
[5] Guvenc I, Chong C, and Watanabe F. NLOS Identification and Mitigation for
UWB Localization Systems. In: 2007 IEEE Wireless Communications and
Networking Conference; 2007. p. 1571–1576.
104 Applications of machine learning in wireless communications
[37] Rappaport TS, Sun S, Mayzus R, et al. Millimeter Wave Mobile Communica-
tions for 5G Cellular: It Will Work!. IEEE Access. 2013;1:335–349.
[38] Sun S, MacCartney GR, Samimi MK, et al. Synthesizing Omnidirectional
Antenna Patterns, Received Power and Path Loss from Directional Anten-
nas for 5G Millimeter-Wave Communications. In: Global Communications
Conference (GLOBECOM), 2015 IEEE. IEEE; 2015. p. 1–7.
[39] Huang C, He R, Zhong Z, et al. A Power-Angle Spectrum Based Clustering
and Tracking Algorithm for Time-Varying Channels. IEEE Transactions on
Vehicular Technology. 2019;68(1):291–305.
[40] Otsu N. A Threshold Selection Method from Gray-Level Histograms. IEEE
Transactions on Systems, Man, and Cybernetics. 1979;9(1):62–66.
[41] Yacob A. Clustering of multipath parameters without predefining the number
of clusters. 2015 Techn Univ, Masterarbeit TU Ilmenau; 2015.
[42] Schneider C, Ibraheam M, Hafner S, et al. On the Reliability of Multipath Clus-
ter Estimation in Realistic Channel Data Sets. In: The 8th European Conference
on Antennas and Propagation (EuCAP 2014); 2014. p. 449–453.
[43] Xie XL, and Beni G. A Validity Measure for Fuzzy Clustering. IEEE
Transactions on Pattern Analysis & Machine Intelligence. 1991;(8):841–847.
[44] Sommerkorn G, Kaske M, Schneider C, et al. Full 3D MIMO Channel Sound-
ing and Characterization in an Urban Macro Cell. In: 2014 XXXIth URSI
General Assembly and Scientific Symposium (URSI GASS); 2014. p. 1–4.
[45] He R, Renaudin O, Kolmonen V, et al. A Dynamic Wideband Directional
Channel Model for Vehicle-to-Vehicle Communications. IEEE Transactions
on Industrial Electronics. 2015;62(12):7870–7882.
[46] He R, Renaudin O, Kolmonen V, et al. Characterization of Quasi-Stationarity
Regions for Vehicle-to-Vehicle Radio Channels. IEEE Transactions on Anten-
nas and Propagation. 2015;63(5):2237–2251.
[47] Czink N, Mecklenbrauker C, and d Galdo G. A Novel Automatic Cluster Track-
ing Algorithm. In: 2006 IEEE 17th International Symposium on Personal,
Indoor and Mobile Radio Communications; 2006. p. 1–5.
[48] Karedal J, Tufvesson F, Czink N, et al. A Geometry-Based Stochastic
MIMO Model for Vehicle-to-Vehicle Communications. IEEE Transactions on
Wireless Communications. 2009;8(7):3646–3657.
[49] Yin X, Steinbock G, Kirkelund GE, et al. Tracking of Time-Variant Radio Prop-
agation Paths Using Particle Filtering. In: 2008 IEEE International Conference
on Communications; 2008. p. 920–924.
[50] Richter A, Enescu M, and Koivunen V. State-Space Approach to Propagation
Path Parameter Estimation and Tracking. In: IEEE 6th Workshop on Signal
Processing Advances in Wireless Communications, 2005; 2005. p. 510–514.
[51] Salmi J, Richter A, and Koivunen V. MIMO Propagation Parameter Tracking
using EKF. In: 2006 IEEE Nonlinear Statistical Signal Processing Workshop;
2006. p. 69–72.
[52] Zhang QJ, Gupta KC, and Devabhaktuni VK. Artificial Neural Networks for
RF and Microwave Design – From Theory to Practice. IEEE Transactions on
Microwave Theory and Techniques. 2003;51(4):1339–1350.
Machine-learning-enabled channel modeling 107
[53] Hinton GE, Osindero S, and Teh YW. A Fast Learning Algorithm for Deep
Belief Nets. Neural Computation. 2006;18(7):1527–1554.
[54] Ma Y-t, Liu K-h, and Guo Y-n. Artificial Neural Network Modeling Approach
to Power-Line Communication Multi-Path Channel. In: 2008 International
Conference on Neural Networks and Signal Processing; 2008. p. 229–232.
[55] Kalakh M, Kandil N, and Hakem N. Neural Networks Model of an UWB
Channel Path Loss in a Mine Environment. In: 2012 IEEE 75th Vehicular
Technology Conference (VTC Spring); 2012. p. 1–5.
This page intentionally left blank
Chapter 3
Channel prediction based on machine-learning
algorithms
Xue Jiang1 and Zhimeng Zhong 2
In this chapter, the authors address the wireless channel prediction using state-of-
the-art machine-learning techniques, which is important for wireless communication
network planning and operation. Instead of the classic model-based methods, the
authors provide a survey of recent advances in learning-based channel prediction
algorithms. Some open problems in this field are then proposed.
3.1 Introduction
1
School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, China
2
Huawei Technologies Ltd., China
110 Applications of machine learning in wireless communications
support vector machines (SVM) [2,3], artificial neural networks (ANN) [2,4], and
matrix completion with singular value thresholding (SVT) [5]. Aside from that,
Gaussian processes [6] and kriging-based techniques [7] have recently been suc-
cessfully used for the estimation of radio maps. In [8], kriging-based techniques
have been applied to track channel gain maps in a given geographical area. The pro-
posed kriged Kalman filtering algorithm allows to capture both spatial and temporal
correlations. These studies use batch schemes as well. In [9], an adaptive online
reconstruction methodology is proposed with adaptive projected subgradient method
(APSM) [10], which is the unique online coverage map reconstruction algorithm
having been employed so far.
This chapter mainly generalizes and reviews the learning-based coverage maps
reconstruction mentioned above. The rest of this survey is organized as follows. Sec-
tion 3.2 introduces methodologies of obtaining measurements. Section 3.3 discusses
the respective traits of batch algorithms and online algorithms. The corresponding
approaches will be studied as well. Section 3.4 comprises of the techniques applied
to label the measurements to get more accurate results. The final section draws the
conclusion of the survey.
Conventional drive test is simple and stable. Nevertheless, this methodology consumes
significant time and human efforts to obtain reliable data, and the cost will ascend
vastly with the studied area getting larger. Thus, it would be more suitable to small-
scale area. MDT is a relatively cost-efficient way to get the measurements, sacrificing
for its stability. The application on users’ smart phones would reduce their data budget
Channel prediction based on machine-learning algorithms 111
and the battery lifetime of the mobile device. Wide fluctuation of the involved smart
phones’ function might also result in the systematic measurement errors.
Type Methods
where {φj (x)}mj=1 is a set of m nonlinear basis functions. The loss function [17] used
for determining the estimate is given by
| ỹ − y| − ε, | ỹ − y| > ε
Lε ( ỹ, y) = (3.2)
0, otherwise
with ε being a small value. The problem can be formally stated as
1
N
min Lε ( ỹi , yi )
N i=1 (3.3)
s.t. ω ≤ α
where ω ∈ Rm and α ∈ R+ is an arbitrarily chosen constant parameter. It is possible,
by introducing some slack variables, to reformulate problem (3.3) as follows:
1 N
min ω2 + C {ξ i + ξ̄ i }
ξ i ,ξ̄ i ,ω 2 i=1
1
N
N
− (α i − ᾱ i )(α j − ᾱ j )K(xi , x j )
2 i=1 j=1
(3.5)
N
s.t. y i (α i − ᾱ i ) = 0
i=1
0 ≤ αi ≤ C i = 1, . . . , N
0 ≤ ᾱ i ≤ C i = 1, . . . , N
where ε and C are arbitrarily chosen constants, and K(xi , x j ) is the inner-product
kernel:
K(xi , x j ) = φ(xi )T φ(x j ) (3.6)
Channel prediction based on machine-learning algorithms 113
defined in accordance with the Mercer’s condition [16]. Once problem (3.4) is solved,
α i , ᾱ i can be used to determine the approximating function:
N
f (x, ω) = (α i − ᾱ i )K(x, xi ). (3.7)
i=1
Data points for which α i − ᾱ i = 0 are defined as support vectors. Parameters ε and C
control in some way the machine complexity for whom control in nonlinear regression
is a very tough task, which impacts directly on the performance of SVM.
where neural networks are used to correct the biases generated by unknown envi-
ronmental properties and algorithmic simplifications of path-loss estimations that
are common in ray-tracing techniques. The considered neural networks show to sub-
stantially improve the results obtained by classic ray-tracing tools. In [22], radial basis
function (RBF) neural networks are used instead of the classic multilayer perceptron
(MLP). One of the major advantages of RBF neural networks is that they tend to learn
much faster than MLP neural networks, because their learning process can be split
into two stages for which relatively efficient algorithms exist. More specifically, a
two-stage learning approach is taken, where the first stage is composed of an unsu-
pervised clustering step via the rival penalized competitive learning approach. Then
the centers of the radial basis function are adjusted, and, once fixed, the weights are
then learned in a supervised fashion by using the celebrated recursive least squares
algorithm.
In [23], a one-layer backpropagation ANN is proposed to gauge the perfor-
mance of kriging-based coverage map estimation. A new distance measure that takes
obstacles between two points into consideration is introduced, and it is defined as
di,j = (xi − xj )2 + ( yi − yj )2 + 10E (3.8)
where E = (10c)(−1) r∈Wi,j Lr with Wi,j representing the set of obstacles between
point i and j, Lr being the path loss of the respective obstacles, and c being the free
space parameter. The first term, involving the square root, is simply the Euclidean
distance between points i and j. The term 10E expresses the path loss caused by
obstacles. For an example, if one assumes that the path-loss factor of a wall between
two points is 5 dB and the free space parameter c is 2 dB for the environment in
which the wall resides, then the path loss between these two points due to the wall
will equal to the free space path loss of 105/(10×2) . This increase of the path loss
can be equivalently represented by an increase of the effective distance between the
two points. This new measure for the distance improves the achievable estimation
accuracy for prediction tools based on both kriging and ANNs.
A common problem that arises in learning tasks is that in general we have no or
little prior knowledge of the relevance of the input data, and hence many candidate
features are generally included in order to equip algorithms with enough degrees of
freedom to represent the domain. Unfortunately, many of these features are irrelevant
or redundant, and their presence does not improve the discrimination ability. Further-
more, many inputs and a limited number of training examples generally lead to the
so-called curse of dimensionality, where the data is very sparse and provides a poor
representation of the mapping. (Deep neural networks do not perform well with lim-
ited training data.) As a remedy to this problem, dimensionality-reduction techniques
are applied to the data in practice, which transform the input into a reduced represen-
tation of features. Dimensionality-reduction techniques are usually divided into two
classes, linear methods (e.g., independent component analysis) and nonlinear meth-
ods (e.g., nonlinear PCA). In [2], a two-step approach using learning machines and
dimensionality-reduction techniques is proposed. SVMs and ANNs are used as the
learning tools, and they are combined with two dimensionality-reduction techniques,
Channel prediction based on machine-learning algorithms 115
namely, linear and nonlinear PCA. In more detail, in [2], the macrocellular path-loss
model is defined as follows:
where L0 is the free space path loss in dB, d is the radio path, f is the radio frequency,
and αbuildings is an attenuation term that depends on several parameters, such as height
of base stations and receivers, the distance between consecutive buildings, the height
of buildings. In [2], the function in (3.9) is learned by using a three-layerANN, with the
three parameters as input. The estimation using dimensionality-reduction techniques
has shown to improve substantially the prediction power over methods that use the
full dimensionality of the input. In addition, PCA-based prediction models provide
better prediction performance than nonlinear PCA-based models, and ANNs-based
models tend to perform slightly better than SVM-based predictors (in the scenarios
considered in the above mentioned studies).
The applications of neural network discussed in this topic are considered as
function approximation problems consisting of a nonlinear mapping from a set of
input variables containing information about potential receiver onto a single output
variable representing the predicted path loss. MLPs is applied to reconstruct the path
loss in [24]. Figure 3.1 shows the configuration for an MLP with one hidden layer
and output layer. The output of the neural network is described as
M
N
y = F0 woj Fh wji xi (3.10)
j=0 i=0
where woj represents the synaptic weights from neuron j in the hidden layer to the
single output neuron, xi represents the ith element of the input vector, Fh and F0 are
the activation function of the neurons from the hidden and output layers, respectively,
and wji are the connection weights between the neurons of the hidden layer and the
inputs. The learning phase of the network proceeds by adaptively adjusting the free
Wji
X0
Woj
X1
Y
X2
Xn–1
parameters of the system based on the mean squares error described by (3.10), between
predicted and measured path loss for a set of appropriately selected training examples:
1
m
E= ( yi − di )2 (3.11)
2 i=1
where yi is the output value calculated by the network and di represents the expected
output.
When the error between network output and the desired output is minimized, the
learning process is terminated. Thus, the selection of the training data is critical to
achieve good generalization properties [25,26]. In coverage map reconstruction, the
neural networks are trained with the Levenberg–Marquardt algorithm, which provides
faster convergence rate than the backpropagation algorithm with adaptive learning
rates and momentum. The Levenberg–Marquardt rule for updating parameters is
given by
−1 T
W = J T J + μI J e (3.12)
where e is an error vector, μ is a scalar parameter, W is a matrix of networks weights,
and J is the Jacobian matrix of the partial derivations of the error components with
respect to the weights.
An important problem that occurs during the neural network training is the over
adaptation. That is, the network memorizes the train examples, and it does not learn
to generalize the new situation. In order to avoid over adaptation and to achieve good
generalization performances, the training set is separated in the actual training subset
and the validation subset, typical 10%–20% of the full training set [26].
3.3.1.3 Matrix completion
In radio map reconstruction, if the sampling rate of the area of interest is high enough,
classical signal-processing approaches can be used to reconstruct coverage maps.
However, dense sampling can be very costly or impracticable, and in general only
a subset of radio measurements of an area are available at a given time. By mak-
ing assumptions on the spatial correlation properties of radio measurements, which
are strongly related to structural properties of an area, and by fitting correspond-
ing correlation models, statistical estimators such as kriging interpolation are able
to produce precise estimates based on only few measurements. However, the price
for this precision is the high computational complexity and questionable scalability.
Nevertheless, the spatial correlation exploited by kriging approaches suggests that
coverage maps contain redundant information, so, if represented by a matrix, radio
maps can be assumed to be of low rank. This observation has led some authors to
propose the framework of low-rank matrix completion for coverage map estimation,
which is the topic of this section.
Matrix completion builds on the observation that a matrix that is of low rank or
approximately low rank can be recovered by using just a subset of randomly observed
data [27,28]. A major advantage of matrix completion is that it is able to recover a
matrix by making no assumption about the process that generates the matrix, except
that the resulting matrix is of low rank. In the context of radio map estimation, matrix
Channel prediction based on machine-learning algorithms 117
efficiently with convex optimization tools. (The relaxed problems are often analyzed
to check the number of measurements required to recover the solution to the original
NP-hard problem exactly, with high probability.) In particular, a common relaxation
of the rank minimization problem is formulated as
min A∗
A
(3.14)
s.t. Aij = Pij , ∀i, j ∈
where A∗ denotes the nuclear norm of the matrix A which is defined as
min(m,n)
A∗ = σk (A)
k=1
with σk (·) being the kth largest singular value of a matrix. Note that (3.14) can
be converted into a semidefinite programming (SDP) and hence can be solved by
interior-point methods. However, directly solving the SDP has a high complexity.
Several algorithms faster than the SDP-based methods have been proposed to solve
the nuclear norm minimization, such as SVT, fixed point continuation (FPC), and
proximal gradient descent [5]. In radio map reconstruction, the authors of [13] opt
for the SVT algorithm, which can be briefly described as follows. Starting from an
initial zero matrix Y0 , the following steps take place at each iteration:
Ai = shrink(Yi−1 , τ )
(3.15)
Yi = Yi−1 + μ (P − Xi )
with μ being a nonnegative step size. The operator (X ) is the sampling operator
associated with the set . Entries not contained in the index set are set to zero,
the remaining entries are kept unchanged. The shrink operator (·, τ ) is the standard
rank-reduction thresholding function, which sets singular values beneath a certain
threshold τ > 0 to be zero.
In [13], the authors also introduce a method to improve the path-loss reconstruc-
tion via matrix completion. The idea is to define a notion of “informative areas,” which
are regions in which samples are required to improve greatly the map reconstruction.
The motivation for this approach is that, in coverage maps, there may exist nonsmooth
transitions caused by abrupt attenuation of signals, which are common when radio
waves impinge on obstacles such as large buildings, tunnels, metal constructions.
Consequently, path loss in such areas exhibits low spatial correlation, which can lead
to reconstruction artifacts that can only be mitigated by increasing the sampling rate
in those regions. In order to identify those regions, which are mathematically rep-
resented by matrix entries, the authors of [13] resort to a family of active learning
algorithms, and, in particular, they employ the QbC rationale. The general approach
is to quantify the uncertainty of the prediction in each missing value in the matrix, so
only measurements corresponding to the most uncertain entries are taken. In the QbC
rationale, the missing matrix values are first estimated by means of many different
algorithms, and only a subset of the available data is used. Assuming that the available
data budget amounts to k measurements, first the coverage map is computed by only
Channel prediction based on machine-learning algorithms 119
using l < k of the available entries. Then, three different algorithms for matrix recon-
struction are compared, and the top K = k − l entries with the largest disagreement
are chosen. New measurements for those K entries are then gathered, and a new cov-
erage map is estimated by using the new samples. The three different reconstruction
algorithms used in [13] are the SVT, the K-nearest neighbors, and the kernel APSM.
In a subsequent work [30], the authors from [13] derive an online algorithm
based on the ALS method for matrix completion. They adopt the matrix factorization
framework in which the low-rank matrix A is replaced by the low-rank product LRT ,
with L ∈ Rm×ρ and RT ∈ Rn×ρ , and ρ is a prespecified overestimated of the rank
of A. Based on this framework, the rank-minimization objective is replaced by the
equivalent objective:
minL,R 1
2
(L2F + R2F )
s.t. LRT = A (3.16)
Aij = Pij , ∀i, j ∈
For noisy case, the objective function for matrix completion becomes in [30]:
with γ being a regularization parameter that controls the trade-off between the close-
ness to data and the nuclear norm of the reconstructed matrix. The ALS method is
a two-step iterative method in which the objective is consecutively minimized over
one variable by holding the other constant. Hence, two quadratic programs have to
be solved in each iteration step consecutively. This amounts to solving D least square
problems to find the optimum solution of each row of L and R. This, however, amounts
computing a (ρ × ρ) matrix inversion for each row, which might become prohibitive
with increased number of samples. Therefore, the authors in [30] propose an approx-
imation algorithm, in which the coefficients of the optimum row vector are computed
one by one, which significantly reduces the computational complexity, especially for
sparse datasets. In this mindset, the online version of the ALS is proposed in a way
that, with new incoming data, only the respective coefficients are updated via this
approximated update function.
In addition to the online reconstruction algorithm for matrix-completion-based
coverage map reconstruction, the authors in [30] also derive a new adaptive sampling
scheme, able to outperform the QbC rationale from their previous work. They assume
that coverage maps are in general smooth. Therefore, for two neighboring matrix
entries (i1 , j1 ) and (i2 , j2 ) that satisfy |i1 − i2 | ≤ 1 and | j1 − j2 | ≤ 1, the entry difference
should be bounded by
|Ai1 j1 − Ai2 j2 | ≤
120 Applications of machine learning in wireless communications
defined via the diversity of the row-wise and column-wise difference of LRT = A, or,
in mathematical terms:
s(LRT ) = Dx (LRT )2F + Dy (LRT )2F
with the gradient operators Dx (A) and Dy (A) being defined as
Dx (i, j) = A(i, j + 1) − A(i, j)
Dy (i, j) = A(i + 1, j) − A(i, j)
The smoothed matrix completion objective from is then stated as
minL,R 21 (L2F + R2F ) + λ Dx (LRT )2F + Dy (LRT )2F
s.t. LRT = A (3.19)
Aij = Pij , ∀i, j ∈
For the solution of the minimum problem in (3.19), an alternating iteration algorithm
over L and R is adopted in [29]. At first, L and R are chosen at random, then L is fixed
and R is optimized by a linear least square method. Then R is fixed and the cost function
is optimized over L. This procedure is repeated until no progress is observed. In [29],
the proposed smoothed low-rank reconstruction method is compared with interpola-
tion methods such as radial basis interpolation and inverse distance weighting. The
smoothed low-rank reconstruction method shows to achieve similar reconstruction
properties with fewer samples compared to these methods.
Alternating projection methods
Note that these nuclear-norm-based algorithms require performing the full SVD of
an m × n matrix. When m or n is large, computing the full SVD is time-consuming.
Different from rank or nuclear norm minimization, a new strategy is adopted for
matrix completion. The basic motivation of alternating projection algorithm (APA)
is to find a matrix such that it has low rank and its entries over the sample set are
consistent with the available observations. Denote the known entries as P :
Pi,j , if (i, j) ∈
[P ]i,j = . (3.20)
0, otherwise
Then it can be formulated as the following feasibility problem [32]:
find A
(3.21)
s.t. rank(A) ≤ r, A = P ,
where r min(m, n) is the desired rank. It is obvious that (3.21) is only suitable for
noise-free case. For noisy case, we use:
find A
(3.22)
s.t. rank(A) ≤ r, A − P F ≤ ε2
to achieve robustness to the Gaussian noise. In the presence of outliers, we adopt:
find A
(3.23)
s.t. rank(A) ≤ r, A − P p ≤ εp
122 Applications of machine learning in wireless communications
where εp > 0 is a small tolerance parameter that controls the p -norm of the fitting
error and ·p denotes the element-wise p -norm of a matrix, i.e.:
⎛ ⎞1/p
p
A p = ⎝ [A]i,j ⎠ . (3.24)
(i,j)∈
Apparently, (3.23) reduces to (3.22) when p = 2. Also, (3.23) reduces to the noise-free
case of (3.21) if εp = 0.
By defining the rank constraint set
Sr := {A|rank(A) ≤ r} (3.25)
and the fidelity constraint set
Sp := A| A − P p ≤ εp , (3.26)
the matrix completion problem of (3.23) is formulated as finding a common point of
the two sets, i.e.:
find X ∈ Sr ∩ Sp . (3.27)
For a given set S , the projection of a point Z ∈
/ S onto it, which is denoted as
S (Z), is defined as
S (Z) := arg min X − Z2F . (3.28)
X ∈S
We adopt the strategy of alternating projection (AP) onto Sr and Se to find a common
point lying in the intersection of the two sets [32]. That is, we alternately project onto
Sr and Sp in the kth iteration as
Y k = Sr (Ak )
(3.29)
Ak+1 = Sp (Y k ).
The choice of p = 1 is quite robust to outliers. Other values of p < 2 may also be of
interest. The case of p < 1 requires to compute the projection onto a nonconvex and
nonsmooth p -ball, which is difficult and hence not considered here. The 1 < p < 2
involves the projection onto a convex p -ball, which is not difficult to solve but requires
an iterative procedure. Since the choice of p = 1 is more robust than 1 < p < 2 and
computationally simpler, we can use p = 1 for outlier-robust matrix completion.
By Eckart–Young theorem, the projection of Z ∈ / Sr onto Sr can be computed
via truncated SVD of Z:
r
Sr (Z) = σi ui viT (3.30)
i=1
in the interval (0, z − p ∞ ) using the bisection method, where ·∞ is the ∞ -
norm of a vector. The computational complexity of projection onto 1 -ball is
O(||), which is much lower than that of the projection onto Sr .
The selection of εp is critical to the performance of APA. In the absence of noise,
the optimum is εp = 0. For noisy case, εp is related to the noise level. Roughly
speaking, larger noise requires a larger εp . If the probability of the noise is known
124 Applications of machine learning in wireless communications
a priori, we can estimate the probability distribution of the p -norm of the noise.
Then a proper value of εp can be determined according to the probability such that
the true entries are located in the p -ball. If the probability of the noise is unknown,
one may resort to cross validation to determine a proper εp . Note that in the nuclear
norm regularized problem:
1
min A − P 2F + τ A∗ (3.36)
A 2
one also faces the issue of selecting the regularization parameter τ . Clearly, an advan-
tage of the proposed formulation is that it is not difficult to determine εp from the
a priori noise level but not easy for τ .
Remark: It should be pointed out that the APA is different from the iterative hard
thresholding (IHT) and its variants [33,34] although they all use a rank-r projection.
The IHT solves the rank constrained Frobenius norm minimization:
1
min f (A) := A − P 2F , s.t. rank(A) ≤ r (3.37)
A 2
using gradient projection with iteration step being:
Ak+1 = Sr Ak − μ∇f (Ak ) (3.38)
where μ > 0 is the step size and ∇f is the gradient of f . Determining the step size
with a line search scheme requires computing the projection Sr (·) for several times.
Thus, the computational cost of the IHT is several times of the APA per iteration.
Convergence of the alternating projection for finding a common point of two sets
was previously established for convex sets only [35]. Recently, the convergence of
APA for nonconvex sets, which satisfies a regularity condition has been investigated
[36,37]. Exploiting the fact that the rank constraint set of (3.25) satisfies the prox-
regularity and according to Theorem 5.2 of [36], we can establish the convergence of
the APA for matrix completion, as stated in the following proposition.
Proposition: The APA locally converges to a point in Sr ∩ Sp at a linear rate.
In more detail, at each iteration, n, q sets are selected from the collection
{S1 , . . . , Sn } with the approach described in [9]. The intersection of these sets is the
set Cn and the index of the sets chosen from the collection is denoted by
(n) (n)
In,q := {ir(n)
n
, irn −1 , . . . , irn −q+1 } ⊆ {1, . . . , n}, (3.39)
where n ≥ q, and rn is the size of dictionary. With this selection of sets, starting from
fˆ0 = 0, sequence { fˆn }n∈N ⊂ H by
fˆn+1 := fˆn + μn ωj,n PSj ( fˆn ) − fˆn , (3.40)
j∈In,q
The K entries, which score the largest disagreement, are chosen, and we perform
drive tests to obtain the path loss.
QbC algorithm is a simple one which can be implemented easily. And this algo-
rithm enhances the accuracy of the reconstruction as illustrated in [13]. However,
for that larger than two algorithms required to run parallel, it cannot be one of the
most efficient algorithms. And the predicted results can be influenced greatly if the
algorithms employed are not stable enough.
Channel prediction based on machine-learning algorithms 127
line-of-sight (LOS) and non-LOS routes, are in red lines and yellow lines, respec-
tively. With the data points being sampled from these UE routes, our purpose is to
predict the values of other points and hence to achieve the path-loss reconstruction.
We devise an experiment to evaluate the prediction performance of both AP and
SVT. The experiment is based upon the fact that only the fractional sampling points
are viewed as the known points. And the rest part of sampling points is considered to
be unknown in advance. Hence, those points are supposed to be predicted. Assuming
that 1 is a subset, consisting of those predicted points, of the set and Pi,j , (i, j) ∈ 1
denotes the true value of the (i, j) point. We compare the predicted value Pi,j , (i, j) with
its true value. If the condition
Pi,j , (i, j) − Pi,j , (i, j) ≤ δ
is satisfied, the prediction with respect to the (i, j) point is assumed to be successful,
otherwise the prediction is failed. In our experiment, we set δ = 20 and investigate
the successful ratio of prediction with respect to the rank r. For each value of the
matrix rank, 100 trials are carried out to calculate the average result. The proportion
of known sampling points, which are randomly selected in each trial, is 85%. Hence,
the rest of 15% sampling points is viewed as predicted points.
Note that AP’s performance is affected by the parameter of the estimated rank r,
while SVT’s performance is determined by the parameter of the threshold of singular
value. Hence, we evaluate the performance of each algorithm with respect to different
parameters. In terms of the AP, Figure 3.3 plots the successful ratio of prediction
versus the rank r. It is observed that using r = 2 yields the best performance. When
r > 2, the successful ratio is successively decreased with the increase of the rank. This
phenomenon shows that only when tight rank constraint with quite small r is adopted,
the reasonable prediction can be obtained. With the SVT in contrast, Figure 3.4 plots
its successful ratio of prediction versus the threshold of singular value. While an
appropriate threshold of singular value yields the highest successful ratio, too small
or too large threshold value will result in a decreased successful ratio.
0.8
0.7
0.6
Successful ratio
0.5
0.4
0.3
0.2
0.1
2 3 4 5 6 7
Rank
Based upon the optimal parameters in two algorithms (rank r = 2 in the APA and
threshold of singular value equal to 1.5 × 105 in the SVT), we compare the highest
successful ratio that two algorithms can attain. Observe that the highest successful
ratio of the APA is 72.9% and that of the SVT is 67.8%. We can see that the AP
outperforms the SVT.
Then we evaluate the prediction errors of both algorithms by the root mean square
error (RMSE) which is defined as
RMSE = 10 log10 E{ P1 − P1 2F }. (3.46)
Figure 3.5 plots the RMSE versus the rank r of AP, which demonstrates that the best
performance can be achieved when r = 2. This RMSE result is consistent with the
above successful ratio result. In those two aspects of evaluation, the choice of r = 2
0.7
0.6
Successful ratio
0.5
0.4
0.3
0.1 0.2 0.5 1 1.5 2
Threshold of singular value × 105
Figure 3.4 Successful ratio of prediction versus threshold of singular value via SVT
11
10.5
10
RMSE
9.5
8.5
7.5
2 3 4 5 6 7
Rank
11
10.5
10
9.5
RMSE
8.5
7.5
0.10.2 0.5 1 1.5 2
× 105
Threshold of singular value
derives the best successful ratio and prediction error. Therefore, we can conclude that
adopting r = 2 as the estimated rank for the AP yields the best prediction. In contrast,
Figure 3.6 plots the RMSE versus the threshold of singular value for the SVT. While
the SVT can attain the smallest RMSE value of 8.57 with threshold of singular value
equal to 1 × 105 , the AP can obtain the smallest RMSE value of 7.92 with rank r = 2.
This comparison proves the better performance of the AP than the SVT.
3.5 Conclusion
References
[1] G. Fodor, E. Dahlman, G. Mildh, et al., “Design aspects of network assisted
device-to-device communications,” IEEE Communications Magazine, vol. 50,
no. 3, pp. 170–177, 2012.
[2] M. Piacentini and F. Rinaldi, “Path loss prediction in urban environment using
learning machines and dimensionality reduction techniques,” Computational
Management Science, vol. 8, no. 4, pp. 371–385, 2011.
[3] R. Timoteo, D. Cunha, and G. Cavalcanti, “A proposal for path loss prediction
in urban environments using support vector regression,” in The Tenth Advanced
International Conference on Telecommunications (AICT), 2014, pp. 119–124.
[4] I. Popescu, I. Nafomita, P. Constantinou, A. Kanatas, and N. Moraitis, “Neural
networks applications for the prediction of propagation path loss in urban
environments,” in IEEE 53rd Vehicular Technology Conference (VTC), vol. 1,
2001, pp. 387–391.
[5] J.-F. Cai, E. J. Candès, and Z. Shen, “A singular value thresholding algorithm for
matrix completion,” SIAM Journal on Optimization, vol. 20, no. 4, pp. 1956–
1982, 2010.
[6] R. Di Taranto, S. Muppirisetty, R. Raulefs, D. Slock, T. Svensson, and
H. Wymeersch, “Location-aware communications for 5G networks: how loca-
tion information can improve scalability, latency, and robustness of 5G,” IEEE
Signal Processing Magazine, vol. 31, no. 6, pp. 102–112, 2014.
[7] D. M. Gutierrez-Estevez, I. F. Akyildiz, and E. A. Fadel, “Spatial cover-
age cross-tier correlation analysis for heterogeneous cellular networks,” IEEE
Transactions on Vehicular Technology, vol. 63, no. 8, pp. 3917–3926, 2014.
[8] E. Dall’Anese, S.-J. Kim, and G. B. Giannakis, “Channel gain map tracking
via distributed kriging,” IEEE Transactions on Vehicular Technology, vol. 60,
no. 3, pp. 1205–1211, 2011.
132 Applications of machine learning in wireless communications
[23] A. Konak, “Predicting coverage in wireless local area networks with obstacles
using kriging and neural networks,” International Journal of Mobile Network
Design and Innovation, vol. 3, no. 4, pp. 224–230, 2011.
[24] I. Popescu, I. Nafomita, P. Constantinou, A. Kanatas, and N. Moraitis, “Neural
networks applications for the prediction of propagation path loss in urban
environments,” in Vehicular Technology Conference, 2001. VTC 2001 Spring.
IEEE VTS 53rd, vol. 1. IEEE, 2001, pp. 387–391.
[25] G. Wolfle and F. Landstorfer, “Field strength prediction in indoor environments
with neural networks,” in Vehicular Technology Conference, 1997, IEEE 47th,
vol. 1. IEEE, 1997, pp. 82–86.
[26] S. Haykin and R. Lippmann, “Neural networks, a comprehensive foundation,”
International Journal of Neural Systems, vol. 5, no. 4, pp. 363–364, 1994.
[27] E. J. Candès and Y. Plan, “Matrix completion with noise,” Proceedings of the
IEEE, vol. 98, no. 6, pp. 925–936, 2010.
[28] M. A. Davenport and J. Romberg, “An overview of low-rank matrix recovery
from incomplete observations,” IEEE Journal of Selected Topics in Signal
Processing, vol. 10, no. 4, pp. 608–622, 2016.
[29] Y. Hu, W. Zhou, Z. Wen, Y. Sun, and B. Yin, “Efficient radio map construc-
tion based on low-rank approximation for indoor positioning,” Mathematical
Problems in Engineering, vol. 2013, pp. 1–9, 2013.
[30] L. Claude, S. Chouvardas, and M. Draief, “An efficient online adaptive
sampling strategy for matrix completion,” in The 42nd IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE,
2017, pp. 3969–3973.
[31] S. Nikitaki, G. Tsagkatakis, and P. Tsakalides, “Efficient training for finger-
print based positioning using matrix completion,” in The 20th European Signal
Processing Conference (EUSIPCO). IEEE, 2012, pp. 195–199.
[32] X. Jiang, Z. Zhong, X. Liu, and H. C. So, “Robust matrix completion
via alternating projection,” IEEE Signal Processing Letters, vol. 24, no. 5,
pp. 579–583, 2017.
[33] P. Jain, R. Meka, and I. S. Dhillon, “Guaranteed rank minimization via sin-
gular value projection,” in Adv. Neural Inf. Process. Syst. (NIPS), 2010,
pp. 937–945.
[34] J. Tanner and K. Wei, “Normalized iterative hard thresholding for matrix com-
pletion,” SIAM Journal on Scientific Computing, vol. 35, no. 5, pp. S104–S125,
2013.
[35] L. Bregman, “The relaxation method of finding the common point of convex
sets and its application to the solution of problems in convex programming,”
USSR Computational Mathematics and Mathematical Physics, vol. 7, no. 3,
pp. 200–217, 1967.
[36] A. S. Lewis, D. R. Luke, and J. Malick, “Local linear convergence for alter-
nating and averaged nonconvex projections,” Foundations of Computational
Mathematics, vol. 9, no. 4, pp. 485–513, 2009.
[37] D. R. Luke, “Prox-regularity of rank constraint sets and implications for
algorithms,” Journal of Mathematical Imaging and Vision, vol. 47, no. 3,
pp. 231–328, 2013.
134 Applications of machine learning in wireless communications
[38] M. Yukawa and R.-i. Ishii, “Online model selection and learning by multi-
kernel adaptive filtering,” in Signal Processing Conference (EUSIPCO), 2013
Proceedings of the 21st European. IEEE, 2013, pp. 1–5.
[39] K. Slavakis and S. Theodoridis, “Sliding window generalized kernel affine pro-
jection algorithm using projection mappings,” EURASIP Journal on Advances
in Signal Processing, vol. 2008, pp. 1–16, 2008.
[40] S. Theodoridis, K. Slavakis, and I. Yamada, “Adaptive learning in a world of
projections,” IEEE Signal Processing Magazine, vol. 28, no. 1, pp. 97–123,
2011.
Chapter 4
Machine-learning-based channel estimation
Yue Zhu1 , Gongpu Wang1 , and Feifei Gao2
Wireless communication has been a highly active research field [1]. Channel estima-
tion technology plays a vital role in wireless communication systems [2]. Channel
estimates are required by wireless nodes to perform essential tasks such as precoding,
beamforming, and data detection. A wireless network would have good performance
with well-designed channel estimates [3,4].
Recently, artificial intelligence (AI) has been a hot research topic which attracts
worldwide attentions from both academic and industrial circles. AI, which aims to
enable machines to mimic human intelligence, was first proposed and founded as
an academic discipline in Dartmouth Conference in 1956 [5]. It covers a series of
research areas, including natural language processing, pattern recognition, computer
vision, machine learning (ML), robotics, and other fields as shown in Figure 4.1.
ML, a branch of AI, uses statistical techniques to develop algorithms that can
enable computers to learn with data and make predictions or yield patterns. According
to different learning styles, ML can be divided into supervised learning, unsuper-
vised learning, semi-supervised learning, and reinforcement learning. Typical ML
algorithms include support vector machine (SVM) [6], decision tree, expectation-
maximization (EM) algorithm [7], artificial neural network (NN), ensemble learning,
Bayesian model, and so on.
Currently, one of the most attractive branches of ML is deep learning proposed
by Geoffrey Hinton in 2006 [8]. Deep learning is a class of ML algorithms that can
use a cascade of multiple layers of nonlinear processing units for feature extraction
and transformation. Its origin can be traced back to the McCulloch–Pitts (MP) model
of neuron in the 1940s [9]. Nowadays, with the rapid development in data volume
and also computer hardware and software facilities such as central processing unit,
graphic processing unit, and TensorFlow library, deep learning demonstrates powerful
abilities such as high recognition and prediction accuracy in various applications.
In short, ML is an important branch of AI, and deep learning is one key family
among various ML algorithms. Figure 4.2 depicts a simplified relationship between
AI, ML, and deep learning.
1
School of Computer and Information Technology, Beijing Jiaotong University, China
2
Tsinghua National Laboratory for Information Science and Technology, Tsinghua University, China
136 Applications of machine learning in wireless communications
Supervised learning
Nature language processing Classification Unsupervised learning
of learning
styles Semi-supervised learning
Pattern recognition
Reinforcement learning
Machine learning
AI SVM
Computer vision
ANN Deep learning
•••
Traditional
machine Decision tree
Robotics learning
•••
algorithm
EM
Ensemble learning
AI ML DP
Figure 4.2 The relationship between AI, ML, and deep learning
where ai (t) is the attenuation and τi (t) is the delay from the transmitter to the receiver on
the ith path. An example of a wireless channel with three paths is shown in Figure 4.3.
The general expression (4.1) is also known as a doubly selective channel since
there are several paths and the attenuations and delays are functions of time. The
following two special cases for h(t, τ ) are widely used:
Since the symbol period Ts decreases when the data rate increases, the channel
can be flat fading or frequency selective depending on the data rate. Moreover, the
Transmitter Delay
τ1(t)
a2(t)
a1(t) τ2(t)
a3(t)
τ3(t)
Receiver
t
delay spread is another relevant parameter. Delay spread Td is defined as the difference
in propagation delay between the longest and shortest path:
Td = max |τi (t) − τj (t)|. (4.2)
i,j
When Ts is much larger than Td , the channel is flat fading. Otherwise, the channel is
frequency selective. For example, the typical delay spread in a wireless channel in
an urban area is 5 μs when the distance between transmitter and receiver is 1 km [1].
When the data rate is 1 kbps, the symbol period is 1 ms, and the channel is flat-fading
since the delay is negligible compared to the symbol period. If the data rate increases
to 1 Mbps, the symbol period Ts is 1 μs. Then the channel becomes frequency selective
due to the non-negligible delays.
Furthermore, the mobility of transmitter or receiver will induce a shift in radio
frequency, which is referred to as the Doppler shift Ds . Coherence time Tc , a parameter
related to the Doppler shift, is defined as
1
Tc = . (4.3)
4Ds
If the coherence time Tc is comparable to the symbol period, the channel is time-
varying. On the other hand, in time-invariant channels, the coherence time Tc is
much larger than the symbol period (i.e., the channel remains constant). For exam-
ple, if Doppler shift Ds = 50 Hz and the transmission data rate is 1 Mbps, then the
coherence time Tc = 2.5 ms is much larger than one symbol duration 1 μs. In this
case, the channel is time invariant.
The types of wireless channels are depicted in Table 4.1.
where w(t) is an additive white Gaussian complex noise signal. The receiver is
required to recover data signal s(t) from received signal y(t); this process is called
data detection.
For data detection, the receiver requires the knowledge of h(t, τ ), which is referred
to as channel state information (CSI). To help the receiver estimate CSI, special
Time varying Tc Ts
Time invariant Tc Ts
Flat fading Td Ts
Frequency selective Td Ts
Machine-learning-based channel estimation 139
where h(n, l) is the sampling version of h(t, τ ), i.e., h(n, l) = h(nTs , lTs ), and s(n − l)
is the sampling version of s(t), i.e., s(n − l) = s((n − l)Ts ), and L + 1 is the number of
multipaths and w(n) is complex white Gaussian noise with mean zero and variance σw2 .
Define y = [y(0), y(1), . . . , y(N − 1)]T , w = [w(0), w(1), . . . , w(N − 1)]T , and h =
[h(0), h(1), . . . , h(L)]T , where N is the block length. We can further write (4.6) in the
following vector form:
y = Sh + w, (4.7)
where S is a N × (L + 1) circulant matrix with the first column s = [s(0), s(1), . . . ,
s(N − 1)]T . Note that the sequence s is the training sequence and depends on the
choice of pilots and their values.
Two linear estimators are often utilized to obtain the estimate of h from the
received signal y. The first one is least square (LS). It treats h as deterministic constant
and minimizes the square error. The LS estimate is [17]:
ĥ = (SH S)−1 SH y. (4.8)
LS estimator can be derived as follows. The square error between the real value
and the estimate value is
J (h) = (y − Sh)H (y − Sh) = yH y − 2yH Sh + hH SH Sh (4.9)
To minimize the error, the gradient of J (h) with respect to h is derived as
∂J (h)
= −2SH y + 2SH Sh (4.10)
∂h
140 Applications of machine learning in wireless communications
Setting the gradient to be zero, we can then obtain the LS estimate (4.8). For simplicity,
denote (SH S)−1 SH as S† , the LS estimate can be rewritten as
ĥ = S† y, (4.11)
†
where (·) represents the pseudo inverse. It can be readily checked that the minimum
square error of LS is
Jmin = J (h) = yT (I − S(XT S)−1 ST )y (4.12)
The second one is the linear minimum mean square error (LMMSE) estimator.
It treats h as a random vector and minimizes the mean square error.
Define Ryy = E(yyH ), Rhh = E(hhH ), and Rhy = E(hyH ), where E(x) is the
expected value of a random variable x. The LMMSE estimator can be expressed as
ĥ = Rh SH (SRh SH + σw2 I)−1 y (4.13)
LMMSE estimator can be derived as follows. As a linear estimator, the estimate
ĥ can be given as linear combination of the received signal y:
ĥ = Ay. (4.14)
LMMSE estimator aims to minimize the mean square error through choosing the
linear combinator A, i.e.:
A = arg min E(h − ĥ2 ) = arg min E(h − Ay2 ). (4.15)
A A
The first tide, as illustrated in Figure 4.4, took place from 1940s to 1960s [18]. The
MP neuron model, created in 1943, laid the foundation for the research of NN. Then in
1958, Frank Rosenblatt created the first machine referred to as perceptron [19] which
exhibited the ability of simple image recognition. The perception aroused huge inter-
ests and large investments during its first decade. However, in 1969, Marvin Minsky
discovered that perceptrons were incapable of realizing the exclusive OR function.
He also pointed out that the computers, due to the limited computing ability at that
time, cannot effectively complete the large amount of computation work required by
large-scale NN [20], such as adjusting the weights. The two key factors leaded to the
first recession in the development of NN.
The second wave started from the 1980s and ended at the 1990s. In 1986, David
Rumelhart, Geoffrey Hinton, and Ronald J. Williams successfully utilized the back
propagation (BP) algorithm [21] and effectively solved the nonlinear problems for
NN with multiple layers. From then on, BP algorithms gained much popularization,
which resulted in the second upsurge of NN. Unfortunately, in the early 1990s, it was
pointed out that there existed three unsolved challenges for BP algorithms. The first is
that the optimization method obtains the local optimal value, instead of global, when
training the multilayer NN. The second is the vanishing gradient problem that the
neuron weights closed to the inputs have little changes. The third is the over-fitting
problem caused by the contradiction between training ability and prediction results.
In addition, the data sets for training the NN and the computing capability at the time
also cannot fully support the requirements from the multilayer NN. Besides, SVM [6]
attracted much attentions and became one hot research topic. These factors led to a
second winter in the NN development.
The third wave emerged in 2006 when Geoffrey Hinton proposed deep belief
networks [8] to solve the problem of gradient disappearance through pretraining and
supervised fine tuning. The term deep learning became popular ever since then. Later
the success of ImageNet in 2012 provided abundant pictures for training sets and
1986, backpropagation
Mid-1990s, winter
Time
set a good example for deep-learning research. So far, the third wave is still gaining
momentums.
x(n)
X(k)
Modulation IDFT Insert CP
Noise
Channel h(n) w(n)
y(n)
ˆ
X(k)
Demodulation FDE DFT Remove CP
CP Pilot CP Data
First OFDM block Second OFDM block
±1±1j
IFFT
0/1 sequence Pilot Add CP
QPSK
Pilot:128 bit Data:128 bit CP Pilot CP Data
Data
Channel
The data-generation process is depicted in Figure 4.7. Suppose the OFDM system
has 64 subcarriers and the length of CP is 16. In each frame, the first block contains
fixed pilot symbols, and the second data block consists of 128 random binary bits.
After QPSK modulation, IDFT, and CP insertion, the whole frame data is con-
volved with the channel vector. The channel vector is randomly selected from the
generated channel parameter sets based on the WINNER model [22]. The maximum
multipath delay is set as 16.
At the receiver side, the received signals including noise and interference in one
frame will be collected as the input of DNN after removing CP. The DNN model aims
to learn the wireless channel parameters and recover the source signals.
As illustrated in Figure 4.8, the architecture of the DNN model has five layers:
input layer, three hidden layers, and output layer. The real and imaginary parts of the
signal are treated separately. Therefore, the number of neurons in the first layer is 256.
The number of neurons in the three hidden layers are 500, 250, and 120, respectively.
The active function of the hidden layers are rectified linear unit (ReLU) function
and that of the last payer is Sigmoid function. Every 16 bits of the transmitted data
are detected for one model which indicates the dimension of the output layer is 16.
For example, the model in Figure 4.8 aims to predict the 16th–31st data bits in the
second block. Since the data block contains 128 binary bits, eight DNN models are
needed to recover the whole transmitted data part as indicated in Figure 4.9.
144 Applications of machine learning in wireless communications
16–31bit
Pilot
Data:0–127 bit
...
...
...
...
...
Data
...
w4, b4
w3, b3
w2, b2
w1, b1
Recovered bit 0–15 bit 16–31 bit 32–47 bit 48–63 bit 64–79 bit 80–95 bit 96–111 bit 112–127 bit
Output(16)
Hidden layer(...)
Input
(256) Pilot Data Pilot Data Pilot Data Pilot Data Pilot Data Pilot Data Pilot Data Pilot Data
Received signal
Pilot Data
The objective function for optimization is L2 loss function and optimal parameters
is obtained with the root mean square prop (RMSProp)1 optimizer algorithm, where
Python2 environment and TensorFlow3 architecture are utilized. Table 4.2 lists some
key parameters for training the DNN.
Figure 4.10 illustrates the bit error rate performance of the DNN method and
traditional estimators: LS and LMMSE. It can be seen that the LS method performs
worst and that the DNN method has the same performance as the LMMSE at low SNR.
When the SNR is over 15 dB, the LMMSE slightly outperforms the DNN method.
1
RMSProp is a stochastic gradient descent method with adapted learning rates.
2
Python is a high-level programming language.
3
TensorFlow is an open-source software library for dataflow programming.
Machine-learning-based channel estimation 145
Parameters Value
Epoch 60
Batch size 2,000
Batch 400
Learning rate 0.001
Test set 10,000
100
Deep learning
LMMSE
LS
10–1
BER
10–2
10–3
10–4
5 10 15 20 25
SNR
BS UE
Downlink
Feedback Channel
Decoder Encoder estimation
Feedback
IDFT Completion Decoder
~ –
H H Ĥ S
Suppose the BS has Nt transmit antennas and UE has one, and the OFDM system
c carriers. Denote the estimated downlink channel matrix as H
has N ∈ C Nt ×Nc . Once
it will apply the following DFT and then obtain:
the UE estimates the channel H,
Ha
H̄ = Fd HF (4.21)
where Fd and Fa are N c × Nc and Nt × Nt DFT matrices, respectively. Next the UE
selects the first Nc rows of H since the CSI are mainly included in these rows. Let H
represent the truncated matrix, i.e., H = H̄( :, 1 : Nc ). Clearly, the matrix H contains
N = Nc × Nt elements, which indicates that the number of the feedback parameters
is cut down to N .
Based on the deep-learning method, the CsiNet designs an encoder to convert
H to a vector s that only has M elements. Next the UE sends the codeword s to the
BS. The BS aims to reconstruct the original channel matrix H with the code word s.
The compression ratio is γ = M /N . Then the decoder in CsiNet can recover s to Ĥ.
After completing Ĥ to H̄, IDFT is used to obtain the final channel matrix. In summary,
the CSI feedback approach is shown as Figure 4.12.
The CsiNet is an autoencoder model based on convolution NN. Figure 4.13 shows
the architecture of the CsiNet which mainly consists of an encoder (Figure 4.14) and
a decoder (Figure 4.15).
The detailed structure of the encoder is shown in Figure 4.14. It contains two
layers. The first layer is a convolutional layer and the second is a reshape layer. The
real and imaginary parts of the truncated matrix H with dimensions 8 × 8 is the input
of the convolutional layer. This convolutional layer uses a 3 × 3 kernel to generate two
feature maps, which is the output of the layer, i.e., two matrices with dimensions 8 × 8.
Then the output feature maps are reshaped into a 128 × 1 vector. The vector enters
Machine-learning-based channel estimation 147
CsiNet
H Ĥ
8 8 128 8
×
× × ×
1
8 8 1 s
H
Compress
representation Decoder model Output
(codeword)
Fully RefineNet RefineNet
connected
layer Input Convolutional Convolutional Convolutional Input Convolutional Convolutional Convolutional
layer layer layer layer layer layer layer layer 2
16 2 8 16
(Reshape) 2 8 2 2
8 8
128 8 8 8 8 8 8 8 8
× ×
× × × × × × × × ×
1 8
1 8 8 8 8 8 8 8 8
s
Ĥ
into a fully connected layer and the output of the connected layer is the compressed
codeword s with eight elements.
The goal of the decoder is to recover the codeword to the matrix H. The detailed
structure of the decoder is shown in Figure 4.15. The decoder is comprised of three
main parts: a fully connected layer (also referred to as dense layer) and two RefineNet.
The first fully connected layer transforms the codeword s into a 128 × 1 vector. And
148 Applications of machine learning in wireless communications
then the vector is reshaped into two 8 × 8 matrices which is considered as the initial
estimate of H. Next, two RefineNets are designed to refine the estimates.
Each RefineNet has four layers: one input layer and three convolutional layers.
The input layer has two feature maps, and the three convolutional layers have eight,
sixteen, and two feature maps, respectively. All the feature maps are of the same size
as the input channel matrix size 8 × 8.
It is worth noting that in each RefineNet, there is a direct data flow from the input
layer to the end of last convolutional layer so as to avoid the gradient disappearance.
Each layer of the CsiNet executes normalization and employs a ReLU function
to activate the neurons. After two RefineNet units, the refined channel estimates will
be delivered to the final convolutional layer and the sigmoid function is also exploited
to activate the neurons.
The end of the second RefineNet in the decoder will output two matrices with
size 8 × 8, i.e., the real and imaginary parts of Ĥ, which is the recovery of H at
the BS.
MSE is chosen as the loss function for optimization, and the optimal parameters
are obtained through ADAM algorithm. Simulation experiments are carried out in
Python environment with TensorFlow and Keras4 architecture. The key parameters
needed to train the network are listed in Table 4.3.
Here, we provide one example of H. The UE obtain channel estimate and trans-
form it into H. The real and image parts of H are as shown in Tables 4.4 and 4.5,
respectively.
The UE inputs H to CsiNet. The encoder of the CsiNet then generates a 8 × 1
codeword:
s = [−0.17767, −0.035453, −0.094305, −0.072261,
−0.34441, −0.34731, 0.14061, 0.089002] (4.22)
Parameters Value
4
Keras is an open-source neural network library which contains numerous implementations of commonly
used neural network building blocks.
Machine-learning-based channel estimation 149
The decoder can utilize this codeword s to reconstruct the channel matrix Ĥ. Define the
distance between H and Ĥ as d = H − Ĥ22 . In this case, we obtain d = 3.98 × 10−4 .
The compression ratio is γ = ((8 × 8 × 2)/8) = 1/16.
where h is the flat-fading channel to be estimated, and x(i) is the unknown modulated
BPSK signal, i.e., x(i) ∈ {+1, −1}. In this statistical model (4.23) that aims to esti-
mate h with unknown BPSK signals x(i), the received signals y(i) are the observable
variables and the transmitted signals x(i) can be considered as latent variables.
Denote y, z, and θ as the observed data, the latent variable, and the parameter to be
estimated, respectively. For the model (4.23), we can have y = [y(1), y(2), . . . , y(N )]T ,
z = [x(1), x(2), . . . , x(N )]T , and θ = h.
If the variable z can be available, the parameter θ can be estimated by maximum
likelihood approach or Bayesian estimation. Maximum likelihood estimator solves
the log-likelihood function:
L(θ) = lnP(y; θ ) (4.24)
where z is a parameter in the probability density function P(y; θ ). Clearly, there is
only one unknown parameter θ in the log-likelihood function L(θ|y), and therefore
the estimate of θ can be obtained through maximizing5 L(θ):
θ̂ = max L(θ ). (4.25)
θ
However, if the variable z are unknown, we cannot find the estimate θ̂ from (4.25)
since the expression of L(θ ) contains unknown parameter z. To address this problem,
EM algorithm was proposed in [7] in 1977. EM algorithm estimates the parameter θ
5
One dimensional search or setting the derivative as zero can obtain the optimal value of θ .
Machine-learning-based channel estimation 151
where the Bayesian equation P(y, z) = P(z)P(y|z) is utilized in the last step in (4.28).
Equation (4.28) is often intractable since it contains not only the logarithm oper-
ation of the summation of multiple items and also the unknown parameter z in the
function P(y|z; θ).
To address this problem of the unknown parameter z, EM algorithm rewrites the
likelihood function L(θ ) as
( j) P(z; θ )P(y|z; θ)
L(θ) = ln P(z|y; θ ) (4.29)
z
P(z|y; θ ( j) )
where LB(θ, θ ( j) ) is defined as the lower bound of the likelihood function L(θ ).
We can further simplify LB(θ, θ ( j) ) as
P(y, z; θ )
LB(θ, θ ( j) ) = P(z|y; θ ( j) ) ln . (4.33)
z
P(z|y; θ ( j) )
152 Applications of machine learning in wireless communications
It is worth noting that there is only one unknown parameter θ in the above
expression (4.33) of LB(θ , θ ( j) ). Therefore, we can find the ( j + 1)th iterative estimate
θ ( j+1) through:
θ ( j+1) = arg max LB(θ, θ ( j) ) (4.34)
θ
= arg max P(z|y; θ ( j) )(ln P(y, z; θ) − ln P(z|y; θ ( j) )) (4.35)
θ
z
Till now, the jth iteration ends. The estimate θ ( j+1) is then used in the next round
of iteration (the E step and the M step).
The termination condition of the iterative process is
θ ( j+1) − θ ( j) < ε, (4.40)
or
Q(θ ( j+1) , θ ( j) ) − Q(θ ( j) , θ ( j) ) < ε, (4.41)
where ε is a predefined positive constant.
x(i) are unknown to the receiver. We assume that the BPSK signals are equiprobable,
i.e., P(x(i) = +1) = P(x(i) = −1) = 1/2, i = 1, 2, . . . , N .
Suppose x1 = +1 and x2 = −1, and clearly the BPSK signals x(i) can be either
x1 or x2 . The conditional probability density function of received signal y(i) given
x(i) = xk , k = 1, 2 can be expressed as
1 (y(i) − hxk )2
P(y(i)|xk ; h) = √ exp , (4.42)
2πσw −2σw2
P(xk , y(i); h( j) )
P(xk |y(i); h( j) ) =
P(y(i); h( j) )
P(y(i)|xk ; h( j) )P(xk )
= 2
P(y(i)|xm ; h( j) )P(xm )
m=1
√
(1/(2 2πσw )) exp (y(i) − h( j) xk )2 /−2σw2
= 2 √
m=1 (1/(2 2πσw )) exp (y(i) − h xm ) /−2σw
( j) 2 2
exp (y(i) − h( j) xk )2 /−2σw2
= 2 (4.44)
m=1 exp (y(i) − h xm ) /−2σw
( j) 2 2
N
2
Q(h, h ) =
( j)
P(xk |y(i); h( j) ) ln P(y(i), xk ; h) (4.45)
i=1 k=1
154 Applications of machine learning in wireless communications
(4.47)
In conclusion, the EM algorithm preset a value for θ and then calculates h( j+1)
iteratively according to (4.47) until the convergence condition (4.40) or (4.41) is
satisfied.
In the following part, we provide simulation results to corroborate the proposed
EM-based channel estimator. Both real and complex Gaussian channels are simu-
lated. For comparison, Cramér–Rao Lower Bound (CRLB) is also derived. CRLB
determines a lower bound for the variance of any unbiased estimator. First, since N
observations are used in the estimation, the probability density function of y is
N
1 (y(i) − hxk )2
P(y; h) = √ exp , (4.48)
i 2πσw −2σw2
and its logarithm likelihood function is
1
N
N
ln P(y; h) = − ln 2πσw2 − (y(i) − hx(i))2 . (4.49)
2 2σw2 i
The first derivative of ln P(y; h) with respect to h can be derived as
1
N
∂ln P(y; h)
= 2 (y(i) − hx(i))x(i). (4.50)
∂h σw i
Thus, the CRLB can be expressed as [17]
1 σ2 σw2
var(h) ≥ = wN = = CRLB (4.51)
−E (∂ 2 ln p(y; h))/∂ 2 h E[ i xi2 ] NPx
where Px is the average transmission power of the signals x(n).
Machine-learning-based channel estimation 155
Figures 4.16 and 4.17 depict the MSEs of the EM estimator versus SNR for
the following two cases separately: when the channel h is generated from N (0, 1),
i.e., real channel; and when it is generated from CN (0, 1), i.e., Rayleigh channel.
The observation length is set as N = 6. For comparison with the EM estimator, the
MSE curves of LS method are also plotted when the length of pilot is N /2 and N ,
100
EM
LS(N/2)
LS(N)
CRLB
10–1
MSE
10–2
10–3
0 2 4 6 8 10 12 14 16 18 20
SNR
100
EM
LS(N/2)
LS(N)
CRLB
10–1
MSE
10–2
10–3
0 2 4 6 8 10 12 14 16 18 20
SNR
100
EM-SNR = 3 dB
LS(N)-SNR = 3 dB
EM-SNR = 20 dB
LS(N)-SNR = 20 dB
10–1
MSE
10–2
10–3
10–4
0 2 4 6 8 10 12
N
respectively. The CRLBs are also illustrated as benchmarks. It can be seen from
Figures 4.16 and 4.17 that the EM-based blind channel estimator performs well and
approaches CRLB at high SNR. It can also be found that the EM-based blind channel
estimator with no pilots exhibits almost the same performance with the LS estimator
with N pilots and fully outperforms the LS estimator with N /2 pilots.
Figure 4.18 demonstrates the MSEs of LS method and EM algorithm versus the
length of signal N when SNR = 3 and 20 dB, respectively. As we expected, the MSEs
of the two estimator witness a downward trend when the length N increases. It can
be also found that the EM algorithm without pilot has almost consistent performance
with the LS method with N pilots when SNR is 20 dB.
References
[1] Tse D., and Viswanath P. Fundamentals of Wireless Communication. Newyork:
Cambridge University Press; 2005. p. 1.
[2] Cavers J.K.An analysis of pilot symbol assisted modulation for Rayleigh fading
channels. IEEE Transactions on Vehicular Technology. 1991; 40(4):686–693.
[3] Wang G., Gao F., and Tellambura C. Joint frequency offset and channel esti-
mation methods for two-way relay networks. GLOBECOM 2009–2009 IEEE
Global Telecommunications Conference. Honolulu, HI; 2009. pp. 1–5.
[4] Wang G., Gao F., Chen W., et al. Channel estimation and training design
for two-way relay networks in time-selective fading environments. IEEE
Transactions on Wireless Communications. 2011; 10(8):2681–2691.
[5] Russell S.J., and Norvig P. Artificial Intelligence: A Modern Approach (3rd
ed.). Upper Saddle River, NJ: Prentice Hall; 2010.
[6] Cortes C., Vapnik V. Support-vector networks. Machine Learning. 1995;
20(3):273–297.
[7] Dempster A.P. Maximum likelihood from incomplete data via the EM
algorithm. Journal of Royal Statistical Society B. 1977; 39(1):1–38.
[8] Hinton G.E., Osindero S., and Teh Y.-W. A fast learning algorithm for deep
belief nets. Neural Computation. 2006; 18(7):1527–1554.
[9] McCulloch W.S., and Pitts W. A logical calculus of the ideas immanent in
nervous activity. The Bulletin of Mathematical Biophysics. 1943; 5(4):115–
133.
[10] Wen C.K., Jin S., Wong K.K., et al. Channel estimation for massive MIMO
using Gaussian-mixture Bayesian learning. IEEE Transactions on Wireless
Communications. 2015; 14(3):1356–1368.
[11] Wang X., Wang G., Fan R., et al. Channel estimation with expectation maxi-
mization and historical information based basis expansion model for wireless
communication systems on high speed railways. IEEE Access. 2018; 6:72–80.
[12] Ye H., Li G.Y., and Juang B.H.F. Power of deep learning for channel estima-
tion and signal detection in OFDM Systems. IEEE Wireless Communications
Letters. 2017; 7(1):114–117.
[13] Wen C., Shih W.T., and Jin S. Deep learning for massive MIMO CSI feedback.
IEEE Wireless Communications Letters. 2018; 7(5):748–751.
[14] Samuel N., Diskin T., and Wiesel A. Deep MIMO detection. 2017 IEEE
18th International Workshop on Signal Processing Advances in Wireless
Communications (SPAWC). Sapporo: Japan; 2017. pp. 1–5.
[15] Dorner S., Cammerer S., Hoydis J., et al. Deep learning based communication
over the air. IEEE Journal of Selected Topics in Signal Processing. 2018;
12(1):132–143.
[16] Jakes W.C. Microwave Mobile Communications. New York: Wiley; 1974.
[17] Steven M.K. Fundamentals of Statistical Signal Processing: Estimation
Theory. Upper Saddle River, NJ: PTR Prentice Hall; 1993.
[18] Goodfellow I., BengioY., Courville A. Deep Learning. Cambridge: MIT Press;
2016.
158 Applications of machine learning in wireless communications
As an intelligent radio, cognitive radio (CR) allows the CR users to access and share
the licensed spectrum. Being a typical noncooperative system, the applications of
signal identification in CRs have emerged. This chapter introduces several signal
identification techniques, which are implemented based on the machine-learning
theory.
The background of signal identification techniques in CRs and the motivation
of using machine learning to solve signal identification problems are introduced
in Section 5.1. A typical signal-identification system contains two parts, namely, the
modulation classifier and specific emitter identifier, which are respectively discussed
in Sections 5.2 and 5.3. Conclusions are drawn in Section 5.3.5.
1
State Key Laboratory of Rail Traffic Control and Safety, Beijing Jiaotong University, China
160 Applications of machine learning in wireless communications
user and other CR users and barely knows the identities (whether they are legal or
evil) of other users in the network. Hence, signal identification plays a key role in
CRs in order to successfully proceed the received signals and to guarantee the safety
and fairness of the networks. In this chapter, two signal identification techniques
are introduced, modulation classification and specific emitter identification (SEI).
Figure 5.1 illustrates the diagram of a typical signal-identification system. The signal
identification system is capable of solving two typical problems that the CR networks
are confronted with: one is that the CR users have little information of the parameters
used by the licensed users and/or other CR users; the other is that with the ability
that allows any unlicensed users to access the spectrum, a certain user lacks the
identity information of other users in the network. Modulation classification can be
adopted to solve the unknown parameter problem and has vital applications in CRs.
For the case when the licensed and cognitive users share the same frequency band
for transmission and reception, the received signal at the cognitive receiver is the
superposition of signals from the licensed transmitter and its own transmitter, which
implies that the signal from the licensed user can be treated as an interference with
higher transmission power. By applying the modulation classification techniques, the
CR receiver can blindly recognize the modulation format adopted by the licensed
signal and is capable of demodulating, reconstructing, and canceling the interference
caused by the licensed user, which is the basis to proceed its own signal. Furthermore,
to solve the problem that the CRs is exposed to high possibility of being attacked
or harmed by the evil users, SEI offers a way to determine the user’s identity and
guarantees the safety and fairness of the CR networks.
The task of signal identification is to blindly learn from the signal and the envi-
ronment to make the classification decision, behind which is the idea of the artificial
intelligence. As an approach to implement the artificial intelligence, machine learn-
ing has been introduced in signal identification for the designing of the identification
algorithms. With little knowledge of the transmitted signal and the transmission
Signal reception
Unknown
parameter
Modulation
Licensed Signal processing
classification
and
Cognitive cognitive
radios users
spectrum
sharing Specific emitter identification
Unknown
identity
Identity recognition
where T and T0 are the symbol and observation √ intervals, respectively, with T0 T ,
g(·) is the real-valued pulse shape, j = −1, ak (l) > 0 and φk (l) ∈ [0, 2π ) are the
unknown amplitude and phase of the lth path at the kth receiver, respectively, wk (t)
is the complex-valued zero-mean white Gaussian noise process with noise power σk2 ,
and xs (n) is the nth transmit constellation symbol drawn from an unknown modulation
format s. The modulation format s belongs to a modulation candidate set {1, . . . , S},
which is known at the receivers. The task of the modulation classification problem
is to determine the correct modulation format to which the transmit signal belongs
based on the received signal.
The maximum likelihood algorithm is adopted as the classifier, which is optimal
when each modulation candidate is equally probable. Let Hs denote the hypothesis
that transmit symbols are drawn from the modulation format s, the likelihood function
under the hypothesis Hs is given by
ps (y|θ ) = ps (y|xs , θ ) p (xs |θ) (5.2)
xs
1
Multiple receivers are considered to obtain the diversity gain, while the proposed algorithm is applicable
to the case with only one receiver.
Signal identification in cognitive radios using machine learning 163
∝ exp − (5.3)
⎩ σk2 ⎭
k=1 0
where
N
L−1
fk (xs (n), t) = ak (l)ejφk (l) xs (n)g(t − nT − lT ).
n=1 l=0
Define xs,i (n) as the nth transmit symbol that maps to the ith constellation point
under the hypothesis Hs and assume that each constellation symbol has equal prior
probability, i.e., p(xs,i (n)) = (1/M ), with M as the modulation order of modulation
format s. The log-likelihood function Ls (θ ) is then obtained by
⎛ ⎧ ⎫⎞
⎨ K 0 2 ⎬
T
M
1 |yk (t) − fk (xs,i (n), t)| dt ⎠
Ls (θ) = ln ps (y|θ ) = ln⎝ exp − . (5.4)
M ⎩ σk2 ⎭
i=1 k=1 0
s = arg max Ls (θ ).
θ () (5.6)
θ
M-step: θ (r+1)
s = arg max J (θ |θ (r)
s ) (5.8)
θ
164 Applications of machine learning in wireless communications
where z is the complete data, which cannot be directly observed at the receivers.
Instead, the complete data is related to the observations, i.e., the received signals,
by y = K(z), where K(·) is a deterministic and non-invertible transformation. The
non-invertible property of K(·) implies that there exits more than one possible defini-
tions of the complete data to generate the same observations. It should be noted that
these choices have great impact on the complexity and convergence result of the EM
algorithm, bad choices of the complete data make the algorithm invalid.
In our problem, the received signal that undergoes multipath channels is equiv-
alent to a superposition of signals from different independent paths; therefore, the
complete data can be defined as
zkl (t) = ak (l)ejφk (l) xs (n)g(t − nT − lT ) + wkl (t) (5.9)
n
where wkl (t) is the lth noise component, which is obtained by arbitrarily decompos-
ing the
total noise wk (t) into L independent and identically distributed components,
i.e., L−1
l=0 wkl (t) = wk (t). Assume that wkl (t) follows the complex-valued zero-mean
Gaussian process with power σkl2 . The noise power σkl2 is defined as σkl2 = βkl σk2 ,
where βkl is a positive real-valued random noise decomposition factor following
L−1
l=0 βkl = 1 [8]. Hence, we can rewrite the transmission model in (5.1) as
K
= Ez (r) [ln p(zk |θ k )] (5.10)
k |yk ,θ s,k
k=1
= 2 (5.11)
M K
L−1 (r)
as,k (l)e s,k xs (n − l) /σk
(r) (r)
j=1 exp − k=1 yk (n) −
jφ (l) 2
l=0
Signal identification in cognitive radios using machine learning 165
where the last equality is computed with the assumption that each symbol has equal
prior probability, i.e., p xs (n) = Xs,i |θ (r)
s = (1/M ), yk (n) is the nth received sym-
bol at discrete time nT , Xs,i is the ith constellation point for modulation format s.
(r)
By obtaining the posterior probability ρs,i (n), we can compute xs(r) (n) as
M
(r)
xs(r) (n) = ρs,i (n)Xs,i . (5.12)
i=1
We define z̄kl (t) = n ak (l)ejφk (l) xs (n)g(t − nT − lT ). By computing (5.12), xs (n)
turns to a deterministic symbol. Furthermore, since flat fading is assumed for each
path, the channel amplitude ak (l) and phase φk (l) are treated as unknown deterministic
parameters. Thus, we have that z̄kl (t) is an unknown deterministic signal. Note that
wkl (t) is a zero-mean white Gaussian noise process, ln p(zk |θ k ) is then given by [9]:
L−1
T0
1
ln p(zk |θ k ) = C1 − |zkl (t) − z̄kl (t)|2 dt (5.13)
l=0 0
σkl2
By taking the derivative of (5.17) with respect to as,k (l) and setting it to zero, we can
obtain that
⎧ ⎫
N ⎨ T0 ⎬
1 (r)
xs(r) (n)∗ e−jφs,k (l) ẑs,kl (t)g ∗ (t − nT − lT )dt
(r+1) (r)
as,k (l) = (r) (5.18)
E n=1 ⎩ ⎭
0
166 Applications of machine learning in wireless communications
∞
where E (r) = Eg Nn=1 |xs(r) (n)|2 , with Eg = −∞ g 2 (t)dt as the pulse energy, {·} rep-
resents the real component of a complex variable, and (·)∗ denotes the conjugation
of a variable. Apparently, the second derivative of (5.17) with respect to as,k (l) is a
negative definite matrix, which implies that (5.18) is the optimal estimate of as,k (l).
By substituting (5.18) into (5.17), with the assumption that E (r) is independent of
(r)
φs,k (l), the M-step in (5.17) is rewritten as
M-step: for l = 0, . . . , L − 1 compute
H
(r) (r)
xs,l ẑ s,kl
−1
(r+1)
φs,k (l) = tan H (5.19)
(r) (r)
xs,l ẑ s,kl
1 (r) ∗ −jφs,k
N
(r+1)
(r+1)
as,k (l) = (r)
xs (n) e (l)
E n=1
⎫
T0 ⎬
ẑs,kl (t)g ∗ (t − nT − lT )dt
(r)
× (5.20)
⎭
0
(r) !T
where xs,l = 0Tl , xs(r) (1), . . . , xs(r) (N − l) , with 0l as a l × 1 vector with all elements
" #T
(r) (r) (r)
equal zero, ẑ s,kl = ẑs,kl (1), . . . , ẑs,kl (N ) ,
{·} represents the imaginary component
of a complex variable, and (·)H denotes the conjugate transpose of a vector/matrix.
It should be noted from (5.16), (5.19), and (5.20) that by employing the EM
algorithm and properly designing the complete data, the multivariate optimization
problem in (5.26) is successfully decomposed into L separate ones, where only one
unknown parameter is optimized at each step, solving the original high-dimensional
and non-convex problem in a tractable way.
Fourth-order moment-based initialization
The most prominent problem for the EM algorithm is how to set proper initialization
points of the unknowns, from which the EM algorithm takes iterative steps to converge
to some stationary points. Since the EM algorithm has no guarantee of the convergence
to the global maxima, poor initializations enhance its probability to converge to the
local maxima. In general, the most commonly adopted initialization schemes for the
EM algorithm include the simulated annealing (SA) [11] and random restart. However,
since our problem considers multipath channels and multiple users, where a (2 × K ×
L)-dimensional initial value should be selected, it is computationally expensive for
the SA and random restart algorithms to find proper initials.
In this section, we employ a simple though effective method to find the initial
values of the unknown fadings. A modified version of the fourth-order moment-based
estimator proposed in [12] is applied to roughly estimate the multipath channels,
which are then used as the initialization points of the EM algorithm. The estimator is
expressed as
y
m4k (p, p, p, l)
ĥk (l) = y (5.21)
m4k (p, p, p, p)
Signal identification in cognitive radios using machine learning 167
y
where m4k (τ1 , τ2 , τ3 , τ4 ) = E{yk (n + τ1 )yk (n + τ2 )yk (n + τ3 )yk (n + τ4 )} is the fourth-
order moment of yk (n), and hk (p) denotes the coefficient of the dominant path of
the channel between the transmitter and kth receiver. Without loss of generality, the
dominant path is assumed to be the leading path, i.e., p = 0.
The overall modulation classification algorithm is summarized as follows:
s = θs
θ (∗) (r+1)
, and continue;
11. ENDFOR
12. Final decision is made by ŝ = arg max Ls (θ (∗) s ).
s
0.9
0.8
Pc
0.7
0.6
Upper bound
0.5 δφk(l) = π/20, δak(l) = 0.1
δφk(l) = π/10, δak (l) = 0.3
δφk(l) = π/5, δak (l) = 0.5
0.4
0 5 10 15 20
(a)
SNR (dB)
1
0.9
0.8
Pc
0.7
0.6
Upper bound
δφk(l) = π/20, δak(l) = 0.1
0.5
δφ (l) = π/10, δa (l) = 0.3
k k
δφ (l) = π/5, δa (l) = 0.5
k k
0.4
0 5 10 15 20
(b) SNR (dB)
Figure 5.2 (a) Impact of the initial values of unknowns on the proposed algorithm
for QAM and (b) impact of the initial values of unknowns on the
proposed algorithm for PSK
Three sets of maximum errors are examined, namely, (δφk (l) = π/20, δak (l) = 0.1),
(δφk (l) = π/10, δak (l) = 0.3), and (δφk (l) = π/5, δak (l) = 0.5). Moreover, the classifi-
cation performance is compared to the performance upper bound, which is obtained
by using the Cramér–Rao lower bounds of the estimates of the unknowns as the vari-
ances. It is apparent that the classification performance decreases with the increase of
the maximum errors. To be specific, for both QAM and PSK modulations, the clas-
sification performance is not sensitive to the initial values for the first two sets with
smaller biases, especially for the PSK modulation, where the classification perfor-
mance is more robust against smaller initialization errors, while for the third set with
Signal identification in cognitive radios using machine learning 169
larger bias, the classification performance degrades, especially in the low signal to
noise ratio (SNR) region. In our problem, we consider a complicated case with multi-
ple receivers in the presence of multipath channels; therefore, the likelihood function
contains large amounts of local extrema. Then, the EM algorithm easier converges to
local extrema when the initial values are far away from the true values. In addition,
we can see from Figure 5.2(a) and (b) that, in the high SNR region, the classification
performance with smaller maximum errors is close to the upper bounds. It indicates
that with proper initialization points, the proposed algorithm can provide promising
performance.
Next, we consider the classification performance of the proposed algorithm using
the fourth-order moment-based initialization scheme, as shown in Figure 5.3(a) and
(b) for QAM and PSK modulations, respectively. Figure 5.3(a) depicts that the clas-
sification performance of the proposed algorithm for QAM modulations using the
fourth-order moment-based initialization method attains Pc ≥ 0.8 for SNR > 10 dB.
When compared to that taking the true values plus bias as the initial values, we can
see that the classification performance of the proposed algorithm is comparable in the
SNR region ranges from 6 to 10 dB. The results show that the fourth-order moment-
based method is feasible in the moderate SNR region. In addition, we also compare the
classification performance of the proposed algorithm with the cumulant-based meth-
ods in [14,15]. The number of samples per receiver for the cumulant-based methods is
set to Nc = 2, 000. Note that the difference of cumulant values between higher order
modulation formats (e.g., 16-QAM and 64-QAM) is small; therefore, the classifica-
tion performance of the cumulant-based approaches is limited, which saturates in the
high SNR region. It is apparent that the classification performance of the proposed
algorithm outperforms that of the cumulant-based ones. Meanwhile, it indicates that
the propose algorithm is more sample efficient than the cumulant-based ones. Similar
results can be seen from Figure 5.3(b) when classifying PSK modulations. The advan-
tage in the probability of correct classification of the proposed algorithm is obvious
when compared to that of cumulant-based methods.
On the other hand, however, we can see from Figure 5.3(a) and (b) that the
classification performance of the proposed algorithm decreases in the high SNR
region. The possible reason is that the likelihood function in the low SNR region
is dominated by the noise, which contains less local extrema and is not sensitive to
the initialization errors. In contrast, the likelihood function in the high SNR region
is dominated by the signal, which contains more local extrema. In such a case, the
convergence result is more likely to be trapped at the local extrema, even when the
initial values are slightly far away from the true values. In addition, when comparing
Figure 5.3(a) with (b), it is noted that using the moment-based estimator, the PSK
modulations are more sensitive to initialization errors in the high SNR region.
To intuitively demonstrate the impact of the noise decomposition factor βkl on the
classification performance, we evaluate the probability of correct classification versus
the SNR in Figure 5.4, with curves parameterized by different choices of βkl . The lines
illustrate the classification performance with fixed βkl = 1/L, and the markers show
that with random βkl . As we can see, different choices of the noise decomposition
factor βkl does not affect the classification performance of the proposed algorithm.
170 Applications of machine learning in wireless communications
0.9
0.8
0.7
Pc
0.6
0.9
0.8
0.7
Pc
0.6
Figure 5.3 (a) The classification performance of the proposed algorithm for QAM
with the fourth-order moment-based initialization scheme and (b) the
classification performance of the proposed algorithm for PSK with the
fourth-order moment-based initialization scheme
0.9
0.8
Pc
0.7
0.6
δφ (l) = π/20, δa (l) = 0.1 with fixed β
k k
0.5 δφ (l) = π/20, δa (l) = 0.1 with rand β
k k
Moment-based initialization with fixed β
Moment-based initialization with rand β
0.4
0 5 10 15 20
SNR (dB)
Figure 5.4 The classification performance of the proposed algorithm for QAM
with curves parameterized by different choices of βkl
where gk is the unknown complex channel fading coefficient from the transmitter to
the kth receiver, wk,n is circularly symmetric complex Gaussian with the distribution
CN (0, σk2 ), and xn is the transmit CPM signal. The continuous-time complex CPM
signal is expressed as
$
2E j(t;I )
x(t) = e (5.23)
T
where E is the energy per symbol, T is the symbol duration, and (t; I ) is the time-
varying phase. For t ∈ [nT , (n + 1)T ], the time-varying phase is represented as
n−L
n
(t; I ) = π h Il + 2π h Il q(t − lT ) (5.24)
l=−∞ l=n−L+1
172 Applications of machine learning in wireless communications
Sn Sn+1
0 κ1 0 +1
κ2 –1
κ3
2 κ4 2
κ5
κ6
κ7
3 3
2 κ8 2
where h is the modulation index, Il is the lth information symbol drawn from the set
{±1, . . . , ±(M −1)}, with M as the symbol level, q(t) is the integral of the pulse shape
t
u(t), i.e., q(t) = 0 u(τ )dτ , t ≤ LT , and L is the pulse length. From (5.24), we can
see that a CPM format is determined by a set of parameters, denoted as {M , h, L, u(t)}.
Basically, by setting different values of these parameters, infinite number of CPM
formats can be generated. Let Sn = {θn , In−1 , . . . , In−L+1 } be the state of the CPM signal
at t = nT , where θn = πh n−L l=−∞ Il . The modulation index h is a rational number,
which can be represented by h = (h1 /h2 ), with h1 and h2 as coprime numbers. Then,
we define h0 as the number of states of θn , which is given by
h2 , if h1 is even
h0 =
2h2 , if h1 is odd.
Hence, we can obtain that the number of states of Sn is Q0 = h0 M L−1 . A trellis of state
transition for CPM with parameter {M = 2, h = (1/2), L = 1} is shown in Figure 5.5.
Let S be the set of CPM candidates, which is known at the receivers. The classifi-
cation task is to identify the correct CPM format s ∈ {1, . . . , S} based on the received
signals. Let y = [y1 , . . . , yK ], where yk is the received signal at the kth receiver,
g = {gk }Kk=1 , and x = {xn }Nn=1 , which are the observations, unknown parameters, and
hidden variables of the HMM, respectively. We formulate this problem as a multiple
composite hypothesis testing problem, and the likelihood-based classifier is adopted
to solve it. For hypothesis Hs , meaning that the transmit signal uses the CPM format s,
Signal identification in cognitive radios using machine learning 173
the log-likelihood function ln ps (y|g) is computed. The classifier makes the final
decision on the modulation format by
Unlike the case for constellation-based modulation formats, where the likelihood
function ps (y|g) is
obtained by averaging over all the unknown constellation symbols
A, i.e., ps (y|g) = x∈A ps (y|x, g)ps (x|g), since the CPM signal is time correlated, its
likelihood function cannot be calculated in such a way.
1. A denotes the state probability matrix, the element αij = Pr{Sn+1 = j|Sn = i} is
expressed as
1
, if i → j is permissible
αij = M
0, otherwise.
1
πi = Pr{S1 = i} = .
Q0
174 Applications of machine learning in wireless communications
M-step: g(r+1)
s = arg max J (g(r)
s , g). (5.28)
g
As shown in Figure 5.5, we denote κq as the transition from state Sn to Sn+1 , where
κq , q = 1, . . . , Q is drawn from the information sequence, with Q = Q0 M . Let
(xn , κq ) denote the transmit signal at t = nT that corresponding to κq . To simplify the
notifications, denote x1:n−1 and xn+1:N as the transmit symbol sequences {x1 , . . . , xn−1 }
and {xn+1 , . . . , xN }, respectively; we then can rewrite (5.27) as
s , g) =
J (g(r) s ) (log ps (x|g) + log ps (y|x, g))
ps (y, x|g(r)
x
K
N
= 1 + log ps (yk,n |xn , gk ) ps (y, x|g(r)
s )
k=1 n=1 x
K
N
Q
= 1 + log ps (yk,n |(xn , κq ), gk )
k=1 n=1 q=1
× ps (y, x1:n−1 , (xn , κq ), xn+1:N |g(r)
s )
x\xn
K
N
Q
= 2 − ps (y, (xn , κq )|g(r)
s )
k=1 n=1 q=1
1
× 2 |yk,n − gk x(Sn = z(q, Q0 ))|2 (5.29)
σk
where z(q, Q0 ) defines the remainder of (q/Q0 ).
Forward–backward algorithm: Define ηs (n, q) = ps (y, (xn , κq )|g(r)s ). Note that
(xn , κq ) is equivalent to the event {Sn = i, Sn+1 = j}, we can then derive ηs (n, q) as
K
ηs (n, q) = ps (yk , Sn = i, Sn+1 = j|gk )
k=1
K
= ps (yk |Sn = i, Sn+1 = j, gk )ps (Sn = i|gk )
k=1
where υk,n (i) = ps (yk,1:n |Sn = i, gk )ps (Sn = i|gk ) and ωk,n+1 (j) = ps (yNk,n+1 |Sn+1 =
j, gk ) are the forward and backward variables, respectively, which can be inductively
obtained by performing the forward–backward procedure as follows:
● Compute forward variable υk,n (i)
Initialize: υk,1 (i) = πi
Induction: υk,n (i) = Qj=1 υk,n−1 (j)αji βi (yk,n )
● Compute backward variable ωk,n+1 (j)
Initialize: ωk,N +1 (j) =1
Induction: ωk,n (j) = Qi=1 ωk,n+1 (i)αij βi (yk,n ).
Finally, by taking the derivative of (5.29) with respect to g and setting it to zero,
the unknown channel fading can be estimated by
N Q ∗
(r+1) n=1 q=1 ηs (n, q)yk,n x(Sn = z(q, Q0 ))
gs,k = N Q . (5.31)
n=1 q=1 ηs (n, q)
x(Sn = z(q, Q0 ))
cosine pulse shape}, {2, 3/4, 3, Gaussian pulse shape}, and {4, 3/4, 3, Gaussian
pulse shape}. For the Gaussian pulse shape, the bandwidth-time product is set to
B = 0.3.
Experiment 2: Consider the case where various values of M , h, and L are set,
with M = {2, 4}, h = {1/2, 3/4 }, L = {1, 2, 3}. The pulse shapes for L = 1, 2, 3
are set to rectangular, raised cosine, and Gaussian, respectively. In such a case,
12 different CPM candidates are considered.
For the proposed algorithm, we assume that the number of symbols per receiver is
N = 100, the stopping threshold is ε = 10−3 , and maximum iterations is Nm = 100.
For the ApEn-based algorithm, the simulation parameters are set according to [17].
Without loss of generality, we assume that the noise power at all receivers is the same.
Comparison with approximate entropy-based approach
We first consider the scenario with one receiver in the presence of AWGN and fading
channels. Figure 5.6 evaluates the classification performance of the proposed algo-
rithm when considering experiment 1 and is compared to that of the ApEn-based
algorithm in [17] as well. From Figure 5.6, we can see that in AWGN channels, the
proposed algorithm attains an acceptable classification performance, i.e., a classifica-
tion probability of 0.8, when SNR = 5 dB, and it achieves an error-free classification
performance at SNR = 10 dB.
Furthermore, we consider the classification performance in the presence of fad-
ing channels. We first initiate the unknown fading channel coefficients with the true
values with bias for the proposed algorithm [13]. Let ak and φk denote the true values
of the magnitude and phase of the fading channel, respectively, and let ak and φk
denote the maximum errors of the magnitude and phase, respectively. The initial val-
ues of the unknown magnitude and phase are arbitrarily chosen within [0, ak + ak ]
0.9
Probability of correct classification
0.8
0.7
0.6
0.5
0.4
Figure 5.6 The classification performance of the proposed algorithm under AWGN
and fading channels
Signal identification in cognitive radios using machine learning 177
and [φk − φk , φk + φk ], respectively. Two sets of initials of the unknowns are eval-
uated, whose maximum errors are set to (ak , φk ) = (0.1, π/20) and (0.3, π/10),
respectively. It is noted that the classification performance of the proposed algorithm
outcomes that of the ApEn-based algorithm in the fading channels. In particular, the
proposed algorithm provides classification probability of 0.8 for SNR > 15 dB, while
the probability of correct classification of the ApEn-based algorithm saturates around
0.6 in the high SNR region, which is invalid for the classification in fading channels.
Impact of initialization of unknowns
Note that the BW algorithm uses the EM algorithm to estimate the unknowns; there-
fore, its estimation accuracy highly relies on the initial values of the unknowns. In
Figure 5.7, we examine the impact of the initializations of the unknowns on the
classification performance of the proposed algorithm. Both experiments 1 and 2 are
evaluated. We consider multiple receivers to enhance the classification performance,
and the number of receivers is set to K = 3. In such a case, the unknown fadings are
first estimated at each receiver independently, the estimations are then forwarded to a
fusion center to make the final decision. The initial values of the unknown parameters
are set as the true values with bias, as previously described. It is seen from Figure 5.7
that, with smaller bias, the classification performance of the proposed algorithm pro-
vides promising classification performance. For experiment 1, the proposed classifier
achieves Pc > 80% when SNR > 10 dB, and for experiment 2, Pc > 80% is obtained
when SNR = 14 dB. Apparently, the cooperative classification further enhances the
classification performance when compared to that with a single receiver. Further-
more, with large bias, it is noticed that the classification performance of the proposed
algorithm degrades in the high SNR region. This phenomenon occurs when using the
EM algorithm [19]. The main reason is that the estimation results of the EM are not
0.9
Probability of correct classification
0.8
0.7
0.6
0.5
0.4
0.3
Multiple Rx w small bias : Experiment 1
Multiple Rx w large bias : Experiment 1
0.2 Multiple Rx w small bias : Experiment 2
Multiple Rx w large bias : Experiment 2
0.1
0 5 10 15 20
SNR (dB)
Figure 5.7 The impact of different initial values of the unknowns on the
classification performance under fading channels
178 Applications of machine learning in wireless communications
0.9
Probability of correct classification
0.8
0.7
0.6
0.5
0.4
0.3
Figure 5.8 The classification performance under fading channels with simulated
annealing initialization method
guaranteed to converge at the global maxima and are sensitive to the initialization
points of the unknowns. In the high SNR region, the likelihood function is dominated
by the signal and it has more local maxima than that in the low SNR region. Thus,
with large bias, the proposed scheme is more likely to converge at a local maxima,
which causes the degradation of the classification performance.
Performance with simulated annealing initialization
Next, we evaluate the classification performance of the proposed algorithm with the
SA initialization method, as illustrated in Figure 5.8. The parameters of the SA method
are set as in [13]. Experiment 1 is considered. Figure 5.8 shows that, using the SA
method to generate the initial values of the unknowns, the classification performance
of the proposed algorithm monotonically increases in the low-to-moderate SNR region
(0–10 dB). It implies that the SA scheme can provide appropriate initializations for
the proposed algorithm. Apparently, a gap is noticed between the classification per-
formance with the SA scheme and that with the true values of the unknowns plus bias.
However, note that how to determine proper initials could be an interesting topic for
future research, which is out of the scope of this chapter.
Ls
[k]
[k] (x(t)) = αl (x(t))l (5.32)
l=1
where x(t) = s(t)ej2π nfT is the input signal at the power amplifier—with s(t) as the
baseband-modulated signal, f as the carrier frequency, and T as the sampling period—
[k]
{αl } denotes the coefficients of the Taylor series, and [k] (x(t)) denotes the output
signal at the power amplifier of the kth emitter, i.e., the transmit signal of the kth
emitter. Apparently, for emitters with the same order Ls , their different coefficients
represent the specific fingerprints, which are carried through the transmit signals
[k] (x(t)).
180 Applications of machine learning in wireless communications
[k]
Ls
[k]
r(t) = Hsd αl (x(t))l + w(t). (5.34)
l=1
1
[1]
H sd
[2]
2 H sd
D
[K]
H sd
(a)
1
[1]
H sr
[2]
2 H sr
Hrd
R D
[K]
H sr
K
(b)
Figure 5.9 (a) The system model of the single-hop scenario and (b) the system
model of the relaying scenario
Signal identification in cognitive radios using machine learning 181
Then, the received signal at the receiver, which is forwarded by the relay, is
written as
r(t) = Hrd (y(t)) + υ(t)
= Hrd Hsr[k] [k] (x(t)) + η(t) + υ(t) (5.36)
where (·) denotes the system response characteristic of the power amplifier of the
relay, Hrd is the unknown channel fading coefficient from the relay to the receiver,
and υ(t) is the additive noise. Similarly, we use the Taylor series to define (·), which
is given by
Lr
(y(t)) = βm (y(t))m (5.37)
m=1
where Lr denotes the order of Taylor series for the power amplifier of the relay,
and {βm } represent the fingerprint of the relay. Hence, the received signal is further
expressed as
Lr
r(t) = Hrd βm (y(t))m + υ(t) (5.38)
m=1
m
Lr
Ls
[k]
= Hrd βm Hsr[k] αl (x(t)) + η(t)
l
+ υ(t). (5.39)
m=1 l=1
It is obvious that the features carried by the received signal are the combinations of
the fingerprint of both the emitter and the relay, meaning that the fingerprint of the
emitter is contaminated by that of the relay, which causes negative effect on SEI.
the number of zero-crossings should either be equal, or the difference is one at most;
(2) at any point, the sum of the upper and lower envelopes, respectively, defined by
the local maxima and minima, should be zero.
Let z(t) be the original signal, the EMD uses an iteration process to decompose
the original signal into the IMFs, which is described as follows [23]:
1. First, identify all local maxima and minima, then employing the cubic spline
fitting to obtain the upper and lower envelopes of the signal;
2. Compute the mean of the upper and lower envelopes, denoted by μ10 (t). Subtract
μ10 (t) from z(t) to obtain the first component z10 (t), i.e., z10 (t) = z(t) − μ10 (t);
3. Basically, since the original signal is complicated, the first component does not
satisfy the IMF conditions. Thus, steps 1 and 2 are repeated p times until z1p (t)
becomes an IMF:
z1p (t) = z1(p−1) (t) − μ1p (t), p = 1, 2, . . . , (5.40)
where μ1p (t) is the mean of the upper and lower envelopes of z1(p−1) (t). We
define that
Ts
|z1(p−1) (t) − z1p (t)|2
ξ= 2
(5.41)
t=0
z1(p−1) (t)
where Ts is the length of the signal. Then, the stopping criterion of this shifting
process is when ξ < ε. Note that an empirical value of ε is set between 0.2
and 0.3.
4. Denote c1 (t) = z1p (t) as the first IMF. Subtract it from the z(t) to obtain the
residual, which is
d1 (t) = z(t) − c1 (t). (5.42)
5. Consider the residual as a new signal, repeat steps 1 to 4 on all residuals dq (t),
q = 1, . . . , Q, to extract the remaining IMFs, i.e.:
d2 (t) = d1 (t) − c2 (t),
··· (5.43)
dQ (t) = dQ−1 (t) − cQ (t)
where Q is the number of IMFs. The stopping criterion of the iteration procedure
is when dQ (t) < ε, or it becomes a monotonic function without any oscillation.
From (5.42) and (5.43), we can rewrite z(t) as
Q
z(t) = cq (t) + dQ (t). (5.44)
q=1
2
As a two-dimensional spectrum, we represent the Hilbert spectrum through a matrix, referred to as the
Hilbert spectrum matrix, where the indices of the columns and rows correspond to the sampling point and
instantaneous frequency, respectively, and the elements of the matrix are combinations of the instantaneous
energy of the IMFs.
184 Applications of machine learning in wireless communications
the nearest lower integer value. Taking ζ -bit gray scale as an example, the largest value
of the Hilbert spectrum is converted to the gray scale (2ζ − 1), while other values are
linearly scaled.
The first- and second-order moments of the gray scale image are, respectively,
defined as
1
M N
μ= Bm,n (5.49)
NH m=1 n=1
1/2
1 2
M N
ς = Bm,n − μ (5.50)
NH m=1 n=1
where NH = M × N is the total number of pixels (elements) of the gray scale image
matrix. Note that the first-order moment interprets the average intensity of the gray
scale image, and the second-order moment describes the standard deviation of the
shades of gray.
where Hi,m,n (Hj,m,n ) denotes the (m, n)th element of the Hilbert spectrum matrix
Hi (Hj ), and E(·) is the mean of the elements. Equation (5.51) depicts the linear
dependence between Hi and Hj ; larger ρ (i,j) implies that Hi and Hj are more likely
from the same emitter; otherwise, ρ (i,j) close to zero indicates that Hi and Hj are
from diverse emitters.
classes is C = K(K − 1)/2. For (k1 , k2 ), we define the FDR at time–frequency spot
(ω, t) as
2
[k ] [k ]
Ei Hi 1 (ω, t) − Ei Hi 2 (ω, t)
F (k1 ,k2 ) (ω, t) = (5.52)
[k]
k=k1 ,k2 Di Hi (ω, t)
[k]
where Hi (ω, t), i = 1, . . . , N̄0 is the Hilbert spectrum of the ith training sequence
of the kth class at (ω, t), with
N̄0 as the number of training sequences for each class,
[k] [k]
and Ei Hi (ω, t) and Di Hi (ω, t) denote the mean and variance of the training
sequences of class k at (ω, t), respectively. From (5.52), we can see that the FDR
F (k1 ,k2 ) (ω, t) measures the separability of the time–frequency spot (ω, t) between
classes k1 and k2 . It indicates that the time–frequency spot (ω, t) with larger FDR
provides larger separation between the mean of two classes and smaller within-class
variance, which shows stronger discrimination.
(k ,k ) (k ,k )
For each combination (k1 , k2 ), we define = {F1 1 2 (ω, t), . . . , FNH1 2 (ω, t)}
as the original FDR sequence. Sort in descending order and denote the new
FDR sequence as ˜ = {F̃1(k1 ,k2 ) (ω, t), . . . , F̃N(k1 ,k2 ) (ω, t)}, i.e., F̃1(k1 ,k2 ) (ω, t) ≥ · · · ≥
H
(k ,k )
F̃NH1 2 (ω, t). Let {(ω̃1 , t̃1 ), . . . , (ω̃NH , t̃NH )} be the time–frequency slots which cor-
respond to the rearranged FDR sequence . ˜ Then, we select the time–frequency
spots that correspond to the S largest FDR as optimal time–frequency spots, denoted
as Z (c) = {(ω̃s(c) , t̃s(c) ), s = 1, . . . , S, c = 1, . . . , C}. The total set of optimal time–
(
frequency spots is defined as the union of Z (c) , i.e., Z = Cc=1 Z (c) . For the same
(ω̃s , t̃s ) between different combinations (k1 , k2 ), only one is retained in order to avoid
duplication, i.e., Z = {(ω̃1 , t̃1 ), . . . , (ω̃D , t̃D )}, where D is the number of optimal
time–frequency spots without duplication, with D ≤ S × ((K(K − 1))/2).
χ (v) = wT v + b (5.53)
where w is the normal vector to the hyperplane and (b/
w
) determines the per-
pendicular offset of the hyperplane from the origin, with
·
as the Euclidean
norm.
186 Applications of machine learning in wireless communications
Given a set of training data, labeled as positive and negative ones, we define
the closest distances from the positive and negative points to the hyperplane as md,+
and md,− , respectively. Then, the optimization task of the hyperplane is to make the
margin, md = md,+ + md,− , the largest. To simplify the derivation, two hyperplanes
that bound the margin are defined as
wT v + b = 1 (5.54)
w v + b = −1.
T
(5.55)
Note that the distance between the two hyperplanes is md = (2/
w
), the original
problem of maximizing md can be converted to a constrained minimization problem,
which is [25,26]:
1
min
w
2 (5.56)
2
s.t. ιi wT vi + b ≥ 1, i = 1, . . . , N̄ (5.57)
N̄
1
N̄
max λi − λi λj ιi ιj vi , vj (5.58)
λ
i=1
2 i,j=1
s.t. λi ≥ 0, i = 1, . . . , N̄ (5.59)
N̄
λ i ιi = 0 (5.60)
i=1
N̄
where w and b are represented by w = i=1 λi ιi vi and b = −(1/2)( max wT vi +
i:ι1 =−1
min wT vi ), respectively.
i:ι1 =1
The decision function is obtained by solving the optimization problem in (5.58):
N̄
χ (v) = λi ιi vi , v + b (5.61)
i=1
where ·, · denotes the inner product. The decision criterion of the correct classifica-
tion is ιl χ(ul ) > 0, i.e., the testing example ul that satisfies χ (ul ) > 0 is labeled as 1;
otherwise, it is labeled as −1.
Nonlinear SVM: For the case where vi cannot be simply distinguished by the linear
classifier, a nonlinear mapping function φ is utilized to map vi to a high-dimensional
space F, in which the categorization can be done by a hyperplane. Similarly, the
decision function is expressed as [25]:
N̄
χ(v) = λi ιi φ(vi ), φ(v) + b. (5.62)
i=1
Signal identification in cognitive radios using machine learning 187
In such a case, a kernel function κ(vi , v) is defined to avoid the computation of the inner
product φ(vi ), φ(v), which is generally intractable in high-dimensional spaces [27].3
Using w = N̄i=1 λi ιi vi and the kernel function, we rewrite (5.62) as
N̄
χ(v) = λi ιi κ(vi , v) + b. (5.63)
i=1
The decision rule is the same as that for the linear classifier.
Multi-class SVM: Next, we consider the case of multiple classes. The multi-
class classification problem is solved by reducing it to several binary classification
problems. Commonly adopted methods include one-versus-one [28], one-versus-
all [28] and binary tree architecture [29] techniques. In this section, we employ the
one-versus-one technique for the multi-class problem, by which the classification is
solved using a max-win voting mechanism, and the decision rule is to choose the
class with the highest number of votes.
The training and identification procedures of the three proposed algorithms using
the SVM are summarized as follows:
Typical kernel functions include the Gaussian radial-base function (RBF), κ(x, y) = e−
x−y
3 2 /2γ 2
, and the
polynomial kernel, κ(x, y) = (x, y)d , with d as the sum of the exponents in each term.
188 Applications of machine learning in wireless communications
6. Let {vi , ιi } be the set of training data with ιi ∈ {1, . . . , K} as the label of each
class. Then, the data is input into the SVM classifier for training, i.e., to obtain
the optimal w and b of the decision hyperplane χ (v).
Identification procedure: Let Hl (ω, t), l = 1, . . . , N , denote the Hilbert
spectrum of test sequence l at time–frequency spot (ω, t) of a unknown class,
where N is the number of test sequences.
1. For the test sequence l, extract the elements corresponding to the D opti-
mal time–frequency spots as the test vector, i.e., ul = [Hl (ω̃1 , t̃1 ), . . . ,
Hl (ω̃D , t̃D )]T ;
2. Utilize the SVM classifier to identify the test sequence. For K = 2, ul which
satisfies χ(ul ) > 0 is labeled as class 2; otherwise, it is labeled as class 1.
For K > 2, one-versus-one technique is applied, where the decision depends
on the max-win voting mechanism, i.e., the class with the highest number of
votes is considered as the identification results.
(1, 0.08, 0.6)T , α [3] = (1, 0.01, 0.01)T , α [4] = (1, 0.01, 0.4)T and α [5] = (1, 0.6, 0.08)T ,
respectively.
To be specific, for the case that K = 2, the coefficient
matrix is
AT2 = α [1]; α [2] ; for the case that K = 3, it is A3 = α [1] ; α [2] ; α [3] ; and for K = 5,
it is A5 = α [1] ; α [2] ; α [3] ; α [4] ; α [5] . For the coefficient matrix of the power amplifier
Taylor polynomial model of the relay, we set it to BT = (1, 0.1, 0.1). The SVM classi-
fier is implemented by using the LIBSVM toolbox, in which we adopt the Gaussian
RBF, κ(x, y) = e−
x−y
/2γ , as the kernel function with parameter γ = 0.1. For each
2 2
class, we set the number of training and test sequences as N̄0 = N0 = 50.
● Algorithms performance in AWGN channel
190 Applications of machine learning in wireless communications
K=2 K=3
1 1
0.9 0.9
0.8 0.8
0.7 0.7
Pc
Pc
0.6 0.6
0.5 0.5
0.98
N0 = 25
0.96 N0 = 50
K=2 N0 = 75
0.94
N0 = 100
0.92
Pc
0.9
0.88
0.86
0.84
0.82
K=3
0.8
10 20 30 40 50 60 70 80 90 100
Number of training samples, N¯0
K=2 K=3
1 1
0.9 0.9
0.8 0.8
0.7 0.7
Pc
Pc
0.6 0.6
0.5 0.5
0.9
0.8
0.7
Pc
0.6
0.5
4 dB 12 dB 20 dB
K = 2 Single-hop 0.93 0.97 0.99 0.55 0.98 0.99 1.00 0.70 0.99 1.00 1.00 0.94
Relaying 0.87 0.90 0.94 0.53 0.97 0.99 0.99 0.62 0.99 0.99 0.99 0.85
K = 3 Single-hop 0.79 0.81 0.97 0.41 0.88 0.93 0.99 0.56 0.90 0.96 0.99 0.81
Relaying 0.72 0.70 0.89 0.38 0.86 0.91 0.98 0.51 0.89 0.95 0.99 0.72
K = 5 Single-hop 0.56 0.56 0.77 0.25 0.63 0.66 0.92 0.28 0.65 0.71 0.93 0.33
Relaying 0.48 0.50 0.70 0.25 0.61 0.62 0.89 0.26 0.64 0.69 0.92 0.30
As expected, the FDR algorithm obtains the best identification performance, then
followed by the CB algorithm and finally the EM2 algorithm. The FDR algorithm
can effectively identify emitters when K = 5 since it extracts features with strong
separability. In addition, the identification performance of the proposed algorithms
outperforms that of the algorithm in [30], especially in the relaying scenario.
● Algorithms performance comparison in non-Gaussian noise channel
Next, the identification performance of the proposed algorithms is evaluated
under the non-Gaussian noise channels and compared to that in [30]. In the simula-
tions, we assume that the non-Gaussian noise follows the Middleton Class A model,
where the probability distribution function (pdf) is expressed as [31,32]:
∞
Am
fClassA (x) = e−A e−(x /2σm )
2 2
+ (5.64)
m=0 m! (2πσm2 )
where A is the impulse index, and σm2 = (((m/A) + )/(1 + )) is the noise variance,
with as the ratio of the intensity of the independent Gaussian component and the
intensity of the impulsive non-Gaussian component. We set A = 0.1 and = 0.05,
respectively. In addition, we assume that the number of terms in the Class A pdf M is
finite, i.e., m ∈ [0, M − 1], and set M = 500 [31]. No fading is considered.
Table 5.2 summarizes the identification accuracy of the proposed algorithms and
the algorithm in [30], at SNR = 4, 12 and 20 dB in the presence of the non-Gaussian
noise channels. Comparing the results with that in the AWGN channels, it is noticed
that for K = 2 and K = 3 in the single-hop scenario, the proposed algorithms have
little degradation in the identification performance, and in the relaying scenario, the
proposed algorithms effectively combat the negative effect of the non-Gaussian noise
in the high SNR region. The results indicate that the proposed algorithms are appli-
cable to non-Gaussian noise channels. Furthermore, it is obvious that all proposed
algorithms outperform that of the conventional method in [30], which performs poorly
especially in the relaying scenario.
● Algorithms performance comparison in flat-fading channel
194 Applications of machine learning in wireless communications
4 dB 12 dB 20 dB
K = 2 Single-hop 0.90 0.91 0.98 0.55 0.98 0.99 0.99 0.70 0.99 0.99 0.99 0.94
Relaying 0.82 0.83 0.86 0.53 0.96 0.97 0.99 0.62 0.98 0.99 0.99 0.85
K = 3 Single-hop 0.74 0.75 0.94 0.40 0.86 0.91 0.99 0.56 0.89 0.95 0.99 0.81
Relaying 0.60 0.62 0.81 0.38 0.85 0.87 0.97 0.51 0.88 0.94 0.99 0.72
K = 5 Single-hop 0.53 0.50 0.73 0.25 0.61 0.62 0.90 0.28 0.64 0.69 0.93 0.33
Relaying 0.41 0.43 0.63 0.25 0.57 0.57 0.88 0.26 0.62 0.67 0.91 0.30
4 dB 12 dB 20 dB
K = 2 Single-hop 0.81 0.90 0.97 0.51 0.96 0.98 0.99 0.56 0.99 0.99 0.99 0.68
Relaying 0.63 0.72 0.82 0.50 0.90 0.90 0.99 0.53 0.98 0.98 0.99 0.59
K = 3 Single-hop 0.65 0.73 0.91 0.36 0.86 0.89 0.98 0.40 0.90 0.93 0.99 0.54
Relaying 0.63 0.50 0.77 0.35 0.80 0.76 0.96 0.39 0.98 0.92 0.98 0.45
K = 5 Single-hop 0.50 0.49 0.71 0.21 0.60 0.61 0.89 0.23 0.62 0.66 0.92 0.29
Relaying 0.35 0.40 0.58 0.20 0.55 0.55 0.86 0.21 0.60 0.64 0.91 0.25
5.3.5 Conclusions
This chapter discusses two main signal identification issues in CRs, that is the
modulation classification and the SEI. New challenges to the signal identification
techniques have raised considering the real-world environments. More advanced
Signal identification in cognitive radios using machine learning 195
and intelligent theory is required to solve the blind recognition tasks. The machine-
learning theory-based algorithms are introduced to solve the modulation classification
and SEI problems. Numerical results demonstrate that the proposed algorithms
provide promising identification performance.
References
[1] Mitola J, and Maguire GQJ. Cognitive radio: making software radios more
personal. IEEE Personal Communications Magazine. 1999;6(4):13–18.
[2] Haykin S. Cognitive radio: brain-empowered wireless communications. IEEE
Journal on Selected Areas of Communications. 2005;23(2):201–220.
[3] Huang C, and Polydoros A. Likelihood methods for MPSK modulation classi-
fication. IEEE Transactions on Communications. 1995;43(2/3/4):1493–1504.
[4] Kebrya A, Kim I, Kim D, et al. Likelihood-based modulation classifica-
tion for multiple-antenna receiver. IEEE Transactions on Communications.
2013;61(9):3816–3829.
[5] Swami A, and Sadler B. Hierarchical digital modulation classification using
cumulants. IEEE Transactions on Communications. 2000;48(3):416–429.
[6] Wang F, Dobre OA, and Zhang J. Fold-based Kolmogorov–Smirnov modulation
classifier. IEEE Signal Processing Letters. 2017;23(7):1003–1007.
[7] Dempster AP, Laird NM, and Rubin DB. Maximum likelihood from incomplete
data via the EM algorithm. Journal of the Royal Statistical Society: Series B.
1977;39(1):1–38.
[8] Xie Y, and Georghiades C. Two EM-type channel estimation algorithms for
OFDM with transmitter diversity. IEEE Transactions on Communications.
2003;51(1):106–115.
[9] Trees HL. Detection Estimation and Modulation Theory, Part I – Detection,
Estimation, and Filtering Theory. New York, NY: John Wiley and Sons; 1968.
[10] Gelb A. Applied Optimal Estimation. Cambridge, MA: The MIT Press; 1974.
[11] Lavielle M, and Moulines E. A simulated annealing version of the
EM algorithm for non-Gaussian deconvolution. Statistics and Computing.
1997;7(4):229–236.
[12] Orlic VD, and Dukic ML. Multipath channel estimation algorithm for auto-
matic modulation classification using sixth-order cumulants. IET Electronics
Letters. 2010;46(19):1348–1349.
[13] Ozdemir O, Wimalajeewa T, Dulek B, et al. Asynchronous linear modula-
tion classification with multiple sensors via generalized EM algorithm. IEEE
Transactions on Wireless Communications. 2015;14(11):6389–6400.
[14] Markovic GB, and Dukic ML. Cooperative modulation classification with data
fusion for multipath fading channels. IET Electronics Letters. 2013;49(23):
1494–1496.
[15] Zhang Y, Ansari N, and Su W. Optimal decision fusion based automatic mod-
ulation classification by using wireless sensor networks in multipath fading
196 Applications of machine learning in wireless communications
Over the past two decades, the rapid development of technologies in sensing, com-
puting and communication has made it possible to employ wireless sensor networks
(WSNs) to continuously monitor physical phenomena in a variety of applications, for
example, air-quality monitoring, wildlife tracking, biomedical monitoring and disas-
ter detection. Since the development of these technologies will continue to reduce the
size and the cost of sensors in the next few decades, it is believed that WSNs will be
involved more and more in our daily lives increasing the impact on the way we live
our lives.
A WSN can be defined as a network of sensor nodes, which can sense the physical
phenomena in a monitored field and transmit the collected information to a central
information-processing station, namely, the fusion center (FC), through wireless links.
A wireless sensor node is composed of three basic elements, i.e., a sensing unit, a
computation unit and a wireless communication unit, although the node’s physical
size and shape may differ in various applications. The rapid development of WSNs
with various types of sensors has resulted in a dramatic increase in the amount of
data that has to be transmitted, stored and processed. As number and resolution of the
sensors grow, the main constraints in the development of WSNs are limited battery
power, limited memory, limited computational capability, limited wireless bandwidth,
the cost and the physical size of the wireless sensor node. While the sensor node
is the performance bottleneck, the FC (or any back-end processor) usually has a
comparatively high-computational capability and power. The asymmetrical structure
of WSNs motivates us to exploit compressive-sensing (CS)-related techniques and to
incorporate those techniques into a WSN system for data acquisition.
CS, also called compressed sensing or sub-Nyquist sampling, was initially pro-
posed by Candès, Romberg and Tao in [1] and Donoho in [2], who derived some
important theoretical results on the minimum number of random samples needed to
reconstruct a signal. By taking advantage of the sparse characteristic of the nat-
ural physical signals of interest, CS makes it possible to recover sparse signals
from far fewer samples that is predicted by the Nyquist–Shannon sampling theorem.
1
State Key Laboratory of Rail Traffic Control and Safety, Beijing Jiaotong University, China
198 Applications of machine learning in wireless communications
has only s nonzero elements. Most naturally occurring signals are not exactly but
nearly sparse under a given transform basis, which means the values of the elements
in x, when sorted, decay rapidly to zero, or follow power-law distributions, i.e., the
ith element of the sorted representation x̀ satisfies:
|x̀i | ≤ c · i−p (6.2)
for each 1 ≤ i ≤ n, where c denotes a constant and p ≥ 1.
Transforming signals into a sparse domain has been widely used in data reduction.
For example, audio signals are compressed by projecting them into the frequency
domain, and images are compressed by projecting them into the wavelet domain and
curvelet domain. Furthermore, sometimes it is easier to manipulate or process the
information content of signals in the projected domain than in the original domain
where signals are observed. For example, by expressing audio signals in the frequency
domain, one can acquire the dominant information more accurately than by expressing
them as the amplitude levels over time. In this case, people are more interested in
the signal representation in the transformed domain rather than the signal itself in the
observed domain.
800
Magnitude
600
400
Neglect 90%
200
0
0 2 4 6
DWT coef. index (sorted)
× 104
Figure 6.1 The original cameraman image vs. the compressed version
smaller than 10 and show the compressed image in the right-hand image in Figure 6.1.
The perceptual loss in quality of the compressed image compared to the original is
imperceptible.
6.2.1 CS model
Given a signal f ∈ Rn , we consider a measurement system that acquires m (m ≤ n)
linear measurements by projecting the signal with a sensing matrix ∈ Rm×n . This
sensing system can be presented as
y = f , (6.4)
where y ∈ Rm denotes the measurement vector.
The standard CS framework assumes that the sensing matrices are randomized
and nonadaptive, which means each measurement is derived independently to the
previously acquired measurements. In some settings, it is interesting to design fixed
and adaptive sensing matrices which can lead to improved performance. More details
about the design of sensing matrices are given in [3–7]. For now, we will concentrate
on the standard CS framework.
Remembering that the signal f can be represented by an s-sparse vector x as
expressed in (6.1), the sensing system can be rewritten as
y = x = Ax, (6.5)
where A = denotes an equivalent sensing matrix. The simplified model with
the equivalent sensing matrix A will be frequently used in this dissertation unless
we need to specify the basis, not only to simplify nomenclature but also because
many important results are given in the product form of and . More gener-
ally, measurements are considered to be contaminated by some noise term n ∈ Rm
Compressive sensing for wireless sensor networks 201
owing to the sampling noise or the quantization process. Then the CS model can be
described as
y = Ax + n. (6.6)
In generally, it is not possible to solve (6.6) even if the noise term is equal to zero,
as there are infinite number of solutions satisfying (6.6). However, a suitable sparsity
constraint may rule out all the solutions except for the one that is expected. Therefore,
the most natural strategy to recover the sparse representation from the measurements
uses 0 minimization, which can be written as
min x0
x (6.7)
s.t. Ax = y.
The solution of (6.7) is the most sparse vector satisfying (6.5). However, (6.7) is a
combinatorial optimization problem and thus computationally intractable.
Consequently, as a convex relaxation of 0 minimization, 1 minimization is used
instead to solve the sparse signal representation, which leads to a linear program and
thus straight forward to solve [8]. Therefore, the optimization problem becomes:
min x1
x (6.8)
s.t. Ax = y.
This program is also known as basis pursuit (BP).
In the presence of noise, the equality constraint in (6.8) can never be satisfied.
Instead, the optimization problem (6.8) can be relaxed by using the BP de-noising
(BPDN) [9], which is
min x1
x (6.9)
s.t. Ax − y22 ≤ ε,
where ε is an estimate of the noise level. It has been demonstrated that only
m = O(s log(n/s)) measurements [10] are required for robust reconstruction in the
CS framework.
This standard CS framework only exploits the sparse characteristics of the sig-
nal to reduce the dimensionality required for sensing the signal. A recent growing
trend relates to the use of more complex signal models that go beyond the simple
sparsity model to further enhance the performance of CS. For example, Baraniuk
et al. [11] have introduced a model-based CS, where more realistic signal models
such as wavelet trees or block sparsity are leveraged in order to reduce the number of
measurements required for reconstruction. In particular, it has been shown that robust
signal recovery is possible with m = O(s) measurements in model-based CS [11].
Ji et al. [12] introduced Bayesian CS, where a signal statistical model instead is
exploited to reduce the number of measurements for reconstruction. In [13,14], recon-
struction methods have been proposed for manifold-based CS, where the signal is
assumed to belong to a manifold. Other works that consider various sparsity mod-
els that go beyond that of simple sparsity in order to improve the performance of
traditional CS include [15–18].
202 Applications of machine learning in wireless communications
Definition 6.1. A matrix A ∈ Rm×n satisfies the NSP in 1 of order s if and only if
the following inequality:
The NSP highlights that vectors in the null space of the equivalent sensing matrix
A should not concentrate on a small number of elements. Based on the definition of
the NSP, the following theorem [19] guarantees the success of 1 minimization with
the equivalent sensing matrix A satisfying the NSP condition.
Theorem 6.1. Let A ∈ Rm×n . Then every s-sparse vector x ∈ Rn is the unique solution
of the 1 minimization problem in (6.8) with y = Ax if and only if A satisfies the NSP
in 1 of order s.
This theorem claims that the NSP is both necessary and sufficient for successful
sparse recovery by 1 minimization. However, it does not consider the presence of
noise as in (6.9). Furthermore, it is very difficult to evaluate the NSP condition for a
given matrix, since it includes calculation of the null space and testing all vectors in
this space.
Definition 6.2. A matrix A ∈ Rm×n satisfies the RIP of order s with a restricted
isometry constant (RIC) δs ∈ (0, 1) being the smallest number such that
(1 − δs )x22 ≤ Ax22 ≤ (1 + δs )x22 (6.11)
holds for all x with x0 ≤ s.
The RIP quantifies the notion that the energy of sparse vectors should not be
scaled too much when projected by the equivalent sensing matrix A. It has been
established in [21] that the RIP provides a sufficient condition for exact or near exact
recovery of a sparse signal via 1 minimization.
This theorem claims that with a reduced number of measurements, the recon-
structed vector x∗ is a good approximation to the original signal representation x. In
addition, for the noiseless case, any sparse representation x with support size no
√ larger
than s, can be exactly recovered by 1 minimization if the RIC satisfies δ2s < 2 − 1.
Improved bounds based on the RIP are derived in [22–24].
For any arbitrary matrix, computing the RIP by going through all possible sparse
signals is exhaustive. Baraniuk et al. prove in [10] that any random matrix whose
entries are independent identical distributed (i.i.d.) realizations of certain zero-mean
random variables with variance 1/m, e.g., Gaussian distribution and Bernoulli distri-
bution,1 satisfies the RIP with a very high possibility when the number of samples
m = O(s log(n/s)).
Note that the RIP is a sufficient condition for successful reconstruction, but it is
too strict. In practice, signals with sparse representations can be reconstructed very
well even though the sensing matrices do not satisfy the RIP.
6.2.2.3 Mutual coherence
Another way to evaluate a sensing matrix, which is not as computationally intractable
as the NSP and the RIP, is via the mutual coherence of the matrix [26], which is
given by
μ= max |AiT Aj |. (6.13)
1≤i, j≤n,i =j
Small mutual coherence means that any pair of vectors in matrix A has a low coher-
ence, which eases the difficulty in discriminating components from the measurement
vector y.
1
In most of the experiments we have conducted, we use random matrices with elements drawn from i.i.d.
Gaussian distributions, since it is the typical setting found in the literature and its performance is no worse
than one with elements drawn from i.i.d. Bernoulli distributions [25].
204 Applications of machine learning in wireless communications
Donoho, Elad and Temlyakov demonstrated in [26] that every x is the unique
sparsest solution of (6.7) if μ < 1/(2s − 1), and the error of the solution (6.8) is
bounded if μ < 1/(4s − 1). According to the relationship between the RIC and the
mutual coherence, i.e., δs ≤ (s − 1)μ [19], it is clear that if a matrix possesses a small
mutual coherence, it also satisfies the RIP condition. It means that the mutual coher-
ence condition is a stronger condition than the RIP. However, the mutual coherence is
still very attractive for sensing matrix design owing to its convenience in evaluation.
4
L0
3.5 Lp (0<p<1)
L1
3
Lp (p>1)
2.5
L norm
1.5
0.5
0
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
x
and the curve of the 1 norm is closer to the curve of the 0 norm than any other p
norms with p > 1.
Some equivalent formulations to (6.9) exist. For example, the least absolute
shrinkage and selection operator (LASSO) [32] instead minimizes the energy of
detection error with an 1 constraint:
min Ax − y22
x (6.14)
s.t. x1 ≤ η,
where η ≥ 0. Both BPDN and LASSO can be written as an unconstrained optimization
problem with some τ ≥ 0 for any η ≥ 0 in (6.14) and ε ≥ 0 (6.9):
1
min Ax − y22 + τ x1 . (6.15)
x 2
Note that the value of τ is an unknown coefficient to make these problems equivalent.
How to choose τ is discussed in [33].
There are several methods, such as the steepest descent and the conjugate gra-
dient, to search for the global optimal solution for these convex-relaxed problems.
Interior-point (IP) methods, developed in the 1980s to solve convex optimization, are
used in [9,34] for sparse reconstruction. Figueiredo, Nowak and Wright propose a
gradient projection approach with one level of iteration [35], while the IP approaches
in [9,34] have two iteration levels, and 1 -magic [9,34] has three iteration levels.
Other algorithms proposed to solve (6.15) include the homotopy method [36,37],
the iterative soft thresholding algorithm [38] and the approximately message passing
algorithm [39].
Generally, algorithms in this category have better performances than greedy algo-
rithms in terms of the number of measurements required for successful reconstruction.
However, their high computing complexity makes them unsuitable for applications
where high-dimensional signals are required to be reconstructed within a short time.
The sensing matrices used in CS play a key role for successful reconstruction in
underdetermined sparse-recovery problems. A number of conditions, such as the
NSP, the RIP and mutual coherence, have been put forth in order to study the quality
of the sensing matrices and recovery algorithms. These conditions are mainly used
to address the worst case performance of sparse recovery [25,47,48]. However, the
actual reconstruction performance in practice is often much better than the worst case
performance, so that this viewpoint can be too conservative. In addition, the worst case
performance is a less typical indicator of quality in signal-processing applications than
the expected-case performance. This motivates us to investigate the design of sensing
matrices with adequate expected-case performance. Furthermore, a recent growing
trend relates to the use of more complex signal models that go beyond the simple
sparsity model to further enhance the performance of CS. The use of additional signal
knowledge also enables one to replace the conventional random sensing matrices by
optimized ones in order to further enhance CS performance (e.g., see [3–7,49–53]).
where t ≥ 0, and the function 1(•) is equal to 1 if its input expression is true, otherwise
it is equal to 0.
If t = 0, the t-averaged mutual coherence is the average of all coherence between
columns. If t = μ, then the t-averaged mutual coherence μt is exactly equal to the
mutual coherence μ. Elad claimed that the equivalent sensing matrix A will have a
better performance if one can reduce the coherence of columns. Iteratively reducing
the mutual coherence by adjusting the related pair of columns is not an efficient
approach to do this since the coherence of all column pairs is not improved except
for the worst pair in each iteration. The t-averaged mutual coherence includes the
contribution of a batch of column pairs with high coherence. Thus, one can improve
the coherence of many column pairs by reducing the t-averaged mutual coherence.
Elad proposes an iterative algorithm to minimize μt (A) = μt () with respect to
the sensing matrix , assuming the basis and the parameter t are fixed and known.
In each iteration, the Gram matrix G = AT A is computed, and the values above t
are forced to reduce by multiplication with γ (0 < γ < 1), which can be expressed as
⎧
⎪
⎨ γ Gi, j |Gi, j | ≥ t
Ĝi, j = γ t · sign(Gi, j ) t > |Gi, j | ≥ γ t , (6.17)
⎪
⎩
Gi, j γ t > |Gi, j |
where sign(•) denotes the sign function. The shrunk Gram matrix Ĝ becomes full
rank in the general case due to the operation in (6.17). To fix this, the Gram matrix
Ĝ is forced to be of rank m by applying the singular value decomposition (SVD) and
setting all the singular values to be zero except for the m largest ones. Then one can
build the square root of Ĝ, i.e., ÃT à = Ĝ, where the square root à is of size m × n.
The last step in each iteration is to find a sensing matrix that makes closest to
à by minimizing à − 2F .
The outline of Elad’s algorithm is given as follows:
● Step 0: Generate an arbitrary random matrix .
● Step 1: Generate a matrix A by normalizing the columns of .
● Step 2: Compute the Gram matrix G = AT A.
● Step 3: Update the Gram matrix Ĝ by (6.17).
● Step 4: Apply SVD and set all the singular values of Ĝ to be zero except for the
m largest ones.
● Step 5: Build the square root m × n matrix à by ÃT Ã
= Ĝ.
2
● Step 6: Update the sensing matrix by minimizing
à −
.
F
● Step 7: Return to step 1 if some halting condition is not satisfied.
Elad’s method aims to minimize the large absolute values of the off-diagonal
elements in the Gram matrix and thus reduces the t-averaged mutual coherence. This
208 Applications of machine learning in wireless communications
method updates a number of columns at the same time in each iteration. Therefore, it
converges to a good matrix design faster than directly working on and updating the
mutual coherence iteratively. Empirical knowledge is required to determine the value
of t and γ , which affects the matrix quality and the convergence rate, respectively.
where Gprev denotes the Gram matrix in the previous iteration, and 0 < α < 1 denotes
the forgetting parameter. Then they update the sensing matrix as the matrix with
the minimum distance to the Gram matrix, given by
●
= G.
2
Step 5: Build the square root m × n matrix à by ÃT Ã
● Step 6: Update the sensing matrix by minimizing
à −
.
F
● Step 7: Return to step 1 if some halting condition is not satisfied.
Xu et al. make the equiangular tight frame lower bound as the target of their
design, as the equiangular tight frame has minimum mutual coherence [51]. However,
the lower bound can never be achieved if an equiangular tight frame for dimensions
m × n does not exist. Although the design target is based on knowledge of the bound,
an improved performance has been shown for arbitrary dimensions.
m ), . . . , (1/λ1 )
Om×(n−m) and = Jn . It uncovers the key operations performed by this optimal
sensing matrix design. In particular, this sensing matrix design (i) exposes the modes
(singular values) of the dictionary; (ii) passes the m strongest modes and filters out the
n − m weakest modes and (iii) weighs the strongest modes. This is also accomplished
by taking the matrix of right singular vectors of the sensing matrix to correspond to
the matrix of left singular vectors of the dictionary and taking the strongest modes of
the dictionary. It leads immediately to the sensing matrix design, which is consistent
with the sensing cost constraint ˆ 2F = n, as follows:
√ √
nˆ n ˆ Jn U
T
= = . (6.30)
ˆ F
ˆ F
The sensing matrix design can be generated as follows:
● Step 0: For a given n × k dictionary , perform SVD, i.e., = U
V
T
(λ
1 ≥ · · · ≥ λn ).
poor computing capability of wireless sensor devices and the bandwidth overhead,
many error protection and retransmission schemes are not suitable for WSNs. In
addition, for small and battery-operated sensor devices, sophisticated source coding
cannot be afforded in some cases. However, as naturally occurring signals are often
compressible, CS can be viewed as a compression process. What is more interesting
is that these CS data with redundant measurements are robust against data loss, i.e.,
the original signal can be recovered without retransmission even though some data
are missing.
As shown in Figure 6.3, the conventional sequence of sampling, source coding
and channel coding is replaced by one CS procedure. Each CS measurement contains
some information about the whole signal owing to the mixture effect of the sensing
matrix. Thus, any lost measurement will not cause an inevitable information loss. With
some redundant measurements, the CS system can combat data loss and successfully
recover the original signal.
This CS-coding scheme has a low-encoding cost especially if random sampling
is used. All the measurements are acquired in the same way, and thus the number of
redundant measurements can be specified according to fading severity of the wireless
channel. In addition, one can still use physical layer channel coding on the CS mea-
surements. In this case, CS can be seen as a coding strategy that is applied at the appli-
cation layer where the signal characteristics are exploited, and also can be seen as a
replacement of traditional sampling and source-coding procedures. If channel coding
fails, the receiver is still able to recover the original signal in the application layer.
In [56], Davenport et al. demonstrate theoretically that each CS measurement
carries roughly the same amount of signal information if random matrices are used.
Therefore, by slightly increasing the number of measurements, the system is robust to
the loss of a small number of arbitrary measurements. Charbiwala et al. show that this
CS coding approach is efficient for dealing with data loss, and cheaper than several
other approaches including Reed–Solomon encoding in terms of energy consumption
using a MicaZ sensor platform [57].
Note that the fountain codes [58]—in particular random linear fountain codes—
and network coding [59] can also be used to combat data loss by transmitting mixed
(a)
(b)
symbols, which are “equally important.” The number of received symbols for both
approaches should be no smaller than the original number of symbols for decoding,
which is not necessary in the CS-based approach owing to the use of the sparse signal
characteristic.
y = H ◦ f + n, (6.31)
Fusion center
Sensor nodes
also the leaking positions and the volumes of the leaks are reported to the monitoring
system. Other anomalies, such as abnormal temperature, humidity and so on, can
also be detected by WSNs. All of these anomaly detection problems can be analyzed
using the same model [66–69].
As shown in Figure 6.6, the n grid intersection points denote sources to be mon-
itored, the m yellow nodes denote sensors, and the s red hexagons denote anomalies.
The monitored phenomena is modeled as a vector x ∈ Rn where xi denotes the value at
the ith monitored position. The normal situation is represented by xi = 0 and xi = 0
represents the anomaly. The measurements of the sensors are denoted by a vector
y ∈ Rm where yj represents the jth sensor’s measurement. The relationship between
the events x and measurements y can be written as
y = Ax + n, (6.33)
f1
...
f1
( f1 ) f2 fn
... Fusion
center
1 2 3 n
(a)
Φ1 f1 Φ1 f1 + Φ2 f2 ∑ ni=1 Φi fi
... Fusion
1 2 3 n center
(b)
Figure 6.5 Chain-type WSNs: (a) baseline data gather and (b) compressive data
gather
where hi and fi denote the channel gain and transmitted symbol corresponding to the ith
sensor node, and n ∈ Rm denotes a white Gaussian noise vector. We assume that both
the channel gains hi (i = 1, . . . , n) and sensor signature sequences φ i (i = 1, . . . , n)
are known at the receiver. It is in general impossible to solve f ∈ Rn with m received
measurements. However, as there are very few active sensor nodes in each time
frame, the transmitted symbols can be reconstructed exploiting CS reconstruction
algorithms.
This reduced-dimension MAC design has been proposed in WSN applications to
save channel resource and power consumption [70,71]. Various linear and nonlinear
detectors are given and analyzed by Xie, Eldar and Goldsmith [72]. In [73], in addition
to the sparsity of active sensor nodes, the authors exploit additional correlations that
exist naturally in the signals to further improve the performance in terms of power
efficiency.
6.4.5 Localization
Accurate localization is very important in many applications including indoor
location-based services for mobile users, equipment monitoring in WSNs and radio
frequency identification-based tracking. In the outdoor environment, the global posi-
tioning system (GPS) works very well for localization purpose. However, this solution
for the outdoor environment is not suitable for an indoor environment. For one thing,
it is difficult to detect the signal from the GPS satellites in most buildings due to the
penetration loss of the signal. For the other thing, the precision of civilian GPS is
about 10 m [74], while indoor location-based services usually requires a much higher
accuracy than GPS provides.
Using trilateration, the position of a device in a 2-D space can be determined by
the distances from the device to three reference positions. The precision of localization
can be improved by using an increased number of distance measurements, which are
corrupted by noises in real application. One localization technique considered for the
indoor environment uses the received signal strength (RSS) as a distance proxy where
the distance corresponding to a particular RSS value can be looked up from a radio
map on the server. However, the RSS metric in combination with the trilateration
is unreliable owing to the complex nature of indoor radio propagation [75]. Another
approach is to compare the online RSS readings with off-line observations of different
reference points, which is stored in a database. The estimated position of a device is a
grid point in the radio map. However, owing to the dynamic and unpredictable nature
of indoor radio propagation, accurate localization requires a large number of RSS
measurements. CS can be used to accurately localize a target with a small number of
RSS measurements, where the sparsity level of the signal representation is equal to 1.
218 Applications of machine learning in wireless communications
6.5 Summary
This chapter reviews the fundamental concepts of CS and sparse recovery. Particularly,
it has been shown that compressively sensed signals can be successfully recovered
if the sensing matrices satisfy any of the three given conditions. The focus of this
chapter is on the applications in WSNs, and five cases in WSNs are presented where
the CS principle has been used to solve different problems. There are many new
emerging directions and many challenges that have to be tackled. For example, it
would be interesting to study better signal models beyond sparsity, computational
efficient algorithms, compressive information processing, data-driven approaches,
multidimensional data and so on.
References
[1] Candès EJ, Romberg JK, and Tao T. Stable Signal Recovery from Incom-
plete and Inaccurate Measurements. Communications on Pure and Applied
Mathematics. 2006;59(8):1207–1223.
[2] Donoho DL. Compressed Sensing. IEEE Transactions on Information Theory.
2006;52(4):1289–1306.
[3] Chen W, Rodrigues MRD, and Wassell IJ. Projection Design for Statistical
Compressive Sensing: A Tight Frame Based Approach. IEEE Transactions on
Signal Processing. 2013;61(8):2016–2029.
[4] Chen W, Rodrigues MRD, and Wassell IJ. On the Use of Unit-Norm Tight
Frames to Improve the Average MSE Performance in Compressive Sensing
Applications. IEEE Signal Processing Letters. 2012;19(1):8–11.
[5] Ding X, Chen W, and Wassell IJ. Joint Sensing Matrix and Sparsifying Dic-
tionary Optimization for Tensor Compressive Sensing. IEEE Transactions on
Signal Processing. 2017;65(14):3632–3646.
[6] Chen W, and Wassell IJ. Cost-Aware Activity Scheduling for Compressive
Sleeping Wireless Sensor Networks. IEEE Transactions on Signal Processing.
2016;64(9):2314–2323.
[7] Chen W, and Wassell IJ. Optimized Node Selection for Compressive Sleep-
ing Wireless Sensor Networks. IEEE Transactions on Vehicular Technology.
2016;65(2):827–836.
[8] Baraniuk RG. Compressive Sensing. IEEE Signal Processing Magazine.
2007;24(4):118–121.
[9] Chen SS, Donoho DL, and Saunders MA. Atomic Decomposition by Basis
Pursuit. SIAM Review. 2001;43(1):129–159.
Compressive sensing for wireless sensor networks 219
[59] Fragouli C, Le Boudec JY, and Widmer J. Network Coding: An Instant Primer.
ACM SIGCOMM Computer Communication Review. 2006;36(1):63–68.
[60] Bajwa W, Haupt J, Sayeed A, et al. Joint Source-Channel Communication for
Distributed Estimation in Sensor Networks. IEEE Transactions on Information
Theory. 2007;53(10):3629–3653.
[61] Chen W, Rodrigues MRD, and Wassell IJ. A Frechet Mean Approach for
Compressive Sensing Date Acquisition and Reconstruction in Wireless Sensor
Networks. IEEE Transactions on Wireless Communications. 2012;11(10):
3598–3606.
[62] Chen W. Energy-Efficient Signal Acquisition in Wireless Sensor Networks: A
Compressive Sensing Framework. IET Wireless Sensor Systems. 2012;2:1–8.
[63] Quer G, Masiero R, Munaretto D, et al. On the Interplay between Routing
and Signal Representation for Compressive Sensing in Wireless Sensor
Networks. In: Information Theory and Applications Workshop, 2009; 2009.
p. 206–215.
[64] Masiero R, Quer G, Munaretto D, et al. Data Acquisition through Joint
Compressive Sensing and Principal Component Analysis. In: Global
Telecommunications Conference, 2009. GLOBECOM 2009. IEEE; 2009.
p. 1–6.
[65] Luo C, Wu F, Sun J, et al. Efficient Measurement Generation and Pervasive
Sparsity for Compressive Data Gathering. IEEE Transactions on Wireless
Communications. 2010;9(12):3728–3738.
[66] Meng J, Li H, and Han Z. Sparse Event Detection in Wireless Sensor Networks
Using Compressive Sensing. In: Information Sciences and Systems, 2009.
CISS 2009. 43rd Annual Conference on; 2009. p. 181–185.
[67] Ling Q, and Tian Z. Decentralized Sparse Signal Recovery for Compressive
Sleeping Wireless Sensor Networks. IEEE Transactions on Signal Processing.
2010;58(7):3816–3827.
[68] Zhang B, Cheng X, Zhang N, et al. Sparse Target Counting and Localization
in Sensor Networks based on Compressive Sensing. In: INFOCOM, 2011
Proceedings IEEE; 2011. p. 2255–2263.
[69] Liu Y, Zhu X, Ma C, et al. Multiple Event Detection in Wireless Sensor
Networks Using Compressed Sensing. In: Telecommunications (ICT), 2011
18th International Conference on; 2011. p. 27–32.
[70] Fletcher AK, Rangan S, and Goyal VK. On-off Random Access Channels: A
Compressed Sensing Framework. IEEE Transactions on Information Theory.
submitted for publication.
[71] Fazel F, Fazel M, and Stojanovic M. Random Access Compressed Sensing
for Energy-Efficient Underwater Sensor Networks. IEEE Journal on Selected
Areas in Communications. 2011;29(8):1660–1670.
[72] Xie Y, Eldar YC, and Goldsmith A. Reduced-Dimension Multiuser Detection.
IEEE Transactions on Information Theory. 2013;59(6):3858–3874.
[73] Xue T, Dong X, and Shi Y. A Covert Timing Channel via Algorithmic
Complexity Attacks: Design and Analysis. In: Communications (ICC), 2012
IEEE International Conference on; 2012.
Compressive sensing for wireless sensor networks 223
[74] Panzieri S, Pascucci F, and Ulivi G. An Outdoor Navigation System Using GPS
and Inertial Platform. IEEE/ASME Transactions on Mechatronics. 2002;7(2):
134–142.
[75] Feng C, Au WSA, Valaee S, et al. Compressive Sensing Based Positioning
Using RSS of WLAN Access Points. In: INFOCOM, 2010 Proceedings IEEE;
2010.
[76] Feng C, Valaee S, Au WSA, et al. Localization of Wireless Sensors via Nuclear
Norm for Rank Minimization. In: Global Telecommunications Conference
(GLOBECOM 2010), 2010 IEEE; 2010.
[77] Abid MA. 3D Compressive Sensing for Nodes Localization in WNs Based
On RSS. In: Communications (ICC), 2012 IEEE International Conference
on; 2012.
This page intentionally left blank
Chapter 7
Reinforcement learning-based channel sharing
in wireless vehicular networks
Andreas Pressas1 , Zhengguo Sheng1 , and Falah Ali1
In this chapter, the authors study the enhancement of the proposed IEEE 802.11p
medium access control (MAC) layer for vehicular use by applying reinforcement
learning (RL). The purpose of this adaptive channel access control technique is
enabling more reliable, high-throughput data exchanges among moving vehicles for
cooperative awareness purposes. Some technical background for vehicular networks is
presented, as well as some relevant existing solutions tackling similar channel sharing
problems. Finally, some new findings from combining the IEEE 802.11p MAC with
RL-based adaptation and insight of the various challenges appearing when applying
such mechanisms in a wireless vehicular network are presented.
7.1 Introduction
Vehicle-to-vehicle (V2V) technology aims to enable safer and more sophisti-
cated transportation starting with minor, inexpensive additions of communication
equipment on conventional vehicles and moving towards network-assisted fully
autonomous driving. It will be a fundamental component of the intelligent trans-
portation services and the Internet of Things (IoT). This technology allows for the
formation of vehicular ad hoc networks (VANETs), a new type of network which
allows the exchange of kinematic data among vehicles for the primary purpose of
safer and more efficient driving as well as efficient traffic management and other
third-party services. VANETs can help minimize road accidents and randomness in
driving with on-time alerts as well as enhance the whole travelling experience with new
infotainment systems, which allow acquiring navigation maps and other information
from peers.
The V2V radio technology is based on the IEEE 802.11a stack, adjusted for low
overhead operations in the dedicated short-range communications (DSRCs) spec-
trum (30 MHz in the 5.9 GHz band for Europe). It is being standardized as IEEE
802.11p [1]. The adjustments that have been made are mainly for enabling exchanges
1
Department of Engineering and Design, University of Sussex, UK
226 Applications of machine learning in wireless communications
7.1.1 Motivation
VANETs are the first large-scale network to operate primarily on broadcast transmis-
sions, since the data exchanges are often relevant for vehicles within an immediate
geographical region of interest (ROI) of the host vehicle. This allows the transmission
of broadcast packets (packets not addressed to a specific MAC address), so that they
can be received from every vehicle within range without the overhead of authentica-
tion and association with an AP. Broadcasting has always been controversial for the
IEEE 802.11 family of protocols [3] since they treat unicast and broadcast frames
differently. Radio signals are likely to overlap with others in a geographical area,
and two or more stations will attempt to transmit using the same channel leading
to contention. Broadcast transmissions are inherently unreliable and more prone to
contention since the MAC specification in IEEE 802.11 does not request explicit
acknowledgements (ACKs packets) on receipt of broadcast packets to avoid the ACK
storm phenomenon, which appears when all successful receivers attempt to send back
an ACK simultaneously and consequently congest the channel. This has not changed
in the IEEE 802.11p amendment.
A MAC protocol is part of the data link layer (L2) of the Open Systems Intercon-
nection model (OSI model) and defines the rules of how the various network stations
share access to the channel. The de facto MAC layer used in IEEE 802.11-based net-
works is called carrier sense multiple access (CSMA) with collision avoidance (CA)
(CSMA/CA) protocol. It is a simple decentralized contention-based access scheme
which has been extensively tested in WLANs and mobile ad hoc networks (MANETs).
The IEEE 802.11p stack also employs the classic CSMA/CA MAC. Although the
proposed stack works fine for sparse VANETs with few nodes, it quickly shows its
inability to accommodate increased network traffic because of the lack of ACKs.
Reinforcement learning-based channel sharing 227
The lack of ACKs not only makes transmissions unreliable but also does not provide
any feedback mechanism for the CSMA/CA backoff mechanism. So it cannot adapt
and resolve contention among stations when the network is congested.
The DSRC operation requires that L1 and L2 must be built in a way that they
can handle a large number of contenting nodes in the communication zone, on the
order of 50–100. The system should not collapse from saturation even if this number
is exceeded. Useful data for transportation purposes can be technical (i.e. vehicular,
proximity sensors, radars), crowd-sourced (i.e. maps, environment, traffic, parking)
or personal (i.e. Voice over Internet Protocol (VoIP), Internet radio, routes). We believe
that a significant part of this data will be exchanged throughV2V links, making system
scalability a critical issue to address. There is a need for an efficient MAC protocol
for V2V communication purposes that adapts to the VANET’s density and transmitted
data rate, since such network conditions are not known a priori.
wireless links [7]. They are a subclass of MANETs, but in this case, the mobile sta-
tions are embedded on vehicles and the stationary nodes are roadside units (RSUs).
There are more differences from classic MANETs, since VANETs are limited to road
topology while moving, meaning that potentially we could predict the future positions
of the vehicles to be used for, i.e. better routing and traffic management. Addition-
ally, since vehicles do not have the energy restrictions of typical MANET nodes, they
can feature significant computational, communication and sensing capabilities [8].
Because of these capabilities and opportunities, many applications are envisioned for
deployment on VANETs, ranging from simple exchange of status or safety messages
between vehicles to large-scale traffic management, Internet Service provisioning
and other infotainment applications.
Infrastructure domain
Vehicular cloud Internet Cellular
management provider
Ad hoc domain
V2I
V2V
RSU
In-vehicle domain
V2B
Remote data processing
Generation of new intelligence
Vehicle tracking
Intelligent
Infotainment traffic control
Road traffic V2I
V2V
information
Data collection
Spacing Environmental/road data
Driver assistance
Lane passing dissemination
Cooperative Geo-significant
driving Vehicle condition services
Obstacle Neighbouring
detection vehicle awareness
The most suitable architecture for a VANET would be the IBSS. An STA(node)
within an IBSS acts as the AP and periodically broadcasts the SSID and other infor-
mation. The rest of the nodes receive these packets and synchronize their time and
frequency accordingly. Communication can only be established as long as the STAs
belong in the same SS.
The IEEE 802.11p amendment defines a mode called “Outside the context of
BSS” in its L2, that enables exchanging data without the need for the station to
belong in a BSS, and thus, without the overhead required for these association and
authentication procedures with an AP before exchanging data.
DSRC defines seven licenced channels, as seen in Figure 7.3, each of 10 MHz
bandwidth, six service channels (SCHs) and one CCH. All safety messages, whether
transmitted by vehicles or RSUs, are to be sent in the CCH, which has to be regularly
monitored by all vehicles. The CCH could be also used by RSUs to inform approaching
vehicles of their services, then use the SCH to exchange data with interested vehicles.
232 Applications of machine learning in wireless communications
5.865
5.875
5.885
5.895
5.905
5.915
5.925
Figure 7.3 The channels available for 802.11p
Non-safety
WAVE stack applications Safety applications
UDP/ WSMP
Transporting layer
TCP IEEE 1609.2 (security)
IEEE 1609.3 (networking services)
Networking layer IPv6
WSM version Security type Channel number Data rate TX Power PSID Length DATA
1 byte 1 byte 1 byte 1 byte 1 byte 4 bytes 2 bytes variable
Control Protocol (TCP)/UDP, which acts as an identity and answers which application
is a specific WSMP heading towards. To reduce latency, WSMP exchanges do not
necessitate the formation of a BSS, which is a requirement for SCH exchanges. The
WSMP format can be seen at Figure 7.5.
However, WSMP is not able to support the classic Internet applications or
exchange of multimedia, and it does not need to, since such applications are more
tolerant to delay or fluctuations in network performance. By supporting the IPv6
stack, which is open and already widely deployed, third-party internet services are
easily deployable in a vehicular environment, and the cost of deployment would be
significantly lower for private investors.
function (DCF). It employs a CSMA/CA algorithm. DCF defines two access mech-
anisms to enable fair packet transmission, a two-way handshaking (basic mode) or a
four-way handshaking (request-to-send/clear-to-send (RTS/CTS)).
Under the basic access mechanism, a node wishing to transmit would have to
sense the channel for a DCF Interframe Space (IFS) (DIFS) interval. If the channel
is found busy during the interval, the node would not transmit but instead wait for an
additional DIFS interval plus a specific period of time known as the backoff interval,
and then try again. If the channel was not found busy for a DIFS interval, the node
would transmit.
Another optional mechanism for transmitting data packets is RTS/CTS reserva-
tion scheme. Small RTS/CTS packets are used to reserve the medium before large
packets are transmitted.
For unicast packet transmissions, in the case of a successful reception, the desti-
nation will send an ACK to the source node after a short IFS (SIFS), so it (the ACK)
can be given priority (since SIFS<DIFS). If the source does not receive an ACK
within a set time frame, it reactivates the sending process after the channel remains
idle for an extended IFS. If two or more nodes decrease their backoff counter to 0
simultaneously, a collision occurs. For each retransmission attempt (because of colli-
sion and no ACK), the used CW is doubled, until it reaches CWmax . Upon successful
transmission, CW resets to CWmin . The operation of CSMA/CA for both unicast and
broadcast transmissions can be seen in Figure 7.7.
Unicast only
DIFS Backoff
DATA
Node A
SIFS DIFS Backoff
ACK
Node B
Time
Figure 7.7 A CSMA/CA cycle for both unicast and broadcast cases. It manages
channel access among transmitting nodes A and B
Reinforcement learning-based channel sharing 237
Two problems seem to appear with the BEB mechanism when trying to establish
unicast communication among many highly mobile nodes. First, in dense wireless
networks such as VANETs, there is higher probability that more than one node choose
the same CW, resulting to collisions. Second, every time a collision occurs, the CW
is doubled to avoid more collisions. But given that the network density for a VANET
can vary a lot over small time periods because of high mobility, a node with a large
CW (because of previous failed transmissions) will wait more than it needs to before
transmitting under lighter network conditions. This will result in unnecessary delay.
A Tx range C Tx range
A B C
A transmitting
3 Safety related 3 7 2
2 Voice 3 7 3
1 Best effect 7 15 6
0 Background traffic 15 1,023 9
lowest arbitrary IFS (AIFS) and CW size, so they are more likely to win the internal
contention and affect transmission delay as little as possible (up to seven time slots
for unicast and up to three time slots for broadcast transmissions). The QoS require-
ments for various vehicular networking applications can be found at Table 7.3, taken
from [12].
other hand, when the traffic load of the network is low, a small CW size is needed so
that potential senders can access the wireless medium with a short delay [18], thus
making more efficient use of channel bandwidth. Additionally, the time the channel
is idle because of nodes being in the backoff stage could be minimized. In an ideal
situation, there would be zero idle time (which is essentially lost and is a synonym of
bandwidth wastage) between messages with the exception of the DIFS [23].
7.6.3 Q-learning
There are, though, many practical scenarios such as the channel access control prob-
lem studied in this work, for which the transition probability Pπ (s) (s, s ) or the reward
function Rπ (s) (s, s ) are unknown, which makes it difficult to evaluate the policy π .
Q-learning [36,37] is an effective and popular algorithm for learning from delayed
reinforcement to determine an optimal policy π in the absence of transition probabil-
ity. It is a form of model-free RL which provides agents the ability to learn how to act
optimally in Markovian domains by experiencing the consequences of their actions,
without requiring maps of these domains.
In Q-learning, the agent maintains a table of Q[S, A], where S is the set of states
and A is the set of actions. At each discrete time step t = 1, 2, . . . , ∞, the agent
observes the state st ∈ S of the MDP, selects an action at ∈ A, receives the resultant
reward rt and observes the resulting next state st+1 ∈ S. This experience (st , at , rt , st+1 )
updates the Q-function at the observed state-action pair, thus providing the updated
Q(st , at ). The algorithm, therefore, is defined by the function (1) that calculates the
Agent Environment
Agent in s
Transit to s'
Receive Ra(s,s')
quantity of a state-action (s, a) combination. The goal of the agent is to maximize its
cumulative reward. The core of the algorithm is a value iteration update. It assumes
the current value and makes a correction based on the newly acquired information,
as in the following equation:
Q(st , at ) ← Q(st , at ) + α × [rt + γ × max Q(st+1 , at+1 ) − Q(st , at )], (7.4)
at+1
where the discount factor γ models the importance of future rewards. A factor of γ = 0
will make the agent “myopic” or short-sighted by only considering current rewards,
while a factor close to γ = 1 will make it strive for a high long-term reward. The
learning rate α quantifies to what extent the newly acquired information will override
the old information. An agent with α = 0 will not learn anything, while with α = 1, it
would consider only the most recent information. The maxat+1 ∈A Q(st+1 , at+1 ) quantity
is the maximum Q value among possible actions in the next state. In the following
sections, we present employing (7.4) as a learning, self-improving, control method
for managing channel access among IEEE 802.11p stations.
The adaptive backoff problem fits into the MDP formulation. RL is used to design a
MAC protocol that selects the appropriate CW parameter based on gained experience
from its interactions with the environment within an immediate communication zone.
The proposed MAC protocol features a Q-learning-based algorithm that adjusts the
CW size based on binary feedback from probabilistic rebroadcasts in order to avoid
packet collisions.
255 1.0
Contention window
127 0.8
63
Epsilon
0.6
31
15 0.4
7 0.2
3 0.0
0 50 100 150 200 250 300
Simulation time (s)
Figure 7.10 Trace of CW over time for a station in a 100-car network. The first
stage is the a priori controller training phase via (4) for 200 sec (or
Ndecay = 2000 original packets), then online stage for the remaining
time, with an exploration to exploitation ratio of 1:9
0.8
Contention window
0.6
ε=α
63
6 × 101
0.4
0.2
4 × 101 50 cars
75 cars
100 cars
31 0.0
100 120 140 160 180 200
Simulation time (s)
Figure 7.11 Mean network-wide CW versus training time (second half) for
networks of different densities using the Q-learning-based MAC
MAC discovers the optimum CW size of the stations in three networks of different
densities.
The first step of the MAC protocol would be to set the default CW of the station
to the minimum possible value, which is suggested by the IEEE 802.11p standard.
After that, the node makes an exploratory move with probability ε (exploration) or
picks the best known action to date (highest Q value) with probability 1 − ε.
Received packet rebroadcasts can be used as ACKs since some will definitely be
overheard from the source vehicle, even assuming that they move at the maximum
speed limit. These rebroadcasts can happen for forwarding purposes, and they enhance
the reliability of the protocol, since the original packet senders can detect collisions
as well as provide a means to reward them if they succeed in broadcasting a packet.
We use probabilistic rebroadcasting for simplicity, but various routing protocols can
be used instead.
Every time a packet containing original information is transmitted, a timer is
initiated which waits for a predefined time for an overheard retransmission of that
packet, which will have the same MessageId. These broadcast packets are useful
for a short lifetime, which is the period between refreshes. So a rebroadcast packet,
received after that period, is not considered to be a valid ACK because the information
will not be relevant any more, since the nodes in VANETs attempt to broadcast fresh
information frequently (i.e. 1–10 Hz).
Networking
CoAP2
WSMP1
UDP2
IPv6
VEINS
IEEE 1609.4 MAC ext. Data-generating sensors
Link
VPLC Ethernet
IEEE 802.11p MAC
CAN LIN FlexRay
V2V
V2I IEEE 802.11p PHY Intra-vehicle communication
7.8.3 Implementation
The simulation environment on which novel medium access algorithms are to be
evaluated uses SUMO and open data to reproduce accurate car mobility [42]. The
map is extracted off OpenStreetMap and converted to an XML file which defines
the road network. Then random trips are generated from this road network file, and
finally these trips are converted to routes and traffic flow. The resulting files are
used in SUMO for live traffic simulation as depicted in Figure 7.13. The vehicles are
dynamically generated with unique IDs shown in green labels.
Each node within OMNeT++, either mobile (car) or static (RSU) consists of a
network interface that uses the 802.11p PHY and MAC, and the application layer that
describes a basic safety message exchange and a mobility module. A car, chosen in
random fashion, broadcasts a periodic safety message, much like the ones specified
in the WSMP.
250 Applications of machine learning in wireless communications
Listing 7.1 SUMO scripts and parameters to produce the needed XML files
As well as safety message exchange, connected cars can provide extra function-
ality and enable driving assisting and infotainment systems, such as downloading
city map content from RSUs, exchanging video for extended driver vision or even
uploading traffic information to the cloud towards an efficient traffic light system.
The protocols used for such applications would be different from WSMP, such as
the Internet protocols (IPv6, UDP) for the pervasiveness of IP-based applications.
Figure 7.14 shows an example of V2V connectivity, where a car broadcasts a safety
message to neighbouring cars within range.
Reinforcement learning-based channel sharing 251
effects (hidden/exposed terminals). The artificial campus map used for simulations
can be seen in Figure 7.15.
The achieved improvement on link-level contention was of primary concern, so a
multitude of tests were run for a single hop scenario, with every node being within the
range of the others. By eliminating the hidden terminal problem from the experiment
and setting an infinite queue size, packet losses from collisions can be accurately
measured. A multi-hop scenario is also presented, which makes the hidden terminal
effect apparent in the performance of the network.
The simulation run time for the proposed MAC protocol consists of two stages,
as seen in Figure 7.10. First is the approximate controller training stage, which lasts
for Ndecay = 1,800 transmitted packets (or 180 s with fb = 10 Hz). Then follows the
evaluation or online period which lasts for 120 s, in which the agent acts with an
ε = α = 0.1. During this time, we benchmark the effect of the trained controllers
regarding network performance as well as keep performing some learning for the
controller augmentation. For IEEE 802.11p simulations, only the evaluation stage is
needed, which lasts for the same time.
All cars in the network are continuously transmitting broadcast packets, such as
CAMs with a period Tb = (1/fb ) = 100 ms. The packets are transmitted using the
highest priority, voice traffic (AC_VO) AC. In VANETs, the network density changes
depending on location and time of the day. We test the performance of the novel MAC
against the standard IEEE 802.11p protocol for different number of cars. The data
rate is set at 6 Mbps so it can conveniently accommodate hundreds of vehicles within
the DSRC communication range. Simulation parameters can be found at Table 7.4.
Parameter Value
1.0
0.8
Packet delivery ratio
0.6
0.4
0.2
IEEE 802.11p
Q-learning MAC
0.0 Delivery gain
20 40 60 80 100
Network density (cars)
Figure 7.16 PDR versus network density for broadcasting of 256-byte packets
with fb = 10 Hz
used in this scenario is 256 bytes, and the broadcasting frequency fb is set at 10 Hz.
Figure 7.16 shows the increase in goodput when using this novel MAC protocol,
expressed as a PDR. When using the standard IEEE 802.11p, PDR decreases in
denser networks due to the increased collisions between data packets.
The PDR for the proposed Q-learning MAC is measured after the initial,
exploratory phase (since the agent by then has gained significant experience). We
observed a 37.5% increase in performance (original packets delivered) in a network
formed of 80 cars when using the modified, “learning” MAC. There is a slight loss
in performance (4%) for 20-car networks. In such sparse networks, the minimum
CW is optimal, since with a big CW (waiting for more b time slots), transmission
opportunities can be lost and the channel access delay will increase. When using our
254 Applications of machine learning in wireless communications
35
IEEE 802.11p
30 Q-learning MAC
20
15
10
0
20 40 60 80 100
Network density (cars)
Figure 7.17 Packet Return Time (delay) versus network density for broadcasting
of 256-byte packets with fb = 10 Hz
learning protocol, the agent still explores larger CW levels 10% of the time (ε = 0.1),
for better adaptability and augmentation of its initial controller. When the network
density exceeds 40 cars, the proposed learning MAC performs much better regarding
successful deliveries.
The round-trip time (RTT) shown in Figure 7.17 is defined as the length of time
it takes for an original broadcast packet to be sent plus the length of time it takes for
a rebroadcast of that packet to be received by the original sender. We can see that the
increased CW of the learning MAC adds to the channel-access delay time. The worst
case scenario simulated is for 100 simultaneous transceivers within the immediate
range of each other, in which the average RTT doubles to 32.8 ms when using the
Q-learning MAC. Given that both the transmission and heard retransmission are of
the same packet size, we can assume that the mean packet delivery latency is 16.4 ms
when using the learning MAC instead of 8 ms for baseline IEEE 802.11p, while PDR
is improved by 54%.
1.0
0.8
0.4
0.2
IEEE 802.11p
Q-learning MAC
0.0 Delivery gain
Figure 7.18 PDR versus packet size for 60 vehicles broadcasting with fb = 10 Hz
1.0
0.8
Packet delivery ratio
0.6
0.4
Figure 7.19 PDR versus network density for broadcasting of 256-byte packets with
fb = 10 Hz in a two-hop scenario
We see that because the hidden terminal phenomenon appears, the performance
deteriorates compared to the single hop scenario, but the performance gain regarding
packet delivery is still apparent when using Q-learning to adapt the backoff. Packets
lost are not recovered since we are concerned with the performance of the link layer.
7.10 Conclusion
A contention-based MAC protocol for V2V/V2I transmissions was introduced in this
chapter. It relies on Q-learning to discover the optimum CW by continuously interact-
ing with the network. Simulations were developed to demonstrate the effectiveness
of this learning-based MAC protocol. Results prove that the proposed method allows
the network to scale better to increase network density and accommodate higher
packet delivery rates compared to the IEEE 802.11p standard. This translates to more
reliable packet delivery and higher system throughput, while maintaining acceptable
delay levels. Future work will be focused on how the learning MAC responds to dras-
tic changes in the networking environment via invoking the ε decay function while
online as well as improving fairness and transmission latency.
References
[1] IEEE. IEEE Standard for Information Technology–Telecommunications
and Information Exchange between Systems–Local and Metropolitan Area
Networks–Specific Requirements Part 11:Wireless LAN MediumAccess Con-
trol (MAC) and Physical Layer (PHY) Specifications Amendment 6: Wireless
Access in Vehicular Environments, IEEE, pp. 1–51, 2010.
[2] Pressas A, Sheng Z, Ali F, et al. Contention-based learning MAC protocol
for broadcast vehicle-to-vehicle communication. In: 2017 IEEE Vehicular
Networking Conference (VNC); 2017. p. 263–270.
[3] Oliveira R, Bernardo L, and Pinto P. The influence of broadcast traffic
on IEEE 802.11 DCF networks. Computer Communications. 2009;32(2):
439–452. Available from: http://www.sciencedirect.com/science/article/pii/
S0140366408005847.
[4] Delgrossi L, and Zhang T. Vehicle Safety Communications: Protocols,
Security, and Privacy; 2012. Available from: http://dx.doi.org/10.1002/
9781118452189.ch3.
[5] Navet N, and Simonot-Lion F. In-vehicle communication networks – a
historical perspective and review; 2013.
[6] Ku I, Lu Y, Gerla M, et al. Towards Software-Defined VANET: Architecture
and Services; 2014.
[7] Achour I, Bejaoui T, and Tabbane S. Network coding approach for vehicle-to-
vehicle communication: principles, protocols and benefits. In: 2014 22nd
International Conference on Software, Telecommunications and Computer
Networks, SoftCOM 2014; 2011. p. 154–159.
Reinforcement learning-based channel sharing 257
[21] Xia X, Member N, Niu Z, et al. Enhanced DCF MAC scheme for provid-
ing differentiated QoS in ITS. In: Proceedings The 7th International IEEE
Conference on Intelligent Transportation Systems (IEEE Cat No04TH8749);
2004. p. 280–285. Available from: http://ieeexplore.ieee.org/lpdocs/epic03/
wrapper.htm?arnumber=1398911.
[22] Qiu HJF, Ho IWH, Tse CK, et al. A methodology for studying 802.11p
VANET broadcasting performance with practical vehicle distribution. IEEE
Transactions on Vehicular Technology. 2015;64(10):4756–4769.
[23] Stanica R, Chaput E, and Beylot AL. Enhancements of IEEE 802.11p proto-
col for access control on a VANET control channel. In: IEEE International
Conference on Communications; 2011.
[24] Miao L, Djouani K, Wyk BJV, et al. Performance evaluation of IEEE 802.11p
MAC protocol in VANETs safety applications. In: Wireless Communications
and Networking Conference (WCNC), 2013 IEEE; 2013; i. p. 1663–1668.
[25] Choi J, So J, and Ko Y. Numerical analysis of IEEE 802.11 broadcast
scheme in multihop wireless ad hoc networks. International Conference on
Information Networking; 2005. Available from: http://www.springerlink.com/
index/10.1007/b105584\nhttp://link.springer.com/chapter/10.1007/978-3-540-
30582-8_1.
[26] Torrent-Moreno M, Mittag J, Santi P, et al. Vehicle-to-vehicle communication:
Fair transmit power control for safety-critical information. IEEE Transactions
on Vehicular Technology. 2009;58(7):3684–3703.
[27] Mertens Y, Wellens M, and Mahonen P. Simulation-based performance eval-
uation of enhanced broadcast schemes for IEEE 802.11-based vehicular
networks. In: IEEE Vehicular Technology Conference; 2008. p. 3042–3046.
[28] Sutton RS, and Barto AG. Introduction to Reinforcement Learning. 1st ed.
Cambridge, MA, USA: MIT Press; 1998.
[29] Bellman R. A Markovian Decision Process. Indiana University Mathematics
Journal. 1957;6:679–684.
[30] Shoaei AD, Derakhshani M, Parsaeifard S, et al. MDP-based MAC design with
deterministic backoffs in virtualized 802.11 WLANs. IEEE Transactions on
Vehicular Technology. 2016;65(9):7754–7759.
[31] Tse Q, Si W, and Taheri J. Estimating contention of IEEE 802.11 broadcasts
based on inter-frame idle slots. In: Proc. IEEE Conf. on Local Computer
Networks – Workshops; 2013. p. 120–127.
[32] Bianchi G. Performance analysis of the IEEE 802.11 distributed coor-
dination function. IEEE Journal on Selected Areas in Communications.
2000;18(3):535–547.
[33] Liu Z, and Elhanany I. RL-MAC: a QoS-aware reinforcement learning based
MAC protocol for wireless sensor networks. In: Proc. IEEE Int. Conf. on
Netw., Sens. and Control; 2006. p. 768–773.
[34] Wu C, Ohzahata S, Ji Y, et al. A MAC protocol for delay-sensitive VANET
applications with self-learning contention scheme. In: Proc. IEEE Consumer
Comm. and Netw. Conference; 2014. p. 438–443.
Reinforcement learning-based channel sharing 259
[35] Yang Q, Xing S, Xia W, et al. Modelling and performance analysis of dynamic
contention window scheme for periodic broadcast in vehicular ad hoc networks.
IET Communications. 2015;9(11):1347–1354.
[36] Watkins CJCH, and Dayan P. Q-learning. Machine Learning. 1992;8(3):
279–292. Available from: http://dx.doi.org/10.1007/BF00992698.
[37] Watkins CJCH. Learning from Delayed Rewards. Cambridge, UK: King’s Col-
lege; 1989. Available from: http://www.cs.rhul.ac.uk/∼chrisw/new_thesis.pdf.
[38] Lessmann J, Janacik P, Lachev L, et al. Comparative study of wireless net-
work simulators. In: Proc. of Seventh International Conference on Networking
(ICN). IEEE; 2008. p. 517–523.
[39] Varga A, and Hornig R. An overview of the OMNeT++ simulation environ-
ment. In: Proc. of the 1st International Conference on Simulation Tools and
Techniques for Communications, Networks and Systems & Workshops. Simu-
Tools; 2008. p. 60:1–60:10. Available from: http://dl.acm.org/citation.cfm?
id=1416222.1416290.
[40] INET Framework; 2010. [Online; accessed 21-June-2016]. http://inet.
omnetpp.org.
[41] Behrisch M, Bieker L, Erdmann J, et al. SUMO – Simulation of Urban MObil-
ity – an overview. In: Proceedings of the 3rd International Conference on
Advances in System Simulation (SIMUL’11); 2011;(c). p. 63–68. Available
from: http://www.thinkmind.org/index.php?view=article&articleid=simul_
2011_3_40_50150.
[42] Pressas A, Sheng Z, Fussey P, et al. Connected vehicles in smart cities:
interworking from inside vehicles to outside. In: 2016 13th Annual IEEE Inter-
national Conference on Sensing, Communication, and Networking (SECON);
2016. p. 1–3.
This page intentionally left blank
Chapter 8
Machine-learning-based perceptual video coding
in wireless multimedia communications
Shengxi Li1 , Mai Xu2 , Yufan Liu3 , and Zhiguo Ding4
8.1 Background
At present, multimedia applications, such as Facebook and Twitter, are becoming
integral components in the daily lives of millions, leading to the explosion of big
data. Among them, videos are one of the largest types of big data [1], thus posing
a great challenge to the limited communication and storage resources. Meanwhile,
due to more powerful camera hardware, their resolutions are significantly increasing,
further intensifying the hunger on communication and storage resources. Aiming at
overcoming this resource-hungry issue, a set of video-coding standards have been
proposed to condense video data, e.g., MPEG-2 [2], MPEG-4 [3] VP9 [4], H.263 [5]
and H.264/AVC [6].
Most recently, as the successor of H.264/AVC, HEVC [7] was formally approved
in April, 2013. In HEVC, several new features, e.g., the quadtree-based coding
1
Department of Electrical and Electronic Engineering, Imperial College London, UK
2
Department of Electronic and Information Engineering, Beihang University, China
3
Institute of Automation, Chinese Academy of Sciences, China
4
School of Electrical and Electronic Engineering, The University of Manchester, UK
262 Applications of machine learning in wireless communications
1
Planar and DC are two other intra-prediction modes.
Machine-learning-based perceptual video coding 263
6 58 16 84
0 10 20 30 40 50 60 0 10 20 30 40 50 60 70 80 90100
CTUs in ROIs CTUs in non-ROIs Bits in ROIs Bits in non-ROIs
33 67 83 17
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
Bits in ROIs Bits in non-ROIs Bits in ROIs Bits in non-ROIs
Figure 8.1 An example of HEVC-based compression for Lena image, with different
bit allocation emphasis on ROIs. Note that (a) is the heat map of eye
fixations; (b), (c) and (d) are compressed by HEVC-MSP at 0.1 bpp
with no, well balanced and more emphasis on face regions. The
difference mean opinion scores (DMOS) for (b), (c) and (d) are 63.9,
57.5 and 70.3, respectively [11]
The organization of this chapter is as follows. The literature review is first intro-
duced in Section 8.2, from perspectives of perceptual models and incorporations in
video coding. We then present in Section 8.3 the recursive Taylor expansion (RTE)
method for optimal bit allocation toward the perceptual distortion and also provide rig-
orous proofs. The computational analysis of the proposed RTE method is introduced in
264 Applications of machine learning in wireless communications
Section 8.4. For experimental validations, we first verify the proposed RTE method
on compressing one single image/frame in Section 8.5, followed by the results on
compressing video sequences in Section 8.6.
Generally speaking, main parts of perceptual video coding are perceptual mod-
els, perceptual model incorporation in video coding and performance evaluations.
Specifically, perceptual models, which imitate the output of the HVS to specify
the ROIs and non-ROIs, need to be designed first for perceptual video coding.
Second, on the basis of the perceptual models and existing video-coding standards,
perceptual model incorporation in video coding from perceptual aspects needs to
be developed to encode/decode the videos, mainly through removing their percep-
tual redundancy. Rather than incorporating perceptual model in video coding, some
machine-learning-based image/video compression approaches have also proposed
during the past decade.
video coding. Mimicking processing in primate occipital and posterior parietal cortex,
Itti’s model integrates low-level visual cues, in terms of color, intensity, orientation,
flicker and motion, to generate a saliency map for selecting ROIs [17].
The other class of visual attention models is top-down processing [14,18,25–29].
The top-down visual attention models are more frequently applied to video applica-
tions, since they are more correlated with human attractiveness. For instance, human
face [16,18,26] is one of the most important factor that draws top-down attention,
especially for conversational video applications. Beyond, a hierarchical perceptual
model of face [18] has been established, endowing unequal importance within face
region. However, abovementioned approaches are unable to figure out the importance
of face region.
In this article, we quantify the saliency of face and facial features via learning the
saliency distribution from the eye fixation data of training videos, via conducting the
eye-tracking experiment. Then, after detecting face and facial features for automati-
cally identifying ROI [18], the saliency map of each frame of encoded conversational
video is assigned using the learnt saliency distribution. Although the same ROI is
utilized in [18], the weight map of our scheme is more reasonable for the perceptual
model for video coding, as it is in light of learnt distribution of saliency over face
regions. Note that the difference between ROI and saliency is that the former refers
to the place that may attract visual attention while the later refers to the possibility of
each pixel/region to attract visual attention.
QP values are assigned to all MBs, which enhances the perceived visual quality of
compressed videos. In addition, after obtaining the importance map, the other encod-
ing parameters, such as mode decision and motion estimation search, are adjusted to
provide ROIs with more encoding resources. Xu et al. [18] proposed a new weight-
based unified R–Q (URQ) rate control scheme for compressing conversational videos,
which assigns bits according to bpw, instead of bpp in conventional URQ scheme.
Then, the quality of face regions is improved such that its perceived visual quality
is enhanced. The scheme in [18] is based on the URQ model [36], which aims at
establishing the relationship between bite-rate R and quantization parameters Q, i.e.,
R–Q relationship. However, since various flexible coding parameters and structures
are applied in HEVC, R–Q relationship is hard to be precisely estimated [37]. There-
fore, Lagrange multiplier λ [38], which stands for the slope of R–D curve, has been
investigated. According to [37], the relationship between λ and R can be better char-
acterized in comparison with R–Q relationships. This way, on the basis of R–λ model,
the state-of-the-art R–λ rate control scheme [39] has better performance than the
URQ scheme. Therefore, on the basis of the latest R–λ scheme, this article proposes
a novel weight-based R–λ scheme to further improve the perceived video quality
of HEVC.
and R–D slope for the ith CTU, respectively, the R–D relationship and R–λ model are
formulated as follows:
di = ci ri −ki , (8.1)
and
∂di
λi = − = ci ki · ri −ki −1 , (8.2)
∂ri
where ci and ki are the parameters that reflect the content of the ith CTU. In the R–λ
approach [37], ri is first allocated according to the predicted mean absolute difference,
and then its corresponding λi is obtained using (8.2). By adopting a fitting relationship
between λi and QP, the QPs of all CTUs within the frame can be estimated such that
RC is achieved in HEVC. For more details, refer to [37].
However, for HEVC-MSP, ci and ki cannot be obtained when encoding CTUs.
Thus, it is difficult to directly apply the R–λ RC approach to HEVC-MSP. In the work
of [52], the sum of the absolute transformed differences (SATD), calculated by the
sum of Hadamard transform coefficients, is utilized for HEVC-MSP. Specifically,
the modified R–λ model is
βi
si
λ i = αi , (8.3)
ri
where αi and βi are the constants for all CTUs and remain the same when encoding
an image. Moreover, si denotes the SATD for the ith CTU, which measures the CTU
texture complexity. Nevertheless, SATD is too simple to reflect image content, leading
to an inaccurate R–D relationship during RC.
To avoid the above issues, we adopt a preprocessing process in calculating ci and
ki . After pre-compressing, the pre-encoded distortion, bits and λ can be obtained for
the ith CTU, which are denoted as d̄i , r̄i and λ̄i , respectively. Then, the RC-related
parameters, ci and ki , can be estimated upon (8.1) and (8.2) before encoding the
ith CTU:
d̄i
ci = , (8.4)
−λ̄ ·r̄ /d̄
r̄i i i i
and
λ̄i · r̄i
ki = . (8.5)
d̄i
With the estimated ci and ki , the RC of the R–λ approach [37] can be implemented in
HEVC-MSP.
Here, a fast pre-compressing process is developed in our approach, which sets the
maximum CU depth to 0 for all CTUs. We have verified that the fast pre-compressing
process slightly increases the computational complexity by a 5% burden, which is
slightly larger than the 3% of the SATD-based method [52]. However, this process is
able to well reflect the R–D relationship, as to be verified Section 8.5.4.
Machine-learning-based perceptual video coding 269
2
It needs to point out that J in (8.7) is convex with regard to ri and λ, which ensures the global minimum
of the problem (8.7).
270 Applications of machine learning in wireless communications
and bi are available before encoding the image. Once λ is known, ri can be estimated
using (8.9) for achieving the minimum J .
Meanwhile, there also exists a constraint on bit rate, which is formulated as
M
ri = R. (8.10)
i=1
According to (8.9) and (8.10), we need to find the “proper” λ and bit allocation ri to
satisfy the following equation:
M M
wi ai bi
ri = = R. (8.11)
i=1 i=1
λ
After solving (8.11) to find the “proper” λ, the target bits can be assigned to each
CTU with the maximum SWPSNR.
Unfortunately, since ai and bi vary across different CTUs, (8.11) cannot be solved
by a closed-form solution. Next, the RTE method is proposed to provide a closed-form
solution.
M M
M bi
i ai bi
w λ
ri = =
ri = R. (8.12)
i=1 i=1
λ i=1
λ
where αpic and βpic are the fitted constants (αpic = 6.7542 and βpic = 1.7860 in HM
16.0) and spic represents the SATD for the current picture. Recall that R denotes the
target bits allocated to the currently encoded picture.
In the following, the RTE method is proposed to iteratively update
λ for making
λ → λ.
Machine-learning-based perceptual video coding 271
ln ( (ln (
M
λ/λ) λ/λ))n n
=
ri 1 + bi + · · · + bi + · · ·
i=1
1! n!
ln ( (ln (
λ/λ))2 2 (ln (
M
λ/λ) λ/λ))3 3
≈
ri 1 + bi + bi + bi (8.14)
i=1
1! 2! 3!
M
λ) λ))3 3
R=
ri 1 + bi + bi + bi
i=1
1! 2! 3!
M 3 M 2
bi bi b3
=−
ri ln3
λ+
ri λ ln2
+ i ln λ
i=1
6 i=1
2 2
A B
M
b3i
−
ri b2i ln
λ + bi + ln λ ln
2
λ
i=1
2
C
M
b2i 2 b3i 3
+
ri 1 + bi ln λ + ln λ + ln λ . (8.15)
i=1
2 6
D
λ as
√
√
3 √
3 −F ± F 2 − 4EG
λ = e((−B−( Y1 + Y2 ))/3A) , Y1,2 = BE + 3A , (8.16)
2
λ is unique for optimizing bit allocation. After further removing the cubic-
order term, (8.14) is turned to be a quadratic equation. We found that such a quadratic
equation may have no real solution or two solutions. Meanwhile, using only one term
may lead to large approximation error and slow convergence speed, while keeping
more than four terms probably makes the polynomial equations on ln
λ unsolvable.
272 Applications of machine learning in wireless communications
Therefore, discarding the biquadratic and higher order terms of the Taylor expansion
is the best choice for our approach.
However, due to the truncation of high-order terms in the Taylor expansion,
λ
estimated by (8.16) may not be an accurate solution to (8.12). Fortunately, as proven
in Lemma 8.1,
λ is more accurate3 than
λ when
λ < λ.
Lemma 8.1. Consider λ > λ > 0, bi > 0, and R > 0 for (8.12). When the solution of
λ to (8.12) is
λ:
λ − λ| < |
λ − λ|. (8.17)
M bi
λ
R =
ri
i=1
λ
M M
ln (
λ/
λ) M
(ln (
λ/
λ))2 2 (ln (
M
λ/
λ))3 3
=
ri +
ri bi + ri bi + ri bi . (8.18)
i=1 i=1
1! i=1
2! i=1
3!
In
fact, 0 < ( <
λ/λ)bi < 1 holds for 0 λ < λ and bi > 0. Besides, there exists
M
R = i=1 ri (
λ/λ) i in (8.18). Therefore, M
b
i=1
ri > R can be evaluated.
i=1 λ) ≥
0, and bi > 0, the inequality below holds:
M M
ln (
λ/
λ) M
(ln (
λ/
λ))2 2 (ln (
M
λ/
λ))3 3
ri +
ri bi +
ri bi +
ri bi > R, (8.19)
i=1 i=1
1! i=1
2! i=1
3!
λ is the
solution of λ to (8.12), then the following holds:
λ < λ. (8.20)
3
It is obvious that 0 < bi = 1/(ki + 1) < 1 and R > 0 in HEVC encoding.
Machine-learning-based perceptual video coding 273
M
i=1
bi
Proof. Toward the Taylor expansion of ri (λ/λ) in (8.12), we can obtain the
following equations:
M bi
λ
R=
ri
i=1
λ
M M
ln (
λ/
λ) M
(ln (
λ/
λ))2 2 (ln (
M
λ/
λ))3 3
=
ri +
ri bi +
ri bi +
ri bi
i=1 i=1
1! i=1
2! i=1
3!
M M
ln (
λ/λ) M
λ/λ))2 2 (ln (
(ln (
M
λ/λ))3 3
=
ri +
ri bi +
ri bi +
ri bi
i=1 i=1
1! i=1
2! i=1
3!
M
λ/λ))4 4 (ln (
(ln (
M
λ/λ))5 5 (ln (
M
λ/λ))6 6
+
ri bi +
ri bi +
ri bi + · · · .
i=1
4! i=1
5! i=1
6!
(8.21)
ln (λ/λ) > ln (λ/λ) > 0 exists such that λ < λ can be achieved.
● For λ > λ > 0 and bi > 0, we have
M
λ/λ))4 4 (ln (
(ln (
M
λ/λ))5 5 (ln (
M
λ/λ))6 6
ri bi +
ri bi +
ri bi + · · · > 0. (8.22)
i=1
4! i=1
5! i=1
6!
M
ln (
λ/λ) M
λ/λ))2 2 (ln (
(ln (
M
λ/λ))3 3
>
ri bi +
ri bi +
ri bi . (8.23)
i=1
1! i=1
2! i=1
3!
Moreover, viewing
λ and λ as variable x, the inequality (8.23) can be analyzed
by (8.24). The function of (8.24) monotonously decreases to 0 along with the
increasing of variable x (until x ≤
λ):
M M
ln (
λ/x) M
λ/x))2 2 (ln (
(ln (
M
λ/x))3 3
ri +
ri bi +
ri bi +
ri bi . (8.24)
i=1 i=1
1! i=1
2! i=1
3!
λ < λ.
Therefore,
λ < λ holds for both cases. This completes the proof of Lemma 8.2.
274 Applications of machine learning in wireless communications
Remark 8.1. Given Lemma 8.2, for the subsequent iterations of the RTE method,
0 < λ < λ of Lemma 8.1 can be satisfied since the value of λ has been replaced
by that of
between the target and actual bits may exist for each CTU. This difference may degrade
RC accuracy. To overcome this, we develop a bit reallocation process to accurately
control bit rates, meanwhile maintaining the optimization for perceptual distortion.
Specifically, for compensating the bit-rate error after encoding the ith CTU, the
target bits for the incoming K CTUs (denoted as Ti+1,i+K ) are updated by
⎛ ⎞
j=i+K
j=M
Ti+1,i+K = rj + ⎝
T− rj ⎠ .
(8.25)
j=i+1 j=i+1
bit-rate error
In (8.25),
T is the remaining bits for encoding remaining CTUs, and rj represents
the target bits for the jth CTU by our RTE method. Recall that M denotes the total
number of CTUs. Obviously, as seen from (8.25), the bit error is compensated during
encoding the next K CTUs. Here, the RTE method of Section 8.3.3 is applied to
reallocate Ti+1,i+K to the next K CTUs. Note that we follow [52] and [37] to set
K = 4, which means that bits are reassigned in the next four CTUs. Moreover, note
that due to the fast convergence speed of our RTE method, the complexity increases
little for the bit reallocation process.
Finally, we summarize our HEVC-based image compression approach in
Figure 8.2. Specifically, we first transplant RC to HEVC-MSP with a simplified
pre-compression process, and the saliency values are detected for the input image.
Then, our RTE method obtains the target bits of each CTU, which can minimize per-
ceptual distortion at a given bit rate. Next, the QP value of each CTU is estimated
using the R–λ model and QP fitting. Note that the bits need to be reallocated in the
following CTUs to bridge the gap between the target and actual bits. In addition, as
to be verified in Section 8.4, little computational complexity cost is introduced in our
RTE method, further highlighting the efficiency of our approach.
convergence speed of the RTE method is first discussed from both theoretical and
numerical perspectives. In the numerical analysis, we also provide the practical
computational time of our approach.
λ is the solution of λ to (8.12) after each iteration of our RTE method. After each
iteration in our RTE method, λ is replaced by
λ. Then, there exists |λ| → 0 along
with iterations. Specifically, when −0.9 < λ < 0:
|λ| < 0.04 (8.27)
exists after two iterations.
ri = (ai /
Proof. Since ri = ri · (λ/
λ)bi and ri = (ai /λ)bi , we can obtain λ)bi . Then, by
combining (8.14) and (8.15), we obtain the following equation:
M bi
λ
R=
ri
i=1
λ
M M
ln (
λ/
λ) M
(ln (
λ/
λ))2 2 (ln (
M
λ/
λ))3 3
=
ri +
ri bi + ri bi + ri bi . (8.28)
i=1 i=1
1! i=1
2! i=1
3!
M M
Since ∀i, bi = l ∈ (0, 1) and i=1 ri = R, there exists ri = M
i=1 i=1 ri ·
λ)bi = R · (λ/
(λ/ λ)l . Next, we can rewrite (8.28) as
l 3 l 2
λ R · l3 λ λ R · l2 λ
· ln + · ln
λ 3!
λ λ 2!
λ
l
λ R·l λ λ l
+ · ln +R · = R. (8.29)
λ 1!
λ
λ
Machine-learning-based perceptual video coding 277
0
The second iteration: –0.0378
–0.2
( λ̂ – λ) / λ
–0.4
The first iteration: –0.5785 l=1
–0.6
l = 0.5
–0.8 l = 0.01
λ =
3 3
λ · e(1+2( Z1 + Z2 ))/l , (8.30)
where
⎛ ⎞
l 2l l
1 1 λ λ λ
Z1 , Z2 = − + · ⎝−3 +2± 9 −6 +2⎠ . (8.31)
8 8 λ λ λ
As proven in Lemma 8.2, λ < λ of our RTE method holds after the first iteration,
which means that λ ∈ (−1, 0). Moreover, we empirically found that λ for all CTUs
is restricted to (−0.9, 0) after the first iteration in HEVC-MSP. Then, Lemma 8.3
indicates that |λ| can be reduced to below 0.04 in at most three iterations, quickly
approaching 0. This verifies the fast convergence speed of the RTE method in terms
of λ. Next, we numerically evaluate the convergence speed of our RTE method in
terms of Ea .
278 Applications of machine learning in wireless communications
100
Approximation error Ea
10–5
10–10
10–15
1 2 3 4 5 6 7 8 9 10
Iteration times
(a) Lena 512 × 512
100
Approximation error Ea
10–5
10–10
10–15
1 2 3 4 5 6 7 8 9 10
Iteration times
(b) All images of our test set
Figure 8.4 Ea versus iteration times of the RTE method at various bit rates. Note
that for (a), the black dots represent λ for each CTU in Lena image.
For (b), all 38 images (from our test set of Section 8.5) were used to
calculate the approximation error Ea and the corresponding standard
deviation along with the increasing iterations [11]
Machine-learning-based perceptual video coding 279
3.4 GHz and 16 GB of RAM. From this test, we found out that one iteration of our
RTE method only consumes approximately 0.0015 ms for each CTU. Since it takes
at most three iterations to acquire the closed-form solution, the computational time
for our RTE method is less than 0.005 ms.
Our approach consists of two parts: bit allocation and reallocation with the RTE
method. For bit allocation, three iterations are sufficient for encoding one image, thus
consuming at most 0.005 ms. For bit reallocation, the computational time depends on
the number of CTUs of the image since each CTU requires at most three iterations
to obtain the reallocated bits. For a 1, 600 × 1, 280 image, the computational time of
our approach is approximately 2.5 ms because it includes 500 CTUs. This implies the
negligible computational complexity burden of our approach.
4
The ground-truth eye fixations, together with their corresponding images, can be obtained from our
website at https://github.com/RenYun2016/TMM2016.
Face
From
Images
Resolution
√
Tourist
Golf
[50]
√ √
Travel
1,920×1,080
√
Doctor
√
Woman
×
Cafe
Bike
Picture01
Picture06
Table 8.2 Details of our test set [11]
1,280×1,600
JPEG XR test set
× × × ×
Picture10
×
Picture14
×
×
Picture30
Kodim01
×
Kodim02
×
Kodim03
×
Kodim05
×
Kodim06
×
Kodim07
×
Kodim08
×
Kodim11
Kodim12
× ×
Kodim13
768×512
×
Kodim14
√
Kodim15
Kodim16
Kodak test set
× ×
Kodim20
×
Kodim21
×
Kodim22
×
Kodim23
×
Kodim24
√
Kodim04
×
Kodim09
Kodim10
× ×
Kodim17
512×768
√
Kodim18
×
Kodim19
√
Tiffany
512×512
√
Lena
Standard images
Machine-learning-based perceptual video coding 281
detection. The other 20 subjects did not have any background in saliency detection,
and they were naive to the purpose of the eye-tracking experiment. Then, a Tobii
TX60 eye tracker integrated with a monitor of a 23-in. LCD display was used to
record the eye movement at a sample rate of 60 Hz. All subjects were seated on an
adjustable chair at a distance of 60 cm from the monitor of the eye tracker. Before
the experiment, the subjects were instructed to perform the 9-point calibration for the
eye tracker. During the experiment, each image was presented in a random order and
lasts for 4 s, followed by a 2-s black image for a drift correction. All subjects were
asked to freely view each image. Overall, 9,756 fixations were collected for our 38
test images.
In our experiments, our approach was implemented in HM 16.0 with the MSP
configuration profile. Then, the non-RC HEVC-MSP [9], also on the HM 16.0 plat-
form, was utilized for comparison. The RC HEVC-MSP was also compared, the RC of
which is mainly based on [52]. Note that both our approach and the RC HEVC-MSP
have integrated RC to specify the bit rates, and the other parameters in the configu-
ration profile were set by default, the same as those of the non-RC HEVC-MSP. To
obtain the target bit rates, we encoded each image with the non-RC HEVC-MSP at
six fixed QPs, the values of which are 22, 27, 32, 37, 42 and 47. Then, the target bit
rates of our approach and the RC HEVC-MSP were set to be the actual bits obtained
by the non-RC HEVC-MSP. As such, high ranges of visual quality for compressed
images can be ensured.
PSNR and EWPSNR (dB) PSNR and EWPSNR (dB) PSNR and EWPSNR (dB) PSNR and EWPSNR (dB) PSNR and EWPSNR (dB) PSNR and EWPSNR (dB)
PSNR and EWPSNR (dB) PSNR and EWPSNR (dB) PSNR and EWPSNR (dB) PSNR and EWPSNR (dB) PSNR and EWPSNR (dB) PSNR and EWPSNR (dB)
45 50
Tourist Golf Travel
45 45
40
40 40
35
35
Average increase: 1.2595 dB Average increase: 2.368 dB Average increase: 3.8423 dB
30 35 30
BD rate saving: 24.9553% BD rate saving: 41.1194% BD rate saving: 46.4434%
30
0.2 0.4 0.6 0.8 1 0.05 0.1 0.15 0.2 0.25 0.2 0.4 0.6 0.8 1 1.2
Bit-rates (bpp) Bit-rates (bpp) Bit-rates (bpp)
45
45 Doctor 45 Woman Kodim15
40 40
40
35 35
35 Average increase: 2.6338 dB 30 Average increase: 4.0383 dB Average increase: 2.3264 dB
BD rate saving: 52.3495% BD rate saving: 58.9304% 30 BD rate saving: 40.8396%
30 25
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 0.2 0.4 0.6 0.8 1
Bit-rates (bpp) Bit-rates (bpp) Bit-rates (bpp)
45
Kodim04 40 Kodim18 Tiffany
40 40
35
35 35
30
Average increase: 1.3689 dB Average increase: 1.7866 dB Average increase: 1.6407 dB
30 BD rate saving: 27.4917% 25 BD rate saving: 29.2705% 30 BD rate saving: 34.8788%
0.2 0.4 0.6 0.8 1 0.5 1 1.5 0.2 0.4 0.6 0.8 1
Bit-rates (bpp) Bit-rates (bpp) Bit-rates (bpp)
Figure 8.5 EWPSNR and PSNR versus bit rates for our approach and the non-RC
HEVC-MSP [11]
PSNR and EWPSNR (dB) PSNR and EWPSNR (dB) PSNR and EWPSNR (dB) PSNR and EWPSNR (dB) PSNR and EWPSNR (dB) PSNR and EWPSNR (dB)
PSNR and EWPSNR (dB) PSNR and EWPSNR (dB) PSNR and EWPSNR (dB) PSNR and EWPSNR (dB) PSNR and EWPSNR (dB) PSNR and EWPSNR (dB)
45 50
Tourist Travel
45 Golf 45
40
40 40
35
35
Average increase: 1.6706 dB Average increase: 1.9525 dB Average increase: 3.7202 dB
30 35 30
BD rate saving: 30.8758% BD rate saving: 41.0559% BD rate saving: 45.9429%
30
0.2 0.4 0.6 0.8 1 0.05 0.1 0.15 0.2 0.25 0.3 0.2 0.4 0.6 0.8 1 1.2
Bit-rates (bpp) Bit-rates (bpp) Bit-rates (bpp)
45
45 Doctor 45 Woman Kodim15
40 40
40
35 35
35 Average increase: 3.1595 dB 30 Average increase: 4.0712 dB Average increase: 2.6205 dB
BD rate saving: 56.3626% BD rate saving: 54.5463% 30 BD rate saving: 46.817%
30 25
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 0.2 0.4 0.6 0.8 1
Bit-rates (bpp) Bit-rates (bpp) Bit-rates (bpp)
45
Kodim04 40 Kodim18 Tiffany
40 40
35
35 35
30
Average increase: 1.3912 dB Average increase: 2.0058 dB Average increase: 1.9259 dB
30 BD rate saving: 30.7179% 25 30 BD rate saving: 37.5514%
BD rate saving: 40.3616%
0.2 0.4 0.6 0.8 1 0.5 1 1.5 0.2 0.4 0.6 0.8 1
Bit-rates (bpp) Bit-rates (bpp) Bit-rates (bpp)
Figure 8.6 EWPSNR and PSNR versus bit rates for our approach and the RC
HEVC-MSP [11]
Face Non-face
Avg. ± Std. Max./Min. Avg. ± Std. Max./Min. Avg. ± Std. Max./Min. Avg. ± Std. Max./Min.
QP = 47 Over non-RC 1.10 ± 0.47 2.05/0.44 1.55 ± 0.79 2.93/0.39 0.44 ± 0.19 0.95/0.14 0.71 ± 0.43 1.91/0.04
Over RC 1.19 ± 0.52 2.21/0.65 1.67 ± 0.86 2.87/0.71 0.90 ± 0.40 1.84/0.25 1.15 ± 0.55 2.51/0.24
QP = 42 Over non-RC 1.21 ± 0.43 1.83/0.39 1.71 ± 0.79 2.84/0.47 0.58 ± 0.23 1.17/0.18 0.92 ± 0.42 2.13/0.15
Over RC 1.43 ± 0.55 2.43/0.55 1.99 ± 0.80 2.98/1.07 0.97 ± 0.45 1.74/0.31 1.34 ± 0.62 2.74/0.23
QP = 37 Over non-RC 1.29 ± 0.38 1.95/0.72 1.92 ± 0.93 3.64/0.67 0.71 ± 0.29 1.23/0.25 1.16 ± 0.46 2.21/0.31
Over RC 1.42 ± 0.50 2.40/0.90 2.16 ± 0.92 3.56/0.80 1.00 ± 0.50 1.96/0.25 1.47 ± 0.65 2.83/0.51
QP = 32 Over non-RC 1.51 ± 0.51 2.48/0.67 2.23 ± 1.08 4.20/1.04 0.81 ± 0.34 1.32/0.24 1.35 ± 0.54 2.40/0.27
Over RC 1.57 ± 0.52 2.49/0.95 2.38 ± 1.10 4.18/0.91 0.99 ± 0.50 1.99/0.21 1.56 ± 0.68 2.90/0.36
QP = 27 Over non-RC 1.90 ± 0.73 3.26/0.79 2.85 ± 1.37 5.41/1.66 0.86 ± 0.36 1.48/0.33 1.49 ± 0.61 2.66/0.10
Over RC 2.01 ± 0.65 3.14/1.01 2.98 ± 1.25 5.16/1.73 0.97 ± 0.47 2.13/0.36 1.58 ± 0.73 2.77/0.23
QP = 22 Over non-RC 2.38 ± 0.92 4.14/1.26 3.60 ± 1.21 5.75/2.17 0.92 ± 0.38 1.54/0.40 1.62 ± 0.69 3.07/0.12
Over RC 2.42 ± 1.05 4.14/1.17 3.65 ± 1.21 6.30/2.07 1.15 ± 0.51 2.14/0.39 1.85 ± 0.82 3.60/0.08
Overall Over non-RC 1.56 ± 0.73 4.14/0.39 2.31 ± 1.23 5.75/0.39 0.72 ± 0.34 1.54/0.14 1.21 ± 0.61 3.07/0.04
Over RC 1.67 ± 0.76 4.14/0.55 2.47 ± 1.20 6.30/0.71 1.00 ± 0.47 2.14/0.21 1.49 ± 0.70 3.60/0.08
Machine-learning-based perceptual video coding 285
Table 8.4 EWPSNR difference (dB) of our approach after replacing SWPSNR with
EWPSNR as the optimization objective [11]
QP 47 42 37 32 27 22 Overall
compressing images using our approach. Specifically, Table 8.4 shows the EWPSNR
difference averaged over all 38 test images when replacing SWPSNR with EWPSNR
as the optimization objective in our approach. This reflects the influence of ROI
detection accuracy on the quality improvement of our approach. We can see from
Table 8.4 that the EWPSNR of our approach can be enhanced by 0.64 and 0.87 dB on
average for face and non-face images after replacing SWPSNR by EWPSNR as the
optimization objective. Thus, visual quality can be further improved in our approach
when ROI detection is more accurate.
Subjective quality evaluation: Next, we compare our approach with the non-
RC HEVC-MSP using DMOS. Note that the DMOS of the RC HEVC-MSP is not
evaluated in our test because it produces even worse visual quality than the non-RC
HEVC-MSP. The DMOS test was conducted by the means of single stimulus contin-
uous quality score, which is processed by Rec. ITU-R BT.500 to rate the subjective
quality. The total number of subjects involved in the test is 12, consisting of 6 males and
6 females. Here, a Sony BRAVIA XDV-W600, with a 55-in. LCD, was utilized for dis-
playing the images. The viewing distance was set to be four times the image height for
rational evaluation. During the experiment, each image was displayed for 4 s, and the
order in which the images were displayed was random. Then, the subjects were asked
to rate after each image was displayed, i.e., excellent (100–81), good (80–61), fair
(60–41), poor (40–21) and bad (21–0). Finally, DMOS was computed to qualify the
difference in subjective quality between the compressed and uncompressed images.
The DMOS results for the face images are tabulated in Table 8.5. Smaller values
of DMOS indicate better subjective quality. As shown in Table 8.5, our approach
has considerably better subjective quality than the non-RC HEVC-MSP at all bit
rates. Note that for all images, the DMOS values of our approach at QP = 47 are
almost equal to those of the non-RC HEVC-MSP at QP = 42, which approximately
doubles the bit rates of QP = 47. This indicates that a bit rate reduction of nearly half
can be achieved in our approach. This result is also in accordance with the ∼40%
BD-rate saving of our approach (to be discussed in Section 8.5.3). We further show
in Figure 8.7 Lena and Kodim18 compressed by our and the other two approaches.
Obviously, our approach, which incorporates the saliency-detection method of [50],
is able to significantly meliorate the visual quality over face regions (that humans
mainly focus on). Consequently, our approach yields significantly better subjective
quality than the non-RC and RC HEVC-MSP for face images.
In addition, the DMOS results of those eight non-face images are listed in
Table 8.6. Again, our approach is considerably superior to the non-RC HEVC-MSP
Table 8.5 DMOS results for face images between our approach and the non-RC HEVC-MSP [11]
Tourist Golf Travel Doctor Woman Kodim15 Kodim04 Kodim18 Tiffany Lena
QP = 47 Bits (bpp) 0.04 0.02 0.04 0.02 0.04 0.03 0.03 0.05 0.03 0.05
Our 57.2 58.0 56.9 56.5 61.4 64.5 68.9 55.0 59.2 57.5
Non-RC 74.3 69.6 69.1 63.9 78.4 70.1 73.9 66.3 67.6 63.9
QP = 42 Bits (bpp) 0.08 0.03 0.10 0.03 0.13 0.06 0.06 0.16 0.06 0.09
Our 45.0 50.0 42.7 47.8 43.9 50.7 53.6 43.1 43.1 47.9
Non-RC 58.5 56.3 53.7 52.1 61.3 61.2 61.9 56.9 54.1 55.5
QP = 32 Bits (bpp) 0.27 0.08 0.36 0.10 0.56 0.29 0.31 0.76 0.26 0.28
Our 28.1 35.2 26.1 34.1 28.9 30.0 30.0 20.8 27.1 36.9
Non-RC 36.4 42.0 34.0 42.3 36.0 38.7 38.8 28.5 30.2 44.0
Note: The bold values mean the best subjective quality per test QP and test image.
Machine-learning-based perceptual video coding 287
Figure 8.7 Subjective quality of Lena and Kodim18 images at both 0.05 bpp
(QP = 47) for three approaches [11]: (a) human fixations, (b) non-RC
HEVC-MSP, (c) RC HEVC-MSP and (d) our
approach at all bit rates. Moreover, Figure 8.8 shows two images Kodim06 and
Kodim07 compressed by our approach and by the other two approaches. From this
figure, we can see that our approach improves the subjective quality of compressed
images, as the fixated regions are with higher quality.
QP = 47 Bits (bpp) 0.07 0.04 0.02 0.04 0.05 0.03 0.02 0.06
Our 53.3 59.6 65.5 62.0 56.8 63.0 71.1 67.1
Non-RC 57.2 63.1 69.9 72.1 67.0 68.1 79.2 70.2
QP = 42 Bits (bpp) 0.14 0.10 0.04 0.12 0.10 0.08 0.06 0.17
Our 36.8 50.3 50.0 52.7 50.1 54.5 56.2 55.4
Non-RC 38.9 54.2 53.4 57.6 56.3 58.7 62.1 59.3
QP = 32 Bits (bpp) 0.49 0.40 0.26 0.60 0.33 0.28 0.36 0.71
Our 30.3 31.7 33.5 34.8 36.3 34.7 35.6 32.6
Non-RC 30.8 32.6 35.2 35.6 38.0 37.9 40.8 33.8
Note: The bold values mean the best subjective quality per test QP and test image.
Machine-learning-based perceptual video coding 289
Figure 8.8 Subjective quality of Kodim06 and Kodim07 image at 0.04 and
0.05 bpp (QP = 47) for three approaches [11]: (a) human fixations,
(b) non-RC HEVC-MSP, (c) RC HEVC-MSP and (d) our
that human faces are more consistent than other objects in attracting human attention.
Meanwhile, in our approach, the saliency of face images can be better predicted than
that of non-face images. Consequently, the ROI-based compression of face images
by our approach is more effective in satisfying human perception, resulting in larger
improvements in EWPSNR, BD-rate savings and DMOS scores.
As a result of BD-rate saving, the computational time of our approach increases,
which is also reported in Table 8.7. Specifically, our approach increases the encoding
time by approximately 8% and 5% over non-RC and RC HEVC-MSP, respectively.
The computational time of our approach mainly comes from three parts, i.e., saliency
detection, pre-compression and RTE optimization. As discussed above (Sections 8.3.1
and 8.4.2), our pre-compression process slightly increases the computational cost
by ∼3%, while our RTE method consumes negligible computational time. Besides,
saliency detection, which is the first step in our approach, consumes ∼2% extra time.
the RC HEVC-MSP over all images in our test set. Since the bit reallocation process
is developed in our approach to bridge the gap between the target and actual bits, the
control accuracy of our approach with and without the bit reallocation process is also
compared. In the following, the control accuracy is evaluated from two aspects: the
CTU level and the image level.
For the evaluation of control accuracy at the CTU level, we compute the bit-rate
error of each CTU, i.e., the absolute difference between target and actual bits assigned
to one CTU. Then, Figure 8.9 demonstrates the heat maps of bit-rate errors at the CTU
level averaged over all images with the same resolutions from the Kodak and JPEG
XR sets. The heat maps of our approach and of the RC HEVC-MSP are both shown in
Figure 8.9. It can easily be observed that our approach ensures a considerably smaller
bit-rate error for almost all CTUs when compared with the RC HEVC-MSP. Note that
the accurate rate control at the CTU level is meaningful because it ensures that the
bit consumption follows the amount that it is allocated, satisfying the subjective R–D
optimization formulation of (8.6). As a result, the bits in our approach can be accu-
rately assigned to ROIs with optimal subjective quality. In contrast, the conventional
RC HEVC-MSP normally accumulates redundant bits at the end of image bitstreams,
resulting in poor performance in R–D optimization.
For the evaluation of control accuracy at the image level, the bit-rate error, defined
as the absolute difference between the target and actual bits of the compressed image,
is worked out. Figure 8.10 shows the bit-rate errors of all 38 images from our test set in
terms of maximum, minimum, average and standard deviation values. As shown in this
figure, our approach achieves smaller bit-rate error than the RC HEVC-MSP from the
aspects of mean, standard deviation, maximum and minimum values. This verifies
the effectiveness of our approach in RC and also makes our approach more practical
because the accurate bit allocation of our approach well meets the bandwidth or storage
requirements. Furthermore, Figure 8.10 shows that the bit-rate error significantly
increases from 1.43% to 6.91% and also dramatically fluctuates once bit reallocation
is disabled in our approach. This indicates the effectiveness of the bit-reallocation
process in our approach. Note that because a simple reallocation process is also
adopted in the RC HEVC-MSP, the bit-rate errors of RC HEVC-MSP are also much
smaller than those of our approach without bit reallocation.
In summary, our approach has more accurate RC at both the CTU and image
levels compared to the RC HEVC-MSP.
Our RC HEVC-MSP
1 1 1
2 2 0.8
3 3
4 4 0.6
5 5
0.4
6 6
7 7 0.2
8 8
1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12
(a)
Our RC HEVC-MSP
1 1
2 2 0.6
3 3
4 4 0.5
5 5
6 6 0.4
7 7
8 8 0.3
9 9
0.2
10 10
11 11 0.1
12 12
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
(b)
Our RC HEVC-MSP
1 1
2 2 2.2
3 3
4 4 2
5 5
6 6 1.8
7 7
8 8
9 9 1.6
10 10
11 11 1.4
12 12
13 13 1.2
14 14
15 15 1
16 16
17 17 0.8
18 18
19 19 0.6
20 20
21 21
22 22 0.4
23 23
24 24 0.2
25 25
1 2 3 4 5 6 7 8 9 1011121314151617181920 1 2 3 4 5 6 7 8 9 1011121314151617181920
(c)
Figure 8.9 Heat maps of bit-rate errors at CTU level for our approach and RC
HEVC-MSP. Each block in this figure indicates the bit-rate error of one
CTU. Note that the bit-rate errors are obtained via averaging all
images compressed by our and the RC HEVC MSP at six different bit
rates (corresponding to QP = 22, 27, 32, 37, 42, 47) [11]. (a) Kodak
768 × 512, (b) Kodak 512 × 768, (c) JPEG XR 1,280 × 1,600
292 Applications of machine learning in wireless communications
20 Avg±Std: 2.33%±2.80% RC HEVC-MSP Our approach Our approach without bit re-allocation
Max/Min: 9.83% / 0.03%
Avg±Std: 1.43%±1.24% Avg±Std: 6.91%±4.92%
15 Max/Min: 4.98% / 0.03% Max/Min: 17.15% / 0.07%
Bit-rate error (%)
10
Ko im0
Ko im0
Ko m0
Ko m0
Ko m0
Ko m08
Ko m1
Ko im1
Ko m13
Ko m14
Ko im1
Ko m16
Ko im2
Ko m2
Ko im2
Ko m23
Ko m2
Ko m04
Ko im0
Ko m10
Ko im1
Ko m18
Ti im1
Le any
To
Go ist
Tr
Do el
W or
Ca an
Bi
Pi
Pi re0
Pi re0
Pi re1
Pi re1
Ko re30
Ko im0
ff 9
ctu
ctu 1
ctu 6
ctu 0
ctu 4
om
av
na
ke
ur
d 2
di 3
di 5
di 6
di 7
di
d 1
di 2
di
d
di 5
d
di 0
d 1
di 2
di
di 4
d
di 9
d
di 7
d
fé
lf
ct
d
d 1
Figure 8.10 The bit-rate errors of each single image for our approach with and
without bit reallocation, as well as the RC HEVC-MSP. The maximum,
minimum, average and standard deviation values over all images are
also provided [11]
As shown in Table 8.8, our approach still dramatically outperforms the conven-
tional approaches across different categories of images in terms of both quality and RC
error. Specifically, the SWPSNR improvement on the newly added 112 images is sim-
ilar to that on the above 38 test images. In particular, when compressing face images
at 6 QPs, our approach has 1.50 ± 0.84 dB SWPSNR increase over the conventional
RC HEVC-MSP. Moreover, the average increase in SWPSNR at six QPs is 0.75 dB
for non-face images, 0.83 dB for graphic images and 0.60 dB for aerial images. For
control accuracy, the average bit-rate errors of our approach stabilize at 1.84%–3.74%
across different categories, while the conventional RC approach in HEVC fluctuates
from 4.08% to 12.40% on average with an even larger standard deviation. This result
validates that our approach can achieve a stable and accurate RC, compared to RC
HEVC-MSP. Finally, the generalization of our approach can be validated.
where Ci is the set of pixels belonging to the ith CTU. Therefore, at target bit rate R,
the optimization on S-PSNR can be formulated by
M
M
min di s.t. ri = R. (8.33)
i=1 i=1
Table 8.8 Performance improvement of our approach over non-RC and RC HEVC-MSP approaches, for 112 test images belonging to
different categories [11]
32 Over non-RC Avg. ± Std. 1.14 ± 0.03 0.54 ± 0.40 0.51 ± 0.28 0.35 ± 0.30 0.58 ± 0.50
Max./Min. 2.82/0.11 1.60/0.02 0.81/0.14 1.04/0.00 2.82/0.00
Over RC Avg. ± Std. 1.40 ± 0.76 0.73 ± 0.49 0.65 ± 0.13 0.59 ± 0.61 0.80 ± 0.66
Max./Min. 2.85/0.17 1.80/0.01 0.83/0.53 2.97/0.00 2.97/0.00
All Over non-RC Avg. ± Std. 1.25 ± 0.71 0.53 ± 0.45 0.50 ± 0.21 0.30 ± 0.33 0.58 ± 0.58
Max./Min. 3.30/0.01 3.35/0.00 0.90/0.14 2.28/0.01 3.35/0.00
Over RC Avg. ± Std. 1.50 ± 0.84 0.75 ± 0.59 0.83 ± 0.68 0.60 ± 0.52 0.84 ± 0.71
Max./Min. 4.59/0.01 3.13/0.01 2.88/0.06 2.97/0.01 4.59/0.01
32 RC HEVC-MSP Avg. ± Std. 2.40 ± 2.76 3.53 ± 9.11 6.43 ± 9.80 6.93 ± 9.09 4.78 ± 8.39
Max./Min. 10.9/0.06 53.65/0.01 20.99/0.47 35.07/0.02 53.65/0.01
Our Avg. ± Std. 2.72 ± 2.62 2.80 ± 4.15 1.89 ± 1.69 1.63 ± 3.34 2.28 ± 3.51
Max./Min. 12.11/0.36 25.45/0.03 4.42/0.85 20.38/0.06 25.45/0.03
All RC HEVC-MSP Avg. ± Std. 4.08 ± 5.51 7.96 ± 15.64 11.96 ± 21.20 12.40 ± 16.29 9.12 ± 15.07
Max./Min. 33.61/0.04 98.81/0.00 86.00/0.12 69.12/0.00 98.81/0.00
Our Avg. ± Std. 3.37 ± 3.63 3.74 ± 5.71 2.17 ± 1.85 1.84 ± 3.03 2.85 ± 4.37
Max./Min. 25.79/0.10 39.32/0.01 7.00/0.31 21.39/0.00 39.32/0.00
294 Applications of machine learning in wireless communications
In (8.33), rm is the assigned bits at the mth CTU, and M is the total number of CTUs
in the current frame. To solve the above formulation, a Lagrange multiplier λ is
introduced, and (8.33) can be converted to an unconstrained optimization problem:
M
min J = (di + λri ). (8.34)
{ri }M
i=1 i=1
Here, we define J as the value of R–D cost. By setting derivative of (8.34) to zero,
minimization on J can be achieved by
M
∂J ∂ i=1 (di + λri )
=
∂ri ∂ri
∂di
= +λ
∂ri
= 0. (8.35)
Next, we need to model the relationship between distortion di and bit rate ri , for
solving (8.35). Note that di and ri are equivalent to S-MSE and bpp divided by the
number of pixels in a CTU, respectively. Similar to [37], we use the hyperbolic model
to investigate the relationship between sphere-based distortion S-MSE and bit-rate
bpp, on the basis of four encoded panoramic video sequences. Figure 8.11 plots the
fitting R–D curves using the Hyperbolic model, for these four sequences. In this
figure, bpp is calculated by
R
bpp = , (8.36)
f ×W ×H
where f means frame rate, and W and H stand for width and height of video, respec-
tively. Figure 8.11 shows that the Hyperbolic model is capable of fitting on the
relationship between S-MSE [55] and bpp, and R-square for the fitting curves of
four sequences are all more than 0.99. Therefore, the hyperbolic model is used in our
RC scheme as follows:
where cm and km are the parameters of the hyperbolic model that can be updated for
each CTU using the same way as [11].
The above equation can be rewritten by
∂di (−k −1)
− = ci · ki · ri i . (8.38)
∂ri
Given (8.35) and (8.38), the following equation holds:
ci ki (1/(ki +1))
ri = . (8.39)
λ
Machine-learning-based perceptual video coding 295
Fengjing_1 Tiyu_1
25
S-MSE vs. bpp S-MSE vs. bpp
60 Hyperbolic fitting Hyperbolic fitting
S-MSE
40
15
30
10
20
10
5
1 2 3 4 5 6 7 8 0.5 1 1.5 2 2.5
–3
bpp × 10 bpp × 10–3
Dianying Hangpai_2
40
6
30
4 20
10
2
0.5 1 1.5 2 2.5 3 0.02 0.04 0.06 0.08 0.1 0.12
–3 bpp
bpp × 10
Figure 8.11 R–D fitting curves using the hyperbolic model. Note that these four
sequences are encoded by HM 15.0 with the default low delay P
profile. The bit rates are set as the actual bit rates when
compressing at four fixed QP (27, 32, 37, 42), to be described in
Section 8.6.1.1 [55]
Upon (8.39) and (8.40), the bit allocation for each CTU can be formulated as follows:
M
ci · ki (1/(ki +1))
M
ri = = R. (8.41)
i=1 i=1
λ
Therefore, once (8.41) is solved, target bit ri can be obtained for each CTU, with
maximization on S-PSNR. In this chapter, we apply the RTE method [49] in solving
(8.41) with the closed-form solution.
After obtaining the optimal bit-rate allocation, quantization parameter (QP) of
each CTU can be estimated using the method of [37]. Figure 8.12 summarizes the
overall procedure of our RC scheme for panoramic video coding. Note that our RC
scheme is mainly applicable for the latest HEVC-based panoramic video coding, and
296 Applications of machine learning in wireless communications
cm, km
Current frame t
rm QPm
Bit allocation QP estimation Encoding
Panoramic sequence
Bit stream
Update parameters 0 0 1 1 1 0 1 0
cm, km
Next frame t + 1
rm QPm
Bit allocation QP estimation Encoding
Figure 8.12 The framework of the proposed RC scheme for panoramic video
coding [55]
8.6.1 Experiment
In this section, experiments are conducted to validate the effectiveness of our RC
scheme. Section 8.6.1.1 presents the settings for our experiments. Section 8.6.1.2
evaluates our approach from aspects of R–D performance, BD-rate and Bjontegaard
delta S-PSNR (BD-PSNR). Section 8.6.1.3 discusses the RC accuracy of our scheme.
8.6.1.1 Settings
Due to space limitation, eight panoramic video sequences at 4K are chosen from
the test set of IEEE 1857 working group in our experiments. They are shown in
Figure 8.13. These sequences are all at 30 fps with duration of 10 s. Figure 8.13
shows that the contents of these sequences, which vary from indoor to outdoor scenes
and contain people and landscapes. Then, these panoramic video sequences are com-
pressed by the HEVC reference software HM-15.0. Here, we implement our RC
scheme in HM-15.0, and then compare our scheme with the latest R–λ RC scheme [37]
that is default RC setting of HM-15.0. For HM-15.0, the Low Delay P setting is applied
with the configuration file encoder lowdelay P main.cfg. The same as [37], we first
compress panoramic video sequences using the conventional HM-15.0 at four fixed
QPs, which are 27, 32, 37 and 42. Then, the obtained bit rates are used to set the
target bit rates of each sequence for both our and conventional [37] schemes. It is
worth pointing out that we only compare with the state-of-the-art RC scheme [37] of
HEVC for 2D video coding, since there exists no RC scheme for panoramic video
coding.
Figure 8.13 Selected frames from all test panoramic video sequences [55]:
(a) Fengjing_1 (4,096 × 2,048), (b) Tiyu_1 (4,096 × 2,048),
(c) Yanchanghui_2 (4,096 × 2,048), (d) Dianying (4,096 × 2,048),
(e) Hangpai_1 (4,096 × 2,048), ( f ) Hangpai_2 (4,096 × 2,048),
(g) AerialCity (3,840 × 1,920), (h) DrivingInCountry (3,840 × 1,920)
We can see from these R–D curves that our scheme achieves higher S-PSNR than [37]
at the same bit rates, for all test sequences. Thus, our RC scheme is superior to [37]
in R–D performance.
BD-PSNR and BD-rate. Next, we quantify R–D performance in terms of BD-
PSNR and BD-rate. Similar to the above R–D curves, we use S-PSNR inY channel for
measuring BD-PSNR and BD-rate. Table 8.9 reports the BD-PSNR improvement of
our scheme over [37]. As can be seen from this table, our scheme averagely improves
0.1613 dB in BD-PSNR over [37]. Such improvement is mainly because our scheme
aims at optimizing S-PSNR, while [37] deals with optimization on PSNR. Table 8.9
also tabulates the BD-rate saving of our RC scheme with [37] being an anchor. We
can see that our RC scheme is able to save 5.34% BD-rate in average, when compared
with [37]. Therefore, our scheme has potential in relieving the bandwidth-hungry
issue posed by panoramic videos.
Subjective quality. Furthermore, Figure 8.15 shows visual quality of one
selected frame of sequence Dianying, encoded by HM-15.0 with our and conven-
tional RC schemes at the same bit rate. We can observe that our scheme yields better
visual quality than [37], with smaller blurring effect and less artifacts. For exam-
ple, both regions of fingers and light generated by our scheme is much more clearer
than those by [37]. Besides, the region of the leg encoded with our RC scheme has
less blurring effect, compared to [37]. In summary, our scheme outperforms [37]
in R–D performance, evaluated by R–D curves, BD-PSNR, BD-rate and subjective
quality.
Yanchanghui_2 Dianying
45
41
44
40
39 43
38 42
37 41
36 40
35 39
Conventional Conventional
34 38
Our Our
33 37
0 500 1,000 1,500 2,000 0 200 400 600 800 1,000
Hangpai_1 Hangpai_2
38
Averaged Y-SPSNR (dB)
Averaged Y-SPSNR (dB)
40
39 37
38 36
37 35
36 34
35 33
34 32
33 Conventional 31 Conventional
32 30
Our Our
31 29
0 2,000 4,000 6,000 8,000 10,000 0 5,000 10,000 15,000 20,000 25,000 30,000 35,000
AerialCity DrivingInCountry
39.5 38
Averaged Y-SPSNR (dB)
38.5 37
37.5 36
35
36.5
34
35.5 33
34.5 32
33.5 31
Conventional Conventional
32.5 30
Our Our
31.5 29
0 1,000 2,000 3,000 4,000 0 5,000 10,000 15,000 20,000
Figure 8.14 R–D curves of all test sequences compressed by HM-15.0 with our
and conventional RC [37] schemes [55]: (a) Fengjing 1, (b) Tiyu 1,
(c) Yanchanghui 2, (d) Dianying, (e) Hangpai 1, ( f ) Hangpai 2,
(g) AerialCity, (h) DrivingInCountry
Table 8.9 BD-rate saving and BD-PSNR enhancement for each test panoramic video sequence [55]
Name Fengjing 1 Tiyu 1 Yanchanghui 2 Dianying Hangpai 1 Hangpai 2 AerialCity DrivingInCountry Average
BD-rate saving (%) −7.63 −4.39 −3.96 −4.81 −3.87 −4.04 −5.41 −8.63 −5.34
BD-PSNR (dB) 0.2527 0.1155 0.1619 0.1441 0.1143 0.1197 0.1356 0.2464 0.1613
300 Applications of machine learning in wireless communications
(a) (b)
Figure 8.15 Visual quality of Dianying compressed at 158 kbps by HM-15.0 with
our and conventional RC [37] schemes. Note that this figure shows
the 68th frame of compressed Dianying [55]. (a) Conventional RC
scheme and (b) our scheme
8.7 Conclusion
In this chapter, we have proposed a novel HEVC-based compression approach that
minimizes the perceptual distortion. Benefiting from the state-of-the-art saliency
detection, we developed a formulation to minimize perceptual distortion, which main-
tains properly high quality at regions that attract attention. Then, the RTE method was
proposed as a closed-form solution to our formulation with little extra time for mini-
mizing perceptual distortion, followed by the bit allocation and reallocation process.
Consequently, we validated our approach in experiments of compressing both images
and videos.
There are two possible directions for future work: (1) our approach only takes
into account the visual attention in improving the subjective quality of compressed
images/videos. In fact, other factors of the HVS, e.g., JND, may also be integrated
into our approach for perceptual compression. (2) Our approach in its present form
only concentrates on minimizing perceptual distortion according to the predicted
visual attention of uncompressed frames. However, the distribution of visual attention
may be influenced by the distortion of compressed frames in reverse. A long-term
Table 8.10 S-PSNR improvement and RC accuracy of our RC scheme, compared with the conventional scheme [37,55]
Name Fixed RC error (‰) RC S-PSNR (dB) Name Fixed RC error (‰) RC S-PSNR (dB) Name Fixed RC error (‰) RC S-PSNR (dB)
QP (conventional) error (‰) improvement error (‰) QP (conventional) error (‰) improvement QP (conventional) error (‰) improvement
(our) (our) (our)
Fengjing 1 27 0.07 0.12 0.27 Tiyu 1 27 0.04 0.04 0.18 Yanchanghui 2 27 0.74 0.34 0.03
32 0.06 0.05 0.28 32 0.46 0.08 0.13 32 0.34 0.52 0.17
37 0.32 0.20 0.23 37 0.01 1.98 0.07 37 0.25 0.96 0.19
42 0.31 0.23 0.15 42 1.37 3.02 0.10 42 0.54 1.89 0.20
Dianying 27 0.02 1.68 −0.07 Hangpai 1 27 0.05 0.19 0.19 Hangpai 2 27 0.02 0.18 0.15
32 0.00 2.00 0.11 32 0.10 0.29 0.11 32 0.45 0.27 0.14
37 0.26 0.49 0.25 37 0.06 1.46 0.09 37 0.42 0.50 0.09
42 0.70 0.04 0.32 42 0.12 0.10 0.12 42 0.01 0.78 0.09
AerialCity 27 0.17 0.95 0.06 Driving 27 0.02 2.75 0.34 Average 27 0.14 0.78 0.14
InCountry
32 0.28 0.74 0.12 32 0.04 0.24 0.27 32 0.21 0.52 0.17
37 0.06 1.18 0.20 37 0.02 0.52 0.21 37 0.17 0.91 0.17
42 0.09 4.43 0.18 42 0.05 1.24 0.19 42 0.40 1.47 0.17
Overall average 0.23 0.92 0.16
302 Applications of machine learning in wireless communications
goal of perceptual compression should thus include the loop between visual attention
and perceptual distortion over compressed images/videos.
References
[1] Chen S-C, Li T, Shibasaki R, Song X, and Akerkar R. Call for papers: Multime-
dia: the biggest big data. Special Issue of IEEE Transactions on Multimedia.
2015;17(9):1401–1403.
[2] Haskell BG, Puri A, and Netravali AN. Digital Video: An Introduction to
MPEG-2. New York:Kluwer Academic Publishers; 1997.
[3] Vetro A, Sun H, and Wang Y. MPEG-4 rate control for multiple video objects.
IEEE Transactions on Circuits and Systems for Video Technology. 1999;9(1):
186–199.
[4] Bankoski J, Bultje RS, Grange A, et al. Towards a next generation open-source
video codec. In: IS&T/SPIE Electronic Imaging. International Society for
Optics and Photonics; 2013. p. 866606-1-14.
[5] Cote G, Erol B, Gallant M, et al. H. 263+: Video coding at low bit rates.
IEEE Transactions on Circuits and Systems for Video Technology. 1998;8(7):
849–866.
[6] Wiegand T, Sullivan GJ, Bjontegaard G, et al. Overview of the H.264/AVC
video coding standard. IEEE Transactions on Circuits and Systems for Video
Technology. 2003;13(7):560–576.
[7] Sullivan GJ, Ohm JR, Han WJ, et al. Overview of the high efficiency video
coding (HEVC) standard. IEEETransactions on Circuits and Systems forVideo
Technology. 2012;22(12):1649–1668.
[8] Lainema J, Bossen F, Han WJ, et al. Intra coding of the HEVC stan-
dard. IEEE Transactions on Circuits and Systems for Video Technology.
2012;22(12):1792–1801.
[9] Nguyen T, and Marpe D. Objective performance evaluation of the HEVC main
still picture profile. IEEE Transactions on Circuits and Systems for Video
Technology. 2014;25(5):790–797.
[10] Lee J, and Ebrahimi T. Perceptual video compression: a survey. IEEE Journal
of Selected Topics in Signal Processing. 2012;6(6):684–697.
[11] Li S, Xu M, Ren Y, et al. Closed-form optimization on saliency-guided
image compression for HEVC-MSP. IEEE Transactions on Multimedia.
2018;20(1):155–170.
[12] Koch K, McLean J, Segev R, et al. How much the eye tells the brain. Current
Biology. 2006;16(14):1428–1434.
[13] Wandell BA. Foundations of vision. Sinauer Associates. 1995.
[14] Doulamis N, Doulamis A, Kalogeras D, et al. Low bit-rate coding of image
sequences using adaptive regions of interest. IEEE Transactions on Circuits
and Systems for Video Technology. 1998;8(8):928–934.
[15] Yang X, Lin W, Lu Z, et al. Rate control for videophone using local percep-
tual cues. IEEE Transactions on Circuits and Systems for Video Technology.
2005;15(4):496–507.
Machine-learning-based perceptual video coding 303
[16] Liu Y, Li ZG, and Soh YC. Region-of-interest based resource allocation for
conversational video communication of H.264/AVC. IEEE Transactions on
Circuits and Systems for Video Technology. 2008;18(1):134–139.
[17] Li Z, Qin S, and Itti L. Visual attention guided bit allocation in video
compression. Image and Vision Computing. 2011;29(1):1–14.
[18] Xu M, Deng X, Li S, et al. Region-of-interest based conversational HEVC
coding with hierarchical perception model of face. IEEE Journal of Selected
Topics on Signal Processing. 2014;8(3):475–489.
[19] Xu L, Zhao D, Ji X, et al. Window-level rate control for smooth picture qual-
ity and smooth buffer occupancy. IEEE Transactions on Image Processing.
2011;20(3):723–734.
[20] Xu L, Li S, Ngan KN, et al. Consistent visual quality control in video
coding. IEEE Transactions on Circuits and Systems for Video Technology.
2013;23(6):975–989.
[21] Rehman A, and Wang Z. SSIM-inspired perceptual video coding for HEVC.
In: Multimedia and Expo (ICME), 2012 IEEE International Conference on.
IEEE; 2012. p. 497–502.
[22] Geisler WS, and Perry JS. A real-time foveated multi-resolution system for
low-bandwidth video communication. In: Proceedings of the SPIE: The
International Society for Optical Engineering. vol. 3299; 1998. p. 294–305.
[23] Martini MG, and Hewage CT. Flexible macroblock ordering for context-aware
ultrasound video transmission over mobile WiMAX. International Journal of
Telemedicine and Applications. 2010;2010:6.
[24] Itti L. Automatic foveation for video compression using a neurobiolog-
ical model of visual attention. IEEE Transactions on Image Processing.
2004;13(10):1304–1318.
[25] Chi MC, Chen MJ, Yeh CH, et al. Region-of-interest video coding based
on rate and distortion variations for H. 263+. Signal Processing: Image
Communication. 2008;23(2):127–142.
[26] Cerf M, Harel J, Einhäuser W, et al. Predicting human gaze using low-
level saliency combined with face detection. Advances in Neural Information
Processing Systems. 2008;20:241–248.
[27] Saxe DM, and Foulds RA. Robust region of interest coding for improved
sign language telecommunication. IEEE Transactions on Information Tech-
nology in Biomedicine: A Publication of the IEEE Engineering in Medicine
and Biology Society. 2002;6(4):310–316.
[28] Sun Y, Ahmad I, Li D, et al. Region-based rate control and bit allocation for
wireless video transmission. IEEE Transactions on Multimedia. 2006;8(1):
1–10.
[29] Chi MC, Yeh CH, and Chen MJ. Robust region-of-interest determina-
tion based on user attention model through visual rhythm analysis. IEEE
Transactions on Circuits and Systems for Video Technology. 2009;19(7):
1025–1038.
[30] Cavallaro A, Steiger O, and Ebrahimi T. Semantic video analysis for adaptive
content delivery and automatic description. IEEE Transactions on Circuits and
Systems for Video Technology. 2005;15(10):1200–1209.
304 Applications of machine learning in wireless communications
Saliency detection has been widely studied to predict human fixations, with various
applications in wireless multimedia communications. For saliency detection, we argue
that the state-of-the-art high-efficiency video-coding (HEVC) standard can be used to
generate the useful features in compressed domain. Therefore, this chapter proposes
to learn the video-saliency model, with regard to HEVC features. First, we establish an
eye-tracking database for video-saliency detection. Through the statistical analysis on
our eye-tracking database, we find out that human fixations tend to fall into the regions
with large-valued HEVC features on splitting depth, bit allocation, and motion vector
(MV). In addition, three observations are obtained from the further analysis on our eye-
tracking database. Accordingly, several features in HEVC domain are proposed on the
basis of splitting depth, bit allocation, and MV. Next, a support vector machine (SVM)
is learned to integrate those HEVC features together, for video-saliency detection.
Since almost all video data are stored in the compressed form, our method is able to
avoid both the computational cost on decoding and the storage cost on raw data. More
importantly, experimental results show that the proposed method is superior to other
state-of-the-art saliency-detection methods, either in compressed or uncompressed
domain.
9.1 Introduction
According to the study on the human visual system (HVS) [1], when a person looks
at a scene, he/she may pay much visual attention on a small region (the fovea) around
a point of eye fixation at high resolution. The other regions, namely, the peripheral
regions, are captured with little attention at low resolutions. As such, humans are
able to avoid the processing of tremendous visual data. Visual attention is therefore
1
School of Electronic and Information Engineering, Beihang University, China
2
School of Electrical and Electronic Engineering, The University of Manchester, UK
308 Applications of machine learning in wireless communications
a key to perceive the world around humans, and it has been extensively studied in
psychophysics, neurophysiology, and even computer vision societies [2]. Saliency
detection is an effective way to predict the amount of human visual attention attracted
by different regions in images/videos. Most recently, saliency detection has been
widely applied in wireless multimedia communications and other computer vision
tasks, such as, object detection [3,4], object recognition [5], image retargeting [6],
image-quality assessment [7], and image/video compression [8,9].
In earlier time, some heuristic saliency-detection methods are developed accord-
ing to the understanding of the HVS. Specifically, in light of the HVS, Itti and
Koch [10] found out that the low-level features of intensity, color, and orientation
are efficient in detecting saliency of still images. In their method, center-surround
responses in those feature channels are established to yield the conspicuity maps.
Then, the final saliency map can be obtained by linearly integrating conspicuity maps
of all three features. For detecting saliency in videos, Itti et al. [11] proposed to
add two dynamic features (i.e., motion and flicker contrast) into Itti’s image saliency
model [10]. Later, other advanced heuristic methods [12–18] have been proposed for
modeling video saliency.
Recently, data-driven methods [19–24] have emerged to learn the visual atten-
tion models from the ground-truth eye-tracking data. Specifically, Judd et al. [19]
proposed to learn a linear classifier of SVM from training data for image saliency
detection, based on several low, middle, and high-level features. For video-saliency
detection, most recently, Rudoy et al. [23] have proposed a novel method to predict
saliency by learning the conditional saliency map from human fixations over a few
consecutive video frames. This way, the inter-frame correlation of visual attention is
taken into account, such that the accuracy of video-saliency detection can be signif-
icantly improved. Rather than free-view saliency detection, a probabilistic multitask
learning method was developed in [21] for the task-driven video-saliency detection,
in which the “stimulus-saliency” functions were learned from the eye-tracking data
as the top-down attention models.
HEVC [25] was formally approved as the state-of-the-art video-coding standard
in April 2013. It achieves double coding efficiency improvement over the preced-
ing H.264/AVC standard. Interestingly, we found out that the state-of-the-art HEVC
encoder can be explored as a feature extractor to efficiently predict video saliency. As
shown in Figure 9.1, the HEVC domain features on splitting depth, bit allocation, and
MV for each coding tree unit (CTU), are highly correlated with the human fixations.
The statistical analysis of Section 9.3.2 verifies such high correlation. Therefore, we
develop several features in our method for video-saliency detection, which are based
on splitting depths, bit allocation, and MVs in HEVC domain. It is worth pointing
out that most videos exist in the form of encoded bitstreams and the features related
to entropy and motion have been well exploited by video coding at the encoder side.
Since [2] has argued that entropy and motion are very effective in video-saliency
detection, our method utilizes these well-exploited HEVC features (splitting depth,
bit allocation, and MV) at the decoder side to achieve high accurate detection on
video saliency.
Machine-learning-based saliency detection 309
Heatmap
0 1
Counts
Figure 9.1 An example of HEVC domain features and heat map of human fixations
for one video frame. Parts (a), (b), and (c) are extracted from the
HEVC bitstream of video BQSquare (resolution: 416 × 240) at
130 kbps. Note that in (c) only the MVs that are larger than 1 pixel are
shown. Part (d) is the heat map convolved with a 2D Gaussian filter
over fixations of 32 subjects
Generally speaking, the main motivation of using HEVC features in our saliency
detection method is 2-fold: (1) our method takes advantage from sophisticated
encoding of HEVC, to effectively extract features for video-saliency detection. Our
experimental results in this chapter also show that the HEVC features are quite effec-
tive in video-saliency detection. (2) Our method can efficiently detect video saliency
from HEVC bitstreams without completely decoding the videos, thus saving both the
computational time and storage. Consequently, our method is generally more effi-
cient than the aforementioned video-saliency detection methods at pixel domain (or
called uncompressed domain), which have to decode the bitstreams into raw data.
Such efficiency is also validated by our experiments.
There are only a few methods [26–28] proposed for detecting video saliency
in compressed domain of previous video-coding standards. Among these methods,
the block-wise discrete cosine transform (DCT) coefficients and MVs are extracted
in MPEG-2 [26] and MPEG-4 [27]. Bit allocation of H.264/AVC is exploited for
saliency prediction in [28]. However, all above methods do not take full advantage of
the sophisticated features of the modern HEVC encoder, such as CTU splitting [29]
and R–λ bit allocation [30]. More importantly, all methods of [26–28] fail to find out
the precise impact of each compressed domain feature on attracting visual attention.
310 Applications of machine learning in wireless communications
In fact, the relationship between compressed domain features and visual attention can
be learned from the ground-truth eye-tracking data. Thereby, this chapter proposes
to learn the visual attention model of videos with regard to the well-explored HEVC
features.
Similar in spirit, the latest work of [31] also makes use of HEVC features for
saliency detection. Despite conceptually similar, our method is greatly different from
[31] in two aspects. From the aspect of feature extraction, our method develops pixel-
wise HEVC features, while [31] directly uses block-based HEVC features with deeper
decoding (e.g., inverse DCT). Instead of going deeper, our method develops shallow
decoded HEVC features with sophisticated design of temporal and spatial difference
on these features, more unrestrictive than [31]. In addition, camera motion is detected
and then removed in our HEVC features, such that our features are more effective
in predicting attention. From the aspect of feature integration, compared with [31],
our method is data driven, in which a learning algorithm is developed to bridge the
gap between HEVC features and video saliency. Meanwhile, our data-driven method
benefits from thorough analysis of our established eye-tracking database.
Specifically, the main contributions of this chapter are listed in the following:
● We establish an eye-tracking database on viewing 33 raw videos of the latest data
sets, with the thorough analysis and observations on our database.
● We propose several saliency-detection features in HEVC domain, according to
the analysis and observations on our established eye-tracking database.
● We develop a data-driven method for video-saliency detection, with respect to the
proposed HEVC features.
The rest of this chapter is organized as follows: in Section 9.2, we briefly review
the related work on video-saliency detection. In Section 9.3, we present our eye-
tracking database as well as the analysis and observations on our database. In light
of such analysis and observations, Section 9.4 proposes several HEVC features for
video-saliency detection. Section 9.5 outlines our learning-based method, which is
based on the proposed HEVC features. Section 9.6 shows the experimental results to
validate our method. Finally, Section 9.7 concludes this chapter.
attention. Combined with 12 multi-scale bottom-up features, [13] has high accu-
racy in task-driven saliency detection. Most recently, a dynamic Bayesian network
method [35] has been proposed for learning top-down visual attention model of play-
ing video games. Besides the task of playing video games, a data-driven method [34]
on video-saliency detection was proposed with the dynamic consistency and align-
ment models, for the task of action recognition. In [34], the proposed models are
learned from the task-driven human fixations on large-scale dynamic computer vision
databases like Hollywood-2 [37] and UCF Sports [38]. In [21], Li et al. developed
a probabilistic multitask learning method to include the task-related attention mod-
els for video-saliency detection. The “stimulus-saliency” functions are learned from
the eye-tracking database, as the top-down attention models to some typical tasks
of visual search. As a result, [21] is “good at” video-saliency detection in multiple
tasks, more generic than other methods that focus on single visual task. However, all
task-driven saliency-detection methods can only deal with the specific tasks.
For free-view video-saliency detection, Kienzle et al. [20] proposed a nonpara-
metric bottom-up method to model video saliency, via learning the center-surround
texture patches and temporal filters from the eye-tracking data. Recently, Lee et al.
[24] have proposed to extract the spatiotemporal features, i.e., rarity, compactness,
center prior, and motion, for the bottom-up video-saliency detection. In their bottom-
up method, all extracted features are combined together by an SVM, which is learned
from the training eye-tracking data. In addition to the bottom-up model, Hua et al. [36]
proposed to learn the middle-level features, i.e., gists of a scene, as the top-down cue
for both video and image-saliency detection. Most recently, Rudoy et al. [23] have
proposed to detect the saliency of a video, by simulating the way that humans watch
the video. Specifically, a visual attention model is learned to predict the saliency map
of a video frame, given the fixation maps from the previous frames. As such, the inter-
frame dynamics of gaze transitions can be taken into account during video-saliency
detection.
As aforementioned, this chapter mainly concentrates on utilizing the HEVC fea-
tures for video-saliency detection. However, there is a gap between HEVC features
and human visual attention. From data-driven perspective, machine learning can be
utilized in our method to investigate the relationship between HEVC features and
visual attention, according to eye-tracking data. Thus, this chapter aims at learning an
SVM classifier to predict saliency of videos using the features from HEVC domain.
different quality. Through the data analysis, we found that visual attention is almost
unchanged when videos are compressed at high or medium quality (more than 30 dB).
This is consistent with the result of [40]. Compared with the conventional databases
(e.g., SFU [41] and DIEM [42]), the utilization of these videos benefits from the
state-of-the-art test sets in providing videos with diverse resolutions and content. For
the resolution, the videos vary from 1080p (1,920 × 1,080) to 240p (416 × 240). For
the content, the videos include sport events, surveillance, video conferencing, video
games, videos with the subscript, etc.
In our eye-tracking experiment, all videos are withYUV 4:2:0 sampling. Here, the
resolutions of the videos in Class A of [39] were down-sampled to be 1,280 × 800,
as the screen resolution of the eye tracker can only reach to 1,920 × 1,080. Other
videos were displayed in their original resolutions. In our experiment, the videos were
displayed in a random manner at their default frame rates to reduce the influence of
video-playing order on the eye-tracking results. Besides, a blank period of 5 seconds
was inserted between two consecutive videos, so that the subjects can have a proper
rest time to avoid eye fatigue.
There were a total of 32 subjects (18 male and 14 female, aging from 19 to
60) involved in our eye-tracking experiment. These subjects were selected from the
campuses of Beihang University and Microsoft ResearchAsia. All subjects have either
corrected or uncorrected normal eyesight. Note that only two subjects were experts,
who are working in the research field of saliency detection. The other 30 subjects
did not have any research background in video-saliency detection, and they were also
native to the purpose of our eye-tracking experiment.
The eye fixations of all 32 subjects over each video frame were recorded by
a Tobii TX300 eye tracker at a sample rate of 300 Hz. The eye tracker is inte-
grated with a monitor of 23-inch LCD screen, and the resolution of the monitor
was set to be 1,920 × 1,080. All subjects were seated on an adjustable chair at a
distance of around 60 cm from the screen of the eye tracker, ensuring that their
horizontal sight is in the center of the screen. Before the experiment, subjects were
instructed to perform the 9-point calibration for the eye tracker. Then, all subjects
were asked to free-view each video. After the experiment, 392,163 fixations over
13,020 frames of 33 videos were collected. Here, the eye fixations of all subjects
and the corresponding MATLAB® code for our eye-tracking database are available
online: https://github.com/remega/video_database.
For all videos of our database, the features on splitting depth, bit allocation, and
MV were extracted from the corresponding HEVC bitstreams. Then, the maps of
these features were generated for each video frame. Note that the configuration to
generate the HEVC bitstreams can be found in Section 9.6. Afterwards, a 2D Gaussian
filter was applied to all three feature maps of each video frame. For each feature map,
after sorting pixels in the descending order of their feature values, the pixels were
equally divided into ten groups according to the values of corresponding features.
For example, the group of 0%–10% stands for the set of pixels, the features of which
rank top 10%. Finally, the number of fixations belonging to each group was counted
upon all 33 videos in our database.
We show in Figure 9.2 the percentages of eye fixations belonging to each group, in
which the values of the corresponding HEVC features decrease alongside the groups.
From this figure, we can find out that extensive attention is drawn by the regions
with large-valued HEVC features, especially for the feature of bit allocation. For
example, about 33% fixations fall into the regions of top 10% high-valued feature of
bit allocation, whereas the percentage of those hitting the bottom 10% is much less
than 2%. Hence, the HEVC features on splitting depth, bit allocation, and MV, are
explored for video-saliency detection in our method (Section 9.4).
35%
MV
20%
Proportion
15%
10%
5%
0%−10% 10%−20% 20%−30% 30%−40% 40%−50% 50%−60% 60%−70% 70%−80% 80%−90% 90%−100%
Groups
Figure 9.2 The statistical results for fixations belong to different groups of pixels,
in which values of the corresponding HEVC features are sorted in the
descending order. Here, all 392,163 fixations of 33 videos are used for
the analysis. In this figure, the horizontal axis indicates the groups of
pixels. For example, 0%–10% means that the first group of pixels, the
features of which rank top 10%. The vertical axis shows the percentage
of fixations that fall into each group
Machine-learning-based saliency detection 315
Figure 9.3 Illustration of Observation 9.1. This figure shows the heat maps of
human fixations of all 32 subjects on several selected frames of videos
BasketballDrive and Kimono. In BasketballDrive, the green box is
drawn to locate the moving basketball
316 Applications of machine learning in wireless communications
Figure 9.4 Illustration of Observation 9.2. This figure shows the heat maps of
visual attention of all 32 subjects, over several selected frames of
videos vidyo1 and ParkScene
Heatmap
0 1
Counts
Figure 9.5 Illumination of Observation 9.3. This figure shows the map of human
fixations of all 32 subjects, over a selected frame of video
PeopleOnStreet. Note that in the video a lot of visual attention is
attended to the old man, who pushes a trolley and walks in the opposite
direction of the crowd
attention. This completes the analysis of Observation 9.2. Note that there exists the
lag of human fixations, as the door is still fixated on when the person has left. This
also satisfies Observation 9.1.
Observation 9.3: The object, which moves in the opposite direction of the
surrounding objects, is possible to receive extensive fixations.
The previous work [10] has verified that the human fixations on still images are
influenced by the center-surround features of color and intensity. Actually, the center-
surround feature of motions also has an important effect on attracting visual attention.
As seen from Figure 9.5, the old man with a trolley moves in the opposite direction
of the surrounding crowd, and he attracts the majority of visual attention. Therefore,
Machine-learning-based saliency detection 317
this suggests that the object moving in the opposite direction to its surround (i.e., it is
with large center-surround motion) may receive extensive fixations. This completes
the analysis of Observation 9.3.
In this section, we mainly focus on exploring the features in HEVC domain, which can
be used to efficiently detect video saliency. As analyzed above, three HEVC features,
i.e., splitting depth, bit allocation, and MV, are effective in predicting video saliency.
Therefore, they are worked out as the basic features for video saliency detection, to be
presented in Section 9.4.1. Note that the camera motion has to be removed for the MV
feature, with an efficient algorithm developed in Section 9.4.1. Based on the three
basic HEVC features, the features on temporal and spatial difference are discussed
in Sections 9.4.2 and 9.4.3, respectively.
via averaging all consumed bits in the corresponding CTU. Next, the bpp is normal-
ized to be bkij in each video frame, and it is then included as one of basic HEVC
features to detect saliency.
MV. In video coding, MV identifies the location of matching prediction unit
(PU) in the reference frame. In HEVC, MV is sophisticatedly developed to indicate
motion between neighboring frames. Intuitively, MV can be used to detect video
saliency, as motion is an obvious cue [16] of salient regions. This intuition has also
been verified by the statistical analysis of Section 9.3.2. Therefore, MV is extracted
as a basic HEVC feature in our method.
During video coding, MV is accumulated by two factors: the camera motion
and object motion. It has been pointed out in [43] that in a video, moving objects
may receive extensive visual attention, while static background normally draws little
attention. It is thus necessary to distinguish moving objects and static background.
Unfortunately, MVs of static background may be as large as moving objects, due to
the camera motion. On the other hand, although temporal difference of MVs is able
to make camera motion negligible for static background, it may also miss the moving
objects. Therefore, the camera motion has to be removed from calculated MVs to
estimate object motion for saliency detection.
Figure 9.6 shows that the camera motion can be estimated to be the dominant
MVs in a video frame. In this chapter, we therefore develop a voting algorithm to
estimate the motion of camera. Assuming that mijk is the two-dimensional MV of pixel
(i, j) at the kth frame, the dominant camera motion mck in this frame can be determined
in the following way.
First, the static background Skb is roughly extracted to be
⎧ ⎫
⎨ 1 ⎬
Skb = (i, j)|dijk · bkij < k dik j · bki j , (9.1)
⎩ |I | k ⎭
(i , j )∈I
Figure 9.6 An example of MV values of all PUs in (a) a frame with no camera
motion and (b) a frame with right-to-left camera motion. Note that the
MVs are extracted from HEVC bitstreams. In (a) and (b), the dots stand
for the origin of each MV, and the blue lines indicate the intensity and
angle of each MV. It can be seen that in (a) there is no camera motion,
as most MV values are close to zero, whereas the camera motion in (b)
is from right to left according the most MV values
Machine-learning-based saliency detection 319
for the kth frame Ik (with |Ik | pixels). It is because the static background is generally
with less splitting depth and bit allocation than the moving foreground objects. Then,
the azimuth a(mck ) for the dominant camera motion can be calculated via voting all
MV angles in the background Skb as
⎛ ⎞
max hist ⎝ a(mijk )⎠ , (9.2)
i, j∈Skb
where a(mijk ) is the azimuth for MV mijk , and hist( · ) is the azimuth histogram of all
MVs. In this chapter, 16 bins with equal angle width (= 360◦ /16 = 22.5◦ ) are applied
for the histogram. After obtaining a(mck ), radius r(mck ) for the camera motion needs to
be calculated via averaging over all MVs from the selected bin of a(mck ). Finally, the
camera motion of each frame can be achieved upon a(mck ) and r(mck ). For justification,
we show in Figure 9.7 some subjective results of the camera motion estimated by our
voting algorithm (in yellow arrows), as well as the annotated ground truth of camera
motion (in blue arrows). As can be seen from this figure, our algorithm is capable of
accurately estimating the camera motion. See Appendix for more justification on the
estimation of camera motion.
Next, in order to track the motion of objects, all MVs obtained in HEVC domain
need to be processed to remove the estimated camera motion. All processed MVs
should be then normalized in each video frame, denoted as m̂ijk . Since it has been
argued in [16] that visual attention is probably attracted by moving objects, m̂ijk 2 is
utilized as one of the basic HEVC features to predict video saliency.
5 frame 300 frame 50 frame 300 frame 240 frame 144 frame 120 frame
15 frame 350 frame 100 frame 350 frame 300 frame 156 frame 180 frame
25 frame 400 frame 150 frame 400 frame 360 frame 168 frame 240 frame
Figure 9.7 The results of camera motion estimation, yielded by our voting
algorithm. The first six videos are with some extended camera motion,
whereas the last one is without any camera motion. In the frames of the
second row, the yellow and blue arrows represent the estimated and
manually annotated vectors of the camera moving from frames of the
first row to frames of the second row, respectively. Similarly, the yellow
and blue arrows in the frames of the third row show the camera motion
from frames of the second row to the third row. Refer to Appendix for
the way of annotating ground-truth camera motion
320 Applications of machine learning in wireless communications
where parameter σd controls the weights on splitting depth difference between two
frames. In (9.3), dijk−l is the splitting depth of pixel (i, j) at the (k − l)th frame. After
considering the camera motion with our voting algorithm, we assume that (ik,l , j k,l ) is
the pixel at the (k − l)th frame matching to pixel (i, j) at the kth frame. To remove
the influence of the camera motion, we replace dijk−l in (9.3) by dik−l k,l j k,l . Then, (9.3) is
rewritten to be
k
l=1 exp(−(l /σd ))dij − dik,l j k,l 1
2 2 k k−l
t dij =
k
k 2
. (9.4)
2
l=1 exp(−(l /σd ))
After calculating (9.4), t dijk needs to be normalized in each video frame, as one of
temporal difference features in HEVC domain.
Furthermore, the bpp difference across neighboring frames is also regarded as a
feature for saliency detection. Let t bkij denote the temporal difference of the bpp at
pixel (i, j) between the currently processed kth frame and its previous frames. Similar
to (9.4), t bkij can be obtained by
k
l=1 exp(−(l /σb ))bij − bik,l j k,l 1
2 2 k k−l
t bij =
k
k 2
, (9.5)
2
l=1 exp(−(l /σb ))
where σb decides the weights of the bpp difference between frames. In (9.5), with
the compensated camera motion, bk−l ik,l j k,l
is the bpp for pixel (ik,l , j k,l ) at the (k − l)th
frame, which matches to pixel (i, j) at the kth frame.
Finally, the temporal difference of MV is also taken into account, by adopting
the similar way presented above. Recall that m̂ijk is the extracted MV of each pixel,
Machine-learning-based saliency detection 321
with the camera motion being removed. Since m̂ijk is a 2D vector, 2 -norm operation
is applied to compute the temporal difference of MVs (denoted by t m̂kij ) as follows:
k
l=1 exp(−(l /σm ))m̂ij − m̂ik,l j k,l 2
2 2 k k−l
t m̂ij =
k
k . (9.6)
2 2
l=1 exp(−(l /σm ))
frame, which is the colocated pixel of (i, j) at the kth frame, after the camera motion
is removed by our voting algorithm.
to compute the spatial difference of splitting depth, bit allocation, and MV. As in the
above equations, ξd , ξb , and ξm are the parameters to control the spatial weighting of
each feature.
Finally, all nine features in HEVC domain can be achieved in our saliency
detection method. Since all the proposed HEVC features are block wise, the
block-to-pixel refinement is required to obtain smooth feature maps. For the block-
to-pixel refinement, a 2D Gaussian filter is applied to three basic features. In this
chapter, the dimension and standard deviation of the Gaussian filter are tuned to be
322 Applications of machine learning in wireless communications
(2h/15) × (2h/15) and (h/30), where h is the height of the video. It is worth men-
tioning that the above features on spatial and temporal difference are explored in
compressed domain with the block-to-pixel refinement, while the existing methods
compute contrast features in pixel domain (e.g., in [10,11]). Additionally, unlike the
existing methods, the camera motion is estimated and removed when calculating
the feature contrast in our method. Despite being simple and straightforward, these
features are effective and efficient, as evaluated in experiment section.
Figure 9.8 summarizes the procedure of HEVC feature extraction in our saliency
detection method. As seen from Figure 9.8, the maps of nine features have been
obtained, based on splitting depth, bit allocation, and MV of HEVC bitstreams. We
argue that one single feature is not capable enough [2] but has different impact on
saliency detection. We thus integrate the maps of all nine features with the learned
weights. For more details, refer to the next section.
Splitting Gaussian
depths
HEVC bitstreams filter
Basic features
of videos
Splitting depths Bits allocation Motion vectors
Motion Camera Gaussian
vectors motion filter
removal
Bits Gaussian
allocation filter
Spatial difference features
Spatial difference Spatial difference Spatial difference
Calculating of splitting depths of bits allocation of motion vectors
the spatial
difference
Figure 9.8 Framework of our HEVC feature extractor for video-saliency detection
324 Applications of machine learning in wireless communications
HEVC bitstreams
of training videos Human fixations
HEVC feature Linear SVM
Training extractor learning model
HEVC bitstreams
of a test videos Video saliency map
HEVC feature Linear Forward
Test extractor combination smoothing
In (9.8), w and b are the parameters to be learned for maximizing the margin between
positive and negative samples, and βn is a nonnegative slack variable evaluating the
degree of classification error of fn . In addition, C balances the trade-off between
the error and margin. Function φ(·) transforms the training vector of HEVC features
fn to higher dimensional space. Then, w can be seen as the linear combination of
transformed vectors:
N
w= λm lm · φ(fm ), (9.9)
m=1
N
= λm lm · φ(fm ), φ(fn ) . (9.10)
m=1
Note that φ(fm ), φ(fn ) indicates the inner product of φ(fm ) and φ(fn ). To calculate
(9.10), a kernel of radial based function (RBF) is introduced:
K(fm , fn ) = φ(fm ), φ(fn ) = exp(−γ fm − fn 22 ), (9.11)
where γ (> 0) stands for the kernel parameter. Here, we utilize the above RBF
kernel due to its simplicity and effectiveness. When training the C-SVC for saliency
detection, the penalty parameter C in (9.8) is set to 2−3 , and γ of the RBF kernel is
tuned to be 2−15 , such that the trained C-SVC is rather efficient in detecting saliency.
Finally, w and b can be worked out in the trained C-SVC as the model of video-saliency
detection, to be discussed below.
where Fk defines the pixel-wise matrix of nine HEVC features at the kth video frame.
Note that w in (9.12) is one set of weights for the binary classifier of C-SVC, which
have been obtained using the above training algorithm.
Since Observation 9.1 offers a key insight that visual attention may lag behind
the moving or new appearing objects, a forward smoothing filter is developed in our
method to take into account the saliency maps of previous frames. Mathematically, the
final saliency map Ŝk of the kth video frame is calculated by the forward smoothing
filter as follows:
1
k
Ŝk = Sk , (9.13)
t · fr
k =k−
t·fr+1
where t (> 0) is the time duration1 of the forward smoothing, and fr is the frame
rate of the video. Note that a simple forward smoothing filter of (9.13) is utilized
here, since we mainly concentrate on extracting and integrating features for saliency
detection. Some advanced tracking filters may be applied, instead of the forward
smoothing filter in our method, for further improving the performance on saliency
detection. To model visual attention on video frames, the final saliency maps need to
be smoothed with a 2D Gaussian filter, which is in addition to the one for each single
feature map (as shown in Figure 9.8). Note that the 2D Gaussian filter here shares the
same parameters as those for feature maps.
1
We found out through experiments that t = 0.3 s makes the saliency detection accuracy highest. So, time
duration t of our forward smoothing was set to be 0.3 in Section 9.6.
326 Applications of machine learning in wireless communications
bitstreams of all 33 videos in our database were produced for both training and test.
In HM 16.0, the low delay (LD) P main configuration was chosen. In addition, the
latest R–λ rate control scheme [30] was enabled in HM 16.0. Since the test videos are
with diverse content and resolutions, we followed the way of [30] to set the bit rates
the same as those at fixed QPs. The CTU size was set to 64 × 64 and maximum CTU
depth was 3 to allow all possible CTU partition structures for saliency detection. Each
group of pictures (GOP) was composed of 4 P frames. Other encoding parameters
were set by default, using the common encoder_lowdelay_P_main.cfg configuration
file of HM.
Other working conditions. The implementation of our method in random access
(RA) configuration is to be presented in Section 9.6.5. The rate control of RA in HM
16.0 is also enabled. In our experiments, we set all other parameters of RA via the
encoder_randomaccess_main.cfg file. Note that the GOP of RA is 8 B frames for
HM 16.0. Section 9.6.5 further presents the saliency detection results of our method
for the bitstreams of ×265, which is more practical than the HM encoder from the
aspects of encoding and decoding time.2 Here, ×265 v1.8 encoder, embedded in the
latest FFmpeg, was applied. For ×265, both LD and RA were tested. In ×265, the bit
rates were chosen using the same way as we applied for HM 16.0. The GOP structure
is 4 P frames for LD and four frames (BBBP) for RA. Other parameters were all set
by default in the FFmpeg with the ×265 codec. It is worth pointing out that the ×265
codec was used to extract features from the bitstreams encoded by ×265, while the
features of HM 16.0 bitstreams were extracted by the software of HM 16.0.
Training setting. In order to train the C-SVC, our database of Section 9.3.1 was
divided into nonoverlapping sets. For the fair evaluation, 3-fold cross validation was
conducted in our experiments, and the averaged results are reported in Sections 9.6.2
and 9.6.3. Specifically, our database was equally partitioned into three nonoverlapping
sets. Then, two sets were used as training data, and the remaining set was retained for
validating saliency detection. The cross-validation process is repeated by 3-fold, with
each of the three sets being used exactly once as the validation data. In the training set,
3 pixel of each video frame were randomly selected from top 5% salient regions of
ground-truth fixation maps as the positive samples. Similarly, 3 pixel of each video
frame were further chosen from bottom 70% salient regions as negative samples.
Then, both positive and negative samples were available in each cross validation, to
train the C-SVC with (9.8).
2
It takes around 100 s for HM to encode a 1080p video frame, in a PC with Intel Core i7-4770 CPU and
16 GB RAM. By contrast, ×265 adopts parallel computing and fast methods to encode videos, such that
real-time 4K HEVC encoding can be achieved by ×265.
Machine-learning-based saliency detection 327
finding bit rates suitable for all videos to ensure proper visual quality. To solve such
an issue, we follow [30] in setting the bit rates of each video for rate control the
same as those of fixed QPs. Then, we report in Figure 9.10 the AUC, CC, and NSS
results of our method at different bit rates. Note that the bit rates averaged over all
33 videos are shown, varying from 2,068 to 100 kbps. Figure 9.10 shows that our
method achieves the best performance in terms of CC and NSS, when the averaged
bit rate of rate control is 430 kbps (equal to those of fixed QP = 37). Therefore, such
bit-rate setting is used for the following evaluation. Figure 9.10 also shows that the
bit rates have slight impact on the overall performance of our method in terms of
AUC, NSS, and CC. The minimum values of AUC, NSS, and CC are above 0.82,
1.52, and 0.41, respectively, at different bit rates, which are superior to all other
methods reported in Section 9.6.3. Besides, one may observe from Figure 9.10 that
the saliency detection accuracy of some HEVC features is fluctuating when the bit
rate is changed. Hence, this figure suggests that our saliency detection should not rely
1
0.85
AUC
0.9
0.8
0 500 1,000 1,500 2,000 0.8
Bit-rates (kbps)
1.6 0.7
AUC
NSS
1.4 MV
0.6 Temporal diff. of MV
Spatial diff. of MV
0 500 1,000 1,500 2,000 Bit allocation
Bit-rates (kbps) 0.5 Temporal diff. of bit allocation
Spatial diff. of bit allocation
Splitting depth
0.4 0.4
CC
0.3 0.3
0 500 1,000 1,500 2,000 200 400 600 800 1,000 1,200 1,400 1,600 1,800 2,000 2,200
Bit-rates (kbps) Bit-rates (kbps)
1.4 0.35
1.2 0.3
1 0.25
NSS
CC
0.8 0.2
MV MV
Temporal diff. of MV Temporal diff. of MV
0.6 Spatial diff. of MV 0.15 Spatial diff. of MV
Bit allocation Bit allocation
Temporal diff. of bit allocation Temporal diff. of bit allocation
Spatial diff. of bit allocation Spatial diff. of bit allocation
0.4 Splitting depth 0.1
Splitting depth
Temporal diff. of splitting depth Temporal diff. of splitting depth
Spatial diff. of splitting depth Spatial diff. of splitting depth
0.2 0.05
200 400 600 800 1,000 1,200 1,400 1,600 1,800 2,000 2,200 200 400 600 800 1,000 1,200 1,400 1,600 1,800 2,000 2,200
Bit-rates (kbps) Bit-rates (kbps)
Figure 9.10 Performance comparison of our method (first column) and our single
features (second to fourth columns) at different bit rates. The bit rates
of each video in our rate control are the same as those of fixed QPs,
i.e., QP = 27, 32, 35, 37, 39, 42, and 47. Here, the bit rates averaged
over all 33 videos are shown in the horizontal axis
328 Applications of machine learning in wireless communications
0.75 0.73
0.74 0.72
Temporal diff. of MV
0.73 Temporal diff. of bit allocation 0.71
Temporal diff. of splitting depth
0.72 0.7
AUC
AUC
0.71 0.69
0.7 0.68
0.69 0.67 Spatial diff. of splitting depth
0.68 0.66
10 20 30 40 50 60 5 10 15 20 25
σm/σb/σd values ξd values
0.81 0.695
0.8
0.79
0.694
0.78
AUC
AUC
0.77
0.693
0.76
0.75 Spatial diff. of bit allocation Spatial diff. of MV
0.74 0.692
1 2 3 4 5 6 50 60 70 80
ξb values ξm values
on a single feature. On the contrary, the combination of all features is robust across
various bit rates, implying the benefit of applying the C-SVC in learning to integrate
all HEVC features for saliency detection.
Next, we analyze the parameters of our saliency detection method. When com-
puting the spatial difference features through (9.7), parameters ξd , ξb , and ξm have
been all traversed to find the optimal values. The results are shown in Figure 9.11.
As can be seen in this figure, parameters ξd , ξb , and ξm should be set to 13, 3, and 57
for optimizing saliency detection results. In addition, the saliency detection accuracy
of temporal difference features almost reaches the maximum, when σd , σb , and σm
of (9.4), (9.5), and (9.6) are equivalent to 46, 46, and 26, respectively. Finally, we
achieve the optimal parameter selection for the following evaluation (i.e., ξd = 13,
ξb = 3, ξm = 57, σd = 46, σb = 46, and σm = 26).
The effectiveness of the center bias in saliency detection has been verified in [45],
as humans tend to pay more attention on the center of the image/video than the
Machine-learning-based saliency detection 329
1
0.9
0.8
0.7
True positive rate
0.6
Our
0.5 Fang
0.4 Itti
Judd
0.3 PQFT
0.2 Rudoy
Surprise
0.1 OBDL
0
0 0.2 0.4 0.6 0.8 1
False positive rate
Figure 9.12 ROC curves of saliency detection by our and other state-of-the-art
methods. Note that the results are averaged over frames of all test
videos of 3-fold cross validation
surround. In this chapter, we follow [45] to impose the same center bias map B to both
our and other compared methods, for fair comparison. Specifically, the center bias is
based on the Euclidean distance of each pixel to video frame center (ic , jc ) as follows:
(i − ic )2 + (j − jc )2
B(i, j) = 1 − , (9.14)
ic2 + jc2
where B(i, j) is the center bias value at pixel (i, j). Then, the detected saliency maps
of all methods are weighted by the above center bias maps.
3
In our experiments, we directly used the codes by the authors to implement all methods except Fang
et al. [27], which was realized by ourselves as the code is not available online.
330 Applications of machine learning in wireless communications
rates than others at the same false positive rates. In a word, the ROC curves illustrate
the superior performance of our method in saliency detection.
AUC and EER. In order to quantify the ROC curves, we report in Table 9.1
the AUC and EER results of our and other seven state-of-the-art methods. Here,
both mean and standard deviation are provided for the AUC and EER results of
all test video frames of 3-fold cross validation. This table shows that our method
performs better than all other seven methods. Specifically, there are 0.026 and 0.038
enhancement of AUC, over Fang et al. [27] and OBDL [28], respectively, which also
work in compressed domain. The EER of our method has 0.028 and 0.036 decrease,
compared with compressed domain methods of [27,28]. Smaller EER means that there
is a lower misclassifying probability in our method when the false positive rate equals
to the false negative rate. The possible reasons for the improvement of our method
are (1) the new compressed domain features (i.e., CTU structure and bit allocation)
are developed in light of the latest HEVC standard; (2) the camera motion has been
removed in our method; (3) the learning mechanism is incorporated into our method
to bridge the gap between HEVC features and human visual attention. Besides, our
method outperforms uncompressed domain learning-based methods [19,23], with
0.007 and 0.038 improvement in AUC as well as 0.009 and 0.029 reduction in EER.
This verifies the effectiveness of the newly proposed features in compressed domain,
which benefit from the well-developed HEVC standard. However, since extensive
high and middle level features are applied in [19], there is little AUC improvement
(around 0.007) of our method over [19]. Generally speaking, our method outperforms
all other seven methods, which are in compressed or uncompressed domain.
NSS, CC, and KL. Now, we concentrate on the comparison of NSS, CC, and KL
metrics to evaluate the accuracy of saliency detection on all test videos. The averaged
results (with their standard deviation) of NSS, CC, and KL, by our and other seven
state-of-the-art methods, are also reported in Table 9.1. Note that the method with a
higher value of NSS, CC or KL can better predict the human fixations. Again, it can
be seen from Table 9.1 that our method improves the saliency detection accuracy over
all other methods, in the terms of NSS, CC, and KL. Moreover, the improvement of
NSS, CC, and KL, especially CC, is much larger than that of AUC.
Saliency maps. Figure 9.13 shows the saliency maps of four randomly selected
test videos, detected by our and other seven methods, as well as the ground-truth
human fixation maps. Note that the results of only one frame for each video are
shown in these figures. From these figures, one may observe that in comparison with
all other seven methods, our method is capable of well locating the saliency regions
in a video frame, much closer to the maps of human fixations. In summary, the
subjective results here, together with the objective results above, demonstrate that our
method is superior to other state-of-the-art methods in our database.
Computational time. For time-efficiency evaluation, the computational time
of our and other methods has been recorded4 and listed in Table 9.2. We can see
from this table that our method ranks third in terms of computational speed, only
4
All methods were run in the same environment: MATLAB 2012b at a computer with Intel Core i7-4770
[email protected] GHz and 16 GB RAM.
Table 9.1 The averaged accuracy of saliency detection by our and other seven methods, in mean (standard deviation) of all test
videos of 3-fold cross validation over our database
Our Itti [10] Surprise [14] Judd [19] PQFT [16] Rudoy [23] Fang [27] OBDL [28]
Note: The bold values indicate the best saliency prediction results in the table.
332 Applications of machine learning in wireless communications
(a) (b) (c) (d) (e) (f) (g) (h) (i) (j)
Figure 9.13 Saliency maps of four videos selected from the first time of our
cross-validation experiments. The maps were yielded by our and other
seven methods as well the ground-truth human fixations. Note that the
results of only one frame are shown for each selected video: (a) input,
(b) human, (c) our, (d) Itti, (e) surprise, (f) Judd, (g) PQFT, (h) Rudoy,
(i) Fang, (j) OBDL
Table 9.2 Computational time per video frame averaged over our database for our
and other seven methods
Time (s) 3.1 1.6 40.6 23.9 0.5 98.5 15.4 5.8
slower than Itti [10] and PQFT [16]. However, as discussed above, the performance
of Itti and PQFT is rather inferior compared with other methods, and their saliency
detection accuracy is much lower than that of our method. In summary, our method
has high time efficiency with effective saliency prediction performance. The main
reason is that our method benefits from the modern HEVC encoder and the learning
mechanism, thus not wasting much time on exploiting saliency-detection features. We
further transplanted our method into C++ program on the VS.net platform to figure
out its potential in real-time implementation. After the transplantation, our method
consumes averaged 140 ms per frame over all videos of our database and achieves
real-time detection for 480p videos at 30 frame per second (fps). It is worth pointing
out that some speeding-up techniques, like parallel computing, may further reduce the
computational time of our method for real-time saliency detection of high-resolution
videos.
SFU
Our Itti [10] Surprise [14] Judd [19] PQFT [16] Rudoy [23] Fang [27] OBDL [28]
DIEM
Our Itti [10] Surprise [14] Judd [19] PQFT [16] Rudoy [23] Fang [27] OBDL [28]
Note: The bold values indicate the best saliency prediction results in the table.
334 Applications of machine learning in wireless communications
Note: The bold values indicate the best saliency prediction results in the table.
accuracy of our and other methods over the SFU and DIEM databases. Again, our
method performs much better than others in terms of all five metrics. Although the
C-SVC was trained on our database, our method still significantly outperforms all
seven conventional methods over other databases.
Although above results were mainly upon the codes by their authors, it is fairer
to compare with the results reported in their literatures. However, it is hard to find
the literatures reporting the results of all seven methods on a same database. Due to
this, we only compare to the reported results of the method with top performance. We
can see from Tables 9.1 and 9.3 that among all methods we compared, Rudoy [23]
generally ranks highest in our, SFU, and DIEM databases. Thus, we implemented our
method on the same database as Rudoy [23] (also the DIEM database), and then we
compared the results of our method to those of seven PQFT [16] and Rudoy [23],
which were reported in [23]. The comparison is provided in Table 9.4. Note that the
comparison is in terms of median shuffled-AUC, as shuffled version of AUC was
measured with median values available in [23]. Note that shuffled-AUC is much
smaller than AUC, due to the removed center bias prior. We can see from Table 9.4
that our method again performs better than [16,23].
1.8
HM + LD
1.6
1.4 HM + RA
X265 + LD
1.2
X265 + RA
1
Rudoy
0.8
Fang
0.6
0.4
0.2
0
AUC NSS CC KL
AUC 0.73(0.10) 0.76(0.09) 0.68(0.11) 0.72(0.09) 0.75(0.09) 0.69(0.10) 0.71(0.10) 0.79(0.08) 0.69(0.12)
NSS 0.84(0.49) 1.26(0.72) 0.85(0.67) 0.97(0.55) 1.15(0.63) 0.86(0.61) 0.82(0.50) 1.38(0.70) 0.78(0.62)
CC 0.23(0.12) 0.31(0.15) 0.19(0.15) 0.23(0.12) 0.27(0.14) 0.20(0.15) 0.23(0.13) 0.35(0.15) 0.19(0.15)
KL 0.19(0.09) 0.24(0.10) 0.19(0.09) 0.22(0.08) 0.24(0.09) 0.19(0.08) 0.19(0.08) 0.27(0.09) 0.12(0.09)
EER 0.27(0.08) 0.29(0.09) 0.35(0.09) 0.33(0.08) 0.30(0.08) 0.34(0.09) 0.33(0.09) 0.27(0.09) 0.35(0.02)
Machine-learning-based saliency detection 337
0.82
0.8
0.78
0.76
0.74
AUC
0.72
0.7 MV
Temporal diff. of MV
0.68 Spatial diff. of MV
Bit allocation
Temporal diff. of bit allocation
0.66 Spatial diff. of bit allocation
Splitting depth
Temporal diff. of splitting depth
0.64 Spatial diff. of splitting depth
Six features comb.
Nine features comb.
0.62
H
×2
×2
M
65
65
+L
+R
+L
+R
D
A
Figure 9.15 AUC curves of saliency detection by each single feature and feature
combination. Six comb. and nine comb. mean the results of saliency
detection by six features (excluding features of splitting depth) and by
all six features, respectively. Similar results can be found for other
metrics, e.g., CC
Table 9.6 The averaged accuracy of saliency detection by our method with C-SVC
and equal weight
9.7 Conclusion
In this chapter, we found out that the state-of-the-art HEVC encoder is not only effi-
cient in video coding but also effective in providing the useful features in saliency
detection. Therefore, this chapter has proposed a novel method for learning to detect
video saliency with several HEVC features. Specifically, to facilitate the study on
video-saliency detection, we first established an eye-tracking database on viewing
33 uncompressed videos from test sets commonly used for HEVC evaluation. The
statistical analysis on our database revealed that human fixations tend to fall into the
regions with the high-valued HEVC features of splitting depth, bit allocation, and
MV. Besides, three observations were also found from our eye-tracking database.
According to the analysis and observations, we proposed to extract and then com-
pute several HEVC features, on the basis of splitting depth, bit allocation, and MV.
Next, we developed the C-SVC, as a nonlinear SVM classifier, to learn the model of
video saliency with regard to the proposed HEVC features. Finally, the experimental
results verified that our method outperforms other state-of-the-art saliency detection
methods, in terms of ROC, EER, AUC, CC, NSS, and KL metrics.
In the reality of wireless multimedia communications, almost all videos exist
in the form of bitstreams, generated by video-coding techniques. Since HEVC is
the latest video-coding standard, there is no doubt that the HEVC bitstreams will be
prevalent in the near future. Accordingly, our method, performed in HEVC domain, is
more practicable over other state-of-the-art uncompressed domain methods, as both
time and storage complexity on decompressing videos can be saved.
References
In this chapter, we incorporate deep learning for indoor localization utilizing channel
state information (CSI) with commodity 5 GHz Wi-Fi. We first introduce the state-of-
the-art deep-learning techniques including deep autoencoder network, convolutional
neural network (CNN), and recurrent neural network (RNN). We then present a deep-
learning-based algorithm to leverage bimodal CSI data, i.e., average amplitudes and
estimated angle of arrivals (AOA), for indoor fingerprinting. The proposed scheme
is validated with extensive experiments. Finally, we discuss several open research
problems for indoor localization based on deep-learning techniques.
10.1 Introduction
The proliferation of mobile devices has fostered great interest in indoor-location-based
services, such as indoor navigation, robot tracking in factories, locating workers on
construction sites, and activity recognition [1–8], all requiring accurately identifying
the locations of mobile devices indoors. The indoor environment poses a complex
radio-propagation channel, including multipath propagation, blockage, and shadow
fading, and stimulates great research efforts on indoor localization theory and sys-
tems [9]. Among various indoor-localization schemes, Wi-Fi-based fingerprinting is
probably one of the most widely used. In fingerprinting, a database is first built with
data collected from a thorough measurement of the field in the off-line training stage.
Then, the position of a mobile user can be estimated by comparing the newly received
test data with that in the database. A unique advantage of this approach is that no
extra infrastructure needs to be deployed.
Many existing fingerprinting-based indoor-localization systems use received
signal strength (RSS) as fingerprints, due to its simplicity and low hardware require-
ment [10,11]. For example, radar is one of the first RSS-based fingerprinting systems
1
Department of Computer Science, California State University, United States
2
Department of Electrical and Computer Engineering, Auburn University, United States
344 Applications of machine learning in wireless communications
that incorporate a deterministic method for location estimation [10]. For higher accu-
racy, Horus, another RSS-based fingerprinting scheme, adopts a probabilistic method
based on K-nearest neighbor (KNN) [9] for location estimation [11]. The performance
of RSS-based schemes is usually limited by two inherent shortcomings of RSS. First,
due to the multipath effect and shadow fading, the RSS values are usually highly
diverse, even for consecutively received packets at the same position. Second, RSS
value only reflects the coarse channel information, since it is the sum of the powers
of all received signals.
Unlike RSS, CSI represents fine-grained channel information, which can now
be extracted from several commodity Wi-Fi network interface cards (NIC), e.g., Intel
Wi-Fi Link 5300 NIC [12], the Atheros AR9390 chipset [13], and the Atheros AR9580
chipset [14]. CSI consists of subcarrier-level measurements of orthogonal frequency
division multiplexing (OFDM) channels. It is a more stable representation of chan-
nel characteristics than RSS. Several CSI-based fingerprinting systems have been
proposed and shown to achieve high localization accuracy [15,16]. For example, the
fine-grained indoor fingerprinting system (FIFS) [15] uses a weighted average of
CSI values over multiple antennas. To fully exploit the diversity among the multiple
antennas and subcarriers, DeepFi [16] learns a large amount of CSI data from the
three antennas and 30 subcarriers with an autoencoder. These CSI-based schemes only
use the amplitude information of CSI, since the raw phase information is extremely
random and not directly usable [17].
Recently, for the Intel 5300 NIC in 2.4 GHz, two effective methods have been
proposed to remove the randomness in raw CSI phase data. In [18], the measured
phases from 30 subcarriers are processed with a linear transformation to mitigate
the random phase offsets, which is then employed for passive human-movement
detection. In [17], in addition to the linear transformation, the difference of the
sanitized phases from two antennas is obtained and used for line-of-sight (LOS)
identification. Although both approaches can stabilize the phase information, the
mean value of phase will be zero (i.e., lost) after such processing. This is actually
caused by the firmware design of the Intel 5300 NIC when operating on the 2.4 GHz
band [19]. To address this issue, Phaser [19] proposes to exploit CSI phase in 5 GHz
Wi-Fi. Phaser constructs an AOA pseudospectrum for phase calibration with a single
Intel 5300 NIC. These interesting works motivate us to explore effectively cleansed
phase information for indoor fingerprinting with commodity 5 GHz Wi-Fi.
In this chapter, we investigate the problem of fingerprinting-based indoor local-
ization with commodity 5 GHz Wi-Fi. We first present three hypotheses on CSI
amplitude and phase information for 5 GHz OFDM channels. First, the average
amplitude over two antennas is more stable over time for a fixed location than that
from a single antenna as well as RSS. Second, the phase difference of CSI values
from two antennas in 5 GHz is highly stable. Due to the firmware design of Intel
5300 NIC, the phase differences of consecutively received packets form four clusters
when operating in 2.4 GHz. Such ambiguity makes measured phase difference unus-
able. However, we find this phenomenon does not exist in the 5 GHz band, where
all the phase differences concentrate around one value. We further design a sim-
ple multi-radio hardware for phase calibration which is different from the technique
Deep learning for indoor localization based on bimodal CSI data 345
in [19] that uses AOA pseudospectrum search with a high computation complexity,
to calibrate phase in single Intel 5300 NIC. As a result, the randomness from the
time and frequency difference between the transmitter and receiver, and the unknown
phase offset can all be removed, and stable phase information can be obtained. Third,
the calibrated phase difference in 5 GHz can be translated into AOA with considerable
accuracy when there is a strong LOS component. We validate these hypotheses with
both extensive experiments and simple analysis.
We then design BiLoc, bimodal deep learning for indoor localization with com-
modity 5 GHz Wi-Fi, to utilize the three hypotheses in an indoor fingerprinting
system [20]. In BiLoc, we first extract raw amplitude and phase data from the three
antennas, each with 30 subcarriers. We then obtain bimodal data, including average
amplitudes over pairs of antennas and estimated AOAs, with the calibration procedure
discussed above. In the off-line training stage, we adopt an autoencoder with three
hidden layers to extract the unique channel features hidden in the bimodal data and
propose to use the weights of the deep network to store the extracted features (i.e.,
fingerprints). To reduce the computational complexity, we propose a greedy learn-
ing algorithm to train the deep network in a layer-by-layer manner with a restricted
Boltzmann machine (RBM) model. In the online test stage, bimodal test data is first
collected for a mobile device. Then a Bayesian probability model based on the radial
basis function (RBF) is leveraged for accurate online position estimation.
In the rest of this chapter, preliminaries on deep learning for indoor localization
is introduced Section 10.2. Then, the three hypotheses are given in Section 10.3.
We present the BiLoc system in Section 10.4 and validate its performance in
Section 10.5. Section 10.6 discusses future research problems for indoor localization,
and Section 10.7 concludes this chapter.
Original Reconstructed
data data
Encode Decode
features or reduce the size of data, which is more powerful than principal component
analysis-based methods because of its nonlinear transformations with multiple hidden
layers. Figure 10.1 shows the architecture of the deep autoencoder neural network. For
training, a deep autoencoder neural network has three stages including pretraining,
unrolling, and fine-tuning [24]. In the pretraining stage, each neighboring set of two
layer is considered as an RBM, is denoted as a bipartite undirected graphical model.
Then, a greedy algorithm is used to train the weights and biases for a stack of RBMs.
In the unrolling stage, the deep autoencoder network is unrolled to obtain the recon-
structed input data. Finally, the fine-tuning phase employs the backpropagation (BP)
algorithm for training the weights in the deep autoencoder network by minimizing
the loss function (i.e., the error).
The first work that applies a deep autoencoder to indoor localization is
DeepFi [16,25], which is a deep autoencoder network-based indoor fingerprinting
method with CSI amplitudes. For every training location, the deep autoencoder net-
work is trained to obtain a set of weights and biases, which are used as fingerprints
for the corresponding locations. For online test, the true location is estimated based
on the Bayesian scheme. The experimental results show that the mean distance error
in a living room environment and a laboratory environment is 1.2 and 2.3 m, respec-
tively. In addition, PhaseFi [26,27] is proposed to use CSI calibrated phase, which
still incorporates a deep autoencoder networks for indoor localization. Moreover, deep
autoencoder networks are used for device-free indoor localization [28,29]. The denois-
ing autoencoder-based indoor localization with Bluetooth Low Energy (BLE) is also
used to provide 3-D localization [30]. In this chapter, we consider deep autoencoder
networks for indoor localization using bimodal CSI data.
The convolutional layer can obtain feature maps within local regions in the pre-
vious layer’s feature maps with linear convolutional filters, which is followed by
nonlinear activation functions. The subsampling layer is to decrease the resolution of
the feature maps by downsampling over a local neighborhood in the feature maps of
the previous layer, which is invariant to distortions in the input data [33]. The feature
maps in the previous layer are pooled over a local temporal neighborhood using the
mean pooling function. Other operations such as the sum or max pooling function
can also be incorporated in the subsampling layer.
After the convolutional and subsampling layers, there is a fully connected layer,
which is a basic neural network with one hidden layer, to train the output data. More-
over, a loss function is used to measure the difference between the true location label
and the output of CNN, where the squared error or cross entropy is used as loss func-
tion for training the weights. Currently, an increasing number of CNN models are
proposed, such as AlexNet [31] and ResNet [34]. AlexNet is a larger and more com-
plex model, where Max pooling and rectified linear unit (ReLU) nonlinear activation
function are used in the model [35]. Moreover, dropout regularization is used to han-
dle the overfitting problem. ResNet was proposed by Microsoft, where the residual
block includes a direct path between the input and output, and the batch normaliza-
tion technique is used to avoid diminishing or exploding of the gradient. ResNet is a
152 layers residual learning framework, which won the ILSVRC 2015 classification
competition [31].
For indoor localization problems, the CiFi [33,36] system leverages the con-
structed images with estimated AOA values with commodity 5 GHz Wi-Fi for indoor
localization. This system demonstrates that the performance of the localization has
outperformed several existing schemes, like FIFS and Horus. Motivated by ResNet,
the ResLoc [37] system uses bimodal CSI tensor data to train a deep residual sharing
learning, which can achieve the best performance among deep-learning-based local-
ization methods using CSI. CSI amplitude is also used to obtain CSI images for indoor
localization [38]. In addition, input images by using received signal strength indicator
(RSSI) of Wi-Fi signals are leveraged to train a CNN model [39,40]. CNN has also
been used for TDoA-based localization systems, which can estimate nonlinearities in
the signal propagation space but also predict the signal for multipath effects [41].
348 Applications of machine learning in wireless communications
ht1 h t2
Ct–1 Ct
tanh tanh
σ σ tanh σ σ σ tanh σ
ht–1 ht
X t1 X t2
and coding scheme to mitigate frequency selective fading. Leveraging the device
driver for off-the-shelf NICs, e.g., the Intel 5300 NIC, we can extract CSI for each
received packet, that is a fine-grained physical layer (PHY) information. CSI reveals
the channel characteristics experienced by the received signal such as the multipath
effect, shadow fading, and distortion.
With OFDM, the Wi-Fi channel at the 5 GHz band can be considered as a nar-
rowband flat fading channel. In the frequency domain, the channel model can be
expressed as
Y = CSI · X + N , (10.1)
where Y and X denote the received and transmitted signal vectors, respectively,
N is the additive white Gaussian noise (AWGN), and CSI represents the channel’s
frequency response, which can be computed from Y and X .
Although a Wi-Fi receiver uses an OFDM system with 56 subcarriers for a
20 MHz channel, the Intel 5300 NIC can report 30 out of 56 subcarriers. The channel
frequency response of subcarrier i, CSIi , is a complex value, that is,
CSIi = Ii + jQi = |CSIi | exp( j∠CSIi ), (10.2)
where Ii and Qi are the in-phase component and quadrature component, respectively;
|CSIi | and ∠CSIi are the amplitude response and phase response of subcarrier i,
respectively.
10.3.3 Hypotheses
We next present three important hypotheses about the CSI data on 5 GHz OFDM chan-
nels, which are demonstrated and tested with our measurement study and theoretical
analysis.
10.3.3.1 Hypothesis 1
The average CSI amplitude value of two adjacent antennas for the 5 GHz OFDM
channel is highly stable for a fixed location.
We find CSI amplitude values exhibit great stability for continuously received
packets at a given location. Figure 10.4 presents the cumulative distribution functions
(CDF) of the standard deviations (STD) of (i) the normalized CSI amplitude averaged
over two adjacent antennas, (ii) the normalized CSI amplitude from a single antenna,
and (iii) the normalized RSS amplitude from a single antenna, for 90 positions. At
each position, 50 consecutive packets are received by the Intel 5300 NIC operating on
the 5 GHz band. It can be seen that 90% of the testing positions are blow 10% of the
STD in the case of averaged CSI amplitudes, while the percentage is 80% for the case
of single antenna CSI and 70% for the case of single antenna RSS. Thus, averaging
over two adjacent antennas can make CSI amplitude highly stable for a fixed location
with 5 GHz OFDM channels. We conduct the measurements over a long period of
time, including midnight and business hours. No obvious difference in the stability of
CSI is observed over different times, while RSS values exhibit large variations even
for the same position. This finding motivates us to use average CSI amplitudes of two
adjacent antennas as one of the features of deep learning in the BiLoc design.
Recall that the PDF of the amplitude response of a single antenna is Gaussian
in the high SNR regime. Assuming that the CSI values of the two antennas are i.i.d.
0.8
0.6
CDF
0.4
Figure 10.4 CDF of the standard deviations of normalized average CSI amplitude,
a single CSI amplitude, and a single RSS in the 5 GHz OFDM channel
for 90 positions
Deep learning for indoor localization based on bimodal CSI data 351
(true when the antennas are more than a half wavelength apart [17]), the average
CSI amplitudes also follow the Gaussian distribution, as N ( |CSI0 |2 + σ 2 , σ 2 /2),
but with a smaller variance. This proves that stability can be improved by averaging
CSI amplitudes over two antennas [47] (as observed in Figure 10.4). We consider the
average CSI amplitudes over two antennas instead of three antennas or CSI amplitudes
from only one antenna, and BiLoc employs bimodal data, including estimated AOAs
and average amplitudes. This requires that we use the same number of nodes as the
input for the deep network.
10.3.3.2 Hypothesis 2
The difference of CSI phase values between two antennas of the 5 GHz OFDM channel
is highly stable, compared to that of the 2.4 GHz OFDM channel.
Although the CSI phase information is also available from the Intel 5300 NIC,
it is highly random and cannot be directly used for localization, due to noise and
the unsynchronized time and frequency of the transmitter and receiver. Recently, two
useful algorithms are used to remove the randomness in CSI phase. The first approach
is to make a linear transform of the phase values measured from the 30 subcarriers [18].
The other one is to exploit the phase difference between two antennas in 2.4 GHz and
then remove the measured average [17]. Although both methods can stabilize the CSI
phase in consecutive packets, the average phase value they produce is always near
zero, which is different from the real phase value of the received signal.
Switching to the 5 GHz band, we find the phase difference becomes highly
stable. In Figure 10.5, we plot the measured phase differences of the 30 subcarriers
between two antennas for 200 consecutively received packets in the 5 GHz (in blue)
and 2.4 GHz (in red) bands. The phase difference of the 5 GHz channel varies between
[0.5, 1.8], which is considerably more stable than that of the 2.4 GHz channel (varies
between [−π, π ]). To further illustrate this finding, we plot the measured phase
differences on the fifth subcarrier between two antennas using polar coordinates in
Figure 10.6. We find that all the 5 GHz measurements concentrate around 30◦ , while
the 2.4 GHz measurements form four clusters around 0◦ , 90◦ , 180◦ , and 270◦ . We
conjecture that this may be caused by the firmware design of the Intel 5300 NIC when
operating on the 2.4 GHz band, which reports the phase of channel modulo π/2 rather
than 2π on the 5 GHz band [19]. Comparing to the ambiguity in the 2.4 GHz band,
the highly stable phase difference in the 5 GHz band could be very useful for indoor
localization.
As in Hypothesis 1, we also provide an analysis to validate the observation from
the experiments. Let ∠CSI i denote the measured phase of subcarrier i, which is given
by [14,48]:
where ∠CSIi is the true phase; Z is the measurement noise; β is the initial phase
offset because of the phase-locked loop; mi is the subcarrier index of subcarrier i;
λp , λs , and λc are phase errors from the packet boundary detection (PBD); the
352 Applications of machine learning in wireless communications
4
5 GHz
3
Phase difference 2
−1
−2
2.4 GHz
−3
−4
0 50 100 150 200
Number of packets
Figure 10.5 The measured phase differences of the 30 subcarriers between two
antennas for 200 consecutively received packets in the 5 GHz (blue)
and 2.4 GHz (red) bands
90 800
120 60
600
150 400 30
200
180 0
210 330
240 300
270
Figure 10.6 The measured phase differences of the fifth subcarrier between two
antennas for 200 consecutively received packets in the 5 GHz (blue
dots) and 2.4 GHz (red crosses) bands
sampling frequency offset and central frequency offset, respectively [48], which are
expressed by
⎧
⎪ t
⎪
⎪ λp = 2π
⎪
⎪ N
⎨
T − T Ts (10.5)
⎪ s
⎪ λ = 2π n
⎪
⎪ T T
⎪
⎩
u
λc = 2π f Ts n,
Deep learning for indoor localization based on bimodal CSI data 353
where t is the PBD delay, N is the fast Fourier transform (FFT) size, T and T are the
sampling periods from the receiver and the transmitter, respectively, Tu is the length
of the data symbol, Ts is the total length of the data symbol and the guard interval, n
is the sampling time offset for current packet, f is the center frequency difference
between the transmitter and receiver. It is noticed that we cannot obtain the exact
values about t, (T − T )/T , n, f , and β in (10.4) and (10.5). Moreover, λp , λs ,
and λc vary for different packets with different t and n. Thus, the true phase ∠CSIi
cannot be derived from the measured phase value.
However, note that the three antennas of the Intel 5300 NIC use the same clock and
the same down-converter frequency. Consequently, the measured phases of subcarrier
i from two antennas have identical packet detection delay, sampling periods, and
frequency differences (and the same mi ) [19]. Thus the measured phase difference on
subcarrier i between two antennas can be approximated as
i = ∠CSIi + β + Z,
∠CSI (10.6)
where ∠CSIi is the true phase difference of subcarrier i, β is the unknown differ-
ence in phase offsets, which is in fact a constant [19], and Z is the noise difference.
We find that ∠CSI i is stable for different packets because of (10.6) where t and
n are cancelled.
In the high SNR regime, the PDF of the phase response of subcarrier i for each
of the antennas is N (0, (σ/|CSI0 |)2 ). Due to the independent phase responses, the
measured phase difference of subcarrier i is also Gaussian with N (β, 2σ 2 (1 +
1/|CSI0 |2 )). Note that although the variance is higher comparing to the true-phase
response, the uncertainty from the time and frequency differences is removed, leading
to much more stable measurements (as shown in Figure (10.6)).
10.3.3.3 Hypothesis 3
The calibrated phase difference in 5 GHz can be translated into the AOA with
considerable accuracy when there is a strong LOS component.
The measured phase difference on subscriber i can be translated into an estimation
of AOA, as
iλ
∠CSI
θ = arcsin , (10.7)
2πd
where λ is the wavelength and d is the distance between the two antennas (set to
d = 0.5λ in our experiments). Although the measured phase difference ∠CSI i is
highly stable, we still wish to remove the unknown phase offset difference β to
further reduce the error of AOA estimation. For commodity Wi-Fi devices, the only
existing approach for a single NIC, to the best of our knowledge, is to search for β
within an AOA pseudospectrum in the range of [−π , π ], which, however, has a high
time complexity [19].
In this chapter, we design a simple method to remove the unknown phase offset
difference β using two Intel 5300 NICs. As in Figure 10.7, we use one Intel 5300 NIC
as transmitter and the other as receiver, while a signal splitter is used to route signal
from antenna 1 of the transmitter to antennas 1 and 2 of the receiver through cables
354 Applications of machine learning in wireless communications
Antenna 1 Antenna 1
Antenna 3 Antenna 3
NIC 1 NIC 2
Transmitter Receiver
Figure 10.7 The multi-radio hardware design for calibrating the unknown phase
offset difference β
−10
Magnitude (dB)
−20
−30
−40
−50
−100 −50 0 50 100
Angle (degree)
Figure 10.8 The estimated AOAs from the 30 subcarriers using the MUSIC
algorithm, while the real AOA is 14◦
of the same length. Since the two antennas receive the same signal, the true phase
difference ∠CSIi of subcarrier i is zero. We can thus obtain β as the measured
phase offset difference between antennas 1 and 2 of the receiver. We also use the
same method to calibrate antennas 2 and 3 of the receiver, to obtain the unknown
phase offset difference between them as well. We find that the unknown phase offset
difference is relatively stable over time.
Having calibrated the unknown phase offset differences for the three antennas,
we then use the MUSIC algorithm for AOA estimation [49]. In Figure 10.8, the AOA
estimation using MUSIC with the calibrated phase information for the 30 subcarriers
is plotted for a high SNR signal with a known incoming direction of 14◦ . We can see
that the peak occurs at around 20◦ in Figure 10.8, indicating an AOA estimation error
of about 6◦ .
Deep learning for indoor localization based on bimodal CSI data 355
We can obtain the true incoming angle with MUSIC when the LOS component is
strong. To deal with the case with strong NLOS paths (typical in indoor environments),
we adopt a deep network with three hidden layers to learn the estimated AOAs and the
average amplitudes of adjacent antenna pairs as fingerprints for indoor localization.
As input to the deep network, the estimated AOA is obtained as follows:
θ = arcsin ∠CSI i − β λ π
+ , (10.8)
2πd 2
where β is measured with the proposed multi-radio hardware experiment. The
estimated AOA is in the range of [0, π].
Location 1
bimodal data
... Location N
bimodal data
Deep learning
Estimated location
5 GHz band. The Intel 5300 NIC has three antennas; at each antenna, we can read
CSI data from 30 subcarriers. Thus we can collect 90 CSI data for every received
packet. We then calibrate the phase information of the received CSI data using our
multi-radio hardware design (see Figure 10.7). Both the estimated AOAs and average
amplitudes of two adjacent antennas are used as location features for building the
fingerprint database.
A unique feature of BiLoc is its bimodal design. With the three receiving antennas,
we can obtain two groups of data: (i) 30 estimated AOAs and 30 average amplitudes
from antennas 1 and 2 and (ii) that from antennas 2 and 3. BiLoc utilizes estimated
AOAs and average amplitudes for indoor fingerprinting for two main reasons. First,
these two types of CSI data are highly stable for any given position. Second, they are
usually complementary to each other under some indoor circumstances. For example,
when a signal is blocked, the average amplitude of the signal will be significantly
weakened, but the estimated AOA becomes more effective. On the other hand, when
the NLOS components are stronger than the LOS component, the average amplitude
will help to improve the localization accuracy.
Another unique characteristic of BiLoc is the use of deep learning to produce
feature-based fingerprints from the bimodal data in the off-line training stage, which
is quite different from the traditional approach of storing the measured data as fin-
gerprints. Specifically, we use the weights in the deep network to represent the
features-based fingerprints for every position. By obtaining the optimal weights with
the bimodal data on estimated AOAs and average amplitudes, we can establish a
bimodal fingerprint database for the training positions. The third feature of BiLoc
is the probabilistic data fusion approach for location estimation based on received
bimodal data in the online test stage.
Because of the large number of nodes and the complex model structure, it is
difficult to find the optimal weights for the input data with the maximum likelihood
method. To reduce the computational complexity, BiLoc utilizes a greedy learning
algorithm to train the weights layer by layer based on a stack of RBMs [50]. We
consider an RBM as a bipartite undirected graphical model [50] with joint distribution
Pr(hi−1 , hi ), as
exp(−E(hi−1 , hi ))
Pr(hi−1 , hi ) = i−1 , hi ))
, (10.10)
hi−1 hi exp(−E(h
where E(hi−1 , hi ) denotes the free energy between layer (i − 1) and layer i, which is
given by
E(hi−1 , hi ) = −bi−1 hi−1 − bi hi − hi−1 Wi hi , (10.11)
where b and b are the biases for the units of layer (i − 1) and that of layer i,
i−1 i
respectively. To obtain the joint distribution Pr(hi−1 , hi ), the CD-1 algorithm is used
to approximate it as [50]:
⎧ i−1
⎨Pr(hi−1 |hi ) = Kj=1 j |h )
Pr(hi−1 i
(10.12)
⎩Pr(hi |hi−1 ) = Ki Pr(hi |hi−1 ),
j=1 j
j |h ) and Pr(hj |h
where Pr(hi−1 i i i−1
) are given by the sigmoid belief network as follows:
⎧ −1
⎪ Ki
⎨Pr(hi−1
j |h i
) = 1 + exp(−b i−1
j − t=1 W i
j,t i
h t )
Ki−1 j,t i−1 −1 (10.13)
⎪
⎩Pr(h |h ) = 1 + exp(−b − t=1 Wi ht )
i i−1 i
j j .
We propose a greedy algorithm to train the weights and biases for a stack of
RBMs. First, with the CD-1 method, we use the input data to train the parameters
{b0 , b1 , W1 } of the first layer RBM. Then, the parameters {b0 , W1 } are frozen, and we
sample from the conditional probability Pr(h1 |h0 ) to train the parameters {b1 , b2 , W2 }
of the second layer RBM. Next, we freeze the parameters {b0 , b1 , W1 , W2 } of the
first and second layers and then sample from the conditional probability Pr(h2 |h1 ) to
train the parameters {b2 , b3 , W3 } of the third layer RBM. In order to train the weights
and biases of each RBM, we use the CD-1 method to approximate them. For the
layer i RBM model, we estimate ĥi−1 by sampling from the conditional probability
Pr(hi−1 |hi ); by sampling from the conditional probability Pr(hi |ĥi−1 ), we can estimate
ĥi . Thus, the parameters are updated as follows:
⎧
⎨Wi = ε(h h − ĥ ĥ )
⎪ i−1 i i−1 i
BP algorithm is used to train the weights in the deep network according to the error
between the input data and the reconstructed input data. The optimal weights are
obtained by minimizing the error. In BiLoc, we use estimated AOAs and average
amplitudes as input data and obtain two sets of optimal weights for the bimodal
fingerprint database.
where N is the number of reference locations, li is the ith reference location in the
bimodal fingerprint database, and Pr(li ) is the prior probability that the mobile device
is considered to be at the reference location li . Without loss of generality, we assume
that Pr(li ) is uniformly distributed. The posteriori probability Pr(li |v1 , v2 ) becomes:
Pr(v1 , v2 |li )
Pr(li |v1 , v2 ) = N . (10.16)
j=1 Pr(v , v |lj )
1 2
where v̂1 and v̂2 are the reconstructed average amplitude and reconstructed AOA,
respectively; σ1 and σ2 are the variance of the average amplitude and estimated AOA,
respectively; η1 and η2 are the parameters of the variance of the average amplitude
and estimated AOA, respectively; and ρ is the ratio for the bimodal data.
For the (10.17), the average amplitudes v̂1 and the estimated AOAs v̂2 are as
the input of deep network, where the different nodes of the input can express the
different CSI channels. Then, by employing the test data v̂1 and v̂2 , we compute the
reconstructed average amplitude v̂1 and reconstructed AOA v̂2 based on databases 1
and 2, respectively, which is used to compute the likelihood function Pr(v1 , v2 |li ).
The location of the mobile device can be finally estimated as a weighted average
of all the reference locations, which is given by
N
l̂ = Pr(li |v1 , v2 ) · li . (10.18)
i=1
Deep learning for indoor localization based on bimodal CSI data 359
AP
9m
Figure 10.10 Layout of the computer laboratory: training positions are marked as
red squares and testing positions are marked as green dots
360 Applications of machine learning in wireless communications
between two adjacent training positions is 1.8 m. The single access point is put close
to the center of the room. We collect bimodal data from 1,000 packet receptions for
each training position, and from 25 packet receptions for each test position. The deep
network used for this scenario is configured as {K1 = 150, K2 = 100, K3 = 50}. Also,
the ratio ρ for the bimodal data is set as 0.5.
Corridor: This is a 2.4 × 24 m2 corridor, as shown in Figure 10.11. In this scenario,
the AP is placed at one end of the corridor, and there are plenty of LOS paths. Ten
training positions (red squares) and ten test positions (green dots) are arranged along
a straight line. The distance between two adjacent training positions is also 1.8 m. We
also collect bimodal data from 1,000 packets for each training position and from 25
packets for each test position. The deep network used for this scenario is configured
as {K1 = 150, K2 = 100, K3 = 50}. Also, the ratio ρ for the bimodal data is set as 0.1.
24 m
Figure 10.11 Layout of the corridor: training positions are marked as red squares
and testing positions are marked as green dots
Table 10.1 Mean/STD error and execution time of the laboratory experiment
Algorithm Mean error (m) Std. dev. (m) Mean execution time (s)
Table 10.2 Mean/STD errors and execution time of the corridor experiment
Algorithm Mean error (m) Std. dev. (m) Mean execution time (s)
environment, BiLoc achieves a mean error of 1.5743 m and an STD error of 0.8312 m
across the 15 test points. In the corridor experiment, because only one access point is
used for this larger space, BiLoc achieves a mean error of 2.1501 m and an STD error
of 1.5420 m across the ten test points. BiLoc outperforms the other three benchmark
schemes with the smallest mean error, as well as with the smallest STD error, i.e.,
being the most stable scheme in both scenarios. We also compare the online test time
of all the schemes. Due to the use of bimodal data and the deep network, the mean
executing time of BiLoc is the highest among the four schemes. However, the mean
execution time is 0.6653 s for the laboratory case and 0.5440 s for the corridor case,
which are sufficient for most indoor localization applications.
Figure 10.12 presents the CDF of distance errors of the four schemes in the
laboratory environment. In this complex propagation environment, BiLoc has 100%
of the test positions with an error under 2.8 m, while DeepFi, FIFS, and Horus
have about 72%, 52%, and 45% of the test positions with an error under 2.8 m,
respectively. For a much smaller error of 1.5 m, the percentage of test positions having
a smaller error are 60%, 45%, 15%, and 5% for BiLoc, DeepFi, FIFS, and Horus,
respectively. BiLoc achieves the highest precision among the four schemes, due to
the use of bimodal CSI data (i.e., average amplitudes and estimated AOAs). In fact,
when the amplitude of a signal is strongly influenced in the laboratory environment,
the estimated AOA can be utilized to mitigate this effect by BiLoc. However, the other
schemes-based solely on CSI or RSS amplitudes will be affected.
Figure 10.13 presents the CDF of distance errors of the four schemes for the
corridor scenario. Only one access point is used at one end for this 24 m long corridor,
making it hard to estimate the location of the mobile device. For BiLoc, more than
90% of the test positions have an error under 4 m, while DeepFi, FIFS, and Horus have
about 70%, 60%, and 50% of the test positions with an error under 4 m, respectively.
For a tighter 2 m error threshold, BiLoc has 60% of the test positions with an error
below this threshold, while it is 40% for the other three schemes. For the corridor
0.8
0.6
CDF
0.4
BiLoc
DeepFi
0.2
FIFS
Horus
0
0 1 2 3 4 5
Distance error (m)
Figure 10.12 CDF of localization errors in 5 GHz for the laboratory experiment
362 Applications of machine learning in wireless communications
0.8
0.6
CDF
0.4
BiLoc
DeepFi
0.2 FIFS
Horus
0
0 2 4 6 8 10 12
Distance error (m)
Figure 10.13 CDF of localization errors in 5 GHz for the corridor experiment
scenario, BiLoc mainly utilizes the average amplitudes of CSI data, because the
estimated AOAs are similar for all the training/test positions (recall that they are
aligned along a straight line with the access point at one end). This is a challenging
scenario for differentiating different test points and the BiLoc mean error is 0.5758 m
higher than that of the laboratory scenario.
0.8
CDF 0.6
0.4
0.2
BiLoc in 5 GHz
BiLoc in 2.4 GHz
0
0 1 2 3 4 5 6 7
Distance error (m)
Figure 10.14 CDF of localization errors in 5 and 2.4 GHz for the laboratory
experiment
4.5
4
3.5
Distance error (m)
3
2.5
2
1.5
1
Laboratory experiment
0.5 Corridor experiment
0
0 0.2 0.4 0.6 0.8 1
Parameter, ρ
Figure 10.15 Mean localization errors versus parameter, ρ, for the laboratory and
corridor experiments
Figure 10.15 presents the mean localization errors for increasing ρ for the lab-
oratory and corridor experiments. In the laboratory experiment, when ρ is increased
from 0 to 0.3, the mean error decreases from 2.6 to 1.5 m. Furthermore, the mean
error remains around 1.5 m for ρ ∈ [0.3, 0.7], and then increases from 1.5 to 2 m
when ρ is increased from 0.6 to 1. Therefore, BiLoc achieves its minimum mean
error for ρ ∈ [0.3, 0.7], indicating that both average amplitudes and estimated AOAs
are useful for accurate location estimation. Moreover, BiLoc has higher localization
accuracy with the mean error of 1.5 m, compared with individual modality such as
the estimated AOAs with that of 2.6 m or the average amplitudes with that of 2.0 m.
364 Applications of machine learning in wireless communications
In the corridor experiment, we can see that the mean error remains around 2.1 m
when ρ is increased from 0 to 0.1. When ρ is further increased from 0.1 to 1, the
mean error keeps on increasing from 2.1 to about 4.3 m. Clearly, in the corridor
experiment, the estimated AOAs provide similar characteristics for deep learning and
are not useful for distinguishing the positions. Therefore, BiLoc should mainly use
the average amplitudes of CSI data for better accuracy. These experiments provide
some useful guidelines on setting the ρ value for different indoor environments.
resolution of Wi-Fi signals, only using Wi-Fi RSS values cannot obtain better per-
formance at close locations, while magnetic sensor data at such positions is greatly
different. LSTM can effectively fuse them for indoor localization [43]. In addition,
an integrated CNN and LSTM model can be used for Wi-Fi RSS or CSI images data,
which can be easily created from different access points or different subcarriers. In
fact, the LSTM model can be combined with other deep-learning models such as
autoencoder, GAN, deep reinforcement learning, Bayesian model for different local-
ization problems such as radio map construction, device calibration, and environment
change. For sensor data fusion for indoor localization, different sensor data sources
should be normalized and aligned [23].
10.7 Conclusions
In this chapter, we proposed a bimodal deep-learning system for fingerprinting-based
indoor localization with 5 GHz commodity Wi-Fi NICs. First, the state-of-the-art
deep-learning techniques including deep autoencoder network, CNN, and LSTM
were introduced. We then extracted and calibrated CSI data to obtain bimodal CSI
data, including average amplitudes and estimated AOAs, which were used in both
366 Applications of machine learning in wireless communications
the off-line and online stages. The proposed scheme was validated with extensive
experiments. We concluded this chapter with a discussion of future directions and
challenges for indoor localization problems using deep learning.
Acknowledgments
This work is supported in part by the US NSF under Grants ACI-1642133 and CNS-
1702957, and by the Wireless Engineering Research and Education Center (WEREC)
at Auburn University.
References
[1] Wang Y, Liu J, Chen Y, et al. E-eyes: Device-free location-oriented activity
identification using fine-grained WiFi signatures. In: Proc. ACM Mobicom’14.
Maui, HI; 2014. p. 617–628.
[2] Zhang D, Zhao S, Yang LT, et al. NextMe: Localization using cellular
traces in internet of things. IEEE Transactions on Industrial Informatics.
2015;11(2):302–312.
[3] Derr K, and Manic M. Wireless sensor networks node localization for various
industry problems. IEEE Transactions on Industrial Informatics. 2015;11(3):
752–762.
[4] Abu-Mahfouz A, and Hancke GP. Distance bounding: A practical secu-
rity solution for real-time location systems. IEEE Transactions on Industrial
Informatics. 2013;9(1):16–27.
[5] Pak J, Ahn C, Shmaliy Y, et al. Improving reliability of particle filter-based
localization in wireless sensor networks via hybrid particle/FIR filtering. IEEE
Transactions on Industrial Informatics. 2015;11(5):1089–1098.
[6] Ivanov S, Nett E. Localization-based radio model calibration for fault-
tolerant wireless mesh networks. IEEE Transactions on Industrial Informatics.
2013;9(1):246–253.
[7] Lee S, Kim B, Kim H, et al. Inertial sensor-based indoor pedestrian localiza-
tion with minimum 802.15.4a configuration. IEEE Transactions on Industrial
Informatics. 2011;7(3):455–466.
[8] Wu B, and Jen C. Particle filter based radio localization for mobile robots in the
environments with low-density WLAN APs. IEEE Transactions on Industrial
Electronics. 2014;61(12):6860–6870.
[9] Liu H, Darabi H, Banerjee P, et al. Survey of wireless indoor positioning
techniques and systems. IEEETransactions on Systems, Man, and Cybernetics,
Part C. 2007;37(6):1067–1080.
[10] Bahl P, Padmanabhan VN. Radar: An in-building RF-based user location
and tracking system. In: Proc. IEEE INFOCOM’00. Tel Aviv, Israel; 2000.
p. 775–784.
Deep learning for indoor localization based on bimodal CSI data 367
[11] Youssef M, and Agrawala A. The Horus WLAN location determination system.
In: Proc. ACM MobiSys’05. Seattle, WA; 2005. p. 205–218.
[12] Halperin D, Hu WJ, Sheth A, et al. Predictable 802.11 packet delivery from
wireless channel measurements. In: Proc. ACM SIGCOMM’10. New Delhi,
India; 2010. p. 159–170.
[13] Sen S, Lee J, Kim KH, and Congdon P. Avoiding multipath to revive inbuild-
ing WiFi localization. In: Proc. ACM MobiSys’13. Taipei, Taiwan; 2013.
p. 249–262.
[14] Xie Y, Li Z, and Li M. Precise power delay profiling with commodity WiFi.
In: Proc. ACM Mobicom’15. Paris, France; 2015. p. 53–64.
[15] Xiao J, Wu K, Yi Y, et al. FIFS: Fine-grained indoor fingerprinting system.
In: Proc. IEEE ICCCN’12. Munich, Germany; 2012. p. 1–7.
[16] Wang X, Gao L, Mao S, et al. DeepFi: Deep learning for indoor fingerprinting
using channel state information. In: Proc. WCNC’15. New Orleans, LA; 2015.
p. 1666–1671.
[17] Wu C,Yang Z, Zhou Z, et al. PhaseU: Real-time LOS identification with WiFi.
In: Proc. IEEE INFOCOM’15. Hong Kong, China; 2015. p. 2038–2046.
[18] Qian K, Wu C, Yang Z, et al. PADS: Passive detection of moving targets with
dynamic speed using PHY layer information. In: Proc. IEEE ICPADS’14.
Hsinchu, Taiwan; 2014. p. 1–8.
[19] Gjengset J, Xiong J, McPhillips G, et al. Phaser: Enabling phased array signal
processing on commodity WiFi access points. In: Proc. ACM Mobicom’14.
Maui, HI; 2014. p. 153–164.
[20] Wang X, Gao L, and Mao S. BiLoc: Bi-modal deep learning for indoor
localization with commodity 5 GHz WiFi. IEEE Access. 2017;5:4209–4220.
[21] Abadi M, Barham P, Chen J, et al. Tensorflow: A system for large-scale machine
learning. In: OSDI. vol. 16; 2016. p. 265–283.
[22] Mohammadi M, Al-Fuqaha A, Sorour S, et al. Deep learning for IoT big data
and streaming analytics: A survey. IEEE Communications Surveys & Tutorials.
2018;20(4):2923–2960.
[23] Wang X, Wang X, and Mao S. RF sensing in the Internet of Things: A gen-
eral deep learning framework. IEEE Communications Magazine. 2018;56(9):
62–67.
[24] Hinton GE, and Salakhutdinov RR. Reducing the dimensionality of data with
neural networks. Science. 2006;313(5786):504–507.
[25] Wang X, Gao L, Mao S, et al. CSI-based fingerprinting for indoor localiza-
tion: A deep learning approach. IEEE Transactions on Vehicular Technology.
2017;66(1):763–776.
[26] Wang X, Gao L, and Mao S. PhaseFi: Phase fingerprinting for indoor local-
ization with a deep learning approach. In: Proc. GLOBECOM’15. San Diego,
CA; 2015.
[27] Wang X, Gao L, and Mao S. CSI phase fingerprinting for indoor localization
with a deep learning approach. IEEE Internet of Things Journal. 2016;3(6):
1113–1123.
368 Applications of machine learning in wireless communications
[44] Yang W, Wang X, Cao S, et al. Multi-class wheat moisture detection with
5 GHz Wi-Fi: A deep LSTM approach. In: Proc. ICCCN 2018. Hangzhou,
China; 2018.
[45] Wang Y, Shen Y, Mao S, et al. LASSO & LSTM integrated temporal model
for short-term solar intensity forecasting. IEEE Internet of Things Journal. In
press.
[46] Akbar MB, Taylor DG, and Durgin GD. Amplitude and phase difference esti-
mation bounds for multisensor based tracking of RFID tags. In: Proc. IEEE
RFID’15. San Diego, CA; 2015. p. 105–112.
[47] Kleisouris K, Chen Y, Yang J, et al. The impact of using multiple antennas on
wireless localization. In: Proc. IEEE SECON’08. San Francisco, CA; 2008.
p. 55–63.
[48] Speth M, Fechtel S, Fock G, et al. Optimum receiver design for wireless broad-
band systems using OFDM—Part I. IEEE Transactions on Communications.
1999;47(11):1668–1677.
[49] Schmidt R. Multiple emitter location and signal parameter estimation. IEEE
Transactions on Antennas and Propagation. 1986;34(3):276–280.
[50] Bengio Y, Lamblin P, Popovici D, et al. Greedy layer-wise training of deep
networks. In: Proc. Adv. Neural Inform. Proc. Syst. 19. Vancouver, Canada;
2007. p. 153–160.
[51] Mnih V, Kavukcuoglu K, Silver D, et al. Human-level control through deep
reinforcement learning. Nature. 2015;518(7540):529.
[52] Shi J, Chen J, Zhu J, et al. ZhuSuan: A library for Bayesian deep learning.
arXiv preprint arXiv:170905870. 2017.
[53] Wang X, Wang X, Mao S, et al. DeepMap: Deep Gaussian process for indoor
radio map construction and location estimation. In: Proc. IEEE GLOBECOM
2018. Abu Dhabi, United Arab Emirates; 2018.
[54] Han S, Liu X, Mao H, et al. EIE: Efficient inference engine on compressed
deep neural network. In: International Conference on Computer Architecture
(ISCA); 2016.
[55] Shokri R, Theodorakopoulos G, Troncoso C, et al. Protecting location pri-
vacy: optimal strategy against localization attacks. In: Proceedings of the 2012
ACM conference on Computer and communications security. ACM; 2012.
p. 617–627.
[56] Li T, ChenY, Zhang R, et al. Secure crowd-sourced indoor positioning systems.
In: IEEE INFOCOM’18; 2018.
This page intentionally left blank
Chapter 11
Reinforcement-learning-based wireless
resource allocation
Rui Wang1
1
Department of Electrical and Electronic Engineering, The Southern University of Science and Technology,
China
372 Applications of machine learning in wireless communications
problem. Mathematically, these kinds of problems are not well defined. They may refer
to some systems with unknown random behavior. For example, the wireless trans-
mitter wants to make sure that the average receiving signal-to-interference-plus-ratio
(SINR) is above certain quality level; however, the interference level at the receiver
is hard to predict without its statistics. Clearly, this problem cannot be solved unless
more information can be collected. In the transmission protocol design, the receiver
can estimate the receiving interference level and report it to the transmitter periodi-
cally, so that the transmitter can adjust its power and guarantee an acceptable average
SINR level. Hence, the procedure of problem-solving includes not only calculation
but also system observation. Stochastic approximation is such an online learning and
adapting procedure, which collects the information from each observation and finally
converges to the solution.
f (x) = 0, (11.1)
Providing the expression of f (x), this problem may be solved analytically. For
example, x = log10 a when f (x) = 10 x − a and a is a positive constant. Nevertheless,
the following iterative algorithm is useful when the explicit expression of x cannot be
derived.
Strictly speaking, the solution for Problem 11.1 may not be unique; the above
algorithm is to find one feasible solution if it exists. There are a number of choices
Reinforcement-learning-based wireless resource allocation 373
Controller
xn+1 = xn – γn+1 f(xn)
System
f(x)
Input: xn Output: f(xn)
Figure 11.1 Block diagram for the iterative algorithm of Problem 11.1
on the step size {γn |∀n = 1, 2, . . .}. For the Newton’s method (also known as the
Newton–Raphson method), the step size is
1
γn = ,
f (xn )
where f (xn ) is the first-order derivative of f (x) at x = xn . For the case that f (xn )
cannot be obtained, one more general choice of step size is the harmonic series
1
γn = .
n
An intuitive explanation on using the harmonic series as iteration step size is
provided below:
● Note that this series is monotonically decreasing. When xn is close to the solution,
smaller step
size is better for fine adjustment.
● Note that +∞ n=1 (1/n) = +∞, the incremental update of −γn+1 f (x) is not neg-
ligible as long as f (x) = 0. Hence the algorithm could drive xn to the solution
of f (x) = 0.
A block diagram of the iterative algorithm is illustrated in Figure 11.1, where f (x)
may be an observation of certain system with input x. A controller, with the objective
of f (x) = 0, collects the observation of f (x) and update the value of x.
where p(y) is the PDF of random variable Y . The fixed-point problem to be solved
becomes.
The condition (11.6) guarantees that the value of f (x) can be used to update
the variable x: x should be decreased when f (x) is positive, and vice versa. The
condition (11.7) assures that each realization f (x, Y ) can be adopted to evaluate its
expectation f (x). According to the method of stochastic approximation, the solution
of Problem 11.2 is described below.
Theorem 11.1. Let en = E(xn − θ)2 . If γn = 1/n, the conditions of (11.6) and (11.7)
are satisfied, then
lim en = 0.
n→+∞
Please refer to the Theorem 11.1 of [1] for the rigorous proof. Some insights on
the convergence property are provided here. In fact, f (xn , Yn ) can be treated as one
estimation of f (xn ) with estimation error Zn . Thus:
f (xn , Yn ) = f (xn ) + Zn , ∀n. (11.9)
Since E f (xn , Yn ) = f (xn ), we know that E[Zn ] = 0. From the iterative update equation
(11.8), it can be derived that
x1 = x0 − γ1 f (x0 , Y0 ) = x0 − [ f (x0 ) + Z0 ]
1
x2 = x1 − γ2 f (x1 , Y1 ) = x0 − [ f (x0 ) + Z0 ] − [ f (x1 ) + Z1 ]
2
···
n−1
1
n−1
1
xn = xn−1 − γn f (xn−1 , Yn−1 ) = x0 − f (xi ) − Zi
i=0
i+1 i=0
i+1
k−1
1 1
n−1 1
n−1
= x0 − [Zi + f (xi )] − f (xi ) − Zi . (11.10)
i=0
i+1 i=k
i+1 i=k
i+1
xk
Hence, the convergence of {xn } is discussed as follows:
n−1
1. The last term of (11.10), i=k (1/(i + 1))Zi , can be treated as the noise of
iteration. Because without it, the iteration becomes:
n−1
1
xn = xk − f (xi ), (11.11)
i=k
i+1
These type of problems aim at finding the optimal values of some transmission or
receiving parameters for each time slot (i.e., action). It implies that there is a sched-
uler, who observes the system state in each time slot, solves the above optimization
problem, and uses the solution in transmission. In many of such optimization prob-
lems, the optimization action in one time slot does not affect that of the followings.
Hence, we shall refer to this type of problems as the “single stage” optimization in the
remaining of this chapter. The following is an example of single stage optimization
formulation.
Example 11.1 (Multi-carrier power allocation). Suppose that there is one point-
to-point OFDM link with NF subcarriers, and their channel gains in one certain
time slot are denoted by {hi |i = 1, 2, . . . , NF }. Let pi (i = 1, 2, . . . , NF ) be the
transmission power on the ith subcarrier. One typical power allocation problem
is to determine the transmission power on each subcarrier { pi |i = 1, 2, . . . , NF }
such that the overall throughput is maximized, which can be formulated as follows:
● System state: {hi |i = 1, 2, . . . , NF }.
● Action: { pi
|i = 1, 2, .
. . , NF }.
● Objective: Ni=1 F
log 1 + ((pi
hi
2 )/σz2 ) .
NF 2
● Constraint: i=1 pi ≤ P, where P is the peak transmission power.
Hence, the overall optimization problem can be written as
NF
pi
hi
2
max log2 1 +
{pi |i=1,2,...,NF }
i=1
σz2
NF
subject to pi ≤ P,
i=1
where σz2 is the power of noise. This problem can be solved by the well-known
water-filling algorithm. Note that this is the single stage optimization, since there
is no connection between the optimization in different time slots.
However, the above single stage formulation is not powerful enough to address
all the wireless-resource-allocation problems, especially when the scope of opti-
mization is extended to the MAC layer and larger timescale. From the MAC layer
point of view, the BS maintains one queue for each active downlink mobile user.
If the BS schedules more transmission resource to one user in certain time slot,
the traffic load for this user in the following time slots can be relieved. Hence, the
scheduling action in one time slot can affect that of the following ones, and a joint
optimization along multiple time slots becomes necessary. The difficulty of such joint
optimization is that the wireless channel is time varying. Hence, the scheduler can-
not predict the channel of the following time slots and, of course, cannot determine
378 Applications of machine learning in wireless communications
their scheduling parameters in advance. We shall refer to this type of problems as the
“multistage” optimization in the remaining of this chapter. Due to the uncertainty on
future system state, its differences from the single-stage optimization problem are as
follows:
● Instead of calculating the values for scheduling action, we should provide a map-
ping from arbitrarily possible system state to the corresponding scheduling action,
so that the system can work properly in all possible situations. This mapping is
called policy, thus:
Policy : System State → Scheduling Action. (11.14)
● The expectation should be taken on the objective and constraints (if any) since
these functions depend on random system state.
In this section, the MDP is introduced to formulate and solve this kind of multi-
stage optimization problem. In order to bring up the basic principle without struggling
with the mathematical details, this section is only about the discrete-time MDP with
finite state and action spaces, and some mathematical proof is neglected. For the read-
ers who are interested in a overwhelm and rigorous discussion on MDP optimization
theory, please refer to [2,3].
System
State: sn
Cost: g(sn, an)
Action: an = Ωn(sn)
s = {hi |i = 1, 2, . . . , NF }
and
a = {pi |i = 1, 2, . . . , NF }.
In MDP, the state of a system evolves with time in a Markovian way: providing
the current system state and control action, the distribution of next system state is
independent of other historical states or actions. In other words, the evolving of system
state is a Markov chain, providing the control policy at each stage. Given the current
(say the tth stage) system state st and control action at , the distribution of next system
state, Pr(st+1 |st , at ), is called the state transition probability or transition kernel.
The expense of control action is measured by the cost function. The cost of the
system at the tth stage is a function of st and at , which is denoted as gt (st , at ). In fact,
gt can be a random variable given st and at , i.e., gt (st , at , ξt ) where ξt for different
t are independent variables. To simplify the elaboration, we focus on the form of
gt (st , at ) in the following discussion. Note that the cost function can be homogeneous
or heterogeneous along the time line. Particularly, gt can be different with respect to
stage index t for an MDP with finite number of stages. However, when it is extended
to the optimization over infinite number of stages, gt should usually be homogeneous
and subscript of stage index can be removed.
What are optimized in an MDP are not some parameters but a policy, which
maps from the system state to control (scheduling) action. Thus, the solution of an
MDP is a “function” rather than the values for some parameters. In Example 11.1,
the water-filling algorithm can be used to figure out the values of the transmission
powers on all the subcarriers. This is a physical-layer point of view. The following
example shows that the if the scope of resource allocation is extended to MAC layer,
the value optimization will transfer to policy optimization, which can be formulated
as an MDP.
380 Applications of machine learning in wireless communications
λn e−λ
Pr [The number of arrival packets in one frame = n] = . (11.16)
n!
The packet departure in the MAC layer is determined by the physical layer trans-
mission. Let q(t) be the number of packets waiting to be transmitted in the tth
frame, the queue dynamics can be represented by
where d(t) and c(t) are the numbers of departure and arrival packets in the tth
frame.
In physical layer, let {hi (t)|i = 1, 2, . . . , NF } be the CSI of all the subcarriers in
the tth frame, and { pi (t)|i = 1, 2, . . . , NF } be the corresponding power allocation.
The number of packets can be delivered in the tth frame is
Clearly, larger transmission power will lead to larger departure rate of the
transmission queue. However, some systems may have the following concerns
on the average power consumption, which is particularly for battery-powered
device:
1
T NF
lim E pi (t) ≤ P, (11.19)
T →+∞ T t=1 i=1
be the average queue length at the transmitter, the average delay W is given below
according to the Little’s Law [4]:
1 q(t)
T
Q
W = = lim E . (11.21)
λ T →+∞ T t=1 λ
min W
{pi (t)|∀i,t}
subject to (11.19).
Three forms of MDP formulation will be discussed in the following: first of all, we
introduce the finite-horizon MDP, where the number of stages for joint optimization
is finite. Then, we move to the infinite-horizon MDP, where two cost functions are
considered: namely, average cost and discounted cost.
where at = t (st ). The expectation in the above equation is with respect to the ran-
domness of the system state at the first stage and the state transition given the control
action. Note that with the expectation on random system state, the overall cost func-
tion G depends on the control policies used in all the stages. With the objective of
minimizing G, the problem of finite-horizon MDP is described below.
382 Applications of machine learning in wireless communications
Problem 11.3 (Finite-horizon MDP). Find the optimal control polices for each
stage, denoted as {∗t |t = 1, 2, . . . , T }, such that the overall cost G is minimized, i.e.:
{∗t |t = 1, 2, . . . , T } = arg min G {t |t = 1, 2, . . . , T } . (11.23)
{t |t=1,2,...,T }
Hence, in order to obtain the optimal control policy at the tth stage, it is neces-
sary to first figure out the value function Vt+1 for all possible next state. It implies
that before calculating the optimal policy, a backward recursion for evaluating VT ,
VT −1 , …, V1 sequentially is required, which is usually referred to as value iteration
(VI). The VI algorithm for finite-horizon MDP is elaborated below.
Note that the value functions are calculated from the last stage to the first one.
This is because of iterative structure as depicted in the Bellman equation (11.25). As
a summary, the procedure to obtain the optimal control policy for the finite-horizon
MDP can be described below.
● Off-line VI: Before running the system, the controller should evaluate the value
functions for all the possible system states and all the stages. Their values can be
stored in a table.
● Online scheduling: When the system is running, the controller should identify
the system state, solve the corresponding Bellman equation, and apply the optimal
action.
Hence, the solution raises both computation and memory requirements to the con-
troller, whose complexities are proportional to the size of state space |S | and
the number of stages T . In the following, we shall demonstrate the application of
finite-horizon MDP via the multi-carrier power allocation problem.
● Control policy: The control action in the tth frame (t = 1, 2, . . . , T ) is the power
allocation on all the subcarriers, i.e.,
at = {pi (t)|i = 1, 2, . . . , NF }.
Then the control policy in the tth frame, denoted as t , can be written as
t (st ) = at , ∀t, st . (11.29)
● Transition kernel: The block fading channel model is considered, and the CSIs
in different frames are i.i.d. distributed. Therefore, the transition kernel can be
rewritten as
Pr(st+1 |st , at ) = Pr {hi (t + 1)|i = 1, 2, . . . , NF } Pr q(t + 1)|st , at ,
(11.30)
384 Applications of machine learning in wireless communications
where
and
pi (t)
hi (t)
2
NF
d(t) = log2 1 + (11.32)
i=1
σz2
is the number of bits transmitted in the tth frame. Hence, given st and at , q(t + 1)
is uniquely determined.
● Cost: In the tth frame (t = 1, 2, . . . , T ), the cost of the system is the total power
consumption, i.e.:
NF
gt (st , at ) = pi (t), ∀t = 1, 2, . . . , T . (11.33)
i=1
Due to the randomness of the channel fading, a penalty is added in case there are
some remaining bits after T frames transmission (penalty on the remaining bits
in the (T + 1)th frame). Hence, the following cost is introduced for the (T + 1)th
frame:
where w is the weight for the penalty and q(T + 1) is the number of remaining
bits after T frames. Note that there is no control action in the (T + 1)th frame and
aT +1 is introduced simply for notation consistency.
(11.35)
The expectation is because that pi (t) (∀i, t) and q(T + 1) are random due to channel
fading. It can be observed that the choice of weight w may have strong impact on the
scheduling policy: small weight leads to conservative strategy (try to save energy)
and large weight makes the transmitter aggressive.
The Bellman equation for the above MDP is given in (11.25), where VT +1 (sT +1 ) =
w q(T + 1) can be calculated directly. However, because the space of CSI is continuous
and infinite, it is actually impossible to evaluate the other value functions. Note that
Reinforcement-learning-based wireless resource allocation 385
the CSI is i.i.d. distributed among different frames, the expectation on CSI can be
taken on both sides of the Bellman equation, which can be written as
V t (q(t)) = Eh Vt (st )
⎡ ⎤
= Eh min⎣gt (st , at ) + Pr(st+1 |st , at )Vt+1 (st+1 )⎦
at
st+1
⎡ ⎤
= Eh min⎣gt (st , at ) + Pr {hi (t + 1)|∀i} Vt+1 (st+1 ) Pr(q(t + 1)|st , at )⎦
at
st+1
where Eh denotes the expectation over CSI. Therefore, an equivalent Bellman equation
with compressed system state is obtained, whose value function V t (t = 1, 2, . . . , T )
depends only on the QSI. The dependence of CSI is removed from the value function,
which is mainly due to the nature of i.i.d. distribution. As a result, the state space is
reduced from infinite to finite, and a practical solution becomes feasible.
The off-line VI can be applied to compute the new value function V t for all states
and stages (off-line VI), which is given below:
With the value functions, the optimal online scheduling when the system is running
can be derived in each stage according to
∗t (st ) = at∗ = arg min gt (st , at ) + V t+1 (q(t + 1)) . (11.38)
at
Note that in both off-line VI and online scheduling, we always need to find the
optimal solution for (11.38), which can be solved as follows. From the principle of
water-filling method, it can be derived that with a given total transmission power in
the tth frame, the optimal power allocation on each subcarrier can be written as
1 σz2
pi (t) = max 0, − , ∀i = 1, 2, . . . , NF , (11.39)
βt
hi (t)
2
where βt is determined by the total transmission power on all the subcarriers. There-
fore, the key of solution is to find the optimal total transmission power (or βt ) for the
386 Applications of machine learning in wireless communications
tth frame such that the right-hand-side of (11.38) is minimized. Notice that the num-
ber of information bits delivered in the tth frame is given by (11.32), the optimization
problem on the right-hand-side of (11.38) can be rewritten as
NF
1 σz2
min max 0, − + V t+1 (max{0, q(t) − d(t)}), (11.40)
βt
i=1
βt
hi (t)
2
650
Baseline [peak power(30 W) water-filling]
600 Baseline [peak power(25 W) water-filling]
Baseline [peak power(20 W) water-filling]
550 Baseline [peak power(15 W) water-filling]
Proposed (MDP)
500
Average cost
450
400
350
300
250
200
6 7 8 9 10 11
T (frames)
less cost than the physical layer approach with various peak power levels. This gain
mainly comes from the cross-frame power scheduling, which exploits the channel
temporal diversity.
where st and at = (st ) are the system state and control action of the tth stage,
respectively. The expectation is taken over all possible state transition, and the infinite
summation usually converges due to the discount factor γ ∈ (0, 1). As a result, the
infinite-horizon MDP can be mathematically described as follows:
Problem 11.4 (Infinite-horizon MDP with discounted cost). Find the optimal
control polices, denoted as ∗ , such that the overall cost G is minimized, i.e.:
T
∗
= arg min G() = arg min lim E t−1
γ g st , (st ) . (11.42)
T →+∞
t=1
In order to derive the solution of the above problem, the following cost-to-go
function (value function) is first defined for one arbitrary system state s1 at the first
stage:
T
V (s1 ) = min lim E γ g st , (st ) s1
t−1
T →+∞
t=1
T
= min g(s1 , a1 ) + lim E γ t−1 g(st , at )s1 . (11.43)
T →+∞
t=2
388 Applications of machine learning in wireless communications
With the definition of V , the system cost starting from the tth stage given the state st
at the tth stage can be written as
T
min lim E γ g(sn , an )st
n−1
T →+∞
n=t
T
=γ t−1
min lim E γ n−t
g(sn , an )st
T →+∞
n=t
⎡
⎤
T
= γ t−1 min lim E ⎣ γ k−1 g(sk+t−1 , ak+t−1 )st⎦ , (11.44)
T →+∞
k=1
where the second equality is due to k = n − t + 1 and T = T − t + 1. If we define the
new notation for system state by letting sk = sk+t−1 and ak = ak+t−1 , the minimization
of the above equation can be written as
⎡ ⎤
T
min lim E ⎣ γ k−1 g(sk+t−1 , ak+t−1 )st⎦
T →+∞
k=1
⎡
⎤
T
= min lim E ⎣ γ k−1 g(sk , ak )s1⎦
T →+∞
k=1
= V (s1 ) = V (st ). (11.45)
Hence, it can be derived that
T
min lim E γ g(sn , an )st = γ t−1 V (st ).
n−1
(11.46)
T →+∞
n=t
Since the time horizon is infinite, the optimal policy minimizing the system cost
since the first stage also minimizes the system cost since any arbitrary stage. Hence,
(11.43) can be written as
T
V (s1 ) = min g(s1 , a1 ) + Es2 lim E{si |i=3,4,...} γ t−1 g(st , at )s2
T →+∞
t=2
T
= min g(s1 , a1 ) + Es2 min lim E{si |i=3,4,...} γ t−1
g(st , at )s2
(s1 ) T →+∞
t=2
!
= min g(s1 , a1 ) + γ Es2 V (s2 ) , (11.47)
(s1 )
where the last equality is due to (11.46). Similarly, for arbitrary system state at the
arbitrary tth stage st , the Bellman equation for infinite-horizon MDP with discounted
cost can be written as follows:
V (st ) = min g(st , at ) + γ Est+1 V (st+1 ) . (11.48)
(st )
Reinforcement-learning-based wireless resource allocation 389
Regarding to the solution, if the value function has already been calculated, it is
straightforward to see that the optimal control action for arbitrary stage is
∗ (st ) = at∗ = arg min g(st , at ) + γ Est+1 V (st+1 ) , ∀t, st . (11.49)
at
On the other hand, the value function should satisfy the Bellman equation in (11.48).
This is a fixed-point problem with minimization on the right-hand-side, and we have
to rely on the iterative algorithm, which is named as VI. The detail steps of VI is
elaborated below, and please refer to [3] for the proof of convergence.
In the following section, we still use the case of multi-carrier power allocation
to demonstrate the formulation via infinite-horizon MDP with the discounted cost. It
takes the packet arrival at the MAC layer into considered, which is not addressed in
Section 11.2.2.1.
● System state: In the tth frame (t = 1, 2, 3, . . .), the system state st is uniquely
specified by the CSI of all the subcarriers {hi (t)|i = 1, 2, . . . , NF } and the QSI
q(t). The latter denotes the number of remaining packets waiting at the transmitter.
Thus:
st = {hi (t)|i = 1, 2, . . . , NF }, q(t) . (11.51)
390 Applications of machine learning in wireless communications
● Control policy: The control action in the tth frame (t = 1, 2, 3, . . .) is the power
allocation on all the subcarriers, i.e., at = {pi (t)|i = 1, 2, . . . , NF }. Then the
control policy in the tth frame, denoted as , can be written as
(st ) = at , ∀t, st . (11.52)
● Transition kernel: The block fading channel model is considered, and the CSI in
each frame is i.i.d. distributed. Therefore, the transition kernel can be written as
Pr(st+1 |st , at ) = Pr {hi (t + 1)|i = 1, 2, . . . , NF } Pr q(t + 1)st , at ,
(11.53)
where
q(t + 1) = max{0, q(t) − d(t)} + c(t), (11.54)
and
1
pi (t)
hi (t)
2
NF
d(t) = log2 1 + (11.55)
B i=1
σz2
are the number of packets transmitted in the tth frame, c(t) is the number of arrival
packets in the tth frame. It is usually assumed that c(t) follows the Poisson arrival
with expectation λ, as in Example 11.2. Thus, there are λ arrival packets in one
frame on average.
● Cost: The average power consumption at the transmitter is
1
T NF
P = lim E pi (t) . (11.56)
T →+∞ T t=1 i=1
According to the Little’s Law, the average transmission delay of one packet is
1 q(t)
T
Q
W = = lim E , (11.57)
λ T →+∞ T t=1 λ
where Q is the average number of packets waiting at the transmitter. The weighted
sum of average power and delay is
1 q(t)
T NF
P + ηW = lim E η + pi (t) , (11.58)
T →+∞ T t=1 λ i=1
where η is the weight on the average transmission delay. The problem of minimiz-
ing P + ηW is an infinite horizon MDP with average cost, whose solution will
be introduced in the next section. Usually, people prefer to consider the discount
approximation of P + ηW as follows:
T
q(t) NF
The main reason for approximating average cost via discounted cost is that it has
better converge rate in VI.
Hence, the resource allocation problem can be formulated as
⎡ ⎤
⎢ T ⎥
⎢ t−1 q(t) NF
⎥
⎢
min G = min lim E ⎢ + pi (t) ⎥
T →+∞
γ η
λ ⎥, (11.60)
⎣ t=1 i=1 ⎦
g(st ,at )
which is an infinite-horizon MDP with discounted cost. The Bellman equation for
this problem is
q(t)
NF
V (st ) = min η + pi (t) + γ Est+1 V (st+1 ) . (11.61)
λ i=1
Note that the space of the system state includes all possible values of CSI, and it is
actually impossible to evaluate value function. Similar to Section 11.2.2.1, since the
CSI is i.i.d. distributed in each frame, the expectation with respect to the CSI can be
taken on both side of the above Bellman equation, i.e.:
q(t)
NF
V q(t) = Eh min η + pi (t) + γ Est+1 V (st+1 )
λ i=1
⎧ ⎫
⎨ q(t) NF
⎬
= Eh min η + pi (t) + γ Pr q(t + 1)|st , at V q(t + 1)
⎩ λ ⎭
i=1 q(t+1)
⎧ ⎫
q(t) ⎨NF
λc(t) e−λ ⎬
=η + Eh min pi (t) + γ V q(t + 1)
λ ⎩ c(t)! ⎭
i=1 c(t)
N
q(t) F
=η + Eh,c min pi (t) + γ V q(t + 1) , (11.62)
λ
i=1
where Eh is the expectation over CSI, Est+1 is the expectation over next system state,
and Eh,c is the expectation over random packet arrival.
The off-line VI can be applied to compute the value function V for all possible
queue length. In order to avoid infinite transmission queue, we can set a buffer size.
Thus, the overflow packets will be dropped. With the value function, the optimal
scheduling action can be calculated via:
⎧ ⎫
⎨ NF
λc(t) e−λ ⎬
∗ (st ) = at∗ = arg min pi (t) + γ V q(t + 1) . (11.63)
⎩ c(t)! ⎭
i=1 c(t)
Note that in both off-line VI and online scheduling, we always need to find the optimal
solution for (11.63), which can be solved with the approach introduced in Section
11.2.2.1 (i.e., water-filling with optimized water level).
392 Applications of machine learning in wireless communications
45
Baseline [peak power(25 W) water-filling]
Baseline [peak power(20 W) water-filling]
Baseline [peak power(15 W) water-filling]
40 Baseline [peak power(10 W) water-filling]
Proposed (MDP)
35
Average cost
30
25
20
0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1
λ
Comparing with Problem 11.4, it can be observed that the discounted cost MDP
values the current cost more than the future cost (due to discount factor γ ), but the
average cost MDP values them equally. Moreover, when the discount factor γ of
Problem 11.4 is close to 1, the discounted cost MDP becomes closer to the average
cost MDP.
Unlike the case of discounted cost, the value function for the case of average cost
does not have straightforward meaning. Instead, the value function is defined via the
following Bellman equation:
θ + V (st ) = min g(st , at ) + Est+1 V (st+1 ) , ∀st , at , (11.66)
(st )
where V (s) is the value function for system state s. As proved in [3], this Bellman
equation could bring the following insights on Problem 11.5:
● θ is the minimized average system cost, i.e.:
1
T
θ = min lim E g st , (st ) . (11.67)
T →+∞ T t=1
● The optimal control action for arbitrary system state st at arbitrary tth stage can
be obtained by solving the right-hand-side of (11.66), i.e.:
∗ (st ) = at∗ = arg min g(st , at ) + Est+1 V (st+1 ) . (11.68)
at
3. If the update from V i to V i+1 for all system states is negligible, the iteration
terminates. Otherwise, let i = i + 1 and jump to Step 2.
394 Applications of machine learning in wireless communications
Its Bellman equation after taking expectation on the CSI can be written as
⎧ ⎫
q(t) ⎨NF
λc(t) e−λ ⎬
θ + V q(t) = η + E min pi (t) + V q(t + 1) ,
λ ⎩ c(t)! ⎭
i=1 c(t)
(11.71)
where V (q(t)) is the value function with q(t) packets at the transmitter. The VI can
be used to evaluate the value function for all the possible queue lengths (a maximum
queue length can be assumed to avoid infinite queue). Moreover, with the value
function, the optimal scheduling action at arbitrary one stage (say the tth stage) with
arbitrary system state st can be derived via:
⎧ ⎫
⎨NF
λc(t) e−λ ⎬
∗ (st ) = at∗ = arg min pi (t) + V (q(t + 1)) . (11.72)
at ⎩ c(t)! ⎭
i=1 c(t)
Note that q(t + 1) depends on both pi (t) (∀i) and c(t). This problem can be solved
with the approach introduced in Section 11.2.2.1 (i.e., water-filling with optimized
water level).
scheduling. Now, consider the following VI, which is supposed to be finished before
running the system:
!
V i+1 (st ) = min g(st , at ) + γ Est+1 V i (st+1 )
at
⎧ ⎫
⎨ ⎬
= min g(st , at ) + γ Pr(st+1 |st , at )V i (st+1 ) . (11.73)
at ⎩ ⎭
st+1
It can be observed that the VI relies on the knowledge of cost function g(st , at ) and
the state transition probability Pr(st+1 |st , at ). In other words, the VI is infeasible if
they are unknown. In the following example, we extend the power allocation example
of Section 11.2.3.1 from ideal mathematical model to practical implementation and
show that the cost function or the transition kernel (state transition probability) may
be unknown in some situation.
Thus, is a mapping from the queue length to the power allocations for all
possible CSI. With the definition of , the Bellman equation for the example of
Section 11.2.3.1 can be written as
N
q(t) F
where cost function and state transition probability of (11.73) are given by
N
q(t) F
and
Pr(st+1 |st , at ) = Eh Pr q(t + 1)q(t), (q(t)), {hi (t)} , (11.76)
respectively. Hence, the cost function requires the knowledge of CSI distribu-
tion, and the state transition probability depends on the distributions of both CSI
and packet arrival. Without the knowledge of both distribution, the off-line VI is
infeasible.
In order to find the optimal policy without the a priori knowledge on the statistics
of the system, we have to perform VI in an online way, which is usually referred to
as reinforcement learning. In the remaining of this section, we shall introduce two
learning approaches. The first approach can be applied on the example of Section
11.2.3.1 without any knowledge on CSI distribution, and the second one, which is
call as Q-learning, is more general to handle unknown statistics of packet arrival.
Note that this is the exact MDP problem discussed in Section 11.2.3.1, where ξt and
system state st refer to the CSI and queue length in the tth frame, respectively. Its
Bellman equation can be written as
V (st ) = Eξt min g(st , at , ξt ) + γ Pr(st+1 |st , at , ξt )V (st+1 ) . (11.79)
at
In this section, we assume that ξt (∀t), g(st , at , ξt ), and Pr(st+1 |st , at , ξt ) can be
observed or measured at each stage, but the distribution of ξt is unknown. This
refers to the situation that the CSI distribution in the example of Section 11.2.3.1
is unknown (the distribution of packet arrival is known). Hence, the off-line VI is
infeasible as the right-hand-side of (11.79) cannot be calculated. Instead, we can
first initialize a control policy, evaluate the value function corresponding to this
policy via stochastic approximation in an online way, and then update the policy
and reevaluate the value function again. By such iteration, it can be proved that
the Bellman equation of (11.79) can be finally solved. The algorithm is elaborated
below.
(11.81)
Note that ξt can be observed at the tth stage (e.g., the CSI can be estimated
at the beginning of each frame in the example of Section 11.2.3.1), the above
optimization problem can be solved.
4. If the update on the control policy is negligible, terminate the algorithm.
Otherwise, let i = i + 1 and jump to Step 2.
It can be proved that the policy and value function obtained by the above itera-
tive algorithm, denoted as V ∞ and ∞ , can satisfy the Bellman equation in (11.79).
Thus, ∞ is the optimal control policy and V ∞ represents the minimum discounted
cost for each initial system state. Notice that in the second step of the above algo-
rithm, we should solve a fixed-point problem with unknown statistics. The stochastic
398 Applications of machine learning in wireless communications
Thus, the value for the current system state sj is updated, and others remain
the same. As a remark,
* notice that g(sj , aj , ξj ) + γ V i, j (sj+1 + ) is an unbiased
estimation of Eξt g(st , at , ξt ) + γ Pr(st+1 |st , at , ξt )V i (st+1 ) . Moreover, since
the knowledge on sj+1 is required, the above update should be calculated after
observing the next system state.
3. If the update on the value function is negligible, terminate the algorithm.
Otherwise, let j = j + 1 and jump to Step 2.
● System state: Since the distribution of CSI is i.i.d. in each frame, we can treat
the CSI as the independent random variables ξt , instead of the system state. Thus:
st = {q(t)}. (11.85)
Reinforcement-learning-based wireless resource allocation 399
● Control policy: As elaborated in Example 11.3, when the CSI is removed from
the system state, the control action becomes the power allocation for all possible
CSI given the QSI. Thus, the control action of the tth frame is
at = (st ) = q(t), {hi (t)} ∀i, hi (t) . (11.86)
As a remark note that and represent the same scheduling behavior, however
their mathematical meanings are different: is a policy with respect to QSI and
CSI, and is a policy with respect to QSI only. Hence, one action in consists
of a number of actions in with the same QSI.
● Transition kernel: Given the system state, CSI and control action of the tth frame,
the transition kernel can be written as
Pr(st+1 |st , at , ξt ) = Pr q(t + 1)q(t), {hi (t)|∀i}, {pi (t)|∀i} , (11.87)
where
q(t + 1) = max{0, q(t) − d(t)} + c(t), (11.88)
1
pi (t)
hi (t)
2
NF
d(t) = log2 1 + (11.89)
B i=1 σz2
is the number of packets transmitted in the tth frame, c(t) is the number of arrival
packets in the tth frame. Note that the randomness of q(t + 1) comes from random
packet arrival c(t).
● Cost: The overall cost function as defined in Section 11.2.3.1 is
T
q(t) NF
G = lim E γ t−1
η + pi (t) . (11.90)
T →+∞ λ
t=1 i=1
As elaborated in Example 11.3, the Bellman equation for the above MDP
problem is
N
q(t) F
(11.91)
or equivalently:
⎧ ⎫
⎨ q(t) NF
⎬
V q(t) = Eξt min η + pi (t) + γ Pr c(t) V q(t + 1) .
(q(t)) ⎩ λ ⎭
i=1 c(t)
(11.92)
Note that without the distribution knowledge of CSI ξt , the expectations in the above
Bellman equation cannot be calculated directly. Hence, we have to rely on the online
value and policy iteration introduced in this section, which consists of two levels of
iteration. The outer iteration is for updating the policy, and the inner one is to find the
value function corresponding to the policy. The procedure is elaborated below.
400 Applications of machine learning in wireless communications
where βt depends on the total transmission power of the tth frame. βt is usually
referred to as the Lagrange multiplier as the above power allocation is derived via
convex optimization [5]. Moreover, given st and ξt , the βt with respect to the initialized
0
value function V , denoted as βt1 , can be calculated according to the right-hand-side
of (11.92), i.e.,
⎧ ⎫
⎨ q(t) NF ⎬
1 σz 2 0
βt1 = arg min η + 0, − +γ Pr c(t) V q(t + 1) .
⎩ λ βt
hi (t)
2 ⎭
i=1 c(t)
Step 2 (Value function evaluation): Given the power allocation policy derived
based on βti (i = 1, 2, . . .), the corresponding value function can be calculated as
follows:
i, j i−1
● Let j = 1. Initialize the value function by V = V .
● At the jth stage, denote sj as the system state and update the value function as
follows:
i, j+1 j i, j 1 * +
V (sj ) = V (sj ) + g(sj , aj , ξj ) + γ V i, j (sj+1 ) , (11.95)
j+1 j+1
and
i, j+1 i, j
V (s) = V (s), ∀s = sj . (11.96)
i
● Let j = j + 1 and repeat the above step until the iteration converged. Let V be
the converged value function.
Reinforcement-learning-based wireless resource allocation 401
11.3.2 Q-learning
The stochastic-approximation-based learning approach in the previous section is able
to handle the situation that the controller knows the transition kernel Pr(st+1 |st , at , ξt )
but does not know its expectation with respect to ξt , i.e., Pr(st+1 |st , at ) =
Eξt [ Pr(st+1 |st , at , ξt )]. Regarding to the example in Section 11.3.1.1, it refers to the
circumstance that the transmitter knows the distribution of packet arrival in each
frame, but not the CSI distribution. Q-learning is a more powerful tool to solve MDP
problems with unknown transition kernel Pr(st+1 |st , at ). In other words, it can handle
the power allocation even without the statistics of packet arrival.
We use the infinite-horizon MDP with discounted cost in Problem 11.4 as the
example to demonstrate the method of Q-learning. First of all, the Q function is
defined as
T
Q(s, a) = min lim E γ g(st , at )s1 = s, a1 = a .
t−1
(11.98)
T →+∞
t=1
In other words, the optimal control policy can be easily obtained with the Q function
of the MDP. Moreover, the Bellman equation in (11.48) can be written in the form of
Q function, i.e.:
Q(st , at ) = g(st , at ) + γ Pr(st+1 |st , at )V (st+1 ), (11.101)
V (st ) = min g(st , at ) + γ Pr(st+1 |st , at ) min Q(st+1 , at+1 ) , (11.102)
at at+1
or
Q(st , at ) = g(st , at ) + γ Pr(st+1 |st , at ) min Q(st+1 , at+1 ). (11.103)
at+1
In order to compute and store the values of Q function, it is actually required that
both the state and action spaces should be finite. Regarding the example of Section
11.3.1.1, we should quantize the transmission power into finite levels.
The Bellman equation (11.103) provides an iterative way to evaluate the
Q function. The procedure is described below.
The above VI require the knowledge on the distribution of Pr(st+1 |st , at ). If they
are not available at the controller, the Q-learning algorithm is provided below.
Q-learning algorithm
1. Let j = 1, initialize the Q function, denoted as Qj .
2. At the jth stage, denote si as the system state and ai as the action, update the
value function as follows:
, -
j 1
Q (si , ai ) =
i+1
Q (si , ai ) +
i
g(sj , aj ) + γ min Q (si+1 , ai+1 ) ,
i
j+1 j+1 ai+1
and
Qi+1 (s, a) = Qi (s, a), ∀(s, a) = (sj , aj ). (11.104)
Reinforcement-learning-based wireless resource allocation 403
Since the knowledge on sj+1 is required, the above update should be calculated
after observing the next system state.
3. If the update on the Q function is negligible, terminate the algorithm.
Otherwise, let j = j + 1 and jump to Step 2.
In the above algorithm, the control action for each stage should be chosen to
guarantee that the Q function for all pairs of system state and control action can be
well trained.
Without the statistics knowledge of ξt and packet arrival c(t), the Q-learning algorithm
is provided below:
1. Let i = 1. Initialize the Q function, denoted as Qi .
2. In the ith frame, let q(i) and βi be the system state (QSI) and the Lagrange
multiplier for power allocation, update the value function as follows:
i
Qi+1 q(i), βi = Qi q(i), βi
i+1
⎡ ⎤
1 ⎣ q(i)
NF
+ η + pj (i) + γ min Q q(i + 1), β ⎦ ,
i+1 λ j=1
β
and
Qi+1 (s, β) = Qi (s, β), ∀(s, β) = q(i), βi . (11.107)
Since the knowledge on q(i + 1) is required, the above update should be
calculated after observing the next system state.
404 Applications of machine learning in wireless communications
where βt = minβ Q(q(t), β). Note that in this solution, the Lagrange multiplier βt
is determined according to the QSI only. A better solution may be obtained if we
treat both CSI and QSI as the system state in the Q-learning algorithm (βt is then
determined according to both CSI and QSI). However, it requires the quantization of
CSI and larger system complexity.
Comparing with the method introduced in Section 11.3.1, it can be observed that
the Q-learning approach is more general in the sense that it can be applied on the
situation without knowledge of transition kernel. However, the price to pay is that the
Q function depends on both system state and control action. Thus, the storage and
computation complexities for evaluating the Q function is higher.
References
[1] Robbins H, and Monro S. A Stochastic Approximation Method. The Annals of
Mathematical Statistics. 1951;22(3):400–407.
[2] Bertsekas D. Dynamic Programming and Optimal Control: Volume I. 3rd ed.
Belmont: Athena Scientific; 2005.
[3] Bertsekas D. Dynamic Programming and Optimal Control: Volume II. 3rd ed.
Belmont: Athena Scientific; 2005.
[4] Kleinrock L. Queueing Systems. Volume 1: Theory. 1st ed. New York: Wiley-
Interscience; 1975.
[5] Boyd S, and Vandenberghe L. Convex Optimization. 1st ed. Cambridge:
Cambridge University Press; 2004.
[6] Sutton R, and Barto A. Reinforcement Learning: An Introduction. 2nd ed.
Cambridge: MIT Press; 2018.
[7] Bettesh I, and Shamai S. Optimal Power and Rate Control for Minimal Aver-
age Delay: The Single-User Case. IEEE Transactions on Information Theory.
2006;52:4115–4141.
[8] Moghadari M, Hossain E, and Le LB. Delay-Optimal Distributed Scheduling
in Multi-User Multi-Relay Cellular Wireless Networks. IEEE Transactions on
Communications. 2013;61(4):1349–1360.
[9] Cui Y, and Lau VKN. Distributive Stochastic Learning for Delay-Optimal
OFDMA Power and Subband Allocation. IEEE Transactions on Signal
Processing. 2010;58(9):4848–4858.
[10] Cui Y, and Jiang D. Analysis and Optimization of Caching and Multicast-
ing in Large-Scale Cache-Enabled Heterogeneous Wireless Networks. IEEE
Transactions on Wireless Communications. 2017;16(1):250–264.
[11] Wang R, and Lau VKN. Delay-Aware Two-Hop Cooperative Relay Commu-
nications via Approximate MDP and Stochastic Learning. IEEE Transactions
on Information Theory. 2013;59(11):7645–7670.
[12] Zhou B, Cui Y, and Tao M. Stochastic Content-Centric Multicast Scheduling
for Cache-Enabled Heterogeneous Cellular Networks. IEEE Transactions on
Wireless Communications. 2016;15(9):6284–6297.
[13] Powell WB. Approximate Dynamic Programming: Solving the Curses of
Dimensionality. 2nd ed. New Jersey:John Wiley & Sons; 2011.
This page intentionally left blank
Chapter 12
Q-learning-based power control in small-cell
networks
Zhicai Zhang1 , Zhengfu Li2 , Jianmin Zhang3 ,
and Haijun Zhang3
12.1 Introduction
In recent years, most voice and data services have occurred in indoor environments.
However, due to long-distance transmission and high penetration loss, the indoor
coverage of macrocell may not be so good. As a result, FBS has gained wide attention
in wireless industry [1,2]. With the exponential growth of mobile data traffics, wireless
1
College of Physics and Electronic Engineering, Shanxi University, China
2
Beijing Key Laboratory of Network System Architecture and Convergence, Beijing University of Posts
and Telecommunications, China
3
School of Computer & Communication Engineering, University of Science and Technology Beijing,
China
408 Applications of machine learning in wireless communications
communication networks play a more and more important role in the global emissions
of carbon dioxide [3]. Obviously, the increasing energy cost will bring significant
operational cost to mobile operators. On the other hand, limited battery resources
cannot meet the requirement of mass data rate. In this chapter, the concept of green
communication is proposed to develop environmentally friendly and energy saving
technologies for future wireless communications. Therefore, the use of energy aware
communication technology is the trend of the next generation wireless network design.
In a two-tier network with shared spectrum, due to cross-layer interference,
the target user and the femtocell user of each signal-to-interference-plus-noise
ratio (SINR) sampling macrocell are coupled. The SINR target concept estab-
lishes application-related minimum QoS requirements for each user. It is reasonably
expected that since home users deploy femtocell for their own benefit and because
they are close to their BS, femtocell users and cellular users seek different SINRs
(data rates)—usually higher data rates using femtocell. However, QoS improvements
from femtocell should be at the expense of reducing cellular coverage.
In practice, a reliable delay guarantee is provided for delay sensitivity. High
data rate services, such as video calling and video conferencing, are the key issues
of wireless communication network. However, due to the time-varying nature of
wireless channel, it is difficult and unrealistic to apply the traditional fixed delay
QoS guarantee. To solve the problem, the statistical QoS metric with delay-bound
violation probability have been widely adopted to guarantee the statistical delay QoS
[4–6]. In [5], for delay-sensitive traffic in single-cell downlink Orthogonal Frequency
Division Multiple Access (OFDMA) networks, the effective spectrum design based
on EC delay allocation is studied. In [6], a joint power and subchannel allocation
algorithm in vehicular infrastructure communication network is proposed. It has the
requirement of delayed QoS. However, as far as we know, EC-based delay provision
in two-tier femtocell cellular networks has not been widely studied.
In addition, due to the scarcity of spectrum, the microcell and macrocell usu-
ally share the same frequency band. However, in the case of co-channel operation,
intensive and unplanned deployment will lead to serious cross-tier and co-tier inter-
ference, which will greatly limit the performance of the network. Microcell base
stations are low-power, low-cost, and user-deployed wireless access points that use
local broadband connections as backhauls. Not only users but also operators ben-
efit from femtocell. On the one hand, users enjoy high-quality links; on the other
hand, operators reduce operating expenses and capital expenditure due to service
uninstallation and user deployment of FBS.
Therefore, it is necessary to design effective interference suppression mecha-
nism in the two-tier femtocell networks to reduce cross-tier and co-tier interference.
In [7,8], the author reviews the interference management in two-level microcellular
networks and small cellular networks. In [9], the authors have proposed a novel inter-
ference coordination scheme using downlink multicell chunk allocation with dynamic
inter-cell coordination to reduce co-tier interference. In [10], based on cooperative
Nash bargaining game theory, this chapter proposes a cognitive cell joint uplink sub-
channel and power-allocation algorithm to reduce cross-layer interference. In [11],
in order to maximize the total capacity of all femtocell users under the constraints of
Q-learning-based power control in small-cell networks 409
Queue 0 MS 0 B0
h00
h01
Queue 1 MS 1 h0N
B1
hN0
Queue N MS N BN
hN1 hNN
Macrocell
Femtocell
at the edges of their cell to meet their received power targets and cause excessive cross-
layer interference in nearby microcellular networks. Due to scalability, security, and
limited availability of backhaul bandwidth, base station (BS) and femtocells base
station APS.
Let i ∈ N = {0, 1, . . . , N } denote the index of active users, where i = 0 indicates
the scheduled user in macrocell B0 and i ∈ {1, 2, . . . , N } denotes the scheduled users
in femtocell Bi .
Let Bi (i ∈ N ) denote the base station (BS), where N = {0, 1, 2, . . . , N }. B0
denotes the MBS, and Bi (i ∈ N , i = 0) is FBS. We assume that each MS will be
allocated only a subchannel, and in order to avoid intra cell interference during each
frame time slot, only one active MS in each cell can occupy the same frequency.
Let i ∈ N denote the index of scheduled user in Bi .
The received SINR of MS i in Bi can be expressed as
pi hii
γi (pi , p−i ) = , ∀i ∈ N , (12.1)
j =i pj hij + σi
2
where pi denotes the transmit power of MS i, and p−i , (−i ∈ N ) denotes the transmit
power of other MSs except MS i. hii and hij are the channel gains from MS i to BS
Bi , Bj respectively, σi2 is the variance of additive white Gaussian noise (AWGN) of
MS i.
Similarly, the received SINR of MU is
h0,0 p0
γ0 = N , (12.2)
i=1 hi,0 pi + σ02
where hi,0 is the channel gain from FBS Bi to the active MU and h0,0 denotes the
channel gain from MBS to its active MU.
According to the Shannon’s capacity formula, the ideal achievable data rate of
MS i is
Ri (pi , p−i ) = wlog2 (1 + γi (pi , p−i )), (12.3)
where w is the bandwidth of each subchannel.
Q-learning-based power control in small-cell networks 413
where Qth is queue length bound and θ > 0 is the decay rate of the tail distribution of
the queue length Q(∞).
If Qth → ∞, we get the approximation of the buffer violation probability,
Pr{Q(∞) > Qth } ≈ e−θ Qth .
We can find that the larger θ corresponds to the faster fading rate, which means
more stringent QoS constraints, while the smaller θ leads to a slower fading rate,
which means a looser QoS requirement. Similarly, the delay-outage probability can
be approximated by [4], Pr{Delay > Dth } ≈ ξ e−θ δDth , where Dth is the maximum
tolerable delay, ξ is the probability of a non-empty buffer, and δ is the maximum
constant arrival rate.
The concept of EC is proposed by Wu et al., in [4], it is defined as the maxi-
mum constant arrival rate that can be supported by the time-varying channel, while
ensuring the statistical delay requirement specified by the QoS exponent θ. The EC
is formulated as
1 K
E c (θ) = − lim ln(E{e−θ k=1 S[k] }), (12.5)
K→∞ Kθ
We assume that the channel fading coefficients remain unchanged over the frame
duration T and vary independently for each frame and each MS. From (12.5), Si [k] =
T Ri [k] is obtained. Based on the above analysis, the EC of MS i can be simplified as
1
Eic (θi ) = − ln(E{e−θi T Ri [k] }). (12.6)
θi T
FBS is rational. Each player pursues the maximization of its own utility, which can
be denoted as
max ui (pi ∗ ,p∗ −i ), ∀i ∈ N , (12.10a)
pi ∈Pi
Theorem 12.1. An NE exits in the NPCG G = {N , {Pi }, ui (pi ,p−i )}, if for all i ∈ N ,
the following two conditions are satisfied:
1. In Euclidean space RN , the strategy set {Pi } is a non-empty, convex, and compact
subset.
2. The utility function ui (pi ,p−i ) is continuous in (pi ,p−i ) and quasi-concave in pi .
Proof. For condition (1), it is obvious that {Pi } is a non-empty, convex, and compact
subset. We prove condition (2) inthe following:
For fixed p−i , let hi = (gi,i / j=0,i gj,i pj + g0,i p0 + σ 2 ) denote the channel gain-
to-interference-plus-noise ratio of FU i and f (hi ) is the probability density of hi . For
almost all practical environment, we assume f (hi ) is continuous and differentiable
in hi :
∞
∼ 1 −θi Ri (pi )
(θ
E Ci i ) = − ln e f (h i )dh i − ugi,0 pi . (12.12)
θi 0
where p = (p0,v0 , . . . pi,vi . . . , pN ,vN ) ∈ P is the actions of all MSs at time slot t, and
P = ×i∈N Pi .
Theorem 12.2. Given MS 0’s strategy π0 , there exists a mixed strategy {πi∗ , π ∗−i }
satisfies:
ui (πi∗ , π ∗−i ) ≥ ui (πi , π ∗−i ), (12.16)
which is an NE point.
Q-learning-based power control in small-cell networks 417
Proof. As it has been shown in [33], every limited strategic game has a mixed strategy
equilibrium, i.e., there exists NE(π0 ) for given π0 .
Lemma 12.1. The problem exists a Stackelberg equilibrium (SE) point {π0∗ , πi∗ , π ∗−i }
(∀i ∈ N , i = 0), which is a mixed strategy.
The proof of the existence of SE point is omitted here for brevity. We will employ
reinforcement learning mechanism, called Q-learning, to find SE point.
12.4.2 Q-learning
Based on reinforcement learning, each femtocell can be an intelligent agent with
self-organization and self-learning ability, and its operation parameters can be opti-
mized according to the environment. Q-learning is a common reinforcement learning
method, which is widely used in self-organizing femtocell networks. It does not need
teachers’ signals. It can optimize its operation parameters through experiments and
errors. Each BS acts as an intelligent agent, maximizing its profit by interacting
directly with the environment.
We define pi,vi ∈ Pi (∀i ∈ N ) as actions of Q-learning model, and π t−i (−i ∈ N )
are environment states. In a standard Q-learning model, an agent interacts with its
environment to optimize its operation parameters. First, the agent perceives the envi-
ronment and observes its current state s ∈ S. Then, the agent selects and performs
an action a ∈ A according to a decision policy π : s → a and the environment will
change to the next state s + 1. Meanwhile, the agent receives a reward W from the
environment.
In each state, there is a Q-value associated with each action. The definition of a
Q-value is the sum of the received reward (possibly discounted) when an agent per-
forms an associated action and then follows a given policy thereafter [34]. Similarly,
the optimal Q-value is the sum of the received reward when the optimal strategy is
followed. Therefore, the Q-value can be expressed as
Qπt (a, s) = W t (a, s) + λ max Qπt−1 (a, s + 1), (12.17)
a∈A
where W t (a, s) is the received reward when an agent performs an action a at the
state s in the time slot t and λ denotes a discount factor, 0 ≤ λ < 1. However, at the
beginning of the learning, the (12.17) has not been established. The deviation between
the optimal value and the realistic value is
Qπt (a, s) = W t (a, s) + λ max Qπt−1 (a, s + 1) − Qπt−1 (a, s), (12.18)
a∈A
where t+1
δ−(0,v0)
= t+1
πj,vj
denotes the probability of actions vector p−(0,v0 ) =
j∈N , j =0
(p1,v1 , . . . pi,vi . . . , pN ,vN ).
For MS i (∀i ∈ N , i = 0), due to the fact that FBSs can receive MBS’s transmit
power strategy, and there is no interference between FBSs, the reward function of
MS i is
V0
ri (pi,vi , π0t+1 ) = t+1
δ−(i,v η (pi,vi , p0,v0 ),
i) i
(12.24)
v0 =1
t+1
where δ−(i,vi)
= π0,v
t+1
0
.
Q-learning-based power control in small-cell networks 419
V0
ri (pi,vi , π0t+1 ) = t+1
δ−(i,v η̂ (pi,vi , p0,v0 ).
i) i
(12.25)
v0 =1
Due to the limited space, the convergence of the proposed algorithm can be found
in [35]. As the Algorithm 12.1, a distributed Q-learning algorithm is proposed.
t−1
vector, where πi,a i
is the probability with which FBS Bi chooses action ai at
time t − 1.
Action: Each discrete transmit power can be denoted by each action ai . Therefore,
we use action ai ∈ Ai to replace the FBS Bi’s transmit power. According to
policy πit−1 , FBS Bi selects transmit power ai with probability πi,a t−1
i
.
The Q-value can be formulated according to the utility function of discrete
game Gd :
Qπt i (ai , si ) = W t (ai , si ) + λ max Qπt−1
i
(ai , si + 1) = πi,a
t−1 t
i
ui (pi (ai ), p−i ). (12.28)
a∈A
Therefore, we can believe that the α represents the weight of the historical learning
process and can speed up learning. Moreover, in order to ensure fast convergence, we
propose a weighted filter algorithm-based Boltzmann distribution [31] to update the
policy πit :
t−1
α2 exp(Qπt i (ai , si )/T ) t 2 πi,a
t
πi,a = + i
, (12.32)
i
t2 + α2 M i t
j=0 exp(Qπ ( j, si )/T )
t2 + α2
i
τ = 0.001, α = 0.5
7×107 d=3
d = 10
6×107 d = 20
Expected utilities (bit/J)
5×107
4×107
3×107
2×107
1×107
0
10–6 10–5 10–4 10–3 10–2 10–1 100 101
Delay exponent, θ
Figures 12.4 and 12.5 show the convergence of the proposed algorithm. From
these figures, we can see that the proposed algorithm has faster convergence speed
than CMAQL algorithm. The reason is that micro-users in the proposed Q-learning
mechanism can share transmit power strategy with macro-user, while the value of
t+1
δ−(i,vi)
is estimated by only the past experiences in CMAQL algorithm.
6×107
Proposed-algorithm
Maximal Q-value (bits/J)
CMAQL-algorithm
5×107
4×107
3×107
2×107
40 50 60 70 80 90 100
Time shot, t
108
107
20 40 60 80 100 120
Time shot, t
Parameter Value
×105
8
7
Average Q-value of FUs (bit/s)
6 HRLA
NGb-PCA
5 BDb-WFQA
0
0 5 10 15 20 25 30 35 40 45 50 55 60
Time slot, t
action through the Boltzmann distribution. From Figure 12.6, after 20 iterations, the
BDb-WFQA Algorithm is stable, which guarantees the convergence of the algorithm.
In addition, we find that compared with NGB-PCA and HLA, the proposed BDB-
WFQA has faster convergence. This is because the proposed BDb-WFQA employs
the discrete power as action profile and uses the weighted filter way to update the
policy where the filter parameter α can be considered as a believable parameter to
accelerate learning.
The average EC of FUs is shown in Figure 12.7. It is can be observed that the
average EC of FUs reduces with the increase of delay QoS exponent θ for both the
NGb-PCA and the proposed BDb-WFQA. This is because a larger θ means a more
stringent delay requirement. In addition, we find that the performance of the proposed
BDb-WFQA is slightly lower than that of NGB-PCA. This is because the proposed
BDb-WFQA uses discrete action contours, but it may lose correct power values.
However, as mentioned earlier, we know that the proposed BDb-WFQA converges
faster than NGb-PCA.
The average EC of MUs is shown in Figure 12.8. From the five curves in the
Figure 12.8, it can be observed that the average EC of MUs increases with the increase
of μ. Besides, we can see that when the pricing factor μ = 0, the average EC of
MUs is the smallest. This is because that μ = 0 means there is not interference
constraint at FBSs’ side, the FBSs will choose the optimal transmit power to self-
ishly increase their EC, which will cause severe cross-tier interference to macrocell.
When the μ ≥ 170 dB, MUs gain the largest average EC. This is because the suffi-
ciently large pricing will make the FBSs choose the smallest transmit power; thus,
the cross-tier interference each MU received is smallest, and the achievable average
× 105
14
NGb-PCA
BDb-WFQA
12
Average EC of FUs (bit/s)
10
0
10–7 10–6 10–5 10–4 10–3 10–2 10–1
Delay QoS exponent, θ
× 105
2.3
2.2
Average EC of MUs (bit/s)
2.1
μ=0
μ = 150 dB
2
μ = 160 dB
μ = 170 dB
1.9 μ = 175 dB
1.8
1.7
1.6
0 5 10 15 20 25 30 35 40
Time slot, t
EC is largest. Therefore, we can choose a pricing factor which can guarantee that
the received cross-tier interference of MUs is acceptable, and the FBSs can achieve a
good EC performance.
12.6 Conclusion
References
[1] Chandrasekhar V, Andrews JG, and Gatherer A. Femtocell networks: A survey:
In IEEE Commun Mag. 2008;46(9):59–67.
[2] Zhang H, Jiang C, Beaulieu NC, et al. Resource allocation in spectrum-sharing
OFDMA femtocells with heterogeneous services. IEEE Trans Commun.
2014;62(7):2366–2377.
[3] Li GY, Xu Z, Xiong C, et al. Energy-efficient wireless communications:
tutorial, survey, and open issues. IEEE Wireless Commun. 2011;18(6):28–35.
[4] Wu D, and Negi R. Effective capacity: a wireless link model for support of
quality of service. IEEE Trans Wireless Commun. 2003;2(4):630–643.
[5] Xiong C, Li GY, Liu Y, et al. Energy-efficient design for downlink OFDMA
with delay-sensitive traffic. IEEE Trans Wireless Commun. 2013;12(6):
3085–3095.
[6] Zhang H, Ma Y, Yuan D, et al. Quality-of-service driven power and sub-carrier
allocation policy for vehicular communication networks. IEEE J Sel Areas
Commun. 2011;29(1):197–206.
[7] Palanisamy P, and Nirmala S. Downlink interference management in femtocell
networks-a comprehensive study and survey. In: Proc. IEEE ICICES; 2013.
p. 747–754.
[8] Zhang H, Jiang C, Cheng J, et al. Cooperative interference mitigation and
handover management for heterogeneous cloud small cell networks. IEEE
Wireless Commun. 2015;22(3):92–99.
[9] Rahman M, and Yanikomeroglu H. Enhancing cell-edge performance: a down-
link dynamic interference avoidance scheme with inter-cell coordination. IEEE
Trans Wireless Commun. 2010;9(4):1414–1425.
[10] Zhang H, Jiang C, Beaulieu NC, et al. Resource allocation for cognitive small
cell networks: a cooperative bargaining game theoretic approach. IEEE Trans
Wireless Commun. 2015;14(6):3481–3493.
[11] Zhang H, Jiang C, Mao X, et al. Interference-limited resource optimization
in cognitive femtocells with fairness and imperfect spectrum sensing. IEEE
Trans Veh Technol. 2016;65(3):1761–1771.
[12] Li Z, Lu Z, Wen X, et al. Distributed power control for two-tier femtocell net-
works with QoS provisioning based on Q-learning. In: Vehicular Technology
Conference IEEE; 2015. p. 1–6.
[13] Zhang H, Jiang C, Hu RQ, et al. Self-organization in disaster resilient
heterogeneous small cell networks. IEEE Network. 2016;30(2):116–121.
[14] Long C, Zhang Q, Li B, et al. Non-cooperative power control for wireless
ad hoc networks with repeated games. IEEE J Sel Areas Commun. 2007;25(6):
1101–1112.
428 Applications of machine learning in wireless communications
[15] Chen X, Zhang H, Chen T, et al. Improving energy efficiency in green femtocell
networks: a hierarchical reinforcement learning framework. In: Proc. IEEE
ICC, Budapest, Hungary; 2013. p. 2241–2245.
[16] Zhang Z, Wen X, Li Z, et al. QoS-aware energy-efficient power control
in two-tier femtocell networks based on Q-learning. In: Proc. ICT; 2014.
p. 313–317.
[17] Miao G, Himayat N, Li GY, et al. Low-complexity energy-efficient scheduling
for uplink OFDMA. IEEE Trans Commun. 2012;60(1):112–120.
[18] ZapponeA,Alfano G, Buzzi S, et al. Energy-efficient non-cooperative resource
allocation in multi-cell OFDMA systems with multiple base station antennas.
In: IEEE GreenCom; 2011. p. 82–87.
[19] Saraydar CU, Mandayam NB, and Goodman DJ. Pareto efficiency of pricing-
based power control in wireless data networks. In: IEEE Wireless Communi-
cations and Networking Conference (WCNC); 1999. p. 231–235 vol. 1.
[20] Wang L, Chen X, Zhao Z, et al. Exploration vs exploitation for distributed
channel access in cognitive radio networks: a multi-user case study. In: 11th
International Symposium on Communications and Information Technologies
(ISCIT); 2011. p. 360–365.
[21] van den Biggelaar O, Dricot JM, Doncker PD, et al. A new distributed algorithm
for the allocation of cognitive radio sensing times. In: IEEE International
Symposium on Personal Indoor and Mobile Radio Communications (PIMRC);
2012. p. 1208–1213.
[22] Panahi FH, and Ohtsuki T. Optimal channel-sensing policy based on Fuzzy
Q-learning process over cognitive radio systems. In: IEEE International
Conference on Communications (ICC); 2013. p. 2677–2682.
[23] Qiao D, Gursoy MC, and Velipasalar S. Energy efficiency in multiaccess fading
channels under QoS constraints. EURASIP J Wireless Commun Networking.
2012;2012(1):136.
[24] Musavian L, and Le-Ngoc T. Energy-efficient power allocation for delay-
constrained systems. In: IEEE Global Communications Conference (GLOBE-
COM); 2012. p. 3554–3559.
[25] Xiong C, Li GY, Liu Y, et al. QoS driven energy-efficient design for
downlink OFDMA networks. In: IEEE Global Communications Conference
(GLOBECOM); 2012. p. 4320–4325.
[26] Jiang C, Zhang H, RenY, et al. Machine learning paradigms for next-generation
wireless networks. IEEE Wireless Commun. 2017;24(2):98–105.
[27] Alnwaimi G, Vahid S, and Moessner K. Dynamic heterogeneous learning
games for opportunistic access in LTE-based macro/femtocell deployments.
IEEE Trans Wireless Commun. 2015;14(4):2294–2308.
[28] Onireti O, Zoha A, Moysen J, et al. A cell outage management framework for
dense heterogeneous networks. IEEE Trans Veh Technol. 2016;65(4):2097–
2113.
[29] Rekha JU, Chatrapati KS, and Babu AV. Game Theory and Its Applications
in Machine Learning. In: Satapathy SC, Mandal JK, Udgata SK, Bhateja V.
Q-learning-based power control in small-cell networks 429
Vehicular networks have been recently attracting an increasing attention from both the
industry and research communities. One of the challenges in this area is the under-
standing of vehicular mobility and further propose accurate and realistic mobility
models to aid the vehicular communication and networks design and evaluation. In
this chapter, different from the current works focusing on designing microscopic level
models that are describing the individual mobility behaviors, we are exploring the use
of open Jackson queuing network frameworks to model the macroscopic level vehic-
ular mobility. The proposed intuitive model can accurately describe the vehicular
mobility, and further predict various measures of network-level performance. These
measures include the vehicular distribution and vehicular-level performance, such as
average sojourn time in each area and the number of sojourned areas in the vehicular
networks. Model validation based on two large-scale urban vehicular motion traces
reveals that such a simple model can accurately predict a number of system measure
concerned with the vehicular network performance. Moreover, we develop two appli-
cations to illustrate the proposed model’s effectiveness in the analysis of system-level
performance and dimensioning of vehicular networks.
13.1 Introduction
Recently, as more and more vehicles are equipped with multiple sensors and hetero-
geneous communication access devices to enable wireless connectivity, interests on
vehicular communications and networks have grown tremendously [1]. It is seen as the
key technology for improving road safety and building intelligent transportation sys-
tem (ITS) [2]. Many applications of vehicular networks are also emerging, including
automatic collision warning, remote vehicle diagnostics, emergency management and
assistance for safely driving, vehicle tracking, automobile high speed Internet access,
and multimedia content sharing. In USA, Federal Communications Commission has
1
Department of Electronic Engineering, Tsinghua University, China
2
Department of Computer Science, Qingdao University, China
432 Applications of machine learning in wireless communications
mobility behaviors precisely, unfortunately, they fail to capture the overall mobility
in the whole network. In contrast, macroscopic level description can lead to gross
quantities of metrics like vehicular distribution, density and means of velocity, by
treating vehicular traffic according to fluid dynamics, and then large-scale overall
vehicular behaviors and traffic can be easily revealed. Further, such models are indis-
pensable for network dimensioning, answering the “what if ” questions like how the
network performance changes or the deployed network evolves as the number of
vehicles or communication demands scale-up [7]. Thus, macroscopic level vehicular
mobility models are crucial for the development of vehicular networking protocols
and algorithms.
In this chapter, against this background, we consider the problem of modeling
the macroscopic level vehicular mobility. Specifically, we explore the use of an open
Jackson queueing network to model the vehicular mobility among areas divided by the
intersections of the city road. In the model, vehicles arrive in the system according
to a random process, move from one area to another area by making independent
probabilistic transitions, and finally depart the system. The question we address is
can this simple queueing network model accurately describe the vehicular mobility
and further predict various measures of network-level performance like the vehicular
distribution, and vehicular-level performance like average sojourn time in each area
and the number of sojourned areas in the vehicular networks. Our novel contributions
are summarized as follows:
The rest of this chapter is organized as follows. After introducing related work
in Section 13.2, we give the model motivation and describe the system model in
Section 13.3. While in Section 13.4, we derive related system performance metrics
based on the proposed model. Moreover, in Section 13.5, we introduce the vehicular
434 Applications of machine learning in wireless communications
mobility trace for model simulation and provide the validation results, and followed by
two specific applications of vehicular network performance analysis in Section 13.6.
Finally, we conclude the chapter in Section 13.7.
13.3 Model
39.98
31.26
39.96
31.25
39.94 31.24
39.92 31.23
39.9 31.22
39.88 31.21
39.86 31.22
39.84 31.19
39.82 31.18
116.25 116.3 116.35 116.4 116.45 116.5 121.4 121.42 121.44 121.46 121.48 121.5
(a) (b)
Figure 13.1 City maps recovered from one data taxis mobility trace of (a) Beijing
and (b) Shanghai
In order to verify the above data preprocess approach does not introduce inaccurate
information to the original data trace, we use the obtained data of locations of 1 day to
plot the trajectory of all taxies, which are shown in Figure 13.1(a) and (b) for Beijing
and Shanghai data, respectively. From these two figures, we can see that our data set
is fine-grained that even using 1 day’s data can recover the map of the whole city. In
order to further show the accuracy of our data processing, we compare the obtained
figures with the original Beijing and Shanghai Map and find that all the trajectories
are in the city road and thus demonstrate that the map drawn by these trajectories are
very similar with the original city map.
31.212
31.21
31.208
31.206
31.204
Latitude
31.202
31.2
31.198
31.196
31.194
31.192
121.44 121.445 121.45 121.455 121.46 121.465
Longitude
Figure 13.2 Illustration of area partition algorithm for a part of Shanghai city
and model the vehicular traffic transiting from one area to another. Taking Figure 13.2
as an example. We first marked the important intersections surrounded by a large
number of vehicles as the red point and then divide the whole area around the selected
intersections. The methods for the area partition can be chosen flexibly by the specific
applications. For example, if we want to adapt the model to the use in the vehicular
network design, i.e., deploying the RSU system, then the Voronoi diagram can be
used, where each point is divided into the intersection that it is nearest to and then
obtain the boundary of each area.
Let us consider the two-dimensional vehicular mobility defined by a sequence
of steps that a vehicle travels in the city road, which is modeled by the above areas
formed by the intersections. A step is denoted by a tuple (t 1 , t 2 , A), where A is the area,
t 1 is the time entering area A, and t 2 is the time that it departs the area. In the first step,
the vehicle enters the modeled region from the entering area, and after some step,
it moves out of the modeled region. Every vehicle moves in this way by transiting
from one area to another area. In this way, we can depict one vehicle’s mobility and
overall describe the traffic flows of the whole system by combining all the vehicles
and intersections together as a system. Now, we are ready to introduce our queue
network to model the above vehicle-mobility scenario.
the vehicles transit, and the vehicles move into the system, move from one area to
another, and finally moving out the system. We use a queueing network to model
the above system, which is shown in Figure 13.3. It includes N server nodes with
infinite queue size, which models the N partitioned areas in the system. The servers
are denoted by set N = {A1 , A2 , AN }. The vehicular movement into the system and
moving from one area to another are modeled by the entrance into the queueing
network and the transition from one server to another.
Now, we describe the dynamic behaviors of vehicles moving in different areas
of the system. In such a vehicular mobility system with multi-areas, the vehicle
dynamic behaviors occur on two different timescales. One is on the long timescale, in
which the vehicle may enter and depart the system. The other is on the short timescale,
where a vehicle changes areas, which means it switches from one area to another. In
the viewpoint of queueing network model, the vehicles enter into the system with
certain rate, stay in the server’s queue, and then transfer into another server. For
the long timescale dynamics, we assume that vehicles arrive server n, n ∈ N with
rates of λn . When a vehicle moves to area n, it will continually stay in this area for
a period of time. We assume that the average amount of time that a vehicle stay in
area n is μn . The distribution of staying time is arbitrary. For the short timescale
dynamics, we consider that after the vehicle staying in area n for a random period
of time, it switches to another area m with probability pnm or leaves the system with
probability pn0 . In this way, the vehicles move from one area to another, depart or enter
the system.
We have modeled the vehicular mobility system including N areas as an open
network of N servers with infinite queues. In such an open system, vehicles freely
join and leave the system. The exogenous arrival rate for server n is λn . After staying
in the queue of server n for time period of 1/μn , it will leave the queueing network
with probability pn0 or switch to other server like m with probability pnm and denote
the switching matrix as P. Therefore, for server n, its load is denoted by ρn = λn /μn .
A2
p2n
p12
pn2
p21
p1n
A1 An
pn1
λn pn0
Notations Meaning
Since each server’s queue is considered as an infinite queue, and each user is served
immediately if we assume there are an infinite number of servers, and each vehicle is
independent of each other. Now, we summarize the key parameters and their notations
in our model in Table 13.1.
As we model the vehicular mobility system as the aforementioned open queueing
system, it would be very easy to get further results if it is an open Jackson network,
which has well-known results about the user distribution and waiting time. Related
to the Jackson network, we need to demonstrate that the exogenous arrival to each
server follows Poisson process. If it holds, the queueing network can be modeled
by a network with infinite server queue (i.e., M /G/∞). Thus, we need to study
the property of exogenous arrival rate in the system. By leveraging the Beijing and
Shanghai traces, we find that the actual exogenous arrival process of the vehicular
mobility matches well with the exponential distribution. Thus, the vehicular mobility
system can be modeled as an open Jackson network. Based on this model, we will
derive some important metrics to depict the system performance.
pnj
λn
An Aj
pjn
pn0
area n. Furthermore, we get the separated area vehicular distribution and joint area
vehicular distribution expressions by the following lemma:
→
N
ρj wj e−ρj
π ( wn ) = , (13.6)
j=1
wj !
and the expected number of vehicles that stay in the area n in dynamic vehicular
mobility system is ρn .
Proof. We consider one area, say n. The user arrival rate is γn and the serve time is
1/μn . We view this area as an infinite serve node. Therefore, according to the Jackson
networks, for the vehicular mobility system, we have
→
N
ρj wj e−ρj
π ( wn ) = P(W1 = w1 , . . . , WN = wN ) = . (13.7)
j=1
wj !
ρn wn e−ρn
P(Wn = wn ) = . (13.8)
wn !
We note that the distribution of vehicles at the queue of node n follows Poisson
with mean ρn . Therefore, the expected number of vehicles that stay in area n is ρn ,
which proves the lemma.
We define p0n as the probability that a vehicle move into the system through area n.
From the definition of pn0 and p0n , we can obtain the following expression for them:
N
pn0 = 1 − pnm , n = 1, . . . , N , (13.9)
m=1
γn
p0n = N , n = 1, . . . , N . (13.10)
m=1 γm
In order to better distinguish the transitions on states N and M , we refer P =
{pnm }, 0 ≤ n, m ≤ N as the system transition on states M and denote the sub-matrix
R as the transition among the areas on state N , where R = P{pnm }, 1 ≤ n, m ≤ N .
Now, we obtain the average vehicular sojourn time by the following theorem:
Theorem 13.1. Denoting the vehicular sojourn time in the system by S, we can obtain
the average sojourn time designated as E[S] by the following expression:
E [S] = p0n E [Tn ] , (13.11)
n∈N
Proof. We denote Tn as the vehicular sojourn time in the system on the condition that
it is always in the system. That is to say, the vehicle will not move out of the region
when it stays in any area of n, n ∈ N . Considering the staying duration in area n, and
using the Jackson queue network model, we have
1
E [Tn ] = + pnm E [Tm ] . (13.12)
μn
m∈N
T = U + RT. (13.13)
T = (I − R)−1 U. (13.14)
Theorem 13.2. The average vehicular mobility length, denoted by E [L], can be
expressed by
E [L] = p0n E [Tn ] , (13.16)
n∈N
Proof. Note that in the obtained sojourn time that a vehicle stays in the system when
it is already in the system. In (13.13), we can change the sojourn time to mobility
length just by setting the staying time in each area by 1. That is, U = [1, . . . , 1]. Then,
we obtain:
T = (I − R)−1 1. (13.17)
30
Shanghai
25 Beijing
15
10
0
0 5 10 15 20
Time (h)
Figure 13.5 Average vehicular arrival rate to the system in timescale of 1 day for
Shanghai and Beijing trace
vehicle does not have records in any of the areas in the period of 10 min, we assume
that it has departed from the system and will take it as a new vehicle moving into the
system when it appears.
100
Empirical curves
CCDF Exponential fittings
Beijing
Shanghai
10−1
0 2 4 6 8 10
Arrival time (s)
the accuracy of the exponential distribution of the aggregated arrival time, which
indicates the aggregated arrivals follows Poisson process.
Second, in order to further validate the exogenous arrivals to each area following
Poisson distribution, we investigate the distribution of the exogenous arrival time
of each area in the timescale of 1 day and select the first 15 days for studying. To
measure the closeness of the Poisson distribution and empirical ones, we use the
Kolmogorov–Smirnov test (KS test) instead of CCDF fitting due to the large amount
of curves in each area of 15 days. The KS statistic can quantify the distance between the
empirical distribution function of the sample and the cumulative distribution function
of the theoretical distribution [26]. The smaller the KS statistic, the closer the two
distributions are. In our study, we set the significance level [26] of KS test to 0.01,
which means the confidence level is 99%. Figure 13.7 shows the goodness-of-fit
measured by the acceptance ratio of KS tests of each day by averaging the results of
all areas. From the results, we can observe that the acceptance ratio of Beijing trace
is above 90% except the second day, which have a relatively smaller vehicle mobility
records. With regards to the Shanghai trace, we note a good match between the model
distribution and empirical results, and the average acceptance ratios are around 80%,
which means the overall accuracy of the Poisson model is about 80%. Combing the
results of Shanghai and Beijing, we come to the conclusion that the exogenous arrivals
to each area can be accurately modeled by the Poisson distribution.
Now, we have completed the validation of Poisson-based exogenous arrivals’
accuracy in our model. Next, we focus on validating the results of vehicular distribu-
tion, sojourn time, and mobility length, which are important metrics obtained from
the open Jackson queuing network-based vehicular mobility model.
Data-driven vehicular mobility modeling and prediction 447
0.8
Ratio of passing KS tests
0.6
0.4
0.2
Beijing
Shanghai
0
5 10 15
Day
Figure 13.7 Results of KS tests passing ratio for the vehicular arrival rate in
timescale of 1 day for Shanghai and Beijing traces
0.2
0.25
0.18
0.2
0.16
0.2
0.14
0.15
0.12
0.15
PDF
PDF
0.1
0.1
0.1 0.08
0.06
0.05 0.04
0.05
0.02
0 0 0
0 10 20 0 10 20 0 10 20
No. of vehicles No. of vehicles No. of vehicles
0.25
0.25
0.2
0.2
0.2
0.15
0.15 0.15
PDF
0.1
0.1 0.1
0 0 0
0 10 20 0 10 20 0 10 20
No. of vehicles No. of vehicles No. of vehicles
Figure 13.8 Vehicular distribution of six intersections in Beijing trace, where the
red and dotted curves are the empirical results obtained from the
trace, and the blue and solid curves are the theoretical results
obtained by our proposed model
MATLAB® Curve Fitting Toolbox. It can be seen from Figure 13.10 that the average
adjusted R-square statistics of over 90% areas for the Shanghai trace is larger than
98%, and that of over 90% areas for the Shanghai trace is larger than 95%. This
confirms the accuracy of the model-based prediction for vehicular distribution.
Data-driven vehicular mobility modeling and prediction 449
0.18 0.16
PDF
0.08
0.08 0.1
0.06
0.06
0.04 0.04
0.05
0.02 0.02
0 0 0
0 10 20 0 10 20 0 10 20
No. of vehicles No. of vehicles No. of vehicles
0.16
0.2 0.25
0.14 0.18
0.14
0.1
PDF
0.12 0.15
0.08 0.1
Figure 13.9 Vehicular distribution of six intersections in Shanghai trace, where the
red and dotted curves are the empirical results obtained from the
trace, and the blue and solid curves are the theoretical results
obtained by our proposed model
0.8
0.6
CCDF
0.4
0.2
Beijing
Shanghai
0
0.8 0.85 0.9 0.95 1
R2
Table 13.2 Predicted and empirical results of average sojourn time and mobility
length of Shanghai trace
For testing the scalability of the accuracy of the proposed model, we vary the
number of evolved vehicles. Both for the Shanghai and Beijing trace, we sort the
vehicles according to their number of positions recorded in the GPS trace. We first
select vehicles that have at least 80% of record during the trace collection time, and
then put more and more vehicles into the system for testing. For Shanghai trace, we
selected different number of vehicles as 1,000, which have 80% of record, 3,000 and
4,441, which is the total number of vehicles in the trace. For Beijing trace, the set of
the number of vehicles as 3,000, 6,000, 10,000, and 28,590.
The results of average sojourn time and mobility length of predicted and empir-
ical results under the Shanghai and Beijing traces are shown in Tables 13.2 and
13.3, respectively. From the results of average sojourn time, we find that the pre-
dicted results match the empirical results very well for both Shanghai and Beijing
traces. Especially, when we only use vehicles with more completed record, the pre-
dicted results are very near to the empirical results. For example, in Shanghai trace,
Data-driven vehicular mobility modeling and prediction 451
Table 13.3 Predicted and empirical results of average sojourn time and mobility
length of Beijing trace
the average deviation between the predicted and empirical results is only 5.7% when
the number of vehicular is 1, 000; while in Beijing trace, it is only 1.8% when the
number of vehicular is 3, 000. With the increasing number of vehicles, although more
and more vehicles are with imperfect records, which will induce some errors into the
system and model, the accuracy of our model is also accepted. For example, the pre-
diction accuracy are higher than 6.9% and 3.4% when the number of vehicles are
4, 441 and 28, 590 in Shanghai and Beijing trace, respectively. In terms of average
mobility length, the predicted results comply with the theoretical results completely.
Consequently, we validate that our model is accurate enough to model the vehicular
mobility to obtain the average and stable system performance.
In vehicular networks, regarding with random and bursty data traffic initiated from
vehicles, RSUs play as the gateways to the internet and to the infrastructure of other
systems, such as ITS. Vehicles transmit their Internet access requests and information
to RSUs, and RSUs then send responses to the Internet for querying the data and
information needed by vehicles. Therefore, deploying RSUs appropriately is signif-
icant to the performance of vehicular networks. On one hand, the capacity and the
number of deployed RSUs determine the capacity and service that can be provided to
the vehicular network. On the other hand, a large number of RSUs deployed with large
capacity mean more infrastructure cost. Therefore, the decision for RSU deployment
should depend on the demands of vehicles. Basically, in the large urban city, it is very
difficult to make such decisions due to the dynamics of vehicular traffic and ran-
domness of vehicle mobility. However, based on our proposed vehicular model, we
can obtain some fundamental results of the relationship with the RSUs capacity and
network performance. Using the proposed queuing network-based vehicular mobility
model, we will analyze how much RSUs’ capacity should be provided with the rise
in communication demands resulting from the increase of urban vehicles.
In reality, it is difficult to cover roads with enough RSUs so that each vehicle
on the road can always be connected to nearby RSU in terms of infrastructure cost.
452 Applications of machine learning in wireless communications
1
5λ
4λ
0.4
0.2
0
5 10 15 20 25 30 35 40
RSU capacity (C)
1
5λ
4λ
Percentage of overloaded RSUs
0.8 3λ
2λ
λ
0.6
0.4
0.2
0
5 10 15 20 25 30 35 40
RSU capacity (C)
average number of areas that satisfy the communication demands with V2I and V2V
communications.
We say satisfied area occurs when every vehicle in this area can receive the
demanded data rate. When the network is in the state of satisfying communication,
all vehicles in the network are satisfied. For a given vehicular network, it is hard
for the network to enjoy the satisfying communication all the time. Then, the metric
of the steady-state probability that the network is in satisfying communication is an
important parameter to evaluate the performance of the vehicular network. Another
important metric is the expected number of areas that are enjoying the satisfying
communication. Now we give more precise definitions for these two metrics.
The communication capacity index for area n, denoted by n (Wn ), is defined as
c n + pn
n (Wn ) = , (13.19)
dn (Wn )
where Wn is the number of vehicles in the area of n, cn is the communication capacity
of the deployed RSU in area n, pn is the capacity of the V2V communications in this
area, and dn (Wn ) is the demand of communication of vehicles in the area. Based on
n (Wn ), the probability that area n is enjoying the satisfying communication can be
defined as
N
NS = P(n (Wn ) ≥ 1). (13.22)
n=1
Now, we consider area n and calculate the area satisfying probability ASn . In this case,
cn is the capacity of the deployed RSU located at the center of this area, pn is the V2V
communication capacity, which depends on the number of vehicles in this area. We
assume each vehicle can offer capacity of ui , then pn = i∈Wn ui . Assume vehicles
in area n need communication capacity of rn , then dn (Wn ) = Wn rn . Hence, ASn can
be expressed as
ASn = P cn + u i ≥ Wn rn . (13.23)
i∈Wn
has large capacity of uk ; then, according to Theorem 13.1, the satisfying probability
of area n, ASn , is given by
ASn = P cn + u i ≥ Wn r n
i∈Wn
= P cn + u j Wnj + uk Wnk ≥ (Wnj + Wnk )rn
P Wnj = wnj P Wnk = wnk ×
∞ ∞
= j j
P cn +u wj n +uk wn ≥ rn
Wnj = wnj , Wnk = wnk
k k
wn =0 wn =0 (wn +wn )
j k
j
k
n = wn P Wn = wn ×
j k
∞ ∞ P W (13.24)
= 1 cn + u wn + u wn ≥ (wn + wn )rn
j j k k j k
wn =0 wn =0
j k
∞ ∞ P(Wnj = wnj )P(Wnk = wnk )×
=
k =0 1 (rn − u )wn + (rn − u )mj ≤ cn
j j k k
j
wn =0 n
w
j j k
(ρn )wn (ρnl )wn −ρnj −ρnk
= j w !k e
j wn ! n
0≤(rn −u j )wn +(rn −uk )mkj ≤cn
N
N
NS = P(n (Wn ) ≥ 1) = ASn . (13.26)
n=1 n=1
0.95
0.9
ASn
0.85
ρ
0.8 3ρ
5ρ
0.75 7ρ
9ρ
0.7
200 250 300 350 400 450 500
Demand of communication rate
1
ρ
3ρ
0.8 5ρ
7ρ
9ρ
0.6
PS
0.4
0.2
0
200 400 600 800 1,000
Demand of communication rate
decreases. The larger the load, the sharper the decreasing rate. Under these results,
we can decide how to deploy the RSU equipment according to the performance curve
and specific requirements. In terms of the network-wide performance, Figures 13.14
and 13.15 show the results of PS and NS. With the increase in average demand and
Data-driven vehicular mobility modeling and prediction 457
20
15
NS
ρ
10 3ρ
5ρ
7ρ
9ρ
5
200 400 600 800 1,000
Demand of communication rate
area load, we can see both PS and NS decrease. In this case, we can use the related
results to design the network system according to the requirements and decide how
to deploy infrastructure and RSU devices supporting V2V communications.
13.7 Conclusions
In this chapter, we used the open Jackson queueing network to model the macro-
scopic level vehicular mobility. Our proposed simple model can accurately describe
the vehicular mobility and predict various measures of network-level and vehicular-
level performance. Based on two large-scale urban city vehicular motion traces, we
validated the accuracy of our proposed model. Finally, we proposed two applica-
tions as an example to illustrate our proposed model effectiveness in the analysis of
system-level performance and dimensioning of vehicular networks.
References
[1] Khabazian M, Aissa S, and Mehmet-Ali M. Performance modeling of message
dissemination in vehicular ad hoc networks with priority. IEEE Journal on
Selected Areas in Communications. 2011;29(1):61–71.
[2] Dimitrakopoulos G, and Demestichas P. Intelligent transportation systems.
IEEE Vehicular Technology Magazine. 2010;5(1):77–84.
458 Applications of machine learning in wireless communications
[18] Kelly FP. Networks of queues with customers of different types. Journal of
Applied Probability. 1975;12(3):542–554.
[19] Menasche DS, Rocha AA, Li B, et al. Content availability and bundling
in swarming systems. In: Proceedings of the 5th International Confer-
ence on Emerging Networking Experiments and Technologies. ACM; 2009.
p. 121–132.
[20] Ashtiani F, Salehi JA, and Aref MR. Mobility modeling and analytical solution
for spatial traffic distribution in wireless multimedia networks. IEEE Journal
on Selected Areas in Communications. 2003;21(10):1699–1709.
[21] Kim K, and Choi H. A mobility model and performance analysis in wire-
less cellular network with general distribution and multi-cell model. Wireless
Personal Communications. 2010;53(2):179–198.
[22] Li M, Zhu H, Zhu Y, et al. ANTS: Efficient vehicle locating based on
ant search in ShanghaiGrid. IEEE Transactions on Vehicular Technology.
2009;58(8):4088–4097.
[23] Kemeny J G, and Snell J L. Markov Chains[M]. New York: Springer-Verlag,
1976.
[24] Kise K, Sato A, and Iwata M. Segmentation of page images using the
area Voronoi diagram. Computer Vision and Image Understanding. 1998;
70(3):370–382.
[25] Schermelleh-Engel K, Moosbrugger H, and Müller H. Evaluating the fit of
structural equation models: Tests of significance and descriptive goodness-of-
fit measures[J]. Methods of psychological research online, 2003;8(2):23–74.
[26] Zhang G, Wang X, Liang YC, et al. Fast and robust spectrum sens-
ing via Kolmogorov–Smirnov test. IEEE Transactions on Communications.
2010;58(12):3410–3416.
[27] Câmara D, Frangiadakis N, Filali F, et al. Vehicular Delay Tolerant Networks.
Handbook of Research on Mobility and Computing: Evolving Technologies
and Ubiquitous Impacts. IGI Global; 2011. p. 356–367.
This page intentionally left blank
Index
Intel 5300 NIC 344–5, 349–51, 353, learning channel access control
355–6, 359 protocols 241–2
intelligent transportation system (ITS) least square (LS) 139
228, 240, 431–2, 451 Levenberg–Marquardt algorithm 116
inter-contact time 434 LightGBM 9
intra-vehicle communications 228 linear minimum mean square error
intrinsic mode functions (IMFs) 181 (LMMSE) estimator 140
in-vehicle communication 229 line-of-sight (LOS)/non-line-of-sight
inverse discrete Fourier transform (NLOS) scenarios 68, 70–2
(IDFT) 142 location estimation 344, 360–2
iterative algorithm 372–3 logistic regression 12–14, 19
iterative hard thresholding (IHT) 124 long short-term memory (LSTM) 348
Jackson queueing network model 433, macrocell base station (MBS) 407,
439, 443, 457 412, 418
Jensen’s inequality 151 macrocell users (MU) 410, 412, 414,
425–6
Kalman filter-based tracking 91–2
Manhattan distance 4
Kalman filters 69
Markov decision process (MDP) 42–5,
Keras 148
242, 371, 376, 378
kernel k-means 22–3
basic components of 378–81
Kernel-power-density (KPD)-based
finite-horizon MDP 381–2
clustering 78–81
infinite-horizon MDP
k-means 21–3, 40–1
with average cost 392–4
k-nearest neighbours method 2–4, 19,
119, 344 with discounted cost 387–9
Kolmogorov–Smirnov test (KS test) multi-carrier power allocation with
446 random packet arrival 389–92
KPowerMeans-based clustering 68, 73, matching pursuit (MP) 205–6
84, 102 matrix completion 116
clustering 73–4 alternating projection (AP) methods
cluster pruning 74–5 121–4
development 75–6 nuclear norm minimization-based
validation 74 methods 117–21
kriging-based techniques 110 maximum likelihood estimates (MLEs)
Kuhn–Munkres algorithm 33, 69, 95 149, 161–3, 171, 173
Kullback–Leibler (KL) divergence 310 max-pooling 17–18
McCulloch–Pitts (MP) model 135
Lagrange multiplier 123, 266, 269, MP neuron model 140–1
294, 324, 400, 403–4 mean squarederror (MSE) 148
Laplacian Kernel density 78 medium access control (MAC) 216–17,
learning-based reconstruction 226–7, 234, 238–41, 254, 371,
algorithms 109, 111 377
batch algorithms 111–24 MicaZ sensor platform 212
online algorithms 124–5 Middleton Class A model 193
468 Applications of machine learning in wireless communications
protocol performance 251 Q-value function 43, 47, 49, 51, 53–5
effect of data rate 254–5 approximation 55–6
effect of increased network density
252–4 radar sensors 228
effect of multi-hop 255–6 radial basis function (RBF) 72, 114,
simulation setup 251–2 345
Python environment 144 RBF-based neural network 99–101
RBF kernel function 72
Q-learning 49–50, 55, 242–3, 401–4, radio resource management (RRM) 371
417–18 radio waves 67, 126
MAC protocol 243 random forest (RF) 7–8, 20
action selection dilemma 243 rate–quantization (R–Q) model 265
a priori approximate controller Rayleigh fading 216, 385
244–6 received signal strength (RSS) 217, 343
convergence requirements 244 received signal strength indicator
implementation details 247–8 (RSSI) 347
online controller augmentation rectified linear unit (ReLU) activation
246–7 function 14, 143,347
procedure 418 recurrent neural network (RNN) 343
densely deployed scenario 419 recursive Taylor expansion (RTE)
distributed Q-learning algorithm method, minimizing perceptual
419 distortion with 267
sparsely deployed scenario 418 bit reallocation for maintaining
Q-learning-based power control in optimization 274–5
small-cell networks 407 optimization formulation on
noncooperative game theoretic perceptual distortion 269–70
solution 414–15 rate control implementation on
proposed BDb-WFQA based on HEVC-MSP 267–8
NPCG 420–1 for solving the optimization
simulation and analysis 422 formulation 270–4
simulation for BDb-WFQA reduced-dimension multiple access
algorithm 424–6 216–17
simulation for Q-learning based on compressive data gathering 213
Stackelberg game 422–3 multi-hop communications, WSNs
Stackelberg game framework with 214
416–17 single hop communications, WSNs
system model 411 with 213–14
effective capacity 413–14 robust data transmission 211–13
problem formulation 414 RefineNet 147–8
system description 411–12 region-of-interest (ROI) 262
Quadrature Phase Shift Keying (QPSK) regression task 2, 19
modulation 359 reinforcement learning (RL) 41, 53–6,
quality of service (QoS) 238, 407 394–6
query by committee (QbC) 117–20, deep reinforcement learning 50
126, 131 policy gradient methods 51–3
Index 471