Two Applications of Deep Learning in The Physical Layer of Communication Systems

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

1

Two Applications of Deep Learning in the Physical


Layer of Communication Systems
Emil Björnson and Pontus Giselsson

Deep learning has proved itself to be a powerful tool to develop data-driven signal processing algorithms
for challenging engineering problems. By learning the key features and characteristics of the input signals,
arXiv:2001.03350v1 [cs.IT] 10 Jan 2020

instead of requiring a human to first identify and model them, learned algorithms can beat many man-
made algorithms. In particular, deep neural networks are capable of learning the complicated features in
nature-made signals, such as photos and audio recordings, and use them for classification and decision
making.
The situation is rather different in communication systems, where the information signals are man-
made, the propagation channels are relatively easy to model, and we know how to operate close to the
Shannon capacity limits. Does this mean that there is no role for deep learning in the development of
future communication systems?

I. R ELEVANCE

The answer to the question above is “no” but for the aforementioned reasons, we need to be careful not
to reinvent the wheel. We must identify the right problems to tackle with deep learning and, even then,
not start from a blank sheet of paper. There are many signal processing problems in the physical layer of
communication systems that we already know how to solve optimally, for example, using well-established
estimation, detection, and optimization theory. Nonetheless, there are also important practical problems
where we lack acceptable solutions, for example, due to a lack of appropriate models or algorithms. In
this lecture note, we first introduce the key properties of artificial neural networks and deep learning.
The focus is not on technicalities around the training process or choice of network structure, but on what
we can practically achieve, assuming the training is carried out successfully. We will then describe three
application categories in communication engineering, whereof one exposes some fundamental weaknesses
of deep learning and two illustrate important advances that can be made by utilizing deep learning.

II. P REREQUISITES

This lecture note requires basic knowledge of linear algebra, digital communications, and probability.

E. Björnson is with Linköping University, Sweden. P. Giselsson is with Lund University, Sweden. This work was partially supported by
the Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation.
2

x0 fˆ(x0 ; θ) y

(a) An arbitrary gray box taking x0 as input and giving y as output.

fˆ2 (x1 ; θ 2 )
fˆ1 (x0 ; θ 1 ) fˆ3 (x2 ; θ 3 )

x0 y

Input layer Hidden layer 1 Hidden layer 2 Output layer

(b) A fully-connected feed-forward network with four layers (L = 3) that fits into the box in (a).

Fig. 1. The gray-box input-output model in (a) is characterized by fˆ and a parameter vector θ. It is called
an artificial neural network if fˆ has a particular structure, such as the one illustrated in (b).

III. P RELIMINARIES : A RTIFICIAL N EURAL N ETWORKS AS F UNCTION A PPROXIMATORS

Consider a system that takes an n0 -length input vector x0 ∈ Rn0 and produces a k-length output vector
y ∈ Rk , as illustrated in Fig. 1(a). The output is determined by the input via a deterministic function fˆ:

y = fˆ(x0 ; θ). (1)

The function is fixed but is characterized by an m-dimensional parameter vector θ ∈ Rm . Many different
input-output relations can be modeled in this way by changing the parameter vector θ, but they all share
an underlying structure determined by the initial choice of fˆ. This is called a gray-box model.
When the function fˆ is selected to resemble the biological neural networks in human brains, the gray
box is called an artificial neural network. The input vector x0 is then viewed as the values in n0 neurons
from which the function fˆ produces the values of y in k other neurons. There are many different examples
of this. The classical one is a fully-connected feed-forward network, which is illustrated in Fig. 1(b). In
this case, fˆ is a composition of L functions, fˆ1 , . . . , fˆL , which describe transitions between neurons in
an input layer to neurons in an output layer via L − 1 intermediate “hidden” layers. L characterizes how
deep the network is. The function fˆl is determined by the parameters θ l = {W l , bl } and modeled as

fˆl (xl−1 ; θ l ) = σl (W l xl−1 + bl ), (2)


3

where W l ∈ Rnl ×nl−1 is called a weight matrix, bl ∈ Rnl is called a bias vector, and σl : Rnl → Rnl is an
element-wise non-linear function that is called an activation function. With inspiration from the structure
of the human brain, the function fˆl can be interpreted as taking the values xl−1 in the nl−1 neurons of
layer l − 1, mixing the values together according to the affine transition relation W l xl−1 + bl , and finally
applying the activation function σl to the determine values of the nl neurons of layer l.
If there are four layers as in Fig. 1(b), then L = 3 and the complete input-output relation is
   
y = fˆ3 fˆ2 fˆ1 (x0 ; θ 1 ) ; θ 2 ; θ 3 . (3)

Hence, the composite function fˆ is determined by the parameter vector θ containing the Ll=1 nl (nl−1 +1)
P

parameter values from θ 1 , θ 2 , θ 3 (i.e., the weights and biases from all layers).
Artificial neural networks are generally used to approximate other functions, by selecting the parameter
vector θ to somehow minimize the approximation error. In particular, the category of fully-connected
feed-forward networks is capable of approximating any continuous function arbitrarily well by utilizing a
(possibly) large but finite number of parameters (and neurons) [1]. This important result can be viewed as
a generalization of Taylor polynomial approximations to functions with vector inputs and vector outputs.
Two other categories are convolutional neural networks and recurrent neural networks [2]. Each category
is believed to be better at approximating certain types of functions, in the sense of requiring fewer
parameters to achieve a certain approximation error and/or it being easier to find appropriate parameter
values in practice. Select the right category is important but beyond the scope of this lecture note.

A. Supervised Training a Neural Network

The parameter vector of an artificial neural network can be tuned/trained to approximate a (possibly
unknown) function that we call f ; that is, fˆ should be trained to become a good estimate of f . This is
preferably done by supervised learning using a set of T training examples consisting of input vectors xtrain
t

and the corresponding output vectors y train


t = f (xtrain
t ) that we want the neural network to reproduce, for
t = 1, . . . , T . Let us represent these training examples as the columns of two matrices:
h i
X train = xtrain
1
train ,
. . . xT (4)
h i
train train train
Y = y1 . . . yT . (5)

The inputs should ideally be selected independently at random from the distribution of inputs that appears
when using f in reality. The training basically consists of finding the parameter θ ∗ that minimizes a loss
function ` that measures the approximation mismatch:

θ ∗ = arg min ` θ, X train , Y train .



(6)
θ
4

For example, the loss can be measured in the mean-squared sense as


T
train train
 1 X train ˆ train 2

` θ, X ,Y = y − f (xt ; θ) . (7)
T t=1 t

The goal is that the trained neural network fˆ(x0 ; θ ∗ ) will provide approximately the right outputs not only
for the training examples, but for any input signal x0 generated in the same way. This desired property
is called generalization. Intuitively, if the unknown function f is continuous and has limited variability,
we should be able to approximate it well from a large training set. We can once again make a parallel to
polynomial approximations; any scalar polynomial of order T − 1 is uniquely determined by T samples
(training examples) of the inputs and outputs. If the polynomial order is unknown, or if the function is
only approximately polynomial, we need a larger number of samples to ensure a good approximation.
Since the training in (6) is a complicated non-convex optimization problem, huge efforts have been
dedicated to finding computationally and performance-wise acceptable suboptimal solutions. Moreover, the
generalization to unseen inputs can be improved by various regularizations, hyper-parameter choices, and
network designs [2]. Such empirical craftsmanship is not the focus of this lecture note, but we conclude:
1) Artificial neural networks can approximate any continuous function.
2) The supervised training requires a large training set with inputs/outputs to achieve a low approxi-
mation error.

IV. F IRST E XAMPLE : S IGNAL D ETECTION

The physical layer of a communication system determines how an information-bearing signal is sent
from the transmitter to the receiver over a physical channel. A critical task is the signal detection, where
the receiver tries to identify what information was sent. To describe some key properties of deep learning,
we will exemplify how it can be used for signal detection.
We consider a classical additive white Gaussian noise (AWGN) channel, where a two-dimensional
signal vector s ∈ R2 is sent. The received signal r ∈ R2 is given by

r = s + n, (8)

where n ∼ N (0, σ 2 I) is an independent Gaussian noise vector where the entries have variance σ 2 .
We assume two bits of information are encoded into s using a quadrature phase-shift keying (QPSK)
constellation. Hence, there are four possible signal points that are equally spaced on the unit circle:
 √   √   √   √ 
 1/ 2 1/ 2 −1/ 2 −1/ 2 
s ∈  √ , √  ,  √  ,  √  . (9)
 1/ 2 −1/ 2 1/ 2 −1/ 2 
The mapping between information bits and signals is illustrated in Fig. 2(a). Due to the additive noise,
the received signal r can take any value, but the Gaussian distribution makes values close to one of the
5

1.5

1
01 11
0.5

0
A received signal r

-0.5

00 10
-1

-1.5
-1.5 -1 -0.5 0 0.5 1 1.5

(a) Quadrature phase-shift keying for information encoding and the corresponding received signals.

Detection: 01 Detection: 11 Detection: 01 Detection: 11

Detection: 00 Detection: 10 Detection: 00 Detection: 10

(b) Detection regions with a trained neural network. (c) Optimal detection regions using detection theory.

Fig. 2. We send QPSK signals over an AWGN channel, as shown in (a), and try to detect the signals at
the receiver. The detection regions produced by a trained neural network is shown in (b) and the optimal
regions obtained from detection theory are shown in (c).

signal points in (9) more likely than values far away. This can be seen from the red dots in Fig. 2(a),
which represent r for 10,000 noise realizations with σ 2 = 0.2 that are added to each signal point.

Based on the received signal r, the receiver needs to guess (detect) what signal s was sent. We have
trained a neural network for this task, by taking the received signal x0 = r as input and letting the output
y be a four-dimensional vector that is one for the detected signal and has zeroes elsewhere. We used the
40,000 red dots in Fig. 2(a), and the signals s that generated these r, to train a fully-connected neural
network using standard training methods. We then applied the neural network to a wide range of possible
received signals to illustrate how it is making its detections. The colored areas in Fig. 2(b) show in which
regions the received signals are mapped to the respective information signals. Note that we have “zoomed
out” and the range of values that was shown in Fig. 2(a) is indicated by the black box.
6

The colored detection regions produced by the neural network have peculiar asymmetric shapes, which
are not optimal. In fact, the optimal detection regions for AWGN channels are well known [3, Ch. 6]: the
received signal should be mapped to the closest signal point in terms of Euclidean distance. The optimal
detection regions are shown in Fig. 2(c). The regions are quite similar within the black box, but greatly
deviates further away. Several important observations can be made from this example:

1) If there is a known optimal algorithm, a trained neural network cannot outperform it. The detection
error probability is, however, almost the same in this example since most received signals appear
within the black box where the neural network has a decent behavior.
2) The detection regions in Fig. 2(b) are wrongly shaped since all training examples appeared inside
the black box around the signal points. Nevertheless, the long tail of the Gaussian distribution will
occasionally give received signals far outside this box. Since this never happened during training, the
neural network does not know what to do and can make strange miscategorizations. It has learned to
interpolate between training examples but not to extrapolate. This is a general issue; neural networks
are good at handling typical inputs but may generalize poorly to atypical inputs.
3) We could have used prior domain knowledge (from digital communications) to preprocess the input
signals. In this example, the neural network had to rediscover where the constellation points are,
how the noise is distributed, and how to make the right detection. If we would instead compute the
Euclidean distance between the received signal and each of the four signal constellation points, we
could use that as input to a neural network. This will give more accurate and reliable results since
the neural network has fewer characteristics to learn, but it still cannot beat the optimal detection.

V. I S T HERE A ROLE OF D EEP L EARNING IN C OMMUNICATIONS ?

Since signal detection in AWGN channels is easy to perform optimally, it makes little sense to utilize
artificial neural networks for that purpose. There are many similar tasks in communications where deep
learning cannot make any meaningful improvements. For example, the fundamental performance limits
were derived by Shannon [4] and we can operate close to those limits using modern channel codes.
Moreover, it is known how to perform optimal channel estimation, multi-user multiple-input multiple-
output (MIMO) processing, and transmit power allocation in many wireless communication scenarios [5].
The fact that the information signals are man-made gives us strong prior information that makes it easier
to devise effective man-made algorithms than in many other fields, where the signals are created by nature.
There are nevertheless some roles that deep learning can play in communications. Firstly, there are many
problems where a known algorithm finds the optimal solution but it has high computational complexity that
prohibits real-time implementation. Secondly, there are cases where the standard system models that are
7

Known algorithm y = f (x)


x
f
y − ŷ
θ∗
θ Training −
Neural network Neural network
x ŷ = fˆ(x; θ ∗ )
fˆ ŷ = fˆ(x; θ) fˆ

(a) Offline training phase (b) Real-time usage


Fig. 3. A known algorithm f can be approximated by training a neural network fˆ to make f (x) ≈ fˆ(x; θ ∗ )
for all possible inputs, as shown in (a). The training procedure will iteratively update θ to gradually reduce
the approximation errors until it converges to some θ ∗ . If the neural network is designed to have sufficiently
low complexity, then the trained neural network in (b) can be used in real-time applications.

used in communications are inadequate. We will elaborate on these two applications in the remainder of this
lecture note. But before that, we stress that errors are unavoidable in the physical layer of communication
systems and are conventionally dealt with using retransmissions. This built-in fault tolerance is positive
when it comes to deep learning. It gives robustness to the strange behaviors that occasionally occur when
an atypical signal is fed into a neural network that has been trained to work well for typical input signals.
However, adversaries can also exploit atypical signals to perform jamming more efficiently [6].

VI. A PPLICATION 1: A LGORITHMIC A PPROXIMATION

The first important application of deep learning in communications is to approximate a known but
computationally complicated algorithm. There are many examples of iterative algorithms that asymptot-
ically find a global (or local) optimum to an optimization problem, but require very many iterations for
convergence and/or complicated operations in each iteration [7]. Such algorithms might not be practically
useful in communication systems where latency constraints require execution times below a millisecond.
The general procedure for training a neural network for algorithmic approximation is illustrated in
Fig. 3. Suppose we have a known algorithm, represented by the function y = f (x), which cannot be
implemented in real time. To address this problem using deep learning, we can first create a training set
containing a large number T of input signals xtrain
t , for t = 1, . . . , T . We then run the algorithm T times
to compute the outputs
y train
t = f (xtrain
t ). (10)

After having generated the training set, we can train an artificial neural network to provide approximately
the same outputs for these inputs. More precisely, we should find an optimized parameter vector θ ∗ in
accordance to (6). If the training is performed well, the neural network will generalize well (i.e., provide
8

good outputs) to previously unseen input signals that were generated in the same way as the inputs used
for training. Simply speaking, this means that f (x) ≈ fˆ(x; θ ∗ ) for all inputs x of practical interest.

There are many optimization problems to be solved in communication systems. For example, at the
transmitter, power allocation between concurrent transmissions is important to limit interference [5], [7].
At the receiver, non-linear signal detection problems must be solved to deal with interference in MIMO
systems [8]. Some of these problems are convex and can be solved by off-the-shelf optimization software.
Other problems are non-convex but there exist iterative algorithms that converge to local or global optima.
In both cases, the computational complexity is often prohibitive for real-time applications, where similar
optimization problems with different input data are solved repeatedly. A neural network can then be trained
to learn approximately how the solution depends on the input data. This approximate input-output map can
be evaluated with substantially lower computational cost, as exemplified in [7], [8]. Domain knowledge
can be utilized to pre-process the input data, to focus the learning on the problem that the algorithm is
solving and not on rediscovering known properties (e.g., that the desired signal lies in a certain subspace).

There are two main approaches. One can learn the input-output mapping based on training data, as
described above, while ignoring how it was produced. Alternatively, the shape of the neural network can
be selected so that each layer mimics one iteration of a known algorithm that converges asymptotically to
an optimum. This is called deep unfolding and exploits that many first-order iterative optimization methods
have the same structure as a (recurrent) neural network [9]. The parameters of the neural network are
then trained to give a nearly optimum solution after a predefined number of iterations, thereby speeding
up the convergence. In [8], the authors “unfold” a gradient-descent-like algorithm for MIMO detection
to create a neural network where each layer performs similar operations but with optimized parameters.
When using this approach, (10) needs not to be determined in advance, which simplifies the training.

The practical benefit of this application is the complexity reduction it can provide; the neural network
will essentially learn how to make algorithmic shortcuts to strike a good balance between accuracy and
computational complexity. Another important benefit is related to hardware implementation. To solve a
practical problem with real-time constraints, we conventionally would first need to design an algorithm
and then develop a dedicated circuit based on it, which can be very time-consuming. With the help of
deep learning, we can instead predesign a general-purpose circuit that implements a neural network of a
given maximum size (i.e., number of layers and neurons) with a predetermined run time. We can then
train a neural network to perform the algorithmic task we need and, finally, load the corresponding trained
parameters (i.e., weights and biases) onto the circuit. This new approach to hardware implementation can
greatly reduce the time from that the algorithmic design begins to a product can hit the market.

A main issue with this application is the highly computationally demanding generation of desired
9

y − ŷ
θ Training −
(a) Training phase Unknown x = g(y) Neural network
y ŷ = fˆ(x; θ)
function g fˆ
θ∗

(b) Usage Unknown x = g(y) Neural network


y ŷ = fˆ(x; θ ∗ )
function g fˆ

Fig. 4. An unknown function g with input y is inverted using a neural network fˆ by training it to achieve
fˆ(g(y); θ ∗ ) ≈ y, as shown in (a). The training procedure will iteratively update θ to gradually reduce
the approximation errors until it converges to some θ ∗ . The trained neural network in (b) can be used to
counteract the unknown function, without having to explicitly model it and estimate model parameters.

outputs: the more complex the algorithm f is, the longer time it takes to compute f (xtrain
t ) for t = 1, . . . , T .
We are basically moving the complexity issue from the algorithmic run time to the design process. There
is a practical limit to which algorithms that we can approximate in this way. If it takes 1 hour to generate
one training example, it will take 11.4 years or extreme parallelism to generate 100,000 examples.

VII. A PPLICATION 2: I NVERSION OF AN U NKNOWN F UNCTION

The second important application is to invert an unknown function. In particular, non-linear distortion
can occur between the transmitter and receiver. Three prominent examples are finite-resolution quantization
in the receiver hardware, non-linear amplifiers in the transmitter hardware [10], and non-linear fiber-optical
channels [11]. While quantizers typically are designed with known properties, the latter two examples can
be represented by an unknown function g that takes a signal y as input and produces a distorted output
x = g(y). The conventional way to undo the distortion is to identify an appropriate parameterized model
of the function, then estimate the parameters from measurements, and finally create an inverse function
based on the estimates. This approach is prone to error-propagation between the three steps. An alternative
approach is to train a neural network to directly invert the function, without requiring explicit modeling
or parameter estimation. Deep learning can provide better results than the conventional approach.
The general procedure for training a neural network for function inversion is illustrated in Fig. 4. To
perform training, we need to generate a large number T of possible communication signals y train
t and
send them through the unknown function to measure

xtrain
t = g(y train
t ) for t = 1, . . . , T. (11)

It is then xtrain
t that is used as input to the neural network, while y train
t is the desired output.
10

Different from Application 1, the creation of a training set can be very computationally efficient in
Application 2 because the outputs are man-made. Online learning when operating the communication
system is possible by occasionally sending predefined reference signals to generate new training data.
This is useful when the function g is time-varying (e.g., due to temperature variations in the hardware).
The key to successful utilization of deep learning is to identify tasks in communication systems that
currently lack an optimal solution—there is then an opportunity to beat the state-of-the-art. For example,
a common way to deal with non-linear communication hardware is to apply the Bussgang decomposition
[12] to write the output of the non-linear function g as g(y) = Dy + n, where D is a deterministic
matrix and n is distortion noise that is uncorrelated with y but statistically dependent. By pretending as
if n is independent noise, one can often develop communication algorithms (e.g., for channel estimation
or data detection) that partially mitigate distortion, but such algorithms are suboptimal since the distortion
is dependent on the input. As shown in [10], one can achieve substantially better performance by training
neural networks instead.
VIII. L ESSONS L EARNED

Although many parts of communication systems can be solved optimally, there are important cases
where deep learning can give large improvements. In particular, it can be used to reduce computational
complexity of known algorithms or to deal with non-linear hardware or channels in an efficient way.

R EFERENCES
[1] G. Cybenko, “Approximation by superpositions of a sigmoidal function,” Mathematics of Control, Signals and Systems, vol. 2, no. 4,
pp. 303–314, Dec. 1989.
[2] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016.
[3] U. Madhow, Introduction to Communication Systems. Cambridge University Press, 2014.
[4] C. E. Shannon, “A mathematical theory of communication,” Bell System Technical Journal, vol. 27, pp. 379–423, 623–656, 1948.
[5] E. Björnson, J. Hoydis, and L. Sanguinetti, “Massive MIMO networks: Spectral, energy, and hardware efficiency,” Foundations and
Trends
R in Signal Processing, vol. 11, no. 3-4, pp. 154–655, 2017.

[6] M. Sadeghi and E. G. Larsson, “Adversarial attacks on deep-learning based radio signal classification,” IEEE Wireless Communications
Letters, vol. 8, no. 1, pp. 213–216, Feb. 2019.
[7] H. Sun, X. Chen, Q. Shi, M. Hong, X. Fu, and N. D. Sidiropoulos, “Learning to optimize: Training deep neural networks for interference
management,” IEEE Transactions on Signal Processing, vol. 66, no. 20, pp. 5438–5453, Oct. 2018.
[8] N. Samuel, T. Diskin, and A. Wiesel, “Deep MIMO detection,” in IEEE International Workshop on Signal Processing Advances in
Wireless Communications (SPAWC), Jul. 2017, pp. 1–5.
[9] J. R. Hershey, J. L. Roux, and F. Weninger, “Deep unfolding: Model-based inspiration of novel deep architectures,” arXiv preprint,
vol. abs/1409.2574, 2014. [Online]. Available: http://arxiv.org/abs/1904.03406
[10] Ö. T. Demir and E. Björnson, “Channel estimation in massive MIMO under hardware non-linearities: Bayesian methods versus deep
learning,” IEEE Open Journal of the Communications Society, 2019, to appear.
[11] A. D. Ellis, J. Zhao, and D. Cotter, “Approaching the non-linear Shannon limit,” Journal of Lightwave Technology, vol. 28, no. 4, pp.
423–433, Feb. 2010.
[12] J. J. Bussgang, “Crosscorrelation functions of amplitude-distorted Gaussian signals,” RLE, MIT, Tech. Rep. 216, 1952.

You might also like