Learning To Detect
Learning To Detect
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSP.2019.2899805, IEEE
Transactions on Signal Processing
1
Learning to Detect
Neev Samuel, Member, IEEE, and Tzvi Diskin, Member, IEEE and Ami Wiesel, Member, IEEE
Abstract—In this paper we consider Multiple-Input-Multiple- better computational complexity with a rather low accuracy
Output (MIMO) detection using deep neural networks. We performance degradation relative to the full search. In the
introduce two different deep architectures: a standard fully other regime, the most common suboptimal detectors are the
connected multi-layer network, and a Detection Network (DetNet)
which is specifically designed for the task. The structure of linear receivers, i.e., the matched filter (MF), the decorrelator
DetNet is obtained by unfolding the iterations of a projected or zero forcing (ZF) detector and the minimum mean squared
gradient descent algorithm into a network. We compare the error (MMSE) detector. More advanced detectors are based on
accuracy and runtime complexity of the proposed approaches decision feedback equalization (DFE), approximate message
and achieve state-of-the-art performance while maintaining low passing (AMP) [5] and semidefinite relaxation (SDR) [6], [7].
computational requirements. Furthermore, we manage to train a
single network to detect over an entire distribution of channels. Currently, both AMP and SDR provide near optimal accuracy
Finally, we consider detection with soft outputs and show that under many practical scenarios. AMP is simple and cheap to
the networks can easily be modified to produce soft decisions. implement in practice, but is an iterative method that may
Index Terms—MIMO Detection, Deep Learning, Neural Net- diverge in challenging settings. SDR is more robust and has
works. polynomial complexity, but is limited in the constellations it
addresses and is much slower in practice.
I. I NTRODUCTION
B. Background on Machine Learning
M ULTIPLE input multiple output (MIMO) systems en-
able enhanced performance in communication systems,
by using many dimensions that account for time and frequency
Supervised machine learning is the ability to solve sta-
tistical problems using examples of inputs and their desired
resources, multiple users, multiple antennas and other re- outputs. Unlike classical hypothesis testing, it is typically
sources. While improving performance, these systems present used when the underlying distributions are unknown and
difficult computational challenges when it comes to detection are characterized via sample examples. In recent years, the
since the problem is NP-Complete, and there is a growing field witnessed the deep revolution. The “deep” adjective is
need for sub-optimal solutions with polynomial complexity. associated with the use of complicated and expressive classes
Recent advances in the field of machine learning, specif- of algorithms, also known as architectures. These are typically
ically the success of deep neural networks in solving many neural networks with many non-linear operations and layers.
problems in almost any field of engineering, suggest that a Deep architectures are more expressive than shallow ones
data driven approach for detection using machine learning may and can theoretically solve much harder and larger problems
present a computationally efficient way to achieve near optimal [8], but were previously considered impossible to optimize.
detection accuracy. With the advances in big data, optimization algorithms and
stronger computing resources, such networks are currently
A. MIMO detection state of the art in different problems from speech processing
[9], [10] and computer vision [11], [12] to games [13]. Typical
MIMO detection is a classical problem in simple hypothesis solutions involve dozens and even hundreds of layers which
testing [1]. The maximum likelihood (ML) detector involves are slowly optimized off-line over clusters of computers,
an exhaustive search and is the optimal detector in the sense to provide accurate and cheap decision rules which can be
of minimum joint probability of error for detecting all the applied in real-time. In particular, one promising approach
symbols simultaneously. Unfortunately, it has an exponential to designing deep architectures is by unfolding an existing
runtime complexity which makes it impractical in large real iterative algorithm [14]. Each iteration is considered a layer
time systems. and the algorithm is called a network. The learning begins
In order to overcome the computational cost of the maxi- with the existing algorithm as an initial starting point and uses
mum likelihood decoder there is considerable interest in imple- optimization methods to improve the algorithm. For example,
mentation of suboptimal detection algorithms which provide a this strategy has been shown successful in the context of sparse
better and more flexible accuracy versus complexity tradeoff. reconstruction [15], [16]. Leading algorithms such as Iterative
In the high accuracy regime, sphere decoding algorithms [2], Shrinkage and Thresholding and a sparse version of AMP
[3], [4] were proposed, based on lattice search, and offering have both been improved by unfolding their iterations into
Manuscript received May, 2018; a network and learning their optimal parameters.
N. Samuel, T. Diskin and A. Wiesel are with the School of Computer Following this revolution, there is a growing body of
Science and Engineering, The Hebrew University of Jerusalem, Israel. E-mail: works on deep learning methods for communication systems.
([email protected] or see http://www.cs.huji.ac.il/ amiw/).
This research was partly supported by the Heron Consortium and by ISF Exciting contributions in the context of error correcting codes
grant 1339/15. include [17]–[21]. In [22] a machine learning approach is
1053-587X (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSP.2019.2899805, IEEE
Transactions on Signal Processing
2
considered in order to decode over molecular communica- and =(·) respectively. An α-Toeplitz matrix M will be defined
tion systems where chemical signals are used for transfer as a matrix such that [M T M ]i,j = α|i−j| .
of information. In these systems an accurate model of the
channel is impossible to find. This approach of decoding II. P ROBLEM FORMULATION
without CSI (channel state information) is further developed
in [23]. Machine learning for channel estimation is considered A. MIMO detection
in [24], [25]. End-to-end detection over continuous signals is We consider the standard linear MIMO model:
addressed in [26]. Joint learning of transmitters and receivers
is considered in [27]. Parts of our work on MIMO detection ȳ = H̄x̄ + w̄, (1)
using deep learning have already appeared in [28], see also where ȳ ∈ CN is the received vector, H̄ ∈ CN ×K is the
[29]. Similar ideas were discussed in [30] in the context of channel matrix, x̄ ∈ S̄K is an unknown vector of independent
robust regression. and equal probability symbols from some finite constellation
S̄ (e.g. PSK or QAM), w̄ is a noise vector of size N with
C. Main contributions independent, zero mean complex normal variables of variance
The main contribution of this paper is the introduction of σ2 .
two deep learning networks for MIMO detection. We show Our detectors do not assume knowledge of the noise
that, under a wide range of scenarios including different variance σ 2 . Hypothesis testing theory guarantees that it is
channel models and various digital constellations, our net- unnecessary for optimal detection [1]. Indeed, the ML rule
works achieve near optimal detection performance with low does not depend on it. This is in contrast to the MMSE and
computational complexity. AMP decoders that exploit this parameter and are therefore
Another important result we show is their ability to easily less robust in cases where the noise variance is not known
provide soft outputs as required by modern communication exactly.
systems. We show that for different constellations the soft out-
put of our networks achieve accuracy comparable to that of the
B. Reparameterization
M-Best sphere decoder with low computational complexity.
In a more general learning perspective, an important con- A main challenge in MIMO detection is the use of complex
tribution is DetNet’s ability to perform on multiple models valued signals and various digital constellations S̄ which are
with a single training. Recently, there were works on learning less common in machine learning. In order to use standard
to invert linear channels and reconstruct signals [15], [16], tools and provide a unified framework, we re-parameterize the
[31]. To the best of our knowledge, these were developed and problem using real valued vectors and one-hot mappings as
trained to address a single fixed channel. In contrast, DetNet is described below.
designed for handling multiple channels simultaneously with First, throughout this work, we avoid handling complex
a single training phase. valued variables, and use the following convention:
The paper is organized as follows. In section II we present
y = Hx + w, (2)
the MIMO detection problem and how it is formulated as a
learning problem including the use of one-hot representations. where
In section III we present two types of neural network based
<(ȳ) <(w̄) <(x̄)
detectors, FullyCon and DetNet. In section IV we consider y = ,w = ,x = ,
=(ȳ) =(w̄) =(x̄)
soft decisions. In section V we compare the accuracy and
the runtime of the proposed learning based detectors against <(H̄) − =(H̄)
H = (3)
traditional detection methods both in the hard decision and the =(H̄) <(H̄)
soft decision cases. Finally, section VI provides concluding
where y ∈ R2N is the received vector, H ∈ R2N ×2K is the
remarks.
channel matrix and x ∈ S2K where S = <{S̄} (which is also
equal to ={S̄} in the complex valued constellations we tested)
D. Notation A second convention concerns the re-parameterization of
where µ is
In this paper, we define the normal distribution the discrete constellations S = {s1 , · · · , s|S| } using one-hot
the mean and σ 2 is the variance as N µ, σ 2 . The uniform mapping. With each possible si we associate a unit vector
distribution with the minimum value a and the maximum value ui ∈ R|S| . For example, the 4 dimensional one-hot mapping
b will be U (a, b) . Boldface uppercase letters denote matrices. of the real part of 16-QAM constellations is defined as
T
Boldface lowercase letters denote vectors. The superscript (·)
denotes the transpose. The i’th element of the vector x will be s1 = −3 ↔ u1 = [1, 0, 0, 0]
denoted as xi . Unless stated otherwise, the term independent s2 = −1 ↔ u2 = [0, 1, 0, 0]
and identically distributed (i.i.d.) Gaussian matrix, refers to a s3 = 1 ↔ u3 = [0, 0, 1, 0]
matrix where each of its elements is i.i.d. sampled from the s4 = 3 ↔ u4 = [0, 0, 0, 1] (4)
normal distribution N (0, 1). The rectified linear unit defined
as ρ(x) = max{0, x}. When considering a complex matrix or We denote this mapping via the function s = foh (u) so
vector the real and imaginary parts of it are defined as <(·) that si = foh (ui ) for i = 1, · · · , |S|. More generally, for
1053-587X (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSP.2019.2899805, IEEE
Transactions on Signal Processing
3
approximate inputs which are not unit vectors, the function is input/output
defined as Learned Variables
variables
|S|
X
x = foh (xoh ) = si [xoh ]i (5) - Multiplication + - Addition ρ - Relu Activation
i=1
1053-587X (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSP.2019.2899805, IEEE
Transactions on Signal Processing
4
This hints that two main ingredients in the architecture should to compute complicated features using many layer. In order to
be HT y and HT Hx. Second, our construction is based on solve this problem we added the vˆk in (12) variable that allows
mimicking a projected gradient descent like solution for the the network to pass unconstrained information from one layer
maximum likelihood optimization. Such an algorithm would to another.
lead to iterations of the form In our final implementation, in order to further enhance
" # the performance of DetNet, we added a residual feature from
∂ky − Hxk2
x̂k+1 = Π x̂k − δk ResNet [11] where the output of each layer is a weighted
∂x x=x̂k average with the output of the previous layer:
T T
= Π x̂k − δk H y + δk H Hxk , (11) x̂k = αx̂k−1 + (1 − α)x̂k . (15)
where x̂k is the estimate in the k’th iteration, Π[·] is a
IV. S OFT DECISION OUTPUT
nonlinear projection operator, and δk is a step size. Intuitively,
each iteration (executed by a single layer in DetNet) is a linear In this section, we consider a more general setting in
combination of the xk , HT y, and HT Hxk followed by a which the MIMO detector needs to provide soft outputs.
non-linear projection. We enrich these iterations by lifting the High end communication systems typically resort to iterative
input to a higher dimension in each iteration and applying decoding where the MIMO detector and the error correcting
standard non-linearities which are common in deep neural decoder iteratively exchange information on the unknowns
networks. In order to further improve the performance we treat until convergence. For this purpose, the MIMO detector must
the gradient step sizes δK at each step as a learned parameter replace its hard estimates with soft posterior distributions
and optimize them during the training phase. This yields the Prob(xj = si |y) for each unknown j = 1, · · · , 2K and each
following architecture: possible symbol i = 1, · · · , |S|. More precisely, it also needs
to allow additional soft inputs but we leave this for future
work.
qk = x̂k−1 − δ1k HT y + δ2k HT Hx̂k−1 Computation of the posteriors is straight forward based on
Bayes law, but its complexity is exponential in the size of the
qk
zk = ρ W1k + b1k signal and constellation. Similarly to the maximum likelihood
vk−1
x̂oh,k = W2k zk + b2k algorithm in the hard decision case, this computation yields
optimal accuracy yet is intractable. Thus, the goal in this
x̂k = foh (x̂oh,k ) section is to design networks that approximate the posteriors.
v̂k = W3k zk + b3k On first glance, this seems difficult to learn as we have no
x̂0 = 0 training set of posteriors and cannot define a loss function.
v̂0 = 0, (12) Remarkably, this is not a problem and the probabilities of
arbitrary constellations can be easily recovered using the stan-
with the trainable parameters dard l2 loss function with respect to the one-hot representation
L xoh . Indeed, consider a scalar x and a single s ∈ S associated
θ = {W1k , b1k , W2k , b2k , W3k , b1k , δ1k , δ2k }k=1 . (13)
with its one-hot bit xoh then it is well known that
Note the similarity between (11) and the computation of qk . 2
arg min E[||xoh − x̂oh || |y] = E[xoh |y] (16)
When computing qk at each layer, an explicit gradient descent x̂oh
is computed, with learnable step sizes. To ensure that our = xest (17)
networks have the advantages of wide neural networks, the which is a vector of the size of S that satisfies:
parameters W1k are m × n matrices where m > n, which
means that multiplying by W1k increases the dimension of the xest,i = Prob(xoh,i = 1|y)
input. The final estimate is defined as x̂L . For convenience, = Prob(x = si |y)
the structure of each DetNet layer is illustrated in Fig. 2.
Assuming that our network is sufficiently expressive and
Training deep networks is a difficult task due to vanishing
globally optimized, the one-hot output xˆoh will provide the
gradients, saturation of the activation functions, sensitivity to
exact posterior probabilities. Therefore, in practice, the one-
initialization and more [32]. To address these challenges we
hot output will approximate the true posterior probabilities.
adopted a loss function that takes into account the outputs
of all of the layers, an idea following the notion of auxiliary V. N UMERICAL R ESULTS
classifiers presented in GoogLeNet [12]:
In this section, we provide numerical results on the accuracy
L
X and complexity of the proposed networks in comparison to
l (xoh ; x̂oh (H, y; θ)) = log(l)kxoh − x̂oh,l k2 . (14) competing methods.
l=1 In the FC case, the results are over the 0.55-Toeplitz
The downside of this loss function is that it forces the output channel.
of each layer x̂oh,k to be close to xoh in order to minimize the In the VC case and when testing the soft output perfor-
loss function. This will mean that the only information passing mance, the results presented are over random channels, where
from one layer to the next one will be estimated xoh,k . This each element is sampled i.i.d. from the normal distribution
means that in practice we lose the ability of the deep network N (0, 1).
1053-587X (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSP.2019.2899805, IEEE
Transactions on Signal Processing
5
HTy - Multiplication
X + Wk,3 bk,3
HTH
+ - Addition
Fig. 2. A flowchart representing a single layer of DetNet. The network is composed out of L layers as such where each layers’ output is the next layers’
input
A. Implementation details In the hard decision scenarios, we tested our deep networks
We train both networks using a variant of the stochastic gra- against the following detection algorithms:
dient descent method [33], [34] for optimizing deep networks, ZF: This is the classical decorrelator, also known as least
named Adam Optimizer [35]. All networks were implemented squares or zero forcing (ZF) detector [1].
using the Python based TensorFlow library [36]. DF: The decision feedback equalizer algorithm.
In the Fullycon network, the number of layers was 6 and AMP: Approximate message passing algorithm from [5].
each hidden layer had 10K neurons. SDR: A decoder based on semidefinite relaxation imple-
In the case of the DetNet architecture we used 30 layers in mented using an efficient interior point solver [6], [7].
all constellations and all channel sizes presented in this paper. For the 8-PSK constellation we implemented the SDR
In the hard decision case the sizes of zk were 4K, 4K, variation suggested in [37].
8K and 12K for the BPSK, QPSK, 16QAM and 8PSK SD: An implementation of the sphere decoding algorithm as
constellations respectively and the sizes of vK were 2K, presented in [38].
2K, 4K and 4K for the BPSK, QPSK, 16QAM and 8PSK In the soft output case, we tested our networks against
constellations respectively. zk and vk are not dependent of k. the M-Best sphere decoding algorithm as presented in [3]
In the soft decision case the sizes of zk was 12K and size (originally named K-Best, but changed here to avoid confusion
of vK was 4K for all of the constellations tested. with K the transmitted signal size):
We trained FullyCon for 1,000,000 iterations with a batch M-Best SD M=5: The M-Best sphere decoding algorithm,
size of 1,000 samples and DetNet for approximately 100,000 where the number of candidates we keep is 5.
iterations with a batch size of 2,000 samples (the number M-Best SD M=7: Same as M-Best SD M=5 with 7 candi-
of iterations varied slightly depending on the constellation). dates.
We used a decaying learning rate with the starting rate of
0.0008 and a decay rate of 0.97 every 1000 iterations (the exact C. Accuracy results
values might vary slightly between constellations. To give a 1) Fixed Channel (FC): In the case of the FC scenario,
rough idea of the computation needed during the learning where we know during the learning phase over what realization
phase, optimizing the detectors in our numerical results in of the channel we need to detect, the performance of both our
both architectures took around 3 days on a standard Intel networks was comparable to most of the competitors except
i7-6700 processor. Each sample was independently generated SD. Both DetNet and FullyCon managed to achieve accuracy
from (2) according to the statistics of x, H (either in the results comparable to SDR and AMP. This result emphasizes
FC or VC model) and w. During training, the noise variance the notion that when learning to detect over simple scenarios as
was randomly generated so that the SNR will be uniformly FC, a simple network is expressive enough. And since a simple
distributed on U (SNRmin − 1, SNRmax + 1) where SNRmin network is easier to optimize and has lower complexity, it is
and SNRmax are the minimal and maximal SNR values over preferable. Fig. 3 we present the accuracy rates over a range of
which we used the network. SNR values in the FC model. This is a rather difficult setting
and algorithms such as AMP did not succeed to converge.
B. Competing algorithms 2) Varying channel: In the VC case, the accuracy results of
FullyCon were poor and the network did not manage to learn
When presenting our network performance we shall use the how to detect properly. DetNet managed to achieve accuracy
following naming conventions: rates comparable to those of SDR and AMP, and almost
FullyCon: The basic fully-connected deep architecture. comparable to those of SD, while being computationally
DetNet: The DetNet deep architecture. cheaper (see next section regarding computational resources).
1053-587X (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSP.2019.2899805, IEEE
Transactions on Signal Processing
6
−1
10
−2
−2 10
10
SER
−3
BER
10
−3
ZF 10 DF
DF AMP
SDR
SDR
−4 AMP −4
10 10 SD
SD
DetNet
DetNet
−5 FullyCon 8 9 10 11 12 13
10 SNR (dB)
8 9 10 11 12 13
SNR (dB)
Fig. 5. Comparison of the detection algorithms BER performance in the
varying channel case over a QPSK modulated signal. All algorithms were
Fig. 3. Comparison of the detection algorithms BER performance in the fixed
tested on channels of size 30x20.
channel channel case over a BPSK modulated signal.
−1
10
−2
10
SER
−2
−3 10 ZF
10
BER
ZF
DF
DF
AMP
−4 AMP
10 −3 SD
SDR 10
DetNet
SD
8 9 10 11 12 13
−5 DetNet SNR (dB)
10
8 9 10 11 12 13
SNR (dB) Fig. 6. Comparison of the detection algorithms SER performance in the
varying channel case over a 16-QAM modulated signal. All algorithms were
Fig. 4. Comparison of the detection algorithms BER performance in the tested on channels of size 25X15.
varying channel case over a BPSK modulated signal. All algorithms were
tested channels of size 60x30.
1053-587X (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSP.2019.2899805, IEEE
Transactions on Signal Processing
7
−2
10
Posterior Probability
SER
−4 −3
10 10
SDR
M−Best M=5
SD
M−Best M=7
−6 DetNet
10 DetNet
19 20 21 22 23 24
SNR (dB) 8 9 10 11 12 13
SNR (dB)
Fig. 7. Comparison of the detection algorithms SER performance in the
varying channel case over a 8-PSK modulated signal. All algorithms were Fig. 9. Comparison of the accuracy of the soft output relative to the posterior
tested on channels of size 25X15. probability in the case of a BPSK signal over a 20 × 10 real valued channel.
0
10
Posterior Probability
−2
10
SER
−2
10
DF
AMP
−3
10 SDR
M−Best M=5
SD M−Best M=7
−4 DetNet DetNet
10 −3
10
8 10 12 14 16 18 8 9 10 11 12 13
SNR (dB) SNR (dB)
Fig. 8. Comparison of the detection algorithms SER performance in the Fig. 10. Comparison of the accuracy of the soft output relative to the posterior
varying channel where each column is correlated according to a one-ring probability for a 16-QAM signal over an 8 × 4 complex valued channel.
model correlation matrix created with a random parameter of the angular
spread over a QPSK modulated signal. All algorithms were tested on channels
of size 15x10.
output where DetNet is comparable to the M-Best algorithms
only in the high SNR region.
the posterior probability distribution we shall use the Jensen-
Shannon divergence that measures the similarity between two E. Computational Resources
discrete probability distributions, P and Q defined as:
1) FullyCon and DetNet run time: In order and estimate
1 1 the computational complexity of the different detectors we
JSD(P, Q) = DKL (P, M ) + DKL (Q, M ) compared their run time. Comparing complexity is non-trivial
2 2
1 due to many complicating factors as implementation details
M = (P + Q)
2 and platforms. To ensure fairness, all the algorithms were
X P (i) tested on the same machine via python 2.7 environment
DKL (P, Q) = P (i)log (18)
Q(i) using the Numpy package. The networks were converted from
i=1
TensorFlow objects to Numpy objects. We note that the run-
In Fig. 9 we present the Jensen-Shannon divergence between time of SD depends on the SNR, and we therefore report a
the estimated prediction of DetNet and the posterior probabil- range of times.
ity distribution in the case of a BPSK signal over a 10x20 real An important factor when considering the run time of the
channel. In this setting we reach smaller divergence than that neural networks is the effect the batch size. Unlike classical
achieved by the M-Best algorithm. As seen in Fig. 9 adding detectors as SDR and SD, neural networks can detect over
additional layers improves the accuracy of the soft output. In entire batches of data which speeds up the detection process.
Fig. 10 we present the results over a 8x4 complex channel with This is true also for the AMP algorithm, where computation
16-QAM constellation. We can see the performance of DetNet can be made on an entire batch of signals at once. However, the
is comparable to the M-Best Sphere decoding algorithm. For improvement introduced by using batches is highly dependent
completeness, in Fig. 11 we added the 8-PSK constellation soft on the platform used (CPU/GPU/FPGA etc). Therefore, for
1053-587X (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSP.2019.2899805, IEEE
Transactions on Signal Processing
8
−1
Constellation Batch DetNet SDR AMP SD
10 channel size size
BPSK 1 0.0066 0.024 0.0093 0.008
Average Distance From
60X30 -0.1
Posterior Probability
1053-587X (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSP.2019.2899805, IEEE
Transactions on Signal Processing
9
8
10
6
10
Flops Count
Flops Count
6
10
4
SDR 10
AMP SDR
4
10 SD Low SNR SD High SNR
SD High SNR SD Low SNR
DetNet 2 DetNet
10
10 20 30 10 20 30
K size K Size
Fig. 12. Flops count for different algorithms over different K sizes for the Fig. 13. Flops count for different algorithms over different K sizes for the
QPSK constellation and a K × K channel size 8PSK constellation and a K × K channel size
K 2 + (3K + 2Aux)Hid
DetNet LDetN et
6K 3 LSDR
SDR 5
AMP (2K × N + 2P ost × K) LAM P 10
SD 2K × N odes
M-Best SD (23 + log(|Con| × M )(K × M × |Con|)
TABLE IV AMP
F LOPS COUNT COMPARISON BETWEEN D ET N ET, S EMIDEFINITE SD Low SNR
R ELAXATION , AMP, S PHERE D ECODING AND M- BEST S PHERE
D ECODING AS A FUNCTION OF K AND THE PARAMETERS OF THE
SD High SNR
ALGORITHMS DetNet
6 8 10 12 14 16 18
K Size
In Fig. 12, 13 and 14 we present the number of flops for the
Fig. 14. Flops count for different algorithms over different K sizes for the
QPSK, 8PSK and the 16QAM constellations respectively for 16QAM constellation and a K × K channel size
different sizes of K. The complexity of the sphere decoding
algorithm is very dependant on the SNR value, so we presented
one graph for low SNR values and a second one for a higher layer used as the output layer in the case of a 60x30 channel
SNR value. Both sphere decoding graphs are asymptotically with BPSK constellation.
worse than the competing detection algorithms. The Flops
count of DetNet is always better than SDR and worse than VI. C ONCLUSION
the AMP algorithm. In Fig. 15 we present the Flops count for In this paper we investigated into the ability of deep neural
the 16QAM constellation in the soft decision case. While in networks to serve as MIMO detectors. We introduced two deep
the runtime comparison the M-Best Sphere decoding algorithm learning architectures that provide promising accuracy with
was comparable or slower than DetNet, when counting Flops low and flexible computational complexity. We demonstrated
the M-Best algorithm is much faster than DetNet. The reason their application to various digital constellations, and their
for the difference is that the M-Best algorithm is more exten- ability to provide accurate soft posterior outputs. An important
sive in memory accesses, which slows down the algorithm, feature of one of our network is its ability to detect over
yet does not affect the Flops count. multiple channel realizations with a single training.
3) Accuracy-Complexity Trade-Off: An interesting feature Using neural networks as a general scheme in MIMO
of DetNet is that the complexity-accuracy trade-off can be detection still a long way to go and there are many open ques-
decided during run-time. Each of the network’s layers outputs tions. These include their hardware complexity, robustness,
an estimated signal, and our loss optimizes all of them. We and integration into full communication systems. Furthermore
usually use the output of the last layer as the result since it the architectures we proposed are not flexible to changes in the
is the most accurate, but it is possible to take the estimated constellation used or the number of users (that is, any change
output xi of previous layers to allow faster detection. In Fig. in the number of users or constellation used will require a new
16 we present the accuracy as a function of the number of the network). Nonetheless, we believe this approach is promising
1053-587X (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSP.2019.2899805, IEEE
Transactions on Signal Processing
10
et al., “Mastering the game of Go with deep neural networks and tree
0.2 search,” nature, vol. 529, no. 7587, pp. 484–489, 2016.
[14] J. R. Hershey, J. L. Roux, and F. Weninger, “Deep unfolding:
Model-based inspiration of novel deep architectures,” arXiv preprint
arXiv:1409.2574, 2014.
0.1 [15] K. Gregor and Y. LeCun, “Learning fast approximations of sparse cod-
ing,” in Proceedings of the 27th International Conference on Machine
Learning (ICML-10), 2010, pp. 399–406.
[16] M. Borgerding and P. Schniter, “Onsager-corrected deep learning for
sparse linear inverse problems,” in IEEE Global Conference on Signal
10 20 30 40 and Information Processing (GlobalSIP). IEEE, 2016, pp. 227–231.
Output Layer Number [17] E. Nachmani, Y. Be’ery, and D. Burshtein, “Learning to decode linear
codes using deep learning,” in Communication, Control, and Computing
Fig. 16. Comparison of the average BER as a function of the layer chosen to (Allerton), 2016 54th Annual Allerton Conference on. IEEE, 2016, pp.
be the output layer in the case of a 60x30 channel and BPSK constellation. 341–346.
[18] E. Nachmani, E. Marciano, D. Burshtein, and Y. Be’ery, “RNN decoding
of linear block codes,” arXiv preprint arXiv:1702.07560, 2017.
[19] E. Nachmani, E. Marciano, L. Lugosch, W. J. Gross, D. Burshtein, and
and has the potential to impact future communication systems. Y. Be’ery, “Deep learning methods for improved decoding of linear
Neural networks can be trained on realistic channel models codes,” IEEE Journal of Selected Topics in Signal Processing, 2018.
and tune their performance for specific environments. Their [20] T. J. O’Shea and J. Hoydis, “An introduction to machine learning
communications systems,” arXiv preprint arXiv:1702.00832, 2017.
architectures and batch operation are more natural to hardware [21] T. Gruber, S. Cammerer, J. Hoydis, and S. ten Brink, “On deep learning-
implementation than algorithms as SDR and SD. Finally, their based channel decoding,” in 51st Annual Conference on Information
multi-layer structure allows a flexible accuracy vs complexity Sciences and Systems (CISS). IEEE, 2017, pp. 1–6.
[22] N. Farsad and A. Goldsmith, “Detection algorithms for communication
nature as required by many modern applications. systems using deep learning,” arXiv preprint arXiv:1705.08044, 2017.
[23] ——, “Neural network detection of data sequences in communication
ACKNOWLEDGMENTS systems,” arXiv preprint arXiv:1802.02046, 2018.
[24] H. Ye, G. Li, and B. Juang, “Power of deep learning for channel
We would like to thank Shai Shalev-Shwartz for many estimation and signal detection in OFDM systems,” IEEE Wireless
discussions throughout this research. In addition, we thank Communications Letters, 2017.
[25] T. O’Shea, K. Karra, and T. Clancy, “Learning approximate neu-
Amir Globerson and Yoav Wald for their ideas and help with ral estimators for wireless channel state information,” arXiv preprint
the soft output networks. arXiv:1707.06260, 2017.
[26] S. Dörner, S. Cammerer, J. Hoydis, and S. ten Brink, “Deep learning
based communication over the air,” IEEE Journal of Selected Topics in
R EFERENCES Signal Processing, vol. 12, no. 1, pp. 132–143, 2018.
[1] S. Verdu, Multiuser detection. Cambridge university press, 1998. [27] T. O’Shea, T. Erpek, and T. Clancy, “Deep learning based MIMO
[2] E. Agrell, T. Eriksson, A. Vardy, and K. Zeger, “Closest point search in communications,” arXiv preprint arXiv:1707.07980, 2017.
lattices,” IEEE transactions on information theory, vol. 48, no. 8, pp. [28] N. Samuel, T. Diskin, and A. Wiesel, “Deep MIMO detection,” arXiv
2201–2214, 2002. preprint arXiv:1706.01151, 2017.
[3] Z. Guo and P. Nilsson, “Algorithm and implementation of the k-best [29] T. Wang, C. Wen, H. Wang, F. Gao, T. Jiang, and S. Jin, “Deep
sphere decoding for MIMO detection,” IEEE Journal on selected areas learning for wireless physical layer: Opportunities and challenges,”
in communications, vol. 24, no. 3, pp. 491–503, 2006. China Communications, vol. 14, no. 11, pp. 92–111, 2017.
1053-587X (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSP.2019.2899805, IEEE
Transactions on Signal Processing
11
1053-587X (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.