Soft-Output Sphere Decoding Performance and Implementation Aspects PDF
Soft-Output Sphere Decoding Performance and Implementation Aspects PDF
Abstract— Multiple-input multiple-output (MIMO) detection hardware overhead and in addition be highly inefficient since
algorithms providing soft information for a subsequent channel large portions of the chip would remain idle most of the time.
decoder pose significant implementation challenges due to their A practical MIMO receiver design must therefore be able to
high computational complexity. In this paper, we show how
sphere decoding can be used as an efficient tool to implement cover a wide range of complexity/performance trade-offs using
soft-output MIMO detection with flexible trade-offs between a single tunable detection algorithm.
computational complexity and (error rate) performance. In Contributions: In this (predominantly tutorial) paper, we
particular, we demonstrate that single tree search, ordered QR provide a formulation of the sphere decoder [2], [3] as a
decomposition, channel matrix regularization, and log-likelihood
ratio clipping are the key ingredients for realizing soft-output tunable MIMO detector with performance ranging from that of
MIMO detectors with near max-log performance at a computa- successive interference cancellation (SIC) to that of max-log
tional complexity that is reasonably close to that of hard-output APP detection. Tuning of the detector is achieved through log-
sphere decoding. likelihood ratio (LLR) clipping, preprocessing, and imposing
constraints on the maximum computational complexity of the
decoder. We formulate a framework for systematically char-
I. I NTRODUCTION
acterizing the resulting complexity/performance trade-offs. Fi-
Multiple-input multiple-output (MIMO) wireless systems nally, we elaborate on, and provide some refinements of, the
employ multiple antennas on both sides of the wireless link tree-search algorithm introduced in [4] and the LLR clipping
and offer increased spectral efficiency (compared to single- approach proposed in [5].
antenna systems) by transmitting multiple data streams concur- Outline: The remainder of this paper is organized as fol-
rently and in the same frequency band (spatial multiplexing). lows. Section II reviews the transformation of the MIMO
MIMO technology constitutes the basis for upcoming wireless detection and LLR computation problems into a tree-search
communication standards, such as IEEE 802.11n and IEEE problem. Section III reviews max-log APP sphere decod-
802.16e. ing and proposes some refinements of existing algorithms.
The main challenge in the practical realization of MIMO In Section IV, we describe methods for reducing the tree-
wireless systems lies in the efficient implementation of the search complexity. A framework for evaluating the complex-
detector which needs to separate the spatially multiplexed data ity/performance trade-offs of the resulting class of detectors is
streams. To this end, a wide range of algorithms offering introduced in Section V. We conclude in Section VI.
various trade-offs between performance and computational
complexity have been developed [1]. Linear detection produc-
ing hard-decision outputs constitutes one extreme of the com- II. S OFT-O UTPUT S PHERE D ECODING
plexity/performance trade-off region, while computationally
demanding a posteriori probability (APP) detection algorithms Consider a MIMO system with MT transmit and MR
result in the opposite extreme. The computational complexity receive antennas. The coded bit-stream is mapped to
of a MIMO detection algorithm depends on the symbol MT -dimensional transmit vector symbols s ∈ OMT , where O
constellation size and the number of spatially multiplexed data stands for the underlying complex-valued scalar constellation
streams, but often also on the instantaneous MIMO channel of cardinality 2Q . The individual coded bits are denoted by
realization and the signal-to-noise ratio (SNR). On the other xj,b , where the indices j and b refer to the bth bit in the
hand, the overall decoding effort is typically constrained by binary label of the jth entry of s, respectively. The resulting
system bandwidth, latency requirements, and limitations on complex baseband input-output relation is given by
power consumption. Implementing different algorithms, each
optimized for a maximum allowed decoding effort and/or a y = Hs + n (1)
particular system configuration, would entail a considerable
where H denotes the MR × MT channel matrix and n is
This work was supported by the STREP project No. IST-026905 (MAS-
COT) within the sixth framework programme (FP6) of the European Com- an i.i.d. proper complex Gaussian distributed MR -dimensional
mission. noise vector with unit variance entries.
1424407850/06/$20.00 2071
A. Max-Log Soft-Output Computation computed recursively as d(s) = d1 with the partial Euclidean
Soft-output MIMO detection requires the computation of distances (PEDs)
LLRs for all coded bits. In order to reduce the corresponding di = di+1 + |ei |2 , i = MT , MT − 1, . . . , 1 (8)
computational complexity, we employ the max-log approxi-
mation [6] and the distance increments (DIs)
MT
2
L xj,b = min y − Hs2 − min y − Hs2 (2)
(0) (1) |ei |2 = ỹi − Ri,j sj . (9)
s∈Xj,b s∈Xj,b
j=i
(0) (1)
where Xj,b and Xj,b are the disjoint sets of vector symbols Since the dependence of the PEDs di on the symbol vector s
that have the bth bit in the label of the jth scalar symbol equal is only through s(i) , we have transformed ML detection and
to 0 and 1, respectively. For each bit, one of the two minima the computation of the max-log LLRs into a weighted tree-
in (2) is given by λML = y − HsML 2 , where search problem: PEDs and PSVs are associated with nodes,
branches correspond to DIs. Each path from the root down
sML = arg min y − Hs2 (3) to a leaf corresponds to a symbol vector s ∈ OMT. The
s∈O MT
xML
j,b
is the maximum likelihood (ML) solution. The other minimum leaf associated with the smallest metric in OMT and Xj,b
in (2) is given by corresponds to the solution of (6) and (7), respectively. The
basic building block underlying the two tree traversal strategies
λML
j,b = min y − Hs2 (4) described in the next section is the Schnorr-Euchner sphere
(xML )
s∈X j,b decoder (SESD) with radius reduction [8], briefly summarized
j,b
as follows: The SESD constrains the search to nodes which
where the counter-hypothesis xMLj,b denotes the binary comple- lie within a radius r around ỹ and traverses the tree depth-
ment of the bth bit in the binary label of the jth entry of sML . first, visiting the children of a given node in ascending order
With (3) and (4) the max-log LLRs can be written as of their PEDs. The basic idea of radius reduction is to start
the algorithm with r = ∞ and to update the radius according
λML − λMLj,b , xML j,b = 0 to r2 ← d(s) whenever a leaf s has been reached. This avoids
L xj,b = (5)
λML
j,b − λ ML
, x ML
j,b = 1 . the problem of selecting a suitable (initial) radius and leads to
efficient pruning of the tree.
From (5) we can conclude that efficient max-log APP MIMO
Throughout this paper, computational complexity is defined
detection reduces to efficiently identifying sML , λML , and λML
j,b as the number of visited nodes. This complexity measure
for j = 1, 2, . . . , MT and b = 1, 2, . . . , Q [7].
is directly related to the throughput of corresponding VLSI
implementations [9].
B. Max-Log APP MIMO Detection as a Tree Search
Transforming (3) and (4) into tree-search problems and us- III. T REE -T RAVERSAL S TRATEGIES
ing the sphere decoding algorithm [2], [3] allows to efficiently Computing the LLRs as in (5) requires determining the
compute the LLRs (5). To this end, the channel matrix H metric λML
j,b , which is achieved by traversing
only those parts
is first QR-decomposed according to H = QR, where Q is xML
j,b
unitary and R is upper-triangular with real-valued positive of the tree that have leaves in Xj,b . Since this computation
entries on its main diagonal. Left-multiplying (1) by1 QH has to be carried out for every coded bit, it is immediately
leads to the modified input-output relation obvious that the resulting need for repeated tree traversals can
lead to a major computational burden. In the following, we
ỹ = Rs + QH n with ỹ = QH y review two alternative tree-traversal strategies, proposed in [7]
and [4], respectively, for solving (6) and (7). In addition, we
and hence, noting that QH n has the same statistics as n, to
propose some minor refinements of the tree-search algorithm
the equivalent formulation of λML and λML
j,b as introduced in [4].
λML = min ỹ − Rs2 (6)
s∈O MT A. Repeated Tree Search
λML
j,b = min ỹ − Rs2 . (7) An algorithm for computing the LLRs based on repeated
(xML )
s∈X j,b
j,b tree search (RTS) was described in [7]. The basic idea is
to start by solving (6) (using the SESD) and to rerun the
We next define the partial symbol vectors (PSVs)
T SESD to solve (7) for each coded bit (i.e., QMT times) in
s(i) = [ si si+1 · · · sMT ] and note that the s(i) can be
the vector symbol. When rerunning the SESD to determine
arranged in a tree that has its root just above level i = MT
λML
j,b , the search tree is prepruned by forcing the decoder to
and leaves, which correspond to possible candidate symbol
exclude all nodes (and the corresponding subtrees) from the
vectors, on level i = 1. After initializing dMT +1 = 0, the
search for which xj,b = xML j,b . This prepruning procedure is
Euclidean distances d(s) = ỹ − Rs2 in (6) and (7) can be
illustrated in Fig. 1. Initializing the SESD with r = ∞ in
1 The superscript H stands for conjugate transposition. each of the QMT runs required to obtain λML j,b will lead to
2072
by the updates λML ← d (x) and xML ← x. In other
xML = [ 0 1 1 ] xML
1 =1 words, for each bit in the ML hypothesis that is changed
in the process of the update, the metric of the former
0 1
ML hypothesis becomes the metric of the new counter-
1 hypothesis, followed by an update of the ML hypothesis.
This procedure ensures that all λML j,b always contain the
1 metric associated with a valid counter-hypothesis to the
current ML hypothesis.
2) In the case where d (x) ≥ λML , only the counter-
xML
2 =0 xML
3 =0 hypotheses have to be checked. For all j and b for
which d (x) < λML ML
j,b and xj,b = xj,b , the decoder up-
ML
dates λj,b ← d (x).
0 0
Pruning criterion: The key aspect of this algorithm is
0 0 0 0 the following pruning criterion. A given node s(i) on
level i and the subtree originating from that node have
the partial binary label x(i) consisting of the bits xj,b
Fig. 1. Example of the prepruning procedure in the RTS approach. Counter- (b = 1, 2, . . . , Q and j = i, i + 1, . . . , MT ). The remaining
hypotheses to the ML solution are found by forcing the algorithm through bits xj,b (j = 1, 2, . . . , i − 1) corresponding to the subtree are
the dashed branches.
unknown at this point. The pruning criterion for s(i) along
with its subtree is compiled from two conditions. First, the
high computational complexity. It is therefore important to bits in the partial binary label x(i) are compared with the
realize that, without compromising max-log optimality, we corresponding bits in the binary label of the current ML
can initialize the search radius rj,b by setting itequal
to the
hypothesis. In this comparison, for all j, b with xj,b = xML j,b ,
xML
j,b the corresponding counter-hypotheses λML might be affected
minimum value of ỹ − Rs over all s ∈ Xj,b found j,b
when further searching the node’s subtree. Second, all counter-
during preceding tree traversals.
The main advantage of the RTS strategy lies in the fact hypotheses corresponding to the subtree of s(i) with the asso-
that each traversal of the tree can be performed using a hard- ciated metrics λML j,b (j = 1, 2, . . . , i − 1) may also be updated
decision SESD with minimal modifications to account for since the corresponding bits are not yet known. In summary,
the search being carried out on a prepruned tree. The main the metrics which may be affected during further search in the
disadvantage is the repeated traversal of large parts of the tree. subtree emanating from a node s(i) are given by the set
2073
where P is a suitably chosen permutation matrix. More
xML x(i) counter-hypotheses
efficient pruning of the search tree closer to the root is obtained
0 0 0 0 λML
MT ,1 λML
MT ,2 if “stronger streams” correspond to the levels closer to the root,
i.e., P is chosen such that the main diagonal entries of R in
max HP = QR are sorted in ascending order. In the following, this
0 0 1 0 λML ML
MT −1,1 λMT −1,2
approach is termed sorted QR-decomposition (SQRD) [11].
Regularization: Poorly conditioned channel realizations H
1 0 0 0 λML λML >? lead to significant search complexity due to the low effective
i,1 i,2
level i SNR on one or multiple of the effective spatial streams. An
efficient way to counter this problem is to perform the tree-
1 0 ? ? λML
i−1,1 λML
i−1,2 d s(i)
search on a regularized channel matrix by computing
H Q1
1 0 ? ? λML λML P= R
1,2 1,1 αI Q2
where I is the MT × MT -identity matrix and α > 0 is
Fig. 2. Example of the STS pruning criterion (MT = 5 and two bits per a suitably chosen regularization parameter. LLRs are then
symbol): The partial binary label x(i) determines which counter-hypotheses computed according to
may be affected during the search of the subtree emanating from the current
node. L xj,b = min ỹ − Rs̃2 − min ỹ − Rs̃2 (14)
(0) (1)
s̃∈Xj,b s̃∈Xj,b
way of ensuring that LLR values are bounded is to clip them where ỹ = QH 1 y and s̃ = Ps. Note that the LLRs in (14) need
after the detection stage so that to be reordered at the end of the decoding process to account
for the permutation induced by P. Operating on a regularized
|L(xj,b )| ≤ Lmax ∀ j, b . (11) version of the channel matrix clearly entails an (error rate)
It has been noted in [5] that the constraint (11) can be built performance loss. However, we shall see in Section V that
into the tree-search algorithm such that it leads to a reduction choosing α according to the minimum mean squared error
in search complexity. In the following, we briefly describe the (MMSE) criterion (resulting in MMSE-SQRD) as outlined in
application of the idea proposed in [5] to the RTS and the STS [12], degrades the performance only slightly while leading to
tree-traversal strategies. considerable savings in terms of search complexity.
a) LLR Clipping for RTS: Whenever the RTS algorithm
starts to search for a counter-hypothesis, with the search radius C. Run-Time Constraints
rj,b initialized as described in Section III-A, we first update A disadvantage of all SDs is that the computational com-
plexity required to find the ML solution (and the LLR values)
rj,b ← min rj,b , λML + Lmax (12)
depends on the realization of the channel matrix and the noise;
which ensures that (11) is satisfied. Metrics associated with the worst-case complexity corresponds to an exhaustive search.
counter-hypotheses for which no valid lattice point can be On the other hand, in order to meet the practically important
found are set to λML + Lmax . requirement of a fixed throughput, the algorithm run-time must
b) LLR Clipping for STS: Whenever a leaf has been be constrained, which leads to a constraint on the maximum
reached and a new ML hypothesis has been found after detection effort. This, in turn, generally prevents the detector
carrying out the steps in Case 1 in Section III-B, the counter- from achieving ML or max-log APP performance.
hypotheses have to be updated according to A straightforward way of enforcing a run-time constraint
2074
symbol is allowed to use up all of the remaining run-time 450 0.4
0.4 RTS, FER=0.04
within the block up to a safety margin of (N − k)MT visited 400 64 64
RTS, FER=0.01
of the concepts described in Sections III and IV by plotting 0 0.05 0.025 0.0125 0.025 0.0125
the average (over independent channel and noise realizations) 15.5 16 16.5 17 17.5 18 18.5 19
Minimum SNR for a given FER
number of visited nodes as a function of this minimum
SNR. Since the number of visited nodes translates directly to
the required chip area per throughput [9], the corresponding Fig. 3. Comparison of repeated tree search (RTS), single tree search (STS)
and the list sphere decoder (LSD) as proposed in [6]. The numbers next to
charts allow to associate an SNR penalty with a reduction in the curves correspond to Lmax for RTS and STS and to the list size in the
hardware complexity. case of the LSD.
All simulation results are for a rate 1/2 (generator poly-
nomials [133o 171o ] and constraint length 7) convolutionally 120
encoded 4 × 4 MIMO-OFDM system with 16-QAM constel- 0.2
QRD
lation (using Gray mapping) and N = 64 tones. A soft-in SQRD
Average number of visited nodes
0.4
100 MMSE-SQRD
Viterbi decoder [14] is employed. One frame consists of 1024
randomly interleaved (across space and frequency) bits and a
TGn type C channel model [15] is used. 80
0.2
A. Comparison of Tree-Search Strategies
60
Fig. 3 compares the performance of RTS and STS max- 0.1
log APP decoders, and the list sphere decoder (LSD) [6] for 0.2
hard-output
different target FERs, different values of Lmax and in the case 40
0.1
0.05
SESD
of the LSD for different list sizes. Changing the list size allows 0.025
to adjust the complexity/performance trade-off. 20 0.1 0.05 0.0125
0.025
The STS approach is seen to clearly outperform the RTS 0.0125
0.05
strategy in terms of average complexity. We can furthermore 0.025 0.0125
0
see that for this setup max-log APP performance is achieved 16.5 17 17.5 18 18.5 19 19.5 20
for Lmax = 0.2. Increasing the LLR clipping level beyond Minimum SNR for a given FER
this value only increases complexity without improving per-
formance. Fig. 4. Comparison of unordered QRD, SQRD and MMSE-SQRD prepro-
The implementation of the LSD requires additional memory cessing applied to STS at a target FER of 0.01. The numbers next to the
and logic for the administration of the candidate list, which is curves correspond to Lmax . For Lmax → 0, the performance approaches
that of hard-output SESD.
not accounted for in this comparison. Fig. 3 shows that even
when this additional complexity is ignored, the LSD is still
inferior to the STS algorithm. C. LLR Clipping
B. Impact of Preprocessing and Regularization Both Fig. 3 and Fig. 4 show that adjusting the LLR clipping
Fig. 4 compares the impact of SQRD, MMSE-SQRD, and level Lmax allows to sweep an entire family of sphere decoders
standard (unordered) QRD-based preprocessing on the com- ranging from the exact max-log APP SESD (obtained, in our
plexity/performance trade-off of the STS algorithm at a target setup, for Lmax ≥ 0.2) to hard-output SESD (Lmax = 0). The
FER of 0.01. It can be seen that the improvement resulting LLR clipping level is therefore an important design parameter
from SQRD compared to unordered QRD becomes significant which can be used to conveniently adjust the decoder at
in the low (but realistic) complexity region. Further (minor) runtime to a given complexity constraint.
improvements are obtained from regularization using MMSE-
SQRD. In the region where the average complexity is very D. Run-time Constraints
high, the performance penalty resulting from regularization In Fig. 5, we finally demonstrate the impact of imposing a
eventually renders MMSE-SQRD inferior to SQRD. maximum run-time constraint of N Davg visited nodes for a
2075
50 R EFERENCES
0.2
0.2 Davg=8
45 [1] H. Bölcskei, D. Gesbert, C. Papadias, and A. J. van der Veen, Eds.,
Average number of visited nodes
VI. C ONCLUSIONS
2076