0% found this document useful (0 votes)
83 views22 pages

Software-Defined Sphere Decoding For FPGA-based MIMO Detection

Sphere decoding (SD) is a highly effective detection technique for Multiple-Input Multiple-Output (MIMO) wireless communications receivers. Despite this, the computational demands of even low-complexity SD variants remain such that implementation on modern software-defined network equipment is a high-challenge process. This paper overcomes this barrier By exploiting large-scale networks of fine-grained softwareprogrammable processors on field programmed gate array. It culminates in the only single-

Uploaded by

anilshaw27
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
83 views22 pages

Software-Defined Sphere Decoding For FPGA-based MIMO Detection

Sphere decoding (SD) is a highly effective detection technique for Multiple-Input Multiple-Output (MIMO) wireless communications receivers. Despite this, the computational demands of even low-complexity SD variants remain such that implementation on modern software-defined network equipment is a high-challenge process. This paper overcomes this barrier By exploiting large-scale networks of fine-grained softwareprogrammable processors on field programmed gate array. It culminates in the only single-

Uploaded by

anilshaw27
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication.

Software-Dened Sphere Decoding for FPGA-based MIMO Detection


Xuezheng Chu, Member, IEEE, and John McAllister, Member, IEEE

Abstract Sphere Decoding (SD) is a highly effective detection technique for Multiple-Input Multiple-Output (MIMO) wireless communications receivers, offering quasi-optimal accuracy with relatively low computational complexity as compared to the ideal ML detector. Despite this, the computational demands of even low-complexity SD variants, such as Fixed Complexity SD (FSD), remains such that implementation on modern software-dened network equipment is a highly challenging process, and indeed real-time solutions for MIMO systems such as 4 4 16-QAM 802.11n are unreported. This paper overcomes this barrier. By exploiting large-scale networks of ne-grained softwareprogrammable processors on Field Programmable Gate Array (FPGA), a series of unique SD implementations are presented, culminating in the only single-chip, real-time quasi-optimal SD for 44 16-QAM 802.11n MIMO. Furthermore, it demonstrates that the high performance software-dened architectures which enable these implementations exhibit cost comparable to dedicated circuit architectures. Index Terms MIMO, Sphere Decoder, FPGA, Multicore.

I. I NTRODUCTION Multiple-Input, Multiple-Output (MIMO) communications systems [1] exploit spatial diversity to provide wireless communications channels of unprecedented capacity and throughput, prompting their adoption in wireless communications standards such as 802.11n [2]. A generic MIMO system employing M transmit and N receive antennas is shown in Fig. 1. Effectively harnessing the benets of MIMO technology, however, relies on the existence of accurate, high throughput receiver equipment - a very signicant embedded architecture design problem. This difculty is due to two main factors. Firstly, the high computational complexity of accurate detector algorithms such as Sphere Decoders (SDs) is apparent in the current absence of reported real-time implementations for even moderate MIMO systems, such as the 4 4 16-QAM topologies employed in 802.11n [3], [4], [5], [6], [7], [8]. Furthermore, such realisations should ideally be software-dened, for integration in modern network equipment and design processes
Authors are with the Institute of Electronics, Communications and Information Technology (ECIT), Queens University Belfast e-mail: xchu01, [email protected].

June 22, 2012

DRAFT

(c) 2011 Crown Copyright. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

Modulation & Mapping

h12

hM1
hM 2

h1 N
hMN
H

Detection

s2

v2

y2

s2

sM

vN

MIMO Channel

yN

sN

Fig. 1.

Generic MIMO Communication System

[9], [10], whilst current implementations of parts of SD algorithms require custom circuit architectures to achieve real-time processing. Combining these two to achieve real-time, software-dened detection is a highly challenging implementation problem. This paper presents a unique design approach which overcomes these barriers. It extends the work in [11], [12] to create a series of unique real-time, software-dened SD processing architectures for 4 4 16-QAM 802.11n MIMO. Specically, four contributions are made: 1) A highly efcient, software-dened FPGA processing architecture for baseband DSP [11], [12] is presented. 2) The architecture from 1) is used to create the rst known real-time preprocessing architecture for 802.11n FSD. 3) It is shown how 1) enables the only known software-dened FSD metric calculation and sorting architecture. 4) The implementations from 2) and 3) are combined to create the only known full FSD detector for 4 4 16-QAM 802.11n. The remainder of this paper is organized as follows. Section III describes the software-dened FPGA processing paradigm, before it is used to create real-time architectures for preprocessing and metric calculation and sorting in Sections IV and V respectively. Finally, Section VI exploits this approach to create the only recorded single-chip, real-time SD architecture for 4 4 16-QAM 802.11n MIMO. II. BACKGROUND AND M OTIVATION In an M -transmit, N-receive antenna MIMO system, the M -element transmitted symbol vector s suffers multipath distortion and noise corruption (v) when propagating across the channel to the receiver. Hence the N-element received symbol vector y is formulated mathematically as (1), where H CN M represents the MIMO channel, used typically as a parallel set of at-fading subchannels via Orthogonal Frequency Division Multiplexing (OFDM).

y = Hs + v

Demodulation & Separation

s1

h11

v1

y1

s1

(1)

SD is a receiver baseband signal processing approach employed to estimate s. It offers near-ideal detection performance with signicantly reduced computational complexity relative to the ideal ML detector [13], [14].
June 22, 2012 DRAFT

(c) 2011 Crown Copyright. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

Despite this, SD algorithms in general remain computationally complex and present a signicant implementation challenge, particularly in base-station equipment where demanding real-time performance metrics must be met, such as the maximum 480 Mbps, 4s latency required by 44 16-QAM 802.11n MIMO [2]. In the context of the industrywide move toward software-programmable or software-dened DSP architectures which emphasise exibility to support multiple radio standards along with implementation cost and performance [9], this is a challenging real-time implementation problem; even recorded custom circuit architectures have not proven capable of supporting real-time quasi-ML SD for 4 4 16-QAM 802.11n [3], [15], [5], [16], [17], [18]. A range of simplied SD variants have emerged in an attempt to alleviate this complexity problem whilst maintaining quasi-ML detection accuracy. Amongst these, Fixed-Complexity SD (FSD) is exceptional since it uniquely combines relatively low complexity, deterministic behaviour and quasi-ML accuracy [19]. FSD has a two-phase behaviour: 1) Pre-Processing (PP): The symbols of y are ordered for detection and the centre of the decoding sphere is initialised using Zero Forcing detection. 2) Metric Calculation & Sorting (MCS): An M -level decode tree performs a Euclidean distance based statistical estimation of s. PP orders the received symbols according to the perceived distortion experienced by each. This is achieved by reordering the columns of H to give H (the general form of which is illustrated in Fig. 2(a)) via an M -phase iterative process: 1) Calculate Wi according to (2), where Hi is the channel matrix with previously selected columns zeroed. Wi = (HH )H1 i i 2) The signal sk to be detected is selected according to arg max (H ) j i j k= arg min (H )
j

(2)

2 2

if ni = P if ni = P

i j

Post-ordering, groups of M symbols undergo detection via a tree-search structure illustrated in Fig. 2(b). The node distribution at each level in the tree is given by nS = (n1 , n2 , ..., nM )T . During the rst nfs levels, the worst distorted symbols undergo Full Search (FS), where the search space is fully enumerated resulting in P child nodes at level i + 1 per node at level i, where P is the number of QAM constellation points. The remaining nss (nss = M nf s) least distorted symbols subsequently undergo Single Search (SS), where only a single candidate detected symbol is maintained between layers. For full diversity, nfs is given by (3). At each MCS tree level, (4) and (5) are performed. nf s =

M 1 .

(3)

June 22, 2012

DRAFT

(c) 2011 Crown Copyright. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

nss
(1,1) (2,1) (1,2) (2,2) (1,nss) (2,nss) (1,nss+1) (2,nss+1)

nfs
(1,M-1) (2,M-1) (1,M) (2,M)

(N,1) (N,2)

(N,nss)
Least Distorted Symbol

(N,nss+1)

(N,M-1)

(N,M)
Most Distorted Symbol

Increasing Distortion Detection Order

(a) General Form of H


Received Symbols 1) Preprocessing 2) Metric Calculation & Sorting (MCS) (i) Full Search (FS)

(ii) Single Search (SS)

(iii) Sorting Detected Symbols

(b) FSD Tree Search Structure Fig. 2. FSD Algorithm Components

si = sZF,i
Mt

rij (ZF,j sj ) s r j=i+1 ii


2

Mt

(4)

di =
j=i

2 rij sZF,j sj

, Di = di + Di+1

(5)

In (4) and (5), rij refers to an entry in R, obtained via QR decomposition of H during PP, sZF is the center of the constrained FSD sphere and sj is the j th detected data, which is sliced to sj in subsequent iterations of the detection process [20]. Since Di+1 can be considered as the Accumulated Partial Euclidean Distance (APED) at

June 22, 2012

DRAFT

(c) 2011 Crown Copyright. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

level j = i + 1 of the MCS tree and di as the PED in level i, the APED can be obtained by recursively applying (5) from level i = M to i = 1. The resulting candidate symbols are sorted based on their Euclidean distance measurements, and the nal result produced post-sorting. In 802.11n MIMO, this collective behaviour must be replicated 108 times, once per OFDM subcarrier employed. FSD is amongst the lowest complexity, quasi-optimal SD algorithms known [21]; along with the highly parallel nature of FSD this has proven effective in enabling real-time FSD MCS [22]. However, three outstanding issues remain in real-time software-dened FSD for standards such as 802.11n: 1) Single-chip detectors remain elusive: there is no recorded real-time architecture which integrates both PP and MCS for all 108 802.11n OFDM subcarriers. 2) The high computational complexity of PP (the M iterations of the O(M 3 ) pseudo-inverse in (2) results in a O(M 4 ) algorithm) has, to date, prohibited real-time implementation of even PP. 3) Existing real-time MCS realisations [22] rely on custom dedicated circuits, whilst modern equipment design processes require software-dened architectures. Whilst technologies such as Field Programmable Gate Array (FPGA) are computationally capable of hosting real-time MCS at least, two key issues currently prevent software-dened FPGA architectures from resolving this problem: 1) Software-dened FPGA architectures, e.g. [23], [24] are too costly and low performance to meet the real-time demands of 802.11n FSD. 2) FSD detection of all 108 OFDM subcarriers in 802.11n is a large scale operation, requiring a highly scalable processing architecture. To resolve this issue, a new approach to software-dened realisation of SD is required. This paper presents a unique solution which demonstrates the viability of real-time, software-dened MIMO detection on FPGA, by realising FSD PP, MCS and full detector architectures which meet the 480 Mbps, 4 S latency requirements of 802.11n. Further, we show how the resulting realisations exhibit cost comparable to custom circuit solutions. Section III describes the processing architecture exploited, before its effectiveness for FSD PP, MCS and full detection are described in Sections IV, V and VI respectively. III. T HE FPGA-BASED P ROCESSING E LEMENT (FPE) The emergence of components such as the DSP48E on recent generations of Xilinx FPGA offer unprecedented levels of computational capacity enclosed in programmable datapath components. Their programmability implies the need for data storage and circuitry for datapath control, but despite modern FPGA housing plentiful resources with which to realise these, in the form of Look Up Tables (LUTs) and Block RAM (BRAMs), existing FPGA processors are typically resource hungry and performance limited. Hence whilst modern FPGA house very high levels of programmable computational capacity, software-dened architectures capable of exploiting these resources are lacking. A unique, lean processing architecture known as the FPGA Processing Element (FPE) is proposed to resolve this deciency. The architecture of the FPE is shown in Fig. 3.
June 22, 2012 DRAFT

(c) 2011 Crown Copyright. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

Branch Detection

COMM Zerooverhead Loop ID DSP48E IMM PM RF


ALU DM

PC

Coprocessor

Branch Control

Instruction Fetch

ID/RF

Source Select

EXE1

EXE2

Result Select

Write Back

Fig. 3.

The FPE architecture

The FPE houses the minimum set of resources required for programmable operation: the instructions pointed to by the Program Counter (PC) are loaded from Program Memory (PM) and decoded by the Instruction Decoder (ID). Data operands are read either from Register File (RF), or in the case of immediate data Immediate Memory (IMM) and processed by the ALU (implemented using a Xilinx DSP48E). In addition, a Data Memory (DM) is used for bulk data storage and a Communication Adapter (COMM) performs on/off-FPE communications. The FPE is congurable such that its architecture can be customised pre-synthesis in terms of the aspects listed in Table I; in addition, the ALU can be extended with custom coprocessors, to accelerate critical operations. Further, the FPE is programmable via the instruction set described in Table II; it is currently programmed manually at the assembly level, and the instruction set is extensible to incorporate new instructions for specic coprocessors. When implemented on Xilinx Virtex 5 VLX110T FPGA, the computational capability and cost of six FPE congurations - 16 bit Real (16R), 32 bit Complex (32C) and 32 bit Real (32R) variants - are as described in Table III1 .
TABLE I FPE C ONFIGURATION PARAMETERS

Parameter DataWidth DataType ALUWidth PMDepth/PMWidth DMDepth/RFDepth TxCOMM/RxCOMM IMMDepth/IMM Width

Meaning Data wordsize Type of data No. DSP48E slices PM Capacity/Instrn. Width DM/ RF Capacity No. Tx/Rx ports No./size immediate data

Values 16/32 bits Real/complex 1-4 Unlimited Unlimited 1024 Unlimited

memory locations

1 All

synthesis results are post place-and-route, employing at criteria, with neither speed nor area prioritized.

June 22, 2012

DRAFT

(c) 2011 Crown Copyright. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

TABLE II FPE I NSTRUCTION S ET

Instruction LOOP/RPT BEQ/BGT/BLT CTRL JMP GET/PUT GETCH/CLRCH NOP MUL/ADD/SUB ALU MULADD/MULSUB(FWD) COPROC LD/ST MEM LDIMM/STIMM LDIAR

Function loop/repeat branch if equal/greater/less jump load/push data from/to channel load data from/clear channels no operation multiply/add/subtract multiply-add/subtract (& forward) coprocessor access load/store data from/to memory load/store data from/to IMM updata IMM address register

Table III describes a range of performance/cost metrics that, to the best of the authors knowledge, are unmatched in any other software-dened FPGA architecture; for instance, the FPE occupies only 18% of the resource of a conventional MicroBlaze processor, whilst enabling a factor 2.8 increase in computational capacity. These metrics imply that the FPE is the most likely of any software-dened FPGA architecture to support real-time FSD. Sections IV - VI examine implementation of PP, MCS and full detector architectures using the FPE processing paradigm.
TABLE III FPE A RITHMETIC P ERFORMANCE

Resource Cong LUTs 16 R 90 132 16 C 172 140 185 32 R 182 3 DSP48Es 1 1 2 4 2

Latency (Cycles) 4 7 5 5 6 7

Clock (MHz) 483 476 453 474 431 431

Throughput (MMACs/s) 483 119 226.5 474 215.5 431

June 22, 2012

DRAFT

(c) 2011 Crown Copyright. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

IV. FPE-BASED P RE - PROCESSING U SING SQRD Section II described how the complexity of SD PP has, to date, prohibited real-time implementation for MIMO systems such as 4 4 802.11n. However the recent emergence of sub-optimal PP algorithms such as Sorted QRD (SQRD) [25], [26] can potentially overcome this issue. This section tests the ability of the FPE to support real-time SQRD-based PP for FSD. SQRD-based ordering for FSD transforms the input channel matrix H to the product of a unitary matrix Q and an upper-triangular R via QR decomposition, whilst deriving order, the order of detection of the received symbols during MCS. It operates in two phases, as described in Algorithm 1 [26]. In Phase 1 Q, R, order, norm and nf s are initialized as shown in lines 2-5 of Algorithm 1, where qi is the ith column of Q. Phase 2 comprises M iterations, in each of which the k th lowest entry in norm is identied (lines 9 & 10) before the corresponding column of R and elements in order and norm are permuted with the ith (line 11) and orthogonalized (line 12-18). The resulting Q, R, and order are used for FSD MCS as dened in (4) and (5). Note the merged ordering and QRD of H in Phase 2; this avoids the M iterations of QRD in V-BLAST, enabling an observed order-of-magnitude complexity reduction for FSD PP. The quasi-ML accuracy of oating-point SQRD-based FSD is demonstrated in [26], however since FPE-based realisation requires reduced-precision xed-point arithmetic, similar verication under these conditions is required. To this end, the BER performance of SQRD-based FSD detection of a 4 4 16-QAM Rayleigh Fading MIMO channel has been performed for 32, 24 and 16 bit xed-point variants and is compared with the ideal oatingpoint version in Fig. 42 . As Fig. 4 shows, 16 bit arithmetic in the SQRD phase is sufcient to enable detection performance almost equal to that of the ideal oating-point solution, particularly when integer wordsizes of 9 or 10 bits are employed. Hence, FPE-based realisation is viable and 16 bit xed-point arithmetic with a 10 bit fractional part is chosen for FPE-based SQRD. Despite its relatively low complexity and suitability for xed-point implementation, there are two major issues that must be resolved to enable FPE-based SQRD PP for 4 4 802.11n: 1) SQRD remains highly computationally demanding, as outlined in Table IV; given the capabilities of a single FPE, it appears that a large-scale multi-FPE architecture is required to enable SQRD for 4 4 802.11n. 2) The square root (line 12) and division (line 13) operations used in SQRD offer very low performance on sequential processors [27]; special consideration of these is required for real-time PP for 802.11n MIMO.

A. FPE-based Division Acceleration Binary division is usually achieved using digital recurrence or convergence algorithms [27]. Of these alternatives, recurrence algorithms generally exhibit lower complexity and latency, and hence are usually preferred. Non-restoring
2 For

clarity, the behaviours under 32 and 24 bit xed point conditions are omitted from Fig. 4 due to the high detection performance of 16

bit solutions.

June 22, 2012

DRAFT

(c) 2011 Crown Copyright. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

input : H, M output: Q, R, order


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Phase 1: Initialization Q = H, R = 0M , order = [1, , M ], nf s = for i 1 to M do normi = qi end Phase 2: SQRD ordering for i 1 to M do k = min (nf s + 1, M i + 1)
k 2

M 1

ki = arg min normj


j=i, ,M

Exchange columns i and ki in R, order, norm and Q ri,i = normi qi = qi /ri,i for l i + 1 to M do ri,l = qH ql i ql = ql ri,l qi norml = norml r2 i,l end end Algorithm 1: Sorted QR decomposition for FSD

recurrence algorithms generally enable higher performance FPGA implementations by avoiding sophisticated control overhead [28]. Non-restoring 16 bit division [27] requires 312 cycles on a 16R FPE. This equates to approximately 1.2 MDIV/s (millions of divisions per second), which means that to achieve the 120 MDIV/s required by real-time 4 4 SQRD for 802.11n would require at least 100 FPEs dedicated solely to division. The high resource cost such a solution could entail may potentially be avoided by exploiting the congurable nature of the FPE, specically its ability to incorporate coprocessors within the ALU, to accelerate FPE-based division. Radix-2/4 non-restoring division coprocessors [27] are considered in this context - the structure of these coprocessors are outlined in Fig. 5. The performance, cost and efciency (in terms of throughput per LUT, or TP/LUT) of the programmed FPE implementation (FPE-P) and radix-2/4 coprocessor augmented FPEs (FPE-R2 , FPE-R4 ) when implemented on Virtex 5 FPGA is described in Table V. As this shows, FPE-R2 and FPE-R4 increase

June 22, 2012

DRAFT

(c) 2011 Crown Copyright. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

10

10

4x4 16QAM FSD SQRD ORDERING Fixed Point Simulation

10

BER (Bit Error Rate)

10

10

10

16 bit8 fraction 16 bit9 fraction 16 bit10 fraction 16 bit11 fraction 16 bit12 fraction 16 bit13 fraction Floating Point 6 8 10 12 14 16 SNR(dB) 18 20 22 24 26

Fig. 4.

44 16-QAM Fixed Point Simulation TABLE IV 4 4 SQRD O PERATIONAL C OMPLEXITY

Operation +/

No. per second (109 ) 3.24 12.72 0.12 0.12

throughput by factors of 8.9 and 13.3 and hardware efciency by factors of 9.4 and 10.7 as compared to FPE-P respectively. Given the need for 120 MDIV/s for SQRD-based detection of 4 4 802.11n MIMO systems, the implied implementation cost and performance metrics of each option are summarised in Table V. This suggests that FPE-R2 represents the lowest cost real-time solution, enabling a 93.4% reduction in resource cost relative to FPE-P. Accordingly, this approach is adopted in the FPE-based SQRD implementation. B. FPE-based Acceleration Of Square Root Operations Implementing square root operations poses a similar problem to that of division - 120 MSQRT/s (million square root operations per second) are required for real-time SQRD-based detection of a 4 4 802.11n system. There are two primary options for achieving this: software-based execution on the native FPE, using the pencil-and-paper method [27], or by using a standard CORDIC component available in vendor IP libraries [29]. The programmed solution (FPE-P) is compared with that incorporating the CORDIC coprocessor (FPE-C) in Table VI. As this shows, FPE-C offers simultaneous increases in throughput and efciency by factors of 23 and 10 respectively as compared
June 22, 2012 DRAFT

(c) 2011 Crown Copyright. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

11

Q16-j Quotient
MSB

Partial

remainder

Divisor Add/Sub

Complement
16 16

1/2

16-bit Adder
16 1/2 1 quotient bit obtained per iteration

Fig. 5.

SQRD Divider Coprocessor Architecture TABLE V SQRD D IVISION I MPLEMENTATIONS

Resource Solution FPEs FPE-P FPE-R2 FPE-R4 100 5 4 DSP48Es 100 5 4 LUTs 13,600 900 944

Throughput (MDiv/s) 120 120 144

to FPE-P. This implies that the resources required to realise real-time square-root for SQRD-based detection of 4 4 802.11n are summarised in Table VII. Hence FPE-C enables real-time operation whilst incurring only 11% of the resource required by FPE-P, and is adopted for realising FPE-based square root operations.
TABLE VI C OMPARISON OF 16 BIT PSQRT, CSQRT ON FPE

FPE-P [27] PM/RF locations Cost LUTs DSP48Es Clock (MHz) Latency (Cycles) Throughput (MSQRT/s) TP/LUT (10 3) 29/14 142 1 367.7 191 1.93 13.6

FPE-C [29] 8/1 330 0 350 8 43.6 132.1

June 22, 2012

DRAFT

(c) 2011 Crown Copyright. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

12

TABLE VII FPE- BASED SQRT I MPLEMENTATIONS

Resource Solution FPEs FPE-P FPE-C 63 3 DSP48Es 63 3 LUTs 8946 990

Throughput (MSqrt/s) 121.6 130.8

C. FPE-based SQRD Whilst Sections IV-A and IV-B have provided FPE-based components of sufciently high performance and low cost to enable real-time division and square-root, integrating these components into a coherent processing architecture to perform SQRD, and replicating that behaviour to provide PP for the 108 subcarriers of 802.11n MIMO is a large scale, challenging implementation problem. Fig. 6(a) describes the SQRD algorithm as a ow chart composed of four main iterative tasks (T1 , T2.1 -T2.3 ). The rst task, T1 , conducts the iterative channel norm ordering, and computes the diagonal elements of R (lines 11 - 13 in Algorithm 1), with the subsequent concurrent tasks T2.1 - T2.3 permuting and updating Q, R and norm respectively (lines 14 - 18 in Algorithm 1). To realise this behaviour a 4-FPE Multiple Instruction, Multiple Data (MIMD) processing architecture, illustrated in Fig. 6(b), is used; the FPEs employ 16 bit datapaths, in accordance with the analysis in Section IV, and are otherwise congured as described in Table VIII(a). FPE1 - FPE3 perform permutation of Q, R and norm and iterative updating (T2.1 - T2.3 in Fig. 6(a)), whilst FPE4 calculates the diagonal elements of R (T1 ). Across the architecture, SQRD-based PP of a 4 4 matrix occurs in three phases. Initially, H and the calculation of norm are distributed amongst the FPEs, with the separate parts of norm gathered by FPE4 to undergo ordering, division and square root. These resulting metrics are distributed to the outer FPEs for independent iterative permutation and update of Q, R and norm. Inter-FPE communication occurs via point-to-point FIFO links, chosen due to their relatively low cost on FPGA and implicit ability to synchronize the multi-FPE architecture in a data-driven manner whilst avoiding data access conicts. The performance and cost of the 4-FPE grouping is given in Table VIII(b). According to the metrics quoted in Table VIII(b), the throughput of each 4-FPE group is sufcient to support SQRD-based PP of 3 subcarriers within the real-time constraints of 802.11n. Hence, to implement PP for all 108 subcarriers of 802.11n, the architecture illustrated in Fig. 7, incorporating 36 groups of the 4-FPE array, is used. The mapping of subcarriers to groups is as described in Fig. 7. D. Implementation Evaluation When implemented on Xilinx Virtex 5 VSX240T FPGA, the cost and performance of the 802.11n PP architecture (FPE-SQRD) described in Fig. 7 are as quoted in Table IX. These results are notable since, to the best of the authors

June 22, 2012

DRAFT

(c) 2011 Crown Copyright. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

13

start i =1

T1

ki arg min norm j


j i ,, M

ri ,i norm i q i q i ri ,i
i =i+1 i4

T2.2
FPE1

T2.1
FPE2

Y T2.3
permute norm i , norm ki

T2.2
permute Q i , Q ki

T2.1

permute R i , R ki

T1
FPE4 FPE4

for l in i 1 M for l in i 1 to M ri,l q iH q l q l q l ri ,l q i end


end

for l in i 1 to M norm l norm l ri 2l , end

i<44 i

N end

T2.3

FPE3

(a) 4 4 SQRD Fig. 6. 4 4 SQRD FPE-MIMD Mapping TABLE VIII 4-FPE BASED SQRD

(b) 4 4 SQRD Architecture

(a) FPE Conguration Parameter PM Depth RFDepth IMMDepth DMDepth TxComm RxComm Value 350 Cost 32 32 64 32 32

(b) FPE-SQRD Metrics Aspect LUTs DSP48Es BRAMs Clock (MHz) T (MSQRD/s) Latency (S) Value 2109 4 0 3.15 1.07 0.9

knowledge, they constitute the only recorded real-time SQRD PP and FSD PP architectures for 4 4 MIMO. They achieve 32.5 MSQRD/s (millions of SQRD operations per second) exceeding the required 30 MSQRD/s. Given that they are unique in enabling real-time PP for SD-based detection for 44 MIMO, comparing with other PP realisations is difcult - whilst a number of SD implementations rely on SQRD-based PP, such as [31], [32], they do not implement it. Further, whilst the work in [33] describes an SQRD implementation, balanced objective comparison is difcult since it reports only the resource cost for an ASIC-based 2 2 SQRD, giving no measure of real-time performance. Table IX compares FPE-SQRD with a custom circuit-based V-BLAST detector [30]. The

June 22, 2012

DRAFT

(c) 2011 Crown Copyright. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

14

Subcarrier Subcarrier Subcarrier Subcarrier Subcarrier Subcarrier 5 6 2 3 4 1 FPE1

Subcarrier Subcarrier Subcarrier 108 106 107 FPE2

FPE4 FPE4
FPE1 FPE2 FPE2 FPE1

FPE1

FPE2

FPE3

FPE4 FPE4 FPE4 FPE4

FPE4 FPE4

Core 36

FPE3

FPE3

FPE3

Core 1

Core 2

Core 3

Fig. 7.

4 4 SQRD Mapping TABLE IX 4 4 SQRD I MPLEMENTATIONS

Ref LUTs Resource DSP48Es BRAMs ELUTs (103 ) Clock (MHz) T (MSQRD/s) L (S)

FPE-SQRD 70,560 144 0 152.5 265 32.5 1.1

[30] 33,512 426 N/A 276.0 87 22 1.43

implementation in [30] does not achieve the required throughput for real-time processing and whilst FPE-SQRD consumes signicantly more LUT resource, it signicantly reduces the demand for DSP48E resources by 66%; combining these relates to an overall reduction of 45% in terms of Equivalent LUTs (ELUTs)3 , whilst enabling an increased throughput of almost 50%. Hence, this analysis shows that the FPE has enabled the only software-dened real-time PP architecture for 4 4 16-QAM 802.11n FSD, whilst incurring resource costs comparable to existing circuit architectures. Indeed, further, the FPE-SQRD it is the only real-time realisation of any kind. Section V investigates its ability to support the second major suboperation of FSD: MCS.
3 ELUTs

combine measurement of resource on modern Xilinx FPGA in a single quantity; the reader is referred to [34] for details.

June 22, 2012

DRAFT

(c) 2011 Crown Copyright. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

15

V. FPGA- BASED FSD MCS FOR 802.11 N The MCS stage of FSD for 4 4 16 QAM 802.11n is even more computationally demanding than SQRD-based PP, as described in Table X. When a single 44 16-QAM FSD MCS is implemented on a 16R FPE, the performance and cost are as reported as 16R-MCS in Table XI.
TABLE X 802.11 N FSD O PERATIONAL C OMPLEXITY

Operation +/

No. per second (109 ) 32.37 19.20

TABLE XI FPE- BASED MCS I MPLEMENTATIONS

16R-MCS PM/RF Locations Cost LUTs DSP48Es Clock (MHz) Latency (Cycles) Throughput (MOP/s) 4591/32 2520 1 367.7 3281 1.9

16R + Coprocessors 1420/32 805 0 350 1420 4.5

Table XI reports a large increase in resource cost for 16R-MCS as compared to the basic 16R reported in Table III. The observed order of magnitude increase is a consequence of the large PM required to house the 4591 instructions required. A signicant factor in this large number of instructions are the comparison operations required for slicing (equation (4)) and sorting the PED metrics, which require branch instructions. Associated with these branch instructions are NOPs, whose number is swollen by the FPEs deep pipeline [12]; the wasted cycles these NOPs represent dramatically increase cost and reduce throughput - indeed branch and NOP instructions represent 50.7% of the total number of instructions. As a result, optimising the FPE architecture to reduce the impact of these brach instructions could have a signicant impact on the MCS cost/performance. Employing ALU coprocessors, in a manner similar to that described in Section IV to accelerate division and square root operations, can signicantly reduce these penalties. A SWITCH coprocessor (Fig. 8(a)), which compares the input to one of a number of pre-dened options can be used to accelerate slicing, whilst a MIN coprocessor (Fig. 8(b)) can accelerate the sorting operation. Each of these coprocessors costs 20 LUTs, but their ability to eliminate wasted instructions can signicantly reduce the PM size leading to an overall cost decrease and performance increase when these coprocessors are used,
June 22, 2012 DRAFT

(c) 2011 Crown Copyright. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

16

>
1 2 3 SWITCH 4
MIN

SUB
DSP48E

(a) Switch Coprocessor Fig. 8. FPE Coprocessors for Switch and Min Acceleration

(b) Min Coprocessor

as described in column 3 of Table XI. As this shows, including these components results in a 68% reduction in resource cost and a factor 2.3 increase in throughput. This produces an implementation capable of realising FSD MCS for a single 802.11n subcarrier in real-time, providing a good foundation unit for implementing MCS for all 108 subcarriers. A. SIMD-based Implementation of 802.11n FSD MCS A large, coherent collection of FPEs is required to implement FSD MCS for all 108 subcarriers of 802.11n MIMO. Two important observations of the applications behaviour help guide the choice of multiprocessing architecture: 1) In the tree-structured FSD MCS (Fig. 2(b)), each tree branch performs an identical sequence of operations on distinct data streams - the denition of Single Instruction Multiple Data (SIMD) behaviour. 2) The number of FPEs required to implement MCS for all 108 OFDM subcarriers on a single, very wide SIMD processor implies limitations on the achieveable clock rate as a result of high signal fan-outs to broadcast instructions from a central PM to a very large number of ALUs, restricting performance [11]. Hence, a collection of smaller SIMDs is used. To enable these multi-SIMD architectures, the FPE is used as a foundation for a congurable SIMD processor component, as illustrated in Fig. 9. Note that the PC, PM, ID and IMM are now all centralised in the SIMD, and hence do not appear in each FPE way. Table XII denes the congurable aspects of the SIMD processor. All of the FPE instructions (except BEQ, BGT and BLT) can be used as SIMD instructions.

PC RF
IMM

FPE

FPE

FPE

FPE

RF

RF

RF

PM ALU ALU ALU ALU ID

Fig. 9.

SIMD Processor Architecture

June 22, 2012

DRAFT

(c) 2011 Crown Copyright. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

17

TABLE XII SIMD P ROCESSOR C ONFIGURATION PARAMETERS

Parameter SIMDways IMMDepth/IMMWidth PMDepth/PMWidth

Meaning No. parallel FPE elements No./width of IMM locations No./width of PM locations

Values Unlimited Unlimited Unlimited

The increasing limiting effect of instruction broadcast from the central PM results in 16-way SIMD congurations offering the best cost/performance balance; accordingly FSD MCS for all 108 802.11n subcarriers is implemented on a dual-layer network of such processors, as illustrated in Fig. 10. Level 1 consists of 8 SIMDs. The 802.11n subcarriers are clustered into 8 groups {Gi = {j : (j 1) mod 8 = i}108 }7 , where j is the set of subcarriers j=1 i=0 processed by Core i. The 16 branches of the MCS tree for each subcarrier are processed in parallel across the 16 ways of the Level 1 SIMD onto which they have been mapped. Sorting for the subcarriers implemented in each Level 1 SIMD is performed by adjacent pairs of ways in the Level 2 SIMD - hence given the 8 Level 1 SIMDs, the Level 2 SIMD is composed of 16 ways. The analysis in [22] shows that 16 bit data is sufcient for FSD-based detection of 4 4 16-QAM 802.11n, hence each FPE is congured to exploit 16 bit real-valued arithmetic. All processors exploit P M Depth = 128, RF Depth = 32 and DM Depth = 0, and communication between the two levels exploit 8-element FIFO queues. The Level 1 SIMDs incorporate SWITCH coprocessors to accelerate the slicing operation, whilst the Level 2 SIMDs support the MIN ALU extension to accelerate the sort operation.

Subcarrier Subcarrier 105 106 Subcarrier Subcarrier Subcarrier 9 Subcarrier 10 1 2

Subcarrier Subcarrier 16 8

Subcarrier 108

Level 1 (8 x 16-way SIMD Cores)

Core 0

Core 1

Core 7

Fig. 10.

802.11n OFDM MCS-SIMD Mapping

June 22, 2012

Level 2 (16-way SIMD)

DRAFT

(c) 2011 Crown Copyright. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

18

The program ow for each Level 1 SIMD is as illustrated in Fig. 11(a). As this shows,each FPE performs a single branch of the MCS tree, with the empty parts of the program ow representing NOP instructions, used to properly synchronise movement of data into and out of memory. These NOP cycles represent 29% of the total instruction count but since they represent ALU idle cycles they should preferably be eliminated. To achieve this, the NOP cycles in one branch can be occupied by the useful, independent instructions from another, i.e the branches may be interleaved as illustrated in Fig. 11(b). As this shows, interleaving branches occupies wasted NOP cycles, to the extent that when two branches are interleaved the proportion of wasted cycles is reduced to 4%.
FPE1
FPE8

FPE1 Slice4,1 APED4,1 Slice3,1 APED3,1 Slice2,1 APED2,1 Slice1,1 APED1,12 PUT FSD Slicing /APED

FPE2 Slice4,2 APED4,2 Slice3,2 APED3,2 Slice2,2 APED2,2 Slice1,2 APED1,2 PUT

FPE16 Slice4,16 APED4,16 Slice3,16 Program Flow APED3,16

Slice2,16 APED2,16 Slice1,16 APED1,16 PUT

Slice4,1 Slice4,2 APED4,1 APED4,2 Slice3,1 Slice3,2 APED3,1 APED3,2 Slice2,1 Slice2,2 APED2,1 APED2,2 Slice1,1 Slice1,2 APED1,1 APED1,2 PUT PUT

Slice4,15 Slice4,16 APED4,15 APED4,16 Slice3,15 Slice3,16 APED3,15 APED3,16 Slice2,15 Slice2,16 APED2,15 APED2,16 Slice1,15 Slice1,16 APED1,15 APED1,16 PUT PUT

Program Flow

NOPs

8 Interleaved FSD tasks

(a) Original FSD Threads Fig. 11. FPE Branch Interleaving

(b) Interleaved Threads

When implemented on Xilinx Virtex 5 VSX240T FPGA the performance and cost of the FSD-MCS for 802.11n are reported as FPE-MCS in Table XIII. As this shows, it comfortably exceeds the real-time performance criteria of 802.11n, and is the rst software-dened implementation to do so.
TABLE XIII 4 4 16-QAM FSD I MPLEMENTATIONS

Ref LUT DSP48E BRAM ELUT (103 ) Clock (MHz) T (Mbps) L (S)

FPE-MCS 16,601 144 0 98.5 296 502.5 0.9

[22] 13,197 160 49 168.0 150 600 N/A

[17] 18,893 64 12 N/A 100 200 N/A

[18] 6,587 0 0 N/A 52 27.7 N/A

June 22, 2012

DRAFT

(c) 2011 Crown Copyright. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

19

Table XIII also compares the FPE-MCS with existing Xilinx FPGA custom circuit SD realisations. The work in [22] displays a slightly lower LUT cost and higher throughput, but the FPE-MCS enables software-programmability whilst maintaining real-time behaviour and comparable cost. The architectures in [17], [18] operate well below realtime, and the architectural changes necessary to enable real-time performace are such that direct comparison is very difcult. The work in [8] presents a very high performance 800 Mbps single subcarrier architecture on Altera FPGA, but in common with [6] the resource and performance implications of adapting this to enable all 108 802.11n subcarriers is unknown, making comparison difcult. Finally, target technology variations between the FPE-MCS and the 2 2 ASIC-based custom circuit in [35] make comparison very difcult also. Given these comparisons, the novel aspect of the FPE-MCS is clear: it is the only software-dened approach which supports real-time FSD MCS for 4 4, 16 QAM 802.11n. Similarly to the FPE-SQRD presented in Section IV, it shows that massively parallel networks of simple processors (> 140 in this case) on FPGA can support real-time processing with resource costs comparable to custom circuits. Like all software-dened radio platforms, it trades absolute performance/cost for exibility and ease of design: it offers a predominately software-based design approach, and hence is inherently more exible for adaption to other SD or even other more general DSP algorithms, as well as being more suited to existing software radio design processes. In Section VI, the FPE-based design approach is applied to the design of a full FSD detector. VI. FPGA- BASED S OFTWARE -D EFINED FSD FOR 802.11 N When the PP and MCS implementation strategies, described in Sections IV and V respectively, are combined to create a full FSD detector implementation, the cost and performance of the implementation are as reported in Table XIV. Given that this implementation realises the real-time processing requirements of 4 4 16 QAM 802.11n, to the best of the authors knowledge it is the only single-chip implementation to do so, despite its software-dened nature.
TABLE XIV 4 4 SQRD FSD F ULL D ETECTOR I MPLEMENTATIONS

Aspect LUTs Resource DSP48Es BRAMs ELUTs (103 )

FPE-FSD 96,115 408 N/A 328 189 483 2.3

Clock (MHz) T L (S)

June 22, 2012

DRAFT

(c) 2011 Crown Copyright. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

20

VII. C ONCLUSION This paper has presented a uniquely capable approach for implementing SD detectors for MIMO receivers: it is the rst software-dened platform to support real-time detection for applications such as 4 4 16 QAM 802.11n MIMO. This paper has shown how, by composing ne-grained, very high performance programmable components into large scale multiprocessing architectures on FPGA, the resulting software-dened architectures satisfy the demanding real-time performance metrics of modern MIMO standards, whilst incurring resource costs of the order of existing dedicated circuit architectures. We have demonstrated this by creating three unique implementations: 1) The only recorded SQRD-based PP architecture for 4 4 802.11n MIMO. 2) The only recorded real-time software-dened FSD MCS architecture for 4 4 16-QAM 802.11n MIMO. 3) The only recorded single-chip integrated quasi-optimal detector for 4 4 16-QAM 802.11n MIMO. It is important to note that their software-dened nature implicitly eases the design process for architectures such as these. However, this is only the case given supporting Computer Aided Design (CAD) and software compilation infrastructure. This paper has concentrated on demonstrating the feasibility of the architectures to support such realisations, but has constructed and programmed them manually at the Register Transfer Level (RTL) and assembly level respectively. Creating these technologies is left as future work of similar signicance to the demonstration of architectural viability presented here. ACKNOWLEDGMENT The authors would like to thank Prof. Roger Woods, Prof. John Thompson, Dr Chengwei Zheng and Mr. Matthew Milford for their valuable assistance in this work. This work is supported by the UK Engineering and Physical Sciences Research Council (EPSRC), under grant EP/F031017/1. R EFERENCES
[1] P. Wolniansky, G. Foschini, G. Golden, and R. Valenzuela, V-BLAST: An Architecture for Realizing Very High Data Rates Over The Rich-Scattering Wireless Channel, in 1998 URSI Int. Symp. Signals, Systems, and Electronics, 1998, pp. 295300. [2] IEEE802.11n, 802.11n-2009 IEEE Local and metropolitan area networksSpecic requirements Part 11: Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) Specications Amendment 5: Enhancements for Higher Throughput, 2009. [3] A. Burg, M. Borgmann, M. Wenk, M. Zellweger, W. Fichtner, and H. Bolcskei, VLSI Implementation of MIMO Detection Using The Sphere Decoding Algorithm, IEEE Journal of Solid-State Circuits, vol. 40, no. 7, pp. 15661577, Jul 2005. [4] X. Huang, C. Liang, and J. Ma, System Architecture and Implementation of MIMO Sphere Decoders on FPGA, IEEE Trans. VLSI Systems, vol. 16, no. 2, pp. 188197, 2008. [5] M. Li, B. Bougard, W. Xu, D. Novo, L. Van Der Perre, and F. Catthoor, Optimizing Near-ML MIMO Detector for SDR Baseband on Parallel Programmable Architectures, Design, Automation and Test in Europe (DATE), pp. 444449, March 2008. [6] P. Bhagawat, R. Dash, and G. Choi, Dynamically Recongurable Soft Output MIMO Detector, in Proc. IEEE Intl. Conf. Computer Design (ICCD), Oct. 2008, pp. 68 73. [7] J. Janhunen, O. Silv n, and M. Juntti, Programmable Processor Implementations of K-best List Sphere Detector for MIMO Receiver, e Elsevier Journal of Signal Processing, vol. 90, no. 1, pp. 313323, 2009. [8] M. Khairy, M. Abdallah, and S.-D. Habib, Efcient FPGA Implementation of MIMO Decoder for Mobile WiMAX System, in 2009 IEEE Intl. Conf. on Communications (ICC09), June 2009, pp. 1 5. [9] J. Bard and V. J. Kovarik Jr., Software Dened Radio: The Software Communications Architecture. Wiley, 2007.

June 22, 2012

DRAFT

(c) 2011 Crown Copyright. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

21

[10] J. H. Reed, Software Radio: A Modern Approach To Radio Engineering.

Prentice Hall, 2002.

[11] X. Chu and J. McAllister, FPGA Based Soft-core SIMD Processing: A MIMO-OFDM Fixed-Complexity Sphere Decoder Case Study, in IEEE Int. Conf. on Field-Programmable Technology (FPT), Dec. 2010, pp. 479 484. [12] X. Chu, J. McAllister, and R. Woods, A Pipeline Interleaved Heterogeneous SIMD Soft Processor Array Architecture for MIMO-OFDM Detection, in 7th Intl. Conf. on Recongurable Computing: Architectures, Tools and Applications (ARC), Mar. 2011, pp. 133144. [13] M. Pohst, On The Computation of Lattice Vectors of Minimal Length, Successive Minima and Reduced Bases with Applications, SIGSAM Bull., vol. 15, no. 1, pp. 3744, 1981. [14] C. P. Schnorr and M. Euchner, Lattice Basis Reduction: Improved Practical Algorithms and Solving Subset Sum Problems, Mathematical Programming, vol. 66, no. 1, pp. 181199, 1994. [15] J. Antikainen, P. Salmela, O. Silven, M. Juntti, J. Takala, and M. Myllyla, Application-Specic Instruction Set Processor Implementation of List Sphere Detector, in Conf. Record of the Forty-First Asilomar Conf. on Signals, Systems and Computers, 2007, Nov. 2007, pp. 943 947. [16] J. Janhunen, O. Silven, M. Juntti, and M. Myllyla, Software Dened Radio Implementation of K-best List Sphere Detector Algorithm, in Intl. Conf. on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS), Jul 2008, pp. 100107. [17] Q. Qi and C. Chakrabarti, Parallel High Throughput Soft-output Sphere Decoder, in IEEE Workshop on Signal Processing Systems (SIPS), Oct. 2010, pp. 174 179. [18] B. Wu and G. Masera, A Novel VLSI Architecture of Fixed-Complexity Sphere Decoder, in 13th Euromicro Conf. on Digital System Design: Architectures, Methods and Tools, Sept. 2010, pp. 737 744. [19] L. Barbero and J. Thompson, Fixing the Complexity of the Sphere Decoder for MIMO Detection, IEEE Trans. Wireless Communications, pp. 2131 2142, June 2008. [20] L. Hanzo, W. Webb, and T. Keller, Single and Multi-carrier Quadrature Amplitude Modulation: Principles and Applications for Personal Communications, WLANs and Broadcasting, 2000. [21] C. Zheng, X. Chu, J. McAllister, and R. Woods, Real-Valued Fixed-Complexity Sphere Decoder for High Dimensional QAM-MIMO Systems, IEEE Trans. Signal Processing, vol. 59, no. 9, pp. 4493 4499, Sept. 2011. [22] L. G. Barbero and J. S. Thompson, Rapid Prototyping of a Fixed-Throughput Sphere Decoder for MIMO Systems, in IEEE Intl. Conf. on Communications, Jun. 2006, pp. 30823087. [23] P. Yiannacouras, J. G. Steffan, and J. Rose, Fine-Grain Performance Scaling of Soft Vector Processors, in Intl. Conf. on Compilers, Architecture and Synthesis for Embedded Systems (CASES), Oct. 2009, pp. 97106. [24] J. Yu, G. Lemieux, and C. Eagleston, Vector Processing as a Soft-core CPU Accelerator, in Intl. ACM/SIGDA Symp. on Field Programmable Gate Arrays (FPGA). ACM, Feb. 2008, pp. 222232. [25] D. Wubben, R. Bohnke, V. Kuhn, and K.-D. Kammeyer, MMSE Extension of V-BLAST based on Sorted QR Decomposition, in 2003 IEEE Vehicular Technology Conference (VTC 2003), Oct. 2003, pp. 508 512 Vol.1. [26] X. Chu, J. McAllister, and R. Woods, A Low Complexity Real-Time MIMO-Preprocessing For Fixed-Complexity Sphere Decoder, in 2011 Wireless Innovation Forum (SDR11-WINNComm), Nov. 2011, pp. 601 604. [27] B. Parhami, Computer Arithmetic: Algorithms and Hardware Designs, 2nd ed. pp. 811, 2006. [29] Xilinx Inc., LogiCORE IP CORDIC v4.0, 2011. [30] X. Chu, K. Benkrid, and J. Thompson, Rapid Prototyping of an Improved Cholesky Decomposition Based MIMO Detector on FPGAs, in NASA/ESA Conf. on Adaptive Hardware and Systems (AHS), 2009, pp. 369375. [31] N. Moezzi-Madani, T. Thorolfsson, and W. Davis, A Low-Area Flexible MIMO Detector for WiFi/WiMAX Standards, in Design, Automation Test in Europe (DATE), 2010, March 2010, pp. 1633 1636. [32] M. Myllyla and, J. Cavallaro, and M. Juntti, Architecture Design and Implementation of the Metric First List Sphere Detector Algorithm, IEEE Trans. VLSI Systems, vol. 19, no. 5, pp. 895 899, May 2011. [33] J. Im, M. Cho, Y. Jung, and J. Kim, Low-Power Low-Complexity MIMO-OFDM Baseband Processor for Wireless LANs, in IEEE Intl. Symp. on Circuits and Systems (ISCAS), May 2009, pp. 601 604. OUP USA, 2010. [28] N. Sorokin, Implementation of High-Speed Fixed-Point Dividers on FPGA, Journal of Computer Science & Technology, vol. 6, no. 1,

June 22, 2012

DRAFT

(c) 2011 Crown Copyright. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

22

[34] D. Sheldon, R. Kumar, R. Lysecky, F. Vahid, and D. Tullsen, Application-Specic Customization of Parameterized FPGA Soft-Core Processors, in IEEE/ACM Intl. Conf. on Computer-Aided Design, 2006, pp. 261268. [35] T. Cupaiuolo, M. Siti, and A. Tomasoni, Low-Complexity High Throughput VLSI Architecture of Soft-output ML MIMO Detector, in Proc. Design, Automation Test in Europe (DATE), March 2010, pp. 1396 1401.

June 22, 2012

DRAFT

(c) 2011 Crown Copyright. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

You might also like