Shepard 2013 Cell Net
Shepard 2013 Cell Net
Shepard 2013 Cell Net
that it will be advantageous for base stations to dynamically GHz (wavelength of 12.5 cm) a user moving at 140 mph has a
switch between precoding techniques to optimize capacity, coherence time of 500 s. However, this neglects movement
which we call adaptive precoding. in the environment itself and experimental evaluation has
The rest of this paper is organized as follows: We provide shown that vehicular mobility near users results in less than
a brief background in Section 2. In Section 3 we discuss the 300 s coherence intervals in the 2.4 GHz band [2]. Previous
factors which affect performance, then use them to build a work based on LTE channel models often use approximately
performance model in Section 4. We leverage this model 1 ms coherence times [5].
to predict tradeoff points between the precoding techniques, Coherence bandwidth is the approximately flat frequency
which we present with other results in Section 5. In Sec- interval of the channel. Delay spread in multipath environ-
tion 6 we discuss future work, followed by a brief overview ments causes the channels frequency response to become
of related work in 7, then conclude in Section 8. rough. However, channels can still be approximated as
smooth over the coherence bandwidth, usually derived as
2. BACKGROUND the inverse of the delay spread. This effectively requires the
channel to be estimated at regular intervals across the spec-
There are many forms of MU-MIMO; we focus on linear trum to obtain accurate CSI. In LTE models the coherence
precoding since other methods are computationally infeasi- bandwidth is 210 kHz, as described in further detail in [5].
ble in practice, or do not take advantage of the potential Channel coherence determines the latency of CSI acquisi-
capacity gains from many-antenna systems. Let s denote tion and how long that CSI is valid. Since the CSI is only
a K 1 vector representing the data-bearing symbols to valid temporarily, the overhead of CSI collection and pre-
K users. Linear precoding creates a downlink transmission coding computation results in a direct loss of capacity. More
vector s0 for M antennas, by multiplying the original data importantly, however, this overhead is fixed with respect to
vector s by a M K matrix W: s0 = W s. In the uplink channel coherence time. Thus, as channel coherence is re-
the data symbols from the K terminals can be recovered duced, the relative capacity loss grows. Since conjugate and
similarly, by performing s = WT s0 . zero-forcing have drastically different computational over-
The beamforming weights, W, are computed according heads they behave differently as coherence time varies.
to the precoding algorithm; in this work we analyze the
two predominant algorithms: conjugate and zero-forcing. 3.1.2 Precoder Spectral Efficiency
Conjugate uses beamforming weights which are the complex Zero-forcing and conjugate provide vastly different spec-
conjugate of the channel matrix, H, Wconj = cH , where tral efficiencies during actual data transmissions [8]. We
H , which maximizes the SNR to each user, regardless of define precoder spectral efficiency as the capacity achieved
interference. Zero-forcing calculates the beamweights as a (bps/hz) using M antennas to serve K users in a given en-
1
pseudo-inverse of the channel matrix, Wzf = cH HT H , vironment neglecting all CSI and computational overhead.
which forces inter-user interference to zero. Because these factors are neglected, precoder spectral effi-
For more detailed background, we suggest [5, 8, 3]. ciency is independent of base station implementation (for a
given M and K).
3. PERFORMANCE FACTORS This spectral efficiency is determined by the propagation
The factors which affect the performance of base stations environment, specifically channel orthogonality, user distance,
employing linear precoding can be classified as either envi- noise, and interference. It is important to note that the rel-
ronmental or by design. The propagation environment af- ative spectral efficiency of conjugate and zero-forcing varies
fects the channel coherence and the precoders spectral effi- significantly with SNR, as further explored in [9, 8]. How-
ciency. The base station design determines the number base ever, zero-forcing is known to perform poorly in low SNR
station antennas, the number of users that can be served, regimes, so a slightly modified form, often referred to as
and the precoding algorithms latency. We next define each MMSE, should be used in these scenarios. MMSE has neg-
factor and their effect on performance, identify how they ligibly increased performance overhead when compared to
cause discrepant behavior in conjugate and zero-forcing pre- zero-forcing, but performs much better at low SNRs, as
coding, and characterize them in real-world systems. shown nicely in [7]. While the relative performance to con-
jugate still varies with SNR, it is not as drastic.
3.1 Environmental Factors One approach to approximate spectral efficiency is to mea-
sure each environmental property to create a channel model
3.1.1 Channel coherence and simulate precoder spectral efficiency. Alternatively, we
employ a more accurate approach that uses a many-antenna
Channel coherence describes how smooth the physical
base station to measure spectral efficiency directly, thus cap-
wireless channel is, in both time and frequency. Essentially,
turing the combined effect of these properties on capacity.
it determines how often CSI must be collected. If the channel
changes too much over time, then the previously estimated 3.2 Design Factors
channel state becomes useless. The duration of this interval
is the coherence time. Similarly, one channel estimate is not 3.2.1 Number of Antennas
valid for the entire spectrum. Thus, the channel state must The number antennas, both on the base station or with
be estimated at intervals across the entire wideband channel; each additional user, drastically affects the capacity in two
the width of this interval is the coherence bandwidth. ways. While more antennas increase spectral efficiency, they
Coherence time is determined by user mobility. Theoreti- also increase CSI collection and precoding computation over-
cal models simulate coherence time as the amount of time it head, decreasing the amount of time available to send data.
Typically, each additional base station antenna provides a are orders of magnitude more time and resource intensive
power gain (both by increasing the total transmit power and than simple multiplications and additions (matrix multipli-
improving directionality), as well as a potential multiplexing cation is also O(M K 2 ) but far less complex and can be fully
gain (by increasing the possible number of users served si- parallelized).
multaneously). However, when zero-forcing, each additional Additionally, the inversion must be performed for each
antenna also increases the amount of data sent to the central coherence bandwidth interval across the entire wide band.
processor, increasing transport and processing overhead. In For example, a system similar to LTE with a 40 MHz band-
contrast, conjugate can be distributed in a manner requiring width and a coherence interval of 210 kHz requires 191 of
no additional overhead with more base station antennas. these inverses.
Each additional user provides a multiplexing gain at the Examples of realtime performance for such a system are
expense of a data slot being converted to a pilot slot, and less dependent on the type of hardware employed. We consider
transmit power per user. However, in low coherence chan- two realistic inversion engines. On the lower, cheaper end,
nels, it may be impossible to collect CSI for all available we consider a high performance desktop (Intel-i7, 4 core, us-
users and still have time left to send data, thus limiting the ing MKL/SSE) CPU and benchmark the matrix inversion
number of users that can be optimally served. Notably, the performance. Given that each inverse can be computed in
complexity and relative performance of each precoder grow parallel, this system can perform 4 inverses at a time, thus,
at different rates with the number of base station antennas such a system can perform 191 15x15 matrix inversions in
and users. Since zero-forcing has polynomial unparalleliz- approximately 2500 s. The best case method of performing
able complexity, it suffers more as M and K increase. This a matrix inverse is to use dedicated inversion hardware such
indicates that the optimal number of users to serve is de- as an FPGA or ASIC. This method is far more expensive
pendent on the precoding technique due to these differences to implement, but would be appropriate for use in a next
in computational overhead. generation base station. We consider the FPGA complex
matrix inversion specified in [1] and compute the expected
3.2.2 Hardware Capability inverse latency. For this ideal system, 191 15x15 inversions
can be computed in approximately 260 s, almost an order
The base stations hardware determines computation and
of magnitude less than the CPU method. Note that due to
data transport latency. After CSI estimation, the base sta-
the non-parallelizable nature of the inverse algorithm, this
tion must perform the linear precoding computation before
overhead is not easily addressed by Moores law, as addi-
data transmission. Any delay caused by this processing re-
tional cores cannot reduce the latency of an inverse, which
sults in a direct capacity loss. All linear precoding tech-
grows with the number of users being served.
niques require the same computation to apply the beam
Data Transport Performance. Current data transport hard-
weights. Additionally, even traditional baseband processing
ware, such as Ethernet or InfiniBand, range in throughput
for wideband systems, such as OFDM, can cause substantial
from 1 Gbps to over 40 Gbps. Along with inversion latency,
delay. However, since these overheads are common to both
data transport latency significantly detracts from the per-
zero-forcing and conjugate, we omit them from our analysis
formance of zero-forcing transmissions due to the inherent,
as they do not provide additional insight in the performance
centralized data dependency.
tradeoffs; they essentially have the effect of further shorten-
This requires each channel vector to be transported from
ing the coherence time.
the radio, through a switch, to the central controller. Once
While conjugate beamforming requires negligible compu-
the inverse is computed, the beamforming weights must be
tation beyond the basic linear precoder, zero-forcing has
sent back to the radios. Thus this process requires two data
polynomial time complexity with regard to the number of
transmissions (CSI forward and weights backward), each
base station antennas and users, and its matrix inverse oper-
of which include the hop latency of traveling through the
ations have internal data dependencies which prevent them
switch, as well as propagation delay. The propagation delay
from being fully parallelized. Additionally, zero-forcing has
exceeds 5 s per kilometer, given the reduced speed of light
a central data dependency: i.e., it requires CSI from each
in fiber optic cables. In general, the amount of data in both
base station antenna at a central location to compute the
directions is symmetric, as there is both a CSI estimate and
beamforming weights, then these weights must be sent back
a beamweight required for each antenna on each coherence
to each of the radios. When the base station has a large
bandwidth.
number of radios serving many users across a large band-
Gigabit Ethernet (GbE) can transport data at a rate of
width, this simple data transportation results in significant
1 Gbps to 40 Gbps and has an incurred hop latency of
overhead thereby decreasing the amount of usable coherence
approximately 20 s [6]. Common Public Radio Interface
time. Thus, the performance of zero-forcing is dependent on
(CPRI), which has a similar performance to Ethernet, is
the base stations matrix inverse and data transport perfor-
typically used for data transport in cellular systems, how-
mance, as well as channel bandwidth, as further described
ever it is specialized for sending continuous synchronized
below.
I/Q samples, and would have to be altered to support this
Matrix Inversion. Matrix inversions have internal data
application. For the round trip transportation of 191 15x15
dependencies which prevent full parallelization of the algo-
matrices (with 32 bit complex values), a 10 GbE system in-
rithm. As the number of simultaneously served users in-
curs a latency of at least 355 s. InfiniBand is a faster, more
creases, the resulting inverse latency increase cannot be com-
expensive transportation system intended for supercomput-
pensated for with additional hardware.
ing clusters that is capable of 40 Gbps throughput with only
Matrix inversion is an operation that is O(M K 2 ) and thus
1 s hop latencies [4]. For the round trip transportation of
the incurred latency scales cubically with the number of con-
191 15x15 matrices, this system incurs a latency of approx-
currently served users (since M K). Each of the compo-
imately 70 s.
nent operations are CORDIC rotations and divisions which
Variable Description Unit where:
Ct Coherence time s Ct E P
Cb Coherence bandwidth hz = (2)
Ct
Spectral efficiency per user bps/hz/u
K # users u For each user, it takes 1/Cb time to collect accurate channel
M # base station antennas information for the whole spectrum (since each spectrum
S Data transport throughput bps block can be measured in parallel), thus:
L Data transport hop latency s
T-1 Time to perform an inverse s K
Nb # bits per CSI bits E= (3)
Cb
B Bandwidth hz
% of time transmitting data % Since conjugate does not require central processing, it has
E Channel est. overhead s no processing overhead, so PC = 0. However, due to central-
P Total processing time s ized processing requirements of zero-forcing, it must spend
Achieved aggregate capacity bps/hz a large amount of time in data transport and computing
inverses, and thus has a substantial additional overhead:
Table 1: Parameters. Upper set are model inputs
M K CBb Nb
!
categorized by environment and design. Lower set B
are model variables. PZF = 2 +L + T-1 (4)
S Cb
Notably, the data being sent to each user must also be The first part of the equation accounts for the time it takes
distributed to all of the radios, however this is a common to send the B/Cb channel vectors, each with K entries that
requirement for all precoding techniques, would likely use a have Nb bits from the M antennas to the central processor
separate data link, and is much less sensitive to latency. over a connection with a speed of S and hop latency of L
Channel Bandwidth. Practical communication systems (which includes propagation delay due to cable length). This
use wide channel bandwidths in order to increase capacity. is doubled, since the central processor then has to send the
Unfortunately, as mentioned above, the frequency response beamweights back to each of the M radios. If the size of the
of this channel is not flat, thus CSI estimation and pre- beamweights and CSI differ, due to the use of codebooks,
coding computation has to be repeated at regular intervals compression, or quantization, the forward and reverse links
across the band. Thus, the number of inverses and amount can be trivially separated to account for this asymmetry.
of data transport required scale linearly with the bandwidth. The second component accounts for the amount of time it
In current LTE standards the largest channel bandwidth is takes to perform the K K inverses for each of the B/Cb
40 MHz (20 MHz downlink and 20 MHz uplink, in FDD), coherence bandwidths.
whereas the next generation of WiFi, 802.11ac, goes up to
160 MHz bandwidths (two bonded 80 MHz bands). 4.3 Complete Model
Combining all of the factors we see that the modeled
4. PERFORMANCE MODEL throughput for conjugate is:
Using the factors discussed in the previous section, we now K
Ct Cb
present the model which dictates the real-world performance C = C K (5)
Ct
of these linear precoding techniques. These factors exhibit
complex interactions in real-world systems; we use our model And for zero-forcing is:
to capture these interactions and analyze their impact on
M K B Nb
K 2 Cb B T
practical performance. Ct C
b S
+L+ C
b
-1
ZF = Ct
ZF K (6)
4.1 Parameters
A list of model parameters, sorted by their category, envi-
ronment or design, is shown in Table 1. If a value is specific 5. SIMULATION
to a precoding technique it is denoted with a ZF or C for Leveraging our model we analyze the performance of prac-
zero-forcing and conjugate, respectively. tical many-antenna linear precoding under realistic constraints.
We focus on scenarios where the performance of conjugate
4.2 Model Derivation and zero-forcing cross, as they highlight the conditions when
The goal of this model is to find the real-world achieved it is important to consider the tradeoffs between the two pre-
capacity of a linear precoding system when given the chan- coding techniques.
nel coherence, number of base station antennas, number of
users, hardware capability, precoder spectral capacity, and 5.1 Simulation Methodology
bandwidth. At a high level, the system capacity, , can Using the performance model described in Section 4, we
be shown in terms of , which is determined by the environ- input a range of realistic parameter values and analyze their
mental factors, and , which is a result of the design factors: impact on performance. As defined in Table 1, there are
=K (1) 11 input parameters to the model; in order to reduce the
dimensionality in the presented results, we hold Cb , M , Nb ,
This equation describes simultaneous data transmission to and B constant, as they yield the least interesting impacts
K users at a rate of bps/hz each, however due to the over- on performance. For all experiments we base the coherence
head of channel estimation (E) and processing (P ), we can bandwidth, Cb , and channel width, B, on LTE, which defines
actually only transmit percent of each coherence time (Ct ), Cb = 210 kHz and B = 40 MHz (20 MHz uplink and 20 MHz
80 35
Conjugate
70 30
Achieved Capacity (bps/Hz)
30 15
20 10
10 5 ZeroForcing
Conjugate
0
10
4
10
3
10
2
10
1 0
2 4 6 8 10 12 14
Coherence Time (s) Number of Users
Type S L Inv. Type Sym.
Figure 2: Zero-forcing and conjugate performance
Super InfiniBand 40 Gbps 1 s FPGA
comparison for number of terminals and fixed co-
Cluster 4x10GbE 40 Gbps 20 s 8xIntel i7
herence time of 30 ms with low-end hardware.
High 2x10GbE 20 Gbps 20 s 4xIntel i7
Mid 10GbE 10 Gbps 20 s 2xIntel i7 F
Low GbE 1 Gbps 20 s Intel i7 N 5.2.2 Number of Users
Figure 1: Zero-forcing and conjugate performance Finally, we note that as the number of users grows, the
comparison for different hardware configurations in performance of zero-forcing quickly degrades under the con-
a M=64, K=15 system. straint of low coherence times, as the overhead from data-
downlink). Our platform supports up to 64 base station transport and processing dominate its capacity. Figure 2
antennas, so M = 64. We choose the number of bits in demonstrates a scenario where conjugate begins to outper-
channel estimates and beamweights to be 32 (16 real and 16 form zero-forcing with more users; with 4-6 users their per-
imaginary), as this offers low quantization error, and is the formance is equivalent, but as the number of users grows to
width used by our implementation. 15, zero-forcing achieves only 65% the capacity of conjugate.
We then vary the remaining 7 parameters as follows: We This also demonstrates the criticality of choosing the opti-
look at channel coherence times, Ct , that range from 500 s mal number of users to serve, as the capacity of zero-forcing
to 100 ms, which are reasonable for real-world mobility, and peaks at 11 users under these constraints. We use the low-
in-line with the LTE parameters. Using the many-antenna end hardware to demonstrate these effects, however higher-
base station implementation described in [8] we collect the end hardware will also show this behavior as the number of
real-world spectral efficiency, , achieved by conjugate and users increases; our models show that K (an indicator
zero-forcing precoding as the number of users, K, varies from of peak capacity), under the same 30 ms coherence and 64
1 to 15. In order to assess the impact of hardware capabil- base station scenario, is maximal at 49 users, 73 users, 83
ity, S, D, L, and T-1 , on capacity, we devise four base sta- users, and 101 users, for the mid, high, cluster, and super
tions which range from low-end hardware using Ethernet to hardware configurations, respectively.
high-end custom FPGA designs using InfiniBand; the spec-
ifications are provided in Figure 1 [6, 4]. We assume that 5.3 Implications
processing is local, and thus propagation delay is negligible. These results indicate that our model can play two im-
portant roles in the development of many-antenna base sta-
5.2 Results tions: (i) guiding base station design and (ii) enabling adap-
The main factors which affect the performance tradeoffs tive precoding. We find that conjugate beamforming will
between conjugate and zero-forcing are coherence time, hard- be better suited for high frequency bands where coherence
ware capability, and number of users. We design simulations is lower and antenna arrays have much smaller form fac-
which analyze each of these factors, and clearly show their tors, whereas zero-forcing will be more appropriate at lower
impact on the tradeoff between conjugate and zero-forcing. frequencies with fewer antennas. The actual tradeoff fre-
quencies between these regimes will be a function of user
5.2.1 Coherence Time and Hardware Capability mobility and hardware implementation, and in the tradeoff
We first look at the achieved capacity of conjugate and region adaptive precoding will be useful.
zero-forcing with regard to coherence time. Figure 1 shows Base station design. Using our model, base station archi-
that while serving 15 users simultaneously, conjugate beam- tects can appropriately provision their design to meet real-
forming outperforms zero-forcing at coherence times up to world performance requirements. By measuring the environ-
38 ms in the low-end base station. We clearly see that as mental factors, they can determine the design constraints
the coherence time drops, the overhead of zero-forcing dom- they need to meet in order to achieve their performance
inates its capacity. goals. This can help them avoid costly mistakes, such as
However, we can also see in Figure 1, that given the investing in a zero-forcing system for an environment with
specialized super high performance central processor and very short coherence time.
switch we can reduce this tradeoff point to below 1.5 ms. Adaptive Precoding. The optimal precoding technique varies
Even using very high-end servers, it is still very difficult to according to factors which change in realtime, such as the
reduce the tradeoff point to below 5 ms. number of users or channel coherence. Thus, for deploy-
ments that encompass the tradeoff points highlighted by oretical perspective, particularly with regard to energy and
our results, it will be advantageous to dynamically switch spectral efficiency, it neglects the practical implementation
between conjugate and zero-forcing through adaptive pre- challenges facing many-antenna precoding, which drastically
coding. Since users exhibit widely varying mobility, their affect real-world performance.
coherence time may drop below the threshold where zero-
forcing is optimal, and thus the system should dynamically 8. CONCLUDING REMARKS
switch to conjugate. Notably, users can be scheduled in
Many-antenna base stations show enormous potential in
groups based on mobility, and thus the precoding can not
multiplying the spectral capacity of wireless systems. How-
only be adaptive across time and frequency, but user group-
ever it is imperative to discover and understand at the real-
ing as well.
world factors which affect their performance in order de-
sign systems which achieve their potential capacity gain.
6. DISCUSSION AND FUTURE WORK We have analyzed and described the critical system fac-
It is typically very difficult to capture the behavior and tors which discrepantly affect the performance of the two
performance of complex real-world systems using an analyt- predominant linear precoders envisioned for many-antenna
ical model. Our approach addresses this issue by separating beamforming. Contrary to some existing theoretical theo-
the erratic and complex behavior of the environment from retical analysis, our results indicates that conjugate beam-
the deterministic overhead imposed by the hardware de- forming likely outperforms zero-forcing in many realistic sce-
sign. This enables system architects to identify and address narios. Our robust model can not only be used to help guide
critical high-level design factors which affect performance system design and provisioning, but also indicates that base
from a hardware design perspective then leverage empirical stations can greatly benefit from adaptive precoding, en-
measurements of the environmental factors from the target abling them to dynamically switch to the optimal precoding
topology to estimate real-world performance. technique as the users and environment vary.
Clearly every system design has much more complex in-
ternal interactions, such as multiple levels of hardware, soft- Acknowlegements
ware, and data interconnects, which determine the actual
This work was funded in part by NSF grants CRI 0751173,
overhead of the high-level factors. These design details can
MRI 0923479, NetSE 101283, MRI 1126478 and CNS 1218700.
easily be incorporated in to the model. As we develop our
Clayton Shepard was supported by an NDSEG fellowship.
own realtime adaptive precoding system we are iteratively
We thank Ashutosh Sabharwal, Edward Knightly, Chris Hunter,
refining this abstract model to incorporate concrete imple-
and Patrick Murphy for their input and support.
mentation details specific to our design. Additionally, as we
collect more experimental data from various propagation en-
vironments, with more simultaneous users, we will further References
hone the accuracy and applicability of the model. [1] Altera. Floating-Point Megafunctions User Guide, Nov.
We also note that the simulation results presented are a 2011. Available at:
very conservative estimate of the real-world tradeoff points; www.altera.com/literature/ug/ug_altfp_mfug.pdf.
the parameters chosen are reasonable estimates intended to [2] E. Aryafar, N. Anand, T. Salonidis, and E. Knightly.
demonstrate the behavior and trends of the model. Many of Design and experimental evaluation of multi-user
the common overheads, such as cyclic prefix, synchroniza- beamforming in Wireless LANs. In Proc. ACM
tion, control, etc., are omitted from the analysis, and have MobiCom, 2010.
essentially the same effect as reducing the coherence time. [3] F. Fernandes, A. Ashikhmin, and T.L. Marzetta.
Furthermore, many of the overhead estimates represent ide- Inter-cell interference in noncooperative TDD large
alized, lower-bound, overhead rather than values expected scale antenna systems. IEEE Journal on Selected Areas
in a full implementation, e.g., data-transport, computation, in Communications, 2013.
and CSI collection. However, these values are design and [4] InfiniBand. Available at: www.infinibandta.org.
environment specific, and should be determined on a per-
[5] T.L. Marzetta. Noncooperative cellular wireless with
system basis, then incorporated in to the model accordingly.
unlimited numbers of base station antennas. IEEE
Trans. on Wireless Communications, 2010.
7. RELATED WORK [6] Netgear. PROSAFE 52-Port Gigabit Stackable Switch.
While there is plethora of theoretical work on many-antenna Available at: www.netgear.com/business/products/
base stations, due to the recent nature of this area, to the switches/stackable-smart-switches/GS752TXS.aspx#two.
best of our knowledge, only one explores the tradeoffs be- [7] H. Ngo. Performance Bounds for Very Large Multiuser
tween linear precoding techniques. In [9], Yang et al. ana- MIMO Systems. PhD thesis, LinkA uping University,
lyze the radiated power and computational requirements of The Institute of Technology, 2012.
conjugate and zero-forcing linear precoders. However, when [8] C. Shepard, H. Yu, N. Anand, E. Li, T. Marzetta,
determining the performance of the precoders, the authors R. Yang, and L. Zhong. Argos: Practical many-antenna
do not account for the time it takes to perform these ad- base stations. In Proc. ACM MobiCom, 2012.
ditional computations, nor do they consider other practical [9] H. Yang and T.L. Marzetta. Performance of conjugate
implementation issues, such the data transport overhead or and zero-forcing beamforming in large-scale antenna
the non-parallelizable nature of inverses. Their simulations systems. IEEE Journal on Selected Areas in
assume a channel coherence time of 933 s, which, as we Communications, 2013.
have shown, can cause serious performance degradation in
zero-forcing. While this work is very insightful from a the-