Gpus For Ofdm Based SDR Prototyping: A Comparative Research Study
Gpus For Ofdm Based SDR Prototyping: A Comparative Research Study
Gpus For Ofdm Based SDR Prototyping: A Comparative Research Study
I.
I NTRODUCTION
Wireless protocols are often implemented in custom hardware in order to satisfy the heavy computational requirements
within the low power margins available. Hardware implementations take longer to design and verify and therefore
require longer development times. A programmable software
implementation of the physical layer, also called Software
Defined Radio (SDR), is therefore very advantageous in terms
of supporting multiple protocols, faster time-to-market, higher
chip volumes and easy modifications. For SDRs to comply
with the balance of power and processing involved, it is
often necessary to choose the right underlying architecture
for their implementation. In this study, we discuss a few
architectural methods used for vectorization of code, such
as Single Instruction Multiple Packed Data (SIMpD), Single
Instruction Multiple Disjoint Data (SIMdD), and Very Long
Instruction Word (VLIW) for accomplishing parallel data
computation. Once we analyze and establish vectorization as
an intuitive way to implement SDR standards, well further
explain why vectorization is advantageous for SDRs in terms
of implementing different standards on the same device.
This paper presents our initial research findings with results
from our implementation of a generic physical layer for an
Orthogonal Frequency Division Multiplexing (OFDM) based
wireless standard. The results specifically look at the speedup
achieved in the modules that constitute the SDR baseband
(such as coding/decoding, modulation/demodulation, interleaving/deinterleaving, ifft/fft) implemented on graphic processors
(GPUs) versus an implementation of these modules in MATLAB on a Core i7 machine. We do this by interfacing with
Nvidia GPU computing platform, Compute Unified Device
Architecture (CUDA), from within the MATLAB framework
for direct comparison of different areas. We conclude our
research study through characterizing the architecture required
for building a GPU accelerated platform for quick and easy
prototyping of SDR applications.
II.
researched [1][6]. The major challenge of an SDR implemented to cross standards would then be, the ability to realize
multiple giga instructions per second (GIPS) of the flexible
base baseband processing under power limitations and time
constraints. In order to implement a crossover between two
standards, the baseband must be flexible enough to have one
active standard with the ability to keep sniffing for other
available standards to connect to. The challenge here is to
keep the computational load within bounds of the embedded
processor while crossing over avoiding a break in service. High
performance demands with power and throughput restrictions
have always been the concern for devices running digital signal
processing (DSP) algorithms. Time after time these digital
signal processing algorithms have constantly been changed,
modified and customized to suit their underlying architecture
for better performance. With the availability and increase
of multiple-simultaneous processing power the major front
to optimize these DSP techniques now lay in parallelism.
Parallelism will give these DSP kernels the edge in performing
multiple computations simultaneously and thereby make the
research into its adaptations and limitations into SDR significant.
III.
calculations on super scalar architectures present versus inorder vector processors with long words. One of the major
differences in the above mentioned architectures lay in the
hardware involved in doing the required data accesses. Single
Instruction Single Data (SISD) architecture, which is the
simplest kind, takes care of a sequential instruction stream
to do work and gives out an output stream. The requirements
of just one register containing data to work on per instructions
makes it simple to implement and avoids any extra design
to avoid data dependencies between the different instructions
being run. Very Long Instruction Word (VLIW) architecture
uses register streams in order to handle the data access to do
simultaneous and different processing of multiple independent
data. The instructions stream in VLIW is a combination of
multiple operations to be computed on different data, taking
into account, the output dependencies of the computations dont
overlap and even if they did not cause any computational
errors due to out of sequence execution. If VLIW as described
above when designed in a way to have single sequential
instruction to be run on multiple data elements, leads to the
Single Instruction Multiple Data architecture. Based on the
application, the data being worked on come from different data
registers and can be written back to same or disjoint locations.
The latter case leads to the development of Single Instruction
Multiple Disjoint Data (SIMdD). SIMdDs are however harder
to implement and not currently used due to the complexity it
has in terms of additional hardware and control of multiple
registers. Another technique involves multiple data elements
for SIMD being packed together into one single register often
called SIMPD architecture. An architecture similar to SIMdD,
where disjoint data are accessed through vector pointers and
this architecture is thus known as indirect-SIMdD architecture.
Instead of explicitly specifying vector elements, in indirectSIMdD, pointers to the source and destination elements are
provided and vectors fields specifies multiple indices.
IV.
C ONCLUSION
Fig. 1. CPU vs GPU ifft/fft runtime against Number of OFDM symbols with
baseband set to 512 subcarriers and 4QAM modulation
Fig. 4. CPU vs GPU ifft/fft runtime against Number of OFDM symbols with
baseband set to 512 subcarriers and 4QAM modulation
Fig. 5. CPU vs GPU ifft/fft runtime against Number of OFDM symbols with
baseband set to 512 subcarriers and 4QAM modulation
VIII.
ACKNOWLEDGMENT
R EFERENCES
[1]
[2]
[3]
[4]
Fig. 3. CPU vs GPU ifft/fft runtime against Number of OFDM symbols with
baseband set to 512 subcarriers and 4QAM modulation
[5]
[6]
[7]
[8]
[9]