Memory Controller
Memory Controller
Abstract. To manage power and memory wall affects, the HPC industry sup-
ports FPGA reconfigurable accelerators and vector processing cores for data-
intensive scientific applications. FPGA based vector accelerators are used to in-
crease the performance of high-performance application kernels. Adding more
vector lanes does not affect the performance, if the processor/memory perfor-
mance gap dominates. In addition if on/off-chip communication time becomes
more critical than computation time, causes performance degradation. The sys-
tem generates multiple delays due to application’s irregular data arrangement and
complex scheduling scheme. Therefore, just like generic scalar processors, all
sets of vector machine – vector supercomputers to vector microprocessors – are
required to have data management and access units that improve the on/off-chip
bandwidth and hide main memory latency.
In this work, we propose an Advanced Programmable Vector Memory Controller
(PVMC), which boosts noncontiguous vector data accesses by integrating de-
scriptors of memory patterns, a specialized on-chip memory, a memory manager
in hardware, and multiple DRAM controllers. We implemented and validated the
proposed system on an Altera DE4 FPGA board. The PVMC is also integrated
with ARM Cortex-A9 processor on Xilinx Zynq All-Programmable System on
Chip architecture. We compare the performance of a system with vector and
scalar processors without PVMC. When compared with a baseline vector sys-
tem, the results show that the PVMC system transfers data sets up to 1.40x to
2.12x faster, achieves between 2.01x to 4.53x of speedup for 10 applications and
consumes 2.56 to 4.04 times less energy.
1 Introduction
Data Level Parallel (DLP) accelerators such as GPUs [1] and Vectors [2–4], are getting
popular due to their high performance per area. New programming environments are
making GPU programming easy for general purpose applications. DLP accelerators are
very efficient for HPC scientific applications because they can simultaneously process
multiple data elements with a single instruction. Due to the reduced number of instruc-
tions, the Single Instruction Multiple Data (SIMD) architectures decrease the fetch and
system with vector and scalar processors without PVMC. When compared with the
baseline vector system, the results show that the PVMC system transfers data sets up to
1.40x to 2.12x faster, achieves between 2.01x to 4.53x of speedup for 10 applications
and consumes 2.56 to 4.04 times less energy.
The remainder of the paper is organized as follows. In Section 2 we describe the
architecture of a generic vector system. In Section 3 we describe the architecture of
PVMC. The Section 4 gives details on the PVMC support for the applications. The
hardware integration of PPMC to the Baseline VESPA system is presented in Section 5.
A performance and power comparison of PVMC by executing application kernels is
given in Section 6. Major differences between our proposal and state of the art are
described in Section 7. Section 8 summarizes our main conclusions.
2 Vector Processor
A vector processor is also known as a “single instruction, multiple data” (SIMD) CPU [12],
that can operate on an array of data in a pipelined fashion, one element at a time using
a single instruction. For higher performance multiple vector lanes (VL) can be used
to operate in lock-step on several elements of the vector in parallel. The structure of a
vector processor is shown in Figure 1. The number of vector lanes determines the num-
ber of ALUs and elements that can be processed in parallel. By default the width (W)
of each lane is 32-bit and can be adjusted to 16-bit or 8-bit wide. The maximum vec-
tor length (MVL) determines the capacity of the vector register files (RF). Increasing
the MVL allows a single vector instruction to encapsulate more parallel operations, but
also increases the vector register file size. Larger MVL values allow software to spec-
ify greater parallelism in fewer vector instructions. The vector processor uses a scalar
core for all control flow, branches, stack, heap, and input/output ports. Performing such
processing in the vector processor is not a good use of vector resources, where 1 < MC
< L. To reduce the hardware complexity all integer parameters are limited to powers of
two.
The vector processor takes instructions from the instruction cache, decodes and pro-
ceeds to the replicate pipeline stage. The replicate pipeline stage divides the elements
of work, requested by the vector instruction into smaller groups that are mapped onto
the VL lanes. A hazard is generated when two or more of concurrent vector instructions
conflict and do not execute in consecutive clock cycles. The hazard check stage exam-
ines hazards for the vector and flags register files and stalls if required. Execution occurs
in the next two stages (or three stages for multiple instructions) where VL operand pairs
are read from the register file and sent to the functional units in the VL lanes.
Modern vector memory units use local memories (cache or scratchpad) and transfer
data between the Main Memory and the VLs, with parameterized aspect ratio of cache
depth, line size, and prefetching. The MC connects each byte in an on-chip memory line
to the VL in any given cycle. The vector system has memory instructions for describing
consecutive, strided, and indexed memory access patterns. The index memory patterns
can be used to perform scatter/gather operations. A scalar core is used to initialize
the control registers that hold parameters of vector memory instructions such as the
base address or the stride. The vector system includes vbase, vinc and vstride
dedicated registers for memory operations. The vbase registers can store any base
address which can be auto-incremented by the value stored in the vinc registers. The
vstride registers can store different constant strides for specifying strided memory
accesses. For example, if MVL is 16, the instruction vld.w vector register,
vbase, vstride, vinc loads the 16 elements starting at vbase each separated
by vstride words, store them in vector register, and finally update vbase
by adding vinc to it. The memory crossbar (MC) is used to route each byte of the
cache line (CL) accessed simultaneously to any lane. The MC maps individual memory
requests from a VL to the appropriate byte(s) in the CL. The vector memory unit can
take requests from each VL and transfers one CL at a time. Several MCs can be used to
process memory requests concurrently.
The memory unit of the vector system first computes and loads the requested ad-
dress in the Memory Queue (MQ) for each lane and then transfers the data to the lanes.
If the number of switches in the MC is smaller than the number of lanes, this process
will take several cycles. Vector chaining [13] sends the output of a vector instruction
to a dependent vector instruction, bypassing the vector register file, thus avoiding se-
rialization, thus allowing multiple dependent vector instructions to execute simultane-
ously. Vector chaining can be combined with increasing the number of VLs. It requires
available functional units; having a large MVL improves the impact on performance of
vector chaining. When the loop is vectorized, and the original loop count is larger than
the MVL, a technique called strip-mining is applied [14]. The body of the strip-mined
vectorized loop operates on blocks of MVL elements. Strip mining transforms a loop
into two nested loops: an outer strip-control loop with a step size of multiple of the orig-
inal loops step size, and an inner loop contains the original step size and the loop body.
The vector processor performs strip mining [14] by breaking loops into pieces that fit
into vector registers. Strip mining moves vector components of the original loop in the
inner loop and transfers all vectorized statements in the body of the outer strip-control
loop. In this way, strip mining folds the array-based parallelism to fit in the available
hardware. When all MC requests have been satisfied the MQ shifts all its contents up
by MC.
Fig. 2. Advanced Programmable Vector Memory Controller: Vector System
As the number of processing cores and the capacity of memory components increases
the system requires a high-speed bus interconnection network that connects the proces-
sor cores and memory modules [15] [16]. The bus system includes the status bus, the
control bus, the address bus and the data bus.
The status bus holds signals of multiple sources that indicate data transfer requests,
acknowledgement, wait/ready and error/ok messages. The bus indicates the current bus
operation, and memory read, memory write, or input/output operation. The control bus
uses signals that control the data movement and carries information of data transfers.
The bus is also used to move data between PVMC descriptors and the vector unit’s
control and scalar registers. The address bus is used to identify the locations to read
or write data from memory components or processing cores. The address bus of the
functional units is decoded and arbitrated by the memory manager.
3.2.1 Descriptor Memory The Descriptor Memory [17],[18] is used to define data
transfer patterns. Each descriptor transfers a strided stream of data. More complex non-
contiguous data transfers can be defined by several descriptors. A single descriptor is
represented by parameters called Main Memory address, local memory address, stream,
stride and offset. The Main Memory and local memory address parameters specify the
memory locations to read and write. Stream defines the number of data elements to be
transferred. Stride indicates the distance between two consecutive memory addresses of
a stream. The offset register field is used to point to the next vector data access through
the main address.
3.2.2 Buffer Memory The Buffer Memory architecture implements the following
features:
– Load/reuse/update to avoid accessing the same data multiple times (uses the re-
alignment feature). It handles the increment of the base address, thus reducing loop
overhead when applying strip-mining.
– Data realignment when the position in the vector of the reused elements does not
match their original position, or when there is a mismatch in the number of elements
– In-order data delivery. In cooperation with the Memory Manager that prefetches
data, it ensures that the data of one pattern is sent in-order to the vector lanes. This
is used to implement vector chaining from/to vector memory instructions.
The Buffer Memory holds three buffers which are the load buffer, the update buffer
and the reuse buffer. The Buffer Memory transfers data to the vector lanes using the up-
date buffer. The load and reuse buffers are used by the Memory Manager that manages
Fig. 3. Data Memory Buffers: Load, Reuse & Update
the Specialized Memory data (see Section 3.2.3). For example, if a vector instruction
requests data that has been written recently then the Buffer Memory performs on-chip
data management and arrangement.
3.2.3 Specialized Memory The Specialized Memory keeps data close to the vector
processor. It has been designed to exploit 2D and 3D data locality. The Specialized
Memory is further divided into read and write specialized memories. Each Specialized
Memory provides a data link to the vector lanes. The Specialized Memory structure
is divided into multiple planes as shown in Figure 2, where each plane hold rows and
columns. The row defines the bit-width, and the column defines the density of the plane.
In the current configuration, the planes of the Specialized Memory have 8- to 32-bit wide
data ports, and each data port is connected to a separate lane using the update buffer.
The Specialized Memory has an address space separated from main memory. PVMC
uses special memory-memory operations that transfer data between the Specialized
Memory and the main memory. The data of the read Specialized Memory is sent di-
rectly to the vector lanes using the update buffer, and the results are written back into
the write specialized memory.
3.2.4 Main Memory The Main Memory has the largest size due to the use of external
SDRAM memory modules but also has the highest latency. The Main Memory works
independently and has multiple SDRAM modules. The SDRAM on each memory mod-
ule implements multiple independent banks that can operate in parallel internally. Each
bank represents multiple arrays (rows and column), and it can be accessed in parallel
with other banks.
The PVMC Memory Manager manages the transferring of complex data patterns to/from
the vector lanes. The data transfer instructions are placed in the Descriptor Memory (see
Section 4.5). The Memory Manager uses the Descriptor Memory to transfer the work-
ing set of vector data. The Memory Manager is composed of two modules: the Address
Manager and the Data Manager.
3.3.1 Address Manager The Address Manager takes a vector transfer instruction
and reads the appropriate Descriptor Memory. The Address manager uses a single or
multiple descriptors, maps addresses in hardware. The Address Manager saves mapped
addresses into its address buffer for further reuse and rearranges them.
3.3.2 Data Manager The data manager is used to rearrange the output data of vector
lanes for reuse or update. The data memory uses the reuse, update and load buffers
(shown in Figure 3) to load, rearrange and write vector data. The n shown in Figure 3
is the size of DRAM transfer. When input and output vectors are not aligned the data
manager shuffles data between lanes. The data manager reduces the loop overhead by
accessing the incremented data and reuses previous data when possible. For example,
if the increment is equal to 1 the data manager shifts one data element and requests one
element to load from the main memory. The incremented address is managed by the
address and data managers that align vector data if required.
The Memory Manager takes memory address requests from the control bus and the
Address Manager reads the data transfer information from the Descriptor Memory . The
Data Manager checks data requests from the specialized memory, if data is available
there then the data manager transfers it to the update buffer. If the data requests are
not available then, the Memory Manager transfers the data request information to the
Multi DRAM Access Unit (see 3.4) which loads data to the load buffer. The load buffer
along with the reuse buffer perform data alignment and reuse where required, and fill
the update buffer. The update buffer transfers data to the vector lanes.
The Multi DRAM Access Unit (MDAU) accesses data from the main memory. The Main
Memory organization is classified into data, address, control, and chip-select buses. The
data bus that transmits data to and from the Main Memory is 64 bits wide. A shared
address bus carries row, column and bank addresses to the main memory. There is a
chip-select network that connects the MDAU to each SDRAM module. Each bit of chip
select operates a separate SDRAM module.
MDAU can integrate multiple DRAM Controllers using separate data buses, which
increases the memory bandwidth. There is one DRAM Controller per SDRAM mod-
ule. In the current evaluation environment, two DRAM Controllers are integrated. Each
DRAM Controller takes memory addresses from the Memory Manager, performs ad-
dress mapping from physical address to DRAM address and reads/writes data from/to
its SDRAM module.
4 PVMC Functionality
In this section, we discuss the important challenges faced by the memory unit of soft
vector processors and explain our solution. The section is further divided into five
subsections: Memory Hierarchy, Memory Crossbars, Address Registers, Main Memory
Controller and Programming Vector Accesses.
4.1 Memory Hierarchy
A conventional soft vector system uses the cache hierarchy to improve the data lo-
cality by providing and reusing the required data set to functional units. With a high
number of vector lanes with strider data transfers, the vector memory unit does not
satisfy the data spatial locality. PVMC improves the data spatial locality by accessing
more data elements than MVL into its Specialized Memory and transferring them using
the buffer memory. Non-unit stride accesses do not exploit spatial locality offered by
caches resulting in a considerable waste of resources. PVMC manages non-unit stride
memory accesses similar to unit-stride ones. Like a cache of soft vector processor, the
PVMC Specialized Memory temporarily holds data to speed up later accesses. Unlike a
cache, data is deliberately placed in the Specialized Memory at a known location, rather
than automatically cached according to a fixed hardware policy. The PVMC memory
manager along with the Buffer Memory hold information of unit and non-unit strided
accesses, update and reuse them for future accesses.
f o r ( i = 0 ; i<l e n g t h ; i = i +1 )
{
d a t a o u t [ i ] = a [ i ∗ 64 ] + b [ i ] + c [ i ] ;
}
(a)
f o r ( i = 0 ; i<l e n g t h ; i +=64)
{
VLD . S ( /∗ Main Memory∗/ 0 x00000000 + i , /∗ V e c t o r R e g i s t e r ∗/ , VR0 , /∗ S t r i d e ∗/ 0 x40 ) ;
VLD ( /∗ Main Memory∗/ 0 x00000100 + i , /∗ V e c t o r R e g i s t e r ∗/ , VR1 ) ;
VADD VR0 , VR1 , VR2
VLD ( /∗ Main Memory∗/ 0 x00000200 + i , /∗ V e c t o r R e g i s t e r ∗/ , VR3 ) ;
VADD VR2 , VR3 , VR3
VST ( /∗ Main Memory∗/ 0 x10000000 + i , /∗ V e c t o r R e g i s t e r ∗/ VR3 ) ;
}
(b)
(c)
Fig. 4. (a) Scalar Loop (b) Vector Loop (c) PVMC Vector Loop
and efficient prefetching support. Delay and power increase for complex non-stride ac-
cesses and crossbars. The PVMC VLD instruction uses a single or multiple descriptors
to transfer data from the Main Memory to the Specialized Memory. PVMC rearranges
and manages accessed data in the Buffer Memory and transfers it to vector registers. In
Figure 5, PVMC prefetches vectors longer than MVL in the Specialized Memory. After
completing the first transfer of MVL, the PVMC sends a signal to the vector processor
that acknowledges that the register elements are available for processing. In this way
PVMC pipelines the data transfers and parallelizes computation, address management
and data transfers.
A common concern, when using soft vector processors, is compiler support. A soft
core vector processor typically requires in-line assembly code that translates vector in-
structions with a modified GNU assembler. In order to describe how PVMC is used, the
supported memory access patterns are discussed in this section. We provide C macros
which ease the programming of common access patterns through a set of function calls,
integrated with an API. The memory access information is included in the PVMC
header file and provides function calls (e.g. STRIDED(), INDEXED(), etc.) that re-
quire basic information of the local memory and the data set. The programmer has to
annotate the code using PVMC function calls. The function calls are used to transfer
the complete data set between Main Memory and Specialized Memory. PVMC supports
complex data access patterns such as strided vector accesses and transfers complex data
patterns in parallel with vector execution.
For multiple or complex vector accesses, PVMC prefetches data using vector ac-
cess function calls (e.g. INDEXED(), etc.), arranges them according to the predefined
patterns and buffers them in the Specialized Memory. The PVMC memory manager ef-
ficiently transfers data with long strides, longer than MVL size and feeds it to a vector
processor. For example, a 3D stencil access requires three descriptors. Each descriptor
accesses a separate (x, y and z) vector in a different dimension, as shown in Figure 6. By
combining these descriptors, the PVMC exchanges 3D data between the Main Memory
and the Specialized Memory buffer. The values X, Y and Z define the width (row size),
height (column size) and length (plane size) respectively of the 3D memory block.
When n=4, 25 points are required to compute one central point. The 3D-Stencil has
x, z and y vectors having direction of row, column and plane respectively. The x, y and
z vectors have length of 8, 9 and 8 points respectively. The vector x has unit stride, the
vector z has stride equal to row size and the vector y has stride equal to the size of one
plane, i.e. row size × column size.
5 Experimental Framework
In this section, we describe the PVMC and VESPA vector systems as well as the Nios
scalar system. The Altera Quartus II version 13.0 and the Nios II Integrated Develop-
ment Environment (IDE) are used to develop the systems. The systems are tested on
an Altera Stratix-IV FPGA based DE4 board. The section is further divided into three
subsections: the VESPA system, the PVMC system and the Nios system.
5.1.1 Scalar Core An SPREE [19] scalar processor is used to program the VESPA
system and perform scalar operations. The SPREE is a 3-stage MIPS pipeline with
full forwarding core and has a 4K-bit branch history table for branch prediction. The
SPREE core keeps working in parallel with the vector processor with the exception of
control instructions and scalar load/store instructions between the two cores.
5.1.2 Vector Core A soft vector processor called VESPA (Vector Extended Soft Pro-
cessor Architecture) [10] is used in the design. VESPA is a parameterizable design
enabling a large design space of possible vector processor configurations. These pa-
rameters can modify the VESPA compute architecture, instruction set architecture, and
memory system. The vector core uses a MVL of 128.
5.1.3 Memory System The baseline VESPA vector memory unit (shown in Figure 7)
includes an SDRAM controller, cache and bus crossbar units. The SDRAM controller
transfers data from Main Memory (SDRAM modules) to the local cache memory. The
Vector core can access only one cache line at a time that is determined by the requesting
lane with the lowest lane identification number. Each byte in the accessed cache line can
be simultaneously routed to any lane through the bus crossbar. Two crossbars are used,
one read crossbar and one write crossbar.
The PVMC system manages on-chip data and off-chip data movement using the Buffer
Memory and the Descriptor Memory. The memory crossbar is replaced with the Buffer
Memory which rearranges and transfers data to the vector lanes. The Specialized Mem-
ory is used instead of a cache memory, that handles complex patterns and transfers them
to the vector lanes with less delay.
5.4 Applications
Table 1 shows the application kernels which are executed on the vector systems along
with their memory access patterns. The set of applications covers a wide range of pat-
terns allowing us to measure the behaviour and performance of data management and
data transfer of the systems in a variety of scenarios.
6 Results and Discussion
In this section, the resources used by the memory and bus systems, the application
performance, the dynamic power and energy and the memory bandwidth of the PVMC
vector system are compared with the results of the non-PVMC vector system and the
baseline scalar systems.
Multiple memory hierarchies and different bus system configurations of PVMC &
VESPA systems are compiled using Quartus II to measure their resource usage, maxi-
mum operating frequency and leakage power.
Table 2 (a) presents the maximum frequency of the memory system for 1 to 64 vec-
tor lanes with 32kB of cache/specialized memory. The VESPA system uses crossbars
to connect each byte of the cache line to the vector lanes. Increasing the number of
lanes requires more crossbars and a larger multiplexer that routes data between vector
lanes and cache lines. This decreases the operating frequency of the system. For the
VESPA vector processor, results show that increasing the number of vector lanes from
1 to 64 requires larger crossbar multiplexer switches that forces to operate at lower
frequency. The PVMC Specialized Memory uses separate read and write specialized
memories that reduce the switching bottleneck. The vector lanes read data from read
Specialized Memory for processing and transfer it back to the write specialized mem-
ory. The on-chip data alignment and management is done by the Data Manager and
the buffer memory. This direct coupling of the Specialized Memory and vector lanes
using the update buffer is very efficient and allows the system to operate at a higher
clock frequency. Table 2 (b) presents the maximum frequency for the data bus to op-
erate multiple memory controllers. The PVMC data bus supports a dedicated bus for
each SDRAM controller which increases the bandwidth of the system. The data bus of
VESPA system supports only a single SDRAM controller.
Table 3 shows the resource utilization of the memory hierarchy of the VESPA and
PVMC systems. The memory hierarchy is compiled for 64 lanes with 32KB of memory
and several line sizes. Column Line Size presents cache line and update buffer size in
bytes of the VESPA and PVMC systems respectively. The VESPA system cache mem-
ory uses cache lines to transfer each byte to the vector lanes. The PVMC update buffer
is managed by the data manager and is used to transfer data to the vector lanes. Column
Reg, LUT shows the resources used by the cache controller and the memory manager of
the VESPA and PVMC systems respectively. Column Memory Bits presents the num-
ber of BRAM bits for the local memory. The PVMC memory system uses separate read
and writes specialized memories, and therefore, it occupies twice the number of BRAM
bits. The data manager of the PVMC memory system occupies 3 to 5 times fewer re-
sources than the VESPA memory system. Column Main Memory presents the resource
utilization of the DRAM controllers. The VESPA system does not support two SDRAM
controllers. Column Power shows leakage power in watts for the VESPA and PVMC
memory systems. The leakage current of the VESPA system is higher than in PVMC,
because it requires a complex crossbar network to transfer data between the cache and
the vector lanes and requires more multiplexers.
For performance comparisons, we use the applications of Table 1. We run the applica-
tions on the Nios II/e, Nios II/f and VESPA systems and compare their performance
with the proposed PVMC vector system. Nios II/e, VESPA and PVMC systems run at
100 MHZ. The VESPA and the PVMC systems are compiled using 64 lanes with 32kB
of cache and Specialized Memory respectively. The Nios II/f system operates at 200
Mhz using data and instruction caches of 32KB each. All systems use a single SDRAM
controller to access the main memory.
Fig. 9. Speedup of PVMC and VESPA over Nios II/e and Nios II/f
Fig. 10. Vector and Scalar Systems: Application Kernels Execution Clocks
Figure 9 shows the speedups of VESPA and PVMC systems over Nios II/f and Nios
II/e. Results show that vector execution with the PVMC is 8.3x and 31.04x faster than
the Nios II/f. When compared with the Nios II/e, the PVMC improves speed between
90x and 313x which shows the potential of vector accelerators for high performance.
In order to discard that the speedups over the scalar processor NIOS are caused by
using SPREE as the scalar unit of the vector processor, we execute FIR, Mat Mul and
3D-Stencil application kernels on a SPREE scalar processor, i.e. with the vector pro-
cessor disabled. While comparing performance of FIR, Mat Mul and 3D-Stencil ker-
nels on SPREE, Nios II/e and Nios II/f scalar processors, the results show that SPREE
improves speed between 5.2x and 8.6x over Nios II/e, whereas against Nios II/f the
SPREE is not efficient. The Nios II/f achieves speedups between 1.27x and 1.67x over
the SPREE scalar processor. The results show that Nios II/f performs better than Nios
II/e and SPREE scalar processors.
Figure 10 shows the execution time (clock cycles) for the application kernels. The X
and Y axis represent application kernels and number of clock cycles, respectively. Each
bar represents the application kernel’s computation time and memory access time in log-
arithmic scale (less is better). Figure 11 presents the speedup of PVMC over VESPA.
By using the PVMC system, the results show that the FIR kernel achieves 2.37x of
speedup over VESPA. The application kernel has streaming data accesses and requires
a single descriptor to access a stream that reduces the address generation/management
time and on-chip request/grant time. The 1D Filter accesses a 1D block of data and
achieves 3.18x of speedup. The Tri-diagonal kernel processes the matrix with sparse
data placed in diagonal format. The application kernel has a diagonal access pattern
and attains 2.68x of speedup. The Mat Mul kernel accesses row and column vectors.
PVMC uses two descriptors to access the two vectors. The row vector descriptor has
unit stride whereas the column vector has a stride equal to the size of a row. The appli-
cation yields 3.13x of speedup. RGB2CMYK and RGB2Gray take 1D block of data and
achieve 3.89x and 4.53x of speedup respectively. The Motion Estimation and Gaussian
applications take 2D block of data and achieve 2.67x and 2.16x of speedup respec-
tively. The PVMC system manages addresses of row and column vectors in hardware.
The 3D-Stencil data uses row, column and plane vectors and achieves 2.7x of speedup.
The K-Mean kernel has 1D strided and load/store accesses the kernel achieves 2.01x
of speedup. The vectorized 3D-stencil code for VESPA always uses the whole MVL
and unit-stride accesses and accesses vector data by using vector address registers and
vector load/store operations. The VESPA system multi-banking methodology requires
a larger crossbar that routes requests from load/store units to cache banks and another
one from banks back to ports. This also increases the cache access time but reduces the
simultaneous read and write conflicts.
Fig. 12. Vector & Scalar Systems: Memory Bandwidth
To measure voltage and current the DE4 board provides a resistor to sense current/-
voltage and 8-channel differential 24-bit analog to digital converters. Table 4 presents
dynamic power and energy of different systems using a filter application kernel with 2M
Byte of input data set, 1D block (64 elements) of data access and 127 arithmetic op-
erations on each block of data. Column System@MHz shows the operating frequency
of the Nios II/e and Nios II/f cores and the VESPA and PVMC systems. The vector
cores execute the application kernel using different numbers of lanes while the clock
frequency is fixed to 100 MHz. To control the clock frequencies all systems use a single
phase-locked loop (PLL). Columns Reg, LUTs and Mem Bits show the amount of logic
and memory in bits respectively utilized by each system. The Nios II/e does not have
a cache memory and only uses program memory. Column Dynamic Power and Energy
presents run time measured power of scalar and vector systems while executing the
filter application kernel and calculated energy for power and execution time. Column
FPGA Core includes the power consumed by on-chip FPGA resources and PLL power.
Column SDRAM power presents the power of the SDRAM memory device. The power
of Nios II/e and Nios II/f increases with frequency. Results show that the PVMC draws
21.2% less power and 4.04x less energy than the VESPA system, both using 64 lanes.
For a single lane configuration, PVMC consumes 14.55% less power and 2.56x less
energy. This shows that PVMC improves system performance and handles data more
efficiently results improve with a higher number of lanes. The PVMC using a single
lane since/and operating at 100 MHz draws 14%, 44% less power and 14.5x, 8.5x less
energy than a Nios II/f core operating at 100 MHz and 200 MHz respectively. Whereas,
when compared to a Nios II/e core at 100 MHz and 200 MHz, the PVMC system draws
.03% and 17.3% less power respectively and consumes and 2.07x, 1.21x times less
energy.
6.4 Bandwidth
In this section, we measure the bandwidth of the PVMC, VESPA and Nios ll/f systems
by reading and writing memory patterns. The PVMC with a single SDRAM controller
is also executed on a Xilinx Virtex-5 ML505 FPGA board, and results are very sim-
ilar. The systems have 32 bit on-chip data bus operating at 100 MHZ that provides a
maximum bandwidth of 400 MB. The Nios II/f uses the SGDMA controller and uses
multiple load/store or Scatter/Gather DMA calls to transfer data patterns. The PVMC
can achieve maximum bandwidth by using a data transfer size equal to the data set. In
order to check the effects of memory and address management units over the system
bandwidth, we transfer data between processor and memory using three types of trans-
fers: the unit stride transfer, strided transfer, and indexed transfer (scatter/gather). The
X-axis (shown in Figure 12) presents three types of data transfers. Each data transfer
reads and writes a data set of 2MB from/to the SDRAM. All three transfers have a trans-
fer size of 1024B. The type unit stride transfer and strided transfer contain a unit and
64B strides between two consecutive elements respectively. The type indexed transfer
merges non-contiguous memory accesses to a continuous address space. The indexed
transfer reads a series of unit stride transfer instructions that specify the data to be
transferred. For the data of unit stride transfer type, results show that PVMC transfers
data 1.82x and 1.40x faster than VESPA and Nios II/f respectively. While transferring
data with the strided transfer type, PVMC improves bandwidth 2.12 and 1.79 times. Re-
sults show that PVMC improves bandwidth up to 1.78x and 1.53x for indexed transfer.
Nios II/f uses the SGDMA controller that handles transfer in parallel. SGDMA follows
the bus protocol and requires a processor that provides data transfer instructions. For
indexed transfer, Nios II/f uses multiple instructions to initialize SGDMA. SGDMA
can begin a new transfer before the previous data transfer completes with a delay called
pipeline latency. The pipeline latency increases with the number of data transfers. Each
Data Transfer requires bus arbitration, address generation and SDRAM bank/row man-
agement. The VESPA hardware prefetcher transfers data to vector core using cache
memory. The cache memory streams all cache lines, including the prefetched lines,
to the vector core using the memory crossbar. The PVMC data transfers uses few de-
scriptors that reduce run-time address generation and address request/grant delay and
improve bandwidth by managing addresses at compile-time and by accessing data from
multi-DRAM devices and multi-banks in parallel.
The PVMC is also integrated with the ARM processor architecture and tested on Xil-
inx Zynq All-Programmable System on Chip (SoC) [20] Development platform. Vector
multiplication applications are executed on NEON coprocessor of the ARM Cortex-A9
Evaluation Systems. The NEON coprocessor uses SIMD instruction sets to process vec-
tor multiplication. ARM Cortex-A9 Evaluation System architecture uses single NEON
coprocessor having 512 bytes of register file, 64-bit register size and perform single
precision floating point instructions on all lanes. The ARM Generic Memory Con-
roller (ARM-GMC) uses 512 MB of DDR3 Main Memory with 1 Gbps bandwidth.
The ARM-PVMC system uses PVMC to access data from the Main Memory.
Three types of vector data structures (V1, V2, and V3) are selected to test the per-
formance of system architecture. The V1 vector has fixed strides between two elements,
and its memory accesses are predictable at compile time. The V2 memory accesses are
not predictable at compile but known at run-time, the distance between two consecutive
elements of V2 is greater than the cache line size. The V3 memory accesses are not
known at run-time, we used a random address generator that provides address before
computation.
Figure 13 shows the number of clock cycles taken ARM-PVMC, and ARM-GMC
to access and processes V1, V2 and V3 vectors. Y-axis presents the number of clocks
in logarithmic scale. The x-axis shows the applications with V1, V2 and V3 vector
data access patterns. While perforating vector multiplication on V1 type vector, the
ARM-PVMC achieves 4.12x of speedups against the ARM-GMC. The V1 addresses are
aligned; therefore, the ARM system uses multiple direct memory access calls which
require the start address, the end address and the size of the transfer. The ARM-PVMC
organizes the known address in descriptors which reduces the run-time address genera-
tion and address management time. The ARM snoop control unit (SCU) reuses and pre-
fetches the data in parallel with the computation, but there is still on-chip bus manage-
ment and Main Memory data transfer issues. For vector V2 addition, the ARM-PVMC
achieves 12.68x of speedup. The V2 has run-time predictable stride size, having a size
greater than the cache line. The baseline system uses multiple transfer calls to access the
vector, which generates address generation, on-chip bus arbitration and the Main Mem-
ory bank section time. The ARM-GMC uses multiple load/store and DMA transfer call
to access V2 vector; this adds on-chip bus delay. The ARM-PVMC efficiently handles
strides at run-time and manages address/data of access patterns, translates/recorders in
hardware, in parallel with ARM processor cores. As the vector addresses are in the form
of descriptors the PVMC on-chip bus and the memory controller manage data transfers
in a single or few bursts. The Specialized Memory places complete data structure in
continuous format and feed it to processing cores that reduce the local memory man-
agement time. The Specialized Memory banks read/write operations are performed in
Fig. 13. ARM: Generic Memory Controller and PVMC based System
parallel. The vector addition of type V3 vectors gives 1.21x of speedup. The V3 vector
requires pointer/irregular data transfer calls, which generates address management, bus
arbitration, scheduling and the Main Memory delays. As the address are not known, the
ARM-PVMC Memory Manager and Memory Controller do not able to work in parallel.
The ARM-GMC uses generic Local Memory, Bus Management and the Main Memory
units. This results in a settlement on the possible performance because of the generic
units requires extra handshaking and synchronization time. In order to achieve per-
formance ARM-PVMC bypasses the generic system units and introduces Specialized
Memory, Memory Management and Main Memory. The PVMC internal units have the
ability to operate independently, in parallel with each other and ARM-PVMC achieves
maximum performance when all of its units works in parallel.
7 Related Work
Yu et al. propose VIPERS [21], a vector architecture that consists of a scalar core to
manage data transfers, a vector core for processing data, an address generation logic,
and a memory crossbar to control data movement. Chou et al. present the VEGAS [11]
vector architecture with a scratchpad to read and write data and a crossbar network to
shuffle vector operations. VENICE [22] is an updated version of VEGAS, with scratch-
pad and DMA that reduces data redundancy. VENICE has limitations regarding the
rearrangement of complex data with scatter/gather support. Yiannacouras et al. pro-
pose the VESPA [10] processor that uses a configurable cache and hardware prefetch-
ing of a constant number of cache lines to improve the memory system performance.
The VESPA system uses wide processor buses that match the system cache line sizes.
VIPERS and VEGAS require a scalar Nios processor that transfers data between the
scratchpad and the main memory. A crossbar network is used to align and arrange on-
chip data. The PVMC eliminates the crossbar network and the limitation of using a
scalar processor for data transfer. PVMC manages addresses in hardware with the pat-
tern descriptors and accesses data from Main Memory without support of a scalar pro-
cessor core. The PVMC data manager rearranges on-chip data using the Buffer Memory
without a complex crossbar network which allows the vector processor to operate at
higher clock rates.
McKee et al. [23] introduce a Stream Memory Controller (SMC) system that de-
tects and combines streams together at program-time and at run-time prefetches read-
streams, buffers write-streams, and reorders the accesses to use the maximum available
memory bandwidth. The SMC system describes the policies that reorder streams with a
fixed stride between consecutive elements. The PVMC system prefetches both regular
and irregular streams and also supports dynamic streams whose addresses are dependent
on run-time computation. McKee et al. also proposed the Impulse memory controller
[24] [25], which supports application-specific optimizations through configurable phys-
ical address remapping. By remapping the physical addresses, applications can manage
the data to be accessed and cached. The Impulse controller works under the command
of the operating system and performs physical address remapping in software, which
may not always be suitable for HPC applications using hardware accelerators. PVMC
remaps and produces physical addresses in the hardware unit without the overhead of
operating system intervention. Based on its C/C++ language support, PVMC can be
used with any operating system that supports the C/C++ stack.
A scratchpad is a low latency memory that is tightly coupled to the CPU [26]. There-
fore, it is a popular choice for on-chip storage in real-time embedded systems. The al-
location of code/data to the scratchpad memory is performed at compile time leading
to predictable memory access latencies. Panda et al. [27] developed a complete alloca-
tion strategy for scratchpad memory to improve the average-case program performance.
The strategy assumes that the access patterns are known at compile time. Suhendra et
al. [28] aims at optimizing the worst-case performance of memory access tasks. How-
ever, in that study, scratch-pad allocation is static having static and predictable access
patterns that do not change at run-time, raising performance issue when the amount of
code/data is much larger than the scratchpad size. Dynamic data structure management
using scratchpad techniques are more effective in general because they may keep the
working set in scratchpad. This is done by copying objects at predetermined points in
the program in response to execution [29]. Dynamic data structure management requires
a dynamic scratchpad allocation algorithm to decide where copy operations should be
carried out. A time-predictable dynamic scratchpad allocation algorithm has been de-
scribed by Deverge and Puaut [29]. The program is divided into regions, each with a
different set of objects loaded into the scratchpad. Each region supports only static data
structures. This restriction ensures that every program instruction can be trivially linked
to the variables it might use. Udayakumaran et al. [30] proposed a dynamic scratchpad
allocation algorithm that supports dynamic data structures. It uses a form of data ac-
cess shape analysis to determine which instructions can access which data structures,
and thus ensures that accesses to any particular object type can only occur during the
regions where that object type is loaded into the scratchpad. However, the technique is
not time-predictable, because objects are spilled into external memory when insufficient
scratchpad space is available. The PVMC Address Manager arranges unknown mem-
ory access at run-time in the form of pattern descriptors. The PVMC Data Manager
rearranges on-chip data using the Buffer Memory without a complex crossbar network,
which allows the vector processor to operate at higher clock rates.
A number of off-chip DMA Memory Controllers have been suggested in the past.
The Xilinx XPS Channelized DMA Controller [31], Lattice Semiconductor’s Scatter-
Gather Direct Memory Access Controller IP [32] and Altera’s Scatter-Gather DMA
Controller [33] cores provide data transfers from non-contiguous blocks of memory by
means of a series of smaller contiguous transfers. The data transfer of these controllers
is regular and is managed/controlled by a microprocessor (Master core) using a bus pro-
tocol. PVMC extends this model by enabling the memory controller to access complex
memory patterns.
Hussain et al. proposed a vector architecture called programmable vector memory
controller (PVMC) [34] and its implementation on an Altera Stratix IV 230 FPGA
device. The PVMC accesses memory patterns and feed them to soft vector processor
architecture. The Advanced Programmable Vector Memory Controller supports appli-
cation specific accelerator, scalar soft vector core processor and hard core ARM pro-
cessor architectures. The advanced architecture is tested with the applications having
complex complex memory patterns.
8 Conclusion
The memory unit can easily become a bottleneck for vector accelerators. In this paper,
we have suggested a memory controller for vector processor architectures that manages
memory accesses without the support of a scalar processor. Furthermore, to improve the
on-chip data access a Buffer Memory and a Data Manager are integrated that efficiently
access, reuse, align and feed data to the vector processor. A Multi DRAM Access Unit
is used to improve the Main Memory bandwidth which manages the memory accesses
of multiple SDRAMs. The experimental evaluation based on the VESPA vector system
demonstrates that the PVMC based approach improves the utilization of hardware re-
sources and efficiently accesses Main Memory data. The benchmarking results show
that PVMC achieves between 2.01x to 4.53x of speedup for 10 applications, consumes
2.56 to 4.04 times less energy and transfers different data set patterns up to 1.40x and
2.12x faster than the baseline vector system. In the future, we plan to embed run-time
memory access aware descriptors inside PVMC for vector-multicore architectures.
References