0% found this document useful (0 votes)
67 views4 pages

AMMC Advanced Multi-Core Memory Controll

The document describes AMMC (Advanced Multi-Core Memory Controller), which efficiently handles data movement and computational tasks in a heterogeneous multi-core system. AMMC improves performance by managing complex data transfers and scheduling multi-cores without an operating system. When compared to a baseline system, AMMC achieves 6.8x speedup, transfers data 1.95x faster, uses 48% less hardware resources and 27.9% less power.

Uploaded by

prakriti sankhla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views4 pages

AMMC Advanced Multi-Core Memory Controll

The document describes AMMC (Advanced Multi-Core Memory Controller), which efficiently handles data movement and computational tasks in a heterogeneous multi-core system. AMMC improves performance by managing complex data transfers and scheduling multi-cores without an operating system. When compared to a baseline system, AMMC achieves 6.8x speedup, transfers data 1.95x faster, uses 48% less hardware resources and 27.9% less power.

Uploaded by

prakriti sankhla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

AMMC: Advanced Multi-core Memory Controller

Tassadaq Hussain1,2 , Oscar Palomar1,2 , Osman Unsal1 , Adrian Cristal1,2,3 , Eduard Ayguadé1,2 , Mateo Valero1,2
1
Computer Sciences, Barcelona Supercomputing Center, Barcelona, Spain
2
Departament d’Arquitectura de Computadors, Universitat Politècnica de Catalunya, Barcelona, Spain
3
Artificial Intelligence Research Institute (IIIA), Centro Superior de Investigaciones Cientı́ficas (CSIC), Barcelona, Spain
Email: {first}.{last}@bsc.es

Abstract— In this work, we propose an efficient scheduler • When compared to the baseline multi-core system im-
and intelligent memory manager known as AMMC (Advanced plemented on the Xilinx FPGA, the AMMC multi-core
Multi-Core Memory Controller), which proficiently handles data system achieves 6.8x of speed-up, transfers datasets up
movement and computational tasks. The proposed AMMC system to 1.95x faster, consumes 48% less hardware resources
improves performance by managing complex data transfers at and 27.9% less on-chip power.
run-time and scheduling multi-cores without the intervention of
a control processor nor an operating system. AMMC has been
coupled with a heterogeneous system that provides both general- II. A DVANCED M ULTI - CORE M EMORY C ONTROLLER
purpose cores and application specific accelerators. The AMMC In this section, we describe the AMMC system. The archi-
system is implemented and tested on a Xilinx ML505 evaluation tecture (shown in Figure 1) is divided into five units: the Bus
FPGA board. The performance of the system is compared with System (A), the Local Memory Unit (B), the Memory Manager
a microprocessor based system that has been integrated with the (C), the Scheduler (D) and the Pattern Aware SDRAM Con-
Xilkernel operating system. Results show that the AMMC based troller (E). The main units of AMMC are shown in Figure 1, as
multi-core system consumes 48% less hardware resources, 27.9% well as the Multi-Core System, that executes the applications.
less on-chip power and achieves 6.8x of speed-up compared to
the MicroBlaze-based multi-core system. The Multi-Core System can have general purpose processors,
application specific accelerator cores or a combination of both
I. I NTRODUCTION types. The Bus System [2] provides a link between AMMC
Latest multi-core architectures require both programmabil- and the Multi-Core System. The AMMC Memory Unit stores
ity and performance and combine different types of cores, both data and the access pattern descriptors in Specialized
becoming heterogeneous systems. To get programmability, a Memory and Descriptor Memory respectively. Each processing
part of the program is executed on general-purpose cores. To core has a separate Specialized Memory [3] and a number of
achieve performance and to increase power efficiency, compute Descriptor Memory blocks. The descriptors are programmed
intensive tasks are mapped into separate hardware accelerators at compile-time, providing information of the memory access
or application-specific processors. The dedicated application patterns and their priorities. At run-time, the AMMC Sched-
specific accelerator cores have small footprint and low power uler receives multiple memory read/write requests from the
dissipation and feature high performance [1]. Multi-Core System and selects a processing core, depending
upon its priority level and scheduling policy. The Scheduler
To overcome the memory wall and to reduce the sys- forwards the memory request to the Memory Manager. The
tem power, a memory system is needed that supports cores Memory Manager is divided into the Address Manager and the
with low frequency and low complexity, has efficient local Data Manager. The Address Manager takes a Task ID from
memory and data management, with an intelligent scheduler the AMMC Scheduler and fetches its Descriptor Memory.
while supporting a programming model that manages memory Depending on the access pattern the Address Manager uses
accesses in software so that hardware can best utilize them. single or multiple descriptors, maps and rearranges addresses
In this work, we have integrated a memory controller with a in hardware. The Address Manager saves mapped addresses
heterogeneous multi-core system having Application Specific into its Address Buffer for further reuse. The Data Manager
Hardware Accelerators (ASHA) and Scalar Soft Processor improves the Computational Intensity [3] by organizing and
(SSP), that we term AMMC (Advanced Multi-core Memory managing the memory accesses. For a core processing a single
Controller). Some salient features of the AMMC are given computed point, the maximum achievable (ideal) Computa-
below: tional Intensity is 1. The Data Manager accesses the elements
• The AMMC based system handles heterogeneous in the form of patterns which are required for a single output
(SSP and ASHA) cores using Symmetric and Asym- (Computedelement ). After accessing the first access pattern, the
metric scheduling policies, without the support of a Data Manager reuses and updates data where required. The
master core nor operating system. Pattern Aware SDRAM Controller [3] is used to transfer data
between main memory and the Specialized Memory.
• Regular and irregular access patterns of heterogeneous
multi-cores are described using a separate Descriptor
Memory, which reduces the on-chip communication
time and run-time address generation overhead.
• The AMMC Address Manager and Scheduler handles
regular and irregular pattern requests of a heteroge-
neous multi-core system, provides precise timing and
allows scheduling mode to be changed at runtime.
The research leading to these results has received funding from the European
Research Council under the European Unions 7th FP (FP/2007-2013) / ERC
GA n. 321253. It has been partially funded by the Spanish Government
(TIN2014-34557).
Fig. 1. Architecture of the Advanced Multi-core Memory Controller
(c)

(a) (b) (d)


Fig. 2. Multi-Core Systems: (a) MicroBlaze (b) AMMC (c) MicroBlaze Resource Utilization (d) AMMC Resource Utilization

The AMMC Descriptor Memory holds memory access and scheduling policies are programmed statically at program-
scheduling information of applications running on the Multi- time and are executed by hardware at run-time. In Symmet-
Core System. Each processing core has a separate block of ric multi-core strategy, the AMMC Task Placer (Figure 1)
Descriptor Memory. The set of parameters for a Descriptor manipulates the incoming requests in FIFO (First in First
Memory block includes Command, Task ID, External Address, out) order and places them in the Dispatch Descriptor. The
Priority, Size, Stride and Offset. Command specifies whether Asymmetric strategy uses the priority specified for each core
to read or write data. The Task ID and External Address and incoming requests. Each core is assigned a fixed priority
parameters hold the address of processing core (buffer) and at program-time, which is placed in Program-Time Priority
main memory (SDRAM) data set respectively. The Priority Descriptor. At run-time, the Scheduler accumulates requests
defines the order in which memory accesses are entitled to from the Multi-Core System. The Comparator and Task Placer
be process. The parameters Size and Stride define the type of maintain them in the Dispatch Descriptor. The Comparator
memory access. The Offset register field is used to point the takes requests from multiple processing cores, compares them
next linked memory access pattern. with programmed priorities and forwards the results to Task
Placer. The Task Placer places the requests in the Descriptor
The AMMC Scheduler manages and controls the run-time Dispatch Unit and executes requests only if it is ready to run,
requests and programmed priorities of processing cores. Each and there are no higher priority cores that are in ready state.
processing core’s request includes a read and write memory The Dispatch Descriptor executes processing core requests
operation. At program-time, each processing core is assigned sequently.
a priority value along with the Task ID, which are placed in
the Program-Time Priority Descriptor (shown in Figure 1). III. E XPERIMENTAL F RAMEWORK
The processing cores are categorized into three states, busy
(core is processing on local buffer), requesting (core is idle), In this section, we describe the MicroBlaze- and AMMC-
and request & busy. In the request & busy state the core is based Multi-Core Systems and the rest of our experimental
assumed to have double or multi buffers. During this state, setup. A Xilinx ML505 evaluation FPGA board is used to
the core is processing the data of one buffer while making a test the Multi-Core Systems. The Xilinx Integrated Software
request to fill another buffer. Environment and Xilinx Platform Studio are used to design the
Multi-Core Systems. Xilinx Power Estimator does the power
AMMC Scheduler supports two scheduling strategies of analysis. The section is divided into three subsections: the
the requests of the cores: symmetric and asymmetric. The Computation Units, the MicroBlaze based Multi-Core System
and the AMMC based Multi-Core System.
A. Computation Units
There are two cores in our heterogeneous system: ASHA
cores execute application kernels with regular memory ac-
cesses while SSP cores execute application kernels with ir-
regular memory access patterns. Figure 3 lists all applications
used in our experiments. The ASHAs are generated by the
ROCCC [4] compiler. We have chosen the MicroBlaze SSP
to implement the general purpose cores of our system. Mi-
croblaze is an RISC SSP architecture, optimized and imple-
mented with FPGA resources. The multi-core system includes
2 MicroBlaze cores that run 2 applications each and 8 ASHA
cores.
B. MicroBlaze-based Multi-Core System
The MicroBlaze-based Multi-Core System is used as base-
line (Figure 2(a)). The resources utilized by the MicroBlaze
based multi-core system is shown in Figure 2(c). Each gen-
eral purpose core has 32KB of data cache, that is imple-
mented using BRAM. The design uses Xilinx Cache Links
(IXCL/DXCL) for I-Cache and D-Cache memory accesses
respectively. MicroBlaze instruction prefetcher improves the
(a) (b) system performance by using the instruction prefetch buffer
Fig. 3. Application Kernels: (a) Regular Access Pattern (b) Irregular Access and instruction cache streams.
Pattern
(a) (b)
Fig. 4. Symmetric System Performance: (a) AMMC (b) MicroBlaze

(a) (b)

(c) (d)
Fig. 5. Asymmetric System Performance: (a) AMMC (b) MicroBlaze (c) & (d) AMMC & MicroBlaze Systems Pipeline and Overlap Time Period

TABLE I. A SYMMETRIC S CHEDULING P RIORITY P OLICIES


Kernels FIR FFT Mat Mul Lapl 3D-Sten CRG Huffman In Rem N-Body Speed-ups
Symmetric I 1 1 1 1 1 1 1 1 5.47x
Asymmetric
Group I 1 4 5 3 2 6 7 8 9 6.84x
Group II 2 3 4 5 1 8 6 9 7 5.83x
Group III 9 6 5 4 8 4 3 2 1 3.45x
Architecture 1 1 1 1 1 2 2 2 2 5.42x

One of the MicroBlaze softcores (Core 0, Figure 2(a)) is the time. Tt presents the data access time from external memory.
master core and is used to schedule the memory requests and It includes address mapping from physical address space to
to manage data transfer between multi-cores and main memory SDRAM address space, interface timing and synchronization.
(SDRAM). The MicroBlaze cores use Xilkernel a small light- Tc holds the computation time of the application kernels. To
weight easy-to-use Real-Time Operating System (RTOS). Its measure the overlap and processing time, each application
API performs scheduling, inter-process communication and kernel is assigned four timers which count Ts, Tm, Tt and
synchronization with POSIX threads (pthreads). From the main Tc clocks.
function, application spawns into multiple statically declared
threads using the pthread library. Each thread controls a single In the symmetric scheduling policy, the requests are treated
application kernel and manages its memory patterns. with the FIFO method, which removes the scheduling time.
Figures 4(a) and (b) present the overlapped/pipelined time
of AMMC and MicroBlaze systems respectively. X and Y
C. AMMC based Multi-Core System axis present clock cycles and execution time factors, respec-
Figure 2(b) shows the implementation of an AMMC-based tively. While running the Multi-Core System using symmetric
Multi-Core System. SSP cores do not integrate any cache but scheduling, the results show that the AMMC system achieves
use the local memory provided by AMMC. Similarly, there is 5.47x of speed-up. The current Computation Units contain ap-
no need for an RTOS like Xilkernel. The resources consumed plication kernels with different access patterns. The symmetric
by each AMMC unit is shown in Figure 2(d). scheduling policy gives higher priority to application kernels
with many memory requests. These requests add on-chip bus
IV. R ESULTS AND D ISCUSSION and memory access delays, therefore the AMMC system does
This section analyzes the results of experiments conducted not fully overlap Tm & Tt. These delays can be decreased
on AMMC and MicroBlaze based system. The experiments by executing Multi-Core System with asymmetric scheduling
are characterized into two subsections: System Performance policy.
and Area & Power.
A. Multi-Core System Performance We categorize the asymmetric scheduling policy into two
types; the memory access based asymmetric policy and the
The system performance is measured by executing ap- architecture based asymmetric policy (shown in Table I). The
plication kernels simultaneously using different scheduling memory access based asymmetric policy assigns priorities to
policies, on AMMC and MicroBlaze based systems. Due to the application kernels with respect to their access patterns and
the confined FPGA resources, 5 ASHA and 2 SSP cores are is further categorized into three groups. In Group I, the highest
integrated with the Multi-Core System. The execution time priorities (1) are allocated to application kernels having less
of both systems is categorized into four factors: scheduling memory requests and dense access patterns. For example, the
time (Ts), memory management time (Tm), data transfer time applications having multiple read/write requests are given low
(Tt) and computation time (Tc). Ts holds the arbitration (re- priorities. To check the sensitivity of asymmetric scheduling
quest, grant and wait) time among the on-chip scheduling. execution of the assigned priorities, the priorities of Group I
Tm comprises the address generation and data management are slightly varied in Group II. In Group III, the priorities
are assigned to check the fairness of applications for priorities V. R ELATED W ORK
not assigned properly. For example, the highest priority (1) Marchand et al. [5] have developed software and hardware
is allocated to application kernels having the largest amount implementations of the Priority Ceiling Protocol that control
of memory requests. Like MicroBlaze Xilkernel scheduling the multiple-unit resources in a uniprocessor environment.
model, the AMMC scheduling policies and memory accesses Yan et al. [6] has designed a hardware scheduler to assist
are configured statically at program-time. Unlike Xilkernel, the the synergistic processor cores (SPCs) task scheduling on
requests are managed and executed by hardware at run-time. heterogeneous multi-core architecture. The scheduler supports
The memory access based asymmetric policy performs load first come first service (FCFS) and dynamic priority scheduling
balancing and reduces on-chip communication and memory strategies. It acts as helper engine for separate threads working
management delay. on the active cores. The scouting hardware thread [7] tends to
reduce latency, but also optimizes memory bandwidth usage
Figures 5(a) & (b) present clock cycles of AMMC and by predicting memory accesses and by prioritizing valuable
MicroBlaze systems respectively, while executing application memory traffic using a separate core. The information of
kernels simultaneously using memory access based asym- memory accesses is stored thus helping the scouting core to
metric scheduling policy. X (logarithmic scale) and Y axis fetch and update data from the cache. The AMMC holds
present clock cycles and application kernels, respectively. information of memory patterns in the form of Descriptor
Each bar represents Ts, Tm, Tt and Tc. While running all Memory. Currently, accessed patterns are placed in the address
application kernel together using the asymmetric scheduling, manager of AMMC. The AMMC monitors the access patterns
the results show that the scheduling, memory manager and without using a separate core and reuses these patterns for
memory transfer of AMMC based system are 21x, 2.9x and multiple cores if required.
7.1x faster respectively, compared to the MicroBlaze based
system. The computation units execution time (Tc) remains Hussain et al. [8] [9] discussed a programmable pattern
the same for both systems. Figures 5(c) and (d) present the based memory controller architecture. The design is appropri-
overlapped/pipelined time of AMMC and MicroBlaze systems ate for data intensive applications with regular access patterns
respectively. The Tc of all application kernels is overlapped only. He also proposes a controller [3] that supports irregular
(shown in Figure 5(c)). In the AMMC system, Tt and Tm applications running on a single core. Whereas in AMMC, we
are dominant for the regular and irregular application kernels present a mechanism that supports both application-specific
respectively. As all AMMC units operate in parallel, AMMC accelerators and RISC cores in a heterogeneous multi-core
overlaps all other units under the unit that consumes more time. system having regular and irregular memory access patterns.
For example for regular application kernels Ts, Tm and Tc are VI. C ONCLUSION
overlapped under Tt. The MicroBlaze based system overlaps In this work, we have proposed AMMC that schedules
Tc & Ts completely and partially overlaps Tm and Tt (shown multi-core operations while taking processing, scheduling,
in Figure 5(d)). While running all application kernel together memory management and memory transfer into account. The
using the asymmetric scheduling with priorities of Group I, AMMC architecture supports two types of cores: the general
the results show that the AMMC based system achieves 6.84x purpose RISC core and application specific hardware accel-
of speed-up compared to the MicroBlaze based system. While erator core. The AMMC improves the system performance
executing application kernels with priories of Group II and by reducing the speed gap between accelerators/processors
Group III, the AMMC based system achieves 5.83 and 3.45x and memory and by scheduling/managing complex memory
of speed-up respectively. The AMMC asymmetric scheduling patterns without master core intervention. The AMMC system
policy manages system resources (Application code, On-Chip is implemented and tested on a Xilinx ML505 evaluation
Off-Chip Memory) of the Multi-Core System without the FPGA board. The performance of the system is compared
support of the operating system. with a microprocessor based system that has been integrated
with the Xilkernel operating system. Results show that the
In the architecture based asymmetric policy, the Computa- AMMC based multi-core system consumes 48% fewer slices
tion Units are assigned priorities depending upon their instruc- and 27.9% less on-chip static power and achieves 6.8x of
tion set architecture, execution and communication (request/- speed-up compared to the MicroBlaze-based multi-core system
grant) speed. The architecture based asymmetric priorities are having real time operating system.
shown in Table I. All the cores of one type get the same
priority. The priority 1 executes ASHA core requests with R EFERENCES
higher priority. Requests having same priorities are executed [1] András Vajda et al. Programming many-core chips. Springer, 2011.
in FIFO order. While running the Multi-Core System using the [2] Tassadaq Hussain et al. MAPC: Memory Access Pattern based Controller.
architecture based asymmetric scheduling policy, the results In 24th International Conference on FPL, 2014.
show that the AMMC based system achieves 5.42x of speed- [3] Tassadaq Hussain et al. Advanced Pattern based Memory Controller for
up compared to MicroBlaze based system. For performance FPGA based Applications. In 24th International Conference on HPCS.
evaluations, we analyzed that the priority based scheduling has [4] Jason Villarreal et al. Designing modular hardware accelerators in c with
the potential for supporting scalability and load balancing. roccc 2.0. In FCCM 2010.
[5] P. Marchand and P. Sinha. A hardware accelerator for controlling access
to multiple-unit resources in safety/time-critical systems. Inderscience
B. Area & Power Publishers, April 2007.
Xilinx V5-Lx110T device dissipates 3.15 watts of on-chip [6] L. Yan et al. Hardware Assistant Scheduling for Synergistic Core
Tasks on Embedded Heterogeneous Multi-core System. In Journal of
static power, while running the MicroBlaze based system. The Information & Computational Science (2008).
AMMC system draws 2.27 watts of on-chip power on a V5- [7] Shailender et al. Chaudhry. Simultaneous speculative threading: a novel
Lx110T device. While comparing the AMMC and MicroBlaze pipeline architecture implemented in sun’s rock processor. ACM, 2009.
systems without slave units (accelerators and processor), re- [8] T. Hussain, M. Shafiq, M. Pericas, N. Nacho and E. Ayguade. PPMC:
sults show that AMMC system consumes 48% fewer slices and A Programmable Pattern based Memory Controller. In ARC 2012.
27.9% less on-chip static power than the MicroBlaze system. [9] Tassadaq Hussain and Amna Haider. PGC: A Pattern-Based Graphics
The AMMC provides low-power and simple control charac- Controller. Int. J. Circuits and Architecture Design, 2014.
teristics by rearranging data accesses and utilizing hardware
units efficiently.

You might also like