High Performance Embedded Computing Syst
High Performance Embedded Computing Syst
Palem
Abstract: Customization is central to meeting the unique are placed in a historical perspective form whence we
size, weight, power, and execution constraints of emerging outline a strategy for achieving productivity gains in the
and future embedded systems. However, future High design of future HPEC systems. The paper concludes with
Performance Embedded Computing (HPEC) systems must representative results from an early implementation of this
face the twin challenges of escalating non-recurring strategy.
engineering costs and time-to-market windows. This paper
first presents an overview of the current landscape of Landscape of Customization Techniques
customization techniques for embedded systems followed As illustrated in Figure 1, the landscape of customization
by a strategy for large gains in design productivity from techniques for embedded systems spans a spectrum of
cross-fertilization between the domains of optimizing decreasing hardware customization starting with ASICs at
compiler technology and traditional design automation. one end and proceeding to microprocessors at the other
Recent results with a prototype system being constructed at end. In between lay solutions that employ differing degrees
CREST are presented in support of this promising direction of hardware customization corresponding to tradeoffs
of research between performance, NRE costs and TTM.
Embedded microprocessors and their instruction set
Keywords: embedded systems, customization, architectures (ISAs) represent software customizable
polymorphic computer architecture, design space
hardware and thus enable the highest chip volumes towards
exploration, compiler-driven customization.
amortizing the manufacturing NRE costs of the
microprocessor. Towards this end, the development of high
Introduction1 level programming languages and their compilers have
Customization is central to meeting the unique combination produced quantum jumps in designer productivity over the
of size, weight, energy, performance and time (SWEPT) last two decades. However, software customization limits
requirements of emerging high performance embedded the range of SWEPT constraints that can be met.
computing systems (HPEC) over the next decade. In the Subsequent efforts to customize microprocessors have
past, customization meant investing the time and money to included extending the ISAs with custom instructions
design an application specific integrated circuit (ASIC). accompanied by design tools that automate the generation
However the non-recurring engineering (NRE) cost of of the hardware and accompanying software tool chains
ASIC development has been rapidly escalating with each including simulators, debuggers, and compilers [11].
new generation of device technology. The cost structures Significant cost reductions accrue from the re-use of the
and time to market (TTM) characteristics of ASIC core datapath and control. An alternative is the use of a
development are becming sustainable for only the highest reconfigurable fabric coupled with the core datapath to
volume applications. At the same time, the well support post deployment customization [6,10]. Variants of
documented “design gap” between device densities and this approach have produced commercial instantiations
designer productivity has been increasing with each new where customization is supported by hardware/software
generation of Moore’s Law. Thus, a major challenge for the design flows that target the fabric and which supported by
HPEC community over the next decade is to produce APIs between the fabric and the core datapath [12].
several orders of magnitude improvement in designer
productivity. The last two technology generations enabled the ability to
Increasing the level of design abstraction and extensive tile the surface of a chip with homogeneous building
design re-use has been two major sources of productivity blocks, or tiles, where each tile may range in granularity
improvement in the last decade. The challenge for the from a complete processor core (emerging multi-core
upcoming decade is to identify and exploit new sources of designs), through simple RISC processors[1], down to
productivity improvement in order to sustain Moore’s Law. functional units[4]. Compute tiles may be interspersed with
In this paper we first examine, from the perspective of memory tiles (typically SRAM) and the fabric as a whole
reducing NRE and TTM, the landscape of customization may utilize a homogenous or heterogeneous set of
techniques that have emerged to date. These developments computer and memory tiles. Tiles are interconnected with a
flexible, configurable interconnection network.
1
This research was funded in part by the DARPA PCA Customization is achieved through the configuration of
Program under contract #F33615-03-C-4105 individual tiles, and the configuration of the intra and inter
tile networks.
Paper No. 19.4 - K. V. Palem
Compilation
Synthesis
exploring this cross fertilization between optimizing • Smaller footprint: Reconfiguration enables a single
compiler technology and EDA starting with the seminal fabric to replace multiple ASIC cores that would be
•
work in the PICO project [8,7,9,3]. otherwise necessary.
Risk reduction: Easily adaptable to changing standards
Design space exploration is a compiler-centric framework
•
and post deployment configuration.
for such exploring such cross-fertilization and developing a Reduced time to market: Reconfiguration is a software
generation of design flows where hardware customization
•
recompilation process.
is driven by a correct-by-construction philosophy of Reduced Life Cycle Costs: With a smaller number of
compilers. The design space is spanned by parametric cores to maintain, life cycle costs of maintaining an
representations of the architecture, compiler optimizations,
•
inventory are greatly reduced.
and machine independent metrics [5]. Navigation of this
Compilation vs. synthesis: Correct by construction
design space is driven by constraints on size, weight,
approach reduces design time and time to market
power, and execution time. Central to such a framework is
the parametric representation of the hardware architecture.
Our approach to customization is broadly founded on a
Polymorphic computing architectures (PCAs) are an
novel formulation of design space exploration wherein
emerging class of architectures that provide features
rather than having a fixed hardware target the target
amenable to parametric representation and optimization by
hardware is itself malleable or polymorphic. In such a
compiler-centric design space exploration formulations .
formulation the compiler assumes the responsibility of
generating optimized code as well as configuring, and
Polymorphic Computing Architectures scheduling reconfiguration of the target hardware within
Applications, particularly in the DoD environment, have some tightly controlled degrees of freedom. Polymorphism
become diverse in the range of computations they support. provides a powerful conceptual framework within which to
Different phases of the application are better suited to concurrently meet the demanding constraints of NRE cost,
different models of computation. Over the years these performance and TTM. This emerging technology of
applications were targeted to hardware that provided polymorphic computing will define how important classes
distinct hardware solutions for each phase: typically of future embedded processors will be designed and
representing solutions along the spectrum in Figure 1. As deployed in customized embedded systems.
we enter the deep submicron region of Moore’s Law the
increased densities have opened the door to an emerging The MONARCH PCA
class of architectures that are defined as polymorphic
The MONARCH chip is being designed and implemented
computing architectures (PCAs) – a class of computing
under the DARPA funded PCA program. The chip is
architectures whose micro-architecture can morph to
comprised of two main components. The first is a set of
efficiently and effectively meet the requirements of a range
high performance RISC processors. These processors
of diverse HPEC applications thereby also amortizing
access local memory, communicate amongst themselves
development cost. Created under the DARPA PCA
via a high speed on-chip network, and access global off-
program, Such architectures have the advantages of smaller
chip memory via the same high speed network. The second
footprint (by replacing several ASIC cores), lower risk via
component is a hierarchically structured, configurable
post deployment hardware configuration, and sustained
dataflow fabric. The fabric is comprised of a set of
performance over computationally diverse applications.
arithmetic clusters interspersed with memory clusters. Each
The micro-architecture innovations are accompanied by
arithmetic cluster is comprised of a number of high
innovations in programming abstractions and
performance functional units such as adders, multipliers,
methodologies, optimizing compiler technology, and
shifters, etc. The interconnect between elements in a cluster
resource management. The architectures are programmed
and between clusters are configured during compilation.
in high level languages that are rooted in modern
All clusters communicate via the programmable
imperative programming languages such as C/C++. Thus
interconnect.
the application development flow is a compiler-driven
software development flow that customizes the micro-
Operationally, the fabric implements a streaming, data
architecture for a specific code segment. Consequently the
driven model computation. Individual clusters and the
design NRE costs track those of software development with
interconnect between them, are configured to implement
performance attributes of customized hardware.
the dataflow of the computation or alternatively SIMD
Polymorphism extends beyond simply reconfiguring the
computation. The fabric natively supports hardware flow
data path and or memory structures on chip. A morph can
control and communication between hardware elements. In
involve a change in the programming model. For example,
one morph, streaming kernels are compiled to dataflow
an application may switch modes between the use of a
graphs that are then mapped onto the fabric. Once
dataflow execution model to that of a Single Instruction
configured, input data streams are piped through the fabric
Stream Multiple Data Stream (SIMD) model of
to produce one or more output data streams. The goal of
computation [4]. A single hardware substrate is capable of
compiler is to optimize throughput (keep the elements
reconfiguring both data and control flow. Such
busy) and cost (minimize area of the fabric). The
architectures afford the following advantages.
Paper No. 19.4 - K. V. Palem
compilation model is parametric across cluster designs and implements a series of searches through a simulated radar
morphs including the targeting heterogeneous fabrics. target database. The high IPC rates on early
implementations reflect the ability of the compiler to
The programming model is that of the Streaming Virtual extract concurrency and produce different, effective
Machine (SVM): an API implemented with C as the base configurations of the fabric across distinct computations
language. The SVM was developed through the types. Current efforts are focused on a detailed
Morphware Forum as part of the PCA program. The performance analysis and evaluation.
compiler front end consists of functionality common to References
most modern optimizing compilers. The backend
instruction selection / code generation module is different 1. Michael Bedford Taylor, et.al, “The Raw
from most existing compiler backends in that the target is Microprocessor: A Computational Fabric for Software
not a sequence of assembly language instructions to be Circuits and General Purpose Programs, “ IEEE
executed, but a hardware configuration for the data flow Micro, Mar/Apr 2002
fabric. The code generator targets a low level hardware
configuration that abstracts the space of hardware 2. LSI logic rapid chip platform ASIC:
configurations into a form amenable to design space http://www.lsilogic.com/products/rapidchip platform
exploration. Adaptations of traditional code generation asic/.
techniques enable exploration of hardware configuration 3. B. Mei, et.al, Exploiting loop-level parallelism on
alternatives for small sets of program operations. Code coarse-grained reconfigurable architectures using
generation occurs in two steps. First, each code segment is modulo scheduling” Design, Automation and Test in
translated to a position independent hardware Europe Conference and Exhibition, 2003.
configuration. Second, the optimized implementation is 4. J. Granacki and M. Vahey, “MONARCH:A High
mapped/scheduled on the fabric allocating compute and Performanc Processor Architecture with Two Native
memory resources to the optimized program representation. Computing Modes,” Proceedings of High
This latter step is a form of traditional place and route Performance Embedded Computing Workshop,
found in physical design automation tools. The optimized September 2002.
and mapped code forms the input to the chip-specific
implementation tools. 5. Krishna V. Palem, Lakshmi N. Chakrapani, Sudhakar
Yalamanchili, “A Framework For Compiler Driven
IP C o n M O N A R C H A rc h ite c tu re Design Space Exploration For Embedded System
2 7.5
Customization,” Proceedings of the Ninth Asian
25
Computing Science Conference, December 2004
2 2 .5