High Performance Embedded Computing Syst

1) The document discusses customization techniques for embedded systems, ranging from fully customized ASICs to microprocessors with software customization. 2) It proposes a strategy for large gains in design productivity from cross-fertilization between optimizing compiler technology and traditional hardware design automation. 3) Recent results from a prototype system being constructed at CREST are presented to support this promising direction of research combining compiler-driven customization with polymorphic computer architectures.

Uploaded by

Arduino Basic

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views4 pages

High Performance Embedded Computing Syst

Uploaded by

Arduino Basic

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Paper No. 19.4 - K. V.

Palem

High Performance Embedded Computing Systems for the Next Decade1

Krishna Palem, Sudhakar Yalamanchili, and Richard Copeland
Center for Research in Embedded Systems and Technology (CREST)
School of Electrical Engineering
Georgia Institute of Technology
Atlanta, GA. 30332
{palem,sudha}@ece.gatech.edu

Abstract: Customization is central to meeting the unique are placed in a historical perspective form whence we
size, weight, power, and execution constraints of emerging outline a strategy for achieving productivity gains in the
and future embedded systems. However, future High design of future HPEC systems. The paper concludes with
Performance Embedded Computing (HPEC) systems must representative results from an early implementation of this
face the twin challenges of escalating non-recurring strategy.
engineering costs and time-to-market windows. This paper
first presents an overview of the current landscape of Landscape of Customization Techniques
customization techniques for embedded systems followed As illustrated in Figure 1, the landscape of customization
by a strategy for large gains in design productivity from techniques for embedded systems spans a spectrum of
cross-fertilization between the domains of optimizing decreasing hardware customization starting with ASICs at
compiler technology and traditional design automation. one end and proceeding to microprocessors at the other
Recent results with a prototype system being constructed at end. In between lay solutions that employ differing degrees
CREST are presented in support of this promising direction of hardware customization corresponding to tradeoffs
of research between performance, NRE costs and TTM.
Embedded microprocessors and their instruction set
Keywords: embedded systems, customization, architectures (ISAs) represent software customizable
polymorphic computer architecture, design space
hardware and thus enable the highest chip volumes towards
exploration, compiler-driven customization.
amortizing the manufacturing NRE costs of the
microprocessor. Towards this end, the development of high
Introduction1 level programming languages and their compilers have
Customization is central to meeting the unique combination produced quantum jumps in designer productivity over the
of size, weight, energy, performance and time (SWEPT) last two decades. However, software customization limits
requirements of emerging high performance embedded the range of SWEPT constraints that can be met.
computing systems (HPEC) over the next decade. In the Subsequent efforts to customize microprocessors have
past, customization meant investing the time and money to included extending the ISAs with custom instructions
design an application specific integrated circuit (ASIC). accompanied by design tools that automate the generation
However the non-recurring engineering (NRE) cost of of the hardware and accompanying software tool chains
ASIC development has been rapidly escalating with each including simulators, debuggers, and compilers [11].
new generation of device technology. The cost structures Significant cost reductions accrue from the re-use of the
and time to market (TTM) characteristics of ASIC core datapath and control. An alternative is the use of a
development are becming sustainable for only the highest reconfigurable fabric coupled with the core datapath to
volume applications. At the same time, the well support post deployment customization [6,10]. Variants of
documented “design gap” between device densities and this approach have produced commercial instantiations
designer productivity has been increasing with each new where customization is supported by hardware/software
generation of Moore’s Law. Thus, a major challenge for the design flows that target the fabric and which supported by
HPEC community over the next decade is to produce APIs between the fabric and the core datapath [12].
several orders of magnitude improvement in designer
productivity. The last two technology generations enabled the ability to
Increasing the level of design abstraction and extensive tile the surface of a chip with homogeneous building
design re-use has been two major sources of productivity blocks, or tiles, where each tile may range in granularity
improvement in the last decade. The challenge for the from a complete processor core (emerging multi-core
upcoming decade is to identify and exploit new sources of designs), through simple RISC processors[1], down to
productivity improvement in order to sustain Moore’s Law. functional units[4]. Compute tiles may be interspersed with
In this paper we first examine, from the perspective of memory tiles (typically SRAM) and the fabric as a whole
reducing NRE and TTM, the landscape of customization may utilize a homogenous or heterogeneous set of
techniques that have emerged to date. These developments computer and memory tiles. Tiles are interconnected with a
flexible, configurable interconnection network.
1
This research was funded in part by the DARPA PCA Customization is achieved through the configuration of
Program under contract #F33615-03-C-4105 individual tiles, and the configuration of the intra and inter
tile networks.
Paper No. 19.4 - K. V. Palem

Customization fully in Hardware Customization fully in Software

Design NRE Effort

Decreasing Customization Increasing NRE and Time to Market

Hardware Development Software Development

Compilation

Synthesis

Custom FPGA Polymorphic Computing Fixed + Variable Microprocessor

ASIC Architectures ISA
Coarse grained, tiled
architectures
Figure 1 Spectrum of customization technique
Fine grained customization is achieved with Field the stable hardware-software interface isolating chip design
Programmable Gate Arrays (FPGAs) which implement fine issues from the compiler. Optimizing compilers focused on
grained tiles or logic elements that can be configured to program analysis and transformations for translating high
implement gate level combinational and sequential logic level specifications to a fixed ISA. Hardware design flows
primitives. started from (and largely remain today) abstractions of the
ASICs provide the greatest degree of customization and the hardware and consequently explicitly incorporated notions
highest NRE costs and TTM. In an effort to improve both of concurrency, communication, and time. The last decade
NRE and TTM, vendors have developed structured ASIC has seen the use of hardware description languages (HDLs)
design flows [13]. The overriding philosophy is to employ and logic synthesis become the dominant design paradigm
design re-use for the majority of the design and provide the with electronic system level (ESL) design now forming the
customer with the ability to focus on the specific next step towards higher level hardware/system
components of their application that benefit the most from abstractions. However, ESL flows are still rooted in higher
customization. LSI Logic’s RapidChip[2] design flow is a level abstractions of hardware and continue to be outpaced
representative example where pre-configured slices by the design gap. Hardware and software design flows
comprise of pre-diffused base layers can be customized via remain vertically integrated distinct design flows.
the final metal layers. Thus the NRE costs of a significant
component of the design are shared across multiple designs Major productivity improvements can come from a shift in
while customization is achieved for critical components of design approaches that remove the barrier of a fixed
specific designs. hardware target in the form of an a priori defined ISA.
Global optimizations that span both hardware and software
The spectrum is populated by additional design approaches, afford opportunities for major improvement in designer
each representing a finer trade-offs between performance, productivity and system performance. Central to the
NRE costs and TTM. However, scanning this spectrum, an development of such analysis and optimization techniques
emerging view has been the that the “sweet spot” for a will be the redefinition of the hardware/software interface,
large number of embedded systems is as indicated in the to enable customization techniques that can operate directly
figure with the following characteristics - about 80% of the on program graph representations used in compilers, in
performance of an ASIC, 15% the cost of an FPGA, $200K bridging to hardware abstractions and thus avoiding the
in NRE cost, and 8-12 weeks TTM [13]. performance and productivity hurdles of utilizing a fixed
ISA. Thus, program transformations from optimizing
Compiler-Driven Customization compilers can be tightly integrated with hardware
In the 70’s, when each new technology generation placed customization techniques to produce optimized
formidable challenges to designer productivity, the implementations. Rather than optimizing a program for a
development of customized hardware consequently fixed hardware target, both the program and the hardware
evolved into two distinct disciplines: optimizing compilers can be transformed to provide an optimized
and electronic design automation (EDA). ISAs served as implementation. Several recent efforts have begun
Paper No. 19.4 - K. V. Palem

exploring this cross fertilization between optimizing • Smaller footprint: Reconfiguration enables a single
compiler technology and EDA starting with the seminal fabric to replace multiple ASIC cores that would be

•
work in the PICO project [8,7,9,3]. otherwise necessary.
Risk reduction: Easily adaptable to changing standards
Design space exploration is a compiler-centric framework
•
and post deployment configuration.
for such exploring such cross-fertilization and developing a Reduced time to market: Reconfiguration is a software
generation of design flows where hardware customization
•
recompilation process.
is driven by a correct-by-construction philosophy of Reduced Life Cycle Costs: With a smaller number of
compilers. The design space is spanned by parametric cores to maintain, life cycle costs of maintaining an
representations of the architecture, compiler optimizations,
•
inventory are greatly reduced.
and machine independent metrics [5]. Navigation of this
Compilation vs. synthesis: Correct by construction
design space is driven by constraints on size, weight,
approach reduces design time and time to market
power, and execution time. Central to such a framework is
the parametric representation of the hardware architecture.
Our approach to customization is broadly founded on a
Polymorphic computing architectures (PCAs) are an
novel formulation of design space exploration wherein
emerging class of architectures that provide features
rather than having a fixed hardware target the target
amenable to parametric representation and optimization by
hardware is itself malleable or polymorphic. In such a
compiler-centric design space exploration formulations .
formulation the compiler assumes the responsibility of
generating optimized code as well as configuring, and
Polymorphic Computing Architectures scheduling reconfiguration of the target hardware within
Applications, particularly in the DoD environment, have some tightly controlled degrees of freedom. Polymorphism
become diverse in the range of computations they support. provides a powerful conceptual framework within which to
Different phases of the application are better suited to concurrently meet the demanding constraints of NRE cost,
different models of computation. Over the years these performance and TTM. This emerging technology of
applications were targeted to hardware that provided polymorphic computing will define how important classes
distinct hardware solutions for each phase: typically of future embedded processors will be designed and
representing solutions along the spectrum in Figure 1. As deployed in customized embedded systems.
we enter the deep submicron region of Moore’s Law the
increased densities have opened the door to an emerging The MONARCH PCA
class of architectures that are defined as polymorphic
The MONARCH chip is being designed and implemented
computing architectures (PCAs) – a class of computing
under the DARPA funded PCA program. The chip is
architectures whose micro-architecture can morph to
comprised of two main components. The first is a set of
efficiently and effectively meet the requirements of a range
high performance RISC processors. These processors
of diverse HPEC applications thereby also amortizing
access local memory, communicate amongst themselves
development cost. Created under the DARPA PCA
via a high speed on-chip network, and access global off-
program, Such architectures have the advantages of smaller
chip memory via the same high speed network. The second
footprint (by replacing several ASIC cores), lower risk via
component is a hierarchically structured, configurable
post deployment hardware configuration, and sustained
dataflow fabric. The fabric is comprised of a set of
performance over computationally diverse applications.
arithmetic clusters interspersed with memory clusters. Each
The micro-architecture innovations are accompanied by
arithmetic cluster is comprised of a number of high
innovations in programming abstractions and
performance functional units such as adders, multipliers,
methodologies, optimizing compiler technology, and
shifters, etc. The interconnect between elements in a cluster
resource management. The architectures are programmed
and between clusters are configured during compilation.
in high level languages that are rooted in modern
All clusters communicate via the programmable
imperative programming languages such as C/C++. Thus
interconnect.
the application development flow is a compiler-driven
software development flow that customizes the micro-
Operationally, the fabric implements a streaming, data
architecture for a specific code segment. Consequently the
driven model computation. Individual clusters and the
design NRE costs track those of software development with
interconnect between them, are configured to implement
performance attributes of customized hardware.
the dataflow of the computation or alternatively SIMD
Polymorphism extends beyond simply reconfiguring the
computation. The fabric natively supports hardware flow
data path and or memory structures on chip. A morph can
control and communication between hardware elements. In
involve a change in the programming model. For example,
one morph, streaming kernels are compiled to dataflow
an application may switch modes between the use of a
graphs that are then mapped onto the fabric. Once
dataflow execution model to that of a Single Instruction
configured, input data streams are piped through the fabric
Stream Multiple Data Stream (SIMD) model of
to produce one or more output data streams. The goal of
computation [4]. A single hardware substrate is capable of
compiler is to optimize throughput (keep the elements
reconfiguring both data and control flow. Such
busy) and cost (minimize area of the fabric). The
architectures afford the following advantages.
Paper No. 19.4 - K. V. Palem

compilation model is parametric across cluster designs and implements a series of searches through a simulated radar
morphs including the targeting heterogeneous fabrics. target database. The high IPC rates on early
implementations reflect the ability of the compiler to
The programming model is that of the Streaming Virtual extract concurrency and produce different, effective
Machine (SVM): an API implemented with C as the base configurations of the fabric across distinct computations
language. The SVM was developed through the types. Current efforts are focused on a detailed
Morphware Forum as part of the PCA program. The performance analysis and evaluation.
compiler front end consists of functionality common to References
most modern optimizing compilers. The backend
instruction selection / code generation module is different 1. Michael Bedford Taylor, et.al, “The Raw
from most existing compiler backends in that the target is Microprocessor: A Computational Fabric for Software
not a sequence of assembly language instructions to be Circuits and General Purpose Programs, “ IEEE
executed, but a hardware configuration for the data flow Micro, Mar/Apr 2002
fabric. The code generator targets a low level hardware
configuration that abstracts the space of hardware 2. LSI logic rapid chip platform ASIC:
configurations into a form amenable to design space http://www.lsilogic.com/products/rapidchip platform
exploration. Adaptations of traditional code generation asic/.
techniques enable exploration of hardware configuration 3. B. Mei, et.al, Exploiting loop-level parallelism on
alternatives for small sets of program operations. Code coarse-grained reconfigurable architectures using
generation occurs in two steps. First, each code segment is modulo scheduling” Design, Automation and Test in
translated to a position independent hardware Europe Conference and Exhibition, 2003.
configuration. Second, the optimized implementation is 4. J. Granacki and M. Vahey, “MONARCH:A High
mapped/scheduled on the fabric allocating compute and Performanc Processor Architecture with Two Native
memory resources to the optimized program representation. Computing Modes,” Proceedings of High
This latter step is a form of traditional place and route Performance Embedded Computing Workshop,
found in physical design automation tools. The optimized September 2002.
and mapped code forms the input to the chip-specific
implementation tools. 5. Krishna V. Palem, Lakshmi N. Chakrapani, Sudhakar
Yalamanchili, “A Framework For Compiler Driven
IP C o n M O N A R C H A rc h ite c tu re Design Space Exploration For Embedded System
2 7.5
Customization,” Proceedings of the Ninth Asian
25
Computing Science Conference, December 2004
2 2 .5

20 6. Krishna V. Palem, S. Talla, and P.W. Devaney.

1 7.5 Adaptive explicitly parallel instruction computing.
15
Proceedings of the 4th Australasian Computer Architecture
IP C
Conference, Jan. 1999.
1 2 .5
10 7. Krishna V. Palem, S. Talla, and W.Wong. Compiler
7.5 optimizations for adaptive epic processors. First
5 International Workshop on Embedded Software,
2 .5 Lecture Notes of Computer Science, 2211, Oct. 2001.
0
8. B. R. Rau and M. Schlansker. Embedded computing:
F IR G A -M UT D B -S
New directions in architecture and automation. HP
Figure 2 Example Performance Data Labs technical report: HPL-2000-115, 2000.
Early implementations of structured kernels from DoD 9. B. So, M. W. Hall, and P. C. Diniz, “A compiler
signal processing benchmark applications using the approach to fast hardware design space exploration in
Trimaran Stream Compiler demonstrated close to peak fpga based systems,” ACM SIGPLAN Conference on
achievable performance on the MONARCH fabric Programming Language Design and Implementation
(PLDI), 2002.
simulator for a 333MHZ implementation. However the
advantages of the polymorphic fabric are best illustrated 10. Stretch Inc. http://www.stretchinc.com/
with the diverse benchmarks shown in Figure 2 where the 11. Tensilica Inc. http://www.tensilica.com/
performance is measured as the number of RISC-like
operations completed per machine cycle (IPC) on the 12. Xilinx,http://www.xilinx.com/xlnx/xil_prodcat_landin
reconfigurable fabric. The FIR benchmark is a 13-tap FIR gpage.jsp?title=Virtex-II+Pro+FPGAs.
filter implemented in direct transposed form. The GA- 13. B. Zahiri. Structured ASICs: Opportunities and
MUT benchmark implements the mutation phase of a challenges. 21st International Conference on
genetic algorithm, inserting a random mutation at a random Computer Design, 2003.
point in a stream of “genes.” The DB-S benchmark