0% found this document useful (0 votes)

17 views52 pages

Xcell Software1

The Xcell Software Journal discusses the evolution of software development environments, specifically Xilinx's SDx line, which allows software developers to program FPGAs using C/C++ and OpenCL. It emphasizes the shift from traditional monolithic microprocessors to heterogeneous multicore architectures, highlighting the advantages of FPGA-accelerated processing in both data centers and embedded systems. The journal aims to assist developers in leveraging these new environments to optimize application performance and reduce development cycles.

Uploaded by

syahban rangkuti

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views52 pages

Xcell Software1

Uploaded by

syahban rangkuti

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 52

SOFTWARE SOLUTIONS FOR

A PROGRAMMABLE WORLD

ISSUE 1
THIRD QUARTER 2015

SOFTWARE
The Next Logical Step
in C/C++, OpenCL
Programming

Exploring the SDSoC Environment:

Build a Sample Design

System-level HW-SW
Optimization on the Zynq SoC

SDAccel Software Application

Design Flows for FPGAs

MathWorks: Make Design Trade-offs

at the Desktop, Not the Lab

www.xilinx.com/xcell
Lifecycle Technology

Design it or Buy it?

Shorten your development cycle with Avnet’s SoC Modules

Quick time-to-market demands are forcing you to rethink how you design, build and deploy your
products. Sometimes it’s faster, less costly and lower risk to incorporate an off-the-shelf solution
instead of designing from the beginning. Avnet’s system-on module and motherboard solutions for
the Xilinx Zynq®-7000 All Programmable SoC can reduce development times by more than four
months, allowing you to focus your efforts on adding differentiating features and unique capabilities.

Find out which Zynq SOM is right for you http://zedboard.org/content/design-it-or-buy-it

facebook.com/avnet twitter.com/avnet youtube.com/avnet

Letter from the Publisher
Welcome to Xcell Software Journal

SOFTWARE
Earlier this year, Xilinx® released its SDx™ line of development environments,
which enable non-FPGA experts to program Xilinx device logic using C/C++
and OpenCL™. The goal of the SDx environments is to let software develop-
ers and embedded system architects program our devices as easily as they
program GPUs or CPUs. Xilinx’s FPGA hardware technology has long been
able to accelerate algorithms, but it was only recently that the underlying soft-
ware technology and hardware platforms reached a level where it was feasible
PUBLISHER Mike Santarini to create these development environments for the broader software- and sys-
[email protected]
1-408-626-5981
tem-architect communities of C/C++ and OpenCL users.
The hardware platforms have evolved rapidly during this millennium. In the
early 2000s, the semiconductor industry changed the game on software develop-
EDITOR Diana Scheben
ers. To avoid a future in which chips reached the energy density of the sun, MPU
vendors switched from monolithic MPUs to homogeneous multicore, distrib-
ART DIRECTOR Scott Blair
uted processing architectures. This switch enabled the semiconductor indus-
try to continue to introduce successive generations of devices in cadence with
DESIGN/PRODUCTION Teie, Gelwicks & Assoc.
1-408-842-2627
Moore’s Law and even to innovate heterogeneous multicore processing systems,
which we know today as systems-on-chip (SoCs). But the move to multicore
has placed a heavy burden on software developers to design software that runs
ADVERTISING SALES Judy Gelwicks
1-408-842-2627 efficiently on these new distributed processing architectures. Xilinx has stepped
[email protected] in to help software developers by introducing its SDx line of development en-
vironments. The environments let developers dramatically speed their C/C++
INTERNATIONAL Melissa Zhang, and OpenCL code running on systems powered by next-generation processing
Asia Pacific architectures, which today are increasingly accelerated by FPGAs.
[email protected]
Indeed, FPGA-accelerated processing architectures, pairing MPUs with
Christelle Moraga, FPGAs, are fast replacing power-hungry CPU/GPU architectures in data cen-
Europe/Middle East/Africa
[email protected] ter and other compute-intensive markets. Likewise, in the embedded systems
Tomoko Suto,
space, new heterogeneous multicore processors such as Xilinx’s Zynq®-7000
Japan All Programmable SoC and upcoming Xilinx UltraScale+™ MPSoC integrate
[email protected] multiple processors with FPGA logic on the same chip, enabling companies
to create next-generation systems with unmatched performance and differ-
REPRINT ORDERS 1-408-842-2627 entiation. FPGAs have traditionally been squarely in the domain of hardware
engineers, but no longer.
Now that Xilinx has released its SDx line of development environments to use
EDITORIAL ADVISERS Tomas Evensen
on its hardware platforms, the software world has the ability to unlock the acceler-
Lawrence Getman
ation power of the FPGA using C/C++ or OpenCL within environments that should
Mark Jensen
be familiar to embedded-software and -system developers. This convergence of
strong underlying compilation technology for our high-level synthesis (HLS) tool
flow with programming languages and tools designed for heterogeneous architec-
Xilinx, Inc.
tures brings the final pieces together for software and system designers to create
2100 Logic Drive custom hardware accelerators in their own heterogeneous SoCs.
San Jose, CA 95124-3400 Xcell Software Journal is dedicated to helping you leverage the SDx envi-
Phone: 408-559-7778
FAX: 408-879-4780 ronments and those from Xilinx Alliance members such as National Instru-
www.xilinx.com/xcell/ ments and MathWorks®. The quarterly journal will focus on software trends,
case studies, how-to tutorials, updates and outlooks for this rapidly growing
user base. I’m confident that as you read the articles you will be inspired to
explore Xilinx’s resources further, testing out the SDx development environ-
ments accessible through the Xilinx Software Developer Zone. I encourage
you to read the Xcell Daily Blog, especially Adam Taylor’s chronicles of using
© 2015 Xilinx, Inc. All rights reserved. XILINX, the Xilinx Logo, and other the SDSoC development environment. And I invite you to contribute articles
designated brands included herein are trademarks of Xilinx, Inc. All
other trademarks are the property of their respective owners.
to the new journal to share your experiences with your colleagues in the van-
guard of programming FPGA-accelerated systems.
The articles, information, and other materials included in this issue
are provided solely for the convenience of our readers. Xilinx makes
no warranties, express, implied, statutory, or otherwise, and accepts
no liability with respect to any such articles, information, or other There is a new bass player
materials or their use, and any use thereof is solely at the risk of the for the blues jam in the sky . . .
user. Any person or entity using such information in any way releas-
es and waives any claim it might have against Xilinx for any loss,
— Mike Santarini This issue is dedicated to analyst and
damage, or expense caused thereby. Publisher ESL visionary Gary Smith, 1941 – 2015.
CONTENTS
THIRD QUARTER
2015
ISSUE 1

14
VIEWPOINT
Letter from the Publisher
Welcome to Xcell Software Journal . . . 3

COVER STORY
The Next Logical Step in C/C++,
Open CL Programming

6
36
XCELLENCE WITH SDSOC
FOR EMBEDDED DEVELOPMENT
SDSoC, Step by Step: Build a Sample Design . . . 14

Using the SDSoC IDE for System-level HW-SW

Optimization on the Zynq SoC . . . 20

XCELLENCE WITH SDACCEL

FOR APPLICATION ACCELERATION
Compile, Debug, Optimize . . . 30

Developing OpenCL Imaging

Applications Using C++ Libraries . . . 36

XCELLENT ALLIANCE FEATURES 20

MATLAB and Simulink Aid
HW-SW Co-design of Zynq SoCs . . . 42

XTRA READING
30
IDE Updates and Extra Resources for Developers . . . 50

42
XCELL SOFTWARE JOURNAL: COVER STORY

The Next Logical Step

in C/C++, OpenCL
Programming

6
E
THIRD QUARTER 2015

New environments
allow you to maximize
code performance.
by Mike Santarini
Publisher, Xcell Publications
Xilinx, Inc.
[email protected]

Lawrence Getman
Vice President, Corporate Strategy and Marketing
Xilinx, Inc.
[email protected]

Ever since Xilinx® invented and brought

to market the world’s first FPGAs in the
early 1980s, these extraordinarily ver-
satile programmable logic devices have
been the MacGyver multipurpose tool
of hardware engineers. With Xilinx’s
recent releases of the SDx™ line of de-
velopment environments—SDAccel™,
SDSoC™ and SDNet™—Xilinx is em-
powering a greater number of creative
minds to bring remarkable innovations
to the world by enabling software devel-
opers and systems engineers (non-FPGA
designers) to create their own custom
software-defined hardware easily with
Xilinx devices.
Before we take a look at these new
environments and other software devel-
opment resources from Xilinx and its Al-
liance members, let’s consider the evolu-
tion of processing architectures and their
impact on software development.

IT’S A SOFTWARE PROBLEM …

Prior to 2000, the typical microprocessor
largely comprised one giant monolithic
processor core with onboard memory and

7
XCELL SOFTWARE JOURNAL: COVER STORY

Figure 1 — The Zynq UltraScale+ MPSoC

a few other odds and ends, making MPUs relatively up the clock on each new monolithic MPU archi-
straightforward platforms on which to develop next-gen- tecture, given the silicon process technology road
eration apps. For three decades leading up to that point, map and worsening transistor leakage, MPUs would
every 22 months—in step with Moore’s Law—micropro- soon have the same power density as the sun.
cessor vendors would introduce devices with great- It was for this reason that the MPU industry quick-
er capacity and higher performance. To increase the ly transitioned to a homogeneous multiprocessing
performance, they would simply crank up the clock architecture, in which computing was distributed to
rate. The fastest monolithic MPU of the time, Intel’s multiple smaller cores running at lower clock rates.
Pentium 4 Pro, topped out at just over 4 GHz. For The new processing model let MPU and semiconduc-
developers, this evolution was great; with every gen- tor vendors continue to produce new generations of
eration, their programs could become more intricate higher-capacity devices and reap more performance
and perform more elaborate functions, and their pro- mainly from integrating more functions together in a
grams would run faster. single piece of silicon. Existing programs could not
But in the early 2000s, the semiconductor indus- take advantage of the new distributed architectures,
try changed the game, forcing developers to adjust however, leaving software developers to figure out
to a new set of rules. The shift started with the real- ways to develop programs that would run efficiently
ization that if the MPU industry continued to crank across multiple processor cores.

8
THIRD QUARTER 2015

The SDAccel environment includes a fast,

architecturally optimizing compiler that makes
efficient use of on-chip FPGA resources.

Meanwhile, as these subsequent generations of But to make these FPGA-accelerated heteroge-

silicon process technologies continued to double neous architectures practical for mass deployment
transistor counts, they enabled semiconductor com- and accessible to software developers, FPGA ven-
panies to take another innovative step and integrate dors have had to develop novel environments. In
different types of cores on the same piece of silicon Xilinx’s case, the company offers three development
to create SoCs. These heterogeneous multiprocessor platforms: SDAccel for data center developers,
architectures posed additional challenges for embed- SDSoC for embedded systems developers, and
ded software developers, who now had to develop SDNet for network line card architects and develop-
custom software stacks to get applications to run opers. The new Xilinx environments give developers
timally on their targeted systems. the tools to accelerate their programs by easily pro-
Today, the semiconductor industry is chang- gramming slow portions of their code onto program-
ing the game yet again—but this time software mable logic to create optimized systems.
developers are welcoming the transition. Faced
with another power dilemma, semiconductor SDACCEL FOR OPENCL, C/C++ PROGRAMMING
and systems companies are turning to FPGA-ac- OF FPGA-ACCELERATED PROCESSING
celerated heterogeneous processing architec- The new Xilinx SDAccel development environment
tures, which closely pair MPUs with FPGAs to gives data center application developers a complete
increase system performance at a minimal power FPGA-based hardware and software solution (Fig-
cost. This emerging architecture has been most ure 2). The SDAccel environment includes a fast,
notably leveraged in new data center processing architecturally optimizing compiler that makes ef-
architectures. In a now-famous paper, Microsoft ficient use of on-chip FPGA resources. The environ-
researchers showed that the architectural pair- ment provides developers with a familiar CPU/GPU-
ing of an MPU and FPGA produced a 90 percent like work environment and software-development
performance improvement with only a 10 per- flow, featuring an Eclipse-based integrated design
cent power increase, producing far superior per- environment (IDE) for code development, profiling
formance per watt than architectures that paired and debugging. With the environment, developers
MPUs with power-hungry GPUs. can create dynamically reconfigurable accelerators
The advantages of FPGA-accelerated heteroge- optimized for different data center applications that
neous multiprocessing extend beyond data center can be swapped in and out on the fly. Developers
applications. Numerous embedded systems using can use the environment to create applications that
Xilinx’s Zynq®-7000 All Programmable SoC have swap many kernels in and out of the FPGA during
greatly benefited from the devices’ on-chip marriage run time without disrupting the interface between
of ARM processors and programmable logic. Sys- the server CPU and the FPGA, for nonstop applica-
tems created with the upcoming Zynq UltraScale+™ tion acceleration. The SDAccel environment targets
MPSoC are bound to be even more impressive. host systems based on x86 server processors and
Zynq UltraScale+ MPSoC integrates into one device provides commercial off-the-shelf (COTS), plug-in
multiple ARM® cores (quad Cortex™-A53 applica- PCIe cards that add FPGA functionality.
tions processors, dual Cortex-R5 real-time proces- With the SDAccel environment, developers with
sors and a Mali™-400MP GPU), programmable logic, no prior FPGA experience can leverage SDAccel’s
and multiple levels of security, increased safety and familiar workflow to optimize their applications and
advanced power management (Figure 1). take advantage of FPGA platforms. The IDE provides

9
XCELL SOFTWARE JOURNAL: COVER STORY

SDAccel — CPU/GPU Development Experience on FPGAs

OpenCL, C, C++ Application Code

Environment

Compiler Debugger Profiler Libraries

x86-Based Server PCIe FPGA-Based Accelerator Boards

Figure 2 — The SDAccel development environment for OpenCL, C and C++ enables up to 25x better
performance/watt for data-center-application acceleration leveraging FPGAs.

coding templates and software libraries, and it en- ten in C++ (as opposed to RTL) so developers can
ables compiling, debugging and profiling against use them exactly as written during all development
the full range of development targets, including and debugging phases. Early in a project, all devel-
emulation on the x86, performance validation us- opment will be done on the CPU host. Because the
ing fast simulation, and native execution on FPGA SDAccel libraries are written in C++, they can sim-
processors. The environment executes the applica- ply be compiled along with the application code for
tion on data-center-ready FPGA platforms complete a CPU target—creating a virtual prototype—which
with automatic instrumentation insertion for all permits all testing, debugging and initial profiling
supported development targets. Xilinx designed to occur initially on the host. During this phase, no
the SDAccel environment to enable CPU and GPU FPGA is needed.
developers to migrate their applications to FP-
GAs easily while maintaining and reusing their SDSOC FOR EMBEDDED DEVELOPMENT
OpenCL™, C and C++ code in a familiar workflow. OF ZYNQ SOC- AND MPSOC-BASED SYSTEMS
SDAccel libraries contribute substantially to the Xilinx designed the SDSoC development environ-
SDAccel environment’s CPU/GPU-like development for embedded systems developers program-
ment experience. They include low-level math li- ming the Xilinx Zynq SoCs and soon-to-arrive Zynq
braries and higher-productivity ones such as BLAS, UltraScale+ MPSoCs. The SDSoC environment pro-
OpenCV and DSP libraries. The libraries are writ- vides a greatly simplified embedded C/C++ application

10
THIRD QUARTER 2015

The SDSoC Development Environment

C/C++ Development
Rapid
system-level
performance
estimation
System-level Profiling

SoC MPSoC

• Embedded C/C++ application development experience Specify C/C++ Functions

for Acceleration
• System-level profiling
• Full system optimizing compiler
• Expert use model for platform developers & system architects Full System
Optimizing Compiler

Figure 3— The SDSoC development environment provides a familiar embedded C/C++ application
development experience, including an easy-to-use Eclipse IDE and a comprehensive design environment
for heterogeneous Zynq All Programmable SoC and MPSoC deployment.

programming experience, including an easy-to-use integration and verification of smarter heteroge-

Eclipse IDE running on bare metal or operating sys- neous systems.
tems such as Linux and FreeRTOS as its input. It is a
comprehensive development platform for heteroge- SDNET FOR DESIGN AND PROGRAMMING
neous Zynq SoC and Zynq MPSoC platform deploy- OF FPGA-ACCELERATED LINE CARDS
ment (Figure 3). Complete with the industry’s first SDNet is a software-defined specification environ-
C/C++ full-system optimizing compiler, the SDSoC ment using an intuitive, C-like high-level language
environment delivers system-level profiling, auto- to design the requirements and create a specifica-
mated software acceleration in programmable log- tion for a network line card (Figure 4). The envi-
ic, automated system connectivity generation and ronment enables network architects and develop-
libraries to speed programming. It also provides a ers to create “Softly” Defined Networks, expanding
flow for customer and third-party platform devel- programmability and intelligence from the control
opers to enable platforms to be used in the SDSoC to the data plane.
development environment. In contrast to traditional software-defined net-
SDSoC provides board support packages (BSPs) work architectures, which employ fixed data plane
for Zynq All Programmable SoC-based development hardware with a narrow southbound API connection
boards including the ZC702 and ZC706, as well as to the control plane, Softly Defined Networks are
third-party and market-specific platforms includ- based on a programmable data plane with content
ing the ZedBoard, MicroZed, ZYBO, and video and intelligence and a rich southbound API control plane
imaging development kits. The BSPs include meta- connection. This enables multiple disruptive capabil-
data abstracting the platform from software devel- ities, including support of wire-speed services that
opers and system architects to ease the creation, are independent of protocol complexity, provisioning

11
XCELL SOFTWARE JOURNAL: COVER STORY

SDNet — Software Defined Specification Environment for Networking

SDNet Specifications System

Architect

SDNet Compiler

• LogiCORE
• SmartCORE
• Custom Core
• SW Function Implementation
Engineer

HW/SW Implementation

SDK/API Executable Image

FPGA or SoC

“Softly” Defined Line Card

Figure 4 — The SDNet environment enables network architects to create a specification

in a C-like language. After a hardware team completes the design, developers can use
SDNet to update or add protocols to the card in the field.

of per-flow and flexible services, and support for rev- namic service provisioning enables service providers
olutionary in-service “hitless” upgrades while operat- to increase revenue and speed time to market while
ing at 100 percent line rates. lowering capex and opex. Network equipment pro-
These unique capabilities enable carriers and mul- viders realize similar benefits from the Softly Defined
tiservice system operators (MSOs) to provision differ- Network platform, which allows for extensive differ-
entiated services dynamically without any interrup- entiation through the deployment of content-aware
tion to the existing service or the need for hardware data plane hardware that is programmed with the
requalification or truck rolls. The environment’s dy- SDNet environment.

12
THIRD QUARTER 2015

This combination of MathWorks and Xilinx

technologies has helped customer companies
produce thousands of innovative products.

EMBEDDED DEVELOPMENT ENVIRONMENTS is a user-friendly graphics-based program that runs

To further help embedded software engineers with Xilinx’s Vivado® Design Suite under the hood so that
programming, Xilinx offers a comprehensive set National Instruments’ customers need not know any
of embedded tools and run-time environments de- of the details of FPGA design; indeed, some perhaps
signed to enable embedded software developers to don’t even know a Xilinx device is at the heart of
move efficiently from concept to production. Xil- the RIO platforms. They can simply program their
inx offers developers an Eclipse-based IDE called systems in the LabVIEW environment and let NI’s
the Xilinx Software Development Kit (SDK), which hardware speed the performance of designs they
includes editors, compilers, debuggers, drivers are developing.
and libraries targeting Zynq SoCs or FPGAs with MathWorks® (Natick, Mass.), for its part, add-
Xilinx’s 32-bit MicroBlaze™ soft core embedded ed FPGA support more than a decade ago to its
in them. The environment provides out-of-the-box MATLAB®, Simulink®, HDL Coder™ and Embedded
support for advanced features such as security and Coder® with Xilinx’s ISE® and Vivado tools running
virtualization software drivers built on Xilinx’s under the hood and completely automated. As a re-
unique Zynq SoCs and MPSoCs. This allows devel- sult, the users—who are mainly mathematician al-
opers to innovate truly differentiated connected gorithm developers—could develop algorithms and
systems that are both smarter and secure. speed algorithm performance exponentially by run-
Xilinx offers a comprehensive suite of open- ning the algorithms succinctly on an FPGA fabric.
source resources to develop, boot, run, debug and Xilinx added an FPGA-architecture-level tool
maintain Linux-based applications running on a called System Generator to its ISE development
Xilinx SoC or emulation platform. Xilinx provides environment more than a decade ago and, more
example applications, kernel construction, Yocto recently, added the tool to the Vivado Design Suite
recipes, multiprocessing and real-time solutions, to enable teams with FPGA knowledge to tweak
drivers and forums, as well as many community designs for further algorithm performance gains.
links. Linux open-source developers will find a very This combination of MathWorks and Xilinx tech-
comfortable environment in which to learn, develop nologies has helped customer companies produce
and interact with others of like interests and needs. thousands of innovative products.
A number of members in Xilinx’s Alliance eco-
A POWERFUL AND GROWING ALLIANCE system offer development tools in support of the
OF PROGRAMMING ENVIRONMENTS SDx and Alliance environments; they include
In addition to offering developers the new SDx de- ARM, Lauterbach, Yokogawa Digital Comput-
velopment environments and SDK, Xilinx has built er Corp. and Kyoto Microcomputer Corp. As for
strong alliances over the past decade with compa- OS and middleware support, Xilinx and its eco-
nies that already have well-established develop- system of Alliance members provide customers
ment environments serving developers in specific with multiple software options, including Linux,
market segments. RTOS, bare-metal, and even hypervisor and Trust-
National Instruments (Austin, Texas) offers hard- Zone-enabled solutions for safety and security.
ware development platforms fanatically embraced by For more information on the SDx environ-
control and test system innovators. Xilinx’s FPGAs ments and Xilinx’s extensive and growing devel-
and Zynq SoCs power the NI RIO platforms. Nation- oper solutions, visit Xilinx’s new Software De-
al Instruments’ LabVIEW development environment veloper Zone. n

13
XCELL SOFTWARE JOURNAL: XCELLENCE WITH SDSOC

SDSoC,
Step by Step:
Build a Sample
Design
by Adam Taylor
Chief Engineer
e2v
[email protected]

A ZedBoard example proves

quick to build and optimize
using the seamless environment.

14
THIRD QUARTER 2015

U
Until the release of the Xilinx® SDSoC™ de-
velopment environment, the standard SoC
design methodology involved a mix of dis-
parate engineering skills. Typically, once the
system architect had generated a system
architecture and subsystem segmentation
from the requirement, the solution would
A BRIEF HISTORY
OF DESIGN METHODOLOGIES
The programmable logic device segment has
been fast-moving since the devices’ intro-
duction in the 1980s. At first engineers pro-
grammed the devices via schematic entry (al-
though the earlier PLDs, such as the 22v10,
were programmed via logic equations). This
required that electronics engineers perform
most PLD development, as logic design and
optimization are typically the EE degree’s
domain. As device size and capability in-
creased, however, schematic entry naturally
began to hit boundaries, as both design time
be split between functions implemented and verification time rose in tandem with de-
in hardware (the logic side) and functions sign complexity. Engineers needed the capa-
implemented in software (the processor bility to work at a higher level of abstraction.
side). FPGA and software engineers would Enter VHDL and Verilog. Both started
separately develop their respective func- as languages to describe and simulate log-
tions and then combine and test them in ac- ic designs, particularly ASICs. VHDL even
cordance with the integration test plan. The had its own military standard. It is a logical
approach worked for years, but the advent step that if we are describing logic behav-
of more-capable SoCs, such as the Xilinx ior within a hardware description language
Zynq®-7000 All Programmable SoC and the (HDL), it would be great to synthesize the
upcoming Xilinx Zynq UltraScale+™ MP- logic circuits required. The development of
SoC, mandated a new design methodology. synthesis tools let engineers describe logic
The SDSoC methodology enables a behavior typically at a register transfer lev-
wider user base of engineers to develop el. HDLs also provided a significant boost
extremely high-performing systems. Engi- in verification approach, allowing the de-
neers new to developing in the SDSoC development of behavioral test benches that
velopment environment will discover that enabled structured verification. For the
it’s easy to get a system up and running first time, HDLs also enabled modularity
quickly and just as easy to optimize it. and vendor independence.
A simple, representative example will Again, the inherent concurrency of HDLs,
illustrate how to accomplish those tasks the register transfer level design approach
and reap the resultant benefits. We will tar- and the implementation flow, which re-
get a ZedBoard running Linux and using quired knowledge of optimization and tim-
one of the built-in examples: the Matrix ing closures, ensured that the PLD devel-
Multiplier and Addition Template. opment task would largely fall to EEs.

15
XCELL SOFTWARE JOURNAL: XCELLENCE WITH SDSOC

FAMILIAR ENVIRONMENT
The SDSoC development environ-
ment is based on Eclipse, which
should be familiar to most software
developers (Figure 1). The environ-
ment seamlessly enables acceleration
of functions within the PL side of the
device. It achieves this by using the
new SDSoC compiler, which can han-
dle C or C++ programs.
Figure 1 — SDSoC Welcome page The development cycle at the
highest abstraction level used in the
SDSoC environment is as follows:
HDLs have long been the de facto standard for PLD 1. We develop our application in C or C++.
development but have evolved over the years to take 2. We profile the application to determine the perfor-
industry needs into account. VHDL alone underwent mance bottlenecks.
revisions in 1987 (the first year of IEEE adoption), 3. Using the profiling information, we identify func-
1993, 2000, 2002, 2007 and 2008. As happened with tions to accelerate within the PL side of the device.
schematic design entry, however, HDLs are hitting
up against the buffers of increases in development 4. We can then build the system and generate the SD
time, verification time and device capability. card image.
As the PLD’s role has expanded from glue logic to 5. Once the hardware is on the board, we can analyze
acceleration peripheral and ultimately to the heart the performance further and optimize the accelera-
of the system, the industry has needed a new design tion functions as required.
methodology to capitalize on that evolution. In re- We can develop applications in the SDSoC environ-
cent years, high-level synthesis (HLS) has become ment that function variously on bare metal, FreeRTOS
increasingly popular; here, the design is entered in or Linux operating systems. The environment comes
C/C++ (using Xilinx’s Vivado® HLS) or tools such with built-in support for most of the Zynq SoC devel-
as MathWorks®’ MATLAB® or National Instruments’ opment boards, including the ZedBoard, the MicroZed
LabVIEW. Such approaches begin to move the de- and the Digilent ZYBO Zynq SoC development board.
sign and implementation out from the EE domain Not only can we develop our applications faster as a
into the software realm, markedly widening the user result, but we can use this capability to define our own
base of potential PLD designers and cementing the underlying hardware platform for use when our custom
PLD’s place at the heart of the system as new design hardware platform is ready for integration.
methodologies unlock the devices’ capabilities. When we compile a program within the SDSoC
It is therefore only natural that SoC-based de- environment, the output of the build process provides
signs would use HLS to generate tightly integrat- the suite of files required to configure the Zynq SoC
ed development environments in which engineers from an SD card. This suite includes first- and sec-
could seamlessly accelerate functions in the logic ond-stage boot loaders, along with the application
side of the design. Enter the SDSoC environment. and images as required for the operating system.

16
THIRD QUARTER 2015

SDSOC EXAMPLE
Let’s look at how the SDSoC environment works and see
how quickly we can get an example up and running. We
will target a ZedBoard running Linux and using the built-
in Matrix Multiplier and Addition Template.
The first task, as always, is to create a project. We can
do so either from the Welcome screen (Figure 1) or by
selecting File -> New -> SDSoC project from the menu.
Selecting either option will open a dialog box that will let
us name the project and select the board and the operat-
ing system (Figure 2).
This will create a project under the Project Explorer
on the left-hand side of the SDSoC GUI. Under this proj-
ect, we will see the following folders, each with its own,
graphically unique symbol:
• SDSoC Hardware Functions: Here we will see the func-
tions we have moved into the hardware. Initially, as we
have yet to move functions, this folder will be empty.
• Includes: Expanding this folder will show all of the
C/C++ header files used in the build.
•
src: This will contain the source code for the
demonstration.
To ensure that we have everything correctly con-
figured not only with our SDSoC installation and
environment, but also with our development board,
we will build the demo so that it will run on only the
on-chip processing system (PS) side of the device.
Of course, the next step is to build the project. With the
project selected on the menu, we choose Project->Build
Project. It should not take too long to build, and when we
are done we will see folders as shown in Figure 3 appear
under our project within the Project Explorer. In addi-
tion to the folders described above, we will have:
• Binaries: Here we will find the Executable and Linkable
Format (ELF) files created from the software compi-
lation process.
• Archives: The object files that are linked to create the
binaries reside here.
Figure 2 — Creating the project
• SDRelease: This contains our boot files and reports.

17
XCELL SOFTWARE JOURNAL: XCELLENCE WITH SDSOC
COVER STORY

With the first demo built such that it will run only on the
Zynq SoC’s PS, let’s explore how we know it is working
as desired. Recall that SDSoC acceleration works by pro-
filing the application; the engineer then uses the profiled
information to determine which functions to move.
We achieve profiling at the basic level by using a pro-
vided library called sds_lib.h. This provides a basic time-
stamp API, based on the 64-bit global counter, that lets us
Figure 3 — Project Explorer view when built
measure how long each function takes. With the API, we
simply record the function start and stop times, and the
difference constitutes the process execution time.
The source code contains two versions of the algo-
rithm for matrix multiply and add. The so-called golden
version is not intended for offloading to the on-chip pro-
grammable logic (PL); the other version is. By building
and running these just within the PS, we can ensure that
we are comparing eggs with eggs and that both process-
es take roughly the same time to execute.
Figure 4 — Execution time of both functions in the PS
With the build complete, we can copy all of the files in
the SDRelease -> sd_card folder under the Project Explor-
er onto our SD card and insert the card into the ZedBoard
(with the mode pins correctly set for SD card configuration).
With a terminal program connected, once the boot sequence
has been completed we need to run the program. We type
/mnt/mult_add.elf (where mult_add is the name of the proj-
ect we have created). When I ran this on my ZedBoard, I
got the result shown in Figure 4, which demonstrates that
the two functions take roughly the same time to execute.
Having confirmed the similar execution times, we
will move the multiply function into the PL side of the
SoC. This is simple to achieve.
Looking at the file structure within the src directory
of the example, we will see:
•
main.cpp, which contains the main function, golden
calculation, timestamping, and calls to the mult and
add functions used in the hardware side of the device;
•
mmult.cpp, which contains the multiplication func-
tion to be offloaded into the hardware; and
Figure 5 — Moving the multiplier kernel to
•
madd.cpp, which contains the addition function to be the PL side using the Project Explorer
offloaded into the hardware.

18
THIRD QUARTER 2015

Once we have taken the steps described here,

the next time we build the project the SDSoC linker will
automatically call Xilinx Vivado HLS and Vivado to
implement the functions within the PL side of the SoC.
The next step is to offload just one of these functions
to the PL side of the SoC. We can achieve this by one of two
methods:
1. Within the Project Explorer, we can expand the file such
that we can see the functions within that file, select the
function of interest, right click and select Toggle HW/SW
[H] (Figure 5).
2. We can open the file and perform the same option under
the outline tab on the right, which shows the functions as
well (Figure 6).

Toggling the mmult() function to be accelerated within

the hardware will result in an [H] being added to the back
of the function (Figure 7).
We will also see the function we have selected un- Figure 6 — Moving the multiplier kernel to
der SDSoC Hardware Functions (beneath our project the PL side using the outline window
within the Project Explorer tab; Figure 8). This pro-
vides an easy way to see all of the functions that we
have accelerated within our design.
Once we have taken the steps described here, the next
time we build the project the SDSoC linker will automat-
ically call Vivado HLS and the rest of the Vivado Design
Suite to implement the functions within the PL side of the
SoC. As it does so, it will create the relevant software driv-
ers to support function acceleration. From our perspec- Figure 7 — The mmult() function in hardware
tive, offloading the function to the PL side of the device be-
comes seamless, except for the increase in performance.
I moved the mmult() function into the hardware
after compilation and SD card image generation, running
it on my ZedBoard. As Figure 9 shows, the execution time
(in processor cycles) was only 52,444 / 183,289 = 0.28,
or 28 percent of the previous execution time of 183,289 Figure 8 — Identifying our accelerated functions
processor cycles when executed within the PS side of the
device (Figure 4). When we consider the performance of
the same function when executed within the PS side of
the device, we see that we achieve this considerable in-
crease in execution time by a simple click of the mouse.
The straightforward example presented here demon-
strates the power and seamlessness of the SDSoC envi- Figure 9 — The accelerated results
ronment and the tightly integrated HLS functions. n

19
XCELL SOFTWARE JOURNAL: XCELLENCE WITH SDSOC

Using the SDSoC IDE

for System-level
HW-SW Optimization
on the Zynq SoC
by Daniele Bagni
DSP Specialist FAE
Xilinx, Inc.
[email protected]

Nick Ni
Product Manager, SDSoC
Development Environment
Xilinx, Inc.
[email protected]

20
THIRD QUARTER 2015

T
A Choleksy matrix he Xilinx® Zynq®-7000 All Programmable
SoC family represents a new dimension

decomposition example in embedded design, delivering unprec-

edented performance and flexibility

yields an acceleration to the embedded systems engineering

community. These products integrate a feature-rich,

estimate in minutes. dual-core ARM® Cortex™-A9 MPCore™-based pro-

cessing system and Xilinx programmable logic in a
single device. More than 3,000 interconnects link the
on-chip processing system (PS) to on-chip program-
mable logic (PL), enabling performance between the
two on-chip systems that simply can’t be matched
with any two-chip processor-FPGA combination.
When Xilinx released the device in 2011, the Zynq
SoC gained an instant following among a subset of
embedded systems engineers and architects well-
versed in hardware design languages and methodol-
ogies as well as in embedded software development.
The first-of-its-kind Zynq SoC today is deployed in
embedded applications ranging from wireless infra-
structure to smart factories and smart video/vision,
and it is quickly becoming the de facto standard plat-
form for advanced driver assistance systems.
To make this remarkable device available to em-
bedded engineers who have a strong software back-
ground but no HDL experience, Xilinx earlier this year
introduced the Eclipse-based SDSoC™ integrated de-
velopment environment, which enables software en-
gineers to program the programmable logic as well as
the ARM processing system of the Zynq SoC.
Let’s take a closer look at the features of the
Zynq SoC [1] and at how software engineers can
leverage the SDSoC environment to create system
designs not possible with any other processor-plus-
FPGA system. For our investigation, we will use
the Xilinx ZC702 evaluation board [2], containing a
Zynq Z-7020-1 device, as the hardware platform.
As shown in Figure 1, the Zynq SoC comprises two ma-
jor functional blocks: the PS (composed of the applica-
tion processor unit, memory interfaces, peripherals and
interconnect) and the PL (the traditional FPGA fabric).

21
XCELL SOFTWARE JOURNAL: XCELLENCE WITH SDSOC

Processing System
Static Memory Controller Dynamic Memory Controller Programmable
Quad-SPI, NAND, NOR DDR3, DDR2, LPDDR2 Logic:
System Gates,
AMBA Switches AMBA Switches DSP, RAM
2x SPI
S_AXI_HP0
2x I2C
ARM CoreSight Multicore and Trace Debug S_AXI_HP1
2x CAN
NEON/ FPU Engine NEON/ FPU Engine S_AXI_HP2
2x UART
I/O Cortex-A9 MPCore Cortex-A9 MPCore S_AXI_HP3

Multistandard I/Os (3.3V & High Speed 1.8V)

MUX GPIO 32/32 KB I/D Caches 32/32 KB I/D Caches
S_AXI_ACP
512KB L2 Cache Snoop Control Unit (SCU)
2x SDIO
with DMA
Timer Counters 256KB On-Chip Memory

2x USB General Interrupt Controller DMA Configuration

with DMA

2x GigE
with DMA AMBA Switches

EMIO XADC S_AXI_GP0/1 M_AXI_GP0/1 PCIe

Multistandard I/Os (3.3V & High Speed 1.8V) Multi Gigabit Transceivers

Figure 1 — Zynq high-level architecture overview

The PS and PL are tightly coupled via interconnects Verilog using Vivado, in C/C++ using Vivado High
compliant with the ARM® AMBA® AXI4 interface. Level Synthesis (HLS) [3] or in model-based design
Four high-performance (HP) AXI4 interface ports using Vivado System Generator for DSP [4].
connect the PL to asynchronous FIFO interface (AFI)
blocks in the PS, thereby providing a high-throughput 3. Engineers then use Vivado IP Integrator [5] to
data path between the PL and the PS memory system create a block-based design of the whole embed-
(DDR and on-chip memory). The AXI4 Accelerator ded system. The full system needs to be developed
Coherency Port (ACP) allows low-latency cache-co- with different data movers (AXI-DMA, AXI Memory
herent access to L1 and L2 cache directly from the PL Master, AXI-FIFO, etc.) and AXI interfaces (GP, HP
masters. The General Purpose (GP) port comprises and ACP) connecting the PL IP with the PS. Once
low-performance, general-purpose ports accessible all design rules checks are passed within IP Inte-
from both the PS and PL. grator, the project can be exported to the Xilinx
In the traditional, hardware-design-centric flow, us- Software Development Kit (SDK) [6].
ing Xilinx’s Vivado® Design Suite, designing an embed-
4. Software engineers develop drivers and applica-
ded system on the Zynq SoC requires roughly four steps:
tions targeting the ARM processors in the PS using
1. A system architect decides a hardware-software parti- the Xilinx SDK.
tioning scheme. Computationally intensive algorithms
In recent years, Xilinx made substantial ease-of-use
are the ideal candidates for hardware. Profiling re-
improvements to the Vivado Design Suite that enabled
sults are used as the basis for identifying performance
engineers to shorten the duration of the IP develop-
bottlenecks and running trade-off studies between
ment and IP block connection steps (step 2 and part of
data movement costs and acceleration benefits.
step 3 above). For IP development, the adoption of such
2. Hardware engineers take functions partitioned to new design technologies as C/C++ high-level synthesis
hardware and convert/design them into intellectu- in the Vivado HLS tool and model-based design with
al-property (IP) cores—for example, in VHDL or Vivado System Generator for DSP cut development

22
THIRD QUARTER 2015

The SDSoC environment automatically orchestrates

all necessary Xilinx tools to generate a full
hardware-software system targeting the Zynq SoC—
and does so with minimum user intervention.
and verification time dramatically while letting design
teams use high-level abstractions to explore a great-
er range of architectures. Designs that took weeks to
C/C++ Development
accomplish with VHDL or Verilog could be completed
Rapid
in days using the new tools. system-level
Xilinx enhanced the flow further with Vivado IP performance
Integrator. This feature of the Vivado Design Suite en- estimation
System-level Profiling
ables the design of a complicated hardware system,
embedded or not, simply by connecting IP blocks in a
graphical user interface (GUI), thereby allowing rapid
hardware system integration. Specify C/C++ Functions
The new Vivado Design Suite features made life a for Acceleration
bit easier for design and development teams working
with the Zynq SoC. But with a hardware-centric op- Full System
timization workflow, not too much could be done to Optimizing Compiler
shorten the development time required to explore dif-
ferent data movers and PS-PL interfaces (part of step 3
above) and to write and debug drivers and applications
(step 4). If the whole system did not meet the design Figure 2 — The main steps in the SDSoC design flow
requirement in terms of throughput, latency or area,
the team would have to revisit the hardware archi-
tecture by modifying the system connectivity during
synchronize hardware and software and to pre-
step 3; those modifications inevitably would lead to
serve original program semantics, while enabling
changes in the software application in step 4. In some
task-level parallelism and pipelined communication
cases, a lack of acceleration or a hardware utilization
and computation to achieve high performance. The
overflow would force the team to revisit the original
SDSoC environment automatically orchestrates all
hardware-software partitioning. Multiple hardware
necessary Xilinx tools (Vivado, IP Integrator, HLS
and software teams would have to create another iter-
and SDK) to generate a full hardware-software sys-
ation of the system to explore other architectures that
tem targeting the Zynq SoC—and does so with min-
might meet the end requirement.
imum user intervention.
These examples show the time-to-market im-
Assuming we have an application completely de-
pact of system optimization done manually. System
scribed in C/C++ targeting the PS and we have already
optimization nonetheless is critical for a tightly
decided which functions to partition to the PL for ac-
integrated system such as the Zynq SoC because
celeration, the SDSoC development flow roughly pro-
bottlenecks often occur in the system connectivity
ceeds as follows (Figure 2):
between the PS and the PL.
The SDSoC environment greatly simplifies the 1. The SDSoC environment builds the application
Zynq SoC development process, slashing total de- project using a rapid estimation flow (by calling
velopment time by largely automating steps 2, 3 Vivado HLS under the hood). This will provide the
and 4. The development environment generates ballpark performance and resource estimation in
necessary hardware and software components to minutes.
23
XCELL SOFTWARE JOURNAL: XCELLENCE WITH SDSOC

32-bit floating-point representa-

tion as an application example
for hardware-software partition-
ing on the Zynq SoC.
The Cholesky decomposition
transforms a positive definite
matrix into the product of a low-
er and upper triangular matrix
with a strictly positive diagonal.
The matrix B is decomposed in
the triangular matrix L, so that
B = L’ * L, with L’ the transposed
version of L, as illustrated in the
Figure 3 — Structure of the C/C++ test bench for the SDSoC environment following MATLAB® code for the
case of a 4 x 4 matrix size:

2. If we deem it necessary, we optimize the C/C++

application and the hardware functions with prop- A = ceil(64*randn(4,4)) %
generate random
data
er directives, and rerun the estimation until the
B = A * A’ %
make the matrix to
desired performance and area are achieved. be symmetric
3. The SDSoC environment then builds the full sys- L = chol(B) %
compute cholesky
decomposition
tem. This process will generate the full Vivado De-
B2 = (L’ * L) %
reconstruct the
sign Suite project and the bitstream, along with a original matrix B
bootable run-time software image targeting Linux, A =
FreeRTOS or bare metal. -13 53 41 20
-19 98 12 9
2 30 -65 33
PERFORMANCE ESTIMATION OF HARDWARE VS. 4 -13 61 17
SOFTWARE WITH THE SDSOC ENVIRONMENT B =
Linear algebra is a fundamental and powerful tool 5059 6113 -441 2100
in almost every discipline of engineering, allowing 6113 10190 2419 -465
whole systems of equations with multidimensional -441 2419 6218 -3786
2100 -465 -3786 4195
variables to be solved computationally. For exam- L =
ple, engineers can describe linear control theory 71.1266 85.9453 -6.2002 29.5248
systems as matrices of “states” and state changes. 0 52.9472 55.7513 -56.7077
Digital signal processing of images is another 0 0 55.4197 -7.9648
classic example of linear algebra’s application. 0 0 0 6.6393
In particular, matrix inversion through the Cholesky B2 =
5059 6113 -441 2100
decomposition is considered one of the most effi- 6113 10190 2419 -465
cient methods for solving a system of equations or -441 2419 6218 -3786
inverting a matrix. Let’s look closely at a Cholesky 2100 -465 -3786 4195
matrix decomposition of 64 x 64 real data in a
24
THIRD QUARTER 2015

Selecting the candidate accelerator is easily

accomplished with a mouse click on a specific
function via the SDSoC environment’s GUI.

Let’s see how we can obtain an estimation of the noncontiguous pages in the Physical Address Space.
performance and resource utilization that we can The Simple DMA is cheaper than the Scatter-Gather
expect from our application, without going through DMA in terms of area and performance overheads,
the entire build cycle. but it requires sds_alloc to obtain physically contig-
Figure 3 shows the test bench structure suitable for uous memory.
the SDSoC environment. The main program allocates Selecting the candidate accelerator is easily accom-
dynamic memory for all the empty matrices and fills plished with a mouse click on a specific function via
them with data (either read from a file or generated the SDSoC environment’s GUI. As shown in Figure 4,
randomly). It then calls the reference software func- the routine cholesky_alt_top is marked with an “H” to
tion and the hardware candidate function. Finally, the indicate that it will be promoted to a hardware accel-
main program checks the numerical results comput- erator. We can also select the clock frequency for the
ed by both functions to test the effective correctness. accelerator and for the data motion cores (100 MHz as
Note the use of a special memory allocator called illustrated in the SDSoC project page of Figure 4).
sds_alloc for each input/output array to let the SDSoC We can now launch the “estimate speedup” process.
environment automatically insert a Simple DMA After a few minutes of compilation, we get all the cores
IP between each I/O port of the hardware accelera- and the data motion network generated in a Vivado
tor; in contrast, malloc instantiates a Scatter-Gather project. The SDSoC environment also generates an
DMA, which can handle arrays spread across multiple SD card image that comprises a Linux boot image

Figure 4 — Setting the hardware accelerator core and its clock frequency from the SDSoC project page

25
XCELL SOFTWARE JOURNAL: XCELLENCE WITH SDSOC

including the FPGA bitstream and

the application binary of the soft-
ware-only version. We boot from this
SD card and run the application on
the ZC702 target platform.
Once Linux has booted on the
board, we can execute the soft-
ware-only application, and the SD-
SoC environment then generates
the performance estimation report
of Figure 5. We see both the FPGA
resources utilization (26 DSP, 80
BRAM, 15,285 LUT, 17,094 FF) and
the performance speedup (1.75)
of the cholesky_alt_top function
Figure 5 — SDSoC-generated performance,
if executed in hardware instead of
speedup and resources estimation report
software. We can also see, from
the main application point of view,
that the overall speedup is lower
(1.23) because of other software
overhead such as malloc and data
transfer. Our complete application
is indeed small, focusing mainly on
illustrating the SDSoC flow and de-
sign methodology; we would need
more routines to be accelerated in
the PL, but that is beyond the scope
of this article.
Using the SDSoC environment,
we have generated this information
in a few minutes without requir-
ing synthesis and place-and-route
FPGA compilation; those processes
could take hours, depending on the
complexity of the hardware system.
Estimations like this one are often
enough to analyze the system-level
performance of hardware-software
partitioning and let users very rap-
idly iterate a design to create an
Figure 6: Vivado HLS synthesis estimation report optimized system.

26
THIRD QUARTER 2015

UNDERSTANDING
THE PERFORMANCE
ESTIMATION RESULTS
When the SDSoC environment
compiles the application code for
the estimate-speedup process,
it generates an intermediate
directory (_sds in Figure 5) in
which it places all intermediate
projects (Vivado HLS, Vivado IP
Integrator, etc.). In particular,
it inserts calls to a free-running Figure 7 — Makefile for the Release build
ARM performance counter func-
tion, sds_clock_counter(), in
the original code to measure the
execution time of key parts of the program functions. In Figure 6, we report the Vivado HLS synthesis es-
That is why the target board needs to be connected timation results. Note that the hardware accelerator
with the SDSoC environment’s GUI during the esti- latency is CKHW = 83,652 cycles at FHW = 100-MHz clock
mate-speedup process. All the numbers reported in frequency. Since in the ZC702 board we have FARM
Figure 5 are measured with those counters during = 666 MHz and therefore CKARM = CKHW*FARM / FHW =
run-time execution. The only exception is the hard- 83,653*666/100 = 557,128, the resultant hardware ac-
ware-accelerated function, which does not exist celeration is well aligned with the result of 565,554
until after the entire FPGA build (including place- cycles reported by the SDSoC environment in Figure 5.
and-route implementation); therefore Vivado HLS This is why the SDSoC environment can estimate the
computes the hardware-accelerated function’s es- number of clock cycles that the accelerator requires
timated cycles—together with the resource utiliza- without actually building it via place-and-route.
tion estimates—under the hood, during the effective
Vivado HLS Synthesis step. BUILDING THE HARDWARE-SOFTWARE SYSTEM
Assuming the candidate hardware accelerator WITH THE SDSOC ENVIRONMENT
function runs at FHW MHz clock frequency and needs Having determined that this hardware acceleration
CKHW clock cycles for the whole computation (this is makes sense, we can implement the whole hardware
the concept of latency), and assuming the function and software system with the SDSoC environment.
takes CKARM at a clock frequency of FARM MHz when All we need to do is add the right directives (in the
executed on the ARM CPU, then the hardware acceler- form of pragma commands) to specify, respectively,
ator achieves the same performance as the ARM CPU the FIFO interfaces (due to the sequential scan of
if the computation time is the same, that is, CKHW / FHW the I/O arrays); the amount of data to be transferred
= CKARM / FARM. From this equation, we get CKARM = at run time for any call to the accelerator; the types
CKHW*FARM / FHW. This represents the maximum amount of AXI ports connected between the IP core in the
of clock cycles the accelerator can offload from the PL and the PS; and, finally, the kind of data movers.
processor to show any acceleration that results from The following C/C++ code illustrates the applica-
migrating the function to hardware. tions of those directives. Note that in reality the last

27
XCELL SOFTWARE JOURNAL: XCELLENCE WITH SDSOC

axi_interconnect_M_AXI_GP0

M00_AXI
S00_AXI M01_AXI
cholesky_alt_top_0_if datamover_1 ps7
M02_AXI
S_AXI S_AXI_LITE M_AXIS_S2MM PTP_ETHERNET_0
AXI Interconnect S_AXI_0 M_AXIS_0 S_AXI_S2MM s2mm_prmry_reset_out_n DDR DDR
S_AXI_ACP
AP_FIFO_IARG_0 AP_CTRL FIXED_IO FIXED_IO
AXI Direct Memory Access IRQ_F2P[1:0]
proc_sys_reset_0 cholesky_at_top_0 AP_FIFO_OARG_0 interrupt USBIND_0
ap_oscalar_0_din[31:0] M_AXI_GPO
mb_reset A
aux_reset_in bus_struct_reset[0:0] ap_ctrl L AXI4-Stream Accelerator Adapter ZYNQ7 Processing System
mb_debug_sys_rst peripheral_reset[0:0] ap_return[31:0]
dom_locked interconnect_aresetn[0:0] acp_axcache_0xF axi_interconnect_S_AXI_ACP proc_sys_reset_2
peripheral_aresetn[0:0] Cholesky_alt_top (Pre-Production)
dout[3:0] mb_reset
S00_AXI
Processor System Reset datamover_0 aux_reset_in bus_struct_reset[0:0]
Constant S00_AXI_arcache
M00_AXI mb_debug_sys_rst peripheral_reset[0:0]
M_AXI_MM2S S01_AXI
dom_locked interconnect_aresetn[0:0]
S_AXI_LITE M_AXIS_MM2S proc_sys_reset_3 proc_sys_reset_0 S01_AXI_awcache
peripheral_aresetn[0:0]
mm2s_prmry_reset_out_n
mb_reset aux_reset_in mb_reset
mb_debug_sys_rst
AXI Interconnect Processor System Reset
AXI Direct Memory Access aux_reset_in bus_struct_reset[0:0] bus_struct_reset[0:0]
xlconcat
mb_debug_sys_rst peripheral_reset[0:0] dom_locked peripheral_reset[0:0]
dom_locked interconnect_aresetn[0:0] dout[1:0]
peripheral_aresetn[0:0] Processor System Reset
Concat
Processor System Reset

Figure 8 — IP Integrator block-based design done by the SDSoC environment

directive is not needed, because the SDSoC environment ment calls Vivado IP Integrator in a process transpar-
will instantiate a Simple DMA due to the use of sds_alloc; ent to the user (for the sake of clarity, only the AXI4
we have included it here only for the sake of clarity. interfaces are shown). In addition, the SDSoC environ-

#pragma SDS data access_pattern(A:SEQUENTIAL, L:SEQUENTIAL) //fifo interfaces

#pragma SDS data copy(A[0:BUF_SIZE], L[0:BUF_SIZE]) // amount of data transf
#pragma SDS data sys_port (A:ACP, L:ACP) // type of AXI ports
#pragma SDS data data_mover (A:AXI_DMA_SIMPLE, L:AXI_DMA_SIMPLE) // type of DMAs

int cholesky_alt_top(MATRIX_IN_T A[ROWS_COLS_A*ROWS_COLS_A],

MATRIX_OUT_T L[ROWS_COLS_A*ROWS_COLS_A]);

We can build the project in Release configura- ment reports the Vivado IP Integrator block diagram
tion directly from the SDSoC environment’s GUI, as an HTML file to make it easy to read (Figure 9). This
or we can use the Makefile reported in Figure 7 and report clearly shows that the hardware accelerator is
launched from the SDSoC Tool Command Language connected with the ACP port via a simple AXI4-DMA,
(Tcl) interpreter. As is the case with any tool in the whereas the GP port is used to set up the accelerator
Vivado Design Suite, designers can either adopt the via an AXI4-Lite interface.
GUI or Tcl scripting. To improve the speedup gain, How much time did it take us to generate the SD
we increase the clock frequency of the hardware card for the ZC702 board with the embedded system
accelerator to FHW =142 MHz (set by the -clkid 1 up and running? We needed one working day to write
makefile flag). a C++ test bench suitable to both Vivado HLS and the
After less than half an hour of FPGA compilation, SDSoC environment, and then we needed one hour of
we get the bitstream to program the ZC702 board experimentation to get good results from the Linear
and the Executable Linkable Format (ELF) file to Algebra HLS Library and one hour to create the embed-
execute on the Linux OS. We then measure the per- ded system with the SDSoC environment (the FPGA
formance on the ZC702 board: 995,592 cycles for compilation process). Altogether, the process took 10
software-only and 402,529 cycles for hardware ac- hours. We estimate that doing all this work manually
celeration. Thus, the effective performance gain for (step 3 with Vivado IP Integrator and step 4 with Xilinx
the cholesky_alt_top function is 2.47. SDK) would have cost us at least two weeks of full-
Figure 8 illustrates the block diagram of the whole time, hard work, not counting the experience needed
embedded system created when the SDSoC environ- to use those tools efficiently.

28
THIRD QUARTER 2015

After less than half an hour of

FPGA compilation, we get the
bitstream to program the ZC702
board and the Executable
Linkable Format (ELF) file to REFERENCES
execute on the Linux OS. 1. UG1165, Zynq-7000 All Programmable SoC:
Embedded Design Tutorial
2. UG850, ZC702 Evaluation Board for the Zynq-
The SDSoC development environment enables the 7000 XC7Z020 All Programmable SoC User
broader community of embedded system and soft- Guide
ware developers to target the Zynq SoC with a familiar
embedded C/C++ development experience. Complete 3. UG871, Vivado Design Suite Tutorial: High-Level
with the industry’s first C/C++ full-system optimizing Synthesis
compiler, the SDSoC environment delivers system- 4. UG948, Vivado Design Suite Tutorial: Mod-
level profiling, automated software acceleration in pro- el-Based DSP Design using System Generator
grammable logic, automated system connectivity gen- 5. UG994, Vivado Design Suite User Guide: Design-
eration and libraries to speed development. For more in- ing IP Subsystems Using IP Integrator
formation, including how to obtain the tool, visit http://
www.xilinx.com/products/design-tools/software- 6. UG782, Xilinx Software Development Kit (SDK)
zone/sdsoc.html. n User Guide

Data Motion Network

Accelerator Argument IP Port Direction Declared Size (bytes) Pragmas Connection

cholesky_alt_top_0 A A IN 4096*4 • length:

(BUF_SIZE) S_AXI_ACP:AXIDMA_SIMPLE
• sys_port:ACP

L L OUT 4096*4 • length:

(BUF_SIZE) S_AXI_ACP:AXIDMA_SIMPLE
• sys_port:ACP

return AP_return OUT 4 M_AXI_GP0:AXILITE:0xC0

Accelerator Callsites
Accelerator Callsite IP Port Transfer Size (bytes) Paged or Contiguous Cacheable or
Non-cacheable

cholesky_alt_top_0 cholesky_alt_tb.cpp:246:23 A (BUF_SIZE) * 4 contiguous cacheable

L (BUF_SIZE) * 4 contiguous cacheable

ap_return 4 paged cacheable

Figure 9 — SDSoC connectivity report

29
XCELL SOFTWARE JOURNAL: XCELLENCE WITH SDACCEL

Compile, Debug,
Optimize
by Jayashree Rangarajan
Senior Engineering Director,
Interactive Design Tools
Xilinx, Inc.
[email protected]

Fernando Martinez Vallina

Software Development Manager, SDAccel
Xilinx, Inc.
[email protected]

Vinay Singh
Senior Product Marketing Manager,
SDAccel
Xilinx, Inc.
[email protected]

30
THIRD QUARTER 2015

Xilinx’s SDAccel development environment

enables software application design flows
for FPGAs.

X
ilinx® FPGA devices mainly comprise a enables application compile, debug and optimization
programmable logic fabric that lets appli- for FPGA devices in ways similar to the processes
cation designers exploit both spatial and used for CPUs and GPUs, with the advantage of up to
temporal parallelism to maximize the per- 25x better performance/watt for data center applica-
formance of an algorithm or a critical kernel in a large tion acceleration.
application. At the heart of this fabric are arrays of Software designers can use the SDAccel develop-
lookup-table-based logic elements, distributed memo- ment environment to create and accelerate many func-
ry cells and multiply-and-accumulate units. Designers tions and applications. Let’s look at how the SDAccel
can combine those elements in different ways to im- environment enables a compile, debug and optimiza-
plement the logic in an algorithm while achieving pow- tion design loop on a median filter application.
er consumption, throughput and latency design goals.
The combination of FPGA fabric elements into MEDIAN FILTER
logic functions has long been the realm of hardware The median filter is a spatial function commonly used
engineers, involving a process that resembles assem- in image processing for the purpose of noise reduction
bly-level coding more closely than it mimics modern (Figure 1). The algorithm inside the median filter uses
software design practices. Whereas common software a 3 x 3 window of pixels around a center pixel to com-
design procedures long ago moved beyond assembly pute the value of the center based on the median of all
coding, FPGA design practices have progressed at a neighbors. The equation for this operation is:
slower pace because of the inherent differences be-
tween CPU and FPGA compilation. outputPixel[i][j] =
In the case of CPUs and GPUs, the hardware is median(inputPixel[i-1][j-1], inputPix-
fixed, and all programs are compiled against a static el[i-1][j], inputPixel[i-1][j+1],
instruction set architecture (ISA). Although the ISAs inputPixel[i][j-1], inputPixel[i]
differ between CPUs and GPUs, the basic underlying [j], inputPixel[i][j+1],
compilation techniques are the same. Those similar- inputPixel[i+1][j-1], inputPixel[i+1]
ities have enabled the evolution of design practices [j], inputPixel[i+1][j+1]) ;
from handcrafted assembly code into compilation, de-
bug and optimization design procedures that leverage COMPILE
the OpenCL™ C, C and C++ programming languages After the functionality of the median filter has been
common to software development. captured in a programming language such as Open-
In the case of FPGA design, designers can create CL C, the first stage of development is compilation.
their own processing architecture to perform a specific On a CPU or GPU, compilation is a necessary and
workload. The ability to customize the architecture to a natural step in the software design flow. The target
specific system need is a key advantage of FPGAs, but ISA is fixed and well known, leaving the program-
it has also acted as a barrier to adopting software devel- mer to worry only about the number of available
opment practices for FPGA application development. processing cores and cache misses in the algorithm.
Six years ago, Xilinx began a diligent R&D effort FPGA compilation is more of an open question: At
to break down this barrier by creating a development compilation time, the target ISA does not exist, the
environment that brought an intuitive software devel- logic resources have yet to be combined into a pro-
opment design loop to FPGAs. The Xilinx SDAccel™ cessing fabric and the system memory architecture
development environment for OpenCL C, C and C++ is yet to be defined.

31
XCELL SOFTWARE JOURNAL: XCELLENCE WITH SDACCEL

for (int y=0; y < height; y++) {

int offset = y * width;
int prev = offset - width;
int next = offset + width;

for (int x=0; x < width; x++) {

// Get pixels within 3x3 aperture
The compiler in the SDAccel development envi- uint rgb[SIZE];
ronment provides three features that help program- rgb[0] = input[prev + x - 1];
mers tackle those challenges: automatic extraction rgb[1] = input[prev + x];
of parallelism among statements within a loop and rgb[2] = input[prev + x + 1];
across loop iterations, automatic memory archi-
tecture inference based on read and write patterns rgb[3] = input[offset + x - 1];
to arrays, and architectural awareness of the type rgb[4] = input[offset + x];
and quantity of basic logic elements inside a giv- rgb[5] = input[offset + x + 1];
en FPGA device. We can illustrate the importance of
these three features with regard to source code for a rgb[6] = input[next + x - 1];
median filter (Figure 2). rgb[7] = input[next + x];
The median filter operation is expressed as a se- rgb[8] = input[next + x + 1];
ries of nested loops with two main sections. The
first section fetches data from an array in external uint result = 0;
memory called input and stores the values into a lo-
cal array RGB. The second section of the algorithm // Iterate over all color channels
is the “for” loop around the getMedian function; or (int channel = 0; channel < 3;
f
getMedian is where the computation takes place. channel++) {
By analyzing the code in Figure 2, the SDAccel result |= getMedian(channel, rgb);
environment understands that there are no loop-car- }

// Store result into memory

output[offset + x] = result;
}

Figure 2 — Median filter code

The version of the algorithm in Figure 2 executes

the getMedian function inside a “for” loop with a fixed
bound. Depending on the performance target for the
filter and the FPGA selected, the SDAccel environ-
Figure 1 — Median filter operation ment can either reuse the compute resources across
all three channels or allocate more logic to enable spa-
tial parallelism and run all channels at the same time.
ried dependencies on the array RGB. Each loop iter- This decision, in turn, affects how memory storage for
ation has a private RGB copy, which can be stored on the array RGB is implemented.
different physical resources. The other main char- From an application programmer’s perspective,
acteristic that the SDAccel environment can derive the steps described above are transparent and can be
from this code is the independent nature of calls to the thought of as –O1 to –O3 optimizations in the GNU
getMedian function. Compiler Collection (GCC).

32
THIRD QUARTER 2015

The printf implementation in the SDAccel

environment provides the functionality without
consuming additional logic resources.
DEBUG terms of hardware resources, the generation of data
An axiom of software development is that application for printf consumes a few registers—a negligible cost
compilation does not equal application correctness. It in the register-rich FPGA fabric. Data decoding occurs
is only after the application starts to run on the tar- in the driver to the FPGA. By leveraging the host CPU
get hardware that a programmer can start to discover, to execute the data decode and presentation layers for
trace and correct errors in the application—in other printf, a software programmer can use printf with vir-
words, debug. tually zero cost in FPGA resources.
CPU application debug is a well-understood prob- The second technique for debugging borrowed from
lem, with a multitude of tools from both commercial CPUs is the use of tools such as the GNU Project De-
vendors and the open-source community available to bugger (GDB) to include breakpoints and single step-
address it. Once again, FPGAs are another story. How ping through code. Programmers can use the SDAccel
does an application programmer debug something environment’s emulation modes to attach GDB to a
that was created to implement the functionality of a running emulation process. The emulation process is
piece of code at a given performance target? a simulation of the application-specific hardware that
The SDAccel development environment address- the developer will execute on the FPGA device. Within
es this question by borrowing two concepts from the the context of an emulation process, GDB can watch
CPU world: printf and GDB debugging. the state of variables, insert breakpoints and single step
The printf function is a fundamental tool in the soft- through code. From an application programmer’s per-
ware programmer’s toolbox. It is available in every spective, this is identical to how GDB works on a CPU.
programming language and can be used to expose the
state of key application variables during program exe- OPTIMIZE
cution. For CPU devices, this is as simple as monitor- After compiling and debugging, the next step in the
ing the status of registers. There is no cost in hardware software development cycle is to optimize the appli-
for printf functionality. cation. The principles behind application optimiza-
In the case of FPGAs, the implementation of printf tion on an FPGA are the same as on a CPU; the differ-
can potentially consume logic resources that could ence is in the approach. For a CPU, application code
otherwise be used for implementing algorithm func- is massaged to fit into the boundaries of the cache
tionality. The printf implementation in the SDAccel and arithmetic units of a processor. In an FPGA, the
environment provides the functionality without con- computation logic is custom assembled for the cur-
suming additional logic resources. The environment rent application. Therefore, the size of the FPGA and
achieves this by separating printf data generation the application’s target performance dictate the opti-
from the decoding and user presentation layers. In mization constraints.

Figure 3 — Memory access transaction trace

33
XCELL SOFTWARE JOURNAL: XCELLENCE WITH SDACCEL
COVER STORY

Software programmers
who use SDAccel can
leverage the flexibility
for (int line = 0; line < height; line++) {
of the logic fabric to local uint linebuf0[MAX_WIDTH];

build high-performance, local uint linebuf1[MAX_WIDTH];

local uint linebuf2[MAX_WIDTH];
low-power applications local uint lineres[MAX_WIDTH];
// Fetch Lines
without having to if (line == 0) {

understand all of the

async_work_group_copy(linebuf0,
input, width, 0);
details associated
async_work_group_copy(linebuf1,
input, width, 0);
with hardware design.
async_work_group_copy(linebuf2,
input + width, width, 0);
}
The compiler in the SDAccel environment automat-
ically optimizes the compute logic. The programmer …
can assist the automatic optimizations by analyzing the }
data transfer patterns inferred from the code. Figure 3
shows the read and write transactions from the median Figure 4 — Median filter code with
filter code to the memories for input and output. explicit burst memory transfers
Each vertical line in the plot represents a transac-
tion to memory. The green bar shows the duration of
media filter function activity. It can be seen from the code of Figure 4, the async_work_group_copy function
plot that although the median filter is always active, brings the contents of entire lines from the input image
there are large gaps between memory transactions. in DDR memory to memories inside the kernel data path.
The gaps represent the time it takes the median filter The memory transaction trace in Figure 5 shows the
to switch from one transaction to the next. Since each result of using async_work_group_copy. As Figure 5
transaction to memory accesses only a single value, shows, the kernel involves a setup time before memo-
the gaps between transactions represent an import- ry transactions occur that is not present in the original
ant performance bottleneck for this application. code for the median filter (Figure 2).
One way to solve the performance problem is to state The setup time difference has to do with the logic
burst transactions from external memories to local memo- derived from the code. In the original code of Figure 2,
ries explicitly inside the application code. The code excerpt the application immediately starts a single transac-
in Figure 4 shows the use of the async_work_group_copy tion to memory and then waits for the data to be
function employed in OpenCL C kernels. The purpose of available. In contrast, the optimized code of Figure 4
this function call is to tell the compiler that each transac- determines whether a memory transaction needs to
tion to memory will be a burst containing multiple data occur or whether the data is already available in the
values. This enables more efficient utilization of the avail- kernel’s local memory. It also allows the generated
able memory bandwidth on the target device and reduc- logic to schedule memory transactions back-to-back
es the overall number of transactions to memory. In the and to overlap read and write transactions.

34
THIRD QUARTER 2015

Figure 5 — Memory access transaction trace after code optimization

Whether the final device is a CPU or an FPGA, development flows. The SDAccel development envi-
profiling is an essential component of application ronment enables this design loop with tools and tech-
development. The SDAccel environment’s visualiza- niques similar to the development environment on a
tion and profiler capabilities let an application pro- CPU, with FPGA-based application acceleration of up
grammer characterize the impact of code changes to 25x better performance per watt and with a 50x to
and application requirements in terms of kernel oc- 75x latency improvement. Software programmers who
cupancy, transactions to memory and memory band- use SDAccel can leverage the flexibility of the logic
width utilization. fabric to build high-performance, low-power applica-
The design loop created by the operations of com- tions without having to understand all of the details
pile, debug and optimize is fundamental to software associated with hardware design. n

GET A DAILY DOSE OF XCELL

Xilinx has extended its award-winning journal and added an exciting new Xcell Daily Blog.
The new site provides dedicated readers with a frequent flow of content to help engineers and developers
leverage the flexiblility and extensive capabilities of all Xilinx products and ecosystems.

for hardware for software

engineers engineers

What’s Recent:
n Half Wheelchair, Half Segway, Half Battlebot: Unprecedented mobility for the disabled—controlled by Zynq
n R
egular Universal Electronic Control Unit tester for vehicles up and running in two months thanks to NI LabVIEW and LabVIEW FPGA
n Radar looks deep into Arctic snow and ice to help develop sea-level climate models
n Passive, Wi-Fi radar that sees people through walls prototyped with NI LabVIEW and two FPGA-based USRP-2921 SDR platforms
n 500-FPGA Seismic Supercomputer performs real-time acoustic measurements on its heart of stone to simulate earthquakes

35
XCELL SOFTWARE JOURNAL: XCELLENCE WITH SDACCEL

Developing OpenCL
Imaging Applications
Using C++ Libraries

36
THIRD QUARTER 2015

by Stephen Neuendorffer
Principal Engineer, Vivado HLS
Xilinx’s SDAccel development
Xilinx, Inc.
[email protected] environment leverages the
Thomas Li
power of preexisting libraries
Software Engineer, Vivado HLS
Xilinx, Inc. to accelerate application design.

I
[email protected]

Fernando Martinez Vallina maging applications have grown in both

Development Manager, SDAccel scale and ubiquity in recent years as online
Xilinx, Inc.
[email protected]
pictures and videos, robotics, and driver
assistance applications have proliferated.
Across these domains, the core algorithms
are very similar and require a development
methodology that lets application develop-
ers quickly retarget and differentiate products based
on markets and deployment targets.
As a result of those needs, imaging applications typ-
ically start as a software program targeting a CPU and
employ library calls to standard functions. The com-
bination of software design techniques with readily
available libraries makes it easy to get started and to
create a functionally correct application on a desktop.
The challenge for the developer lies in optimiz-
ing the imaging application for an execution target.
By leveraging technology from Xilinx® Vivado® HLS,
Xilinx’s SDAccel™ development environment makes
the use of C++ libraries straightforward for OpenCL™
application developers targeting FPGAs.

SET OF PARALLEL COMPUTATION TASKS

One key characteristic of imaging applications is that
they are fundamentally a set of operations on a pixel
with respect to a surrounding neighborhood of pix-
els in space and, for some applications, in time. We
therefore can think of an imaging application as a set
of parallel computation tasks that a developer can ex-
ecute on a CPU, GPU or FPGA.
The CPU is always the easiest target device with
which to start. The code typically already runs on
the CPU before optimization is considered and can
leverage the wealth of available libraries. The prob-
lem with executing imaging workloads on a CPU is
the achievable sustained performance. The overall
performance is limited by cache hits/misses and the
nontrivial task of parallelization into multiple threads
running across CPU cores.

37
XCELL SOFTWARE JOURNAL: XCELLENCE WITH SDACCEL

DDR memory DDR memory

HOST DEVICE
PCIe

Figure 1 — The basic OpenCL platform contains one host and at least one device.

GPUs hold the promise of much higher perfor- OPENCL FRAMEWORK

mance than CPUs for imaging applications because The OpenCL framework provides a common program-
GPU hardware was purposely built for imaging work- ming model for expressing data parallel programs. The
loads. Until recent years, the drawback of GPUs for framework, which has evolved into an industry standard,
general imaging applications had been the program- is based on a platform and a memory model that are con-
ming model. GPU programming differed from that sistent across all device vendors supporting OpenCL. A
for CPUs, and GPU models were not portable across device is defined as any hardware, be it a CPU, GPU or
GPU device families. That changed with the standard- FPGA, capable of executing OpenCL kernels.
ization of programming for parallel systems such as The platform in an OpenCL application defines
GPUs under the OpenCL framework. the hardware environment in which an application
FPGAs provide an alternative implementation is executed. Figure 1 shows the main elements of an
choice for imaging workloads. Developers can cus- OpenCL platform.
tomize the FPGA logic fabric into workload-specific A platform for OpenCL always contains one host,
circuits. The flexibility of the FPGA fabric lets an appli- which is typically implemented on a processor. The
cation developer leverage the performance and power host is responsible for launching tasks on the device
consumption benefits of custom logic while avoiding and for explicitly coordinating all data transfers be-
the cost and effort associated with ASIC design. tween the host and the device.
As it was for the GPU, one barrier for adoption In addition to the host, a platform contains at least
of FPGA devices has been the programming model. one device. The device in the OpenCL platform is the
Traditionally, FPGAs have been programmed with hardware element capable of executing OpenCL ker-
a register transfer language (RTL) such as Verilog nel code. In the context of an OpenCL application,
or VHDL. Although those languages can express the kernel code is the computationally intensive part
parallelism, the level of granularity is significantly of the algorithm that requires acceleration.
lower than what is needed to program a CPU or a In the case of CPU and GPU devices, the kernel
GPU. As in the case of GPUs, however, adoption code is executed on one or more cores in the device.
of the OpenCL standard to express FPGA program- Each core is exactly the same per the device specifi-
ming in a way that is familiar to software applica- cation; this stricture forces the application developer
tion developers has overcome the programming to modify the code to maximize performance within
model hurdle. a fixed architecture.

38
THIRD QUARTER 2015

For an FPGA, the SDAccel development environment

generates custom cores per the specific
computation requirements of the application kernel.

For an FPGA, in contrast, the SDAccel development sible to both the host and the device and is typically
environment generates custom cores per the specific implemented in DDR attached to the FPGA. Depend-
computation requirements of the application kernel. The ing on the FPGA used on the acceleration board, a por-
application developer thus is free to explore implementation of the global memory can also be implemented
tion architectures based on the needs of the algorithm to inside the FPGA fabric. The local and private memory
reduce overall system latency and power consumption. spaces are visible only to the kernels executing inside
The second OpenCL component is the memory the FPGA fabric and are built entirely inside that fab-
model (Figure 2). This model, which is common to all ric using block RAM (BRAM) and register resources.
vendors, defines a single memory hierarchy against Let’s see how the SDAccel environment leverages
which a developer can create a portable application. OpenCL and C++ libraries for a stereo imaging block
The main components of the memory model are matching application.
the host, global, local and private memories. The host
memory refers to the memory space that is accessible STEREO BLOCK MATCHING
only to the host processor. The memories visible to Stereo block matching uses images from two cam-
the FPGA (the device) are the global, local and private eras to create a representation of the shape of an ob-
memory spaces. The global memory space is acces- ject in the field of view of the cameras. As Figure 3

Constant
Host Memory Global Memory Global Memory

FPGA

On-chip Global Memory

Kernel A Kernel B

Host PCIe
Compute Unit 0

Local Memory
Compute Unit 1

Local Memory
Compute Unit 0

Local Memory
Compute Unit 1

Local Memory
Private Private Private Private Private Private Private Private Private Private Private Private
Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory

PE PE PE PE

Figure 2 — The OpenCL memory model defines a single memory hierarchy for application development.

39
XCELL SOFTWARE JOURNAL: XCELLENCE WITH SDACCEL

The SDAccel development environment leverages technology from

Xilinx’s Vivado HLS C-to-RTL compiler as part of the core kernel
compiler, letting the SDAccel environment use kernels expressed
in C and C++ in the same way as kernels expressed in OpenCL C.

Left camera Right camera

Viewing ray

Figure 3 — Conceptual image of multi-camera processing

shows, the algorithm uses the input images of a Vivado HLS provides image processing functions
left and a right camera to search for the correspon- based on the popular OpenCV framework. The func-
dence between the images. Such multi-camera im- tions are written in C++ and have been optimized to
age processing tasks can be applied to depth maps, provide high performance in an FPGA. When synthe-
image segmentation and foreground/background sized into an FPGA implementation, the equivalent of
separation. These are, for example, all integral anywhere from tens to thousands of RISC processor in-
parts of pedestrian detection applications in driver structions are executed concurrently every clock cycle.
assistance systems. The code for the application uses Vivado HLS vid-
eo processing functions to create the application.
USING C++ LIBRARIES FOR VIDEO The application code contains C++ function calls to
The SDAccel development environment leverages tech- Vivado HLS libraries as well as pragmas to guide the
nology from Xilinx’s Vivado HLS C-to-RTL compiler as compilation process. The pragmas are divided into
part of the core kernel compiler, letting the SDAccel en- those for interface definition and those for perfor-
vironment use kernels expressed in C and C++ in the mance optimization.
same way as kernels expressed in OpenCL C. Applica- The interface definition pragmas determine how
tion developers thus can use C++ libraries and code pre- the stereo block matching accelerator connects to
viously optimized in Vivado HLS to increase productivity. the rest of the system. Since this accelerator is ex-
The main code for the stereo block matching appli- pressed in C++ instead of OpenCL C code, the appli-
cation is shown on the next page. cation programmer must provide interface definition

40
THIRD QUARTER 2015

void stereobm(
unsigned short img_data_lr[MAX_HEIGHT*MAX_WIDTH],
unsigned char img_data_d[MAX_HEIGHT*MAX_WIDTH],
int rows,
int cols)
{
#pragma HLS INTERFACE m_axi port=img_data_lr offset=slave bundle=gmem0
#pragma HLS INTERFACE m_axi port=img_data_d offset=slave bundle=gmem1
#pragma HLS INTERFACE s_axilite port=img_data_lr bundle=control
#pragma HLS INTERFACE s_axilite port=img_data_d bundle=control
#pragma HLS INTERFACE s_axilite port=rows bundle=control
#pragma HLS INTERFACE s_axilite port=cols bundle=control
#pragma HLS INTERFACE s_axilite port=return bundle=control

hls::Mat<MAX_HEIGHT, MAX_WIDTH, HLS_8UC2> img_lr(rows, cols);

hls::Mat<MAX_HEIGHT, MAX_WIDTH, HLS_8UC1> img_l(rows, cols);
hls::Mat<MAX_HEIGHT, MAX_WIDTH, HLS_8UC1> img_r(rows, cols);
hls::Mat<MAX_HEIGHT, MAX_WIDTH, HLS_16SC1> img_disp(rows, cols);
hls::Mat<MAX_HEIGHT, MAX_WIDTH, HLS_8UC1> img_d(rows, cols);

hls::StereoBMState<15, 32, 32> state;

#pragma HLS dataflow

hls::AXIM2Mat<MAX_WIDTH>(img_data_lr, img_lr);
hls::Split(img_lr, img_l, img_r);
hls::FindStereoCorrespondenceBM(img_l, img_r, img_disp, state);
hls::SaveAsGray(img_disp, img_d);
hls::Mat2AXIM<MAX_WIDTH>(img_d, img_data_d);
}

pragmas that match the assumptions of the OpenCL FindStereoCorrespondenceBM function to start
model in the SDAccel environment. operating as soon as the Split function produces
The pragmas marked with m_axi state that the pixels, without having to wait for a complete image
contents of the buffer will be stored in device global to be produced. The net result is a more efficient
memory. The pragmas marked with s_axilite are re- architecture and reduced processing latency rela-
quired for the accelerator to receive the base address tive to sequential processing of each function with
of buffers in global memory from the host. full frame buffers in between them.
The performance optimization pragma in this Imaging applications are a compute-intensive
code is dataflow. The dataflow pragma yields an application domain with a rich set of available
accelerator in which different subfunctions can libraries; the devil is in optimizing the application for
also execute concurrently. the execution target. The SDAccel environment lets
In this accelerator, because of the underlying im- developers leverage C++ libraries to accelerate the
plementation of the hls::Mat datatype, data is also development of imaging applications for FPGAs pro-
streamed between each function. This allows the grammed in OpenCL. n

41
XCELL SOFTWARE JOURNAL: XCELLENT ALLIANCE

MATLAB and Simulink

Aid HW-SW Co-design
of Zynq SoCs
by Eric Cigan and Noam Levine
FPGA/SoC Technical Marketing
MathWorks

42
THIRD QUARTER 2015

Model-Based Design
workflow lets engineers
make design trade-offs
The open question was how they would program the
at the desktop rather new devices. Designers imagining the potential of hard-
ware-software co-design sought integrated workflows
than the lab. that would intelligently partition designs between ARM
processors and programmable logic. What they found,
however, were distinct hardware and software work-
flows: conventional embedded software development
flows targeting ARM cores, alongside a combination of
IP assembly, traditional RTL and emerging high-level syn-
thesis tools for programmable logic.

T
he introduction of the Xilinx® Zynq®-7000 INTEGRATED WORKFLOW
All Programmable SoC family in 2011 In September 2013, MathWorks introduced a hard-
brought groundbreaking innovation to ware-software workflow for Zynq-7000 SoCs using
the FPGA industry. These devices, with Model-Based Design. In this workflow (Figure 1), de-
their combination of dual-core ARM ® Cortex™-A9 signers could create models in Simulink that would
MPCore™ processors and ample programmable represent a complete dynamic system—including a
logic, offered advantages for a wealth of applica- Simulink model for algorithms targeted for the Zynq
tions. By adopting Zynq SoCs, designers could reap SoC—and rapidly create hardware-software imple-
the benefits of software application development mentations for Zynq SoCs directly from the algorithm.
on one of the industry’s most popular processors System designers and algorithm developers used
while gaining the flexibility and throughput poten- simulation in Simulink to create models for a com-
tial provided via hardware acceleration on a high- plete system (communications, electromechanical
speed, programmable logic fabric. components and so forth) in order to evaluate design
Using MATLAB ® and Simulink ® from Math- concepts, make high-level trade-offs, and partition al-
Works ®, innovators today can leverage a highly gorithms into software and hardware elements. HDL
integrated hardware-software workflow to create code generation from Simulink enabled the creation
highly optimized systems. The case study present- of IP cores and high-speed I/O processing on the Zynq
ed here illustrates this model-based workflow. SoC fabric. C/C++ code generation from Simulink en-
When Xilinx released the first Zynq SoC in De- abled programming of the Zynq SoC’s Cortex-A9 cores,
cember 2011, designers seized on the idea that they supporting rapid embedded software iteration.
could migrate their legacy, multichip solutions, The approach enabled automatic generation of the
built from discrete processors and FPGAs, to a sin- AMBA® AXI4 interfaces linking the ARM processing
gle-chip platform. They could create FPGA-based system and programmable logic with support for the
accelerators on the new platform to unclog soft- Zynq SoC. Integration with downstream tasks—such
ware execution bottlenecks and tap into an array as C/C++ compilation and building of the executable
of off-the-shelf, production-ready intellectual prop- for the ARM processing system, bitstream generation
erty from Xilinx and its IP partners that would ad- using Xilinx implementation tools, and downloading to
dress applications in digital signal processing, net- Zynq development boards—allowed for a rapid proto-
working, communications and more. typing workflow.

43
XCELL SOFTWARE JOURNAL: XCELLENT ALLIANCE

Model-Based Design for Zynq-7000 All Programmable SoCs

RESEARCH REQUIREMENTS

DESIGN

System Modeling

as the means for algorithm devel-

Software/Hardware Partitioning
opers to work closely with hard-
Algorithms for Algorithms for

Verification
ARM core programmable fabric ware designers and embedded
C code HDL code software developers to acceler-
generation generation
ate the implementation of algo-
IMPLEMENTATION
rithms on programmable SoCs.
Hardware /
software Build Executable IP Core Generation Once the generated HDL and C
design
iterations
ARM Cortex-A9 Programmable Logic
code is prototyped in hardware,
the design team can use Xilinx
Vivado® IP Integrator to integrate
INTEGRATION the code with other design com-
ponents needed for production.
CASE STUDY: THREE-PHASE
Zynq-7000 SoC
development MOTOR CONTROL
boards For several reasons, custom
motor controllers with efficient
power conversion are one of the
most popular applications to
Figure 1 — Designers can create models in Simulink that represent have emerged for programma-
a complete dynamic system and create hardware-software ble SoCs. Higher-performance,
implementations for Zynq SoCs directly from the model. higher-efficiency initiatives are
one factor. With electric mo-
tor-driven systems accounting
Central to this workflow are two technologies: Embed- for as much as 46 percent of global electricity con-
ded Coder® and HDL Coder™. Embedded Coder gener- sumption, attaining higher efficiency with novel con-
ates production-quality C and C++ code from MATLAB, trol algorithms is an increasingly common motor drive
Simulink and Stateflow®, with target-specific optimiza- design goal. Xilinx Zynq programmable logic enables
tions for embedded systems. Embedded Coder has be- precise timing, providing an ideal platform for imple-
come so widely adopted that when you drive a modern menting low-latency, high-efficiency drives.
passenger car, take a high-speed train or fly on a commer- Another driver is multi-axis control. Ample pro-
cial airline, there’s a high probability that Embedded Cod- grammable logic and DSP resources on programma-
er generated the real-time code guiding the vehicle. HDL ble SoCs open up possibilities for implementing multi-
Coder is the counterpart to Embedded Coder, generating ple motor controllers on a single programmable SoC,
VHDL or Verilog for FPGAs and ASICs, and is integrat- whether motors will operate independently or in com-
ed tightly into Xilinx workflows. This mature C and HDL bination, as in an integrated motion control system.
code generation technology forms the foundation of the Integration of industrial networking IP is a further
Model-Based Design workflow for programmable SoCs. factor. Xilinx and its IP partners offer IP for integra-
Design teams using Model-Based Design in application with EtherCAT, PROFINET and other industrial
tions such as communications, image processing, smart networking protocols that can be readily incorporated
power and motor control have adopted this workflow into programmable SoCs.

44
THIRD QUARTER 2015

With electric motor-driven

systems accounting for as
much as 46 percent of global
electricity consumption,
attaining higher efficiency
To illustrate the use of this workflow on a common
motor control example, consider the case of a field-ori-
with novel control algorithms
ented control algorithm for a three-phase electric mo- is an increasingly common
tor implemented on a Zynq-7020 SoC (details of this
hardware prototyping platform are available at http:// motor drive design goal.
www.mathworks.com/zidk). The motor control system
model includes two primary subsystems (Figure 2): a • A Mode Select state machine running on the ARM
motor controller targeting the Zynq SoC that has been core determines the motor controller operating
partitioned between the Zynq processing system and mode (for example, open-loop operation or closed-
programmable logic, and a motor controller FPGA mez- loop regulation). This state machine manages the
zanine card (FMC) connected to a brushless DC motor transitions between the start-up, open-loop control
equipped with an encoder to measure shaft angle. and encoder calibration modes before switching to
We can look at hardware-software partitioning in a closed-loop control mode.
terms of data flow:
• The encoder sensor signal is passed via an external
• We assign the Velocity Control and Mode Select port to an Encoder Peripheral in the programmable
blocks to the ARM Cortex-A9 processing system logic and then to a Position/Velocity Estimate block
because those blocks can run at a slower rate than that computes the motor’s state (shaft position and
other parts of the model and because they are the velocity).
portions of the design most likely to be modified
and recompiled during development. •
A sigma-delta analog-to-digital converter (ADC)

ARM Cortex-A9 Programmable Logic Motor FMC Card

Processing System

Core controller (C) Core controller(HDL)

Position / Velocity Encoder Encoder

Current Controller Estimate Peripheral Interface
Mode Select
• Open-loop mode
AXI4 Interface

• Calibrate
Isolation

encoder mode Current ADC

∑∆ ADC
• Closed-loop Conversion Peripheral
mode
Velocity Control • Standby mode
Voltage PWM Inverter
Open-source LINUX Conversion Peripheral Module

C code (from model) HDL code (from model) HDL (hand-coded)

Figure 2 — The motor control system model includes two primary subsystems.

45
XCELL SOFTWARE JOURNAL: XCELLENT ALLIANCE
COVER STORY

System
Inputs
Verify
Outputs
boolean
Disabled uint 16
...1010...
comm Mode
Select Open
Loop ...1010...
single Current
Convert boolean
C/D Calibrate
uint 16 Encoder D/C
Volt
Convert ...1010...
Velocity
boolean
Control
Position uint 16
Velocity Current
uint 16
Control
Motor_And_Load
FOC_Velocity_Encoder_Core_Algorithm

Figure 3 — This control-loop model for a motor control system with

simulation results shows the response to a velocity pulse command.

senses the motor current, and a hand-coded ADC • a model of the motor control algorithm that will be
Peripheral block processes the current. targeted for the Zynq SoC;
• The Current Controller takes the motor state and cur- • a plant model, which includes the drive electronics
rent, as well as the operating mode and velocity control of the FMC, a permanent-magnet synchronous ma-
commands passed from the ARM core over the AXI4 in- chine (PMSM) model of the brushless DC motor, a
terface, and computes the current controller command. model of an inertial load on the motor shaft and an
When in its closed-loop mode, the Current Controller encoder sensor model; and
uses a proportional-integral (PI) control law, whose
•
an output-verification model, which includes
gains can be tuned using simulation and prototyping.
post-processing and graphics to help the algorithm
• The current controller command goes through the developer refine and validate the model.
Voltage Conversion block and is output to the mo-
In Simulink, we can test out the algorithm with
tor control FMC via the PWM Peripheral, ultimately
simulation long before we start hardware testing. We
driving the motor.
can tune the PI controller gains, try various stimulus
profiles and examine the effect of different process-
Designers can model the complete system in Sim-
ing rates. As we use simulation, though, we face a fun-
ulink (Figure 3).
damental issue: Because of the disparate processing
In Model-Based Design, the system increases to
rates typical of motor control—that is, overall me-
four components in the top-level Simulink model:
chanical response rates of 1 to 10 Hz, core controller
•
an input model, which provides a commanded algorithm rates of 1 to 25 kHz and programmable logic
shaft velocity and on/off commands to the control- operating at 10 to 50 MHz or more—simulation times
ler as stimulus; can run to many minutes or even hours. We can head

46
THIRD QUARTER 2015

Field-Oriented Velocity Control

Zynq ARM Real-Time
inputSourceEnum.StandAloneTest

Input_Source Display
Convert
Data_Type_Conversion
controllerMode
Calibrate +100 rad/sec, no load
motorOn (logical) boolean FPGA Interface For Bitstream
motorOn phaseCurrentA
<motorOn>
Disabled PhaseCurrent
Mode
phaseCurrentB
PVelocity, 2=CalibrateEncoder, 3=Velocity Enum Select
commandType <commandType>
Open
rotorPosition
<velocityCommand> Loop
velocityCommand (rad/esc) single Current electricalPosition
velocityCommand ADC
Convert
Select_Source
Calibrate encoderOffset
Signal_Builder_Experiments Encoder rotorVelocity

Volt Scope
Velocity PWM
Convert
Control
1 Position
on Encoder
Velocity
Current
motorOn
0 Control
off
FOC_Velocity_Encoder_C FOC_Velocity_Encoder_FPGA_Interface

1
commandTypeEnum.Velocity
commandType Z-
Command_Mode
1
Z-
1
1
100
velocityCommand Z-
DSP velocityCommand
1
Slider_Gain Z-
Sine_Wave
F = 0.1 Hz

Figure 4(a) — Simulink model for testing prototype hardware

ed velocity (rad/sec)

100

40
sured
Measured

20
Prototype hardware
asur

0 System simulation
Control loop simulation
Me

-20

4
t (amps))

2
currents
ents

1
urren

0
e curr

-1
Prototype hardware
Phase

-2 System simulation
-3 Control loop simulation

Prototype’s startup response (in green) -4

differs from those from simulation (red, 0 1 2 3 4 5 6 7

purple) because of a difference in the shaft Time (sec)

angle at t=0.

Figure 4(b) — Comparison of results from hardware prototype and simulation

47
XCELL SOFTWARE JOURNAL: XCELLENT ALLIANCE
COVER STORY

Continuous verification between the simulation and

hardware environments lets designers identify
and resolve issues early in the development process.

off this issue with a control-loop model that uses be- cessing in MATLAB, but for now we can repeat the
havioral models for the peripherals—the PWM, cur- pulse test (Figure 3).
rent sensing and encoder processing—producing the Figure 4b shows the results of the shaft rotation-
time response shown in Figure 3. al velocity and the phase current for the hardware
After we use the control-loop model to tune the control- prototype compared with the simulation results. The
ler, our next step is to prove out the controller in simula- startup sequence for the hardware prototype differs
tion using high-fidelity models that include the peripherals. noticeably from those for the two simulation models.
We do this by incorporating timing-accurate specification This is to be expected, however, because the initial
models for the C and HDL components of the controller. angle between the motor’s rotor and stator in the
These specification models have the necessary semantics hardware test differs from the initial angle used in
for C and HDL code generation. With simulation, we then simulation, resulting in a different response as the
verify that the system with specification models tracks ex- current control algorithm drives the motor through
tremely closely to the control-loop model. its encoder calibration mode. When the pulse is ap-
Once performance has been validated with the plied at 2 seconds, the results from simulation and
high-fidelity models, we move on to prototyping the prototype hardware match almost exactly.
controller in hardware. Following the workflow Based on these results, we could continue with fur-
shown in Figure 1, we start by generating the IP core. ther testing under different loading and operating con-
The IP core generation workflow lets us choose the ditions, or we could move on to performing further C
target development board and walks us through the and HDL optimizations.
process of mapping the core’s input and output ports Engineers are turning to Model-Based Design work-
to target interfaces, including the AXI4 interface and flows to enable hardware-software implementation of
external ports. algorithms on Xilinx Zynq SoCs. Simulink simulation
Through integration with the Vivado Design Suite, provides early evaluation of algorithms, letting designers
the workflow builds the bitstream and programs the evaluate the algorithms’ effectiveness and make design
fabric of the Zynq-7020 SoC. trade-offs at the desktop rather than in the lab, with a re-
With the IP core now loaded onto the target device, sultant increase in productivity. Proven C and HDL code
the next step is to generate embedded C code from the generation technology, along with hardware support for
Simulink model targeting the ARM core. The process Xilinx All Programmable SoCs, provides a rapid and re-
of generating C code, compiling it and building the ex- peatable process for getting algorithms running on real
ecutable with embedded Linux is fully automated, and hardware. Continuous verification between the simula-
the prototype is then ready to run. tion and hardware environments lets designers identify
To run the prototype hardware and verify that it and resolve issues early in the development process.
gives us results consistent with our simulation models, Workflow support for Zynq-based development
we build a modified Simulink model (Figure 4a) that boards, software-defined radio kits and motor control
will serve as a high-level control panel. In this model, kits is available from MathWorks. To learn more about
we removed the simulation model for the plant—that this workflow, visit http://www.mathworks.com/zynq. n
is, the drive electronics, motor, load and sensor—and MATLAB and Simulink are registered trademarks of The
replaced it with I/Os to the ZedBoard. MathWorks, Inc. See http://www.mathworks.com/trade-
Using this model in a Simulink session, we can turn marks for a list of additional trademarks. Other product
on the motor, choose different stimulus profiles, moni- or brand names may be trademarks or registered trade-
tor relevant signals and acquire data for later post-pro- marks of their respective holders.

48
This year’s Xcell Publications

best release.
Solutions
for a
Progammable
World

The definitive resource for software developers speeding

C/C++ & OpenCL code with Xilinx SDx IDEs & devices
The Award-winning Xilinx Publication Group is rolling out a brand new trade journal specifically for the
programmable FPGA software industry, focusing on users of Xilinx SDx™ development environments and
high-level entry methods for programming Xilinx All Programmable devices.

This is where you come in.

Xcell Software Journal is now accepting reservations for advertising opportunities in this new, beautifully
designed and written resource. Don’t miss this great opportunity to get your product or service into the
minds of those who matter most. Call or write today for your free advertising packet!

For advertising inquiries (including calendar and advertising rate card), contact [email protected]
or call: 408-842-2627.
XCELL SOFTWARE JOURNAL: XTRA, XTRA

Xtra, Xtra
Xilinx® is constantly refining its software and updating its
training and resources to help software developers design
innovations with the Xilinx SDx™ development environments
and related FPGA and SoC hardware platforms. Here is list of
additional resources and reading. Check for the newest
quarterly updates in each issue.

SDSOC™ DEVELOPMENT ENVIRONMENT C and C++ kernels, along with libraries, development
The SDSoC environment provides a familiar embedded boards, and the first complete CPU/GPU-like develop-
C/C++ application development experience, including ment and run-time experience for FPGAs.
an easy-to-use Eclipse IDE and a comprehensive design • SDAccel Backgrounder
environment for heterogeneous Xilinx All Programma-
• SDAccel Development Environment: User Guide
ble SoC and MPSoC deployment. Complete with the
industry’s first C/C++ full-system optimizing compiler, • SDAccel Development Environment: Tutorial
SDSoC delivers system-level profiling, automated soft- • Xilinx Training: SDAccel Video Tutorials
ware acceleration in programmable logic, automated
• Boards and Kits
system connectivity generation and libraries to speed
programming. It lets end-user and third-party platform • SDAccel Demo
developers rapidly define, integrate and verify sys-
tem-level solutions and enable their end customers with SDNET™ DEVELOPMENT ENVIRONMENT
a customized programming environment. The SDNet environment, in conjunction with Xilinx
All Programmable FPGAs and SoCs, lets network engi-
• SDSoC Backgrounder (PDF)
neers define line card architectures, design line cards
• SDSoC User Guide (PDF) and update them with a C-like environment. It enables
• SDSoC User Guide: Getting Started (PDF) the creation of “Softly” Defined Networks, a technolo-
gy dislocation that goes well beyond today’s Software
• SDSoC User Guide: Platforms and Libraries (PDF)
Defined Networking (SDN) architectures.
• SDSoC Release Notes (PDF)
• SDNet Backgrounder — Xilinx
• Boards, Kits and Modules • SDNet Backgrounder — The Linley Group
• SDSoC Video Demo • SDNet Demo
• Buy/Download
SOFTWARE DEVELOPMENT KIT (SDK)
SDACCEL™ DEVELOPMENT ENVIRONMENT The SDK is Xilinx’s development environment for
The SDAccel environment for OpenCL™, C and C++ creating embedded applications on any of its micro-
enables up to 25x better performance/watt for data processors for Zynq®-7000 All Programmable SoCs
center application acceleration leveraging FPGAs. A and the MicroBlaze™ soft processor. The SDK is the
member of the SDx family, the SDAccel environment first application IDE to deliver true homogeneous- and
combines the industry’s first architecturally optimiz- heterogeneous-multiprocessor design and debug.
ing compiler supporting any combination of OpenCL, • Free SDK Evaluation and Download n

50
Program FPGAs Faster
With a Platform-Based Approach

Take advantage of an integrated hardware and software platform to shorten development

cycles and deliver FPGA-enabled technology to market faster. With rugged CompactRIO LabVIEW system
design software offers
controllers and LabVIEW FPGA software, you can program and customize Xilinx FPGAs in
flexibility through FPGA
a more productive environment with a higher level of abstraction–all without knowledge programming, simplifies
of hardware description languages. Use LabVIEW FPGA’s cycle-accurate simulation, built-in code reuse, and helps
you program the way
functions for I/O, memory management, bus interfaces, and cloud compile capabilities to you think–graphically.
design, validate, and deploy projects faster.

Learn more at ni.com/labview/fpga.

800 453 6202

©2015 National Instruments. All rights reserved. CompactRIO, LabVIEW, National Instruments, NI, and ni.com are trademarks of National Instruments.
Other product and company names listed are trademarks or trade names of their respective companies. 23284

23284 ECM_Ad_8.5x11.indd 1 8/26/15 10:11 AM

m
Find it at
mathworks.com/accelerate
datasheet
video example
trial request

GENERATE
HDL CODE
AUTOMATICALLY
from

MATLAB
and

Simulink

HDL CODER™ automatically

converts Simulink models and
MATLAB algorithms directly into
Verilog and VHDL code for FPGAs or
ASIC designs. The code is bit-true, cycle-
accurate and synthesizable.
©2015 The MathWorks, Inc.

Client Name: The Mathworks Cosmos Communications 1 QC

REQ #: 082415A C M Y K js

Title: HDL_NEW_8.5X11.0 31369a 08.19.15 133 1

Size: 8.5” x 11”

This advertisement prepared by:

Magnitude 9.6
345 W. 13th Street
New York, NY 10014
240-362-7079
[email protected]

CAM
No ratings yet
CAM
256 pages
DECCA Navigation System
25% (4)
DECCA Navigation System
19 pages
Xcell 19
No ratings yet
Xcell 19
40 pages
Zynq-7000 All Programmable SoC - Embedded Design Tutorial. A Hands-On Guide To Effective Embedded System Design
No ratings yet
Zynq-7000 All Programmable SoC - Embedded Design Tutorial. A Hands-On Guide To Effective Embedded System Design
124 pages
Altera Intel Product Catalog
No ratings yet
Altera Intel Product Catalog
76 pages
Xcell 38
No ratings yet
Xcell 38
68 pages
Love of Two-Armed Form
No ratings yet
Love of Two-Armed Form
37 pages
Xcell Journal
No ratings yet
Xcell Journal
116 pages
Company Name Contact Person Name
No ratings yet
Company Name Contact Person Name
22 pages
FPGA Xilin Xcell
No ratings yet
FPGA Xilin Xcell
68 pages
Xilinx Presentation
100% (1)
Xilinx Presentation
35 pages
1993 Xilinx Data Book PDF
No ratings yet
1993 Xilinx Data Book PDF
610 pages
Sample Docs Format - Brgy
No ratings yet
Sample Docs Format - Brgy
14 pages
EDK Concepts, Tools, and Techniques: A Hands-On Guide To Effective Embedded System Design
No ratings yet
EDK Concepts, Tools, and Techniques: A Hands-On Guide To Effective Embedded System Design
72 pages
Unit - 4 Embedded Software Development Process and Tools
No ratings yet
Unit - 4 Embedded Software Development Process and Tools
25 pages
8 Dimension Quality of SAMSUNG GALAXY A5 2016
86% (7)
8 Dimension Quality of SAMSUNG GALAXY A5 2016
2 pages
Hotel Project Report
No ratings yet
Hotel Project Report
97 pages
Xcell 59
No ratings yet
Xcell 59
116 pages
Xcell 94
No ratings yet
Xcell 94
64 pages
Sdsoc Environment Tutorial: Ug1028 (V2017.1) June 20, 2017
No ratings yet
Sdsoc Environment Tutorial: Ug1028 (V2017.1) June 20, 2017
64 pages
AdmFilePDF PDF
No ratings yet
AdmFilePDF PDF
51 pages
Chevrolet Volt 2013
No ratings yet
Chevrolet Volt 2013
302 pages
Xcell90 QST Quarter 2015
No ratings yet
Xcell90 QST Quarter 2015
68 pages
ISE Design Suite Software Manuals - PDF Collection
No ratings yet
ISE Design Suite Software Manuals - PDF Collection
14 pages
Untitled
No ratings yet
Untitled
252 pages
Dictum of The Great Architects
No ratings yet
Dictum of The Great Architects
8 pages
Sdaccel Development Environment: Release Notes, Installat On, and Licensing Guide
No ratings yet
Sdaccel Development Environment: Release Notes, Installat On, and Licensing Guide
28 pages
Lib Guide
No ratings yet
Lib Guide
816 pages
Ug1238 SDX Rnil
No ratings yet
Ug1238 SDX Rnil
44 pages
Parallel Universe Issue 31
No ratings yet
Parallel Universe Issue 31
66 pages
XMOS vs. FPGA
100% (1)
XMOS vs. FPGA
6 pages
Fosdem SDR Ornl
No ratings yet
Fosdem SDR Ornl
48 pages
Smog
No ratings yet
Smog
5 pages
Specifications: Description Specification
No ratings yet
Specifications: Description Specification
24 pages
1995 Xilinx XCell No. 17 PDF
No ratings yet
1995 Xilinx XCell No. 17 PDF
32 pages
Introduction To Codesign
No ratings yet
Introduction To Codesign
56 pages
Creating A System With Sdsoc
No ratings yet
Creating A System With Sdsoc
15 pages
Ten Theses On Monsters and Monstrosity
No ratings yet
Ten Theses On Monsters and Monstrosity
2 pages
Xcell 43
No ratings yet
Xcell 43
100 pages
Programming Heterogeneous Systems From An Image Processing DSL
No ratings yet
Programming Heterogeneous Systems From An Image Processing DSL
25 pages
User Guide: Downloaded From Manuals Search Engine
No ratings yet
User Guide: Downloaded From Manuals Search Engine
20 pages
Introducing The Versal Architecture
No ratings yet
Introducing The Versal Architecture
35 pages
11591962
No ratings yet
11591962
353 pages
Intro 2
No ratings yet
Intro 2
18 pages
Library Guide Virtex-II Pro
No ratings yet
Library Guide Virtex-II Pro
1,180 pages
Ug1165 Zynq Embedded Design Tutorial 1
No ratings yet
Ug1165 Zynq Embedded Design Tutorial 1
136 pages
Xilinx Library
No ratings yet
Xilinx Library
1,696 pages
Zynq7000 Embedded Design Tutorial
No ratings yet
Zynq7000 Embedded Design Tutorial
126 pages
Constraints Guide
No ratings yet
Constraints Guide
1,026 pages
Micro Systems Package
No ratings yet
Micro Systems Package
44 pages
Lattice Semi Product Selector Guide
No ratings yet
Lattice Semi Product Selector Guide
38 pages
Earlyyears 2
100% (8)
Earlyyears 2
176 pages
Xcell 32
No ratings yet
Xcell 32
64 pages
Product Catalog
No ratings yet
Product Catalog
76 pages
Ug1043 Embedded System Tools PDF
No ratings yet
Ug1043 Embedded System Tools PDF
92 pages
Ug1164 Sdaccel Platform Development
No ratings yet
Ug1164 Sdaccel Platform Development
71 pages
EDK Concepts, Tools, and Techniques: A Hands-On Guide To Effective Embedded System Design
No ratings yet
EDK Concepts, Tools, and Techniques: A Hands-On Guide To Effective Embedded System Design
80 pages
Fpga23000 10 WKBF Rev1
No ratings yet
Fpga23000 10 WKBF Rev1
370 pages
Xilinx SDK
No ratings yet
Xilinx SDK
11 pages
1991 Xilinx Data Book PDF
No ratings yet
1991 Xilinx Data Book PDF
420 pages
Product Selector Guide
No ratings yet
Product Selector Guide
36 pages
Product Catalog
No ratings yet
Product Catalog
75 pages
04 Silica Xilinx
No ratings yet
04 Silica Xilinx
71 pages
Section VII: Xilinx University Program
No ratings yet
Section VII: Xilinx University Program
23 pages
Cover TBD: Intel® Fpga Product Catalog
No ratings yet
Cover TBD: Intel® Fpga Product Catalog
100 pages
Designing Xilinx Zynq-Based Systems Using Sdsoc
No ratings yet
Designing Xilinx Zynq-Based Systems Using Sdsoc
8 pages
Xilinx ISE Design Suite 10.1 Software Manuals: Design Verification Design Entry
No ratings yet
Xilinx ISE Design Suite 10.1 Software Manuals: Design Verification Design Entry
15 pages
Giulio Corradi Presentation PDF
No ratings yet
Giulio Corradi Presentation PDF
64 pages
Magnetic Monopoles, Fiber Bundles
No ratings yet
Magnetic Monopoles, Fiber Bundles
11 pages
Readme Zybo
No ratings yet
Readme Zybo
2 pages
Use of Cow Dung Ash in Eco Friendly Concrete
No ratings yet
Use of Cow Dung Ash in Eco Friendly Concrete
6 pages
General Lab Safety Rules: No Eating Drinking No Social Gathering No Playing
No ratings yet
General Lab Safety Rules: No Eating Drinking No Social Gathering No Playing
7 pages
Software Manuals Online: Design Verification Design Entry
No ratings yet
Software Manuals Online: Design Verification Design Entry
14 pages
Method of Load Ow Solution of Radial Distribution Network
No ratings yet
Method of Load Ow Solution of Radial Distribution Network
9 pages
ISE Design Suite Software Manuals and Help - PDF Collection: Getting Started
No ratings yet
ISE Design Suite Software Manuals and Help - PDF Collection: Getting Started
13 pages
Interaksi Obat by MEDSCAPE
No ratings yet
Interaksi Obat by MEDSCAPE
4 pages
2024 Benjamin
No ratings yet
2024 Benjamin
4 pages
725017ku 2
No ratings yet
725017ku 2
13 pages
Risen - 430-450W RSM-130-8
No ratings yet
Risen - 430-450W RSM-130-8
2 pages
Draft1 Technical Seminar 21cs81
No ratings yet
Draft1 Technical Seminar 21cs81
15 pages
046to055 Gr12 Detailed Ch06 Q Book
No ratings yet
046to055 Gr12 Detailed Ch06 Q Book
10 pages
Laboratory Exercise 3
No ratings yet
Laboratory Exercise 3
3 pages
CircuitPython in Practice: Definitive Reference for Developers and Engineers
From Everand
CircuitPython in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Cefalexin Monohydrate Falteria (Drug Study)
No ratings yet
Cefalexin Monohydrate Falteria (Drug Study)
1 page
Jetson Platform Development Guide: Definitive Reference for Developers and Engineers
From Everand
Jetson Platform Development Guide: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
STP2 & STP 3.docx 3
No ratings yet
STP2 & STP 3.docx 3
2 pages
Design and Implementation with i.MX Processors: Definitive Reference for Developers and Engineers
From Everand
Design and Implementation with i.MX Processors: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Carbon Capture and Storage
No ratings yet
Carbon Capture and Storage
2 pages
Jamapsychiatry Hansen 2025 Oi 240093 1745421110.18796
No ratings yet
Jamapsychiatry Hansen 2025 Oi 240093 1745421110.18796
11 pages
Innovative Services and Applications of Wireless S
No ratings yet
Innovative Services and Applications of Wireless S
4 pages

Xcell Software1

Uploaded by

Xcell Software1

Uploaded by

SOFTWARE SOLUTIONS FOR

Exploring the SDSoC Environment:

SDAccel Software Application

MathWorks: Make Design Trade-offs

Design it or Buy it?

Find out which Zynq SOM is right for you http://zedboard.org/content/design-it-or-buy-it

facebook.com/avnet twitter.com/avnet youtube.com/avnet

Using the SDSoC IDE for System-level HW-SW

XCELLENCE WITH SDACCEL

Developing OpenCL Imaging

XCELLENT ALLIANCE FEATURES 20

The Next Logical Step

Ever since Xilinx® invented and brought

IT’S A SOFTWARE PROBLEM …

Figure 1 — The Zynq UltraScale+ MPSoC

The SDAccel environment includes a fast,

Meanwhile, as these subsequent generations of But to make these FPGA-accelerated heteroge-

SDAccel — CPU/GPU Development Experience on FPGAs

OpenCL, C, C++ Application Code

Compiler Debugger Profiler Libraries

x86-Based Server PCIe FPGA-Based Accelerator Boards

The SDSoC Development Environment

• Embedded C/C++ application development experience Specify C/C++ Functions

programming experience, including an easy-to-use integration and verification of smarter heteroge-

SDNet — Software Defined Specification Environment for Networking

SDNet Specifications System

SDK/API Executable Image

“Softly” Defined Line Card

Figure 4 — The SDNet environment enables network architects to create a specification

This combination of MathWorks and Xilinx

EMBEDDED DEVELOPMENT ENVIRONMENTS is a user-friendly graphics-based program that runs

A ZedBoard example proves

Once we have taken the steps described here,

Toggling the mmult() function to be accelerated within

Using the SDSoC IDE

decomposition example in embedded design, delivering unprec-

yields an acceleration to the embedded systems engineering

estimate in minutes. dual-core ARM® Cortex™-A9 MPCore™-based pro-

Multistandard I/Os (3.3V & High Speed 1.8V)

2x USB General Interrupt Controller DMA Configuration

EMIO XADC S_AXI_GP0/1 M_AXI_GP0/1 PCIe

Figure 1 — Zynq high-level architecture overview

The SDSoC environment automatically orchestrates

32-bit floating-point representa-

2. If we deem it necessary, we optimize the C/C++

Selecting the candidate accelerator is easily

including the FPGA bitstream and

Figure 8 — IP Integrator block-based design done by the SDSoC environment

#pragma SDS data access_pattern(A:SEQUENTIAL, L:SEQUENTIAL) //fifo interfaces

int cholesky_alt_top(MATRIX_IN_T A[ROWS_COLS_A*ROWS_COLS_A],

After less than half an hour of

Data Motion Network

cholesky_alt_top_0 A A IN 4096*4 • length:

L L OUT 4096*4 • length:

return AP_return OUT 4 M_AXI_GP0:AXILITE:0xC0

cholesky_alt_top_0 cholesky_alt_tb.cpp:246:23 A (BUF_SIZE) * 4 contiguous cacheable

ap_return 4 paged cacheable

Figure 9 — SDSoC connectivity report

Fernando Martinez Vallina

Xilinx’s SDAccel development environment

for (int y=0; y < height; y++) {

for (int x=0; x < width; x++) {

// Store result into memory

Figure 2 — Median filter code

The version of the algorithm in Figure 2 executes

The printf implementation in the SDAccel

Figure 3 — Memory access transaction trace

build high-performance, local uint linebuf1[MAX_WIDTH];

understand all of the 

Figure 5 — Memory access transaction trace after code optimization

GET A DAILY DOSE OF XCELL

for hardware for software

Fernando Martinez Vallina maging applications have grown in both

SET OF PARALLEL COMPUTATION TASKS

DDR memory DDR memory

GPUs hold the promise of much higher perfor- OPENCL FRAMEWORK

2. If we deem it necessary, we optimize the C/C++

understand all of the

The SDAccel development environment leverages technology from