Xcell Software1
Xcell Software1
A PROGRAMMABLE WORLD
ISSUE 1
THIRD QUARTER 2015
SOFTWARE
The Next Logical Step
in C/C++, OpenCL
Programming
System-level HW-SW
Optimization on the Zynq SoC
www.xilinx.com/xcell
Lifecycle Technology
Quick time-to-market demands are forcing you to rethink how you design, build and deploy your
products. Sometimes it’s faster, less costly and lower risk to incorporate an off-the-shelf solution
instead of designing from the beginning. Avnet’s system-on module and motherboard solutions for
the Xilinx Zynq®-7000 All Programmable SoC can reduce development times by more than four
months, allowing you to focus your efforts on adding differentiating features and unique capabilities.
SOFTWARE
Earlier this year, Xilinx® released its SDx™ line of development environments,
which enable non-FPGA experts to program Xilinx device logic using C/C++
and OpenCL™. The goal of the SDx environments is to let software develop-
ers and embedded system architects program our devices as easily as they
program GPUs or CPUs. Xilinx’s FPGA hardware technology has long been
able to accelerate algorithms, but it was only recently that the underlying soft-
ware technology and hardware platforms reached a level where it was feasible
PUBLISHER Mike Santarini to create these development environments for the broader software- and sys-
[email protected]
1-408-626-5981
tem-architect communities of C/C++ and OpenCL users.
The hardware platforms have evolved rapidly during this millennium. In the
early 2000s, the semiconductor industry changed the game on software develop-
EDITOR Diana Scheben
ers. To avoid a future in which chips reached the energy density of the sun, MPU
vendors switched from monolithic MPUs to homogeneous multicore, distrib-
ART DIRECTOR Scott Blair
uted processing architectures. This switch enabled the semiconductor indus-
try to continue to introduce successive generations of devices in cadence with
DESIGN/PRODUCTION Teie, Gelwicks & Assoc.
1-408-842-2627
Moore’s Law and even to innovate heterogeneous multicore processing systems,
which we know today as systems-on-chip (SoCs). But the move to multicore
has placed a heavy burden on software developers to design software that runs
ADVERTISING SALES Judy Gelwicks
1-408-842-2627 efficiently on these new distributed processing architectures. Xilinx has stepped
[email protected] in to help software developers by introducing its SDx line of development en-
vironments. The environments let developers dramatically speed their C/C++
INTERNATIONAL Melissa Zhang, and OpenCL code running on systems powered by next-generation processing
Asia Pacific architectures, which today are increasingly accelerated by FPGAs.
[email protected]
Indeed, FPGA-accelerated processing architectures, pairing MPUs with
Christelle Moraga, FPGAs, are fast replacing power-hungry CPU/GPU architectures in data cen-
Europe/Middle East/Africa
[email protected] ter and other compute-intensive markets. Likewise, in the embedded systems
Tomoko Suto,
space, new heterogeneous multicore processors such as Xilinx’s Zynq®-7000
Japan All Programmable SoC and upcoming Xilinx UltraScale+™ MPSoC integrate
[email protected] multiple processors with FPGA logic on the same chip, enabling companies
to create next-generation systems with unmatched performance and differ-
REPRINT ORDERS 1-408-842-2627 entiation. FPGAs have traditionally been squarely in the domain of hardware
engineers, but no longer.
Now that Xilinx has released its SDx line of development environments to use
EDITORIAL ADVISERS Tomas Evensen
on its hardware platforms, the software world has the ability to unlock the acceler-
Lawrence Getman
ation power of the FPGA using C/C++ or OpenCL within environments that should
Mark Jensen
be familiar to embedded-software and -system developers. This convergence of
strong underlying compilation technology for our high-level synthesis (HLS) tool
flow with programming languages and tools designed for heterogeneous architec-
Xilinx, Inc.
tures brings the final pieces together for software and system designers to create
2100 Logic Drive custom hardware accelerators in their own heterogeneous SoCs.
San Jose, CA 95124-3400 Xcell Software Journal is dedicated to helping you leverage the SDx envi-
Phone: 408-559-7778
FAX: 408-879-4780 ronments and those from Xilinx Alliance members such as National Instru-
www.xilinx.com/xcell/ ments and MathWorks®. The quarterly journal will focus on software trends,
case studies, how-to tutorials, updates and outlooks for this rapidly growing
user base. I’m confident that as you read the articles you will be inspired to
explore Xilinx’s resources further, testing out the SDx development environ-
ments accessible through the Xilinx Software Developer Zone. I encourage
you to read the Xcell Daily Blog, especially Adam Taylor’s chronicles of using
© 2015 Xilinx, Inc. All rights reserved. XILINX, the Xilinx Logo, and other the SDSoC development environment. And I invite you to contribute articles
designated brands included herein are trademarks of Xilinx, Inc. All
other trademarks are the property of their respective owners.
to the new journal to share your experiences with your colleagues in the van-
guard of programming FPGA-accelerated systems.
The articles, information, and other materials included in this issue
are provided solely for the convenience of our readers. Xilinx makes
no warranties, express, implied, statutory, or otherwise, and accepts
no liability with respect to any such articles, information, or other There is a new bass player
materials or their use, and any use thereof is solely at the risk of the for the blues jam in the sky . . .
user. Any person or entity using such information in any way releas-
es and waives any claim it might have against Xilinx for any loss,
— Mike Santarini This issue is dedicated to analyst and
damage, or expense caused thereby. Publisher ESL visionary Gary Smith, 1941 – 2015.
CONTENTS
THIRD QUARTER
2015
ISSUE 1
14
VIEWPOINT
Letter from the Publisher
Welcome to Xcell Software Journal . . . 3
COVER STORY
The Next Logical Step in C/C++,
Open CL Programming
6
36
XCELLENCE WITH SDSOC
FOR EMBEDDED DEVELOPMENT
SDSoC, Step by Step: Build a Sample Design . . . 14
XTRA READING
30
IDE Updates and Extra Resources for Developers . . . 50
42
XCELL SOFTWARE JOURNAL: COVER STORY
6
E
THIRD QUARTER 2015
New environments
allow you to maximize
code performance.
by Mike Santarini
Publisher, Xcell Publications
Xilinx, Inc.
[email protected]
Lawrence Getman
Vice President, Corporate Strategy and Marketing
Xilinx, Inc.
[email protected]
7
XCELL SOFTWARE JOURNAL: COVER STORY
a few other odds and ends, making MPUs relatively up the clock on each new monolithic MPU archi-
straightforward platforms on which to develop next-gen- tecture, given the silicon process technology road
eration apps. For three decades leading up to that point, map and worsening transistor leakage, MPUs would
every 22 months—in step with Moore’s Law—micropro- soon have the same power density as the sun.
cessor vendors would introduce devices with great- It was for this reason that the MPU industry quick-
er capacity and higher performance. To increase the ly transitioned to a homogeneous multiprocessing
performance, they would simply crank up the clock architecture, in which computing was distributed to
rate. The fastest monolithic MPU of the time, Intel’s multiple smaller cores running at lower clock rates.
Pentium 4 Pro, topped out at just over 4 GHz. For The new processing model let MPU and semiconduc-
developers, this evolution was great; with every gen- tor vendors continue to produce new generations of
eration, their programs could become more intricate higher-capacity devices and reap more performance
and perform more elaborate functions, and their pro- mainly from integrating more functions together in a
grams would run faster. single piece of silicon. Existing programs could not
But in the early 2000s, the semiconductor indus- take advantage of the new distributed architectures,
try changed the game, forcing developers to adjust however, leaving software developers to figure out
to a new set of rules. The shift started with the real- ways to develop programs that would run efficiently
ization that if the MPU industry continued to crank across multiple processor cores.
8
THIRD QUARTER 2015
9
XCELL SOFTWARE JOURNAL: COVER STORY
Environment
Figure 2 — The SDAccel development environment for OpenCL, C and C++ enables up to 25x better
performance/watt for data-center-application acceleration leveraging FPGAs.
coding templates and software libraries, and it en- ten in C++ (as opposed to RTL) so developers can
ables compiling, debugging and profiling against use them exactly as written during all development
the full range of development targets, including and debugging phases. Early in a project, all devel-
emulation on the x86, performance validation us- opment will be done on the CPU host. Because the
ing fast simulation, and native execution on FPGA SDAccel libraries are written in C++, they can sim-
processors. The environment executes the applica- ply be compiled along with the application code for
tion on data-center-ready FPGA platforms complete a CPU target—creating a virtual prototype—which
with automatic instrumentation insertion for all permits all testing, debugging and initial profiling
supported development targets. Xilinx designed to occur initially on the host. During this phase, no
the SDAccel environment to enable CPU and GPU FPGA is needed.
developers to migrate their applications to FP-
GAs easily while maintaining and reusing their SDSOC FOR EMBEDDED DEVELOPMENT
OpenCL™, C and C++ code in a familiar workflow. OF ZYNQ SOC- AND MPSOC-BASED SYSTEMS
SDAccel libraries contribute substantially to the Xilinx designed the SDSoC development environ-
SDAccel environment’s CPU/GPU-like develop- ment for embedded systems developers program-
ment experience. They include low-level math li- ming the Xilinx Zynq SoCs and soon-to-arrive Zynq
braries and higher-productivity ones such as BLAS, UltraScale+ MPSoCs. The SDSoC environment pro-
OpenCV and DSP libraries. The libraries are writ- vides a greatly simplified embedded C/C++ application
10
THIRD QUARTER 2015
C/C++ Development
Rapid
system-level
performance
estimation
System-level Profiling
SoC MPSoC
Figure 3— The SDSoC development environment provides a familiar embedded C/C++ application
development experience, including an easy-to-use Eclipse IDE and a comprehensive design environment
for heterogeneous Zynq All Programmable SoC and MPSoC deployment.
11
XCELL SOFTWARE JOURNAL: COVER STORY
SDNet Compiler
• LogiCORE
• SmartCORE
• Custom Core
• SW Function Implementation
Engineer
HW/SW Implementation
FPGA or SoC
of per-flow and flexible services, and support for rev- namic service provisioning enables service providers
olutionary in-service “hitless” upgrades while operat- to increase revenue and speed time to market while
ing at 100 percent line rates. lowering capex and opex. Network equipment pro-
These unique capabilities enable carriers and mul- viders realize similar benefits from the Softly Defined
tiservice system operators (MSOs) to provision differ- Network platform, which allows for extensive differ-
entiated services dynamically without any interrup- entiation through the deployment of content-aware
tion to the existing service or the need for hardware data plane hardware that is programmed with the
requalification or truck rolls. The environment’s dy- SDNet environment.
12
THIRD QUARTER 2015
13
XCELL SOFTWARE JOURNAL: XCELLENCE WITH SDSOC
SDSoC,
Step by Step:
Build a Sample
Design
by Adam Taylor
Chief Engineer
e2v
[email protected]
14
THIRD QUARTER 2015
U
Until the release of the Xilinx® SDSoC™ de-
velopment environment, the standard SoC
design methodology involved a mix of dis-
parate engineering skills. Typically, once the
system architect had generated a system
architecture and subsystem segmentation
from the requirement, the solution would
A BRIEF HISTORY
OF DESIGN METHODOLOGIES
The programmable logic device segment has
been fast-moving since the devices’ intro-
duction in the 1980s. At first engineers pro-
grammed the devices via schematic entry (al-
though the earlier PLDs, such as the 22v10,
were programmed via logic equations). This
required that electronics engineers perform
most PLD development, as logic design and
optimization are typically the EE degree’s
domain. As device size and capability in-
creased, however, schematic entry naturally
began to hit boundaries, as both design time
be split between functions implemented and verification time rose in tandem with de-
in hardware (the logic side) and functions sign complexity. Engineers needed the capa-
implemented in software (the processor bility to work at a higher level of abstraction.
side). FPGA and software engineers would Enter VHDL and Verilog. Both started
separately develop their respective func- as languages to describe and simulate log-
tions and then combine and test them in ac- ic designs, particularly ASICs. VHDL even
cordance with the integration test plan. The had its own military standard. It is a logical
approach worked for years, but the advent step that if we are describing logic behav-
of more-capable SoCs, such as the Xilinx ior within a hardware description language
Zynq®-7000 All Programmable SoC and the (HDL), it would be great to synthesize the
upcoming Xilinx Zynq UltraScale+™ MP- logic circuits required. The development of
SoC, mandated a new design methodology. synthesis tools let engineers describe logic
The SDSoC methodology enables a behavior typically at a register transfer lev-
wider user base of engineers to develop el. HDLs also provided a significant boost
extremely high-performing systems. Engi- in verification approach, allowing the de-
neers new to developing in the SDSoC de- velopment of behavioral test benches that
velopment environment will discover that enabled structured verification. For the
it’s easy to get a system up and running first time, HDLs also enabled modularity
quickly and just as easy to optimize it. and vendor independence.
A simple, representative example will Again, the inherent concurrency of HDLs,
illustrate how to accomplish those tasks the register transfer level design approach
and reap the resultant benefits. We will tar- and the implementation flow, which re-
get a ZedBoard running Linux and using quired knowledge of optimization and tim-
one of the built-in examples: the Matrix ing closures, ensured that the PLD devel-
Multiplier and Addition Template. opment task would largely fall to EEs.
15
XCELL SOFTWARE JOURNAL: XCELLENCE WITH SDSOC
FAMILIAR ENVIRONMENT
The SDSoC development environ-
ment is based on Eclipse, which
should be familiar to most software
developers (Figure 1). The environ-
ment seamlessly enables acceleration
of functions within the PL side of the
device. It achieves this by using the
new SDSoC compiler, which can han-
dle C or C++ programs.
Figure 1 — SDSoC Welcome page The development cycle at the
highest abstraction level used in the
SDSoC environment is as follows:
HDLs have long been the de facto standard for PLD 1. We develop our application in C or C++.
development but have evolved over the years to take 2. We profile the application to determine the perfor-
industry needs into account. VHDL alone underwent mance bottlenecks.
revisions in 1987 (the first year of IEEE adoption), 3. Using the profiling information, we identify func-
1993, 2000, 2002, 2007 and 2008. As happened with tions to accelerate within the PL side of the device.
schematic design entry, however, HDLs are hitting
up against the buffers of increases in development 4. We can then build the system and generate the SD
time, verification time and device capability. card image.
As the PLD’s role has expanded from glue logic to 5. Once the hardware is on the board, we can analyze
acceleration peripheral and ultimately to the heart the performance further and optimize the accelera-
of the system, the industry has needed a new design tion functions as required.
methodology to capitalize on that evolution. In re- We can develop applications in the SDSoC environ-
cent years, high-level synthesis (HLS) has become ment that function variously on bare metal, FreeRTOS
increasingly popular; here, the design is entered in or Linux operating systems. The environment comes
C/C++ (using Xilinx’s Vivado® HLS) or tools such with built-in support for most of the Zynq SoC devel-
as MathWorks®’ MATLAB® or National Instruments’ opment boards, including the ZedBoard, the MicroZed
LabVIEW. Such approaches begin to move the de- and the Digilent ZYBO Zynq SoC development board.
sign and implementation out from the EE domain Not only can we develop our applications faster as a
into the software realm, markedly widening the user result, but we can use this capability to define our own
base of potential PLD designers and cementing the underlying hardware platform for use when our custom
PLD’s place at the heart of the system as new design hardware platform is ready for integration.
methodologies unlock the devices’ capabilities. When we compile a program within the SDSoC
It is therefore only natural that SoC-based de- environment, the output of the build process provides
signs would use HLS to generate tightly integrat- the suite of files required to configure the Zynq SoC
ed development environments in which engineers from an SD card. This suite includes first- and sec-
could seamlessly accelerate functions in the logic ond-stage boot loaders, along with the application
side of the design. Enter the SDSoC environment. and images as required for the operating system.
16
THIRD QUARTER 2015
SDSOC EXAMPLE
Let’s look at how the SDSoC environment works and see
how quickly we can get an example up and running. We
will target a ZedBoard running Linux and using the built-
in Matrix Multiplier and Addition Template.
The first task, as always, is to create a project. We can
do so either from the Welcome screen (Figure 1) or by
selecting File -> New -> SDSoC project from the menu.
Selecting either option will open a dialog box that will let
us name the project and select the board and the operat-
ing system (Figure 2).
This will create a project under the Project Explorer
on the left-hand side of the SDSoC GUI. Under this proj-
ect, we will see the following folders, each with its own,
graphically unique symbol:
• SDSoC Hardware Functions: Here we will see the func-
tions we have moved into the hardware. Initially, as we
have yet to move functions, this folder will be empty.
• Includes: Expanding this folder will show all of the
C/C++ header files used in the build.
•
src: This will contain the source code for the
demonstration.
To ensure that we have everything correctly con-
figured not only with our SDSoC installation and
environment, but also with our development board,
we will build the demo so that it will run on only the
on-chip processing system (PS) side of the device.
Of course, the next step is to build the project. With the
project selected on the menu, we choose Project->Build
Project. It should not take too long to build, and when we
are done we will see folders as shown in Figure 3 appear
under our project within the Project Explorer. In addi-
tion to the folders described above, we will have:
• Binaries: Here we will find the Executable and Linkable
Format (ELF) files created from the software compi-
lation process.
• Archives: The object files that are linked to create the
binaries reside here.
Figure 2 — Creating the project
• SDRelease: This contains our boot files and reports.
17
XCELL SOFTWARE JOURNAL: XCELLENCE WITH SDSOC
COVER STORY
With the first demo built such that it will run only on the
Zynq SoC’s PS, let’s explore how we know it is working
as desired. Recall that SDSoC acceleration works by pro-
filing the application; the engineer then uses the profiled
information to determine which functions to move.
We achieve profiling at the basic level by using a pro-
vided library called sds_lib.h. This provides a basic time-
stamp API, based on the 64-bit global counter, that lets us
Figure 3 — Project Explorer view when built
measure how long each function takes. With the API, we
simply record the function start and stop times, and the
difference constitutes the process execution time.
The source code contains two versions of the algo-
rithm for matrix multiply and add. The so-called golden
version is not intended for offloading to the on-chip pro-
grammable logic (PL); the other version is. By building
and running these just within the PS, we can ensure that
we are comparing eggs with eggs and that both process-
es take roughly the same time to execute.
Figure 4 — Execution time of both functions in the PS
With the build complete, we can copy all of the files in
the SDRelease -> sd_card folder under the Project Explor-
er onto our SD card and insert the card into the ZedBoard
(with the mode pins correctly set for SD card configuration).
With a terminal program connected, once the boot sequence
has been completed we need to run the program. We type
/mnt/mult_add.elf (where mult_add is the name of the proj-
ect we have created). When I ran this on my ZedBoard, I
got the result shown in Figure 4, which demonstrates that
the two functions take roughly the same time to execute.
Having confirmed the similar execution times, we
will move the multiply function into the PL side of the
SoC. This is simple to achieve.
Looking at the file structure within the src directory
of the example, we will see:
•
main.cpp, which contains the main function, golden
calculation, timestamping, and calls to the mult and
add functions used in the hardware side of the device;
•
mmult.cpp, which contains the multiplication func-
tion to be offloaded into the hardware; and
Figure 5 — Moving the multiplier kernel to
•
madd.cpp, which contains the addition function to be the PL side using the Project Explorer
offloaded into the hardware.
18
THIRD QUARTER 2015
19
XCELL SOFTWARE JOURNAL: XCELLENCE WITH SDSOC
Nick Ni
Product Manager, SDSoC
Development Environment
Xilinx, Inc.
[email protected]
20
THIRD QUARTER 2015
T
A Choleksy matrix he Xilinx® Zynq®-7000 All Programmable
SoC family represents a new dimension
21
XCELL SOFTWARE JOURNAL: XCELLENCE WITH SDSOC
Processing System
Static Memory Controller Dynamic Memory Controller Programmable
Quad-SPI, NAND, NOR DDR3, DDR2, LPDDR2 Logic:
System Gates,
AMBA Switches AMBA Switches DSP, RAM
2x SPI
S_AXI_HP0
2x I2C
ARM CoreSight Multicore and Trace Debug S_AXI_HP1
2x CAN
NEON/ FPU Engine NEON/ FPU Engine S_AXI_HP2
2x UART
I/O Cortex-A9 MPCore Cortex-A9 MPCore S_AXI_HP3
2x GigE
with DMA AMBA Switches
Multistandard I/Os (3.3V & High Speed 1.8V) Multi Gigabit Transceivers
The PS and PL are tightly coupled via interconnects Verilog using Vivado, in C/C++ using Vivado High
compliant with the ARM® AMBA® AXI4 interface. Level Synthesis (HLS) [3] or in model-based design
Four high-performance (HP) AXI4 interface ports using Vivado System Generator for DSP [4].
connect the PL to asynchronous FIFO interface (AFI)
blocks in the PS, thereby providing a high-throughput 3. Engineers then use Vivado IP Integrator [5] to
data path between the PL and the PS memory system create a block-based design of the whole embed-
(DDR and on-chip memory). The AXI4 Accelerator ded system. The full system needs to be developed
Coherency Port (ACP) allows low-latency cache-co- with different data movers (AXI-DMA, AXI Memory
herent access to L1 and L2 cache directly from the PL Master, AXI-FIFO, etc.) and AXI interfaces (GP, HP
masters. The General Purpose (GP) port comprises and ACP) connecting the PL IP with the PS. Once
low-performance, general-purpose ports accessible all design rules checks are passed within IP Inte-
from both the PS and PL. grator, the project can be exported to the Xilinx
In the traditional, hardware-design-centric flow, us- Software Development Kit (SDK) [6].
ing Xilinx’s Vivado® Design Suite, designing an embed-
4. Software engineers develop drivers and applica-
ded system on the Zynq SoC requires roughly four steps:
tions targeting the ARM processors in the PS using
1. A system architect decides a hardware-software parti- the Xilinx SDK.
tioning scheme. Computationally intensive algorithms
In recent years, Xilinx made substantial ease-of-use
are the ideal candidates for hardware. Profiling re-
improvements to the Vivado Design Suite that enabled
sults are used as the basis for identifying performance
engineers to shorten the duration of the IP develop-
bottlenecks and running trade-off studies between
ment and IP block connection steps (step 2 and part of
data movement costs and acceleration benefits.
step 3 above). For IP development, the adoption of such
2. Hardware engineers take functions partitioned to new design technologies as C/C++ high-level synthesis
hardware and convert/design them into intellectu- in the Vivado HLS tool and model-based design with
al-property (IP) cores—for example, in VHDL or Vivado System Generator for DSP cut development
22
THIRD QUARTER 2015
Let’s see how we can obtain an estimation of the noncontiguous pages in the Physical Address Space.
performance and resource utilization that we can The Simple DMA is cheaper than the Scatter-Gather
expect from our application, without going through DMA in terms of area and performance overheads,
the entire build cycle. but it requires sds_alloc to obtain physically contig-
Figure 3 shows the test bench structure suitable for uous memory.
the SDSoC environment. The main program allocates Selecting the candidate accelerator is easily accom-
dynamic memory for all the empty matrices and fills plished with a mouse click on a specific function via
them with data (either read from a file or generated the SDSoC environment’s GUI. As shown in Figure 4,
randomly). It then calls the reference software func- the routine cholesky_alt_top is marked with an “H” to
tion and the hardware candidate function. Finally, the indicate that it will be promoted to a hardware accel-
main program checks the numerical results comput- erator. We can also select the clock frequency for the
ed by both functions to test the effective correctness. accelerator and for the data motion cores (100 MHz as
Note the use of a special memory allocator called illustrated in the SDSoC project page of Figure 4).
sds_alloc for each input/output array to let the SDSoC We can now launch the “estimate speedup” process.
environment automatically insert a Simple DMA After a few minutes of compilation, we get all the cores
IP between each I/O port of the hardware accelera- and the data motion network generated in a Vivado
tor; in contrast, malloc instantiates a Scatter-Gather project. The SDSoC environment also generates an
DMA, which can handle arrays spread across multiple SD card image that comprises a Linux boot image
Figure 4 — Setting the hardware accelerator core and its clock frequency from the SDSoC project page
25
XCELL SOFTWARE JOURNAL: XCELLENCE WITH SDSOC
26
THIRD QUARTER 2015
UNDERSTANDING
THE PERFORMANCE
ESTIMATION RESULTS
When the SDSoC environment
compiles the application code for
the estimate-speedup process,
it generates an intermediate
directory (_sds in Figure 5) in
which it places all intermediate
projects (Vivado HLS, Vivado IP
Integrator, etc.). In particular,
it inserts calls to a free-running Figure 7 — Makefile for the Release build
ARM performance counter func-
tion, sds_clock_counter(), in
the original code to measure the
execution time of key parts of the program functions. In Figure 6, we report the Vivado HLS synthesis es-
That is why the target board needs to be connected timation results. Note that the hardware accelerator
with the SDSoC environment’s GUI during the esti- latency is CKHW = 83,652 cycles at FHW = 100-MHz clock
mate-speedup process. All the numbers reported in frequency. Since in the ZC702 board we have FARM
Figure 5 are measured with those counters during = 666 MHz and therefore CKARM = CKHW*FARM / FHW =
run-time execution. The only exception is the hard- 83,653*666/100 = 557,128, the resultant hardware ac-
ware-accelerated function, which does not exist celeration is well aligned with the result of 565,554
until after the entire FPGA build (including place- cycles reported by the SDSoC environment in Figure 5.
and-route implementation); therefore Vivado HLS This is why the SDSoC environment can estimate the
computes the hardware-accelerated function’s es- number of clock cycles that the accelerator requires
timated cycles—together with the resource utiliza- without actually building it via place-and-route.
tion estimates—under the hood, during the effective
Vivado HLS Synthesis step. BUILDING THE HARDWARE-SOFTWARE SYSTEM
Assuming the candidate hardware accelerator WITH THE SDSOC ENVIRONMENT
function runs at FHW MHz clock frequency and needs Having determined that this hardware acceleration
CKHW clock cycles for the whole computation (this is makes sense, we can implement the whole hardware
the concept of latency), and assuming the function and software system with the SDSoC environment.
takes CKARM at a clock frequency of FARM MHz when All we need to do is add the right directives (in the
executed on the ARM CPU, then the hardware acceler- form of pragma commands) to specify, respectively,
ator achieves the same performance as the ARM CPU the FIFO interfaces (due to the sequential scan of
if the computation time is the same, that is, CKHW / FHW the I/O arrays); the amount of data to be transferred
= CKARM / FARM. From this equation, we get CKARM = at run time for any call to the accelerator; the types
CKHW*FARM / FHW. This represents the maximum amount of AXI ports connected between the IP core in the
of clock cycles the accelerator can offload from the PL and the PS; and, finally, the kind of data movers.
processor to show any acceleration that results from The following C/C++ code illustrates the applica-
migrating the function to hardware. tions of those directives. Note that in reality the last
27
XCELL SOFTWARE JOURNAL: XCELLENCE WITH SDSOC
axi_interconnect_M_AXI_GP0
M00_AXI
S00_AXI M01_AXI
cholesky_alt_top_0_if datamover_1 ps7
M02_AXI
S_AXI S_AXI_LITE M_AXIS_S2MM PTP_ETHERNET_0
AXI Interconnect S_AXI_0 M_AXIS_0 S_AXI_S2MM s2mm_prmry_reset_out_n DDR DDR
S_AXI_ACP
AP_FIFO_IARG_0 AP_CTRL FIXED_IO FIXED_IO
AXI Direct Memory Access IRQ_F2P[1:0]
proc_sys_reset_0 cholesky_at_top_0 AP_FIFO_OARG_0 interrupt USBIND_0
ap_oscalar_0_din[31:0] M_AXI_GPO
mb_reset A
aux_reset_in bus_struct_reset[0:0] ap_ctrl L AXI4-Stream Accelerator Adapter ZYNQ7 Processing System
mb_debug_sys_rst peripheral_reset[0:0] ap_return[31:0]
dom_locked interconnect_aresetn[0:0] acp_axcache_0xF axi_interconnect_S_AXI_ACP proc_sys_reset_2
peripheral_aresetn[0:0] Cholesky_alt_top (Pre-Production)
dout[3:0] mb_reset
S00_AXI
Processor System Reset datamover_0 aux_reset_in bus_struct_reset[0:0]
Constant S00_AXI_arcache
M00_AXI mb_debug_sys_rst peripheral_reset[0:0]
M_AXI_MM2S S01_AXI
dom_locked interconnect_aresetn[0:0]
S_AXI_LITE M_AXIS_MM2S proc_sys_reset_3 proc_sys_reset_0 S01_AXI_awcache
peripheral_aresetn[0:0]
mm2s_prmry_reset_out_n
mb_reset aux_reset_in mb_reset
mb_debug_sys_rst
AXI Interconnect Processor System Reset
AXI Direct Memory Access aux_reset_in bus_struct_reset[0:0] bus_struct_reset[0:0]
xlconcat
mb_debug_sys_rst peripheral_reset[0:0] dom_locked peripheral_reset[0:0]
dom_locked interconnect_aresetn[0:0] dout[1:0]
peripheral_aresetn[0:0] Processor System Reset
Concat
Processor System Reset
directive is not needed, because the SDSoC environment ment calls Vivado IP Integrator in a process transpar-
will instantiate a Simple DMA due to the use of sds_alloc; ent to the user (for the sake of clarity, only the AXI4
we have included it here only for the sake of clarity. interfaces are shown). In addition, the SDSoC environ-
We can build the project in Release configura- ment reports the Vivado IP Integrator block diagram
tion directly from the SDSoC environment’s GUI, as an HTML file to make it easy to read (Figure 9). This
or we can use the Makefile reported in Figure 7 and report clearly shows that the hardware accelerator is
launched from the SDSoC Tool Command Language connected with the ACP port via a simple AXI4-DMA,
(Tcl) interpreter. As is the case with any tool in the whereas the GP port is used to set up the accelerator
Vivado Design Suite, designers can either adopt the via an AXI4-Lite interface.
GUI or Tcl scripting. To improve the speedup gain, How much time did it take us to generate the SD
we increase the clock frequency of the hardware card for the ZC702 board with the embedded system
accelerator to FHW =142 MHz (set by the -clkid 1 up and running? We needed one working day to write
makefile flag). a C++ test bench suitable to both Vivado HLS and the
After less than half an hour of FPGA compilation, SDSoC environment, and then we needed one hour of
we get the bitstream to program the ZC702 board experimentation to get good results from the Linear
and the Executable Linkable Format (ELF) file to Algebra HLS Library and one hour to create the embed-
execute on the Linux OS. We then measure the per- ded system with the SDSoC environment (the FPGA
formance on the ZC702 board: 995,592 cycles for compilation process). Altogether, the process took 10
software-only and 402,529 cycles for hardware ac- hours. We estimate that doing all this work manually
celeration. Thus, the effective performance gain for (step 3 with Vivado IP Integrator and step 4 with Xilinx
the cholesky_alt_top function is 2.47. SDK) would have cost us at least two weeks of full-
Figure 8 illustrates the block diagram of the whole time, hard work, not counting the experience needed
embedded system created when the SDSoC environ- to use those tools efficiently.
28
THIRD QUARTER 2015
Accelerator Callsites
Accelerator Callsite IP Port Transfer Size (bytes) Paged or Contiguous Cacheable or
Non-cacheable
29
XCELL SOFTWARE JOURNAL: XCELLENCE WITH SDACCEL
Compile, Debug,
Optimize
by Jayashree Rangarajan
Senior Engineering Director,
Interactive Design Tools
Xilinx, Inc.
[email protected]
Vinay Singh
Senior Product Marketing Manager,
SDAccel
Xilinx, Inc.
[email protected]
30
THIRD QUARTER 2015
X
ilinx® FPGA devices mainly comprise a enables application compile, debug and optimization
programmable logic fabric that lets appli- for FPGA devices in ways similar to the processes
cation designers exploit both spatial and used for CPUs and GPUs, with the advantage of up to
temporal parallelism to maximize the per- 25x better performance/watt for data center applica-
formance of an algorithm or a critical kernel in a large tion acceleration.
application. At the heart of this fabric are arrays of Software designers can use the SDAccel develop-
lookup-table-based logic elements, distributed memo- ment environment to create and accelerate many func-
ry cells and multiply-and-accumulate units. Designers tions and applications. Let’s look at how the SDAccel
can combine those elements in different ways to im- environment enables a compile, debug and optimiza-
plement the logic in an algorithm while achieving pow- tion design loop on a median filter application.
er consumption, throughput and latency design goals.
The combination of FPGA fabric elements into MEDIAN FILTER
logic functions has long been the realm of hardware The median filter is a spatial function commonly used
engineers, involving a process that resembles assem- in image processing for the purpose of noise reduction
bly-level coding more closely than it mimics modern (Figure 1). The algorithm inside the median filter uses
software design practices. Whereas common software a 3 x 3 window of pixels around a center pixel to com-
design procedures long ago moved beyond assembly pute the value of the center based on the median of all
coding, FPGA design practices have progressed at a neighbors. The equation for this operation is:
slower pace because of the inherent differences be-
tween CPU and FPGA compilation. outputPixel[i][j] =
In the case of CPUs and GPUs, the hardware is median(inputPixel[i-1][j-1], inputPix-
fixed, and all programs are compiled against a static el[i-1][j], inputPixel[i-1][j+1],
instruction set architecture (ISA). Although the ISAs inputPixel[i][j-1], inputPixel[i]
differ between CPUs and GPUs, the basic underlying [j], inputPixel[i][j+1],
compilation techniques are the same. Those similar- inputPixel[i+1][j-1], inputPixel[i+1]
ities have enabled the evolution of design practices [j], inputPixel[i+1][j+1]) ;
from handcrafted assembly code into compilation, de-
bug and optimization design procedures that leverage COMPILE
the OpenCL™ C, C and C++ programming languages After the functionality of the median filter has been
common to software development. captured in a programming language such as Open-
In the case of FPGA design, designers can create CL C, the first stage of development is compilation.
their own processing architecture to perform a specific On a CPU or GPU, compilation is a necessary and
workload. The ability to customize the architecture to a natural step in the software design flow. The target
specific system need is a key advantage of FPGAs, but ISA is fixed and well known, leaving the program-
it has also acted as a barrier to adopting software devel- mer to worry only about the number of available
opment practices for FPGA application development. processing cores and cache misses in the algorithm.
Six years ago, Xilinx began a diligent R&D effort FPGA compilation is more of an open question: At
to break down this barrier by creating a development compilation time, the target ISA does not exist, the
environment that brought an intuitive software devel- logic resources have yet to be combined into a pro-
opment design loop to FPGAs. The Xilinx SDAccel™ cessing fabric and the system memory architecture
development environment for OpenCL C, C and C++ is yet to be defined.
31
XCELL SOFTWARE JOURNAL: XCELLENCE WITH SDACCEL
32
THIRD QUARTER 2015
33
XCELL SOFTWARE JOURNAL: XCELLENCE WITH SDACCEL
COVER STORY
Software programmers
who use SDAccel can
leverage the flexibility
for (int line = 0; line < height; line++) {
of the logic fabric to local uint linebuf0[MAX_WIDTH];
34
THIRD QUARTER 2015
Whether the final device is a CPU or an FPGA, development flows. The SDAccel development envi-
profiling is an essential component of application ronment enables this design loop with tools and tech-
development. The SDAccel environment’s visualiza- niques similar to the development environment on a
tion and profiler capabilities let an application pro- CPU, with FPGA-based application acceleration of up
grammer characterize the impact of code changes to 25x better performance per watt and with a 50x to
and application requirements in terms of kernel oc- 75x latency improvement. Software programmers who
cupancy, transactions to memory and memory band- use SDAccel can leverage the flexibility of the logic
width utilization. fabric to build high-performance, low-power applica-
The design loop created by the operations of com- tions without having to understand all of the details
pile, debug and optimize is fundamental to software associated with hardware design. n
What’s Recent:
n Half Wheelchair, Half Segway, Half Battlebot: Unprecedented mobility for the disabled—controlled by Zynq
n R
egular Universal Electronic Control Unit tester for vehicles up and running in two months thanks to NI LabVIEW and LabVIEW FPGA
n Radar looks deep into Arctic snow and ice to help develop sea-level climate models
n Passive, Wi-Fi radar that sees people through walls prototyped with NI LabVIEW and two FPGA-based USRP-2921 SDR platforms
n 500-FPGA Seismic Supercomputer performs real-time acoustic measurements on its heart of stone to simulate earthquakes
35
XCELL SOFTWARE JOURNAL: XCELLENCE WITH SDACCEL
Developing OpenCL
Imaging Applications
Using C++ Libraries
36
THIRD QUARTER 2015
by Stephen Neuendorffer
Principal Engineer, Vivado HLS
Xilinx’s SDAccel development
Xilinx, Inc.
[email protected] environment leverages the
Thomas Li
power of preexisting libraries
Software Engineer, Vivado HLS
Xilinx, Inc. to accelerate application design.
37
XCELL SOFTWARE JOURNAL: XCELLENCE WITH SDACCEL
HOST DEVICE
PCIe
Figure 1 — The basic OpenCL platform contains one host and at least one device.
38
THIRD QUARTER 2015
For an FPGA, in contrast, the SDAccel development sible to both the host and the device and is typically
environment generates custom cores per the specific implemented in DDR attached to the FPGA. Depend-
computation requirements of the application kernel. The ing on the FPGA used on the acceleration board, a por-
application developer thus is free to explore implementa- tion of the global memory can also be implemented
tion architectures based on the needs of the algorithm to inside the FPGA fabric. The local and private memory
reduce overall system latency and power consumption. spaces are visible only to the kernels executing inside
The second OpenCL component is the memory the FPGA fabric and are built entirely inside that fab-
model (Figure 2). This model, which is common to all ric using block RAM (BRAM) and register resources.
vendors, defines a single memory hierarchy against Let’s see how the SDAccel environment leverages
which a developer can create a portable application. OpenCL and C++ libraries for a stereo imaging block
The main components of the memory model are matching application.
the host, global, local and private memories. The host
memory refers to the memory space that is accessible STEREO BLOCK MATCHING
only to the host processor. The memories visible to Stereo block matching uses images from two cam-
the FPGA (the device) are the global, local and private eras to create a representation of the shape of an ob-
memory spaces. The global memory space is acces- ject in the field of view of the cameras. As Figure 3
Constant
Host Memory Global Memory Global Memory
FPGA
Kernel A Kernel B
Host PCIe
Compute Unit 0
Local Memory
Compute Unit 1
Local Memory
Compute Unit 0
Local Memory
Compute Unit 1
Local Memory
Private Private Private Private Private Private Private Private Private Private Private Private
Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory
PE PE PE PE
Figure 2 — The OpenCL memory model defines a single memory hierarchy for application development.
39
XCELL SOFTWARE JOURNAL: XCELLENCE WITH SDACCEL
shows, the algorithm uses the input images of a Vivado HLS provides image processing functions
left and a right camera to search for the correspon- based on the popular OpenCV framework. The func-
dence between the images. Such multi-camera im- tions are written in C++ and have been optimized to
age processing tasks can be applied to depth maps, provide high performance in an FPGA. When synthe-
image segmentation and foreground/background sized into an FPGA implementation, the equivalent of
separation. These are, for example, all integral anywhere from tens to thousands of RISC processor in-
parts of pedestrian detection applications in driver structions are executed concurrently every clock cycle.
assistance systems. The code for the application uses Vivado HLS vid-
eo processing functions to create the application.
USING C++ LIBRARIES FOR VIDEO The application code contains C++ function calls to
The SDAccel development environment leverages tech- Vivado HLS libraries as well as pragmas to guide the
nology from Xilinx’s Vivado HLS C-to-RTL compiler as compilation process. The pragmas are divided into
part of the core kernel compiler, letting the SDAccel en- those for interface definition and those for perfor-
vironment use kernels expressed in C and C++ in the mance optimization.
same way as kernels expressed in OpenCL C. Applica- The interface definition pragmas determine how
tion developers thus can use C++ libraries and code pre- the stereo block matching accelerator connects to
viously optimized in Vivado HLS to increase productivity. the rest of the system. Since this accelerator is ex-
The main code for the stereo block matching appli- pressed in C++ instead of OpenCL C code, the appli-
cation is shown on the next page. cation programmer must provide interface definition
40
THIRD QUARTER 2015
void stereobm(
unsigned short img_data_lr[MAX_HEIGHT*MAX_WIDTH],
unsigned char img_data_d[MAX_HEIGHT*MAX_WIDTH],
int rows,
int cols)
{
#pragma HLS INTERFACE m_axi port=img_data_lr offset=slave bundle=gmem0
#pragma HLS INTERFACE m_axi port=img_data_d offset=slave bundle=gmem1
#pragma HLS INTERFACE s_axilite port=img_data_lr bundle=control
#pragma HLS INTERFACE s_axilite port=img_data_d bundle=control
#pragma HLS INTERFACE s_axilite port=rows bundle=control
#pragma HLS INTERFACE s_axilite port=cols bundle=control
#pragma HLS INTERFACE s_axilite port=return bundle=control
pragmas that match the assumptions of the OpenCL FindStereoCorrespondenceBM function to start
model in the SDAccel environment. operating as soon as the Split function produces
The pragmas marked with m_axi state that the pixels, without having to wait for a complete image
contents of the buffer will be stored in device global to be produced. The net result is a more efficient
memory. The pragmas marked with s_axilite are re- architecture and reduced processing latency rela-
quired for the accelerator to receive the base address tive to sequential processing of each function with
of buffers in global memory from the host. full frame buffers in between them.
The performance optimization pragma in this Imaging applications are a compute-intensive
code is dataflow. The dataflow pragma yields an application domain with a rich set of available
accelerator in which different subfunctions can libraries; the devil is in optimizing the application for
also execute concurrently. the execution target. The SDAccel environment lets
In this accelerator, because of the underlying im- developers leverage C++ libraries to accelerate the
plementation of the hls::Mat datatype, data is also development of imaging applications for FPGAs pro-
streamed between each function. This allows the grammed in OpenCL. n
41
XCELL SOFTWARE JOURNAL: XCELLENT ALLIANCE
42
THIRD QUARTER 2015
Model-Based Design
workflow lets engineers
make design trade-offs
The open question was how they would program the
at the desktop rather new devices. Designers imagining the potential of hard-
ware-software co-design sought integrated workflows
than the lab. that would intelligently partition designs between ARM
processors and programmable logic. What they found,
however, were distinct hardware and software work-
flows: conventional embedded software development
flows targeting ARM cores, alongside a combination of
IP assembly, traditional RTL and emerging high-level syn-
thesis tools for programmable logic.
T
he introduction of the Xilinx® Zynq®-7000 INTEGRATED WORKFLOW
All Programmable SoC family in 2011 In September 2013, MathWorks introduced a hard-
brought groundbreaking innovation to ware-software workflow for Zynq-7000 SoCs using
the FPGA industry. These devices, with Model-Based Design. In this workflow (Figure 1), de-
their combination of dual-core ARM ® Cortex™-A9 signers could create models in Simulink that would
MPCore™ processors and ample programmable represent a complete dynamic system—including a
logic, offered advantages for a wealth of applica- Simulink model for algorithms targeted for the Zynq
tions. By adopting Zynq SoCs, designers could reap SoC—and rapidly create hardware-software imple-
the benefits of software application development mentations for Zynq SoCs directly from the algorithm.
on one of the industry’s most popular processors System designers and algorithm developers used
while gaining the flexibility and throughput poten- simulation in Simulink to create models for a com-
tial provided via hardware acceleration on a high- plete system (communications, electromechanical
speed, programmable logic fabric. components and so forth) in order to evaluate design
Using MATLAB ® and Simulink ® from Math- concepts, make high-level trade-offs, and partition al-
Works ®, innovators today can leverage a highly gorithms into software and hardware elements. HDL
integrated hardware-software workflow to create code generation from Simulink enabled the creation
highly optimized systems. The case study present- of IP cores and high-speed I/O processing on the Zynq
ed here illustrates this model-based workflow. SoC fabric. C/C++ code generation from Simulink en-
When Xilinx released the first Zynq SoC in De- abled programming of the Zynq SoC’s Cortex-A9 cores,
cember 2011, designers seized on the idea that they supporting rapid embedded software iteration.
could migrate their legacy, multichip solutions, The approach enabled automatic generation of the
built from discrete processors and FPGAs, to a sin- AMBA® AXI4 interfaces linking the ARM processing
gle-chip platform. They could create FPGA-based system and programmable logic with support for the
accelerators on the new platform to unclog soft- Zynq SoC. Integration with downstream tasks—such
ware execution bottlenecks and tap into an array as C/C++ compilation and building of the executable
of off-the-shelf, production-ready intellectual prop- for the ARM processing system, bitstream generation
erty from Xilinx and its IP partners that would ad- using Xilinx implementation tools, and downloading to
dress applications in digital signal processing, net- Zynq development boards—allowed for a rapid proto-
working, communications and more. typing workflow.
43
XCELL SOFTWARE JOURNAL: XCELLENT ALLIANCE
RESEARCH REQUIREMENTS
DESIGN
System Modeling
Verification
ARM core programmable fabric ware designers and embedded
C code HDL code software developers to acceler-
generation generation
ate the implementation of algo-
IMPLEMENTATION
rithms on programmable SoCs.
Hardware /
software Build Executable IP Core Generation Once the generated HDL and C
design
iterations
ARM Cortex-A9 Programmable Logic
code is prototyped in hardware,
the design team can use Xilinx
Vivado® IP Integrator to integrate
INTEGRATION the code with other design com-
ponents needed for production.
CASE STUDY: THREE-PHASE
Zynq-7000 SoC
development MOTOR CONTROL
boards For several reasons, custom
motor controllers with efficient
power conversion are one of the
most popular applications to
Figure 1 — Designers can create models in Simulink that represent have emerged for programma-
a complete dynamic system and create hardware-software ble SoCs. Higher-performance,
implementations for Zynq SoCs directly from the model. higher-efficiency initiatives are
one factor. With electric mo-
tor-driven systems accounting
Central to this workflow are two technologies: Embed- for as much as 46 percent of global electricity con-
ded Coder® and HDL Coder™. Embedded Coder gener- sumption, attaining higher efficiency with novel con-
ates production-quality C and C++ code from MATLAB, trol algorithms is an increasingly common motor drive
Simulink and Stateflow®, with target-specific optimiza- design goal. Xilinx Zynq programmable logic enables
tions for embedded systems. Embedded Coder has be- precise timing, providing an ideal platform for imple-
come so widely adopted that when you drive a modern menting low-latency, high-efficiency drives.
passenger car, take a high-speed train or fly on a commer- Another driver is multi-axis control. Ample pro-
cial airline, there’s a high probability that Embedded Cod- grammable logic and DSP resources on programma-
er generated the real-time code guiding the vehicle. HDL ble SoCs open up possibilities for implementing multi-
Coder is the counterpart to Embedded Coder, generating ple motor controllers on a single programmable SoC,
VHDL or Verilog for FPGAs and ASICs, and is integrat- whether motors will operate independently or in com-
ed tightly into Xilinx workflows. This mature C and HDL bination, as in an integrated motion control system.
code generation technology forms the foundation of the Integration of industrial networking IP is a further
Model-Based Design workflow for programmable SoCs. factor. Xilinx and its IP partners offer IP for integra-
Design teams using Model-Based Design in applica- tion with EtherCAT, PROFINET and other industrial
tions such as communications, image processing, smart networking protocols that can be readily incorporated
power and motor control have adopted this workflow into programmable SoCs.
44
THIRD QUARTER 2015
• Calibrate
Isolation
Figure 2 — The motor control system model includes two primary subsystems.
45
XCELL SOFTWARE JOURNAL: XCELLENT ALLIANCE
COVER STORY
System
Inputs
Verify
Outputs
boolean
Disabled uint 16
...1010...
comm Mode
Select Open
Loop ...1010...
single Current
Convert boolean
C/D Calibrate
uint 16 Encoder D/C
Volt
Convert ...1010...
Velocity
boolean
Control
Position uint 16
Velocity Current
uint 16
Control
Motor_And_Load
FOC_Velocity_Encoder_Core_Algorithm
senses the motor current, and a hand-coded ADC • a model of the motor control algorithm that will be
Peripheral block processes the current. targeted for the Zynq SoC;
• The Current Controller takes the motor state and cur- • a plant model, which includes the drive electronics
rent, as well as the operating mode and velocity control of the FMC, a permanent-magnet synchronous ma-
commands passed from the ARM core over the AXI4 in- chine (PMSM) model of the brushless DC motor, a
terface, and computes the current controller command. model of an inertial load on the motor shaft and an
When in its closed-loop mode, the Current Controller encoder sensor model; and
uses a proportional-integral (PI) control law, whose
•
an output-verification model, which includes
gains can be tuned using simulation and prototyping.
post-processing and graphics to help the algorithm
• The current controller command goes through the developer refine and validate the model.
Voltage Conversion block and is output to the mo-
In Simulink, we can test out the algorithm with
tor control FMC via the PWM Peripheral, ultimately
simulation long before we start hardware testing. We
driving the motor.
can tune the PI controller gains, try various stimulus
profiles and examine the effect of different process-
Designers can model the complete system in Sim-
ing rates. As we use simulation, though, we face a fun-
ulink (Figure 3).
damental issue: Because of the disparate processing
In Model-Based Design, the system increases to
rates typical of motor control—that is, overall me-
four components in the top-level Simulink model:
chanical response rates of 1 to 10 Hz, core controller
•
an input model, which provides a commanded algorithm rates of 1 to 25 kHz and programmable logic
shaft velocity and on/off commands to the control- operating at 10 to 50 MHz or more—simulation times
ler as stimulus; can run to many minutes or even hours. We can head
46
THIRD QUARTER 2015
Input_Source Display
Convert
Data_Type_Conversion
controllerMode
Calibrate +100 rad/sec, no load
motorOn (logical) boolean FPGA Interface For Bitstream
motorOn phaseCurrentA
<motorOn>
Disabled PhaseCurrent
Mode
phaseCurrentB
PVelocity, 2=CalibrateEncoder, 3=Velocity Enum Select
commandType <commandType>
Open
rotorPosition
<velocityCommand> Loop
velocityCommand (rad/esc) single Current electricalPosition
velocityCommand ADC
Convert
Select_Source
Calibrate encoderOffset
Signal_Builder_Experiments Encoder rotorVelocity
Volt Scope
Velocity PWM
Convert
Control
1 Position
on Encoder
Velocity
Current
motorOn
0 Control
off
FOC_Velocity_Encoder_C FOC_Velocity_Encoder_FPGA_Interface
1
commandTypeEnum.Velocity
commandType Z-
Command_Mode
1
Z-
1
1
100
velocityCommand Z-
DSP velocityCommand
1
Slider_Gain Z-
Sine_Wave
F = 0.1 Hz
100
80
60
40
sured
Measured
20
Prototype hardware
asur
0 System simulation
Control loop simulation
Me
-20
4
t (amps))
2
currents
ents
1
urren
0
e curr
-1
Prototype hardware
Phase
-2 System simulation
-3 Control loop simulation
angle at t=0.
47
XCELL SOFTWARE JOURNAL: XCELLENT ALLIANCE
COVER STORY
off this issue with a control-loop model that uses be- cessing in MATLAB, but for now we can repeat the
havioral models for the peripherals—the PWM, cur- pulse test (Figure 3).
rent sensing and encoder processing—producing the Figure 4b shows the results of the shaft rotation-
time response shown in Figure 3. al velocity and the phase current for the hardware
After we use the control-loop model to tune the control- prototype compared with the simulation results. The
ler, our next step is to prove out the controller in simula- startup sequence for the hardware prototype differs
tion using high-fidelity models that include the peripherals. noticeably from those for the two simulation models.
We do this by incorporating timing-accurate specification This is to be expected, however, because the initial
models for the C and HDL components of the controller. angle between the motor’s rotor and stator in the
These specification models have the necessary semantics hardware test differs from the initial angle used in
for C and HDL code generation. With simulation, we then simulation, resulting in a different response as the
verify that the system with specification models tracks ex- current control algorithm drives the motor through
tremely closely to the control-loop model. its encoder calibration mode. When the pulse is ap-
Once performance has been validated with the plied at 2 seconds, the results from simulation and
high-fidelity models, we move on to prototyping the prototype hardware match almost exactly.
controller in hardware. Following the workflow Based on these results, we could continue with fur-
shown in Figure 1, we start by generating the IP core. ther testing under different loading and operating con-
The IP core generation workflow lets us choose the ditions, or we could move on to performing further C
target development board and walks us through the and HDL optimizations.
process of mapping the core’s input and output ports Engineers are turning to Model-Based Design work-
to target interfaces, including the AXI4 interface and flows to enable hardware-software implementation of
external ports. algorithms on Xilinx Zynq SoCs. Simulink simulation
Through integration with the Vivado Design Suite, provides early evaluation of algorithms, letting designers
the workflow builds the bitstream and programs the evaluate the algorithms’ effectiveness and make design
fabric of the Zynq-7020 SoC. trade-offs at the desktop rather than in the lab, with a re-
With the IP core now loaded onto the target device, sultant increase in productivity. Proven C and HDL code
the next step is to generate embedded C code from the generation technology, along with hardware support for
Simulink model targeting the ARM core. The process Xilinx All Programmable SoCs, provides a rapid and re-
of generating C code, compiling it and building the ex- peatable process for getting algorithms running on real
ecutable with embedded Linux is fully automated, and hardware. Continuous verification between the simula-
the prototype is then ready to run. tion and hardware environments lets designers identify
To run the prototype hardware and verify that it and resolve issues early in the development process.
gives us results consistent with our simulation models, Workflow support for Zynq-based development
we build a modified Simulink model (Figure 4a) that boards, software-defined radio kits and motor control
will serve as a high-level control panel. In this model, kits is available from MathWorks. To learn more about
we removed the simulation model for the plant—that this workflow, visit http://www.mathworks.com/zynq. n
is, the drive electronics, motor, load and sensor—and MATLAB and Simulink are registered trademarks of The
replaced it with I/Os to the ZedBoard. MathWorks, Inc. See http://www.mathworks.com/trade-
Using this model in a Simulink session, we can turn marks for a list of additional trademarks. Other product
on the motor, choose different stimulus profiles, moni- or brand names may be trademarks or registered trade-
tor relevant signals and acquire data for later post-pro- marks of their respective holders.
48
This year’s Xcell Publications
best release.
Solutions
for a
Progammable
World
For advertising inquiries (including calendar and advertising rate card), contact [email protected]
or call: 408-842-2627.
XCELL SOFTWARE JOURNAL: XTRA, XTRA
Xtra, Xtra
Xilinx® is constantly refining its software and updating its
training and resources to help software developers design
innovations with the Xilinx SDx™ development environments
and related FPGA and SoC hardware platforms. Here is list of
additional resources and reading. Check for the newest
quarterly updates in each issue.
SDSOC™ DEVELOPMENT ENVIRONMENT C and C++ kernels, along with libraries, development
The SDSoC environment provides a familiar embedded boards, and the first complete CPU/GPU-like develop-
C/C++ application development experience, including ment and run-time experience for FPGAs.
an easy-to-use Eclipse IDE and a comprehensive design • SDAccel Backgrounder
environment for heterogeneous Xilinx All Programma-
• SDAccel Development Environment: User Guide
ble SoC and MPSoC deployment. Complete with the
industry’s first C/C++ full-system optimizing compiler, • SDAccel Development Environment: Tutorial
SDSoC delivers system-level profiling, automated soft- • Xilinx Training: SDAccel Video Tutorials
ware acceleration in programmable logic, automated
• Boards and Kits
system connectivity generation and libraries to speed
programming. It lets end-user and third-party platform • SDAccel Demo
developers rapidly define, integrate and verify sys-
tem-level solutions and enable their end customers with SDNET™ DEVELOPMENT ENVIRONMENT
a customized programming environment. The SDNet environment, in conjunction with Xilinx
All Programmable FPGAs and SoCs, lets network engi-
• SDSoC Backgrounder (PDF)
neers define line card architectures, design line cards
• SDSoC User Guide (PDF) and update them with a C-like environment. It enables
• SDSoC User Guide: Getting Started (PDF) the creation of “Softly” Defined Networks, a technolo-
gy dislocation that goes well beyond today’s Software
• SDSoC User Guide: Platforms and Libraries (PDF)
Defined Networking (SDN) architectures.
• SDSoC Release Notes (PDF)
• SDNet Backgrounder — Xilinx
• Boards, Kits and Modules • SDNet Backgrounder — The Linley Group
• SDSoC Video Demo • SDNet Demo
• Buy/Download
SOFTWARE DEVELOPMENT KIT (SDK)
SDACCEL™ DEVELOPMENT ENVIRONMENT The SDK is Xilinx’s development environment for
The SDAccel environment for OpenCL™, C and C++ creating embedded applications on any of its micro-
enables up to 25x better performance/watt for data processors for Zynq®-7000 All Programmable SoCs
center application acceleration leveraging FPGAs. A and the MicroBlaze™ soft processor. The SDK is the
member of the SDx family, the SDAccel environment first application IDE to deliver true homogeneous- and
combines the industry’s first architecturally optimiz- heterogeneous-multiprocessor design and debug.
ing compiler supporting any combination of OpenCL, • Free SDK Evaluation and Download n
50
Program FPGAs Faster
With a Platform-Based Approach
m
Find it at
mathworks.com/accelerate
datasheet
video example
trial request
GENERATE
HDL CODE
AUTOMATICALLY
from
MATLAB
and
Simulink