Opencl On Fpga: Marc Gaucheron INTEL Programmable Solution Group
Opencl On Fpga: Marc Gaucheron INTEL Programmable Solution Group
OpenCL On FPGA
Marc Gaucheron
INTEL Programmable Solution Group
Agenda
2
FPGA architecture overview
FPGA Architecture: Fine-grained Massively Parallel
I/O
Millions of reconfigurable logic elements
Thousands of 20Kb memory blocks
Let’s zoom in
Thousands of Variable Precision DSP blocks
Dozens of High-speed transceivers
Multiple High Speed configurable Memory
I/O
I/O
Controllers
Multiple ARM© Cores
I/O
4
FPGA Architecture: Basic Elements
Basic Element
5
FPGA Architecture: Flexible Interconnect
6
FPGA Architecture: Flexible Interconnect
…
…
7
FPGA Architecture: Custom Operations Using Basic Elements
16-bit add
32-bit sqrt
…
…
8
FPGA Architecture: Memory Blocks
addr
Memory
data_out
Block
data_in
20 Kb
9
FPGA Architecture: Memory Blocks
addr
Memory
data_out
Block
data_in
20 Kb
Few larger
Can be configured and grouped using Lots of smaller caches
the interconnect to create various caches
cache architectures
10
FPGA Architecture: Floating Point Multiplier/Adder Blocks
data_in data_out
11
DSP block architecture
12
Elementary math functions supporting floating point
Coverage of ~70 elementary math functions Trigonometrics misc
14
15
1GHz
Core Performance
5.5M
Logic Elements
Heterogeneous up to
Up to 70 %
Lower Power 1TB/s
3D SIP Integration
Up to 10 Intel 14 nm
TFLOPS
Tri-Gate
Most Quad-Core
Comprehensive
Security Cortex-A53
ARM Processor
Developing with FPGA
17
Typical Programmable Logic Design Flow
Design specification Design entry/RTL coding
- Behavioral or structural description of design
RTL simulation
- Functional simulation
(Mentor Graphics ModelSim® or other 3rd-party simulators)
- Verify logic model & data flow
(no timing delays)
M512 Synthesis (Mapping)
LE - Translate design into device specific primitives
- Optimization to meet required area & performance constraints
M4K/M9K I/O - Quartus II synthesis, Precision Synthesis, Synplify/Synplify Pro,
Design Compiler FPGA
- Result: Post-synthesis netlist
18
Typical Programmable Logic Design Flow
tclk Timing analysis
- Verify performance specifications were met
- Static timing analysis
19
Application Development Paradigm
ASIC
FPGA
Programmers
Parallel
Programmers
20
The magic trick ?
HDL Coder
FpgaC
21
OpenCL Concepts
22
Setting the right expectations
23
OpenCL C Language
Built-in functions
OpenCL Kernels: Parallel Threads
Threads in workgroups can cooperate with each through fast local (on-chip)
memory
Data Organization
27
Memory hierarchy
Thread:
Registers
Thread:
Private memory
Workgroups:
Local or Shared memory
All Workgroups:
Global memory
OpenCL: abstracting FPGA away
Altera OpenCL Program Overview
30
OpenCL Use Model: Abstracting the FPGA away
Accelerator
Host
SoC FPGA combines these
31
in single device
OpenCL on GPU/Multi-Core CPU Architectures
IU IU IU IU IU IU IU IU IU IU IU IU IU IU IU IU
SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP
Shared Shared Shared Shared Shared Shared Shared Shared Shared Shared Shared Shared Shared Shared Shared Shared
Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory
TF TF TF TF TF TF TF TF
L2 L2 L2 L2 L2 L2
Pipeline the resulting circuit with a new thread on each clock cycle to keep
functional units busy
33
Example Pipeline for Vector Add
8 threads for vector add example
0 1 2 3 4 5 6 7
Load Load
Thread IDs
1 2 3 4 5 6 7
0
Load Load
Thread IDs
2 3 4 5 6 7
1
Load Load
0 Thread IDs
3 4 5 6 7
2
Load Load
1 Thread IDs
4 5 6 7
3
Load Load
2 Thread IDs
CPU instructions
High-level code
R0 Load Mem[100]
R1 Load Mem[101]
Mem[100] += 42 * Mem[101] R2 Load #42
R2 Mul R1, R2
R0 Add R2, R0
Store R0 Mem[100]
39
CPU activity, step by step
R0 Load Mem[100]
A
Time
R1 Load Mem[101]
A
R2 Load #42
A
R2 Mul R1, R2
A
R0 Add R2, R0 A
Store R0 Mem[100]
A
40
On the FPGA we unroll the CPU hardware…
R0 Load Mem[100]
A
Space
R1 Load Mem[101]
A
R2 Load #42
A
R2 Mul R1, R2
A
R0 Add R2, R0 A
Store R0 Mem[100]
A
41
… and specialize by position
R0 Load Mem[100]
A 1. Instructions are fixed. Remove “Fetch”
R1 Load Mem[101]
A
R2 Load #42
A
R2 Mul R1, R2
A
R0 Add R2, R0 A
Store R0 Mem[100]
A
42
… and specialize
R0 Load Mem[100]
A 1. Instructions are fixed. Remove “Fetch”
2. Remove unused ALU ops
R1 Load Mem[101]
A
R2 Load #42
A
R2 Mul R1, R2
A
R0 Add R2, R0 A
Store R0 Mem[100]
A
43
… and specialize
R0 Load Mem[100]
A 1. Instructions are fixed. Remove “Fetch”
2. Remove unused ALU ops
3. Remove unused Load / Store
R1 Load Mem[101]
A
R2 Load #42
A
R2 Mul R1, R2
A
R0 Add R2, R0 A
Store R0 Mem[100]
A
44
… and specialize
R0 Load Mem[100]
1. Instructions are fixed. Remove “Fetch”
2. Remove unused ALU ops
3. Remove unused Load / Store
R1 Load Mem[101] 4. Wire up registers properly! And
propagate state.
R2 Load #42
R2 Mul R1, R2
R0 Add R2, R0
Store R0 Mem[100]
45
… and specialize
R0 Load Mem[100]
1. Instructions are fixed. Remove “Fetch”
2. Remove unused ALU ops
3. Remove unused Load / Store
R1 Load Mem[101] 4. Wire up registers properly! And
propagate state.
5. Remove dead data.
R2 Load #42
R2 Mul R1, R2
R0 Add R2, R0
Store R0 Mem[100]
46
… and specialize
R0 Load Mem[100]
1. Instructions are fixed. Remove “Fetch”
2. Remove unused ALU ops
3. Remove unused Load / Store
R1 Load Mem[101] 4. Wire up registers properly! And
propagate state.
5. Remove dead data.
R2 Load #42 6. Reschedule!
R2 Mul R1, R2
R0 Add R2, R0
Store R0 Mem[100]
47
Custom data-path on the FPGA matches your algorithm!
High-level code
Mem[100] += 42 * Mem[101]
Custom data-path
load load 42
Build exactly what you need:
Operations
Data widths
Memory size & configuration
Efficiency:
store Throughput / Latency / Power
48
What Hardware do we produce?
CRA
Load Load
Interconnect
Mult-add
Memory
DDRx
Store
Done?
49
ALTERA SDK for OpenCL
50
OpenCL CAD Flow
mm_host.c
mm_kernel.cl
Front End
Parses OpenCL
System extensions and
ACL
CLANG runtime
intrinsics – produces LLVM IR
Description C compiler
front end Library
Unoptimized LLVM
IR Third Party program.exe
or
Optimizer Academic
Tools
Optimized LLVM IR
DDR*
QSYS
RTL generator Verilog Quartus
PCIe
51
OpenCL CAD Flow
mm_host.c
mm_kernel.cl
System
ACL
CLANG C compiler runtime
Description
front end Library
Unoptimized LLVM
IR Third Party program.exe
Middle
or End
Code optimizations such as loop
Optimizer Academic
unrolling and branch elimination
Tools to more efficient HW
leading
Optimized LLVM IR
DDR*
QSYS
RTL generator Verilog Quartus
PCIe
52
OpenCL CAD Flow
mm_host.c
vectorAdd_kernel.cl
System
ACL
CLANG C compiler runtime
Description
front end Library
Unoptimized LLVM
IR Third Party program.exe
or
Optimizer Academic Back End
Tools Conversion of Intermediate
representation into custom generated
Optimized LLVM IR pipelined hardware
DDR*
QSYS
RTL generator Verilog Quartus
PCIe
53
OpenCL CAD Flow
mm_host.c
mm_kernel.cl
ACL
CLANG System Architecture Gen runtime
Description C compiler
front end Create interfaces to the outside world.
Library
Needs to meet timing without user
Unoptimized LLVM intervention.
IR Third Party program.exe
or
Optimizer Academic
Tools
Optimized LLVM IR
DDR*
Kernel
QSYSto IP
RTL generator Verilog Interconnect
Quartus
PCIe
54
OpenCL Kernel Development Flow
Modify kernel.cl
Profiler (hours)
DONE!
55
x86 emulator
Supports
OpenCL syntax
Channels
Printf
56
Profiler
Kernel Pipeline
Store
57
Guaranteed Timing Flow
kernel.cl
Meet Yes
Timing
No
Reconfig kernel PLL
DONE!
58
Optimization Report Example: Load to Store dependency
1 kernel void prefixsum( global int* restrict A, unsigned N ) {
2 for ( unsigned i = 1 ; i < N ; i++ ) {
3 int a = A[i-1];
4 A[i] += a;
5 }
6 }
==============================================================================
| *** Optimization Report *** |
==============================================================================
| Kernel: prefixsum Relative cost| of global |memory
Ln.Col
to local computation
==============================================================================
| Loop for.body | 2.25 |
| Pipelined execution inferred. | |
| Successive iterations launched every 321 cycles due to: | |
| | |
| Memory dependency on Load Operation from: | 3.21 |
| Store Operation | 4.7 |
| Largest Critical Path Contributors: True fix requires
| restructuring
|
| 49%: Load Operation the code | 3.21 |
| 49%: Store Operation | 4.7 |
=============================================================================
59
Optimization Report Example: Accumulating a value
Hierarchical
and interactive
61
Detailed Area Report (aocl analyze-area)
62
Additional Altera OpenCL Collateral
White papers on OpenCL
OpenCL online demos
OpenCL design examples
Instructor-Led training
Parallel Computing with OpenCL Workshop by Altera – (1 Day)
Optimization of OpenCL for Altera FPGAs Training by Altera – (1 Day)
Online training
Introduction to Parallel Computing with OpenCL
Writing OpenCL Programs for Altera FPGAs
Running OpenCL on Altera FPGAs
Single-Threaded vs. Multi-Threaded Kernels
Building Custom Platforms for Altera SDK for OpenCL
63
ALTERA BSP: abstracting FPGA development
An adaptable Board Support Package
Interconnect
A/D JESD204
Kernel Kernel
SDI XCVRs IP IP
10G Network
10Gb MAC/UOE Data Interface
Prebuilt
BSP with standard HDL
Tools by FPGA
IO Infrastructure Developer
65
Channels Advantage
DDR DDR3 Interface CvP Update DDR DDR3 Interface CvP Update
Interconnect
QDR QDRII Interface QDR QDRII Interface
OpenCL OpenCL OpenCL OpenCL
QDR QDRII Interface Kernels Kernels QDR QDRII Interface Kernels Kernels
10Gb 10Gb
10G Interface 10G Interface
Network Network
10Gb Interface 10Gb Interface
66
Start with OpenCL ready platforms 1/2
67
Start with OpenCL ready platforms 2/2
68
Shared Virtual Memory (SVM) Platform Model
CAPP PSL
CAPI
PCIe
QPI
69
VIP based BSP Customization
External
Memory
Video to
Video Native PHY Kernel
In (RX)
SDI (RX) CVI VFB DC FIFO
Kernel
Video Native PHY
Out (TX)
SDI (TX) CVO VFB DC FIFO
Video from
kernel
EMIF
Controller
External
Memory
70
Live Demo
Developing a Custom OpenCL BSP
Deep Dive
72
Recommended Hardware
Development system
Available PCIe slot (if using PCIe-based accelerator card)
x86 based development system
Altera device documentation defines minimum recommended system RAM
73
Software Requirements
Quartus Prime
Accelerator devices installed
Quartus Prime license
AOCL Utility
Perform various tasks related to the board, drivers, and compile process
75
Altera OpenCL (AOCL) Utility
76
aoc Output Files
<kernel file>.aoco
Intermediate object file representing the created hardware system
<kernel file>.aocx
Kernel executable file used to program FPGA
77
Compiling the Host Program
78
Custom Platforms
Framework of host software and FPGA interface design to enable the use of
OpenCL on a custom board
Host Software
79
Custom vs. Preferred Platform
80
Custom Platform BSP Overview
Goals
Allow Altera® SDK for OpenCL™ to automatically create FPGA images from OpenCL kernel C code
for custom boards
Allow the compilation of OpenCL host code to easily run kernels on the FPGA board
Tools
Custom Platform Toolkit
Use one of the reference platforms as a starting point
Network Reference Platform
High Performance Computing (HPC) Reference Platform
FPGA design, software, and board bring up skills required
81
Reference Platforms
82
Custom Platform Development Support
83
Developing a Custom OpenCL BSP
1. These features apply to Quartus Prime Pro which only supports Arria 10 devices and newer
Partitions and blocks
85
Setting Partitions1
root
a b c
Entities
d e f g h
i j k l
1. Partitions must be set in the QSF file at this time. Partitions will be supported in the GUI in a future
version of Quartus software.
86
Partial Reconfiguration (PR)
Persona C
Persona B
Persona A
Static region
Remains constant across all PR personas
Part(s) of the design not changed by PR
Essentially the BSP
PR partition
A design partition targeted for PR
PR region
A physical location assigned to a PR partition
Contains the kernels generated by the aoc compiler
Persona
One of the variations in functionality that a PR region can take
A PR region may have more than 1 persona
Freeze Wrapper
discussed later
88
Developing a Custom OpenCL BSP
Hardware Development
Hardware Procedure - Setup Environment
<platform name="windows64">
<mmdlib>%b/windows64/bin/altera_a10_ref_mmd.dll</mmdlib>
<linkflags>/libpath:%b/windows64/lib</linkflags>
<linklibs>altera_a10_ref_mmd.lib</linklibs>
<utilbindir>%b/windows64/libexec</utilbindir>
</platform>
</board_env>
<?xml version="1.0"?>
<board version="15.1" name=“my_board">
OpenCL Host
PCIe-based host that connects to the Arria 10 PCIe Gen3 x8 Hard IP core
93
Contents of the Arria 10 Reference Platform
\hardware
Contains the Quartus Prime project templates for three board variants
Each board variant implements the entire OpenCL hardware system on a given kit
\windows64 /linux64
Contains the MMD library, kernel mode driver,and executable files of the AOCL utilities (that is,install, uninstall,
flash, program,diagnose) for the OS
\source_windows64
Contains source codes for the MMD library and AOCL utilities
The MMD library and the AOCL utilities are in the windows64 folder
/source
Contains source codes for the MMD library and AOCL utilities
The MMD library and the AOCL utilities are in the linux64 directory
board_env.xml
eXtensible Markup Language (XML) file that describes the Reference Platform to the Altera SDK for OpenCL
94
Contents of Each Board Variant Directory
Option Description
quartus.ini Contains any special Quartus Prime software options that you need to compile OpenCL kernels for the Reference Platform.
system.qsys Legacy file that you must update with interfaces, to match those defined in the board spec.xml file, for the compilation flow to
work properly. The compilation process does not include the system.qsys file into the OpenCL hardware system.
board.qsys Qsys system that implements the board interfaces (that is, the static region) of the OpenCL hardware system.
top.qpf Quartus Prime Project File for the OpenCL hardware system.
top.qsf Quartus Prime Settings File for the AOCL-user compilation flow.
top.sdc Synopsys Design Constraints File that contains board-specific timing constraints.
top.v Top-level Verilog Design File for the OpenCL hardware system.
top_synth.qsf Quartus Prime Settings File for the Quartus Prime revision in which the OpenCL kernel system is synthesized.
base.qsf Quartus Prime Settings File for the base project revision. Use this revision when porting the Arria 10 Reference Platform to your
own custom BSP. The Quartus Prime Pro Edition software compiles this base project revision from source code.
Do not try to compile the BSP project in the Quartus Prime software!
95
Hardware System Overview
96
Altera SDK for OpenCL-Specific Qsys Components
Required
OpenCL Clock Generator
OpenCL Kernel Interface
OpenCL Bank Divider
Altera Interface IP
PCI Express Hard IP
DDR Controller
QDR Controller
Altera Supporting IP
Avalon-MM Pipeline Bridge
Scatter Gather DMA
Uniphy Status Component
ACL Version ID
Reset Components
97
OpenCL Clock Generator
clk kernel_clk_gen
reset
ctrl
PLL ROM
PLL Reset
PLL Lock
kernel_pll_locked
98
OpenCL Kernel Interface
SW Reset sw_reset
Slave
Sys. Desc. kernel_reset
ROM
Version
ID
99
OpenCL Kernel Interface
100
Hard IP for PCI Express
OpenCL Device
Global Memory1 FPGA
CU
Interface
QDR SRAM
Global Memory2 CU
DDR3 SDRAM
102
OpenCL Memory Bank Divider
memory_bank_divider
acl_bsp_memorg_host bank1
s Memory bank2
Splitter
bankn
clk Snoop
Adapter
reset acl_bsp_snoop
kernel_clk
kernel_reset
103
OpenCL SGDMA Controller
Connect to
Host PCIe Rx Slave
All global memories
Through Memory Bank Divider if used
104
Avalon-ST Interface
Fully synchronous
Supports simple and complex interface requirements
Source interface
Launches data on rising edges of associated clock
Sink interface
Latches data on rising edges of associated clock
See Custom IP Development Using Avalon and AXI Interfaces Online Training
105 Or consult the Avalon Interface Specifications document
Avalon-ST Interface Signals
ready 1 Sink → Source Indicates the sink can accept data (backpressure control)
channel 1-128 Source → Sink Channel number for data being transferred (if multiple channels supported)
error 1-256 Source → Sink Bit mask marks errors affecting the data being transferred
106
Grayed out signals are not supported by OpenCL channels
Simple Streaming Examples
Another example
32b inverter block in datapath in Qsys system
ready used to “throttle” the transfer
Sink Source
interface ready interface
valid
data data
valid
ready
107
Hardware Procedure – Modify the Platform
109
_hw.tcl File
110
If New Component is an IO Channel
system.qsys file
111
Guaranteed Timing Closure
reconfig
Kernel Kernel
Kernel
Compute Kernel
Compute
PLL Engine Engine
112
Developing a Custom OpenCL BSP
Software Development
113
Board XML Files Overview
board_env.xml
Describes the properties of your platform
e.g. library location, utility directory
board_spec.xml
Contains metadata describing your hardware system
e.g. memory properties, device resources used, interfaces, etc
114
Board Environment XML
115
Board Description File – board_env.xml
<?xml version="1.0"?>
<board_env version="15.1" name=“MyPlatformName">
<hardware dir="hardware" default=“MyBoard"></hardware>
<platform name="linux64">
<mmdlib>%b/linux64/lib/libaltera_a10_ref_mmd.so</mmdlib>
<linkflags>-L%b/linux64/lib</linkflags>
<linklibs>-laltera_a10_ref_mmd </linklibs>
<utilbindir>%b/linux64/libexec</utilbindir>
</platform>
<platform name="windows64">
<mmdlib>%b/windows64/bin/altera_a10_ref_mmd.dll</mmdlib>
<linkflags>/libpath:%b/windows64/lib</linkflags>
<linklibs>altera_a10_ref_mmd.lib</linklibs>
<utilbindir>%b/windows64/libexec</utilbindir>
</platform>
</board_env>
%a references the AOCL installation directory (e.g. c:\altera\15.1\hld)
%b references your BSP installation directory (e.g. c:\altera\15.1\hld\board\MyPlatform)
116
board_env.xml Elements and Attributes
linkflags Linker flags necessary to statically link with the MMD layer
117
Testing board_env.xml
118
Board Spec XML File (1)
<?xml version="1.0"?>
<board version="0.9" name=“MyBoard">
<device device_model="10ax115s2f45i2sges_dm.xml">
<used_resources>
<alms num="45000"/>
<ffs num="117500"/>
<dsps num="0"/>
<rams num="583"/>
</used_resources>
</device>
<channels>
<interface name="udp_0" port="udp0_out" type="streamsource" width="256" chan_id="eth0_in"/>
<interface name="udp_0" port="udp0_in" type="streamsink" width="256" chan_id="eth0_out"/>
</channels>
119
Board Spec XML File (2)
<host>
<kernel_config start=”0x00000000” size="0x0100000"/>
</host>
<interfaces>
<interface name="board" port="kernel_cra" type="master" width="64" misc="0"/>
<interface name="board" port="kernel_irq" type="irq" width="1"/>
<interface name="board" port="acl_internal_snoop" type="streamsource" enable="SNOOPENABLE“
width="33" clock="board.kernel_clk"/>
<kernel_clk_reset clk="board.kernel_clk“ clk2x="board.kernel_clk2x"reset="board.kernel_reset"/>
</interfaces>
</board>
120
board_spec.xml Elements and Attributes
[channels] interface: Characteristics of each channel interface for direct kernel-to-I/O accesses
121
Board XML Files Review
board_env.xml
One needed for each platform
Describes the properties of your platform
e.g. library location, utility directory
board_spec.xml
One needed for each board within the platform
Contains metadata describing your hardware system
Memory properties
Channel properties
Device resources used
Control interfaces
Compile properties
etc.
122
Memory-Mapped Devices (MMD) Software Layer
123
MMD API
get_offline_info get_info
set_status_handler set_interrupt_handler
open close
read write
copy yield
shared_mem_alloc shared_mem_free
reprogram
124
AOCL Utilities
125
Summary
OpenCL + FPGA Key Benefits
Heterogeneous IO interface
Multiple 10G Ethernet
SDI, HMDI, A/D Interface
127
Q&A