0% found this document useful (0 votes)
115 views128 pages

Opencl On Fpga: Marc Gaucheron INTEL Programmable Solution Group

Uploaded by

Carlos León
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
115 views128 pages

Opencl On Fpga: Marc Gaucheron INTEL Programmable Solution Group

Uploaded by

Carlos León
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 128

Ecole Numérique 2016

IN2P3, Aussois, 21 Juin 2016

OpenCL On FPGA

Marc Gaucheron
INTEL Programmable Solution Group
Agenda

FPGA architecture overview


Conventional way of developing with FPGA
OpenCL: abstracting FPGA away
ALTERA BSP: abstracting FPGA development
Live Demo
Developing a Custom OpenCL BSP

2
FPGA architecture overview
FPGA Architecture: Fine-grained Massively Parallel

I/O
Millions of reconfigurable logic elements
Thousands of 20Kb memory blocks
Let’s zoom in
Thousands of Variable Precision DSP blocks
Dozens of High-speed transceivers
Multiple High Speed configurable Memory

I/O

I/O
Controllers
Multiple ARM© Cores

I/O

4
FPGA Architecture: Basic Elements

Basic Element

1-bit configurable 1-bit register


operation (store result)

Configured to perform any


1-bit operation:
AND, OR, NOT, ADD, SUB

5
FPGA Architecture: Flexible Interconnect

Basic Elements are surrounded with a


flexible interconnect

6
FPGA Architecture: Flexible Interconnect


Wider custom operations are implemented by


configuring and
interconnecting Basic Elements

7
FPGA Architecture: Custom Operations Using Basic Elements

16-bit add

32-bit sqrt

Your custom 64-bit


bit-shuffle and encode
Wider custom operations are implemented by
configuring and
interconnecting Basic Elements

8
FPGA Architecture: Memory Blocks

addr
Memory
data_out
Block
data_in

20 Kb

Can be configured and grouped using


the interconnect to create various
cache architectures

9
FPGA Architecture: Memory Blocks

addr
Memory
data_out
Block
data_in

20 Kb

Few larger
Can be configured and grouped using Lots of smaller caches
the interconnect to create various caches
cache architectures

10
FPGA Architecture: Floating Point Multiplier/Adder Blocks

data_in data_out

Dedicated floating point


multiply and add blocks

11
DSP block architecture

Floating-Point DSP Fixed-Point DSP

12
Elementary math functions supporting floating point
Coverage of ~70 elementary math functions Trigonometrics misc

Patented & published efficient mapping to FPGA hardware FPHypot


FPRangeReduction
 Polynomial approximation, Horner’s method, truncated multipliers, …
Basic Floating Point

Compliant to OpenCL & IEEE754 accuracy standards FPAdd


FPAddExpert
Rounding mode options for fundamental operators FPAddN
FPSubExpert
FPAddSub \
Half- to Double-precision FPAddSubExpert
Trigonometrics of pi*x FPFusedAddSub
FPMul
FPSinPiX FPMulExpert
FPCosPiX FPConstMul
Inverse trigonometric functions
FPTanPiX FPAcc
FPCotPiX FPArcsinX\ FPSqrt
FPArcsinPi FPDivSqrt
Exp, Log and Power FPArccosX FPRecipSqrt
FPArccosPi FPCbrt
FPLn
FPArctanX FPDiv
FPLn1px
FPArctanPi FPInverse
FPLog10
FPArctan2 FPFloor
FPLog2
FPCeil
FPExp
FPRound
FPExpFPC
Conversion FPRint
FPExpM1
FPFrac
FPExp2 FXPToFP
FPMod
FPExp10 FPToFXP
FPDim
FPPowr FPToFXPExpert
FPAbs
FPToFXPFused
Trig with argument reduction FPMin
FPToFP
FPMax
FPSinX Macro Operators FPMinAbs
Fixed and floating point FPCosX FPMaxAbs
FPSinCosX FPFusedHorner FPMinMaxFused
Floating point only FPTanX FPFusedHornerExpert FPMinMaxAbsFused
FPCotX FPFusedHornerMulti FPCompare
FPFusedMultiFunction FPCompareFused
13
FPGA Architecture: Configurable Connectivity = Efficiency

Blocks are connected into


a custom data-path that matches your
application.

Streaming data-path more efficient


than copying to/from global memory

14
15
1GHz
Core Performance
5.5M
Logic Elements

Heterogeneous up to
Up to 70 %
Lower Power 1TB/s
3D SIP Integration

Up to 10 Intel 14 nm
TFLOPS
Tri-Gate

Most Quad-Core
Comprehensive
Security Cortex-A53
ARM Processor
Developing with FPGA

17
Typical Programmable Logic Design Flow
Design specification Design entry/RTL coding
- Behavioral or structural description of design

RTL simulation
- Functional simulation
(Mentor Graphics ModelSim® or other 3rd-party simulators)
- Verify logic model & data flow
(no timing delays)
M512 Synthesis (Mapping)
LE - Translate design into device specific primitives
- Optimization to meet required area & performance constraints
M4K/M9K I/O - Quartus II synthesis, Precision Synthesis, Synplify/Synplify Pro,
Design Compiler FPGA
- Result: Post-synthesis netlist

Place & route (Fitting)


- Map primitives to specific locations inside
target technology with reference to area &
performance constraints
- Specify routing resources to be used
- Result: Post-fit netlist

18
Typical Programmable Logic Design Flow
tclk Timing analysis
- Verify performance specifications were met
- Static timing analysis

Gate level simulation (optional)


- Timing simulation
- Verify design will work in target technology

PC board simulation & test


- Simulate board design
- Program & test device on board
- Use on-chip tools for debugging

19
Application Development Paradigm

ASIC

FPGA
Programmers

Parallel
Programmers

Standard CPU Programmers

20
The magic trick ?

HDL Coder

FpgaC

21
OpenCL Concepts

22
Setting the right expectations

We have to think data parallelism

Algorithms have to be rethink


at the mathematics level.

23
OpenCL C Language

Derived from ISO C99


 No standard C99 headers, function pointers, recursion, variable length arrays, and bit fields

Additions to the language for parallelism


 Work-items and workgroups
 Vector types
 Synchronization

Address space qualifiers

Built-in functions
OpenCL Kernels: Parallel Threads

A kernel is a function executed on an


Accelerator device
 Array of threads, in parallel

All threads (or work-items) execute the


same code, can take different paths float x = input[threadID];
Each thread has an ID float y = func(x);
 Select input/output data output[threadID] = y;
 Control decisions
OpenCL Kernels: Divide into Workgroups

Threads in workgroups can cooperate with each through fast local (on-chip)
memory
Data Organization

27
Memory hierarchy

Thread:
 Registers

Thread:
 Private memory

Workgroups:
 Local or Shared memory

All Workgroups:
 Global memory
OpenCL: abstracting FPGA away
Altera OpenCL Program Overview

2010 research project Public release 13.1 (Nov 2013)


 Toronto Technology Center  Channels (Streaming IO)
2011 Development started  Example Designs
 SoC Support
 Proof of concept
 9 customer evaluations Release 14.0 (June 2014)
2012 Early Access Program  Platforms
 Emulator/Profiler
 Demo’s at Supercomputing ‘12
 Rapid Prototyping
 Over 60 customer evaluations
2013 First public release Release 14.1 (Nov 2014)
 Arria 10 support
 Publically available May 2013
 Shared Virtual Memory (PoC)
 Passed Conformance Testing
>8500 programs run properly Release 15.1 (Nov 2015)
 Kernel Update
 Library support

30
OpenCL Use Model: Abstracting the FPGA away

Host Code OpenCL Accelerator Code


main() { __kernel void sum
read_data( … ); (__global float *a,
manipulate( … ); __global float *b,
clEnqueueWriteBuffer( … ); __global float *y)
clEnqueueNDRange(…,sum,…); {
clEnqueueReadBuffer( … ); int gid = get_global_id(0);
display_result( … ); y[gid] = a[gid] + b[gid];
} }

Standard Altera Offline Verilog


gcc Compiler Compiler

EXE AOCX Quartus II

Accelerator
Host
SoC FPGA combines these
31
in single device
OpenCL on GPU/Multi-Core CPU Architectures

Conceptually many parallel threads


Simplified View
 Each thread runs sequentially on a different processing element (PE)
 Fixed #s of Functional Units, Registers available on each PE
 Many processing elements are available to provide significant parallel speedup
Host CPU Work Distribution

IU IU IU IU IU IU IU IU IU IU IU IU IU IU IU IU

SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP

Shared Shared Shared Shared Shared Shared Shared Shared Shared Shared Shared Shared Shared Shared Shared Shared
Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory

TF TF TF TF TF TF TF TF

TEX L1 TEX L1 TEX L1 TEX L1 TEX L1 TEX L1 TEX L1 TEX L1

L2 L2 L2 L2 L2 L2

Memory Memory Memory Memory Memory Memory


32
OpenCL on FPGA

OpenCL kernels are translated into a highly parallel circuit


 A unique functional unit is created for every operation in the kernel
Memory loads / stores, computational operations, registers
 Functional units are only connected when there is some data dependence dictated by the kernel

Pipeline the resulting circuit with a new thread on each clock cycle to keep
functional units busy

Amount of parallelism is dictated by the number of pipelined


computing operations in the generated hardware

33
Example Pipeline for Vector Add
8 threads for vector add example

0 1 2 3 4 5 6 7

Load Load
Thread IDs

+ On each cycle the portions of the pipeline are


processing different threads
While thread 2 is being loaded, thread 1 is being
Store added, and thread 0 is being stored
Example Pipeline for Vector Add
8 threads for vector add example

1 2 3 4 5 6 7
0

Load Load
Thread IDs

+ On each cycle the portions of the pipeline are


processing different threads
While thread 2 is being loaded, thread 1 is being
Store added, and thread 0 is being stored
Example Pipeline for Vector Add
8 threads for vector add example

2 3 4 5 6 7
1

Load Load
0 Thread IDs

+ On each cycle the portions of the pipeline are


processing different threads
While thread 2 is being loaded, thread 1 is being
Store added, and thread 0 is being stored
Example Pipeline for Vector Add
8 threads for vector add example

3 4 5 6 7
2
Load Load

1 Thread IDs

+ On each cycle the portions of the pipeline are


0 processing different threads
While thread 2 is being loaded, thread 1 is being
Store added, and thread 0 is being stored
Example Pipeline for Vector Add
8 threads for vector add example

4 5 6 7
3
Load Load

2 Thread IDs

+ On each cycle the portions of the pipeline are


1 processing different threads
While thread 2 is being loaded, thread 1 is being
Store added, and thread 0 is being stored
0
Mapping a simple program to an FPGA

CPU instructions

High-level code
R0  Load Mem[100]
R1  Load Mem[101]
Mem[100] += 42 * Mem[101] R2  Load #42
R2  Mul R1, R2
R0  Add R2, R0
Store R0  Mem[100]

39
CPU activity, step by step

R0  Load Mem[100]
A
Time
R1  Load Mem[101]
A

R2  Load #42
A

R2  Mul R1, R2
A

R0  Add R2, R0 A

Store R0  Mem[100]
A
40
On the FPGA we unroll the CPU hardware…

R0  Load Mem[100]
A
Space
R1  Load Mem[101]
A

R2  Load #42
A

R2  Mul R1, R2
A

R0  Add R2, R0 A

Store R0  Mem[100]
A
41
… and specialize by position

R0  Load Mem[100]
A 1. Instructions are fixed. Remove “Fetch”

R1  Load Mem[101]
A

R2  Load #42
A

R2  Mul R1, R2
A

R0  Add R2, R0 A

Store R0  Mem[100]
A
42
… and specialize

R0  Load Mem[100]
A 1. Instructions are fixed. Remove “Fetch”
2. Remove unused ALU ops
R1  Load Mem[101]
A

R2  Load #42
A

R2  Mul R1, R2
A

R0  Add R2, R0 A

Store R0  Mem[100]
A
43
… and specialize

R0  Load Mem[100]
A 1. Instructions are fixed. Remove “Fetch”
2. Remove unused ALU ops
3. Remove unused Load / Store
R1  Load Mem[101]
A

R2  Load #42
A

R2  Mul R1, R2
A

R0  Add R2, R0 A

Store R0  Mem[100]
A
44
… and specialize

R0  Load Mem[100]
1. Instructions are fixed. Remove “Fetch”
2. Remove unused ALU ops
3. Remove unused Load / Store
R1  Load Mem[101] 4. Wire up registers properly! And
propagate state.

R2  Load #42

R2  Mul R1, R2

R0  Add R2, R0

Store R0  Mem[100]
45
… and specialize

R0  Load Mem[100]
1. Instructions are fixed. Remove “Fetch”
2. Remove unused ALU ops
3. Remove unused Load / Store
R1  Load Mem[101] 4. Wire up registers properly! And
propagate state.
5. Remove dead data.
R2  Load #42

R2  Mul R1, R2

R0  Add R2, R0

Store R0  Mem[100]
46
… and specialize

R0  Load Mem[100]
1. Instructions are fixed. Remove “Fetch”
2. Remove unused ALU ops
3. Remove unused Load / Store
R1  Load Mem[101] 4. Wire up registers properly! And
propagate state.
5. Remove dead data.
R2  Load #42 6. Reschedule!

R2  Mul R1, R2

R0  Add R2, R0

Store R0  Mem[100]
47
Custom data-path on the FPGA matches your algorithm!

High-level code

Mem[100] += 42 * Mem[101]

Custom data-path

load load 42
Build exactly what you need:
Operations
Data widths
Memory size & configuration

Efficiency:
store Throughput / Latency / Power
48
What Hardware do we produce?

CRA

OpenCL-specific iterators PCIe

Flow control logic

Load Load

Interconnect
Mult-add

Memory
DDRx
Store

Done?

49
ALTERA SDK for OpenCL

Development Flow & Features

50
OpenCL CAD Flow
mm_host.c
mm_kernel.cl
Front End
Parses OpenCL
System extensions and
ACL
CLANG runtime
intrinsics – produces LLVM IR
Description C compiler
front end Library

Unoptimized LLVM
IR Third Party program.exe
or
Optimizer Academic
Tools

Optimized LLVM IR
DDR*
QSYS
RTL generator Verilog Quartus
PCIe

51
OpenCL CAD Flow
mm_host.c
mm_kernel.cl

System
ACL
CLANG C compiler runtime
Description
front end Library
Unoptimized LLVM
IR Third Party program.exe
Middle
or End
Code optimizations such as loop
Optimizer Academic
unrolling and branch elimination
Tools to more efficient HW
leading

Optimized LLVM IR
DDR*
QSYS
RTL generator Verilog Quartus
PCIe

52
OpenCL CAD Flow
mm_host.c
vectorAdd_kernel.cl

System
ACL
CLANG C compiler runtime
Description
front end Library
Unoptimized LLVM
IR Third Party program.exe
or
Optimizer Academic Back End
Tools Conversion of Intermediate
representation into custom generated
Optimized LLVM IR pipelined hardware
DDR*
QSYS
RTL generator Verilog Quartus
PCIe

53
OpenCL CAD Flow
mm_host.c
mm_kernel.cl

ACL
CLANG System Architecture Gen runtime
Description C compiler
front end Create interfaces to the outside world.
Library
Needs to meet timing without user
Unoptimized LLVM intervention.
IR Third Party program.exe
or
Optimizer Academic
Tools

Optimized LLVM IR
DDR*
Kernel
QSYSto IP
RTL generator Verilog Interconnect
Quartus
PCIe

54
OpenCL Kernel Development Flow

Modify kernel.cl

x86 Emulator (sec) Functional Bugs?


Hardware
performance met? Stall-free pipeline? Memory
Optimization Report (sec)
coalesced?

Profiler (hours)

DONE!

55
x86 emulator

Enable functional debug on x86 system of kernel code


 Prototype support to allow users run kernels on x86 platform
 Debug support for Altera vendor specific debug support such as channels

kernel void accel(…) {


… ./kernel_tb…
gid = get_global_id(0); …
x86 Running …
out[gid] = Kernel Compiler
proc(data[gid]);

}

Supports
 OpenCL syntax
 Channels
 Printf

56
Profiler

Instrument the pipeline with performance counters and profiling logic


Transfer the profiling information to the host via PCIe link

Kernel Pipeline

kernel void accel(…) { Load Load



gid = get_global_id(0);
out[gid] = a[gid]+b[gid];

}
+ Memory Mapped
Registers

Store

57
Guaranteed Timing Flow
kernel.cl

Post-fit QXP partition (PCIe, UniPHY,


Boardspec.xml AOC DMA, …)

Synthesis / P&R / STA on the OpencL


Kernels ONLY

Meet Yes
Timing

No
Reconfig kernel PLL

Re-run STA with the new PLL


value

DONE!
58
Optimization Report Example: Load to Store dependency
1 kernel void prefixsum( global int* restrict A, unsigned N ) {
2 for ( unsigned i = 1 ; i < N ; i++ ) {
3 int a = A[i-1];
4 A[i] += a;
5 }
6 }

==============================================================================
| *** Optimization Report *** |
==============================================================================
| Kernel: prefixsum Relative cost| of global |memory
Ln.Col
to local computation
==============================================================================
| Loop for.body | 2.25 |
| Pipelined execution inferred. | |
| Successive iterations launched every 321 cycles due to: | |
| | |
| Memory dependency on Load Operation from: | 3.21 |
| Store Operation | 4.7 |
| Largest Critical Path Contributors: True fix requires
| restructuring
|
| 49%: Load Operation the code | 3.21 |
| 49%: Store Operation | 4.7 |
=============================================================================

59
Optimization Report Example: Accumulating a value

1 kernel void test( global float* restrict input,


2 global float* restrict output, unsigned N )
3 {
4 float mul = 1.0f;
5 for ( unsigned i = 0; i < N; i++ ) {
6 mul *= input[ i ];
7 }
8 *output = mul;
9 }
==================================================================================
| *** Optimization Report *** |
==================================================================================
| Kernel: test | Ln.Col |
==================================================================================
| Loop for.body | 5.24 |
| Pipelined execution inferred. | |
| Successive iterations launched every 3 cycles due to: | |
| | |
| Data dependency on variable mul | 4.10 |
| Largest Critical Path Contributor: | |
| 100%: Fmul Operation | 6.7 |
==================================================================================
60
Architecture Visualizer (Hidden)

Hierarchical
and interactive

61
Detailed Area Report (aocl analyze-area)

Per-line area break-down.


 Very useful, as single careless line of code can burn many FPGA resources.

62
Additional Altera OpenCL Collateral
White papers on OpenCL
OpenCL online demos
OpenCL design examples
Instructor-Led training
 Parallel Computing with OpenCL Workshop by Altera – (1 Day)
 Optimization of OpenCL for Altera FPGAs Training by Altera – (1 Day)

Online training
 Introduction to Parallel Computing with OpenCL
 Writing OpenCL Programs for Altera FPGAs
 Running OpenCL on Altera FPGAs
 Single-Threaded vs. Multi-Threaded Kernels
 Building Custom Platforms for Altera SDK for OpenCL

OpenCL board partners page

63
ALTERA BSP: abstracting FPGA development
An adaptable Board Support Package

DDR Built with


DDR3 Memory Interface
Altera
OpenCL
DDR OpenCL Domain Compiler
DDR3 Memory Interface

QDR QDRII Memory Interface

QDR QDRII Memory Interface

Interconnect
A/D JESD204

Kernel Kernel
SDI XCVRs IP IP

10Gb MAC/UOE Data Interface

10G Network
10Gb MAC/UOE Data Interface

Host PCIe gen3x8 Host Interface

Prebuilt
BSP with standard HDL
Tools by FPGA
IO Infrastructure Developer
65
Channels Advantage

Standard OpenCL Altera Vendor Extension


IO and Kernel Channels
DDR DDR3 Interface DDR DDR3 Interface

DDR DDR3 Interface CvP Update DDR DDR3 Interface CvP Update

QDR QDRII Interface QDR QDRII Interface

QDR QDRII Interface QDR QDRII Interface


Interconnect

Interconnect
QDR QDRII Interface QDR QDRII Interface
OpenCL OpenCL OpenCL OpenCL
QDR QDRII Interface Kernels Kernels QDR QDRII Interface Kernels Kernels

10Gb 10Gb
10G Interface 10G Interface
Network Network
10Gb Interface 10Gb Interface

Host Host Interface Host Host Interface

66
Start with OpenCL ready platforms 1/2

HPC Applications Network Applications

67
Start with OpenCL ready platforms 2/2

68
Shared Virtual Memory (SVM) Platform Model

OpenCL 1.2 OpenCL 2.0


 Traditional Hosted Heterogeneous  New Hosted Heterogeneous Platform
Platform with SVM
Host Host
CPU CPU
Memory Memory

Shared Virtual Memory

Global Global Global Global Global Global


Memory Memory Memory Memory Memory Memory

DEV DEV DEV DEV DEV DEV


1 … N 1 … N

CAPP PSL
CAPI

PCIe

QPI
69
VIP based BSP Customization

External
Memory

Kernel Bridges and


Sys PLL HPS
PLL Adapters

Video to
Video Native PHY Kernel
In (RX)
SDI (RX) CVI VFB DC FIFO
Kernel
Video Native PHY
Out (TX)
SDI (TX) CVO VFB DC FIFO
Video from
kernel
EMIF
Controller

External
Memory

70
Live Demo
Developing a Custom OpenCL BSP

Deep Dive

72
Recommended Hardware

Development system
 Available PCIe slot (if using PCIe-based accelerator card)
 x86 based development system
 Altera device documentation defines minimum recommended system RAM

FPGA accelerator card


 PCIe interface or SoC
 DDR3 or DDR4 External Memory
 Embedded USB blaster or JTAG header

73
Software Requirements

Operating system: 64-bit1


 Microsoft 64-bit Windows 7 on the x86-64 architecture
 Red Hat Enterprise 64-bit Linux (RHEL) 6.0 on the x86-64 architecture

Quartus Prime
 Accelerator devices installed
 Quartus Prime license

Altera SDK for OpenCL


 Must match Quartus Prime version
 Altera SDK for OpenCL License
C compiler for host code
 E.g. Microsoft® Visual Studio or GCC
 Needed to compile the host program
 Able to compile and link 64-bit code
Except when targeting a SoC host

1. CentOS, Ubuntu and Windows 8 supported in a future version of the Quartus


74
software
SDK Components

AOCL Utility
 Perform various tasks related to the board, drivers, and compile process

Altera Offline Compiler (AOC)


 Translates your OpenCL C kernel source file into an FPGA hardware image ready to be loaded onto
the Altera FPGA
Host Libraries
 Provides the OpenCL host Platform API and Runtime API to be used by OpenCL host applications
 Libraries for the host program to link to

75
Altera OpenCL (AOCL) Utility

Custom Platforms must support a set of aocl utilities


 Executables delivered in a subdirectory within the Custom Platform files
AOCL looks for corresponding executables when respective aocl calls are
made
install (aocl install)
 Installs driver into the host operating system
unstall (aocl install)
 Removes driver from the host operating system
program (aocl program <device> <kernel file>.aocx)
 Programs the FPGA using the provided aocx file
flash (aocl flash <device> <kernel_file>.aocx)
 Programs base programming image into Flash
diagnose (aocl diagnose [<device_name>])
 Confirms board functionality

76
aoc Output Files

<kernel file>.aoco
 Intermediate object file representing the created hardware system

<kernel file>.aocx
 Kernel executable file used to program FPGA

<kernel file> folder


 <kernel file>.log
Main compile log including estimated resource usage, optimization report, and compile messages
 Quartus project
Project files
Source files
Timing reports
quartus_sh_compile.log
 Output from the Quartus software compile
 Useful for error checking

77
Compiling the Host Program

Use a conventional C compiler (Visual Studio/GCC)


Add %ALTERAOCLSDKROOT%/host/include to your file search path
 Recommended to use aocl compile-config

Include CL/opencl.h in your source code


Link to Altera OpenCL libraries
 Link to libraries located in the %ALTERAOCLSDKROOT%/host/<OS>/lib directory
Recommended to use aocl link-config

78
Custom Platforms

Framework of host software and FPGA interface design to enable the use of
OpenCL on a custom board

Host Software

FPGA Board Hardware


User OpenCL host
application DMA

OpenCL Lib DDR / QDR


HAL
MMD Interface OpenCL kernel IP/XCVR

User Application User-Provided custom BSP Provided by Altera

79
Custom vs. Preferred Platform

Description Custom Preferred


Flexibility High Low
Supports Custom Boards Yes No
Supports Custom Interfaces Yes No
Design skills required Many Fewer
HDL coding skills required Yes No
High-speed interface skills required Yes No
Software coding skills required More Some
Development Time Higher Lower
Qsys system design effort required Yes No
Floorplanning effort required Yes No

80
Custom Platform BSP Overview

Goals
 Allow Altera® SDK for OpenCL™ to automatically create FPGA images from OpenCL kernel C code
for custom boards
 Allow the compilation of OpenCL host code to easily run kernels on the FPGA board

Tools
 Custom Platform Toolkit
 Use one of the reference platforms as a starting point
Network Reference Platform
High Performance Computing (HPC) Reference Platform
FPGA design, software, and board bring up skills required

Not required if using an Altera Preferred Board for OpenCL


 Download BSP from the board vendor

81
Reference Platforms

Stratix V Network Reference Platform


 User Guide
 Download from OpenCL board platforms landing page

Stratix V Reference Platform


 Ships with the Altera SDK for OpenCL

Cyclone V SoC Reference Platform


 User Guide
 Ships with the Altera SDK for OpenCL

Template project - Custom BSP toolkit deliverable


 Skeleton design

Arria 10 GX Reference Platform


 Contact your Altera representative for the latest version

82
Custom Platform Development Support

Custom Platform developer page


 https://www.altera.com/products/design-software/embedded-software-developers/opencl/developer-
zone.html#Custom
Custom Platform Toolkit
 https://www.altera.com/products/design-software/embedded-software-developers/opencl/developer-
zone.html#Custom
Custom Platform Toolkit User Guide
 https://www.altera.com/content/dam/altera-www/global/en_US/pdfs/literature/hb/opencl-
sdk/ug_aocl_custom_platform_toolkit.pdf

83
Developing a Custom OpenCL BSP

Important New Quartus Software Features1

1. These features apply to Quartus Prime Pro which only supports Arria 10 devices and newer
Partitions and blocks

Partitions/blocks and like instances/entities


 Except that most of the time, they’re one-to-one
 In PR, one partition can have multiple blocks

In fact, a partition is always an instance


 The partition name is the instance name
 A partition is simply an instance that cannot be dissolved

A block is the implementation of a partition


 Local assignments
 A netlist
 Placement and routing information

85
Setting Partitions1

root

a b c
Entities
d e f g h

i j k l

set_instance_assignment –name PARTITION a_block –to a


set_instance_assignment –name PARTITION c_block –to c
set_instance_assignment –name PARTITION bf_block –to b|f
set_instance_assignment –name PARTITION cg_block –to c|g

1. Partitions must be set in the QSF file at this time. Partitions will be supported in the GUI in a future
version of Quartus software.
86
Partial Reconfiguration (PR)

The ability to reconfigure (reprogram) part of the device,


while the rest of the device is running
Used by the OpenCL Runtime to program kernels without
disturbing the periphery
PR for Arria 10 devices is Early Access1 only in 15.1

Persona C

Persona B
Persona A

1. PR for OpenCL as shown is not available for previous devices.


Previous devices used Configuration via Protocol (CvP) for OpenCL.
87
Commonly Used PR Terms

Static region
 Remains constant across all PR personas
 Part(s) of the design not changed by PR
 Essentially the BSP

PR partition
 A design partition targeted for PR

PR region
 A physical location assigned to a PR partition
 Contains the kernels generated by the aoc compiler

Persona
 One of the variations in functionality that a PR region can take
 A PR region may have more than 1 persona

Freeze Wrapper
 discussed later

88
Developing a Custom OpenCL BSP

Hardware Development
Hardware Procedure - Setup Environment

1. Copy the Arria 10 GX Reference Platform


2. Rename the directories of the platform
3. Modify the XML files
4. Modify the environment variables
5. Conduct a base compile using boardtest.cl
6. Verify timing
7. Copy the base_qhd.qar file to the custom BSP directory1
8. Conduct an import compile with a simple kernel
9. Verify error free compile

1. This step is easy to forget.


90
Modify board_env.xml file1

Modify the <your_custom_platform>/board_env.xml file to match the names


of your platform and board directories
<?xml version="1.0"?>
<board_env version="15.1" name=“custom_platform">
<hardware dir="hardware" default=“my_board"></hardware>
<platform name="linux64">
<mmdlib>%b/linux64/lib/libaltera_a10_ref_mmd.so</mmdlib>
<linkflags>-L%b/linux64/lib</linkflags>
<linklibs>-laltera_a10_ref_mmd</linklibs>
<utilbindir>%b/linux64/libexec</utilbindir>
</platform>

<platform name="windows64">
<mmdlib>%b/windows64/bin/altera_a10_ref_mmd.dll</mmdlib>
<linkflags>/libpath:%b/windows64/lib</linkflags>
<linklibs>altera_a10_ref_mmd.lib</linklibs>
<utilbindir>%b/windows64/libexec</utilbindir>
</platform>
</board_env>

1. The board_env.xml file will be explained in more detail in a later section.


91
%b references your board installation directory
Modify board_spec.xml file1

Modify the <your_custom_platform>/hardware/<board


variant>/board_spec.xml file to match the name of your board directory

<?xml version="1.0"?>
<board version="15.1" name=“my_board">

<compile project="top" revision="top" qsys_file="system.qsys" generic_kernel="1">


<generate cmd="echo"/>
<synthesize cmd="quartus_sh -t import_compile.tcl"/>
<auto_migrate platform_type="a10_ref" >
<include fixes=""/>
<exclude fixes=""/>
</auto_migrate>
</compile>
.
.
.

1. The board_spec.xml file will be discussed in more detail in a later section.


92
Features of the Arria 10 Reference Platform

OpenCL Host
 PCIe-based host that connects to the Arria 10 PCIe Gen3 x8 Hard IP core

OpenCL Global Memory


 One 2-gigabyte (GB) DDR4 SDRAM daughter card

FPGA Programming via one of the following methods:


 Partial Reconfiguration (PR) over PCIe
 External cable and the Arria 10 GX FPGA Development Kit's on-board USB-Blaster® II interface
 On-board FLASH

93
Contents of the Arria 10 Reference Platform

\hardware
 Contains the Quartus Prime project templates for three board variants
 Each board variant implements the entire OpenCL hardware system on a given kit

\windows64 /linux64
 Contains the MMD library, kernel mode driver,and executable files of the AOCL utilities (that is,install, uninstall,
flash, program,diagnose) for the OS
\source_windows64
 Contains source codes for the MMD library and AOCL utilities
 The MMD library and the AOCL utilities are in the windows64 folder

/source
 Contains source codes for the MMD library and AOCL utilities
 The MMD library and the AOCL utilities are in the linux64 directory

board_env.xml
 eXtensible Markup Language (XML) file that describes the Reference Platform to the Altera SDK for OpenCL

94
Contents of Each Board Variant Directory

Option Description
quartus.ini Contains any special Quartus Prime software options that you need to compile OpenCL kernels for the Reference Platform.

system.qsys Legacy file that you must update with interfaces, to match those defined in the board spec.xml file, for the compilation flow to
work properly. The compilation process does not include the system.qsys file into the OpenCL hardware system.
board.qsys Qsys system that implements the board interfaces (that is, the static region) of the OpenCL hardware system.

top.qpf Quartus Prime Project File for the OpenCL hardware system.

top.qsf Quartus Prime Settings File for the AOCL-user compilation flow.

top.sdc Synopsys Design Constraints File that contains board-specific timing constraints.

top.v Top-level Verilog Design File for the OpenCL hardware system.

top_post.sdc Qsys and AOCL IP-specific timing constraints.

top_synth.qsf Quartus Prime Settings File for the Quartus Prime revision in which the OpenCL kernel system is synthesized.

base.qsf Quartus Prime Settings File for the base project revision. Use this revision when porting the Arria 10 Reference Platform to your
own custom BSP. The Quartus Prime Pro Edition software compiles this base project revision from source code.

Do not try to compile the BSP project in the Quartus Prime software!
95
Hardware System Overview

96
Altera SDK for OpenCL-Specific Qsys Components

Required
 OpenCL Clock Generator
 OpenCL Kernel Interface
 OpenCL Bank Divider

Altera Interface IP
 PCI Express Hard IP
 DDR Controller
 QDR Controller

Altera Supporting IP
 Avalon-MM Pipeline Bridge
 Scatter Gather DMA
 Uniphy Status Component
 ACL Version ID
 Reset Components

97
OpenCL Clock Generator

Programmable PLL to adjust kernel clock rate


Status interfaces allow software to observe the PLL

clk kernel_clk_gen
reset
ctrl
PLL ROM

PLL Reset

PLL Lock
kernel_pll_locked

PLL Reconfig kernel_clk


pll_ref_clk PLL kernel_clk2x

98
OpenCL Kernel Interface

Allows the host to control the kernel compute units


clk
reset kernel_interface
sw_reset_in
kernel_clk
ctrl kernel_cra
Window Bridge

SW Reset sw_reset
Slave
Sys. Desc. kernel_reset
ROM
Version
ID

Mem. Org. acl_bsp_memorg


Slave
kernel_irq kernel_irq_to_host
IRQ Sync.

99
OpenCL Kernel Interface

Interface added for each global memory system

100
Hard IP for PCI Express

PCIe Hard IP handles host-to-device communication

create_clock -period 100MHz [get_ports pcie_refclk]

Modify the top.sdc file if the refclk frequency changes


See 28nm PCIe online training and PCIe instructor-led training &
101
device-specific Hard IP for PCIe User Guides
External Memory Buffers

A BSP may support different memory device types


 Take advantage of memory device characteristics

Latency Density Cost Usage


Ideal for sequential access applications such as
DDR SDRAM high high low
input/output data
Better suited for random access applications such as
QDR SRAM low low high
look-up tables

External memory information is specified in board_spec.xml (discussed


later)

OpenCL Device
Global Memory1 FPGA
CU

Interface
QDR SRAM

Global Memory2 CU
DDR3 SDRAM

102
OpenCL Memory Bank Divider

Interface host to kernel memory


Multiple banks support interleaving memory
Provide at least one Memory Bank Divider for each memory type
Number of banks and memory type must be entered into the board_spec.xml file (discussed
later)

memory_bank_divider

acl_bsp_memorg_host bank1

s Memory bank2
Splitter

bankn
clk Snoop
Adapter
reset acl_bsp_snoop
kernel_clk
kernel_reset
103
OpenCL SGDMA Controller

Controlled from the Host


 PCIe Tx BAR0 Master

Connect to
 Host PCIe Rx Slave
 All global memories
Through Memory Bank Divider if used

104
Avalon-ST Interface

Used by Altera OpenCL channels or OpenCL 2.0 pipes


Standard, flexible, and modular protocol for transfer of data
 Unidirectional
 Point-to-point connections Source Sink

 Fully synchronous
 Supports simple and complex interface requirements

Source interface
 Launches data on rising edges of associated clock

Sink interface
 Latches data on rising edges of associated clock

Data format/definition controlled by application or component

See Custom IP Development Using Avalon and AXI Interfaces Online Training
105 Or consult the Avalon Interface Specifications document
Avalon-ST Interface Signals

Signal type Width Direction Description


Fundamental signals

ready 1 Sink → Source Indicates the sink can accept data (backpressure control)

valid 1 Source → Sink Qualifies all source to sink signals

data 1-4096 Source → Sink Payload of the information being transmitted

channel 1-128 Source → Sink Channel number for data being transferred (if multiple channels supported)

error 1-256 Source → Sink Bit mask marks errors affecting the data being transferred

Packet transfer signals


startofpacket 1 Source → Sink Marks the beginning of the packet
endofpacket 1 Source → Sink Marks the end of the packet
Indicates the number of symbols that are empty during cycles that contain
empty 1-8 Source → Sink
the end of a packet

106
Grayed out signals are not supported by OpenCL channels
Simple Streaming Examples

Simple example valid


Data Data
 data (presents information) data
source sink
 valid (indicates data is valid)
 Both signals propagate from source to sink
 Sink cannot backpressure or stall transfer if valid is asserted

Another example
 32b inverter block in datapath in Qsys system
 ready used to “throttle” the transfer

Sink Source
interface ready interface
valid
data data
valid
ready

107
Hardware Procedure – Modify the Platform

1. Open Quartus software project in the \boardtest\boardtest directory


2. Add or remove components in the board.qsys file
3. Add or remove signals in the system.qsys and top.v files
4. Add or remove SDC constraints in the top.sdc and top_post.sdc files
5. Add or remove LogicLock Plus regions in the base.qsf file
6. Propagate all global assignments from base.qsf to the top.qsf and top_synth.qsf files
7. Copy any files modified above into your custom BSP directory1
8. Conduct a base compile with boardtest.cl using several seeds
9. Verify timing
10. Copy the base_qhd.qar file to your custom BSP directory1
11. Conduct an import compile with a simple kernel
12. Verify error free compile

1. These steps are easy to forget.


108
Component Editor

Used to import components into Qsys system


Launch from Qsys IP Catalog or File menu  New Component

109
_hw.tcl File

Only file generated by Component Editor


Describes all component settings
Portability: _hw.tcl plus HDL code all that is needed to import a component into
other projects
Makes component look and feel like any other component in IP Catalog
Tcl syntax discussed in Advanced Qsys Design Methodologies class

110
If New Component is an IO Channel

Add the channel the


board_spec. xml file
<interface name="board” port="leds" type="streamsink" width="16“ chan_id="leds_out"/>
type="streamsink" width="16"
 Directs the compiler to make the “leds_out” interface
a streaming source that is 16 bits wide
name="board" port="leds"
 Directs the compiler to connect the “leds_out”
interface on the kernel.qsys system to the “leds”
interface on the board.qsys system
chan_id="leds_out"
 Directs the compiler to add an exported interface to
the kernel.qsys called “leds_out”.

system.qsys file

111
Guaranteed Timing Closure

Some interfaces have required


clock frequencies
 PCIe 125 MHz / 250 MHz
 DDR3-1600 800 MHz
 Kernel ??
The custom board developer is responsible for delivering a locked down, timing clean netlist for
the custom platform

Post Place & Route Partition

PCIe DDR DDR

reconfig

Kernel Kernel
Kernel
Compute Kernel
Compute
PLL Engine Engine

112
Developing a Custom OpenCL BSP

Software Development

113
Board XML Files Overview

Platforms must include XML files


 Describes your platform to the Altera SDK for OpenCL

board_env.xml
 Describes the properties of your platform
e.g. library location, utility directory

board_spec.xml
 Contains metadata describing your hardware system
e.g. memory properties, device resources used, interfaces, etc

114
Board Environment XML

AOCL_BOARD_PACKAGE_ROOT points to directory where the board_env.xml


is located
Sets up board installation enabling AOC to target specific boards
Template available in the /board_package directory of the Custom Platform
Toolkit

Top level elements


 hardware element
 One platform element for each supported OS

Each platform element contains


 mmdlib, linkflags, linklibs, utilbindir

115
Board Description File – board_env.xml
<?xml version="1.0"?>
<board_env version="15.1" name=“MyPlatformName">
<hardware dir="hardware" default=“MyBoard"></hardware>

<platform name="linux64">
<mmdlib>%b/linux64/lib/libaltera_a10_ref_mmd.so</mmdlib>
<linkflags>-L%b/linux64/lib</linkflags>
<linklibs>-laltera_a10_ref_mmd </linklibs>
<utilbindir>%b/linux64/libexec</utilbindir>
</platform>

<platform name="windows64">
<mmdlib>%b/windows64/bin/altera_a10_ref_mmd.dll</mmdlib>
<linkflags>/libpath:%b/windows64/lib</linkflags>
<linklibs>altera_a10_ref_mmd.lib</linklibs>
<utilbindir>%b/windows64/libexec</utilbindir>
</platform>

</board_env>
%a references the AOCL installation directory (e.g. c:\altera\15.1\hld)
%b references your BSP installation directory (e.g. c:\altera\15.1\hld\board\MyPlatform)

116
board_env.xml Elements and Attributes

Element Attributes and Descriptions

board_env version: AOCL version used to develop the platform


name: Name of Custom Platform board directory
hardware dir: Subdirectory containing board variants
default: Default board variant
platform name: Name of OS

mmdlib Path to the dynamic MMD libraries of the Custom Platform

linkflags Linker flags necessary to statically link with the MMD layer

linklibs Libraries the AOCL must statically link against

utilbindir Directory where AOCL utility executables are located


(install, uninstall, program, diagnose, and flash)

117
Testing board_env.xml

1. Set AOCL_BOARD_PACKAGE_ROOT to location of the Custom Platform


2. Run aocl board-xml-test

3. Run aoc --list-boards

118
Board Spec XML File (1)

<?xml version="1.0"?>
<board version="0.9" name=“MyBoard">

<device device_model="10ax115s2f45i2sges_dm.xml">
<used_resources>
<alms num="45000"/>
<ffs num="117500"/>
<dsps num="0"/>
<rams num="583"/>
</used_resources>
</device>

<!-- DDR4-2400 -->


<global_mem name=”DDR” max_bandwidth=“19200" interleaved_bytes="1024" config_addr="0x18">
<interface name="board" port="kernel_mem0" type="slave" width="512" maxburst="16”
address="0x00000000" size="0x80000000" latency="240" addpipe="1" />
<interface/>
</global_mem>

<channels>
<interface name="udp_0" port="udp0_out" type="streamsource" width="256" chan_id="eth0_in"/>
<interface name="udp_0" port="udp0_in" type="streamsink" width="256" chan_id="eth0_out"/>
</channels>

119
Board Spec XML File (2)

<host>
<kernel_config start=”0x00000000” size="0x0100000"/>
</host>

<interfaces>
<interface name="board" port="kernel_cra" type="master" width="64" misc="0"/>
<interface name="board" port="kernel_irq" type="irq" width="1"/>
<interface name="board" port="acl_internal_snoop" type="streamsource" enable="SNOOPENABLE“
width="33" clock="board.kernel_clk"/>
<kernel_clk_reset clk="board.kernel_clk“ clk2x="board.kernel_clk2x"reset="board.kernel_reset"/>
</interfaces>

<compile project="top" revision="top" qsys_file="system.qsys" generic_kernel="1">


<generate cmd="echo"/>
<synthesize cmd="quartus_sh -t import_compile.tcl"/>
<auto_migrate platform_type=“MyPlatform" >
<include fixes=""/>
<exclude fixes=""/>
</auto_migrate>
</compile>

</board>

120
board_spec.xml Elements and Attributes

Element Attributes and Descriptions

board version: AOCL version used to develop the platform


name: Name of the current board directory
device device_model: Device model file describing FPGA resources
used_resources: FPGA resource used by the BSP hardware
global_mem name, max_bandwidth, interleaved_bytes, config_addr, interface:
global memory properties
host kernel_config: Address offset where the kernel hardware resides

[channels] interface: Characteristics of each channel interface for direct kernel-to-I/O accesses

interfaces interface, kernel_clk_reset: Description of kernel interfaces connection to and


controlling the kernel hardware
compile project, qsys_files, generate_cmd, etc… : Controls Quartus Prime
compilation

121
Board XML Files Review

board_env.xml
 One needed for each platform
 Describes the properties of your platform
e.g. library location, utility directory

board_spec.xml
 One needed for each board within the platform
 Contains metadata describing your hardware system
Memory properties
Channel properties
Device resources used
Control interfaces
Compile properties
etc.

122
Memory-Mapped Devices (MMD) Software Layer

Software layer for communicating with board


 Over any medium

Used by host programs and board utilities


File I/O like interface needs to be implemented
 Read/write/open/close etc.

To be linked to by the host program


 Statically and dynamically

Runtime (OpenCL API)


HAL for memory transfers and kernel launches
MMD layer for raw read and write operations
Kernel mode driver for accessing communication medium
Board Hardware

123
MMD API

get_offline_info get_info

set_status_handler set_interrupt_handler

open close

read write

copy yield

shared_mem_alloc shared_mem_free

reprogram

124
AOCL Utilities

Custom Platforms must support a set of aocl utilities


 Executables delivered in a subdirectory within the Custom Platform files
AOCL looks for corresponding executables when respective aocl calls are
made
install (aocl install)
 Installs driver into the host operating system
unstall (aocl install)
 Removes driver from the host operating system
program (aocl program <device> <kernel file>.aocx)
 Programs the FPGA using the provided aocx file
flash (aocl flash <device> <kernel_file>.aocx)
 Programs base programming image into Flash
diagnose (aocl diagnose [<device_name>])
 Confirms board functionality

125
Summary
OpenCL + FPGA Key Benefits

Faster development vs. traditional FPGA design flow


 Puts the FPGA in the software developers hands
 Familiar C-based development flow

Heterogeneous IO interface
 Multiple 10G Ethernet
 SDI, HMDI, A/D Interface

Higher performance/watt vs. CPU/GPGPU


 Implement exactly what you need
 Pipeline parallel structures
 Custom interconnect converging with data processing cores

Portability & Obsolescence free


 Code can transfer between different HW accelerators (CPU, GPGPU, FPGA, etc)
 Code ports seamlessly to new generations of the FPGA
 FPGA life cycle considerably longer than CPUs or GPGPUs

127
Q&A

You might also like