0% found this document useful (0 votes)
2 views21 pages

Introduction 2

The document discusses the application of real-time deep learning on FPGAs, particularly in the context of the Large Hadron Collider (LHC) and other high-data environments like self-driving cars. It highlights the challenges posed by extreme data rates and the need for efficient machine learning algorithms to manage data processing within strict latency constraints. FPGAs are presented as a solution due to their programmable nature, low power consumption, and ability to handle parallel processing effectively.

Uploaded by

Ehsan Faraji
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views21 pages

Introduction 2

The document discusses the application of real-time deep learning on FPGAs, particularly in the context of the Large Hadron Collider (LHC) and other high-data environments like self-driving cars. It highlights the challenges posed by extreme data rates and the need for efficient machine learning algorithms to manage data processing within strict latency constraints. FPGAs are presented as a solution due to their programmable nature, low power consumption, and ability to handle parallel processing effectively.

Uploaded by

Ehsan Faraji
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

How to do real-time Deep Learning on FPGAs

Introduction

Universität Zürich - Physik Institut, 6 February 2019


Motivation:
use cases at the LHC and beyond

06.02.2019 fpa4hep: real-time deep learning on FPGAs 2


Future challenges @ LHC
Extreme bunch crossing frequency of 40 MHz → extreme data rates O(100 TB/s)

LHC TODAY HL-LHC

‣~~ 40 collisions/event ‣~ 200 collisions/event


‣ 10 sec/event processing time
‣more granular detector
‣~flatminutes/event processing time
‣resources
budget for computing

06.02.2019 fpa4hep: real-time deep learning on FPGAs 3


Future challenges @ HL-LHC
Modern machine learning methods might be the way out!

Current event reconstruction


algorithms will not be substainable

Recast instead the problem as


a machine learning problem

‣ Excellent physics performance


‣ Intrinsically
speed
parallelizable → high

‣ Follow industry trends in developing


new devices optimized for ML
and speed the up the inference
# collisions/event ~ event complexity

06.02.2019 fpa4hep: real-time deep learning on FPGAs 4


The LHC big data problem

r l
ge ve
s
lli Hz

pu e
g
on
CMS
CMSTrigger
CMS Trigger
Trigger

in
r

tin
ig e
ge
co M

Tr h-L

om ffl
si

ig
pp 40

O
ig
Tr

H
L1
1 1kHz

C
1 kHz
kHz
100
100kHz
100 kHz
kHz 1 MB/evt
1 1MB/evt
1 MB/evt
MB/evt

HiH Hi
ghighgh
4040MHz
40 MHz e rer er T T-L-L -L
MHz ig gggigg rTi r rie e e
T i
rTr Tr ggigggvgeveve
L1L1L1 e
DATA FLOWr r
re el l l

• •Level-1
•Level-1Trigger
Level-1 Trigger
Trigger
• 40 MHz in / 100 KHz out • •High-Level
(hardware)
(hardware)
(hardware) •High-Level
High-Level Trigger (software)
Trigger
Trigger (software)
(software)
• Absorbs 100s TB/s
• •99.75%
•99.75%
99.75%rejectedTrigger decision to be made•in•~
•rejected
rejected •10
99% μsrejected
99%
99% rejected
rejected
• •decision
•decision
• Coarse local reconstruction
inin~4
decision in
~4 μs
~4μs μs
• FPGAs / Hardware implemented
• • •
decision
decision
decision inin~100s
in ms
~100s
~100s msms

06.02.2019 fpa4hep: real-time deep learning on FPGAs 5


The LHC big data problem

r l
ge ve
s
lli Hz

pu e
g
on
CMS
CMSTrigger
CMS Trigger
Trigger

in
r

tin
ig e
ge
co M

Tr h-L

om ffl
si

ig
pp 40

O
ig
Tr

H
L1
1 1kHz

C
1 kHz
kHz
100
100kHz
100 kHz
kHz 1 MB/evt
1 1MB/evt
1 MB/evt
MB/evt

HiH Hi
ghighgh
4040MHz
40 MHz e rer er T T-L-L -L
MHz ig gggigg rTi r rie e e
T i
rTr Tr ggigggvgeveve
L1L1L1 e
DATA FLOWr r
re el l l

• •Level-1
•Level-1Trigger
Level-1 Trigger
Trigger (hardware)• 100•KHz
(hardware)
(hardware) •High-Level
•High-Level
/ 1 KHz outTrigger
inHigh-Level (software)
Trigger
Trigger (software)
(software)
• Output: ~ 500 KB/event
• •99.75%
•99.75%
99.75%rejected
rejected
rejected • •99%
•99%
• Processing 99%rejected
time ~ rejected
rejected
300 ms

• •decision
•decisioninin~4
decision ~4μs
in ~4μsμs •
• Simplified global reconstruction
• •
decision
decision
decision inin~100s
in ms
~100s
~100s
• Software implemented on CPUs
msms

06.02.2019 fpa4hep: real-time deep learning on FPGAs 6


The LHC big data problem

r l
ge ve
s
lli Hz

CMS Trigger

pu e
g
on
CMS
CMSTrigger
CMS Trigger
Trigger

in
r

tin
ig e
ge
co M

Tr h-L

om ffl
si

ig
pp 40

O
ig
Tr

H
1 kHz1 1kHz
L1

C
1 kHz
kHz
100 kHz
100 kHz
100
100 kHz
kHz 1 MB/evt
1 1MB/evt
1 MB/evt
MB/evt
Hi
gh HiH g i Hi
gh
40 MHz er T
r - L h g h
4040MHz
40 MHz
MHz igg ge er
gri er eTvT T-L-L -L
Tr rig igrgig gg riegri rieve ev
L1 1 T T r T e l gg g g gev e e
1
LL L 1 r e e el l l
DATA FLOWr rr

vel-1 Trigger
• • •Level-1
Level-1
Level-1 (hardware)
Trigger (hardware)
Trigger
Trigger •
(hardware)
(hardware) High-Level Trigger
• • •High-Level
High-Level (software)
Trigger
High-Level (software)
Trigger
Trigger (software)
(software)
• Output: max. 1 MB/event
.75%
•• • rejected
99.75%
99.75%rejected
rejected
99.75% rejected • 99% rejected
• 99%
99% • • • Accurate global reconstruction
rejected
99% rejectedtime ~ 20 s
•rejected
Processing

cision in ~4 μs
• • •decision in~4
decision in
decision ~4μs
in ~4
μsμs • decision in ~100s
• • •decision
decision in
decision ms
~100s
in ~100s
in ~100s
• Software ms
msmson CPUs
implemented

06.02.2019 fpa4hep: real-time deep learning on FPGAs 7


The LHC big data problem

r l
ge ve
s
lli Hz

CMS Trigger

pu e
g
on
CMS
CMSTrigger
CMS Trigger
Trigger

in
r

tin
ig e
ge
co M

Tr h-L

om ffl
si

ig
pp 40

O
ig
Tr

H
1 kHz1 1kHz
L1

C
1 kHz
kHz
100 kHz
100 kHz
100
100 kHz
kHz 1 MB/evt
1 1MB/evt
1 MB/evt
MB/evt
Hi
gh HiH g i Hi
gh
40 MHz er T
r - L h g h
4040MHz
40 MHz
MHz ig g ge er
gri er eTvT T-L-L -L
Tr rig igrgig gg riegri rieve ev
L1 1 T T r T e l gg g g gev e e
1 ns 1Lμs1
LL 1 r e re100el l msl 1s
r r

vel-1 Trigger
• • •Level-1
Level-1
Level-1 (hardware)
Trigger (hardware)
Trigger
Trigger •
(hardware)
(hardware) High-Level Trigger
• • •High-Level
High-Level (software)
Trigger
High-Level (software)
Trigger
Trigger (software)
(software)
.75%
•• • rejected
99.75%
99.75%rejected
rejected
99.75% rejected • 99% rejected
• 99%
99%
Deploy ML algorithms very • •early in the game
rejected
99% rejected
rejected
cision in ~4 μs
• • •decision in ~4 μs
decision in
decision ~4
in μs
~4 μs •
Challenge: strict latency constraints!
decision in
decision ~100s
decision in ms
• • •decision in ~100s
~100s
in ms
~100s
msms

06.02.2019 fpa4hep: real-time deep learning on FPGAs 8


r l
ge ve
s
lli Hz

pu e
g
on
CMS
CMSTrigger
Trigger

in
r

tin
ig e
ge
co M

Tr h-L

om ffl
si

ig
pp 40

O
ig
Tr

H
L1
1 1kHz

C
kHz
100
100kHz
kHz 1 MB/evt
1 1MB/evt
MB/evt

HiH
ghigh
4040MHz
40 MHz erer TrTr -L-eLe
MHz igi
ggg ig ig vev
T rTr gege l el
1 ns 1L1μs
L1 r 100
r ms 1s

• •Level-1
Level-1Trigger
Trigger(hardware)
(hardware) • •High-Level
High-LevelTrigger
Trigger(software)
(software)
• •99.75%
99.75%rejected
rejected • •99%
99%rejected
rejected
• •decision
decisioninin~4~4μsμs • •decision
decisioninin~100s
~100sms ms

06.02.2019 fpa4hep: real-time deep learning on FPGAs 9


Beyond LHC
Ex: self-driving cars

A single self-driving test vehicle can


produce ~ 30 TB/day

There are over 250 million cars on the


road in the US alone

If < 1% replaced by autonomous vehicles by 2020


→ HUGE amount of data generated, not manageable by central servers!

06.02.2019 fpa4hep: real-time deep learning on FPGAs 10


Beyond LHC
Ex: self-driving cars

A single self-driving test vehicle can


produce ~ 30 TB/day

There are over 250 million cars on the


road in the US alone

If < 1% replaced by autonomous vehicles by 2020


→ HUGE amount of data generated, not manageable by central servers!

Need edge computing architectures, low power and small in size to


run powerful data analytics programs oboard

NB: latency matters! even a few milliseconds of delay


can result in an accident!
The stakes are too high to wait the answer from a distant cloud server.

06.02.2019 fpa4hep: real-time deep learning on FPGAs 11


People might have different opinion… but today
we learn about FPGA & Machine Learning!
FPGA
What are FPGAs?
“programmable hardware”

wing
Field Programmable Gate Arrays are reprogrammable
integrated circuits
FPGA diagram

ctures
Contain array of logic cells used to configure low level
operations (bit masking, shifting, addition)

ures
DSPs (multiply-accumulate,
Logic cell etc.)
Flip Flops (registers/distributed memo
LUTs (logic)
Block RAMs
Look-up(memories)
Flip-flop
table
(registers)
(logic)

06.02.2019 fpa4hep: real-time deep learning on FPGAs 13


FPGA
What are FPGAs?
“programmable hardware”

wing
Field Programmable Gate Arrays are reprogrammable
integrated circuits
FPGA diagram

ctures
Contain array of logic cells used to configure low level
operations (bit masking, shifting, addition)

ures
DSPs (multiply-accumulate,
Also contain embedded components: etc.)
Flip Flops (registers/distributed
Digital Signal Processors (DSPs):
memo
LUTs
logic units(logic)
used for multiplications
Block RAMs (memories)
Random-access memories (RAMs):
embedded memory elements

06.02.2019 fpa4hep: real-time deep learning on FPGAs 14


FPGA
What are FPGAs?
“programmable hardware”

wing
Field Programmable Gate Arrays are reprogrammable
integrated circuits
FPGA diagram

ctures
Contain array of logic cells embedded with DSPs,
BRAMs, etc.

High speed input/output to handle the large bandwith

ures
Support highly parallel algorithm implementations

Low power (relative to CPU/GPU)

DSPs (multiply-accumulate,
Digital Signal Processors (DSPs): etc.)
Flip Flops (registers/distributed memo
logic units used for multiplications

LUTs (logic)
Random-access memories (RAMs):
embedded memory elements
Block RAMs (memories)
Flip-flops (FF) and look up tables
(LUTs) for additions

06.02.2019 fpa4hep: real-time deep learning on FPGAs 15


How are FPGAs programmed?

Hardware Description Languages

HDLs are programming languages which describe


electronic circuits

High Level Synthesis

generate HDL from more common C/C++ code


pre-processor directives and constraints used to
optimize the timing
drastic decrease in firmware development time!

We use today Xilinx Vivado HLS [*]

[*] https://www.xilinx.com/support/documentation/sw_manuals/xilinx2014_1/ug902-vivado-high-level-synthesis.pdf
06.02.2019 fpa4hep: real-time deep learning on FPGAs 16
Neural network inference
Lmn
N
L
N11
xn = gn (Wn,n 1 xn 1 + bn )
L
NMN
activation function multiplication addition
precomputed and
DSPs logic cells
stored in BRAMs
M hidden layers

16 inputs

64 nodes
output layer
activation: ReLU

input layer 32 nodes


activation: ReLU
layer m

32 nodes
N
X activation: ReLU
Nmultiplications = Ln 1 ⇥ Ln
5 outputs
n=2 activation: SoftMax

06.02.2019 fpa4hep: real-time deep learning on FPGAs 17


Neural network inference
Lmn
N
L
N11
xn = gn (Wn,n 1 xn 1 + bn )
L
NMN
activation function multiplication addition
precomputed and
DSPs logic cells
stored in BRAMs
M hidden layers

16 inputs

64 nodes
activation: ReLU
output layer
How many resources?
input layer DSPs, LUTs, FFs?
32 nodes
activation: ReLU
layer m
Does the model fit in the
32 nodes
N
X latencyactivation:
requirements?
ReLU
Nmultiplications = Ln 1 ⇥ Ln
5 outputs
n=2 activation: SoftMax

06.02.2019 fpa4hep: real-time deep learning on FPGAs 18


network in terms of not only performance but also resource usage and latency.
Today you are going to implement a NN on FPGA with this package:
2.1 hls4ml concept
high level synthesis for machine learning
Our basic task is to translate a trained neural network by taking a model architecture, weights, and
biases and implementing them in HLS in an automated fashion. This https://arxiv.org/abs/1804.06913
automated procedure is the task
of the software/firmware package, hls4ml. A schematic of a typical workflow is illustrated in Fig. 1.

hls4ml
-

hls 4 ml

HLS 4 ML/

https://hls-fpga-machine-learning.github.io/hls4ml/
Figure 1: A typical workflow to translate a model into a firmware implementation using hls4ml.
06.02.2019 fpa4hep: real-time deep learning on FPGAs 19
Efficient NN design for FPGAs
FPGAs provide huge flexibility Constraints:
Performance depends on how well you Input bandwidth
take advantage of this FPGA resources
Latency

Today you will learn how to optimize your project through: in g


in
t ra
N
- compression: reduce number of synapses or neurons N

- quantization: reduces the precision of the calculations (inputs, e c t


r oj
weights, biases) a p ing
PG n F s i g
de
- parallelization: tune how much to parallelize to make the
inference faster/slower versus FPGA resources

06.02.2019 fpa4hep: real-time deep learning on FPGAs 20


Today’s hls4ml hands on
• First part:

- take confidence with the package, its functionalities and design synthesis by running with one of
the provided trained NN

- learn how to read out an estimate of FPGA resources and latency for a NN after synthesis

- learn how to optimize the design with quantization and parallelization

• Second part:

- learn how to export the HLS design to firmware with SDAccel

• Third part:

- learn how to do model compression and its effect on the FPGA resources/latency

• Fourth part:

- learn how to accelerate NN inference firmware on a real FPGA (provided on Amazon cloud) with
SDAccel

- timing and resources studies after running on real FPGA

06.02.2019 fpa4hep: real-time deep learning on FPGAs 21

You might also like