0% found this document useful (0 votes)

10 views8 pages

Evaluation of Smearing On FPGA Accelerator Cards: Salvatore Calì, Grzegorz Korcyl and Piotr Korcyl

This document evaluates the implementation of the SU(3) gauge field smearing routine on FPGA accelerator cards, specifically the Xilinx Alveo U280, within the context of Lattice QCD calculations. It discusses the potential performance gains through parallelism and pipelining in the computation process, while also addressing resource consumption and the challenges faced during implementation. The findings are based on benchmarks and highlight the advantages and limitations of using FPGA technology for this specific computational task.

Uploaded by

Nguyễn Thị Như Quỳnh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views8 pages

Evaluation of Smearing On FPGA Accelerator Cards: Salvatore Calì, Grzegorz Korcyl and Piotr Korcyl

Uploaded by

Nguyễn Thị Như Quỳnh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Evaluation of 𝑺𝑼(3) smearing on FPGA accelerator cards

arXiv:2110.11172v1 [hep-lat] 21 Oct 2021

Salvatore Calì,𝑎 Grzegorz Korcyl𝑏,∗ and Piotr Korcyl𝑐

𝑎 Center for Theoretical Physics, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
𝑏 Institute of Applied Computer Science, Jagiellonian University, ul. prof. Łojasiewicza 11, 30-348
Kraków, Poland
𝑐 Institute of Theoretical Physics, Jagiellonian University, ul. prof. Łojasiewicza 11, 30-348 Kraków,

Poland
E-mail: [email protected], [email protected], [email protected]

Recent FPGA accelerator cards promise large acceleration factors for some specific computational
tasks. In the context of Lattice QCD calculations, we investigate the possible gain of moving the
𝑆𝑈 (3) gauge field smearing routine to such accelerators. We study Xilinx Alveo U280 cards and
use the associated Vitis high-level synthesis framework. We discuss the possible pros and cons
of such a solution based on the gathered benchmarks.

MIT-CTP/5341

The 38th International Symposium on Lattice Field Theory, LATTICE2021 26th-30th July, 2021
Zoom/Gather@Massachusetts Institute of Technology

∗ Speaker

© Copyright owned by the author(s) under the terms of the Creative Commons
Attribution-NonCommercial-NoDerivatives 4.0 International License (CC BY-NC-ND 4.0). https://pos.sissa.it/
Evaluation of 𝑆𝑈 (3) smearing on FPGA accelerator cards Grzegorz Korcyl

1. Introduction

As the computer architectures become more and more heterogeneous it may be advantageous to
delegate some steps of the calculations to different resources present on the cluster/supercomputer
nodes. In such scenario some elements could be executed in parallel by different architectures. For
instance, in CYGNUS installation, preprocessing of data for data exchanges is accelerated by the
FPGA processors. With this in mind, in the context of lattice QCD, we benchmark the APE link
smearing routine on the Xilinx Alveo U280 accelerator card.
The APE smearing [1] is a representative case of input data averaging defined by a 9-point
stencil on a data grid with a topology of a four dimensional torus. In lattice QCD the basic degrees
of freedom located on the edges of the grid are 3×3 complex values matrices belonging to the 𝑆𝑈 (3)
group, called "links". Because of the non-abelian nature of that group, averaging of neighbouring
parallel links is replaced by the average of "staples", i.e. products of three link variables along the
lines sketched in Figure 1. For each link one needs to evaluate 6 staples and perform a substitution,

3
∑︁
𝑈 𝜇 (𝑥) → 𝑈 𝜇 (𝑥) + 𝑆𝑖 (𝑥) (1)
𝑖=−3

where 𝑆±1 ,𝑆±2 and 𝑆±3 are the staples in three directions perpendicular to the direction of the link
𝑈 𝜇 (𝑥). ± corresponds to the two possibilities: "up" or "down", "left" or "right" which we denote in
the following altogether by "forward" and "backward". Eq. (1) differs from the common definition
in the Literature by scaling coefficients which all were set to 1. Such coefficients are irrelevant as
far as performance is concerned.
From the point of view of a compute node, we assume that the host CPU supervises the main
compute flow and delegates parts of the computations to different devices. Hence, we assume that
the gauge links have been transferred from the host to the High Bandwidth Memory (HBM) memory
of the FPGA accelerator. The described implementation takes the input link variables which are
streamed to the programmable logic from the HBM, transforms them and stores back in the HBM
memory. This process can be iterated. Ultimately, the smeared link variables are transferred back
to the host. Below we describe the details of the FPGA kernel and data transfer mechanisms. Our
work is built on previous implementations of the CG solver [2–5]. For recent progress in the FPGA
optimized HPCG benchmark see Ref. [7].

Figure 1: Schematic representation of the APE link smearing. The link is a 𝑆𝑈 (3) matrix and is a basic
degree of freedom. The link being smeared is marked in red. Two blue and two black "staples" are shown,
each one being a product of three links. The full smearing routine contains another pair of "staples" in the
fourth direction.

2
Evaluation of 𝑆𝑈 (3) smearing on FPGA accelerator cards Grzegorz Korcyl

2. Pipelined and streamlined design

In order to fully exploit the possibilities given by the U280 accelerator one has to consider and
implement several levels of parallelism. At the lowest level, we have data parallelism which we
can realize by instantiating several instances of a kernel to process multiple data simultaneously.
For instance, staples in three directions can be evaluated in parallel if we instantiate three separate
kernels calculating staples (see Table 1). At one-step higher level, one can exploit parallelism in
time by pipelining the computations. Again, let us take a computation of a single staple as example.
Its evaluation in double precision takes 39 clock cycles (again, see Table 1). By using special
directives from the Vitis environment, we can instruct the compiler to produce a kernel which can
be fed with new data every Initiation Interval (II) clock cycles (see fourth column of Table 1). In
the case of double precision this can be II = 2. This means that, at a given moment of time, the
kernel responsible for the staple evaluation will be performing computations for 39/2 ≈ 20 staples
in parallel. Eventually, since typically the smearing algorithm involves many iterations of the same
procedure on the same data, one can construct a pipelined data flow using multiple instances of the
entire smearing routine kernel in such a way that in a given moment of time multiple iterations will
be executed in the FPGA accelerator. This latter idea is schematically depicted in Figure 2. The
plot shows slices of the lattice with the link being smeared marked in red. The necessary staples
are shown in blue and green. The upper part represents one kernel implementing one iteration of
the smearing routine; the lower part is a second, separate kernel implementing the second iteration.
Data flow is marked with black arrows: original data arrives in a stream from the HBM to the
programmable logic, it is processed by the first kernel performing the iteration 𝑛, subsequently it is
sent in a form of another stream to the second kernel where the iteration 𝑛 + 1 is executed. Finally,
the data is streamed back to the HBM memory. The link variables shown in orange on the sketch
are kept in the local memory of the kernel in an array in the form of a FIFO cyclic buffer. The black
link variables have already been used and were removed from the buffer, the grey will be transferred
to the kernel in the next steps of the volume loop. Although we have implemented and tested this
mechanism, we did not manage to compile the entire project including the cyclic buffers with all the
constraints, because of local congestion problems in the HBM-Super Logic Region (SLR) region.
Hence, although the U280 has enough resources to implement the entire project, the performances
quoted in the following section are based on partial compilation results.
Combining all three levels of parallelism together with the corresponding data transport layers
allows to fully exploit the potential of the FPGA accelerators. In practice, the feasibility of the
project depends on: the size of the available resources which we discuss in the next section and on
the ability of the compiler to efficiently implement everything within the time and space constraints,
on which we comment in the last section.

3. Resource consumption

The feasibility of the implementation outlined in the previous section depends on the size
(in terms of logical elements resources) of the single kernel. In our implementation the kernel is
composed of several modules instantiated as separate functions: 𝑆𝑈 (3) group elements scaling by a
scalar, addition (add_two) and multiplication, evaluation of a single staple (compute_staple_*),

3
Evaluation of 𝑆𝑈 (3) smearing on FPGA accelerator cards Grzegorz Korcyl

Figure 2: Schematic view of data flow in the HBM-kernel-kernel-HBM stream with cyclic buffers (orange)
implemented in the U/BRAM

Table 1: Composition of multiply_by_staple function

component # latency interval DSP FF LUT

grp_compute_staple_forward_fu 3 39 2 400 31231 21159
grp_compute_staple_backward_fu 3 39 2 400 31231 21159
grp_add_two_fu 3 4 1 36 2827 2178

multiplication by the sum of staples (multiply_by_staple) and projection back to the 𝑆𝑈 (3)
group (su3_projection). On one hand, the best performance is obtained when all the functions are
merged by the inline keyword allowing for the compiler to reshuffle and reuse resources and avoid
constructing interfaces for consecutive functions calls. On the other hand, when each function is left
as a separate module, the compiler provides individual information on resources consumption which
allows to understand which elements are critical from the point of view of resource consumption and
also which functions are reusing the same instances of lower-level kernels. Following the second
possibility, we gather relevant information on the resource consumption of the various steps of the
smearing procedure in Table 1 and 2. In order to estimate the total performance we use inlining for
all functions.
As an example, Table 1 shows the structure of the multiply_by_staple function which yields
the product of the current link and the sum of the six staples at a one level decomposition. We see
that the compiler has generated three instances of the kernels grp_compute_staple_forward_fu,

4
Evaluation of 𝑆𝑈 (3) smearing on FPGA accelerator cards Grzegorz Korcyl

Table 2: Comparison of the resource consumption of various compute kernels for different data types,
compiled for U280 card at 300 MHz with Vitis HLS 2020.2 (resources in % of total / of one SLR)

function prec. latency II BRAM DSP FF LUT

compute_staple_forward double 65 2 0 12 / 37 5 / 16 6 / 18
compute_staple_forward double 67 4 0 6 / 18 3 / 10 3/9
compute_staple_forward double 71 8 0 3/9 2/6 1/5
multiply_by_staple double 90 2 0 77 / 231 34 / 103 39 / 118
multiply_by_staple double 93 4 0 38 / 115 21 / 63 21 / 63
multiply_by_staple double 99 8 0 19 / 57 13 / 41 11 / 34
compute_staple_forward float 69 2 0 5 / 16 2/7 2/7
compute_staple_forward float 72 4 0 2/8 1/4 1/4
multiply_by_staple float 100 2 0 17 / 104 15 / 47 16 / 50
multiply_by_staple float 105 4 0 17 / 52 10 / 30 9 / 29
compute_staple_forward half 72 2 0 4 / 13 1/4 1/4
multiply_by_staple half 103 2 0 27 / 82 10 / 31 9 / 29
su3_projection double 869 8 0 14 / 43 10 / 31 8 / 26
su3_projection float 899 4 0 13 / 39 7 / 23 7 / 23
su3_projection half 909 2 0 20 / 62 8 / 24 7 / 23
full double 989 8 0 33 / 100 25 / 75 20 / 62
full float 1022 4 0 30 / 91 17 / 53 17 / 53
full half 1037 4 0 24 / 73 11 / 35 11 / 35
full half 1014 2 0 49 / 147 19 / 57 17 / 53

grp_compute_staple_backward_fu and grp_add_two_fu, which already signifies that the

evaluation of the six staples will be performed in parallel. The inner structure of these functions is
hidden at this point, but may be unraveled if we unset the inline keyword for them. In that case we
would be able to monitor how the parallelism of the 𝑆𝑈 (3) matrix multiplications is implemented
in the logic. Vitis software allows to control the number of instances of each function and hence
the user can directly reduce/increase the resource consumption to reduce/increase the parallelism.
The fourth column of Table 1 contains data on the initiation interval which is directly proportional
to the total performance.
In Table 2 we show the resource consumption and the latency and initiation interval of all the
higher-level functions from the smearing routine as a function of the data precision (column 2) and
imposed initiation interval (column 4), both highlighted with bold letters. The initiation interval
can be controlled from the Vitis environment by a special pragma. The smaller is the II, the larger
the performance. At the compilation stage, although the compiler can produce a kernel with a given
II, we may not be able to provide input data at that speed or the resources needed to sufficiently
parallelize the kernel to keep up to this II may not be available. The latter turns out to be the case
for the kernel multiply_by_staple in double precision with II=2 which exceeds the DSP, FF and
LUT resources in a single SLR. With II = 4 the number of needed DSP is exceeded, which also

5
Evaluation of 𝑆𝑈 (3) smearing on FPGA accelerator cards Grzegorz Korcyl

rules out this setup. Similar observations may be done for the same kernel in single precision with
II = 2. From that point of view, we conclude that the possible II for double precision is II = 8,
for float is II = 4 and for half is II = 2. This conclusion will be confirmed by the analysis of the
input data bandwidth which we discuss in the next section. The full size of the smearing routine,
composed of the staple evaluation and multiplication and of the 𝑆𝑈 (3) projection, is shown in the
last four rows of Table 2 only for the parameters which fit in a single SLR.

4. Timings and performance

In order to assess the performance of the setup presented above one has to count the number
of floating point operations needed for the smearing of a single link. The input data is composed
of six sets of three 𝑆𝑈 (3) matrices needed for the six staples. Hence, for each link we need to load
18 × 9 × 2 = 324 floating point numbers. For each staple we have two matrix-matrix multiplications,
hence 12 multiplications and 6 matrix-matrix additions. This gives 324 × 12 + 108 = 3996 floating
point operations (FLOPs). Finally, the 𝑆𝑈 (3) projection [6] requires 2790 FLOPs where the number
of iterations was set to 4.
As far as the data transfer is concerned, the HBM memory on the Xilinx U280 card has 32
512-bit wide ports which can run at 300 MHz. The 32 ports are divided equally among four regions
of the programmable logic (SLR). From the point of view of possible paths congestion it is advisable
not to exceed one SLR and work with 8 ports attached to it. In Table 3 we provide the size of
the input in bits for the different precisions. In the second column we translate the latter into the
number of 512-bit words which have to be transferred. Finally, in the third column we report the
minimal (when all 8 ports are used) and maximal (when only a single port is used) number of clock
cycles needed to transfer input data for the smearing routine of a single link variable. This number
of clock cycles directly translates into the initiation interval for the kernel, since we cannot start
the kernel before all the data has arrived. The last column contains the final initiation interval for
the given precision, chosen in accordance with the resource consumption presented in the previous
section.

total size # 512-bit initiation optimal

in bits words interval II
double 4608 9 2-9 8
float 2304 4.5 2-5 4
half 1652 2.25 2-3 2

Table 3: Possible values of the initiation interval inferred from the HBM-programmable logic bandwidth.

With the initiation interval fixed by the available resources and memory bandwidth we can
estimate the performance of a single kernel. We have gathered the numbers in Table 4.
We can contrast these numbers with our benchmark runs performed on the Prometheus super-
computer hosted by the AGH Cyfronet in Kraków, Poland. Each node is equipped with a two-socket,
24-core Intel Haswell processor. 50 iterations of the APE smearing on a lattice of size 323 × 64
using 6 nodes took 3.0s, which translates into 110 GFLOPs/s per node.

6
Evaluation of 𝑆𝑈 (3) smearing on FPGA accelerator cards Grzegorz Korcyl

precision II staples projection full 3 kernels

[GFLOP/s] [GFLOP/s] [GFLOP/s] [GFLOP/s]
double 8 150 105 255 765
float 4 300 210 510 1530
half 2 600 420 1020 3060

Table 4: The initiation interval, inferred from the HBM-programmable logic memory bandwidth, sets the
performance limit on a single kernel. The last column provides estimates in GFLOP/s, assuming that 3
parallel kernels are implemented in 3 separate SLR domains.

5. Conclusions and outlook

In this work we have evaluated the performance of the APE smearing routine executed on the
Xilinx Alveo U280 accelerator. Our implementation exploits several layers of parallelism offered
by FPGA accelerators as well as the benefits of HBM memory located close to the programmable
logic. Our analysis shows that a speedup factor compared with CPU is possible, provided the
compilation, placement and routing of all elements is successful. Although we have tested all the
elements individually and the SLR domain of Alveo U280 is large enough to contain the complete
solution, we did not yet manage to obtain the final binary, due to Vitis 2020.2 failing in placing and
routing the generated logic resources, because of high level of congestion. The problem remains
still open and the solution will be evaluated with various Vitis releases, which highly differ in
delivered quality of results. Work in this direction is still being done. Also, as some additional
research direction it would be interesting to benchmark the SyCL framework for FPGA with the
code described here.

Acknowledgements

This work was supported by the Foundation for Polish Science grant no. TEAM/2017-4/39,
by the Polish Ministry for Science and Higher Education grant no. 7150/E-338/M/2018, and by the
Priority Research Area Digiworld under the program Excellence Initiative – Research University at
the Jagiellonian University. We gratefully acknowledge hardware donations from Xilinx within the
Xilinx University Program. S.C. acknowledges support from the Carl G and Shirley Sontheimer
Research Fund at MIT.

References

[1] Albanese, M., et al. Glueball Masses and String Tension in Lattice QCD. Phys. Lett. B 192
(1987), 163–169.

[2] Korcyl, G., and Korcyl, P. Investigating the Dirac operator evaluation with FPGAs.

[3] Korcyl, G., and Korcyl, P. Optimized implementation of the conjugate gradient algorithm
for FPGA-based platforms using the Dirac-Wilson operator as an example.

7
Evaluation of 𝑆𝑈 (3) smearing on FPGA accelerator cards Grzegorz Korcyl

[4] Korcyl, G., and Korcyl, P. Towards Lattice Quantum Chromodynamics on FPGA devices.
Comput. Phys. Commun. 249 (2020), 107029.

[5] Korcyl, P., and Korcyl, G. Implementation of the conjugate gradient algorithm in Lattice
QCD on FPGA devices. PoS LATTICE2018 (2018), 313.

[6] Morte, M. D., Shindler, A., and Sommer, R. On lattice actions for static quarks. Journal of
High Energy Physics 2005, 08 (Aug 2005), 051–051.

[7] Zeni, A., O’Brien, K., Blott, M., and Santambrogio, M. D. Optimized implementation
of the hpcg benchmark on reconfigurable hardware. In European Conference on Parallel
Processing (2021), Springer, pp. 616–630.

ECE 385 Fall 2014 Lab Manual 20140829
50% (2)
ECE 385 Fall 2014 Lab Manual 20140829
308 pages
DanBuss DNIP
No ratings yet
DanBuss DNIP
26 pages
Course Project - Agra: Objectives
No ratings yet
Course Project - Agra: Objectives
5 pages
Simulation of Digital Communication Systems Using Matlab
From Everand
Simulation of Digital Communication Systems Using Matlab
Mathuranathan Viswanathan
3.5/5 (22)
Encyclopedia of Crash Dump Analysis Patterns PDF
100% (1)
Encyclopedia of Crash Dump Analysis Patterns PDF
1,200 pages
Summary Master Thesis
No ratings yet
Summary Master Thesis
3 pages
Lasa Abstraction and Specialization For Productive and Performant Linear Algebra On FPGAs
No ratings yet
Lasa Abstraction and Specialization For Productive and Performant Linear Algebra On FPGAs
7 pages
FPGA Based Remote Object Tracking For Real-Time Control
No ratings yet
FPGA Based Remote Object Tracking For Real-Time Control
6 pages
Object Tracking System
No ratings yet
Object Tracking System
6 pages
Smith Waterman
100% (1)
Smith Waterman
23 pages
Graph Layout Support for Model-Driven Engineering
From Everand
Graph Layout Support for Model-Driven Engineering
Miro Spönemann
No ratings yet
Design of A Bit-Serial Floating Point Unit For A F
No ratings yet
Design of A Bit-Serial Floating Point Unit For A F
8 pages
Design and Implementation of Single Precision Pipelined Floating Point Co-Processor
No ratings yet
Design and Implementation of Single Precision Pipelined Floating Point Co-Processor
4 pages
Chap 15
No ratings yet
Chap 15
26 pages
15 A Simple Compiler - The Back End: 15.1 The Code Generation Interface
No ratings yet
15 A Simple Compiler - The Back End: 15.1 The Code Generation Interface
26 pages
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet
University of South Florida Department of Electrical Engineering Concentration in Control Theory
No ratings yet
University of South Florida Department of Electrical Engineering Concentration in Control Theory
10 pages
Digital Engineering: Complex System Design
From Everand
Digital Engineering: Complex System Design
S Mathioudakis
No ratings yet
Image Motion Tracker
No ratings yet
Image Motion Tracker
8 pages
Realtimesignal Cuda
No ratings yet
Realtimesignal Cuda
26 pages
Technologies-07-00004 - A High-Level Synthesis Implementation and Evaluation of An Image Processing Accelerator
No ratings yet
Technologies-07-00004 - A High-Level Synthesis Implementation and Evaluation of An Image Processing Accelerator
13 pages
FPGA Test Time Reduction Through A Novel Interconnect Testing Scheme
No ratings yet
FPGA Test Time Reduction Through A Novel Interconnect Testing Scheme
9 pages
EGR435 - Final Project Report
No ratings yet
EGR435 - Final Project Report
11 pages
2162 Term Project: The Tomasulo Algorithm Implementation
No ratings yet
2162 Term Project: The Tomasulo Algorithm Implementation
5 pages
Lecture: Static ILP: Topics: Predication, Speculation (Sections C.5, 3.2)
No ratings yet
Lecture: Static ILP: Topics: Predication, Speculation (Sections C.5, 3.2)
26 pages
Cad For Vlsi 2 Pro Ject - Superscalar Processor Implementation
No ratings yet
Cad For Vlsi 2 Pro Ject - Superscalar Processor Implementation
10 pages
Lcss
No ratings yet
Lcss
24 pages
ECNG 3016 Advanced Digital Electronics: Eneral Nformation
No ratings yet
ECNG 3016 Advanced Digital Electronics: Eneral Nformation
25 pages
2018 Icetran Pajkanovic
No ratings yet
2018 Icetran Pajkanovic
5 pages
COA Unit V B
No ratings yet
COA Unit V B
5 pages
What Does A FASTQ File Look Like?
No ratings yet
What Does A FASTQ File Look Like?
7 pages
Image Processing Using Fpgas: Imaging
No ratings yet
Image Processing Using Fpgas: Imaging
4 pages
Intelligent Technologies for Research and Engineering
From Everand
Intelligent Technologies for Research and Engineering
S. Kannadhasan
No ratings yet
Superscalar and Superpipelined Processors
No ratings yet
Superscalar and Superpipelined Processors
4 pages
ECE241 Final Project Report
No ratings yet
ECE241 Final Project Report
12 pages
Milonga
100% (1)
Milonga
162 pages
Very High-Level Synthesis of Datapath and Control Structures For Reconfigurable Logic Devices
No ratings yet
Very High-Level Synthesis of Datapath and Control Structures For Reconfigurable Logic Devices
5 pages
Assignment Questions
No ratings yet
Assignment Questions
3 pages
Advance Electronics: Unacademy
No ratings yet
Advance Electronics: Unacademy
38 pages
Computer Vision Graph Cuts: Exploring Graph Cuts in Computer Vision
From Everand
Computer Vision Graph Cuts: Exploring Graph Cuts in Computer Vision
Fouad Sabry
No ratings yet
8x8x8 LED Matrix and 3D Snake: Final Project Report December 8, 2017 E155
No ratings yet
8x8x8 LED Matrix and 3D Snake: Final Project Report December 8, 2017 E155
27 pages
Cs433 Fa20 Hw3 Solution
No ratings yet
Cs433 Fa20 Hw3 Solution
15 pages
Report
No ratings yet
Report
7 pages
Scanline Rendering: Exploring Visual Realism Through Scanline Rendering Techniques
From Everand
Scanline Rendering: Exploring Visual Realism Through Scanline Rendering Techniques
Fouad Sabry
No ratings yet
Lec18-Static BRANCH PREDICTION VLIW
No ratings yet
Lec18-Static BRANCH PREDICTION VLIW
40 pages
A 5Ghz+ 128-Bit Binary Floating-Point Adder For The Power6 Processor
No ratings yet
A 5Ghz+ 128-Bit Binary Floating-Point Adder For The Power6 Processor
4 pages
Problemset 2solutions
No ratings yet
Problemset 2solutions
6 pages
Processor Realization For Application of Convolution: Prashant D Bhirange, V. G. Nasre, M. A. Gaikwad
No ratings yet
Processor Realization For Application of Convolution: Prashant D Bhirange, V. G. Nasre, M. A. Gaikwad
5 pages
Stanley Assignment
No ratings yet
Stanley Assignment
6 pages
FPGA: Field Programmable Gate Array
No ratings yet
FPGA: Field Programmable Gate Array
5 pages
Labtest1 p1 Q Key
No ratings yet
Labtest1 p1 Q Key
2 pages
Main Seminar 'Autonomic Computing': Operating Systems and Middleware
No ratings yet
Main Seminar 'Autonomic Computing': Operating Systems and Middleware
10 pages
Implementing Linear Algebraalgorithms For Dense Matrices
No ratings yet
Implementing Linear Algebraalgorithms For Dense Matrices
22 pages
Design of A Minimal Processor On An Fpga
No ratings yet
Design of A Minimal Processor On An Fpga
12 pages
Mca Coa-Unit III
No ratings yet
Mca Coa-Unit III
16 pages
A Self Checking Reed Solomon Encoder Design and Analysis
No ratings yet
A Self Checking Reed Solomon Encoder Design and Analysis
9 pages
Vlsi Design Automation
No ratings yet
Vlsi Design Automation
13 pages
Teaching FPGA-based Systems and Their Influence On Mechatronics
No ratings yet
Teaching FPGA-based Systems and Their Influence On Mechatronics
8 pages
Defective Pixel Correction: Henrik Backe-Hansen
No ratings yet
Defective Pixel Correction: Henrik Backe-Hansen
268 pages
Group 2 Report
No ratings yet
Group 2 Report
10 pages
Supplementary Information
No ratings yet
Supplementary Information
14 pages
Data Dependences and Hazards
No ratings yet
Data Dependences and Hazards
24 pages
Full Text 23952016
No ratings yet
Full Text 23952016
6 pages
Final Bootstrap Acceleration On Fpga Using Dsp-Free Constant-Multiplier Ntts
No ratings yet
Final Bootstrap Acceleration On Fpga Using Dsp-Free Constant-Multiplier Ntts
23 pages
An Improvised Design Implementation of S
No ratings yet
An Improvised Design Implementation of S
10 pages
HLS-Based Acceleration Framework For Deep Convolutional Neural Networks
No ratings yet
HLS-Based Acceleration Framework For Deep Convolutional Neural Networks
11 pages
Model-Architecture Co-Design For High Performance Temporal GNN Inference On FPGA
No ratings yet
Model-Architecture Co-Design For High Performance Temporal GNN Inference On FPGA
10 pages
Leveraging Hls To Design A Versatile & High-Performance Classic Mceliece Accelerator
No ratings yet
Leveraging Hls To Design A Versatile & High-Performance Classic Mceliece Accelerator
27 pages
FPT: A Fixed-Point Accelerator For Torus Fully Homomorphic Encryption
No ratings yet
FPT: A Fixed-Point Accelerator For Torus Fully Homomorphic Encryption
15 pages
Cie 1 2019
No ratings yet
Cie 1 2019
3 pages
CSE321 - 3. Threads & Concurrency
No ratings yet
CSE321 - 3. Threads & Concurrency
40 pages
Scopus - Document Details 3 PDF
No ratings yet
Scopus - Document Details 3 PDF
2 pages
VX Works
100% (1)
VX Works
35 pages
OS Plate 1
No ratings yet
OS Plate 1
5 pages
Unit-Iv: Real-Time Operating Systems Based Embedded System Design
No ratings yet
Unit-Iv: Real-Time Operating Systems Based Embedded System Design
49 pages
Chapter 2: Operating-System Structures: Silberschatz, Galvin and Gagne ©2018 Operating System Concepts - 10 Edition
No ratings yet
Chapter 2: Operating-System Structures: Silberschatz, Galvin and Gagne ©2018 Operating System Concepts - 10 Edition
41 pages
A Survey On Resource Management and Security Issues in IoT Operating Systems
No ratings yet
A Survey On Resource Management and Security Issues in IoT Operating Systems
5 pages
Functional Design Specifications: CX-140507-FDS-001 0 24-Sep-2014 Ref #: Rev #: Date
No ratings yet
Functional Design Specifications: CX-140507-FDS-001 0 24-Sep-2014 Ref #: Rev #: Date
44 pages
Understanding Our IIS
No ratings yet
Understanding Our IIS
19 pages
Assignment No 1 4365
No ratings yet
Assignment No 1 4365
7 pages
Introduction-Basic Operating System, Resource Abstraction
No ratings yet
Introduction-Basic Operating System, Resource Abstraction
20 pages
Exercises
100% (3)
Exercises
23 pages
Questions Answered in This Lecture:: - Why Are Threads Useful? - How Does One Use POSIX Pthreads?
No ratings yet
Questions Answered in This Lecture:: - Why Are Threads Useful? - How Does One Use POSIX Pthreads?
6 pages
V20PCA102 - Operating System
No ratings yet
V20PCA102 - Operating System
21 pages
Virtual PC: Welcome !
No ratings yet
Virtual PC: Welcome !
41 pages
Linux Commands
67% (3)
Linux Commands
18 pages
Intro-to-OS NAMASTE
No ratings yet
Intro-to-OS NAMASTE
60 pages
Operating System Notes
No ratings yet
Operating System Notes
12 pages
Compatibility 4 5 3 SIMOTION SCOUT
No ratings yet
Compatibility 4 5 3 SIMOTION SCOUT
55 pages
Hcia Iot
100% (2)
Hcia Iot
96 pages
Unit-IV: Inter-Process Communication
No ratings yet
Unit-IV: Inter-Process Communication
61 pages
Architecture of Linux
No ratings yet
Architecture of Linux
20 pages
67f804cdf3829e65bdb04297 Rujevagububo
No ratings yet
67f804cdf3829e65bdb04297 Rujevagububo
8 pages
Os MCQS
No ratings yet
Os MCQS
33 pages
Conterization Notes 2
No ratings yet
Conterization Notes 2
13 pages
LPC ALPC Paper
No ratings yet
LPC ALPC Paper
28 pages
IT Modular Curriculum 2005 Final Edited
100% (1)
IT Modular Curriculum 2005 Final Edited
153 pages

Evaluation of Smearing On FPGA Accelerator Cards: Salvatore Calì, Grzegorz Korcyl and Piotr Korcyl

Uploaded by

Evaluation of Smearing On FPGA Accelerator Cards: Salvatore Calì, Grzegorz Korcyl and Piotr Korcyl

Uploaded by

Evaluation of 𝑺𝑼(3) smearing on FPGA accelerator cards

arXiv:2110.11172v1 [hep-lat] 21 Oct 2021

Salvatore Calì,𝑎 Grzegorz Korcyl𝑏,∗ and Piotr Korcyl𝑐

2. Pipelined and streamlined design

Table 1: Composition of multiply_by_staple function

component # latency interval DSP FF LUT

function prec. latency II BRAM DSP FF LUT

grp_compute_staple_backward_fu and grp_add_two_fu, which already signifies that the

4. Timings and performance

total size # 512-bit initiation optimal

precision II staples projection full 3 kernels

5. Conclusions and outlook

You might also like