0% found this document useful (0 votes)

61 views4 pages

G-GPU A Fully-Automated Generator of GPU-like ASIC Accelerators

Uploaded by

Raphael Lopes Pinheiro

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

61 views4 pages

G-GPU A Fully-Automated Generator of GPU-like ASIC Accelerators

Uploaded by

Raphael Lopes Pinheiro

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

G-GPU: A Fully-Automated Generator of GPU-like

ASIC Accelerators
Tiago D. Perez∗ , Márcio M. Gonçalves† , Leonardo Gobatto† , Marcelo Brandalero‡ , José Rodrigo Azambuja† ,
Samuel Pagliarini∗
∗ Department of Computer Systems, Tallinn University of Technology (TalTech), Estonia
† Institute of Informatics, Federal University of Rio Grande do Sul (UFRGS), Brazil
‡ Brandenburg University of Technology (B-TU), Germany
Emails:{tiago.perez,samuel.pagliarini}@taltech.ee,{marcio.goncalves,leonardo.gobato,jose.azambuja}@inf.ufrgs.br,[email protected]

Abstract—Modern Systems on Chip (SoC), almost as a rule, are representative of modern architectures. To the best of
require accelerators for achieving energy efficiency and high our knowledge, the only configurable open-source GPU
performance for specific tasks that are not necessarily well suited architectures available in the literature are FlexGripPlus [5] and
for execution in standard processing units. Considering the broad
range of applications and necessity for specialization, the design FGPU [6]. The first is based on the NVIDIA G80 decade-old
of SoCs has thus become expressively more challenging. In this architecture and has never been deployed to an FPGA board.
paper, we put forward the concept of G-GPU, a general-purpose The second was designed specifically for FGPA platforms.
GPU-like accelerator that is not application-specific but still gives Therefore, the literature has not yet tackled the challenges
benefits in energy efficiency and throughput. Furthermore, we in designing, configuring, and implementing modern GPU
have identified an existing gap for these accelerators in ASIC,
for which no known automated generation platform/tool exists. architectures for ASICs – a platform that presents challenges
Our solution, called GPUPlanner, is an open-source generator of that are far from those in FPGA design. Still, all commercial
accelerators, from RTL to GDSII, that addresses this gap. Our GPUs are designed as ASICs.
analysis results show that our automatically generated G-GPU This work proposes to bridge this gap with GPUPlanner,
designs are remarkably efficient when compared against the an automated and open-source framework for generating
popular CPU architecture RISC-V, presenting speed-ups of up to
223 times in raw performance and up to 11 times when the metric ASIC-specific GPU-like accelerators as IP. We term
is performance derated by area. These results are achieved by these general-purpose accelerators G-GPUs. GPUPlanner
executing a design space exploration of the GPU-like accelerators, helps designers in generating GPU-like accelerators
where the memory hierarchy is broken in a smart fashion and through user-driven customization and automated physical
the logic is pipelined on demand. Finally, tapeout-ready layouts implementation. Customization is performed according to a
of the G-GPU in 65nm CMOS are presented.
Index Terms—ASIC generator, domain-specific accelerators,
given GPU architecture through a series of parameters that
general-purpose gpu architectures, integrated circuits define computation characteristics (e.g., number of processing
units) and memory access (e.g., cache sizes), thus providing
I. I NTRODUCTION designers a high degree of scalability to better fit the generated
IP into their systems. Implementation strategies explore the
New computer applications, especially in the field of use of smart memories and on-demand pipeline insertion.
Artificial Intelligence (AI), keep pushing the need for more We evaluate our proposed framework by implementing
energy-efficient hardware architectures [1]. For many years, four flavors of G-GPU architectures in terms of performance,
application- and domain-specific accelerators, designed by power, and area (PPA). Additionally, we provide a reasonable
specializing to the task at hand, have been the standard choice comparison with the popular CPU architecture RISC-V [7],
for achieving high energy efficiency. Canonical examples are [8] in terms of raw performance speed-up and performance
crypto cores [2] and Graphics Processing Unit (GPUs) for speed-up derated by area. Our main contributions is an
which even specialized programming languages and paradigms open-source framework for automated generation of GPU-like
have been proposed [3]. GPU architectures focus on specialized accelerators, from RTL to GDSII – the GPUPlanner.
massively parallel many-core processors that take advantage of
Thread-Level Parallelism (TLP) to handle highly parallelizable II. H ARDWARE ACCELERATORS AND OUR BASELINE GPU
applications in a Single-Instruction Multiple Threads (SIMT) In a nutshell, domain- or application-specific accelerators
paradigm. GPUs have been traditionally designed for cost too much. Recent developments in High-Level Synthesis
graphics applications but have recently evolved into efficient (HLS) [9] are encouraging and have helped in accelerate the
general-purpose accelerators for High-Performance Computing development of domain-specific hardware accelerators. Yet,
(HPC). HPC applications have a wide range, including oil for ASIC designs, the performance for flexibility trade-off is
exploration, bioinformatics, and the thriving AI and Machine not interesting, or the performance is insufficient [10]. This
Learning (ML) domains [4]. NVIDIA GPUs, for instance, are scenario presents itself as an opportunity where general-purpose
used as accelerators in several top500 supercomputers. accelerators have gained ground. Our proposed GPUPlanner
However, despite its widespread use as accelerators, research framework combines the efficiency from domain-specific
in GPU architectures is limited due to the lack of open-source accelerators and the ease of use from general-purpose
models at a sufficiently low level of abstraction and that architectures into G-GPU. The result is an automatically

978-3-9819263-6-1/DATE22/2022
c EDAA 544

Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on May 12,2024 at 06:40:08 UTC from IEEE Xplore. Restrictions apply.
...
...

Global Memory
CU

Controller
Memory Controller
AXI Data

Cache
Interface
PE0 PE7

...
WF Scheduler
Reg.
CRAM
Reg. ... File
File
AXI Control

Dispatcher
Interface

WG
LRAM
Runtime Memory
Ctrl Regs

Fig. 1: FGPU architecture colored according to Fig 3.

generated domain-specific ASIC accelerator based on GPU
architectures that can be easily programmed with modern
programming languages.
FGPU is a configurable open-source GPU-like soft processor
designed to accelerate workloads that fit in the SIMT
paradigm [6]. Fig. 1 presents an overview of FGPU’s
architecture. Its main component is the Compute Unit (CU),
a SIMD machine of 8 identical Processing Elements (PE0 - Fig. 2: GPUPlanner’s G-GPU generation flow.
PE7) that can be spatially replicated up to eight times. A One of our main goals is to achieve the best PPA ratio
single CU can run up to 512 work-items (a computational possible from the G-GPU, exercising the maximum possible
kernel in OpenCL) and supports full thread-divergence, i.e., design space. The first aspect analyzed was the performance.
each work-item is allowed to take a different path in the control This is done by finding the maximum operating frequency,
flow graph. Work-items are grouped into Wavefronts (WFs) which does not violate timing. For the logical synthesis, the
that execute concurrently in a CU, and WFs are combined value found for the standard version (without any of the
into Workgroups (WGs), which share a program counter and optimizations done in this work) is 500MHz. The G-GPU has a
are assigned to a CU. FGPU is also deeply pipelined. On similar performance across versions with different numbers of
the software side, only standard OpenCL-API procedures are CUs because the CU itself is the bottleneck for performance in
needed. Most importantly, FGPU can be artlessly scaled up this architecture. As expected, the critical path for the version
to 64 processing units and is deeply configurable in terms of without any optimization has its starting point at a memory
operations, instructions, and memory access. block. Also, the critical path was found inside the CU partition.
Several past works have modified the FGPU to adapt it Larger memories display a higher delay for accessing the
to different application domains. In [11], the authors have stored data when compared with smaller memories. This
included new instructions along with micro-architecture and observation guides our design space exploration: dividing the
compiler enhancements to specialize FPGU for persistent deep memory blocks in the critical path is a valid strategy for
learning, achieving 56–693x speed-up in PDL applications. increasing the performance of a design [14]. Memory division
MIAOW [12] is GPU-like implementation based on the can be applied by diving the number of words, the size of
AMD Southern Islands architecture and supporting its ISA. the word, or both. This strategy requires a few alterations in
Scratch [13] extended MIAOW with automatic identification the RTL code. First, the new modules have to be instantiated
of the specific requirements of each application kernel and properly, substituting the target memories for the optimization.
a tool that allows for the generation of application-specific Second, the address or the input/output data have to be
and FPGA-implementable trimmed-down GPU-inspired concatenated accordingly. To attain faster results, this task was
architectures. Our work is the first to propose a tool fully automated in our framework.
that automatically generates tapeout-ready domain-specific The area of the memory blocks is not linear w.r.t. their
accelerators based on GPU-like architectures and to make it size. In fact, two blocks of size M × N are larger and more
publicly available. power-hungry than a single block of size 2M × N or M × 2N .
From the memory division alone, we are increasing the area and
III. GPUP LANNER F RAMEWORK power. Also, a small extra logic is necessary to accommodate
Our experimental investigation started from migrating the the addressing control. When exercising the memory division
FGPU, originally designed for FGPA, to ASIC. To this to enhance the design performance, we found cases where
end, a few changes in the architecture were necessary. As the critical path was not in memory blocks. For solving such
compilers for FGPA have a feature to infer memory from timing issues, pipelines were introduced in those paths. As a
RTL automatically, all the memory blocks in the FGPU code result, we created an open-source tool to automatically generate
were described as regular FFs. In ASIC, memory IPs are G-GPU IPs, from RTL to GDSII. The flow of GPUPlanner is
hand-instantiated instead of inferred. Thus, the first task was to highlighted in Fig. 2. For starters, the designer has to define the
clearly define intended behavior from the code and instantiate specifications required from the G-GPU. Our architecture can
memory modules, utilizing a 65nm commercial technology. be configured for CUs ranging from 1 to 8. Also, the designer

Design, Automation and Test in Europe Conference (DATE 2022) 545

Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on May 12,2024 at 06:40:08 UTC from IEEE Xplore. Restrictions apply.
1CU@677MHz
has to specify the operating frequency of the G-GPU. 1CU@500MHz
After surveying the possible versions of the G-GPU for
desired application scenarios, the designer can generate a

2800 um
specification for each scenario. Then, these specifications are

2500 um
contrasted with the characteristics of the technology intended to
be used to create a first-order estimation of the G-GPU PPA. In
this phase, there is a possibility to find several versions suitable
for the given specification. Still, it also might happen that a 2700 um 3200 um
configuration that suits the designer’s requirements does not 8CU@600MHz

exist. However, our framework is not a static input generator. 8CU@500MHz

Instead, we provide a map on how to achieve a realistic PPA

7450 um
that might be close enough to the designer’s requirements. This

6250 um
Untouched
map is a dynamic spreadsheet, where the user input the delay Memories
CU Optimized
of the memory blocks required for the non-optimized version Memories
of the G-GPU. Our map gives the maximum performance MCTRL Optimized
Memories
and which memory has to be divided or where to introduce TOP Optimized
Memories
7150 um 8350 um
pipelines to enhance the performance. This is an iterative
process and can be repeated until the designer finds the desired Fig. 3: Layout comparison between minimum and maximum
performance. Thus, using our map, the designer can rapidly performance of G-GPUs with 1 CU (top) and 8 CUs (bottom).
adapt his specification or create new versions of G-GPU. The different CU layout blocks and scale the floorplan regarding
only hard constraint in our framework is that many of the the number of CUs for different application scenarios easily.
G-GPU memories have to be dual-port. Further development The layouts for the versions with 1 and 8 CUs are depicted
for single-port memories is scheduled as future work. in Fig. 3. Size scale in the figure only applies to the
From a single push of a button, our framework can perform ones with the same number of CUs. The block memories
logic and physical synthesis of the list of designs. After the divided for augmenting the performance are highlighted in
logic and physical synthesis, the resulting PPA is checked to green for the CU partition, yellow and pink for the MCTRL,
guarantee it is under the initial specification. If the resulting and blue for the top. Note how different the floorplan is
G-GPU is out of the specifications, the designer should modify between the version with optimizations running at 600MHz
it and restart the process. In any case, the resulting layouts are and without optimizations running at 500MHz. Block memories
ready to be integrated in a system as a tapeout-ready IP. have to be strategically placed in order to extract the
IV. R ESULTS AND D ISCUSSION maximum performance, hence, the differences in the floorplan.
From the exercise of the GPUPlanner, we found 12 versions The layout of the versions 1CU@500MHz, 1CU@667MHz,
worth the PPA trade-off in a general manner. These versions 8CU@500MHz have the same performance expected from the
have 1, 2, 4, and 8 CUs. Their variants run at 500MHz, logical synthesis (i.e., they can run at the specified clock
590MHz, and 667MHz. The characteristics of each version are frequency without any timing violation). However, the layout
shown in Table I. In terms of area, the G-GPU size grows of version 8CU@667MHz can only run at 600MHz. This
linearly with the number of CUs. The optimizations done for is explained by analyzing the floorplan of its layout (see
augmenting the performance increased the area by an average Fig. 3). The connecting routing wires introduce a significant
of 10%, from 500MHz to 590MHz, and 2%, from 590MHz to capacitance because of the long distance between the peripheral
667MHz. Thus, if the power consumption is not a priority, the CUs and the general memory controller.
667MHz is a good fit for having a negligible increase in area To fully evaluate the G-GPU as an ASIC accelerator,
in trade-off a better performance. These results demonstrate the we compared its performance with the popular RISC-V
potential scalability of the G-GPU architecture. architecture. We synthesized both architectures using the
We chose four versions to perform the physical same technology used before with an operating frequency of
synthesis. Those are the 1CU@500MHz, 1CU@667MHz, 667MHz, the RISC-V having 32Kb memory and the G-GPU
8CU@500MHz, and 8CU@677MHz. During this phase, the with 1/2/4/8 CUs. We chose seven micro-benchmarks from the
G-GPU is broken into three partitions during implementation: AMD OpenCL SDK and increased their inputs up until crashing
the CU, the general memory controller (MCTRL), and the the RISC-V compiler. We further increased the input size of
top. The density of the CU and the MCTRL was set to 70%. the G-GPU applications to make its computing units fully
Because of our floorplan strategy of breaking the design into utilized. To compare the performance of the different-input size
partitions, the top has a low density of 30%. Nevertheless, applications, we took a pessimistic approach for G-GPU and
breaking the design in partitions allows the designer to scale considered that one could increase RISC-V application input
G-GPU without any extra effort. Once a CU partition is fully sizes by multiplying its cycle count by the G-GPU/RISC-V
placed and routed, it can be implemented in versions with input size ratio. These results are shown in Fig. 4.
more than 1 CU by cloning the partition in the final floorplan Our first evaluation compares raw performance between
of the design. Moreover, the user can create a collection of G-GPU and RISC-V for the same input sizes. For applications

546 Design, Automation and Test in Europe Conference (DATE 2022)

Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on May 12,2024 at 06:40:08 UTC from IEEE Xplore. Restrictions apply.
TABLE I: Characteristics of 12 different GGPU solutions generated by our tool after logic synthesis in Cadence Genus.
#CU & Freq. Total Area (mm2 ) Memory Area (mm2 ) #FF #Comb. #Memory Leakage (mW) Dynamic (W) Total (W)
1@500MHz 4.19 2.68 119778 127826 51 4.62 1.97 2.055
2@500MHz 7.45 4.64 229171 214243 93 8.54 3.63 3.77
4@500MHz 13.84 8.56 437318 387246 177 16.07 6.88 7.14
8@500MHz 26.51 16.39 852094 714256 345 30.79 13.33 13.86
1@590MHz 4.66 3.15 120035 128894 68 4.73 2.57 2.66
2@590MHz 8.16 5.34 229172 221946 120 8.73 4.63 4.81
4@590MHz 15.03 9.72 436807 397995 224 16.41 8.70 9.02
8@590MHz 28.65 18.49 850559 737232 432 31.25 16.81 17.40
1@667MHz 4.77 3.26 120035 130802 71 4.65 2.62 2.72
2@667MHz 8.27 5.45 229172 222028 123 8.72 4.69 4.87
4@667MHz 15.15 9.83 436807 398124 227 16.43 8.75 9.07
8@667MHz 28.69 18.60 848511 730506 435 30.21 19.10 19.76
50
1CU - Area Ratio=06.5
240
high parallelism. Moreover, as GPUPlanner is an open-source
220
2CU - Area Ratio=11.6
200
framework, it gives the community the opportunity to explore
40 4CU - Area Ratio=21.4
8CU - Area Ratio=41.0 180 the design space of GPU-like accelerators. Our work goes
Speed-up derated by area

1CU - Raw Speed-up

2CU - Raw Speed-up
160 beyond the analysis of what constitutes a reasonable G-GPU
30 140

Speed-up
4CU - Raw Speed-up
8CU - Raw Speed-up 120
accelerator in 65nm, as our tool can be easily extended to
20 100 support other baseline GPU architectures and technologies.
80
60 ACKNOWLEDGMENTS
10
40
This work has been partially conducted in the project “ICT
20
0 0
programme”, supported by the European Union through the
mat_mul copy vec_mult r div_int xcorr parallel_sel European Social Fund. This work was financed in part by the
Fig. 4: Speed-up over RISC-V. Coordenação de Aperfeiçoamento de Pessoal de Nı́vel Superior
with low to no parallelism, G-GPU can be as low as only - Brasil (CAPES) - Finance Code 001, CNPq, and FAPERGS.
1.2 times faster than RISC-V. As G-GPU is a domain-specific
R EFERENCES
ASIC accelerator, such results are expected, once it will not be
[1] V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer, “Efficient processing of
the best option for general-purpose applications. Therefore, a deep neural networks: A tutorial and survey,” Proc. of the IEEE, vol. 105,
user interested in implementing a G-GPU as an accelerator can no. 12, pp. 2295–2329, 2017.
utilize these provided data to ponder if this type of architecture [2] C. Mucci, L. Vanzolini, A. Lodi, A. Deledda, R. Guerrieri, F. Campi,
and M. Toma, “Implementation of aes/rijndael on a dynamically
is a good fit for his system, considering only the raw speed-up. reconfigurable architecture,” in 2007 Design, Automation Test in Europe
Our second evaluation factors previously measured area into Conference Exhibition, pp. 1–6, 2007.
performance speed-up. We derated the previously measured [3] T. D. Han and T. S. Abdelrahman, “hicuda: High-level gpgpu
programming,” IEEE Trans. on Parallel and Distributed Systems, vol. 22,
speed-up by dividing the area ratio (G-GPU/RISC-V). A no. 1, pp. 78–90, 2011.
G-GPU with 1 CU has an area that is 6.5 times larger than [4] P. P. Brahma, D. Wu, and Y. She, “Why deep learning works: A
the RISC-V, and it achieves the best increase in performance manifold disentanglement perspective,” IEEE Trans. on Neural Networks
and Learning Systems, vol. 27, no. 10, pp. 1997–2008, 2016.
per area of 10.2 times the RISC-V’s. On the other hand, G-GPU [5] J. E. R. Condia, B. Du, M. Sonza Reorda, and L. Sterpone,
with 8 CUs has an area that is 41 times bigger than RISC-V’s, “Flexgripplus: An improved GPGPU model to support reliability
thus achieving the best increase in performance per area of 5.7 analysis,” Microelectronics Reliability, vol. 109, p. 113660, 2020.
[6] M. Al Kadi, B. Janssen, and M. Huebner, “Fgpu: An simt-architecture for
times faster than RISC-V’s. This trend happens mainly because fpgas,” in ACM/SIGDA Int. Symp. on Field-Programmable Gate Arrays,
data dependency and global memory communication limit p. 254–263, ACM, 2016.
parallelism. Thus, the provided increased processing power of [7] M. Gautschi et al., “Near-Threshold RISC-V Core With DSP Extensions
for Scalable IoT Endpoint Devices,” IEEE Trans. on VLSI Systems,
a G-GPU configuration with more CUs. vol. 25, no. 10, pp. 2700–2713, 2017.
We are planning to update the GPUPlanner to be able to [8] OpenHW Group, “Cv32e40p risc-v ip,” 2016. https://github.com/
implement the 8-CU G-GPU without performance loss. The openhwgroup/cv32e40p.
[9] A. Canis et al., “Legup: High-level synthesis for fpga-based
performance problem of the layouts with 8 CUs has the processor/accelerator systems,” FPGA’11, p. 33–36, ACM, 2011.
possibility to be solved by replicating the general memory [10] J. Weng, S. Liu, V. Dadu, Z. Wang, P. Shah, and T. Nowatzki, “Dsagen:
controller, shortening the distance between the peripheral CUs, Synthesizing programmable spatial accelerators,” in ACM/IEEE Int.
Symp. on Computer Architecture, pp. 268–281, 2020.
and reducing the delay introduced by the routing wires. [11] R. Ma et al., “Specializing fgpu for persistent deep learning,” ACM Trans.
Also, we intend to include support of memory hierarchy and Reconfigurable Technol. Syst., vol. 14, July 2021.
incorporate single-port memories into GPUPlanner. [12] V. Gangadhar et al., “Miaow: An open source gpgpu,” in IEEE Hot Chips
Symp., pp. 1–43, 2015.
[13] P. Duarte, P. Tomas, and G. Falcao, “Scratch: An end-to-end
V. C ONCLUSION application-aware soft-gpgpu architecture and trimming tool,” in
Our results showed that G-GPUs are feasible domain-specific IEEE/ACM Int. Symp. on Microarchitecture, p. 165–177, ACM, 2017.
[14] H. E. Sumbul, K. Vaidyanathan, Q. Zhu, F. Franchetti, and L. Pileggi, “A
ASIC accelerator. Furthermore, when the G-GPU performance synthesis methodology for application-specific logic-in-memory designs,”
is contrasted with that of a RISC-V, it shows that our in ACM/EDAC/IEEE Design Automation Conference, pp. 1–6, 2015.
architecture has tremendous benefits for applications with

Design, Automation and Test in Europe Conference (DATE 2022) 547

Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on May 12,2024 at 06:40:08 UTC from IEEE Xplore. Restrictions apply.

Report On Gpu
No ratings yet
Report On Gpu
39 pages
MIS Midterm Review
No ratings yet
MIS Midterm Review
38 pages
E CentriX Contact Center Brochure
No ratings yet
E CentriX Contact Center Brochure
2 pages
Special Topic Submission Enabling Domain-Specific Architectures With An Open-Source Soft-Core GPGPU
No ratings yet
Special Topic Submission Enabling Domain-Specific Architectures With An Open-Source Soft-Core GPGPU
8 pages
Unveiling The Powerhouses of AI A Comprehensive ST
No ratings yet
Unveiling The Powerhouses of AI A Comprehensive ST
9 pages
A Survey of Architectural Approaches For Improving GPGPU
No ratings yet
A Survey of Architectural Approaches For Improving GPGPU
24 pages
The Role of Field-Programmable Gate Arrays in The Acceleration of Modern High - Performance Computing Workloads
No ratings yet
The Role of Field-Programmable Gate Arrays in The Acceleration of Modern High - Performance Computing Workloads
11 pages
Unit 5'
No ratings yet
Unit 5'
33 pages
Modelado de GPUs e Implementación de Características Dentro de Accel-Sim. TFG Juan José Castillo Otón
No ratings yet
Modelado de GPUs e Implementación de Características Dentro de Accel-Sim. TFG Juan José Castillo Otón
56 pages
Gpgpu Workshop Cuda
No ratings yet
Gpgpu Workshop Cuda
10 pages
Lecture 2
No ratings yet
Lecture 2
15 pages
Parallel Processing Using GPU's
No ratings yet
Parallel Processing Using GPU's
34 pages
AHA U4
No ratings yet
AHA U4
199 pages
06 Intro Gpus
No ratings yet
06 Intro Gpus
33 pages
Part1 22
No ratings yet
Part1 22
77 pages
UNIT 4 GPU Computing - HPC
No ratings yet
UNIT 4 GPU Computing - HPC
13 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
Operating System Abstractions To Manage Gpus As Compute Devices
No ratings yet
Operating System Abstractions To Manage Gpus As Compute Devices
16 pages
NVIDIA GPU Computing - A Journey From PC Gaming To Deep Learning
100% (1)
NVIDIA GPU Computing - A Journey From PC Gaming To Deep Learning
91 pages
Brodtkorb Etal Meta10
No ratings yet
Brodtkorb Etal Meta10
15 pages
Evolution of The Graphics Process Units: Dr. Zhijie Xu Z.xu@hud - Ac.uk
No ratings yet
Evolution of The Graphics Process Units: Dr. Zhijie Xu Z.xu@hud - Ac.uk
24 pages
GPGPU
No ratings yet
GPGPU
139 pages
Unleashing The Potential of Alternative Deep Learning Hardware - EE Times
No ratings yet
Unleashing The Potential of Alternative Deep Learning Hardware - EE Times
5 pages
Design of Graphics Processing Framework On FPGA
No ratings yet
Design of Graphics Processing Framework On FPGA
5 pages
Algorithmic Considerations For Graphical Hardware Accelerated Applications
No ratings yet
Algorithmic Considerations For Graphical Hardware Accelerated Applications
9 pages
GPU Gems2 ch29
No ratings yet
GPU Gems2 ch29
21 pages
GPU Gpgpu Computing: Rajan Panigrahi
No ratings yet
GPU Gpgpu Computing: Rajan Panigrahi
24 pages
GPU Architecture
33% (3)
GPU Architecture
28 pages
Graphics Processing Unit (GPU) Programming Strategies and Trends in GPU Computing
No ratings yet
Graphics Processing Unit (GPU) Programming Strategies and Trends in GPU Computing
10 pages
HPC Day 12 ppt-2
No ratings yet
HPC Day 12 ppt-2
139 pages
Purple Modern Futuristic Technology Presentation
No ratings yet
Purple Modern Futuristic Technology Presentation
6 pages
We Are Intechopen, The World'S Leading Publisher of Open Access Books Built by Scientists, For Scientists
No ratings yet
We Are Intechopen, The World'S Leading Publisher of Open Access Books Built by Scientists, For Scientists
15 pages
Graphics Processing Unit GPU Programming Strategie
No ratings yet
Graphics Processing Unit GPU Programming Strategie
14 pages
Introduction To GP-GPU and CUDA: High Performance Computing Center Hanoi University of Science & Technology
No ratings yet
Introduction To GP-GPU and CUDA: High Performance Computing Center Hanoi University of Science & Technology
43 pages
p10 Cuda
No ratings yet
p10 Cuda
28 pages
1 s2.0 S1383762122001138 Main
No ratings yet
1 s2.0 S1383762122001138 Main
51 pages
Hardware
No ratings yet
Hardware
10 pages
Intro Computing BCSM-F18-071 - Assignment 1
No ratings yet
Intro Computing BCSM-F18-071 - Assignment 1
10 pages
Gpu Companies: Intel Nvidia Amd Ati Matrox Adreno Qualcomm Powervr Imagination Technologies Mali Gpus Arm
No ratings yet
Gpu Companies: Intel Nvidia Amd Ati Matrox Adreno Qualcomm Powervr Imagination Technologies Mali Gpus Arm
8 pages
Architecture, Applications, and Accelerating AI
No ratings yet
Architecture, Applications, and Accelerating AI
11 pages
8 Things You Should Know About GPGPU Technology: Q&A With TACC Research Scientists
No ratings yet
8 Things You Should Know About GPGPU Technology: Q&A With TACC Research Scientists
2 pages
Comparison of Processing Performance and Architectural Efficiency Metrics For Fpgas and Gpus in 3D Ultrasound Computer Tomography
No ratings yet
Comparison of Processing Performance and Architectural Efficiency Metrics For Fpgas and Gpus in 3D Ultrasound Computer Tomography
7 pages
Exploiting The Power of Gpus For Asymmetric Cryptography: Abstract. Modern Graphics Processing Units (Gpu) Have Reached A
No ratings yet
Exploiting The Power of Gpus For Asymmetric Cryptography: Abstract. Modern Graphics Processing Units (Gpu) Have Reached A
21 pages
GPU-Co Processing
No ratings yet
GPU-Co Processing
8 pages
Gpus
No ratings yet
Gpus
32 pages
Lecture - 01 - CUDA Programming
No ratings yet
Lecture - 01 - CUDA Programming
52 pages
Lecture HWA
No ratings yet
Lecture HWA
11 pages
Systolic Array Architecture For Educational Use
No ratings yet
Systolic Array Architecture For Educational Use
6 pages
GPUIntro
No ratings yet
GPUIntro
21 pages
789
No ratings yet
789
5 pages
Technical Trends in General-Purpose Computing On Graphics Processing Units (GPGPU)
No ratings yet
Technical Trends in General-Purpose Computing On Graphics Processing Units (GPGPU)
6 pages
Lecture 1
No ratings yet
Lecture 1
17 pages
10 GPU-IntroCUDA3
No ratings yet
10 GPU-IntroCUDA3
141 pages
Topic 8
No ratings yet
Topic 8
71 pages
Storegpu: Exploiting Graphics Processing Units To Accelerate Distributed Storage Systems
No ratings yet
Storegpu: Exploiting Graphics Processing Units To Accelerate Distributed Storage Systems
10 pages
Thesis Gpu Programming
100% (2)
Thesis Gpu Programming
6 pages
Cks 2012 It Art 002
No ratings yet
Cks 2012 It Art 002
10 pages
Kirk+Hwu GPU
No ratings yet
Kirk+Hwu GPU
92 pages
Domain Architecture
No ratings yet
Domain Architecture
4 pages
GPU Architecture & Implications: David Luebke NVIDIA Research
No ratings yet
GPU Architecture & Implications: David Luebke NVIDIA Research
94 pages
Data-Mining-And Data Warehousing
No ratings yet
Data-Mining-And Data Warehousing
1 page
SAP Note 3117800 - SAP BW Bridge Limitations
No ratings yet
SAP Note 3117800 - SAP BW Bridge Limitations
3 pages
Introduction To Mobile Computing
No ratings yet
Introduction To Mobile Computing
3 pages
Zaggle FAQ
No ratings yet
Zaggle FAQ
17 pages
Production Kitting Using WM-PP Interface
100% (1)
Production Kitting Using WM-PP Interface
7 pages
Kazim Usman-Teaching PDF
No ratings yet
Kazim Usman-Teaching PDF
2 pages
Ai Unit1 (List, Tuple, Set, Dictionary) PDF
No ratings yet
Ai Unit1 (List, Tuple, Set, Dictionary) PDF
15 pages
Ooad Lab Manual 2016: Hostel Management System
No ratings yet
Ooad Lab Manual 2016: Hostel Management System
28 pages
Quote - QUO909 - Web Server Hosting and Setup With Secured Layer and Secured Socket Shell
No ratings yet
Quote - QUO909 - Web Server Hosting and Setup With Secured Layer and Secured Socket Shell
1 page
General EBS Setup
No ratings yet
General EBS Setup
119 pages
System Design
No ratings yet
System Design
34 pages
Andrei Tiut WK Foto en Web
No ratings yet
Andrei Tiut WK Foto en Web
3 pages
M07-Recording Client Support Requirements For HNS
0% (1)
M07-Recording Client Support Requirements For HNS
28 pages
Retentive Timer On (RTO) : Instruction
No ratings yet
Retentive Timer On (RTO) : Instruction
3 pages
Creating Custom Screen in XD01 BADI
No ratings yet
Creating Custom Screen in XD01 BADI
10 pages
NSLU2 Manual
100% (2)
NSLU2 Manual
47 pages
PC Trickes
No ratings yet
PC Trickes
28 pages
MCS 023 Solved Assignment
No ratings yet
MCS 023 Solved Assignment
55 pages
DMC Card 00180020430723
No ratings yet
DMC Card 00180020430723
2 pages
Multi-Mode Router: Meet All Your Needs. TL-WR841N
No ratings yet
Multi-Mode Router: Meet All Your Needs. TL-WR841N
2 pages
CS 3306 - Unit 4 Written Assignment Student University of The People Debanjana Chaudhuri
No ratings yet
CS 3306 - Unit 4 Written Assignment Student University of The People Debanjana Chaudhuri
4 pages
Routing and Switching Lab - 2019: Anycast Topology
No ratings yet
Routing and Switching Lab - 2019: Anycast Topology
7 pages
TM Series PDF
No ratings yet
TM Series PDF
4 pages
Django Setup
No ratings yet
Django Setup
11 pages
B.TechCSE (Core) - Curriculum-AY2021-22-Revised - 2.2
No ratings yet
B.TechCSE (Core) - Curriculum-AY2021-22-Revised - 2.2
5 pages
Stiker Tinggal Print NEW
No ratings yet
Stiker Tinggal Print NEW
33 pages
21cs3601-Computer Architecture
No ratings yet
21cs3601-Computer Architecture
5 pages
Cycle Based Driven Simulation
No ratings yet
Cycle Based Driven Simulation
17 pages

G-GPU A Fully-Automated Generator of GPU-like ASIC Accelerators

Uploaded by

G-GPU A Fully-Automated Generator of GPU-like ASIC Accelerators

Uploaded by

G-GPU: A Fully-Automated Generator of GPU-like

Fig. 1: FGPU architecture colored according to Fig 3.

Design, Automation and Test in Europe Conference (DATE 2022) 545

exist. However, our framework is not a static input generator. 8CU@500MHz

546 Design, Automation and Test in Europe Conference (DATE 2022)

1CU - Raw Speed-up

Design, Automation and Test in Europe Conference (DATE 2022) 547

You might also like