G-GPU A Fully-Automated Generator of GPU-like ASIC Accelerators
G-GPU A Fully-Automated Generator of GPU-like ASIC Accelerators
ASIC Accelerators
Tiago D. Perez∗ , Márcio M. Gonçalves† , Leonardo Gobatto† , Marcelo Brandalero‡ , José Rodrigo Azambuja† ,
Samuel Pagliarini∗
∗ Department of Computer Systems, Tallinn University of Technology (TalTech), Estonia
† Institute of Informatics, Federal University of Rio Grande do Sul (UFRGS), Brazil
‡ Brandenburg University of Technology (B-TU), Germany
Emails:{tiago.perez,samuel.pagliarini}@taltech.ee,{marcio.goncalves,leonardo.gobato,jose.azambuja}@inf.ufrgs.br,[email protected]
Abstract—Modern Systems on Chip (SoC), almost as a rule, are representative of modern architectures. To the best of
require accelerators for achieving energy efficiency and high our knowledge, the only configurable open-source GPU
performance for specific tasks that are not necessarily well suited architectures available in the literature are FlexGripPlus [5] and
for execution in standard processing units. Considering the broad
range of applications and necessity for specialization, the design FGPU [6]. The first is based on the NVIDIA G80 decade-old
of SoCs has thus become expressively more challenging. In this architecture and has never been deployed to an FPGA board.
paper, we put forward the concept of G-GPU, a general-purpose The second was designed specifically for FGPA platforms.
GPU-like accelerator that is not application-specific but still gives Therefore, the literature has not yet tackled the challenges
benefits in energy efficiency and throughput. Furthermore, we in designing, configuring, and implementing modern GPU
have identified an existing gap for these accelerators in ASIC,
for which no known automated generation platform/tool exists. architectures for ASICs – a platform that presents challenges
Our solution, called GPUPlanner, is an open-source generator of that are far from those in FPGA design. Still, all commercial
accelerators, from RTL to GDSII, that addresses this gap. Our GPUs are designed as ASICs.
analysis results show that our automatically generated G-GPU This work proposes to bridge this gap with GPUPlanner,
designs are remarkably efficient when compared against the an automated and open-source framework for generating
popular CPU architecture RISC-V, presenting speed-ups of up to
223 times in raw performance and up to 11 times when the metric ASIC-specific GPU-like accelerators as IP. We term
is performance derated by area. These results are achieved by these general-purpose accelerators G-GPUs. GPUPlanner
executing a design space exploration of the GPU-like accelerators, helps designers in generating GPU-like accelerators
where the memory hierarchy is broken in a smart fashion and through user-driven customization and automated physical
the logic is pipelined on demand. Finally, tapeout-ready layouts implementation. Customization is performed according to a
of the G-GPU in 65nm CMOS are presented.
Index Terms—ASIC generator, domain-specific accelerators,
given GPU architecture through a series of parameters that
general-purpose gpu architectures, integrated circuits define computation characteristics (e.g., number of processing
units) and memory access (e.g., cache sizes), thus providing
I. I NTRODUCTION designers a high degree of scalability to better fit the generated
IP into their systems. Implementation strategies explore the
New computer applications, especially in the field of use of smart memories and on-demand pipeline insertion.
Artificial Intelligence (AI), keep pushing the need for more We evaluate our proposed framework by implementing
energy-efficient hardware architectures [1]. For many years, four flavors of G-GPU architectures in terms of performance,
application- and domain-specific accelerators, designed by power, and area (PPA). Additionally, we provide a reasonable
specializing to the task at hand, have been the standard choice comparison with the popular CPU architecture RISC-V [7],
for achieving high energy efficiency. Canonical examples are [8] in terms of raw performance speed-up and performance
crypto cores [2] and Graphics Processing Unit (GPUs) for speed-up derated by area. Our main contributions is an
which even specialized programming languages and paradigms open-source framework for automated generation of GPU-like
have been proposed [3]. GPU architectures focus on specialized accelerators, from RTL to GDSII – the GPUPlanner.
massively parallel many-core processors that take advantage of
Thread-Level Parallelism (TLP) to handle highly parallelizable II. H ARDWARE ACCELERATORS AND OUR BASELINE GPU
applications in a Single-Instruction Multiple Threads (SIMT) In a nutshell, domain- or application-specific accelerators
paradigm. GPUs have been traditionally designed for cost too much. Recent developments in High-Level Synthesis
graphics applications but have recently evolved into efficient (HLS) [9] are encouraging and have helped in accelerate the
general-purpose accelerators for High-Performance Computing development of domain-specific hardware accelerators. Yet,
(HPC). HPC applications have a wide range, including oil for ASIC designs, the performance for flexibility trade-off is
exploration, bioinformatics, and the thriving AI and Machine not interesting, or the performance is insufficient [10]. This
Learning (ML) domains [4]. NVIDIA GPUs, for instance, are scenario presents itself as an opportunity where general-purpose
used as accelerators in several top500 supercomputers. accelerators have gained ground. Our proposed GPUPlanner
However, despite its widespread use as accelerators, research framework combines the efficiency from domain-specific
in GPU architectures is limited due to the lack of open-source accelerators and the ease of use from general-purpose
models at a sufficiently low level of abstraction and that architectures into G-GPU. The result is an automatically
978-3-9819263-6-1/DATE22/2022
c EDAA 544
Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on May 12,2024 at 06:40:08 UTC from IEEE Xplore. Restrictions apply.
...
...
Global Memory
CU
Controller
Memory Controller
AXI Data
Cache
Interface
PE0 PE7
...
WF Scheduler
Reg.
CRAM
Reg. ... File
File
AXI Control
Dispatcher
Interface
WG
LRAM
Runtime Memory
Ctrl Regs
Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on May 12,2024 at 06:40:08 UTC from IEEE Xplore. Restrictions apply.
1CU@677MHz
has to specify the operating frequency of the G-GPU. 1CU@500MHz
After surveying the possible versions of the G-GPU for
desired application scenarios, the designer can generate a
2800 um
specification for each scenario. Then, these specifications are
2500 um
contrasted with the characteristics of the technology intended to
be used to create a first-order estimation of the G-GPU PPA. In
this phase, there is a possibility to find several versions suitable
for the given specification. Still, it also might happen that a 2700 um 3200 um
configuration that suits the designer’s requirements does not 8CU@600MHz
7450 um
that might be close enough to the designer’s requirements. This
6250 um
Untouched
map is a dynamic spreadsheet, where the user input the delay Memories
CU Optimized
of the memory blocks required for the non-optimized version Memories
of the G-GPU. Our map gives the maximum performance MCTRL Optimized
Memories
and which memory has to be divided or where to introduce TOP Optimized
Memories
7150 um 8350 um
pipelines to enhance the performance. This is an iterative
process and can be repeated until the designer finds the desired Fig. 3: Layout comparison between minimum and maximum
performance. Thus, using our map, the designer can rapidly performance of G-GPUs with 1 CU (top) and 8 CUs (bottom).
adapt his specification or create new versions of G-GPU. The different CU layout blocks and scale the floorplan regarding
only hard constraint in our framework is that many of the the number of CUs for different application scenarios easily.
G-GPU memories have to be dual-port. Further development The layouts for the versions with 1 and 8 CUs are depicted
for single-port memories is scheduled as future work. in Fig. 3. Size scale in the figure only applies to the
From a single push of a button, our framework can perform ones with the same number of CUs. The block memories
logic and physical synthesis of the list of designs. After the divided for augmenting the performance are highlighted in
logic and physical synthesis, the resulting PPA is checked to green for the CU partition, yellow and pink for the MCTRL,
guarantee it is under the initial specification. If the resulting and blue for the top. Note how different the floorplan is
G-GPU is out of the specifications, the designer should modify between the version with optimizations running at 600MHz
it and restart the process. In any case, the resulting layouts are and without optimizations running at 500MHz. Block memories
ready to be integrated in a system as a tapeout-ready IP. have to be strategically placed in order to extract the
IV. R ESULTS AND D ISCUSSION maximum performance, hence, the differences in the floorplan.
From the exercise of the GPUPlanner, we found 12 versions The layout of the versions 1CU@500MHz, 1CU@667MHz,
worth the PPA trade-off in a general manner. These versions 8CU@500MHz have the same performance expected from the
have 1, 2, 4, and 8 CUs. Their variants run at 500MHz, logical synthesis (i.e., they can run at the specified clock
590MHz, and 667MHz. The characteristics of each version are frequency without any timing violation). However, the layout
shown in Table I. In terms of area, the G-GPU size grows of version 8CU@667MHz can only run at 600MHz. This
linearly with the number of CUs. The optimizations done for is explained by analyzing the floorplan of its layout (see
augmenting the performance increased the area by an average Fig. 3). The connecting routing wires introduce a significant
of 10%, from 500MHz to 590MHz, and 2%, from 590MHz to capacitance because of the long distance between the peripheral
667MHz. Thus, if the power consumption is not a priority, the CUs and the general memory controller.
667MHz is a good fit for having a negligible increase in area To fully evaluate the G-GPU as an ASIC accelerator,
in trade-off a better performance. These results demonstrate the we compared its performance with the popular RISC-V
potential scalability of the G-GPU architecture. architecture. We synthesized both architectures using the
We chose four versions to perform the physical same technology used before with an operating frequency of
synthesis. Those are the 1CU@500MHz, 1CU@667MHz, 667MHz, the RISC-V having 32Kb memory and the G-GPU
8CU@500MHz, and 8CU@677MHz. During this phase, the with 1/2/4/8 CUs. We chose seven micro-benchmarks from the
G-GPU is broken into three partitions during implementation: AMD OpenCL SDK and increased their inputs up until crashing
the CU, the general memory controller (MCTRL), and the the RISC-V compiler. We further increased the input size of
top. The density of the CU and the MCTRL was set to 70%. the G-GPU applications to make its computing units fully
Because of our floorplan strategy of breaking the design into utilized. To compare the performance of the different-input size
partitions, the top has a low density of 30%. Nevertheless, applications, we took a pessimistic approach for G-GPU and
breaking the design in partitions allows the designer to scale considered that one could increase RISC-V application input
G-GPU without any extra effort. Once a CU partition is fully sizes by multiplying its cycle count by the G-GPU/RISC-V
placed and routed, it can be implemented in versions with input size ratio. These results are shown in Fig. 4.
more than 1 CU by cloning the partition in the final floorplan Our first evaluation compares raw performance between
of the design. Moreover, the user can create a collection of G-GPU and RISC-V for the same input sizes. For applications
Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on May 12,2024 at 06:40:08 UTC from IEEE Xplore. Restrictions apply.
TABLE I: Characteristics of 12 different GGPU solutions generated by our tool after logic synthesis in Cadence Genus.
#CU & Freq. Total Area (mm2 ) Memory Area (mm2 ) #FF #Comb. #Memory Leakage (mW) Dynamic (W) Total (W)
1@500MHz 4.19 2.68 119778 127826 51 4.62 1.97 2.055
2@500MHz 7.45 4.64 229171 214243 93 8.54 3.63 3.77
4@500MHz 13.84 8.56 437318 387246 177 16.07 6.88 7.14
8@500MHz 26.51 16.39 852094 714256 345 30.79 13.33 13.86
1@590MHz 4.66 3.15 120035 128894 68 4.73 2.57 2.66
2@590MHz 8.16 5.34 229172 221946 120 8.73 4.63 4.81
4@590MHz 15.03 9.72 436807 397995 224 16.41 8.70 9.02
8@590MHz 28.65 18.49 850559 737232 432 31.25 16.81 17.40
1@667MHz 4.77 3.26 120035 130802 71 4.65 2.62 2.72
2@667MHz 8.27 5.45 229172 222028 123 8.72 4.69 4.87
4@667MHz 15.15 9.83 436807 398124 227 16.43 8.75 9.07
8@667MHz 28.69 18.60 848511 730506 435 30.21 19.10 19.76
50
1CU - Area Ratio=06.5
240
high parallelism. Moreover, as GPUPlanner is an open-source
220
2CU - Area Ratio=11.6
200
framework, it gives the community the opportunity to explore
40 4CU - Area Ratio=21.4
8CU - Area Ratio=41.0 180 the design space of GPU-like accelerators. Our work goes
Speed-up derated by area
Speed-up
4CU - Raw Speed-up
8CU - Raw Speed-up 120
accelerator in 65nm, as our tool can be easily extended to
20 100 support other baseline GPU architectures and technologies.
80
60 ACKNOWLEDGMENTS
10
40
This work has been partially conducted in the project “ICT
20
0 0
programme”, supported by the European Union through the
mat_mul copy vec_mult r div_int xcorr parallel_sel European Social Fund. This work was financed in part by the
Fig. 4: Speed-up over RISC-V. Coordenação de Aperfeiçoamento de Pessoal de Nı́vel Superior
with low to no parallelism, G-GPU can be as low as only - Brasil (CAPES) - Finance Code 001, CNPq, and FAPERGS.
1.2 times faster than RISC-V. As G-GPU is a domain-specific
R EFERENCES
ASIC accelerator, such results are expected, once it will not be
[1] V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer, “Efficient processing of
the best option for general-purpose applications. Therefore, a deep neural networks: A tutorial and survey,” Proc. of the IEEE, vol. 105,
user interested in implementing a G-GPU as an accelerator can no. 12, pp. 2295–2329, 2017.
utilize these provided data to ponder if this type of architecture [2] C. Mucci, L. Vanzolini, A. Lodi, A. Deledda, R. Guerrieri, F. Campi,
and M. Toma, “Implementation of aes/rijndael on a dynamically
is a good fit for his system, considering only the raw speed-up. reconfigurable architecture,” in 2007 Design, Automation Test in Europe
Our second evaluation factors previously measured area into Conference Exhibition, pp. 1–6, 2007.
performance speed-up. We derated the previously measured [3] T. D. Han and T. S. Abdelrahman, “hicuda: High-level gpgpu
programming,” IEEE Trans. on Parallel and Distributed Systems, vol. 22,
speed-up by dividing the area ratio (G-GPU/RISC-V). A no. 1, pp. 78–90, 2011.
G-GPU with 1 CU has an area that is 6.5 times larger than [4] P. P. Brahma, D. Wu, and Y. She, “Why deep learning works: A
the RISC-V, and it achieves the best increase in performance manifold disentanglement perspective,” IEEE Trans. on Neural Networks
and Learning Systems, vol. 27, no. 10, pp. 1997–2008, 2016.
per area of 10.2 times the RISC-V’s. On the other hand, G-GPU [5] J. E. R. Condia, B. Du, M. Sonza Reorda, and L. Sterpone,
with 8 CUs has an area that is 41 times bigger than RISC-V’s, “Flexgripplus: An improved GPGPU model to support reliability
thus achieving the best increase in performance per area of 5.7 analysis,” Microelectronics Reliability, vol. 109, p. 113660, 2020.
[6] M. Al Kadi, B. Janssen, and M. Huebner, “Fgpu: An simt-architecture for
times faster than RISC-V’s. This trend happens mainly because fpgas,” in ACM/SIGDA Int. Symp. on Field-Programmable Gate Arrays,
data dependency and global memory communication limit p. 254–263, ACM, 2016.
parallelism. Thus, the provided increased processing power of [7] M. Gautschi et al., “Near-Threshold RISC-V Core With DSP Extensions
for Scalable IoT Endpoint Devices,” IEEE Trans. on VLSI Systems,
a G-GPU configuration with more CUs. vol. 25, no. 10, pp. 2700–2713, 2017.
We are planning to update the GPUPlanner to be able to [8] OpenHW Group, “Cv32e40p risc-v ip,” 2016. https://github.com/
implement the 8-CU G-GPU without performance loss. The openhwgroup/cv32e40p.
[9] A. Canis et al., “Legup: High-level synthesis for fpga-based
performance problem of the layouts with 8 CUs has the processor/accelerator systems,” FPGA’11, p. 33–36, ACM, 2011.
possibility to be solved by replicating the general memory [10] J. Weng, S. Liu, V. Dadu, Z. Wang, P. Shah, and T. Nowatzki, “Dsagen:
controller, shortening the distance between the peripheral CUs, Synthesizing programmable spatial accelerators,” in ACM/IEEE Int.
Symp. on Computer Architecture, pp. 268–281, 2020.
and reducing the delay introduced by the routing wires. [11] R. Ma et al., “Specializing fgpu for persistent deep learning,” ACM Trans.
Also, we intend to include support of memory hierarchy and Reconfigurable Technol. Syst., vol. 14, July 2021.
incorporate single-port memories into GPUPlanner. [12] V. Gangadhar et al., “Miaow: An open source gpgpu,” in IEEE Hot Chips
Symp., pp. 1–43, 2015.
[13] P. Duarte, P. Tomas, and G. Falcao, “Scratch: An end-to-end
V. C ONCLUSION application-aware soft-gpgpu architecture and trimming tool,” in
Our results showed that G-GPUs are feasible domain-specific IEEE/ACM Int. Symp. on Microarchitecture, p. 165–177, ACM, 2017.
[14] H. E. Sumbul, K. Vaidyanathan, Q. Zhu, F. Franchetti, and L. Pileggi, “A
ASIC accelerator. Furthermore, when the G-GPU performance synthesis methodology for application-specific logic-in-memory designs,”
is contrasted with that of a RISC-V, it shows that our in ACM/EDAC/IEEE Design Automation Conference, pp. 1–6, 2015.
architecture has tremendous benefits for applications with
Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on May 12,2024 at 06:40:08 UTC from IEEE Xplore. Restrictions apply.