GPU_Architecture_Optimization_For_Mobile_Computing
GPU_Architecture_Optimization_For_Mobile_Computing
Computing
Abdulsami Aldahlawi1, Kyung Ki Kim2 Yong-Bin Kim1
1
Department of Electrical and Computer Engineering, Northeastern University
Boston, MA, USA
2
Department of Electronic Engineering, Daegu University
Gyeongsan, Gyeongbuk, South Korea
aldahlawi, [email protected], [email protected]
Abstract— Graphical Processing Units (GPUs) are always x Double-Precision Unit: used for double-precision
criticized for high power consumption due to its massive operations.
performance that it can deliver. While GPUs are getting into the
mobile market, more power constraints are established. In this x Special-Function Unit: used for fast evaluations of
work, we evaluate the power gating techniques for GPU cache functions such as sine, cosine, square-root,
arrays. The leakage power in active mode is measured at 2.28 μW complex numbers, etc.
whereas is sleep mode leakage power is measured at 0.61 μW
(26.7% of active mode leakage) and 0.034 μW at off mode (1.5% x Load-Store Unit: used to calculate the source and
of active mode leakage) at 1.0V power supply using 45nm standard destinations addresses made for memory requests.
CMOS process. Each SM has its own private Level 1 (L1) cache whereas all
SMs in a GPU share a common Level 2 (L2) cache. Fig. 1 shows
Keywords; Graphical Processing Units (GPU); Power Gating;
a general architecture of modern GPUs. [1]
Leakage Power.
I. INTRODUCTION
The demand for higher performance and computing
capability is steadily increasing. Scientific simulations,
sophisticated graphic rendering, and big data processing are all
examples of modern applications requiring massive parallel
processing. Since Central-Processing Units (CPUs) execution
model is based on sequential processing, these applications
require a significant amount of time if it were to be executed on
a CPU. However, with the easier programmability of current
GPUs, the era of General-Purpose GPUs (GPGPUs) has Figure 1. GPU Architecture
blossomed. Furthermore, the need for GPUs is expanding to the
mobile market as smartphones nowadays are capable of running B. GPU Memories
multimedia processing applications, 3D graphics online video The memory hierarchy in GPU differs from that of a CPU in
games, and more. However, with the limited power supply in few aspects. Mainly in CPUs, register file gets loaded from L1
mobile devices, strict power constraints must be met to ensure cache, which gets loaded from L2 cache and then from main
adequate power-performance relationship when running these memory. The GPU memory hierarchy, however, has more
applications. components to allow efficient execution for thread, block, and a
grid level. Fig 2. shows the hierarchy of GPU memory. [1]
II. BACKGROUND
IV. CONCLUSION
The powerful computing capabilities brought by modern
GPUs comes with the price of increased power consumption.
Figure 4. SRAM Cache Array Due to limited power supplies in mobile devices and the fact that
leakage power consumption in modern CMOS technologies
The control generating S1 and S2 signals will have its input acquire a handful of total power consumed by the device, a
from the warp scheduler. If the SM completed its execution, both leakage power saving technique was applied to GPU cache
signals will be off, and the data is lost. However, during idle memory. Using this technique, the estimated power saving in
cycles, S1 will be on and S2 will be off which will retain the data sleep mode is about 73.3% compared to normal active mode.
and save some leakage power.
REFERENCES
B. 6T SRAM Cell Sizing and Sleep Transistor Sizing
The 6T SRAM cell sizing (M1-M6) has been determined to [1] J. D. Owens, M. Houston, D. Luebke, S. Green, J. E Stone, and J. C.
ensure proper read and write operations using 45nm standard Phillips. “GPU computing”. Proceedings of the IEEE, 96(5), 2008.
CMOS technology and to satisfy the 1 GHz clock timing [2] Rusu, S. Tam, H. Muljono, J. Stinson, D. Ayres, J. Chang, R. Varada, M.
constraints. The sleep transistors, however, will be attached to a Ratta, S. Kottapalli and S. Vora. “Power Reduction Techniques for an 8-
core Xeon Processor”. 2009. In Proc. IEEE ESSCIRC. 340–343.
32 Bytes array. The model used to size those transistors is the
[3] J. Kao, A. Chandrakasan, and D. Antoniadis. “Transistor Sizing Issues
sum of all low Vt transistors in an array to estimate the rise of and Tool For Multi-Threshold CMOS Technology”. 2014. In Proc.
voltage given by the current rush in worst case scenario where Design Automation Conference.
both sleep transistors turns on at the same time [3]. The