Technology Prospects For Data-Intensive Computing
Technology Prospects For Data-Intensive Computing
Data-Intensive Computing
This article advances the idea that data-intensive computing will further cement
semiconductor technology as a foundational technology with multidimensional
pathways for growth.
By K EREM A KARVARDAR , Senior Member IEEE, AND H.-S. P HILIP W ONG , Fellow IEEE
ABSTRACT | For many decades, progress in computing hard- KEYWORDS | Artificial intelligence (AI); AI accelerators;
ware has been closely associated with CMOS logic density, big data applications; CMOS technology; deep learning;
performance, and cost. As such, slowdown in 2-D scaling, DRAM chips; energy efficiency; high performance computing;
frequency saturation in CPUs, and increased cost of design and machine learning; Moore’s Law; multichip modules (MCMs);
chip fabrication for advanced technology nodes since the early nonvolatile memory; roadmaps (technology planning); SRAM
2000s have led to concerns about how semiconductor technol- chips; system integration; system-in-package (SiP); system-
ogy may evolve in the future. However, the last two decades on-chip; three-dimensional integrated circuits; wafer bonding.
have also witnessed a parallel development in the application
landscape: the advent of big data and consequent rise of data- I. I N T R O D U C T I O N
intensive computing, using techniques such as machine learn- It would be fair to characterize the 21st century as the
ing. In this article, we advance the idea that data-intensive digital transformation age, where data occupy the central
computing would further cement semiconductor technology position. Fueled by the rise of the Internet, the digitiza-
as a foundational technology with multidimensional pathways tion of almost all aspects of human activity leads to a
for growth. Continued progress of semiconductor technology relentless growth of data generated every day. For exam-
in this new context would require the adoption of a system- ple, the majority of the world’s data have been created
centric perspective to holistically harness logic, memory, and in the last two years alone [1], and the annual size of
packaging resources. After examining the performance metrics the global “datasphere” (totality of data in datacenters,
for data-intensive computing, we present the historical trends cell towers, PCs, smartphones, IoT (Internet of Things)
for general-purpose graphics processing unit (GPGPU) as a devices, and so on) is expected to increase by around
representative data-intensive computing hardware. Thereon, 3.5× from 2020 to 2025 [2]. Sensor data, social networks,
we estimate the values of the key data-intensive computing multimedia digital content, GPS data, and alike amount to
parameters for the next decade, and our projections may so-called “big data,” which essentially consists of massive,
serve as a precursor for a dedicated technology roadmap. unstructured information, typically in the form of text,
By analyzing the compiled data, we identify and discuss spe- imagery, audio, and video. Big data is primarily character-
cific opportunities and challenges for data-intensive computing ized by its unusually high volume, variety, and frequency it
hardware technology. is being generated and transmitted [3].
Developing and optimizing algorithms and software/
hardware systems that execute the computation with close
interaction to big data are known as “data-intensive com-
Manuscript received 19 May 2021; revised 24 June 2022; accepted 18 October
2022. Date of current version 12 January 2023. (Corresponding author:
puting” [4]. Data-intensive computing involves collection,
Kerem Akarvardar.) storage, access, and analysis of big data. Out of these,
Kerem Akarvardar is with Taiwan Semiconductor Manufacturing Company analysis of raw data, akin to processing of raw material,
(TSMC), Corporate Research, San Jose, CA 95134 USA (e-mail:
[email protected]). creates value by providing insight, facilitating decision
H.-S. Philip Wong is with the Department of Electrical Engineering, Stanford making, and enabling discovery [5]. Machine learning and
University, Stanford, CA 94305 USA, and also with Taiwan Semiconductor
Manufacturing Company (TSMC), Corporate Research, San Jose, CA 95134 USA graph analytics are among the most powerful techniques
(e-mail: [email protected]; [email protected]). for big data analysis [6] and data-driven applications in
Digital Object Identifier 10.1109/JPROC.2022.3218057 general.
0018-9219 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
II. K E Y P E R F O R M A N C E M E T R I C S
A distinct feature of data-intensive applications is their
reliance on simple compute-intensive kernels, such as
matrix multiplication, and the consequent amenability to
specialization and parallelization. For instance, neural net-
works are inherently suited for hardware parallelization,
since their computation is primarily based on multiply-
and-accumulate (MAC) operations [10]. Similarly, most
graph algorithms can be implemented by some form
of matrix-vector multiplication [11]. Accordingly, data-
intensive applications favor hardware acceleration based
on graphics processing units (GPUs), FPGAs, custom ASIC
1 This work is the outcome of a research collaboration between
Stanford University and TSMC, and does not reflect TSMC’s technology Fig. 2. Basic data-intensive computing system (top) and the
development roadmap. associated parameter descriptions (below), adapted from [18].
background energy required to keep the system functional BW 1 TB/sec, Pp 200 W, and Pm 50 W. P0 0 assumed to
simplify the interpretation of plots in (b) and (c). In (c), as we focus
during computation [18], [25]. Furthermore, E0 = P0 ·
on the range of values OI could take versus Bτ (>Bε based on the
DS , with P0 defined as the constant background power. power values in this example), we notice that the following hold: 1)
Normalizing both sides of the equation for ES by W and for OI Bτ , P converges to Pm (as Pp ∼ 0); 2) for OI < Bτ , Pm
S Pm
taking their inverse
−1
due to fully utilized memory bandwidth, while the processor cores
do not get enough data (leading to Pp < Pp ); 3) for OI Bτ , PS is
1 1 1 P0
EES = + · + (2) maximized and equal to Pm Pp ; 4) for OI > Bτ , Pp Pp due to fully
EEp OI EEm TS utilized processor cores, while the available memory bandwidth is
not entirely used (hence, Pm < Pm ); and 5) for OI Bτ , PS converges
where EES = W/ES is the system attainable energy effi- to Pp (as Pm ∼ 0).
and EEm = 1/εm is the memory peak energy efficiency. OPs/cycle = 2 for a core that executes one MAC operation
Fig. 3(b) shows the energy-domain equivalent of the each cycle. One should keep in mind that Tp reflects the
roofline plot in Fig. 3(a). Unlike in Fig. 3(a), EES versus OI peak throughput that would be achieved if all cores were
exhibits a continuous function, or an “arch line” [18], due to be fully utilized. The actual throughput may end up
to lack of “overlap” between compute and memory access being significantly lower than Tp depending on the core
components in the energy domain. The highest attainable utilization [22] dictated by the compute architecture and
EES is EEp (for P0 = 0), similar to the relation between dataflow, in addition to BW and OI.
TS and Tp . Bε = εm /εp = EEp /EEm is the energy balance The key parameter on the memory side, memory peak
point. Note that imposing OI = Bε in (2) along with P0 = 0 bandwidth (in bytes/sec), is given by
would result in EES = EEp /2 [18] (in contrast to TS = Tp
at OI = Bτ ). Ensuring OI Bε would maximize EES , BW = DR · fmem · NIO /8 (5)
which would then be dominated by the processor.
In order to reveal the similarities between time and
energy domains, one can consider the “envelope” of EES , where DR is the data rate associated with the main memory
EES(env) , which can be found directly from (1) by replacing interface (e.g., DR = 2 for double data rate (DDR) DRAM),
Tp with EEp and BW with EEm : EES(env) = min(EEp , OI · fmem is the memory frequency, and NIO is the memory bus
EEm ). As shown in Fig. 3(b), at extreme OI values (for width (in bits) or the total number of I/O, given by the
OI Bε or OI Bε ) and P0 = 0, EES converges to number of memory channels multiplied by the number of
EES(env) , since it would then be determined by a single I/Os per memory channel [27], [28]. In (5), the factor 1/8
system component. Similar to the roofline plot in Fig. 3(a), allows conversion to bytes.
a higher EEm would lift up the slanted portion of EES(env) , Finally, dividing (4) and (5), respectively, by Pp and
which has a slope of 1 on the log–log scale. Pm provides the peak energy efficiency parameters EEp
Dividing (1) by (2) and rearranging the terms, the and EEm . The parameters studied above are not the only
average system power, PS = ES /DS = TS /EES , is given critical ones [22], and additional data-intensive computing
by parameters will be discussed in Section IV. In Section III,
we will review the trends associated with the parameters
PS = Pp +Pm + P 0 (3) covered so far.
Bτ
ture for accelerating a wide spectrum of parallel applica-
tions in scientific computing, data analytics, and machine
Pm = Pm · min 1, (3a)
OI
OI
learning especially in the last decade. Modern GPUs are
multicore architectures originally conceived to support
Pp = Pp · min 1, (3b)
Bτ real-time computer graphics and have evolved to high
throughput programmable processors with the advent of
Pp = εp /τp = Tp /EEp and Pm = εm /τm = BW/EEm are AI and big data analytics [12], [16], [29], [30], [31].
the processor and memory peak power, respectively (such Most recently, GPU architectures specialized for compute
that Pp /Pm = Bτ /Bε ). PS is plotted as a function of and gaming also became available to further improve the
OI in Fig. 3(c), again for P0 = 0, and the trends were energy efficiency [148] (throughout this article, we will
summarized in the figure caption as a function of the continue to use the term GPGPU, although our dataset
relative values OI could take versus Bτ . will also include compute GPUs). GPGPUs also have a
Equations above reveal that for a given operational rich and systematic dataset with steady trends since the
intensity OI, the two fundamental performance metrics, early 2000s for most of the data-intensive computing
the system throughput TS and the system energy efficiency parameters. As such, revealing the historical trends in
EES , depend on Tp , BW, their power normalized versions this section and the projections thereon will be based on
EEp and EEm , and P0 . Basic equations relating these para- GPGPUs, although the role of FPGAs and ASIC accelerators
meters to memory and compute hardware parameters are to run data-intensive applications has been continuously
summarized below. increasing [12], [32].
For a single processor core, Tp is simply the number of Modern GPGPUs are based on the “system-in-package”
operations (OPs) per cycle multiplied by the number of (SiP) concept [33], consisting of heterogeneous integra-
cycles per second (i.e., processor core frequency, fcore ). tion of tightly connected logic and memory dies typi-
For many cores working in parallel, the peak throughput cally on a silicon interposer. More generally, SiP aims to
increases linearly with the core count, Ncore [26] improve the overall system performance by simultane-
ously accounting for logic, memory, and their connectivity
OPs through 2-D multichip module (MCM), 2.5-D interposer,
Tp = · fcore · Ncore (4)
cycle and 3-D stacking technologies or some combination of
ment rate is 1.57×/two years to allow 2-TB/sec bandwidth for each GPU in (a). (c) Time-balance (Bτ Tp /BW) data.
Fig. 8. (a) GPU clock frequency data. Boost clock has been used
whenever available, consistent with reported throughput values
in [53] and GPGPU datasheets. (b) FP32 core count trends.
Fig. 10. (a) DRAM effective frequency (DR · fmem ). (b) Bus width
(total I/O count, NIO ) data.
IV. O P P O R T U N I T I E S A N D
CHALLENGES
A. A Technology Roadmap for Data-Intensive
Computing
As we have emphasized in the earlier sections, con-
tinuous progress in data-intensive computing hardware
requires harnessing and co-optimizing the advances in
logic technology with those in memory and packag-
ing/integration technologies. New applications and tech-
nologies also require adequate performance metrics as
discussed in Section II. In this context, an application
domain-specific semiconductor technology roadmap for
data-intensive computing would be timely and useful,
since it would set clear targets for technology develop-
ment, such that the semiconductor industry would know Fig. 11. (a) DRAM and (b) L2 SRAM capacity trends.
Table 1 Server GPGPU Projected Parameter Values for the Next Decade, Assuming the Current Growth Rates (per Two Years, Denoted in “Rates”
Column) That Were Extracted From the Data Covering 2008–2022 Period. The 2022 Values Are From the Best Fitting Line to Data and Not From the
Actual Products. Projections Below Do Not Reflect TSMC’s Technology Roadmap
The data compiled in Table 1 could serve as a precursor to be defined as the number of components divided by
for such a roadmap. Clearly, the industry would not nec- the footprint area, such that it will improve linearly with
essarily sustain the individual rates precisely as extracted the number of 3-D layers. In contrast to DL and DM ,
but would instead tune them or trade them off against each the progress in DC has been so far characterized by some-
other depending on their technological feasibility, specific what discrete jumps, as new packaging technologies were
gaps, application needs, and the most critical design goal introduced [82], [84]. However, a technology roadmap for
(such as the system throughput, energy efficiency, form data-intensive computing is likely to motivate continuous
factor, time-to-volume production, and cost). However, and more systematic advances in DC (along with novel
irrespective of the specific paths that the industry would system architectures [81]) to ensure predictable growth
follow in the future, the appropriate first step would have in memory bandwidth and memory energy efficiency, the
been to document the trends as they stand today, which is two parameters that became increasingly critical in the last
what we aimed to accomplish in Table 1. decade.
A quantitative example on the usage of density metric
B. A New Metric for Gauging Technology had been provided in [82]. However, considering multiple
Advancement options for the packaging technologies, possible presence
of multiple memory types in a system including the cache
Data-intensive computing systems, executing elemen-
hierarchy, and other architecture details that may be rel-
tary operations on large volumes of data in a parallel
evant; a systematic way of formulating DL , DM , and
fashion, require combining multiple semiconductor tech-
DC across a wide variety of data-intensive computing
nologies [82]. Therefore, the advances on such systems
hardware requires more work and can be enabled through
cannot be tracked based on a logic technology metric
dedicated debates within the community.
alone. Moreover, the current logic metric based on the
transistor minimum gate length has been obsolete since the C. Sustaining Transistor Count and Memory
mid-1990s, in that the node labels have been significantly Capacity Growth
different than the physical gate length on the chip. As such,
The implications of the growth rates in Table 1 with
a “density metric” has been put forward, in order to “gauge
respect to transistor count and memory capacity are dis-
advances in future generations of semiconductor tech-
cussed below.
nologies in a holistic way, by accounting for the progress
in logic, memory, and packaging/integration technologies 1) Sustaining Transistor Count Growth: Based on the
simultaneously” [82], [83]. current rate of 1.73×/two years, transistor count in GPU
Concretely, the density metric consists of a three-part die(s) is expected to get near a trillion within a decade.
number: [DL , DM , DC ], where DL is the density of As mentioned earlier, transistor count increase has been
logic transistors, DM is the bit density of main memory driven so far primarily by density improvement. Due to
(currently the off-chip DRAM bit density), and DC is the slowdown in 2-D scaling, reticle limit for monolithic die
density of connections between the main memory and area, and yield/cost considerations, GPUs and CPUs opti-
logic, all per mm2 . Device density has been the primary mally partitioning their function into multiple chiplets
driver for progress in semiconductor technology in terms (typically combined on an organic package substrate) have
of power, performance, and cost. Since the 1970s, both the been already on the market [85], [86] and enable tran-
transistor density DL and the DRAM bit density DM have sistor count growth beyond a single die [84]. However,
been systematically increasing. As 2-D scaling asymptoti- considering the area and energy efficiency, device density
cally reaches a plateau, density improvements will increas- improvements would still be needed. Assuming the current
ingly rely on 3-D die stacking. Device density will continue rate of 1.57×/two years in Fig. 9(c), the transistor density
would grow by around tenfold, to reach ≈700 M/mm2 growth and visible performance improvement from one
within a decade, as shown in Table 1. The 2-D scaling to node to the next. Accordingly, device solutions, such as
maintain this rate is feasible from the current 5-nm [87] forksheet FET [94] and complementary FET (CFET) [95],
to the 3-nm node [88], albeit >50% of this scaling would have been actively investigated and, if successful, would
rely on design-technology co-optimization (DTCO) [88], enable to improve the SRAM scale factors. Nevertheless,
[89]. Beyond 3-nm, novel transistor architectures com- it is also clear that continuous node-to-node improvements
bined with further DTCO and scaling boosters are likely to for many nodes have its limits on a 2-D plane. As such,
offer a visible 2-D scaling rate for a number of nodes [90]. a systematic transition to 3-D can be expected in combi-
Still, a systematic transition to 3-D logic within a decade nation with the 2-D solutions in the near future. The 3-D
may also occur to complement 2-D scaling and support SRAM architecture had been subject to detailed studies
an area and energy-efficient growth of the component for the last decades [96], [97], [98], and the 3-D die-
count. Encouragingly, product/test-chip demonstrations of stacking technologies to make it a reality are currently
logic-on-logic die stacking at advanced node are already available [44]. The recent 3-D SRAM-based CPU product
available [39], [40], [41], and technologies to support enabling a generational performance gain [42] has been a
more than two layers already exist [49]. very encouraging milestone in that regard.
3) Sustaining DRAM Capacity Growth: The current rate
2) Sustaining SRAM Capacity Growth: A large capac-
of DRAM capacity growth, 1.53×/two years, would require
ity SRAM in GPUs improves the hit rate and effective
close to 9× increase of in-package DRAM capacity to
bandwidth, hence lowering the execution latency and
exceed half a TB within a decade. Since 3-D DRAM is
memory access energy, particularly for memory-bound
already being used in GPGPUs (originally motivated by bet-
data-intensive workloads [65], [76], [175]. As shown in
ter energy efficiency compared with GDDR DRAM [34]),
Fig. 11(b), there has been a steep growth in GPU on-chip
the required capacity will likely be enabled by expanding
L2 capacity especially in the recent years, commensurate
the current 2.5-D/3-D advanced packaging capabilities in
with ever increasing DNN model sizes especially for natural
combination with further 2-D DRAM scaling. As an exam-
language processing (NLP) applications [91]. With the cur-
ple, a future GPGPU featuring 12 HBM3 cubes (2× cube
rent rate of 2×/two years, L2 SRAM capacity would grow
count versus [76]) each with 16 layers (2× layer count
by 32-fold in a decade, approaching GB range. Moreover,
versus HBM2E [36]) and 32-Gb/layer density (2× den-
the future compute GPU architectures specialized for DNN
sity/layer versus HBM2E [36]) would allow a total of 8×
inference and training may incorporate not just L2, but
increase in capacity, all within the provisioned capability of
also GB-range L3 SRAM on separate, dedicated dies, which
HBM3 [99] and 2.5-D interposer technology [38].
would be combined with the GPU die(s) using 2.5-D/3-D
Although the 3-D layer count does not have a fundamen-
integration [92].
tal limit, as the 3-D stack gets taller, thermal issues, power
A substantial increase in SRAM capacity has been
delivery, and reliability challenges are exacerbated [100].
recently observed in other high-end compute architectures
However, even before those become a problem, package
as well. For example, L3 cache has been already intro-
thickness can be a limiting factor at least temporarily,
duced in gaming GPUs, with an on-chip capacity of 128 MB
in terms of mechanical design [100] and industry stan-
(which has been reported to enable a significantly higher
dards [99]. As an example, the adoption of 12 layers-
effective bandwidth and bandwidth/watt compared with a
tall (“12-Hi”) HBM2E may have been impeded so far
wider-IO GDDR6 DRAM solution [175]). A similar increase
due to stack height being taller than the common, 8-Hi
of the total SRAM capacity to 144 MB has been also
option [101]. This makes the availability of alternative die
the case for the latest inference Tensor Processing Unit
stacking techniques, which would allow to reduce the stack
(TPUv4i), and the associated benefits regarding DNN infer-
height compared with the mainstream through-silicon
ence latency were again reported to be higher compared
via (TSV)/microbump technology (currently employed in
with those that could be achieved by doubling the HBM
HBM DRAM), very critical [100]. From that perspective,
bandwidth [65]. Finally, in some of the emerging DNN
the test chip in [102], where 12 layers were stacked on a
accelerator chips, where computing is distributed to up to
base die with less than 600-μm total thickness (by using
thousands of cores (each with their local SRAM) across
the SoIC (System on Integrated Chips) bonding and face-
the die, the total on-chip SRAM capacity can reach 500–
to-back stacking), has been a key demonstration.
900 MB [176], [177], [179]. Here as well, very high
on-chip SRAM size is estimated to relax the DRAM band- 4) Potential Role of NVM to Support Memory Capac-
width requirement, so that lower cost and higher capacity ity Trends: Emerging nonvolatile memories’ (NVMs’)
DRAM options versus HBM may potentially become viable scalability and amenability to monolithic 3-D integration
depending on the application [179]. [103] make them a good candidate for area- and static
While a high-capacity SRAM is needed the most, fin energy-efficient growth of memory capacity by comple-
field-effect transistor (FinFET)-based SRAM scaling has menting DRAM or SRAM. Considering the typical error
been slowing down (e.g., 1.2× density improvement from resilience of data-intensive applications, multilevel cell
5- to 3-nm node [93]), impeding an aggressive capacity (MLC) capability of certain NVMs has potential to further
improve the bit density and support the capacity growth the memory bandwidth and bandwidth/watt stands out
[104], [105]. However, NVMs also tend to have multiple as a major driver for future advances in semiconductor
shortcomings, such as high write energy and latency as hardware technology.
well as low endurance [106]. Thus, the promise of NVMs Earlier in Fig. 10, we observed that the transition from
depends on how much a given data-intensive computing GDDR DRAM to HBM widens the memory bus while reduc-
architecture and application would be able to hide NVMs’ ing the frequency, hence enabling a power-efficient growth
weaknesses while revealing their strengths. of the bandwidth [27]. Since the emergence of HBM,
In the GPU context, a recent NVM-based proposal con- however, per pin data rates have been increasing again
sists of significantly increasing the available GPU memory (≈2× in HBM3 versus HBM2E [99]) along with the HBM
capacity by leveraging low-latency NVM solid-state drives cube count in the GPUs, while NIO per cube has remained
(SSDs) [107]. As the name suggests, these devices are constant at 1024. Hence, an increase in 3-D main memory
significantly faster than the conventional NAND SSDs while I/O count per cube, together with commensurate archi-
providing much higher capacity than the DRAM [108]. tecture improvements [48], [71], is conceivable moving
Current low-latency SSD options are based on vertical forward. In this context, one question is whether NIO in a
NAND flash structure [109] or 3-D cross-point phase- 3-D memory could increase at a rate equal to FP32 core
change memory [110]. The architecture proposed in [107] Tp growth rate of 1.7×/two years (in order to prevent
would enable GPUs to directly manage high throughput further widening of the gap between BW and Tp assuming
and fine granularity access to SSDs and, hence, mitigate no increase in memory effective frequency). This would
CPU-related synchronization and I/O traffic overheads. require around ≈14× (1.75 ) improvement in vertical inter-
Critical datacenter workloads, such as graph and data ana- connect density within a decade if the dedicated TSV area
lytics, graph neural networks, and recommender systems were to remain constant. Considering the current HBM
with exceptionally large memory footprints, are estimated TSV pitch of 48 × 55 μm [36], this increase would lead
to primarily benefit from this architecture leveraging low to a TSV pitch of ≈13 μm by 2032. Excitingly, a 4-Hi HBM-
latency SSD storage. like test structure with 9-μm TSV/bond pitch has been
Moving closer to the compute in the memory hierar- already demonstrated by using chip-on-wafer, face-to-back
chy, the spin-transfer torque magnetoresistive RAM (STT- die stacking, and SoIC bonding [102]. SoIC bonding pitch
MRAM) has been considered by all major semiconductor is projected to scale sub-micrometer [84], which can also
manufacturers as a higher density/lower leakage alterna- be leveraged to do the following: 1) reduce the TSV
tive to SRAM for last-level cache implementation [111]. area in order to increase the memory capacity per layer,
In the context of GPUs, STT-MRAM-based register file and 2) address the bandwidth deficit associated with the
has been widely explored ([112], [113], and references specialized GPU core Tp growth rate, which is much higher
therein). The GPU register file is typically larger than than that in FP32 cores [76], [77], [78], as touched upon
or comparable to L2 cache in terms of total capacity earlier. In addition to increasing the density, improving
[76], [77], [78], [112]. Primarily motivated by leakage the quality of connections between 3-D tiers is critical
reduction compared with SRAM, relatively slow and high with respect to an energy-efficient data movement. In that
energy write operation of MRAM can be, for instance, respect as well, SoIC bonding presents substantial RC
overcome by an SRAM-based register cache minimizing the delay and IR drop advantages compared with mainstream
write count into MRAM [113]. In addition, stacking NVM microbump/solder joint approaches [118].
dies with logic dies through fine-grained 3-D integration When we move from 3-D memory to 3-D memory-
would allow to combine the intrinsic benefits of the two on-logic, vertical interconnects would reduce memory-to-
technologies [114]. logic distance, within-die horizontal interconnects, and
the required CMOS buffer count and size [119], [120],
D. Improving Memory Access Efficiency by [121]. The 3-D logic-on-memory also offers opportuni-
Fine-Grained 3-D Integration ties for architecture advances [81] that would enable
Memory access is a fundamental bottleneck for many an increased number of connections, hence an increased
computing systems in terms of latency and energy [115], bandwidth, between logic and memory versus 2-D designs.
[116] and particularly for data-intensive applications Nevertheless, this “full 3-D” approach is also more
[117]. In GPUs, FPGAs, and ASIC accelerators, elevating prone to thermal issues (compared with a memory-
memory access efficiency closer to compute capabilities only stack), which tend to degrade lifetime reliabil-
would provide substantial gains at the system level espe- ity, soft error rates, leakage power dissipation, dynamic
cially for memory-bound workloads, even when there is timing error rates, and performance [122]. Addressing
little or no improvement on the processor peak throughput the thermal issues for memory-on-logic (and logic-on-
or energy efficiency. For domain-specific accelerators in logic) stack would require technology/design/architecture
particular, as specialization combined with advanced logic co-optimization [123], especially critical at increased 3-D
technology drastically reduce the cost of logic computa- layer count. On the technology side, fine-grained micro-
tion, memory access bottleneck becomes more apparent TSVs and bonding would also be beneficial in terms of
[66], and addressing it tends to make the biggest impact thermal resistance due to high density of vertical inter-
on the system performance [20]. Accordingly, improving connects facilitating the heat flow toward the heat sink
Vol. 111, No. 1, January 2023 | P ROCEEDINGS OF THE IEEE 103
Authorized licensed use limited to: SUN YAT-SEN UNIVERSITY. Downloaded on December 15,2024 at 14:18:27 UTC from IEEE Xplore. Restrictions apply.
Akarvardar and Wong: Technology Prospects for Data-Intensive Computing
[118]. The heat flow can be further facilitated by advanced gets are relatively abundant, such as DNN training at
thermal interface materials [124]. More effective cool- datacenter), scalability enabled by parallelism leads to a
ing techniques include direct jet impingement [125] and steadily growing demand for the cutting-edge semiconduc-
fine-grained liquid cooling at the chip [126] and package tor technology volume. Clearly, upscaling the performance
level [127]. On the design and architecture side, thermal- under stricter power and area constraints is a more chal-
aware floor planning (hot-spot misalignment), memory lenging task, which we will touch upon next.
management, and task scheduling techniques [128] may
F. Sustaining Compute Energy Efficiency Trends
need to be additionally deployed. While the cooling tech-
In Section II, we had defined the processor peak energy
niques mature, novel 3-D products are enabled by innova-
efficiency as EEp = Tp /Pp . EEp can be related to CMOS
tive approaches minimizing thermal impact. As an example
technology parameters by imposing Pp ∼ = Pcore · Ncore
in [42], chip-on-wafer bonding, which allows to stack dies
(implying P0 ∼ = 0), where Pcore is the peak power asso-
with dissimilar areas, is used to stack an SRAM die only on
ciated with a single core. Then, using (4) for Tp , it follows
top of the SRAM area of a CPU chip, hence avoiding SRAM
that EEp ∼ = fcore /Pcore in units of MAC operations per
above the hotter compute cores while still substantially
Joule (a related note is that the common practice of bench-
improving its capacity at fixed footprint.
marking the CMOS technologies by comparing power at
It is worth mentioning that while the current focus
constant frequency or vice versa [136], [137] reflects the
may be primarily on the density, count, and length of the
energy efficiency differences at either fixed frequency or
wires between different system components, a high band-
power). Further assuming a case where the peak processor
width and low energy communication can be achieved
power is dominated by dynamic operation (Pcore = α ·
by other means as well, such as silicon photonics [129],
fcore · C · V 2 , where α, C , and V are the activity factor,
[130], [131] or wireless coupling [132], [133], which may
core total switching capacitance, and core supply voltage,
find increased adoption moving forward to support data-
respectively) expectedly leads to EEp ∼ = 1/(α · C · V 2 ),
intensive applications.
i.e., the inverse of capacitor switching energy. Accordingly,
E. Scalability of System Throughput in order to sustain the compute energy efficiency gains
in Fig. 6(a), the two fundamental hardware technology
Scalability of an architecture refers to how well it
knobs that should continue to scale are the switching
can be scaled-up to achieve a performance metric when
capacitance C (consisting of both device and interconnect
increasing the amount of resources, such as the number of
capacitances) and the supply voltage V (we will briefly
processor cores and memory capacity [22]. The hardware
touch upon α scaling later in Section IV-F5).
architecture for data-intensive workloads and especially
Up until the mid-1990s, Dennard scaling [138] has
DNNs can be highly scalable, since the inherent massive
enabled a “win–win” setup for semiconductor technology,
parallelism of these workloads would allow to extract more
in which device density, energy efficiency, and cost were
performance by merely supplying more compute hard-
all steadily improved [139]. During this period, for a linear
ware [134]. Accordingly in Fig. 4(b), increased number
scaling factor of κ (whose ideal value is 0.7×, such that the
of transistors (hence, core count) and I/O count (through
area reduces to half every generation), the active energy,
transition to 3-D DRAM with increasing number of cubes)
C · V 2 , has scaled by κ3 , since both the switching capac-
collectively contribute to system peak throughput improve-
itance C and the supply voltage V scaled as κ. Once the
ments by boosting Tp and BW, respectively. Scaling-up has
supply voltage scaling started to saturate due to increased
thus been a major driver behind the advances in GPUs
subthreshold leakage, energy per operation scaled at a
and will remain so in the future. In particular, providing
significantly lower rate than the ideal factor of κ3 [136].
more transistors at every product generation by further 2-D
Indeed, based on the processor energy efficiency data in
scaling and/or advanced packaging will remain a key goal.
Fig. 6(a), the energy per operation has been scaling as
An alternative means of exploiting scalability is to allo-
1/EEp = 0.62×/two years, which is far from ideal κ3 and
cate multiple units to collaboratively work on a given
rather close to κ = 0.7×/two years [58]. Nevertheless,
task. Multi-GPU systems that have been widely deployed
sustaining even this current rate would require >10×
in high-performance computing [31] (despite the chal-
scaling in energy per operation within a decade. Meeting
lenges with respect to workload distribution, synchroniza-
this target while 2-D scaling slows down will require not
tion, and energy efficiency of the interconnects [73]) fall
only pushing the traditional device technology knobs to
into this category. A more recent example from the ASIC
their limits but also increasingly combining them with
domain is TPUv3 supercomputer employing 1024 chips
emerging technologies, such as 3-D die stacking, along
[135]. This allows the peak throughput to linearly scale
with the circuit design, system architecture, software, and
from 123 TFLOPS to 126 PFLOPS (Peta FLOPS), at the cost
algorithm optimizations [79]. In the following, we will
of a power increase from 450 W to 594 kW and the system
briefly cover the critical knobs, starting with the traditional
size being about 7-ft tall and 36-ft long [135].
ones.
In summary, for the specific data-intensive computing
applications where a high system throughput is the most 1) Further Supply Voltage Scaling: Scaling of V is limited
critical parameter (while power and area/form factor bud- by the affordable leakage level (leading to a minimum
tolerable threshold voltage), desired ON -current, and noise are commonly used for reducing dynamic power, whereas
margin. The key consequence of a substantially slowed power gating eliminates the leakage by turning off the
supply voltage scaling while 2-D scaling continues has supply voltage of unused circuits [149]. A visibly improved
been an increased power density. During the 15 years, DNN training and inference energy efficiency in the GPUs
we have tracked the server GPU trends, the “nominal” by DVFS without causing a significant performance degra-
V has reduced by only about 450 mV (from 1.2 V at dation had been reported in [150] and references therein.
55-nm [140] down to around 0.75 V at 5-nm node [87]), Power gating in GPUs can also save considerable power
which corresponds to a V scaling rate of less than 6% if the energy overhead associated with the power-gating
every two years if V were to scale at constant rate. Today, transistors (header and footer switches) can be compen-
the actual V in advanced CMOS can significantly vary sated by ensuring the target blocks to remain idle long
according to application domain and associated tuning of enough [149], [151].
the technology. Nanosheet devices [141], considered for
4) Domain-Specific Technology: Another pathway to
future nodes, may allow to use a slightly lower nominal
improve the energy efficiency is the CMOS technology
voltage than FinFETs due to better electrostatic control
specialization depending on the function (e.g., logic ver-
and lower threshold voltage variation [142]. Nevertheless,
sus SRAM) and application space (e.g., high performance
except for ultralow power applications, such as IoT [143],
[152] versus low power [143], [153]). This “domain-
scaling of the supply voltage is likely saturated as an
specific technology” (DST) concept [88] would be a nat-
energy efficiency knob considering the predominance of
ural extension to the chiplet approach where a large SoC
device and interconnect parasitic resistances at advanced
is disintegrated into smaller, higher yield/lower cost dies
CMOS [144], leaving little room for compromise on the
and reassembled in a 2-D, 2.5-D, and 3-D package through
overdrive voltage.
heterogeneous integration [9], [86], [154], [155]. DST
2) 2-D Logic Scaling: The switching capacitance reduc- would match different chiplets with different functions
tion to improve the energy efficiency relies primarily on and enable technology optimizations specific to application
2-D scaling. C would ideally decrease as the linear scaling domain, as illustrated in Fig. 12. Hence, it would poten-
factor κ, which can be approximated by the inverse root tially allow each chiplet to achieve a more optimal power,
square of transistor density improvement rate in Fig. 9(c). performance, area, and cost (PPAC) due to customized
This leads to a C scaling rate of ≈0.8×/two years for the cell libraries and dedicated process integration, resulting
time period our data have been collected. Considering that in reduced complexity and variability as well as a more
this rate actually underestimates how fast the proper logic robust yield.
area scales (as explained in Section III) together with the
5) Contributions to the System Energy Efficiency From
minimal scaling of V , it is fair to conclude that the major
the Higher Levels of the Computing Stack: Circuit design,
contributor to the compute energy efficiency improve-
hardware architecture, software, and algorithms constitute
ments in Fig. 6(a) has been the 2-D scaling, while the con-
the higher levels of the computing stack (versus semi-
tributions from the circuit design, architecture, and power
conductor technology at the “bottom” [79]) and will be
management knobs have been increasingly important at
critical to maintain—and exceed—the energy efficiency
the system level [9]. As also mentioned in Section IV-C1,
improvement rates for data-intensive computing.
in recent CMOS nodes, this 2-D scaling has been increas-
The impact of circuit design on the energy efficiency
ingly maintained by DTCO innovations [88], [145], while
has been briefly touched upon earlier, in the context of
the conventional contacted gate pitch and minimum metal
transistor density improvements by DTCO, as well as with
pitch scaling have been slowing down [83], [146]. Moving
respect to the power management. In addition to those, the
forward, novel devices, such as nanosheets, forksheets,
cross pollination between CPU and GPU designs, regard-
and CFETs; boosters, such as buried power rail; and fur-
ing the switching capacitance optimizations [183] that
ther DTCO advances will be able to maintain a visible
allow higher frequencies within the same/similar power
2-D scaling for the foreseeable future [90], [147]. This,
envelope, was reported to substantially contribute to GPU
combined with the steady improvements in back-end of
performance per watt improvements [175].
line (BEOL) technology [184], will continue to ensure
On the hardware architecture side, the solution to max-
noticeable logic energy efficiency improvements, although
imize the energy efficiency for data-intensive computing
the process integration challenges will continue to increase
and particularly for AI hardware has been to sacrifice
at each node.
the flexibility of the general-purpose architecture (Fig. 1).
3) Power Management: Techniques, such as adaptive As also mentioned in Section III, for general-purpose
calibration, droop mitigation, active thermal management, processors, the energy for useful arithmetic work remains
and per-core voltage regulation, have been critical in much lower than the energy for instructions, control, and
dealing with various sources of variation and minimizing data movement [58], [64], [66], [156]. By redirecting
the wasted energy during computation [9], [148]. How the hardware resources supporting programmability into
to optimally combine these techniques depends on the large arithmetic units and local memories minimizing data
workload and the specific architecture [148]. Dynamic movement, DSAs can execute a narrow range of tasks in a
voltage and frequency scaling (DVFS) and clock gating significantly more energy and area efficient fashion than
Vol. 111, No. 1, January 2023 | P ROCEEDINGS OF THE IEEE 105
Authorized licensed use limited to: SUN YAT-SEN UNIVERSITY. Downloaded on December 15,2024 at 14:18:27 UTC from IEEE Xplore. Restrictions apply.
Akarvardar and Wong: Technology Prospects for Data-Intensive Computing
Fig. 14. From integrated circuits to the new era of integrated chips. (Top left, top right, and lower left images are adapted from [172],
[173], and [174], respectively. Lower right image from [118], reprinted with permission, 2019 IEEE).
and 3) countless combinations of 1) and 2) with the its board-level equivalents, requires combining dissimilar
logic technology, which itself has considerable knobs up devices, such as logic, SRAM, analog, and I/O on the
its sleeve for the next decade [90]. In the meantime, same silicon wafer monolithically [164], at the expense
hardware technology can also increasingly benefit from of increased integration complexity, increased number
the progress on more disruptive options, such as silicon of process steps, and degraded yields that are inversely
photonics. Still, semiconductor technology alone, even a proportional to die size. In that regard, heterogeneous
more broadly defined one as in Fig. 13, may not be able to integration of multiple chiplets within a package [9],
sustain data-intensive computing energy efficiency trends [86], [165] offers several advantages with respect to cost,
at historical rates indefinitely. As such, contributions from performance, and functionality: 1) partitioning a large
the higher levels of the computing stack (which are already SoC into smaller dies improves the yield significantly,
proven very effective), as well as their co-optimization lowering the cost; 2) non- or slowly scaling parts, such
with the hardware technology [65], [148], are crucial to as analog, I/O, and SRAM, can rely on mature, hence
maximize the system level benefits. This holds even more cheaper technology nodes, further reducing the cost; 3) for
true when it comes to realizing more ambitious energy homogeneous chiplets using DST, the integration flow
efficiency targets well exceeding the current trend lines would become simpler, such that it can be optimized
[160]. more effectively than in an SoC; 4) more functionality
can be achieved compared with a monolithic die, since
G. Keeping the Cost Affordable total chiplet area is not bound by the reticle limit or
The cost of data-intensive computing hardware is pri- a single technology; 5) fine-tuning the specific chiplet
marily determined by compute die(s)’ technology node combinations according to customer demand would allow
and area, off-chip memory capacity and bandwidth, as well the companies to maintain a rich product portfolio
as the specific packaging technology including the cooling at acceptable cost [86]; and 6) in the case of 3-D stacking,
solution [22]. For decades, semiconductor industry has additional cost saving would be possible by customizing
relied on 2-D scaling to reduce cost per transistor. However, and simplifying BEOL flow of the involved dies [98], [166],
starting with the early 2000s, sustaining 2-D scaling and [167]. Moreover, it is estimated that, combining the chiplet
improving PPA has required the incorporation of many new approach with 3-D integration can potentially enable fur-
device, integration, and patterning technologies, along ther cost benefits by reducing the time to market [168].
with a significantly increased number of process steps [87], In order to maximize the value out of chiplets, modu-
[161], [162], [163]. In consequence, 2-D scaling started lar and standardized packaging approaches and chip-to-
to provide diminishing returns in terms of cost scaling, chip connectivity standards, which would allow an easy
although the trends are actually less dramatic if one mix-and-match of chiplets, would be needed (considering
were to compare subsequent nodes at the same maturity the packaging solutions today are mostly custom). This
level. would improve package-level yields, speed up time to mar-
Part of the increasing cost has been due to the SoC ket, reduce design costs, and improve manufacturing vol-
model, which, in order to improve PPA compared with umes. Standardized 2.5-D package architecture mentioned
in [124] is one such option allowing to combine ASIC denser interconnects. High parallelism and resulting scal-
chips with two, four, or six HBM cubes on interposer using ability of data-intensive computing systems will enable a
specific design protocols. The Universal Chiplet Intercon- steady and predictable increase in system throughput by
nect Express (UCIe) initiative is another example aiming increasing the number of transistors, memory cells, and
to standardize on-package connectivity between chiplets die-to-die interconnects. Instead of the node names cur-
[169], such that the technology companies would be able rently in use, semiconductor technology progress will be
to simply “slot-in lego-like chiplets” into their systems characterized by more comprehensive metrics that account
[170]. At the high level, the impact of such standardization for logic, memory, and interconnect density of a packaged
effort would be akin to the way the standard cells facilitate system.
digital design process compared with full custom designs. Sustaining/exceeding the data-intensive computing
It is important to note that, despite many technical energy efficiency trends of the last 15 years will be the
challenges outlined so far, the cost effectiveness and prof- biggest technology challenge. In that respect, pushing
itability of the semiconductor industry have been improved 2-D scaling to its limits while bringing up increasingly
during the past decade [171], and this trend has been capable memory and packaging technologies will be
recently boosted by the increased remote connectivity indispensable—yet may remain insufficient. Accordingly,
needs. Data-intensive computing is set to make the semi- these hardware technologies will need to be comple-
conductor industry even more indispensable and prof- mented and co-optimized with circuit design, architec-
itable. Considering the tremendous resources dedicated to ture, software, and algorithm innovations to ensure an
data-intensive computing from both academia and indus- improved energy efficiency at every new generation
try, as well as the immediate return on this investment as of computing systems. A systematic cost reduction in
novel, smarter applications directly affecting people’s lives, hardware will remain as another big challenge. The
this will likely be a long-lasting trend. solution lies in maximizing standardization and modularity
of the chiplet approach, which would also reduce design
V. C O N C L U S I O N costs and broaden adoption.
Data-intensive computing is set to support an increas-
ingly wider range of human undertakings. Hence,
it presents major opportunities for semiconductor tech- Acknowledgment
nology advances for decades to come. For more than five This work has been crystallized over the last few years
decades, the semiconductor industry has been extremely and benefitted from the discussions in different contexts
successful in integrating discrete devices into chips with with many of the authors’ colleagues: Jin Cai, Min Cao,
exponentially increasing transistor counts. In the coming Carlos Diaz, Cliff Hou, Frank Lee, Frank J. C. Lee, Yujun Li,
decades, we will increasingly focus on integrating chip(let)s Mark Liu, L. C. Lu, Linus Lu, Xiaochen Peng, Iuliana Radu,
into systems (Fig. 14). Rather than optimizing a single logic Stefan Rusu, Winston S. Shue, Xiaoyu Sun, Chuei T. Wang,
processor, the focus will shift into optimizing an entire Yih (Eric) Wang, Howard C.-H. Wang, Doug Yu, and
system by considering logic, memory, and their connectiv- Kevin Zhang of Taiwan Semiconductor Manufacturing
ity as a whole. Fine-grained heterogeneous 3-D integra- Company (TSMC), Hsinchu, Taiwan, and San Jose, CA,
tion, in particular, will be a key technology to improve USA, and Wei-Chen (Harry) Chen, Haitong Li, and
the area and energy efficiency at the system level by Prof. Subhasish Mitra of Stanford University, Stanford, CA.
connecting architecture- and technology-specialized logic The authors would like to thank the anonymous reviewers
and memory chiplets through progressively shorter and for their valuable comments.
REFERENCES
[1] How Much Data Is Created Every Day in 2020. [7] AI Chip Market Will Hit 70B in 2026. [Online]. ASICs for Intel Stratix 10 FPGAs: A case study of
[Online]. Available: https://techjury. Available: https://www.eetimes.com/ai-chip- accelerating deep learning using TensorTile ASIC,”
net/blog/how-much-data-is-created-every- market-will-hit-70b-in-2026/# in Proc. 28th Int. Conf. Field Program. Log. Appl.
day/#gref [8] T. Hwang, “Computational power and the social (FPL), Aug. 2018, pp. 106–1064.
[2] D. Reinsel, J. Gantz, and J. Rydning, The impact of artificial intelligence,” 2018, [14] M. Papermaster, “Delivering the future of
Digitization of the World From Edge to Core. arXiv:1803.08971. high-performance computing,” in Proc. IEEE 26th
Framingham, MA, USA: International Data [9] L. T. Su, S. Naffziger, and M. Papermaster, Int. Conf. High Perform. Comput., Data, Anal.
Corporation, 2018. “Multi-chip technologies to unleash computing (HiPC), Dec. 2019, pp. 1–43.
[3] S. Kaisler, F. Armour, J. A. Espinosa, and performance gains over the next decade,” in IEDM [15] J. L. Hennessy and D. A. Patterson, Computer
W. Money, “Big data: Issues and challenges Tech. Dig., Dec. 2017, p. 1. Architecture: A Quantitative Approach, 6th ed.
moving forward,” in Proc. 46th Hawaii Int. Conf. [10] V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer, New York, NY, USA: Elsevier, 2018, ch. 7.
Syst. Sci., Jan. 2013, pp. 995–1004. “Efficient processing of deep neural networks: A [16] M. Garland and D. B. Kirk, “Understanding
[4] M. Gokhale, “Hardware technologies for tutorial and survey,” Proc. IEEE, vol. 105, no. 12, throughput-oriented architectures,” Commun.
high-performance data-intensive computing,” pp. 2295–2329, Dec. 2017. ACM, vol. 53, no. 11, pp. 58–66, Nov. 2010.
Computer, vol. 41, no. 4, pp. 60–68, 2008. [11] L. Song, Y. Zhuo, X. Qian, H. Li, and Y. Chen, [17] S. Williams, A. Waterman, and D. Patterson,
[5] H. Watson, “Tutorial: Big data analytics: Concepts, “GraphR: Accelerating graph processing using “Roofline: An insightful visual performance model
technologies, and applications,” Commun. Assoc. ReRAM,” in Proc. IEEE Int. Symp. High Perform. for multicore architectures,” Commun. ACM,
Inf. Syst., vol. 34, no. 6, p. 65, 2014. Comput. Archit. (HPCA), Feb. 2018, pp. 531–543. vol. 52, no. 4, pp. 65–76, 2009.
[6] H. H. Huang and H. Liu, “Big data machine [12] W.-M. Hwu and S. Patel, “Accelerator [18] J. W. Choi, D. Bedard, R. Fowler, and R. Vuduc,
learning and graph analytics: Current state and architectures—A ten-year retrospective,” IEEE “A roofline model of energy,” in Proc. IEEE 27th
future challenges,” in Proc. IEEE Int. Conf. Big Data Micro, vol. 38, no. 6, pp. 56–62, Nov. 2018. Int. Symp. Parallel Distrib. Process., May 2013,
(Big Data), Oct. 2014, pp. 16–17. [13] E. Nurvitadhi et al., “In-package domain-specific pp. 661–672.
[19] S. Williams. Performance Tuning With the Roofline [38] TSMC 2023 Interposers: TSMC Hints at 3400mm2 com/20190613/graphics-cards-power-tdp-tgp/
Model on GPUs and CPUs: Introduction to the + 12x HBM in One Package. Accessed: Mar. 12, [58] S. W. Keckler, W. J. Dally, B. Khailany, M. Garland,
Roofline Model. Accessed: Mar. 12, 2022. [Online]. 2022. [Online]. Available: https://www. and D. Glasco, “GPUs and the future of parallel
Available: https://crd.lbl.gov/assets/Uploads/ anandtech.com/show/16036/2023-interposers- computing,” IEEE Micro, vol. 31, no. 5, pp. 7–17,
ECP20-Roofline-1-intro.pdf tsmc-hints-at-2000mm2-12x-hbm-in-one-package Sep./Oct. 2011.
[20] N. P. Jouppi, “In-datacenter performance analysis [39] W. Gomes et al., “Lakefield and mobility compute: [59] J. Zhao, G. Sun, G. H. Loh, and Y. Xie,
of a tensor processing unit,” in Proc. 44th Annu. A 3D stacked 10nm and 22FFL hybrid processor “Energy-efficient GPU design with reconfigurable
Int. Symp. Comput. Archit., 2017, pp. 1–12. system in 12×12 mm2 , 1 mm in-package graphics memory,” in Proc. ACM/IEEE
[21] Y. E. Wang, G.-Y. Wei, and D. Brooks, package-on-package,” in IEEE Int. Solid-State Int. Symp. Low Power Electron. Design (ISLPED),
“Benchmarking TPU, GPU, and CPU platforms for Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2020, 2012, pp. 403–408.
deep learning,” 2019, arXiv:1907.10701. pp. 144–146. [60] L. Brochard, V. Kamath, J. Corbalán, S. Holland,
[22] V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer, “How [40] W. Gomes et al., “Ponte vecchio: A multi-tile 3D W. Mittelbach, and M. Ott, Energy-Efficient
to evaluate deep neural network processors: stacked processor for exascale computing,” in IEEE Computing and Data Centers, Section 1.4.2.
TOPS/W (alone) considered harmful,” IEEE Solid Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Hoboken, NJ, USA: Wiley, 2019.
State Circuits Mag., vol. 12, no. 3, pp. 28–41, Papers, Feb. 2022, pp. 42–44. [61] NVIDIA Hopper GPU Architecture and H100
Aug. 2020. [41] S. Sinha et al., “A high-density logic-on-logic 3DIC Accelerator Announced: Working Smarter and
[23] J. Hanhirova, T. Kämäräinen, S. Seppälä, design using face-to-face hybrid wafer-bonding on Harder. Accessed: Mar. 22, 2022. [Online].
M. Siekkinen, V. Hirvisalo, and A. Ylä-Jääski, 12nm FinFET process,” in IEDM Tech. Dig., Available: https://www.anandtech.
“Latency and throughput characterization of Dec. 2020, pp. 15–1. com/show/17327/nvidia-hopper-gpu-
convolutional neural networks for mobile [42] J. Wuu et al., “3D V-cache: The implementation of architecture-and-h100-accelerator-announced
computer vision,” in Proc. 9th ACM Multimedia a hybrid-bonded 64MB stacked cache for a 7nm [62] W. J. Dally. (2016). Efficiency and
Syst. Conf., Jun. 2018, pp. 204–215. x86–64 CPU,” in IEEE Int. Solid-State Circuits Conf. Programmability: The Challenges of Future
[24] S. K. Esser, J. L. McKinstry, D. Bablani, (ISSCC) Dig. Tech. Papers, Feb. 2022, pp. 428–429. Computing. [Online]. Available: http://systems.cs.
R. Appuswamy, and D. S. Modha, “Learned step [43] M. Min and S. Kadivar, “Accelerating innovations uchicago.edu/downloads-open/Dally-gcasr16.pdf
size quantization,” 2019, arXiv:1902.08153. in the new era of HPC, 5G and networking with [63] W. J. Dally. (2017). Deep Learning and HPC.
[25] M. Ghane, J. Larkin, L. Shi, S. Chandrasekaran, advanced 3D packaging technologies,” in Proc. [Online]. Available: https://images.nvidia.com/
and M. S. Cheung, “Power and energy-efficiency Int. Wafer Level Packag. Conf. (IWLPC), Oct. 2020, content/APAC/events/deep-learning-institute-
roofline model for GPUs,” 2018, pp. 1–6. jp/2017/pdf/keynote-nv-bill-dally.pdf
arXiv:1809.09206. [44] Y.-K. Cheng et al., “Next-generation design and [64] M. Horowitz, “Computing’s energy problem (and
[26] R. Dolbeau, “Theoretical peak FLOPS per technology co-optimization (DTCO) of system on what we can do about it),” in IEEE Int. Solid-State
instruction set: A tutorial,” J. Supercomput., integrated chip (SoIC) for mobile and HPC Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2014,
vol. 74, no. 3, pp. 1341–1377, Mar. 2018. applications,” in IEDM Tech. Dig., Dec. 2020, p. 41. pp. 10–14.
[27] J. Zhao, G. Sun, G. H. Loh, and Y. Xie, “Optimizing [45] D. Niu et al., “184QPS/W 64Mb/mm2 3D [65] N. P. Jouppi et al., “Ten lessons from three
GPU energy efficiency with 3D die-stacking logic-to-DRAM hybrid bonding with generations shaped Google’s TPUv4i : Industrial
graphics memory and reconfigurable memory process-near-memory engine for recommendation product,” in Proc. ACM/IEEE 48th Annu. Int. Symp.
interface,” ACM Trans. Archit. Code Optim., system,” in IEEE Int. Solid-State Circuits Conf. Comput. Archit. (ISCA), Jun. 2021, pp. 1–14.
vol. 10, no. 4, pp. 1–25, Dec. 2013. (ISSCC) Dig. Tech. Papers, Feb. 2022, pp. 1–3. [66] W. J. Dally, Y. Turakhia, and S. Han,
[28] CUDA C BEST PRACTICES GUIDE. [46] J. T. Pawlowski, “Hybrid memory cube (HMC),” in “Domain-specific hardware accelerators,”
Accessed: Mar. 12, 2022. [Online]. Available: Proc. IEEE Hot Chips 23 Symp. (HCS), Aug. 2011, Commun. ACM, vol. 63, no. 7, pp. 48–57,
https://docs.nvidia.com/cuda/archive/ pp. 1–24. Jun. 2020.
9.0/pdf/CUDA_C_Best_Practices_Guide.pdf [47] D. Lee, S. Ghose, G. Pekhimenko, S. Khan, and [67] W. Wen, J. Yang, and Y. Zhang, “Optimizing power
[29] N. Glaskowsky. (2009). NVIDIA’s Fermi: The First O. Mutlu, “Simultaneous multi-layer access: efficiency for 3D stacked GPU-in-memory
Complete GPU Computing Architecture. [Online]. Improving 3D-stacked memory bandwidth at low architecture,” Microprocessors Microsyst., vol. 49,
Available: https://www.nvidia.com/content/ cost,” ACM Trans. Archit. Code Optim., vol. 12, pp. 44–53, Mar. 2017.
PDF/fermi_white_papers/P.Glaskowsky_NVIDIA’s_ no. 4, pp. 1–29, Jan. 2016. [68] D. Stow, A. Farmahini-Farahani, S. Gurumurthi,
Fermi-The_First_Complete_GPU_Architecture.pdf [48] M. O’Connor et al., “Fine-grained DRAM: M. Ignatowski, and Y. Xie, “Power profiling of
[30] T. M. Aamodt, W. W. L. Fung, and T. G. Rogers, Energy-efficient DRAM for extreme bandwidth modern die-stacked memory,” IEEE Comput.
“General-purpose graphics processor systems,” in Proc. 50th Annu. IEEE/ACM Int. Archit. Lett., vol. 18, no. 2, pp. 132–135,
architectures,” Synth. Lectures Comput. Archit., Symp. Microarchitecture, Oct. 2017, pp. 41–54. Jul. 2019.
vol. 13, no. 2, pp. 1–140, 2018. [49] D. C. H. Yu, C.-T. Wang, and H. Hsia, “Foundry [69] M. O’Connor. High-Bandwidth, Energy-Efficient
[31] J. Nickolls and W. J. Dally, “The GPU computing perspectives on 2.5D/3D integration and DRAM Architectures for GPU Systems. [Online].
era,” IEEE Micro, vol. 30, no. 2, pp. 56–69, roadmap,” in IEDM Tech. Dig., Dec. 2021, pp. 3–7. Available: https://www.archive.ece.cmu.edu/
Mar. 2010. [50] A. Elsherbini, S. Liff, J. Swan, K. Jun, S. Tiagaraj, ~ece740/f15/lib/exe/fetch.php?media=dram_
[32] G. Batra et al., “Artificial-intelligence hardware: and G. Pasdast, “Hybrid bonding interconnect for for_gpus_talk.pptx&usg=aovvaw2lkog81quj_
New opportunities for semiconductor companies,” advanced heterogeneously integrated processors,” lrtcc5gcdk5
McKinsey and Company, Jan. 2019. [Online]. in Proc. IEEE 71st Electron. Compon. Technol. Conf. [70] S. Keckler. Energy-Efficient Architectures for
Available: https://www.mckinsey.com/~/ (ECTC), Jun. 2021, pp. 1014–1019, doi: Exascale Systems. Accessed: Mar. 12, 2022.
media/McKinsey/Industries/Semiconductors/ 10.1109/ECTC32696.2021.00166. [Online]. Available: http://images.nvidia.
Our%20Insights/Artificial%20intelligence [51] E. Beyne, D. Milojevic, G. Van der Plas, and com/events/sc15/pdfs/SC_15_Keckler_
%20hardware%20New%20opportunities G. Beyer, “3D SoC integration, beyond 2.5D distribute.pdf
%20for%20semiconductor%20companies/Artificial- chiplets,” in IEDM Tech. Dig., Dec. 2021, pp. 3–6. [71] N. Chatterjee et al., “Architecting an
intelligence-hardware.pdf [52] E. Bourjot et al., “Towards 5μm interconnection energy-efficient DRAM system for GPUs,” in Proc.
[33] D. C. H. Yu, “New system-in-package (SiP) pitch with die-to-wafer direct hybrid bonding,” in IEEE Int. Symp. High Perform. Comput. Archit.
integration technologies,” in Proc. IEEE Custom Proc. IEEE 71st Electron. Compon. Technol. Conf. (HPCA), Feb. 2017, pp. 73–84.
Integr. Circuits Conf., Sep. 2014, pp. 1–6. (ECTC), Jun. 2021, pp. 470–475. [72] J. Y. Chen, “GPU technology trends and future
[34] M. Alfano, B. Black, J. Rearick, J. Siegel, M. Su, [53] GPU Specs Database. Accessed: Mar. 12, 2022. requirements,” in IEDM Tech. Dig., Dec. 2009,
and J. Din, “Unleashing fury: A new paradigm for [Online]. Available: https://www.techpowerup. pp. 1–6.
3-D design and test,” IEEE Design Test, vol. 34, com/gpu-specs/ [73] A. Arunkumar, “MCM-GPU: Multi-chip-module
no. 1, pp. 8–15, Feb. 2017. [54] V. Kandiah et al., “AccelWattch: A power modeling GPUs for continued performance scalability,” ACM
[35] C.-S. Oh et al., “A 1.1 V 16GB 640GB/s HBM2E framework for modern GPUs,” in Proc. MICRO SIGARCH Comput. Archit. News, vol. 45, no. 2,
DRAM with a data-bus window-extension 54th Annu. IEEE/ACM Int. Symp. pp. 320–332, 2017.
technique and a synergetic on-die ECC scheme,” Microarchitecture, Oct. 2021, pp. 738–753. [74] S. Y. Hou et al., “Wafer-level integration of an
in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. [55] J. Guerreiro, A. Ilic, N. Roma, and P. Tomas, advanced logic-memory system through the
Tech. Papers, Feb. 2020, pp. 330–332. “GPGPU power modeling for multi-domain second-generation CoWoS technology,” IEEE
[36] M.-J. Park et al., “A 192-Gb 12-high 896-GB/s voltage-frequency scaling,” in Proc. IEEE Int. Trans. Electron Devices, vol. 64, no. 10,
HBM3 DRAM with a TSV auto-calibration scheme Symp. High Perform. Comput. Archit. (HPCA), pp. 4071–4077, Oct. 2017.
and machine-learning-based layout optimization,” Feb. 2018, pp. 789–800, doi: 10.1109/HPCA. [75] M. Zhu, Y. Zhuo, C. Wang, W. Chen, and Y. Xie,
in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. 2018.00072. “Performance evaluation and optimization of
Tech. Papers, Feb. 2022, pp. 444–446. [56] GeForce GPU Power Primer. Accessed: Mar. 12, HBM-enabled GPU for data-intensive
[37] P. K. Huang et al., “Wafer level system integration 2022. [Online]. Available: https://www.nvidia. applications,” IEEE Trans. Very Large Scale Integr.
of the fifth generation CoWoS-S with high com/content/dam/en-zz/Solutions/GeForce/ (VLSI) Syst., vol. 26, no. 5, pp. 831–840,
performance Si interposer at 2500 mm2 ,” in Proc. technologies/frameview/Power_Primer.pdf May 2018.
IEEE 71st Electron. Compon. Technol. Conf. (ECTC), [57] Graphics Cards: TDP and TGP (and Don’t Forget [76] NVIDIA A100 Tensor Core GPU Architecture.
Jun. 2021, pp. 101–104, doi: 10.1109/ TBP, GCP and MPC). Accessed: Mar. 12, 2022. Accessed: Mar. 12, 2022. [Online]. Available:
ECTC32696.2021.00028. [Online]. Available: https://www.geeks3d. https://www.nvidia.com/content/dam/en-
zz/Solutions/Data-Center/nvidia-ampere- generational gains in advanced CMOS,” in Proc. D. C. H. Yu, “System on integrated chips (SoICTM )
architecture-whitepaper.pdf IEEE Custom Integr. Circuits Conf. (CICC), for 3D heterogeneous integration,” in Proc. IEEE
[77] NVIDIA TESLA V100 GPU ARCHITECTURE. Apr. 2021, pp. 1–2. 69th Electron. Compon. Technol. Conf. (ECTC),
Accessed: Mar. 12, 2022. [Online]. Available: [99] JEDEC Publishes HBM3 Update to High Bandwidth May 2019, pp. 594–599.
https://images.nvidia.com/content/volta- Memory (HBM) Standard. Accessed: Mar. 12, [119] D. Milojevic, G. Sisto, D. P. G. Van, and E. Beyne,
architecture/pdf/volta-architecture- 2022. [Online]. Available: https://www. “Fine-pitch 3D system integration and advanced
whitepaper.pdf businesswire.com/news/home/20220127005320/ CMOS nodes: Technology and system design
[78] NVIDIA H100 Tensor Core GPU Architecture. en/JEDEC-Publishes-HBM3-Update-to-High- perspective,” Proc. SPIE, vol. 11614, Feb. 2021,
Accessed: May 2022. [Online]. Available: Bandwidth-Memory-HBM-Standard Art. no. 116140H.
https://resources.nvidia.com/en-us-tensor-core [100] A. Farmahini-Farahani, S. Gurumurthi, G. Loh, [120] L. Zhu et al., “High-performance logic-on-memory
[79] C. E. Leiserson et al., “There’s plenty of room at and M. Ignatowski, “Challenges of high-capacity monolithic 3-D IC designs for Arm Cortex-A
the top: What will drive computer performance DRAM stacks and potential directions,” in Proc. processors,” IEEE Trans. Very Large Scale Integr.
after Moore’s Law?” Science, vol. 368, no. 6495, Workshop Memory Centric High Perform. Comput., (VLSI) Syst., vol. 29, no. 6, pp. 1152–1163,
Jun. 2020, Art. no. eaam9744. Nov. 2018, pp. 4–13. Jun. 2021.
[80] Heterogeneous Integration Roadmap 2021 Edition. [101] JEDEC Updates HBM2 Memory Standard To 3.2 [121] K. Chang, D. Kadetotad, Y. Cao, J.-S. Seo, and
Accessed: Mar. 12, 2022. [Online]. Available: Gbps; Samsung’s Flashbolt Memory Nears S. K. Lim, “Power, performance, and area benefit
https://eps.ieee.org/technology/heterogeneous- Production. Accessed: Mar. 12, 2022. [Online]. of monolithic 3D ICs for on-chip deep neural
integration-roadmap/2021-edition.html Available: https://www.anandtech.com/ networks targeting speech recognition,” ACM J.
[81] E. P. DeBenedictis, M. Badaroglu, A. Chen, show/15469/jedec-updates-hbm2-memory- Emerg. Technol. Comput. Syst., vol. 14, no. 4,
T. M. Conte, and P. Gargini, “Sustaining Moore’s standard-to-32-gbps-samsungs-flashbolt-memory- pp. 1–19, Oct. 2018.
law with 3D chips,” Computer, vol. 50, no. 8, nears-production [122] N. Madan and R. Balasubramonian, “Leveraging
pp. 69–73, 2017. [102] M. F. Chen, C. H. Tsai, T. Ku, W. C. Chiou, 3D technology for improved reliability,” in Proc.
[82] H.-S. P. Wong et al., “A density metric for C. T. Wang, and D. Yu, “Low temperature SoIC 40th Annu. IEEE/ACM Int. Symp. Microarchitecture
semiconductor technology [point of view],” Proc. bonding and stacking technology for 12-/16-Hi (MICRO), Dec. 2007, pp. 223–235.
IEEE, vol. 108, no. 4, pp. 478–482, Apr. 2020. high bandwidth memory (HBM),” IEEE Trans. [123] R. Mathur et al., “Thermal analysis of a 3D
[83] S. K. Moore, “The node is nonsense,” IEEE Spectr., Electron Devices, vol. 67, no. 12, pp. 5343–5348, stacked high-performance commercial
vol. 57, no. 8, pp. 24–30, Aug. 2020, doi: Dec. 2020. microprocessor using face-to-face wafer bonding
10.1109/MSPEC.2020.9150552. [103] H.-S. P. Wong and S. Salahuddin, “Memory leads technology,” in Proc. IEEE 70th Electron. Compon.
[84] Y. H. Chen et al., “Ultra high density SoIC with the way to better computing,” Nature Technol. Conf. (ECTC), Jun. 2020, pp. 541–547.
sub-micron bond pitch,” in Proc. IEEE 70th Nanotechnol., vol. 10, no. 3, pp. 191–194, 2015. [124] D. Yu, “TSMC packaging technologies for chiplets
Electron. Compon. Technol. Conf. (ECTC), [104] G. F. Close et al., “A 256-mcell phase-change and 3D,” presented at the HotChips 33 Tutorial,
Jun. 2020, pp. 576–581. memory chip operating at 2+ bit/cell,” IEEE 2021. [Online]. Available: https://hc33.hotchips.
[85] AMD CDNA 2 Architecture. Accessed: Mar. 12, Trans. Circuits Syst. I, Reg. Papers, vol. 60, no. 6, org/assets/program/tutorials/2021%20HotChips
2022. [Online]. Available: https://www.amd. pp. 1521–1533, Jun. 2013. %20TSMC%20Packaging%20Technologies
com/system/files/documents/amd-cdna2-white- [105] B. Q. Le et al., “Resistive RAM with multiple bits %20for%20Chiplets%20and%203D_0819
paper.pdf per cell: Array-level demonstration of 3 bits per %20publish_public.pdf
[86] S. Naffziger et al., “Pioneering chiplet technology cell,” IEEE Trans. Electron Devices, vol. 66, no. 1, [125] T. W. Wei et al., “Experimental and numerical
and design for the AMD EPYC and Ryzen pp. 641–646, Jan. 2019. study of 3-D printed direct jet impingement
processor families: Industrial product,” in Proc. [106] A. Chen, “A review of emerging non-volatile cooling for high-power, large die size
ACM/IEEE 48th Annu. Int. Symp. Comput. Archit. memory (NVM) technologies and applications,” applications,” IEEE Trans. Compon., Packag.,
(ISCA), Jun. 2021, pp. 57–70. Solid-State Electron., vol. 125, pp. 25–38, Manuf. Technol., vol. 11, no. 3, pp. 415–425,
[87] G. Yeap et al., “5nm CMOS production technology Nov. 2016. Mar. 2021, doi: 10.1109/TCPMT.2020.3045113.
platform featuring full-fledged EUV, and high [107] Z. Qureshi et al., “BaM: A case for enabling [126] C. J. Wu et al., “Ultra high power cooling solution
mobility channel FinFETs with densest 0.021 μm2 fine-grain high throughput GPU-orchestrated for 3D-ICs,” in Proc. Symp. VLSI Circuits,
SRAM cells for mobile SoC and high performance access to storage,” 2022, arXiv:2203.04910. Jun. 2021, pp. 1–2.
computing applications,” in Proc. IEDM, [108] A. Altman, M. Arafa, and K. Balasubramanian, [127] J.-N. Hung et al., “Advanced system integration
Dec. 2019, p. 36. “Intel optane data center persistent memory,” in for high performance computing with liquid
[88] M. Liu, “Unleashing the future of innovation,” in Proc. IEEE Hot Chips, Aug. 2019, pp. 1–25. cooling,” in Proc. IEEE 71st Electron. Compon.
IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. [109] Y. Paik, “Developing extremely low-latency NVMe Technol. Conf. (ECTC), Jun. 2021, pp. 105–111.
Tech. Papers, Feb. 2021, pp. 9–16. SSDs,” presented at the Flash Memory Summit, [128] K. Cao, J. Zhou, T. Wei, M. Chen, S. Hu, and K. Li,
[89] S.-Y. Wu, “Key technology enablers of innovations 2017. [Online]. Available: https://www. “A survey of optimization techniques for
in the AI and 5G era,” in IEDM Tech. Dig., flashmemorysummit.com/English/Collaterals/ thermal-aware 3D processors,” J. Syst. Archit.,
Dec. 2019, p. 36. Proceedings/2017/20170809_FA21_Paik.pdf vol. 97, pp. 397–415, Aug. 2019.
[90] S. B. Samavedam et al., “Future logic scaling: [110] A. Fazio, “Advanced technology and systems of [129] Y. Shen et al., “Silicon photonics for extreme scale
Towards atomic channels and deconstructed cross point memory,” in IEDM Tech. Dig., systems,” J. Lightw. Technol., vol. 37, no. 2,
chips,” in IEDM Tech. Dig., Dec. 2020, p. 1. Dec. 2020, pp. 24–1. pp. 245–259, Jan. 15, 2019.
[91] S. Smith et al., “Using DeepSpeed and megatron [111] Y. Liu and G. Yu, “MRAM gets closer to the core,” [130] H. Hsia et al., “Heterogeneous integration of a
to train megatron-turing NLG 530B, a large-scale Nature Electron., vol. 2, no. 12, pp. 555–556, compact universal photonic engine for silicon
generative language model,” 2022, Dec. 2019. photonics applications in HPC,” in Proc. IEEE 71st
arXiv:2201.11990. [112] S. Mittal, “A survey of techniques for architecting Electron. Compon. Technol. Conf. (ECTC),
[92] Y. Fu, E. Bolotin, N. Chatterjee, D. Nellans, and and managing GPU register file,” IEEE Trans. Jun. 2021, pp. 263–268.
S. W. Keckler, “GPU domain specialization via Parallel Distrib. Syst., vol. 28, no. 1, pp. 16–28, [131] Y. Zhang, A. Samanta, K. Shang, and S. J. B. Yoo,
composable on-package architecture,” ACM Trans. Jan. 2017. “Scalable 3D silicon photonic electronic integrated
Archit. Code Optim., vol. 19, no. 1, pp. 1–23, [113] Y. Oh, I. Jeong, W. W. Ro, and M. K. Yoon, circuits and their applications,” IEEE J. Sel. Topics
Mar. 2022. “CASH-RF: A compiler-assisted hierarchical Quantum Electron., vol. 26, no. 2, pp. 1–10,
[93] TSMC Begins Pilot Production of 3 nm Chips: register file in GPUs,” IEEE Embedded Syst. Lett., Mar. 2020.
Report. Accessed: Mar. 12, 2022. [Online]. vol. 14, no. 4, pp. 187–190, Dec. 2022. [132] X. Timoneda et al., “Engineer the channel and
Available: https://www.tomshardware.com/ [114] M. Perumkunnil et al., “System exploration and adapt to it: Enabling wireless intra-chip
news/tsmc-begins-pilot-production-of-3nm-chips technology demonstration of 3D wafer-to-wafer communication,” IEEE Trans. Commun., vol. 68,
[94] M. K. Gupta et al., “A comprehensive study of integrated STT-MRAM based caches for advanced no. 5, pp. 3247–3258, May 2020.
nanosheet and forksheet SRAM for beyond n5 mobile SoCs,” in IEDM Tech. Dig., Dec. 2020, [133] K. Shiba et al., “A 96-MB 3D-stacked SRAM using
node,” IEEE Trans. Electron Devices, vol. 68, no. 8, pp. 15–4. inductive coupling with 0.4-V transmitter,
pp. 3819–3825, Aug. 2021. [115] W. A. Wulf and S. A. McKee, “Hitting the memory termination scheme and 12:1 SerDes in 40-nm
[95] M. K. Gupta et al., “The complementary FET wall: Implications of the obvious,” ACM SIGARCH CMOS,” IEEE Trans. Circuits Syst. I, Reg. Papers,
(CFET) 6T-SRAM,” IEEE Trans. Electron Devices, Comput. Archit. News, vol. 23, no. 1, pp. 20–24, vol. 68, no. 2, pp. 692–703, Feb. 2021.
vol. 68, no. 12, pp. 6106–6111, Dec. 2021. Mar. 1995. [134] A. Samajdar, J. M. Joseph, Y. Zhu, P. Whatmough,
[96] Y.-F. Tsai, F. Wang, Y. Xie, N. Vijaykrishnan, and [116] O. Mutlu, and L. Subramanian, “Research M. Mattina, and T. Krishna, “A systematic
M. J. Irwin, “Design space exploration for 3-D problems and opportunities in memory systems,” methodology for characterizing scalability of DNN
cache,” IEEE Trans. Very Large Scale Integr. (VLSI) Supercomput. Frontiers Innov., vol. 1, no. 3, accelerators using SCALE-sim,” in Proc. IEEE Int.
Syst., vol. 16, no. 4, pp. 444–455, Apr. 2008. pp. 19–55, 2014. [Online]. Available: Symp. Perform. Anal. Syst. Softw. (ISPASS),
[97] K. Puttaswamy and G. H. Loh, “3D-integrated https://superfri.org/index.php/superfri Aug. 2020, pp. 58–68.
SRAM components for high-performance [117] M. M. S. Aly et al., “The N3XT approach to [135] N. P. Jouppi et al., “A domain-specific
microprocessors,” IEEE Trans. Comput., vol. 58, energy-efficient abundant-data computing,” Proc. supercomputer for training deep neural
no. 10, pp. 1369–1381, Oct. 2009. IEEE, vol. 107, no. 1, pp. 19–48, Jan. 2019. networks,” Commun. ACM, vol. 63, no. 7,
[98] R. Mathur et al., “3D-split SRAM: Enabling [118] M.-F. Chen, F.-C. Chen, W.-C. Chiou, and pp. 67–78, Jun. 2020.
[136] G. G. Shahidi, “Chip power scaling in recent [153] Kammler, T., 2018, “FDSOI: The technology Ecosystem. Accessed: Mar. 12, 2022. [Online].
CMOS technology nodes,” IEEE Access, vol. 7, alternative to the mainstream,” ECS Trans., Available: https://www.uciexpress.org/
pp. 851–856, 2019. vol. 85, no. 8, p. 39, 2018. _files/ugd/0c1418_c5970a68ab214ffc97
[137] P. Oldiges et al., “Chip power-frequency scaling in [154] F. Sheikh, R. Nagisetty, T. Karnik, and D. Kehlet, fab16d11581449.pdf
10/7nm node,” IEEE Access, vol. 8, “2.5D and 3D heterogeneous integration: [170] A New Standard Could Let Companies Build
pp. 154329–154337, 2020, doi: Emerging applications,” IEEE Solid State Circuits Processors Out of Lego-Like Chiplets.
10.1109/ACCESS.2020.3017756. Mag., vol. 13, no. 4, pp. 77–87, Fall 2021, doi: Accessed: Mar. 12, 2022. [Online]. Available:
[138] R. H. Dennard, F. H. Gaensslen, V. L. Rideout, 10.1109/MSSC.2021.3111386. https://www.theverge.com/2022/3/2/
E. Bassous, and A. R. LeBlanc, “Design of [155] R. Swaminathan, “Advanced packaging: Enabling 22958049/ucie-chiplet-standard-processors-soc-
ion-implanted MOSFET’s with very small physical Moore’s Law next frontier through heterogeneous intel-tsmc-samsung-arm
dimensions,” IEEE J. Solid-State Circuits, integration,” presented at the Hot Chips 33 [171] O. Burkacky, M. D. Jong, A. Mittal, and N. Verma.
vol. SSC-9, no. 5, pp. 256–268, Tutorial, 2021. [Online]. Available: https:// (2021). Value Creation: How Can the
Oct. 1974. hc33.hotchips.org/assets/program/tutorials/ Semiconductor Industry Keep Outperforming.
[139] M. Bohr, “A 30 year retrospective on Dennard’s 2021%20Hot%20Chips%20AMD%20Advanced [Online]. Available: https://www.mckinsey.com/
MOSFET scaling paper,” IEEE Solid-State Circuits %20Packaging%20Swaminathan%20Final industries/semiconductors/our-insights/value-
Newslett., vol. 12, no. 1, pp. 11–13, Winter 2007. %20%2020210820.pdf creation-how-can-the-semiconductor-industry-
[140] H. Nakamura et al., “55nm CMOS technology for [156] W. Qadeer, “Convolution engine: Balancing keep-outperforming
low standby power/generic applications efficiency and flexibility in specialized [172] Image by OpenClipart-Vectors From Pixabay.
deploying the combination of gate work function computing,” Commun. ACM vol. 58, no. 4, [Online]. Available: https://pixabay.
control by HfSiON and stress-induced mobility pp. 85–93, 2015. com/vectors/electrical-components-resistors-
enhancement,” in Symp. VLSI Technol. Dig. Tech. [157] S. Han, H. Mao, and W. J. Dally, “Deep 30344/
Papers, Jun. 2006, pp. 158–159, doi: compression: Compressing deep neural networks [173] Image by OpenClipart-Vectors From Pixabay.
10.1109/VLSIT.2006.1705265. with pruning, trained quantization and Huffman [Online]. Available: https://pixabay.com/
[141] N. Loubet et al., “Stacked nanosheet coding,” 2015, arXiv:1510.00149. vectors/chip-computer-transistor-processor-
gate-all-around transistor to enable scaling [158] M. Scherer, G. Rutishauser, L. Cavigelli, and 160090/
beyond FinFET,” in Proc. Symp. VLSI Technol., L. Benini, “CUTIE: Beyond PetaOp/s/W ternary [174] Image by OpenClipart-Vectors From Pixabay.
Jun. 2017, pp. T230–T231. DNN inference acceleration with [Online]. Available: https://pixabay.com/
[142] TSMC’s Chip Scaling Efforts Reach Crossroads at 2 better-than-binary energy efficiency,” IEEE Trans. vectors/chip-computer-transistor-processor-
nm. Accessed: Mar. 12, 2022. [Online]. Available: Comput.-Aided Design Integr. Circuits Syst., vol. 41, 160090/ and https://pixabay.com/vectors/
https://www.eetimes.com/tsmcs-chip-scaling- no. 4, pp. 1020–1033, Apr. 2022. chip-processor-computer-transistor-160091/
efforts-reach-crossroads-at-2nm/ [159] S. Markidis, S. W. D. Chien, E. Laure, I. B. Peng, [175] A. Pomianowski, “RDNA 2 gaming architecture,”
[143] TSMC Launches New N12e Process: FinFET at 0.4V and J. S. Vetter, “NVIDIA tensor core presented at the IEEE Hot Chips 33 Symp. (HCS),
for IoT. Accessed: Mar. 12, 2022. [Online]. programmability, performance precision,” in Proc. Aug. 2021, pp. 1–18.
Available: https://www.anandtech.com/ IEEE Int. Parallel Distrib. Process. Symp. Workshops [176] S. Lie, “Cerebras architecture deep dive: First look
show/16046/tsmc-launches-new-n12e-process- (IPDPSW), May 2018, pp. 522–531. inside the HW/SW co-design for deep learning :
finfet-at-04v-for-iot [160] AMD Aims to Increase Chip Efficiency by 30x by Cerebras systems,” presented at the IEEE Hot
[144] H. Wu et al., “Parasitic resistance reduction 2025 (Updated). Accessed: Mar. 12, 2022. Chips 34 Symp. (HCS), Aug. 2022,
strategies for advanced CMOS FinFETs beyond [Online]. Available: https://www.tomshardware. pp. 1–34.
7nm,” in IEDM Tech. Dig., Dec. 2018, p. 35. com/news/amd-increase-efficiency-of-chips- [177] SambaNova’s New Silicon Targets Foundation
[145] C. Auth et al., “A 10nm high performance and thirtyfold-by-2025#:~:text=AMD%20today Models. Accessed: Sep. 28, 2022. [Online].
low-power CMOS technology featuring 3rd %20announced%20an%20extremely,wide%20 Available: https://www.eetimes.com/
generation FinFET transistors, self-aligned quad efficiency%20improvements%20by%20150%25 sambanovas-new-silicon-targets-foundation-
patterning, contact over active gate and cobalt [161] T. Ghani et al., “A 90nm high volume models/
local interconnects,” in IEDM Tech. Dig., manufacturing logic technology featuring novel [178] D. Stosic. Training Neural Networks With Tensor
Dec. 2017, p. 29. 45nm gate length strained silicon CMOS Cores. Accessed: Mar. 12, 2022. [Online].
[146] T. N. Theis and H.-S. P. Wong, “The end of Moore’s transistors,” in IEDM Tech. Dig., Dec. 2003, p. 11. Available: https://nvlabs.github.io/eccv2020-
law: A new beginning for information [162] K. Mistry et al., “A 45nm logic technology with mixed-precision-tutorial/files/dusan_stosic-
technology,” Comput. Sci. Eng., vol. 19, no. 2, high-k+Metal gate transistors, strained silicon, training-neural-networks-with-tensor-cores.pdf
pp. 41–50, Mar./Apr. 2017. 9 Cu interconnect layers, 193nm dry patterning, [179] S. Knowles, “Graphcore colossus Mk2 IPU,”
[147] B. Chehab et al., “Standard cell architectures for and 100% pb-free packaging,” in IEDM Tech. Dig., presented at the IEEE Hot Chips 33 Symp. (HCS),
N2 node: Transition from FinFET to nanosheet Dec. 2007, pp. 247–250. Aug. 2021, pp. 1–25.
and to forksheet device,” Proc. SPIE, vol. 11328, [163] C. Auth et al., “A 22nm high performance and [180] N. Wang, J. Choi, D. Brand, C. Y. Chen, and
Mar. 2020, Art. no. 1132807. low-power CMOS technology featuring K. Gopalakrishnan, “Training deep neural
[148] M. Papermaster, S. Kosonocky, G. H. Loh, and S. fully-depleted tri-gate transistors, self-aligned networks with 8-bit floating point numbers,”
Naffziger, “A new era of tailored computing,” in contacts and high density MIM capacitors,” in in Proc. Adv. Neural Inf. Process. Syst., 2018.
Proc. Symp. VLSI Circuits, Jun. 2021, pp. 1–2. Proc. Symp. VLSI Technol. (VLSIT), Jun. 2012, [Online]. Available: https://papers.nips.cc/
[149] P.-H. Wang, C.-L. Yang, Y.-M. Chen, and pp. 131–132. paper/2018/hash/335d3d1cd7ef05ec77714
Y.-J. Cheng, “Power gating strategies on GPUs,” [164] M. Bohr, “The new era of scaling in an SoC a215134914c-Abstract.html
ACM Trans. Archit. Code Optim., vol. 8, no. 3, world,” in IEEE Int. Solid-State Circuits Conf. [181] P. Micikevicius et al., “FP8 formats for deep
pp. 1–25, Oct. 2011. (ISSCC) Dig. Tech. Papers, Feb. 2009, pp. 23–28. learning,” 2022, arXiv:2209.05433.
[150] Z. Tang, Y. Wang, Q. Wang, and X. Chu, “The [165] S. Naffziger, “Architecting chiplet solutions for [182] NVIDIA, Arm, and Intel Publish FP8 Specification
impact of GPU DVFS on the energy and high volume products,” presented at the ISSCC, for Standardization as an Interchange Format for
performance of deep learning: An empirical 2021. AI. Accessed: Sep. 14, 2022. [Online]. Available:
study,” in Proc. 10th ACM Int. Conf. Future Energy [166] X.-W. Lin et al., “Heterogeneous integration https://developer.nvidia.com/blog/nvidia-arm-
Syst., Jun. 2019, pp. 315–325. enabled by the state-of-the-art 3DIC and CMOS and-intel-publish-fp8-specification-for-
[151] X. Wang and W. Zhang, “Execution units technologies: Design, cost, and modeling,” in standardization-as-an-interchange-format-for-ai/
power-gating to improve energy efficiency of IEDM Tech. Dig., Dec. 2021, pp. 3–4. [183] AMD: Addressing the Challenge of Energy-Efficient
GPGPUs,” in Proc. Int. Conf. Internet Things [167] R. Chen et al., “3D-optimized SRAM macro design Computing. Accessed: Jul. 6, 2022. [Online].
(iThings) IEEE Green Comput. Commun. and application to memory-on-logic 3D-IC at Available: https://venturebeat.com/
(GreenCom) IEEE Cyber, Phys. Social Comput. advanced nodes,” in IEDM Tech. Dig., Dec. 2020, data-infrastructure/amd-addressing-the-
(CPSCom) IEEE Smart Data (SmartData), p. 15. challenge-of-energy-efficient-computing/
Jul. 2019, pp. 711–718. [168] S. Sinha, X. Xu, M. Bhargava, S. Das, B. Cline, and [184] T. Nogami et al., “Advanced BEOL materials,
[152] TSMC Unveils N4X Node: Extreme High- G. Yeric, “Stack up your chips: Betting on 3D processes, and integration to reduce line
Performance at High Voltages. Accessed: Mar. 12, integration to augment Moore’s law scaling,” resistance of damascene Cu, Co, and
2022. [Online]. Available: https://www. 2020, arXiv:2005.10866. subtractive Ru interconnects,” in Proc.
anandtech.com/show/17123/tsmc-unveils-n4x- [169] D. D. Sharma. Universal Chiplet Interconnect IEEE Symp. VLSI Technol. Circuits,
node-high-voltages-for-high-clocks Express (UCIe): Building an Open Chiplet Jun. 2022, pp. 423–424.