0% found this document useful (0 votes)
111 views14 pages

High-Bandwidth Chiplet Interconnects For Advanced Packaging Technologies in AI ML Applications Challenges and Solutions

This document discusses the increasing demand for chiplet integration in advanced packaging technologies driven by AI and ML applications. It reviews challenges and solutions related to high-bandwidth chiplet interconnects, including energy efficiency, power integrity, and signal integrity, while emphasizing the importance of design and technology co-optimization. The article also highlights various advanced packaging technologies and their implications for semiconductor performance and manufacturing efficiency.

Uploaded by

nst_492784108
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
111 views14 pages

High-Bandwidth Chiplet Interconnects For Advanced Packaging Technologies in AI ML Applications Challenges and Solutions

This document discusses the increasing demand for chiplet integration in advanced packaging technologies driven by AI and ML applications. It reviews challenges and solutions related to high-bandwidth chiplet interconnects, including energy efficiency, power integrity, and signal integrity, while emphasizing the importance of design and technology co-optimization. The article also highlights various advanced packaging technologies and their implications for semiconductor performance and manufacturing efficiency.

Uploaded by

nst_492784108
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Received 8 August 2024; revised 25 September 2024, 7 November 2024, and 20 November 2024; accepted 20 November 2024.

Date of publication 26 November 2024; date of current version 25 December 2024.


Digital Object Identifier 10.1109/OJSSCS.2024.3506694

High-Bandwidth Chiplet Interconnects for Advanced


Packaging Technologies in AI/ML Applications:
Challenges and Solutions
SHENGGAO LI (Senior Member, IEEE), MU-SHAN LIN , WEI-CHIH CHEN, AND CHIEN-CHUN TSAI
(Invited Paper)

Design and Technology Platform, Taiwan Semiconductor Manufacturing Company Ltd., San Jose, CA 95134, USA
CORRESPONDING AUTHOR: S. LI (e-mail: [email protected])

ABSTRACT The demand for chiplet integration using 2.5D and 3D advanced packaging technologies has
surged, driven by the exponential growth in computing performance required by artificial intelligence and
machine learning (AI/ML). This article reviews these advanced packaging technologies and emphasizes
critical design considerations for high-bandwidth chiplet interconnects, which are vital for efficient
integration. We address challenges related to bandwidth density, energy efficiency, electromigration, power
integrity, and signal integrity. To avoid power overhead, the chiplet interconnect architecture is designed to
be as simple as possible, employing a parallel data bus with forwarded clocks. However, achieving high-
yield manufacturing and robust performance still necessitates significant efforts in design and technology
co-optimization. Despite these challenges, the semiconductor industry is poised for continued growth and
innovation, driven by the possibilities unlocked by a robust chiplet ecosystem and novel 3D-IC design
methodologies.

INDEX TERMS 3Dblox, 3D-IC, advanced packaging, artificial intelligence (AI) and compute, chiplet
integration, energy efficiency, interconnects, Universal Chiplets Interconnect Express (UCIe).

I. INTRODUCTION accumulation operations, which are performed by clusters of

T HE DEMAND for artificial intelligence (AI) and


machine learning (ML) technologies is growing at an
unprecedented pace, far surpassing the pace predicted by
parallel computing cores. These workloads demand exten-
sive memory capacity and high-interconnect bandwidth. To
accommodate this compute need, a typical xPU/accelerator
Moore’s Law. The amount of compute used for AI training chip nowadays may consists of many compute, memory,
has been growing exponentially at 4.1×/year since 2012, and IO chiplets [7], [8], [9], integrated using advanced
outpacing Moore’s Law, which predicts a doubling every packaging technologies. Each chiplet is designed within the
24 months [1], [2], as shown in Fig. 1. The increase in lithography stepper’s photomask limit, or reticle size, of
the number of parameters in deep learning models enhances 26 × 33 mm2 .
their flexibility and potential performance, driving the rapid The use of chiplets offers several significant benefits. By
growth in model complexity. However, this rate of expansion breaking down a large monolithic chip into smaller chiplets,
is becoming economically (training cost), technically (size designers can customize various process technologies to
of the computer clusters), and environmentally (carbon foot- enhance specific functionalities. For instance, they can
print) unsustainable [3], [4]. To partially meet the escalating employ the most advanced process node for the compute
compute demand, it is essential to focus on advancements die while utilizing older generation process nodes for
in algorithm efficiency and semiconductor scaling, aiming analog-centric IO dies and memory dies. This modular
to achieve not only higher compute performance but also approach not only simplifies the manufacturing process but
energy-efficient compute performance [5], [6]. AI work- also facilitates rapid system integration, especially when
loads require massive parallel matrix multiplication and standardized chiplet interfaces are utilized [10], [11]. By


c 2024 The Authors. This work is licensed under a Creative Commons Attribution 4.0 License.
For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 4, 2024 351
LI et al.: HIGH-BANDWIDTH CHIPLET INTERCONNECTS FOR ADVANCED PACKAGING TECHNOLOGIES

FIGURE 1. Trend in the amount of compute used to train the ML models [2]. FIGURE 2. TSMC 3D Fabric Technology portfolio.

While this categorization is intuitive, it is also somewhat


leveraging off-the-shelf chiplets, this method is expected to
arbitrary. As packaging technologies continue to evolve,
significantly reduce manufacturing costs and design cycle
the boundaries between these categories may become
times.
increasingly blurred. To simplify discussions, most of the
As chiplet-based systems in packages grow in size and
intermediate 2.xD technologies are often grouped under the
complexity, 3D integration [12] and wafer-scale system
2.5D category. It is also possible that 2D, 2.5D, and 3D
integration [13], [14], [15] will deliver superior power
integration techniques will co-exist within advanced pack-
efficiency, exceptional performance, and enhanced cost-
aging solutions, with 3D-IC being used in a broad sense to
effectiveness [16]. However, several critical issues, which
refer to these solutions. Regardless of these distinctions, the
are familiar to designers at each generation of prod-
primary focus will remain on leveraging these technologies
ucts, continue to pose significant challenges in today’s
to achieve superior performance, efficiency, and functionality
larger and more intricate chiplet systems. These chal-
in semiconductor devices.
lenges include thermal design power (TDP), power delivery
Fig. 2 illustrates TSMC’s evolving 3DFabricTM technol-
network (PDN) loss, mechanical and thermal stress, network
ogy portfolio. The 3DFabric, as an example of advanced
topology and routing algorithms, interconnect throughput,
packaging technologies with wide adoption, is a compre-
energy efficiency, latency, manufacturability, redundancy
hensive set of integration technologies that bring multiple
and repairability, testability, and many more [16], [17],
chips together with closer physical proximity and higher
[18], [19], [20], [21], [22]. Addressing these challenges is
interconnect density, all from a single vendor. This inte-
essential to ensuring the performance and yield of advanced
gration enables a smaller form factor, better electrical
semiconductor solutions.
performance, and greatly enhanced data bandwidth. More
This article is structured as follows. Section II provides
importantly, these technologies allow system designers to
a summary of advanced packaging technologies. Section III
partition a previously monolithic SoC into chiplets and build
discusses die-to-die interconnects for various packaging
more powerful systems in a package [5]. The different
technologies in large CPU/GPU scale-up systems. Section IV
3DFabric packaging options maintain consistency. This
delves into practical issues for chiplet interconnect design,
cohesion is beneficial because the complexity of 3D-IC
such as serial versus parallel interfaces, chiplet I/F signaling,
requires that the design rules pertaining to manufacturability
channel routing and signal integrity (SI), bump map plan-
are compatible and consistently verified prior to high-volume
ning, clock schemes, defect repair, electrostatic discharge
manufacturing.
(ESD) roadmap, and power delivery. Section V introduces
Two distinct packaging platforms emerged from distinctive
a comprehensive 3D-IC design flow. Finally, Section VI
applications. The first one is the chip-on-wafer-on-substrate
explores future development trends.
(CoWoSTM ) platform [24], [25], which has been in produc-
tion since 2012, primarily for high-performance computing.
II. ADVANCED PACKAGING TECHNOLOGIES AND NEW It has 3 subfamilies. The CoWoS-S has a silicon interposer
CAPABILITIES which allows very dense metal wires (W/S = 0.4/0.4 µm).
Lau [23] provided an excellent review of advanced packaging The CoWoS-R [26], [27] has redistribution layers (RDLs)
technologies, categorizing them into 2D, 2.xD (including embedded in an organic interposer, with coarser wiring
2.1-D, 2.3D, and 2.5D), and 3D packaging technologies. density (W/S = 2/2 µm). The CoWoS-L [28] combines
According to this classification, if chiplets are placed directly the best of the -R and -S: local silicon interconnect (LSI)
on the package substrate, it is considered 2D packaging. for high-wiring density, and RDL in the organic substrate
When an intermediate layer, such as a thin film, bridge, or for better electrical performance. The -S or -L option also
passive interposer is used, it falls under the 2.xD category. comes with embedded deep-trench decoupling capacitors
Specifically, if the interposer is an active die with Through- (DTCs) [29] in the silicon interposer or bridge for enhanced
Silicon Vias (TSVs), it is classified as 3D packaging. power delivery.

352 VOLUME 4, 2024


The second one is the integrated-fanout (InFOTM ) plat-
form [30]. InFO has been in mass production since 2016,
initially driven by cost-effective mobile application. Info
package on package (InFO-PoP) [30] is the first 3D Fan-
Out wafer-level packaging to integrate SoC with memory
packages using fine-pitch Copper RDLs. Due to its cost,
form factor, and better signal integrity, InFO technology
has evolved into many variants, significantly extending
to allow integration of more functional chips for HPC
applications [15]. The InFO platform also features advanced
options, such as local silicon bridge for finer pitch metal
FIGURE 3. Bump pitch scaling perspective (XSR: Extreme short reach and UCIe:
routing and embedded decoupling capacitors for superior Universal Chiplets Interconnect Express).
power delivery. InFO is a chip-first approach, where chips
are placed face-down onto a temporary carrier and RDL
is built up around them. CoWoS, on the other hand, is
a chip-last approach, with chips first fabricated and then
placed onto a silicon interposer, which is later attached to
a substrate. This distinction in manufacturing steps affect
integration density, and thermal management. Specifically,
in a chip-first approach, the silicon will undergo the thermal
cycles in subsequent process steps. The cost of late step
defects is also significantly higher than a chip-last approach.
3D stacking has been broadly used in memory products,
including high-bandwidth memory (HBM) [31] and NAND
flash memory [32], and is seeing adoption by chipmakers for
increased compute density and data bandwidth [33], [34].
FIGURE 4. Die-to-die interconnect applications.
System on integrated chips (SoICTM ) is for such 3D chip
stacking [35]. It includes the SoIC-P with micro-bumps
(pitch at 18–25 µm) and SoIC-X with advanced bonds (pitch 110–130-µm bump pitch, to 2.5D advanced package type
at 3–9 µm or below) [36], [37], [38]. (e.g., CoWoS/InFO) with ∼40-µm pitch, to 3D chip-on-
SoIC enables seamless integration of multiple chips in wafer or wafer-on-wafer type (e.g., SoIC) with <9-µm
a vertically stacked configuration, unlocking new possi- pitch [11]. As the bump pitch decreases, the number of
bilities for system design and performance optimization. die-to-die signals increases quadratically within a given
Furthermore, SoIC can be combined with CoWoS or InFO area, thereby increasing bandwidth density. The selection of
to form more powerful and flexible computer systems. circuit architecture in the context of pitch scaling depends
Chip manufacturers and outsource semiconductor assem- heavily on factors, such as achievable reach, bandwidth,
bly and test (OSAT) providers offer a range of advanced energy efficiency, and latency [41]. For example, high-
packaging technologies [23], [38], [39], each with unique speed serializers/deserializers (SerDes) operating at ∼56/112
(dis)advantages and tradeoffs in SI, interconnect density, Gb/s [42] are commonly used in MCM packages to maximize
manufacturability, and thermal management. For example, data rates per pin. In contrast, 2.5D interposers often employ
embedded multidie interconnect bridge (EMIBTM ) [39] high-speed parallel data buses due to their superior energy
by Intel and elevated fan-out bridge (EFBTM ) [40] by and area efficiency [41]. Meanwhile, advanced 3D stacking
AMD, both feature high-density passive bridges without technologies benefit most from simpler, lower speed data
TSVs, complemented by additional RDLs to enhance power buses that utilize minimal CMOS buffers and flip-flops, with
integrity (PI). The selection of a particular packaging no equalizer or calibration circuits, thereby achieving best
technology hinges on the specific application requirements area bandwidth density and energy efficiency [11].
and desired performance characteristics, especially in high- Fig. 4 depicts an example of multiple chiplets for compute
performance computing where speed and energy efficiency performance scale-up and scaling out for AI applications.
are crucial. This also imposes constraints and challenges on The die-to-die interconnects between the chiplets can be
interconnect design, which will be explored in subsequent categorized to four types: 1) Compute to Compute and
sections. Compute to IO: Universal Chiplets Interconnect Express
(UCIeTM ) PHY on CoWoS/InFO technology; 2) Compute to
III. DIE TO DIE INTERCONNECT APPLICATIONS Memory: HBMTM PHY on CoWoS technology; 3) Compute
Fig. 3 shows the chip package evolution from the perspective to SRAM: 3D-stacking with SoIC technology; and 4) IO
of bump pitch scaling, starting from the conventional 2D Chiplet to External IO: XSR-Serdes on standard package
standard package type, or multichip modules (MCMs) with technology.

VOLUME 4, 2024 353


LI et al.: HIGH-BANDWIDTH CHIPLET INTERCONNECTS FOR ADVANCED PACKAGING TECHNOLOGIES

(a)

FIGURE 5. Chiplet interconnects design considerations.


(b)

Today’s most widely used AI accelerators adopt this type


of topology to maximize compute performance and memory
access bandwidth [8], [43]. Competing technologies, such as (c)
wafer-scale systems [13], [14], offer a glimpse into likely
candidates of future computing systems. The interconnects FIGURE 6. Die-to-die interconnect applications. (a) Serial link. (b) Parallel
and network topologies for these systems will need to evolve bus—2.5D. (c) Parallel bus—3D.

accordingly to meet system performance needs.


evaluated to ensure that the ESD area and capacitance
IV. CHIPLET INTERCONNECT DESIGN
overhead do not hinder IO energy efficiency.
CONSIDERATIONS
A. CHIPLET INTERCONNECT DESIGN OBJECTIVES AND 4) PDN: This entails managing electromigration (EM)
DTCO and IR drop, voltage droop, and cross-talks originating
Breaking down a formerly monolithic SoC into multiple from power delivery.
chiplets stitched by high-bandwidth chiplet interconnects 5) Thermal Management: Key challenges include simulat-
enables a more flexible system partition, with the benefit ing hot spots accurately, and mitigating thermal cycle
of improved yield, and fast turnaround time using off-the- induced problems, such as timing drift, mechanical
shelf chiplets. Standardization of the chiplet interface is an stress, and EM. It involves implementing solutions,
important milestone, as exemplified by UCIeTM [11]. during the design stage [50] or at run-time [51] to
Before this development, several industry-initiated chiplet keep devices within safe temperature ranges, thereby
interfaces were adopted to address the requirements of maintaining performance, reliability, and longevity.
chiplet systems, emphasizing high-bandwidth density, low 6) Designing for Testability, Repairability, and
latency, and high-energy efficiency. Notable examples Reliability: Ensuring these aspects contribute to both
include the advanced interconnect bus (AIBTM ) [44], effective short-term testing and a long-term lifespan,
bunch of wires (BoWTM ) [45], open high-bandwidth which is crucial for a product’s success.
interface (OpenHBITM ) [46], and LipinconTM (TSMC pro- 7) Design Sign-Off Processes: Efficient, AI-assisted EDA
prietary) [47]. tools and flows are increasingly essential for produc-
Fig. 5 provides a comprehensive overview of the mul- tivity and optimization [52].
tifaceted design and technology co-optimization (DTCO)
efforts aimed at meeting both performance and manu- B. SERIAL VERSUS PARALLEL DATA BUS
facturing objectives for high-speed interconnects in 2.5D With standard packages (MCM, or 2D), the pitch of signal
or 3D chiplet-based systems [48]. The scope of DTCO bumps and metal wires are coarse. One is forced to maximize
encompasses a wide range of considerations, including but per-pin data bandwidth density, using a serial link (e.g.,
not limited to as follows. PCIe-32/64 Gb/s and CEI-112/224 Gb/s) with differential
1) Device-Level Optimization: The focus is on pushing signaling as shown in Fig. 6(a).
transistor bandwidth and noise performance to improve Advanced packaging technologies (2.5D) allow one to use
IO energy efficiency. a lower data rate per signal pin with a greater number of
2) Package Optimization: Optimizing the package design parallel single-ended signals per unit geometry to maximize
rules on the interposer by balancing key parameters, beachfront bandwidth density or area bandwidth density (e.g.,
such as line spacing, layer thickness, and via enclosure, UCIe ×64 at 4–32 Gb/s) [11]. A parallel interface [Fig. 6(b)]
is crucial to PI, SI, routability, and manufacturability. stands out in several aspects. First, a parallel interface is
3) ESD: New challenges have emerged in ESD protection accompanied by a forwarded clock for jitter and skew tracking,
and ESD modeling for chiplet systems [49]. The eliminating the need for a per-lane clock data recovery (CDR)
ESD rating for advanced packages must be carefully mechanism, resulting in a less complex system and lower

354 VOLUME 4, 2024


FIGURE 7. Die-to-die interconnect signaling: (a) SST driver, (b) low-VDDQ NRZ
driver (c) AC-coupled [53], and (d) simultaneously bidirectional [54].

latency. Second, the lower data rate operation of the parallel


interface means the system suffers less to channel loss, jitter,
and cross-talks. Less channel equalization (EQ) is needed,
eliminating circuit overheads, and achieving higher bandwidth
density and better energy efficiency.
For 3D stacking, with a signal density (pitch P ≤ 9 µm),
the 3D interconnect circuit area should be smaller than the FIGURE 8. Channel routability and SI optimization.
bump area (P2 ) to maximize the interconnect efficiency (
BandwidthDensity ∗ Energy Efficiency). In this case, the
speed of the parallel data bus is constrained to 5 Gb/s to ease
the timing [11]. No calibration and adaptation are needed,
effectively reducing power, latency, and area overhead. UCIe-
3DTM has this spirit [Fig. 6(c)].

C. DIE-TO-DIE INTERCONNECT SIGNALING


Advanced packaging technologies enable closer die prox-
imity and reduced interconnect loading, improving SI, data
rates and power efficiency. Nonreturn to zero (NRZ) and
4-level Pulse Amplitude Modulation (PAM4) signaling are
likely candidates for different operating speeds. In Fig. 7, an
source-series terminated (SST) driver at core supply (e.g.,
Vdd = 0.75 V) is commonly used for optimal eye margin
and impedance matching. An NFET-NFET driver has been
adopted to operate at a low VDDQ (e.g., < 0.3 V) to cut
down power consumption [47]. However, this extra power
domain may be undesirable with scarce routing resource. FIGURE 9. Channel optimization.
PAM4 is advantageous when there is a significant insertion
loss benefit at PAM4 Nyquist frequency than NRZ Nyquist
frequency, yet it consumes DC current at the middle levels, As shown in Fig. 8, channel optimization inside a foundry
making it less suitable for low-loss advanced packaging involves many metrics, such as dielectric thickness, metal
channels. Another alternative low-power driver option is pitch, metal thickness, available metal layers, via enclosure,
AC-coupling [53], which reduces driver strength and signal stacking rules, etc. DTCO is exercised for the interposers
swing to reduce power. Simultaneous bidirectional (SBD) of each advanced technology, which often involves pushing
data transmission can also double the data bandwidth for a design rules to maintain a good tradeoff between manufac-
given beachfront [54], [55]. turability, routability and SI (including insertion loss and
crosstalk as shown by the plots).
D. CHANNEL ROUTABILITY AND INTEGRITY ANALYSIS Fig. 9 demonstrates two examples of UCIe D2D routing
For high-wiring density (e.g., at min spacing of 0.4 µm), designs, in two different representative packages and differ-
proper intersignal shielding is necessary to allow adequate ent shielding style. The InFO (silicon bridge) has an LSI
crosstalk isolation and better SI. with 2-µm thick metal, and the InFO (organic substrate)

VOLUME 4, 2024 355


LI et al.: HIGH-BANDWIDTH CHIPLET INTERCONNECTS FOR ADVANCED PACKAGING TECHNOLOGIES

has RDL with 2.3-µm thick metal. Both has four layers
of metal for signal routing and an additional one layer for
power mesh. The former has a tighter metal width/spacing
granularity. With an 8-µm signal pitch for both cases, the
former can afford to have a much wider metal shield, and
slightly larger signal to signal spacing. As such, the former
is able to operate up to 32 Gb/s for the ×64 UCIe form
factor, whereas the latter is only capable of 16 Gb/s with
×32 data lanes due to the more severe crosstalk.
FIGURE 10. Universal 3D bump map form factor.

E. 2.5D AND 3D FORM FACTORS


A certain form factor of an interconnect module, which While it is conceivable to develop a one-size-fits-all
encompasses module geometry, signal order, bump pitch, chiplet standard that addresses the three unique types of
multimodule stacking, etc., is crucial for ensuring integration systems—homogeneous bidirectional core-to-core interfaces,
compatibility among different chiplet vendors. While this asymmetrical memory access interfaces, and unidirectional
standardization introduces rigidness to the chiplet ecosystem, data converter interfaces—each system would still require
it simplifies intellectual property (IP) development – only a different form factors to achieve optimal performance and
limited variants of the IP need to be supported. However, it efficiency.
is important to note that a given form factor may not always 3D stacking is a natural choice for achieving greater
be optimal in terms of area, power, and cost. Take UCIe as energy efficiency, primarily because of the short interdie
an example: initially, a ×64 (64 Tx + 64 Rx) form factor routing significantly reduces the energy required for interdie
was released, followed by a ×32 (32 Tx + 32 Rx) form data movement. A 3D interconnect cluster is essential for
factor for low-cost advanced packages with a smaller number forming a hard IP block with inherent timing robustness,
of RDL layers. The initial 10-column module was targeting as illustrated in Fig. 6(c). This built-in timing robustness
a 45-µm bump pitch. To further enhance area efficiency, allows for modular timing sign-off, ensuring that the timing
the consortium later introduced a 16-column module for validation of each die in a 3D stack can be conducted
smaller bump pitches (< 38 µm) and an 8-column module independently and in a self-contained manner.
for larger bump pitches (> 50 µm) [11]. These successive In Fig. 10, we propose a 3D cluster structure with an
adaptations balance cost and performance to accommodate AB|BA pattern, where pattern A represents the transmitter
varying requirements of different applications. (TX) and pattern B represents the receiver (RX), or vice
The current UCIe protocol supports symmetrical bidirec- versa. The square-shaped A/B pattern can be configured into
tional data transmit and receive, which is typical for data various sizes, such as 4 × 4, 8 × 8, or 20 × 20, depending
communication between homogeneous xPU chiplets. on system requirements. The RX and TX clocks, positioned
In contrast, the HBM interface, a crucial component of at the center of their respective regions, achieve optimal
the chiplet ecosystem, exhibits asymmetrical memory access balance for each I/O pin and across the dies. Power and
(read/write) bandwidth. To extend the interface bandwidth ground are symmetrically distributed within the IP cluster.
without causing severe SI issues, the forthcoming HBM4 This configuration offers the advantage of designing a single
doubles the number of bidirection data IOs from 1024 IP block with a specific poly gate orientation that can
to 2048 [31]. Scaling HBM for increased bandwidth is accommodate any chiplet orientation, assuming that logic-
often constrained by routing congestion and SI issues. level pin remapping can be readily achieved at the chiplet
By transitioning the base die logic to advanced process level.
nodes, we can shorten interconnect routes, enhancing SI and This structure facilitates easy SoC-level scalability,
speed. Alternatively, utilizing UCIe-like SerDes IOs for the enabling various chiplet-to-chiplet stacking scenarios through
HBM interface can achieve higher lane rates with fewer IP instantiation across the SoC. We propose four options
signal routes, maintaining the same bandwidth density while for SoC-level scalability in Face-to-Back (F2B) and Face-
improving SI. to-Face (F2F) connectivity: mirrored or stepped in the
One more notable chiplet application is the interface X-direction and mirrored or stepped in the Y-direction.
between data converters and logic processors. JESD204D Fig. 11 illustrates two integration examples.
is the latest standard defining high-speed serial interfaces 1) Case 1: “X-mirrored/Y-mirrored/mirrored between
for data converters [56]. It includes the data receive D2D”—this configuration supports all F2F and F2B
interface for analog-to-digital converters (ADCs) and the data die-to-die stacking scenarios.
transmit interface for digital-to-analog converters (DACs). 2) Case 2: “X-stepped/Y-stepped/Not-mirrored between
The standards are applicable to PCB-level or MCM chiplet D2D”—this setup features the same bump map across
integration. However, a chiplet standard for data converters dies. It supports F2F stacking but requires a 90◦
in advanced packages has yet to be established. rotation for F2B stacking.

356 VOLUME 4, 2024


(a)

(b)

FIGURE 12. Edge-aligned versus delay-matched structure. (a) Edge-aligned. (b)


Delay-matched.

FIGURE 11. SoC level Scalability to support arbitrary 3D chiplet stacking (F2F/F2B
or rotation).

These flexible integration methods ensure that the IP


cluster can be effectively utilized across a variety of chiplet
stacking configurations, promoting scalability and efficiency
in SoC design. (a)

F. LANE DE-SKEW AND CLOCK ALIGNMENT


On top of the parallel data bus and forwarded clock topology,
there is a need to align the data lanes and clock lanes such
that lane-to-lane skew is minimized. Lane-to-lane matching
is achieved by anti-mirror physical symmetry between the
Tx and Rx in the bump map planning. However, the
physical symmetry does not hold when one is to interface
(b)
two different form factors. For example, for an 8-column
UCIe to interface with a 10-column UCIe, the channels are FIGURE 13. Data Synchronization with or without FIFO. (a) Clock domain crossing
intrinsically unmatched. Besides, random circuit mismatch with FIFO. (b) Shared clock domain.
and on-die/on-package wire mismatch add additional skew.
One needs to allocate enough skew tuning range on a per- I/Q clocks (using DLL or phase locked loop (PLL), and
lane basis at the leaf clock tree to achieve per-lane skew phase interpolators) at the Tx side, with the I-clock going
calibration at the transmitter and/or receiver. Data sampling to the data path while the Q clock forwarded to the Rx.
clock at the receiver is further tuned to the center of the Rx The clock and data path are structurally matched to maintain
data eye for best left and right eye margin. good jitter tracking and delay tracking.
Two clock topologies for the forwarded clock genera- In most cases, the transmit die and the receive die are
tion are illustrated in Fig. 12. The edge-aligned topology under independent PLLs and clock domains. To enable robust
[Fig. 12(a)] has data transition and clock transition aligned; clock domain crossing between two PLL domains, first-in–
a local delay locked loop (DLL) in the Rx is adopted first-out (FIFO) data buffers are typically required, which
to generate a 90◦ phase shifted clock to sample the Rx incurs extra power and latency [Fig. 13(a)]. For interfaces
data eye. The edge-aligned topology aims for less circuitry like core-to-memory connections, enforcing a single clock
and better energy efficiency, but it is sensitive to mismatch domain between two stacked dies is feasible. In Fig. 13(b),
induced by temperature or voltage drift, making it only we proposed an alternative scheme to enable a single clock
suitable for applications with lower data rate (e.g., below 20 domain between two dies, where the primary clock from
Gb/s). The delay-matched topology [Fig. 12(b)] generates PLL1 is forwarded from the primary die to a secondary die

VOLUME 4, 2024 357


LI et al.: HIGH-BANDWIDTH CHIPLET INTERCONNECTS FOR ADVANCED PACKAGING TECHNOLOGIES

FIGURE 14. Redundancy and repair.


FIGURE 15. ESD roadmap.

and is returned to the primary die. This allows the 3D die


to die interface to transmit/receive data without FIFOs. The Balancing repairability and SI involves making strategic
same timing margin as Fig. 13(a) at the boundary of the first tradeoffs. For instance, separating power and ground bumps
capture DFF can be retained. Timing margin for the data is advantageous in preventing permanent short circuit fail-
recapture after the Rx data flipflop (DFF) in the primary die ures [22]. However, this approach may lead to increased area
is slightly affected by the delay of the two forwarding clock overhead or a compromise in SI.
paths, which is manageable.
H. ESD MIGRATION
G. REDUNDANCY AND REPAIRABILITY As the industry pushes for higher bandwidth, it is crucial
Redundancy and repairability are extensively researched for ESD structures to scale accordingly to prevent the large
topics in the field of microprocessors. Shivakumar et al. [57] size and high capacitance of ESD diodes from becoming
identified three distinct redundancy strategies. scaling bottlenecks. Failure to address this issue will limit IO
1) Component-Level Redundancy: This involves having energy efficiency. We need to establish an aggressive ESD
multiple parallel functional units, such as numerous roadmap for IOs incorporating micro bumps and advanced
CPU cores. In this arrangement, the failure of one bonds. Fig. 15 highlights the trend for ESD capacitance
or more cores does not compromise the overall and area scaling [6], also showing a reduction in charge
functionality of the system. device model (CDM) voltage to be supported by the
2) Array Redundancy: This type of redundancy adds spare industry.
structures that can replace defective ones. A common
application of array redundancy is in cache memory, I. POWER DELIVERY
where spare elements substitute for faulty ones to Take the UCIe advanced-package 10-column form-factor
maintain performance. as an example: the current density can reach over 4.1
3) Dynamic Queue Redundancy: This approach entails A/mm2 based on a ×64-lanes module size of 388.8 µm by
the ability to mark and disable defective elements approximately 1000 µm, under 32-Gb/s operation and 0.6-
dynamically, thereby preventing their use and main- pJ/bit energy efficiency at 0.75 V. With such a high-current
taining the integrity of the system. density, we have observed severe electromagnetic reliability
By leveraging these redundancy strategies, processors can issues on power/ground bumps, which were found to be three
achieve higher reliability and easier repairability, ensuring times higher than the EM limit allowed by design rules. This
robust performance even in the presence of faults. issue was mitigated by changing the bump material, but we
Since the die-to-die are connected through dense micro also had to add more power/ground bumps and update the
bumps or advanced bonds, defect detection and repair are UCIe bump map to boost reliability and performance.
essential to guarantee chip yield after packaging. All three Additionally, the UCIe spec supports clock-gating mode.
strategies above can be applicable to chiplet interconnects. Going from idle to mission mode introduces a worst case
Fig. 14 is an example where a “Shift and Switch Repair” dynamic current (di/dt), leading to significant voltage droop.
concept [21] is used to fix three failed lanes with just This results in higher bit error because of a reduced timing
one-over-ten redundancy in hardware overhead. Probability and voltage margin. The most effective di/dt reduction
calculation based on binomial distribution [57] shows that approach is to rely on on-die or on-package decoupling
this 30 + 3 joint repair method can achieve 1000× lower capacitors to suppress the noise ripple. The decoupling cap
failure rate than three separate 10 + 1 groups. strategy involves, from top to bottom [See Fig. 16(a)], uti-
For mission-critical applications, such as automotive, lizing, such as 1) on-package discrete de-coupling capacitors
where AI and ML are taking shape, the stake of a (OPD) typically in the µF range; 2) in-package de-coupling
processor failure is high, a dynamic reliability management capacitors, such as embedded deep trench capacitor (eDTC)
technique where a processor can respond to changing on a Si-interposer with greater than 1000-nF/mm2 capaci-
application behavior to maintain its lifetime reliability target tance density; and 3) on-die decoupling capacitors, which
is beneficial [58]. include super high-density MIM capacitors (SHDMIM)

358 VOLUME 4, 2024


V. COMPREHENSIVE 3DIC DESIGN FLOW
As illustrated in Fig. 17(a), advanced packaging architecture
encompasses a diverse array of package options. These
options include varying the number of dies at each level
and incorporating various passive devices, such as deep
trench capacitors (DTCs) and integrated passive devices
(IPDs). The architecture also supports different types of
horizontal connections, including silicon interposers and
(a)
organic interposers, as well as various vertical connections
like TSVs, through interposer vias (TIV), and through mold
vias (TMVs). Additionally, it offers multiple interface types,
including advanced bonds, micro bumps, and C4 bumps,
along with different stacking orientations, such as face-down,
face-up, face-to-face, and face-to-back.
The diverse range of packaging technology offerings
(b) within a single or multiple vendors, coupled with the
numerous possible combinations, significantly complicates
FIGURE 16. Decoupling capacitor strategy for PDN (PDN – Power Delivery Network; the design process. Additionally, different EDA tools are
OPD – On Package Decoupling capacitor; SHDMIM – Super High Density Metal
Insulator Metal decoupling capacitor; eDTC – embedded Deep Trench Capacitor). required for various physical integration and verification
tasks, involving multiple IP and tool vendors. Current EDA
with a capacitance density of roughly 50 nF/mm2 , and tools, workflows, and methodologies has been evolving sig-
device capacitors with a capacitance density of around nificantly to meet the demands of complex 3D integrations.
10 nF/mm2 [29], [48]. Capacitors located on or near the To address the challenges in 3D-IC design, the 3DbloxTM
top dies exhibit lower series resistance but also have a open standard [59] has been established and gained wide
lower capacitance density. As the distance from the top industry acceptance. As depicted in Fig. 17(a), 3Dblox intro-
die increases, the series resistance also increases. Therefore, duces a modular approach, where each physical component
when determining the optimal decoupling capacitor strategy, within a 3D package is categorized and abstracted into
one must consider various factors, including technology, cost, specific modules. Designing a 3D system involves instanti-
area, and noise specifications. ating these modules to create interconnected objects using a
Fig. 16(b) shows an example of power impedance high-level programming language, organized hierarchically,
optimization and the voltage ripple analysis results [6], similar to traditional SoCs.
[48]. Different capacitors are utilized to suppress the power See Fig. 17(b) for the key features of 3Dblox. To stream-
impedance in respective frequency range. OPD serves to line the design process, we integrated assertions directly into
enhance power impedance in the 1–100 MHz range. The the language, enabling a top-down, correct-by-construction
on-die SHDMIM suppresses the high-frequency part beyond design methodology. The hierarchical instantiation feature
200 MHz. And the additional in-package eDTC can further enhances the reuse of chiplets, promoting design efficiency.
suppress the impedance to an even lower frequency range With major EDA vendors and semiconductor manufacturers
as 40 MHz. With eDTC, the voltage ripple was suppressed adopting 3Dblox, chiplet integration has become more
from 102.4 to 32.07 mVpp to be near the targeted 30 mVpp seamless and significantly more efficient, thanks to improved
specification. interoperability. This integration will further accelerate the
Finally, if a system exceeds its voltage droop tolerance, a development and maturity of the 3D-IC ecosystem.
comprehensive system-level strategy must be implemented
to meet the required low-bit error rates. Potential solutions VI. FUTURE DEVELOPMENT TREND
include the following. A. DESIGN MODULARIZATION
1) Reducing di/dt through lane staggering, which involves Six UCIe form factors [11] have been defined for
transitioning lanes in and out of its idle-state one at a advanced packages supporting data rates from 4 to 32 Gb/s.
time. While this method can mitigate voltage droop, Fig. 18(a) shows one example of the form-factors. Given
it has the drawback of increased link latency. the various bump pitches, column numbers, data rates,
2) Reducing di/dt by increasing the background current and technology nodes, the development of IP becomes a
during clock-gating periods. This can be achieved time-consuming and resource-intensive process. To mitigate
by keeping some or all idle lanes active. Although this challenge, a modularization concept and a compiler-
effective, this approach results in higher power con- compatible scheme, as illustrated in Fig. 18(b), have been
sumption [11]. implemented.
3) Reducing di/dt by lowering the operational data rate, In this approach, the die-to-die interconnect is segmented
which, while helpful in managing voltage droop, would into repeatable blocks, such as IO lanes, and commonly
lead to a degradation in system performance. shared blocks, including DLL, PLL, digitally controlled

VOLUME 4, 2024 359


LI et al.: HIGH-BANDWIDTH CHIPLET INTERCONNECTS FOR ADVANCED PACKAGING TECHNOLOGIES

(a)

(b)

FIGURE 18. (a) UCIe 2.0 bumpmap example. (b) Modular design for Chiplet I/F.

B. BANDWIDTH AND ENERGY EFFICIENCY SCALING


Bandwidth density and energy efficiency continue to be the
FIGURE 17. (a) Abundant choices of 3DIC architectures. (b) 3Dblox unification focus for next-generation chiplet interconnects.
infrastructure.
The package bump pitch and technology nodes signifi-
cantly impact bandwidth density. Fig. 19 illustrates the area
delay line (DCDL), and calibration circuits. Specific floor- bandwidth density trend based on our first-order estimate
plan elements, such as clock trees, can be customized and using realistic process and package technology scaling fac-
compiled to meet different target specifications. tors. To enhance bandwidth density, one can increase the link

360 VOLUME 4, 2024


FIGURE 19. Technology and bandwidth scaling (Note: P30/C16 refers to 30 µm
bump pitch and UCIe 16-column form factor).

data rate and/or decrease interconnect bump pitch. However,


higher data rates require stronger circuit driving strength
and calibration, leading to larger circuit areas. Consequently,
bump pitch may need to be adjusted. For instance, with N7
technology, a 45 µm bump pitch (P45) supports 16 Gb/s,
while 55 µm (P55) and 65 µm (P65) are needed for 24 and FIGURE 20. Scaling for better energy efficiency [61].
32 Gb/s, respectively, resulting in a decline in area bandwidth
density beyond 16 Gb/s. In contrast, N4/N5 (4 nm/5 nm)
technology supports increased bandwidth density with data
rate up to 24 Gb/s. N3 allows further bandwidth increase.
DTCO will likely modify the trend line slightly, but in
general, more advanced technologies like N3 (3 nm) offer the
benefit of achieving higher area/shoreline bandwidth density
and energy efficiency [60].
Taking a different perspective on shoreline bandwidth
density, the above study was based on UCIe bump map
constraints, resulting in a higher data rate correlating with
higher shoreline bandwidth density. This contrasts with the
evaluation in [61], which uses pitch scaling in both the x and
y directions. As the bump pitch scales down with the data
rate while maintaining a bump-limited situation, the shoreline
bandwidth density remains constant. In this context, lower
data rates are expected to enhance energy efficiency due to
reduced circuit complexity. Conversely, technology scaling
can support more complex designs and increase the data rate
for a given bump pitch, leading to enhancements in shoreline
bandwidth (e.g., from 1.5 to 2 Tb/s/mm), as illustrated in
Fig. 20.
FIGURE 21. System-on-wafer scale up (Source: TSMC).
C. BIGGER SYSTEMS
Due to reticle size limitations, the recent trend in
AI/ML development is the scaling up at the wafer level for power delivery and longer reach data transfer at large-
(Fig. 21) [13], [14], [15]. By combining the solutions pro- scale integration. This wafer-level packaging alleviates the
vided by 3DFabric (or equivalent), we can effectively utilize constraints imposed by reticle size limit, while necessitating
SoIC for integrating SRAM+CPU and HBM+GPU, LSI network-on-wafer [13], [14] and heterogenous (serial and
for integrating CPU+GPU (high-density/near reach), LSI for parallel) [18] or hybrid (optical and electrical) links [62] for
integrating xPU to I/O die, passive LSI for eDTC (for on- efficient xPU to xPU interconnections in the near future.
package decoupling for supply noise mitigation), and RDL Beyond wafer-level packaging, fan-out panel-level packaging

VOLUME 4, 2024 361


LI et al.: HIGH-BANDWIDTH CHIPLET INTERCONNECTS FOR ADVANCED PACKAGING TECHNOLOGIES

(FOPLP) [63], [64] is also on the horizon, promising [12] D. C. H. Yu, C.-T. Wang, and H. Hsia, “Foundry perspectives on
higher packaging throughput, reduced costs, and potentially 2.5D/3-D integration and roadmap,” in Proc. IEEE Int. Electron
Devices Meeting (IEDM), 2021, pp. 3.7.1–3.7.4.
larger integrated systems at panel level, where warpage [13] S. Lie, “Wafer-scale AI: GPU impossible performance,” in Proc. 36th
control remains a significant challenge throughout the entire IEEE Hot Chips Symp., 2024, pp. 1–71.
packaging process [65], [66]. [14] E. Talpes, D. Williams, and D. D. Sarma, “DOJO: The microarchitec-
ture of Tesla’s Exa-scale computer,” in Proc. 34th Hot Chips Symp.,
In the meantime, the hunger for higher interconnect 2023, pp. 1–28.
data bandwidth density continues, for instance, the UCIe [15] S.-R. Chun et al., “InFO_SoW (system-on-wafer) for high
Consortium is working on a 48/64 Gb/s proposal for interdie performance computing,” in Proc. IEEE ECTC, 2020, pp. 1–6.
interconnect. For system scaling up and scaling out, on [16] S. S. Iyer, “Heterogeneous integration for performance and scal-
ing,” IEEE Trans. Compon., Pack. Manuf. Technol., vol. 6, no. 7,
package optical waveguide [67] and co-packaged optical pp. 973–982, Jul. 2016.
engine [68] remain appealing to the industry. [17] A. B. Ahmed and A. B. abdallah, “la-xyz: low latency, high throughput
Bigger systems necessitate vertical power delivery with look-ahead routing algorithm for 3-d network-on-chip (3D-NoC)
architecture,” in Proc. IEEE 6th Int. Symp. Embed. Multicore SoCs,
integrated magnetic components for efficient voltage regu- 2012, pp. 167–174.
lation [69], [70]. The larger scale integration of CPU, GPU, [18] Y. Feng, D. Xiang, and K. Ma, “Heterogeneous die-to-die interfaces:
HBM, SerDes, optical engines, and voltage regulators is Enabling more flexible chiplet interconnection systems,” in Proc. 56th
a significant undertaking, surpassing some of the existing Annu. IEEE/ACM Int. Symp. Microarchit., 2023, pp. 930–943.
[19] Y. Feng, D. Xiang, and K. Ma, “A scalable methodology for designing
engineering feats [13], [14], [15]. Achieving this requires efficient interconnection network of Chiplets,” in Proc. IEEE Int.
a collaborative effort across various industry partners to Symp. High-Perform. Comput. Archit. (HPCA), 2023, pp. 1059–1071.
manage different aspects of technology stacks to achieve high [20] J. Yin et al., “Modular routing design for chiplet-based systems,” in
Proc. ACM/IEEE 45th Annu. Int. Symp. Comput. Archit. (ISCA), 2018,
performance while ensuring exceptional power efficiency, SI, pp. 726–738.
thermal management, and structural robustness. [21] I. Lee, M. Cheong, and S. Kang, “Highly reliable redundant TSV
As the chiplet ecosystem becomes more robust and 3D-IC architecture for clustered faults,” IEEE Trans. Rel., vol. 68, no. 1,
pp. 237–247, Mar. 2019.
design methodologies advance, new possibilities and greater
[22] T.-H. Wang, P.-Y. Chuang, F. Lorenzelli, and E. J. Marinissen, “Test
innovations will emerge. and repair improvements for UCIe,” in Proc. IEEE Eur. Test Symp.
(ETS), 2024, pp. 1–6.
ACKNOWLEDGMENT [23] J. Lau, “Recent advances and trends in advanced packaging,” IEEE
The authors would like to express their gratitude for Trans. Compon., Pack. Manuf. Technol., vol. 12, no. 2, pp. 228–252,
Feb. 2022.
the insightful and regular discussions on 3D integration [24] R. Chaware, K. Nagarajan, and S. Ramalingam, “Assembly and
with King-Ho Tam, Homer Liu, S. J. Yang, Jim Chang, reliability challenges in 3-D integration of 28-nm FPGA die on a
T. C. Huang, Sandeep Goel, Cheng-Hsiang Hsieh, Frank large high density 65-nm passive interposer,” IEEE Trans. Electron
Devices, 2012, submitted for publication.
Lee, Carlos Diaz, Stefan Rusu, and L. C. Lu. [25] S. Hou et al., “Wafer-level integration of an advanced logic-memory
system through the second-generation CoWoS technology,” IEEE
REFERENCES Trans. Electron Devices, vol. 64, no. 10, pp. 4071–4077, Oct. 2017.
[1] D. Amodei and D. Hernandez. “AI and compute.” OpenAI. 2018. [26] Y.-H. Lin et al., “Multilayer RDL interposer for heterogeneous device
[Online]. Available: https://openai.com/index/ai-and-compute/ and module integration,” in Proc. IEEE ECTC, 2019, pp. 931–936.
[2] J. Sevilla, L. Heim, A. Ho, T. Besiroglu, M. Hobbhahn, and [27] M. Lin et al., “Organic interposer CoWoS-R+ (plus) technology,” in
P. Villalobos, “Compute trends across three eras of machine learning,” Proc. IEEE ECTC, 2022, pp. 1–6.
in Proc. Int. Joint Conf. Neural Netw. (IJCNN), 2022, pp. 1–8. [28] Y.-C. Hu et al., “CoWoS architecture evolution for next generation
[3] A. J. Lohn and M. Musser, AI and Compute: How Much Longer Can HPC on 2.5D system in package,” in Proc. IEEE ECTC, 2023,
Computing Power Drive Artificial Intelligence Progress? CSET Issue pp. 1022–1026.
Brief, Center Secur. Emerg. Technol., Washington, DC, USA, 2022. [29] S. Hou et al., “Integrated deep trench capacitor in Si interposer
[4] N. C. Thompson, K. Greenewald, K. Lee, and G. F. Manso, “The for CoWoS heterogeneous integration,” in Proc. IEEE IEDM, 2019,
computational limits of deep learning,” 2022, arXiv:2007.05558v2. pp. 19.5.1–19.5.4.
[5] Y.-J. Mii, “Semiconductor innovations, from device to system,” in [30] C.-F. Tseng, C. S. Liu, C.-H. Wu, and D. Yu, “InFO (wafer level
Proc. Symp. VLSI Technol. Circuits, 2022, pp. 276–281. integrated fan-out) technology,” in Proc. IEEE ECTC, 2016, pp. 1–6.
[6] S. Li, “F1: Transceivers for exascale: Towards Tbps/mm and sub- [31] K. Kim and M.-J. Park, “Present and future, challenges of high
pJ/bit: Advanced packaging and 3DIC interconnections,” in Proc. bandwith memory (HBM),” in Proc. IEEE Int. Memory Workshop
ISSCC, 2023, pp. 519–521. (IMW), 2024, pp. 1–4.
[7] B. Santo (EE Times, Portland, OR, USA). Chiplets: A Short History.
[32] D. B. L. Yolanda, “Wafer to wafer bonding to increase memory
Mar. 2021. [Online]. Available: https://www.eetimes.com/chiplets-a-
density,” in Proc. China Semicond. Technol. Int. Conf. (CSTIC), 2022,
short-history/
pp. 1–4.
[8] A. Tirumala and R. Wong, “NVIDIA blackwell platform: Advancing
generative AI and accelerated computing,” in Proc. 36th IEEE Hot [33] W. Gomes et al., “Ponte Vecchio: A multi-tile 3-D stacked processor
Chips Symp., 2024, pp. 1–33. for exascale computing,” in Proc. IEEE ISSCC, 2022, pp. 42–44.
[9] R. Kaplan, “Intel Gaudi 3 AI accelerator: Architected for Gen AI [34] J. Wuu et al., “3-D V-Cache: The implementation of a hybrid-bonded
training and inference,” in Proc. 36th IEEE Hot Chips Symp., 2024, 64MB stacked cache for a 7-nm ×86–×64 CPU,” in Proc. IEEE
pp. 1–16. ISSCC, 2022, pp. 428–429.
[10] D. D. Sharma, G. Pasdast, Z. Qian, and K. Aygun, “Universal [35] M.-F. Chen, F.-C. Chen, W.-C. Chiou, and D. C. Yu, “System on
chiplet interconnect express (UCIe): An open industry standard for integrated chips (SoIC(TM) for 3-D heterogeneous integration,” in
innovations with chiplets at package level,” IEEE Trans. Compon., Proc. IEEE ECTC, 2019, pp. 594–599.
Pack. Manuf. Technol., vol. 12, no. 9, pp. 1423–1431, Sep. 2022. [36] G. Kuo et al., “A thermally friendly bonding scheme for 3-D system
[11] “Universal chiplet interconnect express (UCIe) specification integration,” in Proc. IEEE ECTC, 2023, pp. 1973–1976.
revision 2.0.” Accessed: Jun. 7, 2024. [Online]. Available: [37] H.-J. Chia et al., “Ultra high density low temperature SoIC with sub-
https://www.uciexpress.org/ 0.5-µm bond pitch,” in Proc. IEEE ECTC, 2023, pp. 1–4.

362 VOLUME 4, 2024


[38] J. Lau, “State of the art of Cu–Cu hybrid bonding,” IEEE Trans. [62] Y. Safari, R. Mohammadrezaee, D. A. Saleh, and B. Vaisband,
Compon., Pack., Manuf. Technol., vol. 14, no. 3, pp. 376–396, “Hybrid interconnect infrastructure for inter-chiplet,” in Proc. IEEE
Mar. 2024. 74th Electron. Compon. Technol. Conf. (ECTC), 2024, pp. 2229–2236.
[39] R. Mahajan et al., “Embedded multi-die interconnect bridge [63] T. Braun et al., “Challenges and opportunities for fan-out panel level
(EMIB)–A high-density, high-bandwidth packaging interconnect,” in packing (FOPLP),” in Proc. 9th Int. Microsyst., Pack., Assem. Circuits
Proc. IEEE 66th Electron. Compon. Technol. Conf. (ECTC), 2016, Technol. Conf. (IMPACT), 2014, pp. 154–157.
pp. 557–565. [64] J. H. Lau, “Patent issues of embedded fan-out wafer/panel level
[40] R. Swaminathan, M. J. Schulte, B. Wilkerson, G. H. Loh, A. Smith, packaging,” in Proc. China Semicond. Technol. Int. Conf. (CSTIC),
and N. James, “AMD InstinctTM MI250X accelerator enabled by 2016, pp. 1–7.
elevated fanout bridge advanced packaging architecture,” in Proc. [65] T. Braun et al., “How to manipulate warpage in fan-out wafer and
IEEE Symp. VLSI Technol. Circuits, 2023, pp. 1–2. panel level packaging,” in Proc. IEEE 74th Electron. Compon. Technol.
[41] D. Tonietto, “The future of short reach interconnect,” in Proc. IEEE Conf. (ECTC), 2024, pp. 1–6.
ESSCIRC, 2022, pp. 1–8. [66] C.-C. Lee, C.-P. Chang, C.-Y. Chen, H.-C. Lee, and G. C.-F. Chen,
[42] G. Gangasani, “A 1.6-Tb/s chiplet over XSR-MCM channels using “Warpage estimation and demonstration of panel-level fan-out packag-
113-Gb/s PAM-4 transceiver with dynamic receiver-driven adaptation ing with Cu pillars applied on a highly integrated architecture,” IEEE
of TX-FFE and programmable roaming taps in 5-nm CMOS,” in Proc. Trans. Compon., Pack. Manuf. Technol., vol. 13, no. 4, pp. 560–569,
IEEE ISSCC, 2022, pp. 122–124. Apr. 2023.
[43] A. Smith and V. Alla, “AMD instinct MI300X generative AI [67] N. Harris, “Passage: A wafer-scale, programmable photonic communi-
accelerator and platform architecture,” in Proc. 36th Hot Chips Symp., cation substrate,” in Proc. 34th IEEE Hot Chips Symp., 2022, pp. 1–26.
2024, pp. 1–22. [68] P. Maniotis and D. M. Kuchta, “Exploring the benefits of using co-
[44] C. Liu, J. Botimer, and Z. Zhang, “A 256-Gb/s/mm-shoreline AIB- packaged optics in data center and AI supercomputer networks: A
compatible 16-nm FinFET CMOS chiplet for 2.5D integration with simulation-based analysis,” J. Opt. Commun. Netw., vol. 16, no. 2,
Stratix 10 FPGA on EMIB and tiling on silicon interposer,” in Proc. pp. A143–A156, 2024.
IEEE CICC, 2021, pp. 1–2. [69] K. Zhang, “1.1 semiconductor industry: Present & future (keynote),”
[45] S. Ardalan, R. Farjadrad, M. Kuemerle, K. Poulton, S. Subramaniam, in Proc. IEEE ISSCC, 2024, pp. 10–15.
and B. Vinnakota, “An open inter-chiplet communication link: [70] S. Li, “Power delivery for high-speed die to die interconnects and
Bunch of wires (BoW),” IEEE Micro, vol. 41, no. 1, pp. 54–60, future 3-D-ICs,” presented at the Asia-Pac. Econ. Cooper., Long
Jan./Feb. 2021. Beach, CA, USA, 2024.
[46] OpenHBI Specification Version 1.0, Open Comput. Project, Austin,
TX, USA, 2021.
[47] M.-S. Lin et al., “A 7-nm 4-GHz Arm R
-core-based CoWoSR
chiplet
design for high performance computing,” IEEE J. Solid-State Circuits,
vol. 55, no. 4, pp. 956–966, Apr. 2020.
[48] S. Li, M.-S. Lin, W.-C. Chen, and C.-C. Tsai, “Interconnect in the
Era of 3DIC,” in Proc. IEEE CICC, 2022, pp. 1–5.
[49] Z. Pan, X. Li, W. Hao, R. Miao, Z. Yue, and A. Wang, “Challenges:
ESD protection for heterogeneously integrated SoICs in advanced
packaging,” Electronics, vol. 13, no. 12, p. 2341, 2024.
[50] X. Wang, Y. Yang, D. Chen, and D. Li, “Intelligent design method
of thermal through silicon via for thermal management of chiplet-
based system,” IEEE Trans. Electron Devices, vol. 70, no. 10, SHENGGAO LI (Senior Member, IEEE) received
pp. 5273–5280, Oct. 2023. the B.S. degree in automatic control from
[51] A. K. Coskun, J. L. Ayala, D. Atienza, T. S. Rosing, and Northwestern Polytechnical University, Xi’an,
Y. Leblebici, “Dynamic thermal management in 3-D multicore archi- China, in 1994, the M.S. degree in automation
tectures,” in Proc. Design, Autom. Test Europe Conf. Exhibit., 2009, from Tsinghua University, Beijing, China, in 1997,
pp. 1410–1415. and the Ph.D. degree in electrical engineering from
[52] L. Chen et al., “The dawn of AI-native EDA: Opportunities and Ohio State University, Columbus, OH, USA, in
challenges of large circuit models,” 2024, arXiv.2403.07257. 2000.
[53] Y. Nishi, “A 0.190-pJ/bit 25.2-Gb/s/wire inverter-based AC-coupled He is currently with Taiwan Semiconductor
transceiver for short-reach die-to-die interfaces in 5-nm CMOS,” IEEE Manufacturing Company Ltd. (TSMC), Hsinchu,
J. Solid-State Circuits, vol. 59, no. 4, pp. 1146–1157, Apr. 2024. Taiwan, as a Director of the Mixed-signal and RF
[54] Y. Nishi et al., “A 0.297-pJ/bit 50.4-Gb/s/wire inverter-based short- Solutions Division, leading the high-speed SerDes, chiplet interconnects for
reach simultaneous bi-directional transceiver for die-to-die interface 3-D and photonic integrations, RF/mmWave, high-speed data converters,
in 5-nm CMOS,” IEEE J. Solid-State Circuits, vol. 58, no. 4, and foundational analog design and technology co-optimization from N5 to
pp. 1062–1073, Apr. 2023. A14. Prior to this, he was with Intel, Santa Clara, CA, USA, as the Principal
[55] E. Yeung and M. Horowitz, “A 2.4-Gb/s/pin simultaneous bidirectional Engineer, the Section Manager, and the Director of IP/SoC Strategy, where
parallel link with per-pin skew compensation,” IEEE J. Solid-State he led the 10GE and PCIe Gen3/4/5, and QPI/UPI/CXL PHY development
Circuits, vol. 35, no. 11, pp. 1619–1628, Nov. 2000. for multiple generations of high-performance computing CPU products,
[56] Serial Interface for Data Converters, JEDEC Standards JESD204D, from 2007 to 2020. He was a Founding Member, a Senior Principal
2023. Engineer, and a Director of Engineering with Mobert Semiconductor, an
[57] P. Shivakumar, S. Keckler, C. Moore, and D. Burger, “Exploiting RFIC start-up based in Silicon Valley and Shanghai, from 2006 to 2007.
microarchitectural redundancy for defect tolerance,” in Proc. 21st Int. From 2000 to 2006, he worked with the Communication Division, Intel as
Conf. Comput. Design, 2003, pp. 481–488. an Analog Design Engineer and a Project Manager on wireless transceivers
[58] J. Srinivasan, S. Adve, P. Bose, and J. Rivers, “The case for lifetime (BT/GSM), optical transceivers (OC192), and SerDes IPs (XFI, XAUI). He
reliability-aware microprocessors,” in Proc. ISCA, 2004, pp. 276–287. holds more than 50 patents in the field of high-speed communication and
[59] J. Chang, “3Dblox: Unleash the ultimate 3DIC design productivity,” 3-D integrations.
in Proc. Int. Symp. Phys. Design, 2024, p. 215. Dr. Li is a Working Group Co-Chair for the UCIe Consortium on Chiplet
[60] M.-S. Lin et al., “A 0.296-pJ/bit 17.9-Tb/s/mm2 Die-to-Die link in 5- Interconnect Standards, and a TPC Member for the Custom Integrated
nm/6-nm FinFET on a 9-µm-Pitch 3-D package achieving 10.24-Tb/s Circuits Conference since 2021 as a Co-Chair of the Wireline Subcommittee
bandwidth at 16-Gb/s PAM-4,” in Proc. IEEE Symp. VLSI Technol. in 2025, and the Chair of the Sponsorship Subcommittee in 2023 and
Circuits, 2024, pp. 1–2. 2024. He has published more than 30 journal/conference papers, including
[61] W. Turner et al., “Leveraging micro-bump pitch scaling to accelerate IEEE JOURNAL OF SOLID-STATE CIRCUITS, ISSCC, ASSCC, CICC, and
interposer link bandwidths for future high-performance compute Integration: The VLSI Journal, and served as an Invited Speaker for chiplet
applications,” in Proc. IEEE CICC, 2024, pp. 1–4. interconnects and advanced packaging technologies.

VOLUME 4, 2024 363


LI et al.: HIGH-BANDWIDTH CHIPLET INTERCONNECTS FOR ADVANCED PACKAGING TECHNOLOGIES

MU-SHAN LIN was born in Taiwan in 1979. CHIEN-CHUN TSAI received the master’s degree
He received the master’s degree in electronics in electrical engineering from National Taiwan
engineering from National Chiao Tung University, University, Taipei, Taiwan, in 1996.
Hsinchu, Taiwan, in 2004. He has been working as a Design Engineer with
He has since dedicated his career to circuit Taiwan Semiconductor Manufacturing Company
design with Taiwan Semiconductor Manufacturing Ltd. (TSMC), Hsinchu, Taiwan, since November
Company Ltd. (TSMC), Hsinchu, Taiwan. He 1998. Over the years, his responsibilities have
specializes in high-speed SerDes (56/112 Gbps) included standard cell design, standard I/O,
and equalization development, DDR-PHY design, ESD, specialty I/O, high-speed SerDes, and
and parallel-bus forwarded-clock and low-swing chiplet interface design. He is currently the
interconnects for 2.5D/3-D-IC package applica- Department Manager of Advanced Connectivity
tions. He currently serves as the Technical Manager for the Advanced Taiwan, TSMC, focusing on SerDes and 2.5D/3D IC chiplet interface PHY
Connectivity Department, TSMC. His expertise is recognized through development.
numerous IEEE publications, including contributions to ASSC, VLSI, JSSC,
and HotChip conferences, and he holds several patents in collaboration with
TSMC.

WEI-CHIH CHEN was born in Keelung, Taiwan,


in 1980. He received the master’s degree in
electronics engineering from National Chiao Tung
University, Hsinchu, Taiwan, in 2002.
He is currently a Manager with the Advanced
Connectivity Department, Taiwan Semiconductor
Manufacturing Company Ltd. (TSMC), Hsinchu,
where he focuses on the design of high-speed
SerDes transceivers and chiplet heterogeneous
integration. He has made significant contributions
to PHY standards, particularly in link jitter analy-
sis, clocking architectures, equalization techniques, and channel modeling.
He has published several papers in IEEE conferences, such as VLSI,
JSSC, CICC, and ISSCC, and he holds multiple patents with TSMC. His
research interests include low-power, high-speed wireline interfaces, and
high-throughput die-to-die interfaces.

364 VOLUME 4, 2024

You might also like