Review of Chiplet-Based Design: System Architecture
Review of Chiplet-Based Design: System Architecture
Information Sciences
. REVIEW . October 2024, Vol. 67, Iss. 10, 200401:1–200401:20
https://doi.org/10.1007/s11432-023-3926-8
Special Topic: AI Chips and Systems for Large Language Models
Received 24 June 2023/Revised 28 September 2023/Accepted 26 December 2023/Published online 19 July 2024
Abstract Chiplet-based design, which breaks a system into multiple smaller dice (or “chiplets”) and re-
assembles them into a new system chip through advanced packaging, has received extensive attention in
the post Moore’s law era due to its advantages in terms of cost, performance, and agility. However, signifi-
cant challenges arise in this implementation approach, including the mapping of functional components onto
chiplets, co-optimization of package and architecture, handling the increased latency of communication across
functions in different dies, the uncertainty problems of fragment communication subsystems, such as main-
taining deadlock-free when independently designed chiplets are combined. Despite various design approaches
that attempt to address these challenges, surveying these approaches one-after-another is not the most help-
ful way to offer a comparative viewpoint. Accordingly, in this paper, we present a more comprehensive
and systematic strategy to survey the various approaches. First, we divide them into chiplet-based system
architecture design and interconnection design, and further classify them based on different architectures and
building blocks of interconnection. Then, we analyze and cross-compare each classification separately, and in
addition, we present a topical discussion on the evolution of memory architectures, design automation, and
other relevant topics in chiplet-based designs. Finally, some discussions on important topics are presented,
emphasizing future needs and challenges in this rapidly evolving field.
Keywords chiplet-based design, package, architecture, interconnection, silicon interposer
1 Introduction
In recent years, the progression of Moore’s law has slowed down while the demand for high-performance
integrated circuits (ICs) and the diversification of their applications has continued to grow. The conven-
tional monolithic chip design approach, though widely adopted, presents limitations in cost-effectiveness,
performance, and agility [1,2]. Fortunately, the chiplet-based design approach offers a promising solution
by disintegrating the traditional monolithic silicon chip into multiple smaller chiplets and then reassem-
bling them into a new system chip [3]. Chiplet refers to an IC block that has been specifically designed
to communicate with other chiplets, to form larger more complex ICs [4]. To be emphasized, a chiplet is
a reusable prefabricated component in the form of a die. As a promising technology, chiplet technology
has received extensive attention in the post-Moore’s law era [3, 5–8].
Chiplet-based design has emerged as a crucial solution to break limitations in increasing integration
level, reducing time-to-market (TTM), and improving the energy efficiency of ICs. By breaking down
the monolithic chip into smaller dice, the chiplet technology improves yields, reduces design complexity,
and overcomes reticle limitations. As a result, multiple times, even one to two orders of magnitude
larger scale chip systems become achievable. The chiplet technology facilitates the reuse of the silicon-
proven fabricated dice because they can be assembled together to form a new system through a relatively
short processing step. The TTM and non-recurring engineering (NRE) costs associated with the back-
end design process and chip manufacturing can be diluted. As the chiplets in an integrated system are
manufactured independently and can be assembled in various structures, heterogeneous integration is also
enabled, as a result, additional optimization space is opened up and system iteration and upgrade are
* Corresponding author (email: [email protected], [email protected])
also simplified. Consequently, companies such as AMD, Intel, and TSMC have already used the chiplet
technology in their products and solutions [9–12].
However, chiplet-based design faces many challenges, including die-to-die interconnection, packaging,
power supply, thermal management, design methodologies, and electronic design automation (EDA) tools.
Among them, a new special challenge in chiplet-based design is the architecture design, which is the top-
level problem that governs the system performance and cost and is related to most of the issues mentioned
above. Moreover, in chiplet-based design, the coupling between the packaging, system architecture, and
interconnection becomes tighter than in the traditional monolithic system on chip (SoC). The trade-off
between performance, power, area, and cost (PPAC) requires careful choices within the design space
of packaging, system architecture, and interconnection. Unfortunately, no general and clear system-
level architecture design method has been proposed. Therefore, it is valuable to summarize and analyze
the state-of-the-art chiplet-based designs in order to provide designers with valuable design schemes for
reference. This task involves reviewing the works in the aspects of function mapping, system floorplan,
and interconnection fabric architecture.
Several surveys have already focused on some key technologies and fundamental modules of chiplet-
based design. Studies in [13, 14] have provided a comprehensive overview of chiplet-based designs from
a packaging perspective. Similarly, in [14–16], the interconnect interfaces of chiplets have been summa-
rized, with a focus mainly on the physical layer. However, further investigation that can provide guidance
for system architecture design is desired in addition. The study [17] has overviewed the computing ar-
chitecture of high-performance computing (HPC) systems. However, it mainly classifies and introduces
architectures from the perspective of packaging technology rather than the perspective of design, lack-
ing the analysis of the system architecture and especially the architecture of interconnection networks.
Furthermore, the analysis is limited to computing systems for applications such as HPC, mobile, and
personal computers (PC), while there are many other potential application fields for which the chiplet
technology is valuable. Accordingly, this article investigates and analyzes chiplet-based design issues
from the perspective of system architecture and interconnection design, with a wider range of application
domains.
We declare the following contributions:
• We provide a high-level classification of the system architecture of existing chiplet-based designs from
a design perspective, and highlight the challenges and evolution of the memory subsystem in chiplet-based
designs.
• We demonstrate the significant trend of platformization in chiplet-based system design and provide
a systematic analysis of existing solutions.
• We present a summary of emerging chiplet-based interconnect topologies and analyze the evolving
optimization of the network topology on active silicon interposers.
• We analyze and summarize the state-of-the-art modular routing schemes in chiplet-based systems
and provide a comparative analysis of centralized and distributed routing approaches.
The remainder of the paper is organized as follows: Section 1 is the introduction. Section 2 presents an
overview of packaging technologies and highlights the design challenges faced by chiplet-based systems.
Section 3 reviews chiplet-based designs from the perspectives of system architecture and interconnection.
It includes analyses of typical chiplet-based system architectures and interconnections, as well as a dis-
cussion on emerging technologies. Section 4 explores important topics for future research, with a focus
on challenges and needs. Section 5 is the summary of the purpose, content, and conclusion of this paper.
In this paper, we use the definition and classification of advanced packaging from articles [18–20]. The
packaging technologies used in chiplet-based systems are classified into 2D, 2.xD, and 3D [19], as shown
in Figure 1 [18].
In 2D packaging, multichips are mounted in a single plane on a package substrate or fan-out redistri-
bution-layer (RDL) substrate [20]. The 2D package offers lower interconnect density than that of the
silicon die, but it is a mature technology with cost-effectiveness, high yield, and robustness.
In 2.xD (x=1, 3, 5) packaging, chiplets are integrated on a substrate by a fine pitch interposer,
Liu Y F, et al. Sci China Inf Sci October 2024, Vol. 67, Iss. 10, 200401:3
Figure 1 (Color online) Groups of advanced packaging: 2D, 2.xD, and 3D packaging [18] Copyright 2022 IEEE.
including thin-film layers or embedded bridges (2.1D), separately fabricated fine pitch through-silicon via
(TSV)-less interposer (higher interconnect density than thin-film layer) (2.3D), and passive TSV silicon
interposer (2.5D) [20]. In particular, the 2.5D package is a widely employed package type [21–24], in
which the passive TSV silicon interposer is a piece of silicon with TSVs and RDLs [20], and a typical
example is TSMC’s chip-on-wafer-on-substrate (CoWoS) [25]. Compared with the 2D package, the 2.xD
package offers much higher interconnect density, yet introduces higher package cost.
In 3D packaging, or vertical integration, multiple layers of dies stack vertically, interconnected by TSVs
bounded with micro-bump (µbump) or bumpless hybrid bonding [20]. Remarkably, the widely adopted
active interposer-based designs, such as the AMD 3D V-cache stack die [26] and Intel Lakefield [27] with
Foveros technology [28], are typical of 3D packages [20]. Compared to the 2.5D package, the 3D package
offers even higher performance, lower energy per bit transition, and a smaller form factor. However, it
suffers from high technology complexity and high cost. Moreover, in 3D packaging, the thermal problem
is more challenging compared to 2D/2.5D packaging. In 2D/2.5D packaging heat extraction can be
performed from the top package face. However, in the case of die stacking, thermal dissipation is a more
significant challenge as multiple layers of compute dies need to dissipate their heat on top of themselves
particularly when the logic part has high power consumption.
The main system-level technology issues in chiplet-based design include the following interrelated aspects:
system architecture which is mainly chiplet partition, interconnection design and package scheme. In
detail, they respectively plan the mapping of the computing and processing functions, their physical
floorplan, and the interconnect fabric between them across multiple chiplets.
System architecture. The primary challenge in the aspect of system architecture is to choose the
chiplet partition scheme while addressing the potential performance degradation that may result from
the split. as the bandwidth and latency between chiplets generally tend to be inferior compared to those
within the same chip. Moreover, the challenges faced by systems with different computing architectures,
such as homogeneous multicore processors and heterogeneous multiprocessor SoCs (MPSoCs), are quite
different. In addition to the memory access bottlenecks and reticle limitation, the former mainly concerns
scalability and inconsistencies in access latency due to splitting, while the latter may be more concerned
with compute efficiency. Moreover, additional costs such as additional silicon area of interface circuits,
additional package cost, and additional masks, and power consumption overhead associated with the
chiplet interface must also be taken into account. Furthermore, the reusability of the chiplets is also an
important factor to consider, which greatly determines the cost of the system.
Interconnection design. There are two differences between the interconnection in chiplet-based
Liu Y F, et al. Sci China Inf Sci October 2024, Vol. 67, Iss. 10, 200401:4
systems and in monolithic chip systems, and these differences create design challenges. First, the charac-
teristics of interconnect links in chiplet-based systems are not uniform. Inter-chiplet interconnects have
lower bandwidth and higher latency compared to intra-chiplet interconnects due to physical limitations,
such as limited input-output (IO) ports and additional latency from drivers and receivers. Second, the
chiplet-based approach leads to the problem of fragmented network-on-chip (NoC), where each chiplet
contains only a piece of the overall network and these pieces are usually designed and validated indepen-
dently. This leads to additional communication hops and introduces more uncertainty in traffic, such as
new deadlocks that may occur when different chiplets are assembled together. Therefore, deadlock-free
routing technology is crucial in chiplet-based design.
Both the system architecture and the interconnection design are heavily dependent on the packaging
scheme. The bandwidth of a die-to-die interface and the latency over its link are dominated by the
package scheme for a large scale chiplet. Additionally, the cost of the package is also non-negligible under
the current conditions. Therefore, co-optimization of package-architecture is crucial for achieving optimal
PPAC in chiplet-based systems.
3 Literature review
3.1 System architecture
The crucial aspect of system architecture design is the mapping of system functions to chiplets. Pre-
vious studies [5, 29] classify these mapping scenarios into two categories: package aggregation and SoC
disaggregation. In the former case, the systems are mainly based on the aggregation of board design
into a package, including the aggregation of processors with memory and the aggregation of processors
of different architectures. In the latter case, it represents the partitioning of a monolithic SoC into two
or more chiplets [29]. Following this line, we will first review the existing literature from an architectural
perspective and then analyze and discuss each of them.
(a) (b)
Figure 2 (Color online) Package aggregation. (a) AMD RadeonTM Fury [38] Copyright 2015 IEEE; (b) Huawei Ascend 910 [41]
Copyright 2019 IEEE.
HBM is a 3D-stacked memory device in which multiple DRAM dies are stacked on top of the logic die
and connected by TSV.
With the advancement of architecture technologies, the traditional boundary between compute and
memory in this kind of “compute die+HBM” scheme is being broken. PIM architectures are evolving
based on HBM. The PIM-HBM architecture, proposed in [42], embeds the PIM cores into the logic
base of HBM. As a result, the high DRAM access computations can be performed by the in-memory
logic die, which clearly reduces the data movement in the memory hierarchy. Inspired by this, emerging
memory technologies such as near data processing (NDP) architectures and PIM architectures have been
intensively investigated. Obviously, these new PIM architectures are no longer typical package aggregation
schemes.
In addition to the on-package memory, multiple heterogeneous processors can be integrated in the
same package to boost system performance. For instance, Intel’s Kaby Lake-G processor integrates a
discrete graphics processor produced by AMD (RadeonTM RX Vega M) into the same package as the
microprocessor. The CPU is connected to the GPU via 8 on-package PCI express (PCIe) links, leading
to a performance improvement of up to 40% and 50% board space savings compared to the typical
motherboard design [43]. Additionally, data-intensive products such as network interface cards (NICs)
also benefit from the high-speed interface provided by package-integration. SmartNIC of Corigine, a
chiplet-based system, utilizes a dedicated data processing unit (DPU) to offload the CPU workload for
performance and power efficiency. The DPU and CPU are connected by high-bandwidth on-package
links [44]. Moreover, Intel introduced a package-integrated “CPU+FPGA” architecture, allowing CPU
and custom-designed FPGA logic to share a single physical memory via high-bandwidth on-package
links [45]. This significantly improves the performance of memory limited algorithms such as machine
learning (ML) as noted in [46].
Analysis. The package aggregation schemes employ mature board-level links and packaging process.
This approach has the advantage of high yields and low cost. Moreover, chiplets are designed in a similar
manner to traditional monolithic designs, requiring only a high-speed interface to be added and minimal
modifications to the architecture. This decreases design risk and complexity. Additionally, the chiplets
are relatively functionally complete and independent. Therefore, it is possible to compose products in a
single-chiplet form first and then in a multi-chiplet solution according to the research and development
(R&D) timeline. This facilitates the TTM of products with varying performance levels.
The mapping approach in package aggregation schemes is relatively straightforward, that is mapping
chips in the board-level system to dies in the package-level system. In some cases, multiple chips can
be merged into a single die. Interconnections between the chips are realized through common industry-
standard interfaces, such as PCIe. Apparently, this implementation benefits from the high-bandwidth,
low-latency, and low-power interface provided by the on-package integration. More importantly, the de-
sign effort of the chiplets is minimized, and only additional high-speed interface technologies are required
to be developed, which reduces design risk and complexity. Unfortunately, the resulting schemes are not
always optimal. On one hand, the traditional board-level interfaces are too heavy (high power consump-
tion, high latency) for systems in packages and, therefore, cannot meet the increasing energy efficiency
requirements. On the other hand, the partition of functions is constrained by the available chips or the
Liu Y F, et al. Sci China Inf Sci October 2024, Vol. 67, Iss. 10, 200401:6
TM
Figure 3 (Color online) SoC disaggregation. (a) AMD EPYC processor [9] Copyright 2018 IEEE; (b) Apple M1 Ultra [50]
Copyright 2022 IEEE; (c) Intel Sapphire Rapids [51] Copyright 2021 IEEE.
existing designs whose functions are not exactly optimized for the new system.
(a) (b)
Figure 4 (Color online) SoC disaggregation. (a) AMD “Rome” processor [10] Copyright 2020 IEEE; (b) IntAct architecture [58]
Copyright 2021 IEEE.
Moreover, designers are also trying their best to reduce the latency of cross-die communication, such as
the above mentioned TSMC’s LIPINCON and Intel’s MDFIO, which are made possible by the dense
fine pitch interconnects offered by silicon interposer [12] and silicon bridge [52], respectively. Intel also
employed a metal stack with a 400 nm-pitch routing layer optimized for global interconnects, which
reduces latency by 30% at the same signal density [11].
Another challenge is that the inconsistent latency of inter-chiplet accesses can lead to non-uniform
memory access (NUMA), which is unfriendly to programmers as it requires complex logic for proper
placement of memory [54]. Additionally, some memory requests must be serviced by inter-chiplet “re-
mote” memory channels, which can lead to system throughput bottlenecks [3]. For instance, the latency
of the inter-chiplet is 56.7% higher than that of the intra-chiplet in the AMD EPYCTM processor [3]. To
mitigate this challenge, a centralized interconnect architecture can be adopted, which will be mentioned
below.
the fabric, the active interposer can also implement modules such as power management [22, 58–60],
memory-IO controller [59, 61, 62], on-chip memory [58], and capacitors [58, 61, 62], in turn improving the
utilization.
Analysis. This architecture, which separates the interconnect fabric including those low performance
common components of systems from the computation cores, offers several advantages. Firstly, cen-
tralizing interconnections through the interconnect die leads to increased consistency in inter-chiplet
communication latency and reduced NUMA effect, thereby improving the overall performance of multi-
core processors [3]. Secondly, the fabric die and the compute die can be fabricated in different process
nodes, conserving the advanced process chip area and thereby reducing system cost. For example, the
“Rome” product consists of a 12 nm IOD and eight 7 nm compute die [10]. As shown in [3], splitting
the fabric improves the utilization of advanced process nodes, as the scalable logic (CPU and cache)
increases to 86% of the total area (56% for the former). Thirdly, it simplifies the compute die’s logic to
separate interconnect from computation, because the routing logic is moved into the fabric die. At the
same time, without an NoC router, the range of application of the compute dies becomes wider because
the router design is more tightly related to the design of the network scheme. Lastly, the fabric die
could simplify the package substrate design. For example, the first-generation EPYCTM processor with
four chiplets has exhausted all the routing resources of the substrate, making it difficult to add more
chiplets. In comparison, in the second-generation EPYCTM , the four “inner” CCDs are connected by the
central fabric die (i.e., IOD). Moreover, another four CCDs could connect to the fabric die by routing
paths underneath the “inner” CCDs. Thanks to the fabric die, the number of fully connected chiplets is
doubled to 8 [3].
To compare the systems that use 2D/2.5D packaging and 3D packaging, the former has more of a cost
advantage, while the latter has more of a performance advantage. Compared to 2D/2.5D packaging, a 3D
package offers shorter interconnection distances and a higher interconnect density since the top die and
the base die are connected by µbumps or hybrid bonding. The base die enables the relay of signals over
long distances through active repeaters, leading to reduced signal delay. However, the base die used in 3D
systems is required to be large enough to accommodate all the top dies, and hence it still faces the limit
in the aspect of reticle size and yield. To address this, the cascade expansion of the base die technique,
Co-EMIB, has been proposed in [28], which is an extension combining EMIB and Foveros technologies.
(a) (b)
Figure 5 (Color online) Multiprocessor partitioning. (a) Meteor Lake [59] Copyright 2022 IEEE; (b) Granite Rapids [63] Copyright
2023 IEEE.
Figure 6 (Color online) Tile-based architecture. (a) NVIDIA DNN accelerator and (b) their scalable NoP and NoC architec-
ture [67] Copyright 2020 IEEE; (c) Intel’s Ponte Vecchio [62] Copyright 2022 IEEE.
Another example is Tianjic AI chip proposed in [68], it presents a multi-grained scalable architecture
similar to [66]. It can be configured with fine-grained integration, with 1×4 chips, for embedded systems,
and coarse-grained integration, with a 5×5 chips board design for a cloud platform.
Another example is Intel’s PVC, as shown in Figure 6(c) [61, 62], which is a processor comprising 47
functional tiles. PVC is composed of sixteen compute tiles, and eight memory tiles optimized for random
access bandwidth-optimized SRAM (RAMBO) 3D stacked on the Foveros base dies. The compute tiles
and RAMBO tiles are connected by the base tile using the die-to-die FDI link, enabling a scalable compute
and memory architecture [62].
Analysis. This architecture provides a flexible and scalable solution for parallel processing, which is a
promising approach for building large-scale systems [69]. Moreover, each chip in the system can operate as
a standalone subsystem for smaller workloads; meanwhile, by scaling up the number of parallel computing
units, the system’s computing power can be improved. Besides, in order to reduce data exchange frequency
and simplify interconnect design, each tile uses a data buffer to store inter-tile data exchanges.
The main challenge of this architecture is the potential off-chip bandwidth wall, in which bandwidth
limitation becomes a bottleneck that limits multicore scaling [70]. Since the memory bandwidth is shared
between the compute tiles, as the number of compute tiles increases, the performance will be limited
by the memory bandwidth [70]. Apparently, it is a common concern for many-core systems [69, 71]. To
alleviate this problem, one way is to integrate more memory into the package, such as by adopting HBM
technology [40,72]. Another way is to minimize external memory access and place all data on-chip, such as
by using the PIM architecture mentioned above [73–75]. Additionally, the increasing hop count for those
packets traversing between the chiplets [76] also limits system scaling. As a result, highly-parallel memory
architectures are emerging to alleviate the traffic congestion. It involves distributing the memory access
nodes near the compute nodes, so they can be accessed in high parallel with high throughput. Several
published designs, such as the distributed MC placement adopted by [71], the distributed tensor direct
memory access (DMA) engines introduced in [77], and the distributed RAMBO cache introduced in [62],
have demonstrated significant improvements in system performance.
Another issue to consider is redundancy, which brings cost and power overhead. In a tiled chiplet-
Liu Y F, et al. Sci China Inf Sci October 2024, Vol. 67, Iss. 10, 200401:10
based system, typically, each chiplet integrates a router, physical channels, and 4 to 6 physical interfaces.
And these circuit overheads are for everyone. When the system consists of only a few chiplets, the
chiplets located at the edges have inactive interfaces and account for a high proportion, which results in
redundancy. In addition, the physical interfaces usually do not scale well with logic shrinking [6], which
can diminish the advantages of advanced process nodes for the entire chip.
(a) (b)
Figure 7 (Color online) (a) Chiplet-based SoC by zGlue [86] Copyright 2021 IEEE; (b) chiplet-based system of GIA [85].
the form of active interposers can be applied to the FSMC (sockets multiple collocations) reuse scheme,
although it is based on the 2D/2.5D package. The FSMC reuse scheme shows that chiplets with the
same footprint can be integrated into a package with several standardized sockets. By combining with
different chips, the NRE cost of the interposer can be amortized to negligible [7].
Moreover, if the substrate or interposer could be reconfigured rapidly, this would accelerate the de-
velopment of SoC. A chiplet-based SoC architecture is proposed in [86], in which chiplets with different
functions such as micro-controllers (MCUs), sensors, and FPGAs are connected by an active silicon inter-
poser, as shown in Figure 7(a) [86]. The active silicon interposer, which is called “zGlue Smart Fabric”
in [87], provides programable connectivity that allows rapid reconfiguration of the chiplet interconnec-
tion, thus it can be reused in multiple products. A more fine-grained approach called general interposer
architecture (GIA) is proposed in [85], in which the chiplets and interposers are divided into multiple tiles
with the same size, as shown in Figure 7(b) [85]. Each tile contains a group of predefined connections
via µbumps. With this modular design, a chiplet can plug into the free tiles of the interposer [85]. To
improve the efficiency of the configuration of the interconnect substrates generation process, the paper
further introduced an integration flow with the proposed design automation framework. However, the
additional reconfigurable logic leads to power and area overhead, and a yield loss for the defect-prone
active interposer [85].
As presented in the above platforms, a standard interface is necessary for a reusable platform, and an
industry standard interconnect called universal chiplet interconnect express (UCIe) has been proposed
by the UCIe consortium [83]. This approach offers high-performance, low-latency, power-efficient, and
cost-effective connectivity between heterogeneous chiplets, which enables designers to package chiplets
from different sources, including different fabs, using a wide range of packaging technologies.
3.2 Interconnection
The chiplet-based interconnections have received significant attention, and the following topics have been
extensively investigated.
Liu Y F, et al. Sci China Inf Sci October 2024, Vol. 67, Iss. 10, 200401:12
Topology. The physical layout and connections between nodes and channels are determined by the
topology of the fabric, also known as the on-chip interconnect network [91]. Moreover, the selection of
the fabric topology in chiplet-based design has a significant impact on overall performance. In particular,
the fabric design is slightly different in 2D/2.5D systems (as a standalone fabric die) and in 3D systems
(as an active interposer with fabric).
Routing. According to research [23], chiplets need to be designed in a modular manner without
considering the holistic system knowledge, so that they can be reused in different systems. However, such
a modular design approach presents a significant challenge to the networks, as assembling individually
designed chiplets together may lead to deadlocks, even if the NoC of each chiplet is designed to be
deadlock-free.
Kite small
“misaligned” topology
Kite medium
Kite large
Figure 8 (Color online) Topologies of fabric in silicon interposer. (a) Mesh and Double Butterfly [96] Copyright 2014 ACM;
(b) compare of aligned and misaligned topology [47] Copyright 2015 ACM; (c) Kite topology family [97] Copyright 2020 IEEE.
is to place interposer routers “in between” neighboring chiplets instead of “aligned beneath” them, as
illustrated in Figure 8(b) [47]. With this change, the routers are shared by neighboring chiplets, so chip-
to-chip traffic and on-interposer traffic can flow simultaneously via different router ports without having
to compete for the shared links, thereby reducing queuing delays. It is a hybrid topology of both the
Double Butterfly and the Folded Torus, without increasing the router complexity or adding additional
links (which cause interposer yield loss). Following this lead, researchers proposed the Kite topology
family in [97], shown in Figure 8(c) [97], which utilizes longer links more efficiently. This results in a
reduced network diameter and higher throughput compared to [47]. It further improved the maximum
throughput by 17% versus Double Butterfly and ButterDonut. Furthermore, researchers introduced the
ClusCross topology in [98], which treats each chiplet as a cluster and adds cross-cluster long (longer
than [97]) links to increase cross-chip bandwidth. It shows a 10% latency reduction compared to the
ButterDonut. Contrary to the prior assumptions that the percentage of the active area in an interposer
should be minimized, the study [99] demonstrates how the extra “prepaid” silicon area of the interposers
can be leveraged for fault tolerance to improve yield and cost-effectiveness. Based on the same baseline
as [47, 96–98], the yield loss of a typical 32-core, four-chiplet interposer could be reduced by 0.44× by
adding fault-tolerant routers and redundant interconnects.
configuration and optimization of the chiplets, thus violating modularity and reusability [23]. Therefore,
routing in modular chiplet design is a key issue in the interconnection network design of chiplets. From
the perspective of routing decision implementation, it is generally divided into two routing strategies:
distributed and centralized. The following analysis will be conducted respectively.
Distributed routing. Distributed routing refers to the independent routing calculation of each node
in the network, often using simple and resource-saving routing algorithms, and is commonly found in
resource-restricted scenarios such as on-chip networks [91]. Currently, the most widely used routing
algorithm in on-chip networks is the simple dimension order routing (DOR) [103]. Moreover, dimension-
order routing is a deterministic routing algorithm, which means that all data packets from node A to
node B follow the same path. By imposing a direction-of-turn restriction on DOR routing, loops can be
avoided in the network. However, as mentioned above, for modularly designed chiplet systems, the use
of traditional DOR algorithms may still struggle with deadlock problems.
Yin et al. [23] proposed a modular routing restriction (MTR) that abstracts all external nodes of a
chiplet as a single nod, and avoids the formation of deadlock loops by imposing turn restrictions on
the boundary routers (BRs). The key point is that through these abstract nodes, the network topol-
ogy of other chiplets can be ignored, thus obtaining a deadlock-free chiplet network while maintaining
modularity. However, the generation of turn restriction tables for BRs could be a time-consuming it-
erative search process. To speed up this process, a presort turn restriction algorithm (Presort-TRA) is
proposed in [104]. Since the objective of the search process is to minimize φ (defined as AverageDis-
tance/AverageReachability), the proposed algorithm first presorts the candidates by their φ values and
then performs the search. The iteration search time is reduced by 50% by the proposed algorithm [104].
Taheri et al. [24] proposed a deadlock-free routing algorithm based on virtual networks (VNs) called
DeFT. Unlike [23], the turn restrictions in DeFT are not imposed on specific routers, but on VNs.
By imposing turn restrictions on two independent VNs without cyclic dependencies, VN0 and VN1, a
deadlock-free network is formed. The deadlock-free routing algorithm based on VNs provides path diver-
sity from source to destination node, and its advantage is that fault tolerance can be achieved through
path diversity, which is particularly important for vertically interconnected links that formed during
integration and have lower yield. There is a relatively large area and power overhead associated with
this VN-based routing since it requires twice as many virtual channel (VC) buffers as [23]. Due to the
existence of turn restrictions, the above routing algorithms may limit the choice of paths, potentially
cause traffic competition, and ultimately affect performance.
Another way to achieve deadlock-free is based on flow control, which prevents the formation of buffer
occupation-request cycles in the network, thus achieving deadlock-free [105]. Majumder et al. [106] pro-
posed a solution called remote control (RC) based on the fact that inter-chiplet networks are guaranteed
to be deadlock-free at design time, thus deadlocks are only involved in the inter-chiplet BR, during the
high congestion of outbound and inbound packets. RC uses an additional flow control network to store
outbound packets in the reserved RC buffer, thus avoiding potential cyclic dependencies. Compared to
turn-restriction-based solutions, flow-control-based solutions can guarantee network path diversity and
achieve higher performance in throughput. However, the flow control-based solution will increase latency
due to the injection of control, and the additional flow control communication network and the RC buffer
that needs to cache the entire outgoing packet will lead to increased area and power consumption.
Furthermore, there is another approach that allows for deadlocks to occur and periodically recovers
from them, called upward packet popup (UPP), proposed in [107]. Based on the same assumptions
as [23,24,106] that all chiplets are designed to be deadlock-free, the researchers in [107] observe that new
occurrence of deadlock will always involve BRs, particularly an upward packet. Therefore, the deadlock
in BRs can be detected and broken periodically. According to this observation, the locked packet can
be properly routed to its destination using the reserved circuit-switched channels in chiplets, thereby
achieving deadlock recovery. Moreover, this detect-and-break strategy avoids the turn restriction of router
and control injection [107]. Therefore, this approach can achieve better performance in throughput and
latency than the turn restriction-based approach when the traffic load and the probability of deadlocks
are low. However, accurately and promptly finding the router where the deadlock occurs is challenging,
especially when there are a large number of vulnerable routers. In addition, the deadlock recovery phase
will inevitably result in increased latency.
Centralized routing. Centralized routing can obtain the working status of each node in the network,
making it easier to implement adaptive routing and thus deadlock-free. For instance, a typical NoC archi-
tecture that employs centralized routing is the software-defined NoC (SDNoC) [108–110]. Specifically, the
Liu Y F, et al. Sci China Inf Sci October 2024, Vol. 67, Iss. 10, 200401:15
(a) (b)
Figure 9 (Color online) (a) SDNoC within monolithic chip [108] Copyright 2017 Elsevier; (b) SDNoC architecture within
chiplet [111] Copyright 2019 IEEE.
control network and the data network are physically separated. And the data network is controlled by a
centralized network manager (NM) via the control network [108]. Specifically, the software obtains rout-
ing information through the control plane and performs path calculation, thus enabling flexible adaptive
routing. Unfortunately, the complex control network introduces communication overhead. Therefore, to
reduce communication overhead in chiplet-based systems, as shown in Figure 9(b) [111], a lightweight
SDNoC communication protocol, called MicroLET, is proposed in [111]. In particular, the protocol con-
sists of a new simplified message stack specifically developed for micro-scale networks, and three main
phases, including handshaking, network monitoring, and routing. Additionally, relying on the centralized
management mechanism and the turn model routing algorithm, the system ensures deadlock-freedom.
The comparison of different deadlock-free routings is listed in Table 2 [23, 24, 106–111]. Centralized
routing adds complexity to system design because it requires an additional control network for routing and
traffic control. Nevertheless, centralized routing has a number of advantages due to its adaptability. For
example, adaptive routing can dynamically select the path from the source node to the destination node
based on the network’s traffic conditions, which is not supported by deterministic routing. In addition,
adaptive routing can take advantage of the path diversity from the source node to the destination node to
address the issue of network congestion and failures. More importantly, due to its adaptability, centralized
routing is more practical than distributed routing in custom topologies or irregular topologies.
4 Discussion
In this study, we have reviewed the three interrelated aspects of package scheme, system architecture, and
interconnection design. It is undeniable that cost reduction remains one of the most important drivers
for the development of most commercial products. Emerging advanced packaging technologies such as
copper-copper hybrid bonding show notable advantages in high-performance interconnects and will be
more widely adopted as manufacturing processes mature and costs decrease. This, in turn, will stimulate
innovation in system architecture and interconnection design.
In the aspect of system architecture design, the package aggregation approach benefits from the advan-
tages of minimal design effort, and low technical risk, while SoC disaggregation represents a future trend
that involves implementing various disaggregation and recombination schemes to obtain a better PPC.
SoC disaggregation into homogeneous chiplets with integrated fabric is relatively straightforward parti-
tioning scheme that can further increase the system integration. However, it faces limitations in terms
of scalability and inconsistent latency. The separation of interconnect fabric from the compute offers
better scalability and flexibility, as well as improved consistency of inter-chiplet latency. Additionally, it
helps to preserve advanced process chip area, reduce system costs, and simplify compute and substrate
design. In the case of MPSoCs, disaggregating them into multiple heterogeneous tiles (chiplets) based
on functional units, not only the types of compute, memory, and fabric die, is a flexible and scalable
approach and is a promising approach for building high energy-efficiency systems and achieving product
serialization. However, the scale of the system based on fabric or base die is still limited by the size of the
fabric or base die. An approach to resolve this limitation is to use a hybrid packaging technique, which
connects the base die through 2.5D packaging, such as a silicon bridge. Additionally, the tile-based chiplet
architecture also provides a flexible and scalable solution for large-scale parallel processing, but it faces
the off-chip bandwidth wall that limits scaling. To address this challenge, state-of-the-art chiplet-based
systems employ distributed storage architectures by bringing memory closer to the compute chiplets both
horizontally and vertically, and unlocking significant potential bandwidth between memory and compute.
In the aspect of interconnection design, the separation of fabric die facilitates the implementation of
multiple topologies. In addition, several studies based on active silicon interposers demonstrate that
the performance of interconnects could be improved by making better use of the routing resources on
the interposer. When comparing deadlock-free routing schemes, turn restriction-based schemes have the
advantage in simple and minimum overhead. On the other hand, flow control and deadlock recovery-
based schemes show better performance but tend to be more complex in design. Additionally, centralized
routing schemes allow flexibility in achieving deadlock-free routing, especially for irregular topologies.
Although centralized routing may introduce higher overhead, this can be minimized by implementing
lightweight control networks. Therefore, centralized routing is an attractive approach in chiplet-based
design.
According to this survey, chiplet-based systems are currently dominated by homogeneous multicore
processors, which are mainly high-performance processors. Splitting a chip into smaller and finer-grained
chiplets would significantly improve the yield and it also works for MPSoCs. However, it is not common
to see chiplet-based designs in MPSoCs, which may be due to several obstacles such as the packaging
and design costs, as well as a lack of off-the-shelf chiplet ecosystem. Nevertheless, these obstacles are
Liu Y F, et al. Sci China Inf Sci October 2024, Vol. 67, Iss. 10, 200401:17
now changing. The differentiation and customization of heterogeneous chip products have become a
new driving factor and fine-grained chiplet splitting will prove to be cost-effective. Moreover, with the
widespread adoption of chiplet interface standards, more modular chiplet designs will emerge. This
will lead to a commercial off-the-shelf chiplets based design ecosystem, which will result in a boom in
chiplet-based designs.
In addition, platformization is a significant trend in the evolution of chiplet-based ecosystems. As
proposed, the fabric die, which is the core part that provides interconnectivity between chiplets, demon-
strates the ability to develop bespoke solutions based on reusable chiplets. Moreover, if the fabric die
can be reused, it can help mitigate costs and shorten development cycles, and it promotes a variety of
fabric die with reconfigurable topologies, also called active interposer. However, employing a fairly fine-
grained reconfigurable active interposer may not be the optimal design choice, as it requires a significant
pre-configured redundancy circuit, leading to power and area overheads. Finally, there are many works
that use automation tools to analyze or optimize the topology or partition scheme, as well as perform
co-optimization across layers. Since these factors impact the PPAC of a chiplet-based system, EDA
technologies that perform architecture design space exploration are necessary for chiplet-based design.
5 Conclusion
Chiplet-based design represents a key technology for the post-Moore’s law era, offering a cost-effective,
high-integrated, and agile design approach. Given the difficulties that such systems must overcome,
i.e., choosing the partition schematic, addressing the cross-die communication bottleneck, and ensuring
deadlock-free chiplet networks, chiplet-based system design is by no means an easy task. Consequently,
a plethora of research papers addressing the challenge have been presented, and a state-of-the-art review
can be of great help in summarizing all of the work done to this point.
In this paper, we try to present a survey of the state-of-the-art studies in a classification method and
from the perspective of chiplet-based system designers. We have cast the chiplet design challenge as
architecture design and interconnect design and subcategorized them separately by design task. Conse-
quently, we have reviewed and analyzed the literature while explicitly elucidating the design principles
for the different subcategories. Moreover, we have included a discussion section, where we presented
our viewpoint of the chiplet-based design, given the reviewed papers as well as our own experience, and
pointed out the current weaknesses and possible future trends.
The primary conclusion is that in the last decade, there has been an enormous research effort in chiplet-
based design. However, we are still a long way from a chiplet-based system that better meets all the
benefits, such as cost effectiveness, high performance, low power, and agility. It is clear that significant
progress has been made in the area of HPC, mainly due to the “disintegration” of the monolithic die
by interposers to address the cost and yield issues. Moreover, the research and analysis show that the
co-optimization of package, architecture, and interconnect is an important trend and has demonstrated
significant improvements in system performance. We have argued that such a responsibility must be
shared collaboratively by chip designers, package designers, and EDA vendors, among others. Altogether,
we are optimistic about future achievements in this field.
Acknowledgements This work was supported by Shenzhen Science and Technology Program (Grant No. JCYJ20200109143003-
935).
References
1 Hennessy J L, Patterson D A. A new golden age for computer architecture. Commun ACM, 2019, 62: 48–60
2 Hao Y, Xiang S Y, Han G Q, et al. Recent progress of integrated circuits and optoelectronic chips. Sci China Inf Sci, 2021,
64: 201401
3 Naffziger S, Beck N, Burd T, et al. Pioneering chiplet technology and design for the AMD EPYCTM and RyzenTM proces-
sor families: industrial product. In: Proceedings of the ACM/IEEE 48th Annual International Symposium on Computer
Architecture (ISCA), 2021. 57–70
4 EPS I. Chiplet definition-IEEE electronics packaging society. https://eps.ieee.org/technology/definitions.html
5 Nagisetty R. The Path to a Chiplet Ecosystem. Technical Report, ODSA Workshop, 2019
6 Loh G H, Naffziger S, Lepak K. Understanding chiplets today to anticipate future integration opportunities and limits.
In: Proceedings of the Design, Automation & Test in Europe Conference & Exhibition (DATE), 2021. 142–145
7 Feng Y X, Ma K S. Chiplet actuary: a quantitative cost model and multi-chiplet architecture exploration. In: Proceedings
of the 59th ACM/IEEE Design Automation Conference, 2022. 121–126
8 Suggs D, Subramony M, Bouvier D. The AMD “Zen 2” processor. IEEE Micro, 2020, 40: 45–52
9 Beck N, White S, Paraschou M, et al. “Zeppelin”: an SoC for multichip architectures. In: Proceedings of the IEEE
International Solid-State Circuits Conference (ISSCC), 2018. 40–42
Liu Y F, et al. Sci China Inf Sci October 2024, Vol. 67, Iss. 10, 200401:18
10 Naffziger S, Lepak K, Paraschou M, et al. 2.2 AMD chiplet architecture for high-performance server and desktop products.
In: Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC), 2020. 44–45
11 Nassif N, Munch A O, Molnar C L, et al. Sapphire rapids: the next-generation Intel Xeon scalable processor. In: Proceedings
of the IEEE International Solid-State Circuits Conference (ISSCC), 2022. 44–46
12 Lin M S, Huang T C, Tsai C C, et al. A 7-nm 4-GHz arm1 -core-based CoWoS1 chiplet design for high-performance computing.
IEEE J Solid-State Circ, 2020, 55: 956–966
13 Lau J H. Chiplet Heterogeneous Integration. Singapore: Springer Singapore, 2021. 413–439
14 Li T, Hou J, Yan J L, et al. Chiplet heterogeneous integration technology-status and challenges. Electronics, 2020, 9: 670
15 Jiang J, Wang Q, He G, et al. Research and prospect on chiplet technology. Microelectron Comput, 2022, 39: 1–6
16 Ma X H, Wang Y, Wang Y J, et al. Survey on chiplets: interface, interconnect and integration methodology. CCF Trans
HPC, 2022, 4: 43–52
17 Shan G B, Zheng Y W, Xing C Y, et al. Architecture of computing system based on chiplet. Micromachines, 2022, 13: 205
18 Lau J H. Recent advances and trends in advanced packaging. IEEE Trans Compon Packag Manufact Technol, 2022, 12:
228–252
19 EPS I. Heterogeneous integration roadmap-IEEE electronics packaging society. https://eps.ieee.org/hir
20 Lau J H. State-of-the-art of advanced packaging. In: Proceedings of the Chiplet Design and Heterogeneous Integration
Packaging, 2023. 1–99
21 Stow D, Akgun I, Barnes R, et al. Cost analysis and cost-driven IP reuse methodology for SoC design based on 2.5D/3D
integration. In: Proceedings of the 35th International Conference on Computer-Aided Design, 2016. 1–6
22 Kim J, Murali G, Park H, et al. Architecture, chip, and package co-design flow for 2.5D IC design enabling heterogeneous
IP reuse. In: Proceedings of the 56th Annual Design Automation Conference 2019, 2019. 1–6
23 Yin J M, Lin Z F, Kayiran O, et al. Modular routing design for chiplet-based systems. In: Proceedings of the ACM/IEEE
45th Annual International Symposium on Computer Architecture (ISCA), 2018. 726–738
24 Taheri E, Pasricha S, Nikdast M. DeFT: a deadlock-free and fault-tolerant routing algorithm for 2.5D chiplet networks.
In: Proceedings of the Design, Automation & Test in Europe Conference & Exhibition (DATE), 2022. 1047–1052
25 Chen W C, Hu C, Ting K C, et al. Wafer level integration of an advanced logic-memory system through 2nd generation
CoWoSrtechnology. In: Proceedings of the Symposium on VLSI Technology, 2017. 54–55
26 Wuu J, Agarwal R, Ciraula M, et al. 3D V-cache: the implementation of a hybrid-bonded 64 MB stacked cache for a 7 nm
x86-64 CPU. In: Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC), 2022, 65: 428–429
27 Gomes W, Khushu S, Ingerly D B, et al. 8.1 Lakefield and mobility compute: a 3D stacked 10 nm and 22 FFL hybrid
processor system in 12×12 mm2 , 1 mm package-on-package. In: Proceedings of the IEEE International Solid-State Circuits
Conference (ISSCC), 2020. 144–146
28 Ingerly D B, Enamul K, Gomes W, et al. Foveros: 3D integration and the use of face-to-face chip stacking for logic devices.
In: Proceedings of the IEEE International Electron Devices Meeting (IEDM), 2019
29 Drucker K, Jani D, Agarwal I, et al. The open domain-specific architecture. In: Proceedings of the IEEE Symposium on
High-Performance Interconnects (HOTI), 2020. 25–32
30 Qian X H. Graph processing and machine learning architectures with emerging memory technologies: a survey. Sci China
Inf Sci, 2021, 64: 160401
31 Zou X Q, Xu S, Chen X M, et al. Breaking the von Neumann bottleneck: architecture-level processing-in-memory technology.
Sci China Inf Sci, 2021, 64: 160404
32 Zhao Y L, Yang J L, Li B, et al. NAND-SPIN-based processing-in-MRAM architecture for convolutional neural network
acceleration. Sci China Inf Sci, 2023, 66: 142401
33 An J J, Wang L F, Ye W, et al. Design memristor-based computing-in-memory for AI accelerators considering the interplay
between devices, circuits, and system. Sci China Inf Sci, 2023, 66: 182404
34 Lee H J, Mahajan R, Sheikh F, et al. Multi-die integration using advanced packaging technologies. In: Proceedings of the
IEEE Custom Integrated Circuits Conference (CICC), 2020. 1–7
35 Kurd N, Chowdhury M, Burton E, et al. 5.9 Haswell: a family of IA 22nm processors. In: Proceedings of the IEEE
International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2014. 112–113
36 Hamzaoglu F, Arslan U, Bisnik N, et al. 13.1 A 1 Gb 2 GHz embedded DRAM in 22 nm tri-gate CMOS technology.
In: Proceedings of the IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2014. 230–
231
37 Hammarlund P, Martinez A J, Bajwa A A, et al. Haswell: the 4th-generation Intel core processor. IEEE Micro, 2014, 34:
6–20
38 Macri J. AMD’s next generation GPU and high bandwidth memory architecture: fury. In: Proceedings of the IEEE Hot
Chips 27 Symposium (HCS), 2015. 1–26
39 Lee D U, Kim K W, Kim K W, et al. A 1.2 V 8 Gb 8-Channel 128 GB/s high-bandwidth memory (HBM) stacked DRAM
with effective I/O test circuits. IEEE J Solid-State Circ, 2015, 50: 191–203
40 Lee C C, Hung C, Cheung C, et al. An overview of the development of a GPU with integrated HBM on silicon interposer.
In: Proceedings of the IEEE 66th Electronic Components and Technology Conference (ECTC), 2016. 1439–1444
41 Liao H, Tu J J, Xia J, et al. DaVinci: a scalable architecture for neural network computing. In: Proceedings of the IEEE
Hot Chips 31 Symposium (HCS), 2019. 1–44
42 Kim S, Kim S, Cho K, et al. Processing-in-memory in high bandwidth memory (PIM-HBM) architecture with energy-efficient
and low latency channels for high bandwidth system. In: Proceedings of the IEEE 28th Conference on Electrical Performance
of Electronic Packaging and Systems (EPEPS), 2019. 1–3
43 Viswanath R, Chandrasekhar A, Srinivasan S, et al. Heterogeneous SoC integration with EMIB. In: Proceedings of the
IEEE Electrical Design of Advanced Packaging and Systems Symposium (EDAPS), 2018. 1–3
44 J.Finnegan, Viljoen N. Fourth generation architecture for smartNICs. In: Proceedings of the SmartNICs Summit, 2022
45 Sheffield D. IvyTown Xeon+FPGA: the HARP program. In: Proceedings of the International Symposium on Computer
Architecture (ISCA), 2016
46 Hwang R, Kim T, Kwon Y, et al. Centaur: a chiplet-based, hybrid sparse-dense accelerator for personalized recommendations.
In: Proceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), 2020. 968–981
47 Kannan A, Jerger N E, Loh G H. Enabling interposer-based disintegration of multi-core processors. In: Proceedings of the
48th International Symposium on Microarchitecture, 2015. 546–558
48 Burd T, Beck N, White S, et al. “Zeppelin”: an SoC for multichip architectures. IEEE J Solid-State Circ, 2019, 54: 133–143
Liu Y F, et al. Sci China Inf Sci October 2024, Vol. 67, Iss. 10, 200401:19
49 Huang P K, Lu C Y, Wei W H, et al. Wafer level system integration of the fifth generation CoWoSr-s with high performance
Si interposer at 2500 mm2 . In: Proceedings of the IEEE 71st Electronic Components and Technology Conference (ECTC),
2021. 101–104
50 Mattioli M. Meet the fam1ly. IEEE Micro, 2022, 42: 78–84
51 Biswas A. Sapphire rapids. In: Proceedings of the IEEE Hot Chips 33 Symposium (HCS), Palo Alto, 2021. 1–22
52 Mahajan R, Sankman R, Patel N, et al. Embedded multi-die interconnect bridge (EMIB) – a high density, high bandwidth
packaging interconnect. In: Proceedings of the IEEE 66th Electronic Components and Technology Conference (ECTC),
2016. 557–565
53 Mekkat V, Holey A, Yew P C, et al. Managing shared last-level cache in a heterogeneous multicore processor. In: Proceedings
of the 22nd International Conference on Parallel Architectures and Compilation Techniques, 2013. 225–234
54 Lameter C. NUMA (non-uniform memory access): an overview. Queue, 2013, 11: 40–51
55 Asanović K, Avizienis R, Bachrach J, et al. The Rocket chip generator. Technical Report UCB/EECS-2016-17, 2016
56 Ishii A, Foley D, Anderson E, et al. Nvswitch and Dgx-2 Nvlink-switching chip and scale-up compute server. In: Proceedings
of the Hot Chips, 2018
57 Ishii A, Wells R. The nvlink-network switch: Nvidia’s switch chip for high communication-bandwidth superpods. In: Pro-
ceedings of the IEEE Hot Chips 34 Symposium (HCS), 2022. 1–23
58 Vivet P, Guthmuller E, Thonnart Y, et al. IntAct: a 96-Core processor with 6 chiplets 3D-Stacked on an active interposer
with distributed interconnects and integrated power management. IEEE J Solid-State Circ, 2021, 56: 79–97
59 Gomes W, Morgan S, Phelps B, et al. Meteor lake and Arrow lake Intel next-gen 3D client architecture platform with
Foveros. In: Proceedings of the IEEE Hot Chips 34 Symposium (HCS), 2022. 1–40
60 Ilderem V, Pellerano S, Tschanz J, et al. Innovations for intelligent edge. In: Proceedings of the 48th European Solid State
Circuits Conference (ESSCIRC), 2022. 41–44
61 Blythe D. Xe hpc ponte vecchio. In: Proceedings of the IEEE Hot Chips 33 Symposium (HCS), Palo Alto, 2021. 1–34
62 Gomes W, Koker A, Stover P, et al. Ponte vecchio: a multi-tile 3D stacked processor for exascale computing. In: Proceedings
of the IEEE International Solid-State Circuits Conference (ISSCC), 2022. 42–44
63 Gianos C. Architecting for flexibility and value with next gen Intelr Xeonr processors. In: Proceedings of the IEEE Hot
Chips 35 Symposium (HCS), 2023. 1–15
64 Munoz R. Furthering Moore’s law integration benefits in the chiplet era. IEEE Design Test, 2024, 41: 81–90
65 Ehrett P, Austin T, Bertacco V. Chopin: composing cost-effective custom chips with algorithmic chiplets. In: Proceedings
of the 39th International Conference on Computer Design (ICCD), 2021. 395–399
66 Zimmer B, Venkatesan R, Shao Y S, et al. A 0.11 pJ/Op, 0.32–128 TOPS, scalable multi-chip-module-based deep neural
network accelerator with ground-reference signaling in 16 nm. In: Proceedings of the Symposium on VLSI Circuits, 2019.
300–301
67 Zimmer B, Venkatesan R, Shao Y S, et al. A 0.32–128 TOPS, scalable multi-chip-module-based deep neural network inference
accelerator with ground-referenced signaling in 16 nm. IEEE J Solid-State Circ, 2020, 55: 920–932
68 Pei J, Deng L, Ma C, et al. Multi-grained system integration for hybrid-paradigm brain-inspired computing. Sci China Inf
Sci, 2023, 66: 142403
69 Shao Y S, Clemons J, Venkatesan R, et al. SIMBA: scaling deep-learning inference with multi-chip-module-based architecture.
In: Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, 2019. 14–27
70 Ahsan B. Off-chip bandwidth for multicore processors: managing the next big wall. Dissertation for Ph.D. Degree. New
York: The City University of New York, 2010
71 Jang H, Kim J, Gratz P, et al. Bandwidth-efficient on-chip interconnect designs for GPGPUs. In: Proceedings of the 52nd
Annual Design Automation Conference, 2015
72 Park M J, Cho H S, Yun T S, et al. A 192 Gb 12-high 896 GB/s HBM3 DRAM with a TSV auto-calibration scheme
and machine-learning-based layout optimization. In: Proceedings of the IEEE International Solid-State Circuits Conference
(ISSCC), 2022. 444–446
73 Imani M, Gupta S, Kim Y, et al. Floatpim: in-memory acceleration of deep neural network training with high precision.
In: Proceedings of the 46th International Symposium on Computer Architecture, 2019. 802–815
74 Krishnan G, Mandal S K, Pannala M, et al. SIAM: chiplet-based scalable in-memory acceleration with mesh for deep neural
networks. ACM Trans Embed Comput Syst, 2021, 20: 1–24
75 Dou C M, Chen W H, Xue C X, et al. Nonvolatile circuits-devices interaction for memory, logic and artificial intelligence.
In: Proceedings of the IEEE Symposium on VLSI Technology, 2018. 171–172
76 Zheng H, Wang K, Louri A. A versatile and flexible chiplet-based system design for heterogeneous manycore architectures.
In: Proceedings of the 57th ACM/IEEE Design Automation Conference (DAC), 2020. 1–6
77 Jouppi N P, HyunYoon D, Ashcraft M, et al. Ten lessons from three generations shaped Google’s TPUv4i: industrial product.
In: Proceedings of the ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), 2021. 1–14
78 Burd T, Li W, Pistole J, et al. Zen3: the AMD 2nd-generation 7 nm x86-64 microprocessor core. In: Proceedings of the
IEEE International Solid-State Circuits Conference (ISSCC), 2022. 1–3
79 Evers M, Barnes L, Clark M. The AMD next-generation “Zen 3” core. IEEE Micro, 2022, 42: 7–12
80 Loh G H. 3D-stacked memory architectures for multi-core processors. In: Proceedings of the International Symposium on
Computer Architecture, 2008. 453–464
81 Huang C H, Thakkar I G. Improving the latency-area tradeoffs for DRAM design with coarse-grained monolithic 3D (M3D)
integration. In: Proceedings of the IEEE 38th International Conference on Computer Design (ICCD), 2020. 417–420
82 Park J, Lee B, Lee H, et al. Wafer to wafer hybrid bonding for DRAM applications. In: Proceedings of the IEEE 72nd
Electronic Components and Technology Conference (ECTC), 2022. 126–129
83 Das Sharma D, Pasdast G, Qian Z, et al. Universal chiplet interconnect express (UCIe): an open industry standard for
innovations with chiplets at package level. IEEE Trans Compon Packag Manufact Technol, 2022, 12: 1423–1431
84 Sharma D D. System on a package innovations with universal chiplet interconnect express (UCIe) interconnect. IEEE Micro,
2023, 43: 76–85
85 Li F P, Wang Y, Cheng Y Q, et al. GIA: a reusable general interposer architecture for agile chiplet integration. In:
Proceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design, 2022. 1–9
86 ChanMok M P, Chan C H, ShuiChow W C, et al. Chiplet-based system-on-chip for edge artificial intelligence. In: Proceedings
of the 5th IEEE Electron Devices Technology & Manufacturing Conference (EDTM), 2021. 1–3
87 Nasrullah J, Luo Z Q, Taylor G. Designing software configurable chips and SIPs using chiplets and zGlue. Int Symp
Liu Y F, et al. Sci China Inf Sci October 2024, Vol. 67, Iss. 10, 200401:20