0% found this document useful (0 votes)
5 views6 pages

Bayesian Optimization

Uploaded by

Hongming Zheng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views6 pages

Bayesian Optimization

Uploaded by

Hongming Zheng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

KATO: Knowledge Alignment And Transfer for Transistor Sizing

Of Different Design and Technology


Wei W. Xing2 , Weijian Fan1,3 , Zhuohua Liu3 ,Yuan Yao4 and Yuanqi Hu4∗
1
Eastern Institute of Technology, Ningbo, China 2 The University of Sheffield, U.K.
3College of Mechatronics and Control Engineering, Shenzhen University, Shenzhen, China
4 School of Integrated Circuit Science and Engineering, Beihang University, Beijing, China

ABSTRACT Machine learning advancements have heightened interest in the


Automatic transistor sizing in circuit design continues to be a for- application of reinforcement learning (RL) to transistor sizing. Specif-
ically, RL agents are trained to optimize the transistor sizing process
arXiv:2404.14433v1 [cs.LG] 19 Apr 2024

midable challenge. Despite that Bayesian optimization (BO) has


achieved significant success, it is circuit-specific, limiting the accu- by rewarding the agent with better designs that deliver higher per-
mulation and transfer of design knowledge for broader applications. formance. Wang et al. [11] introduce a GCN-RL circuit designer that
This paper proposes (1) efficient automatic kernel construction, (2) utilizes RL to transfer knowledge efficiently across various tech-
the first transfer learning across different circuits and technology nology nodes and topologies with a graph convolutional network
nodes for BO, and (3) a selective transfer learning scheme to ensure (GCN) capturing circuit topology, resulting in enhanced Figures of
only useful knowledge is utilized. These three novel components Merit (FOM) for a range of circuits. Li et al. [5] expand on this by
are integrated into BO with Multi-objective Acquisition Ensemble incorporating a circuit attention network and a stochastic method to
(MACE) to form Knowledge Alignment and Transfer Optimization diminish layout effects, while Settaluri et al. [8] utilize a multi-agent
(KATO) to deliver state-of-the-art performance: up to 2x simulation approach for sub-block tuning for a larger circuit sizing task.
reduction and 1.2x design improvement over the baselines. While RL’s promise is clear, it has the following limitations: (1)
KEYWORDS the inherent design for complex Markov decision processes of RL
makes it an overly complex solution for the relatively straightfor-
Transistor Sizing, Transfer learning, Bayesian Optimization
ward optimization task of transistor sizing, leading to great demand
1 INTRODUCTION for data of designs and simulations; (2) RL’s complexity and high
The ongoing miniaturization of circuit devices within integrated computational cost may be excessive for smaller teams with limited
circuits (ICs) introduces increasing complexity to the design process, resources; and (3) the trained model might contain commercially
primarily due to the amplified impact of parasitics that cannot be sensitive design information that can be exploited by competitors.
overlooked. In contrast to digital circuit design, where the use of Bayesian optimization (BO) stands in contrast to reinforcement
standardized cells and electronic design automation (EDA) tools learning (RL), being a tried-and-tested optimization method for intri-
streamlines the process, analog and mixed-signal (AMS) designs cate EDA tasks. Its popularity stems from its efficiency, stability, and
still predominantly rely on the nuanced expertise of designers [4]. the fact that it doesn’t require pretraining, making it more accessible
This reliance on analog design experts is not only costly but also for immediate deployment. To solve the transistor sizing challenge
introduces the risk of inconsistency and bias. Such challenges can as a straightforward optimization problem, Lyu et al. [6] introduce
lead to inadequate exploration of the design space and extended weighted expected improvement (EI), transforming the constrained
design times. As a consequence, it often results in sub-optimal design optimization of transistor sizing into a single-objective optimiza-
in terms of power, performance, and area (PPA), underscoring the tion task. Building upon this, Lyu et al. [7] presents the Multiple
need for more efficient and objective design methodologies. Acquisition Function Ensemble (MACE) to facilitate massive par-
Analog circuit design involves two main steps: designing the allel simulation, which is essential to harness the power of cluster
circuit topology and sizing the transistors, where the former is nor- computing. To resolve the large-scale transistor sizing challenge,
mally guided by expert knowledge and the latter is labor-intensive Touloupas et al. [10] implement multiple local Gaussian Processes
and time-consuming. Thus, the transistor sizing is normally solved (GP) with GPU acceleration to significantly improve BO scalabil-
by algorithms to allow designers to dedicate their expertise more ity. The choice of kernel function is pivotal in BO’s performance.
effectively toward the nuanced aspects of topology selection. As a remedy, Bai et al. [1] propose deep kernel learning (DKL) to
∗ Corresponding author.
construct automatic GP for efficient design space exploration.
This work is supported by NSFC U23A20352 and 62271020.
Recognizing the similarities in transistor sizing across different
technology nodes, Zhang et al. [13] utilize Gaussian Copula to cor-
Permission to make digital or hard copies of all or part of this work for personal or relate technology nodes with multi-objective BO, aiming to improve
classroom use is granted without fee provided that copies are not made or distributed the search of the Pareto frontier.
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than the Despite its notable achievements, Bayesian Optimization (BO)
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or faces several challenges that impede its full potential in the applica-
republish, to post on servers or to redistribute to lists, requires prior specific permission tion of transistor sizing: (1) DKL is powerful but data-intensive and
and/or a fee. Request permissions from [email protected].
DAC ’24, June 23–27, 2024, San Francisco, CA, USA demands meticulous neural network design and training; (2) Trans-
© 2024 Copyright held by the owner/author(s). Publication rights licensed to ACM. fer Learning in BO is only possible for technology and impossible
ACM ISBN 979-8-4007-0601-1/24/06. . . $15.00 for different designs, (3) Transfer Learning in practice may degrade
https://doi.org/10.1145/3649329.3657380
DAC ’24, June 23–27, 2024, San Francisco, CA, USA Wei W. Xing2 , Weijian Fan1,3 , Zhuohua Liu3 ,Yuan Yao4 and Yuanqi Hu4∗

(a) Neural Kernel (b) Assessment


performance, it is unclear when to use Transfer Learning and when
not to use Transfer Learning. We propose Knowledge Alignment PER
and Transfer Optimization, KATO, to resolve these challenges simul-
x1
taneously. The novelty of this work includes: RBF

(1) As far as the authors are aware, KATO is the first BO sizing x2 Output

method that can transfer knowledge across different circuit Input RQ

designs and technology nodes simultaneously. Primate Kernel

(2) To ensure a positive Transfer Learning, we propose a simple Linear Layer

yet effective Bayesian selection strategy in the BO pipeline. Figure 1: Neural kernel and assessments
(3) As an alternative to DKL, we propose a Neural kernel (Neuk),
which is more powerful and stable for BO. The hyperparameters 𝜃 in kernels are estimated via maximum likeli-
(4) KATO is validated using practical analog designs with state-of- hood estimates (MLE) of L.
the-art methods on multiple experiment setups, showcasing Conditioning on y gives a predictor posterior 𝑓ˆ(x) ∼ N (𝜇 (x), 𝑣 (x))
a 2x speedup and 1.2x performance enhancement.   −1   −1
𝜇 (x) = k𝑇 (x) K + 𝜎 2 I y; 𝑣 (x) = 𝑘 (x, x) − k𝑇 (x) K + 𝜎 2 I k(x)
2 BACKGROUND (4)
2.1 Problem Definition 2.3 Bayesian Optimization
Transistor sizing is typically formulated as a constrained optimiza- To maximize 𝑓 (x), we can optimize x by sequentially quarrying
tion problem. The goal is to maximize a specific performance while points such that each point shows an improvement 𝐼 (x) = max( 𝑓ˆ(x)−
ensuring that various metrics meet predefined constraints, i.e., 𝑦 †, 0), where 𝑦 † is the current optimal and 𝑓ˆ(x) is the predictive
posterior in Eq. (4). The possibility for x giving improvements is
argmax 𝑓0 (x) s.t. 𝑓𝑖 (x) ≥ 𝐶𝑖 , ∀𝑖 ∈ 1, . . . , 𝑁𝑐 . (1)  
Here, x ∈ R𝑑 represents the design variable, 𝑓𝑖 (x) calculates the i- 𝑃𝐼 (x) = Φ (𝜇 (x) − 𝑦 † )/𝜎 (x) , (5)
th performance metric, and 𝐶𝑖 denotes the required minimum value which is the probability of improvement (PI). For a more informative
for that metric. Given the complexity of solving this constrained solution, we can take the expectative improvement (EI) over the
optimization problem, many convert it into an unconstrained opti- predictive posterior:
mization problem by defining a Figure of Merit (FOM) that combines
the performance metrics. For instance, [11] defines 𝐸𝐼 (x) = (𝜇 (x) − 𝑦 † )𝜓 (𝑢 (x)) + 𝑣 (x)𝜙 (𝑢 (x)) , (6)
𝑁𝑐
∑︁ 𝑚𝑖𝑛(𝑓𝑖 (x), 𝑓𝑖𝑏𝑜𝑢𝑛𝑑 ) − 𝑓𝑖𝑚𝑖𝑛 where 𝜓 (·) and 𝜙 (·) are the probabilistic density function (PDF) and
𝐹𝑂𝑀 (x) = 𝑤𝑖 × , (2) cumulative density function (CDF) of standard normal distribution,
𝑓𝑖𝑚𝑎𝑥 − 𝑓𝑖𝑚𝑖𝑛
𝑖=0 respectively.
where 𝑓𝑖𝑏𝑜𝑢𝑛𝑑 is the pre-determined limit and 𝑓𝑖𝑚𝑎𝑥 and 𝑓𝑖𝑚𝑖𝑛 are The candidate for the next iteration is selected by argmaxx∈ X 𝐸𝐼 (x)
the maximum and minimum values obtained through 10,000 random with gradient descent methods, e.g., L-BFGS-B. Rather than looking
samples, 𝑤𝑖 is −1𝑜𝑟 1 depending on whether the metric is to be into the expected improvement, we can approach the optimal by
maximized or minimized. The FOM method is less desirable due to exploring the areas with higher uncertainty, a.k.a, upper confidence
the difficulty in setting all hyperparameters properly. bound (UCB)
𝑈𝐶𝐵(x) = (𝜇 (x) + 𝛽𝑣 (x)) , (7)
2.2 Gaussian process
Gaussian process (GP) is a common choice as a surrogate model where 𝛽 controls the tradeoff between exploration and exploita-
for building the input-output mapping for complex computer code tion.
due to its flexibility and uncertainty quantification. We can approxi- 3 PROPOSED METHODOLOGIES
mate the black-box function 𝑓0 (x) by placing a GP prior: 𝑓0 (x)|𝜃𝜃 ∼ 3.1 Automatic Kernel Learning: Neural Kernel
GP (𝑚(x), 𝑘 (x, x′ |𝜃𝜃 )), Choosing the right kernel function is vital in BO. Despite DKL’s
where the mean function is normally assumed zero, i.e., 𝑚 0 (x) ≡ 0, success, creating an effective network structure remains challenging.
by virtue of centering the data. The covariance function can take Following [9], we introduce Neural Kernel (Neuk) as a basic unit to
many forms, the most common is the automatic relevance determi- construct an automatic kernel function.
nant (ARD) kernel 𝑘 (x, x′ |𝜃𝜃 ) = 𝜃 0 exp(−(x−x′ )𝑇 𝑑𝑖𝑎𝑔(𝜃 1, . . . , 𝜃𝑙 )(x− Neuk is inspired by the fact that kernel functions can be safely
x′ )), with 𝜃 denoting the hyperparameters. Other kernels like Ratio- composed by adding and multiplying different kernels. This compo-
nal Quadratic (RQ), Periodic (PERD), and Matern can also be used sitional flexibility is mirrored in the architecture of neural networks,
depending on the application. specifically within the linear layers. Neuk leverages this concept
For any given design variable x, 𝑓 (x) is now considered a random by substituting traditional nonlinear activation layers with kernel
variable, and multiple observations 𝑓 (x𝑖 ) form a joint Gaussian functions, facilitating an automatic kernel construction.
with covariance matrix K = [𝐾𝑖 𝑗 ] where 𝐾𝑖 𝑗 = 𝑘 (x𝑖 , x 𝑗 ). With a To illustrate, consider two input vectors, x1 and x2 . In Neuk, these
consideration of the model inadequacy and numerical noise 𝜀 ∼
vectors are processed through multiple kernels {ℎ𝑖 (x, x′ )}𝑖𝑁𝑘 , each
N (0, 𝜎 2 ), the likelihood L is:
undergoing a linear transformation as follows:
1 1 1
L = − y𝑇 (K + 𝜎 2 I) −1 y − ln |K + 𝜎 2 I| − log(2𝜋). (3) ℎ𝑖 (x1, x2 ) = ℎ𝑖 (W (𝑖 ) x1 + b (𝑖 ) , W (𝑖 ) x2 + b (𝑖 ) ) (8)
2 2 2
KATO: Knowledge Alignment And Transfer for Transistor Sizing Of Different Design and Technology DAC ’24, June 23–27, 2024, San Francisco, CA, USA

Source Data Neural Kernel Gaussian Process Source


,,𝑦𝑦𝑦!!!…
… Output

Netlists 𝒙! ……… transforms the output of the GP 𝐺 (x) to match the target out-
𝒚#
PER

MP1 MP2
x1
As GP
MP3

C0 R RBF

Parameters
put y (𝑡 ) . Thus, the KAT-GP for the target domain is expressed as
Vin- MN1 MN2 Vin+
Vout
x2 Output
Input RQ

Target Output
Vb1 MN3 Vb2 MN4
Primate Kernel
gnd

y (𝑠 ) = D(𝐺𝑃 (E(x (𝑡 ) ))).


Linear Layer

𝒖𝒕 (𝒙" ) 𝒗" (𝒙" )


𝑦 A key aspect of this approach is that the original observations
Target Data Encoder Neural Kernel Decoder 𝑥
in the source GP are preserved, while the encoder and decoder are
Netlist 𝒙"
x1
PER
Delta specifically trained to align and transfer knowledge between the
𝑦
MP1 MP2 MP3 Vb2 MP4 Vb3 MP5

Method
source and target domains. The encoder and decoder themselves
C0
RBF

Vin+ MN1 MN2 Vin- C1


Vout x2 Output
Input RQ
Vb1 MN4 MN5 MN6

can be complex functions, such as deep neural networks, and their


MN3

gnd Primate Kernel

Linear Layer
𝑥

Figure 2: Knowledge Alignment and Transfer (KAT)-GP specific architecture depends on the problem and available data. In
this work, both the encoder and decoder are small shallow neural
Here, W (𝑖 ) and b (𝑖 ) represent the weight and bias of the 𝑖-th kernel networks with linear(𝑑𝑖𝑛 × 32)-sigmoid-linear(32 × 𝑑𝑜𝑢𝑡 ) structure,
function, respectively, and ℎ𝑖 (·) is the corresponding kernel func- where 𝑑𝑖𝑛 and 𝑑𝑜𝑢𝑡 are the input and output dimension during im-
tion. Subsequently, latent variables z are generated through a linear plementation.
combination of these kernels: It is important to note that unless the decoder is a linear operator,
KAT-GP is no longer a GP and does not admit a closed-form solution.
z = W (𝑧 ) h(x1, x2 ) + b (𝑧 ) , (9)
We approximate the predictive mean and variance of the overall
where W (𝑧 ) and b (𝑧 ) are the weight and bias of the linear layer, model through the Delta method, which employs Taylor series ex-
and h(x1, x2 ) = [ℎ 1 (x1, x2 ), . . . , ℎ𝑑𝑙 (x1, x2 )]𝑇 . This configuration pansions to estimate the predictive mean 𝝁 (𝑡 ) (x (𝑡 ) ) and covariance
constitutes the core unit of Neuk. For broader applications, multiple S (𝑡 ) (x (𝑡 ) ) of the transformed output:
Neuk units can be stacked horizontally to form a Deep Neuk (DNeuk)
 
𝜇 (𝑡 ) (x (𝑡 ) ) = 𝐷 𝝁 (𝑠 ) (𝐸 (x (𝑡 ) )) ; S (𝑡 ) (x (𝑡 ) ) = JS (𝑠 ) J𝑇 , (11)
or vertically for a Wide Neuk (WNeuk). However, in this study, we
utilize a single Neuk unit, finding it sufficiently flexible and efficient where 𝝁 (𝑠 ) (x (𝑠 ) ) and S (𝑠 ) (x (𝑠 ) ) are the predictive mean and co-
without excessively increasing the parameter count. variance of 𝐺𝑃 (x), respectively, and J is the Jacobian matrix of
The final step in the Neuk process involves a nonlinear transfor- D(𝜇 (𝑡 ) (x (𝑡 ) )) with respect to 𝝁 (𝑡 ) (x (𝑡 ) ). Thus, training of KAT-
mation applied to the latent variables z, ensuring the semi-positive GP involves maximizing the log-likelihood in Eq. (12) using gradient
definiteness of the kernel function, descent w.r.t the parameters of the encoder and decoder, as well as
the hyperparameters of the neural kernel.
𝑘𝑛𝑒𝑢𝑘 (x1, x2 ) = exp( 𝑧 𝑗 + 𝑏 (𝑘 ) ). (10)
Í
log N (y𝑖 |𝐷 (𝝁 (𝐸 (x𝑖 ))), J𝑖 S𝑖 J𝑇𝑖 + 𝜎𝑡2 I, ).
(𝑡 ) (𝑡 )
∑︁
L= (12)
A graphical representation of the Neuk architecture is shown in
Fig. 1 along with small experiments demonstrating its effectiveness KAT-GP is illustrated in Fig. 2. It offers the first knowledge transfer
in predicting performance in a 180nm second-stage amplification solution for GP with different design and performance space.
circuit (see Section 4) with 100 training and 50 testing data points. 3.3 Modified Constrained MACE
In transistor sizing, it is crucial to harness the power of parallel
3.2 Knowledge Alignment and Transfer computing by running multiple simulations simultaneously. One
In the literature, transfer learning is predominantly based on deep
of the most popular solutions, MACE [12] resolves this challenge
learning, wherein the knowledge is encoded within the neural net-
by proposing candidates lying on the Pareto frontier of objectives
work weights, facilitating transfer learning through the method of 𝑢𝑖 (x)
𝑖=1 max(0, 𝑢𝑖 (x) ), 𝑖=1 max(0, 𝑣𝑖 (x)
Í𝑁𝑐 Í𝑁𝑐
{𝑈 𝐶𝐵 (x), 𝑃𝐼 (x), 𝐸𝐼 (x), 𝑃 𝐹 (x), )}
fine-tuning these weights on a target dataset. Contrastingly, GPs
using Genetic searching NSGA-II. Here, 𝑃𝐹 (x) is the probability of
present a fundamentally different paradigm. The predictive capa-
feasibility, which use Eq. (5) with all constraint metrics, i.e., 𝑃𝐹 (x) =
bility of GPs is intrinsically tied to the source data they are trained
𝑖 Φ(𝑢𝑖 (x) − 𝐶𝑖 /𝑣𝑖 (x)).
Î 𝑁𝑐
on (see Eq. (4)). This reliance on data for prediction underscores a
significant challenge in applying transfer learning to GPs. Despite its success, MACE suffers from high computational com-
To address the challenge of applying transfer learning to GPs, we plexity, as it requires a Pareto front search with six correlated objec-
propose an innovative encoder-decoder structure, which we refer to tives. To mitigate this issue, we consider the constraint as an addi-
as Knowledge Alignment and Transfer (KAT) in GPs. This approach tional objective for the primal metric 𝑓0 (x) and the multi-objective
retains the intrinsic knowledge of the source GP while aligning it optimization becomes
with the target domain through an encoder and decoder mechanism. argmax{𝑈𝐶𝐵(x), 𝑃𝐼 (x), 𝐸𝐼 (x)} × 𝑃𝐹 (x). (13)
(𝑠 ) (𝑠 )
Consider a source dataset D (𝑠 ) = {(x𝑖 , y𝑖 )}, on which the GP This reduction in dimensionality significantly improves efficiency, as
model 𝐺𝑃 (x) is trained, and a target dataset D (𝑡 ) = {(x𝑖 , y𝑖 )}.
(𝑡 ) (𝑡 ) the complexity grows exponentially with the number of objectives
The first step involves introducing an encoder E(x), which maps while maintaining the same level of performance. Empirically, we
the target input x (𝑡 ) into the source input space x (𝑠 ) . This encoder do not observe any performance degradation.
accounts for potential differences in dimensionality between the 3.4 Selective Transfer Learning with BO
source and target datasets, effectively managing any compression While transfer learning proves effective in numerous scenarios, its
or redundancy. utility is not universal, particularly when the source and target
The target outputs may have different value ranges or even quan- domains differ significantly, such as between an SRAM and an ADC.
tities from the source outputs. We employ a decoder D(y (𝑠 ) ) that It’s important to note that even when the source and target datasets
DAC ’24, June 23–27, 2024, San Francisco, CA, USA Wei W. Xing2 , Weijian Fan1,3 , Zhuohua Liu3 ,Yuan Yao4 and Yuanqi Hu4∗

have an equal number of points, the utility of KAT-GP may not be stage, the capacitance of two capacitors, and the bias currents for all
immediately apparent. However, our empirical studies reveal that three stages. The optimization is specified as:
source and target data often exhibit distinctly different distributions,
with varying concentration regions. This divergence can provide argmin 𝐼𝑡𝑜𝑡𝑎𝑙 ; s.t.; 𝑃𝑀 > 60◦, 𝐺𝐵𝑊 > 2𝑀𝐻𝑧, 𝐺𝑎𝑖𝑛 > 80𝑑𝐵. (16)
valuable insights for the optimization of the target circuit, even when
the target data exceeds the source data in size. The effectiveness Bandgap Reference Circuit is vital in maintaining precise and
of transfer learning in such cases is inherently problem-dependent, stable outputs in analog and mixed-signal systems-on-a-chip. The
requiring adaptable strategies for diverse scenarios. design variables include the length of the input transistor, the widths
To address these challenges, we propose a Selective Transfer of the bias transistors for the operational amplifier, and the resistance
Learning (STL) strategy, which synergizes with the batch nature of of the resistors. The aim is to minimize the temperature coefficient
the MACE algorithm to optimize the benefits of transfer learning. (TC), with constraints on total current consumption (𝐼𝑡𝑜𝑡𝑎𝑙 less than
This approach involves training both a KAT-GP model and a GP 6uA) and power supply rejection ratio (PSRR larger than 50dB):
model (referred to as NeukGP, equipped with a Neural Kernel) ex-
argmin𝑇𝐶; s.t.; 𝐼𝑡𝑜𝑡𝑎𝑙 < 6𝑢𝐴, 𝑃𝑆𝑅𝑅 > 50𝑑𝐵. (17)
clusively on the target data. During the Bayesian Optimization (BO)
process, each model collaborates with MACE to generate proposal For the experiments, each method was repeated five times with
Pareto front sets, denoted as P𝑖 (with 𝑖 = 1, 2 in this context). We ran- different random seeds, and statistical results were reported.
domly select 𝑤1+𝑤
𝑤1 𝑁 points from P1 to form A1, and 𝑤2 𝑁
2 𝐵 𝑤1+𝑤2 𝐵 Baselines were implemented with fine-tuned hyperparameters to
points from P2 to form A2 . Points in A1 and A2 are then simulated ensure optimal performance. All circuits were implemented using
and evaluated. The weights are initialized with the number of sam- 180nm and 40nm Process Design Kits (PDK). KATO is implemented
ples and updated based on the number of simulations that improve in PyTorch with MACE1 . Experiments were carried out on a work-
the current best, i.e., station equipped with an AMD 7950x CPU and 64GB RAM.
𝑤𝑖 = 𝑤𝑖 + |𝑓 (A𝑖) > 𝑦 † |𝑛 , (14)
4.1 Assessment of FOM Optimization
where |𝑓 (A𝑖) > 𝑦 † |𝑛 represents the number of points in A𝑖 that We initiated the evaluation based on the FOM of Eq. (2). KATO was
surpass the current best objective value 𝑦 † , and 𝑓 can be the con- compared against SOTA Bayesian Optimization (BO) techniques for
strained objective or the Figure of Merit (FOM). The STL algorithm a single objective, SMAC-RF2 , along with MACE and a naive random
is summarized in Algorithm 1. search (RS) strategy. All methods are given 10 random simulations
Algorithm 1 KATO with Selective Transfer Learning as the initial dataset, and the sizing results (FOM versus simulation
budget) for the 180nm technology node are shown in Fig. 4. SMAC-
Require: Source dataset D𝑠 , initial target circuit data D𝑡 , # itera-
RF is slightly better than MACE due to this simple single-objective
tions 𝑁𝐼 . 𝐵, batch size 𝑁𝐵 per iteration
optimization task. Notably, KATO outperforms the baselines by a
1: Train KAT-GP on D𝑠 and D𝑡
large margin. Particularly, KATO consistently achieves the maximum
2: Train NeukGP based on D𝑡
FOM, with up to 1.2x improvement, and it takes about 50% fewer
3: for 𝑖 = 1 → 𝑁𝐼 do
simulations to reach a similar optimal FOM. The optimal result of RS
4: Update KAT-GP and NeukGP based on D𝑡
does not actually satisfy all constraints, highlighting the limitation
5: Apply MACE to KAT-GP and NeukGP to generate proposal
of FOM-based optimization.
set P1 and P2 .
6: Form action set A1 and A2 by randomly selecting 𝑤1𝑤+𝑤 1
𝑁 4.2 Assessment of Constrained Optimization
2 𝐵
Next, we assess the proposed method of transistor sizing with a more
points from P1 and 𝑤1 +𝑤2 𝑁𝐵 points from P2 .
𝑤2
practical and challenging constrained optimization setup. During
7: Simulate A1 and A2 and update D𝑡 ← D𝑡 ∪ A1 ∪ A2 .
optimization, only designs satisfying all constraints are considered
8: Update 𝑤 1 and 𝑤 2 based on Eq. (14) and best design x∗ .
valid and included in the performance reports. To provide sufficient
9: end for
valid designs for the surrogate model, we first simulate 300 random
4 EXPERIMENT designs, typically yielding about 7 valid designs, a 2.3% that makes
To assess the effectiveness of KATO, we conducted experiments on RS not applicable in this task.
three analog circuits: a Two-stage operational amplifier (OpAmp), a We compare KATO with SOTA constrained BO tailored for circuit
Three-stage OpAmp, and a Bandgap circuit (depicted in Fig. 3). design, namely, MESMOC [2], USEMOC [3], and MACE with con-
Two-stage Operational Amplifier (OpAmp) focuses on optimiz- straints [12]. The results of 180nm are shown in Fig. 5, where MES-
ing the length of the transistors in the first stage, the capacitance MOC shows a poor performance due to its lack of exploration, and
of the capacitors, the resistance of the resistors, and the bias cur- MACE is generally good except for the three-stage OpAmp. KATO
rents for both stages. The objective is to minimize the total current demonstrates a consistent superiority, always achieving the best
consumption (𝐼𝑡𝑜𝑡𝑎𝑙 ) while meeting specific performance criteria: performance with a clear margin, and most importantly, with about
phase margin (PM) greater than 60 degrees, gain-bandwidth product 50% of simulation cost to reach the best-performing baseline. The
(GBW) over 4MHz, and a gain exceeding 60dB. final design performance is shown in Table 1, where KATO achieves
argmin 𝐼𝑡𝑜𝑡𝑎𝑙 ; s.t.; 𝑃𝑀 > 60◦, 𝐺𝐵𝑊 > 4𝑀𝐻𝑧, 𝐺𝑎𝑖𝑛 > 60𝑑𝐵. (15) the best performance by extreme trade-off for the constraints (e.g.,
Gain) as long as they fulfilling the requirements.
Three-stage OpAmp improves the gain beyond what a two-stage
OpAmp can offer with an additional stage. This variant introduces
more design variables, including the length of transistors in the first 1 https://github.com/Alaya-in-Matrix/MACE 2 https://github.com/automl/SMAC3
KATO: Knowledge Alignment And Transfer for Transistor Sizing Of Different Design and Technology DAC ’24, June 23–27, 2024, San Francisco, CA, USA

(a) Two-stage operational amplifier (b) Three-stage operational amplifier (c) Bandgap
VDD VDD VDD
MP8
MP1 MP6 MP10 MP13 MP22 MP15 MP19
MP1 MP2 MP3 MP1 MP2 MP3 Vb2 MP4 Vb3 MP5
MP2
SF_gate
MP9 MP21
C0 MP5
MP14 MP16 MP18

C0 R C1 MP7 MP11 MP12 MP17


SF_gate

MP3

off_pnt
R1 C2 MP20
R3 off_pnt
MN1 MN1 MN1 Ibias
Vout MP4
Vout Vin+ MN1 MN2 Vin- C1 Vb1
MN1 MN2 MN2
Vin- Vin+ R2 MN3 MN4 MN5 MN6 MN7 R4
MN1 SF_gate

Vb1 MN4 Q1 Q2 Vb2


MN3 Vb2 Vb1 MN3 MN4 MN5 MN6 MP5 MN9 MN8 Q3
gnd gnd
gnd

Figure 3: Schematic of the evaluation circuits


(a) Two-stage OpAmp (b) Three-stage OpAmp (c) Bandgap

Figure 4: Transistor sizing by optimizing FOM


(a) Two-stage OpAmp (b) Three-stage OpAmp (c) Bandgap

Figure 5: Transistor sizing by constrained optimization


Table 1: Transistor Sizing Optimal Performance via Optimization with Constraints
Two-stage OpAmp(180nm) Three-stage OpAmp(180nm) Bandgap(180nm)
Method I(uA) Gain(dB) PM(◦ ) GBW(MHz) I(uA) Gain(dB) PM(◦ ) GBW(MHz) TC(ppm/◦ C) I(uA) PSRR(dB @100Hz)
Specifications min >60 >60 >4 min >80 >60 >2 min <6 >50
Human Expert 274.84 75.12 63.57 8.23 462.84 112.49 65.61 2.05 11.26 5.31 61.01
MSEMOC 138.58 80.39 65.28 5.66 288.28 86.85 73.55 2.27 14.04 5.29 60.97
USEMOC 137.89 65.42 66.63 4.63 230.47 81.24 73.67 2.09 10.36 4.78 61.21
MACE 127.69 79.30 61.50 4.38 245.62 81.02 68.38 2.13 10.41 4.78 61.20
KATO 124.21 61.18 60.59 4.56 187.51 80.3 63.99 2.10 9.66 5.42 61.99
4.3 Assessment of Transfer Learning shown in Figs. 6(a) and 6(b), which highlights the effectiveness of the
Finally, we assess KATO with transfer learning between different transfer learning, delivering an average 2.52x speedup (defined by
topologies and technology nodes for FOM and constrained opti- the simulations required to reach the best performance of the KATO
mization. Each experiment provides 200 random samples for the without Transfer Learning) and 1.18x performance improvement.
source data; the initial random target sample size is 10 for the FOM Transfer learning between topologies. As far as the authors
optimization and 200 for the constrained optimization. We mainly know, there is no BO method that can perform transfer learning
conduct transfer leaning on the two-stage and three-stage OpAmps between different topologies. We thus validate KATO with and with-
due to their similarity in topology and technology node. Note that out transfer learning between two-stage and three-stage OpAmps
the design variable is different. in the 40nm technology node.
Transfer learning between technology nodes. We first compare The statistical results are shown in Figs. 6(c) and 6(d), which
KATO to the SOTA BO equipped with transfer learning, namely, demonstrate the effectiveness of the transfer learning, delivering an
TLMBO [13] (which is only capable of FOM optimization). Other RL- average 2.35x speedup and 1.16x performance improvement.
based methods, e.g., RL-GCN [11], perform poorly due to the small Transfer learning between topologies and technology nodes.
data size setup of this experiment and thus are not included in the Finally, we assess transfer learning with both topologies and tech-
comparison. We also compared KATO with and without the transfer nology nodes. The results are shown in Figs. 6(e) and 6(f), which
learning for the two-stage and three-stage OpAmps in the 40nm demonstrates the effectiveness of the transfer learning, delivering
technology node. The statistical results over five random runs are an average 2.40x speedup and 1.16x performance improvement.
DAC ’24, June 23–27, 2024, San Francisco, CA, USA Wei W. Xing2 , Weijian Fan1,3 , Zhuohua Liu3 ,Yuan Yao4 and Yuanqi Hu4∗

(a) Two-stage OpAmp (180nm) To Two-stage OpAmp (40nm) (b) Three-stage OpAmp (180nm) To Three-stage OpAmp (40nm)

(c) Three-stage OpAmp (40nm) To Two-stage OpAmp (40nm) (d) Two-stage OpAmp (40nm) To Three-stage OpAmp (40nm)

(e) Three-stage OpAmp (180nm) To Two-stage OpAmp (40nm) (f) Two-stage OpAmp (180nm) To Three-stage OpAmp (40nm)

Figure 6: Transistor sizing constrained optimization with Transfer Learning of designs and technology node
The final design performance is shown in Table 2. Transfer Learn- REFERENCES
ing between technology nodes achieves the best results as it is the [1] Chen Bai, Qi Sun, Jianwang Zhai, Yuzhe Ma, Bei Yu, and Martin DF Wong. 2021.
easier task. Nonetheless, the difference between different transfer BOOM-Explorer: RISC-V BOOM microarchitecture design space exploration frame-
work. In Proc. ICCAD. IEEE, Munich, Germany, 1–9.
learning tasks is not significant. Compared to human experts in [2] Syrine Belakaria, Aryan Deshwal, and Janardhan Rao Doppa. 2020. Max-value
three-stage OpAmp, KATO shows up to 1.62x improvement in key Entropy Search for Multi-Objective Bayesian Optimization with Constraints. CoRR
abs/2009.01721 (2020), 7825 – 7835. arXiv:2009.01721
performance. [3] Syrine Belakaria, Aryan Deshwal, Nitthilan Kannappan Jayakodi, and Janard-
Table 2: Transistor Sizing Optimal Performance via Optimiza- han Rao Doppa. 2020. Uncertainty-Aware Search Framework for Multi-Objective
tion with Constraints with Transfer Learning Bayesian Optimization. In Proc. AAAI. AAAI Press, 10044–10052.
[4] Ibrahim M Elfadel, Duane S Boning, and Xin Li. 2019. Machine learning in VLSI
Two Stage OpAmp(40nm) computer-aided design. Springer, https://www.springer.com/us.
Method I(uA) Gain(dB) PM(◦ ) GBW(MHz) [5] Yaguang Li, Yishuang Lin, Meghna Madhusudan, Arvind Sharma, Sachin Sapat-
nekar, Ramesh Harjani, and Jiang Hu. 2021. A circuit attention network-based
Specifications min >50 >60 >4 actor-critic learning approach to robust analog transistor sizing. In Workshop
Human Expert 308.10 51.77 71.33 7.08 MLCAD. IEEE, Raleigh, North Carolina, USA, 1–6.
[6] Wenlong Lyu, Pan Xue, and Fan Yang. 2018. An Efficient Bayesian Optimization
KATO 273.04 52.44 81.24 21.09 Approach for Automated Optimization of Analog Circuits. IEEE Transactions on
KATO (TL Node) 254.05 50.29 83.72 15.05 Circuits and Systems I: Regular Papers 65, 6 (June 2018), 1954–1967.
KATO (TL Design) 257.12 50.04 82.68 10.28 [7] Wenlong Lyu, Fan Yang, and Changhao Yan. 2018. Batch Bayesian Optimization
KATO (TL Node&Design) 258.01 51.23 85.78 13.31 via Multi-objective Acquisition Ensemble for Automated Analog Circuit Design.
In Proc. ICML, Vol. 80. PMLR, New York, USA, 3312–3320.
Three Stage OpAmp(40nm) [8] Keertana Settaluri, Zhaokai Liu, Rishubh Khurana, Arash Mirhaj, Rajeev Jain, and
Borivoje Nikolic. 2021. Automated design of analog circuits using reinforcement
Specifications min >70 >60 >2 learning. IEEE TCAD 41, 9 (2021), 2794–2807.
Human Expert 244.72 74.10 60.18 2.03 [9] Shengyang Sun, Guodong Zhang, Chaoqi Wang, Wenyuan Zeng, Jiaman Li, and
KATO 151.09 70.23 69.85 3.49 Roger Grosse. 2018. Differentiable compositional kernel learning for Gaussian
processes. In Proc. ICML. PMLR, PMLR, Vienna, Austria, 4828–4837.
KATO (TL Node) 118.47 74.41 71.84 2.65 [10] Konstantinos Touloupas, Nikos Chouridis, and Paul P Sotiriadis. 2021. Local
KATO (TL Design) 118.71 71.46 72.92 2.43 bayesian optimization for analog circuit sizing. In Proc. DAC. IEEE, IEEE, San
KATO (TL Node&Design) 120.08 70.44 73.44 2.48 Francisco, California, USA, 1237–1242.
[11] Hanrui Wang, Kuan Wang, and Jiacheng Yang. 2020. GCN-RL circuit designer:
5 CONCLUSION Transferable transistor sizing with graph neural networks and reinforcement
learning. In Proc. DAC. IEEE, IEEE, San Francisco, California, USA, 1–6.
We propose, KATO, a novel transfer learning for transistor sizing, [12] Shuhan Zhang, Fan Yang, Changhao Yan, Dian Zhou, and Xuan Zeng. 2021. An
which enables transferring knowledge from different designs and efficient batch-constrained bayesian optimization approach for analog circuit
technologies for BO for the first time. Except for improving the SOTA, synthesis via multiobjective acquisition ensemble. IEEE TCAD 41, 1 (2021), 1–14.
[13] Zheng Zhang, Tinghuan Chen, Jiaxin Huang, and Meng Zhang. 2022. A fast
we hope the idea of KAT can inspires more interesting research. parameter tuning framework via transfer learning and multi-objective bayesian
Further extension includes extending transfer learning to many optimization. In Proc. DAC. ACM, San Francisco, California, USA, 133–138.
different circuits of various types, e.g., SRAM, ADC, and PLL.

You might also like