0% found this document useful (0 votes)
69 views14 pages

Multiagent Based Reinforcement Learning MA-RL An Automated Designer For Complex Analog Circuits

Uploaded by

wei zhen Leong
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views14 pages

Multiagent Based Reinforcement Learning MA-RL An Automated Designer For Complex Analog Circuits

Uploaded by

wei zhen Leong
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

4398 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 43, NO.

12, DECEMBER 2024

Multiagent Based Reinforcement Learning


(MA-RL): An Automated Designer for
Complex Analog Circuits
Jiarui Bao, Jinxin Zhang, Zhangcheng Huang , Member, IEEE, Zhaori Bi , Member, IEEE, Xingwei Feng,
Xuan Zeng , Senior Member, IEEE, and Ye Lu , Member, IEEE

Abstract—Despite the effort of analog circuit design automa- demonstrated. It is shown that MA-RL framework can achieve
tion, currently complex analog circuit design still requires the best-Figure of Merits for complex analog circuits’ design.
extensive manual iterations, making it labor intensive and time- This work shines the light for future large scale analog circuit
consuming. Recently, reinforcement learning (RL) algorithms system design automation.
have been demonstrated successfully for the analog circuit design
optimization. However, a robust and highly efficient RL method Index Terms—Circuit design automation, complex analog
to design analog circuits with complex design space has not circuits, multiagent reinforcement learning (MA-RL), proximal
been fully explored yet. In this work, inspired by multiagent policy optimization (PPO), twin delayed deep deterministic policy
planning theory as well as human expert design practice, we gradient (DDPG).
propose a multiagent-based RL (MA-RL) framework to tackle
this issue. Particularly, we 1) partition the complex analog
circuits into several subblocks based on topology information I. I NTRODUCTION
and effectively reduce the complexity of design search space; UE TO the development of Internet of Things (IoT),
2) leverage MA-RL for the circuit optimization, where each
agent corresponds to a single subblock, and the interactions
between agents delicately mimic the best-design tradeoffs between
D 5G/6G communication and edge computing, the demand
for electronic chip integrated circuit (IC) design increase
circuit subblocks by human experts; 3) introduce and compare drastically. Digital circuit design implements functionalities
three different multiagent RL algorithms and corresponding based on standard cells and takes advantages of various
frameworks to demonstrate the effectiveness of the MA-RL computer aided design (CAD) tools [1]. However, the design
method; 4) employing twin-delayed techniques and proximal
of analog and mixed signal circuits (AMS) is still extremely
policy to further boost training stability and accomplish higher
performances; 5) the impacts of different reward function labor intensive and time consuming due to their highly
definitions as well as different state settings of MA-RL agents are nonlinear behavior and complex tradeoffs among circuit Specs.
investigated to further improve the robustness of this framework; It still heavily relies on the iterative optimization of analog
and 6) experiments on three different complex analog circuit circuit topology selection, component parameters selection,
topologies (gain boost amplifier, delay-locked loop, and SAR
and layout routing by human design experts. It is not only time
ADC) and knowledge transfers between two technology nodes are
consuming but also highly designer experience dependent,
making the analog circuit design an “art” with large variations.
To tackle this problem, automated design efforts for analog
circuits have been increasingly investigated [2], [3], [4], [5],
[6], [17]. These efforts primarily focus on three different
Manuscript received 14 July 2023; revised 12 November 2023, 10 January
2024, and 11 April 2024; accepted 25 April 2024. Date of publication aspects: 1) automated selection of circuit topology [14], [15];
8 May 2024; date of current version 22 November 2024. This work was 2) synthesis and optimization of circuit parameters, such as
supported in part by the National Key Research and Development Program of device sizing under a determined topology [2], [3], [4], [5],
China under Grant 2020YFA0711900 and Grant 2020YFA0711901; in part
by the National Natural Science Foundation of China Research under Project [6], [7], [8], [9], [10], [11], [12], [13]; and 3) automatic
62350610270, Project 62374034, Project 62235009, Project 62141407, and layout placement and routing [16], [17]. This work focuses
Project 62304052; in part by the Innovation Program of Shanghai Municipal on the second aspect. It is also noted some commercial EDA
Education Commission under Grant 2021-01-07-00-07-E00077; and in part
by the Natural Science Foundation of Shanghai under Grant 22ZR1403500. software begins to provide sizing functions for analog circuit,
This article was recommended by Associate Editor G. G. E. Gielen. e.g., “Global optimization” tool in Cadence could achieve
(Corresponding authors: Zhangcheng Huang; Xuan Zeng; Ye Lu.) automated and good sizing results for some analog circuits, its
Jiarui Bao, Jinxin Zhang, Xingwei Feng, and Ye Lu are with the State Key
Laboratory of Integrated Chips and Systems, School of Information Science underline algorithms are unrevealed to public and its capability
and Technology, Fudan University, Shanghai 200433, China (e-mail: lu_ye@ for optimizing complex circuit with multiple subblocks is still
fudan.edu.cn). unknown. Therefore, academic research on novel algorithms
Zhangcheng Huang is with the Frontier Institute of Chip and System
Shanghai, Fudan University, Shanghai 200433, China (e-mail: huangzc@ for complex analog circuit optimization is still desired, it
fudan.edu.cn). contributes a better understanding of these algorithms for
Zhaori Bi and Xuan Zeng are with the State Key Laboratory of Integrated the community and it may also serve as a good supple-
Chips and Systems, School of Microelectronics, Fudan University, Shanghai
200433, China (e-mail: [email protected]). ment to existing tools in certain scenarios. Existing research
Digital Object Identifier 10.1109/TCAD.2024.3398554 works on analog circuit parameter optimization can be
1937-4151 
c 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: UNIVERSITY OF SOUTHAMPTON. Downloaded on December 24,2024 at 12:52:20 UTC from IEEE Xplore. Restrictions apply.
BAO et al.: MA-RL: AN AUTOMATED DESIGNER FOR COMPLEX ANALOG CIRCUITS 4399

further summarized into three categories: 1) simulation-based


optimization; 2) model-based optimization; and 3) online
model-based optimization. Simulation-based approach uses the
SPICE simulator combined with the global algorithms as a
black-box tool for parameter optimization tasks in analog cir-
cuits. Commonly used global optimization algorithms include
evolutionary algorithms [3], [4], particle swarm optimization
(PSO) [18], Bayesian optimization (BO) [7], [8], [9] and
others. Model-based approach creates compact models to
obtain fast circuit evaluation, such as optimization based
on geoSpec programming [19], [20] and optimization based
on neural networks, combined with global algorithms [5].
However, obtaining accurate models itself is a very challenging
task as the number of circuit design parameters increases.
Optimization based on online models continuously updates
the model during the optimization process, combining the
accuracy of simulation-based optimization with the efficiency
of model-based optimization. There have been many studies
on online model optimization. For example, a GSLA online Fig. 1. Schematic of complex analog circuit partition and its design
algorithm combining genetic algorithm (GA) and artificial automation using multiagent RL.
neural network (ANN) has been proposed [21], and an online
algorithm based on differential evolution (DE) and a surrogate
model of gaussian process (GP) has also been proposed [22].
Recently, reinforcement learning (RL) as a variant of online topologies and design knowledge, and assign design goals
model optimization has been researched for analog design to each subblock while also keeping in mind the overall
automation [10], [11], [12], [13]. Unlike the conventional circuit targets. This effectively reduces the complexity of the
online model-based optimization methods [21], [22], RL does design search space and helps machine learning algorithms
not improve the agent model of circuit performance, but accomplish a more efficiently exhaustive search. Additionally,
instead, it updates the policy model in the hope of achieving we propose to employ multiagent-based RL (MA-RL) for
the best policy. A combination of RL and deep learning is optimizing the circuit in an automated fashion (Fig. 1). The
used in [23] to optimize the size of the folding amplifier. reward functions of MA-RL are designed for each subblock
Another combination of graph convolutional neural network as well as the overall circuitry. Each RL agent optimizes
(GCN) and deep deterministic policy gradient (DDPG) is used toward its own reward, i.e., subgoal for subblock, while all
to achieve the parameter design of analog circuits and the agents interact with each other to accomplish the overall
topology transfer between different circuits [10]. It has also target. This artificial process properly mimics the design
been shown that prioritized RL training with design knowledge tradeoffs and iterations by human design experts. Finally, we
could optimize analog circuits more efficiently [11]. introduce multiagent twin-delayed DDPG (MATD3) technique
However, the aforementioned RL methods have only been and multiagent proximal policy optimization (MAPPO) to
verified in relatively simple circuits, such as two-stage tran- further improve the training stability and achieve better results.
simpedance amplifier [10], folded-cascode amplifier [11], and The key contributions of this work are as follows.
two stage operational amplifier [23]. Developing efficient and 1) We partition the complex analog circuits into several
robust RL-related technologies for the design of more complex subblocks by leveraging circuit topology information
analog circuits in an automated fashion remains a challenging and design knowledge, and properly create a design
task. First, complex analog circuits, such as successive approx- goal for each subblock with the consideration of overall
imation register (SAR) ADCs and delay-locked loops (DLLs), design specifications (Specs). This effectively reduces
typically contain multiple subcircuit blocks with complicated the complexity of design search space and helps to
tradeoff considerations, and this greatly increases the difficulty enable a more complete search.
of parameter optimization. To the best of our knowledge, the 2) To our knowledge, we are the first to propose the MA-
interactions between subblocks and their tradeoff relationships RL methods for analog circuit automation. Within this
in these complex analog circuits have not been fully researched framework, each agent corresponds to a single subblock
in the RL related framework yet. Second, complex analog of the circuit, and the interactions between agents mimic
circuits typically contain a large number of design parameters the best-design tradeoffs between circuit subblocks by
which may lead to search space explosion, another reason design experts.
for these circuits to be under investigated particularly in RL 3) We study the impacts of different reward definitions
framework. and state settings in MA-RL agents on performance and
Inspired by multiagent planning theory [24] and hierarchical robustness of the algorithms, providing a more complete
design work [25], [26], we propose to partition a complex understanding and possible future improvement direc-
analog circuit into several subblocks based on the circuit tions of the proposed framework.

Authorized licensed use limited to: UNIVERSITY OF SOUTHAMPTON. Downloaded on December 24,2024 at 12:52:20 UTC from IEEE Xplore. Restrictions apply.
4400 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 43, NO. 12, DECEMBER 2024

4) We investigate and compare three different MA-RL III. M ETHODOLOGY


algorithms in this work, including MA-DDPG, the twin- A. Problem Formulation
delayed DDPG (TD3) algorithm which includes double
The analog circuit design optimization can be formulated
network and delayed techniques, and the proximal policy
as a bound-constrained optimization
optimization (PPO) which includes clipped surrogate
objective technique to avoid training instability and maximize FoM(x) (1)
n
x∈D
accomplish better performances.
5) Finally, we demonstrate the efficiency and effectiveness where x is the parameter vector, n is the number of parameters
of the proposed methods through three different circuits to search, and Dn is the design space. Figure of Merits (FoM)
topologies with best-circuit performance and highest is the objective we aim to optimize, and it is the sum of FoM
speed. We have also demonstrated the knowledge trans- for each subblock
fer between different technology nodes with a trained    
model. FoM(x) = FoMi xi = Ri (2)
i∈N i∈N

where xi is the parameter vector in the ith subblock and N


II. R ELATED W ORK is the number of all subblocks. FoMi (xi ) is the objective we
Multiagent Planning: A multiagent environment is created aim to optimize in the ith subblock, and it is equivalent to the
where a centralized planner assigns each operator to one of the reward Ri for this subblock in the multiagent RL implemen-
agents. When a complex plan is divided into subplans with n tation. The reward Ri includes the reward Ri calculated from
interactive subgoals, the complexity is reduced from O(n × b)d the performance Specs of the ith subblock and the reward Rt
to O(maxi bi di + b × n × q × d) due to the idea of parallelism, calculated from the performance Specs of the entire circuit,
where n is the number of agents, b is the branching factor, d and the detailed definition is shown in (7)–(13). This reward
is the depth of the problem and q is a measure of the positive definition takes into account the contribution of xi to both the
interactions between overlapping propositions, bi ≈ b/n and performance of the ith subblock and that of the entire circuit.
di ≈ d/n [24].
MA-RL: RL is a machine learning technique to employ B. Framework Overview
intelligent agents to take actions in an environment in
An overview of the proposed framework is shown in Fig. 1.
order to maximize the cumulative reward. It has been
1) The complex analog circuit is partitioned into several
proposed to undertake complex intelligent tasks, such as
subblocks based on the topology information and design
playing games [27] and controlling robotics [28]. MA-RL
knowledge. Conventionally a complex circuit can be
adapts actor–critic methods that consider action policies of
viewed as a combination of electrical components and
other agents and is able to successfully learn policies that
circuit nodes, where the currents of the components
require complex multiagent coordination. This approach shows
and the voltages of the nodes obey Kirchhoff’s law.
strength compared to existing methods in cooperative as well
Specially, there are some nodes who can form a bound-
as competitive scenarios [29]. Here, we propose to use MA-RL
ary where the currents and voltages within or on the
to optimize complex analog circuits.
nodes are relatively independent of outside components,
DDPG: DDPG is the deterministic policy gradient algo-
and this naturally divides the complex circuit into
rithm for RL with continuous actions. The deterministic policy
different subblocks, one example is explained in [33].
gradient is the expected gradient of the action-value function,
2) A MA-RL framework is adopted for the design
and it can be estimated much more efficiently than the usual
automation task, and each subblock is assigned to a
stochastic policy gradient. Its deterministic actor–critic can
corresponding RL agent.
significantly outperform its stochastic counterparts in high-
3) The individual RL agent trains a model to optimize
dimensional action spaces [30].
its reward, i.e., the FoMi of the subblock i, by
TD3: TD3 uses three modifications to reduce value function
interacting with all the agents through circuit simulation
overestimation [31]. First, it learns two Q-value functions
environment.
and uses the minimum value function estimate during pol-
4) The overall FoM is eventually achieved by the coopera-
icy updates. Second, TD3 agent updates the policy and
tion and competition between all subblock agents during
targets less frequently than the Q functions. Third, a TD3
the algorithm execution process.
agent adds noise to the target action when updating the
policy. Here we introduce TD3 techniques in the MA-RL
framework. C. Multiagent Reinforcement Learning Formulation
PPO: PPO is a policy-based RL algorithm, designed to As shown in Fig. 2, the key concept of the MA-RL frame-
effectively use limited data while maintaining the stability of work is centralized training with decentralized execution [29].
learning performance [32]. The techniques of clipping ratio Multiple agents are created where actor–critic RL agents are
and clipped objective function are applied in PPO to limit the used. The actor is the policy structure searching for the
amplitude of policy updates, ensuring proximity between the action to optimize the FoM, while the critic informs the
new and old policies. It helps to prevent training instability actor how good the action is and how it should improve.
caused by excessive policy updates. Critics in different agents interact with each other by shared

Authorized licensed use limited to: UNIVERSITY OF SOUTHAMPTON. Downloaded on December 24,2024 at 12:52:20 UTC from IEEE Xplore. Restrictions apply.
BAO et al.: MA-RL: AN AUTOMATED DESIGNER FOR COMPLEX ANALOG CIRCUITS 4401

In the second case, the state of each agent is a vector com-


posed of the current performance reached in the corresponding
subblock, and can be defined as
 
Statei,b = Speci1 , Speci2 , . . . , Specin (6)
where Specin is the nth performance Spec of the ith subblock.
For example, the state of each subblock is a set of circuit
performance, such as its gain, power consumption, and so on.
The comparisons of these two state definitions are tested and
discussed in Section IV.
Reward: The reward Ri provides optimization goal for agent
i. It takes into account of current action’s impact on both the
performance of the ith subblock and that of the entire circuit,
and it is defined as the sum of these two parts
Fig. 2. Detailed diagram of multiagent actor–critic RL implementation on Ri = Ri + Rt . (7)
circuit design automation.
Ri is defined as the weighted sum of the rewards obtained
by the performance Specs of the ith subblock, and calculated
states and actions of all agents, and then they influence their as
corresponding actor network by changing their loss function. 
Ri = kli rli (8)
Actor and Critic: In the implementation of MA-RL frame-
l∈M i
work, both Actor and Critic are ANNs with one input layer,
one output layer, and two hidden layers. The input of the actor where M i is the number of Specs of the ith subblock, kli is the
network is its state and the output is the action as described weight of the lth performance Spec sil , and the reward of the
below. The input of the critic network is a concatenation of lth performance Spec rli is defined as
all states and actions, and its output is Q value [29]. ⎧

⎪ −1, if sil > sli,max and sil is minimized


Input of the critic network ⎪
⎪ −1, if sil < sli,min and sil is maximized
  ⎪

1 ⎨ sil −si,min
= State1 , State2 , . . . StateN , Action , Action2 , . . . , ActionN . l
, if sil > sli,min and sil is maximized (9)
rli = si,max −si,min
(3) ⎪
⎪ l l




⎪ i,max
⎪ sl −sl , if si < si,max and si is minimized
i
Due to the fact that each critic network in MA-RL can ⎩ l l l
si,max
l −si,min
l
access information from all agents, i.e., each critic network’s
input is a concatenation of the states and actions of all agents, where sil is the lth Spec target in the ith subblock, sli,min , and
the optimization process of each subblock indeed takes into sli,max are predefined boundary value of sil . For those subblock
account the impacts of all subblocks. Specs which are not included in the entire circuit design Specs,
Action Space: An action is the output of the actor network in the Spec boundaries are used for the reward normalization.
one particular agent, and it is composed of design parameters, Rt is calculated based on the Specs of the entire circuit, and
such as devices’ sizes and passives, e.g., transistor width (W), it is defined as the weighted sum of the rewards obtained by
length (L), capacitance (C), resistance (R), etc. Practically, it is M performance Specs
a vector concatenating this information of its tunable devices 
within its corresponding subblock Rt = kjt rjt (10)
j∈M
 
Action =
i
W1i , L1i , W2i , L2i , . . . , Wki , Lki , Ri1 , Ri2 , . . . , C1i , C2i · · · . (4)
where M is the total number of Specs, kjt is the weight of
the jth performance Spec stj , and rjt is the reward of the jth
Note all values of these design parameters are scaled and
normalized in the range between −1 and 1 for the network performance Spec. Two definitions of rjt , denoted as rjt,a and
training purpose. rjt,b are also tested and discussed in Section IV.
State Space: In the MA-RL, the state of each agent is a rjt,a is defined as
vector containing the circuit information of the corresponding ⎧
subblock observed at the moment. Two state vector definitions ⎪
⎪ −1, if stj > st,max
j and stj is minimized


are tested and discussed in this work. ⎪
⎪ −1, if sj < sj
t t,min
and stj is maximized
⎪ t t,min

The first definition involves a set of simulated operating ⎨ sj −sj
rjt,a = st,max , if stj > st,min
j and stj is maximized (11)
conditions for a subblock, such as gm , Vsat , etc., taking the −st,min

⎪ j j
state of agent i as an example ⎪



  ⎪ t,max
⎪ sj −sj , if st < st,max and st is minimized
t

Statei,a = gmi1 , gmi2 , . . . , Vsat
i
, Vsat
i
,... . (5) sj t,max
−sj t,min j j j

Authorized licensed use limited to: UNIVERSITY OF SOUTHAMPTON. Downloaded on December 24,2024 at 12:52:20 UTC from IEEE Xplore. Restrictions apply.
4402 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 43, NO. 12, DECEMBER 2024

where st,min
j is defined as the target value of a Spec if that Algorithm 1: MADDPGfor Circuit Optimization
particular Spec is the smaller the better, while st,max
j is defined Create N agents for complex analog circuit’s
 N sub-blocks
μ
as the target value of a Spec if that particular Spec is the larger Randomly ihitialize
 theactor network μi Si | θi and the critic
Q
the better, otherwise they are set to be the worst boundary. network Qi Si , Ai | θi in the i-th agent for all i ∈ N;
Initialize replay buffer D and batch size B;
Taking gain as an example, if gain of a circuit is expected to for t = 1 to T do  
be 10 dB or greater, then st,max
j is set to be 10 and st,min
j is Randomly generate initial states in each agent x = S1 , . . . , SN ;
set to be 0. If sj is worse than the worst-boundary value, −1
t
if 1 size (D) < B then
is assigned as the score for this reward.  μ
Generate random sample action Ai ← μi Si | θi in the i-th
rjt,b is defined referring to [34]. Particularly, a clipping agent for i ∈ N
else
mechanism is introduced for Spec reward definition to alleviate Select action for the i-th agent  according
μ
to policy and
the problem of the over-shooting of certain Specs while under- exploration noise σ, Ai ← μi Si | θi + σ ;
weighting others. The clipping reward of the jth performance end if1
Spec rjt,b can be divided into rj,u
t and r t , each defined as the  
j,d Use A A1 , . . . , AN as input for circuit simulation and observe
following:    
⎧ new state x S1 , . . . , SN and reward R R1 , . . . , RN ;
 

⎪ −1, if stj,u < st,min Store transition x, A, R, x in replay buffer D;

⎪  j,u
⎨ s t −s t,min
t,min  if 2 size (D) > B then  
t,min , if sj,u > sj,u and j∈M rjt < M
u j,u t
t,b
rj,u = min 1, st,max −s Sample a batch of B transitions x̂, Â, R̂, x̂ in D,


j,u j,u

⎪ stu −st,min  for agent i =1 to N do
⎩ t,max j,u
t,min , if stj,u > st,min
j,u and j∈M rj ≥ M
t
 μ

sj,u −sj,u Âi ← μ Si | θi ;
  

(12) yi ← ri + γ · Qi x̂ , Â1 , . . . , ÂN ;


⎪ −1, if stj,d > st,max update the critic network in the i-th agent by minimizing

⎪  d
   2
⎨ t,max
sj,d − stj,d t,max  loss function: L = B1 Qi x̂, Â1 , . . . , ÂN − yi ;
t,min , if sj,d < sj,d and j∈M rjt < M
t
t,b
rj,d = min 1, st,max −s update


j,d j,d  μactor
 network in i-th agent by the gradient ;
⎪ st,max
⎪ j,d − sj,d
t  ∇θi J θi = 
⎩ t,max , if stj,d < st,max and j∈M rjt ≥ M   
j,d 1  ∇ Q x̂, Â1 , . . . , ÂN ∇ μ μ Si | θ μ  ;
t,min
sj,d −sj,d
B ai i
i θ i i 
(13) Ŝi
end for
t represents the reward for the Spec st , which is the end if2
where rj,u j,u
t represents the reward for the Spec st ,
larger the better. rj,d end for
j,d
which is the smaller the better. st,max t,min
j,u and sj,d are defined as
the target values of the Spec stj,u and the Spec stj,d , respectively,
while st,min t,max
j,u and sj,d are defined as the worst-boundary values. 1) In the execution phase, MADDPG establishes a replay
When a Spec fails to meet the worst boundary, the reward of the buffer D to store the sampled transitions (x, A, R, x ). If
Spec is set to −1. When a Spec meets the design objective, but the number of transitions in D is less than the batch size
the entire circuit does not satisfy its overall design objective, B, the untrained actor networks are used to generate ran-
the reward is fixed at 1. Once the entire circuit meets its overall dom design parameters, which are then used for circuit
design objective, the reward of this particular Spec continues simulation. However, once the number of transitions in
to increase and it is further optimized. D becomes larger than the batch size B, predictions of
the optimal policy are made using the actor networks
D. Algorithm Execution Flow that had been trained. In the execution phase, a particular
In the optimization process, we first leverage multiagent actor network only needs the information of its own
DDPG (MADDPG) algorithm [29], and the detailed algo- local agent (Si in x) to generate an action Ai . At the
rithm flow is shown in Algorithm 1. x= (S1 . . . , SN ) and same time, exploration noise σ is added to the action
 
x (S1 , . . . , SN ) denote the current and the next set of states Ai to help reach the global optimal point. The action
of the circuit with N subblocks. Similarly, A(A1 , . . . , AN ) and vector A(A1 , . . . , AN ) is provided to the circuit for circuit
R(R1 , . . . , RN ) denote the set of actions and rewards of N simulation, and then the next state x and reward R for
subblocks, respectively. Each agent in MADDPG contains all subblock circuits can be observed.
a critic network and an actor network. The critic network 2) The centralized training phase is based on the transitions
provides a Q value which contains the reward and future (x, A, R, x ) collected in the execution phase. Concretely,
discounted reward of subblock. The actor network is trained each agent takes the information from all agents to
to get an action that maximizes the Q value. train its critic network, and then the output of the critic
Noted that DDPG requires a stable environment for training, network of an agent influences its corresponding actor
while subblocks of a circuit interact with each other, therefore network by changing its loss function. For example,
these subblocks cannot be trained by multiple parallel non- in a complex analog circuit with N subblocks, the
interactive DDPG algorithms. MADDPG takes advantage of performance of the ith subblock is influenced not only
centralized training with decentralized execution to solve this by its own design parameters but also by those in other
problem. subblocks. In the execution in MADDPG framework,

Authorized licensed use limited to: UNIVERSITY OF SOUTHAMPTON. Downloaded on December 24,2024 at 12:52:20 UTC from IEEE Xplore. Restrictions apply.
BAO et al.: MA-RL: AN AUTOMATED DESIGNER FOR COMPLEX ANALOG CIRCUITS 4403

each agent is able to take into account the design Algorithm 2: MATD3 for circuit optimization.
parameters and state vectors from all subblocks for Provide N agents for complexanalog circuit’s N sub-blocks Randomly
i μ
calculating the Q value both for the current state and initialize
  μi S | θi and two critic networks
the actor network
Q Q
that of the next state. Combining the reward Ri and Q1,i Si , Ai | θi 1 , Q2,i Si , Ai | θi 2 in the i-th agent;
Q , the state-action value function yi of the ith agent The pseudo-code in the next part is the same as that in Algorithm 1
except if 2 loop is replaced by if 3 loop as shown below:
is obtained using the Bellman equation, as described if 3 size (D) > B then  
in Algorithm 1. The critic network is updated based Sample a batch of B transitions x̂, Â, R̂, x̂ in D;
on the loss function of the current Q value and yi . for agent i = 1 to N do
Subsequently, the estimated Q value obtained by the Obtain target actor according to the actor

 network and
μ
critic network is used to compute policy gradients, regularization noise , Âi ← μ Si | θi + ; yi ←
  
  

which are then used to update the actor network. The ri + γ · min Q1,i x̂ , Â1 , . . . , ÂN , Q2,i x̂ , Â1 , . . . , ÂN ;
centralized training method of MADDPG enables each update the critic networks in the i-th agent by minimizing loss
 
Q    2
agent to learn and interact effectively in an environment function: L θi 1 ← B1 Q1,i x̂, Â1 , . . . , ÂN − yi ;
     2
affected by the influences of other subblocks and is Q
L θi 2 ← B1 Q2,i x̂, Â1 , . . . , ÂN − yi ;
suitable for complex analog circuits that require multiple if 4 t mod d then
subblocks for tradeoff optimization. update
 μactor
 network in i-th agent by the gradient ;
3) For MA-RL algorithm termination, a maximum number ∇θi J θi = 
   i μ 
1 ∇ Q
ai 1,i x̂, Â , . . . , Â ∇θ μ μi S | θi  ;
of simulation times is first set based on the design 1 N
B i Ŝi
complexity of a particular circuit, e.g., 15 000 for a end if4
gain boost amplifier (GBA) circuit case. A particular
end for
design solution is updated and recorded if its reward
end if3
is higher than the previous one. Finally, if the design
solution does not reach the design Specs for the entire
15 000 runs, the one with the highest reward will be
output. On the other side, if the design Specs are This mitigates the risk of value function overestimation
reached for a particular run and the result of this run caused by training noise. Similar to that of Algorithm 1, the
is not updated for another 1000 simulation runs, we Bellman equation for each agent in Algorithm 2 needs to be
consider the result is converged and the algorithm will be modified as the following when the “Done” flag is introduced:
terminated.    

It is noted prior art a “Done” flag is incorporated in the yi ← ri + γ .min Q1,i  x , 
A1 , . . . , 
AN
termination condition [35], this has also been studied for the   

complex analog circuit design optimization task in Section IV. x , 
Q2,i  A1 , . . . , 
AN (1 − Done). (16)
The “Done” flag is a Boolean variable assigned either 0
2) Delayed Policy Update: Actor network is updated at
or 1, where the change in assignment indicates the end of
a lower frequency than critic networks to minimize the
an episode and the start of a new one. When the “Done”
expectation of TD-error and improve the stability of circuit
flag is introduced, the Bellman equation for each agent in
strategical training.
Algorithm 1 is modified as follows:
  3) Target Policy Smoothing Regularization: Normal dis-
 
x , 
yi ← ri + γ .Qi  A1 , . . . , 
AN (1 − Done). (14) tributed noise is added to the training actions in critical
network for regularization, and this is to reduce ANN overfit-
ting and improve the stability of circuit training. The execution
E. Enhancing RL Agent With Twin Delayed Technique flow of these changes is detailed in Algorithm 2.
Similar to DQN’s max operation on the value function, As an initial test, we compare the robustness of the TD3
MADDPG also tends to overestimate the value function [31]. and DDPG algorithms in the optimization of a rail-to-rail
In circuit automation task, it may preferentially select circuit OTA [Fig. 3(a)] designed on commercial 0.13-μm mixed-
transitions (A, S, S , R) with a large value that is overestimated signal CMOS process PDK. The rail-to-rail OTA has 36
and finish circuit design in a suboptimal point. Furthermore, tunable design parameters and 7 design objectives. The design
DDPG algorithm appears to suffer from weak robustness and objectives are defined as follows:
stability.
Open Loop Gain ≥ 100 dB
To further improve design automation quality and efficiency.
Multiagent twin delayed DDPG (TD3) algorithm is introduced Unity Gain Frequency ≥ 50 MHz
for model training [31], and there are three major techniques Power ≤ 200μ W
added in TD3. Slew Rate ≥ 10Vμs
1) Double Network: Two critic networks instead of one are
PSRR ≥ 80 dB
employed for value function evaluation, and the lower-value
one is used for the target update CMRR ≥ 100 dB
   
 Input common − mode range = 1.2V. (17)
yi ← ri + γ .min Q1,i  x , 
A1 , . . . , 
AN
  
 In the comparison, the rewards are calculated based on (10)
Q2,i x , 
A1 , . . . , 
AN . (15) and (11), and then normalized according to the design Specs.
Authorized licensed use limited to: UNIVERSITY OF SOUTHAMPTON. Downloaded on December 24,2024 at 12:52:20 UTC from IEEE Xplore. Restrictions apply.
4404 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 43, NO. 12, DECEMBER 2024

Algorithm 3: MAPPO for Circuit Optimization


Create N agents for complex analog
 i μcircuit’s
 N sub-blocks Randomly
initialize
  network μi S | θi and the critic network
the actor
Q
Qi Si , Ai | θi in the i-th agent for all i ∈ N; Initialize replay buffer D
and batch size B and ppo epoch M;
for t = 1 to T do  
Randomly generate initial states in each agent x = S1 , . . . , SN ;
if 5 size (D) < B then
Sampling action
 for the i-th agent according to policy
μ
Ai ← Pi ← μi Si |θi ;
Use A A1 , . . . , AN as input for circuit simulation and
 
observe new state x S1 , . . . , SN and reward
 
R R1 , . . . , RN ;
 
Store transition x, A, R, x in replay buffer D;
end if5

if 6 size(D) > B then  


Sample a batch of B transitions x̂, Â, R̂, x̂ in D;
for agent i = 1 to Ndo 
  μ
Âi ← P̂i ← μ Sˆi | θi ;
  

yi ← ri + γ · Qi x̂ , Â1 , . . . , ÂN ;
compute advantage in the i-th agent:
 
Aiadv ← yi − Qi x̂ , Â1 , . . . , ÂN ;
for m = 1 to M do
update critic network in the i-th agent by
maximizing the objective function:
Fig. 3. (a) Schematic of rail-to-rail OTA. (b) Comparison of robustness    2
between TD3 and DDPG algorithms in rail-to-rail OTA. L = B1 Qi x̂, Â1 , . . . , ÂN − yi ;
update actor network in the i-th agent by
For design points that meet or exceed the design Spec targets maximizing
 the objective function: L θiμ =  
μ μ
μi (si |θi ) μi (si |θi )
in all performances, their rewards are set to 1. Fig. 3(b) shows 
E min i μ Âadv , clip
i
i μ , 1 − ε, 1 + εÂiadv ;
i (s |θi )
μold i (s |θi )
μold
DDPG successfully achieves the optimization task in two out end for
end for
of ten rounds while TD3 always accomplishes the final goal. Empty the replay buffer D;
We have also repeated these tests and we found in general end if6
DDPG’s success rate is between 20%–40% in the case we
end for
have tested, while TD3 exhibits better robustness.

F. Enhancing RL Agent With Proximal Policy Optimization


our proposed MA-RL (MADDPG, MATD3, and MAPPO)
The MAPPO algorithm is also investigated in this work. to three real-world circuits’ design on different technology
Unlike value-based MADDPG, MAPPO is believed to be nodes, a GBA designed on commercial 0.18-μm mixed-
immune from the problem of over-estimation of value function signal CMOS process PDK, a DLL circuit on 0.13-μm
caused by maximizing bias. In PPO, the output of the actor mixed-signal CMOS process PDK, an SAR ADC on 65-
network is not a deterministic action but rather the mean and nm mixed-signal CMOS process PDK. More specifically,
variance of an action distribution [32]. The actor network, with the device sizing tasks for all circuits are accomplished,
the assistance of the critic network, updates its parameters by where transistor W, L and passive device are optimized. The
maximizing a surrogate objective function. The objective is circuit simulator is Synopsys HSpice. The experiments are
to find actions that can maximize the expected rewards and run on Intel Core i9-10980XE CPU with a base frequency
increase their corresponding probabilities. of 3.0 GHz. We compared our proposed MA-RL methods
To prevent training instability due to over-frequent updating MADDPG, MATD3, and MAPPO with human design, simple
of network parameters in the policy optimization, a technique Monte Carlo (MC), GA referenced in [5], BO reference
called Clipped Surrogate Objective is applied in PPO. It in [10], and single agent RL techniques. In terms of the single
introduces a clipping function to safe guide the gap between agent RL method, we experimented the standard vanilla RL
the new and old policies, and it uses a parameter ε to control with DDPG agent used in [10], [12], and [13]. It is noted that
the degree of clipping. This helps enhance the stability of the all algorithms are tested with the same number of simulations
algorithm and leads to more optimal solutions. In this work, for a fair comparison.
PPO is used in the MA-RL framework and the detailed flow
of MAPPO is described in Algorithm 3. A. Circuits and Their MA-RL Formulation for the
Experiments
IV. E XPERIMENTS 1) Gain Boost Amplifier: To illustrate the workflow and
To demonstrate the effectiveness of the proposed MA-RL effectiveness of MA-RL, a fully differential GBA [Fig. 4(a)]
methodology, we experiment with the reference methods and is first designed in this work. The main structure of GBA
Authorized licensed use limited to: UNIVERSITY OF SOUTHAMPTON. Downloaded on December 24,2024 at 12:52:20 UTC from IEEE Xplore. Restrictions apply.
BAO et al.: MA-RL: AN AUTOMATED DESIGNER FOR COMPLEX ANALOG CIRCUITS 4405

Agent1 , Agent2 , and Agent3 , each assigned to an amplifier in


Fig. 4(b)–(d), respectively.
In the MA-RL framework of this example, the action
vectors of the three agents are the lists of design parameters
corresponding to their respective circuit subblocks, denoted
as Action1 , Action2 , and Action3 . The state vectors of the
three agents are the lists of overdrive voltages (Vod ) of the
MOS transistors to be optimized in their respective circuit
subblocks, denoted as State1 , State2 , and State3 . The rewards
of the three agents are denoted as R1 , R2 , and R3 , respectively.
And according to (7), the reward for each agent is defined as
the sum of two parts. For example, the reward for Agent1 is
the sum of the entire circuit reward Rt and the reward for the
main amplifier Rmain .
The reward Rt is obtained by normalizing and then weighted
summation of Specs (a)–(e), as defined in (10). The boundary
for gain of GBA is set between [5, 105] dB; the boundary for
UGF of GBA is set between [0, 10] MHz; the boundary for PM
Fig. 4. Schematic of (a) gain boost amplifier circuits, (b) main amplifier,
of GBA is set between [0, 85] deg, and for PM greater than 90
t is set to −1; the boundary for power of GBA is
degrees, rPM
(c) N boost amplifier, and (d) P boost amplifier.
set between [5], [10] mW, and rpower t is set to −1 with power
is a fully differential folded-cascode [Fig. 4(b)]. To further larger than 10 mW; the boundary for SR is set between [0,
improve the gain, two assistant amplifiers with the fully 10] V/μs. After evaluating the difficulty of optimizing each
differential folded-cascode topology are utilized between the Spec, the weights for gain, UGF, PM, power, and SR in (10)
source and gate terminals of the common-gate transistors, as are set as 1/8, 3/8, 1/8, 3/4, and 1/8, respectively. For each
shown in Fig. 4(c) and (d), respectively. The specification circuit subblock, the corresponding reward is calculated based
targets of GBA include (a) gain, (b) unity gain frequency on the gain, UGF, and power of the subblock amplifier itself,
(UGF), (c) power consumption, (d) phase margin (PM), and according to (8) and (9). In this example, the setting of weights
(e) slew rate (SR). The design objectives are defined as and boundaries of the Specs in the subblocks are consistent
follows: with that used in the reward Rt . During the training process
of MA-RL, each subblock optimizes its performance through
DC Gain ≥ 105 dB
tradeoffs with other subblocks and specific design objectives
Unity Gain Frequency ≥ 10 MHz are not needed.
80 deg ≤ Phase Margin ≤ 90 deg In the MA-RL framework, the learning and execution prin-
Power ≤ 5 mW ciples of each agent are similar to the common RL algorithm.
Importantly, MA-RL employs a method of centralized training
Slew Rate ≥ 10 Vμs. (18)
with distributed execution. Taking the main amplifier as an
In the GBA circuit, a total of 21 independent design example, the actor network of Agent1 only needs the state of
parameters are selected as optimization variables, including the main amplifier (State1 ) to output the corresponding action
seven widths of transistors in main amplifier (W2 , W4 , W5 , (Action1 ), while Agent1 is able to access the information of
W6 , W8 , W10 , and W12 ), seven widths of transistors in N boost the entire circuit through the critic network, for which the
amplifier (Wn1 , Wn2 , Wn4 , Wn5 , Wn6 , Wn8 , and Wn10 ), and input is a concatenation of the states and actions of all three
seven widths of transistors in P boost amplifier (Wp1 , Wp2 , agents
Wp4 , Wp5 , Wp6 , Wp8 , and Wp10 ). Other design parameters are  
treated as dependent parameters obtained from the independent State1 , State2 , State3 , Action1 , Action2 , Action3 . (19)
design parameters. In the case of the main amplifier, the widths
of W3 , W7 , W9 , W11 , and W13 are equal to the widths of This allows Agent1 to take into account the influence of all
W2 , W6 , W8 , W10 , and W12 , respectively, due to the symmetry other agents during the learning process, performing tradeoff
of the difference circuit. The search range for these design optimization among subblocks to optimize the performance of
parameters is set to [5, 105] μm. the entire circuit.
The source and gate of the common-gate transistors in the 2) Delay Locked Loop: A more complex DLL with 42
main amplifier are special nodes, and two assistant amplifiers tunable design parameters, including transistor sizes and pas-
between the nodes form individual circuit subblocks. The sives, is also designed to demonstrate the capability of MA-RL
MOS transistors within each assistant amplifier are bound to framework. DLL can be considered as a feedback circuit
each other, forming a relatively closed design space. Therefore, that phase lock an output to an input without the use of an
the GBA is naturally partitioned into three subblocks for the oscillator, and its schematic and timing diagram are shown
MA-RL formulation. Thus, a three agents MA-RL frame- in Fig. 5(a) and (b). Based on its circuit topology, it can be
work is employed here. The three agents are denoted as partitioned naturally to three subblocks, i.e., phase detector

Authorized licensed use limited to: UNIVERSITY OF SOUTHAMPTON. Downloaded on December 24,2024 at 12:52:20 UTC from IEEE Xplore. Restrictions apply.
4406 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 43, NO. 12, DECEMBER 2024

Fig. 5. Schematic of (a) DLL circuits, (b) timing diagram, (c) PD block,
(d) VCDL block, and (e) CP block. Fig. 6. Schematic of (a) SAR ADC circuits, (b) BS block (c) CDAC,
(d) comparator block, and (e) successive approximation register logic block.

(PD), charge pump (CP) and voltage-controlled delay line


(VCDL). PD is mainly composed of standard cells and it
Rt and RVCDL . The overall reward Rtot is the sum of all the
is used to detect phase difference between reference clock
aforementioned rewards. The state vector for CP includes I1 ,
and output clock feedbacked from VCDL. There is no device
I2 , tlock , and Vctrl , and a list of time stamps before the circuit
sizing task for PD and therefore its design is fixed. The
is locked in. The state vector for VCDL is a list of frequencies
schematic of VCDL is shown in Fig. 5(d). Phase difference
at different Vctrl .
approaches to zero and Vctrl becomes fixed once delay time
of VCDL is equal to the period of the reference clock. CP 3) Successive Approximation Register ADC: This experi-
contains a charge bump, bias circuit and buffer amplifier as ment is to design a 10-bit 50MS/s asynchronous differential
shown in Fig. 5(e). The typical 0.1 ns, period of output clock input SAR ADC with 28 tunable design parameters, including
can be regarded as identical to the reference clock. transistor sizes and passives. SAR ADC is a high-speed
A two agents RL framework is employed for this exercise, analog-to-digital converter that uses a binary search to gradu-
one assigned to CP and the other assigned to VCDL. The ally approach the digital value of the input analog signal, its
overall specifications of DLL include (a) working frequency circuit diagram is shown in Fig. 6. According to its circuit
range, (b) lock time (tlock ) of circuit, (c) power consumption, topology, it can be divided into four subblocks, i.e., bootstrap
(d) difference of phase, and (e) jitter. The design objectives switch (BS), capacitive digital-to-analog converter (CDAC),
are defined as follows: comparator (COMP), and SAR logic circuit (SARL). BS
completes the sampling and holding of the input voltage, its
Phase Difference ≤ 100 ps circuit diagram is shown in Fig. 6(b). CDAC converts the input
signal of the SAR logic circuit into an analog voltage as an
Lock Time ≤ 1μs
approximation value, as shown in Fig. 6(c). COMP compares
150 MHz ≤ Working Frequency Range ≤ 250 MHz the approximation value with the input signal and generates
Power ≤ 25mW (20) a digital bit according to the comparison result. There is also
Jitter ≤ 300 ps. a peripheral function control circuit around COMP, as shown
in Fig. 6(d). SARL in Fig. 6(e) approaches the value of the
Among those, (a)–(e) are normalized and added up as reward input signal by multiple incremental approximations, which
Rt for this circuit as defined in (10). For agent 1 corresponding are mainly composed of standard cells, so there is no device
to CP, the difference of I1 and I2 , Vctrl and lock time, gain size task for SARL. Finally, we connect the output digital
of the amplifier within CP are normalized using (8) as r1CP , signal of SAR ADC to an ideal DAC and merge the redundant
r2CP , r3CP , r4CP , respectively, and weighted sum as reward RCP . bits to simulate and analyze the converted analog signal.
The reward for agent 1 (R1 ) is the sum of Rt and RCP . For The Spec targets of SAR ADC include (a) effective num-
agent 2 corresponding to VCDL, frequency range is used as ber of bits (ENOB), (b) signal-to-noise and distortion ratio
reward RVCDL , and the reward for agent 2 (R2 ) is the sum of (SNDR), (c) spurious-free dynamic range (SFDR), (d) total

Authorized licensed use limited to: UNIVERSITY OF SOUTHAMPTON. Downloaded on December 24,2024 at 12:52:20 UTC from IEEE Xplore. Restrictions apply.
BAO et al.: MA-RL: AN AUTOMATED DESIGNER FOR COMPLEX ANALOG CIRCUITS 4407

TABLE I
harmonic distortion (THD), (e) power, and (f ) FOM. The P ERFORMANCE S PECS C OMPARISONS OF G AIN B OOST A MPLIFIER
design objectives are defined as follows: O BTAINED BY T EN O PTIMIZATIONS OF MATD3 W ITH
T WO D IFFERENT R EWARD S ETTINGS
Effective Number of Bits ≥ 9.5 bit
Signal − to − Noise and Distortion Ratio ≥ 60 dB
Spurious − Free Dynamic Range ≥ 65 dB
Total Harmonic Distortion ≤ −60 dB
Power ≤ 8 mW
FOM ≤ 220fJ/c.step. (21)

In the SAR ADC circuit, a total of 28 independent design


parameters are selected as optimization variables, including 14
design parameters in BS (Wb1 , . . . Wb6 , Lb1 , . . . Lb6 , Cwb1 ,
and Csb1 ), two design parameters in CDAC (Cwc1 and Csc1 ),
and 12 design parameters in COMP (Wcom1 , . . . Wcom6 , Lcom1 ,
. . . Lcom6 ), respectively. Based prior design experience, the
search range of transistor width W is set to [2] and [16] μm,
the search range of transistor length L is set to [0.06, 0.14]
μm, the search range of capacitance width Cw and capacitance
spacing Cs are set to [0.1, 0.16] μm.
A three agents RL framework is employed for this exercise,
the agents of which are assigned to BS, CDAC, COMP and
denoted as Agent1 , Agent2 , and Agent3 , respectively. The
Specs of SAR ADC are normalized and added up as reward The effectiveness and robustness comparisons between
Rt as defined in (10). The boundary for ENOB is set between MATD3 using Rt,a and Rt,b are performed, and the former is
[0, 9.5] bit; the boundary for SNDR is set between [0, 60] denoted as MATD3a and the latter is denoted as MATD3b ,
dB; the boundary for SFDR is set between [0, 65] dB, the respectively. These two are both tested on the GBA circuit
boundary for THD is set between [−60, 0] dB; the boundary design for 10 runs, with each optimization run terminated after
for power of SAR ADC is set between [8], [10] mW, and a fixed number of 15 000 simulations. The design objectives,
t
rpower is set to −1 with power greater than 10 mW. ranges of design parameters, boundary values of Specs and
For Agent1 corresponding to BS, ENOB, and SFDR are weights of Specs for the overall GBA circuit and its subblock
normalized using (9) as r11 and r21 , respectively, and added circuits are consistent with those set up for GBA in Section A.
as reward R1 . For Agent2 corresponding to CDAC, area is The results of this experiment are shown in Table I, where
used as reward and weighted sum as reward R2 . For agent the Spec values marked in red indicate an unmet design Spec
3 corresponding to COMP, delay time is used as reward R3 . objective defined in (18). It is shown that MATD3a fails for
The overall reward Rtot is the sum of all the aforementioned some of the design objectives in three out of ten runs, while
rewards. The state vector for Agent1 and Agent3 includes a MATD3b fails only once. This shows that the reward with the
list of Vod and the state vector for Agent2 includes a list of clipping mechanism may help improve the robustness of the
voltage drops for each capacitor. Considering nonideal factors, MA-RL methodology. Nevertheless, the single failure case of
we add a 1% mismatch bias to the capacitive components in MATD3b2 indicates future work to improve other aspects of
the SAR ADC to simulate the effect of process bias, which this framework is warranted to achieve a 100% success rate.
better demonstrates the effectiveness of our approach in a real Fig. 7 depicts the results of ten separate optimization runs
circuit environment. on the GBA circuit using the MATD3 algorithm under differ-
ent agent state definitions. The circuits are evaluated using Rt,b
as a comprehensive assessment Spec. The red dashed line in
B. Comparisons of MA-RL Under Different Reward and the graph represents the baseline of Rt,b that meets the design
State Definitions objective for the GBA (Rt,b = 1.5). MATD3c shown in red
In the MA-RL framework, the reward function determines in Fig. 7 uses the simulated operating conditions as the agent
how agents evaluate different actions, while the state deter- states [Statei,a in (5)]. The green data in Fig. 7 is obtained by
mines the circuit information that agents can observe at the MATD3d , using the current performance of the corresponding
moment. The definition of reward function and state vector circuit as the agent states [Statei,b in (6)]. It is shown that both
may impact the optimization capability and robustness of MATD3c and MATD3d reach similar Rt,b (1.95 for MATD3c
the algorithm. In Section III-C, this article introduces two and 1.93 for MATD3d ). However, MATD3c demonstrates a
definitions for the state of agent i, Statei,a and Statei,b , and two higher-success rate than that of MATD3d . We suspect that the
definitions for the reward of the entire circuit Rt,a and Rt,b . In operating conditions of the circuit contain richer underlying
this section, these different definitions of rewards and states information compared to the circuit Specs, and it can establish
in MA-RL and their impacts are investigated and discussed. a more direct connection with the circuit design parameters

Authorized licensed use limited to: UNIVERSITY OF SOUTHAMPTON. Downloaded on December 24,2024 at 12:52:20 UTC from IEEE Xplore. Restrictions apply.
4408 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 43, NO. 12, DECEMBER 2024

TABLE II
P ERFORMANCE M ETRICS C OMPARISONS OF GBA

Additionally, Fig. 8 also compares the optimization results


Fig. 7. Maximum Rt,b reached after ten optimization runs of GBA
using MATD3 with different state settings. MATD3c uses the simulated on GBA using MATD3 with (MATD3d ) and without the
operating conditions as the agent state vectors and MATD3d uses the current “Done” flag (MATD3). The episode length for MATD3d is set
performance of the corresponding circuit as the agent state vectors. The to 40. As shown in Fig. 8, MATD3 achieves a more optimal
bold colors of red and green are failure cases for MATD3c and MATD3d ,
respectively.
result compared to MATD3d , while MATD3d may converge
faster. This may be attributed to the following reason. MATD3
tends to achieve a superior design even after meeting the
targets within its episode length, and it does not terminate
unless no higher reward is achieved for further 1000 consecu-
tive simulations. Meanwhile, MATD3d resets the environment
and reduces the expected future rewards upon meeting the
design objectives, leading the agent to converge toward the
termination condition of fulfilling the design targets.
Fig. 9(a)–(c) compares the learning curves of different
algorithms for three different complex analog circuits, which
are DLL, GBA, and ADC. The circuits are evaluated using
the FoM in (2) as a comprehensive assessment Spec, and
only the optimal design results are updated and displayed in
learning curves during the simulation iteration process, i.e.,
the data point on the learning curves updates only if it is
higher than the previous one. From Fig. 9, we see that all
MA-RL methods outperform the conventional optimization
algorithm, such as MC, GA, and BO, and they also achieve
Fig. 8. Learning curves of five different RL algorithms using reward
better results than the single agent RL DDPG on all three
definition of Rt,b tested on GBA design. circuits. This indicates that the MA-RL methodology is indeed
effective in the optimization tasks for complex analog cir-
cuits with multiple subblocks. Compared to single-agent RL
in the action network. Therefore, the strategy training of the methods (such as DDPG), MA-RL leverages a centralized
actor network can be made more stable. learning and decentralized execution approach. This not only
reduces the complexity of parameter optimization problems
in complex analog circuits, but also balances the competition
C. Results Comparisons Between Multiagent-Based RL and and collaboration among circuit subblocks. As a result, the
Other Circuit Optimization Methods MA-RL methods could reach more optimal design points in
To demonstrate the effectiveness of the multiagent tech- the complex among circuits.
nique, a test using four RL algorithms on a GBA circuit is Furthermore, for all test cases, the optimal design points
performed and the results are shown in Fig. 8. The circuits optimized by MATD3 and MAPPO have higher FoMs than
are evaluated using Rt,b as a comprehensive assessment Spec. those optimized by MADDPG. Specifically, MATD3 performs
TD3 and MADDPG can be considered as two improved optimally in GBA and SAR ADC designs, while MAPPO excels
algorithms based on DDPG from different perspectives. The in optimizing DLL design. This suggests the effectiveness of
results from Fig. 8 indicate that both TD3 and multiagent multiagent twin-delayed techniques and multiagent PPO.
technique are beneficial for reaching a more optimal result. Finally, we also investigated each individual circuit spec-
Compared to the improvements from TD3, incorporating the ifications reached for all circuits, as shown in Tables II–IV.
multiagent structure indeed leads to a greater enhancement in All MA-RL methods show good balance on all design spec-
the optimization performance from conventional DDPG. ification targets in the three circuits. For GBA, our proposed

Authorized licensed use limited to: UNIVERSITY OF SOUTHAMPTON. Downloaded on December 24,2024 at 12:52:20 UTC from IEEE Xplore. Restrictions apply.
BAO et al.: MA-RL: AN AUTOMATED DESIGNER FOR COMPLEX ANALOG CIRCUITS 4409

TABLE III
P ERFORMANCE S PECS C OMPARISONS OF DLL

TABLE IV
P ERFORMANCE S PECS C OMPARISONS OF SAR ADC

TABLE V
P ERFORMANCE S PECS OF DLL FOR D IRECT MA-RL ACTION
T RANSFER F ROM 0.13 T O 0.18-MM T ECHNOLOGY

training speed comparisons under the same hardware (CPU)


constraint, for a fixed number of simulation points, MA-RL
methods are slightly (−5%) slower than single agent RL, while
Fig. 9. Learning curves of different algorithms on (a) GBA design and it is still more than 3X faster than BO.
(b) DLL design. (c) Successive approximation register ADC design. For all
cases, the proposed MA-RL show higher FoMs, and the MATD3 and MAPPO
generally achieve higher FoMs than that of MADDPG algorithm. D. Knowledge Transfer Performances
To show the optimality of the design created by our MA-
RL methods, we apply the MATD3 trained action, i.e., the
MATD3 obtains the best gain, minimum power with good device sizes of DLL from 0.13-μm CMOS PDK directly to
UGF and reasonable PM, while MAPPO obtains the highest 0.18-μm PDK. The performance Specs of both are presented
SR and the best UGF. For DLL, MATD3 achieves the lowest in Table V. It is demonstrated that the technology transfer is
power, while MAPPO obtains the lowest-phase difference, successful without any further change or optimization, except
widest working frequency range and the second best on lock the Vdd is scaled from 1.2 to 1.8 V. Particularly, the phase
time. For SAR ADC, MATD3 achieves the best ENOB, difference and jitter are very close or even better than the
largest SNDR, largest FOM and second lowest THD, while original design, the lock time and WFR is also within design
MAPPO achieves the best SFDR and lowest power. All specification. Power is larger due to the Vdd scaling. This
Tables demonstrate the high-quality design can be efficiently further shows the design point found by MA-RL method is of
accomplished by our proposed methods. We have also done high quality.

Authorized licensed use limited to: UNIVERSITY OF SOUTHAMPTON. Downloaded on December 24,2024 at 12:52:20 UTC from IEEE Xplore. Restrictions apply.
4410 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 43, NO. 12, DECEMBER 2024

V. C ONCLUSION [16] H. Chen et al., “MAGICAL: An open- source fully automated analog IC
layout system from netlist to GDSII,” IEEE Design Test, vol. 38, no. 2,
We propose the MA-RL-based methods for the automated pp. 19–26, Apr. 2021.
optimization of complex analog circuits to achieve better [17] H. Habal and H. Graeb, “Constraint-based layout-driven sizing of analog
performance. The circuit is proposed to be partitioned into circuits,” IEEE Trans. Comput.-Aided Design Integr. Circuits Syst.,
vol. 30, no. 8, pp. 1089–1102, Aug. 2011.
subblocks based on topology information. Each subblock is [18] M. Fakhfakh, Y. Cooren, A. Sallem, M. Loulou, and P. Siarry, “Analog
assigned to an RL agent for which a specific reward is circuit design optimization through the particle swarm optimization
designed based on design knowledge. Then agents interact technique,” Analog Integr. Circuits Signal Process., vol. 63, no. 1,
pp. 71–82, 2011.
with each other during the training process to reach the overall [19] M. del Mar Hershenson, S. S. Mohan, S. P. Boyd, and T. H. Lee,
design goal. Furthermore, the twin delayed technique and PPO “Optimization of inductor circuits via geoSpec programming,” in Proc.
are introduced to further improve the training stability and Design Autom. Conf. (DAC), 1999, pp. 994–998.
[20] A. K. Singh, K. Ragab, M. Lok, C. Caramanis, and M. Orshansky,
efficiency. Finally, adopting reward clipping and setting circuit “Predictable equation-based analog optimization based on explicit cap-
operating conditions as agent states can help improve the ture of modeling error statistics,” IEEE Trans. Comput.-Aided Design
success rate of the optimization algorithm in achieving design Integr. Circuits Syst., vol. 31, no. 10, pp. 1485–1498, Oct. 2012.
[21] Y. Li, Y. Wang, Y. Li, R. Zhou, and Z. Lin, “An artificial neural network
objectives. This work opens the pathway for large scale analog assisted optimization system for analog design space exploration,” IEEE
circuits as well as system design. Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 39, no. 10,
pp. 2640–2653, Oct. 2020.
[22] B. Liu, D. Zhao, P. Reynaert, and G. G. E. Gielen, “GASPAD: A
general and efficient mm-wave integrated circuit synthesis method
R EFERENCES based on surrogate model assisted evolutionary algorithm,” IEEE Trans.
Comput.-Aided Design Integr. Circuits Syst., vol. 33, no. 2, pp. 169–182,
[1] I. A. M. Elfadel, D. S. Boning, and X. Li, Machine Learning in VLSI Feb. 2014.
Computer-Aided Design. Cham, Switzerland: Springer, 2018. [23] Z. Zhao and L. Zhang, “Deep reinforcement learning for analog circuit
[2] E. Deniz and G. Dündar, “Hierarchical performance estimation of analog sizing,” in Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), 2020, pp. 1–5.
blocks using Pareto Fronts,” in Proc. 6th Conf. Ph.D. Res. Microelectron. [24] E. Ephrati and J. S. Rosenschein, “Divide and conquer in multi-
Electron., 2010, pp. 1–4. agent planning,” in Proc. Int. Conf. Assoc. Adv. Artif. Intell., 1994,
[3] G. Alpaydin, S. Balkir, and G. Dundar, “An evolutionary approach pp. 375–380.
to automatic synthesis of high-performance analog integrated cir- [25] G. Berkol, E. Afacan, G. Dündar, and E. V. Fernandez, “A hierarchical
cuits,” IEEE Trans. Evol. Comput., vol. 7, no. 3, pp. 240–252, design automation concept for analog circuits,” in Proc. IEEE Int. Conf.
Jun. 2003. Electron., Circuits Syst. (ICECS), 2016, pp. 133–136.
[4] M. Barros, J. Guilherme, and N. Horta, “Analog circuits optimization [26] M. Neuner and H. Graeb, “Hierarchical analog power-down synthe-
based on evolutionary computation techniques,” Integration, vol.43, sis,” in Proc. 27th IEEE Int. Conf. Electronics, Circuits Syst. (ICECS),
no. 1, pp. 136–155, 2010. 2020, pp. 1–4.
[5] G. Wolfe and R. Vemuri, “Extraction and use of neural network [27] V. Mnih et al., “Human-level control through deep reinforcement
models in automated synthesis of operational amplifiers,” IEEE Trans. learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
Comput.-Aided Design Integr. Circuits Syst., vol. 22, no. 2, pp. 198–212, [28] T. P. Lillicrap et al., “Continuous control with deep reinforcement
Feb. 2003. learning,” 2015, arXiv:1509.02971.
[6] J. P. S. Rosa, D. J. D. Guerra, N. C. G. Horta, R. M. F. Martins, [29] R. Lowe, Y. Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch,
and N. C. C. Lourenço, Using ANNs to Size Analog Integrated Circuits, “Multi-agent actor–critic for mixed cooperative-competitive environ-
Cham, Switzerland: Springer Int. Publ., 2019. ments,” in Proc. 31st Annu. Conf. Neural Inf. Process. Syst. (NIPS),
[7] W. L. Lyu, F. Yang, C. H. Yan, D. A. Zhou, and X. Zeng, “Batch 2017, pp. 6382–6393.
Bayesian optimization via multi-objective acquisition ensemble for [30] D. Silver, G. Lever, N. Heess, T. Degris, and M. Riedmiller,
automated analog circuit design,” in Proc. 35th Int. Conf. Mach. Learn. “Deterministic policy gradient algorithms,” in Proc. 31st Int. Conf.
(ICML), 2018, pp. 10–15. Mach. Learn. (ICML), 2014, pp. 387–395.
[8] W. Lyu et al., “An efficient Bayesian optimization approach for auto- [31] S. Fujimoto, H. V. Hoof, and D. Meger, “Addressing function approx-
mated optimization of analog circuits,” IEEE Trans. Circuits Syst. I, Reg. imation error in actor–critic methods,” in Proc. 35th Int. Conf. Mach.
Papers, vol. 65, no. 6, pp. 1954–1967, Jun. 2018. Learn. (ICML), 2018, pp. 1587–1596.
[32] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov,
[9] S. Zhang, F. Yang, C. Yan, D. Zhou, and X. Zeng, “An efficient batch-
“Proximal policy optimization algorithms,” 2017, arXiv:1707.06347.
constrained Bayesian optimization approach for analog circuit synthesis
[33] J. Wang, L. Siek, R. Filippi, and K. A. Ng, “A top-down design
via multiobjective acquisition ensemble,” IEEE Trans. Comput.-Aided
verification based on reuse modular and paraspec behavioral modeling
Design Integr. Circuits Syst., vol. 41, no. 1, pp. 1–14, Jan. 2022.
for subranging pipelined analog-to-digital converter,” in Proc. Int. Symp.
[10] H. Wang et al., “GCN-RL circuit designer: Transferable transistor sizing Integr. Circuits (ISIC), 2007, pp. 378–381.
with graph neural networks and reinforcement learning,” in Proc. 57th [34] W. Shi et al., “A robustanalog: Fast variation-aware analog circuit design
ACM/IEEE Design Autom. Conf. (DAC), 2020, pp. 1–6. via multi-task RL,” in Proc. ACM/IEEE 4th Workshop Mach. Learn.
[11] N. S. Karthik Somayaji, H. Hu, and P. Li, “Prioritized reinforcement CAD (MLCAD), 2022, pp. 35–41.
learning for analog circuit optimization with design knowledge,” in Proc. [35] Z. Li and A. C. Carusone, “Design and optimization of low-dropout
58th ACM/IEEE Design Autom. Conf. (DAC), 2021, pp. 1231–1236. voltage regulator using relational graph neural network and reinforce-
[12] K. Settaluri, A. Haj-Ali, Q. Huang, K. Hakhamaneshi, and ment learning in open-source SKY130 process,” in Proc. IEEE/ACM
B. Nikolic, “AutoCkt: Deep reinforcement learning of analog circuit Int. Conf. Comput. Aided Design (ICCAD), 2023, pp. 1–9.
designs,” in Proc. Design, Autom. Test Eur. Conf. Exhib. (DATE), 2020,
pp. 490–495.
[13] K. Settaluri, Z. Liu, R. Khurana, A. Mirhaj, R. Jain, and B. Nikolic,
“Automated design of analog circuits using reinforcement learn- Jiarui Bao received the B.S. degree in mate-
ing,” IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 41, rial physics from Nanchang University, Nanchang,
no. 9, pp. 2794–2807, Sep. 2022. China, in 2018. He is currently pursuing the Ph.D.
[14] T. Sripramong and C. Toumazou, “The invention of CMOS ampli- degree with the School of Information Science and
fiers using genetic programming and current-flow analysis,” IEEE Technology, Fudan University, Shanghai, China.
Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 21, no. 11, His research interests include analog circuit
pp. 1237–1252, Nov. 2002. design automation and optimization, RF circuit
design automation, and semiconductor devices and
[15] X. Wang and L. Hedrich, “An approach to topology synthesis of analog
physics.
circuits using hierarchical blocks and symbolic analysis,” in Proc. Asia
South Pac. Conf. Design Autom. (ASPDAC), 2006, pp. 700–705.

Authorized licensed use limited to: UNIVERSITY OF SOUTHAMPTON. Downloaded on December 24,2024 at 12:52:20 UTC from IEEE Xplore. Restrictions apply.
BAO et al.: MA-RL: AN AUTOMATED DESIGNER FOR COMPLEX ANALOG CIRCUITS 4411

Jinxin Zhang received the B.S. degree in Xuan Zeng (Senior Member, IEEE) received the
information engineering from Shandong University, B.S. and Ph.D. degrees in electrical engineering
Jinan, Shandong, China, in 2020, and the M.S. from Fudan University, Shanghai, China, in 1991
degree from Fudan University, Shanghai, China, in and 1997, respectively.
2023. She is currently a Full Professor with the
His research interests include analog circuit Microelectronics Department. She served as the
design automation and reinforcement learning. Director of the State Key Laboratory of Application
Specific Integrated Circuits and Systems, Fudan
University from 2008 to 2012. She was a
Visiting Professor with the Department of Electrical
Engineering, Texas A&M University, College
Station, TX, USA, and with the Microelectronics Department, Technische
Universiteit Delft, Delft, The Netherlands, in 2002 and 2003, respectively. Her
current research interests include analog circuit modeling and synthesis, design
Zhangcheng Huang (Member, IEEE) received the for manufacturability, high-speed interconnect analysis and optimization, and
B.S. degree in physics from Nanjing University, circuit simulation.
Nanjing, China, in 2006, and the Ph.D. degree Prof. Zeng received the Best Paper Award from Integration, the VLSI
in microelectronics from the Chinese Academy of Journal 2018 and the Best Paper Award from the 8th IEEE Annual
Sciences, Beijing, China, in 2011. Ubiquitous Computing, Electronics and Mobile Communication Conference
From 2011 to 2013, he was an Assistant Professor 2017. She received the Changjiang Distinguished Professor with the Ministry
with the Shanghai Institute of Technical Physics, of Education Department of China in 2014, the Chinese National Science
Chinese Academy of Sciences. From 2013 to 2020, Funds for Distinguished Young Scientists in 2011 and the First-Class of
he was an Associate Professor. He is currently Natural Science Prize of Shanghai in 2012, 10th For Women in Science Award
an Associate Professor with the Frontier Institute in China in 2013, and Shanghai Municipal Natural Science Peony Award in
of Chip and System, Fudan University, Shanghai, 2014. She is an Associate Editor of IEEE T RANSACTIONS ON C IRCUITS AND
China. His research interests include high-performance ASICs for detectors S YSTEMS : PART II, IEEE T RANSACTIONS ON C OMPUTER A IDED D ESIGN
and smart imaging sensors. OF I NTEGRATED C IRCUITS AND S YSTEMS , and ACM T RANSACTIONS ON
D ESIGN AUTOMATION ON E LECTRONICS AND S YSTEMS.

Zhaori Bi (Member, IEEE) received the B.Eng.


degree in electronic information engineering from
Wuhan University of Technology, Wuhan, China,
in 2011, the M.S. and Ph.D. degrees in electrical
engineering and computer engineering from The
University of Texas at Dallas, Richardson, TX, USA,
in 2013 and 2017, respectively.
He is currently an Assistant Professor with
Fudan University, Shanghai, China. His current
research interests include mixed-signal system-on-a-
chip design, circuit performance optimization, and
applications in medical AI.

Ye Lu (Member, IEEE) received the B.S. degree


in physics from Nanjing University, Nanjing, China,
in 2006, and the Ph.D. degree in physics from the
Xingwei Feng received the B.S. degree in applied University of Pennsylvania, Philadelphia, PA, USA,
physics from Nanjing University of Science and in 2011.
Technology, Nanjing, China, in 2022. He is He was a Senior Research and Development
currently pursuing the master’s degree in elec- Engineer with Intel Corporation, Hillsboro, OR,
tronic information with Fudan University, Shanghai, USA, from 2011 to 2016, and a Staff Engineer with
China. Qualcomm, San Diego, CA, USA, from 2016 to
His research focuses on analog circuit design 2019. He is currently a Faculty Member with Fudan
automation and reinforcement learning. University, Shanghai, China, where his research
interests focus on semiconductor device, device modeling, automated circuit
design, and simulations.

Authorized licensed use limited to: UNIVERSITY OF SOUTHAMPTON. Downloaded on December 24,2024 at 12:52:20 UTC from IEEE Xplore. Restrictions apply.

You might also like