0% found this document useful (0 votes)
23 views14 pages

Design of High - Speed and Low-Power Finite-Word-Length PID Controllers

This document summarizes a research paper that proposes optimized finite-word-length (FWL) PID controller designs for embedded control applications. It discusses existing PID controller implementations and identifies shortcomings like high latency, FPGA dependence, and inability to handle varying parameters. The paper then presents three new optimized PID controller algorithms - Booth, modified Booth, and a recursive multi-bit multiplication algorithm. These algorithms enable finely-grained PID structures with bit-level precision to balance performance and power usage for different applications. The PID controllers are implemented as reconfigurable RTL IP cores to overcome limitations of prior work and better suit embedded control needs.

Uploaded by

wedevok392
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views14 pages

Design of High - Speed and Low-Power Finite-Word-Length PID Controllers

This document summarizes a research paper that proposes optimized finite-word-length (FWL) PID controller designs for embedded control applications. It discusses existing PID controller implementations and identifies shortcomings like high latency, FPGA dependence, and inability to handle varying parameters. The paper then presents three new optimized PID controller algorithms - Booth, modified Booth, and a recursive multi-bit multiplication algorithm. These algorithms enable finely-grained PID structures with bit-level precision to balance performance and power usage for different applications. The PID controllers are implemented as reconfigurable RTL IP cores to overcome limitations of prior work and better suit embedded control needs.

Uploaded by

wedevok392
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Design of High- Speed and Low-Power

Finite-Word-Length PID Controllers.


Abdelkrim K. Oudjida, Nicolas Chaillet, Ahmed Liacha, Mohamed L.
Berrandjia, Mustapha Hamerlain

To cite this version:


Abdelkrim K. Oudjida, Nicolas Chaillet, Ahmed Liacha, Mohamed L. Berrandjia, Mustapha Hamer-
lain. Design of High- Speed and Low-Power Finite-Word-Length PID Controllers.. Journal of Control
Theory and Technology, 2014, 12, pp.68-83. �10.1007/s11768-014-2131-5�. �hal-00941303�

HAL Id: hal-00941303


https://hal.science/hal-00941303
Submitted on 14 Mar 2014

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est


archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents
entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non,
lished or not. The documents may come from émanant des établissements d’enseignement et de
teaching and research institutions in France or recherche français ou étrangers, des laboratoires
abroad, or from public or private research centers. publics ou privés.
Design of High-Speed and Low-Power
Finite-Word-Length PID Controllers
A.K. Oudjida1 , N. Chaillet2 , A. Liacha1 , M.L. Berrandjia1 , and M.Hamerlain1
(1) Centre de Développement des Technologies Avancées, Algiers, Algeria
(2) FEMTO-ST Institute, Besançon, France

Abstract— ASIC or FPGA implementation of a finite word- optimization results can be achieved if not undertaken at
length PID controller requires a double expertise: in control architectural and especially at algorithmic level. To achieve
system and hardware design. In this paper, we only focus on such a goal, a deep insight into PID arithmetic is necessary.
the hardware side of the problem. We show how to design
configurable fixed-point PIDs to satisfy applications requiring
At this stage, a choice of a numeric representation format is
minimal power consumption, or high control-rate, or both a crucial issue. Compared to floating-point, fixed-point
together. As multiply operation is the engine of PID, we format is the best candidate for optimized designs as it is
experienced three algorithms: Booth, modified Booth, and a much simpler to implement, faster, power-efficient and
new recursive multi-bit multiplication algorithm. This later requires far much less hardware resources. However, the
enables the construction of finely grained PID structures with limited dynamic range can be source of control instability.
bit-level and unit-time precision. Such a feature permits to
tailor the PID to the desired performance and power budget.
This problem, referred to as finite-word-length (FWL)
All PIDs are implemented at RTL level as technology- effect is an active research area that aims to shorten the
independent reusable IP-cores. They are reconfigurable floating-to-fixed point conversion time while preserving
according to two compile-time constants: set-point word-length control performances [8][9].
and latency. To make PID design easily reproducible, all The digital implementation of PID controllers went
necessary implementation details are provided and discussed. through several stages of evolution, initially dominated by
the use of commercial-of-the-shelf (COTS) components and
Index Terms— Design-Reuse, Embedded Finite-Word-
Length (FWL) Controllers, Intellectual Property (IP), Linear DSP. But over the past few years, FPGAs have brought a
Time Invariant (LTI) Systems, Low-Power and Speed key advantage to digital control: the inherent parallelism of
Optimization, Proportional-Integral-Derivative (PID) FPGA architecture allows many independent control loops
to run at different deterministic rates without relying on
I. BACKGROUND AND MOTIVATION shared resources that might slow down their responsiveness
as in the case of COTS and DSP [10][11].
T HE PID is by far the most commonly used feedback
controller due to its simple structure and robust
performance [1]. An important feature of this controller is
A survey of recent PID related works can be classified
into three categories. The biggest one includes works that
that it does not require a precise analytical model of the are straightforward FPGA implementations targeting
system that is being controlled, which makes it very specific applications: DC-DC converter [12], temperature
attractive for a large class of dynamic systems. While PID is control [13], motor multi-axis control [14], liquid level
well adapted for linear-time-invariant (LTI) systems [2], it control [15], and Xilinx versus Altera FPGA
stands powerless for non-LTI ones. Nevertheless some implementation for result comparison [16]. The second
solutions exist, such as partitioning the non-LTI control category proposes methodologies that analyze the FWL
algorithm into a linear portion and a non-linear portion effect on PID controller in order to reduce the number of
[3][4][5]. The linear portion represents the major control hardware resources [17][18]. And finally the third category,
loop and is computed using an integrated PID, while the paradoxically the smallest one despite the large popularity
non-linear portion that acts as dynamic compensation to the of PID, comprises architecture-optimization works. In [19]
linear one is performed in software using a general-purpose- low-power serial and parallel multiple-channel PID
microprocessor or a DSP. architectures are proposed for small mobile robots. In this
In embedded control applications, such as in small-scale work, the optimization was carried out at macro-level
mobile robot, the control-loop-cycle is very tight and the considering several PIDs, rather than at micro-level
power budget is very limited. A low sample rate leads to (optimization of the PID itself). Nevertheless, the whole
poor and degraded control-performance. And high power architecture will deliver much more interesting results if
consumption shortens the battery lifetime. To cope with combined with an optimized PID. The second work [20]
these two severe and antagonistic constraints, the need for proposes serial, parallel, and mixed PID architectures
both a high-speed and low-power PID structure is of utmost incorporating different number (1-3) of multiplication cores.
importance. High power consumption, even with the serial architecture,
Today, design-reuse [6] is a well established design and complex control-part are the two major shortcomings of
standard that allows grasping with rapid technology changes this proposal. Finally, in [21] an attractive optimized PID
and increasing design complexity. It consists in the use of structure based on distributed arithmetic (DA) is presented.
predesigned technology-independent, generic and Although this latter exhibits interesting results in terms of
reconfigurable IP-cores [7], most generally implemented at resource utilization and power consumption, it suffers from
register-transfer-level (RTL). three serious drawbacks: high latency (n+1 clock-cycles for
However, at RTL abstraction level, no significant n bit set-point word-length), FPGA technology-dependent
as it’s essentially based upon FPGA look-up-tables (LUTs),
and inability to handle time-varying PID parameters since denoted by recurrent equations (1) and (2), respectively, and
they are precomputed and stored into LUTs. Nevertheless, their corresponding coefficients are grouped in Table I.

u (k ) = P (k ) + I (k ) + D (k ) (1)
it’s considered as a reference design against which the Equations (1) and (2) are fully detailed in the Appendix.
obtained results are confronted into the same conditions.

Where P(k ) = A ⋅ u c (k ) + B ⋅ y (k ) ;
The objective of this paper is to design optimized

I (k ) = I (k − 1) + C ⋅ e(k − 1) ;
FWL-PID structures that overcome all above-mentioned

D(k ) = H ⋅ D(k − 1) + L ⋅ f (k ) .
shortcomings, and which are especially dedicated to
embedded control applications. The PID cores are described
With e(k − 1) = u c (k − 1) − y (k − 1)
at RTL level. They are highly reconfigurable and

and f (k ) = y (k ) − y (k − 1)
technology-independent, offering the possibility to be
mapped both on FPGA and ASIC.

u (k ) = u (k − 1) + A ⋅ e(k ) + B ⋅ e(k − 1) + C ⋅ e(k − 2 )


To reach such a goal, a special focus was put on the And
optimization of the inner arithmetic of PID. For that, we
Where e(k ) = u c (k ) − y(k ) ;
(2)
considered two discrete forms of PID algorithm: the

e(k − 1) = u c (k − 1) − y (k − 1) ;
commercial form [22], called also the standard or ISA form,

e(k − 2) = u c (k − 2) − y (k − 2) .
and the incremental form. These two forms went through
three successive types of FPGA implementations, using:
Booth multiplication algorithm (BMA) [23], modified
Booth multiplication algorithm (MBMA) [24], and a new TABLE I
COEFFICIENTS OF DISCRET RECURRENT EQUATIONS
developed version called recursive multibit recoding
multiplication algorithm (RMRMA) [25]. Results show Coefficients Commercial PID Incremental PID
⎛ T T ⎞
gradual improvements with clear superiority over those K p ⎜⎜1 + s + d ⎟⎟
⎝ Ti Ts ⎠
A Kp b
provided in [21]. PID control-rate and energy-consumption
⎛ T ⎞
savings are respectively as follows: 32% and 25% with − Kp − K p ⎜⎜1 + 2 d ⎟⎟
⎝ Ts ⎠
B
BMA, 177% and 23% with MBMA, 431% and 20% with
− Kp
Ts Td
RMRMA. C Kp
Ti Ts
Our previous paper [26] introduced a limited design-
Td
Td + N Ts
space of PID. In this paper, we extended the design-space to H _
accommodate different application cases and provided all

K p Td N
Td + N Ts
necessary implementation details to make the design easily L _
reproducible.
The paper is organized as follows. In this section we Kp is the proportional gain; Ti and Td are the integral and
derivative times, respectively; N is the maximum
outlined the main requirement specifications for embedded
derivative gain; b is the fraction of set-point in
PID controller. Section II introduces the two mostly-used proportional term; and Ts is the sampling period.
discrete versions of PID algorithm. Section III, IV and V
deal with BMA, MBMA and RMRMA implementations, To satisfy different application cases, two IP versions are
respectively. A discussion around the obtained results is developed for each equation: with constant coefficients and
given in section VI. Section VII describes the verification with varying coefficients (Fig. 2). This latter requires a host
method, while Section VIII shows how the FWL-effect is side interface (HSI) to handle the runtime change of the
tackled. And finally some concluding remarks in Section coefficients.
Mode Ck Reset
XI. Mode Mode
uc(k) u(k) uc(k) u(k)
A B C H L
II. THE TWO MOSTLY-USED DISCRETE VERSIONS OF PID y(k) PID1 Done y(k) Done
PID2 HSI
A typical closed-loop system using a PID controller is
shown in Fig. 1, where uc(k), y(k), and u(k) are the discrete (a) Ck Reset (b) Din Adr Rw Cs

signal quantities at the kth sampling instant of the reference Ck Reset Ck Reset

set-point, the process-feedback measured output, and the uc(k) u(k) uc(k) u(k)
PID controller output, respectively. A B PID4
y(k) PID3 Done y(k) Done
C HSI
uc(k)
(c) (d)
Din Adr Rw Cs
Input
y(k) PID u(k) Output Fig. 2. Various PID IP-cores. (a) commercial PID with constant
Interface Controller Interface coefficients; (b) commercial PID with time varying coefficients;
(c) incremental PID with constant coefficients; (b) incremental
PID with time varying coefficients;
Process under
Control The commercial version allows the three standard PID
functioning modes (P, PI, PID) according to Mode input
Fig. 1. Typical closed-loop control system using a PID value. At the end of u(k) computation, the Done output
signal toggles during one clock cycle, and the PID enters
In digital control, commercial and incremental forms are into sleep mode (whole internal activity stopped except for
the two mostly-used discrete PID versions [1][22]. They are clocking and HSI) for maximum energy conservation.
III. BMA BASED PID their implementation results (Table III) are respectively
A straightforward parallel implementation of PID compared to those of [21]. Comparison was made into
requires an amount of 7 adders/substractors and 5 identical conditions using the same FPGA device (Spartan
multiplication cores for equation (1), and 4 XC2S50E-7FT256), although relatively old, as well as the
adders/substractors and 3 multiplication cores for equation same synthesis-tool version (Xilinx ISE 9.1i). In [21], only
(2). In digital hardware, the total gate count scales linearly a 16-bit word-length commercial version with constant
with word length for an adder core, while it scales coefficients (without HSI) is implemented. PID1 and PID3
quadratically for a multiplier core. Thus, any effort for a exhibits interesting results: 44%, 25%, and 32% savings and
low-power optimization of PID must be focused on the 62%, 35%, and 38% savings in terms of gate count, power,
implementation of the multiply-and-accumulate (MAC) and speed, respectively. PID3 exhibits higher savings but at
function (X.Y) [27]. In this work, the optimization effort is the expense of control-quality. Latency is rather the same
rather concentrated on the double MAC function (X.Y+T.Z) (17), which is n+1 clock cycles for all designs (PIDX).
called DMAC, considered as the main building block of our Optimizing latency without sacrificing the three other
PID structures. Equations (1) and (2) are partitioned issues is the main objective of the next two sections.
accordingly. X Y T Z
n n n n
For FWL-PID, two’s complement fixed-point
representation is used, which is habitually expressed in Q "0" X X "0" "0" T T "0"
notation as Qni.nf The values are coded in ni bits before the
point (integer word length including 1 sign bit), and nf bits yj-1 Mux Mux zj-1
yj zj
after the point (fractional word length). The total word (Qj.X) (Pj.T)

length is n=ni+ nf . Cin Cin


+ +
Booth multiplication algorithm [23] belongs to the class
of recoding algorithm, i.e. algorithms that recode one of the << j << j
two operands to cope with signed two’s complement MAC MAC
Reg Reg

Y = − y n −1 2 n −1 + ∑ y j 2 j (3)
multiplication. Let Y be the multiplier: j = 0 , n-1 j = 0 , n-1
n−2

j =0 +
DMAC

Y = ∑ ( y j −1 − y j ) 2 = ∑ Q j 2
Equation (3) can also be expressed as follows: j = 0 , n-1
n −1 n −1
Reg
j j (4) 2n+1 X.Y+T.Z

Where y −1 = 0 and Q j ∈ {− 1, 0,1 }


j =0 j =0
Fig. 3. Straightforward DMAC implementation

Consequently, the multiplier Y is divided into n slices, X Y T Z


n n n n
each of 2 bits. Each pair of two contiguous slices has one bit

X .Y + T . Z = ∑ (Q j . X ) 2 j + ∑ (Pj .T ) 2 j
in common. Thus, the DMAC becomes: "0" X X "0" "0" T T "0"
n −1 n −1
(5) yj-1 zj-1

[ ]
Mux Mux
j =0 j =0
yj zj

= ∑ Q j . X + Pj . T 2 j
n −1 (Qj.X) (Pj.T)
(6)
j =0
Cin
+

the multiplier Y into a set of ternary numbers {− 1, 0,1 } in


According to (5), Booth algorithm consists in recoding << j

order to generate n simple partial products which are +


Cin

summed subsequently. Table II summarizes the 4 ODMAC


possibilities that may occur. The -X can be easily formed by Reg j = 0 , n-1
adding 1 to the complement of X. A direct translation of 2n+1 X.Y+T.Z
DMAC equation (5) into architecture (Fig. 3) requires one Fig. 4. Optimized DMAC implementation
extra adder and two registers in comparison with the
optimized version (Fig. 4) based on (6), called ODMAC. uc(k) n n y(k)
Additionally, one clock cycle latency is also needed in uc(k) y(k)
Fig. 3. The control-part responsible of producing the
successive couples (yj-1 , yj) is A
_
MAC

insignificant: just one multiplexer TABLE II


u(k-1)
BOOTH ALGORITHM e(k)
driven by a counter. B
Reg u(k)
Based upon ODMAC as the Yj Yj-1 Operation
Reg

+
0 0 +0 e(k-1) 2n+log2(r)+2
ODMAC

main building block, PID +


0 1 +X Reg
architectures are constructed for 1 0 -X
e(k-2)
both incremental (Fig. 5) and 1 1 -0 PID3-4
commercial (Fig. 6) forms, and C

Fig. 5. Incremental PID architecture


uc(k) n n y(k) But in this case, some hard partial
products are required such as 3X and -3X
E which are not easy to generate. How to

Reg
uc(k) y(k)
circumvent this obstacle is the purpose of the

ODMAC
f(k)

Reg
next section.
_ _ D
TABLE IV
MODIFIED BOOTH ALGORITHM
Reg D(k-1)
Y2j+1 Y2j Y2j-1 Operation
I(k) A
MAC

e(k-1) u(k) 0 0 0 +0

Reg
+
Reg
+ 0 0 1 +X

ODMAC
C uc(k) + 2n+log2(r)+2 0 1 0 +X
I(k-1) 0 1 1 + 2X
y(k) P(k)
1 0 0 - 2X
PID1-2 1 0 1 -X
B
1 1 0 -X
Fig. 6. Commercial PID architecture 1 1 1 -0

TABLE III
IMPLEMENTATION RESULT COMPARISON OF MBA-BASED PID V. RMRMA BASED PID
PID Total Gate Power* Max. Clock
Latency Multiplication is a fundamental operation in digital
Core Count (mW) Freq. (MHz)
PID [21] 16728 456 47 design. Its speed and power requirements are two critical
PID1 9286 (44%) 342 (25%) 62 (32%) factors limiting the whole system performances (PID in our
PID2 10661 (36%) 359 (21%) 61 (30%) 17
case). Since the publication of Booth’s algorithm in 1951, a
PID3 6337 (62%) 297 (35%) 65 (38%)
PID4 7168 (57%) 308 (32%) 62 (32%) huge number of improvement attempts were proposed,
especially after the publication of a generalized version of
* : Dynamic power consumption at 47MHz; (XX%): saving MBA algorithm accompanied with its proof [29]. Most of
the proposals aimed to reduce the number of partial
IV. MBMA BASED PID products either by employing digital optimization

∑ (y 2 j −1 + y 2 j − 2 y 2 j +1 ) 2
techniques [30][31][32] or by using larger slices (higher

∑Q
Equation (3) can also be rewritten as follows [24]:
( n / 2 ) −1 ( n / 2 ) −1
Y= =
radices) [33]. However, experience showed [34] that beyond
2j
22 j (7) 4-bit slices (radix 8), the complexity to generate hard partial

Where y −1 = 0 and Q j ∈ {− 2, − 1, 0,1, 2}


j =0 j =0
j
products can not be managed in a realistic way. In [34],
three metrics are provided for comparing the tradoffs when
In this case, the multiplier Y is divided into n/2 slices, employing higher radix Booth recodings: partial product
compression factor (gain), the number of hard multiples that
each of 3 bits, with one bit overlapping between adjacent
must be precomputed (computation complexity), and partial
slices. The proof of equation (7) is given in [28]. Thus, the
product generation fanin (routing complexity).

∑ [Q ]
DMAC equation becomes:
( n / 2 ) −1
To circumvent the problem of hard partial products in
X .Y + T . Z = . X + Pj . T 2 2 j (8) higher radices, the idea proposed in [35] is to apply a
j =0
j
recursive Booth recoding on the r-bit slice. While the idea is
Likewise, n/2 simple partial products are generated interesting, it relies upon a complicated mathematical
(Table IV). Since ODMAC is a reconfigurable RTL block, formulation, leading to a complex control circuitry and
it is parameterized to suit equation (8). The new adapted especially to an exaggerated latency (2n/r).
ODMAC architecture is depicted in Fig. 7. The only TABLE V
difference is that Mux(8:1) are used instead of Mux(4:1), IMPLEMENTATION RESULT COMPARISON OF MBMA-BASED PID
and (<<2.j) hardwired shifter instead (<<1.j). Compared to PID Total Gate Power* Max. Clock
Latency
Core Count (mW) Freq. (MHz)
BMA based PID (Table V), MBMA based one (PID1)
PID [21] 16728 456 47 17
shows much more interesting results, since latency is PID1 10642 (36%) 350 (23%) 62 (32%)
divided by 2 while maintaining stable power consumption PID2 11923 (29%) 366 (20%) 61 (30%)
PID3 7042 (58%) 303 (33%) 64 (38%) 9 (47%)
and speed. Control rate is drastically improved as its equal
PID4 7795 (53%) 315 (31%) 62 (32%)
to maximum clock frequency divided by latency. As the
discrete commercial form (equation 1) can accommodate the * : Dynamic power consumption at 47MHz; (XX%): saving
three functioning modes, implementation of PID2 produced According to the multibit recoding algorithm presented in

∑ (y
the following power consumption values at 47 MHz: 268 [29], a n-bit two’s complement operand Y can be written as:
( n / r ) −1
Y= + 2 0 y rj + 21. y rj +1 + 2 2 y rj + 2 + ⋅ ⋅ ⋅
mW, 313 mW, and 366 mW for P, PI, and PID functioning
modes, respectively. rj −1

)
j =0

∑Q
With regard of these improvements, one is encouraged to
( n / r ) −1
pursue farther [24] in reducing latency by considering larger + 2 r − 2 y rj + r − 2 − 2 r −1 y rj + r −1 2 rj = 2 rj (10)

Where y −1 = 0 ; r ∈ Ν * ; and Q j ∈ {− 2 r −1 , ... , 0 , ... , 2 r −1 }


∑ (y )
j =0
j

∑Q
slices, such as:
( n / 3) −1 ( n / 3) −1
Y= 3 j −1 + y 3 j + 2. y 3 j +1 − 2 y 3 j + 2 2 2 3j
= 2 3j (9)

Where y −1 = 0 and Q j ∈ {− 4, ... , 0, ... , 4}


j =0 j =0
j
In this general case, the multiplier Y is divided into n/r
slices, each of r+1 bits. Each pair of two contiguous slices
X Y T Z various levels of parallelism and latencies (n/r+1) can be
n n n n
automatically generated with slight control complexity. The
special cases of r=n and r=2 correspond to fully-parallel and
"0" X X 2X 2X X X "0" "0" T T 2T 2T T T "0"
fully-sequential PID, respectively. In between (r=4,n/2),
y2j-1 z2j-1 partially-parallel PIDs are obtained. The outstanding
y2j Mux Mux z2j
y2j+1 z2j+1 advantage of this algorithm (equation 13) is that hard partial
products are generated using simple ones (2X, X) only. For
(Qj.X) (Pj.T) a simplified hardware and lower power consumption, the
Cin
+ step-by-step sign-propagate technique is employed [36].
Obviously, equation (13) does not reduce the number of
<< 2j partial products, but allows a modulable space-time
partitioning of the multibit recoding algorithm (equation
Cin
+ 10), where n/r sets comprising each r/2 partial products can
Reg ODMAC be generated and summed either simultaneously or
j = 0 , (n/2)-1 iteratively. Whilst the parallel implementation of equation
2n+1 X.Y+T.Z
(13) allows an important reduction of the critical path (using
Fig. 7. Optimized DMAC architecture for r=2 a carry-save adder CSA), it requires too much space.
has one overlapping bit. To bypass the problem of hard Therefore, only the serial implementation is retained. In this
partial products, MBMA (equation 7) is applied to the Qj case, latency drops from (n/2+1) to (n/r+1), whereas the
terms. Thus, equation (10) takes the new simpler recursive overhead on the total critical path, which goes through

∑ [(y (
+ y rj − 2. y rj +1 ) 2 0 + y rj +1 + y rj + 2 − 2. y rj +3 ) 2 2 + ...
form: log2(r/2) adder levels and which is equal to D in the case of
( n / r ) −1
Y=
MBMA, is slightly increased D+log2(r/2). Note that we are
rj −1 using a logarithmic summation tree and not a linear one
j =0

+ ( y rj + r −5 + y rj + r − 4 − 2. y rj + r −3 ) 2
(CSA like).

+
2( −2) An illustrative serial example with r=4 is described as
r

∑ (y )
2
follows:

(y )2
( n / 4 ) −1
⎤ rj Y= + y 4 j + 2 y 4 j +1 + 2 2 y 4 j + 2 − 2 3 y 4 j + 3 2 4 j (15)
+ y rj + r −2 − 2. y rj + r −1
2 ( −1)
⎥ 2
4 j −1
r

j =0

⎢∑ ( y 4 j −1+ 2i + y 4 j + 2i − 2. y 4 j +1+ 2i ) 2 ⎥ 2
(11)

rj + r −3
2


(n / 4 )−1
⎡ 1 2i ⎤

∑ ⎢ ∑ (y − 2. y rj +1+ 2i ) 2 ⎥ 2 rj (12)
(n / r )−1 ⎡ (r / 2 )−1 ⎤ =
⎣ i =0 ⎦
(16)
= + y rj + 2i
4j

∑ [Q ]
j =0
⎣ ⎦
rj −1+ 2 i
2i

j =0 i =0 (n / 4 )−1
= + Q j1 2 2 2 4 j

∑ ⎢ ∑Q
(17)
(n / r )−1 ⎡ (r / 2 )−1 ⎤

∑ [(Q ]
j =0
=
X + Pj 0T ) + (Q j1 X + Pj1T ) 2 2 2 4 j (18)
j0

2 2i ⎥ 2 rj ( n / 4 ) −1
⎣ ⎦
(13)
j =0 i =0 X .Y + T . Z =
ji

With Q ji ∈ {− 2, − 1, 0,1, 2}
j =0
j0

The mapping of equation (18) into a serial architecture is


There is no need to prove equation (11) since it is a shown in Fig. 9. Such a case (r=4) would have required the
combination of equations (10) and (7) which were already computation of hard partial products (7X, 5X, 3X) if the
proven in [29] and [28], respectively. The partitioning of simple form of equation (15) was used. Notice that MBMA
operand Y according to equation (13) is illustrated by Fig. 8. is a special case of RMRMA for r=2. For r=1, equation (10)
corresponds to BMA (equation 4).
Q0 Table VI comprises the implementation results of PIDs
Q00 Q02 Q10 Q12 with n=16 and r=4,8,16. For instance, PID1 with r=4 not
only achieves high improvement in latency (71%), but also
Q01 Q03 Q11 Q13 maintains positive savings in power (14%) and speed
(13%). These important achievements are partially due to
Q1 logic-trimming performed by the synthesis tool on the
y-1 y0 y1 y2 y3 y4 y5 y6 y7 y8 y9 y10 y11 y12 y13 y14 y15
constant coefficients. Such an operation is impossible in the
Y case of PID [21] since the coefficients are stored into LUTs.
Qj Qji TABLE VI
: y-1= 0 Y IMPLEMENTATION RESULT COMPARISON OF RMRMA-BASED PID
16+1 bits 8+1 bits 2+1 bits PID Total Gate Power* Max. Clock
Latency
Fig. 8. Partitioning of a 16-bit Y operand with r=8 Core Count (mW) Freq. (MHz)
PID [21] 16728 223 47 17
To avoid dealing with special cases, n and r must be PID1_4 12443 (+26%) 191 (+14%) 53 (+13%) 5 (+71%)
chosen as even numbers, with r as a divider of n. Thus, the PID1_8 15688 (+06%) 194 (+13%) 44 (-06%) 3 (+82%)
PID1_16 23545 (-41%) 217 (+03%) 26 (-45%) 2 (+88%)

X . Y + T . Z = ∑ ⎢ ∑ (Q ji . X + Pji . T ) 2 2i ⎥ 2 rj (14)
DMAC equation becomes:
⎡ ⎤
PID2_4 22962 (-37%) 256 (-15%) 43 (-08%) 5 (+71%)
( n / r ) −1 ( r / 2 ) −1 PID2_8 26073 (-56%) 204 (+08%) 37 (-21%) 3 (+82%)

j =0 ⎣ i =0 ⎦
PID2_16 40327 (-141%) 488 (-119%) 23 (-51%) 2 (+88%)

Depending on r value ranging from 2 to n, PIDs with *: Dynamic power consumption at 23MHz; PIDY_X: X = r
(+AB%): saving; (-AB%): overhead
X Y T Z
n n n n

y4j+21
y4j+1

y4j+3
y4j+1

z4j+1

z4j+3
z4j+1

z4j+2
y4j-1

z4j-1
"0" X X 2X 2X X X "0" "0" T T 2T 2T T T "0" "0" X X 2X 2X X X "0" "0" T T 2T 2T T T "0"

y4j+

z4j+
Mux Mux Mux Mux

<< 2

<< 2
(Qj0.X) (Pj0.T) Cin (Qj1.X) (Pj1.T)
y4j+1 Cin z4j+1 +
+ Cin
+
y4j+3 z4j+3
<< 4j

Cin z4j+3
+
ODMAC
Reg j = 0 , (n/4)-1
X.Y+T.Z
2n+2
Fig. 9. Optimized DMAC architecture for r=4

At this stage, a key question arises: among this panoply and 2, respectively) and one stage difference in the critical
of PIDs, which one fits the best one’s application case? The path (n-1 and n, respectively), but an important multiplexer
answer to this question is given in the next section. fanin difference (n/4 and n/2, respectively).
In terms of resource occupation, the total complexity
VI. DISCUSSION grows linearly O(r) as r multiplexers and r adders are
In embedded control, satisfactory control-rate (without required by ODMAC which is the most resource consuming
performance degradation) at minimum power consumption block of PID architecture. This is also confirmed by the
is the main requirement. To select the most adequate PID implementation results shown in Table VI. Note that each
for a given application, it’s necessary to investigate how adder of each level of MAC and ODMAC as well as the two
speed, power and hardware resources scales versus r factor ones at the output of the PID (Fig. 5 and 6) are
for a fixed word length n. Referring to equation (14) and successively extended by one bit so that the total bit size of
aided by Fig. 9, the ODMAC architecture scales as a binary the control output u(k) becomes 2n+log2(r)+2. It’s necessary
tree with one stage of r mux(8:1) followed by Log2(r)+1 to do so to prevent the apparition of a possible overflow in
stages of adders with a total of r adders too. Thus, the total the data-path which can cause signal clipping and
delay cumulated by the critical path which goes through instabilities in the closed loop response [37].
Log2(r)+2 stages increases with O(Log(r)) complexity, As for power consumption, intuitively, one would expect
whilst latency (n/r+1) decreases linearly O(r), which makes to see PID1_16 of Table VII as being the most rapid and the
the maximum control-rate increases as r increases. This is most power consumer too, for the reason that it exhibits the
confirmed by implementation results shown in Table VII smallest latency and the biggest total gate count! While it is
and VIII corresponding to PID1 and PID2, respectively. The almost true for the latter (13 MHz, before the first), it is
sole exception to this general rule is PIDX_n/2 which quite the opposite for the former (244 mW, the smallest

( P = 0.5 Vdd2 C sw Fclk ) depends linearly on the frequency


always yields to the highest control-rate compared to one). The explanation is that power consumption
PIDX_n despite the numerous tests with various n values.
This is justified since they exhibit very close latencies (3 (Fclk), which is in this case 26 MHz (the smallest one) and
TABLE VII also on the switched capacitance (Csw) which describes the
MAXIMUM POWER-CONSUMPTION AND CONTROL-LOOP-CYCLE OF PID1 average capacitance charged during each clock period
PID Power* Max. Clock Max. Control Loop (1/Fclk). In fact, Csw depends on a number of parameter
Latency
Core (mW) Freq. (MHz) Cycle (MHz)
PID [21] 456 47 17 2.76
(circuit structure, logic function, input pattern
PID1_1 342 (+25%) 62 17 3.65 (+32%) dependence…) and not only on the total gate count (more
PID1_2 350 (+23%) 62 9 7.66 (+177%) precisely, not only on the total physical capacitance of the
PID1_4 431 (+05%) 53 5 10.60 (+284%) circuit). Furthermore, a study [38] that analyzed the
PID1_8 365 (+20%) 44 3 14.67 (+431%)
dynamic power consumption in Xilinx’s FPGA revealed the
PID1_16 244 (+46%) 26 2 13.00 (+371%)
following share: 60% by routing, 16% by logic, and 14% by
*: Dynamic power consumption at maximum clock frequency; clocking. The reason is that routing is intensively
PID1_X: X=r; Max. control loop cycle = Max. clock frequency / Latency segmented, using pass logic and buffers.
TABLE VIII
When both high control-rate close to 13MHz and low
MAXIMUM POWER-CONSUMPTION AND CONTROL-LOOP-CYCLE OF PID2 power are required, PID1_16 (244 mW at 13MHz) stands as
PID Power* Max. Clock Max. Control Loop the best candidate compared to PID1_8 (323 mW at
Latency
Core (mW) Freq. (MHz) Cycle (MHz) 13MHz). However, it’s noteworthy to mention that this
PID [21] 456 47 17 2.76 comparison stands valid only for the special case of 16-bit
PID2_1 466 (-02%) 61 17 3.59 (+30%) word-length PID, for a given set of coefficients, mapped on
PID2_2 475 (-04%) 61 9 6.78 (+146%)
PID2_4 479 (-05%) 43 5 8.60 (+211%) XC2S150E-7FT256 FPGA circuit and using Xilinx’s XST
PID2_8 328 (+28%) 37 3 12.33 (+347%) synthesis tool, version 9.2. Results could significantly
PID2_16 488 (-07%) 23 2 11.50 (+317%) change under other conditions, especially when considering
*: Dynamic power consumption at maximum clock frequency;
the logic trimming process which is essentially dependant on
PID2_X: X = r; Max. control loop cycle = Max. clock frequency / Latency
the bit-arrangement of the coefficients. For a minimum Timing and power evaluations were performed in the
influence of the trimming operation on the synthesized following conditions. Delays were calculated for two types
results, appropriate coefficients were used such as all Qj of paths: Clock-To-Setup and all paths together (Pad-To-
terms are represented except the null one to avoid generating Setup, Clock-To-Pad and Pad-To-Pad.) The Clock-To-
null partial products that greatly simplify the circuit logic. In Setup gives more precise information on the delays than
fact, constant coefficients PIDs (PID1) are somehow other remaining paths, which depend in fact on I/O Block
unpredictable with regard to r. They are coefficient (IOB) configuration (low/high fanout, CMOS, TTL,
dependant. Adversely, PID2 is not involved with the LVDS…). Thus, all delays (frequencies) presented so far
trimming process since coefficients are time varying. are clock-to-setup delays with the highest speed grade of the
Implementation results comprised in Table VIII show that FPGA circuit. As for power, we chose the highest Vcco
PID2_8 is the best at all aspects for the same reasons cited voltage value (3.3 for Spartan2e and 2.5 for Virex6) with a
above. In sum, when high control-rate is the ultimate maximum toggle activity of 50%, which means that Flip-
objective, PIDX_n/2 is the best candidate whatever n value. Flops (FFs) toggle one time during each clock cycle. The
But in the case where both high speed and low power are reason is that only simple-edge triggered FFs are used for
required, timing and power evaluations are necessary to synthesis (no double-edge FFs).
decide which PID to select: either PIDX_n/2 or PIDX_n.
Finally, when only low power is targeted, PIDX_1 is the VII. VERIFICATION METHOD
best candidate. We dealt here with extreme situations only, The PID design verification process went through several
but for a given couple (cr, pc) of control-rate and power steps. First equations (12) and (14) were tested with a
consumption, several candidates are possible. Yet, the best random C-program. Then, a severe cycle-accurate
PID is the one which requires the smallest gate count. functional verification procedure using Modelsim simulator
So far, speed and power have been considered in isolation was applied to MAC and ODMAC as they are the main
to area which becomes critical, and sometimes prohibitive, building blocks of PID architecture. They were challenged
for large word-length n due to the fact that PID is basically against a set of special test cases (visual simulation), and
built of a set of multipliers (three or five) that scale then submitted to a random test for a very large number of
quadratically with word length. The bigger is the area, the vectors. Once tested successfully, the RTL PID module
higher is the cost. Consequently, another advantage of written in Verilog-2001 (IEEE 1364) was integrated into
RMRMA algorithm is to cope also with the cost issue as an Modelsim/Simulink environment for a co-simulation. At
additional constraint to speed and power. this stage, a ZOH discrete time invariant model of a third
We deliberately chose Spartan2e FPGA to compare our order continuous process (G(s)=1/(s+1)3) was chosen from
results with those provided in [21]. A mapping on a recent the test set used by Åström and Hägglund [1] as examples
FPGA circuit (Virtex6) using XST 12.1 version of extreme of representative plants for the dynamics of typical
PID2 delivered state-of-the-art results grouped in Table IX. industrial processes. To derive the PID parameters, a
Note that control-rate scaled with an average factor of 2, theoretical PID taken from Matlab component-library was
while power dissipation scaled with an average factor of 45. tuned using floating-point numerical representation (IEEE
TABLE IX 754 double format). The sampling period Ts was chosen
MAXIMUM POWER-CONSUMPTION AND CONTROL-LOOP-CYCLE based on the magnitude of the pole time constants. For this
OF PID2 MAPPED ON VIRTEX6
case Ts=10 ms. The following parameters were obtained:
PID Number Power* Max. Clock Max. Control Loop
Latency Kp = 0.5913 ; Ti = 0.0523 ; Td = 0.0225 for N=10 and
Core of Slices (mW) Freq. (MHz) Cycle (MHz)
PID2_1 231 23 122 17 07.17 b=1. Calculations give the following floating-point values
PID2_8 1060 04 90.5 3 30.16 for the coefficients of commercial PID:
PID2_16 1963 13 50.4 2 25.19 A=0.5913; B=-0.5913; C=0.1130; D=0.1836; E=-1.0860
*: Dynamic power consumption at maximum clock frequency;
To co-simulate the RTL PID, a conversion of the
PID2_X: X=r; Max. control loop cycle=Max.clock frequency / Latency coefficients to 16-bit (Q4.12) fixed-point representation was
necessary. Variations were obtained:
This is not surprising, since Spartan2e and Virtex6 were
A=0.5911; B=-0.5911; C=0.1130; D=0.1836; E=-1.0860
fabricated with two differently scaled technology processes:
Note that to represent the original parameters with full-
150 nm and 40 nm, respectively. Therefore, the physical
precision, 44 bits are needed for the fractional part. Varied
capacitances of the circuit in Virtex6 are relatively too much
simulations were performed to verify the correctness of the
smaller. Additionally, the supply-voltages (Vdd) used for
PID RTL code. First, to explore the precision effect on
internal core (Vccint) and for output blocks (Vcco) are
control quality, the control output of PIDs with various
respectively 1.8V and 3.3V for Spartan2e, 1V and 2.5V for
fractional-part sizes (Q4.4 , Q4.12 , Q4.20) were compared to
Virtex6. Furthermore, the efficient advances made in CAD
that of the Matlab floating-point PID component (Fig. 10).
tools (from Xilinx ISE 9.1 to 12.1 versions) as well as in
Simulation shows different rise-times for different
FPGA architecture, such as advanced segmented-routing,
precisions. The higher is the precision; the closer is the
much contributed to lower the power consumption [39].
control output from the ideal model. The second simulation
Power consumption evaluation studies [38][39] based on
tests the behavior of the PID after having reached the steady
simulation and measurements, targeting Virtex2 and Virtex6
state (Fig.11). For that, two perturbations are successively
families revealed the following results: 5.9µW per CLB per
exerted on control output and on the plant measure. Each
MHz, and 1.09 mW per 100 MHz at 38% toggle rate,
time the system recovers as expected. And finally, the third
respectively. These studies roughly confirm our power
simulation investigates the PID capabilities to track set-
results as proximate values are obtained.
points of arbitrary amplitudes and durations (Fig. 12).
After a successful functional verification, the RTL code The sensing of the actual temperature of the tube is
of PID was synthesized, placed, and routed on Xilinx’s assured by LM35 component which delivers a voltage value
FPGA (Virtex-2). The three preceding co-simulations but that grows linearly with temperature (1.5 volts corresponds
with timing backannotation were performed again as a last to 150 °C). As the maximum voltage allowed by FPGA
necessary software verification step before hardware evaluation board (V2MB1000) is 3.3 volts, the calculation
integration of the PID into an FPGA evaluation board of the real temperature (T) is done as follows:
(MEMEC V2MB1000). T = [(val_opb_ADC * 3.3)/1023] * 100. This allows a
Finally, as an ultimate validation step, a physical test of temperature control with a minimum step of 0.32 °C.
our PIDs is performed. We built up a classical temperature The V2MB1000 board is connected through RS232 port
control setup (Fig. 13 and 14), which consists in a tube to a PC running a .net application which allows a real-time
comprising a halogen lamp (12 V, 21 W), a temperature display of the temperature as well as an instantaneous
sensor (LM35), and a DC Fan (12 V, 1.68 W). Temperature tuning of the set-point.
regulation inside the tube is achieved by controlling either Memec 5V
the intensity of the lamp, or the rotation speed of the fan. V2MB1000 Tube
This is carried out by the use of two PWMs, whose duty- FPGA
Evaluation Fan Lamp
cycle ratios represent the PID controller output (u(k)). These Board LM35
two PWMs do not act directly on the fan or on the lamp but
rather on transistors (IRF540) that control the power
consumed by the lamp and fan. Electronic
Device
PMW Fun
1 PMW Lamp

0.8 Fig. 13. Synoptic scheme of the setup


Response

0.6 Set Point (Uc)


PID (4 .4)
0.4
PID (4 .12)
0.2 PID (4 .20)
PID Ideal Model
0
0 20 40 60 80 100 120 140
Time (s)
Fig.10.
9. Fixed-point versus floating-point

1.5
Set Point (Uc)
Plant Measure (Y)

1
Response

Fig. 14. Setup of temperature regulation


1: FPGA evaluation board; 2: Electronic device;
2: Tube containing a fun and a lamp; 4: PC display screen
0.5

VIII. THE FINITE WORD-LENGTH (FWL) EFFECT


0 Fixed-point arithmetic is employed as an approximation
0 100 200 300 400 500 of real numbers (floating-point), with a fixed bit-length of
Time (s) the word used to represent data (Finite Word-Length). This
11. Perturbations after steady-state on control-
Fig. 10. limitation leads to performance degradation (FWL effect)
output and on plant measure, successively mainly due to quantization of coefficients (parametric
1 errors) and roundoff errors subsequently cumulated during
Set P oint (Uc)
P lant Measure (Y)
the computation process (numeric noise). In fact, the FWL
0.8 effect is more-or-less exaggerated depending on the control
algorithm used (I/O relationship, levels of parallelism, etc)
0.6 as well as on the way the computations are performed
Response

(number of bits, different/unique fixed point position,


0.4 round/truncation, etc). Compared to the reference floating-
point implementation, the FWL effect can be assessed using
some indicators such as transfer function sensitivity, or pole
0.2
sensitivity [40][41][42].
In fact, the objective is twofold: we need to provide an
0 optimal ASIC/FPGA implementation of FWL PID without
0 100 200 300 400
T ime (s) degrading control performances. To achieve such a goal, a
Fig. 12.
11. Set-point tracking of arbitrary amplitudes and double expertise is required in hardware design and control
durations system. But usually, hardware designers do not master
control system design, and control system experts do not

Temperature °C
have the required skills to implement and evaluate the
controllers using ASIC/FPGAs [17][43]. This is why we (a)
propose, as hardware designers, a highly reconfigurable
(n, r) and technology-independent FWL PID that can
systematically respond to control-engineer demands after
having modelled, simulated, and evaluated the performances
provided by different bit-width fixed-point representations Time (s)
using Matlab/Simulink environment, and finally opted for
an appropriate word-length (n) of the setpoint. As for

Temperature °C
latency value (r), it depends on the application domain and
intended objectives. Precise guidelines on how to choose r
(b)
value were given in section VI.
Now that (n, r) couple is known, the FWL problem is
tackled from hardware side by simply adjusting in the RTL
code the two compile-time constants: setpoint bit-size (n)
and latency (r). The synthesis of such a PID generates an Time (s)
optimal structure that not only meets the performances
specified by control-engineers, but also consumes minimum
power and hardware resources. This would not have been
possible without the use of the new highly serialisable

Temperature °C
multi-bit multiplication algorithm (equation 13). The
incorporation of equation (13) [25] into equations (1) and
(c)
(2) as an efficient PID engine, allows the generation of PID
architectures classified as regular iterative architectures
(RIA) [44], known for their high conformity with the
principles of regularity and locality. In addition to equation Time (s)
(13), we propose in [25] several new highly serialisable
multiplication algorithms, offering different features in
Temperature °C

power, space and delay, depending on the operand size (n).


Reader is encouraged to explore these algorithms [25] to (d)
select the appropriate one that leads to best performances of
its controller with regard to the size (n) of the setpoint.
Regularity and locality are two important features, highly
sought in hardware design as they lead to an important gain
in space and delay. Regularity is a general space feature, Time (s)
where the repetitiveness of just one or few elementary Fig. 15. Effect of the setpoint fractional length on temperature regulation
building-blocks (mux, adders and shifters of ODMAC, (a) Floating point PID; (b) Our PID with Qni.nf = Q8.8 ; (c) Our PID with
Fig. 9) and their interconnection scheme (predefined netlist) Qni.nf= Q8.6 ; (d) Our PID with Qni.nf= Q8.4
suffice to draw the whole architecture (MAC/ODMAC and
then PID). In the other hand, locality is both space and time
feature, in the sense where each building-block can only IX. SUMMARY AND CONCLUSION
interact with its nearest surrounding neighbours, and any Despite the large popularity of PID controller, little
transaction from one building-block to the next is completed attention has been paid to its optimization, either for ASIC
in one and only one unit time delay (clock period). Because or for FPGA integration. To break down this paradoxical
of these two important features, our PIDs can be finely situation, a series of high-speed and low-power PIDs,
grained at bit level in space (setpoint bit-size n, latency r)
especially dedicated to embedded applications was
and unit delay in time (latency r).
proposed. They are based on two discrete forms of PID
Experimental results depicted in Fig. 15 illustrate the
algorithm: the incremental form and the commercial form,
FWL effects on temperature regulation. Reducing the
fractional-part size of the set-point beyond a certain limit (4 both with constant and time-varying coefficients. The work
bits) yields to a continuous fluctuation of the temperature focused more particularly on the commercial form with
inside the tube (Fig. 15.d). The best compromise is a 6-bit varying coefficients as it is the most used in industry due to
fractional-part (Fig. 15.c) which ensures a correct regulation the higher control-quality provided. Two types of
while consuming less power and hardware resources. As optimizations were carried out: architectural and algorithmic
temperature regulation system has a very slow dynamic, optimizations. The former is a macro-level optimization,
speed is not a concern. Therefore, the most appropriate PID based on an efficient partitioning of PID discrete-equations,
in this case is PIDX_1 as it is the least power consumer. considering the double MAC (DMAC=XY+ZT) as the main
Adversely, for very fast dynamic systems, such as MEMS building block of PID architecture. An optimized version of
[45] or microrobotics applications [46], PIDX_n/2 is the DMAC was developed (ODMAC) for less hardware
most adequate option as it leads to the highest control rate. resource occupation. As for the micro-level optimization
(inner optimization of ODMAC), three multiplication
algorithms were experienced: BMA, MBMA, and a new
general and recursive version of MBMA called RMRMA. International Conference on Computer-Aided Control Systems
(CACSD), pp. 607-612, 2008.
In addition, some low-power design techniques were
[10] S. Gretlein et al, “DSPs, Microprocessors and FPGAs in Control,” the
incorporated, such as: sleep mode, and step-by-step Magazine of Record for the Embedded Computing Industry (RTC
sign-propagation technique. Magazine), March 2006.
The implementation results of PID based upon these [11] E. Manmasson et al., “FPGA in Industrial Control Applications,”
three algorithms yielded to gradual improvements with a IEEE Trans. on Industrial Informatics, vol. 7, N° 2, May 2011.
[12] S. Chander, P. Agarwal, and I. Gupta, “ FPGA-based PID Controller
clear superiority over results presented in [21]. For instance, for DC-DC Converter,” Proceedings of the IEEE Joint International
concerning PID1_2 and PID1_4, savings of 177%, 23%, Conference on Power Electronics, Drives and Energy Systems
and 36%, and savings of 284%, 14%, and 26% are obtained (PEDES), India, 2010.
in control-rate, power consumption, and total gate count, [13] S. Yang et al, “The IP Core Design of PID Controller based on
SOPC,” Proceedings of the IEEE International Conference on
respectively. Additionally, analytical scaling-complexity Intelligent Control and Information Processing, pp. 363-366, Dalian,
evaluations with respect to the couple (n,r), confirmed also China, August 2010.
by software simulations, revealed useful information which [14] J. Lazaro et al, “Simulink/Modelsim Simulable VHDL PID Core for
is summarized as follows: Industrial SoPC Multiaxis Controllers,” Proceedings of the IEEE 32nd
• PIDX_n/2 is the fastest PID that yields to the highest
Annual Conference on Industrial Electronics (IECON), pp. 3007-
3011, 2006.
control-rate (30 MHz for PID2_8 mapped on Virtex6, [15] F. Fons, M. Fons, and E. Canto, “Custom-Made Design of a Digital
with (n,r)=(16,8) ); PID Control System,” Proceedings of the IEEE International
• PIDX_1 is the most power efficient PID when speed is Conference on Acoustics, Speech and Signal Processing(ICASSP),
Vol. 3, pp. 1020-1023, 2006.
not a concern;
• PIDX_n and PIDX_n/2 are the most efficient PIDs
[16] B.V. Sreenivasappa and R.Y. Udaykumar, “ Design and
Implementation of FPGA based Low Power Digital PID Controllers,”
when both high control-rate and low-power Proceedings of the IEEE International Conference on Industrial and
dissipation are required. Information Systems (ICIIS), pp. 568-573, 2009.
Further extension to the present work is to apply the same [17] J. Lima et al, “A Methodology to Design FPGA-based PID
Controllers,” Proceedings of the IEEE International Conference on
(or appropriate) partitioning in conjunction with RMRMA Systems, Man and Cybernetics, pp. 2577-2583, Taipei, Taiwan,
algorithm to the set of recurrent equations of an arbitrary October 2006.
number of multi-loop PID controllers taken as a whole. [18] I. Urriza et al, “Word Length Selection Method based on Mixed
Finally, the new recursive multiplication algorithm Simulation for Digital PID Controllers Implemented in FPGA,”
Proceedings of the IEEE International Symposium on Industrial
(RMRMA), well adapted to large word-lengths, and which Electronics (ISIE), pp. 1965-1970, 2008.
was behind the drastic optimization of PID, can be [19] W. Zhao et al, “FPGA Implementation of Closed-Loop Control
efficiently applied to a variety of advanced control Systems for Small-Scale Robot,” Proceedings of the IEEE 12th
algorithms such as to linear-quadratic-gaussian (LQG) or International Conference on Advanced Robotics (ICAR), pp. 70-77,
sliding-mode controllers, etc. 2005.
[20] L. Samet et al, “A Digital PID Controller for Real-Time and Multi-
Loop Control: a Comparative Study,” Proceedings of the IEEE
REFERENCES International Conference on Electronics, Circuits, and Systems
(ICECS), vol. 1, pp. 291-296, 1998.
[1] K. Åström, T. Hägglund, “PID Controllers: Theory, Design, and [21] Y. Fong, M. Moallem, and W. Wang, “Design and Implementation of
Tuning,” by the Instrument Society of America, Research Triangle Modular FPGA-Based PID Controllers,” IEEE Trans. on Industrial
Park, NC, USA, 2nd Edition, ISBN: 1-55617-516-7, Copyright 1995. Electronics, Vol. 54, N° 4, pp. 1898-1906, August 2007.
[2] D. Xue et al, “Linear Feedback Control,” by the Society for Industrial [22] B. Wittenmark, K. J. Astrom, and K.-E. Arzenin “Computer control:
and Applied mathematics, Copyright 2007. An overview,” Technical Report of Dept. of Automatic Control, Lund
Available: http://www.siam.org/books/dc14/DC14Sample.pdf Institute of Technology, Lund, Sweden, Apr. 2003.
[3] S. Xiaoyin et al, “A New Motion Control Hardware Architecture with Available: www.control.lth.se/kursdr/ifac.pdf
FPGA based IC-Design for Robotic Manipulators,” Proceedings of [23] A. D. Booth, “A Signed Binary Multiplication Te:chnique,” Quarterly
the IEEE International Conference on Robotics and Automation J. Mech. Appl. Math., Vol. 4, part 2, pp. 236-240,1951.
(ICRA), pp. 3520-3525, Orlando, Florida, May 2006.
[24] O.L. MacSorley, “High-Speed Arithmetic in Binary Computers,”
[4] J.S. Kim, H.W. Jeon, and S. Jeung, “Hardware Implementation of Proceedings of the IRE, Vol. 49(1), pp. 67-91, January 1961.
Nonlinear PID Controller with FPGA based on Floating Point
[25] A.K. Oudjida, N. Chaillet, A. Liacha, and M.L. Berrandjia, “A New
Operation for 6-DOF Manipulator Robot Arm,” Proceedings of the
Recursive Multibit Recoding Algorithm for High-Speed and Low-
IEEE International Conference on Control Automation and Systems
Power Multiplier,” Journal of Low Power Electronics (JOLPE), vol.
(ICROS), pp. 1066-1071, Seoul, Korea, October 2007.
8, N° 5, pp. 1-16, December 2012, American Scientific Publishers
[5] L. Qu, Y. Huang, and L. Ling, “Design and Implementation of (ASP), USA.
Intelligent PID Controller based on FPGA,” Proceedings of the IEEE [26] A.K. Oudjida et al., “High-Speed and Low-Power PID Structures for
International Conference on Natural Computation (ICNC), pp. 511- Embedded Applications,” Proceedings of the 21th edition of the
515, 2008. International Workshop on Power and Timing Modeling,
[6] M. Keating & P. Bricaud, “Reuse Methodology Manual for System Optimization and Simulation PATMOS, LNCS 6951, pp. 257-266,
on a Chip Designs,” by the Kluwer Academic Publishers, NY, USA, Springer-Verlag Editor. Madrid, Spain, September 26-29, 2011.
3rd Edition, ISBN: 1-4020-7141-8, Copyright 2002. [27] Y.H. Seo, and D.W. Kim, “A New VLSI Architecture of Parallel
[7] Reports of the International Technology Roadmap for Multiplirer-Accumulator Based on Radix-2 Modified Booth
Semiconductors (ITRS), 2007 & 2008. Algorithm,” IEEE Trans. on VLSI Systems, vol. 18, N° 2, Feb. 2010.
Available: www.itrs.net/reports.html [28] L.P. Rubinfield, “A Proof of the Modified Booth Algorithm for
[8] T. Hilaire, P. Chevrel, and J.F. Whidborne, “A Unifying Framework Multiplication,” IEEE Trans. On Computers, C-24, (10), pp. 1014-
for Finite Word Length Realizations,” IEEE Trans. on Circuits and 1015, 1975.
Systems, Vol. 54, N° 8,, August 2007. [29] H. Sam, and A. Gupta, “A Generalized Multibit Recoding of Two’s
[9] T. Hilaire, D. Ménard, and O. Sentieys, “Bit Accurate Roundoff Noise Complement Binary Numbers and its Proof with Application in
Analysis of Fixed-Point Linear Controllers,” Proceedings of the IEEE Multiplier Implementation,” IEEE Trans. on Computers, vol. 39, N°
8, August 1990.
[30] F. Lamberti, “Reducing the Computation Time in (Short Bit-Width)
Two’s Complement Multiplier,” IEEE Trans. on Computers, vol. 60,
N° 2, pp. 148-156, February 2011.
[31] S.R. Kuang, J.P. Wang, and C.Y. Guo, “Modified Booth Multipliers
with a Regular Partial Product Array,” IEEE Trans. on Circuit and
Systems II, Express Brief, vol. 56, N° 5, May 2009.
[32] J.Y. Kang, J.L. Gaudiot, “A Simple High-Speed Multiplier Design,”
IEEE Trans. on Computers, vol. 55, N° 10, Oct. 2006.
[33] D. Crookes and M. Jiang, “Using Signed Digit Arithmetic for Low-
Power Multiplication,” Electronics Letters, vol. 43, N° 11, may 2007.
[34] P.M. Seidel, L. D. McFearin, and D.W. Matula, “Secondary Radix
Recodings for Higher Radix Multipliers,” IEEE Trans. on Computers,
vol. 54, N°2, February 2005.
[35] R.C. North, and W.H. Ku, “β-Bit Serial/Parallel Multipliers,” Journal
of VLSI Signal Processing, Kluwer Academic Publishers, Boston,
vol. 2, pp. 219-233, 1991.
[36] D.A. Henlin, M.T. Fertsch, M. Mazin, and E.T. Lewis, “A 16 bit x 16
bit Pipelined Multiplier Marcrocell,” 1EEE Journal of Solid-State
Circuits, vol. SC-20, no. 2, pp. 542-547, 1985.
[37] J.S. Kelly et al, “Design and Implementation of Digital Controllers for
Smart Structures Using Field Programmable Gate Arrays,” Smart
Material Structure Journal, PII: S0964-1726 (97) 87085-1, pp. 559-
572, Printed in the UK, 1997.
[38] L. Shang, A.S. Kaviani, and K. Bathala, “Dynamic Power
Consumption in Virtex-II FPGA Family,” Proceedings of FPGA
Conference, pp. 157-164, Monterey, California, USA, February 2002.
[39] Xilinx Inc., “Virtex6 FPGA: Satisfying the Insatiable Demand for
Higher Bandwidth,” PN 2403, Printed in the USA, Copyright 2009.
www.xilinx.com/publications/prod_mktg/Virtex6_Product_Brief.pdf
[40] M. Gevers and G. Li, “Parametrizations in Control, Estimation and
Filtering Probems,” Springer-Verlag, 1993.
[41] T. Hilaire and P. Chevrel, “Sensitivity-based pole and input-output
errors of linear filters as indicators of the implementation deterioration
in fixed-point context,” EURASIP Journal on Advances in Signal
Processing, vol. special issue on Quantization of VLSI Digital Signal
Processing Systems, January 2011.
[42] B. Lopez, T. Hilaire and L.S. Didier, “Sum-of-products Evaluation
Schemes with Fixed-Point arithmetic, and their application to IIR
filter implementation,” Proceedings of the International Conference
on Design and Architecture for Signal and Image Processing
(DASIP), Karlsruhe, Germany, Oct. 2012.
[43] M. Petko and G. Karpiel, “Semi-automatic implementation of control
algorithms in ASIC/FPGA,” Proceedings of Emerging Technologies
and Factory Automation Conference (ETFA '03), vol. 1, pp. 427- 433.
Sept. 2003.
[44] S.K. Rao and T. Kailath, “Regular Iterative Algorithms and their
Implementation on Processor Arrays,” Proceeding of the IEEE, vol.
76, pp. 259-269, Mar. 1988.
[45] G. Hoover et al, “Towards Understanding Architectural Tradeoffs in
Mems Closed-Loop Feedback Control,” Proceedings of the
International Conference on Compilers, Architecture, and Synthesis
for Embedded Systems (CASES’07), pp. 95-102, Salzburg, Austria,
Sep. 30-Oct. 3, 2007.
[46] R. Casanova et al, “Integartion of the Control Electronics for a mm3-
sized Autonomous Microrobot into a Single Chip,” Proceedings of the
IEEE International Conference on Robotics and Automation (ICRA),
pp. 3007-3012, Kobe, Japan, May 12-17, 2009.
APPENDIX

Incremental form

The standard version of PID controller is described in a differential equation as: u (t ) = K p ⎜ e(t ) + 1 e(τ ) ⋅ dτ + Td ⋅ de(t ) ⎟ ,
⎛ ⎞

t

⎜ dt ⎟
⎝ ⎠
where e is the system error ( e(t ) = uc (t ) − y (t ) ), uc is the command signal (setpoint), y is the process variable (measured
Ti
0

Using Laplace transform, u (t ) is expressed in s-domain by: U (s ) = K p ⎜ E (s ) + E (s ) + s ⋅ Td ⋅ E (s )⎟ .


⎛ ⎞
variable). Kp is the proportional gain, Ti the integration time constant, and Td the derivative time constant of the controller.

⎜ s ⋅ Ti ⎟
⎝ ⎠
For a small sample interval Ts, the continuous time variable u (t ) can be discretized using the following approximations:

e (t ) ⋅ dt ≈ ∑ e( j ) ⋅ T ; d e(t ) ≈ e(k ) − e(k − 1) . k denotes the kth sampling instant (k.Ts). Thus, u (t ) can be rewritten as:

k ⋅Ts k

e (k ) − e (k − 1 ) ⎞
j=0
s

e(k ) = uc (k ) − y(k )
0 dt Ts



u (k ) = K p ⋅ ⎜⎜ e (k ) + e ( j )⋅ T s + Td ⋅ ⎟

1 k
⎝ ⎠
with and
Ti j = 0

e (k − 1) − e (k − 2 ) ⎞⎟
Ts

u (k − 1) = K p ⎜ e (k − 1) + ∑ e ( j ) .T s + T d ⋅
1 k −1
⎜ ⎟
⎝ ⎠
Ti j = 0 Ts

K p ⎛⎜ k ⎞
We calculate the difference: u (k ) − u (k − 1) = K p ⋅ (e (k ) − e (k − 1)) + ∑ ( ) ∑ ( )
k −1
⋅ − ⋅ ⎟
Ti ⎜ j = 0
s⎟
⎝ ⎠
e j T e j T
j=0
s

⎛ e(k ) − e(k − 1) e(k − 1) − e(k − 2 ) ⎞


+ K p ⋅ Td ⋅ ⎜⎜ − ⎟⎟
⎝ ⎠
Developing separately each term of u (k ) − u (k − 1) , we obtain:
Ts Ts

K p ⎛⎜ k ⎞
K p ⋅ (e (k ) − e (k − 1)) = K p .e (k ) − K p ⋅ e (k − 1) ⋅ ∑ e ( j ) ⋅ T s − ∑ e ( j ) ⋅ T s ⎟ = K p ⋅ s ⋅ e (k )
k −1

Ti ⎜ j = 0 ⎟
T
⎝ ⎠
;
j =0
Ti

⎛ e (k ) − e (k − 1) e (k − 1) − e (k − 2 ) ⎞
⎟⎟ = K p ⋅ d ⋅ e (k ) − K p ⋅
2 ⋅ Td
K p ⋅ Td ⋅ ⎜⎜ − ⋅ e (k − 1) + K p ⋅ d ⋅ e (k − 2 )
T T
⎝ Ts Ts ⎠ Ts Ts Ts

After simplifications, we get the following recurrent equation:

⎛ T ⎞ ⎛ T ⎞
u (k ) = u (k − 1 ) + K ⋅ ⎜⎜ 1 + s + d ⎟⎟ ⋅ e (k ) − K p ⋅⎜⎜ 1 + 2 . d ⎟⎟ ⋅ e (k − 1 ) + K p ⋅ d ⋅ e(k − 2)
T T
⎝ Ts ⎠ ⎝ Ts ⎠
p
Ti Ts

= u (k − 1 ) + A ⋅ e (k ) + B ⋅ e (k − 1 ) + C ⋅ e(k − 2 )
This latter equation is called the incremental form of the controller. A drawback with the incremental algorithm is that it
cannot be used for P or PD controllers.

Commercial form
For better performances of PID, two corrections are performed: limitation of the derivative gain and setpoint weighting. A

s ⋅ Td
pure derivative action will induce a very large amplification of measurement noise. The gain of the derivative must thus be
s ⋅ Td ≈
1 + s ⋅ Td / N
limited. This can be done by approximating the transfer function s.Td as follows: , where N is

typically in the range of 3 to 20. In addition, to avoid sudden overshoots due to high variations of the setpoint, only a fraction

⎛ ⎞
U (s ) = K p ⋅ ⎜⎜ (b ⋅ U c (s ) − Y (s )) + ⋅ (U c (s ) − Y (s )) − ⋅ Y (s )⎟⎟
s ⋅ Td
b of uc acts on the proportional part (b.uc - y). Hence, the improved PID algorithm becomes:
1
⎝ s ⋅ Ti 1 + s ⋅ Td N ⎠
u (k ) = P (k ) + I (k ) + D (k ) , where P(k ) = K p ⋅ b ⋅Uc (k ) − K p ⋅ Y (k ) and I (k ) = I (k − 1) + K p ⋅ T s ⋅ (U c (k − 1) − Y (k − 1)) .
U(s) expression is discretized such that the proportional, integral and derivative terms are separately obtained, as follows:

To determine the derivative term D (k ) , we use the differential equation representing the transfer function of Gd (s ) :
Ti

U d (s )
G d (s ) = . By performing cross products, we get: U d (s ) ⋅ ⎛⎜1 + ⎟ = − K p ⋅ Y (s ) ⋅ s ⋅ Td .
s ⋅ Td s ⋅ Td ⎞
= −K p
Y (s ) 1 + s ⋅ Td N ⎝ N ⎠

Td du d (t ) dy (t )
Applying the inverse Laplace Transform to this latter equation, we obtain: u d (t ) = − ⋅ − K p ⋅ Td ⋅ d .

Consequently, the discretized form of u d (t ) is: D (k ) = − Td ⋅ D (k ) − D (k − 1) − K pTd Y (k ) − Y (k − 1) .


N dt dt

N Ts Ts

After simplification, we obtain: D (k ) = Td


D (k − 1) −
K ⋅ N ⋅ Td
(Y (k ) − Y (k − 1)) . Finally we can write:
Td + N ⋅ Ts Td + N ⋅ Ts

u (k ) = P (k ) + I (k ) + D (k )
P (k ) = A ⋅ u c (k ) + B ⋅ y (k )
with

I (k ) = I (k − 1 ) + C ⋅ e (k − 1 ) ;
;

D (k ) = H ⋅ D (k − 1 ) + L ⋅ f (k ) and
K p ⋅ N ⋅ Td
A = K p ⋅b ; B = −K p ; C = −K p ⋅ H = L=−
Ts ; Td
Td + N ⋅ Ts Td + N ⋅ Ts
; .
Ti

You might also like