0% found this document useful (0 votes)
79 views5 pages

Weighted Partitioning For Fast Multiplierless

zz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
79 views5 pages

Weighted Partitioning For Fast Multiplierless

zz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

66 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 64, NO.

1, JANUARY 2017

Weighted Partitioning for Fast Multiplierless


Multiple-Constant Convolution Circuit
Gian Domenico Licciardo, Member, IEEE, Carmine Cappetta, Student Member, IEEE,
Luigi Di Benedetto, Member, IEEE, and Mario Vigliar

Abstract—A new radix-3 partitioning method of natural num- effective when one of the operands can be reduced to a bounded
bers, derived by the weight partition theory, is employed to build a set of precalculated values, as in the case of predefined filter
multiplierless circuit that is well suited for multimedia filtering ap- kernels. In such cases, the distributed arithmetic (DA) method
plications. The partitioning method allows conveniently premulti-
plying 32-b floating-point filter coefficients with the smallest set of [10] can be successfully applied in order to partition multiplica-
parts composing an unsigned integer input. In this way, similar to tions in simpler shifts and additions. By using memories to store
the distributed arithmetic, shifters and recoding circuitry, typical precalculated partial sums, whose number can be reduced by
of other well-known multiplier circuits, are completely substituted the help of multiple-constant multiplication (MCM) techniques
with simplified floating-point adders. Compared to the existent [11], [12], DA can be, in principle, advantageously used in
literature, targeted to both field-programmable gate array and
std_cell technology, the proposed solution achieves state-of-the-art place of MB and CSD [13]. However, actual performances of
performances in terms of elaboration velocity, achieving a critical DA result from a careful tradeoff between its “natural” bit-
path delay of about 2 ns both on a Xilinx Virtex 7 and with CMOS serial operation and the parallelism by which the partial sums
90-nm std_cells. are calculated, which can lead to the excessive increment of
Index Terms—Convolution, distributed arithmetic (DA), mapped physical resources [10].
Gaussian filter, multiplier. In this brief, a new partitioning scheme is proposed, based
on radix-3 terms (called parts), derived by the solution of
I. I NTRODUCTION the weight problem [14], with the purpose to improve the
performances of DA as well as the performances of general-
R ECENT advancements in the elaboration of high-quality
media contents have promoted an intense research activity
for the improvement of filtering operators, whose hardware
purpose multipliers in the specific contexts of MCM. Although
it is based on the same operation principle of DA, we will
(HW) complexity is a major concern in applications aimed demonstrate that the proposed method is always advantageous
to pure speed, such as image and video elaboration [1], [2]. in terms of mapped physical resources and elaboration speed,
Such complexity, indeed, usually relapses in the allocation for multiplying 32-b floating-point (FP32) filter coefficients
of a large number of arithmetic operators and a consequent with integer inputs. In this way, shifters and recoding circuitry,
slackening of the overall circuit. The recent literature shows that typical of other well-known multiplier schemes, can be com-
the aforementioned issue is usually managed either by recurring pletely substituted by floating-point (FP) adders, whose HW
to the full/partial serialization of the filters [3], [4] and folding complexity can be simplified to that of fixed-point adders,
techniques [5] or by intervening on the intrinsic complexity without undermining the accuracy. The derived implementa-
of fused multiply adders and multiply accumulators (MAC). tion is adequate for multiple-constant filtering applications,
Since the former way usually causes a significant reduction working with FP precalculated kernel coefficients and input
of the filter performances [6], the latter approach remains the quantities included in a range of integer values compatible with
most accurate way to achieve a good power, performance, and multimedia elaboration. The implementation of the proposed
area (PPA) tradeoff. In this case, the complete removal of the multiplier on a high-end field-programmable gate array (FPGA)
multiplier circuitry is by far the preferred choice of several returns a total delay path of 2.456 ns to produce an IEEE-754
authors [6], [7], who recur to fast adders and shifters in place of FP32 [15] result, starting from an 8-b unsigned integer input,
multipliers, according to the coding of the operands, canonical while std_cell implementation with TSMC CMOS 90-nm tech-
signed digit (CSD), and modified booth (MB), primarily [8], nology returns 2.61 ns, both in the slow/slow corner. These
[9]. The simplification of filtering circuits becomes particularly delays are significantly lower than those achievable by the con-
ventional FP32 multiplier implemented with the same platform
Manuscript received March 4, 2016; revised March 16, 2016; accepted and working on the same data set, while they are comparable
March 16, 2016. Date of publication March 24, 2016; date of current ver- with std_cell implementations in 65- and 45-nm CMOS tech-
sion December 22, 2016. This brief was recommended by Associate Editor nology [16], [17].
C. K. Tse.
G. D. Licciardo, C. Cappetta, and L. Di Benedetto are with the Department
of Industrial Engineering (D.I.In.), University of Salerno, 84084 Salerno, Italy
(e-mail: [email protected]; [email protected]; ldibenedetto@
unisa.it). II. U NDERLYING PARTITIONING M ETHOD
M. Vigliar is with Spark SRL, 42124 Reggio Emilia, Italy (e-mail: mario.
[email protected]). The problem of establishing the least number of integers
Digital Object Identifier 10.1109/TCSII.2016.2546899 and their values such that all the numbers in a limited range
1549-7747 © 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
LICCIARDO et al.: WEIGHTED PARTITIONING FOR MULTIPLE-CONSTANT CONVOLUTION CIRCUIT 67

TABLE I
A PPLICATIONS OF THE P ROPOSED PARTITION M ETHOD

Fig. 1. Scheme of the convolution circuit used as case study of the proposed
decomposition method.

By using (1) in (2), it results in



K−1 n
y= Ak Cki λi (3)
can be expressed as a combination of them has been faced k=0 i=0
since the seventeenth century because it finds a large spectrum
of applications, including the simplification of the circuitry which can be rewritten in terms of the parts in Wr as
devoted to filtering apparatus [18]. About this argument, several n−1 
 
K−1
particular demonstrations have been published [19]–[21]. A y= Ak Cki 3i + Ak Ckn R . (4)
general resume has been published in [14], where it has been k=0 i=0
definitively demonstrated that, having defined the set Wr :=
{30 , 31 , 32 , . . . , 3n−1 , R} with R = r − (30 + 31 + 32 + · · · + Considering that n is small in many cases of interest and
3n−1 ), every integer in the range [0; r] can be obtained by a that the aforementioned partition is a linear superposition of
superposition of the terms in Wr multiplied for a coefficient parts, the previous results suggest that the multiplications in-
Ci ∈ {−1, 0, 1}. volved in a particular filter can be reorganized as the partition
More in general, defining the ordered sequence of positive of premultiplied terms, obtained by calculating a priori the
integers that sum to r = λ0 + λ1 + · · · + λn with λ0 < λ1 < product between the parts and Ak . This partitioning is very
· · · < λn as the partition of a positive integer r and the set different
s−1 from DA that requires that xk is decomposed as xk =
{λ0 , . . . λn−1 , λn } as the parts of the partition, the following b
i=0 ki 2 i
, where bki ∈ {0, 1} represents the sign digit. That
can be demonstrated. is, (2) can be DA partitioned as

1) Every integer 0 ≤ q ≤ r can be written as 


K−1 s−1
y= Ak bki 2i . (5)

n k=0 i=0
q= Ci λi . (1)
Although the use of bki in place of Cki contributes to reduce
i=0
some “glue” logic to implement (5), actual values of s in (5)
2) There does not exist another partition of r satisfying 1 are significantly higher than n in (4). Therefore, the number of
with fewer parts than n + 1. operators to implement the inner products in (4) can be strongly
reduced.
An important corollary of the aforementioned properties
demonstrates that every partition of r is composed by exactly III. A RCHITECTURE D ESIGN
n + 1 = log3 (2r) + 1 parts. Table I shows the application of
the partitioning method. For example, for an 8-b input, the The proposed method has been used for implementing the
parts are {0, 1, 3, 9, 27, 81, 134}; the input 23 can be rewritten scheme in Fig. 1 that calculates the convolution G ∗ I between
as 23 = (−1)1 + (−1)3 + (0)9 + (+1)27 + (0)81 + (0)134, the FP32 kernel vector G and the input vector I of the unsigned
namely, the set of values from Table I will be {−1, −1, 0, integer coded with m bit (Uint-m). The proposed method is
+1, 0, 0}. used to improve the structure of the k MACs that calculate the
The aforementioned results can be applied to MAC opera- product between the generic coefficient of the kernel and the
tions between a generic vector of coefficients Ak and a vector vector of input values. The resulting architecture is schematized
of inputs xk in Fig. 2. Once G has been defined, each element of I can be
decomposed in n + 1 parts, according to Table I, and stored in

K−1 an equal number of dual-port read-only memories (ROMs), af-
y= Ak xk . (2) ter they have been premultiplied by the elements of G. A further
k=0 (n + 1)2m+1 bit ROM stores the Ci coefficients (2 b for each
68 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 64, NO. 1, JANUARY 2017

TABLE III
R EQUIRED M EMORY AS A F UNCTION OF THE I NPUT R ANGE

Fig. 3. Comparison between required resources of the proposed partitioning


method and the radix-2 DA, with l = 37 and (3σ + 1) = 13.
Fig. 2. Scheme of the proposed multiplier, deployed in a Gaussian convolution.

n + 1 = log3 2(28 − 1) + 1 = 6 parts from Table I, and we


TABLE II write the convolution as
C USTOM C ODING A PPLIED TO THE S MALLEST P REMULTIPLIED
C OEFFICIENT W ITH U INT-8 I NPUTS AND σ = 4

K−1 
K−1 
n
G(x, σ) ∗ I(x) = Gj Ij− K−1 = Gj Ci λi
2
j=0 j=0 i=0


K−1 n 
K−1
= (Gj λi )Ci = Pj . (6)
j=0 i=0 j=0

In particular, all the n inner products Gj λi are precomputed


one) used to select the sign of the operands. The multipliers for each value of I in the range [0;255], for every kernel
are substituted by the n adders in the dashed box of Fig. 2, coefficient, since Gj remains constant once σ and K have
distributed along a log2 (n + 1) depth tree. In principle, the been defined. Given that the kernel length can be imposed with
adders should have an FP architecture, but it is possible to adopt K = 6σ + 1 points with good accuracy [3] and that, for the
a custom coding for partial results, in order to reduce their Gaussian symmetry, it is possible to store only 3σ + 1 values,
complexity without altering the accuracy of the multiplication. (6) can be implemented by the scheme in Fig. 2. The input
Starting from the standard IEEE754 FP32 coding [15], all the I is used to access the ROMs that, in principle, are sharable
exponents of the premultiplied coefficients have been increased by all the multipliers. Outputs from the C-ROMs provide the
to that of the greatest one, while the number of bits of the signals to select the inputs to the first adder’s row of multipliers.
significands has also been increased accordingly, in order to The remaining ROMs, having depth (3σ + 1), provide the Gj λi
include the shifted codes and avoid truncations. coefficients that must be added toward the final result. With the
Although the proposed method can be employed in conjunc- purpose to eliminate the handling of the exponent, considering√
tion with several kinds of kernels, as a case study it has been that the greatest premultiplied coefficient is√G0 λn −(9σ
= λn / 2πσ
2
/2σ2 )
used to implement the Gaussian filter with kernel G(x, σ) = and the√smallest one is GK−1 λ0 = (λ0 / 2πσ)e =
Ae−(x /2σ ) for its large diffusion in image and video elabo- e−4.5 / 2πσ, the codelength of significands is increased
2 2

ration flow (e.g., visual search, blurring, segmentation, and so by log2 (G0 λn /GK−1 λ0 ) = log2 (λn e4.5 ) = log2 (λn ) +
on), where they usually work with Uint-8 input (e.g., Luma or 4 bits.
6.5 Therefore, considering that λn ≡ R = 255 −
i
Chroma image pixel). Owing to its separability property, for i=0 3 = 134, the codelength of the premultiplied signifi-
which a 2-D filter can be separated in two consecutive 1-D cands becomes
products [3], [18], the filter implementation in the space–time
domain is typically preferred to the frequency conversion. l = 23 + log2 (λn ) + 6.5 = log2 (134) + 29.5 = 37 bits.
Results from Section II allow partitioning I by using the first (7)
LICCIARDO et al.: WEIGHTED PARTITIONING FOR MULTIPLE-CONSTANT CONVOLUTION CIRCUIT 69

TABLE IV
S YNTHESIS OF THE P ROPOSED M ULTIPLIER IN C OMPARISON W ITH R ECENT FPGA- AND S TD _ CELL -O RIENTED D ESIGNS

Furthermore, in order to reduce the impact of this enlarge- Xilinx for the adders and the multiplier, all configured with
ment on the sizes of the ROMs, the exponent has been omitted a three-stage pipeline. Carry-save adders have been used for
from the partial codes and then reintroduced in the final result the std_cell implementation of the proposed structure, while
to normalize the data to the standard FP32 format. It is worth no conventional multiplier has been compared for the absence
noting that the FP coding is necessary for applications requiring of an optimized multiplier in the same std_cell technology.
a very high dynamic range [16], as in the case of inverse The work in [23] for FPGA and that in [16] and [17] for
tone mapping [3], where ranges higher than 70 are highly std_cells have been used as comparative terms. Although they
recommended. An example of the employed coding is shown in propose slightly different architectures, for the base of our
Table II, applied to the smallest coefficient with Uint-8 inputs knowledge these are the most recent designs in the literature
and σ = 4. that compare with the proposed one for the similarity of the
In Table III, the memory required by the proposed solution, purpose. It is worth noting that the actual memory mapped
detailed for Ci and λn , is compared with the corresponding in our implementation has been increased from the minimum
quantity required by a radix-2 DA. In both cases, the mapped required of 763 B to 904 B since all the premultiplied coef-
resources refer to full-parallel architectures with codelength ficients have been stored together with their 2’s complement
l = 37 and (3σ + 1) = 13. A graphical representation of the negative counterparts. Therefore, both the positive and negative
comparison is shown in Fig. 3, in which the required number premultiplied coefficients are available at the same time for the
of additions is also shown as a function of the input length additions, and sign conversions are avoided when Ci = −1.
m. For m > 4, the proposed solution is always advantageous As expected, Table IV shows that the FPGA is the most
in terms of required additions, whereas the required memory advantageous platform to implement the proposed multiplier,
becomes significantly higher for m > 9, e.g., 15% for m = 10. owing to the availability of hard macros to implement ROMs.
In these cases, indeed, the lower number of parts of the Indeed, the FPGA implementation exhibits a speedup of 335%
proposed method is compensated by the higher number of with respect to a conventional multiplier, whereas the worst
bits for Ci . The advantages of the proposed solution become path delay reduces from 8.235 to 2.456 ns in the slow/slow
relevant when a large number of multipliers is required. In corner. A great advantage is observed also with respect to the
the implementation reported in the next section, with m = 8 MB multiplier that has a path delay of 7.882 ns. It is worth
and σ = 4, 25 MACs are required for a full-parallel circuit; noting that the ROM access does not introduce a critical delay
thus, using the radix-3 proposed method, it is possible to save since it exhibits a latency of 2.1 ns, which is significantly lower
50 adders with respect to the radix-2 DA. Considering also that
than the previous value. The mapped physical resources are
the memory is sharable between all the MACs, for implement-
approximatively lower by 30% than those in [23], while the
ing filters with typical kernel dimensions, the proposed solution
proves to be advantageous when compared to a conventional delay is about one-half, although this value is not representative
radix-2 DA. because of the technology differences between the two target
platforms. The normalized dissipated power is 81.7% of that of
the conventional multiplier.
IV. S YNTHESIS AND R ESULTS
The std_cell implementation exhibits good results in com-
In order to give a straightforward estimation of the advantage parison with the MB-based multiplier in [17] and the Braun
derived by the adoption of the proposed method, a stand- fused MAC in [16], both implementing a two-stage pipeline.
alone “equivalent multiplier” has been synthetized, composed It is worth observing that, although the multiplier in [16] is
by all the memories and the adders schematized in the dashed not a pure multiplier, it only differs by a 24-b carry look-ahead
box of Fig. 2. It has been targeted to a Xilinx Virtex 7 adder. Although the solutions in [16] and [17] are implemented
XC7V2000tflg1925-1 as part of the proFPGA duo application- with shrunk 45- and 65-nm technology, the delay times are
specific integrated circuit (ASIC) prototyping board [22] and only 430 and 110 ps higher than that proposed, respectively,
to TSMC CMOS 90-nm std_cells. Synthesis results have been whereas the occupied area is about 244% and 304% times the
reported in Table IV and compared with a conventional FP32 proposed one, respectively. Considering that, for the absence
multiplier and a 32-b MB, targeted to the same FPGA. A fair of devoted ROMs of adequate dimensions, all the memories
comparison has been achieved by using the IPs provided by have been implemented by lookup tables (LUTs), which is a
70 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 64, NO. 1, JANUARY 2017

TABLE V R EFERENCES
S YNTHESIS R ESULTS OF THE 1-D G AUSSIAN C ONVOLUTION C IRCUIT
[1] S. L. Chen, “VLSI implementation of an adaptive edge-enhanced image
scalar for real-time multimedia applications,” IEEE Trans. Circuits Syst.
Video Technol., vol. 23, no. 9, pp. 1510–1522, Sep. 2013.
[2] F. C. Huang, S. Y. Huang, J. W. Ker, and Y. C. Chen, “High perfor-
mance SIFT hardware accelerator for real-time image feature extraction,”
IEEE Trans. Circuit Syst. Video Technol., vol. 22, no. 3, pp. 340–351,
Mar. 2012.
[3] G. D. Licciardo, A. D’Arienzo, and A. Rubino, “Stream processor for real-
time inverse tone mapping of full-HD images,” IEEE Trans. Very Large
Scale Integr. (VLSI) Syst., vol. 23, no. 11, pp. 2531–2539, Nov. 2015.
[4] M. Vigliar and G. D. Licciardo, “Hardware coprocessor for stripe-based
interest point detection,” US Patent 20 130 301 930, Nov. 14, 2013.
[5] K. K. Parhi, VLSI Signal Processing Systems: Design and Implementation.
good result that contributes to an area–power–delay product of New York, NY, USA: Wiley, 2007.
3.75 × 104 μm2 · ns · mW, which is always better than the cited [6] B. C. Paul, S. Fujita, and M. Okajima, “ROM-based logic (RBL) design:
solutions. A low-power 16 bit multiplier,” IEEE J. Solid—State Circuits, vol. 44,
In Table V, a complete circuit for 1-D Gaussian convolution, no. 11, pp. 2935–2942, Nov. 2009.
[7] S. Y. Park and P. K. Meher, “Low-power, high-throughput, and low-area
composed by 25 multipliers and an output adder tree, connected adaptive FIR filter based on distributed arithmetic,” IEEE Trans. Circuits
as in Fig. 2, is compared with a conventional FP32 multiplier- Syst.—II, Exp. Briefs, vol. 60, no. 6, pp. 346–350, Jun. 2013.
based solution targeted to the same FPGA and std_cells. In [8] R. M. Hewlitt and E. S. Swartzlantler, “Canonical signed digit representa-
tion for FIR digital filters,” in Proc. IEEE Workshop Signal Process. Syst.,
this case, the FPGA speedup to 570% is obtained, whereas Lafayette, LA, USA, Oct. 2000, pp. 416–426.
the path delay reduces from 16.986 to 2.981 ns and the total [9] K. Tsoumanis, N. Axelos, N. Moshopoulos, G. Zervakis, and
amount of LUT in the design is the 44.21% less than the K. Pekmestzi, “Pre-encoded multipliers based on non-redundant radix-
4 signed-digit encoding,” IEEE Trans. Comput., vol. 65, no. 2,
one obtained using conventional multipliers. Also, the overall pp. 670–676, Feb. 2016.
power dissipation almost halves. In std_cells, it is possible to [10] A. Peled and B. Liu, “A new hardware realization of digital filters,”
observe a reduction of about 19.52% in area and a speedup of IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-22, no. 6,
pp. 456–462, Dec. 1974.
about 11.93%. In this case, the power dissipation is 39.64% [11] Y. Voronenko and M. Püschel, “Multiplierless multiple constant multipli-
more than in the conventional case, mainly due to power cation,” ACM Trans. Algorithms, vol. 3, no. 2, pp. 1–39, May 2007.
dissipated by the memories. In obtaining the data in Table V, [12] L. Aksoy, P. Flores, and J. Monteiro, “Efficient design of FIR filters using
hybrid multiple constant multiplication on FPGA,” in Proc. IEEE 32nd
it has been considered that all the ROMs must be read from ICCD, Oct. 2014, pp. 42–47.
all the multipliers on the same clock edge. Although this can [13] A. Berkeman, V. Owall, and M. Torkelson, “A low logic depth com-
be easily implemented in FPGA, ASICs require a custom plex multiplier using distributed arithmetic,” IEEE J. Solid-State Circuits,
vol. 35, no. 4, pp. 656–659, Apr. 2000.
implementation of very small ROMs, similarly to [6] for ROM- [14] E. O’Shea, “Bachet’s problem: As few weights to weigh them all,” ArXiv
based logic multipliers. However, the amount of memory in e-prints, Oct. 2010.
Tables IV and V does not represent an actual problem in [15] IEEE Standard for Binary Floating—Point Arithmetic, Amer. Nat. Std.
Inst. (ANSI), Washington, DC, USA, IEEE 754—1985, 1985.
real multimedia applications, whereas the memory requirement [16] M. A. Basiri and N. M. Sk, “An efficient hardware-based higher radix
is on the order of megabits because of frame buffering [24] floating point MAC design,” ACM Trans. Des. Autom. Electron. Syst.,
or partial data storage [3], [4], which makes, de facto, the vol. 20, no. 1, pp. 1–25, Nov. 2014.
[17] M. Själander and P. Larsson-Edefors, “Multiplication acceleration
additional area required by the multiplier’s ROMs negligible. through twin precision,” IEEE Trans. Very Large Scale Integr. (VLSI)
Furthermore, considering the large amount of partial additions, Syst., vol. 17, no. 9, pp. 1233–1246, Sep. 2009.
the proposed architecture of the multiplier could obtain an area [18] M. Vigliar and G. D. Licciardo, “Multiplierless coprocessor for Difference
of Gaussian (DOG) calculation,” US Patent 20 130 301 950, Nov. 14,
reduction, by the application of MCM techniques [25], which 2013.
have been demonstrated to be effective in reducing the area [19] G. H. Hardy and E. M. Wright, An Introduction to the Theory of Numbers
when a large number of redundant terms must be calculated. (Sixth Edition). New York, NY, USA: Oxford Univ. Press, 2008.
[20] S. K. Park, “The r-complete partitions,” Discrete Math., vol. 183, no. 1–3,
Nevertheless, due to the presence of fast LUTs in contrast with pp. 293–297, Mar. 1998.
heavy carry logic, such as in conventional FP32 multipliers, the [21] Ø. J. Rødseth, “Enumeration of M-partitions,” Discrete Math., vol. 306,
proposed system should better fit the actual and forthcoming no. 7, pp. 694–698, Apr. 2006.
[22] “Virtex—7 Family, DS183 (v1.23),” Xilinx, San Jose, CA, USA, Jun. 23,
CMOS technologies with smaller gate sizes. 2015.
[23] S. Arish and R. K. Sharma, “An efficient floating point multi-
plier design for high speed applications using Karatsuba algorithm
V. C ONCLUSION and Urdvha—Tyriagbhyam algorithm,” in Proc. ICSC, Noidia, India,
Mar. 2015, pp. 303–308.
In this brief, an efficient term-partitioning method has been [24] W. M. Chao and L. G. Chen, “Pyramid architecture for 3840 × 2160 quad
full high definition 30 frames/s video acquisition,” IEEE Trans. Circuits
shown, which allows implementing the circuitry for convo- Syst. Video Technol., vol. 20, no. 11, pp. 1499–1508, Nov. 2010.
lution operators, typically employed in filters, without multi- [25] M. Potkonjak and M. B. Srivastava, “Multiple constant multiplications:
pliers, encoders, and auxiliary circuitry. These are completely Efficient and versatile framework and algorithms for exploring common
substituted by simplified adders and ROMs for storing pre- subexpression elimination,” IEEE Trans. Comput.-Aided Design Integr.
Circuits Syst., vol. 15, no. 2, pp. 151–165, Feb. 1996.
multiplied coefficients. The proposed solution obtains state-
of-the-art performances. The solution is well suited for the
application of multiconstant multiplication techniques, in order
to further simplify the circuital topology.

You might also like