Weighted Partitioning For Fast Multiplierless
Weighted Partitioning For Fast Multiplierless
1, JANUARY 2017
Abstract—A new radix-3 partitioning method of natural num- effective when one of the operands can be reduced to a bounded
bers, derived by the weight partition theory, is employed to build a set of precalculated values, as in the case of predefined filter
multiplierless circuit that is well suited for multimedia filtering ap- kernels. In such cases, the distributed arithmetic (DA) method
plications. The partitioning method allows conveniently premulti-
plying 32-b floating-point filter coefficients with the smallest set of [10] can be successfully applied in order to partition multiplica-
parts composing an unsigned integer input. In this way, similar to tions in simpler shifts and additions. By using memories to store
the distributed arithmetic, shifters and recoding circuitry, typical precalculated partial sums, whose number can be reduced by
of other well-known multiplier circuits, are completely substituted the help of multiple-constant multiplication (MCM) techniques
with simplified floating-point adders. Compared to the existent [11], [12], DA can be, in principle, advantageously used in
literature, targeted to both field-programmable gate array and
std_cell technology, the proposed solution achieves state-of-the-art place of MB and CSD [13]. However, actual performances of
performances in terms of elaboration velocity, achieving a critical DA result from a careful tradeoff between its “natural” bit-
path delay of about 2 ns both on a Xilinx Virtex 7 and with CMOS serial operation and the parallelism by which the partial sums
90-nm std_cells. are calculated, which can lead to the excessive increment of
Index Terms—Convolution, distributed arithmetic (DA), mapped physical resources [10].
Gaussian filter, multiplier. In this brief, a new partitioning scheme is proposed, based
on radix-3 terms (called parts), derived by the solution of
I. I NTRODUCTION the weight problem [14], with the purpose to improve the
performances of DA as well as the performances of general-
R ECENT advancements in the elaboration of high-quality
media contents have promoted an intense research activity
for the improvement of filtering operators, whose hardware
purpose multipliers in the specific contexts of MCM. Although
it is based on the same operation principle of DA, we will
(HW) complexity is a major concern in applications aimed demonstrate that the proposed method is always advantageous
to pure speed, such as image and video elaboration [1], [2]. in terms of mapped physical resources and elaboration speed,
Such complexity, indeed, usually relapses in the allocation for multiplying 32-b floating-point (FP32) filter coefficients
of a large number of arithmetic operators and a consequent with integer inputs. In this way, shifters and recoding circuitry,
slackening of the overall circuit. The recent literature shows that typical of other well-known multiplier schemes, can be com-
the aforementioned issue is usually managed either by recurring pletely substituted by floating-point (FP) adders, whose HW
to the full/partial serialization of the filters [3], [4] and folding complexity can be simplified to that of fixed-point adders,
techniques [5] or by intervening on the intrinsic complexity without undermining the accuracy. The derived implementa-
of fused multiply adders and multiply accumulators (MAC). tion is adequate for multiple-constant filtering applications,
Since the former way usually causes a significant reduction working with FP precalculated kernel coefficients and input
of the filter performances [6], the latter approach remains the quantities included in a range of integer values compatible with
most accurate way to achieve a good power, performance, and multimedia elaboration. The implementation of the proposed
area (PPA) tradeoff. In this case, the complete removal of the multiplier on a high-end field-programmable gate array (FPGA)
multiplier circuitry is by far the preferred choice of several returns a total delay path of 2.456 ns to produce an IEEE-754
authors [6], [7], who recur to fast adders and shifters in place of FP32 [15] result, starting from an 8-b unsigned integer input,
multipliers, according to the coding of the operands, canonical while std_cell implementation with TSMC CMOS 90-nm tech-
signed digit (CSD), and modified booth (MB), primarily [8], nology returns 2.61 ns, both in the slow/slow corner. These
[9]. The simplification of filtering circuits becomes particularly delays are significantly lower than those achievable by the con-
ventional FP32 multiplier implemented with the same platform
Manuscript received March 4, 2016; revised March 16, 2016; accepted and working on the same data set, while they are comparable
March 16, 2016. Date of publication March 24, 2016; date of current ver- with std_cell implementations in 65- and 45-nm CMOS tech-
sion December 22, 2016. This brief was recommended by Associate Editor nology [16], [17].
C. K. Tse.
G. D. Licciardo, C. Cappetta, and L. Di Benedetto are with the Department
of Industrial Engineering (D.I.In.), University of Salerno, 84084 Salerno, Italy
(e-mail: [email protected]; [email protected]; ldibenedetto@
unisa.it). II. U NDERLYING PARTITIONING M ETHOD
M. Vigliar is with Spark SRL, 42124 Reggio Emilia, Italy (e-mail: mario.
[email protected]). The problem of establishing the least number of integers
Digital Object Identifier 10.1109/TCSII.2016.2546899 and their values such that all the numbers in a limited range
1549-7747 © 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
LICCIARDO et al.: WEIGHTED PARTITIONING FOR MULTIPLE-CONSTANT CONVOLUTION CIRCUIT 67
TABLE I
A PPLICATIONS OF THE P ROPOSED PARTITION M ETHOD
Fig. 1. Scheme of the convolution circuit used as case study of the proposed
decomposition method.
TABLE III
R EQUIRED M EMORY AS A F UNCTION OF THE I NPUT R ANGE
K−1 n
K−1
= (Gj λi )Ci = Pj . (6)
j=0 i=0 j=0
ration flow (e.g., visual search, blurring, segmentation, and so by log2 (G0 λn /GK−1 λ0 ) = log2 (λn e4.5 ) = log2 (λn ) +
on), where they usually work with Uint-8 input (e.g., Luma or 4 bits.
6.5 Therefore, considering that λn ≡ R = 255 −
i
Chroma image pixel). Owing to its separability property, for i=0 3 = 134, the codelength of the premultiplied signifi-
which a 2-D filter can be separated in two consecutive 1-D cands becomes
products [3], [18], the filter implementation in the space–time
domain is typically preferred to the frequency conversion. l = 23 + log2 (λn ) + 6.5 = log2 (134) + 29.5 = 37 bits.
Results from Section II allow partitioning I by using the first (7)
LICCIARDO et al.: WEIGHTED PARTITIONING FOR MULTIPLE-CONSTANT CONVOLUTION CIRCUIT 69
TABLE IV
S YNTHESIS OF THE P ROPOSED M ULTIPLIER IN C OMPARISON W ITH R ECENT FPGA- AND S TD _ CELL -O RIENTED D ESIGNS
Furthermore, in order to reduce the impact of this enlarge- Xilinx for the adders and the multiplier, all configured with
ment on the sizes of the ROMs, the exponent has been omitted a three-stage pipeline. Carry-save adders have been used for
from the partial codes and then reintroduced in the final result the std_cell implementation of the proposed structure, while
to normalize the data to the standard FP32 format. It is worth no conventional multiplier has been compared for the absence
noting that the FP coding is necessary for applications requiring of an optimized multiplier in the same std_cell technology.
a very high dynamic range [16], as in the case of inverse The work in [23] for FPGA and that in [16] and [17] for
tone mapping [3], where ranges higher than 70 are highly std_cells have been used as comparative terms. Although they
recommended. An example of the employed coding is shown in propose slightly different architectures, for the base of our
Table II, applied to the smallest coefficient with Uint-8 inputs knowledge these are the most recent designs in the literature
and σ = 4. that compare with the proposed one for the similarity of the
In Table III, the memory required by the proposed solution, purpose. It is worth noting that the actual memory mapped
detailed for Ci and λn , is compared with the corresponding in our implementation has been increased from the minimum
quantity required by a radix-2 DA. In both cases, the mapped required of 763 B to 904 B since all the premultiplied coef-
resources refer to full-parallel architectures with codelength ficients have been stored together with their 2’s complement
l = 37 and (3σ + 1) = 13. A graphical representation of the negative counterparts. Therefore, both the positive and negative
comparison is shown in Fig. 3, in which the required number premultiplied coefficients are available at the same time for the
of additions is also shown as a function of the input length additions, and sign conversions are avoided when Ci = −1.
m. For m > 4, the proposed solution is always advantageous As expected, Table IV shows that the FPGA is the most
in terms of required additions, whereas the required memory advantageous platform to implement the proposed multiplier,
becomes significantly higher for m > 9, e.g., 15% for m = 10. owing to the availability of hard macros to implement ROMs.
In these cases, indeed, the lower number of parts of the Indeed, the FPGA implementation exhibits a speedup of 335%
proposed method is compensated by the higher number of with respect to a conventional multiplier, whereas the worst
bits for Ci . The advantages of the proposed solution become path delay reduces from 8.235 to 2.456 ns in the slow/slow
relevant when a large number of multipliers is required. In corner. A great advantage is observed also with respect to the
the implementation reported in the next section, with m = 8 MB multiplier that has a path delay of 7.882 ns. It is worth
and σ = 4, 25 MACs are required for a full-parallel circuit; noting that the ROM access does not introduce a critical delay
thus, using the radix-3 proposed method, it is possible to save since it exhibits a latency of 2.1 ns, which is significantly lower
50 adders with respect to the radix-2 DA. Considering also that
than the previous value. The mapped physical resources are
the memory is sharable between all the MACs, for implement-
approximatively lower by 30% than those in [23], while the
ing filters with typical kernel dimensions, the proposed solution
proves to be advantageous when compared to a conventional delay is about one-half, although this value is not representative
radix-2 DA. because of the technology differences between the two target
platforms. The normalized dissipated power is 81.7% of that of
the conventional multiplier.
IV. S YNTHESIS AND R ESULTS
The std_cell implementation exhibits good results in com-
In order to give a straightforward estimation of the advantage parison with the MB-based multiplier in [17] and the Braun
derived by the adoption of the proposed method, a stand- fused MAC in [16], both implementing a two-stage pipeline.
alone “equivalent multiplier” has been synthetized, composed It is worth observing that, although the multiplier in [16] is
by all the memories and the adders schematized in the dashed not a pure multiplier, it only differs by a 24-b carry look-ahead
box of Fig. 2. It has been targeted to a Xilinx Virtex 7 adder. Although the solutions in [16] and [17] are implemented
XC7V2000tflg1925-1 as part of the proFPGA duo application- with shrunk 45- and 65-nm technology, the delay times are
specific integrated circuit (ASIC) prototyping board [22] and only 430 and 110 ps higher than that proposed, respectively,
to TSMC CMOS 90-nm std_cells. Synthesis results have been whereas the occupied area is about 244% and 304% times the
reported in Table IV and compared with a conventional FP32 proposed one, respectively. Considering that, for the absence
multiplier and a 32-b MB, targeted to the same FPGA. A fair of devoted ROMs of adequate dimensions, all the memories
comparison has been achieved by using the IPs provided by have been implemented by lookup tables (LUTs), which is a
70 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 64, NO. 1, JANUARY 2017
TABLE V R EFERENCES
S YNTHESIS R ESULTS OF THE 1-D G AUSSIAN C ONVOLUTION C IRCUIT
[1] S. L. Chen, “VLSI implementation of an adaptive edge-enhanced image
scalar for real-time multimedia applications,” IEEE Trans. Circuits Syst.
Video Technol., vol. 23, no. 9, pp. 1510–1522, Sep. 2013.
[2] F. C. Huang, S. Y. Huang, J. W. Ker, and Y. C. Chen, “High perfor-
mance SIFT hardware accelerator for real-time image feature extraction,”
IEEE Trans. Circuit Syst. Video Technol., vol. 22, no. 3, pp. 340–351,
Mar. 2012.
[3] G. D. Licciardo, A. D’Arienzo, and A. Rubino, “Stream processor for real-
time inverse tone mapping of full-HD images,” IEEE Trans. Very Large
Scale Integr. (VLSI) Syst., vol. 23, no. 11, pp. 2531–2539, Nov. 2015.
[4] M. Vigliar and G. D. Licciardo, “Hardware coprocessor for stripe-based
interest point detection,” US Patent 20 130 301 930, Nov. 14, 2013.
[5] K. K. Parhi, VLSI Signal Processing Systems: Design and Implementation.
good result that contributes to an area–power–delay product of New York, NY, USA: Wiley, 2007.
3.75 × 104 μm2 · ns · mW, which is always better than the cited [6] B. C. Paul, S. Fujita, and M. Okajima, “ROM-based logic (RBL) design:
solutions. A low-power 16 bit multiplier,” IEEE J. Solid—State Circuits, vol. 44,
In Table V, a complete circuit for 1-D Gaussian convolution, no. 11, pp. 2935–2942, Nov. 2009.
[7] S. Y. Park and P. K. Meher, “Low-power, high-throughput, and low-area
composed by 25 multipliers and an output adder tree, connected adaptive FIR filter based on distributed arithmetic,” IEEE Trans. Circuits
as in Fig. 2, is compared with a conventional FP32 multiplier- Syst.—II, Exp. Briefs, vol. 60, no. 6, pp. 346–350, Jun. 2013.
based solution targeted to the same FPGA and std_cells. In [8] R. M. Hewlitt and E. S. Swartzlantler, “Canonical signed digit representa-
tion for FIR digital filters,” in Proc. IEEE Workshop Signal Process. Syst.,
this case, the FPGA speedup to 570% is obtained, whereas Lafayette, LA, USA, Oct. 2000, pp. 416–426.
the path delay reduces from 16.986 to 2.981 ns and the total [9] K. Tsoumanis, N. Axelos, N. Moshopoulos, G. Zervakis, and
amount of LUT in the design is the 44.21% less than the K. Pekmestzi, “Pre-encoded multipliers based on non-redundant radix-
4 signed-digit encoding,” IEEE Trans. Comput., vol. 65, no. 2,
one obtained using conventional multipliers. Also, the overall pp. 670–676, Feb. 2016.
power dissipation almost halves. In std_cells, it is possible to [10] A. Peled and B. Liu, “A new hardware realization of digital filters,”
observe a reduction of about 19.52% in area and a speedup of IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-22, no. 6,
pp. 456–462, Dec. 1974.
about 11.93%. In this case, the power dissipation is 39.64% [11] Y. Voronenko and M. Püschel, “Multiplierless multiple constant multipli-
more than in the conventional case, mainly due to power cation,” ACM Trans. Algorithms, vol. 3, no. 2, pp. 1–39, May 2007.
dissipated by the memories. In obtaining the data in Table V, [12] L. Aksoy, P. Flores, and J. Monteiro, “Efficient design of FIR filters using
hybrid multiple constant multiplication on FPGA,” in Proc. IEEE 32nd
it has been considered that all the ROMs must be read from ICCD, Oct. 2014, pp. 42–47.
all the multipliers on the same clock edge. Although this can [13] A. Berkeman, V. Owall, and M. Torkelson, “A low logic depth com-
be easily implemented in FPGA, ASICs require a custom plex multiplier using distributed arithmetic,” IEEE J. Solid-State Circuits,
vol. 35, no. 4, pp. 656–659, Apr. 2000.
implementation of very small ROMs, similarly to [6] for ROM- [14] E. O’Shea, “Bachet’s problem: As few weights to weigh them all,” ArXiv
based logic multipliers. However, the amount of memory in e-prints, Oct. 2010.
Tables IV and V does not represent an actual problem in [15] IEEE Standard for Binary Floating—Point Arithmetic, Amer. Nat. Std.
Inst. (ANSI), Washington, DC, USA, IEEE 754—1985, 1985.
real multimedia applications, whereas the memory requirement [16] M. A. Basiri and N. M. Sk, “An efficient hardware-based higher radix
is on the order of megabits because of frame buffering [24] floating point MAC design,” ACM Trans. Des. Autom. Electron. Syst.,
or partial data storage [3], [4], which makes, de facto, the vol. 20, no. 1, pp. 1–25, Nov. 2014.
[17] M. Själander and P. Larsson-Edefors, “Multiplication acceleration
additional area required by the multiplier’s ROMs negligible. through twin precision,” IEEE Trans. Very Large Scale Integr. (VLSI)
Furthermore, considering the large amount of partial additions, Syst., vol. 17, no. 9, pp. 1233–1246, Sep. 2009.
the proposed architecture of the multiplier could obtain an area [18] M. Vigliar and G. D. Licciardo, “Multiplierless coprocessor for Difference
of Gaussian (DOG) calculation,” US Patent 20 130 301 950, Nov. 14,
reduction, by the application of MCM techniques [25], which 2013.
have been demonstrated to be effective in reducing the area [19] G. H. Hardy and E. M. Wright, An Introduction to the Theory of Numbers
when a large number of redundant terms must be calculated. (Sixth Edition). New York, NY, USA: Oxford Univ. Press, 2008.
[20] S. K. Park, “The r-complete partitions,” Discrete Math., vol. 183, no. 1–3,
Nevertheless, due to the presence of fast LUTs in contrast with pp. 293–297, Mar. 1998.
heavy carry logic, such as in conventional FP32 multipliers, the [21] Ø. J. Rødseth, “Enumeration of M-partitions,” Discrete Math., vol. 306,
proposed system should better fit the actual and forthcoming no. 7, pp. 694–698, Apr. 2006.
[22] “Virtex—7 Family, DS183 (v1.23),” Xilinx, San Jose, CA, USA, Jun. 23,
CMOS technologies with smaller gate sizes. 2015.
[23] S. Arish and R. K. Sharma, “An efficient floating point multi-
plier design for high speed applications using Karatsuba algorithm
V. C ONCLUSION and Urdvha—Tyriagbhyam algorithm,” in Proc. ICSC, Noidia, India,
Mar. 2015, pp. 303–308.
In this brief, an efficient term-partitioning method has been [24] W. M. Chao and L. G. Chen, “Pyramid architecture for 3840 × 2160 quad
full high definition 30 frames/s video acquisition,” IEEE Trans. Circuits
shown, which allows implementing the circuitry for convo- Syst. Video Technol., vol. 20, no. 11, pp. 1499–1508, Nov. 2010.
lution operators, typically employed in filters, without multi- [25] M. Potkonjak and M. B. Srivastava, “Multiple constant multiplications:
pliers, encoders, and auxiliary circuitry. These are completely Efficient and versatile framework and algorithms for exploring common
substituted by simplified adders and ROMs for storing pre- subexpression elimination,” IEEE Trans. Comput.-Aided Design Integr.
Circuits Syst., vol. 15, no. 2, pp. 151–165, Feb. 1996.
multiplied coefficients. The proposed solution obtains state-
of-the-art performances. The solution is well suited for the
application of multiconstant multiplication techniques, in order
to further simplify the circuital topology.