Weighted Partitioning For Fast Multiplierless

Uploaded by

Dr. Ruqaiya Khanam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

79 views5 pages

Weighted Partitioning For Fast Multiplierless

Uploaded by

Dr. Ruqaiya Khanam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

66 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 64, NO.

1, JANUARY 2017

Weighted Partitioning for Fast Multiplierless

Multiple-Constant Convolution Circuit
Gian Domenico Licciardo, Member, IEEE, Carmine Cappetta, Student Member, IEEE,
Luigi Di Benedetto, Member, IEEE, and Mario Vigliar

Abstract—A new radix-3 partitioning method of natural num- effective when one of the operands can be reduced to a bounded
bers, derived by the weight partition theory, is employed to build a set of precalculated values, as in the case of predefined filter
multiplierless circuit that is well suited for multimedia filtering ap- kernels. In such cases, the distributed arithmetic (DA) method
plications. The partitioning method allows conveniently premulti-
plying 32-b floating-point filter coefficients with the smallest set of [10] can be successfully applied in order to partition multiplica-
parts composing an unsigned integer input. In this way, similar to tions in simpler shifts and additions. By using memories to store
the distributed arithmetic, shifters and recoding circuitry, typical precalculated partial sums, whose number can be reduced by
of other well-known multiplier circuits, are completely substituted the help of multiple-constant multiplication (MCM) techniques
with simplified floating-point adders. Compared to the existent [11], [12], DA can be, in principle, advantageously used in
literature, targeted to both field-programmable gate array and
std_cell technology, the proposed solution achieves state-of-the-art place of MB and CSD [13]. However, actual performances of
performances in terms of elaboration velocity, achieving a critical DA result from a careful tradeoff between its “natural” bit-
path delay of about 2 ns both on a Xilinx Virtex 7 and with CMOS serial operation and the parallelism by which the partial sums
90-nm std_cells. are calculated, which can lead to the excessive increment of
Index Terms—Convolution, distributed arithmetic (DA), mapped physical resources [10].
Gaussian filter, multiplier. In this brief, a new partitioning scheme is proposed, based
on radix-3 terms (called parts), derived by the solution of
I. I NTRODUCTION the weight problem [14], with the purpose to improve the
performances of DA as well as the performances of general-
R ECENT advancements in the elaboration of high-quality
media contents have promoted an intense research activity
for the improvement of filtering operators, whose hardware
purpose multipliers in the specific contexts of MCM. Although
it is based on the same operation principle of DA, we will
(HW) complexity is a major concern in applications aimed demonstrate that the proposed method is always advantageous
to pure speed, such as image and video elaboration [1], [2]. in terms of mapped physical resources and elaboration speed,
Such complexity, indeed, usually relapses in the allocation for multiplying 32-b floating-point (FP32) filter coefficients
of a large number of arithmetic operators and a consequent with integer inputs. In this way, shifters and recoding circuitry,
slackening of the overall circuit. The recent literature shows that typical of other well-known multiplier schemes, can be com-
the aforementioned issue is usually managed either by recurring pletely substituted by floating-point (FP) adders, whose HW
to the full/partial serialization of the filters [3], [4] and folding complexity can be simplified to that of fixed-point adders,
techniques [5] or by intervening on the intrinsic complexity without undermining the accuracy. The derived implementa-
of fused multiply adders and multiply accumulators (MAC). tion is adequate for multiple-constant filtering applications,
Since the former way usually causes a significant reduction working with FP precalculated kernel coefficients and input
of the filter performances [6], the latter approach remains the quantities included in a range of integer values compatible with
most accurate way to achieve a good power, performance, and multimedia elaboration. The implementation of the proposed
area (PPA) tradeoff. In this case, the complete removal of the multiplier on a high-end field-programmable gate array (FPGA)
multiplier circuitry is by far the preferred choice of several returns a total delay path of 2.456 ns to produce an IEEE-754
authors [6], [7], who recur to fast adders and shifters in place of FP32 [15] result, starting from an 8-b unsigned integer input,
multipliers, according to the coding of the operands, canonical while std_cell implementation with TSMC CMOS 90-nm tech-
signed digit (CSD), and modified booth (MB), primarily [8], nology returns 2.61 ns, both in the slow/slow corner. These
[9]. The simplification of filtering circuits becomes particularly delays are significantly lower than those achievable by the con-
ventional FP32 multiplier implemented with the same platform
Manuscript received March 4, 2016; revised March 16, 2016; accepted and working on the same data set, while they are comparable
March 16, 2016. Date of publication March 24, 2016; date of current ver- with std_cell implementations in 65- and 45-nm CMOS tech-
sion December 22, 2016. This brief was recommended by Associate Editor nology [16], [17].
C. K. Tse.
G. D. Licciardo, C. Cappetta, and L. Di Benedetto are with the Department
of Industrial Engineering (D.I.In.), University of Salerno, 84084 Salerno, Italy
(e-mail: [email protected]; [email protected]; ldibenedetto@
unisa.it). II. U NDERLYING PARTITIONING M ETHOD
M. Vigliar is with Spark SRL, 42124 Reggio Emilia, Italy (e-mail: mario.
[email protected]). The problem of establishing the least number of integers
Digital Object Identifier 10.1109/TCSII.2016.2546899 and their values such that all the numbers in a limited range
1549-7747 © 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
LICCIARDO et al.: WEIGHTED PARTITIONING FOR MULTIPLE-CONSTANT CONVOLUTION CIRCUIT 67

TABLE I
A PPLICATIONS OF THE P ROPOSED PARTITION M ETHOD

Fig. 1. Scheme of the convolution circuit used as case study of the proposed
decomposition method.

By using (1) in (2), it results in

K−1 n
y= Ak Cki λi (3)
can be expressed as a combination of them has been faced k=0 i=0
since the seventeenth century because it finds a large spectrum
of applications, including the simplification of the circuitry which can be rewritten in terms of the parts in Wr as
devoted to filtering apparatus [18]. About this argument, several n−1

K−1
particular demonstrations have been published [19]–[21]. A y= Ak Cki 3i + Ak Ckn R . (4)
general resume has been published in [14], where it has been k=0 i=0
definitively demonstrated that, having defined the set Wr :=
{30 , 31 , 32 , . . . , 3n−1 , R} with R = r − (30 + 31 + 32 + · · · + Considering that n is small in many cases of interest and
3n−1 ), every integer in the range [0; r] can be obtained by a that the aforementioned partition is a linear superposition of
superposition of the terms in Wr multiplied for a coefficient parts, the previous results suggest that the multiplications in-
Ci ∈ {−1, 0, 1}. volved in a particular filter can be reorganized as the partition
More in general, defining the ordered sequence of positive of premultiplied terms, obtained by calculating a priori the
integers that sum to r = λ0 + λ1 + · · · + λn with λ0 < λ1 < product between the parts and Ak . This partitioning is very
· · · < λn as the partition of a positive integer r and the set different
s−1 from DA that requires that xk is decomposed as xk =
{λ0 , . . . λn−1 , λn } as the parts of the partition, the following b
i=0 ki 2 i
, where bki ∈ {0, 1} represents the sign digit. That
can be demonstrated. is, (2) can be DA partitioned as

1) Every integer 0 ≤ q ≤ r can be written as

K−1 s−1
y= Ak bki 2i . (5)

n k=0 i=0
q= Ci λi . (1)
Although the use of bki in place of Cki contributes to reduce
i=0
some “glue” logic to implement (5), actual values of s in (5)
2) There does not exist another partition of r satisfying 1 are significantly higher than n in (4). Therefore, the number of
with fewer parts than n + 1. operators to implement the inner products in (4) can be strongly
reduced.
An important corollary of the aforementioned properties
demonstrates that every partition of r is composed by exactly III. A RCHITECTURE D ESIGN
n + 1 = log3 (2r) + 1 parts. Table I shows the application of
the partitioning method. For example, for an 8-b input, the The proposed method has been used for implementing the
parts are {0, 1, 3, 9, 27, 81, 134}; the input 23 can be rewritten scheme in Fig. 1 that calculates the convolution G ∗ I between
as 23 = (−1)1 + (−1)3 + (0)9 + (+1)27 + (0)81 + (0)134, the FP32 kernel vector G and the input vector I of the unsigned
namely, the set of values from Table I will be {−1, −1, 0, integer coded with m bit (Uint-m). The proposed method is
+1, 0, 0}. used to improve the structure of the k MACs that calculate the
The aforementioned results can be applied to MAC opera- product between the generic coefficient of the kernel and the
tions between a generic vector of coefficients Ak and a vector vector of input values. The resulting architecture is schematized
of inputs xk in Fig. 2. Once G has been defined, each element of I can be
decomposed in n + 1 parts, according to Table I, and stored in

K−1 an equal number of dual-port read-only memories (ROMs), af-
y= Ak xk . (2) ter they have been premultiplied by the elements of G. A further
k=0 (n + 1)2m+1 bit ROM stores the Ci coefficients (2 b for each
68 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 64, NO. 1, JANUARY 2017

TABLE III
R EQUIRED M EMORY AS A F UNCTION OF THE I NPUT R ANGE

Fig. 3. Comparison between required resources of the proposed partitioning

method and the radix-2 DA, with l = 37 and (3σ + 1) = 13.
Fig. 2. Scheme of the proposed multiplier, deployed in a Gaussian convolution.

n + 1 = log3 2(28 − 1) + 1 = 6 parts from Table I, and we

TABLE II write the convolution as
C USTOM C ODING A PPLIED TO THE S MALLEST P REMULTIPLIED
C OEFFICIENT W ITH U INT-8 I NPUTS AND σ = 4

K−1
K−1
n
G(x, σ) ∗ I(x) = Gj Ij− K−1 = Gj Ci λi
2
j=0 j=0 i=0

K−1 n
K−1
= (Gj λi )Ci = Pj . (6)
j=0 i=0 j=0

In particular, all the n inner products Gj λi are precomputed

one) used to select the sign of the operands. The multipliers for each value of I in the range [0;255], for every kernel
are substituted by the n adders in the dashed box of Fig. 2, coefficient, since Gj remains constant once σ and K have
distributed along a log2 (n + 1) depth tree. In principle, the been defined. Given that the kernel length can be imposed with
adders should have an FP architecture, but it is possible to adopt K = 6σ + 1 points with good accuracy [3] and that, for the
a custom coding for partial results, in order to reduce their Gaussian symmetry, it is possible to store only 3σ + 1 values,
complexity without altering the accuracy of the multiplication. (6) can be implemented by the scheme in Fig. 2. The input
Starting from the standard IEEE754 FP32 coding [15], all the I is used to access the ROMs that, in principle, are sharable
exponents of the premultiplied coefficients have been increased by all the multipliers. Outputs from the C-ROMs provide the
to that of the greatest one, while the number of bits of the signals to select the inputs to the first adder’s row of multipliers.
significands has also been increased accordingly, in order to The remaining ROMs, having depth (3σ + 1), provide the Gj λi
include the shifted codes and avoid truncations. coefficients that must be added toward the final result. With the
Although the proposed method can be employed in conjunc- purpose to eliminate the handling of the exponent, considering√
tion with several kinds of kernels, as a case study it has been that the greatest premultiplied coefficient is√G0 λn −(9σ
= λn / 2πσ
2
/2σ2 )
used to implement the Gaussian filter with kernel G(x, σ) = and the√smallest one is GK−1 λ0 = (λ0 / 2πσ)e =
Ae−(x /2σ ) for its large diffusion in image and video elabo- e−4.5 / 2πσ, the codelength of significands is increased
2 2

ration flow (e.g., visual search, blurring, segmentation, and so by log2 (G0 λn /GK−1 λ0 ) = log2 (λn e4.5 ) = log2 (λn ) +
on), where they usually work with Uint-8 input (e.g., Luma or 4 bits.
6.5 Therefore, considering that λn ≡ R = 255 −
i
Chroma image pixel). Owing to its separability property, for i=0 3 = 134, the codelength of the premultiplied signifi-
which a 2-D filter can be separated in two consecutive 1-D cands becomes
products [3], [18], the filter implementation in the space–time
domain is typically preferred to the frequency conversion. l = 23 + log2 (λn ) + 6.5 = log2 (134) + 29.5 = 37 bits.
Results from Section II allow partitioning I by using the first (7)
LICCIARDO et al.: WEIGHTED PARTITIONING FOR MULTIPLE-CONSTANT CONVOLUTION CIRCUIT 69

TABLE IV
S YNTHESIS OF THE P ROPOSED M ULTIPLIER IN C OMPARISON W ITH R ECENT FPGA- AND S TD _ CELL -O RIENTED D ESIGNS

Furthermore, in order to reduce the impact of this enlarge- Xilinx for the adders and the multiplier, all configured with
ment on the sizes of the ROMs, the exponent has been omitted a three-stage pipeline. Carry-save adders have been used for
from the partial codes and then reintroduced in the final result the std_cell implementation of the proposed structure, while
to normalize the data to the standard FP32 format. It is worth no conventional multiplier has been compared for the absence
noting that the FP coding is necessary for applications requiring of an optimized multiplier in the same std_cell technology.
a very high dynamic range [16], as in the case of inverse The work in [23] for FPGA and that in [16] and [17] for
tone mapping [3], where ranges higher than 70 are highly std_cells have been used as comparative terms. Although they
recommended. An example of the employed coding is shown in propose slightly different architectures, for the base of our
Table II, applied to the smallest coefficient with Uint-8 inputs knowledge these are the most recent designs in the literature
and σ = 4. that compare with the proposed one for the similarity of the
In Table III, the memory required by the proposed solution, purpose. It is worth noting that the actual memory mapped
detailed for Ci and λn , is compared with the corresponding in our implementation has been increased from the minimum
quantity required by a radix-2 DA. In both cases, the mapped required of 763 B to 904 B since all the premultiplied coef-
resources refer to full-parallel architectures with codelength ficients have been stored together with their 2’s complement
l = 37 and (3σ + 1) = 13. A graphical representation of the negative counterparts. Therefore, both the positive and negative
comparison is shown in Fig. 3, in which the required number premultiplied coefficients are available at the same time for the
of additions is also shown as a function of the input length additions, and sign conversions are avoided when Ci = −1.
m. For m > 4, the proposed solution is always advantageous As expected, Table IV shows that the FPGA is the most
in terms of required additions, whereas the required memory advantageous platform to implement the proposed multiplier,
becomes significantly higher for m > 9, e.g., 15% for m = 10. owing to the availability of hard macros to implement ROMs.
In these cases, indeed, the lower number of parts of the Indeed, the FPGA implementation exhibits a speedup of 335%
proposed method is compensated by the higher number of with respect to a conventional multiplier, whereas the worst
bits for Ci . The advantages of the proposed solution become path delay reduces from 8.235 to 2.456 ns in the slow/slow
relevant when a large number of multipliers is required. In corner. A great advantage is observed also with respect to the
the implementation reported in the next section, with m = 8 MB multiplier that has a path delay of 7.882 ns. It is worth
and σ = 4, 25 MACs are required for a full-parallel circuit; noting that the ROM access does not introduce a critical delay
thus, using the radix-3 proposed method, it is possible to save since it exhibits a latency of 2.1 ns, which is significantly lower
50 adders with respect to the radix-2 DA. Considering also that
than the previous value. The mapped physical resources are
the memory is sharable between all the MACs, for implement-
approximatively lower by 30% than those in [23], while the
ing filters with typical kernel dimensions, the proposed solution
proves to be advantageous when compared to a conventional delay is about one-half, although this value is not representative
radix-2 DA. because of the technology differences between the two target
platforms. The normalized dissipated power is 81.7% of that of
the conventional multiplier.
IV. S YNTHESIS AND R ESULTS
The std_cell implementation exhibits good results in com-
In order to give a straightforward estimation of the advantage parison with the MB-based multiplier in [17] and the Braun
derived by the adoption of the proposed method, a stand- fused MAC in [16], both implementing a two-stage pipeline.
alone “equivalent multiplier” has been synthetized, composed It is worth observing that, although the multiplier in [16] is
by all the memories and the adders schematized in the dashed not a pure multiplier, it only differs by a 24-b carry look-ahead
box of Fig. 2. It has been targeted to a Xilinx Virtex 7 adder. Although the solutions in [16] and [17] are implemented
XC7V2000tflg1925-1 as part of the proFPGA duo application- with shrunk 45- and 65-nm technology, the delay times are
specific integrated circuit (ASIC) prototyping board [22] and only 430 and 110 ps higher than that proposed, respectively,
to TSMC CMOS 90-nm std_cells. Synthesis results have been whereas the occupied area is about 244% and 304% times the
reported in Table IV and compared with a conventional FP32 proposed one, respectively. Considering that, for the absence
multiplier and a 32-b MB, targeted to the same FPGA. A fair of devoted ROMs of adequate dimensions, all the memories
comparison has been achieved by using the IPs provided by have been implemented by lookup tables (LUTs), which is a
70 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 64, NO. 1, JANUARY 2017

TABLE V R EFERENCES
S YNTHESIS R ESULTS OF THE 1-D G AUSSIAN C ONVOLUTION C IRCUIT
[1] S. L. Chen, “VLSI implementation of an adaptive edge-enhanced image
scalar for real-time multimedia applications,” IEEE Trans. Circuits Syst.
Video Technol., vol. 23, no. 9, pp. 1510–1522, Sep. 2013.
[2] F. C. Huang, S. Y. Huang, J. W. Ker, and Y. C. Chen, “High perfor-
mance SIFT hardware accelerator for real-time image feature extraction,”
IEEE Trans. Circuit Syst. Video Technol., vol. 22, no. 3, pp. 340–351,
Mar. 2012.
[3] G. D. Licciardo, A. D’Arienzo, and A. Rubino, “Stream processor for real-
time inverse tone mapping of full-HD images,” IEEE Trans. Very Large
Scale Integr. (VLSI) Syst., vol. 23, no. 11, pp. 2531–2539, Nov. 2015.
[4] M. Vigliar and G. D. Licciardo, “Hardware coprocessor for stripe-based
interest point detection,” US Patent 20 130 301 930, Nov. 14, 2013.
[5] K. K. Parhi, VLSI Signal Processing Systems: Design and Implementation.
good result that contributes to an area–power–delay product of New York, NY, USA: Wiley, 2007.
3.75 × 104 μm2 · ns · mW, which is always better than the cited [6] B. C. Paul, S. Fujita, and M. Okajima, “ROM-based logic (RBL) design:
solutions. A low-power 16 bit multiplier,” IEEE J. Solid—State Circuits, vol. 44,
In Table V, a complete circuit for 1-D Gaussian convolution, no. 11, pp. 2935–2942, Nov. 2009.
[7] S. Y. Park and P. K. Meher, “Low-power, high-throughput, and low-area
composed by 25 multipliers and an output adder tree, connected adaptive FIR filter based on distributed arithmetic,” IEEE Trans. Circuits
as in Fig. 2, is compared with a conventional FP32 multiplier- Syst.—II, Exp. Briefs, vol. 60, no. 6, pp. 346–350, Jun. 2013.
based solution targeted to the same FPGA and std_cells. In [8] R. M. Hewlitt and E. S. Swartzlantler, “Canonical signed digit representa-
tion for FIR digital filters,” in Proc. IEEE Workshop Signal Process. Syst.,
this case, the FPGA speedup to 570% is obtained, whereas Lafayette, LA, USA, Oct. 2000, pp. 416–426.
the path delay reduces from 16.986 to 2.981 ns and the total [9] K. Tsoumanis, N. Axelos, N. Moshopoulos, G. Zervakis, and
amount of LUT in the design is the 44.21% less than the K. Pekmestzi, “Pre-encoded multipliers based on non-redundant radix-
4 signed-digit encoding,” IEEE Trans. Comput., vol. 65, no. 2,
one obtained using conventional multipliers. Also, the overall pp. 670–676, Feb. 2016.
power dissipation almost halves. In std_cells, it is possible to [10] A. Peled and B. Liu, “A new hardware realization of digital filters,”
observe a reduction of about 19.52% in area and a speedup of IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-22, no. 6,
pp. 456–462, Dec. 1974.
about 11.93%. In this case, the power dissipation is 39.64% [11] Y. Voronenko and M. Püschel, “Multiplierless multiple constant multipli-
more than in the conventional case, mainly due to power cation,” ACM Trans. Algorithms, vol. 3, no. 2, pp. 1–39, May 2007.
dissipated by the memories. In obtaining the data in Table V, [12] L. Aksoy, P. Flores, and J. Monteiro, “Efficient design of FIR filters using
hybrid multiple constant multiplication on FPGA,” in Proc. IEEE 32nd
it has been considered that all the ROMs must be read from ICCD, Oct. 2014, pp. 42–47.
all the multipliers on the same clock edge. Although this can [13] A. Berkeman, V. Owall, and M. Torkelson, “A low logic depth com-
be easily implemented in FPGA, ASICs require a custom plex multiplier using distributed arithmetic,” IEEE J. Solid-State Circuits,
vol. 35, no. 4, pp. 656–659, Apr. 2000.
implementation of very small ROMs, similarly to [6] for ROM- [14] E. O’Shea, “Bachet’s problem: As few weights to weigh them all,” ArXiv
based logic multipliers. However, the amount of memory in e-prints, Oct. 2010.
Tables IV and V does not represent an actual problem in [15] IEEE Standard for Binary Floating—Point Arithmetic, Amer. Nat. Std.
Inst. (ANSI), Washington, DC, USA, IEEE 754—1985, 1985.
real multimedia applications, whereas the memory requirement [16] M. A. Basiri and N. M. Sk, “An efficient hardware-based higher radix
is on the order of megabits because of frame buffering [24] floating point MAC design,” ACM Trans. Des. Autom. Electron. Syst.,
or partial data storage [3], [4], which makes, de facto, the vol. 20, no. 1, pp. 1–25, Nov. 2014.
[17] M. Själander and P. Larsson-Edefors, “Multiplication acceleration
additional area required by the multiplier’s ROMs negligible. through twin precision,” IEEE Trans. Very Large Scale Integr. (VLSI)
Furthermore, considering the large amount of partial additions, Syst., vol. 17, no. 9, pp. 1233–1246, Sep. 2009.
the proposed architecture of the multiplier could obtain an area [18] M. Vigliar and G. D. Licciardo, “Multiplierless coprocessor for Difference
of Gaussian (DOG) calculation,” US Patent 20 130 301 950, Nov. 14,
reduction, by the application of MCM techniques [25], which 2013.
have been demonstrated to be effective in reducing the area [19] G. H. Hardy and E. M. Wright, An Introduction to the Theory of Numbers
when a large number of redundant terms must be calculated. (Sixth Edition). New York, NY, USA: Oxford Univ. Press, 2008.
[20] S. K. Park, “The r-complete partitions,” Discrete Math., vol. 183, no. 1–3,
Nevertheless, due to the presence of fast LUTs in contrast with pp. 293–297, Mar. 1998.
heavy carry logic, such as in conventional FP32 multipliers, the [21] Ø. J. Rødseth, “Enumeration of M-partitions,” Discrete Math., vol. 306,
proposed system should better fit the actual and forthcoming no. 7, pp. 694–698, Apr. 2006.
[22] “Virtex—7 Family, DS183 (v1.23),” Xilinx, San Jose, CA, USA, Jun. 23,
CMOS technologies with smaller gate sizes. 2015.
[23] S. Arish and R. K. Sharma, “An efficient floating point multi-
plier design for high speed applications using Karatsuba algorithm
V. C ONCLUSION and Urdvha—Tyriagbhyam algorithm,” in Proc. ICSC, Noidia, India,
Mar. 2015, pp. 303–308.
In this brief, an efficient term-partitioning method has been [24] W. M. Chao and L. G. Chen, “Pyramid architecture for 3840 × 2160 quad
full high definition 30 frames/s video acquisition,” IEEE Trans. Circuits
shown, which allows implementing the circuitry for convo- Syst. Video Technol., vol. 20, no. 11, pp. 1499–1508, Nov. 2010.
lution operators, typically employed in filters, without multi- [25] M. Potkonjak and M. B. Srivastava, “Multiple constant multiplications:
pliers, encoders, and auxiliary circuitry. These are completely Efficient and versatile framework and algorithms for exploring common
substituted by simplified adders and ROMs for storing pre- subexpression elimination,” IEEE Trans. Comput.-Aided Design Integr.
Circuits Syst., vol. 15, no. 2, pp. 151–165, Feb. 1996.
multiplied coefficients. The proposed solution obtains state-
of-the-art performances. The solution is well suited for the
application of multiconstant multiplication techniques, in order
to further simplify the circuital topology.

RD Sharma Class 8 Maths Chapter 26 Data Handling IV Probability
No ratings yet
RD Sharma Class 8 Maths Chapter 26 Data Handling IV Probability
21 pages
Uppsc Roaro Pyq (2010-2023) - Maths by Ssip?
No ratings yet
Uppsc Roaro Pyq (2010-2023) - Maths by Ssip?
18 pages
OTB195 01 Key Concepts of Intermediate Math
No ratings yet
OTB195 01 Key Concepts of Intermediate Math
275 pages
Rational and Irrational Numbers PowerPoint
100% (1)
Rational and Irrational Numbers PowerPoint
14 pages
IGCSE - Maths - Chap 4
No ratings yet
IGCSE - Maths - Chap 4
12 pages
Mean Deviation
No ratings yet
Mean Deviation
23 pages
Sample Questions For Math Relay
No ratings yet
Sample Questions For Math Relay
12 pages
Grade-5 Math
92% (12)
Grade-5 Math
148 pages
AAHL-Topic 1 Numbers and Algebra Paper-2
No ratings yet
AAHL-Topic 1 Numbers and Algebra Paper-2
83 pages
Residue Number Systems Theory and Applications - P.V. Ananda Mohan (Auth.) - 1st Ed., 2016 - Birkhäuser - 9783319413853 - Anna's Archive
No ratings yet
Residue Number Systems Theory and Applications - P.V. Ananda Mohan (Auth.) - 1st Ed., 2016 - Birkhäuser - 9783319413853 - Anna's Archive
353 pages
Computer Arithmetic - M. Vladutiu
No ratings yet
Computer Arithmetic - M. Vladutiu
269 pages
MATH 101 College and Advanced Algebra
No ratings yet
MATH 101 College and Advanced Algebra
45 pages
Math Gened Reviewer 1
100% (2)
Math Gened Reviewer 1
17 pages
DSP Architecture
100% (1)
DSP Architecture
31 pages
Python Program
No ratings yet
Python Program
7 pages
ASM Design Example Bin Mult
No ratings yet
ASM Design Example Bin Mult
11 pages
3 Risc V Alu Arch Basics
No ratings yet
3 Risc V Alu Arch Basics
127 pages
Thesis 237
No ratings yet
Thesis 237
116 pages
Maths Work Sheet Contents (For Grade 4 To 7)
No ratings yet
Maths Work Sheet Contents (For Grade 4 To 7)
4 pages
Module 2-1
No ratings yet
Module 2-1
93 pages
0x02. C - Functions, Nested Loops
100% (1)
0x02. C - Functions, Nested Loops
8 pages
Module 3
No ratings yet
Module 3
71 pages
CSC 301 Introduction To Digital Design
No ratings yet
CSC 301 Introduction To Digital Design
60 pages
Unit 2 Architectures For Programmable Digital Signal-Processors
No ratings yet
Unit 2 Architectures For Programmable Digital Signal-Processors
57 pages
Mathematics
No ratings yet
Mathematics
31 pages
Lesson 1.4 - Measurements
No ratings yet
Lesson 1.4 - Measurements
46 pages
BITS Pilani: Digital Signal Processing
No ratings yet
BITS Pilani: Digital Signal Processing
73 pages
Slides 7
No ratings yet
Slides 7
43 pages
An Introduction To Android Development: CS231M - Alejandro Troccoli
No ratings yet
An Introduction To Android Development: CS231M - Alejandro Troccoli
43 pages
Accelerating FHE Integer Multiplier Using Negative
No ratings yet
Accelerating FHE Integer Multiplier Using Negative
5 pages
Tillotama
No ratings yet
Tillotama
19 pages
4 Ijcsi
No ratings yet
4 Ijcsi
10 pages
FFT Tutorial 121102
No ratings yet
FFT Tutorial 121102
28 pages
Preged Math Vocabulary Sorted by Term
No ratings yet
Preged Math Vocabulary Sorted by Term
6 pages
Module 2 Notes
No ratings yet
Module 2 Notes
28 pages
1 s2.0 S0045790624001459 Main
No ratings yet
1 s2.0 S0045790624001459 Main
11 pages
Int Mult 08
No ratings yet
Int Mult 08
13 pages
Distributed Arithmetic For The Design of High Speed Fir Filter Using Fpgas
No ratings yet
Distributed Arithmetic For The Design of High Speed Fir Filter Using Fpgas
9 pages
Applsci 13 10407
No ratings yet
Applsci 13 10407
12 pages
Using High-Control-Bandwidth FPGA and SiC
No ratings yet
Using High-Control-Bandwidth FPGA and SiC
14 pages
Definicion de Memoria
No ratings yet
Definicion de Memoria
14 pages
High-Throughput Pattern Matching With
No ratings yet
High-Throughput Pattern Matching With
14 pages
Grade 11 Revision-1
No ratings yet
Grade 11 Revision-1
10 pages
Investigations On Low Power FIR Filter Design Using Arithmetic Strength Reduction Technique
No ratings yet
Investigations On Low Power FIR Filter Design Using Arithmetic Strength Reduction Technique
8 pages
İllik Dərs Planı - Stage 4 - Math
No ratings yet
İllik Dərs Planı - Stage 4 - Math
10 pages
A New Hardware-Efficient Architecture For Programmable FIR Filters
No ratings yet
A New Hardware-Efficient Architecture For Programmable FIR Filters
13 pages
Area-Efficient and Low Latency Architecture For High Speed Fir Filter Using Distributed Arithmetic
No ratings yet
Area-Efficient and Low Latency Architecture For High Speed Fir Filter Using Distributed Arithmetic
6 pages
155.FFT Ropec
No ratings yet
155.FFT Ropec
7 pages
Baugh-Wooley Multiplication For The RISCV Processor
No ratings yet
Baugh-Wooley Multiplication For The RISCV Processor
8 pages
NGEC 4 Golden Ratio
No ratings yet
NGEC 4 Golden Ratio
9 pages
4 - CORDIC Based FFT
No ratings yet
4 - CORDIC Based FFT
8 pages
FPGA
No ratings yet
FPGA
6 pages
The Role of Distributed Arithmetic in FPGA-based Signal Processing
No ratings yet
The Role of Distributed Arithmetic in FPGA-based Signal Processing
15 pages
Number Sense Set 1
No ratings yet
Number Sense Set 1
7 pages
Architectures For Programmable Digital Signal Processing Devices
No ratings yet
Architectures For Programmable Digital Signal Processing Devices
24 pages
A Matrix-Multiply Unit For Posits in Reconfigurable Logic Leveraging (Open) CAPI
No ratings yet
A Matrix-Multiply Unit For Posits in Reconfigurable Logic Leveraging (Open) CAPI
9 pages
Power Area FILTERS
No ratings yet
Power Area FILTERS
8 pages
Booth Encoder
No ratings yet
Booth Encoder
8 pages
FPGA Implementation of High Speed FIR Filters Using Add and Shift Method
No ratings yet
FPGA Implementation of High Speed FIR Filters Using Add and Shift Method
6 pages
Single-Precision Logarithmic Arithmetic Unit With Floating-Point Input/output Data
No ratings yet
Single-Precision Logarithmic Arithmetic Unit With Floating-Point Input/output Data
10 pages
Maria Integrated School Poblacion Norte, Maria, Siquijor Email Add
No ratings yet
Maria Integrated School Poblacion Norte, Maria, Siquijor Email Add
7 pages
Fpga Implementation of FFT Algorithms Using Floating
No ratings yet
Fpga Implementation of FFT Algorithms Using Floating
5 pages
3.1 Distributed Arithmetic Technique
No ratings yet
3.1 Distributed Arithmetic Technique
8 pages
Fir Filters
No ratings yet
Fir Filters
6 pages
Daa 02 R1 1
No ratings yet
Daa 02 R1 1
4 pages
Adobe Scan 03-Oct-2024
No ratings yet
Adobe Scan 03-Oct-2024
4 pages
Minghexu 2015
No ratings yet
Minghexu 2015
5 pages
A GMP-based Implementation of Schönhage-Strassen's Large Integer Multiplication Algorithm
No ratings yet
A GMP-based Implementation of Schönhage-Strassen's Large Integer Multiplication Algorithm
8 pages
PDF GR 5 Worksheet CH 5 Fractions
No ratings yet
PDF GR 5 Worksheet CH 5 Fractions
4 pages
Efficient FIR Filter Architectures Suitable For FPGA Implementation
No ratings yet
Efficient FIR Filter Architectures Suitable For FPGA Implementation
4 pages
4xDSP IC DA
No ratings yet
4xDSP IC DA
9 pages
A Multiplierless 2-D Convolver Chip For Real-Time Image Processing
No ratings yet
A Multiplierless 2-D Convolver Chip For Real-Time Image Processing
9 pages
High-Performance DSP Capability Within An Optimized Low-Cost Fpga Architecture
No ratings yet
High-Performance DSP Capability Within An Optimized Low-Cost Fpga Architecture
12 pages
ARITHMETIC and LOGIC UNIT - in This Lecture, We Will Examine How
No ratings yet
ARITHMETIC and LOGIC UNIT - in This Lecture, We Will Examine How
12 pages
Distributed Arithmetic Architectures For FIR Filters-A Comparative Review
No ratings yet
Distributed Arithmetic Architectures For FIR Filters-A Comparative Review
7 pages
A 5GHz 128-Bit Binary Floating-Point Adder For The POWER6 Processor
No ratings yet
A 5GHz 128-Bit Binary Floating-Point Adder For The POWER6 Processor
4 pages
FPGA Implementation of High Speed FIR Filters Using Add and Shift Method
No ratings yet
FPGA Implementation of High Speed FIR Filters Using Add and Shift Method
6 pages
Floating Point Ieee
No ratings yet
Floating Point Ieee
4 pages
Implementation of Double Precision Floating Point Radix-2 FFT Using VHDL
No ratings yet
Implementation of Double Precision Floating Point Radix-2 FFT Using VHDL
7 pages
Scaling Free CORDIC Algorithm Implementation of Sine and Cosine Function
No ratings yet
Scaling Free CORDIC Algorithm Implementation of Sine and Cosine Function
4 pages
Simple Computation of DIT FFT: International Journal of Advanced Research in Computer Science and Software Engineering
No ratings yet
Simple Computation of DIT FFT: International Journal of Advanced Research in Computer Science and Software Engineering
4 pages
Introduction To Computer Science - HW1 - Solution
No ratings yet
Introduction To Computer Science - HW1 - Solution
2 pages
Trade-Offs in Multiplier Block Algorithms For Low Power Digit-Serial FIR Filters
No ratings yet
Trade-Offs in Multiplier Block Algorithms For Low Power Digit-Serial FIR Filters
6 pages
Implementation of 16 Point Radix 2 FFT: ECE 645 Ashwin Chiluka Vamsi Krishna Teladevalapalli
No ratings yet
Implementation of 16 Point Radix 2 FFT: ECE 645 Ashwin Chiluka Vamsi Krishna Teladevalapalli
6 pages
NTC 4092 General Guide For The Counting of Microorganisms
No ratings yet
NTC 4092 General Guide For The Counting of Microorganisms
8 pages
Bu 33436438
No ratings yet
Bu 33436438
3 pages
Algorithm and Design
No ratings yet
Algorithm and Design
6 pages
Low Power Mac For Digital Fir
No ratings yet
Low Power Mac For Digital Fir
4 pages
Frequency Analyzer
No ratings yet
Frequency Analyzer
4 pages
Design and Implementation of Pipelined FFT Processor: D.Venkata Kishore, C.Ram Kumar
No ratings yet
Design and Implementation of Pipelined FFT Processor: D.Venkata Kishore, C.Ram Kumar
4 pages
Processors.: Mops Integer Dmder Ic
No ratings yet
Processors.: Mops Integer Dmder Ic
3 pages
Algebra Multiplying Monomials2
No ratings yet
Algebra Multiplying Monomials2
1 page

Weighted Partitioning For Fast Multiplierless

Uploaded by

Weighted Partitioning For Fast Multiplierless

Uploaded by

66 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 64, NO.

Weighted Partitioning for Fast Multiplierless

By using (1) in (2), it results in

1) Every integer 0 ≤ q ≤ r can be written as

Fig. 3. Comparison between required resources of the proposed partitioning

n + 1 = log3 2(28 − 1) + 1 = 6 parts from Table I, and we

In particular, all the n inner products Gj λi are precomputed

You might also like

n + 1 = log3 2(28 − 1) + 1 = 6 parts from Table I, and we