Lec 6
Lec 6
Lecture - 06
Hardware Design for Finite Inverse
So, welcome to this class on Hardware Security. So, today we shall be talking about
different topic, we will be talking about a finite field architecture like last class we
discussed about multipliers. In today’s class, we will discuss about another important
class of arithmetic circuits in finite fields, we are called as finite field inverse. So, how to
compute finite field inversions?
So, to start with how do we compute a multiplicative inverse or what is the definition of
a multiplicative inverse? Multiplicative inverse means that given an element a, we want
to find out it is inverse, which is denoted as a inverse or a to the power of minus 1 in the
field such that if I multiply a with a inverse, I get the unit or I get the unit element in the
field or I get 1.
So, to start so you know like there are different techniques for finite field inversion
computations. The most important ones I have kind of written here like the first one is
the extended Euclidean algorithm, which is a more generalization of the Euclidean
algorithm, which is used to compute the greatest common divisor of two numbers. There
is another algorithm or there is another result, which is very famously known as the
Fermat’s little theorem, which also can be used to compute multiplicative inversions.
So, the (Refer Time: 01:58) here is that the Fermat’s little theorem is more efficient on
hardware than the Euclidean technique. Because, the Euclidean technique as you
probably know is that it will be it will it will be resulting in a in a computation, which
requires more intensive operations, and also essentially will require more number of
clock cycles.
So, essentially right we resort to Fermat’s little theorem and let us state how a Fermat’s
little theorem works. So, when we want to compute a to the power of minus 1, the idea is
that and I want to do a modulo of the field, then our modulo of the size of the field. Then
what we do is we try to compute a to the power of minus 1 as a to the power of n minus 2
modulo n. Note here that the definition of n is such that n can be any number as such, but
a has to be co-prime with n that means, the greatest common divisor of a and n has to be
1 or in another sense, we can also say this as a belongs to the multiplicative group, which
is created by n, which is denoted of n as z n star ok.
So, you can also say for simplicity that let us choose n as a prime number in which case
all the elements, which are non-zero like starting from 1 to n minus 1 belong to it is
multiplicative group. So, just let us take a small example. So, suppose I take n equal to 5,
which is a prime number, so then you can easily see that 3 to the power of 5 minus 2
modulo 5 should be it is inverse, you can check it that this turns out to be 2. And
therefore, if I multiply 3 with 2, then I get 6 which modulo 5 is 1 ok. So, therefore 2 is
indeed the multiplicative inverse of 3 modulo 5.
So, just to generalize this, we the generalization is based upon a particular parameter a
particular function, which is called as the Euler's Totient function often denoted as phi of
m ok. So, suppose that a and m are integers such that the greatest common divisor of a
and m is 1 that means, a and m are mutually co-prime or relatively prime. And then the
number of integers in Z m that are relatively prime to m and does not exceed m is often
denoted as phi m ok. And this is called also as the Euler's Totient function or the phi
function ok.
(Refer Slide Time: 04:29)
So, just to define write the phi of 1 is taken as 1, this is an initialize the phi function ok.
And subsequently write if I for example, 1 to compute 26, m equal to 26, and I want to
calculate phi of 26, then it turns out to be 12 that means, there are 12 numbers are co-
prime with 26. If p is prime, then it is natural to understand that phi of p will be phi will
be p minus 1 ok, just as we saw in the example.
And if n is equal to let us say here, I have just tried to kind of tabulate the values of phi
n. So, you will see that phi n is a pretty much irregular function ok, it is neither
monotonically increasing nor monotonically decreasing, it is kind of an erratic function.
But, luckily like if you know the prime factors of n then, you can compute phi n quite
efficiently ok.
(Refer Slide Time: 05:11)
However, I am not going into that definition here, but rather we will take a look into
Fermat’s little theorem, which is stated here using the phi function ok. So, for example if
I take the greatest common divisor of a and m as 1 ok and then I can write a to the power
of phi m is congruent to 1 modulo m, this equation is also called as the Euler’s function
or Euler’s theorem ok. Fermat’s little theorem is actually a special case of Euler's
theorem, where m is a p which is a prime number ok. And if it is a prime number, then I
can write phi of p as p minus 1, so what we get is a to the power of p minus 1 is
congruent to 1 modulo p.
If you have the field, which is GF to the power of m, therefore right where there are 2 to
the power of m elements in the field. So, this is the extension field that we saw in the last
class ok, where you extend from GF 2 to the GF 2 to the power of m using an irreducible
polynomial whose dimension is or degree is m ok. So, then we know that a to the power
2 to the power of m minus 1 should be congruent to 1, which means that if I want to
compute a to the power of minus 1, then I need to compute a to the power of 2 to the
power of m minus 2 ok, because essentially that should be giving me the multiplicative
inverse ok. So, therefore right how will I compute this a to the power of 2 to the power of
m minus 2 is what we need to discuss. If I want to compute a to the power of minus 1 or
the multiplet or the multiplicative inverse of a.
(Refer Slide Time: 06:33)
So, here there is a technique, which is called which is basically based on what is called as
an addition chain ok. So, if I want to knifely compute a to the power 2 to the power of m
minus 2, then you can see that I will require m minus 1 squarings and m minus 2 field
multiplications ok.
But, there is a restriction the restriction is that any sequence in this addition chain should
be obtained by the sum of previous two numbers that means, for example if that is u l or
say u i, then u i should be essentially expressible as a sum of u j plus u k that means,
should be expressed as a sum of the previous two numbers ok and u such that u j and u k
both belongs to this addition chain ok. So, therefore I should be able to express you
know like u u i as the sum of u j plus u k ok.
So, for example right if I want to compute say you know like for example, you can see in
this addition chain that this is an addition chain for 162, so I have started with 1 my
endpoint is 162, you can observe that in this addition chain 2 can be obtained by adding
1 with itself ok. Likewise 4 can be obtained by adding 2 with itself. Likewise, 5 can you
obtain by adding 4 with 1 ok, 10 can be obtained with 5 with 5, 20 with 10 with 10, 40
with 20 with 20, 80 with 40 with 40, 81 with 80 and 1 and 162 where we add 81 with 81
ok.
So, these kind of addition chain where one of the numbers is the previous number is also
often called as a brewer chain ok. So, this is a specific (Refer Time: 08:55) name given
better which is called as a brewer chain, which is essentially where your previous
number is just also a part of the addition chain ok. And what we want right therefore, if
you if you want to compute this a to the power 2 to the power of n minus 2, then what we
want is that this addition chain should be optimal in length ok. Although, it is a hard
problem or this difficult problem to find out like what is the optimal chain, but there are
some propositions in literature, which essentially gives us efficient chains ok, which we
can use to compute this multiplicative inverse.
So, taking this background or taking this addition chain into account, what we can do is
this so if I want to now compute e to the power 2 to the power of m minus 2 which was
my inverse, then what I can do is I can define a variable, which I denote as beta k ok. So,
beta k is nothing but a to the power 2 to the power of k minus 1 ok, so beta k is defined
as a to the power of 2 to the power of k minus 1.
So, if I want to calculate a to the power of minus 1, then therefore I as I said from
Fermat’s little theorem, I need to calculate a to the power 2 to the power of m minus 2,
which is nothing but the square of a to the power 2 to the power of m minus 1 minus 1
whole square ok. So, what is this inside the parentheses is nothing but beta m minus 1 a
ok.
So, therefore I need an efficient way to calculate beta m minus 1 and then finally I will
square to get the inverse ok. So, this beta m minus 1 interestingly can be computed using
a nice recursive formulation. So, what I can do is I can write beta k plus j, which is like
some position in the addition chain and express them in terms of beta k and beta j ok. So,
therefore what I do is I try to you know like derive beta k plus j, which is a higher beta
value from the smaller beta values ok.
And therefore, I can do it efficiently. So, why does it work, so it is very easy to see like
beta k plus j is nothing but e to the power 2 to the power of k plus j minus 1. So,
therefore, what I can write is I can for convenience, I can introduce this minus 2 to the
power of k, and then ad plus 2 to the power of k, so basically it is the same minus 1.
And then from the first two factors, I take 2 to the power of k common ok. So, therefore I
get 2 to the power j minus 1 into 2 to the power of k and the other term is a to the power
2 to the power of k minus 1. So, you can note easily that this that is a to the power 2 to
the power of j minus 1 is nothing but beta j and I have done a beta j whole to the power 2
to the power of k, whereas the other term is beta k itself.
So, therefore I have been able to express beta k plus j in terms of smaller beta j and beta
k values ok. Interestingly, you can also observe that I can decompose this in either you
know like, I can also write this as in the other form like, I can also write this as beta k to
the power 2 to the power of j into beta j right, so which one will I choose that may
depend upon the fact that I probably would like to reduce the number of squarings,
which is required here ok. Because, here I am doing beta j to the power 2 to the power of
k. So, depending upon whether j is k small or k small, I can write the write the form
accordingly ok. So, if k is small, then I will write in this form. If j is small, then I will
write this as beta k whole to the power 2 to the power of j multiplied by beta j ok, which
is also correct.
So, therefore for example let us take an example, suppose I want to calculate for m equal
to 233 last day we saw about a Karatsuba representation for 233. So, we will take that
same multiplier applied here, so what we want to do is therefore, I want to compute a to
the power of minus 1, which is equal to a to the power 2 to the power of m minus 2,
where m equal to 233.
So, what do I need therefore, I need an addition chain right I need an addition chain for
232 ok, which is m minus 1. So, therefore here it is an addition chain for 232. So, you
can you can observe that it satisfies all the properties of being an addition chain. And
therefore, right what I can do is I can apply my recursive formulation and from there I
can essentially calculate the value of beta 232 ok. If I know beta 232, then I will just do a
squarings to get the corresponding multiplicative inverse of a ok. So, let us try to see step
by step how we can calculate these beta values to arrive at beta 232 and then finally get
the inverse ok.
(Refer Slide Time: 13:27)
So, now what you can do is that you can calculate beta 2 in terms of beta 1, because you
have you can express 2 as 1 plus 1 like what so ever therefore, if you can if you want to
calculate this, therefore you can apply the recursive formulation. So, therefore beta 1
plus 1 is nothing but beta 1 to the power of 2 to the power of 1 into beta 1 and therefore
you can calculate this corresponding value. Likewise, you can take any beta for example,
let us take the example of say beta 29 for example if I want to calculate beta 29, then I
can express 29 being in the addition chain as term 20 plus 1 ok.
And therefore, what I will do is I will write beta 28 whole to the power 2 to the power 1
into beta 1. Note I could have also done that as beta 1 to the power 2 the other way
round, but that would have incurred more number of squarings ok. So, therefore these are
very natural thing that I will do to reduce the number of squarings. So, all in all here we
need 232 squaring, so maybe you can count it down and you need 10 multiplications.
So, you can count for example, 1, 2, 3, 4, 5, 6, 7, 8, 9 and 10, so you need 10
multiplications and if you add up the number of squarings, which you are doing here. So,
like for example, here you are doing is squaring, here you are doing squaring, here you
are doing 3 squarings and so on, you will get 232 squarings which you are essentially
doing.
Now, again remember that squarings in GF 2 to the power of m arithmetic is easy ok. So,
therefore this is a nice trade-off in that sense, where you are basically optimizing the
number of multiplications at the expense of squarings, which is actually quite
conveniently implement in GF 2 arithmetic.
So, how will I realize this by an hardware circuit, so here is a proposal. So, of course if I
want to do a knife implementation right, then I would have just have some squarer’s, and
then I would have repeatedly applied those squarings functions right. But, then that may
not be efficient, because you can observe that in the addition chain as you go up higher,
there are some places, where you are doing significantly in a large number of squarings
like here you are doing say 116 squarings ok.
So, in the general case it may happen that you are doing, so it since we are doing a large
number of squarings that would although the squarings circuit is efficiently implemented.
But, still if you are cascading, so many squarings operations, then at the end of the day
your performance will get affected ok. So, therefore what you can do is you can try to
trade off between hardware and the clock cycles and rather than implementing 1 squarer,
you can actually have a cascade of squarer’s. So, you can have say U s a number of
squarer’s ok.
So, U s essentially is a value in the addition chain, which you have chosen by some kind
of analysis I will try to elaborate later on, but imagine that you have got U s number of
squarer’s, which you have cascaded. And then you have followed it by a multiplexer
circuit ok. So, now suppose I want you like to calculate a squarings I want to calculates
sudden number of squarings, and if the number of squarings is less than U s, then what I
can do is that I can just correspondingly multiplex out the output right.
For example, if I have got say 2 squarings required, then I will do I will pass the input
here and what I will do is I will give an appropriate control here. So, that this signal,
which is essentially after 2 squarings operations guess gets tapped out ok. On the other
hand right if the number of squarings is more than U s, then you have to essentially feed
this back result back ok.
And you have to do say U i by U s seal number of cycles ok. So, you have to basically
keep on applying this 1 after the other and you have to get the corresponding result ok.
So, depending upon you know like how many number of squarings you have, you
essentially have to do this configurability of or you have to configure your hard work ok,
you have to either give the output after 1 clock cycle or you have to spend more clock
cycles for getting the result ok.
You need one extra clock cycle, because if the extra clock cycle which you need for the
final squarings operation, because you have to do a final squarings at the end. And this
summation gives you the number of squarings for doing all the independent all the
iterations which were there in the previous table. So, you can observe that the number of
squaring, which you were doing right is u i minus u i plus 1. So, u i is 1 addition chain u i
plus 1 is the next addition chain. So, you are doing u i minus u i minus 1 number of
squarings ok.
So, if you go back to this table for example, you can see that the number of squarings
which you are essentially doing essentially is nothing but the summation of 1 is a
subtraction or the difference from one u i value with the previous one. Like; suppose it is
28, it is twenty 28 and 14, so you are doing 14 squarings operations here. So, likewise if
these 3 and 2 you are doing 1 squarings operations here; if it is 232 and116, here you are
doing 116 squarings operation so, the difference ok.
So, therefore, right here you are, so therefore, the number of squarings that you need to
do is u i minus u i minus 1. And since often you know like it may be more than u s. So,
as I said that you have to divide it by u s, and then take the you know like the seal of that,
and therefore, that gives you the number of clock cycles per iterations. And you have got
l number of iterations, so you know like vary i 2 to l, so that gives you a rough total
number of clock cycles which you expect.
(Refer Slide Time: 19:47)
So, therefore, right once you have this right I mean it is fine, but at the same time you
would like to optimize it ok. So, here is a possible optimization that you can try to sort of
implement. So, as you see that in the normal Itoh-Tsujii the inversion algorithm there are
a lot of squarings operation ok. So, we are so, you can actually try to do an interesting
comparison that is little you can what you can try is you can try to compare a squarer
circuit with a quad circuit ok. So, what is the quad circuit a quad circuit is nothing but
which something which computes a to the power of 4 ok and squarer is something which
computes a to the power of 2.
So, here is an example which we try to work out for GF 2 to the power of 9 ok. So, you
can take an irreducible polynomial in x to the power of 9 plus and so on. And then you
can calculate or create this field GF 2 to the power of 9 and compute the complexities or
you know like the number of operations which you need to do, squarings and quads in
these 2 for this field.
So, here is you know like the corresponding bits like 0 to 8, because there are 9 bits
which are there in GF to the power of 8. And what we observe here is the circuits or the
circuit complexity for calculating the squares ok. This one is for the quad operations. So,
you can observe easily that quite intuitively the number of operations in the quad is more
than the number of operations in square, because in square you are just doing power of 2,
whereas in quad you are doing power of 4.
But, interestingly if you look now in terms of LUTs, you will see that and if you
remember that in the last class some of the classes we discussed that in a lookup table,
you essentially can fitting. Like suppose if I take a look up table with 4 inputs you can
actually fit in 4 a Boolean function with 4 input variables. If we have got a Boolean
function with 1 input variable or 2 input variable, then also that LUT gets consumed.
So, therefore, right if I want to calculate the number of lookup tables, you see that here
there is not no function done, so there is no lookup table requirement. But, if you have
got b 1 plus b 5, then you have got 1 lookup table requirement ok. Likewise b 6 there is
no lookup table requirement. If for 4 b 2 plus b 6 we have got 1 lookup table
requirement; like this we have got 4 look up tables which are being used.
On the other hand for the quad circuit, you will see that you still have got one lookup
tables, but even for this value like b 1 plus b 3 plus b 5 plus b 7, where you are doing
more computation you are still using one lookup table ok. So, here at the end you are
using 6 lookup tables. So, now, if we compare a quad circuit with a cascade of 2
squarers, where you are doing squarings and then squaring, then you can see that there is
an improvement.
Because, here you are you are acquiring 6 look up tables, but here you would have
required 4 into 2 that is 8 look up tables. But, in terms of delay, if you compare these two
circuits both of them have got a delay of one lookup table ok, because they are
consuming one look up tables ok. So, therefore, there seems to be an opportunity for
trade off where you can try to optimize and have a better performing inversion circuit ok.
(Refer Slide Time: 22:47)
So, what we observed is we try to kind of see it for other fields like larger fields. For
example, if you see for GF 233 you will see that a squarer circuit will consume some 153
look up tables, whereas a quad will consume 233 look up tables. But, if you compare
with again with 2 times 153, then you see that there is 25 percentage advantage; delay
wise again both of them are same. So, therefore, naturally it tells us that you know like
going from our squarer circuit to a quad circuit may give me an advantage, because I can
utilize the look-up tables better. But the question is you also need to generalize like the
Itoh-Tsujii algorithm to work for quad circuits.
And that will work as a to the power of 4 to the power of k right. And essentially we do
power of 4 then power of 2. So, likewise if you take alpha k 2 which is a to the power 2
to the power of n k 2 minus 1, then we can see that you can actually calculate alpha k 1
plus k 2 in terms of alpha k 1 and alpha k 2 ok. Now, this proof is very straight forward
similar to what we have already seen in the context of beta. So, I am not going to that.
What about inverse? So, once you have done this right and you have computed this
series, then you can actually calculate e to the power of minus 1 also in a very
straightforward manner. So, what you can do is depending upon you know like whether n
divides m minus 1. So, n what is n? N is the parameter through which you are
generalizing this quad or you know like higher powers of 2 ok. So, this n is essentially
here.
So, therefore, when n divides m minus 1, because n may divided m minus 1, it may not
be divided m minus 1. If m divides n minus 1, then alpha m minus 1 by n is defined,
because m minus 1 by n is there an integer. So, then if you can calculate alpha m minus 1
by n, then if you just squared it you will get a inverse ok. But, if it is not defined, then
you can write m minus 1 as n q plus r where n q and r are all natural numbers or integers,
then you can write this a inverse as alpha q to the power 2 to the power of r into beta r a
and then square the entire operation ok.
(Refer Slide Time: 25:27)
So, you can try to observe this very see a very, very clearly here that is suppose you have
got alpha m minus 1 by n. So, then alpha m minus 1 by n is nothing but you can write
this as a to the power 2 to the power of m minus 1 by n multiplied by n. So, therefore,
this n and n cancel. So, you have got alpha 2 to the power of 2 to the power of n minus 1
minus 1, which essentially is nothing but the inverse from for Fermat's little theorem ok.
When n does not divide m minus 1, then you cannot write this in this form, because m
minus 1 by n is not an integer.
So, there you see that alpha q that I mean is not there the right hand side of the previous
equation which is alpha q a whole to the power 2 to the power of r into beta r is nothing
but I can elaborate this alpha q as a to the power 2 to the power of n q minus 1 and then
raise it to the power 2 to the power of r multiplied it with a to the power 2 to the power
of r minus 1. Note that it is beta actually not alpha so they and then square it. So,
therefore, what is this is nothing but alpha or rather a to the power 2 to the power of n q
plus r minus 1 and this is squared ok.
So, therefore, here this 2 to the power of r cancels with this minus 2 to the power of r.
And therefore, you have got 2 to the power of n q plus r and n q and n q plus r was m
minus 1. So, therefore, I can write this as a to the power 2 to the power of m minus 1
minus 1 and which I which a very square I again get a inverse ok. So, therefore, in both
cases a inverse is correctly computed.
(Refer Slide Time: 26:55)
So, what we try to do is that we just try to see if I apply quad Itoh-Tsujii now ok. And
then we will see that how or what is the amount of advantage or improvement that you
get over the normal Itoh-Tsujii inversion algorithm. So, if you implement this, you we
will see that now, we will start alpha 1, we will initialize alpha 1 to say a cube that is a to
the power of 3 and then I will try to do the addition chain, but right now I do not need to
go to 232, but rather I can stop at 116. So, what I will basically now do is, I will try to
calculate alpha 2 as alpha 1 plus 1 alpha 2 plus 1 exactly like what we have done
previously, but now I have changed this to this power of force ok.
So, therefore, the computation is being done fast actually it is being done at a faster rate,
but again as you observe that if you compare a quad with us 2 squarings, then the delay
is same. So, therefore, right in the initial pre-computation here, there is an amount of cost
that you have to pay, because in the previous case you were just initializing with a, but
now you are initializing with a cube ok. And remember that you do not have a squarings
circuit. So, therefore, if you want to implement a power of 3, you need 3 multiplication
operations ok. So, you need to actually you need 2 clock cycles, because you have to do
a into a into a so, you need 2 clock cycles.
And you also have to do a final squarings operation which you do with a multiplication,
because you do not have a squarings ok. So, you have got 3 extra operations which you
are doing ok. And likewise so and in total right you will see that there are 12
multiplications, because there are 9 multiplications here and if you also accommodate
these extra 3 multiplications then you need to do 12 multiplications. Number of quads
operations, you have to do 115 quad operation instead of 232 ok, which you did in the
corner spawn when we have when we were operating with beta. And you also say save 7
clock cycles in the process.
So, therefore, right if you just write this entire stuff, then you will see that the number of
clock cycles is as follows l plus 1 plus sigma now i goes from 2 to l minus 1, in the
previous case I was running from 2 to l. So, therefore, the number of savings in terms of
clock cycles is quite interesting it is u l minus u l minus 1 divided by us minus 1 that is
this extra step.
And there is one extra clock cycle here, so I subtract out minus 1 to accommodate that
ok. So, therefore, right you see that this essentially can be as high as m minus 1 by 2 for
GF 2 to the power m. So, therefore, if you have got larger dimensions, then actually this
advantage could be even more.
(Refer Slide Time: 29:25)
So, with this I will you know like briefly stop here. And I will start discussing about the
corresponding hardware architecture, when I am trying to realize this in the form of
architecture in my next class.
Thank you.