Enhancing Performance and Scalability A Novel Hardware Architecture For 1024-Bit Miller-Rabin Primality Testing
Enhancing Performance and Scalability A Novel Hardware Architecture For 1024-Bit Miller-Rabin Primality Testing
Abstract— This paper presents a novel hardware For centuries, mathematicians and computer scientists
architecture for the 1024-bit Miller-Rabin Primality tester IP have faced a significant challenge in the primality test, which
core, designed to enhance performance and scalability. The is used to validate prime numbers [7]. One of the most
proposed architecture leverages multi-bit processing technique significant mathematics puzzles is the identification of Prime
to optimize the algorithm's internal partial operations of and Composite numbers. Many of the current crypto systems
modular exponentiation and multiplication, thereby improving are not seems to be secure enough, especially given the
overall efficiency of the primality tester. This paper evaluates growing threat of attacks. The use of insecure key pair
the performance of the Miller-Rabin algorithm on the Virtex-6 generation is one of the causes. Prime number generation has
FPGA (XC6VLX550T-2ff1759) device, considering metrics such
been acknowledged as being significant, and prime number
as logic resource utilization, maximum operating frequency,
Latency and Area x Time (AT) metrics. Our experimental
validation is a crucial aspect of the key generation process.
results demonstrate significant improvements in performance Finding a large prime number is typically accomplished by
compared to existing approaches, making our architecture well- testing successively generated numbers until a prime is found.
suited for high-performance cryptographic applications. Choosing either a pseudo or true-random number and testing
its primality using one of the available primality tests is one
Keywords—Cryptography, Primality testing, Miller-Rabin way to solve this problem. Two crucial components of any
algorithm, Montgomery multiplier, Montgomery exponentiation,
primality testing algorithm are speed and accuracy.
FPGA implementations
Deterministic algorithms typically have a high computational
I. INTRODUCTION overhead even though they guarantee 100% accuracy.
However, while randomized or probabilistic algorithms are
The two main categories of cryptography systems that are often faster, it is important to take into account a small error
currently accepted are symmetric key and asymmetric key. As probability because they cannot guarantee whether the
the name implies, both the sender and the recipient share the provided number is composite or prime.
same key in symmetric key cryptography. Typically,
symmetric key cryptosystems are implemented as stream or TABLE I. ALGORITHMS FOR PRIMALITY TESTING
block ciphers. Data Encryption Standard (DES) and Advanced
Encryption Standard (AES) are basic examples of symmetric Primality Algorithm Complexity
key cryptosystems. But a significant problem is key AKS test O (log5 n)
management, though. In 1976, the concept of Asymmetric or
Public key cryptography was first introduced as a ground- Baillie-PSW Primality test O ((log n)3)
breaking solution to overcome this impediment [1]. Fermat Primality test O (m log n)
In asymmetric key cryptosystems, two distinct yet Solovay-Strassen test O (log n)
mathematically linked keys, the public key and the private key
Miller-Rabin test O (log n)
are used. Compiling the private key is computationally
impossible given the public key. The sender uses the Table I depicts the most appropriate primality testing
recipient's public key to encrypt the data, which is algorithms [8]. The table presents the Big O notation in terms
subsequently decrypted by the recipient using his private key. of number of arithmetic operations (performance complexity)
Cryptosystems like RSA (Rivest-Shamir-Adleman) [2], for each primality testing algorithms. Consequently, Solovay-
ElGamal [3], Digital Signature Standard (DSS) [4], Elliptic Strassen [9] and Miller-Rabin [10] algorithms are the only two
Curve Cryptosystems (ECC) [5, 6], and are examples of approaches that are currently in competition to perform
public key cryptosystems. primality test for different applications. The Solovay-Strassen
algorithm is not yet practical and is still in theoretical
Large prime numbers underpin the structure of public key development [11]. One of the most popular algorithms for
cryptography and are extensively employed in the RSA, primality testing is Miller-Rabin since it can test primality at
ElGamal, ECC and DCC public key crypto systems. Even the highest throughput and with the least amount of execution
larger primes are needed to protect these cryptosystems from time, especially when it is implemented on hardware like
sophisticated attacks on the underlying number theoretic FPGAs and ASICs [12, 13].
issues like discrete logarithm and integer factorization. In
contrast to the early RSA, which used 256-bit or 512-bit key In this work, introduces a trailblazing hardware
lengths, the recommended key bit lengths were progressively architecture designed for the 1024-bit Miller-Rabin Primality
increased to 1024-bit lengths and 2048-bit lengths, tester IP core, aiming to achieve scalability and exceptional
correspondingly requiring 512-bit lengths and 1024-bit length performance. Our approach leverages multi-bit processing for
prime numbers. In turn, the RSA key pair generation key algorithmic operations, specifically choosing radix for the
necessitates extensive computation, the majority of which is Montgomery modular exponentiation and Montgomery
used to generate and validate the random prime numbers. modular multiplication. We evaluate the performance of our
thorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on May 02,2025 at 04:25:04 UTC from IEEE Xplore. Restrictions appl
architecture on the Virtex-6 FPGA (XC6VLX550T-2ff1759) assessed. In order to create a less recursive algorithm, Authors
device, focusing on critical metrics such as Computation times altered the Karatsuba-Offman algorithm and used it for
and AT metrics. Our findings demonstrate a significant multiplier in the Miller-Rabin primality test. The design takes
enhancement in the Miller-Rabin primality algorithm's roughly 2.5 seconds to complete the Miller-Rabin test for the
hardware design performance. 1024-bit length number on C54x family.
The structure of the paper is organized into several key In article [21], Purdy et al. examined hardware
parts. Section 2 details an in-depth overview of the architectures of the Miller-Rabin as well as Lucas primality
background and related research, providing context for the tests, and discussed about various cryptographic algorithm
proposed Miller-Rabin primality testing method. Following types and primality tests. The Authors also demonstrated the
this, Section 3 delves into the detailed hardware Baillie-PSW test's Verilog-based implementation on Cyclone
implementations of the method, outlining the specific design IV GX Altera FPGA device. The implementation outputs the
choices and strategies employed. In Section 4, the next immediate probable prime number after receiving an odd
implementation results are meticulously compared, random number as input. Evaluated proposed
highlighting the performance and efficiency of the proposed implementation's outcomes and presented recommendations
method. Finally, Section 5 wraps up the paper by summarizing for how to get better outcomes moving forward. The
the findings, exploring their implications, and proposing architecture takes 47.86ms to found a prime number when
possible directions for future research. simulating the implemented design, which used 37% of the
FPGA device resources and a 1024-bit length random number
II. BACKGROUND AND RELATED WORK on the mentioned FPGA device due to the Baillie-PSW
Primality testing is essential in order to generate prime performance complexity “O ((log n)3)”.
numbers. The Miller-Rabin and Solovay-Strassen In article [22], Kim Dong Kyue et al. main goal is to
probabilistic primality tests are compared for efficiency by analyse various scenarios involving in a hardware prime
Monier in the article [14] using a mathematical model as the generator. Authors analysed that the Fermat tests and the trial
foundation for the comparison. The author concluded that division, when implemented in hardware, can operate in
Millar-Rabin outperforms Solovay-Strassen in terms of parallel and exhibit significantly higher performance than
accuracy and efficiency, according to the model. when implemented in software. Separate hardware prime
In the article [15], Duta et al. examined the effectiveness generators were used for the generation of 512-bit and 1024-
of various primality tests to identify the most efficient ones, bit primes, respectively. For the 1024-bit prime validation, it
including the Baillie-Pomerance-Wagstaff (BPW) test, Lucas- takes 789.6 ms on the mentioned Virtex-4 FPGA device.
Lehmer-Riesel (LLR) test, Proth's theorem, Solovay-Strassen The main features of our proposed design include multi-
test, Agrawal-Kayal-Saxena (AKS) test, Fermat's test, bit processing (Radix-216) for Montgomery modular
Adleman-Pomerance-Rumley (APR) test, Lucas-Lehmer test, arithmetic operations (specifically Montgomery Modular
Pepin's test, Miller-Rabin test, Quadratic Frobenius test, Lucas Exponentiation and Modular Squaring), enabling faster
test, and Pocklington test. The tests were implemented in C# primality testing and higher performance. These Montgomery
using the .NET framework, and the performance was analyzed Modular Exponentiation and Modular Squaring operations are
based on the type of primality test (deterministic or designed to be executed in parallel, enhancing efficiency.
probabilistic) and varying sizes of input numbers.
According to Abudaqa et al. from the article [16] presents III. PROPOSED MILLER-RABIN ARCHITECTURE
the findings from the efficiency of the primality tests, Miller- An input number can be tested to see if it is prime or
Rabin is the fastest. Solovay is more exact than Fermat, composite using the Miller-Rabin algorithm. Algorithm 1
although Fermat is usually faster. Because of its accuracy and illustrates the structure of the Miller-Rabin algorithm.
speed, the Miller-Rabin test is always the better option among
Algorithm 1: Miller-Rabin Primality Testing
all of these primality tests. Algorithm
In article [19], R. C. C. Cheung et al. presented a scalable
design architecture for reconfigurable hardware-based prime Inputs: “N”
number validation. This design's parallelism and scalability Outputs: N is Prime or Composite
have been investigated for very big prime numbers. A scalable
design technique was used for translating the Rabin-Miller Perform ‘N - 1’ such that N – 1 = m * 2k
strong Pseudoprime test into hardware. Hardware 1. LIN
implementation of the proposed architecture in reconfigurable
a. Calculate T such that T = am mod N;
FPGA devices, with an assessment of its efficacy relatively a = 2 for binary
compared with speed and size trade-offs. The scalable 1024-
bit design of the proposed Miller-Rabin system takes 5.48 b. If (T = 1 or N – 1), number is prime
seconds on Virtex-II XC2V6000 device. else composite
thorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on May 02,2025 at 04:25:04 UTC from IEEE Xplore. Restrictions appl
Architecture of the probabilistic Miller-Rabin primality then domain conversions between the ordinary and the
algorithm is illustrated in Figure 1, showcasing the utilization Montgomery domains is shown in Table II.
of maximum parallelism in its design approach. Additionally,
incorporating multi-bit (16-bits, Radix-216) processing within
the Montgomery exponentiation and multiplication
contributes to achieving low latency, thereby enhancing the
robustness and performance of the Miller-Rabin design.
Figure 2 shows the logic design flow diagram of the
Miller-Rabin algorithm. The architecture in the Figure 2
consists of the following two major test phases:
Test Phase-1:
This is the beginning test phase of the number in primality
testing. The input number “N” will be composite if it is “1” or
even number. It's a prime number if it is “2”. Proceed to the
next phase of testing if the input number is odd.
Test Phase-2:
This phase is odd number test phase. If the input number
“N” is odd then the steps described in Algorithm 1 needs to be
carried out in order to find the input number’s primality. The
test will produce composite if it fails, and prime otherwise.
thorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on May 02,2025 at 04:25:04 UTC from IEEE Xplore. Restrictions appl
Multiplier (R216MM) is implemented to decrease the clock The initial Montgomery multiplications with ݎଶ are used
cycles needed to execute the Montgomery multiplication and to convert integer operands from the integer to Montgomery
thus achieving lowest latency. The proposed design is domain (Line 1 of Algorithm 3). Also, multiplying the result
illustrated in Algorithm 2 below. by '1' is required to convert the outcome back from the
Montgomery to the integer domain (Line 2(b) of Algorithm
Algorithm 2: Montgomery Modular Multiplication
3). In step 2 of Algorithm 3, for each iteration, it performs a
(R216MM)
squaring, followed by a multiplication based on the scanned
Inputs: Xn-1:0, Yn-1:0, Nn-1:0 ZLWK;<1 bit value. The architecture of Montgomery Modular
Exponentiation is shown in Figure 4.
Outputs: T = (X*Y (r-1) mod N)
n: Operand bit length, Mem: 16-bit stored value
1. begin
2. ܶ = 0; ܷ = 0;
3. For ݅ = 0; ݅ < ଵ
; ݅ + + begin
a. ܷ = ܶ + ܺଵହ: ;ܻ כ
b. ܶ = ܷ + ;ܰ כ ݉݁ܯ
c. ܶ = ܶ ݀݅ ݒ2ଵ;
d. ܺ = ܺ ݀݅ ݒ2ଵ end
4. If ܶ ܰ then ܶ = ܶ െ ܰ;
5. return T;
6. end
thorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on May 02,2025 at 04:25:04 UTC from IEEE Xplore. Restrictions appl
IV. RESULTS AND DISCUSSIONS efficiency. The comparison highlights the strengths of the
proposed design in terms of speed and resource utilization.
A. Development and Validation
The 1024-bit Miller-Rabin design architecture was Compared to the design in [19], our implementation
implemented in Verilog HDL and validated on the Pico achieves a faster completion of primality testing, taking only
Computing M503 FPGA (device: XC6VLX550T-2ff1759). 2.67 seconds. This efficiency is achieved by utilizing 3723
The implementation was synthesized using Xilinx ISE 14.7 Slices on the Virtex-II FPGA device. The corresponding Area
for Virtex-II, Virtex-IV, and Virtex-VI for apple-to-apple x Time (AT) metric for our design is 9.97, showcasing its
comparisons. The architecture underwent simulation and efficiency in utilizing resources to achieve high performance.
verification with NIST test vectors to confirm its accuracy. In contrast, the design in [19] has an AT metric of 15.73,
indicating a less efficient use of resources.
In comparison to the design presented in [22], our
implementation significantly improves the completion time of
primality testing, achieving a time of just 0.47 seconds. This
enhanced efficiency is achieved through the use of 17,640
hardware LUTs on the Virtex-IV FPGA device. The
corresponding AT metric for our design is 8.25, highlighting
its efficiency in resource utilization. In contrast, the design in
[22] has a higher AT metric of 19.28, requiring 24,420
hardware resources and a testing time of 0.78 seconds.
The detailed comparison analysis provided in Table III
showcases the advantages and strengths of our design over
existing implementations [19, 20, 21 and 22], establishes the
proposed design as a highly efficient and effective solution for
1024-bit Miller-Rabin primality testing on FPGA devices,
offering significant advantages over existing designs.
Fig. 4. Montgomery Modular Exponentiation architecture TABLE III. A COMPARISON OF THE PROPOSED 1024-BIT MILLER-
RABIN DESIGN ARCHITECTURE WITH OTHER 1024-BIT MILLER-RABIN
DESIGN IMPLIMENTATIONS
B. Radices (multi-bit processing) Vs AT metrics – Analysis
Figure 5 shows Montgomery Modular Multiplier AT Freque Computat Area
Refer FPGA Area x Time
ncy ion Time (Slices/
Metrics and Computation Times Vs multi-bit (processing ence Device
(MHz) (ms) LUTs)
(LUTs-ms)
word lengths: 1-bits to 20-bits) processing from Radix 21 to [19] - 5478 2872 15.73
220 (1-bit word <w-1> to 20-bit word <w-20> processing) for Virtex II
the 256-bit, 512-bit and 1024-bit operand lengths on Virtex- Ours 85 2678 3723 9.97
VI FPGA device. The decrease in the computation time [22] 100 789.6 24420 19.28
(Latency) and the Area x Time (AT) metrics is notable as the Virtex IV
radix increases up to Radix-216, but these metrics increase Ours 144 468.7 17640 8.255
beyond Radix-216. These fluctuations in the metrics are due to TMS320
Not
the higher radix requiring the computation of more partial [20] C54x 160 2484.30 -
analyzed
family
products, which, in turn, necessitates more FPGA LUTs, Altera
leading to increased area and decreased computation time (T). [21] Cyclone 100 48.3 55411 2.676
The analysis reveals that the Area x Time (AT) metrics exhibit IV GX
a specific decreasing trend as the radix increases. Up to Radix- Ours Virtex VI 448.4 64.7 18129 1.172
216, there is a noticeable improvement in these metrics,
indicating that the design benefits from the increased radix in V. CONCLUSION
terms of both area and computation time. However, beyond
Radix-216, the metrics start to increase. This increase can be In this work, we introduced a novel hardware architecture
attributed to the fact that while the area continues to grow for the 1024-bit Miller-Rabin Primality tester IP core,
linearly, there is almost no further improvement in latency. As enhancing performance and scalability. By analyzing multi-
a result, the overall efficiency, as measured by the AT metrics, bit processing techniques and selecting the most suitable radix
begins to decline. Hence, the Radix-216 Montgomery Modular (Radix-216 <16-bits processing per cycle>) based on AT
Multiplier (R216MM) emerges as the optimal choice for metrics for Montgomery modular multiplication and
implementing the Miller-Rabin hardware architecture. exponentiation hardware implementations, we optimized the
performance of the core operations in the Miller-Rabin design.
C. Comparative Analysis This approach effectively balances area and latency, resulting
The FPGA implementation of the full 1024-bit Miller- in a highly efficient design. Evaluations on Virtex-2, Virtex-
Rabin with Montgomery exponentiation and R216MM 4, and Virtex-6 FPGA devices demonstrated significant gains
multiplier comprises 18129 LUTs. It operates at a worst-case in logic resource utilization, design area, and maximum
frequency of 448.4 MHz and can perform a 1024-bit primality operating frequency. Our experimental results indicate that
test in 64.7 ms. our architecture outperforms existing approaches, making it
well-suited for high-performance public key cryptographic
Table III provides a detailed comparison between the applications especially in RSA and ECC key generation
proposed 1024-bit Miller-Rabin design and other modules. These results underscore the effectiveness of our
implementations, offering insights into their performance and design in advancing primality testing efficiency.
thorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on May 02,2025 at 04:25:04 UTC from IEEE Xplore. Restrictions appl
(a) (b)
(c) (d)
(e) (f)
Fig. 5. Computation Times and AT Metrics of Montgomery Modular Multiplier Vs multi-bit (words) processing from Radix 21 to 220 (a) 256-bit (b) 512-
bit (c) 1024-bit; Montgomery multiplier Computation Time Vs multi-bit (words) processing from Radix 21 to 220 (d) 256-bit (e) 512-bit (f) 1024-bit
thorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on May 02,2025 at 04:25:04 UTC from IEEE Xplore. Restrictions appl