0% found this document useful (0 votes)
76 views

Improving The Efficiency of Cryptography Algorithms On Resource-Constrained Embedded Systems Via RISC-V Instruction Set Extensions

The document discusses improving the efficiency of cryptography algorithms like AES and SHA on resource-constrained embedded systems using RISC-V instruction set extensions. It presents the design and evaluation of RISC-V implementations of AES-128, AES-256, SHA-256, and SHA-512 both with and without specialized RISC-V instructions. Significant gains in clock cycles, memory usage, and energy efficiency were observed when using the specialized instructions.

Uploaded by

ynn1095004648
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views

Improving The Efficiency of Cryptography Algorithms On Resource-Constrained Embedded Systems Via RISC-V Instruction Set Extensions

The document discusses improving the efficiency of cryptography algorithms like AES and SHA on resource-constrained embedded systems using RISC-V instruction set extensions. It presents the design and evaluation of RISC-V implementations of AES-128, AES-256, SHA-256, and SHA-512 both with and without specialized RISC-V instructions. Significant gains in clock cycles, memory usage, and energy efficiency were observed when using the specialized instructions.

Uploaded by

ynn1095004648
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Improving the Efficiency of Cryptography

Algorithms on Resource-Constrained Embedded


Systems via RISC-V Instruction Set Extensions
2023 36th SBC/SBMicro/IEEE/ACM Symposium on Integrated Circuits and Systems Design (SBCCI) | 979-8-3503-1834-0/23/$31.00 ©2023 IEEE | DOI: 10.1109/SBCCI60457.2023.10261964

Carlos Gabriel de Araujo Gewehr∗† , Fernando Gehm Moraes∗


∗ School of Technology, Pontifical Catholic University of Rio Grande do Sul – PUCRS – Porto Alegre, Brazil
† EnSilica – Porto Alegre, Brazil
[email protected], [email protected]

Abstract—This work presents the design and evaluation of embedded systems: (i) significant resource sharing between
RISC-V implementations of AES-128, AES-256, SHA-256, and general and specialized components, including register file
SHA-512, both with and without specialized instructions from the and main memory interface; (ii) trivial data transfer within
Zkne and Zknh ISA extensions. In our evaluation, we use the Ibex
implementation of the RISC-V ISA, a simple low-area 2-stage the processor’s register file, avoiding costly loads/stores to
pipeline design, and the TinyCrypt library, a collection of low- peripherals or shared memory; (iii) no added complexity to
overhead C implementations of widely employed cryptography the interconnect and interrupt controller.
algorithms. Several criteria relevant to low-complexity embedded The RISC-V ISA [3] is increasingly gaining popularity due
systems are measured and compared, such as area costs for the to enabling customization to suit specific application require-
hardware side; stack usage and code density for the software
side; illustrating the trade-offs emerging from using specialized ments via Instruction Set Extensions (ISEs). Relevant to the
RISC-V instructions in the aforementioned algorithms. Clock context of security in low complexity embedded systems are
cycle count gains of 42.57x, 44.81x, 1.45x and 1.74x were the Zkne and Zknh [4] ISEs, defining specialized instructions
observed, as well as 4.16x, 4.16x, 1.58x and 1.63x gains in for AES encryption; and SHA-256 and SHA-512 hashing,
memory usage efficiency and 27.81x, 28.91x, 1.45x and 1.79x respectively. As an open ISA, no licensing fee is required
gains in energy efficiency, with an overhead of 10% in die area
cost. The extended TinyCrypt library with hardware accelerated for distributing implementations of a RISC-V processor. In
implementations and extended Ibex processor RTL are available this manner, several open-source RISC-V cores have been
open-source at https://github.com/cggewehr/RISCV-crypto. made publicly available at no cost [5, 6, 7]. Between such
Index Terms—RISC-V; Instruction Set Extensions; Embedded implementations, of special interest is Ibex [5], a small 2-
Systems; Cryptography; Security; Hardware Acceleration. stage pipeline design aimed at minimizing complexity, used
in relevant works [8, 9, 10, 11]. Additionally, several security-
I. I NTRODUCTION minded features, such as ECC-protected register file and
With the increasing deployment of low-complexity embed- dummy instruction insertion, can be easily obtained by simply
ded systems such as IoT sensor nodes, battery-powered and setting appropriate parameters, making Ibex a prime choice
energy-harvesting devices, secure communication is a core for resource-constrained embedded systems where security is
requirement. Such devices often do not have the necessary a core requirement.
resources for locally computing data processing demands, The requirements for low resource usage in such contexts
delegating intensive work to the cloud or network edge. Due are also relevant in its software components. The TinyCrypt
to this emerging computing paradigm, a significant amount of [12] library is designed specifically for embedded systems with
data must be securely transmitted to and from several network limited resources. Due to its design philosophy of specifically
endpoints. This is not an easy task to accomplish, seeing as targeting such devices, it has been chosen as a fair software
the cryptographic algorithms that enable the confidentiality context for this work, providing real-world applicability.
and integrity of the data being transmitted, such as AES [1] This work aims to present the implementation and eval-
and SHA-256 [2], are associated with non-trivial overheads in uation of AES, SHA-256, and SHA-512, both with and
application performance and memory usage, which are critical without specialized instructions from the Zkne and Zknh ISA
concerns in the context of resource-constrained devices. extensions. The original contributions of this work include:
Hardware acceleration can significantly enhance perfor- • an open-source RTL implementation of the Ibex RISC-V
mance, memory, and energy efficiency in crucial operations processor with Zkne and Zknh extensions;
of an algorithm. One way to achieve hardware acceleration is • an open-source implementation of hardware-accelerated
by implementing specialized instructions in a general-purpose AES, SHA-256 and SHA-512 in highly optimized assem-
processor. Compared to loosely-coupled hardware accelera- bly using RISC-V Zkne and Zknh extensions, transparent
tors, these instructions provide benefits for low-complexity with existing deployments of the TinyCrypt library;
• a deep analysis of the trade-offs surrounding the imple-
This work was financed in part by CNPq (grant 309605/2020-2) and mentation of the Zkne and Zknh extensions, evaluating
FAPERGS (grant 21/2551-0002047-4). gains in performance, energy and memory efficiency, and
979-8-3503-1834-0/23/$31.00 ©2023 IEEE die area costs.

Authorized licensed use limited to: NORTHWESTERN POLYTECHNICAL UNIVERSITY. Downloaded on April 04,2024 at 03:51:07 UTC from IEEE Xplore. Restrictions apply.
II. R ELATED W ORK In the software side of the evaluation in [13], [14] shows
3.5x and 1.65x performance gains for AES-128 encryption
Previous efforts in open-source RISC-V cores have not and encryption key expansion, respectively. The software
sufficiently addressed the need for low-overhead security in comparisons consider a higher-performance T-Table based
resource-constrained embedded systems. The PULP project [7] implementation, trading-off memory usage for performance.
provides several RISC-V cores and loosely coupled dedicated It is significantly faster than the TinyCrypt implementation
cryptography cores, easily integrated via an automated system used in this work, at a much higher memory footprint.
generation tool. This does not provide an optimal solution for Further software evaluations concerning the entire RISC-V
low-energy cryptography workloads due to the long logical cryptography ISE (not only Zkne and Zknd as discussed so
distance between the main processor and hardware accelera- far) are reported in [15, 16]. Nişancı [15] presents an analysis
tors. Software productivity is also a negative factor, seeing as of program and static data memory usage for accelerated and
custom drivers must be developed for each accelerator. non-accelerated implementations of cryptography algorithms,
The OpenTitan project [11] uses a similar approach with prioritizing execution speed in their implementations. The
loosely-coupled accelerators, but uses a higher-complexity Authors show a reduction of 83% in program memory usage
version of the Ibex core with Zb* bit-manipulation extensions in AES-128, but do not seem to consider a reduction in static
for its main processor. It does not intend to provide a platform data memory usage in their accelerated implementation, nor
for use in IoT applications, but in the same manner, as PULP stack usage in their evaluation. It is unclear if their AES
[7], the use of loosely-coupled accelerators does not satisfy implementations use the stack at all. In this work, we show
the constraints of the low-complexity context being explored. the reduction in static data and stack memory usage to be of
The SCARV project [6] implements a 5-stage RISC-V significant importance to the gains provided by an accelerated
core with cryptography ISEs and a software library using AES implementation. For SHA-256 and SHA-512, clock cycle
these ISEs to achieve hardware acceleration. The SCARV count gains of 43% and 40%, as well as a reduction in program
core is of significantly higher complexity than the simple 2- memory usage by 33% and 32% are reported, respectively.
stage Ibex core used in this work. It does not provide a fair Similar results are reported in the earlier [16] work, evaluated
comparison, nor is it an adequate processor choice for the low- in the context of the previously discussed SCARV [6] core.
resource context being explored. Additionally, the software Despite reporting performance in terms of clock cycles
library is custom-made, again leading to challenges in software instead of instructions retired, the authors of [15] do not
productivity in integrating it into other projects. sufficiently describe their hardware platform, simply stating
Zgheib et al. [8] propose a tightly-coupled AES accelerator their processor core to be ”a 32-bit 5-stage pipelined RISC-V
integrated into an Ibex core. A 40% increase in FPGA slice processor that only supports base integer instructions”. Crucial
usage when integrating the AES accelerator into the Ibex facts concerning the complexity of the RISC-V core used in
processor, improving performance by 46 times compared to their evaluation such as operand forwarding, branch prediction,
a software implementation from OpenSSL. The authors also and multiplier and shifter topology are omitted. This makes it
report their solution consumes 44.9 times less energy than difficult to compare results, seeing as the complexity of their
OpenSSL AES. Note that this AES implementation is not ideal RISC-V core cannot be correlated to the Ibex core we use.
considering an embedded context. Seeing as the SubBytes In summary, the core contributions of the works reviewed in
operation is done in an online manner instead of via a lookup this Section are presented and compared in Table I. Note that
table, avoiding cache timing side-channel attacks. In a low- all works reviewed present ASIC implementation results using
complexity embedded context, such attacks are not applicable open-source synthesis tools and generic cell libraries. Our
as there is no cache to leak timing information. A table-based work uses a commercial synthesis tool and cell library, provid-
SubBytes such as the TinyCrypt implementation used in the ing results with a better correlation to physical implementation
present work is clearly both faster and more compact, but we than previous evaluations. Deeper insights into the RISC-
offer no quantitative evidence to this claim. V cryptography ISEs, especially Zknh, in low-complexity
Marshall et al. [13] present a comparative analysis of embedded contexts are scarce in the literature.
several RISC-V ISE proposals, where the Saarinen [14] work
is demonstrated to be the superior extension to the 32-bit
base ISA. The ISE proposed by Saarinen [14] was later TABLE I
R ELATED W ORK C OMPARISON
standardized and ratified as the Zkne and Zknd extensions Reference → Our
in the RISC-V scalar cryptography as the ISE for AES [8] [13] [14] [15] [16]
Feature ↓ Work
encryption and decryption, respectively. Synthesis results are Open-source RTL X X X X
provided in terms of NAND2 equivalent gates. In the context Area cost evaluation X X X X X X
Energy cost evaluation X X
of a SCARV [6] core, the Saarinen [14] ISE shows a 3% Commercial EDA tools X
area increase, accelerating both encryption and decryption. Manufacturable cell library X
Saarinen [14] report a 5% increase in LUT usage in an Open-source software X X X X
FPGA-based evaluation considering a lower complexity SHA-2 extensions evaluation X X X
single-cycle RISC-V core. No power or energy results are Established software library X
RISC-V performance counters X
provided in [13, 14].

Authorized licensed use limited to: NORTHWESTERN POLYTECHNICAL UNIVERSITY. Downloaded on April 04,2024 at 03:51:07 UTC from IEEE Xplore. Restrictions apply.
III. RISC-V S CALAR C RYPTOGRAPHY E XTENSIONS B. Zknh extension for SHA-256 and SHA-512 hashing

A. Zkne extension for AES encryption


AES [1] is a block cipher standard, widely used for The SHA-2 [2] standard defines the SHA-256 and SHA-
confidentially transmitting information. AES iterates through 512 hash algorithms, reducing a message of arbitrary size into
its internal state in 4 fundamental operations in the fol- 256 or 512 bits, respectively. A message M is padded and split
lowing order: SubBytes, ShiftRows, MixColumns and into [256, 512] bit sized message blocks m. An initial h0 value
AddRoundKey1 . Note that the SubBytes operation in- is set, and the following message blocks are computed with
volves a non-trivial inversion in GF(28 ), along with the affine hi = f (hi−1 , mi ). The output is taken as the final h value.
transformation which follows. The entirety of the compound For SHA-256, 4 instructions are defined: sha256sum0,
operation in SubBytes is commonly known as a Substitution sha256sum1, sha256sig0 and sha256sig1. These in-
Box, or SBOX. An optimization technique employed in the structions directly implement the Σ256 256 256
and σ1256
0 , Σ1 , σ 0
TinyCrypt library used in this work is to pre-compute each functions defined in the SHA-2 standard [2], used in the
SBOX value and store them in a Look-Up Table (LUT). The computation of f (hi−1 , mi ), shown below. The ROTRn and
SBOX LUT comprises 256 bytes, and is equivalent in size SHRn functions denote rotation right shift and logical right
to 64 RV32I instructions. Assuming that online computing shift by n bits, respectively.
SubBytes requires more than 64 RV32I instructions, this
can be seen as an optimization for performance and code size. Σ256
0 (x) = ROT R2 (x) ⊕ ROT R13 (x) ⊕ ROT R22 (x)
Seeing as the ShiftRows and MixColumns operations Σ256
1 (x) = ROT R6 (x) ⊕ ROT R11 (x) ⊕ ROT R25 (x)
perform shifts and multiplications by constants, this reasoning
can be taken one step further by pre-computing not the σ0256 (x) = ROT R7 (x) ⊕ ROT R18 (x) ⊕ SHR3 (x)
result of SubBytes, but of MixColumns. In contrast to σ1256 (x) = ROT R17 (x) ⊕ ROT R19 (x) ⊕ SHR10 (x)
an S-Table implementation, which used pre-computed values
only for SubBytes, this approach is commonly known as
The availability of Zknh instructions allows for an imple-
a T-Table implementation, named following T0 , T1 , T2 , T3 as
mentation that is not only faster, but more compact and energy-
defined below, where di,j , ai,j and S denote the output of
efficient. Implementing these 4 functions in plain RV32I is
MixColumns, the round’s initial state and the SBOX LUT
very costly due to the usage of rotation shifts, for which no
at row i and column j, respectively:
RV32I instruction can directly compute. Computing a rotation
d0,j 2 3 1 1 S[a0 ]
    
shift by n bits in RV32I instructions requires a logical shift by
d1,j  1 2 3 1 S[a1,j−1 ]
d  = 1
2,j 1 2 3 S[a2,j−2 ]
= T0 [a0 ] ⊕ T1 [a1,j−1 ] ⊕ T2 [a2,j−2 ] ⊕ T3 [a3,j−3 ] n bits in the forward direction; a logical shift by 32 − n bits
d3,j 3 1 1 2 S[a3,j−3 ] in the reverse direction; and OR-ing both these intermediary

S[x] · 2
 
S[x] · 3
 
S[x]
 
S[x]
 shifts together. This requires the use of additional temporary
 S[x] 
T0 [x] = 
S[x] · 2
T [x] = 
S[x] · 3
T [x] = 
 S[x] 
T [x] = 
registers, which may lead to the compiler storing intermediary
S[x]  1 S[x]  2 S[x] · 2 3 S[x] · 3 values in the stack due to the lack of registers available to the
S[x] · 3 S[x] S[x] S[x] · 2
register allocator.
Note that each T-Table is a 1-to-4 byte mapping, instead
of the previous 1-to-1 byte mapping using S-Tables. This Similar functions are defined for SHA-512:
increases memory usage from 256 to 4K bytes, but simplifies Σ512
0 (x) = ROT R28 (x) ⊕ ROT R34 (x) ⊕ ROT R39 (x)
an entire AES round to 16 LUT lookups and 16 bitwise XOR
operations. The performance to code size trade-off can be Σ512
1 (x) = ROT R14 (x) ⊕ ROT R18 (x) ⊕ ROT R41 (x)
balanced by pre-computing only 1 T-Table and computing the σ0512 (x) = ROT R1 (x) ⊕ ROT R8 (x) ⊕ SHR7 (x)
remaining table entries online via rotations.
The Zkne extension defines two new specialized instruc- σ1512 (x) = ROT R19 (x) ⊕ ROT R61 (x) ⊕ SHR6 (x))
tions: aes32esmi and aes32esi. aes32esmi can be
interpreted as computing T-Table entries online in hardware, Since SHA-512 variables are defined to be 64 bits in
then XOR-ing the current entry with previous T-Table entries size, such functions cannot be directly implemented as a
for the same output column. This provides the performance single instruction yielding a 32-bit result, as is the case
benefits of a T-Table AES implementation without the need for for SHA-256. Note that the temporary registers issue is
storing LUTs in memory, leading to gains in performance and aggravated due to the increased variable size. Six in-
code size, as well as energy efficiency gains via the reduction structions are defined to accelerate SHA-512 sigma func-
in memory accesses. Finally, aes32esi performs only a tions: sha512sig0h, sha512sig0l, sha512sig1h,
single SBOX lookup and XOR, used in computing round keys sha512sig1l, sha512sum0r and sha512sum1r. With
and the last encryption round, with no MixColumns. Zknh instructions σ0512 and σ1512 are computed with 2 different
instructions, for the high and low parts of the result separately,
1 We kindly assume that the reader is already familiar with the inner while Σ512
0 and Σ512
1 are computed via 2 executions of the
workings of AES. For those who are inexperienced in cryptography, we same instruction, but once with inverted operands, implying a
recommend [17] for an introduction to AES. rotation by 32 bits.

Authorized licensed use limited to: NORTHWESTERN POLYTECHNICAL UNIVERSITY. Downloaded on April 04,2024 at 03:51:07 UTC from IEEE Xplore. Restrictions apply.
IV. E XTENDING I BEX WITH C RYPTOGRAPHY ISE S C. Functional units integration
A. Zkne extension for AES encryption The integration of the AES and SHA-2 functional units in
The Zkne instructions are implemented in the AES Unit the Ibex processor’s ID/EX pipeline stage can be seen in Fig.
module, shown in Fig. 1. Our implementation is derived from 3. New elements added to the processor are colored in blue,
the functional unit from the original ISE proposal in [14], existing elements modified to accommodate the new functional
which features additional logic for AES decryption and the units are colored in green, while unmodified elements are
SM4 cipher. The AES unit is entirely combinational, executing colored in yellow.
both the aes32esmi and aes32esi instructions in a single
clock cycle. Both instructions are of an extended R type, which
expects as inputs 2 registers to be read from the register file,
plus a 2-bit immediate Byte Select (BS). AND gates are used
to gate inputs, minimizing switching activity in the AES unit.

Fig. 3. Ibex ID/EX pipeline stage datapath

Fig. 3 also illustrates the benefits of implementing hardware


acceleration as ISEs over loosely-coupled accelerators in em-
Fig. 1. AES Unit
bedded contexts. Note that only small changes in the instruc-
tion decoding and writeback logic are required to integrate
Note how sub-modules in the AES Unit relate to AES functional units into the Ibex processor, while resource shar-
fundamental operations: SubBytes is performed one byte ing with existing processor elements is extensively explored,
at a time in the SBOX sub-module, on the byte determined namely in the register file and load/store unit.
by BS in the value read from RS2; ShiftRows is computed D. Hardware-software interface
by selecting the appropriate byte from RS2, which contains
a column of the current AES state, via the BS immediate; The compiler must be informed that a specific micro-
MixColumns is computed similarly to the T-Table method architectural feature is available to compile software with
shown in Section III-A, where the result of SubBytes is instructions from standardized extensions. GCC achieves this
multiplied by [3, 1, 1, 2] and rotated by BS bytes, computing using the march flag. For example, to use instructions from
TBS accounting for RISC-V little-endianess; AddRoundKey the Zknh extension, GCC should be called with the march
is computed by XOR-ing the result of MixColumns to the flag set to rv32imc zknh. The assembler can then generate
relevant word of the round key given in RS1. RS1 should be binaries that match the encoding specified in the extension
equal to RD, such that RD accumulates the 4 T-Table entries specification, from which the RTL implementation can decode
and round key i.e. a column of the next round state. and execute the specific operation to be accelerated.
The implementation of the algorithms themselves is ac-
B. Zknh extension for SHA-256 and SHA-512 hashing complished through inline assembly inside TinyCrypt func-
Zknh instructions are also implemented in their own func- tions, conditionally compiled as either the original TinyCrypt
tional unit, shown in Fig. 2. The SHA-2 unit is again entirely C implementation or optimized assembly implementations.
combinational, with all Zknh instructions executing in a single Function declarations remain the same, such that APIs are
cycle. The sigma functions discussed in Section III-B lend not modified when using hardware-accelerated assembly. This
themselves to an efficient hardware implementation, seeing makes hardware acceleration transparent to the user, not lead-
as shifts and rotations by static amounts have zero cost. The ing to additional effort in software integration and not breaking
SHA-2 unit does not have a well-defined internal structure like existing code using the TinyCrypt library.
the AES unit, and can be seen simply as a sea of XOR gates. V. E XPERIMENTAL E VALUATION
We compare the implementation and execution of the
AES-128, AES-256, SHA-256 and SHA-512 algorithms
in a baseline unmodified Ibex core and the extended Ibex
core presented in Fig. 3. For both cases, we set the data
independent timing flag in the Ibex control register. We
consider the SecureIbex parameterization in both cases,
excluding the too costly ECC RAM and lockstep features. We
also consider the latch-based register file, 3-cycle multiplier
and no writeback stage parameterizations.
Fig. 2. SHA-2 Unit

Authorized licensed use limited to: NORTHWESTERN POLYTECHNICAL UNIVERSITY. Downloaded on April 04,2024 at 03:51:07 UTC from IEEE Xplore. Restrictions apply.
It should be noted that the TinyCrypt library does not B. Memory Usage Evaluation
have ready-made implementations of AES-256 and SHA-512. We use the GCC fstack-usage flag to obtain static stack
We obtain a comparable AES-256 implementation by setting usage for each function and the GNU nm tool to obtain size
appropriate parameters at compile-time and small changes to information for functions and static data, presented in Fig. 4.
the key schedule logic of the existing AES-128 implementa- Baseline TinyCrypt values are denoted with diagonal stripes
tion. For SHA-512, we consider an implementation from pull while optimized assembly values are denoted with dots. The
request #42. For AES only the encryption operation (using AES key schedule as well as plaintext data are not included.
the forward cipher) is evaluated, seeing as the most common The AES SBOX LUT is only included in the Encryption cost,
block cipher modes of operation such as CTR do not require but is also used in the Key Schedule.
the inverse cipher for the decryption operation, using the
forward cipher for both encryption and decryption. We use
GCC version 12.2.0 for generating binaries. For all cases, we
consider as inputs the example test vectors from [1, 2].
For AES, both the Key Schedule and Encryption of a
single block are evaluated. For SHA-256 and SHA-512, two
scenarios are considered: Full, considering the incremental
API calls with non-accelerated elements such as padding and Fig. 4. Memory Usage for AES-128, AES-256, SHA-256 and SHA-512
IV initialization; Compression, in which only the hardware
accelerated compression function f (hi−1 , mi ) is evaluated. Fig. 4 shows code size to be the most impactful factor in
memory usage, with significant gains in code size most notably
A. Performance Evaluation in AES (both) Encryption (3.08x), SHA-256 Compression
Performance counters for each function call are presented (1.97x) and SHA-512 Compression (2.13x). Code size gains
in Table II. White columns show values for the baseline are most pronounced in AES Encryption, due to the very high
TinyCrypt implementations while grey columns show val- density of the Zkne instructions, executing in a short block
ues for the optimized assembly with ISEs implementations. of instructions behavior that would otherwise be spread out
Performance gains in clock cycles are most notable in AES across many function calls. The same applies to the reduction
encryption, showing gains of 42.57x and 44.81x, while in the in static data and stack usage, seeing as there is no need for
Key Schedule, observed gains were 3.82x and 4.55x in AES- neither a SBOX LUT to be stored in memory nor a stack-
128 and AES-256, respectively. Our results in terms of total allocated temporary state for computing ShiftRows and
instructions retired very closely match AES-128 results from MixColumns. The total memory usage gains are 4.16x, 1.58x
[13], due to similar implementation techniques being used, and 1.63x for AES (both), SHA-256 and SHA-512.
originally proposed in [14]. AES-256 was not evaluated in For the SHA-2 algorithms, gains in code size are attributed
[13, 14]. more significantly to optimizing assembly by hand rather
SHA-2 algorithms present different results than [15], which than Zknh instructions, maximizing the use of compressed
reports speed-ups in terms of clock cycles of 1.76x and instructions. Zknh instructions make better manual register
1.65x, whereas our work shows speed-ups of 1.45x and 1.74x allocation possible by computing sigma functions with the use
for the compression functions of SHA-256 and SHA-512, of fewer temporary registers, evidenced by the proportional
respectively. We theorize this to be attributed to the use of increase of compressed instructions, shown in Table II.
loop unrolling and higher complexity hardware, but due to
the omission of source code for both hardware and software C. Die Area Cost Evaluation
in [15], we cannot elaborate any further on why our results We synthesize both the baseline and extended cores using
do not match theirs. Furthermore, comparing clock cycle a high-density 8-track cell library for a 28 nm FDSOI process
counts between the Full and Compression scenarios using from ST Microelectronics. Results are shown in Table III. We
Zknh instructions, non-trivial API overheads of 10% and 15% use Cadence Genus version 1903 with the PLE flow, aiming to
are seen. It is unclear if this is considered in [15]. obtain the best possible correlation to physical implementation.
TABLE II
I BEX PERFORMANCE COUNTERS FOR EXECUTIONS OF THE AES-128, AES-256, SHA-256 AND SHA-512 ALGORITHMS
Performance Counters AES-128 K.S. AES-128 Enc. AES-256 K.S. AES-256 Enc. SHA-256 Full SHA-256 Comp. SHA-512 Full SHA-512 Comp.
Clock Cycles 1111 291 13366 314 1596 351 18910 422 7007 5049 6248 4295 18650 11197 17550 10091
Instructions Retired 697 213 8110 238 959 244 11486 324 6060 4088 5619 3647 16059 9286 15434 8660
LSU Busy 190 58 2836 52 255 75 3996 69 602 554 424 376 2170 1242 1863 936
Fetch Wait 138 10 640 15 217 19 892 19 145 189 79 128 173 403 95 319
Loads 146 14 1872 48 195 15 2644 64 377 333 340 296 1368 809 1306 748
Stores 44 44 964 4 60 60 1352 4 225 221 84 80 802 433 557 188
Jumps 0 0 797 1 6 0 1141 1 82 36 62 16 98 36 78 16
Conditional Branches 86 10 407 8 159 13 563 10 118 182 64 128 150 230 80 160
Taken Cond. Branches 72 9 221 4 142 7 309 6 90 139 62 111 122 187 78 143
Compressed Instructions 473 111 3381 23 628 71 4805 29 1921 2362 1793 2234 5626 4485 5472 4330
Multiplier Busy 0 0 576 0 0 0 832 0 0 0 0 0 0 0 0 0
Divider Busy 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Authorized licensed use limited to: NORTHWESTERN POLYTECHNICAL UNIVERSITY. Downloaded on April 04,2024 at 03:51:07 UTC from IEEE Xplore. Restrictions apply.
We set a target clock frequency of 500 MHz, which leads VI. C ONCLUSION
to a fair comparison considering both synthesized cores are In this work RISC-V scalar cryptography extensions have
reported to have 0 ns timing slack. We consider a timing worst- been demonstrated to greatly accelerate AES-128, AES-256,
case PVT corner of slow process, 0.75 V and 125 C for synthe- SHA-256, and SHA-512 in the context of the low-complexity
sis. We perform synthesis instantiating only the standardized 2-stage Ibex core. Providing a performance gain of 42.57x,
performance counters i.e. clock cycles and instructions retired. 44.81x, 1.45x and 1.74x in terms of clock cycles for AES en-
TABLE III cryption and the SHA-2 compression functions, the combined
S YNTHESIS RESULTS FOR THE BASELINE AND EXTENDED I BEX CORES Zkne and Zknh extensions come at a low area overhead of 10%.
Synthesis results Baseline Extended Furthermore, improvements in memory efficiency (considering
Cell Area (µm2 ) 11,238 12,446 code size, static data size and stack usage) of 4.16x, 4.16x,
Net Area (µm2 ) 6,992 7,279 1.58x and 1.63x have been measured, as well as gains in
Total Area (µm2 ) 18,230 19,726 energy efficiency of 27.81x, 28.91x, 1.45x, and 1.79x.
Cell Instance Count 10,289 11,687
Equivalent NAND2 gates 34,433 38,132 R EFERENCES
[1] M. Dworkin et al., “Advanced Encryption Standard (AES),” 2001-11-26
From Table III, we most notably see that implementing the 2001. [Online]. Available: https://doi.org/10.6028/NIST.FIPS.197
Zkne and Zknh extensions leads to a 10% increase in cell [2] Q. Dang, “Secure Hash Standard (SHS),” 2012-03-06 2012. [Online].
area. The benefits obtained through the ISEs in performance, Available: https://doi.org/10.6028/NIST.FIPS.180-4
[3] RISC-V Foundation, “The RISC-V Instruction Set Manual, Volume
energy and memory efficiency always proportionally outweigh I: User-Level ISA, Document Version 20191213,” December 2019.
the total area cost. [Online]. Available: https://riscv.org/technical/specifications
Gallmann et al. [18] report a 31.47 kGE value for a similar [4] ——, “RISC-V Cryptography Extensions Volume I: Scalar & Entropy
Source Instructions, Document Version v1.0.1,” February 2022.
Ibex parameterization as our baseline case and tight timing [Online]. Available: https://github.com/riscv/riscv-crypto/releases/tag/
constraints using a commercial synthesis tool. Nişancı et al. v1.0.1-scalar
[15] report their higher complexity 5-stage core to be of 19.71 [5] lowRISC, “Ibex RISC-V Core,” 2018. [Online]. Available: https:
//github.com/lowRISC/ibex
kGE in cell area. No information concerning timing constraints [6] University of Bristol, “SCARV: a side-channel hardened RISC-V
and PDK used were provided, only that open-source synthesis platform,” 2018. [Online]. Available: https://github.com/scarv/scarv
tools were used. Our baseline case is expected to present [7] ETH Zurich, “PULP platform,” 2016. [Online]. Available: https:
//github.com/pulp-platform/pulpino
slightly higher area usage than [18], seeing as we use the [8] A. Zgheib et al., “Extending a RISC-V core with an AES hardware
SecureIbex configuration, due mainly to the ECC protected accelerator to meet IOT constraints,” in SMACD / PRIME, 2021, pp.
register file. Due to the use of open-source synthesis tools, area 1–4. [Online]. Available: https://ieeexplore.ieee.org/document/9547962
[9] S. S. Philip et al., “Intermittent Computing Emulation of Ultralow-
results from [13, 14, 15, 16] should not be seen as accurate Power Processors: Evaluation of Backup Strategies for RISC-V,”
estimates for area overheads due to cryptography ISEs. IEEE Transactions on Computer-Aided Design of Integrated Circuits
and Systems, vol. 42, no. 1, pp. 82–94, 2023. [Online]. Available:
D. Energy Consumption Evaluation https://doi.org/10.1109/TCAD.2022.3169108
[10] N. Dao et al., “FlexBex: A RISC-V with a Reconfigurable Instruction
We evaluate energy consumption via gate-level simula- Extension,” in ICFPT, 2020, pp. 190–195. [Online]. Available:
tions of the synthesized cores as discussed in Section V-C. https://doi.org/10.1109/ICFPT51103.2020.00034
Switching activity during each function call is collected and [11] lowRISC, “OpenTitan project,” 2017. [Online]. Available: https:
//github.com/lowRISC/opentitan
fed into Genus for power analysis. We consider a nominal [12] Intel, “TinyCrypt Cryptographic Library,” 2017. [Online]. Available:
PVT corner of typical process, 0.9 V and 25 C for power https://github.com/intel/tinycrypt
analysis. Additionally, we consider memory access energy [13] B. Marshall et al., “The design of scalar AES Instruction Set Extensions
for RISC-V,” IACR Transactions on Cryptographic Hardware and
costs obtained from CACTI [19]. For a 16KB dual-port SRAM Embedded Systems, pp. 109–136, Dec. 2020. [Online]. Available:
with low-power bit-cells in 28 nm technology, values of 63.362 https://doi.org/10.46586/tches.v2021.i1.109-136
fJ/bit for reads and 41.436 fJ/bit for writes are reported by the [14] M.-J. O. Saarinen, “A Lightweight ISA Extension for AES and SM4,”
2020. [Online]. Available: https://arxiv.org/abs/2002.07041
tool. Energy results are shown in Fig. 5. [15] G. Nişancı et al., “Symmetric Cryptography on RISC-V: Performance
Evaluation of Standardized Algorithms,” Cryptography, vol. 6,
no. 3, p. 41, Aug. 2022. [Online]. Available: https://doi.org/10.3390/
cryptography6030041
[16] B. Marshall et al., “Implementing the Draft RISC-V Scalar
Cryptography Extensions,” in Hardware and Architectural Support
for Security and Privacy, ser. HASP ’20. New York, NY, USA:
Association for Computing Machinery, 2021. [Online]. Available:
https://doi.org/10.1145/3458903.3458904
[17] C. Paar and J. Pelzl, Understanding Cryptography. Springer
Berlin Heidelberg, 2010. [Online]. Available: https://doi.org/10.1007/
978-3-642-04101-3
Fig. 5. Energy Consumption Evaluation for the AES-128, AES-256, SHA- [18] N. Gallmann et al., “From Swift to Mighty: A Cost-Benefit Analysis of
256 and SHA-512 algorithms Ibex and CV32E40P Regarding Application Performance, Power and
We can see that energy consumption is dominated by the Area,” 2021. [Online]. Available: https://carrv.github.io/2021/papers/
CARRV2021 paper 8 Gallmann.pdf
RISC-V core rather than memory IO. Total energy efficiency [19] R. Balasubramonian et al., “CACTI 7: New tools for interconnect
gains of 27.81x, 28.91x, 1.45x, and 1.79x are observed for exploration in innovative off-chip memories,” ACM Transactions on
AES-128 Encryption, AES-256 Encryption, SHA-256 Com- Architecture and Code Optimization (TACO), vol. 14, no. 2, pp. 1–25,
2017. [Online]. Available: https://doi.org/10.1145/3085572
pression, and SHA-512 Compression, respectively.

Authorized licensed use limited to: NORTHWESTERN POLYTECHNICAL UNIVERSITY. Downloaded on April 04,2024 at 03:51:07 UTC from IEEE Xplore. Restrictions apply.

You might also like