Supachai ECTI-CARD2014
Supachai ECTI-CARD2014
Abstract
AES (Advanced Encryption Standard) algorithm is a block
encryption algorithm established by the U.S. National Institute of
Standards and Technology (NIST) in 2001. It has been adopted by many
data security systems and now used worldwide. Most of AES
implementations are for single-core processors. To achieve high
performance for large data, this work proposed an AES algorithm for
multi-core processors. Using parallelism inherent in large data, all
cores are working concurrently to speed up the task.
Keywords: cryptography; AES; Multicore processor;
1. Introduction
The information security has become an important concern today
due to popular use of computers. The AES algorithm is a standard
encryption algorithm with a symmetric key. This technique converts an
input plain text into a cipher text by repetitions of transformation rounds
with cipher key 128, 192 or 256 bits. Each round of AES algorithm
performs the calculation using a complex sequence. The computation
load is quite high. To achieve high throughput multi-processors are
required. This work proposed an AES algorithm based on multicore
processors. The experiments are carried out to compare the throughput
of multicore processors with a single core processor.
21-23 .. 2557
6
ECTI-CARD Proceedings 2014, Chiang Mai, Thailand
A. Sub-byte
Each byte aij in the state is replaced by S-box, S(aij). The S-box is
derived from the multiplicative inverse over GF2
B. Shift-rows
Each byte is shifted cyclically by row. First row is shifted by 0
offset. Second row is shifted by one offset, third and fourth rows by two
and three offset.
C. Mix-columns
Each column of the state is combined using an invertible linear
transformation. Each column is multiplied by a fixed matrix
2
1
3 1 1
2 3 1
1 2 3
1 1 2
3. Multicore Processor
Multicore processor architecture [3] has two or more computing
units contained in single physical processor. It can run multiple
instructions at the same time. There are three types of different system
architecture using multiple computing units. Heterogeneous system
contains different types of cores. Asymmetric Multi-Processing (AMP)
contains two or more processors of the same type which either run
different operating systems or separate copies of the same operating
system. Symmetric Multi-Processing (SMP) has two or more CPUs of
the same type like an Asymmetric Multi-Processing but all of which run
under the same operating system, Fig 2. In this work, we have used S2
multicore processor to perform encryption and decryption on a shared
memory. We perform the experiment using a 32-bit multicore processor
simulator with a simple C-like language as a programming tool.
A. S2 multi-core processor
S2 [4] is 32-bit CPU simulator that allows to setup 2 to 8 cores
simulation. The processor has three-address instruction format. The
instruction set is divided into 4 different groups: arithmetic, logic,
control and data. (Table 2)
Table 2. Instruction type of S2 multi-core processor
Instruction type
Operation name
Arithmetic
add sub mul div mod
Logic
and or xor eq ne lt le gt ge shl shr
Control
jmp jt jf jal ret
Data
ld st push pop
B. RZ Language
RZ [5] is a small programming language that is similar to a subset of
C language. The source code of AES is developed and tested on S2
simulator. The simulator is accurate to the clock cycle. The compiler and
simulator are available publicly.
4. Related work
In this section, we sum up several related work of AES
implementation on CPU and GPU. Many work present GPU and
hardware implementation.
Barnes [6] implemented AES algorithm on a multicore processor
using fork and pthread function to improve throughput. They achieve
6637Mb/s on 32- core processor with pthread architect. Many work have
used GPU processors. Manavski [7] used Nvidia GeForce 8800GTX to
compute AES with CUDA technology with 32-bit processor that can
perform sub-byte 4 bytes in the same instruction. It also works on addround-key. GPU will process sub-byte and add-round-key 4 times per
round by CPU and need 16 times for sub-byte and add-round-key. He
achieved 8.28Gbit/s throughput on 8MB of data with 128 bits AES. In
2010, Nhat-Phuong Tran [8] presented a work that increase the size of
AES block from standard 16 bytes to reduce the overhead of the data
transfer in the memory. The extended block size can increase the
encryption speed by 25% to 28% and 603% to 853% for GPU
implementation. However it does not work well for small size data.
Huang Chang Lin and Tai [9] presented 32 bits AES on Xilinx FPGA
(Spartan-3 XC3S200) with throughput 647 Mbs. The current processor
has a special extension of the instruction set for AES, for example Intel
Core i3/i5/i7 CPUs support AES-NI instruction set extensions,
throughput can be over 700 MB/s per thread. [10]
6
ECTI-CARD Proceedings 2014, Chiang Mai, Thailand
This work proposes an AES implementation for multicore
processors using division of data of each core. The data is split into 16
bytes block in the main memory. Each block of data is fed to each core
to run AES encryption. At the end, each ciphered block is merged with
the ciphered block from another core. (See Fig. 3)
BlockdataN
CoreN
Cipher File
Core2
Shared Memory
Blockdata
Share Memory
Input File
Core1
AESAlgorithm
Table 4. The total number of cycle of memory stall (please note that
this is the sum of the stall of all cores hence it may be larger than the
number of cycle time of CPU)
Data (Byte) Core 1 Core 2 Core 3 Core 4
32
10346 19459
64
19444 35475 62383 81733
128
37640 67695 119257 150909
256
74000 131995 222702 289148
AESAlgorithm
Measurement
Throughput and
Resource
Split File
Input File
The experiments are carried out to compare AES program on singlecore and multi-core processors. The flow of data collection is shown in
Fig.4. All experiments are performed on the same condition using the
same program (except for the number of core) and hardware. The
multicore simulator can monitor the number of instructions executed in
each core. Each instruction is assumed to take one clock cycle, except
for the memory access which takes two cycles. When more than one
core access the memory, only one core is granted the access. Other core
will be stalled to wait for the first core to finish. Instruction fetch is
assumed to have no conflict. The graph in Fig.5 shows the experiment
with a multi-core configuration. The results show the speed of AES
encryption is increasing when more cores were used. However , the
number of cycle that has memory conflict also increases.. Therefore it
does not have any further speed up when using 3 and 4 cores. We
conclude that for this task, two-core is the best configuration. (See Table
3, 4 and Fig.5).
References
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
21-23 .. 2557
6
ECTI-CARD Proceedings 2014, Chiang Mai, Thailand
[10]
McWilliams, Grant (6 July 2011). "Hardware AES Showdown VIA Padlock vs Intel AES-NI vs AMD Hexacore".
http://grantmcwilliams.com/tech/technology/item/532-hardwareaes-showdown-via-padlock-vs-intel-aes-ni-vs-amd-hexacore.
Retrieved 10 February 2014.
21-23 .. 2557