68 results sorted by ID
Possible spell-corrected query: are processor
Optimized One-Dimensional SQIsign Verification on Intel and Cortex-M4
Marius A. Aardal, Gora Adj, Arwa Alblooshi, Diego F. Aranha, Isaac A. Canales-Martínez, Jorge Chavez-Saab, Décio Luiz Gazzoni Filho, Krijn Reijnders, Francisco Rodríguez-Henríquez
Public-key cryptography
SQIsign is a well-known post-quantum signature scheme due to its small combined signature and public-key size. However, SQIsign suffers from notably long signing times, and verification times are not short either. To improve this, recent research has explored both one-dimensional and two-dimensional variants of SQIsign, each with distinct characteristics. In particular, SQIsign2D's efficient signing and verification times have made it a focal point of recent research. However, the absence of...
MAYO Key Recovery by Fixing Vinegar Seeds
Sönke Jendral, Elena Dubrova
Attacks and cryptanalysis
As the industry prepares for the transition to post-quantum secure public key cryptographic algorithms, vulnerability analysis of their implementations is gaining importance. A theoretically secure cryptographic algorithm should also be able to withstand the challenges of physical attacks in real-world environments. MAYO is a candidate in the ongoing first round of the NIST post-quantum standardization process for selecting additional digital signature schemes. This paper demonstrates three...
Mystrium: Wide Block Encryption Efficient on Entry-Level Processors
Parisa Amiri Eliasi, Koustabh Ghosh, Joan Daemen
Secret-key cryptography
We present a tweakable wide block cipher called Mystrium and show it as the fastest such primitive on low-end processors that lack dedicated AES or other cryptographic instructions, such as ARM Cortex-A7.
Mystrium is based on the provably secure double-decker mode, that requires a doubly extendable cryptographic keyed (deck) function and a universal hash function.
We build a new deck function called Xymmer that for its compression part uses Multimixer-128, the fastest universal hash for...
Reality Check on Side-Channels: Lessons learnt from breaking AES on an ARM Cortex A processor
Shivam Bhasin, Harishma Boyapally, Dirmanto Jap
Attacks and cryptanalysis
AES implementation has been vastly analysed against side-channel attacks in the last two decades particularly targeting resource-constrained microcontrollers. Still, less research has been conducted on AES implementations on advanced hardware platforms. In this study, we examine the resilience of AES on an ARM Cortex A72 processor within the Raspberry Pi 4B model. Unlike their microcontroller counterparts, these platforms operate within the complex ecosystem of an operating system (OS),...
Reading It like an Open Book: Single-trace Blind Side-channel Attacks on Garbled Circuit Frameworks
Sirui Shen, Chenglu Jin
Attacks and cryptanalysis
Garbled circuits (GC) are a secure multiparty computation protocol that enables two parties to jointly compute a function using their private data without revealing it to each other. While garbled circuits are proven secure at the protocol level, implementations can still be vulnerable to side-channel attacks. Recently, side-channel analysis of GC implementations has garnered significant interest from researchers.
We investigate popular open-source GC frameworks and discover that the AES...
Exploiting the Central Reduction in Lattice-Based Cryptography
Tolun Tosun, Amir Moradi, Erkay Savas
Attacks and cryptanalysis
This paper questions the side-channel security of central reduction technique, which is widely adapted in efficient implementations of Lattice-Based Cryptography (LBC). We show that the central reduction leads to a vulnerability by creating a strong dependency between the power consumption and the sign of sensitive intermediate values. We exploit this dependency by introducing the novel absolute value prediction function, which can be employed in higher-order non-profiled multi-query...
Cache Side-Channel Attacks Through Electromagnetic Emanations of DRAM Accesses
Julien Maillard, Thomas Hiscock, Maxime Lecomte, Christophe Clavier
Attacks and cryptanalysis
Remote side-channel attacks on processors exploit hardware and micro-architectural effects observable from software measurements. So far, the analysis of micro-architectural leakages over physical side-channels (power consumption, electromagnetic field) received little treatment. In this paper, we argue that those attacks are a serious threat, especially against systems such as smartphones and Internet-of-Things (IoT) devices which are physically exposed to the end-user. Namely, we show that...
MiRitH: Efficient Post-Quantum Signatures from MinRank in the Head
Gora Adj, Stefano Barbero, Emanuele Bellini, Andre Esser, Luis Rivera-Zamarripa, Carlo Sanna, Javier Verbel, Floyd Zweydinger
Public-key cryptography
Since 2016’s NIST call for standardization of post-quantum cryptographic primitives, developing efficient post-quantum secure digital signature schemes has become a highly active area of research. The difficulty in constructing such schemes is evidenced by NIST reopening the call in 2022 for digital signature schemes, because of missing diversity in existing proposals. In this work, we introduce the new post-quantum digital signature scheme MiRitH. As direct successor of a scheme recently...
Leaking Secrets in Homomorphic Encryption with Side-Channel Attacks
Furkan Aydin, Aydin Aysu
Homomorphic encryption (HE) allows computing encrypted data in the ciphertext domain without knowing the encryption key. It is possible, however, to break fully homomorphic encryption (FHE) algorithms by using side channels. This article demonstrates side-channel leakages of the Microsoft SEAL HE library. The proposed attack can steal encryption keys during the key generation phase by abusing the leakage of ternary value assignments that occurs during the number theoretic transform (NTT)...
Threshold Implementations in Software: Micro-architectural Leakages in Algorithms
John Gaspoz, Siemen Dhooghe
Implementation
This paper provides necessary properties to algorithmically secure first-order maskings in scalar micro-architectures. The security notions of threshold implementations are adapted following micro-processor leakage effects which are known to the literature. The resulting notions, which are based on the placement of shares, are applied to a two-share randomness-free PRESENT cipher and Keccak-f. The assembly implementations are put on a RISC-V and an ARM Cortex-M4 core. All designs are...
Hybrid scalar/vector implementations of Keccak and SPHINCS+ on AArch64
Hanno Becker, Matthias J. Kannwischer
Implementation
This paper presents two new techniques for the fast implementation of the Keccak permutation on the A-profile of the Arm architecture: First, the elimination of explicit rotations in the Keccak permutation through Barrel shifting, applicable to scalar AArch64 implementations of Keccak-f1600. Second, the construction of hybrid implementations concurrently leveraging both the scalar and the Neon instruction sets of AArch64. The resulting performance improvements are demonstrated in the example...
Efficient Constant-Time Implementation of SM4 with Intel GFNI instruction set extension and Arm NEON coprocessor
Weiji Guo
Implementation
The efficiency of constant-time SM4 implementation has been lagging behind that of AES for most internet traffic and applicable data encryption scenarios. The best performance before our works was 3.77 cpb for x86 platform (AESNI + AVX2), and 8.62 cpb for Arm platform (NEON). Meanwhile the state of art constant-time AES implementation could reach 0.63 cpb. Dedicated SM4 instruction set extensions like those optionally available in Armv8.2, could achieve comparable cpb to AES. But they are...
Efficient Multiplication of Somewhat Small Integers using Number-Theoretic Transforms
Hanno Becker, Vincent Hwang, Matthias J. Kannwischer, Lorenz Panny, Bo-Yin Yang
Implementation
Conventional wisdom purports that FFT-based integer multiplication methods (such as the Schönhage-Strassen algorithm) begin to compete with Karatsuba and Toom-Cook only for integers of several tens of thousands of bits. In this work, we challenge this belief, leveraging recent advances in the implementation of number-theoretic transforms (NTT) stimulated by their use in post-quantum cryptography. We report on implementations of NTT-based integer arithmetic on two Arm Cortex-M CPUs on...
Benchmarking and Analysing the NIST PQC Lattice-Based Signature Schemes Standards on the ARM Cortex M7
James Howe, Bas Westerbaan
Implementation
This paper presents an analysis of the two lattice-based digital signature schemes, Dilithium and Falcon, which have been chosen by NIST for standardisation, on the ARM Cortex M7 using the STM32F767ZI NUCLEO-144 development board. This research is motivated by the ARM Cortex M7 device being the only processor in the Cortex-M family to offer a double precision (i.e., 64-bit) floating-point unit, making Falcon's implementations, requiring 53 bits of double precision, able to fully run native...
On the Performance Gap of a Generic C Optimized Assembler and Wide Vector Extensions for Masked Software with an Ascon-{\it{p}} test case
Dor Salomon, Itamar Levi
Implementation
Efficient implementations of software masked designs constitute both an important goal and a significant challenge to Side Channel Analysis attack (SCA) security. In this paper we discuss the shortfall between generic C implementations and optimized (inline-) assembly versions while providing a large spectrum of efficient and generic masked implementations for any order, and demonstrate cryptographic algorithms and masking gadgets with reference to the state of the art. Our main goal is to...
Compressed SIKE Round 3 on ARM Cortex-M4
Mila Anastasova, Mojtaba Bisheh-Niasar, Reza Azarderakhsh, Mehran Mozaffari Kermani
Implementation
In 2016, the National Institute of Standards and Technology (NIST) initiated a standardization process among the post-quantum secure algorithms. Forming part of the alternate group of candidates after Round 2 of the process is the Supersingular Isogeny Key Encapsulation (SIKE) mechanism which attracts with the smallest key sizes offering post-quantum security in scenarios of limited bandwidth and memory resources. Even further reduction of the exchanged information is offered by the...
Curve448 on 32-bit ARM Cortex-M4
Hwajeong Seo, Reza Azarderakhsh
Implementation
Public key cryptography is widely used in key exchange and digital signature protocols. Public key cryptography requires expensive primitive operations, such as finite-field and group operations. These finite-field and group operations require a number of clock cycles to exe- cute. By carefully optimizing these primitive operations, public key cryp- tography can be performed with reasonably fast execution timing. In this paper, we present the new implementation result of Curve448 on 32-bit ARM...
All the Polynomial Multiplication You Need on RISC-V
Hwajeong Seo, Hyeokdong Kwon, Siwoo Eum, Kyungbae Jang, Hyunjun Kim, Hyunji Kim, Minjoo Sim, Gyeongju Song, Wai-Kong Lee
Implementation
Polynomial multiplication is a core operation for public key cryptography, such as pre-quantum cryptography (e.g. elliptic curve cryptography) and post-quantum cryptography (e.g. code-based cryptography and multivariate-based cryptography).
For this reason, the efficient and secure implementation of polynomial multiplication has been actively conducted for high availability and security level in application services.
In this paper, we present all polynomial multiplication methods on modern...
Polynomial multiplication on embedded vector architectures
Hanno Becker, Jose Maria Bermudo Mera, Angshuman Karmakar, Joseph Yiu, Ingrid Verbauwhede
Public-key cryptography
High-degree, low-precision polynomial arithmetic is a fundamental computational primitive underlying structured lattice based cryptography. Its algorithmic properties and suitability for implementation on different compute platforms is an active area of research, and this article contributes to this line of work: Firstly, we present memory-efficiency and performance improvements for the Toom-Cook/Karatsuba polynomial multiplication strategy. Secondly, we provide implementations of those...
ROTed: Random Oblivious Transfer for embedded devices
Pedro Branco, Luís Fiolhais, Manuel Goulão, Paulo Martins, Paulo Mateus, Leonel Sousa
Cryptographic protocols
Oblivious Transfer (OT) is a fundamental primitive in cryptography, supporting protocols such as Multi-Party Computation and Private Set Intersection (PSI), that are used in applications like contact discovery, remote diagnosis and contact tracing. Due to its fundamental nature, it is utterly important that its execution is secure even if arbitrarily composed with other instances of the same, or other protocols. This property can be guaranteed by proving its security under the Universal...
CARiMoL: A Configurable Hardware Accelerator for Ringand Module Lattice-Based Post-Quantum Cryptography
Afifa Ishtiaq, Dr. Muhammad Shafique, Dr. Osman Hassan
Implementation
Abstract—CARiMoL is a novel run-time Configurable Hardware Accelerator for Ring and Module Lattice-based postquantum
cryptography. It’s flexible design can be configured to key-pair generation, encapsulation, and decapsulation for NewHope and CRYSTALS-Kyber schemes using same hardware. CARiMoL offers run-time configurability for multiple security levels of NewHope and CRYSTALS-Kyber schemes, supporting both Chosen-Plaintext Attack (CPA) and Chosen-Ciphertext Attack (CCA) secure...
Optimized Implementation of SM4 on AVR Microcontrollers, RISC-V Processors, and ARM Processors
Hyeokdong Kwon, Hyunjun Kim, Siwoo Eum, Minjoo Sim, Hyunji Kim, Wai-Kong Lee, Zhi Hu, Hwajeong Seo
Implementation
The SM4 block cipher is a Chinese domestic crpytographic that was introduced in 2003. Since the algorithm was developed for the use in wireless sensor networks, it is mandated in the Chinese National Standard for Wireless LAN WAPI (Wired Authentication and Privacy Infrastructure). The SM4 block cipher uses a 128-bit block size and a 32-bit round key. This consists of 32 rounds and one reverse translation \texttt{R}. In this paper, we present the optimized implementation of the SM4 block...
Kyber on ARM64: Compact Implementations of Kyber on 64-bit ARM Cortex-A Processors
Pakize Sanal, Emrah Karagoz, Hwajeong Seo, Reza Azarderakhsh, Mehran Mozaffari-Kermani
Implementation
Public-key cryptography based on the lattice problem is efficient and believed to be secure in a post-quantum era. In this paper, we introduce carefully optimized implementations of Kyber encryption schemes for 64-bit ARM Cortex-A processors. Our research contribution includes several optimizations for Number Theoretic Transform (NTT), noise sampling, and AES accelerator based symmetric function implementations. The proposed Kyber512 implementation on ARM64 improved previous works by 1.72×,...
No Silver Bullet: Optimized Montgomery Multiplication on Various 64-bit ARM Platforms
Hwajeong Seo, Pakize Sanal, Wai-Kong Lee, Reza Azarderakhsh
Implementation
In this paper, we firstly presented optimized implementations of Montgomery multiplication on 64-bit ARM processors by taking advantages of Karatsuba algorithm and efficient multiplication instruction sets for ARM64 architectures.
The implementation of Montgomery multiplication can improve the performance of (pre-quantum and post-quantum) public key cryptography (e.g. CSIDH, ECC, and RSA) implementations on ARM64 architectures, directly. Last but not least, the performance of Karatsuba...
Classic McEliece Implementation with Low Memory Footprint
Johannes Roth, Evangelos Karatsiolis, Juliane Krämer
Public-key cryptography
The Classic McEliece cryptosystem is one of the most trusted quantum-resistant cryptographic schemes. Deploying it in practical applications, however, is challenging due to the size of its public key. In this work, we bridge this gap. We present an implementation of Classic McEliece on an ARM Cortex-M4 processor, optimized to overcome memory constraints. To this end, we present an algorithm to retrieve the public key ad-hoc. This reduces memory and storage requirements and enables the...
Optimized Software Implementations for theLightweight Encryption Scheme ForkAE
Arne Deprez, Elena Andreeva, Jose Maria Bermudo Mera, Angshuman Karmakar, Antoon Purnal
Secret-key cryptography
In this work we develop optimized software implementationsfor ForkAE, a second round candidate in the ongoing NIST lightweight cryptography standardization process. Moreover, we analyze the perfor-mance and efficiency of different ForkAE implementations on two em-bedded platforms: ARM Cortex-A9 and ARM Cortex-M0.First, we study portable ForkAE implementations. We apply a decryption optimization technique which allows us to accelerate decryption by up to 35%.
Second, we go on to explore...
Fixslicing AES-like Ciphers: New bitsliced AES speed records on ARM-Cortex M and RISC-V
Alexandre Adomnicai, Thomas Peyrin
Implementation
The fixslicing implementation strategy was originally introduced as a new representation for the hardware-oriented GIFT block cipher to achieve very efficient software constant-time implementations. In this article, we show that the fundamental idea underlying the fixslicing technique is not of interest only for GIFT, but can be applied to other ciphers as well. Especially, we study the benefits of fixslicing in the case of AES and show that it allows to reduce by 52% the amount of...
A Fast and Compact RISC-V Accelerator for Ascon and Friends
Stefan Steinegger, Robert Primas
Implementation
Ascon-p is the core building block of Ascon, the winner in the lightweight category
of the CAESAR competition. With ISAP, another Ascon-p-based AEAD scheme is currently competing
in the 2nd round of the NIST lightweight cryptography standardization project.
In contrast to Ascon, ISAP focuses on providing hardening/protection against a large
class of implementation attacks, such as DPA, DFA, SFA, and SIFA, entirely on mode-level.
Consequently, Ascon-p can be used to realize a wide range of...
Second-Order Masked Lookup Table Compression Scheme
Annapurna Valiveti, Srinivas Vivek
Implementation
Masking by lookup table randomisation is a well-known technique used to achieve side-channel attack resistance for software implementations, particularly, against DPA attacks. The randomised table technique for first- and second-order security requires about m * 2^n bits of RAM to store an (n, m)-bit masked S-box lookup table. Table compression helps in reducing the amount of memory required, and this is useful for highly resource-constrained IoT devices. Recently, Vadnala (CT-RSA 2017)...
Compact domain-specific co-processor for accelerating module lattice-based key encapsulation mechanism
Jose Maria Bermudo Mera, Furkan Turan, Angshuman Karmakar, Sujoy Sinha Roy, Ingrid Verbauwhede
Implementation
We present a domain-specific co-processor to speed up Saber, a post-quantum key encapsulation mechanism competing on the NIST Post-Quantum Cryptography standardization process. Contrary to most lattice-based schemes, Saber doesn’t use NTT-based polynomial multiplication. We follow a hardware-software co-design approach: the execution is performed on an ARM core and only the most computationally expensive operation, i.e., polynomial multiplication, is offloaded to the co-processor to obtain a...
Post-Quantum TLS on Embedded Systems
Kevin Bürstinghaus-Steinbach, Christoph Krauß, Ruben Niederhagen, Michael Schneider
Implementation
We present our integration of post-quantum cryptography (PQC), more specifically of the post-quantum KEM scheme Kyber for key establishment and the post-quantum signature scheme SPHINCS$^+$, into the embedded TLS library mbed TLS. We measure the performance of these post-quantum primitives on four different embedded platforms with three different ARM processors and an Xtensa LX6 processor. Furthermore, we compare the performance of our experimental PQC cipher suite to a classical TLS variant...
Rank-metric Encryption on Arm-Cortex M0
Ameirah al Abdouli, Emanuele Bellini, Florian Caullery, Marc Manzano, Victor Mateu
Implementation
Since its invention by McEliece in 1978, cryptography based on Error Correcting Codes (ECC) has suffered from the reputation of not being suitable for constrained devices. Indeed, McEliece's scheme and its variants have large public keys and relatively long ciphertexts.
Recent works on these downsides explored the possible use of ECC based on rank metric instead of Hamming metric.
These codes were introduced in the late 80's to eliminate errors with repeating patterns, regardless of their...
Efficient and secure software implementations of Fantomas
Rafael J. Cruz, Antonio Guimarães, Diego F. Aranha
Implementation
In this paper, the efficient software implementation and side-channel resistance of the LS-Design construction is studied through a series of software implementations of the Fantomas block cipher, one of its most prominent instantiations. Target platforms include resource-constrained ARM devices like the Cortex-M3 and M4, and more powerful processors such as the ARM Cortex-A15 and modern Intel platforms. The implementations span a broad range of characteristics: 32-bit and 64-bit versions,...
Optimized implementation of the NIST PQC submission ROLLO on microcontroller
Jérôme Lablanche, Lina Mortajine, Othman Benchaalal, Pierre-Louis Cayrel, Nadia El Mrabet
Implementation
We present in this paper an efficient implementation of the code-based cryptosystem ROLLO, a candidate to the NIST PQC project, on a device available on the market. This implementation benefits of the existing hardware by using a crypto co-processor contained in an already deployed microcontroller to speed-up operations in $\mathbb{F}_{2^m}$. Optimizations are then made on operations in $\mathbb{F}_{2^m}^n$. Finally, the cryptosystem outperforms the public key exchange protocol ECDH for a...
Optimized SIKE Round 2 on 64-bit ARM
Hwajeong Seo, Amir Jalali, Reza Azarderakhsh
Implementation
In this work, we present the rst highly-optimized implementation
of Supersingular Isogeny Key Encapsulation (SIKE) submitted
to NIST's second round of post quantum standardization process,
on 64-bit ARMv8 processors. To the best of our knowledge, this work
is the rst optimized implementation of SIKE round 2 on 64-bit ARM
over SIKEp434 and SIKEp610. The proposed library is explicitly optimized
for these two security levels and provides constant-time implementation
of the SIKE mechanism on...
Optimized Supersingular Isogeny Key Encapsulation on ARMv8 Processors
Amir Jalali, Reza Azarderakhsh, Mehran Mozaffari Kermani, Matthew Campagna, David Jao
Public-key cryptography
In this work, we present highly-optimized constant-time software libraries for Supersingular Isogeny Key Encapsulation (SIKE) protocol on ARMv8 processors. Our optimized hand-crafted assembly libraries provide the most efficient timing results on 64-bit ARM-powered devices. Moreover, the presented libraries can be integrated into any other cryptography primitives targeting the same finite field size. We design a new mixed implementation of field arithmetic on 64-bit ARM processors by...
Towards Optimized and Constant-Time CSIDH on Embedded Devices
Amir Jalali, Reza Azarderakhsh, Mehran Mozaffari Kermani, David Jao
Public-key cryptography
We present an optimized, constant-time software library for commutative supersingular isogeny Diffie-Hellman key exchange (CSIDH) proposed by Castryck et al. which targets 64-bit ARM processors. The proposed library is implemented based on highly-optimized field arithmetic operations and computes the entire key exchange in constant-time. The proposed implementation is resistant to timing attacks. We adopt optimization techniques to evaluate the highest performance CSIDH on ARM-powered...
FPGA-based High-Performance Parallel Architecture for Homomorphic Computing on Encrypted Data
Sujoy Sinha Roy, Furkan Turan, Kimmo Jarvinen, Frederik Vercauteren, Ingrid Verbauwhede
Implementation
Homomorphic encryption is a tool that enables computation on encrypted data and thus has applications in privacy-preserving cloud computing. Though conceptually amazing, implementation of homomorphic encryption is very challenging and typically software implementations on general purpose computers are extremely slow. In this paper we present our year long effort to design a domain specific architecture in a heterogeneous Arm+FPGA platform to accelerate homomorphic computing on encrypted...
Deep Learning to Evaluate Secure RSA Implementations
Mathieu Carbone, Vincent Conin, Marie-Angela Cornelie, Francois Dassance, Guillaume Dufresne, Cecile Dumas, Emmanuel Prouff, Alexandre Venelli
Implementation
This paper presents the results of several successful profiled side-channel attacks against a secure implementation of the RSA algorithm. The implementation was running on a ARM Core SC 100 completed with a certified EAL4+ arithmetic co-processor. The analyses have been conducted by three experts' teams, each working on a specific attack path and exploiting information extracted either from the electromagnetic emanation or from the power consumption. A particular attention is paid to the...
{Adiantum}: length-preserving encryption for entry-level processors
Paul Crowley, Eric Biggers
We present HBSH, a simple construction for tweakable length-preserving encryption which
supports the fastest options for hashing and stream encryption for processors
without AES or other crypto instructions, with a provable
quadratic advantage bound. Our composition Adiantum uses NH, Poly1305, XChaCha12,
and a single AES invocation. On an ARM Cortex-A7 processor, Adiantum decrypts
4096-byte messages at 10.6 cycles per byte, over five times faster than
AES-256-XTS, with a constant-time...
SIDH on ARM: Faster Modular Multiplications for Faster Post-Quantum Supersingular Isogeny Key Exchange
Hwajeong Seo, Zhe Liu, Patrick Longa, Zhi Hu
Implementation
We present high-speed implementations of the post-quantum supersingular isogeny Diffie-Hellman key exchange (SIDH) and the supersingular isogeny key encapsulation (SIKE) protocols for 32-bit ARMv7-A processors with NEON support. The high performance of our implementations is mainly due to carefully optimized multiprecision and modular arithmetic that finely integrates both ARM and NEON instructions in order to reduce the number of pipeline stalls and memory accesses, and a new Montgomery...
Assessing the Feasibility of Single Trace Power Analysis of Frodo
Joppe W. Bos, Simon Friedberger, Marco Martinoli, Elisabeth Oswald, Martijn Stam
Implementation
Lattice-based schemes are among the most promising post-quantum schemes, yet the effect of both parameter and implementation choices on their side-channel resilience is still poorly understood. Aysu et al. (HOST'18) recently investigated single-trace attacks against the core lattice operation, namely multiplication between a public matrix and a "small" secret vector, in the context of a hardware implementation. We complement this work by considering single-trace attacks against software...
Vectorizing Higher-Order Masking
Benjamin Grégoire, Kostas Papagiannopoulos, Peter Schwabe, Ko Stoffelen
Implementation
The cost of higher-order masking as a countermeasure against side-channel attacks is often considered too high for practical scenarios, as protected implementations become very slow. At Eurocrypt 2017, the bounded moment leakage model was proposed to study the (theoretical) security of parallel implementations of masking schemes. Work at CHES 2017 then brought this to practice by considering an implementation of AES with 32 shares, bitsliced inside 32-bit registers of ARM Cortex-M...
Compact Energy and Delay-Aware Authentication
Muslum Ozgur Ozmen, Rouzbeh Behnia, Attila A. Yavuz
Authentication and integrity are fundamental security services that are critical for any viable system. However, some of the emerging systems (e.g., smart grids, aerial drones) are delay-sensitive, and therefore their safe and reliable operation requires delay-aware authentication mechanisms. Unfortunately, the current state-of-the-art authentication mechanisms either incur heavy computations or lack scalability for such large and distributed systems. Hence, there is a crucial need for...
Micro-Architectural Power Simulator for Leakage Assessment of Cryptographic Software on ARM Cortex-M3 Processors
Yann Le Corre, Johann Großschädl, Daniel Dinu
Implementation
Masking is a common technique to protect software implementations of symmetric cryptographic algorithms against Differential Power Analysis (DPA) attacks. The development of a properly masked version of a block cipher is an incremental and time-consuming process since each iteration of the development cycle involves a costly leakage assessment. To achieve a high level of DPA resistance, the architecture-specific leakage properties of the target processor need to be taken into account....
ARM2GC: Succinct Garbled Processor for Secure Computation
Ebrahim M Songhori, M Sadegh Riazi, Siam U Hussain, Ahmad-Reza Sadeghi, Farinaz Koushanfar
We present ARM2GC, a novel secure computation framework based on Yao’s Garbled Circuit (GC) protocol and the ARM processor. It allows users to develop privacy-preserving applications using standard high-level programming languages (e.g., C) and compile them using off-the-shelf ARM compilers (e.g., gcc-arm). The main enabler of this framework is the introduction of Skip-Gate, an algorithm that dynamically omits the communication and encryption cost of the gates whose outputs are independent...
Post-Quantum Key Exchange on ARMv8-A -- A New Hope for NEON made Simple
Silvan Streit, Fabrizio De Santis
Implementation
NewHope and NewHope-Simple are two recently proposed post-quantum key exchange protocols based on the hardness of the Ring-LWE problem. Due to their high security margins and performance, there have been already discussions and proposals for integrating them into Internet standards, like TLS, and anonymity network protocols, like Tor. In this work, we present time-constant and vector-optimized implementations of NewHope and NewHope-Simple for ARMv8-A 64-bit processors which target high-speed...
TruSpy: Cache Side-Channel Information Leakage from the Secure World on ARM Devices
Ning Zhang, Kun Sun, Deborah Shands, Wenjing Lou, Y. Thomas Hou
Implementation
As smart, embedded devices are increasingly integrated into our daily life, the security of these devices has become a major concern. The ARM processor family, which powers more than 60% of embedded devices, introduced TrustZone technology to offer security protection via an isolated execution environment called secure world. Caches in TrustZone-enabled processors are extended with a non-secure (NS) bit to indicate whether a cache line is used by the secure world or the normal world. This...
NewHope on ARM Cortex-M
Erdem Alkim, Philipp Jakubeit, Peter Schwabe
Implementation
Recently, Alkim, Ducas, Pöppelmann, and Schwabe proposed a Ring-LWE-based
key exchange protocol called NewHope (Usenix Securitz 2016) and illustrated that
this protocol is very efficient on large Intel processors.
Their paper also claims that the parameter choice enables efficient
implementation on small embedded processors.
In this paper we show that these claims are actually correct
and present NewHope software for the ARM Cortex-M family of
32-bit microcontrollers. More specifically, our...
NEON-SIDH: Efficient Implementation of Supersingular Isogeny Diffie-Hellman Key-Exchange Protocol on ARM
Brian Koziel, Amir Jalali, Reza Azarderakhsh, Mehran Mozaffari Kermani, David Jao
In this paper, we investigate the efficiency of implementing a post-quantum key-exchange protocol over isogenies (PQCrypto 2011) on ARM-powered embedded platforms. We propose to employ new primes to speed up constant-time finite field arithmetic and perform isogenies quickly. Montgomery multiplication and reduction are employed to produce a speedup of 3 over the GNU Multiprecision Library. For curve arithmetic, a uniform differential addition scheme for double point multiplication and...
FourQNEON: Faster Elliptic Curve Scalar Multiplications on ARM Processors
Patrick Longa
Implementation
We present a high-speed, high-security implementation of the recently proposed elliptic curve FourQ (ASIACRYPT 2015) for 32-bit ARM processors with NEON support. Exploiting the versatile and compact arithmetic of this curve, we design a vectorized implementation that achieves high-performance across a large variety of ARM platforms. Our software is fully protected against timing and cache attacks, and showcases the impressive speed of FourQ when compared with other curve-based alternatives....
Towards Practical Tools for Side Channel Aware Software Engineering: `Grey Box' Modelling for Instruction Leakages
David McCann, Elisabeth Oswald, Carolyn Whitnall
Power (along with EM, cache and timing) leaks are of considerable concern for developers who have to deal with cryptographic components as part of their overall software implementation, in particular in the context of embedded devices. Whilst there exist some compiler tools to detect timing leaks, similar progress towards pinpointing power and EM leaks has been hampered by limits on the amount of information available about the physical components from which such leaks originate.
We suggest...
NEON PQCryto: Fast and Parallel Ring-LWE Encryption on ARM NEON Architecture
Reza Azarderakhsh, Zhe Liu, Hwajeong Seo, Howon Kim
Implementation
Recently, ARM NEON architecture has occupied a significant
share of tablet and smartphone markets due to its low cost
and high performance. This paper studies efficient techniques of
lattice-based cryptography on ARM processor and presents the
first implementation of ring-LWE encryption on ARM NEON
architecture. In particular, we propose a vectorized version of
Iterative Number Theoretic Transform (NTT) for high-speed
computation. We present a 32-bit variant of SAMS2 technique,
original...
DPA, Bitslicing and Masking at 1 GHz
Josep Balasch, Benedikt Gierlichs, Oscar Reparaz, Ingrid Verbauwhede
Implementation
We present DPA attacks on an ARM Cortex-A8 processor running at 1 GHz. This high-end processor is typically found in portable devices such as phones and tablets. In our case, the processor sits in a single board computer and runs a full-fledged Linux operating system. The targeted AES implementation is bitsliced and runs in constant time and constant flow. We show that, despite the complex hardware and software, high clock frequencies and practical measurement issues, the implementation can...
SoC it to EM: electromagnetic side-channel attacks on a complex system-on-chip
J. Longo, E. De Mulder, D. Page, M. Tunstall
Increased complexity in modern embedded systems has presented various important challenges with regard to side-channel attacks. In particular, it is common to deploy SoC-based target devices with high clock frequencies in security-critical scenarios; understanding how such features align with techniques more often deployed against simpler devices is vital from both destructive (i.e., attack) and constructive (i.e., evaluation and/or countermeasure) perspectives. In this paper, we investigate...
Efficient Arithmetic on ARM-NEON and Its Application for High-Speed RSA Implementation
Hwajeong Seo, Zhe Liu, Johann Groschadl, Howon Kim
Implementation
Advanced modern processors support Single Instruction Multiple Data (SIMD) instructions (e.g. Intel-AVX, ARM-NEON) and a massive body of
research on vector-parallel implementations of modular arithmetic, which are crucial components for modern public-key cryptography ranging from RSA, ElGamal, DSA and ECC, have been conducted.
In this paper, we introduce a novel Double Operand Scanning (DOS) method to speed-up multi-precision squaring with non-redundant representations on SIMD...
Montgomery Modular Multiplication on ARM-NEON Revisited
Hwajeong Seo, Zhe Liu, Johann Großschädl, Jongseok Choi, Howon Kim
Implementation
Montgomery modular multiplication constitutes the "arithmetic foundation"
of modern public-key cryptography with applications ranging from RSA, DSA
and Diffie-Hellman over elliptic curve schemes to pairing-based cryptosystems. The increased prevalence of SIMD-type instructions in commodity processors (e.g. Intel SSE, ARM NEON) has initiated a massive body of research on vector-parallel implementations of Montgomery modular multiplication. In this paper, we introduce the Cascade Operand...
Providing Root of Trust for ARM TrustZone using On-Chip SRAM
Shijun Zhao, Qianying Zhang, Guangyao Hu, Yu Qin, Dengguo Feng
Implementation
We present the design, implementation and evaluation of the root of trust for the Trusted Execution Environment (TEE) provided by ARM TrustZone based on SRAM Physical Unclonable Functions (PUFs). We first implement a building block which provides the foundations for the root of trust: secure key storage and truly random source. The building block doesn't require on or off-chip secure non-volatile memory to store secrets, but provides a high-level security: resistance to physical attackers...
A Leakage Resilient MAC
Daniel P. Martin, Elisabeth Oswald, Martijn Stam, Marcin Wojcik
We put forward the first practical message authentication code (MAC) which is provably secure against continuous leakage under the Only Computation Leaks Information (OCLI) assumption. Within the context of continuous leakage, we introduce a novel modular proof technique: while most previous schemes are proven secure directly in the face of leakage, we reduce the (leakage) security of our scheme to its non-leakage security. This modularity, while known in other contexts, has two advantages:...
On the Applicability of Time-Driven Cache Attacks on Mobile Devices (Extended Version)
Raphael Spreitzer, Thomas Plos
Applications
Cache attacks are known to be sophisticated attacks against cryptographic implementations on desktop computers. Recently, also investigations of such attacks on testbeds with processors that are employed in mobile devices have been done. In this work we investigate the applicability of Bernstein's timing attack and the cache-collision attack by Bogdanov et al. in real environments on three state-of-the-art mobile devices. These devices are: an Acer Iconia A510, a Google Nexus S, and a...
Efficient and Secure Algorithms for GLV-Based Scalar Multiplication and their Implementation on GLV-GLS Curves (Extended Version)
Armando Faz-Hernandez, Patrick Longa, Ana H. Sanchez
We propose efficient algorithms and formulas that improve the performance of side-channel protected elliptic curve computations with special focus on scalar multiplication exploiting the Gallant-Lambert-Vanstone (CRYPTO 2001) and Galbraith-Lin-Scott (EUROCRYPT 2009) methods. Firstly, by adapting Feng et al.'s recoding to the GLV setting, we derive new regular algorithms for variable-base scalar multiplication that offer protection against simple side-channel and timing attacks. Secondly, we...
Efficient Implementation of Bilinear Pairings on ARM Processors
Gurleen Grewal, Reza Azarderakhsh, Patrick Longa, Shi Hu, David Jao
As hardware capabilities increase, low-power devices such
as smartphones represent a natural environment for the efficient
implementation of cryptographic pairings. Few works in the literature
have considered such platforms despite their growing importance in a
post-PC world. In this paper, we investigate the efficient computation
of the Optimal-Ate pairing over Barreto-Naehrig curves in software at
different security levels on ARM processors. We exploit
state-of-the-art techniques and...
Fast and compact elliptic-curve cryptography
Mike Hamburg
Implementation

Elliptic curve cryptosystems have improved greatly in speed over the past few years. In this paper we outline a new elliptic curve signature and key agreement implementation which achieves record speeds while remaining relatively compact. For example, on Intel Sandy Bridge, a curve with about $2^{250}$ points produces a signature in just under 60k clock cycles, verifies in under 169k clock cycles, and computes a Diffie-Hellman shared secret in under 153k clock cycles. Our...
Automatic Quantification of Cache Side-Channels
Boris Köpf, Laurent Mauborgne, Martin Ochoa
Implementation
The latency gap between caches and main memory has been successfully exploited for recovering sensitive input to programs, such as cryptographic keys from implementation of AES and RSA. So far, there are no practical general-purpose countermeasures against this threat. In this paper we propose a novel method for automatically deriving upper bounds on the amount of information about the input that an adversary can extract from a program by observing the CPU's cache behavior. At the heart of...
SHA-3 on ARM11 processors
Peter Schwabe, Bo-Yin Yang, Shang-Yi Yang
Implementation
This paper presents high-speed assembly implementations of the 256-bit-output versions of all five SHA-3 finalists and of SHA-256 for the ARM11 family of processors. We report new speed records for all of the six implemented functions. For example our implementation of the round-3 version of JH-256 is 35% faster than the fastest implementation of the round-2 version of JH-256 in eBASH. Scaled with the number of rounds this is more than a 45% improvement.We also improve upon previous assembly...
Affine Pairings on ARM
Tolga Acar, Kristin Lauter, Michael Naehrig, Daniel Shumow
Implementation
Pairings on elliptic curves are being used in an increasing number of cryptographic applications on many different devices and platforms, but few performance numbers for cryptographic pairings have been reported on embedded and mobile devices.
In this paper we give performance numbers for affine and projective pairings on a dual-core Cortex A9 ARM processor and compare performance of the same implementation across three platforms: x86, x86-64 and ARM. Using a fast inversion in the base...
Optimizing Multiprecision Multiplication for Public Key Cryptography
Michael Scott, Piotr Szczechowiak
Implementation
In this paper we recall the hybrid method of Gura et al. for multi-precision multiplication which is an improvement on the basic Comba method and which exploits the increased number of registers available on modern architectures in order to avoid duplicated loads from memory. We then show how to improve and generalise the method for application across a wide range of processor types, setting some new records in the process.
Low Cost Security: Explicit Formulae for Genus 4 Hyperelliptic Curves
Jan Pelzl, Thomas Wollinger, Christof Paar
Public-key cryptography
It is widely believed that genus four hyperelliptic curve
cryptosystems (HECC) are not attractive for practical applications
because of their complexity compared to systems based on lower
genera, especially elliptic curves. Our contribution shows that
for low cost security applications genus-4 hyperelliptic curves
(HEC) can outperform genus-2 HEC and that we can achieve a
performance similar to genus-3 HEC. Furthermore our implementation
results show that a genus-4 HECC is an alternative...
SQIsign is a well-known post-quantum signature scheme due to its small combined signature and public-key size. However, SQIsign suffers from notably long signing times, and verification times are not short either. To improve this, recent research has explored both one-dimensional and two-dimensional variants of SQIsign, each with distinct characteristics. In particular, SQIsign2D's efficient signing and verification times have made it a focal point of recent research. However, the absence of...
As the industry prepares for the transition to post-quantum secure public key cryptographic algorithms, vulnerability analysis of their implementations is gaining importance. A theoretically secure cryptographic algorithm should also be able to withstand the challenges of physical attacks in real-world environments. MAYO is a candidate in the ongoing first round of the NIST post-quantum standardization process for selecting additional digital signature schemes. This paper demonstrates three...
We present a tweakable wide block cipher called Mystrium and show it as the fastest such primitive on low-end processors that lack dedicated AES or other cryptographic instructions, such as ARM Cortex-A7. Mystrium is based on the provably secure double-decker mode, that requires a doubly extendable cryptographic keyed (deck) function and a universal hash function. We build a new deck function called Xymmer that for its compression part uses Multimixer-128, the fastest universal hash for...
AES implementation has been vastly analysed against side-channel attacks in the last two decades particularly targeting resource-constrained microcontrollers. Still, less research has been conducted on AES implementations on advanced hardware platforms. In this study, we examine the resilience of AES on an ARM Cortex A72 processor within the Raspberry Pi 4B model. Unlike their microcontroller counterparts, these platforms operate within the complex ecosystem of an operating system (OS),...
Garbled circuits (GC) are a secure multiparty computation protocol that enables two parties to jointly compute a function using their private data without revealing it to each other. While garbled circuits are proven secure at the protocol level, implementations can still be vulnerable to side-channel attacks. Recently, side-channel analysis of GC implementations has garnered significant interest from researchers. We investigate popular open-source GC frameworks and discover that the AES...
This paper questions the side-channel security of central reduction technique, which is widely adapted in efficient implementations of Lattice-Based Cryptography (LBC). We show that the central reduction leads to a vulnerability by creating a strong dependency between the power consumption and the sign of sensitive intermediate values. We exploit this dependency by introducing the novel absolute value prediction function, which can be employed in higher-order non-profiled multi-query...
Remote side-channel attacks on processors exploit hardware and micro-architectural effects observable from software measurements. So far, the analysis of micro-architectural leakages over physical side-channels (power consumption, electromagnetic field) received little treatment. In this paper, we argue that those attacks are a serious threat, especially against systems such as smartphones and Internet-of-Things (IoT) devices which are physically exposed to the end-user. Namely, we show that...
Since 2016’s NIST call for standardization of post-quantum cryptographic primitives, developing efficient post-quantum secure digital signature schemes has become a highly active area of research. The difficulty in constructing such schemes is evidenced by NIST reopening the call in 2022 for digital signature schemes, because of missing diversity in existing proposals. In this work, we introduce the new post-quantum digital signature scheme MiRitH. As direct successor of a scheme recently...
Homomorphic encryption (HE) allows computing encrypted data in the ciphertext domain without knowing the encryption key. It is possible, however, to break fully homomorphic encryption (FHE) algorithms by using side channels. This article demonstrates side-channel leakages of the Microsoft SEAL HE library. The proposed attack can steal encryption keys during the key generation phase by abusing the leakage of ternary value assignments that occurs during the number theoretic transform (NTT)...
This paper provides necessary properties to algorithmically secure first-order maskings in scalar micro-architectures. The security notions of threshold implementations are adapted following micro-processor leakage effects which are known to the literature. The resulting notions, which are based on the placement of shares, are applied to a two-share randomness-free PRESENT cipher and Keccak-f. The assembly implementations are put on a RISC-V and an ARM Cortex-M4 core. All designs are...
This paper presents two new techniques for the fast implementation of the Keccak permutation on the A-profile of the Arm architecture: First, the elimination of explicit rotations in the Keccak permutation through Barrel shifting, applicable to scalar AArch64 implementations of Keccak-f1600. Second, the construction of hybrid implementations concurrently leveraging both the scalar and the Neon instruction sets of AArch64. The resulting performance improvements are demonstrated in the example...
The efficiency of constant-time SM4 implementation has been lagging behind that of AES for most internet traffic and applicable data encryption scenarios. The best performance before our works was 3.77 cpb for x86 platform (AESNI + AVX2), and 8.62 cpb for Arm platform (NEON). Meanwhile the state of art constant-time AES implementation could reach 0.63 cpb. Dedicated SM4 instruction set extensions like those optionally available in Armv8.2, could achieve comparable cpb to AES. But they are...
Conventional wisdom purports that FFT-based integer multiplication methods (such as the Schönhage-Strassen algorithm) begin to compete with Karatsuba and Toom-Cook only for integers of several tens of thousands of bits. In this work, we challenge this belief, leveraging recent advances in the implementation of number-theoretic transforms (NTT) stimulated by their use in post-quantum cryptography. We report on implementations of NTT-based integer arithmetic on two Arm Cortex-M CPUs on...
This paper presents an analysis of the two lattice-based digital signature schemes, Dilithium and Falcon, which have been chosen by NIST for standardisation, on the ARM Cortex M7 using the STM32F767ZI NUCLEO-144 development board. This research is motivated by the ARM Cortex M7 device being the only processor in the Cortex-M family to offer a double precision (i.e., 64-bit) floating-point unit, making Falcon's implementations, requiring 53 bits of double precision, able to fully run native...
Efficient implementations of software masked designs constitute both an important goal and a significant challenge to Side Channel Analysis attack (SCA) security. In this paper we discuss the shortfall between generic C implementations and optimized (inline-) assembly versions while providing a large spectrum of efficient and generic masked implementations for any order, and demonstrate cryptographic algorithms and masking gadgets with reference to the state of the art. Our main goal is to...
In 2016, the National Institute of Standards and Technology (NIST) initiated a standardization process among the post-quantum secure algorithms. Forming part of the alternate group of candidates after Round 2 of the process is the Supersingular Isogeny Key Encapsulation (SIKE) mechanism which attracts with the smallest key sizes offering post-quantum security in scenarios of limited bandwidth and memory resources. Even further reduction of the exchanged information is offered by the...
Public key cryptography is widely used in key exchange and digital signature protocols. Public key cryptography requires expensive primitive operations, such as finite-field and group operations. These finite-field and group operations require a number of clock cycles to exe- cute. By carefully optimizing these primitive operations, public key cryp- tography can be performed with reasonably fast execution timing. In this paper, we present the new implementation result of Curve448 on 32-bit ARM...
Polynomial multiplication is a core operation for public key cryptography, such as pre-quantum cryptography (e.g. elliptic curve cryptography) and post-quantum cryptography (e.g. code-based cryptography and multivariate-based cryptography). For this reason, the efficient and secure implementation of polynomial multiplication has been actively conducted for high availability and security level in application services. In this paper, we present all polynomial multiplication methods on modern...
High-degree, low-precision polynomial arithmetic is a fundamental computational primitive underlying structured lattice based cryptography. Its algorithmic properties and suitability for implementation on different compute platforms is an active area of research, and this article contributes to this line of work: Firstly, we present memory-efficiency and performance improvements for the Toom-Cook/Karatsuba polynomial multiplication strategy. Secondly, we provide implementations of those...
Oblivious Transfer (OT) is a fundamental primitive in cryptography, supporting protocols such as Multi-Party Computation and Private Set Intersection (PSI), that are used in applications like contact discovery, remote diagnosis and contact tracing. Due to its fundamental nature, it is utterly important that its execution is secure even if arbitrarily composed with other instances of the same, or other protocols. This property can be guaranteed by proving its security under the Universal...
Abstract—CARiMoL is a novel run-time Configurable Hardware Accelerator for Ring and Module Lattice-based postquantum cryptography. It’s flexible design can be configured to key-pair generation, encapsulation, and decapsulation for NewHope and CRYSTALS-Kyber schemes using same hardware. CARiMoL offers run-time configurability for multiple security levels of NewHope and CRYSTALS-Kyber schemes, supporting both Chosen-Plaintext Attack (CPA) and Chosen-Ciphertext Attack (CCA) secure...
The SM4 block cipher is a Chinese domestic crpytographic that was introduced in 2003. Since the algorithm was developed for the use in wireless sensor networks, it is mandated in the Chinese National Standard for Wireless LAN WAPI (Wired Authentication and Privacy Infrastructure). The SM4 block cipher uses a 128-bit block size and a 32-bit round key. This consists of 32 rounds and one reverse translation \texttt{R}. In this paper, we present the optimized implementation of the SM4 block...
Public-key cryptography based on the lattice problem is efficient and believed to be secure in a post-quantum era. In this paper, we introduce carefully optimized implementations of Kyber encryption schemes for 64-bit ARM Cortex-A processors. Our research contribution includes several optimizations for Number Theoretic Transform (NTT), noise sampling, and AES accelerator based symmetric function implementations. The proposed Kyber512 implementation on ARM64 improved previous works by 1.72×,...
In this paper, we firstly presented optimized implementations of Montgomery multiplication on 64-bit ARM processors by taking advantages of Karatsuba algorithm and efficient multiplication instruction sets for ARM64 architectures. The implementation of Montgomery multiplication can improve the performance of (pre-quantum and post-quantum) public key cryptography (e.g. CSIDH, ECC, and RSA) implementations on ARM64 architectures, directly. Last but not least, the performance of Karatsuba...
The Classic McEliece cryptosystem is one of the most trusted quantum-resistant cryptographic schemes. Deploying it in practical applications, however, is challenging due to the size of its public key. In this work, we bridge this gap. We present an implementation of Classic McEliece on an ARM Cortex-M4 processor, optimized to overcome memory constraints. To this end, we present an algorithm to retrieve the public key ad-hoc. This reduces memory and storage requirements and enables the...
In this work we develop optimized software implementationsfor ForkAE, a second round candidate in the ongoing NIST lightweight cryptography standardization process. Moreover, we analyze the perfor-mance and efficiency of different ForkAE implementations on two em-bedded platforms: ARM Cortex-A9 and ARM Cortex-M0.First, we study portable ForkAE implementations. We apply a decryption optimization technique which allows us to accelerate decryption by up to 35%. Second, we go on to explore...
The fixslicing implementation strategy was originally introduced as a new representation for the hardware-oriented GIFT block cipher to achieve very efficient software constant-time implementations. In this article, we show that the fundamental idea underlying the fixslicing technique is not of interest only for GIFT, but can be applied to other ciphers as well. Especially, we study the benefits of fixslicing in the case of AES and show that it allows to reduce by 52% the amount of...
Ascon-p is the core building block of Ascon, the winner in the lightweight category of the CAESAR competition. With ISAP, another Ascon-p-based AEAD scheme is currently competing in the 2nd round of the NIST lightweight cryptography standardization project. In contrast to Ascon, ISAP focuses on providing hardening/protection against a large class of implementation attacks, such as DPA, DFA, SFA, and SIFA, entirely on mode-level. Consequently, Ascon-p can be used to realize a wide range of...
Masking by lookup table randomisation is a well-known technique used to achieve side-channel attack resistance for software implementations, particularly, against DPA attacks. The randomised table technique for first- and second-order security requires about m * 2^n bits of RAM to store an (n, m)-bit masked S-box lookup table. Table compression helps in reducing the amount of memory required, and this is useful for highly resource-constrained IoT devices. Recently, Vadnala (CT-RSA 2017)...
We present a domain-specific co-processor to speed up Saber, a post-quantum key encapsulation mechanism competing on the NIST Post-Quantum Cryptography standardization process. Contrary to most lattice-based schemes, Saber doesn’t use NTT-based polynomial multiplication. We follow a hardware-software co-design approach: the execution is performed on an ARM core and only the most computationally expensive operation, i.e., polynomial multiplication, is offloaded to the co-processor to obtain a...
We present our integration of post-quantum cryptography (PQC), more specifically of the post-quantum KEM scheme Kyber for key establishment and the post-quantum signature scheme SPHINCS$^+$, into the embedded TLS library mbed TLS. We measure the performance of these post-quantum primitives on four different embedded platforms with three different ARM processors and an Xtensa LX6 processor. Furthermore, we compare the performance of our experimental PQC cipher suite to a classical TLS variant...
Since its invention by McEliece in 1978, cryptography based on Error Correcting Codes (ECC) has suffered from the reputation of not being suitable for constrained devices. Indeed, McEliece's scheme and its variants have large public keys and relatively long ciphertexts. Recent works on these downsides explored the possible use of ECC based on rank metric instead of Hamming metric. These codes were introduced in the late 80's to eliminate errors with repeating patterns, regardless of their...
In this paper, the efficient software implementation and side-channel resistance of the LS-Design construction is studied through a series of software implementations of the Fantomas block cipher, one of its most prominent instantiations. Target platforms include resource-constrained ARM devices like the Cortex-M3 and M4, and more powerful processors such as the ARM Cortex-A15 and modern Intel platforms. The implementations span a broad range of characteristics: 32-bit and 64-bit versions,...
We present in this paper an efficient implementation of the code-based cryptosystem ROLLO, a candidate to the NIST PQC project, on a device available on the market. This implementation benefits of the existing hardware by using a crypto co-processor contained in an already deployed microcontroller to speed-up operations in $\mathbb{F}_{2^m}$. Optimizations are then made on operations in $\mathbb{F}_{2^m}^n$. Finally, the cryptosystem outperforms the public key exchange protocol ECDH for a...
In this work, we present the rst highly-optimized implementation of Supersingular Isogeny Key Encapsulation (SIKE) submitted to NIST's second round of post quantum standardization process, on 64-bit ARMv8 processors. To the best of our knowledge, this work is the rst optimized implementation of SIKE round 2 on 64-bit ARM over SIKEp434 and SIKEp610. The proposed library is explicitly optimized for these two security levels and provides constant-time implementation of the SIKE mechanism on...
In this work, we present highly-optimized constant-time software libraries for Supersingular Isogeny Key Encapsulation (SIKE) protocol on ARMv8 processors. Our optimized hand-crafted assembly libraries provide the most efficient timing results on 64-bit ARM-powered devices. Moreover, the presented libraries can be integrated into any other cryptography primitives targeting the same finite field size. We design a new mixed implementation of field arithmetic on 64-bit ARM processors by...
We present an optimized, constant-time software library for commutative supersingular isogeny Diffie-Hellman key exchange (CSIDH) proposed by Castryck et al. which targets 64-bit ARM processors. The proposed library is implemented based on highly-optimized field arithmetic operations and computes the entire key exchange in constant-time. The proposed implementation is resistant to timing attacks. We adopt optimization techniques to evaluate the highest performance CSIDH on ARM-powered...
Homomorphic encryption is a tool that enables computation on encrypted data and thus has applications in privacy-preserving cloud computing. Though conceptually amazing, implementation of homomorphic encryption is very challenging and typically software implementations on general purpose computers are extremely slow. In this paper we present our year long effort to design a domain specific architecture in a heterogeneous Arm+FPGA platform to accelerate homomorphic computing on encrypted...
This paper presents the results of several successful profiled side-channel attacks against a secure implementation of the RSA algorithm. The implementation was running on a ARM Core SC 100 completed with a certified EAL4+ arithmetic co-processor. The analyses have been conducted by three experts' teams, each working on a specific attack path and exploiting information extracted either from the electromagnetic emanation or from the power consumption. A particular attention is paid to the...
We present HBSH, a simple construction for tweakable length-preserving encryption which supports the fastest options for hashing and stream encryption for processors without AES or other crypto instructions, with a provable quadratic advantage bound. Our composition Adiantum uses NH, Poly1305, XChaCha12, and a single AES invocation. On an ARM Cortex-A7 processor, Adiantum decrypts 4096-byte messages at 10.6 cycles per byte, over five times faster than AES-256-XTS, with a constant-time...
We present high-speed implementations of the post-quantum supersingular isogeny Diffie-Hellman key exchange (SIDH) and the supersingular isogeny key encapsulation (SIKE) protocols for 32-bit ARMv7-A processors with NEON support. The high performance of our implementations is mainly due to carefully optimized multiprecision and modular arithmetic that finely integrates both ARM and NEON instructions in order to reduce the number of pipeline stalls and memory accesses, and a new Montgomery...
Lattice-based schemes are among the most promising post-quantum schemes, yet the effect of both parameter and implementation choices on their side-channel resilience is still poorly understood. Aysu et al. (HOST'18) recently investigated single-trace attacks against the core lattice operation, namely multiplication between a public matrix and a "small" secret vector, in the context of a hardware implementation. We complement this work by considering single-trace attacks against software...
The cost of higher-order masking as a countermeasure against side-channel attacks is often considered too high for practical scenarios, as protected implementations become very slow. At Eurocrypt 2017, the bounded moment leakage model was proposed to study the (theoretical) security of parallel implementations of masking schemes. Work at CHES 2017 then brought this to practice by considering an implementation of AES with 32 shares, bitsliced inside 32-bit registers of ARM Cortex-M...
Authentication and integrity are fundamental security services that are critical for any viable system. However, some of the emerging systems (e.g., smart grids, aerial drones) are delay-sensitive, and therefore their safe and reliable operation requires delay-aware authentication mechanisms. Unfortunately, the current state-of-the-art authentication mechanisms either incur heavy computations or lack scalability for such large and distributed systems. Hence, there is a crucial need for...
Masking is a common technique to protect software implementations of symmetric cryptographic algorithms against Differential Power Analysis (DPA) attacks. The development of a properly masked version of a block cipher is an incremental and time-consuming process since each iteration of the development cycle involves a costly leakage assessment. To achieve a high level of DPA resistance, the architecture-specific leakage properties of the target processor need to be taken into account....
We present ARM2GC, a novel secure computation framework based on Yao’s Garbled Circuit (GC) protocol and the ARM processor. It allows users to develop privacy-preserving applications using standard high-level programming languages (e.g., C) and compile them using off-the-shelf ARM compilers (e.g., gcc-arm). The main enabler of this framework is the introduction of Skip-Gate, an algorithm that dynamically omits the communication and encryption cost of the gates whose outputs are independent...
NewHope and NewHope-Simple are two recently proposed post-quantum key exchange protocols based on the hardness of the Ring-LWE problem. Due to their high security margins and performance, there have been already discussions and proposals for integrating them into Internet standards, like TLS, and anonymity network protocols, like Tor. In this work, we present time-constant and vector-optimized implementations of NewHope and NewHope-Simple for ARMv8-A 64-bit processors which target high-speed...
As smart, embedded devices are increasingly integrated into our daily life, the security of these devices has become a major concern. The ARM processor family, which powers more than 60% of embedded devices, introduced TrustZone technology to offer security protection via an isolated execution environment called secure world. Caches in TrustZone-enabled processors are extended with a non-secure (NS) bit to indicate whether a cache line is used by the secure world or the normal world. This...
Recently, Alkim, Ducas, Pöppelmann, and Schwabe proposed a Ring-LWE-based key exchange protocol called NewHope (Usenix Securitz 2016) and illustrated that this protocol is very efficient on large Intel processors. Their paper also claims that the parameter choice enables efficient implementation on small embedded processors. In this paper we show that these claims are actually correct and present NewHope software for the ARM Cortex-M family of 32-bit microcontrollers. More specifically, our...
In this paper, we investigate the efficiency of implementing a post-quantum key-exchange protocol over isogenies (PQCrypto 2011) on ARM-powered embedded platforms. We propose to employ new primes to speed up constant-time finite field arithmetic and perform isogenies quickly. Montgomery multiplication and reduction are employed to produce a speedup of 3 over the GNU Multiprecision Library. For curve arithmetic, a uniform differential addition scheme for double point multiplication and...
We present a high-speed, high-security implementation of the recently proposed elliptic curve FourQ (ASIACRYPT 2015) for 32-bit ARM processors with NEON support. Exploiting the versatile and compact arithmetic of this curve, we design a vectorized implementation that achieves high-performance across a large variety of ARM platforms. Our software is fully protected against timing and cache attacks, and showcases the impressive speed of FourQ when compared with other curve-based alternatives....
Power (along with EM, cache and timing) leaks are of considerable concern for developers who have to deal with cryptographic components as part of their overall software implementation, in particular in the context of embedded devices. Whilst there exist some compiler tools to detect timing leaks, similar progress towards pinpointing power and EM leaks has been hampered by limits on the amount of information available about the physical components from which such leaks originate. We suggest...
Recently, ARM NEON architecture has occupied a significant share of tablet and smartphone markets due to its low cost and high performance. This paper studies efficient techniques of lattice-based cryptography on ARM processor and presents the first implementation of ring-LWE encryption on ARM NEON architecture. In particular, we propose a vectorized version of Iterative Number Theoretic Transform (NTT) for high-speed computation. We present a 32-bit variant of SAMS2 technique, original...
We present DPA attacks on an ARM Cortex-A8 processor running at 1 GHz. This high-end processor is typically found in portable devices such as phones and tablets. In our case, the processor sits in a single board computer and runs a full-fledged Linux operating system. The targeted AES implementation is bitsliced and runs in constant time and constant flow. We show that, despite the complex hardware and software, high clock frequencies and practical measurement issues, the implementation can...
Increased complexity in modern embedded systems has presented various important challenges with regard to side-channel attacks. In particular, it is common to deploy SoC-based target devices with high clock frequencies in security-critical scenarios; understanding how such features align with techniques more often deployed against simpler devices is vital from both destructive (i.e., attack) and constructive (i.e., evaluation and/or countermeasure) perspectives. In this paper, we investigate...
Advanced modern processors support Single Instruction Multiple Data (SIMD) instructions (e.g. Intel-AVX, ARM-NEON) and a massive body of research on vector-parallel implementations of modular arithmetic, which are crucial components for modern public-key cryptography ranging from RSA, ElGamal, DSA and ECC, have been conducted. In this paper, we introduce a novel Double Operand Scanning (DOS) method to speed-up multi-precision squaring with non-redundant representations on SIMD...
Montgomery modular multiplication constitutes the "arithmetic foundation" of modern public-key cryptography with applications ranging from RSA, DSA and Diffie-Hellman over elliptic curve schemes to pairing-based cryptosystems. The increased prevalence of SIMD-type instructions in commodity processors (e.g. Intel SSE, ARM NEON) has initiated a massive body of research on vector-parallel implementations of Montgomery modular multiplication. In this paper, we introduce the Cascade Operand...
We present the design, implementation and evaluation of the root of trust for the Trusted Execution Environment (TEE) provided by ARM TrustZone based on SRAM Physical Unclonable Functions (PUFs). We first implement a building block which provides the foundations for the root of trust: secure key storage and truly random source. The building block doesn't require on or off-chip secure non-volatile memory to store secrets, but provides a high-level security: resistance to physical attackers...
We put forward the first practical message authentication code (MAC) which is provably secure against continuous leakage under the Only Computation Leaks Information (OCLI) assumption. Within the context of continuous leakage, we introduce a novel modular proof technique: while most previous schemes are proven secure directly in the face of leakage, we reduce the (leakage) security of our scheme to its non-leakage security. This modularity, while known in other contexts, has two advantages:...
Cache attacks are known to be sophisticated attacks against cryptographic implementations on desktop computers. Recently, also investigations of such attacks on testbeds with processors that are employed in mobile devices have been done. In this work we investigate the applicability of Bernstein's timing attack and the cache-collision attack by Bogdanov et al. in real environments on three state-of-the-art mobile devices. These devices are: an Acer Iconia A510, a Google Nexus S, and a...
We propose efficient algorithms and formulas that improve the performance of side-channel protected elliptic curve computations with special focus on scalar multiplication exploiting the Gallant-Lambert-Vanstone (CRYPTO 2001) and Galbraith-Lin-Scott (EUROCRYPT 2009) methods. Firstly, by adapting Feng et al.'s recoding to the GLV setting, we derive new regular algorithms for variable-base scalar multiplication that offer protection against simple side-channel and timing attacks. Secondly, we...
As hardware capabilities increase, low-power devices such as smartphones represent a natural environment for the efficient implementation of cryptographic pairings. Few works in the literature have considered such platforms despite their growing importance in a post-PC world. In this paper, we investigate the efficient computation of the Optimal-Ate pairing over Barreto-Naehrig curves in software at different security levels on ARM processors. We exploit state-of-the-art techniques and...

Elliptic curve cryptosystems have improved greatly in speed over the past few years. In this paper we outline a new elliptic curve signature and key agreement implementation which achieves record speeds while remaining relatively compact. For example, on Intel Sandy Bridge, a curve with about $2^{250}$ points produces a signature in just under 60k clock cycles, verifies in under 169k clock cycles, and computes a Diffie-Hellman shared secret in under 153k clock cycles. Our...
The latency gap between caches and main memory has been successfully exploited for recovering sensitive input to programs, such as cryptographic keys from implementation of AES and RSA. So far, there are no practical general-purpose countermeasures against this threat. In this paper we propose a novel method for automatically deriving upper bounds on the amount of information about the input that an adversary can extract from a program by observing the CPU's cache behavior. At the heart of...
This paper presents high-speed assembly implementations of the 256-bit-output versions of all five SHA-3 finalists and of SHA-256 for the ARM11 family of processors. We report new speed records for all of the six implemented functions. For example our implementation of the round-3 version of JH-256 is 35% faster than the fastest implementation of the round-2 version of JH-256 in eBASH. Scaled with the number of rounds this is more than a 45% improvement.We also improve upon previous assembly...
Pairings on elliptic curves are being used in an increasing number of cryptographic applications on many different devices and platforms, but few performance numbers for cryptographic pairings have been reported on embedded and mobile devices. In this paper we give performance numbers for affine and projective pairings on a dual-core Cortex A9 ARM processor and compare performance of the same implementation across three platforms: x86, x86-64 and ARM. Using a fast inversion in the base...
In this paper we recall the hybrid method of Gura et al. for multi-precision multiplication which is an improvement on the basic Comba method and which exploits the increased number of registers available on modern architectures in order to avoid duplicated loads from memory. We then show how to improve and generalise the method for application across a wide range of processor types, setting some new records in the process.
It is widely believed that genus four hyperelliptic curve cryptosystems (HECC) are not attractive for practical applications because of their complexity compared to systems based on lower genera, especially elliptic curves. Our contribution shows that for low cost security applications genus-4 hyperelliptic curves (HEC) can outperform genus-2 HEC and that we can achieve a performance similar to genus-3 HEC. Furthermore our implementation results show that a genus-4 HECC is an alternative...