[RFC] Constant-Time Coding Support

Constant-Time Coding Support

Summary

We (@kumarak, @frabert, @hbrodin, @wizardengineer, and myself of Trail of Bits) propose a Clang “constant-time selection” builtin for cryptographers to use to ensure that their compiled C and C++ code selects between values in constant time consistently across target architectures. Our builtin will selectively bypass optimizations that are beneficial for most compiled code but that can replace intended constant-time operations with variable-time jumps or branching, contrary to cryptographers’ needs and expectations. We would love feedback on our approach.

Motivation

An attacker who finds interesting variable-time control flow in compiled code can repeatedly time it to learn when it processes sensitive values, and to potentially even learn what those values are. Rather than using the ternary operator as is typical in non-cryptographic source code, a cryptographic library developer writing a selection between two values based on some condition often uses a bitwise recipe intended to protect this data from timing leaks.

mask = -(cond);
result = (mask & a) | (~mask & b);

Ideally, this recipe would ensure the resulting compiled selection between a and b based on cond lacks variable-time target instructions like branching, jumps, or secret-dependent memory accesses, because variable-time instructions can expose sensitive data to any attacker who can take timing measurements. But recipes like this not only obfuscate code for non-cryptographers, they also cannot bypass all current and potential future LLVM IR and backend optimizations, which means their use does not always result in code that will execute in constant time.

Recent work [Geimer 2025, Schneider 2024] shows that early iterations of the InstCombine pass (which runs during the -O1 opt pipeline) replace the constant-time selection recipe’s mask creation and bitwise operations with IR select. Then, for x86-64 (for instance) select IR instructions may initially be lowered to conditional move (cmovcc) target instructions. This is good, since cmov family instructions run in constant time. But backend optimizations like the x86-cmov-conversion pass then replace cmovcc with a conditional jump or branch, which means the pass output is no longer constant-time.

The source code developer can use verification tools [crocs-muni] to identify introduced variable-time instructions like these to try to prevent their use. But, the preventative measures the developer can take generally amount to: either writing raw constant-time assembly directly, which is not portable; or learning and using an academic language [Bacelar Almeida 2017, Cauligi 2017] that produces unportable assembly or produces LLVM bitcode that is vulnerable to the same backend optimizations as code otherwise compiled with Clang; or turning off core LLVM optimizations entirely [Geimer 2025], which is generally impractical since operations that should be constant-time normally run in the context of code that otherwise benefits from optimization.

Threat model

An attacker can remotely monitor time-sensitive code execution on some host over a network connection. They can choose their own program inputs. They can take multiple measurements, enabling them to overcome any noise or network jitter introduced by their remote position. A local attacker has all the remote attacker’s capabilities and can additionally run their own code on the same host where the code being timed runs. This capability also enables the local attacker to take high-precision measurements directly on the host. Both types of attackers (local, remote) are in scope for this work [Pornin 2025]. Types of timing attacks other than those that exploit variable-time branching are out of scope.

Examples

These are some uses of the constant-time selection recipe in C and C++ where, if one existed today, using a constant-time Clang builtin instead could prevent variable execution timings.

uint64_t constant_time_lookup(const size_t secret_idx, 
  const uint64_t table[16]) {

    uint64_t result = 0;
    for (size_t i = 0; i < 8; i++) {
        const bool cond = i == secret_idx;
        const uint64_t mask = (-(int64_t)cond);
        result |= table[i] & mask;     
    }

    return result;
}
constant_time_lookup:
  xor     eax, eax
  xor     ecx, ecx
  jmp     .LBB0_1 
.LBB0_3:
  mov     rdx, qword ptr [rsi + 8*rcx]
.LBB0_4:
  or      rax, rdx
  inc     rcx
  cmp     rcx, 8
  je      .LBB0_5
.LBB0_1:
  cmp     rdi, rcx
  je      .LBB0_3
  xor     edx, edx
  jmp     .LBB0_4
.LBB0_5:
  ret

Example 1: C code (top) reproduced from [Sprenkels 2019] and the corresponding assembly (bottom) from Compiler Explorer using Clang (20.1.0) for x86-64 with -O1. The C code was written to run without a dependency on secret_idx. Unfortunately, the resulting x86-64 asm includes a conditional jump based on this secret value (cmp rdi,rcx followed by je .LBB0_3) and a secret-dependent memory access (mov rdx, qword ptr [rsi + 8*rcx]). This means it is vulnerable to timing attacks. Compiler Explorer link here.

void cmovznz4(uint64_t cin, uint64_t *x, uint64_t *y, uint64_t *r) {
    uint64_t mask = ~FStar_UInt64_eq_mask(cin, (uint64_t)0U);
    uint64_t r0 = (y[0U] & mask) | (x[0U] & ~mask);
    uint64_t r1 = (y[1U] & mask) | (x[1U] & ~mask);
    uint64_t r2 = (y[2U] & mask) | (x[2U] & ~mask);
    uint64_t r3 = (y[3U] & mask) | (x[3U] & ~mask);

    r[0U] = r0;   
    r[1U] = r1;   
    r[2U] = r2;   
    r[3U] = r3; 
}
cmovznz4:
  mv      a5, a1
  beqz    a0, .LBB0_2
  mv      a5, a2 
.LBB0_2:
  beqz    a0, .LBB0_5
  addi    a6, a2, 8
  bnez    a0, .LBB0_6
.LBB0_4:
  addi    a4, a1, 16
  j       .LBB0_7
.LBB0_5:
  addi    a6, a1, 8
  beqz    a0, .LBB0_4
.LBB0_6:
  addi    a4, a2, 16
.LBB0_7:
  ld      a7, 0(a5)
  ld      a5, 0(a6)
  ld      a6, 0(a4)
  beqz    a0, .LBB0_9
  addi    a1, a2, 24
  j       .LBB0_10
.LBB0_9:
  addi    a1, a1, 24
.LBB0_10:
  ld      a0, 0(a1)
  sd      a7, 0(a3)
  sd      a5, 8(a3)
  sd      a6, 16(a3)
  sd      a0, 24(a3)
  ret

Example 2: A C function intended to be constant-time, reproduced from the appendix of [Schneider 2024] (top). The corresponding RISC-V rv64gc assembly (bottom) was produced in Compiler Explorer with Clang (trunk) with -O1. This asm runs in variable time, since it includes branching dependent on the values of inputs to cmovznz4. Compiler Explorer link here.

#define SIZE 256
#define CONSTANT 1665

void expand_secure(int16_t r[SIZE], const uint8_t msg[32]) {
    unsigned int i,j;
    int16_t mask;

    for(i=0; i < SIZE/8; i++) {
        for(j=0; j < 8; j++) {
            mask = -(int16_t)((msg[i] >> j)&1);
            r[8*i+j] = mask & CONSTANT;
        }
    }
}
expand_secure:
  xor  eax, eax
  jmp  .LBB0_1
.LBB0_5:
  inc  rax
  add rdi, 16
  cmp rax, 32
  je  .LBB0_6
.LBB0_1:
  xor ecx, ecx
  jmp .LBB0_2
.LBB0_4:
  mov word ptr [rdi + 2*rcx], dx
  inc rcx
  cmp rcx, 8
  je  .LBB0_5
.LBB0_2:
  movzx r8d, byte ptr [rsi + rax]
  xor   edx, edx
  bt    r8d, ecx
  jae  .LBB0_4
  mov  edx, 1665
  jmp .LBB0_4
.LBB0_6:
  ret

Example 3: C code intended to be constant-time, reproduced from [Purnal 2024] (top). The resulting assembly (bottom) produced in Compiler Explorer with Clang (trunk) for x86-64 with -O1 includes an input-dependent variable-time sequence of bt, jae, mov. Compiler Explorer link here.

Technical Approach

Proposed work

Our paired Clang built-in and LLVM intrinsic will enable the source developer to bypass any optimizations that do not respect the constant-time selection recipe. __builtin_ct_select(cond, a, b) will call a co-developed LLVM intrinsic llvm.ct.select(cond, a, b) so that optimizations that would otherwise introduce variable-time, data-dependent control flow can be bypassed just for the selection. The intrinsic will then introduce a ctselect pseudoinstruction. Later IR transformations or target instruction transformations (of machine IR) will recognize our pseudo-instruction and emit constant-time lowerings for it.

A previous effort around ten years ago [Simon 2018] implemented a similarly motivated Clang builtin for constant-time selection, but supported it only in the x86-64 backend. Unfortunately, this earlier work was neither proposed to this community for feedback, nor upstreamed. We also observed that this work does not persist the constant-timeliness property through modern x86-64 backend optimizations to the final x86-64 output [Simon 2017]. Closing this gap, when such instructions are available, each backend that sees our ctselect pseudo-instruction will lower it using cmovcc or the target-appropriate equivalent (like csel for ARMv8+), or will use constant-time bitwise target instructions for lowering otherwise.

Example (pseudocode) usage

Before

mask = -(cond);
result = (a & mask) | (b & ~mask);

Proposed After

#define HAS_CT_SELECT __has_builtin(__builtin_ct_select)

#if HAS_CT_SELECT
    #define CTSELECT(mask, a, b) __builtin_ct_select((mask),(a),(b))
#else
    #define CTSELECT(mask, a, b) ((a) & (mask)) | (b) & ~(mask))
#endif

result = CTSELECT(mask, a, b);

Limitations

This work is scoped to address only branching-related timing attacks. This means that:

  • There are (as previously mentioned) more types of side-channel attacks based on timing caching-related operations and similar that this work cannot mitigate.
  • This work is not intended to, and in fact cannot, prevent hardware-level security issues like SPECTRE or Rowhammer.

Constant-time selection alone cannot handle all constant-time cryptographic coding needs (e.g., division). This means that further constant-time functionality either independent of this work or built on top of __builtin_ct_select may need to be added in the future.

Our initial implementation does not use ARM DIT or Intel DOIT. Not every ARM target supports DIT, and not every Intel target supports DOIT. Adding support for DIT and DOIT to the ARM and x86 LLVM backends would require further significant changes and discussion. Moreover, enabling DOIT/DIT features requires OS-level privileges, e.g., to write to the related MSR (model-specific register) on x86. We think support for these instructions could be added at the source library level, outside LLVM, once our initial implementation is in place.

Open Implementation Questions

Avoiding node merging

Generating machine instructions that are guaranteed not to be subject to further merging via peephole optimizations is a requirement in order to maintain the constant-time guarantees we are describing here. Currently, the implementation we are sketching uses custom emission logic that takes advantage of the instruction lowering APIs to do this:

MachineInstr *MI = BuildMI(…, TII->get(AArch64::CSELWr), …);
MI->setFlag(MachineInstr::NoMerge);

We’d love to hear about what possible methods might exist to obtain the same effect today via regular instruction selection patterns. We sketched below roughly what we are thinking in a pseudocode TableGen target definition containing an example “NoMerge” annotation to prevent further node merging when the annotation is included:

def : Pat<
(AArch64ctselect GPR32:$tval, GPR32:$fval, (i32 imm:$cc), NZCV),
(CSELWr !!NoMerge!! GPR32:$tval, GPR32:$fval, (i32 imm:$cc)) >;

This method would provide a way to specify target-specific lowerings for the new pseudoinstruction without the need to write custom expansion logic. However, the issue of how to safely handle the intrinsic for targets for which the pseudoinstruction does not have custom logic yet remains. In an ideal world, we’d be able to handle this at the SelectionDAG level by writing generic expansion logic that turns ctselect into a tree of bitwise operations, tagged with something similar to the NoMerge example tag. The idea would be to make sure that regular patterns cannot match against nodes tagged with this flag, and that this flag will then be maintained in the generated machine instructions. Such a mechanism does not currently exist in LLVM to the best of our knowledge, and would probably be a large architectural change. Any feedback on possible alternatives is welcome.

Fallback support

After implementing our core mechanism initially for at least the x86 and ARM backends, we plan to expand backend support to more architectures e.g., AArch64, MIPS, RISC-V. Since a fail-open strategy could result in unintuitive behaviour for the source developer, we currently plan to fail closed, meaning that any backend that does not yet implement ctselect but receives input source that includes the builtin would fail to compile it and produce an error. The source developer could use has-builtin to check for ctselect, in order to provide an alternative implementation at the source level for cases when ctselect is not yet supported. We would appreciate thoughts on whether fail-closed or something else is most suitable here.

ARM Thumb support

While a source code developer may use the -mthumb command-line flag or specify a target triple that includes Thumb to force Thumb instruction generation, the Clang default at present is to generate A32 (ARM) assembly only, unless the source developer has otherwise used the Thumb attribute in their code. For these reasons, for now we plan to implement ARM target lowering for our intrinsic only for A32. We would also like input on whether Thumb-mode support would be useful before we add it.

Future Work

According to [Bernstein 2024], constant-time boolean selection alone cannot support all cryptographic coding needs. Once we have implemented __builtin_ct_select we will additionally publish a source library of constant-time helpers that use our selection builtin; this source-level work will also demonstrate how to correctly use the builtin in source code.

Beyond the initial implementation, we also see several promising directions for extending this work:

Further builtins

If there is appetite, we would propose extending our work at the LLVM and Clang level with builtins for more constant-time recipes [Aumasson 2019, Intel 2022].

Further languages

For example, Rust could also benefit from our implementation. This would provide Rust’s cryptographic ecosystem, including projects like RustCrypto and ring, with the same constant-time guarantees without extensive implementation effort. The modular nature of this work means that any language targeting LLVM IR could potentially reuse parts of our implementation like the intrinsic llvm.ct.select.

Further architectures

Beyond the architectures currently targeted (x86-64, ARMv7, AArch64, RISC-V, MIPS-32), our approach naturally extends to other LLVM backends like WebAssembly (WASM), where constant-time guarantees are especially challenging due to the abstract nature of the execution environment and the variety of runtime implementations.

3 Likes

The biggest problem here, I think, is the fact that there’s no underlying model for what it means to “use” a value that needs constant-time guarantees. It’s important not only to lower the intrinsic itself correctly, but also to avoid introducing other instructions that touch the value. To get this right, I think you need a new LLVM IR type: a “constant-time value” that only permits certain operations. Your discussion of instruction merging sort of touches on this, but not in a comprehensive way.

Probably suppressing one or two optimizations is sufficient for 99% of the problems you’re likely to see, but if we’re going to introduce a feature specifically for crypto algorithms, I’m not sure users will be happy with 99%.

Practical 32-bit Arm codebases these days are either for M-class microcontrollers (which are always T32), or use Thumb2 for reduced codesize.

1 Like

I have no context on the crypto side of this.

A means of robustly requesting cmov / mask style single basic block code (and possibly of requesting branches) does come up periodically though. It is really annoying to find the compiler heuristics have gone the wrong way on that.

An intrinsic which takes bool (or vector of bits) and two arguments of the same type and chooses between them without splitting the basic block is a legitimately useful thing for high performance code, completely independent of crypto. I want that for x64 and for amdgpu codegen control.

Can we propose / review / ship that as a feature on its own merit, which happens to help the crypto goal as well as the control over machine performance one?

(I’d like the dual as well, a branchy-select which splits the basic block, but I have some doubts about that interacting as well with other passes)

I agree that in order to provide hard end-to-end guarantees we need some notion of secret types. But this comes with significant implementation effort, as well as additional complexity. A constant-time selection primitive addresses the most pressing issue in a simple way. I think there is value in that.

Especially as the primitive is going to be necessary anyway, so it’s not wasted effort even if we want to later represent it as a select over a secret type instead of an intrinsic.

We should not conflate the cryptographic constant-time use case and the “I want a cmov to avoid branch mispredicts” use cases.

The latter is already supported by LLVM in the form of select with !unpredictable metadata. Rust exposes this as std::hint::select_unpredictable() and uses this for things like binary search, where the introduction of branches is catastrophic for performance. I don’t know whether Clang exposes this primitive.

1 Like

I agree with @efriedma-quic. A new LLVM IR type (and a new type in C) provides a better isolation between new logic and existing optimizations. It also avoid mixing ct primitives and normal arithmetics.

I used to propose an attribute-based solution in [RFC] Constant Time Execution Guarantees in LLVM . It is intended to reuse the existing codebase and maintain a good performance. However, after a deep investigation on OpenSSL and BoringSSL, I found that the patterns are significantly different from other applications. Therefore, adding a set of primitives and writing specific optimizations for crypto applications seems feasible.

As you pointed out in the limitations, `__builtin_ct_select` doesn’t meet all the needs. I’d like to introduce a list of intrinsics like `llvm.ct.add/sub/mul/and/or/xor/shl/lshr/ashr/fshl/fshr/…`. In addition, we may need two explicit builtins to allow conversions between normal integer types and `secret` types.

Besides the comments above, I still have some concerns about the builtin-based solution:

1. Can these intrinsic calls be moved? In some platforms (e.g., X86 and AArch64), data-independent execution latency is controlled by some CSRs. If the user modifies the CSR before calling ct primitives, the execution order between the CSR modification and the ct arithmetic cannot be changed. Perhaps we can learn something from the FP env modeling in LLVM.

2. How do we model the side-effect of ct primitives (e.g., the total latency of ct primitive execution is equal in all possible paths)? I’d like to convert the constraints into SMT expressions. Then we can verify the IR transformation in Alive2.

3. The only difference between this approach and the value barrier trick is auto-vectorization. How much effort is required to add support for secret types in LV and SLP? How do we declare a vector secret type in C? We may use `typedef int secret_int4 _attribute_((ext_vector_type(4))) _attribute_(secret);`.

4. Can we get in touch with developers in the OpenSSL/RustCrypto community? Some insights from the downstream users are valuable. There are similar RFCs in LLVM discourse, but crypto library developers rarely get involved in the discussion.

I agree with effriedma-quic that Thumb support would be more useful than Arm. If you were to pick one of A32/T32 it would be better to choose T32.

Pre Cortex CPUs (2006 or so) needed to make more use of A32, but with the advent of Thumb 2 there is little reason to use it.

Arm’s only implementations with hardware tampering mitigations (SecurCore SC000 and SC300) where this would be especially useful are Thumb only.

1 Like

Hi there, I’m one of the leads of RustCrypto.

I used to propose an attribute-based solution in [RFC] Constant Time Execution Guarantees in LLVM . It is intended to reuse the existing codebase and maintain a good performance. However, after a deep investigation on OpenSSL and BoringSSL, I found that the patterns are significantly different from other applications. Therefore, adding a set of primitives and writing specific optimizations for crypto applications seems feasible.

We are still in a position where we would prefer to replace code using new intrinsics to get constant-time guarantees, as opposed to incremental hardening. If I understand how your proposal was supposed to work, it sounds like it might make it difficult to mix constant-time and non-constant-time operations, e.g. only performing constant-time operations on values containing secrets, and using non-constant-time operations on non-secret values for performance.

In the past some of us had worked with Chandler Carruth to come up with a proposal to add “secret integer types” to Rust which would ideally lower to LLVM types which would ensure only the instructions on your list are executed on them, avoiding ever branching on them or using them in pointer calculations:

I believe there was some internal work in Google to implement this for Rust+LLVM+RISC-V but I’m not sure that ever saw the light of day.

Using types for this purpose prevents confusion around “forgetting to use the constant time version of a function”. Ideally it should be impossible to misuse such types, aside from converting them to non-secret integer types and then performing non-constant-time operations on them.

Regarding the OP, we’ve worked around x86-cmov-conversion using inline assembly (emitting cmov family on x86, csel on ARM, with a “best effort” portable fallback with no guarantees):

I think it would be great to have first-class support for something like this in LLVM. It has become commonplace for new cryptographic specifications to use a pseudocode “CMOV”-like function to describe where to apply this sort of predication.

3 Likes

Yes, I’ve always thought that some kind of a “secret” tag on some of your data would be what I’d most want. (I’m an LLVM developer, but with my other hat on, I’m the maintainer of PuTTY, which has always had its own crypto implementations, and uses a dynamic-analysis test suite to check they come out of the compiler in a form that meets all the constant-time requirements I know of.)

There are two ways a non-CT operation on secret data can get into your output code. One is if the compiler puts it there by “helpfully” optimizing the CT idiom you wrote in the source code. But the other is if you accidentally wrote a non-CT idiom in the source code in the first place. Ideally you’d like to prevent both – you’d like a guarantee that the code coming out of the compiler is constant-time if it gets through the compiler at all, and you’d prefer a compile-time failure to silently emitting a security hole. Then the compiler is helping me achieve my aim, instead of me having to fight against it.

But, as @tarcieri says, you do still want the compiler to be allowed to use non-CT operations on values that aren’t secret: partly for performance, and partly because in some cases you have no choice at all. (An obvious case is looping over the elements of a variable-sized array, as long as the size isn’t supposed to be a secret. A much more complicated example is the compact encoding of NTRU Prime ciphertexts, which requires a great deal of fiddly non-secret arithmetic and conditionals to process the encoding format, interleaved with careful handling of the secret data.)

So that’s always suggested to me the idea of tagging some of my variables as secret; then if I explicitly write a non-CT operation on a secret variable I get a compile error, and once I’ve fixed that, the CT operations in my source file stay CT. Meanwhile, I can make my loop bounds not be tagged “secret”, and then I’m still allowed to write a for statement.

On the other hand, if we are doing this one primitive at a time, then here are a couple of other things that spring to mind beyond __builtin_ct_select:

Integer comparison (both equality and ordered), giving an output that can be used as the selection input to __builtin_ct_select.

Min/max, which you can implement with a comparison followed by a selection, but if the CPU has (constant-time) min and max instructions, it would be a shame not to be able to use them.

Bit shift with a secret shift distance. Not all CPUs guarantee shift instructions are constant-time with respect to the distance, so a cautious approach is to shift by 1, 2, 4, 8, 16 bits in turn and __builtin_ct_select based on the corresponding bit of the shift count. But, again, if the CPU does give you the guarantee you want, you can discard that precaution and go faster. So a compiler intrinsic that would do the safe thing if necessary and the fast thing if possible would be useful.

2 Likes

Thank you for your information!

I think this list of operations is a good reference for adding new intrinsics.

I notice that the `Unresolved questions` section in the RFC also mentions the memory zeroing and register spilling problem. From your perspective, what is the boundary of the compiler’s guarantee? In other words, which metrics can be “observed“ by the attacker?

1 Like

We’ve observed spilling of things like cryptographic keys. When people report this as an issue to us, we note it’s a curious threat model if an attacker is able to observe the memory of a computer where you’re performing a cryptographic operation.

Zeroization is a particularly tricky problem, and one where we ship a rather suboptimal solution. I’ll sadly say I think zeroization is more about compliance with requirements stipulating that you have made some effort to clear secrets from memory than it is shipping a fully airtight high-assurance solution.

This paper is a great treatment of the issue, including the deficiencies in our current solution: https://eprint.iacr.org/2023/1713.pdf

The best option I’m aware of is to create a new call stack, perform some expensive cryptographic operation which would involve a multitude of function calls on that stack, then at the end note how large the stack grew during that operation and zeroize the entire call stack on completion. I have not seen any ergonomic Rust wrapper for doing something like that, though, nor do I have any familiarity with what LLVM might already provide for implementing something like that. In such a case, spilling cryptographic keys onto the stack doesn’t matter, since they get erased when the operation completes.

I believe a similar approach is used in the Linux kernel, however?

2 Likes

Memory zeroing and register spilling: from a more general-LLVM than Rust perspective, it would certainly be nice to guarantee that secrets weren’t left in stack slots or general-purpose variables on return from a function that had handled them. In an ideal world such things are unobservable, but we all know this isn’t that world, and you can expose the contents of a previous callee’s stack slot by calling another function that accidentally uses a stack buffer uninitialised (Heartbleed), and surely the same can happen with a non-callee-saved register.

Any such thing is UB on the part of the followup code that leaked the secret, of course, but that’s pure victim-blaming – a good compiler guarantee would be that the secret isn’t available in the first place to leak through that kind of C-programming error.

1 Like

If there was a mainstream language that was practical for cryptographic implementations, it would be fair criticism that adding constant-time select might not fix enough issues to make C attractive, but the reality is that cryptographic implementations, today, are overwhelmingly written in C, and implementers tolerate having to deal with 100% of the issues they’re likely to see. Users would be thrilled with 99% of them solved.

1 Like

I second the idea of providing a set of constant time programming builtins so that we can constrain default compiler transformations and these builtins should be sufficient for most codes that need this behavior without affecting other users.

2 Likes

1→ can be modeled by describing intrinsic to have side effects – that will prevent any move.

2→ latency of operations is a microarchitecural construct and usually covered by -mtune flags which model march latencies. Latencies are somewhat useful for performance heuristics not for security guarantees due to difficulty of modeling modern OOO cores and variety of value speculation they perform it is hard to guarantee anything that is not architectural. e.g instructions with DIT (ARM) give strong architectural guarantees for non value dependent execution time for that instruction if executed in DIT mode. And you can compose a data flow graph composed of strictly DIT instructions and prove the DAG has data-independent timing.

I think if we have ct builtins and data flow graph composed of using these builtins and other operations - we can add strict compiler guarantees for the DAG. I think this enforcement should be added.

2 Likes

We’re inclined to agree, but that is not what we are trying to achieve with this proposal. Indeed, trying to achieve an end-to-end guarantee that no data-dependent timing variances are introduced would require a more sophisticated system with some sort of taint-tracking pass. That means changing the semantics of the C language itself, which we would consider out of scope for the time being, even before considering the potential intractable-ness of the problem.

Yes, we recognize the latter issue as being important, but our main goal here is to make sure that code that is already battle-tested for existing architectures can be more easily ported and refactored by eliminating ugly hacks and inline assembly, essentially trying to avoid this:

The idea behind this proposal is not to detect when data-dependent timing variances are introduced, instead it wants to make sure that code that was already written with constant-time guarantees in mind, is lowered in a semantically correct way. It will be the programmer’s responsibility to make sure that any sensitive data is not used as e.g. branch conditions.

Similarly, we consider concerns about platform-specific control registers out of scope, because we feel like the compilation stage is not the right place to manipulate them: again, we assume that programmers who are writing the sort of code that needs constant-time guarantees are aware of the various methods to enable them in every platform. Our concern is to make sure the compiler doesn’t break the constant-time contract in a way that’s not controllable by the programmer. Manipulating platform-specific control registers is most likely best left to specialized libraries/utility code.

Please note that, although out of scope for this particular proposal, it is still possible to reuse the intrinsics introduced here to build static analysis passes that detect unsafe usage of time-sensitive values: for example, one could build a pass that detect when the result of a constant-time selection was used as part of a branch condition.

These are excellent suggestions, I would encourage everyone to come up with a minimal set of required primitives that cryptographic code can be built from. For example, your point about min/max: since we are currently planning on modeling these primitives as SelDAG nodes, nothing prevents writing an optimization that takes a CTSELECT(CTCMP(a, b), a, b) and turns it into a MAX(a,b) if the target is known to implement MAX with a constant-time instruction. Members of Trail of Bits’ cryptography team suggest adding integer multiplication and division builtins.

  1. These intrinsics are modeled with explicit side effects, so the optimizer cannot move or duplicate them. The other intrinsics setting CSR’s (e.g., DIT/DOIT) will be executed in order with constant-time intrinsics thus avoiding any reordering side-effects. This is similar to how some of the fp intrinsics are modeled in LLVM.
  2. The intrinsics are lowered through a single, unambiguous path, with no opportunistic optimizations or CFG creation. The primitives are implemented using a dedicated set of SelectionDAG nodes, bypassing machine IR optimizations and generating fixed instruction sequences that leverage DIT/DOIT features to provide architectural constant-time guarantees. Avoiding these optimizations carries a performance cost, but we consider this an acceptable tradeoff for the security guarantees provided. Alive2 is used for functional verification and is not intended to assess performance or latency.
  3. I believe that defining any kind of data type for secrets is the responsibility of language frontends such as C, Rust, etc., not of the constant-time primitives. These primitives operate on existing LLVM IR types, and we welcome language frontend developers to define new data types or operations around them.
2 Likes

Members of Trail of Bits’ cryptography team suggest adding integer multiplication and division builtins.

I certainly agree a “multiplication, but make sure it’s constant-time if that’s not already guaranteed” builtin would be nice.

Looking at my own cryptography code, I use remainder more often than division – reducing mod 3329 in ML-KEM, for example, and reducing mod a variable small prime in NTRU. I think maybe the only constant-time machine-integer division in PuTTY that requires the output quotient is used in multiprecision integer division, for the rare cases where you need to do a one-off modular reduction so that it isn’t worth setting up a Montgomery context.

Going back to a thing in the initial post, incidentally, about Arm DIT:

Our initial implementation does not use ARM DIT or Intel DOIT. Not every ARM target supports DIT, and not every Intel target supports DOIT. Adding support for DIT and DOIT to the ARM and x86 LLVM backends would require further significant changes and discussion. Moreover, enabling DOIT/DIT features requires OS-level privileges, e.g., to write to the related MSR (model-specific register) on x86.

I agree that I wouldn’t want the compiler to emit a DIT-setting instruction, for the reason you say (not available on all platforms and will cause an illegal instruction trap if it isn’t), and also, because I believe Arm’s current guidance is that the operation of setting DIT can potentially have a performance cost, so you don’t want to set it right inside each crypto kernel anyway – better to set it once, call multiple crypto functions in a tight cluster (like MAC-checking and decrypting your SSH packet), then unset DIT when it’s all done. Or if your program isn’t doing any significant CPU churning other than crypto then perhaps you don’t even bother, and just set it at startup and leave it set permanently.

So I’d prefer that a compiler’s CT mode simply emits code that is safe if DIT is enabled, and leaves the responsibility with me to actually enable it, which lets me decide how to do it: check with the OS (e.g. hwcap) to ensure the instruction is available, and then make my own choice about how much code I set it around.

3 Likes

Regarding environmental state such as DIT/DOIT, I’d like to share some thoughts (mostly a summary of points already made):

  1. As is the goal with secret-related attributes, don’t push the burden to the developers.
  2. Do not generate direct DIT/DOIT logic in LLVM itself and do not tailor the infrastructure strictly to their design, as additional environmental state may need to be changed, especially on CPU designs prior to specific vulnerabilities’ disclosure.
  3. Do not make assumptions about the overhead of toggling environmental flags, as some may require syscalls and some may not.

We recently published a small high-level write-up without evaluation (sorry!) about the state of side-channel vulnerabilities involving environmental mitigations, an outline of which mitigations may be required on sample platforms (mostly Intel and Apple), and proposals on how to address the constant evolution of mitigations and developer interaction: Secret Types Require OS-Backed Secrecy Code Sections.

The main ideas are:

  1. Explicitly declare secrecy code sections, outside of which secret-qualified variables may not be accessed. In a naive implementation, entering it would call a prologue that enables mitigations and exiting it would call an epilogue that disables them.
  2. Said prologue and epilogue would be provided by the operating system via dynamic linking, so that security updates can be applied globally, especially when they involve new flags (such as chicken bits not wired to DIT/DOIT, as was the case with the data memory-dependent prefetcher (DMP) with Apple M1 and M2).
  3. The scope of the secrecy code section allows the developer to give strong hints about when exactly environmental mitigations should be toggled, to not run into performance issues from either the overhead of toggling them or the performance penalty during execution with mitigations on. However, various compiler optimizations may still apply if secure, and instructions may be moved in and out.

The annotation of variables (or memory in general) also allows for secret memory isolation, which may be crucial in mitigating the effects of DMPs. Something well-known but hardly discussed is that real-world DMPs (like Apple’s) can still trigger on “secret memory” if “public memory” in close proximity is accessed, even if the former is never accessed itself. Isolating “secret memory” entirely allows for it to be unmapped outside of the secrecy code sections above, which can be done in a performant way using Intel MPK or ARM Permission Overlays. The safest way to combat side-effects of the prefetching window is basically never mapping “secret memory” while DMP is on.

The ideas above are all outlined in the linked paper. It also references very important work laying the foundations for a potential implementation, such as Annotating, Tracking, and Protecting Cryptographic Secrets with CryptoMPK for a similar approach exploiting Intel MPK (albeit rather for defense-in-depth rather than side-channel mitigation).

3 Likes

Hi everyone,

In our original RFC, we proposed:

“Since a fail-open strategy could result in unintuitive behaviour for the source developer, we currently plan to fail closed, meaning that any backend that does not yet implement ctselect but receives input source that includes the builtin will fail to compile it and produce an error.”

However, as the first step in showing this work is viable in code, we tried fail-open instead and think it could be better. We’re in the process of implementing a fail-open approach with automatic fallback and would like community feedback.

When a target doesn’t support native constant-time select, we automatically expand llvm.ct.select into a branch-free masked-merge pattern using bitwise operations. This ensures constant-time execution even on targets without dedicated hardware support.

The implementation preserves the constant-time property by preventing optimization passes from reintroducing branches. We’ve tested this on RISCV, WebAssembly, and MIPs.

Questions and concerns we currently have:

  1. Is fail-open with automatic fallback the right approach? Or should we stick with fail-closed (compilation error) for unsupported targets?

  2. What limitations should we acknowledge with this fallback approach? For example:

    • It prevents DAG-level optimizations but may not prevent all target-level optimizations

    • Some targets (like RISC-V) may still lower certain patterns to branches in backend passes

  3. How should different targets be handled - which deserve explicit optimization vs generic fallback?

We’re particularly interested in hearing from backend maintainers about whether this arithmetic expansion approach provides sufficient constant-time guarantees for your architecture’s cryptographic use cases.

Looking forward to your thoughts and any alternative approaches you might suggest.

2 Likes

If you support pointers as arguments to this intrinsic, this expansion is incompatible with non-integral pointers (or, in the more fine-grained world being introduced by [DataLayout][LangRef] Split non-integral and unstable pointer properties by arichardson · Pull Request #105735 · llvm/llvm-project · GitHub, unstable pointers, and non-integral pointers with external state).

1 Like

IIRC we only need to support integral values. If the index is a secret, we will perform loads for all possible positions (See also openssl/include/internal/constant_time.h at 389728876b51de0df9f97b6a295948ebec1e0f0c · openssl/openssl · GitHub ). For the ct.select case, load ct.select(cond, arr + idx1, arr + idx2) is not allowed. The user should write ct.select(cond, load arr + idx1, load arr + idx2) instead.

1 Like