TLS support in GPU progamming

I try to use TLS (thread local storage) variable with ROCM compiler and NVCC, but fail to get the expected behavior for both of the compilers. ROCM crashes when compiling the code, and NVCC seems ignore the keyword __thread and treat tls as a global variable. Does anybody know why TLS is not well supported on GPU programming?

This is unimplemented, and kind of a hassle. You need to allocate a copy of the global for every item in the dispatch, which is an unknown and potentially large number. I would hope you would get a clean error if you attempt to use __thread

Could thread locals be implemented by allocating them at the start of the stack/scratch in kernels?

(I think for x86/glibc TLS is at the start of the stack as well.)

Maybe, but that’s using up precious stack space. I think the correct way to handle this is a new address space and using buffers with the magic work item ID indexed SRD configuration

Being able to codegen thread_local on the GPU would be nice, but right now it’s just unimplemented Compiler Explorer. Another issue would be initializers. Even if we had some special buffer set up we’d need some backend support to write some value to it on kernel start, which would then be a performance issue if we ever wanted that to work between multiple (non-LTO) link jobs.

Right now the closest thing you can get to TLS is burning most of your LDS budget on it. I.e.

__shared__ int local[1024];

thread = local[tid.x];

Also, perhaps because I don’t understand the use case here, why do you want TLS instead of just declaring a variable in your entry point/kernel?

Clang would report error to indicate that compiler don’t support dynamic initialization for local thread variable. That looks good for us.
Compiler Explorer and Compiler Explorer

We’d like to have TLS variable that can be accessed freely in our library functions. If it is declared in the entry of kernel, we need to pass the reference of local variable to device function in the call trace.

Side note: If by “thread-local” you do mean “private to one lane of execution”, this ends up morally equivalent to an alloca() in the kernel entry point.

The trouble with allowing this sort of thing as a global declaration is that, if there’s more than one kernel in your source code, you have to plumb the value through to all of them, probably wasting registers.

That is, something like

thread_local int x;

__device__ __kernel__ f(int a) {
  ...
  g(a);
  ...
}

__device__ g(int b) {
  ...
}

would need to become

__device__ __kernel f(int a) {
  int x;
  ...
  g(b, x); // *maybe* passed by value if you can get away with it
  ...
}
__device__ g(int b, [const] int& x) {
  ...
}

which is a rather irritating transformation to do in general.

(Like, I’m sure no one would complain if this got implemented, but I don’t get the sense this is an extremely desirable feature. Patches welcome, I’d say.)

I think this works in theory but I wouldn’t for a second recommend anyone actually do it since it’ll blow up depending on how the stack gets allocated Compiler Explorer. Mostly just as a curiosity.

int [[clang::address_space(5)]] *tls = (int [[clang::address_space(5)]] *)(0);

[[clang::amdgpu_kernel, gnu::visibility("protected")]] void kernel() {
  *tls = __builtin_amdgcn_workitem_id_x();
}