[RFC] "Stack" dialect

I shortly talked about this proposal as comment here, but I think I can refine it more.

Goal
Most general programming languages store their local variables on stack (lets not think of optimization for now). however in MLIR, there are no such dialect that natively supports it.
If someone wants to build general purpose programming language using MLIR, this would be very helpful.

For example,

int main() {
 int a = 0; // stored on stack
 int b = 1; // stored on stack
 return a + b // loaded from stack, and added
}

New types to introduce

  1. !stack.lv type (left-value)
    (represented as lv

This type represents values that are pushed to stack. by default, values that aren’t on the stack is called r-value. If this is pushed onto stack, it’s typed as l-value.

  1. !stack.ref type
    This type represents reference to l-value types. This is useful if one wants to represent value that ā€œreferencesā€ the other value. For example in Rust,
let mut a = 0;
let mut b = &mut a; //borrows (references a)

if a is updated, b is updated as well. In this case, b points to stack address of ā€˜a’ internally. The type of ā€˜b’ would be ref<lv<type>>

New operations

  1. %res = stack.push %value
    This operation pushes value onto the stack, converting r-value to l-value

  2. %ref = stack.reference %value
    This operation creates reference of the l-value %value. Note that %value must be !stack.lv type.

3.%value = stack.deepCopy %lv
This operation copies l-value and creates r-value. For trivially copyable types (such as scalar types), this is simple memcpy. For mermef or tensors, this would be creation of new instance, independent from the %lv. (Should be internally translated to something like memref.copy or tensor.copy)

  1. %value = stack.shallowCopy %lv
    This operation moves l-value and creates r-value. For trivially copyable types, this is same as stack.deepCopy. But for types with internal indirections, (such as tensor or memref types) This operation only shallow copies enclosing structure, without copying the actual data.

Here I only stated very basic set of operations

Above example given in C++ may be converted as (Of course, this can be optimized to remove stack operations eventually if possible)

%c0 = arith.constant 0 : i32
%a = stack.push %c0 : (i32) -> !stack.lv<i32>
%c1 = arith.constant 1 : i32
%b = stack.push %c1 : (i32) -> !stack.lv<i32>

%a_rv = stack.copy %a : (!stack.lv<i32>) -> i32
%b_rv = stack.copy %b : (!stack.lv<i32>) -> i32
%rtn = arith.add %a_rv, %b_rv : i32
func.return %rtn : i32

This rust example can be represented something like

let mut a = 0;
let mut b = &mut a; //borrows (references a)
%c0 = arith.constant 0 : i32
%a = stack.push %c0 { mutable : true} : (i32) -> !stack.lv<i32>
%b = stack.reference %a : (!stack.lv<i32>) -> !stack.ref<!stack.lv<i32>>

I accepted this idea, and has already built some kind of general-purpose language for optimizing hpc workloads. I found above functionalities useful for doing it, and thought it would be great if MLIR natively supported it.

If you like this idea, there are a lot more to discuss about. we can track lifetimes of tensor or memrefs, and implement some crucial language features that general-purpose languages often provide.

I’d like to receive some feedback and questions about this proposal

Have you thought of having an LLVM-like model where instead of getting an lvalue you simply get a pointer to an allocated value?

The problem I have with this is that it partially introduces a notion of ā€œbig typesā€ that you want to access in parts. Should this be unified somehow?

What would be the type system for shallow copies?

At first glance it seems like something tree-attribute grammar would have at frontend; not something we would want in IR ?

Could you elaborate the reason for existence of such a feature or scaffolding.

Yes, but this is more high-level approach which abstracts llvm-like model. This can be lowered into ptr-like semantics used in LLVM model.
Generally, having higher level abstraction is easier to optimize and analyze. I consider this as first step for evolving MLIR to support general-purpose languages. Many optimization passes and analysis can be done more easily on higher-level abstraction such as forwarding l-values to r-values (which is copy-elision in llvm-like model) or lifetime analysis of stack values.

Indeed this can be also represented in AST-like level (I guess you mean this by ā€œtree-attribute grammarā€ )
However, it’s much easier to analyze and reason the program when it can be translated to MLIR as it is. Language builders don’t have to build their own infrastructure for running analysis on AST, but they can simply lower their AST directly to this dialect, and perform analysis with MLIR infrastructure.

This is similar to what Rust does with its MIR. Language builders may build something like MIR using MLIR. I believe power of MLIR comes from ability to abstract higher level concepts, and analyze something that wouldn’t be easy in llvm-ir level.

I don’t quite understand the proposal, but to begin with: what is the semantic difference between stack.push/stack.copy and alloca/load?

I’d say difference between stack.push/stack.copy and llvm.alloca/llvm.load is abstraction level. llvm.alloca and llvm.load only does simple memory copy in byte-level, without considering ā€œwhat it’s copying & allocatingā€.

For stack.push, in most cases, can be translated to set of llvm.alloca & llvm.store since turning r-value to l-value does not require full deep-copy.

For stack.copy, it will be translated into llvm.load if value is simple scalar (such as i32) However, if value is tensor or memref (or possibly other types which cannot be fully copied with simple byte manipulation. This happens often when someone wants to build real language) ,
stack.copy would be translated into memref.copy or tensor.copy. after performing llvm.load for llvm-level representation of memref.

I believe this concept cannot be abstracted using llvm.load or llvm.alloca. If language builder wants to define other kind of complex type with internal ptrs, they can use this operation, and write their own semantics for copying the type.

%c0 = arith.constant 0 : i32
%a = stack.push %c0 { mutable : true } : (i32) -> !stack.lv<i32>
%b = stack.reference %a : (!stack.lv<i32>) -> !stack.ref<!stack.lv<i32>

can be translated into something like

%c0 = constant 0 : i32
%a = alloca i32, 4, align 128
store i32 %c0 ptr %a
// %b is same as %a here (can be canonicalized out)

If we had memref, it would look something like

%tensor = memref.alloc() : memref<i32x2>
%pushed = stack.push %tensor : (memref<i32x2>) -> !stack.lv<memref<i32x2>>
%copied = stack.copy %pushed : (!stack.lv<memref<i32x2>>) -> memref<i32x2>

which can be lowered to

%memref = memref.alloc() : memref<i32x2>
%llvm_memref = builtin_unrealized_cast %memref : (memref<i32x2>) -> !llvm.struct<(ptr, ptr, i64, array<1xi32>, array<1xi32>)>
%c40 = arith.constant 40 : index

// !stack.push
%pushed = llvm.alloca %c40 x !llvm.ptr : (!llvm.struct<(ptr, ptr, i64, array<1xi32>, array<1xi32>)>) -> !llvm.ptr
llvm.store %llvm_memref, %pushed : (!llvm.struct<(ptr, ptr, i64, array<1xi32>, array<1xi32>)>, !llvm.ptr

//! stack.copy
%loaded = llvm.load %pushed : !llvm.ptr -> !llvm.struct<(ptr, ptr, i64, array<1xi32>, array<1xi32>
%memref = builtin_unrealized_cast %loaded : !llvm.struct<(ptr, ptr, i64, array<1xi32>, array<1xi32> -> memref<i32x2>
%copied = memref.copy %memref : memref<i32x2> -> memref<i32x2>

Here, ā€œbuiltin_unrealized_castā€ will be canonicalized out after full lowering. (I wrote this pseudo-example to clarify the concept)

To summarize, I think this can be meaningful since it brings concept of l-value and r-value, and program stack to higher-level. For someone who wants to build general-purpose language using MLIR, this feature would be helpful to implement important concepts of the language.

I don’t understand this. Could you give an example transformation that benefits from this? Or is it only as a convenience for frontend developers?

What I meant by ā€œcannot be abstracted using llvm.load or llvm.allocaā€ is that

First, types that potentially have internal data (mermef or tensors or potentially other types that real-world languages use) cannot be expressed using low-level operations in one-shot.
Second, llvm-dialects are two low-level for representing program AST directly. so it can be convenient if there’s an abstraction layer.

For example, if someone plans to build something similar to Rust in MLIR with tensors,

// initialize tensor shaped (2x3) with 0
let mut a = tensor((2,3), 0) // r-value tensor is "moved" to "a" (use stack.push)
// initialize tensor shaped (2x3) with 1
let b = tensor((2,3), 1) // r-value tensor is "moved" to "b" (use stack.push)
let mut c = &mut a // c is "borrowed" to a (uses stack.ref) 
c[[0, 0]] = 2 // updating this would update "a" as well since it's borrowed

let mut aCopy = copy(a) // aCopy is copied from "a" (use stack.copy, which can be lowered to memref.copy)

// Add two tensors together
let result = b + c // b, c is popped from stack and added (use stack.shallowCopy) 
println!("{:?}", result)

Of course this can be manipulated using raw llvm dialect, but it can be abstracted in easier way using ā€œstackā€ dialect I tried to propose.

This is convenience layer for frontend developers who wants to build real-world language using MLIR (rather than optimization)

What I’ve thinking of MLIR is that it has strong potential for building compilers for general purpose language, but community is kind of focused to AI workload optimization. I thought this can be initial phase of bringing MLIR to general-purpose programming.

One strong motivation is optimizing stack allocations, copies and data movement in a way that cannot be done with alloca. Do you have such a case?

1 Like

Of course, but my question wasn’t about llvm.alloca directly: I was wondering about the concept of alloca returning a pointer to a ā€œmemory location with automatic scopeā€ and the use of a load-store to access the value at this location.
It isn’t clear to me if you’re using different terminology for the same concept or if there is a fundamental difference.
Note also that right now we’re adding a ptr dialect for example, that is independent from LLVM.

Then having a generic alloca and load/store on top of this would be a matter of showing some value (vs letting dialect introducing the type adding their own ops / lowering for example), for example along the line of @rengolin 's question right above (generic optimization for stack alloc)