Even with bug 39844 fixed we can still get extra memcpys depending on field ordering: #include <stdlib.h> struct SV { size_t capacity; size_t disc; size_t data[40]; static SV make() { SV ret; ret.capacity = 0; ret.disc = 0; return ret; } }; struct L { SV a; SV b; }; template<class T> struct Allocation { T *vec; void init(T s) { *vec = s; } }; void bar(Allocation<L> a, double g) { L s = { SV::make(), SV::make() }; a.init(s); } produces bar(Allocation<L>, double): # @bar(Allocation<L>, double) subq $680, %rsp # imm = 0x2A8 xorps %xmm0, %xmm0 movaps %xmm0, (%rsp) movaps %xmm0, 336(%rsp) movq %rsp, %rsi movl $672, %edx # imm = 0x2A0 callq memcpy addq $680, %rsp # imm = 0x2A8 retq but moving capacity to the end gives: bar(Allocation<L>, double): # @bar(Allocation<L>, double) movq $0, (%rdi) xorps %xmm0, %xmm0 movups %xmm0, 328(%rdi) movq $0, 664(%rdi) retq
The difference from bug 39844 is the additional SV field in L
Gcc compiles it as you'd expect to: bar(Allocation<L>, double): pxor %xmm0, %xmm0 movups %xmm0, (%rdi) movups %xmm0, 336(%rdi) ret
Adding Lang here, who did some work on memcpy optimization of struct copies & might be interested.
Reduced example of the problem: define void @test(i8* %p) { %a = alloca [8 x i64] %a8 = bitcast [8 x i64]* %a to i8* call void @llvm.memset.p0i8.i32(i8* %a8, i8 0, i32 16, i1 false) %a8_32 = getelementptr inbounds i8, i8* %a8, i32 32 call void @llvm.memset.p0i8.i32(i8* %a8_32, i8 0, i32 16, i1 false) call void @llvm.memcpy.p0i8.p0i8.i32(i8* %p, i8* %a8, i32 64, i1 false) ret void } declare void @llvm.memset.p0i8.i32(i8*, i8, i32, i1) declare void @llvm.memcpy.p0i8.p0i8.i32(i8*, i8*, i32, i1) This IR stays invariant under -memcpyopt. The basic transformation we'd want to do here is to convert x = alloca stores inside x y = memcpy x // x not used past here into x = alloca y = memcpy x stores inside y in which case the memcpy of alloca (undef) may be elided entirely. This is basically what call slot optimization does, just not for calls but for stores. A complication is that it needs to work for sequences of stores+memsets, not just a single instruction.
Nikita, why does the ordering of the fields make a difference?