40574 – Field ordering still causes extra memcpy

LLVM Bugzilla is read-only and represents the historical archive of all LLVM issues filled before November 26, 2021. Use github to submit LLVM bugs

Bug 40574 - Field ordering still causes extra memcpy

Summary: Field ordering still causes extra memcpy

Status:	NEW

Alias:	None

Product:	libraries
Classification:	Unclassified
Component:	Scalar Optimizations (show other bugs)
Version:	trunk
Hardware:	PC All

Importance:	P enhancement
Assignee:	Unassigned LLVM Bugs

URL:
Keywords:

Depends on:
Blocks:

Reported:	2019-02-02 11:53 PST by Jeff Muizelaar
Modified:	2021-10-24 07:45 PDT (History)
CC List:	7 users (show)

See Also:
Fixed By Commit(s):

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Jeff Muizelaar 2019-02-02 11:53:43 PST

Even with bug 39844 fixed we can still get extra memcpys depending on field ordering:

#include <stdlib.h>
struct SV {
        size_t capacity;
        size_t disc;
        size_t data[40];

        static SV make() {
                SV ret;
                ret.capacity = 0;
                ret.disc = 0;
                return ret;
        }
};

struct L {
        SV a;
        SV b;
};

template<class T>
struct Allocation {
    T *vec;
    void init(T s) {
        *vec = s;
    }
};

void bar(Allocation<L> a, double g) {
        L s = { SV::make(), SV::make() };
        a.init(s);
}

produces

bar(Allocation<L>, double):                # @bar(Allocation<L>, double)
        subq    $680, %rsp              # imm = 0x2A8
        xorps   %xmm0, %xmm0
        movaps  %xmm0, (%rsp)
        movaps  %xmm0, 336(%rsp)
        movq    %rsp, %rsi
        movl    $672, %edx              # imm = 0x2A0
        callq   memcpy
        addq    $680, %rsp              # imm = 0x2A8
        retq

but moving capacity to the end gives:

bar(Allocation<L>, double):                # @bar(Allocation<L>, double)
        movq    $0, (%rdi)
        xorps   %xmm0, %xmm0
        movups  %xmm0, 328(%rdi)
        movq    $0, 664(%rdi)
        retq

Comment 1 Jeff Muizelaar 2019-02-02 11:55:32 PST

The difference from bug 39844 is the additional SV field in L

Comment 2 Jeff Muizelaar 2019-02-15 06:19:37 PST

Gcc compiles it as you'd expect to:

bar(Allocation<L>, double):
        pxor    %xmm0, %xmm0
        movups  %xmm0, (%rdi)
        movups  %xmm0, 336(%rdi)
        ret

Comment 3 David Blaikie 2019-02-15 09:19:44 PST

Adding Lang here, who did some work on memcpy optimization of struct copies & might be interested.

Comment 4 Nikita Popov 2019-03-26 14:16:02 PDT

Reduced example of the problem:

define void @test(i8* %p) {
  %a = alloca [8 x i64]
  %a8 = bitcast [8 x i64]* %a to i8*
  call void @llvm.memset.p0i8.i32(i8* %a8, i8 0, i32 16, i1 false)
  %a8_32 = getelementptr inbounds i8, i8* %a8, i32 32
  call void @llvm.memset.p0i8.i32(i8* %a8_32, i8 0, i32 16, i1 false)
  call void @llvm.memcpy.p0i8.p0i8.i32(i8* %p, i8* %a8, i32 64, i1 false)
  ret void
}

declare void @llvm.memset.p0i8.i32(i8*, i8, i32, i1)
declare void @llvm.memcpy.p0i8.p0i8.i32(i8*, i8*, i32, i1)

This IR stays invariant under -memcpyopt.

The basic transformation we'd want to do here is to convert

    x = alloca
    stores inside x
    y = memcpy x
    // x not used past here

into

    x = alloca
    y = memcpy x
    stores inside y

in which case the memcpy of alloca (undef) may be elided entirely.

This is basically what call slot optimization does, just not for calls but for stores. A complication is that it needs to work for sequences of stores+memsets, not just a single instruction.

Comment 5 Jeff Muizelaar 2019-03-27 08:40:42 PDT

Nikita, why does the ordering of the fields make a difference?