[x86_64] Improve __int128 argument passing (in ix86_expand_move).

Message ID 014901d9b002$094f5ec0$1bee1c40$@nextmovesoftware.com
State Accepted
Headers
Series [x86_64] Improve __int128 argument passing (in ix86_expand_move). |

Checks

Context Check Description
snail/gcc-patch-check success Github commit url

Commit Message

Roger Sayle July 6, 2023, 12:04 p.m. UTC
  Passing 128-bit integer (TImode) parameters on x86_64 can sometimes
result in surprising code.  Consider the example below (from PR 43644):

__uint128 foo(__uint128 x, unsigned long long y) {
  return x+y;
}

which currently results in 6 consecutive movq instructions:

foo:    movq    %rsi, %rax
        movq    %rdi, %rsi
        movq    %rdx, %rcx
        movq    %rax, %rdi
        movq    %rsi, %rax
        movq    %rdi, %rdx
        addq    %rcx, %rax
        adcq    $0, %rdx
        ret

The underlying issue is that during RTL expansion, we generate the
following initial RTL for the x argument:

(insn 4 3 5 2 (set (reg:TI 85)
        (subreg:TI (reg:DI 86) 0)) "pr43644-2.c":5:1 -1
     (nil))
(insn 5 4 6 2 (set (subreg:DI (reg:TI 85) 8)
        (reg:DI 87)) "pr43644-2.c":5:1 -1
     (nil))
(insn 6 5 7 2 (set (reg/v:TI 84 [ x ])
        (reg:TI 85)) "pr43644-2.c":5:1 -1
     (nil))

which by combine/reload becomes

(insn 25 3 22 2 (set (reg/v:TI 84 [ x ])
        (const_int 0 [0])) "pr43644-2.c":5:1 -1
     (nil))
(insn 22 25 23 2 (set (subreg:DI (reg/v:TI 84 [ x ]) 0)
        (reg:DI 93)) "pr43644-2.c":5:1 90 {*movdi_internal}
     (expr_list:REG_DEAD (reg:DI 93)
        (nil)))
(insn 23 22 28 2 (set (subreg:DI (reg/v:TI 84 [ x ]) 8)
        (reg:DI 94)) "pr43644-2.c":5:1 90 {*movdi_internal}
     (expr_list:REG_DEAD (reg:DI 94)
        (nil)))

where the heavy use of SUBREG SET_DESTs creates challenges for both
combine and register allocation.

The improvement proposed here is to avoid these problematic SUBREGs
by adding (two) special cases to ix86_expand_move.  For insn 4, which
sets a TImode destination from a paradoxical SUBREG, to assign the
lowpart, we can use an explicit zero extension (zero_extendditi2 was
added in July 2022), and for insn 5, which sets the highpart of a
TImode register we can use the *insvti_highpart_1 instruction (that
was added in May 2023, after being approved for stage1 in January).
This allows combine to work its magic, merging these insns into a
*concatditi3 and from there into other optimized forms.

So for the test case above, we now generate only a single movq:

foo:    movq    %rdx, %rax
        xorl    %edx, %edx
        addq    %rdi, %rax
        adcq    %rsi, %rdx
        ret

But there is a little bad news.  This patch causes two (minor) missed
optimization regressions on x86_64; gcc.target/i386/pr82580.c and
gcc.target/i386/pr91681-1.c.  As shown in the test case above, we're
no longer generating adcq $0, but instead using xorl.  For the other
FAIL, register allocation now has more freedom and is (arbitrarily)
choosing a register assignment that doesn't match what the test is
expecting.  These issues are easier to explain and fix once this patch
is in the tree.

The good news is that this approach fixes a number of long standing
issues, that need to checked in bugzilla, including PR target/110533
which was just opened/reported earlier this week.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with only the two new FAILs described above.  Ok for mainline?

2023-07-06  Roger Sayle  <roger@nextmovesoftware.com>

gcc/ChangeLog
        PR target/43644
        PR target/110533
        * config/i386/i386-expand.cc (ix86_expand_move): Convert SETs of
        TImode destinations from paradoxical SUBREGs (setting the lowpart)
        into explicit zero extensions.  Use *insvti_highpart_1 instruction
        to set the highpart of a TImode destination.

gcc/testsuite/ChangeLog
        PR target/43644
        PR target/110533
        * gcc.target/i386/pr110533.c: New test case.
        * gcc.target/i386/pr43644-2.c: Likewise.


Thanks in advance,
Roger
--
  

Comments

Uros Bizjak July 6, 2023, 12:47 p.m. UTC | #1
On Thu, Jul 6, 2023 at 2:04 PM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> Passing 128-bit integer (TImode) parameters on x86_64 can sometimes
> result in surprising code.  Consider the example below (from PR 43644):
>
> __uint128 foo(__uint128 x, unsigned long long y) {
>   return x+y;
> }
>
> which currently results in 6 consecutive movq instructions:
>
> foo:    movq    %rsi, %rax
>         movq    %rdi, %rsi
>         movq    %rdx, %rcx
>         movq    %rax, %rdi
>         movq    %rsi, %rax
>         movq    %rdi, %rdx
>         addq    %rcx, %rax
>         adcq    $0, %rdx
>         ret
>
> The underlying issue is that during RTL expansion, we generate the
> following initial RTL for the x argument:
>
> (insn 4 3 5 2 (set (reg:TI 85)
>         (subreg:TI (reg:DI 86) 0)) "pr43644-2.c":5:1 -1
>      (nil))
> (insn 5 4 6 2 (set (subreg:DI (reg:TI 85) 8)
>         (reg:DI 87)) "pr43644-2.c":5:1 -1
>      (nil))
> (insn 6 5 7 2 (set (reg/v:TI 84 [ x ])
>         (reg:TI 85)) "pr43644-2.c":5:1 -1
>      (nil))
>
> which by combine/reload becomes
>
> (insn 25 3 22 2 (set (reg/v:TI 84 [ x ])
>         (const_int 0 [0])) "pr43644-2.c":5:1 -1
>      (nil))
> (insn 22 25 23 2 (set (subreg:DI (reg/v:TI 84 [ x ]) 0)
>         (reg:DI 93)) "pr43644-2.c":5:1 90 {*movdi_internal}
>      (expr_list:REG_DEAD (reg:DI 93)
>         (nil)))
> (insn 23 22 28 2 (set (subreg:DI (reg/v:TI 84 [ x ]) 8)
>         (reg:DI 94)) "pr43644-2.c":5:1 90 {*movdi_internal}
>      (expr_list:REG_DEAD (reg:DI 94)
>         (nil)))
>
> where the heavy use of SUBREG SET_DESTs creates challenges for both
> combine and register allocation.
>
> The improvement proposed here is to avoid these problematic SUBREGs
> by adding (two) special cases to ix86_expand_move.  For insn 4, which
> sets a TImode destination from a paradoxical SUBREG, to assign the
> lowpart, we can use an explicit zero extension (zero_extendditi2 was
> added in July 2022), and for insn 5, which sets the highpart of a
> TImode register we can use the *insvti_highpart_1 instruction (that
> was added in May 2023, after being approved for stage1 in January).
> This allows combine to work its magic, merging these insns into a
> *concatditi3 and from there into other optimized forms.

How about we introduce *insvti_lowpart_1, similar to
*insvti_highpart_1, in the hope that combine is smart enough to also
combine these two instructions? IMO, faking insert to lowpart of the
register with zero_extend is a bit overkill, and could hinder some
other optimization opportunities (as perhaps hinted by failing
testcases).

Uros.

> So for the test case above, we now generate only a single movq:
>
> foo:    movq    %rdx, %rax
>         xorl    %edx, %edx
>         addq    %rdi, %rax
>         adcq    %rsi, %rdx
>         ret
>
> But there is a little bad news.  This patch causes two (minor) missed
> optimization regressions on x86_64; gcc.target/i386/pr82580.c and
> gcc.target/i386/pr91681-1.c.  As shown in the test case above, we're
> no longer generating adcq $0, but instead using xorl.  For the other
> FAIL, register allocation now has more freedom and is (arbitrarily)
> choosing a register assignment that doesn't match what the test is
> expecting.  These issues are easier to explain and fix once this patch
> is in the tree.
>
> The good news is that this approach fixes a number of long standing
> issues, that need to checked in bugzilla, including PR target/110533
> which was just opened/reported earlier this week.
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32}
> with only the two new FAILs described above.  Ok for mainline?
>
> 2023-07-06  Roger Sayle  <roger@nextmovesoftware.com>
>
> gcc/ChangeLog
>         PR target/43644
>         PR target/110533
>         * config/i386/i386-expand.cc (ix86_expand_move): Convert SETs of
>         TImode destinations from paradoxical SUBREGs (setting the lowpart)
>         into explicit zero extensions.  Use *insvti_highpart_1 instruction
>         to set the highpart of a TImode destination.
>
> gcc/testsuite/ChangeLog
>         PR target/43644
>         PR target/110533
>         * gcc.target/i386/pr110533.c: New test case.
>         * gcc.target/i386/pr43644-2.c: Likewise.
>
>
> Thanks in advance,
> Roger
> --
>
  
Roger Sayle July 6, 2023, 1:48 p.m. UTC | #2
> On Thu, Jul 6, 2023 at 2:04 PM Roger Sayle <roger@nextmovesoftware.com>
> wrote:
> >
> >
> > Passing 128-bit integer (TImode) parameters on x86_64 can sometimes
> > result in surprising code.  Consider the example below (from PR 43644):
> >
> > __uint128 foo(__uint128 x, unsigned long long y) {
> >   return x+y;
> > }
> >
> > which currently results in 6 consecutive movq instructions:
> >
> > foo:    movq    %rsi, %rax
> >         movq    %rdi, %rsi
> >         movq    %rdx, %rcx
> >         movq    %rax, %rdi
> >         movq    %rsi, %rax
> >         movq    %rdi, %rdx
> >         addq    %rcx, %rax
> >         adcq    $0, %rdx
> >         ret
> >
> > The underlying issue is that during RTL expansion, we generate the
> > following initial RTL for the x argument:
> >
> > (insn 4 3 5 2 (set (reg:TI 85)
> >         (subreg:TI (reg:DI 86) 0)) "pr43644-2.c":5:1 -1
> >      (nil))
> > (insn 5 4 6 2 (set (subreg:DI (reg:TI 85) 8)
> >         (reg:DI 87)) "pr43644-2.c":5:1 -1
> >      (nil))
> > (insn 6 5 7 2 (set (reg/v:TI 84 [ x ])
> >         (reg:TI 85)) "pr43644-2.c":5:1 -1
> >      (nil))
> >
> > which by combine/reload becomes
> >
> > (insn 25 3 22 2 (set (reg/v:TI 84 [ x ])
> >         (const_int 0 [0])) "pr43644-2.c":5:1 -1
> >      (nil))
> > (insn 22 25 23 2 (set (subreg:DI (reg/v:TI 84 [ x ]) 0)
> >         (reg:DI 93)) "pr43644-2.c":5:1 90 {*movdi_internal}
> >      (expr_list:REG_DEAD (reg:DI 93)
> >         (nil)))
> > (insn 23 22 28 2 (set (subreg:DI (reg/v:TI 84 [ x ]) 8)
> >         (reg:DI 94)) "pr43644-2.c":5:1 90 {*movdi_internal}
> >      (expr_list:REG_DEAD (reg:DI 94)
> >         (nil)))
> >
> > where the heavy use of SUBREG SET_DESTs creates challenges for both
> > combine and register allocation.
> >
> > The improvement proposed here is to avoid these problematic SUBREGs by
> > adding (two) special cases to ix86_expand_move.  For insn 4, which
> > sets a TImode destination from a paradoxical SUBREG, to assign the
> > lowpart, we can use an explicit zero extension (zero_extendditi2 was
> > added in July 2022), and for insn 5, which sets the highpart of a
> > TImode register we can use the *insvti_highpart_1 instruction (that
> > was added in May 2023, after being approved for stage1 in January).
> > This allows combine to work its magic, merging these insns into a
> > *concatditi3 and from there into other optimized forms.
> 
> How about we introduce *insvti_lowpart_1, similar to *insvti_highpart_1, in the
> hope that combine is smart enough to also combine these two instructions? IMO,
> faking insert to lowpart of the register with zero_extend is a bit overkill, and could
> hinder some other optimization opportunities (as perhaps hinted by failing
> testcases).

The use of ZERO_EXTEND serves two purposes, both the setting of the lowpart
and of informing the RTL passes that the highpart is dead.  Notice in the original
RTL stream, i.e. current GCC, insn 25 is inserted by the .286r.init-regs pass, clearing
the entirety of the TImode register (like a clobber), and preventing TI:84 from
occupying the same registers as DI:93 and DI:94.

If the middle-end had asked the backend to generate a SET to STRICT_LOWPART
then our hands would be tied, but a paradoxical SUBREG allows us the freedom
to set the highpart bits to a defined value (we could have used sign extension if
that was cheap), which then simplifies data-flow and liveness analysis.  Allowing the
highpart to contain undefined or untouched data is exactly the sort of security
side-channel leakage that the clear regs pass attempts to address.

I can investigate an *insvti_lowpart_1, but I don't think it will help with this
issue, i.e. it won't prevent init-regs from clobbering/clearing TImode parameters.

> > So for the test case above, we now generate only a single movq:
> >
> > foo:    movq    %rdx, %rax
> >         xorl    %edx, %edx
> >         addq    %rdi, %rax
> >         adcq    %rsi, %rdx
> >         ret
> >
> > But there is a little bad news.  This patch causes two (minor) missed
> > optimization regressions on x86_64; gcc.target/i386/pr82580.c and
> > gcc.target/i386/pr91681-1.c.  As shown in the test case above, we're
> > no longer generating adcq $0, but instead using xorl.  For the other
> > FAIL, register allocation now has more freedom and is (arbitrarily)
> > choosing a register assignment that doesn't match what the test is
> > expecting.  These issues are easier to explain and fix once this patch
> > is in the tree.
> >
> > The good news is that this approach fixes a number of long standing
> > issues, that need to checked in bugzilla, including PR target/110533
> > which was just opened/reported earlier this week.
> >
> > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> > and make -k check, both with and without --target_board=unix{-m32}
> > with only the two new FAILs described above.  Ok for mainline?
> >
> > 2023-07-06  Roger Sayle  <roger@nextmovesoftware.com>
> >
> > gcc/ChangeLog
> >         PR target/43644
> >         PR target/110533
> >         * config/i386/i386-expand.cc (ix86_expand_move): Convert SETs of
> >         TImode destinations from paradoxical SUBREGs (setting the lowpart)
> >         into explicit zero extensions.  Use *insvti_highpart_1 instruction
> >         to set the highpart of a TImode destination.
> >
> > gcc/testsuite/ChangeLog
> >         PR target/43644
> >         PR target/110533
> >         * gcc.target/i386/pr110533.c: New test case.
> >         * gcc.target/i386/pr43644-2.c: Likewise.
> >
> > Thanks in advance,
> > Roger
> > --
> >
  
Uros Bizjak July 6, 2023, 2:20 p.m. UTC | #3
On Thu, Jul 6, 2023 at 3:48 PM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
> > On Thu, Jul 6, 2023 at 2:04 PM Roger Sayle <roger@nextmovesoftware.com>
> > wrote:
> > >
> > >
> > > Passing 128-bit integer (TImode) parameters on x86_64 can sometimes
> > > result in surprising code.  Consider the example below (from PR 43644):
> > >
> > > __uint128 foo(__uint128 x, unsigned long long y) {
> > >   return x+y;
> > > }
> > >
> > > which currently results in 6 consecutive movq instructions:
> > >
> > > foo:    movq    %rsi, %rax
> > >         movq    %rdi, %rsi
> > >         movq    %rdx, %rcx
> > >         movq    %rax, %rdi
> > >         movq    %rsi, %rax
> > >         movq    %rdi, %rdx
> > >         addq    %rcx, %rax
> > >         adcq    $0, %rdx
> > >         ret
> > >
> > > The underlying issue is that during RTL expansion, we generate the
> > > following initial RTL for the x argument:
> > >
> > > (insn 4 3 5 2 (set (reg:TI 85)
> > >         (subreg:TI (reg:DI 86) 0)) "pr43644-2.c":5:1 -1
> > >      (nil))
> > > (insn 5 4 6 2 (set (subreg:DI (reg:TI 85) 8)
> > >         (reg:DI 87)) "pr43644-2.c":5:1 -1
> > >      (nil))
> > > (insn 6 5 7 2 (set (reg/v:TI 84 [ x ])
> > >         (reg:TI 85)) "pr43644-2.c":5:1 -1
> > >      (nil))
> > >
> > > which by combine/reload becomes
> > >
> > > (insn 25 3 22 2 (set (reg/v:TI 84 [ x ])
> > >         (const_int 0 [0])) "pr43644-2.c":5:1 -1
> > >      (nil))
> > > (insn 22 25 23 2 (set (subreg:DI (reg/v:TI 84 [ x ]) 0)
> > >         (reg:DI 93)) "pr43644-2.c":5:1 90 {*movdi_internal}
> > >      (expr_list:REG_DEAD (reg:DI 93)
> > >         (nil)))
> > > (insn 23 22 28 2 (set (subreg:DI (reg/v:TI 84 [ x ]) 8)
> > >         (reg:DI 94)) "pr43644-2.c":5:1 90 {*movdi_internal}
> > >      (expr_list:REG_DEAD (reg:DI 94)
> > >         (nil)))
> > >
> > > where the heavy use of SUBREG SET_DESTs creates challenges for both
> > > combine and register allocation.
> > >
> > > The improvement proposed here is to avoid these problematic SUBREGs by
> > > adding (two) special cases to ix86_expand_move.  For insn 4, which
> > > sets a TImode destination from a paradoxical SUBREG, to assign the
> > > lowpart, we can use an explicit zero extension (zero_extendditi2 was
> > > added in July 2022), and for insn 5, which sets the highpart of a
> > > TImode register we can use the *insvti_highpart_1 instruction (that
> > > was added in May 2023, after being approved for stage1 in January).
> > > This allows combine to work its magic, merging these insns into a
> > > *concatditi3 and from there into other optimized forms.
> >
> > How about we introduce *insvti_lowpart_1, similar to *insvti_highpart_1, in the
> > hope that combine is smart enough to also combine these two instructions? IMO,
> > faking insert to lowpart of the register with zero_extend is a bit overkill, and could
> > hinder some other optimization opportunities (as perhaps hinted by failing
> > testcases).
>
> The use of ZERO_EXTEND serves two purposes, both the setting of the lowpart
> and of informing the RTL passes that the highpart is dead.  Notice in the original
> RTL stream, i.e. current GCC, insn 25 is inserted by the .286r.init-regs pass, clearing
> the entirety of the TImode register (like a clobber), and preventing TI:84 from
> occupying the same registers as DI:93 and DI:94.
>
> If the middle-end had asked the backend to generate a SET to STRICT_LOWPART
> then our hands would be tied, but a paradoxical SUBREG allows us the freedom
> to set the highpart bits to a defined value (we could have used sign extension if
> that was cheap), which then simplifies data-flow and liveness analysis.  Allowing the
> highpart to contain undefined or untouched data is exactly the sort of security
> side-channel leakage that the clear regs pass attempts to address.
>
> I can investigate an *insvti_lowpart_1, but I don't think it will help with this
> issue, i.e. it won't prevent init-regs from clobbering/clearing TImode parameters.

Thanks for the explanation, the patch is OK then.

Thanks,
Uros.

>
> > > So for the test case above, we now generate only a single movq:
> > >
> > > foo:    movq    %rdx, %rax
> > >         xorl    %edx, %edx
> > >         addq    %rdi, %rax
> > >         adcq    %rsi, %rdx
> > >         ret
> > >
> > > But there is a little bad news.  This patch causes two (minor) missed
> > > optimization regressions on x86_64; gcc.target/i386/pr82580.c and
> > > gcc.target/i386/pr91681-1.c.  As shown in the test case above, we're
> > > no longer generating adcq $0, but instead using xorl.  For the other
> > > FAIL, register allocation now has more freedom and is (arbitrarily)
> > > choosing a register assignment that doesn't match what the test is
> > > expecting.  These issues are easier to explain and fix once this patch
> > > is in the tree.
> > >
> > > The good news is that this approach fixes a number of long standing
> > > issues, that need to checked in bugzilla, including PR target/110533
> > > which was just opened/reported earlier this week.
> > >
> > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> > > and make -k check, both with and without --target_board=unix{-m32}
> > > with only the two new FAILs described above.  Ok for mainline?
> > >
> > > 2023-07-06  Roger Sayle  <roger@nextmovesoftware.com>
> > >
> > > gcc/ChangeLog
> > >         PR target/43644
> > >         PR target/110533
> > >         * config/i386/i386-expand.cc (ix86_expand_move): Convert SETs of
> > >         TImode destinations from paradoxical SUBREGs (setting the lowpart)
> > >         into explicit zero extensions.  Use *insvti_highpart_1 instruction
> > >         to set the highpart of a TImode destination.
> > >
> > > gcc/testsuite/ChangeLog
> > >         PR target/43644
> > >         PR target/110533
> > >         * gcc.target/i386/pr110533.c: New test case.
> > >         * gcc.target/i386/pr43644-2.c: Likewise.
> > >
> > > Thanks in advance,
> > > Roger
> > > --
> > >
>
  
Manolis Tsamis Sept. 1, 2023, 10:44 a.m. UTC | #4
Hi Roger,

I've (accidentally) found a codegen regression that I bisected down to
this patch.
For these two functions:

typedef struct {
  float minx, miny;
  float maxx, maxy;
} AABB;

int TestOverlap(AABB a, AABB b) {
  return a.minx <= b.maxx
      && a.miny <= b.maxy
      && a.maxx >= b.minx
      && a.maxx >= b.minx;
}

int TestOverlap2(AABB a, AABB b) {
  return a.miny <= b.maxy
      && a.maxx >= b.minx;
}

GCC used to produce this code:

TestOverlap:
        comiss  xmm3, xmm0
        movq    rdx, xmm0
        movq    rsi, xmm1
        movq    rax, xmm3
        jb      .L10
        shr     rdx, 32
        shr     rax, 32
        movd    xmm0, eax
        movd    xmm4, edx
        comiss  xmm0, xmm4
        jb      .L10
        movd    xmm1, esi
        xor     eax, eax
        comiss  xmm1, xmm2
        setnb   al
        ret
.L10:
        xor     eax, eax
        ret
TestOverlap2:
        shufps  xmm0, xmm0, 85
        shufps  xmm3, xmm3, 85
        comiss  xmm3, xmm0
        jb      .L17
        xor     eax, eax
        comiss  xmm1, xmm2
        setnb   al
        ret
.L17:
        xor     eax, eax
        ret

After this patch codegen gets much worse:

TestOverlap:
        movq    rax, xmm1
        movq    rdx, xmm2
        movq    rsi, xmm0
        mov     rdi, rax
        movq    rax, xmm3
        mov     rcx, rsi
        xchg    rdx, rax
        movd    xmm1, edx
        mov     rsi, rax
        mov     rax, rdx
        comiss  xmm1, xmm0
        jb      .L10
        shr     rcx, 32
        shr     rax, 32
        movd    xmm0, eax
        movd    xmm4, ecx
        comiss  xmm0, xmm4
        jb      .L10
        movd    xmm0, esi
        movd    xmm1, edi
        xor     eax, eax
        comiss  xmm1, xmm0
        setnb   al
        ret
.L10:
        xor     eax, eax
        ret
TestOverlap2:
        movq    rdx, xmm2
        movq    rax, xmm3
        movq    rsi, xmm0
        xchg    rdx, rax
        mov     rcx, rsi
        mov     rsi, rax
        mov     rax, rdx
        shr     rcx, 32
        shr     rax, 32
        movd    xmm4, ecx
        movd    xmm0, eax
        comiss  xmm0, xmm4
        jb      .L17
        movd    xmm0, esi
        xor     eax, eax
        comiss  xmm1, xmm0
        setnb   al
        ret
.L17:
        xor     eax, eax
        ret

I saw that you've been improving i386 argument passing, so maybe this
is just a missed case of these additions?

(Can also be seen here https://godbolt.org/z/E4xrEn6KW)

PS: I found the code that clang generates, with cmpleps + pextrw to
avoid the fp->int->fp + shr interesting. I wonder if something like
this could be added to GCC as well.

Thanks!
Manolis

On Thu, Jul 6, 2023 at 5:21 PM Uros Bizjak via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> On Thu, Jul 6, 2023 at 3:48 PM Roger Sayle <roger@nextmovesoftware.com> wrote:
> >
> > > On Thu, Jul 6, 2023 at 2:04 PM Roger Sayle <roger@nextmovesoftware.com>
> > > wrote:
> > > >
> > > >
> > > > Passing 128-bit integer (TImode) parameters on x86_64 can sometimes
> > > > result in surprising code.  Consider the example below (from PR 43644):
> > > >
> > > > __uint128 foo(__uint128 x, unsigned long long y) {
> > > >   return x+y;
> > > > }
> > > >
> > > > which currently results in 6 consecutive movq instructions:
> > > >
> > > > foo:    movq    %rsi, %rax
> > > >         movq    %rdi, %rsi
> > > >         movq    %rdx, %rcx
> > > >         movq    %rax, %rdi
> > > >         movq    %rsi, %rax
> > > >         movq    %rdi, %rdx
> > > >         addq    %rcx, %rax
> > > >         adcq    $0, %rdx
> > > >         ret
> > > >
> > > > The underlying issue is that during RTL expansion, we generate the
> > > > following initial RTL for the x argument:
> > > >
> > > > (insn 4 3 5 2 (set (reg:TI 85)
> > > >         (subreg:TI (reg:DI 86) 0)) "pr43644-2.c":5:1 -1
> > > >      (nil))
> > > > (insn 5 4 6 2 (set (subreg:DI (reg:TI 85) 8)
> > > >         (reg:DI 87)) "pr43644-2.c":5:1 -1
> > > >      (nil))
> > > > (insn 6 5 7 2 (set (reg/v:TI 84 [ x ])
> > > >         (reg:TI 85)) "pr43644-2.c":5:1 -1
> > > >      (nil))
> > > >
> > > > which by combine/reload becomes
> > > >
> > > > (insn 25 3 22 2 (set (reg/v:TI 84 [ x ])
> > > >         (const_int 0 [0])) "pr43644-2.c":5:1 -1
> > > >      (nil))
> > > > (insn 22 25 23 2 (set (subreg:DI (reg/v:TI 84 [ x ]) 0)
> > > >         (reg:DI 93)) "pr43644-2.c":5:1 90 {*movdi_internal}
> > > >      (expr_list:REG_DEAD (reg:DI 93)
> > > >         (nil)))
> > > > (insn 23 22 28 2 (set (subreg:DI (reg/v:TI 84 [ x ]) 8)
> > > >         (reg:DI 94)) "pr43644-2.c":5:1 90 {*movdi_internal}
> > > >      (expr_list:REG_DEAD (reg:DI 94)
> > > >         (nil)))
> > > >
> > > > where the heavy use of SUBREG SET_DESTs creates challenges for both
> > > > combine and register allocation.
> > > >
> > > > The improvement proposed here is to avoid these problematic SUBREGs by
> > > > adding (two) special cases to ix86_expand_move.  For insn 4, which
> > > > sets a TImode destination from a paradoxical SUBREG, to assign the
> > > > lowpart, we can use an explicit zero extension (zero_extendditi2 was
> > > > added in July 2022), and for insn 5, which sets the highpart of a
> > > > TImode register we can use the *insvti_highpart_1 instruction (that
> > > > was added in May 2023, after being approved for stage1 in January).
> > > > This allows combine to work its magic, merging these insns into a
> > > > *concatditi3 and from there into other optimized forms.
> > >
> > > How about we introduce *insvti_lowpart_1, similar to *insvti_highpart_1, in the
> > > hope that combine is smart enough to also combine these two instructions? IMO,
> > > faking insert to lowpart of the register with zero_extend is a bit overkill, and could
> > > hinder some other optimization opportunities (as perhaps hinted by failing
> > > testcases).
> >
> > The use of ZERO_EXTEND serves two purposes, both the setting of the lowpart
> > and of informing the RTL passes that the highpart is dead.  Notice in the original
> > RTL stream, i.e. current GCC, insn 25 is inserted by the .286r.init-regs pass, clearing
> > the entirety of the TImode register (like a clobber), and preventing TI:84 from
> > occupying the same registers as DI:93 and DI:94.
> >
> > If the middle-end had asked the backend to generate a SET to STRICT_LOWPART
> > then our hands would be tied, but a paradoxical SUBREG allows us the freedom
> > to set the highpart bits to a defined value (we could have used sign extension if
> > that was cheap), which then simplifies data-flow and liveness analysis.  Allowing the
> > highpart to contain undefined or untouched data is exactly the sort of security
> > side-channel leakage that the clear regs pass attempts to address.
> >
> > I can investigate an *insvti_lowpart_1, but I don't think it will help with this
> > issue, i.e. it won't prevent init-regs from clobbering/clearing TImode parameters.
>
> Thanks for the explanation, the patch is OK then.
>
> Thanks,
> Uros.
>
> >
> > > > So for the test case above, we now generate only a single movq:
> > > >
> > > > foo:    movq    %rdx, %rax
> > > >         xorl    %edx, %edx
> > > >         addq    %rdi, %rax
> > > >         adcq    %rsi, %rdx
> > > >         ret
> > > >
> > > > But there is a little bad news.  This patch causes two (minor) missed
> > > > optimization regressions on x86_64; gcc.target/i386/pr82580.c and
> > > > gcc.target/i386/pr91681-1.c.  As shown in the test case above, we're
> > > > no longer generating adcq $0, but instead using xorl.  For the other
> > > > FAIL, register allocation now has more freedom and is (arbitrarily)
> > > > choosing a register assignment that doesn't match what the test is
> > > > expecting.  These issues are easier to explain and fix once this patch
> > > > is in the tree.
> > > >
> > > > The good news is that this approach fixes a number of long standing
> > > > issues, that need to checked in bugzilla, including PR target/110533
> > > > which was just opened/reported earlier this week.
> > > >
> > > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> > > > and make -k check, both with and without --target_board=unix{-m32}
> > > > with only the two new FAILs described above.  Ok for mainline?
> > > >
> > > > 2023-07-06  Roger Sayle  <roger@nextmovesoftware.com>
> > > >
> > > > gcc/ChangeLog
> > > >         PR target/43644
> > > >         PR target/110533
> > > >         * config/i386/i386-expand.cc (ix86_expand_move): Convert SETs of
> > > >         TImode destinations from paradoxical SUBREGs (setting the lowpart)
> > > >         into explicit zero extensions.  Use *insvti_highpart_1 instruction
> > > >         to set the highpart of a TImode destination.
> > > >
> > > > gcc/testsuite/ChangeLog
> > > >         PR target/43644
> > > >         PR target/110533
> > > >         * gcc.target/i386/pr110533.c: New test case.
> > > >         * gcc.target/i386/pr43644-2.c: Likewise.
> > > >
> > > > Thanks in advance,
> > > > Roger
> > > > --
> > > >
> >
  
Roger Sayle Sept. 1, 2023, 12:35 p.m. UTC | #5
Hi Manolis,
Many thanks.  If you haven't already, could you create/file a
bug report at https://gcc.gnu.org/bugzilla/ which ensures this
doesn't get lost/forgotten.  It provides a PR number for tracking
discussions, and patches/fixes with PR numbers are (often)
prioritized during the review and approval process.

I'll investigate what's going on.  Either my "improvements"
need to be disabled for V2SF arguments, or the middle/back
end needs to figure out how to efficiently shuffle these values,
without reload moving them via integer registers, at least as
efficiently as we were before.  As you/clang show, we could
do better.

Thanks again, and sorry for any inconvenience.
Best regards,
Roger
--

> -----Original Message-----
> From: Manolis Tsamis <manolis.tsamis@vrull.eu>
> Sent: 01 September 2023 11:45
> To: Uros Bizjak <ubizjak@gmail.com>
> Cc: Roger Sayle <roger@nextmovesoftware.com>; gcc-patches@gcc.gnu.org
> Subject: Re: [x86_64 PATCH] Improve __int128 argument passing (in
> ix86_expand_move).
> 
> Hi Roger,
> 
> I've (accidentally) found a codegen regression that I bisected down to this patch.
> For these two functions:
> 
> typedef struct {
>   float minx, miny;
>   float maxx, maxy;
> } AABB;
> 
> int TestOverlap(AABB a, AABB b) {
>   return a.minx <= b.maxx
>       && a.miny <= b.maxy
>       && a.maxx >= b.minx
>       && a.maxx >= b.minx;
> }
> 
> int TestOverlap2(AABB a, AABB b) {
>   return a.miny <= b.maxy
>       && a.maxx >= b.minx;
> }
> 
> GCC used to produce this code:
> 
> TestOverlap:
>         comiss  xmm3, xmm0
>         movq    rdx, xmm0
>         movq    rsi, xmm1
>         movq    rax, xmm3
>         jb      .L10
>         shr     rdx, 32
>         shr     rax, 32
>         movd    xmm0, eax
>         movd    xmm4, edx
>         comiss  xmm0, xmm4
>         jb      .L10
>         movd    xmm1, esi
>         xor     eax, eax
>         comiss  xmm1, xmm2
>         setnb   al
>         ret
> .L10:
>         xor     eax, eax
>         ret
> TestOverlap2:
>         shufps  xmm0, xmm0, 85
>         shufps  xmm3, xmm3, 85
>         comiss  xmm3, xmm0
>         jb      .L17
>         xor     eax, eax
>         comiss  xmm1, xmm2
>         setnb   al
>         ret
> .L17:
>         xor     eax, eax
>         ret
> 
> After this patch codegen gets much worse:
> 
> TestOverlap:
>         movq    rax, xmm1
>         movq    rdx, xmm2
>         movq    rsi, xmm0
>         mov     rdi, rax
>         movq    rax, xmm3
>         mov     rcx, rsi
>         xchg    rdx, rax
>         movd    xmm1, edx
>         mov     rsi, rax
>         mov     rax, rdx
>         comiss  xmm1, xmm0
>         jb      .L10
>         shr     rcx, 32
>         shr     rax, 32
>         movd    xmm0, eax
>         movd    xmm4, ecx
>         comiss  xmm0, xmm4
>         jb      .L10
>         movd    xmm0, esi
>         movd    xmm1, edi
>         xor     eax, eax
>         comiss  xmm1, xmm0
>         setnb   al
>         ret
> .L10:
>         xor     eax, eax
>         ret
> TestOverlap2:
>         movq    rdx, xmm2
>         movq    rax, xmm3
>         movq    rsi, xmm0
>         xchg    rdx, rax
>         mov     rcx, rsi
>         mov     rsi, rax
>         mov     rax, rdx
>         shr     rcx, 32
>         shr     rax, 32
>         movd    xmm4, ecx
>         movd    xmm0, eax
>         comiss  xmm0, xmm4
>         jb      .L17
>         movd    xmm0, esi
>         xor     eax, eax
>         comiss  xmm1, xmm0
>         setnb   al
>         ret
> .L17:
>         xor     eax, eax
>         ret
> 
> I saw that you've been improving i386 argument passing, so maybe this is just a
> missed case of these additions?
> 
> (Can also be seen here https://godbolt.org/z/E4xrEn6KW)
> 
> PS: I found the code that clang generates, with cmpleps + pextrw to avoid the fp-
> >int->fp + shr interesting. I wonder if something like this could be added to GCC as
> well.
> 
> Thanks!
> Manolis
> 
> On Thu, Jul 6, 2023 at 5:21 PM Uros Bizjak via Gcc-patches <gcc-
> patches@gcc.gnu.org> wrote:
> >
> > On Thu, Jul 6, 2023 at 3:48 PM Roger Sayle <roger@nextmovesoftware.com>
> wrote:
> > >
> > > > On Thu, Jul 6, 2023 at 2:04 PM Roger Sayle
> > > > <roger@nextmovesoftware.com>
> > > > wrote:
> > > > >
> > > > >
> > > > > Passing 128-bit integer (TImode) parameters on x86_64 can
> > > > > sometimes result in surprising code.  Consider the example below (from PR
> 43644):
> > > > >
> > > > > __uint128 foo(__uint128 x, unsigned long long y) {
> > > > >   return x+y;
> > > > > }
> > > > >
> > > > > which currently results in 6 consecutive movq instructions:
> > > > >
> > > > > foo:    movq    %rsi, %rax
> > > > >         movq    %rdi, %rsi
> > > > >         movq    %rdx, %rcx
> > > > >         movq    %rax, %rdi
> > > > >         movq    %rsi, %rax
> > > > >         movq    %rdi, %rdx
> > > > >         addq    %rcx, %rax
> > > > >         adcq    $0, %rdx
> > > > >         ret
> > > > >
> > > > > The underlying issue is that during RTL expansion, we generate
> > > > > the following initial RTL for the x argument:
> > > > >
> > > > > (insn 4 3 5 2 (set (reg:TI 85)
> > > > >         (subreg:TI (reg:DI 86) 0)) "pr43644-2.c":5:1 -1
> > > > >      (nil))
> > > > > (insn 5 4 6 2 (set (subreg:DI (reg:TI 85) 8)
> > > > >         (reg:DI 87)) "pr43644-2.c":5:1 -1
> > > > >      (nil))
> > > > > (insn 6 5 7 2 (set (reg/v:TI 84 [ x ])
> > > > >         (reg:TI 85)) "pr43644-2.c":5:1 -1
> > > > >      (nil))
> > > > >
> > > > > which by combine/reload becomes
> > > > >
> > > > > (insn 25 3 22 2 (set (reg/v:TI 84 [ x ])
> > > > >         (const_int 0 [0])) "pr43644-2.c":5:1 -1
> > > > >      (nil))
> > > > > (insn 22 25 23 2 (set (subreg:DI (reg/v:TI 84 [ x ]) 0)
> > > > >         (reg:DI 93)) "pr43644-2.c":5:1 90 {*movdi_internal}
> > > > >      (expr_list:REG_DEAD (reg:DI 93)
> > > > >         (nil)))
> > > > > (insn 23 22 28 2 (set (subreg:DI (reg/v:TI 84 [ x ]) 8)
> > > > >         (reg:DI 94)) "pr43644-2.c":5:1 90 {*movdi_internal}
> > > > >      (expr_list:REG_DEAD (reg:DI 94)
> > > > >         (nil)))
> > > > >
> > > > > where the heavy use of SUBREG SET_DESTs creates challenges for
> > > > > both combine and register allocation.
> > > > >
> > > > > The improvement proposed here is to avoid these problematic
> > > > > SUBREGs by adding (two) special cases to ix86_expand_move.  For
> > > > > insn 4, which sets a TImode destination from a paradoxical
> > > > > SUBREG, to assign the lowpart, we can use an explicit zero
> > > > > extension (zero_extendditi2 was added in July 2022), and for
> > > > > insn 5, which sets the highpart of a TImode register we can use
> > > > > the *insvti_highpart_1 instruction (that was added in May 2023, after
> being approved for stage1 in January).
> > > > > This allows combine to work its magic, merging these insns into
> > > > > a
> > > > > *concatditi3 and from there into other optimized forms.
> > > >
> > > > How about we introduce *insvti_lowpart_1, similar to
> > > > *insvti_highpart_1, in the hope that combine is smart enough to
> > > > also combine these two instructions? IMO, faking insert to lowpart
> > > > of the register with zero_extend is a bit overkill, and could
> > > > hinder some other optimization opportunities (as perhaps hinted by failing
> testcases).
> > >
> > > The use of ZERO_EXTEND serves two purposes, both the setting of the
> > > lowpart and of informing the RTL passes that the highpart is dead.
> > > Notice in the original RTL stream, i.e. current GCC, insn 25 is
> > > inserted by the .286r.init-regs pass, clearing the entirety of the
> > > TImode register (like a clobber), and preventing TI:84 from occupying the
> same registers as DI:93 and DI:94.
> > >
> > > If the middle-end had asked the backend to generate a SET to
> > > STRICT_LOWPART then our hands would be tied, but a paradoxical
> > > SUBREG allows us the freedom to set the highpart bits to a defined
> > > value (we could have used sign extension if that was cheap), which
> > > then simplifies data-flow and liveness analysis.  Allowing the
> > > highpart to contain undefined or untouched data is exactly the sort of security
> side-channel leakage that the clear regs pass attempts to address.
> > >
> > > I can investigate an *insvti_lowpart_1, but I don't think it will
> > > help with this issue, i.e. it won't prevent init-regs from clobbering/clearing
> TImode parameters.
> >
> > Thanks for the explanation, the patch is OK then.
> >
> > Thanks,
> > Uros.
> >
> > >
> > > > > So for the test case above, we now generate only a single movq:
> > > > >
> > > > > foo:    movq    %rdx, %rax
> > > > >         xorl    %edx, %edx
> > > > >         addq    %rdi, %rax
> > > > >         adcq    %rsi, %rdx
> > > > >         ret
> > > > >
> > > > > But there is a little bad news.  This patch causes two (minor)
> > > > > missed optimization regressions on x86_64;
> > > > > gcc.target/i386/pr82580.c and gcc.target/i386/pr91681-1.c.  As
> > > > > shown in the test case above, we're no longer generating adcq
> > > > > $0, but instead using xorl.  For the other FAIL, register
> > > > > allocation now has more freedom and is (arbitrarily) choosing a
> > > > > register assignment that doesn't match what the test is
> > > > > expecting.  These issues are easier to explain and fix once this patch is in
> the tree.
> > > > >
> > > > > The good news is that this approach fixes a number of long
> > > > > standing issues, that need to checked in bugzilla, including PR
> > > > > target/110533 which was just opened/reported earlier this week.
> > > > >
> > > > > This patch has been tested on x86_64-pc-linux-gnu with make
> > > > > bootstrap and make -k check, both with and without
> > > > > --target_board=unix{-m32} with only the two new FAILs described above.
> Ok for mainline?
> > > > >
> > > > > 2023-07-06  Roger Sayle  <roger@nextmovesoftware.com>
> > > > >
> > > > > gcc/ChangeLog
> > > > >         PR target/43644
> > > > >         PR target/110533
> > > > >         * config/i386/i386-expand.cc (ix86_expand_move): Convert SETs of
> > > > >         TImode destinations from paradoxical SUBREGs (setting the lowpart)
> > > > >         into explicit zero extensions.  Use *insvti_highpart_1 instruction
> > > > >         to set the highpart of a TImode destination.
> > > > >
> > > > > gcc/testsuite/ChangeLog
> > > > >         PR target/43644
> > > > >         PR target/110533
> > > > >         * gcc.target/i386/pr110533.c: New test case.
> > > > >         * gcc.target/i386/pr43644-2.c: Likewise.
> > > > >
> > > > > Thanks in advance,
> > > > > Roger
> > > > > --
> > > > >
> > >
  
Manolis Tsamis Sept. 1, 2023, 1:01 p.m. UTC | #6
On Fri, Sep 1, 2023 at 3:35 PM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> Hi Manolis,
> Many thanks.  If you haven't already, could you create/file a
> bug report at https://gcc.gnu.org/bugzilla/ which ensures this
> doesn't get lost/forgotten.  It provides a PR number for tracking
> discussions, and patches/fixes with PR numbers are (often)
> prioritized during the review and approval process.
>
Sure, opened as PR111267.

> I'll investigate what's going on.  Either my "improvements"
> need to be disabled for V2SF arguments, or the middle/back
> end needs to figure out how to efficiently shuffle these values,
> without reload moving them via integer registers, at least as
> efficiently as we were before.  As you/clang show, we could
> do better.
>
Sounds good, thanks a lot!
Manolis

> Thanks again, and sorry for any inconvenience.
> Best regards,
> Roger
> --
>
> > -----Original Message-----
> > From: Manolis Tsamis <manolis.tsamis@vrull.eu>
> > Sent: 01 September 2023 11:45
> > To: Uros Bizjak <ubizjak@gmail.com>
> > Cc: Roger Sayle <roger@nextmovesoftware.com>; gcc-patches@gcc.gnu.org
> > Subject: Re: [x86_64 PATCH] Improve __int128 argument passing (in
> > ix86_expand_move).
> >
> > Hi Roger,
> >
> > I've (accidentally) found a codegen regression that I bisected down to this patch.
> > For these two functions:
> >
> > typedef struct {
> >   float minx, miny;
> >   float maxx, maxy;
> > } AABB;
> >
> > int TestOverlap(AABB a, AABB b) {
> >   return a.minx <= b.maxx
> >       && a.miny <= b.maxy
> >       && a.maxx >= b.minx
> >       && a.maxx >= b.minx;
> > }
> >
> > int TestOverlap2(AABB a, AABB b) {
> >   return a.miny <= b.maxy
> >       && a.maxx >= b.minx;
> > }
> >
> > GCC used to produce this code:
> >
> > TestOverlap:
> >         comiss  xmm3, xmm0
> >         movq    rdx, xmm0
> >         movq    rsi, xmm1
> >         movq    rax, xmm3
> >         jb      .L10
> >         shr     rdx, 32
> >         shr     rax, 32
> >         movd    xmm0, eax
> >         movd    xmm4, edx
> >         comiss  xmm0, xmm4
> >         jb      .L10
> >         movd    xmm1, esi
> >         xor     eax, eax
> >         comiss  xmm1, xmm2
> >         setnb   al
> >         ret
> > .L10:
> >         xor     eax, eax
> >         ret
> > TestOverlap2:
> >         shufps  xmm0, xmm0, 85
> >         shufps  xmm3, xmm3, 85
> >         comiss  xmm3, xmm0
> >         jb      .L17
> >         xor     eax, eax
> >         comiss  xmm1, xmm2
> >         setnb   al
> >         ret
> > .L17:
> >         xor     eax, eax
> >         ret
> >
> > After this patch codegen gets much worse:
> >
> > TestOverlap:
> >         movq    rax, xmm1
> >         movq    rdx, xmm2
> >         movq    rsi, xmm0
> >         mov     rdi, rax
> >         movq    rax, xmm3
> >         mov     rcx, rsi
> >         xchg    rdx, rax
> >         movd    xmm1, edx
> >         mov     rsi, rax
> >         mov     rax, rdx
> >         comiss  xmm1, xmm0
> >         jb      .L10
> >         shr     rcx, 32
> >         shr     rax, 32
> >         movd    xmm0, eax
> >         movd    xmm4, ecx
> >         comiss  xmm0, xmm4
> >         jb      .L10
> >         movd    xmm0, esi
> >         movd    xmm1, edi
> >         xor     eax, eax
> >         comiss  xmm1, xmm0
> >         setnb   al
> >         ret
> > .L10:
> >         xor     eax, eax
> >         ret
> > TestOverlap2:
> >         movq    rdx, xmm2
> >         movq    rax, xmm3
> >         movq    rsi, xmm0
> >         xchg    rdx, rax
> >         mov     rcx, rsi
> >         mov     rsi, rax
> >         mov     rax, rdx
> >         shr     rcx, 32
> >         shr     rax, 32
> >         movd    xmm4, ecx
> >         movd    xmm0, eax
> >         comiss  xmm0, xmm4
> >         jb      .L17
> >         movd    xmm0, esi
> >         xor     eax, eax
> >         comiss  xmm1, xmm0
> >         setnb   al
> >         ret
> > .L17:
> >         xor     eax, eax
> >         ret
> >
> > I saw that you've been improving i386 argument passing, so maybe this is just a
> > missed case of these additions?
> >
> > (Can also be seen here https://godbolt.org/z/E4xrEn6KW)
> >
> > PS: I found the code that clang generates, with cmpleps + pextrw to avoid the fp-
> > >int->fp + shr interesting. I wonder if something like this could be added to GCC as
> > well.
> >
> > Thanks!
> > Manolis
> >
> > On Thu, Jul 6, 2023 at 5:21 PM Uros Bizjak via Gcc-patches <gcc-
> > patches@gcc.gnu.org> wrote:
> > >
> > > On Thu, Jul 6, 2023 at 3:48 PM Roger Sayle <roger@nextmovesoftware.com>
> > wrote:
> > > >
> > > > > On Thu, Jul 6, 2023 at 2:04 PM Roger Sayle
> > > > > <roger@nextmovesoftware.com>
> > > > > wrote:
> > > > > >
> > > > > >
> > > > > > Passing 128-bit integer (TImode) parameters on x86_64 can
> > > > > > sometimes result in surprising code.  Consider the example below (from PR
> > 43644):
> > > > > >
> > > > > > __uint128 foo(__uint128 x, unsigned long long y) {
> > > > > >   return x+y;
> > > > > > }
> > > > > >
> > > > > > which currently results in 6 consecutive movq instructions:
> > > > > >
> > > > > > foo:    movq    %rsi, %rax
> > > > > >         movq    %rdi, %rsi
> > > > > >         movq    %rdx, %rcx
> > > > > >         movq    %rax, %rdi
> > > > > >         movq    %rsi, %rax
> > > > > >         movq    %rdi, %rdx
> > > > > >         addq    %rcx, %rax
> > > > > >         adcq    $0, %rdx
> > > > > >         ret
> > > > > >
> > > > > > The underlying issue is that during RTL expansion, we generate
> > > > > > the following initial RTL for the x argument:
> > > > > >
> > > > > > (insn 4 3 5 2 (set (reg:TI 85)
> > > > > >         (subreg:TI (reg:DI 86) 0)) "pr43644-2.c":5:1 -1
> > > > > >      (nil))
> > > > > > (insn 5 4 6 2 (set (subreg:DI (reg:TI 85) 8)
> > > > > >         (reg:DI 87)) "pr43644-2.c":5:1 -1
> > > > > >      (nil))
> > > > > > (insn 6 5 7 2 (set (reg/v:TI 84 [ x ])
> > > > > >         (reg:TI 85)) "pr43644-2.c":5:1 -1
> > > > > >      (nil))
> > > > > >
> > > > > > which by combine/reload becomes
> > > > > >
> > > > > > (insn 25 3 22 2 (set (reg/v:TI 84 [ x ])
> > > > > >         (const_int 0 [0])) "pr43644-2.c":5:1 -1
> > > > > >      (nil))
> > > > > > (insn 22 25 23 2 (set (subreg:DI (reg/v:TI 84 [ x ]) 0)
> > > > > >         (reg:DI 93)) "pr43644-2.c":5:1 90 {*movdi_internal}
> > > > > >      (expr_list:REG_DEAD (reg:DI 93)
> > > > > >         (nil)))
> > > > > > (insn 23 22 28 2 (set (subreg:DI (reg/v:TI 84 [ x ]) 8)
> > > > > >         (reg:DI 94)) "pr43644-2.c":5:1 90 {*movdi_internal}
> > > > > >      (expr_list:REG_DEAD (reg:DI 94)
> > > > > >         (nil)))
> > > > > >
> > > > > > where the heavy use of SUBREG SET_DESTs creates challenges for
> > > > > > both combine and register allocation.
> > > > > >
> > > > > > The improvement proposed here is to avoid these problematic
> > > > > > SUBREGs by adding (two) special cases to ix86_expand_move.  For
> > > > > > insn 4, which sets a TImode destination from a paradoxical
> > > > > > SUBREG, to assign the lowpart, we can use an explicit zero
> > > > > > extension (zero_extendditi2 was added in July 2022), and for
> > > > > > insn 5, which sets the highpart of a TImode register we can use
> > > > > > the *insvti_highpart_1 instruction (that was added in May 2023, after
> > being approved for stage1 in January).
> > > > > > This allows combine to work its magic, merging these insns into
> > > > > > a
> > > > > > *concatditi3 and from there into other optimized forms.
> > > > >
> > > > > How about we introduce *insvti_lowpart_1, similar to
> > > > > *insvti_highpart_1, in the hope that combine is smart enough to
> > > > > also combine these two instructions? IMO, faking insert to lowpart
> > > > > of the register with zero_extend is a bit overkill, and could
> > > > > hinder some other optimization opportunities (as perhaps hinted by failing
> > testcases).
> > > >
> > > > The use of ZERO_EXTEND serves two purposes, both the setting of the
> > > > lowpart and of informing the RTL passes that the highpart is dead.
> > > > Notice in the original RTL stream, i.e. current GCC, insn 25 is
> > > > inserted by the .286r.init-regs pass, clearing the entirety of the
> > > > TImode register (like a clobber), and preventing TI:84 from occupying the
> > same registers as DI:93 and DI:94.
> > > >
> > > > If the middle-end had asked the backend to generate a SET to
> > > > STRICT_LOWPART then our hands would be tied, but a paradoxical
> > > > SUBREG allows us the freedom to set the highpart bits to a defined
> > > > value (we could have used sign extension if that was cheap), which
> > > > then simplifies data-flow and liveness analysis.  Allowing the
> > > > highpart to contain undefined or untouched data is exactly the sort of security
> > side-channel leakage that the clear regs pass attempts to address.
> > > >
> > > > I can investigate an *insvti_lowpart_1, but I don't think it will
> > > > help with this issue, i.e. it won't prevent init-regs from clobbering/clearing
> > TImode parameters.
> > >
> > > Thanks for the explanation, the patch is OK then.
> > >
> > > Thanks,
> > > Uros.
> > >
> > > >
> > > > > > So for the test case above, we now generate only a single movq:
> > > > > >
> > > > > > foo:    movq    %rdx, %rax
> > > > > >         xorl    %edx, %edx
> > > > > >         addq    %rdi, %rax
> > > > > >         adcq    %rsi, %rdx
> > > > > >         ret
> > > > > >
> > > > > > But there is a little bad news.  This patch causes two (minor)
> > > > > > missed optimization regressions on x86_64;
> > > > > > gcc.target/i386/pr82580.c and gcc.target/i386/pr91681-1.c.  As
> > > > > > shown in the test case above, we're no longer generating adcq
> > > > > > $0, but instead using xorl.  For the other FAIL, register
> > > > > > allocation now has more freedom and is (arbitrarily) choosing a
> > > > > > register assignment that doesn't match what the test is
> > > > > > expecting.  These issues are easier to explain and fix once this patch is in
> > the tree.
> > > > > >
> > > > > > The good news is that this approach fixes a number of long
> > > > > > standing issues, that need to checked in bugzilla, including PR
> > > > > > target/110533 which was just opened/reported earlier this week.
> > > > > >
> > > > > > This patch has been tested on x86_64-pc-linux-gnu with make
> > > > > > bootstrap and make -k check, both with and without
> > > > > > --target_board=unix{-m32} with only the two new FAILs described above.
> > Ok for mainline?
> > > > > >
> > > > > > 2023-07-06  Roger Sayle  <roger@nextmovesoftware.com>
> > > > > >
> > > > > > gcc/ChangeLog
> > > > > >         PR target/43644
> > > > > >         PR target/110533
> > > > > >         * config/i386/i386-expand.cc (ix86_expand_move): Convert SETs of
> > > > > >         TImode destinations from paradoxical SUBREGs (setting the lowpart)
> > > > > >         into explicit zero extensions.  Use *insvti_highpart_1 instruction
> > > > > >         to set the highpart of a TImode destination.
> > > > > >
> > > > > > gcc/testsuite/ChangeLog
> > > > > >         PR target/43644
> > > > > >         PR target/110533
> > > > > >         * gcc.target/i386/pr110533.c: New test case.
> > > > > >         * gcc.target/i386/pr43644-2.c: Likewise.
> > > > > >
> > > > > > Thanks in advance,
> > > > > > Roger
> > > > > > --
> > > > > >
> > > >
>
  

Patch

diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
index 567248d..92ffa4b 100644
--- a/gcc/config/i386/i386-expand.cc
+++ b/gcc/config/i386/i386-expand.cc
@@ -429,6 +429,16 @@  ix86_expand_move (machine_mode mode, rtx operands[])
 
     default:
       break;
+
+    case SUBREG:
+      /* Transform TImode paradoxical SUBREG into zero_extendditi2.  */
+      if (TARGET_64BIT
+	  && mode == TImode
+	  && SUBREG_P (op1)
+	  && GET_MODE (SUBREG_REG (op1)) == DImode
+	  && SUBREG_BYTE (op1) == 0)
+	op1 = gen_rtx_ZERO_EXTEND (TImode, SUBREG_REG (op1));
+      break;
     }
 
   if ((flag_pic || MACHOPIC_INDIRECT)
@@ -532,6 +542,24 @@  ix86_expand_move (machine_mode mode, rtx operands[])
 	}
     }
 
+  /* Use *insvti_highpart_1 to set highpart of TImode register.  */
+  if (TARGET_64BIT
+      && mode == DImode
+      && SUBREG_P (op0)
+      && SUBREG_BYTE (op0) == 8
+      && GET_MODE (SUBREG_REG (op0)) == TImode
+      && REG_P (SUBREG_REG (op0))
+      && REG_P (op1))
+    {
+      wide_int mask = wi::mask (64, false, 128);
+      rtx tmp = immed_wide_int_const (mask, TImode);
+      op0 = SUBREG_REG (op0);
+      tmp = gen_rtx_AND (TImode, copy_rtx (op0), tmp);
+      op1 = gen_rtx_ZERO_EXTEND (TImode, op1);
+      op1 = gen_rtx_ASHIFT (TImode, op1, GEN_INT (64));
+      op1 = gen_rtx_IOR (TImode, tmp, op1);
+    }
+
   emit_insn (gen_rtx_SET (op0, op1));
 }
 
diff --git a/gcc/testsuite/gcc.target/i386/pr110533.c b/gcc/testsuite/gcc.target/i386/pr110533.c
new file mode 100644
index 0000000..513bcd4
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr110533.c
@@ -0,0 +1,9 @@ 
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-O0" } */
+
+__attribute__((naked))
+void fn(__int128 a) {
+    asm("ret");
+}
+
+/* { dg-final { scan-assembler-not "mov" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr43644-2.c b/gcc/testsuite/gcc.target/i386/pr43644-2.c
new file mode 100644
index 0000000..d470b0a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr43644-2.c
@@ -0,0 +1,9 @@ 
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-O2" } */
+
+unsigned __int128 foo(unsigned __int128 x, unsigned long long y)
+{
+  return x+y;
+}
+
+/* { dg-final { scan-assembler-times "movq" 1 } } */