[x86_64] PR target/110551: Tweak mulx register allocation using peephole2.

Message ID 00c901da0b56$4a3ff750$debfe5f0$@nextmovesoftware.com
State Unresolved
Headers
Series [x86_64] PR target/110551: Tweak mulx register allocation using peephole2. |

Checks

Context Check Description
snail/gcc-patch-check warning Git am fail log

Commit Message

Roger Sayle Oct. 30, 2023, 5:26 p.m. UTC
  This patch is a follow-up to my previous PR target/110551 patch, this
time to address the additional move after mulx, seen on TARGET_BMI2
architectures (such as -march=haswell).  The complication here is
that the flexible multiple-set mulx instruction is introduced into
RTL after reload, by split2, and therefore can't benefit from register
preferencing.  This results in RTL like the following:

(insn 32 31 17 2 (parallel [
            (set (reg:DI 4 si [orig:101 r ] [101])
                (mult:DI (reg:DI 1 dx [109])
                    (reg:DI 5 di [109])))
            (set (reg:DI 5 di [ r+8 ])
                (umul_highpart:DI (reg:DI 1 dx [109])
                    (reg:DI 5 di [109])))
        ]) "pr110551-2.c":8:17 -1
     (nil))

(insn 17 32 9 2 (set (reg:DI 0 ax [107])
        (reg:DI 5 di [ r+8 ])) "pr110551-2.c":9:40 90 {*movdi_internal}
     (expr_list:REG_DEAD (reg:DI 5 di [ r+8 ])
        (nil)))

Here insn 32, the mulx instruction, places its results in si and di,
and then immediately after decides to move di to ax, with di now dead.
This can be trivially cleaned up by a peephole2.  I've added an
additional constraint that the two SET_DESTs can't be the same
register to avoid confusing the middle-end, but this has well-defined
behaviour on x86_64/BMI2, encoding a umul_highpart.

For the new test case, compiled on x86_64 with -O2 -march=haswell:

Before:
mulx64: movabsq $-7046029254386353131, %rdx
        mulx    %rdi, %rsi, %rdi
        movq    %rdi, %rax
        xorq    %rsi, %rax
        ret

After:
mulx64: movabsq $-7046029254386353131, %rdx
        mulx    %rdi, %rsi, %rax
        xorq    %rsi, %rax
        ret

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?

2023-10-30  Roger Sayle  <roger@nextmovesoftware.com>

gcc/ChangeLog
        PR target/110551
        * config/i386/i386.md (*bmi2_umul<mode><dwi>3_1): Tidy condition
        as operands[2] with predicate register_operand must be !MEM_P.
        (peephole2): Optimize a mulx followed by a register-to-register
        move, to place result in the correct destination if possible.

gcc/testsuite/ChangeLog
        PR target/110551
        * gcc.target/i386/pr110551-2.c: New test case.


Thanks in advance,
Roger
--
  

Comments

Uros Bizjak Nov. 1, 2023, 10:04 a.m. UTC | #1
On Mon, Oct 30, 2023 at 6:27 PM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> This patch is a follow-up to my previous PR target/110551 patch, this
> time to address the additional move after mulx, seen on TARGET_BMI2
> architectures (such as -march=haswell).  The complication here is
> that the flexible multiple-set mulx instruction is introduced into
> RTL after reload, by split2, and therefore can't benefit from register
> preferencing.  This results in RTL like the following:
>
> (insn 32 31 17 2 (parallel [
>             (set (reg:DI 4 si [orig:101 r ] [101])
>                 (mult:DI (reg:DI 1 dx [109])
>                     (reg:DI 5 di [109])))
>             (set (reg:DI 5 di [ r+8 ])
>                 (umul_highpart:DI (reg:DI 1 dx [109])
>                     (reg:DI 5 di [109])))
>         ]) "pr110551-2.c":8:17 -1
>      (nil))
>
> (insn 17 32 9 2 (set (reg:DI 0 ax [107])
>         (reg:DI 5 di [ r+8 ])) "pr110551-2.c":9:40 90 {*movdi_internal}
>      (expr_list:REG_DEAD (reg:DI 5 di [ r+8 ])
>         (nil)))
>
> Here insn 32, the mulx instruction, places its results in si and di,
> and then immediately after decides to move di to ax, with di now dead.
> This can be trivially cleaned up by a peephole2.  I've added an
> additional constraint that the two SET_DESTs can't be the same
> register to avoid confusing the middle-end, but this has well-defined
> behaviour on x86_64/BMI2, encoding a umul_highpart.
>
> For the new test case, compiled on x86_64 with -O2 -march=haswell:
>
> Before:
> mulx64: movabsq $-7046029254386353131, %rdx
>         mulx    %rdi, %rsi, %rdi
>         movq    %rdi, %rax
>         xorq    %rsi, %rax
>         ret
>
> After:
> mulx64: movabsq $-7046029254386353131, %rdx
>         mulx    %rdi, %rsi, %rax
>         xorq    %rsi, %rax
>         ret
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32}
> with no new failures.  Ok for mainline?

It looks that your previous PR110551 patch regressed
-march=cascadelake [1]. Let's fix these regressions first.

[1] https://gcc.gnu.org/pipermail/gcc-patches/2023-October/634660.html

Uros.

> 2023-10-30  Roger Sayle  <roger@nextmovesoftware.com>
>
> gcc/ChangeLog
>         PR target/110551
>         * config/i386/i386.md (*bmi2_umul<mode><dwi>3_1): Tidy condition
>         as operands[2] with predicate register_operand must be !MEM_P.
>         (peephole2): Optimize a mulx followed by a register-to-register
>         move, to place result in the correct destination if possible.
>
> gcc/testsuite/ChangeLog
>         PR target/110551
>         * gcc.target/i386/pr110551-2.c: New test case.
>
>
> Thanks in advance,
> Roger
> --
>
  
Roger Sayle Nov. 1, 2023, 12:58 p.m. UTC | #2
Hi Uros,

> From: Uros Bizjak <ubizjak@gmail.com>
> Sent: 01 November 2023 10:05
> Subject: Re: [x86_64 PATCH] PR target/110551: Tweak mulx register allocation
> using peephole2.
> 
> On Mon, Oct 30, 2023 at 6:27 PM Roger Sayle <roger@nextmovesoftware.com>
> wrote:
> >
> >
> > This patch is a follow-up to my previous PR target/110551 patch, this
> > time to address the additional move after mulx, seen on TARGET_BMI2
> > architectures (such as -march=haswell).  The complication here is that
> > the flexible multiple-set mulx instruction is introduced into RTL
> > after reload, by split2, and therefore can't benefit from register
> > preferencing.  This results in RTL like the following:
> >
> > (insn 32 31 17 2 (parallel [
> >             (set (reg:DI 4 si [orig:101 r ] [101])
> >                 (mult:DI (reg:DI 1 dx [109])
> >                     (reg:DI 5 di [109])))
> >             (set (reg:DI 5 di [ r+8 ])
> >                 (umul_highpart:DI (reg:DI 1 dx [109])
> >                     (reg:DI 5 di [109])))
> >         ]) "pr110551-2.c":8:17 -1
> >      (nil))
> >
> > (insn 17 32 9 2 (set (reg:DI 0 ax [107])
> >         (reg:DI 5 di [ r+8 ])) "pr110551-2.c":9:40 90 {*movdi_internal}
> >      (expr_list:REG_DEAD (reg:DI 5 di [ r+8 ])
> >         (nil)))
> >
> > Here insn 32, the mulx instruction, places its results in si and di,
> > and then immediately after decides to move di to ax, with di now dead.
> > This can be trivially cleaned up by a peephole2.  I've added an
> > additional constraint that the two SET_DESTs can't be the same
> > register to avoid confusing the middle-end, but this has well-defined
> > behaviour on x86_64/BMI2, encoding a umul_highpart.
> >
> > For the new test case, compiled on x86_64 with -O2 -march=haswell:
> >
> > Before:
> > mulx64: movabsq $-7046029254386353131, %rdx
> >         mulx    %rdi, %rsi, %rdi
> >         movq    %rdi, %rax
> >         xorq    %rsi, %rax
> >         ret
> >
> > After:
> > mulx64: movabsq $-7046029254386353131, %rdx
> >         mulx    %rdi, %rsi, %rax
> >         xorq    %rsi, %rax
> >         ret
> >
> > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> > and make -k check, both with and without --target_board=unix{-m32}
> > with no new failures.  Ok for mainline?
> 
> It looks that your previous PR110551 patch regressed -march=cascadelake [1].
> Let's fix these regressions first.
> 
> [1] https://gcc.gnu.org/pipermail/gcc-patches/2023-October/634660.html
> 
> Uros.

This patch fixes that "regression".  Originally, the test case in PR110551 contained
one unnecessary mov on "default" x86_targets, but two extra movs on BMI2
targets, including -march=haswell and -march=cascadelake.  The first patch
eliminated one of these MOVs, this patch eliminates the second.  I'm not sure
that you can call it a regression, the added test failed when run with a non-standard
-march setting.  The good news is that test case doesn't have to be changed with
this patch applied, i.e. the correct intended behaviour is no MOVs on all architectures.

I'll admit the timing is unusual; I had already written and was regression testing a
patch for the BMI2 issue, when the -march=cascadelake regression tester let me
know it was required for folks that helpfully run the regression suite with non
standard settings.  i.e. a long standing bug that wasn't previously tested for by
the testsuite.

> > 2023-10-30  Roger Sayle  <roger@nextmovesoftware.com>
> >
> > gcc/ChangeLog
> >         PR target/110551
> >         * config/i386/i386.md (*bmi2_umul<mode><dwi>3_1): Tidy condition
> >         as operands[2] with predicate register_operand must be !MEM_P.
> >         (peephole2): Optimize a mulx followed by a register-to-register
> >         move, to place result in the correct destination if possible.
> >
> > gcc/testsuite/ChangeLog
> >         PR target/110551
> >         * gcc.target/i386/pr110551-2.c: New test case.
> >

Thanks again,
Roger
--
  
Uros Bizjak Nov. 1, 2023, 7:22 p.m. UTC | #3
On Wed, Nov 1, 2023 at 1:58 PM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> Hi Uros,
>
> > From: Uros Bizjak <ubizjak@gmail.com>
> > Sent: 01 November 2023 10:05
> > Subject: Re: [x86_64 PATCH] PR target/110551: Tweak mulx register allocation
> > using peephole2.
> >
> > On Mon, Oct 30, 2023 at 6:27 PM Roger Sayle <roger@nextmovesoftware.com>
> > wrote:
> > >
> > >
> > > This patch is a follow-up to my previous PR target/110551 patch, this
> > > time to address the additional move after mulx, seen on TARGET_BMI2
> > > architectures (such as -march=haswell).  The complication here is that
> > > the flexible multiple-set mulx instruction is introduced into RTL
> > > after reload, by split2, and therefore can't benefit from register
> > > preferencing.  This results in RTL like the following:
> > >
> > > (insn 32 31 17 2 (parallel [
> > >             (set (reg:DI 4 si [orig:101 r ] [101])
> > >                 (mult:DI (reg:DI 1 dx [109])
> > >                     (reg:DI 5 di [109])))
> > >             (set (reg:DI 5 di [ r+8 ])
> > >                 (umul_highpart:DI (reg:DI 1 dx [109])
> > >                     (reg:DI 5 di [109])))
> > >         ]) "pr110551-2.c":8:17 -1
> > >      (nil))
> > >
> > > (insn 17 32 9 2 (set (reg:DI 0 ax [107])
> > >         (reg:DI 5 di [ r+8 ])) "pr110551-2.c":9:40 90 {*movdi_internal}
> > >      (expr_list:REG_DEAD (reg:DI 5 di [ r+8 ])
> > >         (nil)))
> > >
> > > Here insn 32, the mulx instruction, places its results in si and di,
> > > and then immediately after decides to move di to ax, with di now dead.
> > > This can be trivially cleaned up by a peephole2.  I've added an
> > > additional constraint that the two SET_DESTs can't be the same
> > > register to avoid confusing the middle-end, but this has well-defined
> > > behaviour on x86_64/BMI2, encoding a umul_highpart.
> > >
> > > For the new test case, compiled on x86_64 with -O2 -march=haswell:
> > >
> > > Before:
> > > mulx64: movabsq $-7046029254386353131, %rdx
> > >         mulx    %rdi, %rsi, %rdi
> > >         movq    %rdi, %rax
> > >         xorq    %rsi, %rax
> > >         ret
> > >
> > > After:
> > > mulx64: movabsq $-7046029254386353131, %rdx
> > >         mulx    %rdi, %rsi, %rax
> > >         xorq    %rsi, %rax
> > >         ret
> > >
> > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> > > and make -k check, both with and without --target_board=unix{-m32}
> > > with no new failures.  Ok for mainline?
> >
> > It looks that your previous PR110551 patch regressed -march=cascadelake [1].
> > Let's fix these regressions first.
> >
> > [1] https://gcc.gnu.org/pipermail/gcc-patches/2023-October/634660.html
> >
> > Uros.
>
> This patch fixes that "regression".  Originally, the test case in PR110551 contained
> one unnecessary mov on "default" x86_targets, but two extra movs on BMI2
> targets, including -march=haswell and -march=cascadelake.  The first patch
> eliminated one of these MOVs, this patch eliminates the second.  I'm not sure
> that you can call it a regression, the added test failed when run with a non-standard
> -march setting.  The good news is that test case doesn't have to be changed with
> this patch applied, i.e. the correct intended behaviour is no MOVs on all architectures.

I was not worried about the extra mov, but more about [2], also
referred from [1], but it looks that that regression was wrongly
attributed to your patch.

[2] https://gcc.gnu.org/pipermail/gcc-regression/2023-October/078391.html

> I'll admit the timing is unusual; I had already written and was regression testing a
> patch for the BMI2 issue, when the -march=cascadelake regression tester let me
> know it was required for folks that helpfully run the regression suite with non
> standard settings.  i.e. a long standing bug that wasn't previously tested for by
> the testsuite.
>
> > > 2023-10-30  Roger Sayle  <roger@nextmovesoftware.com>
> > >
> > > gcc/ChangeLog
> > >         PR target/110551
> > >         * config/i386/i386.md (*bmi2_umul<mode><dwi>3_1): Tidy condition
> > >         as operands[2] with predicate register_operand must be !MEM_P.
> > >         (peephole2): Optimize a mulx followed by a register-to-register
> > >         move, to place result in the correct destination if possible.
> > >
> > > gcc/testsuite/ChangeLog
> > >         PR target/110551
> > >         * gcc.target/i386/pr110551-2.c: New test case.

The patch is OK.

Thanks,
Uros.
  
Jiang, Haochen Nov. 2, 2023, 2:06 a.m. UTC | #4
> -----Original Message-----
> From: Uros Bizjak <ubizjak@gmail.com>
> Sent: Thursday, November 2, 2023 3:23 AM
> To: Roger Sayle <roger@nextmovesoftware.com>
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [x86_64 PATCH] PR target/110551: Tweak mulx register allocation
> using peephole2.
> 
> On Wed, Nov 1, 2023 at 1:58 PM Roger Sayle <roger@nextmovesoftware.com>
> wrote:
> >
> >
> > Hi Uros,
> >
> > > From: Uros Bizjak <ubizjak@gmail.com>
> > > Sent: 01 November 2023 10:05
> > > Subject: Re: [x86_64 PATCH] PR target/110551: Tweak mulx register
> allocation
> > > using peephole2.
> > >
> > > On Mon, Oct 30, 2023 at 6:27 PM Roger Sayle
> <roger@nextmovesoftware.com>
> > > wrote:
> > > >
> > > >
> > > > This patch is a follow-up to my previous PR target/110551 patch, this
> > > > time to address the additional move after mulx, seen on TARGET_BMI2
> > > > architectures (such as -march=haswell).  The complication here is that
> > > > the flexible multiple-set mulx instruction is introduced into RTL
> > > > after reload, by split2, and therefore can't benefit from register
> > > > preferencing.  This results in RTL like the following:
> > > >
> > > > (insn 32 31 17 2 (parallel [
> > > >             (set (reg:DI 4 si [orig:101 r ] [101])
> > > >                 (mult:DI (reg:DI 1 dx [109])
> > > >                     (reg:DI 5 di [109])))
> > > >             (set (reg:DI 5 di [ r+8 ])
> > > >                 (umul_highpart:DI (reg:DI 1 dx [109])
> > > >                     (reg:DI 5 di [109])))
> > > >         ]) "pr110551-2.c":8:17 -1
> > > >      (nil))
> > > >
> > > > (insn 17 32 9 2 (set (reg:DI 0 ax [107])
> > > >         (reg:DI 5 di [ r+8 ])) "pr110551-2.c":9:40 90 {*movdi_internal}
> > > >      (expr_list:REG_DEAD (reg:DI 5 di [ r+8 ])
> > > >         (nil)))
> > > >
> > > > Here insn 32, the mulx instruction, places its results in si and di,
> > > > and then immediately after decides to move di to ax, with di now dead.
> > > > This can be trivially cleaned up by a peephole2.  I've added an
> > > > additional constraint that the two SET_DESTs can't be the same
> > > > register to avoid confusing the middle-end, but this has well-defined
> > > > behaviour on x86_64/BMI2, encoding a umul_highpart.
> > > >
> > > > For the new test case, compiled on x86_64 with -O2 -march=haswell:
> > > >
> > > > Before:
> > > > mulx64: movabsq $-7046029254386353131, %rdx
> > > >         mulx    %rdi, %rsi, %rdi
> > > >         movq    %rdi, %rax
> > > >         xorq    %rsi, %rax
> > > >         ret
> > > >
> > > > After:
> > > > mulx64: movabsq $-7046029254386353131, %rdx
> > > >         mulx    %rdi, %rsi, %rax
> > > >         xorq    %rsi, %rax
> > > >         ret
> > > >
> > > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> > > > and make -k check, both with and without --target_board=unix{-m32}
> > > > with no new failures.  Ok for mainline?
> > >
> > > It looks that your previous PR110551 patch regressed -march=cascadelake [1].

Actually it is not only on -march=cascadelake, w/o -march=cascadelake will also
fail.

Thx,
Haochen

> > > Let's fix these regressions first.
> > >
> > > [1] https://gcc.gnu.org/pipermail/gcc-patches/2023-October/634660.html
> > >
> > > Uros.
> >
> > This patch fixes that "regression".  Originally, the test case in PR110551 contained
> > one unnecessary mov on "default" x86_targets, but two extra movs on BMI2
> > targets, including -march=haswell and -march=cascadelake.  The first patch
> > eliminated one of these MOVs, this patch eliminates the second.  I'm not sure
> > that you can call it a regression, the added test failed when run with a non-standard
> > -march setting.  The good news is that test case doesn't have to be changed with
> > this patch applied, i.e. the correct intended behaviour is no MOVs on all
> architectures.
> 
> I was not worried about the extra mov, but more about [2], also
> referred from [1], but it looks that that regression was wrongly
> attributed to your patch.
> 
> [2] https://gcc.gnu.org/pipermail/gcc-regression/2023-October/078391.html
> 
> > I'll admit the timing is unusual; I had already written and was regression testing a
> > patch for the BMI2 issue, when the -march=cascadelake regression tester let me
> > know it was required for folks that helpfully run the regression suite with non
> > standard settings.  i.e. a long standing bug that wasn't previously tested for by
> > the testsuite.
> >
> > > > 2023-10-30  Roger Sayle  <roger@nextmovesoftware.com>
> > > >
> > > > gcc/ChangeLog
> > > >         PR target/110551
> > > >         * config/i386/i386.md (*bmi2_umul<mode><dwi>3_1): Tidy condition
> > > >         as operands[2] with predicate register_operand must be !MEM_P.
> > > >         (peephole2): Optimize a mulx followed by a register-to-register
> > > >         move, to place result in the correct destination if possible.
> > > >
> > > > gcc/testsuite/ChangeLog
> > > >         PR target/110551
> > > >         * gcc.target/i386/pr110551-2.c: New test case.
> 
> The patch is OK.
> 
> Thanks,
> Uros.
  

Patch

diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index eb4121b..a314f1a 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -9747,13 +9747,37 @@ 
 	  (match_operand:DWIH 3 "nonimmediate_operand" "rm")))
    (set (match_operand:DWIH 1 "register_operand" "=r")
 	(umul_highpart:DWIH (match_dup 2) (match_dup 3)))]
-  "TARGET_BMI2
-   && !(MEM_P (operands[2]) && MEM_P (operands[3]))"
+  "TARGET_BMI2"
   "mulx\t{%3, %0, %1|%1, %0, %3}"
   [(set_attr "type" "imulx")
    (set_attr "prefix" "vex")
    (set_attr "mode" "<MODE>")])
 
+;; Tweak *bmi2_umul<mode><dwi>3_1 to eliminate following mov.
+(define_peephole2
+  [(parallel [(set (match_operand:DWIH 0 "general_reg_operand")
+		   (mult:DWIH (match_operand:DWIH 2 "register_operand")
+			      (match_operand:DWIH 3 "nonimmediate_operand")))
+	      (set (match_operand:DWIH 1 "general_reg_operand")
+		   (umul_highpart:DWIH (match_dup 2) (match_dup 3)))])
+   (set (match_operand:DWIH 4 "general_reg_operand")
+	(match_operand:DWIH 5 "general_reg_operand"))]
+  "TARGET_BMI2
+   && ((REGNO (operands[5]) == REGNO (operands[0])
+        && REGNO (operands[1]) != REGNO (operands[4]))
+       || (REGNO (operands[5]) == REGNO (operands[1])
+	   && REGNO (operands[0]) != REGNO (operands[4])))
+   && peep2_reg_dead_p (2, operands[5])"
+  [(parallel [(set (match_dup 0) (mult:DWIH (match_dup 2) (match_dup 3)))
+	      (set (match_dup 1)
+		   (umul_highpart:DWIH (match_dup 2) (match_dup 3)))])]
+{
+  if (REGNO (operands[5]) == REGNO (operands[0]))
+    operands[0] = operands[4];
+  else
+    operands[1] = operands[4];
+})
+
 (define_insn "*umul<mode><dwi>3_1"
   [(set (match_operand:<DWI> 0 "register_operand" "=r,A")
 	(mult:<DWI>
diff --git a/gcc/testsuite/gcc.target/i386/pr110551-2.c b/gcc/testsuite/gcc.target/i386/pr110551-2.c
new file mode 100644
index 0000000..4936adf
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr110551-2.c
@@ -0,0 +1,12 @@ 
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-O2 -march=haswell" } */
+
+typedef unsigned long long uint64_t;
+
+uint64_t mulx64(uint64_t x)
+{
+    __uint128_t r = (__uint128_t)x * 0x9E3779B97F4A7C15ull;
+    return (uint64_t)r ^ (uint64_t)( r >> 64 );
+}
+
+/* { dg-final { scan-assembler-not "movq" } } */