[RFC] postreload cse'ing vector constants

Message ID 3b0984ef-c532-c29c-732a-1c9b569e134c@linux.ibm.com
State New, archived
Headers
Series [RFC] postreload cse'ing vector constants |

Commit Message

Robin Dapp Sept. 7, 2022, 2:40 p.m. UTC
  Hi,

I recently looked into a sequence like

 vzero %v0
 vlr   %v2, %v0
 vlr   %v3, %v0.

Ideally we would like to use vzero for all of these sets in order to not
create dependencies.

For some instances of this problem I found the offending snippet to be
the postreload cse pass. If there is a non hard reg whose value is
equivalent to an existing hard reg, it will replace the non hard reg.
The costs are only compared if the respective operand is a CONST_INT_P,
otherwise we always replace.

The comment before says:
   /* See if REGNO fits this alternative, and set it up as the


      replacement register if we don't have one for this


      alternative yet and the operand being replaced is not


      a cheap CONST_INT.  */

Now, in my case we have a CONST_VECTOR consisting of CONST_INTS (zeros).
 This is obviously no CONST_INT therefore the substitution takes place
resulting in a "vlr" instead of a "vzero".
Would it not make sense to always compare costs here? Some backends have
instructions for loading vector constants and there could also be
backends able to load floating point constants directly.

For my snippet getting rid of the CONST_INT check suffices because the
costs are similar and no replacement happens.  Was this originally a
shortcut for performance reasons?  I thought we were not checking that
many alternatives and only locally at this point anymore.

Any comments or ideas?

Regards
 Robin

--

                      op_alt_regno[i][j] = regno;
  

Comments

Jeff Law Sept. 7, 2022, 3:06 p.m. UTC | #1
On 9/7/2022 8:40 AM, Robin Dapp via Gcc-patches wrote:
> Hi,
>
> I recently looked into a sequence like
>
>   vzero %v0
>   vlr   %v2, %v0
>   vlr   %v3, %v0.
>
> Ideally we would like to use vzero for all of these sets in order to not
> create dependencies.
>
> For some instances of this problem I found the offending snippet to be
> the postreload cse pass. If there is a non hard reg whose value is
> equivalent to an existing hard reg, it will replace the non hard reg.
> The costs are only compared if the respective operand is a CONST_INT_P,
> otherwise we always replace.
>
> The comment before says:
>     /* See if REGNO fits this alternative, and set it up as the
>
>
>        replacement register if we don't have one for this
>
>
>        alternative yet and the operand being replaced is not
>
>
>        a cheap CONST_INT.  */
>
> Now, in my case we have a CONST_VECTOR consisting of CONST_INTS (zeros).
>   This is obviously no CONST_INT therefore the substitution takes place
> resulting in a "vlr" instead of a "vzero".
> Would it not make sense to always compare costs here? Some backends have
> instructions for loading vector constants and there could also be
> backends able to load floating point constants directly.
>
> For my snippet getting rid of the CONST_INT check suffices because the
> costs are similar and no replacement happens.  Was this originally a
> shortcut for performance reasons?  I thought we were not checking that
> many alternatives and only locally at this point anymore.
>
> Any comments or ideas?
It looks sensible to me.  ISTM this should be driven by costs, not by 
any particular rtx codes.  Your patch takes things in the right direction.

Did you did any archeology into this code to see if there was any 
history that might shed light on why it doesn't just using the costing 
models?

jeff
  
Robin Dapp Sept. 7, 2022, 3:33 p.m. UTC | #2
> Did you did any archeology into this code to see if there was any 
> history that might shed light on why it doesn't just using the costing 
> models?
This one was buried under some dust :)

commit 0254c56158b0533600ba9036258c11d377d46adf
Author: John Carr <jfc@mit.edu>
Date:   Wed Jun 10 06:00:50 1998 +0000

    reload1.c (reload_cse_simplify_operands): Do not call gen_rtx_REG
for each alternative.

    Wed Jun 10 08:56:27 1998  John Carr  <jfc@mit.edu>
            * reload1.c (reload_cse_simplify_operands): Do not call
gen_rtx_REG
            for each alternative.  Do not replace a CONST_INT with a REG
unless
            the reg is cheaper.

    From-SVN: r20402

Back then we didn't have vectors I suppose but apart from that I don't
see a compelling reason not to unconditionally check costs from this.
It seems like we did even more unconditional replacing before it,
including CONST_INTs.

Regards
 Robin
  
Jeff Law Sept. 7, 2022, 3:49 p.m. UTC | #3
On 9/7/2022 9:33 AM, Robin Dapp wrote:
>> Did you did any archeology into this code to see if there was any
>> history that might shed light on why it doesn't just using the costing
>> models?
> This one was buried under some dust :)
>
> commit 0254c56158b0533600ba9036258c11d377d46adf
> Author: John Carr <jfc@mit.edu>
> Date:   Wed Jun 10 06:00:50 1998 +0000
>
>      reload1.c (reload_cse_simplify_operands): Do not call gen_rtx_REG
> for each alternative.
>
>      Wed Jun 10 08:56:27 1998  John Carr  <jfc@mit.edu>
>              * reload1.c (reload_cse_simplify_operands): Do not call
> gen_rtx_REG
>              for each alternative.  Do not replace a CONST_INT with a REG
> unless
>              the reg is cheaper.
>
>      From-SVN: r20402
>
> Back then we didn't have vectors I suppose but apart from that I don't
> see a compelling reason not to unconditionally check costs from this.
> It seems like we did even more unconditional replacing before it,
> including CONST_INTs.
Which is this from the mail archives:

https://gcc.gnu.org/pipermail/gcc-patches/1998-June/000308.html

I would tend to agree that for equal cost that the constant would be 
preferred since that should be better from a scheduling/dependency 
standpoint.   So it seems to me we can drive this purely from a costing 
standpoint.

jef
  
Robin Dapp Sept. 8, 2022, 1:04 p.m. UTC | #4
> Which is this from the mail archives:
> 
> https://gcc.gnu.org/pipermail/gcc-patches/1998-June/000308.html
> 
> I would tend to agree that for equal cost that the constant would be 
> preferred since that should be better from a scheduling/dependency 
> standpoint.   So it seems to me we can drive this purely from a costing 
> standpoint.

I did bootstrapping and ran the testsuite on x86(-64), aarch64, Power9
and s390.  Everything looks good except two additional fails on x86
where code actually looks worse.

gcc.target/i386/keylocker-encodekey128.c

17c17,18
<       movaps  %xmm4, k2(%rip)
---
>       pxor    %xmm0, %xmm0
>       movaps  %xmm0, k2(%rip)

gcc.target/i386/keylocker-encodekey256.c:

19c19,20
<       movaps  %xmm4, k3(%rip)
---
>       pxor    %xmm0, %xmm0
>       movaps  %xmm0, k3(%rip)

Regards
 Robin
  
Robin Dapp Sept. 27, 2022, 5:40 p.m. UTC | #5
> I did bootstrapping and ran the testsuite on x86(-64), aarch64, Power9
> and s390.  Everything looks good except two additional fails on x86
> where code actually looks worse.
> 
> gcc.target/i386/keylocker-encodekey128.c
> 
> 17c17,18
> <       movaps  %xmm4, k2(%rip)
> ---
>>       pxor    %xmm0, %xmm0
>>       movaps  %xmm0, k2(%rip)
> 
> gcc.target/i386/keylocker-encodekey256.c:
> 
> 19c19,20
> <       movaps  %xmm4, k3(%rip)
> ---
>>       pxor    %xmm0, %xmm0
>>       movaps  %xmm0, k3(%rip)

Before the patch and after postreload we have:

(insn (set (reg:V2DI xmm0)
        (reg:V2DI xmm4))
     (expr_list:REG_DEAD (reg:V2DI 24 xmm4)
        (expr_list:REG_EQUIV (const_vector:V2DI [
                    (const_int 0 [0]) repeated x2
                ])))))
(insn (set (mem/c:V2DI (symbol_ref:DI ("k2"))
        (reg:V2DI xmm0))))

which is converted by cprop_hardreg to:

(insn (set (mem/c:V2DI (symbol_ref:DI ("k2")))
        (reg:V2DI xmm4))))

With the change there is:

(insn (set (reg:V2DI xmm0)
        (const_vector:V2DI [
                (const_int 0 [0]) repeated x2
            ])))
(insn (set (mem/c:V2DI (symbol_ref:DI ("k2")))
        (reg:V2DI xmm0))))

which is not simplified further because xmm0 needs to be explicitly
zeroed while xmm4 is assumed to be zeroed by encodekey128.  I'm not
familiar with this so I'm supposing this is correct even though I found
"XMM4 through XMM6 are reserved for future usages and software should
not rely upon them being zeroed." online.

Even inf xmm4 were zeroed explicity, I guess in this case the simple
costing of mov reg,reg vs mov reg,imm (with the latter not being more
expensive) falls short?  cprop_hardreg can actually propagate the zeroed
xmm4 into the next move.
The same mechanism could possibly even elide many such moves which would
mean we'd unnecessarily emit many mov reg,0?  Hmm...
  
H.J. Lu Sept. 27, 2022, 7:39 p.m. UTC | #6
On Tue, Sep 27, 2022 at 10:46 AM Robin Dapp via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> > I did bootstrapping and ran the testsuite on x86(-64), aarch64, Power9
> > and s390.  Everything looks good except two additional fails on x86
> > where code actually looks worse.
> >
> > gcc.target/i386/keylocker-encodekey128.c
> >
> > 17c17,18
> > <       movaps  %xmm4, k2(%rip)
> > ---
> >>       pxor    %xmm0, %xmm0
> >>       movaps  %xmm0, k2(%rip)
> >
> > gcc.target/i386/keylocker-encodekey256.c:
> >
> > 19c19,20
> > <       movaps  %xmm4, k3(%rip)
> > ---
> >>       pxor    %xmm0, %xmm0
> >>       movaps  %xmm0, k3(%rip)
>
> Before the patch and after postreload we have:
>
> (insn (set (reg:V2DI xmm0)
>         (reg:V2DI xmm4))
>      (expr_list:REG_DEAD (reg:V2DI 24 xmm4)
>         (expr_list:REG_EQUIV (const_vector:V2DI [
>                     (const_int 0 [0]) repeated x2
>                 ])))))
> (insn (set (mem/c:V2DI (symbol_ref:DI ("k2"))
>         (reg:V2DI xmm0))))
>
> which is converted by cprop_hardreg to:
>
> (insn (set (mem/c:V2DI (symbol_ref:DI ("k2")))
>         (reg:V2DI xmm4))))
>
> With the change there is:
>
> (insn (set (reg:V2DI xmm0)
>         (const_vector:V2DI [
>                 (const_int 0 [0]) repeated x2
>             ])))
> (insn (set (mem/c:V2DI (symbol_ref:DI ("k2")))
>         (reg:V2DI xmm0))))
>
> which is not simplified further because xmm0 needs to be explicitly
> zeroed while xmm4 is assumed to be zeroed by encodekey128.  I'm not
> familiar with this so I'm supposing this is correct even though I found
> "XMM4 through XMM6 are reserved for future usages and software should
> not rely upon them being zeroed." online.

I opened:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107061

> Even inf xmm4 were zeroed explicity, I guess in this case the simple
> costing of mov reg,reg vs mov reg,imm (with the latter not being more
> expensive) falls short?  cprop_hardreg can actually propagate the zeroed
> xmm4 into the next move.
> The same mechanism could possibly even elide many such moves which would
> mean we'd unnecessarily emit many mov reg,0?  Hmm...

This sounds like an issue.
  
Robin Dapp Sept. 28, 2022, 4:48 p.m. UTC | #7
> I opened:
> 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107061

The online docs for encodekey256 also say

XMM4 through XMM6 are reserved for future usages and software should not
rely upon them being zeroed.

I believe we also zero there.

> This sounds like an issue.

So with your patch for encodekey128 the regression is gone and we zero
(pxor) xmm0 in both versions.  The case I outlined before does not
actually happen since cprop_hardreg propagates the (newly) zeroed
register to the use sites rather than zeroing every time.

I guess this just leaves the situation where we implicitly know that a
reg is zero and by rather zeroing another one we miss the cprop_hardreg
opportunity.  Not sure how common this is and if it's a blocker for this
patch.  No regressions on x86, aarch64, power9 and s390 now.  Most
likely we don't check to that granularity in the test suite and even
here it was more of accidental hit.
  
Robin Dapp Nov. 3, 2022, 12:38 p.m. UTC | #8
Should we go ahead with this, i.e. push the change and wait for fallout?
 I guess we're still early enough in the cycle for that.  There are no
regressions anymore on s390, Power9, x86 and aarch64 (at least on the
farm machines I checked).

Regards
 Robin
  
Jeff Law Nov. 20, 2022, 4:40 p.m. UTC | #9
On 11/3/22 06:38, Robin Dapp wrote:
> Should we go ahead with this, i.e. push the change and wait for fallout?
>   I guess we're still early enough in the cycle for that.  There are no
> regressions anymore on s390, Power9, x86 and aarch64 (at least on the
> farm machines I checked).

That would be my recommendation (go forward asap so that there's more 
time to find any fallout).


jeff
  

Patch

diff --git a/gcc/postreload.cc b/gcc/postreload.cc
index 41f61d326482..934439733d52 100644
--- a/gcc/postreload.cc
+++ b/gcc/postreload.cc
@@ -558,13 +558,12 @@  reload_cse_simplify_operands (rtx_insn *insn, rtx
testreg)
                  if (op_alt_regno[i][j] == -1
                      && TEST_BIT (preferred, j)
                      && reg_fits_class_p (testreg, rclass, 0, mode)
-                     && (!CONST_INT_P (recog_data.operand[i])
-                         || (set_src_cost (recog_data.operand[i], mode,
-                                           optimize_bb_for_speed_p
-                                            (BLOCK_FOR_INSN (insn)))
-                             > set_src_cost (testreg, mode,
-                                             optimize_bb_for_speed_p
-                                              (BLOCK_FOR_INSN (insn))))))
+                     && (set_src_cost (recog_data.operand[i], mode,
+                                       optimize_bb_for_speed_p
+                                        (BLOCK_FOR_INSN (insn)))
+                         > set_src_cost (testreg, mode,
+                                         optimize_bb_for_speed_p
+                                          (BLOCK_FOR_INSN (insn)))))
                    {
                      alternative_nregs[j]++;