[1/2] Using pli(paddi) and rotate to build 64bit constants

Message ID 20220901032400.23692-1-guojiufu@linux.ibm.com
State New, archived
Headers
Series [1/2] Using pli(paddi) and rotate to build 64bit constants |

Commit Message

Jiufu Guo Sept. 1, 2022, 3:24 a.m. UTC
  Hi,

As mentioned in PR106550, since pli could support 34bits immediate, we could
use less instructions(3insn would be ok) to build 64bits constant with pli.

For example, for constant 0x020805006106003, we could generate it with:
asm code1:
pli 9,101736451 (0x6106003)
sldi 9,9,32
paddi 9,9, 2130000 (0x0208050)

or asm code2:
pli 10, 2130000
pli 9, 101736451
rldimi 9, 10, 32, 0

Testing with simple cases as below, run them a lot of times:
f1.c
long __attribute__ ((noinline)) foo (long *arg,long *,long*)
{
  *arg = 0x2351847027482577;
}
5insns: base
pli+sldi+paddi: similar -0.08%
pli+pli+rldimi: faster +0.66%

f2.c
long __attribute__ ((noinline)) foo (long *arg, long *arg2, long *arg3)
{
  *arg = 0x2351847027482577;
  *arg2 = 0x3257845024384680;
  *arg3 = 0x1245abcef9240dec;
}
5nisns: base
pli+sldi+paddi: faster +1.35%
pli+pli+rldimi: faster +5.49%

f2.c would be more meaningful.  Because 'sched passes' are effective for
f2.c, but 'scheds' do less thing for f1.c.

Compare with previous patch:
https://gcc.gnu.org/pipermail/gcc-patches/2022-August/599525.html
This one updates code slightly and extracts changes on rs6000.md to a
seperate patch.

This patch pass boostrap and regtest on ppc64le(includes p10).
Is it ok for trunk?

BR,
Jeff(Jiufu)


	PR target/106550

gcc/ChangeLog:

	* config/rs6000/rs6000.cc (rs6000_emit_set_long_const): Add 'pli' for
	constant building.

gcc/testsuite/ChangeLog:

	* gcc.target/powerpc/pr106550.c: New test.

---
 gcc/config/rs6000/rs6000.cc                 | 39 +++++++++++++++++++++
 gcc/testsuite/gcc.target/powerpc/pr106550.c | 14 ++++++++
 2 files changed, 53 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/powerpc/pr106550.c
  

Comments

Segher Boessenkool Sept. 1, 2022, 9:52 p.m. UTC | #1
Hi!

This patch is a clear improvement :-)

On Thu, Sep 01, 2022 at 11:24:00AM +0800, Jiufu Guo wrote:
> As mentioned in PR106550, since pli could support 34bits immediate, we could
> use less instructions(3insn would be ok) to build 64bits constant with pli.

> For example, for constant 0x020805006106003, we could generate it with:
> asm code1:
> pli 9,101736451 (0x6106003)
> sldi 9,9,32
> paddi 9,9, 2130000 (0x0208050)

3 insns, 2 insns dependent on the previous, each.

> or asm code2:
> pli 10, 2130000
> pli 9, 101736451
> rldimi 9, 10, 32, 0

3 insns, 1 insn dependent on both others.

> Testing with simple cases as below, run them a lot of times:
> f1.c
> long __attribute__ ((noinline)) foo (long *arg,long *,long*)
> {
>   *arg = 0x2351847027482577;
> }
> 5insns: base
> pli+sldi+paddi: similar -0.08%
> pli+pli+rldimi: faster +0.66%

This mostly tests how well this micro-benchmark is scheduled.  More time
is spent in the looping and function calls (not shown)!

> f2.c
> long __attribute__ ((noinline)) foo (long *arg, long *arg2, long *arg3)
> {
>   *arg = 0x2351847027482577;
>   *arg2 = 0x3257845024384680;
>   *arg3 = 0x1245abcef9240dec;
> }
> 5nisns: base
> pli+sldi+paddi: faster +1.35%
> pli+pli+rldimi: faster +5.49%
> 
> f2.c would be more meaningful.  Because 'sched passes' are effective for
> f2.c, but 'scheds' do less thing for f1.c.

It still is a too small example to mean much without looking at a
pipeview, or at the very least perf.  But the results show a solid
improvement as expected ;-)

> gcc/ChangeLog:
> 	PR target/106550
> 	* config/rs6000/rs6000.cc (rs6000_emit_set_long_const): Add 'pli' for
> 	constant building.

"Use pli." ?

> gcc/testsuite/ChangeLog:
> 	PR target/106550
> 	* gcc.target/powerpc/pr106550.c: New test.

> +  else if (TARGET_PREFIXED)
> +    {
> +      /* pli 9,high32 + pli 10,low32 + rldimi 9,10,32,0.  */

But not just 9 and 10.  Use A and B or X and Y or H and L or something
like that?

The comment goes...

> +      if (can_create_pseudo_p ())
> +	{

... here.

> +	  temp = gen_reg_rtx (DImode);
> +	  rtx temp1 = gen_reg_rtx (DImode);
> +	  emit_move_insn (copy_rtx (temp), GEN_INT ((ud4 << 16) | ud3));
> +	  emit_move_insn (copy_rtx (temp1), GEN_INT ((ud2 << 16) | ud1));
> +
> +	  emit_insn (gen_rotldi3_insert_3 (dest, temp, GEN_INT (32), temp1,
> +					   GEN_INT (0xffffffff)));
> +	}
> +

No blank line here please.

> +      /* pli 9,high32 + sldi 9,32 + paddi 9,9,low32.  */
> +      else
> +	{

The comment goes here, in the block it refers to.  Comments for a block
are the first thing *in* the block.

> +	  emit_move_insn (copy_rtx (dest), GEN_INT ((ud4 << 16) | ud3));
> +
> +	  emit_move_insn (copy_rtx (dest),
> +			  gen_rtx_ASHIFT (DImode, copy_rtx (dest),
> +					  GEN_INT (32)));
> +
> +	  bool can_use_paddi = REGNO (dest) != FIRST_GPR_REGNO;

There should be a test that we so the right thing (or *a* right thing,
anyway; a working thing; but hopefully a reasonably fast thing) for
!can_use_paddi.

> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/pr106550.c
> @@ -0,0 +1,14 @@
> +/* PR target/106550 */
> +/* { dg-options "-O2 -std=c99 -mdejagnu-cpu=power10" } */
> +
> +void
> +foo (unsigned long long *a)
> +{
> +  *a++ = 0x020805006106003;
> +  *a++ = 0x2351847027482577;  
> +}
> +
> +/* 3 insns for each constant: pli+sldi+paddi or pli+pli+rldimi.
> +   And 3 additional insns: std+std+blr: 9 insns totally.  */
> +/* { dg-final { scan-assembler-times {(?n)^\s+[a-z]} 9 } } */

Also test the expected insns separately please?  The std's (just with
\mstd so it will catch all variations as well), the blr, the pli's and
the rldimi etc.?

We also should test all special cases as well.  Especially those that do
not happen all over the place, we will notice something is broken there
easy enough.  But unlikely cases can take years to show up.

Okay for trunk with the formatting fixed.  Thank you!


Segher
  
Kewen.Lin Sept. 2, 2022, 4:07 a.m. UTC | #2
Hi Jeff,

Thanks for the patch, some comments on nits are inline.

on 2022/9/1 11:24, Jiufu Guo wrote:
> Hi,
> 
> As mentioned in PR106550, since pli could support 34bits immediate, we could
> use less instructions(3insn would be ok) to build 64bits constant with pli.
> 
> For example, for constant 0x020805006106003, we could generate it with:
> asm code1:
> pli 9,101736451 (0x6106003)
> sldi 9,9,32
> paddi 9,9, 2130000 (0x0208050)
> 
> or asm code2:
> pli 10, 2130000
> pli 9, 101736451
> rldimi 9, 10, 32, 0
> 
> Testing with simple cases as below, run them a lot of times:
> f1.c
> long __attribute__ ((noinline)) foo (long *arg,long *,long*)
> {
>   *arg = 0x2351847027482577;
> }
> 5insns: base
> pli+sldi+paddi: similar -0.08%
> pli+pli+rldimi: faster +0.66%
> 
> f2.c
> long __attribute__ ((noinline)) foo (long *arg, long *arg2, long *arg3)
> {
>   *arg = 0x2351847027482577;
>   *arg2 = 0x3257845024384680;
>   *arg3 = 0x1245abcef9240dec;
> }
> 5nisns: base
> pli+sldi+paddi: faster +1.35%
> pli+pli+rldimi: faster +5.49%
> 
> f2.c would be more meaningful.  Because 'sched passes' are effective for
> f2.c, but 'scheds' do less thing for f1.c.
> 
> Compare with previous patch:
> https://gcc.gnu.org/pipermail/gcc-patches/2022-August/599525.html
> This one updates code slightly and extracts changes on rs6000.md to a
> seperate patch.
> 
> This patch pass boostrap and regtest on ppc64le(includes p10).
> Is it ok for trunk?
> 
> BR,
> Jeff(Jiufu)
> 
> 
> 	PR target/106550
> 
> gcc/ChangeLog:
> 
> 	* config/rs6000/rs6000.cc (rs6000_emit_set_long_const): Add 'pli' for
> 	constant building.
> 
> gcc/testsuite/ChangeLog:
> 
> 	* gcc.target/powerpc/pr106550.c: New test.
> 
> ---
>  gcc/config/rs6000/rs6000.cc                 | 39 +++++++++++++++++++++
>  gcc/testsuite/gcc.target/powerpc/pr106550.c | 14 ++++++++
>  2 files changed, 53 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/pr106550.c
> 
> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
> index df491bee2ea..1ccb2ff30a1 100644
> --- a/gcc/config/rs6000/rs6000.cc
> +++ b/gcc/config/rs6000/rs6000.cc
> @@ -10181,6 +10181,45 @@ rs6000_emit_set_long_const (rtx dest, HOST_WIDE_INT c)
>  			gen_rtx_IOR (DImode, copy_rtx (temp),
>  				     GEN_INT (ud1)));
>      }
> +  else if (TARGET_PREFIXED)
> +    {
> +      /* pli 9,high32 + pli 10,low32 + rldimi 9,10,32,0.  */
> +      if (can_create_pseudo_p ())
> +	{
> +	  temp = gen_reg_rtx (DImode);
> +	  rtx temp1 = gen_reg_rtx (DImode);
> +	  emit_move_insn (copy_rtx (temp), GEN_INT ((ud4 << 16) | ud3));
> +	  emit_move_insn (copy_rtx (temp1), GEN_INT ((ud2 << 16) | ud1));
> +

Nit: copy_rtx here seems not necessary, as both temp and temp1 are with CODE REG.
The function copy_rtx returns the given rtx for code REG.

> +	  emit_insn (gen_rotldi3_insert_3 (dest, temp, GEN_INT (32), temp1,
> +					   GEN_INT (0xffffffff)));
> +	}
> +
> +      /* pli 9,high32 + sldi 9,32 + paddi 9,9,low32.  */
> +      else
> +	{
> +	  emit_move_insn (copy_rtx (dest), GEN_INT ((ud4 << 16) | ud3));
> +
> +	  emit_move_insn (copy_rtx (dest),
> +			  gen_rtx_ASHIFT (DImode, copy_rtx (dest),
> +					  GEN_INT (32)));
> +
> +	  bool can_use_paddi = REGNO (dest) != FIRST_GPR_REGNO;
> +

The REGNO usage has asserted dest is with CODE REG, if it's always true
I don't see why we need copy_rtx around.  Or do I miss something?

> +	  /* Use paddi for the low32 bits.  */
> +	  if (ud2 != 0 && ud1 != 0 && can_use_paddi)
> +	    emit_move_insn (dest, gen_rtx_PLUS (DImode, copy_rtx (dest),
> +						GEN_INT ((ud2 << 16) | ud1)));
> +	  /* Use oris, ori for low32 bits.  */
> +	  if (ud2 != 0 && (ud1 == 0 || !can_use_paddi))
> +	    emit_move_insn (ud1 != 0 ? copy_rtx (dest) : dest,
> +			    gen_rtx_IOR (DImode, copy_rtx (dest),
> +					 GEN_INT (ud2 << 16)));
> +	  if (ud1 != 0 && (ud2 == 0 || !can_use_paddi))
> +	    emit_move_insn (dest, gen_rtx_IOR (DImode, copy_rtx (dest),
> +					       GEN_INT (ud1)));
> +	}
> +    }
>    else
>      {
>        temp = !can_create_pseudo_p () ? dest : gen_reg_rtx (DImode);
> diff --git a/gcc/testsuite/gcc.target/powerpc/pr106550.c b/gcc/testsuite/gcc.target/powerpc/pr106550.c
> new file mode 100644
> index 00000000000..c6f4116bb9a
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/pr106550.c
> @@ -0,0 +1,14 @@
> +/* PR target/106550 */
> +/* { dg-options "-O2 -std=c99 -mdejagnu-cpu=power10" } */
> +

Need to check power10_ok, like:
/* { dg-require-effective-target power10_ok } */

Nit: -std=c99 is not needed?

BR,
Kewen
  
Jiufu Guo Sept. 2, 2022, 6:56 a.m. UTC | #3
Hi,

Segher Boessenkool <segher@kernel.crashing.org> writes:

> Hi!
>
> This patch is a clear improvement :-)
>
> On Thu, Sep 01, 2022 at 11:24:00AM +0800, Jiufu Guo wrote:
>> As mentioned in PR106550, since pli could support 34bits immediate, we could
>> use less instructions(3insn would be ok) to build 64bits constant with pli.
>
>> For example, for constant 0x020805006106003, we could generate it with:
>> asm code1:
>> pli 9,101736451 (0x6106003)
>> sldi 9,9,32
>> paddi 9,9, 2130000 (0x0208050)
>
> 3 insns, 2 insns dependent on the previous, each.
Yeap.
>
>> or asm code2:
>> pli 10, 2130000
>> pli 9, 101736451
>> rldimi 9, 10, 32, 0
>
> 3 insns, 1 insn dependent on both others.
Yes.
>
>> Testing with simple cases as below, run them a lot of times:
>> f1.c
>> long __attribute__ ((noinline)) foo (long *arg,long *,long*)
>> {
>>   *arg = 0x2351847027482577;
>> }
>> 5insns: base
>> pli+sldi+paddi: similar -0.08%
>> pli+pli+rldimi: faster +0.66%
>
> This mostly tests how well this micro-benchmark is scheduled.  More time
> is spent in the looping and function calls (not shown)!
>
>> f2.c
>> long __attribute__ ((noinline)) foo (long *arg, long *arg2, long *arg3)
>> {
>>   *arg = 0x2351847027482577;
>>   *arg2 = 0x3257845024384680;
>>   *arg3 = 0x1245abcef9240dec;
>> }
>> 5nisns: base
>> pli+sldi+paddi: faster +1.35%
>> pli+pli+rldimi: faster +5.49%
>> 
>> f2.c would be more meaningful.  Because 'sched passes' are effective for
>> f2.c, but 'scheds' do less thing for f1.c.
>
> It still is a too small example to mean much without looking at a
> pipeview, or at the very least perf.  But the results show a solid
> improvement as expected ;-)
Right, checking how the 'cycles' are using on each instructions would be
more meaningful to demonstrate how the runtime is changing.
>
>> gcc/ChangeLog:
>> 	PR target/106550
>> 	* config/rs6000/rs6000.cc (rs6000_emit_set_long_const): Add 'pli' for
>> 	constant building.
>
> "Use pli." ?
Thanks, will update.
>
>> gcc/testsuite/ChangeLog:
>> 	PR target/106550
>> 	* gcc.target/powerpc/pr106550.c: New test.
>
>> +  else if (TARGET_PREFIXED)
>> +    {
>> +      /* pli 9,high32 + pli 10,low32 + rldimi 9,10,32,0.  */
>
> But not just 9 and 10.  Use A and B or X and Y or H and L or something
> like that?
OK,  will updata accordingly.
>
> The comment goes...
>
>> +      if (can_create_pseudo_p ())
>> +	{
>
> ... here.
>
>> +	  temp = gen_reg_rtx (DImode);
>> +	  rtx temp1 = gen_reg_rtx (DImode);
>> +	  emit_move_insn (copy_rtx (temp), GEN_INT ((ud4 << 16) | ud3));
>> +	  emit_move_insn (copy_rtx (temp1), GEN_INT ((ud2 << 16) | ud1));
>> +
>> +	  emit_insn (gen_rotldi3_insert_3 (dest, temp, GEN_INT (32), temp1,
>> +					   GEN_INT (0xffffffff)));
>> +	}
>> +
>
> No blank line here please.
>
>> +      /* pli 9,high32 + sldi 9,32 + paddi 9,9,low32.  */
>> +      else
>> +	{
>
> The comment goes here, in the block it refers to.  Comments for a block
> are the first thing *in* the block.
OK, great! I like the format you sugguested here :-)
>
>> +	  emit_move_insn (copy_rtx (dest), GEN_INT ((ud4 << 16) | ud3));
>> +
>> +	  emit_move_insn (copy_rtx (dest),
>> +			  gen_rtx_ASHIFT (DImode, copy_rtx (dest),
>> +					  GEN_INT (32)));
>> +
>> +	  bool can_use_paddi = REGNO (dest) != FIRST_GPR_REGNO;
>
> There should be a test that we so the right thing (or *a* right thing,
> anyway; a working thing; but hopefully a reasonably fast thing) for
> !can_use_paddi.
To catch this test point, we need let the splitter run after RA,
and register 0 happen to be the dest of an assignment.
Oh, below case would be useful for this test point:

/* { dg-options "-O2 -std=c99 -mdejagnu-cpu=power10 -fdisable-rtl-split1" } */
/* force the constant splitter run after RA: -fdisable-rtl-split1
   a few assignments to make sure r0 is allocated as dest of an assign.  */

void
foo (unsigned long long *a)
{
  *a++ = 0x020805006106003;
  *a++ = 0x2351847027482587;
  *a++ = 0x22513478874a2578;
  *a++ = 0x02180570670b003;
  *a++ = 0x2311847029488587;
  *a++ = 0x335184b02748757f;
  *a++ = 0x720805006096003;
  *a++ = 0x23a18b70a74e257e;
  *a++ = 0x2a518a70a74a2567;
  *a++ = 0x5208a5da0606a03;
  *a++ = 0x1391a47a2749257a;
  *a++ = 0x235a847027488576;
  *a++ = 0x23a1847027482677;  
}

/* { dg-final { scan-assembler-times {\moris\M} 1 } } */
/* { dg-final { scan-assembler-times {\mori\M} 1 } } */

I will add this test case in patch.
Is this ok?  Any sugguestions?
                                           
>
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/powerpc/pr106550.c
>> @@ -0,0 +1,14 @@
>> +/* PR target/106550 */
>> +/* { dg-options "-O2 -std=c99 -mdejagnu-cpu=power10" } */
>> +
>> +void
>> +foo (unsigned long long *a)
>> +{
>> +  *a++ = 0x020805006106003;
>> +  *a++ = 0x2351847027482577;  
>> +}
>> +
>> +/* 3 insns for each constant: pli+sldi+paddi or pli+pli+rldimi.
>> +   And 3 additional insns: std+std+blr: 9 insns totally.  */
>> +/* { dg-final { scan-assembler-times {(?n)^\s+[a-z]} 9 } } */
>
> Also test the expected insns separately please?  The std's (just with
> \mstd so it will catch all variations as well), the blr, the pli's and
> the rldimi etc.?
The reason of using "(?n)^\s+[a-z]" is to keep this test case pass no
matter the splitter running before or after RA.

Yes, using real instructions to check would be better.
Will update the case to check real instructions 'pli' and 'rldimi'.

>
> We also should test all special cases as well.  Especially those that do
> not happen all over the place, we will notice something is broken there
> easy enough.  But unlikely cases can take years to show up.
Understand :)

BR,
Jeff(Jiufu)

>
> Okay for trunk with the formatting fixed.  Thank you!
>
>
> Segher
  
Peter Bergner Sept. 2, 2022, 3:29 p.m. UTC | #4
On 9/1/22 4:52 PM, Segher Boessenkool wrote:
> On Thu, Sep 01, 2022 at 11:24:00AM +0800, Jiufu Guo wrote:
>> As mentioned in PR106550, since pli could support 34bits immediate, we could
>> use less instructions(3insn would be ok) to build 64bits constant with pli.
> 
>> For example, for constant 0x020805006106003, we could generate it with:
>> asm code1:
>> pli 9,101736451 (0x6106003)
>> sldi 9,9,32
>> paddi 9,9, 2130000 (0x0208050)
> 
> 3 insns, 2 insns dependent on the previous, each.
> 
>> or asm code2:
>> pli 10, 2130000
>> pli 9, 101736451
>> rldimi 9, 10, 32, 0
> 
> 3 insns, 1 insn dependent on both others.

Yeah, the improvement here is the fewer dependent instructions, since
2 64-bit + 1 32-bit instructions is the same size as 5 32-bit insns.
Those 5 32-bit insns are all dependent on the previous insn, so not ideal.

It's too bad we don't have a paddis or poris insns where we could specify
in the prefix a shift of 32-bits rather than the normal 16-bits.
If we had those, we could generate the constant with just 2 64-bit insns.

Peter
  
Segher Boessenkool Sept. 2, 2022, 4:12 p.m. UTC | #5
Hi!

On Fri, Sep 02, 2022 at 02:56:21PM +0800, Jiufu Guo wrote:
> >> +      /* pli 9,high32 + sldi 9,32 + paddi 9,9,low32.  */
> >> +      else
> >> +	{
> >
> > The comment goes here, in the block it refers to.  Comments for a block
> > are the first thing *in* the block.
> OK, great! I like the format you sugguested here :-)

It's the normal GCC style, not my invention :-)

> >> +	  emit_move_insn (copy_rtx (dest), GEN_INT ((ud4 << 16) | ud3));
> >> +
> >> +	  emit_move_insn (copy_rtx (dest),
> >> +			  gen_rtx_ASHIFT (DImode, copy_rtx (dest),
> >> +					  GEN_INT (32)));
> >> +
> >> +	  bool can_use_paddi = REGNO (dest) != FIRST_GPR_REGNO;
> >
> > There should be a test that we so the right thing (or *a* right thing,
> > anyway; a working thing; but hopefully a reasonably fast thing) for
> > !can_use_paddi.
> To catch this test point, we need let the splitter run after RA,
> and register 0 happen to be the dest of an assignment.

Or force the testcase to use r0 some other way.  Well, "forcing" cannot
be done, but we can probably encourage it (via a local register asm for
example, or by tying the var to the output of an asm that is hard reg 0,
or perhaps there are other ways as well :-) )

> I will add this test case in patch.
> Is this ok?  Any sugguestions?

Sounds useful yes.  Maybe describe the expected output in words as well
(in the testcase, not in email)?

> >> +/* 3 insns for each constant: pli+sldi+paddi or pli+pli+rldimi.
> >> +   And 3 additional insns: std+std+blr: 9 insns totally.  */
> >> +/* { dg-final { scan-assembler-times {(?n)^\s+[a-z]} 9 } } */
> >
> > Also test the expected insns separately please?  The std's (just with
> > \mstd so it will catch all variations as well), the blr, the pli's and
> > the rldimi etc.?
> The reason of using "(?n)^\s+[a-z]" is to keep this test case pass no
> matter the splitter running before or after RA.

Ah.  Some short comment in the testcase please?

Thanks again,


Segher
  
Segher Boessenkool Sept. 2, 2022, 4:20 p.m. UTC | #6
On Fri, Sep 02, 2022 at 10:29:35AM -0500, Peter Bergner wrote:
> On 9/1/22 4:52 PM, Segher Boessenkool wrote:
> > On Thu, Sep 01, 2022 at 11:24:00AM +0800, Jiufu Guo wrote:
> >> As mentioned in PR106550, since pli could support 34bits immediate, we could
> >> use less instructions(3insn would be ok) to build 64bits constant with pli.
> > 
> >> For example, for constant 0x020805006106003, we could generate it with:
> >> asm code1:
> >> pli 9,101736451 (0x6106003)
> >> sldi 9,9,32
> >> paddi 9,9, 2130000 (0x0208050)
> > 
> > 3 insns, 2 insns dependent on the previous, each.
> > 
> >> or asm code2:
> >> pli 10, 2130000
> >> pli 9, 101736451
> >> rldimi 9, 10, 32, 0
> > 
> > 3 insns, 1 insn dependent on both others.
> 
> Yeah, the improvement here is the fewer dependent instructions, since
> 2 64-bit + 1 32-bit instructions is the same size as 5 32-bit insns.

It also helps CSE if you do say 0x1200aa00bb0034 and 0x1200aa00bb0056,
or even just 0x1200aa001200aa maybe (we probably have a separate pattern
for the latter though :-) )


Segher
  
Jiufu Guo Sept. 5, 2022, 6:22 a.m. UTC | #7
"Kewen.Lin" <linkw@linux.ibm.com> writes:

> Hi Jeff,
>
> Thanks for the patch, some comments on nits are inline.
>
> on 2022/9/1 11:24, Jiufu Guo wrote:
>> Hi,
>> 
>> As mentioned in PR106550, since pli could support 34bits immediate, we could
>> use less instructions(3insn would be ok) to build 64bits constant with pli.
>> 
>> For example, for constant 0x020805006106003, we could generate it with:
>> asm code1:
>> pli 9,101736451 (0x6106003)
>> sldi 9,9,32
>> paddi 9,9, 2130000 (0x0208050)
>> 
>> or asm code2:
>> pli 10, 2130000
>> pli 9, 101736451
>> rldimi 9, 10, 32, 0
>> 
>> Testing with simple cases as below, run them a lot of times:
>> f1.c
>> long __attribute__ ((noinline)) foo (long *arg,long *,long*)
>> {
>>   *arg = 0x2351847027482577;
>> }
>> 5insns: base
>> pli+sldi+paddi: similar -0.08%
>> pli+pli+rldimi: faster +0.66%
>> 
>> f2.c
>> long __attribute__ ((noinline)) foo (long *arg, long *arg2, long *arg3)
>> {
>>   *arg = 0x2351847027482577;
>>   *arg2 = 0x3257845024384680;
>>   *arg3 = 0x1245abcef9240dec;
>> }
>> 5nisns: base
>> pli+sldi+paddi: faster +1.35%
>> pli+pli+rldimi: faster +5.49%
>> 
>> f2.c would be more meaningful.  Because 'sched passes' are effective for
>> f2.c, but 'scheds' do less thing for f1.c.
>> 
>> Compare with previous patch:
>> https://gcc.gnu.org/pipermail/gcc-patches/2022-August/599525.html
>> This one updates code slightly and extracts changes on rs6000.md to a
>> seperate patch.
>> 
>> This patch pass boostrap and regtest on ppc64le(includes p10).
>> Is it ok for trunk?
>> 
>> BR,
>> Jeff(Jiufu)
>> 
>> 
>> 	PR target/106550
>> 
>> gcc/ChangeLog:
>> 
>> 	* config/rs6000/rs6000.cc (rs6000_emit_set_long_const): Add 'pli' for
>> 	constant building.
>> 
>> gcc/testsuite/ChangeLog:
>> 
>> 	* gcc.target/powerpc/pr106550.c: New test.
>> 
>> ---
>>  gcc/config/rs6000/rs6000.cc                 | 39 +++++++++++++++++++++
>>  gcc/testsuite/gcc.target/powerpc/pr106550.c | 14 ++++++++
>>  2 files changed, 53 insertions(+)
>>  create mode 100644 gcc/testsuite/gcc.target/powerpc/pr106550.c
>> 
>> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
>> index df491bee2ea..1ccb2ff30a1 100644
>> --- a/gcc/config/rs6000/rs6000.cc
>> +++ b/gcc/config/rs6000/rs6000.cc
>> @@ -10181,6 +10181,45 @@ rs6000_emit_set_long_const (rtx dest, HOST_WIDE_INT c)
>>  			gen_rtx_IOR (DImode, copy_rtx (temp),
>>  				     GEN_INT (ud1)));
>>      }
>> +  else if (TARGET_PREFIXED)
>> +    {
>> +      /* pli 9,high32 + pli 10,low32 + rldimi 9,10,32,0.  */
>> +      if (can_create_pseudo_p ())
>> +	{
>> +	  temp = gen_reg_rtx (DImode);
>> +	  rtx temp1 = gen_reg_rtx (DImode);
>> +	  emit_move_insn (copy_rtx (temp), GEN_INT ((ud4 << 16) | ud3));
>> +	  emit_move_insn (copy_rtx (temp1), GEN_INT ((ud2 << 16) | ud1));
>> +
>
> Nit: copy_rtx here seems not necessary, as both temp and temp1 are with CODE REG.
> The function copy_rtx returns the given rtx for code REG.
>
>> +	  emit_insn (gen_rotldi3_insert_3 (dest, temp, GEN_INT (32), temp1,
>> +					   GEN_INT (0xffffffff)));
>> +	}
>> +
>> +      /* pli 9,high32 + sldi 9,32 + paddi 9,9,low32.  */
>> +      else
>> +	{
>> +	  emit_move_insn (copy_rtx (dest), GEN_INT ((ud4 << 16) | ud3));
>> +
>> +	  emit_move_insn (copy_rtx (dest),
>> +			  gen_rtx_ASHIFT (DImode, copy_rtx (dest),
>> +					  GEN_INT (32)));
>> +
>> +	  bool can_use_paddi = REGNO (dest) != FIRST_GPR_REGNO;
>> +
>
> The REGNO usage has asserted dest is with CODE REG, if it's always true
> I don't see why we need copy_rtx around.  Or do I miss something?
Thanks a lot for point this out!
Yes, now rs6000_emit_set_long_const is called on DImode for constant
splitter; and it should be always with CODE REG, and then copy_rtx would
not be needed here! 
I would update the patch accordingly.

>
>> +	  /* Use paddi for the low32 bits.  */
>> +	  if (ud2 != 0 && ud1 != 0 && can_use_paddi)
>> +	    emit_move_insn (dest, gen_rtx_PLUS (DImode, copy_rtx (dest),
>> +						GEN_INT ((ud2 << 16) | ud1)));
>> +	  /* Use oris, ori for low32 bits.  */
>> +	  if (ud2 != 0 && (ud1 == 0 || !can_use_paddi))
>> +	    emit_move_insn (ud1 != 0 ? copy_rtx (dest) : dest,
>> +			    gen_rtx_IOR (DImode, copy_rtx (dest),
>> +					 GEN_INT (ud2 << 16)));
>> +	  if (ud1 != 0 && (ud2 == 0 || !can_use_paddi))
>> +	    emit_move_insn (dest, gen_rtx_IOR (DImode, copy_rtx (dest),
>> +					       GEN_INT (ud1)));
>> +	}
>> +    }
>>    else
>>      {
>>        temp = !can_create_pseudo_p () ? dest : gen_reg_rtx (DImode);
>> diff --git a/gcc/testsuite/gcc.target/powerpc/pr106550.c b/gcc/testsuite/gcc.target/powerpc/pr106550.c
>> new file mode 100644
>> index 00000000000..c6f4116bb9a
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/powerpc/pr106550.c
>> @@ -0,0 +1,14 @@
>> +/* PR target/106550 */
>> +/* { dg-options "-O2 -std=c99 -mdejagnu-cpu=power10" } */
>> +
>
> Need to check power10_ok, like:
> /* { dg-require-effective-target power10_ok } */
>
> Nit: -std=c99 is not needed?
Thanks for catching this!


BR,
Jeff(Jiufu)
>
> BR,
> Kewen
  
Jiufu Guo Sept. 5, 2022, 6:25 a.m. UTC | #8
Hi,

Segher Boessenkool <segher@kernel.crashing.org> writes:

> Hi!
>
> On Fri, Sep 02, 2022 at 02:56:21PM +0800, Jiufu Guo wrote:
>> >> +      /* pli 9,high32 + sldi 9,32 + paddi 9,9,low32.  */
>> >> +      else
>> >> +	{
>> >
>> > The comment goes here, in the block it refers to.  Comments for a block
>> > are the first thing *in* the block.
>> OK, great! I like the format you sugguested here :-)
>
> It's the normal GCC style, not my invention :-)
>
>> >> +	  emit_move_insn (copy_rtx (dest), GEN_INT ((ud4 << 16) | ud3));
>> >> +
>> >> +	  emit_move_insn (copy_rtx (dest),
>> >> +			  gen_rtx_ASHIFT (DImode, copy_rtx (dest),
>> >> +					  GEN_INT (32)));
>> >> +
>> >> +	  bool can_use_paddi = REGNO (dest) != FIRST_GPR_REGNO;
>> >
>> > There should be a test that we so the right thing (or *a* right thing,
>> > anyway; a working thing; but hopefully a reasonably fast thing) for
>> > !can_use_paddi.
>> To catch this test point, we need let the splitter run after RA,
>> and register 0 happen to be the dest of an assignment.
>
> Or force the testcase to use r0 some other way.  Well, "forcing" cannot
> be done, but we can probably encourage it (via a local register asm for
> example, or by tying the var to the output of an asm that is hard reg 0,
> or perhaps there are other ways as well :-) )
Specify register is very helpful here! Both below two cases could help
register 0 to be selected for a variable.  Great!  Thanks!
code1
-------
void __attribute__ ((noinline)) foo (unsigned long long *a)
{
  register long long d asm("r0") = 0x1245abcef9240dec;
  long long n;
  asm("cntlzd %0, %1" : "=r"(n) : "r"(d));
  *a = n;
}

code2
-----------
register long long d asm ("r0");

void __attribute__ ((noinline)) foo ()
{
  d = 0x2351847027482577ULL;
}


>
>> I will add this test case in patch.
>> Is this ok?  Any sugguestions?
>
> Sounds useful yes.  Maybe describe the expected output in words as well
> (in the testcase, not in email)?
OK.   Will update the patch accordingly.
>
>> >> +/* 3 insns for each constant: pli+sldi+paddi or pli+pli+rldimi.
>> >> +   And 3 additional insns: std+std+blr: 9 insns totally.  */
>> >> +/* { dg-final { scan-assembler-times {(?n)^\s+[a-z]} 9 } } */
>> >
>> > Also test the expected insns separately please?  The std's (just with
>> > \mstd so it will catch all variations as well), the blr, the pli's and
>> > the rldimi etc.?
>> The reason of using "(?n)^\s+[a-z]" is to keep this test case pass no
>> matter the splitter running before or after RA.
>
> Ah.  Some short comment in the testcase please?
Thanks.  We can check individual instructions to check better asm
pli+pli+rldimi is generated, since the splitter could run in split1 pass.

BR,
Jeff(Jiufu)

>
> Thanks again,
>
>
> Segher
  
Segher Boessenkool Sept. 5, 2022, 1:42 p.m. UTC | #9
On Mon, Sep 05, 2022 at 02:25:29PM +0800, Jiufu Guo wrote:
> > On Fri, Sep 02, 2022 at 02:56:21PM +0800, Jiufu Guo wrote:
> > Or force the testcase to use r0 some other way.  Well, "forcing" cannot
> > be done, but we can probably encourage it (via a local register asm for
> > example, or by tying the var to the output of an asm that is hard reg 0,
> > or perhaps there are other ways as well :-) )
> Specify register is very helpful here! Both below two cases could help
> register 0 to be selected for a variable.  Great!  Thanks!

> >> The reason of using "(?n)^\s+[a-z]" is to keep this test case pass no
> >> matter the splitter running before or after RA.
> >
> > Ah.  Some short comment in the testcase please?
> Thanks.  We can check individual instructions to check better asm
> pli+pli+rldimi is generated, since the splitter could run in split1 pass.

It's always a tradeoff: if we scan for too specific code the testcase
will need a lot of maintenance, and that does not scale.  If on the
other hand we do test less than we really want, well, that is less than
we really want to test :-)

The changes / additions we agreed to are preapproved btw.  But please
post the thing you eventually commit to the mailing list :-)


Segher
  
Jiufu Guo Sept. 6, 2022, 12:34 p.m. UTC | #10
Segher Boessenkool <segher@kernel.crashing.org> writes:

> On Mon, Sep 05, 2022 at 02:25:29PM +0800, Jiufu Guo wrote:
>> > On Fri, Sep 02, 2022 at 02:56:21PM +0800, Jiufu Guo wrote:
>> > Or force the testcase to use r0 some other way.  Well, "forcing" cannot
>> > be done, but we can probably encourage it (via a local register asm for
>> > example, or by tying the var to the output of an asm that is hard reg 0,
>> > or perhaps there are other ways as well :-) )
>> Specify register is very helpful here! Both below two cases could help
>> register 0 to be selected for a variable.  Great!  Thanks!
>
>> >> The reason of using "(?n)^\s+[a-z]" is to keep this test case pass no
>> >> matter the splitter running before or after RA.
>> >
>> > Ah.  Some short comment in the testcase please?
>> Thanks.  We can check individual instructions to check better asm
>> pli+pli+rldimi is generated, since the splitter could run in split1 pass.
>
> It's always a tradeoff: if we scan for too specific code the testcase
> will need a lot of maintenance, and that does not scale.  If on the
> other hand we do test less than we really want, well, that is less than
> we really want to test :-)
Yes :)
>
> The changes / additions we agreed to are preapproved btw.  But please
> post the thing you eventually commit to the mailing list :-)
Sure! I updated the patch for testcases and also for code part; I would
submit for review before commit again for new comments. :)

The updated patch is as:
https://gcc.gnu.org/pipermail/gcc-patches/2022-September/601150.html

Thanks again!


BR,
Jeff(Jiufu)

>
>
> Segher
  

Patch

diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
index df491bee2ea..1ccb2ff30a1 100644
--- a/gcc/config/rs6000/rs6000.cc
+++ b/gcc/config/rs6000/rs6000.cc
@@ -10181,6 +10181,45 @@  rs6000_emit_set_long_const (rtx dest, HOST_WIDE_INT c)
 			gen_rtx_IOR (DImode, copy_rtx (temp),
 				     GEN_INT (ud1)));
     }
+  else if (TARGET_PREFIXED)
+    {
+      /* pli 9,high32 + pli 10,low32 + rldimi 9,10,32,0.  */
+      if (can_create_pseudo_p ())
+	{
+	  temp = gen_reg_rtx (DImode);
+	  rtx temp1 = gen_reg_rtx (DImode);
+	  emit_move_insn (copy_rtx (temp), GEN_INT ((ud4 << 16) | ud3));
+	  emit_move_insn (copy_rtx (temp1), GEN_INT ((ud2 << 16) | ud1));
+
+	  emit_insn (gen_rotldi3_insert_3 (dest, temp, GEN_INT (32), temp1,
+					   GEN_INT (0xffffffff)));
+	}
+
+      /* pli 9,high32 + sldi 9,32 + paddi 9,9,low32.  */
+      else
+	{
+	  emit_move_insn (copy_rtx (dest), GEN_INT ((ud4 << 16) | ud3));
+
+	  emit_move_insn (copy_rtx (dest),
+			  gen_rtx_ASHIFT (DImode, copy_rtx (dest),
+					  GEN_INT (32)));
+
+	  bool can_use_paddi = REGNO (dest) != FIRST_GPR_REGNO;
+
+	  /* Use paddi for the low32 bits.  */
+	  if (ud2 != 0 && ud1 != 0 && can_use_paddi)
+	    emit_move_insn (dest, gen_rtx_PLUS (DImode, copy_rtx (dest),
+						GEN_INT ((ud2 << 16) | ud1)));
+	  /* Use oris, ori for low32 bits.  */
+	  if (ud2 != 0 && (ud1 == 0 || !can_use_paddi))
+	    emit_move_insn (ud1 != 0 ? copy_rtx (dest) : dest,
+			    gen_rtx_IOR (DImode, copy_rtx (dest),
+					 GEN_INT (ud2 << 16)));
+	  if (ud1 != 0 && (ud2 == 0 || !can_use_paddi))
+	    emit_move_insn (dest, gen_rtx_IOR (DImode, copy_rtx (dest),
+					       GEN_INT (ud1)));
+	}
+    }
   else
     {
       temp = !can_create_pseudo_p () ? dest : gen_reg_rtx (DImode);
diff --git a/gcc/testsuite/gcc.target/powerpc/pr106550.c b/gcc/testsuite/gcc.target/powerpc/pr106550.c
new file mode 100644
index 00000000000..c6f4116bb9a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/pr106550.c
@@ -0,0 +1,14 @@ 
+/* PR target/106550 */
+/* { dg-options "-O2 -std=c99 -mdejagnu-cpu=power10" } */
+
+void
+foo (unsigned long long *a)
+{
+  *a++ = 0x020805006106003;
+  *a++ = 0x2351847027482577;  
+}
+
+/* 3 insns for each constant: pli+sldi+paddi or pli+pli+rldimi.
+   And 3 additional insns: std+std+blr: 9 insns totally.  */
+/* { dg-final { scan-assembler-times {(?n)^\s+[a-z]} 9 } } */
+