[RFC] rs6000: split complicated constant to memory

Message ID 20220815052519.194582-1-guojiufu@linux.ibm.com
State New, archived
Headers
Series [RFC] rs6000: split complicated constant to memory |

Commit Message

Jiufu Guo Aug. 15, 2022, 5:25 a.m. UTC
  Hi,

This patch tries to put the constant into constant pool if building the
constant requires 3 or more instructions.

But there is a concern: I'm wondering if this patch is really profitable.

Because, as I tested, 1. for simple case, if instructions are not been run
in parallel, loading constant from memory maybe faster; but 2. if there
are some instructions could run in parallel, loading constant from memory
are not win comparing with building constant.  As below examples.

For f1.c and f3.c, 'loading' constant would be acceptable in runtime aspect;
for f2.c and f4.c, 'loading' constant are visibly slower. 

For real-world cases, both kinds of code sequences exist.

So, I'm not sure if we need to push this patch.

Run a lot of times (1000000000) below functions to check runtime.
f1.c:
long foo (long *arg, long*, long *)
{
  *arg = 0x1234567800000000;
}
asm building constant:
	lis 10,0x1234
	ori 10,10,0x5678
	sldi 10,10,32
vs.  asm loading
	addis 10,2,.LC0@toc@ha
	ld 10,.LC0@toc@l(10)
The runtime between 'building' and 'loading' are similar: some times the
'building' is faster; sometimes 'loading' is faster. And the difference is
slight.

f2.c
long foo (long *arg, long *arg2, long *arg3)
{
  *arg = 0x1234567800000000;
  *arg2 = 0x7965234700000000;
  *arg3 = 0x4689123700000000;
}
asm building constant:
	lis 7,0x1234
	lis 10,0x7965
	lis 9,0x4689
	ori 7,7,0x5678
	ori 10,10,0x2347
	ori 9,9,0x1237
	sldi 7,7,32
	sldi 10,10,32
	sldi 9,9,32
vs. loading
	addis 7,2,.LC0@toc@ha
	addis 10,2,.LC1@toc@ha
	addis 9,2,.LC2@toc@ha
	ld 7,.LC0@toc@l(7)
	ld 10,.LC1@toc@l(10)
	ld 9,.LC2@toc@l(9)
For this case, 'loading' is always slower than 'building' (>15%).

f3.c
long foo (long *arg, long *, long *)
{
  *arg = 384307168202282325;
}
	lis 10,0x555
	ori 10,10,0x5555
	sldi 10,10,32
	oris 10,10,0x5555
	ori 10,10,0x5555
For this case, 'building' (through 5 instructions) are slower, and 'loading'
is faster ~5%;

f4.c
long foo (long *arg, long *arg2, long *arg3)
{
  *arg = 384307168202282325;
  *arg2 = -6148914691236517205;
  *arg3 = 768614336404564651;
}
	lis 7,0x555
	lis 10,0xaaaa
	lis 9,0xaaa
	ori 7,7,0x5555
	ori 10,10,0xaaaa
	ori 9,9,0xaaaa
	sldi 7,7,32
	sldi 10,10,32
	sldi 9,9,32
	oris 7,7,0x5555
	oris 10,10,0xaaaa
	oris 9,9,0xaaaa
	ori 7,7,0x5555
	ori 10,10,0xaaab
	ori 9,9,0xaaab
For this cases, since 'building' constant are parallel, 'loading' is slower:
~8%. On p10, 'loading'(through 'pld') is also slower >4%.


BR,
Jeff(Jiufu)

---
 gcc/config/rs6000/rs6000.cc                | 14 ++++++++++++++
 gcc/testsuite/gcc.target/powerpc/pr63281.c | 11 +++++++++++
 2 files changed, 25 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/powerpc/pr63281.c
  

Comments

Richard Biener Aug. 15, 2022, 8:07 a.m. UTC | #1
On Mon, Aug 15, 2022 at 7:26 AM Jiufu Guo via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> Hi,
>
> This patch tries to put the constant into constant pool if building the
> constant requires 3 or more instructions.
>
> But there is a concern: I'm wondering if this patch is really profitable.
>
> Because, as I tested, 1. for simple case, if instructions are not been run
> in parallel, loading constant from memory maybe faster; but 2. if there
> are some instructions could run in parallel, loading constant from memory
> are not win comparing with building constant.  As below examples.
>
> For f1.c and f3.c, 'loading' constant would be acceptable in runtime aspect;
> for f2.c and f4.c, 'loading' constant are visibly slower.
>
> For real-world cases, both kinds of code sequences exist.
>
> So, I'm not sure if we need to push this patch.
>
> Run a lot of times (1000000000) below functions to check runtime.
> f1.c:
> long foo (long *arg, long*, long *)
> {
>   *arg = 0x1234567800000000;
> }
> asm building constant:
>         lis 10,0x1234
>         ori 10,10,0x5678
>         sldi 10,10,32
> vs.  asm loading
>         addis 10,2,.LC0@toc@ha
>         ld 10,.LC0@toc@l(10)
> The runtime between 'building' and 'loading' are similar: some times the
> 'building' is faster; sometimes 'loading' is faster. And the difference is
> slight.

I wonder if it is possible to decide this during scheduling - chose the
variant that, when the result is needed, is cheaper?  Post-RA might
be a bit difficult (I see the load from memory needs the TOC, but then
when the TOC is not available we could just always emit the build form),
and pre-reload precision might be not good enough to make this worth
the experiment?

Of course the scheduler might lack on the technical side as well.

>
> f2.c
> long foo (long *arg, long *arg2, long *arg3)
> {
>   *arg = 0x1234567800000000;
>   *arg2 = 0x7965234700000000;
>   *arg3 = 0x4689123700000000;
> }
> asm building constant:
>         lis 7,0x1234
>         lis 10,0x7965
>         lis 9,0x4689
>         ori 7,7,0x5678
>         ori 10,10,0x2347
>         ori 9,9,0x1237
>         sldi 7,7,32
>         sldi 10,10,32
>         sldi 9,9,32
> vs. loading
>         addis 7,2,.LC0@toc@ha
>         addis 10,2,.LC1@toc@ha
>         addis 9,2,.LC2@toc@ha
>         ld 7,.LC0@toc@l(7)
>         ld 10,.LC1@toc@l(10)
>         ld 9,.LC2@toc@l(9)
> For this case, 'loading' is always slower than 'building' (>15%).
>
> f3.c
> long foo (long *arg, long *, long *)
> {
>   *arg = 384307168202282325;
> }
>         lis 10,0x555
>         ori 10,10,0x5555
>         sldi 10,10,32
>         oris 10,10,0x5555
>         ori 10,10,0x5555
> For this case, 'building' (through 5 instructions) are slower, and 'loading'
> is faster ~5%;
>
> f4.c
> long foo (long *arg, long *arg2, long *arg3)
> {
>   *arg = 384307168202282325;
>   *arg2 = -6148914691236517205;
>   *arg3 = 768614336404564651;
> }
>         lis 7,0x555
>         lis 10,0xaaaa
>         lis 9,0xaaa
>         ori 7,7,0x5555
>         ori 10,10,0xaaaa
>         ori 9,9,0xaaaa
>         sldi 7,7,32
>         sldi 10,10,32
>         sldi 9,9,32
>         oris 7,7,0x5555
>         oris 10,10,0xaaaa
>         oris 9,9,0xaaaa
>         ori 7,7,0x5555
>         ori 10,10,0xaaab
>         ori 9,9,0xaaab
> For this cases, since 'building' constant are parallel, 'loading' is slower:
> ~8%. On p10, 'loading'(through 'pld') is also slower >4%.
>
>
> BR,
> Jeff(Jiufu)
>
> ---
>  gcc/config/rs6000/rs6000.cc                | 14 ++++++++++++++
>  gcc/testsuite/gcc.target/powerpc/pr63281.c | 11 +++++++++++
>  2 files changed, 25 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/pr63281.c
>
> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
> index 4b727d2a500..3798e11bdbc 100644
> --- a/gcc/config/rs6000/rs6000.cc
> +++ b/gcc/config/rs6000/rs6000.cc
> @@ -10098,6 +10098,20 @@ rs6000_emit_set_const (rtx dest, rtx source)
>           c = ((c & 0xffffffff) ^ 0x80000000) - 0x80000000;
>           emit_move_insn (lo, GEN_INT (c));
>         }
> +      else if (base_reg_operand (dest, mode)
> +              && num_insns_constant (source, mode) > 2)
> +       {
> +         rtx sym = force_const_mem (mode, source);
> +         if (TARGET_TOC && SYMBOL_REF_P (XEXP (sym, 0))
> +             && use_toc_relative_ref (XEXP (sym, 0), mode))
> +           {
> +             rtx toc = create_TOC_reference (XEXP (sym, 0), copy_rtx (dest));
> +             sym = gen_const_mem (mode, toc);
> +             set_mem_alias_set (sym, get_TOC_alias_set ());
> +           }
> +
> +         emit_insn (gen_rtx_SET (dest, sym));
> +       }
>        else
>         rs6000_emit_set_long_const (dest, c);
>        break;
> diff --git a/gcc/testsuite/gcc.target/powerpc/pr63281.c b/gcc/testsuite/gcc.target/powerpc/pr63281.c
> new file mode 100644
> index 00000000000..469a8f64400
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/pr63281.c
> @@ -0,0 +1,11 @@
> +/* PR target/63281 */
> +/* { dg-do compile { target lp64 } } */
> +/* { dg-options "-O2 -std=c99" } */
> +
> +void
> +foo (unsigned long long *a)
> +{
> +  *a = 0x020805006106003;
> +}
> +
> +/* { dg-final { scan-assembler-times {\mp?ld\M} 1 } } */
> --
> 2.17.1
>
  
Segher Boessenkool Aug. 15, 2022, 9:12 p.m. UTC | #2
Hi!

On Mon, Aug 15, 2022 at 01:25:19PM +0800, Jiufu Guo wrote:
> This patch tries to put the constant into constant pool if building the
> constant requires 3 or more instructions.
> 
> But there is a concern: I'm wondering if this patch is really profitable.
> 
> Because, as I tested, 1. for simple case, if instructions are not been run
> in parallel, loading constant from memory maybe faster; but 2. if there
> are some instructions could run in parallel, loading constant from memory
> are not win comparing with building constant.  As below examples.
> 
> For f1.c and f3.c, 'loading' constant would be acceptable in runtime aspect;
> for f2.c and f4.c, 'loading' constant are visibly slower. 
> 
> For real-world cases, both kinds of code sequences exist.
> 
> So, I'm not sure if we need to push this patch.
> 
> Run a lot of times (1000000000) below functions to check runtime.
> f1.c:
> long foo (long *arg, long*, long *)
> {
>   *arg = 0x1234567800000000;
> }
> asm building constant:
> 	lis 10,0x1234
> 	ori 10,10,0x5678
> 	sldi 10,10,32
> vs.  asm loading
> 	addis 10,2,.LC0@toc@ha
> 	ld 10,.LC0@toc@l(10)

This is just a load insn, unless this is the only thing needing the TOC.
You can use crtl->uses_const_pool as an approximation here, to figure
out if we have that case?

> The runtime between 'building' and 'loading' are similar: some times the
> 'building' is faster; sometimes 'loading' is faster. And the difference is
> slight.

When there is only one constant, sure.  But that isn't the expensive
case we need to avoid :-)

> 	addis 9,2,.LC2@toc@ha
> 	ld 7,.LC0@toc@l(7)
> 	ld 10,.LC1@toc@l(10)
> 	ld 9,.LC2@toc@l(9)
> For this case, 'loading' is always slower than 'building' (>15%).

Only if there is nothing else to do, and only in cases where code size
does not matter (i.e. microbenchmarks).

> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/pr63281.c
> @@ -0,0 +1,11 @@
> +/* PR target/63281 */
> +/* { dg-do compile { target lp64 } } */
> +/* { dg-options "-O2 -std=c99" } */

Why std=c99 btw?  The default is c17.  Is there something we need to
disable here?


Segher
  
Jiufu Guo Aug. 16, 2022, 3:50 a.m. UTC | #3
Hi,

Richard Biener <richard.guenther@gmail.com> writes:

> On Mon, Aug 15, 2022 at 7:26 AM Jiufu Guo via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
>>
>> Hi,
>>
>> This patch tries to put the constant into constant pool if building the
>> constant requires 3 or more instructions.
>>
>> But there is a concern: I'm wondering if this patch is really profitable.
>>
>> Because, as I tested, 1. for simple case, if instructions are not been run
>> in parallel, loading constant from memory maybe faster; but 2. if there
>> are some instructions could run in parallel, loading constant from memory
>> are not win comparing with building constant.  As below examples.
>>
>> For f1.c and f3.c, 'loading' constant would be acceptable in runtime aspect;
>> for f2.c and f4.c, 'loading' constant are visibly slower.
>>
>> For real-world cases, both kinds of code sequences exist.
>>
>> So, I'm not sure if we need to push this patch.
>>
>> Run a lot of times (1000000000) below functions to check runtime.
>> f1.c:
>> long foo (long *arg, long*, long *)
>> {
>>   *arg = 0x1234567800000000;
>> }
>> asm building constant:
>>         lis 10,0x1234
>>         ori 10,10,0x5678
>>         sldi 10,10,32
>> vs.  asm loading
>>         addis 10,2,.LC0@toc@ha
>>         ld 10,.LC0@toc@l(10)
>> The runtime between 'building' and 'loading' are similar: some times the
>> 'building' is faster; sometimes 'loading' is faster. And the difference is
>> slight.
>
> I wonder if it is possible to decide this during scheduling - chose the
> variant that, when the result is needed, is cheaper?  Post-RA might
> be a bit difficult (I see the load from memory needs the TOC, but then
> when the TOC is not available we could just always emit the build form),
> and pre-reload precision might be not good enough to make this worth
> the experiment?
Thanks a lot for your comments!

Yes, Post-RA may not handle all cases.
If there is no TOC avaiable, we are not able to load the const through
TOC.  As Segher point out: crtl->uses_const_pool maybe an approximation
way.
Sched2 pass could optimize some cases(e.g. for f2.c and f4.c), but for
some cases, it may not distrubuted those 'building' instructions.

So, maybe we add a peephole after sched2.  If the five-instructions
to building constant are still successive, then using 'load' to replace
(need to check TOC available).
While I'm not sure if it is worthy. 

>
> Of course the scheduler might lack on the technical side as well.


BR,
Jeff(Jiufu)

>
>>
>> f2.c
>> long foo (long *arg, long *arg2, long *arg3)
>> {
>>   *arg = 0x1234567800000000;
>>   *arg2 = 0x7965234700000000;
>>   *arg3 = 0x4689123700000000;
>> }
>> asm building constant:
>>         lis 7,0x1234
>>         lis 10,0x7965
>>         lis 9,0x4689
>>         ori 7,7,0x5678
>>         ori 10,10,0x2347
>>         ori 9,9,0x1237
>>         sldi 7,7,32
>>         sldi 10,10,32
>>         sldi 9,9,32
>> vs. loading
>>         addis 7,2,.LC0@toc@ha
>>         addis 10,2,.LC1@toc@ha
>>         addis 9,2,.LC2@toc@ha
>>         ld 7,.LC0@toc@l(7)
>>         ld 10,.LC1@toc@l(10)
>>         ld 9,.LC2@toc@l(9)
>> For this case, 'loading' is always slower than 'building' (>15%).
>>
>> f3.c
>> long foo (long *arg, long *, long *)
>> {
>>   *arg = 384307168202282325;
>> }
>>         lis 10,0x555
>>         ori 10,10,0x5555
>>         sldi 10,10,32
>>         oris 10,10,0x5555
>>         ori 10,10,0x5555
>> For this case, 'building' (through 5 instructions) are slower, and 'loading'
>> is faster ~5%;
>>
>> f4.c
>> long foo (long *arg, long *arg2, long *arg3)
>> {
>>   *arg = 384307168202282325;
>>   *arg2 = -6148914691236517205;
>>   *arg3 = 768614336404564651;
>> }
>>         lis 7,0x555
>>         lis 10,0xaaaa
>>         lis 9,0xaaa
>>         ori 7,7,0x5555
>>         ori 10,10,0xaaaa
>>         ori 9,9,0xaaaa
>>         sldi 7,7,32
>>         sldi 10,10,32
>>         sldi 9,9,32
>>         oris 7,7,0x5555
>>         oris 10,10,0xaaaa
>>         oris 9,9,0xaaaa
>>         ori 7,7,0x5555
>>         ori 10,10,0xaaab
>>         ori 9,9,0xaaab
>> For this cases, since 'building' constant are parallel, 'loading' is slower:
>> ~8%. On p10, 'loading'(through 'pld') is also slower >4%.
>>
>>
>> BR,
>> Jeff(Jiufu)
>>
>> ---
>>  gcc/config/rs6000/rs6000.cc                | 14 ++++++++++++++
>>  gcc/testsuite/gcc.target/powerpc/pr63281.c | 11 +++++++++++
>>  2 files changed, 25 insertions(+)
>>  create mode 100644 gcc/testsuite/gcc.target/powerpc/pr63281.c
>>
>> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
>> index 4b727d2a500..3798e11bdbc 100644
>> --- a/gcc/config/rs6000/rs6000.cc
>> +++ b/gcc/config/rs6000/rs6000.cc
>> @@ -10098,6 +10098,20 @@ rs6000_emit_set_const (rtx dest, rtx source)
>>           c = ((c & 0xffffffff) ^ 0x80000000) - 0x80000000;
>>           emit_move_insn (lo, GEN_INT (c));
>>         }
>> +      else if (base_reg_operand (dest, mode)
>> +              && num_insns_constant (source, mode) > 2)
>> +       {
>> +         rtx sym = force_const_mem (mode, source);
>> +         if (TARGET_TOC && SYMBOL_REF_P (XEXP (sym, 0))
>> +             && use_toc_relative_ref (XEXP (sym, 0), mode))
>> +           {
>> +             rtx toc = create_TOC_reference (XEXP (sym, 0), copy_rtx (dest));
>> +             sym = gen_const_mem (mode, toc);
>> +             set_mem_alias_set (sym, get_TOC_alias_set ());
>> +           }
>> +
>> +         emit_insn (gen_rtx_SET (dest, sym));
>> +       }
>>        else
>>         rs6000_emit_set_long_const (dest, c);
>>        break;
>> diff --git a/gcc/testsuite/gcc.target/powerpc/pr63281.c b/gcc/testsuite/gcc.target/powerpc/pr63281.c
>> new file mode 100644
>> index 00000000000..469a8f64400
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/powerpc/pr63281.c
>> @@ -0,0 +1,11 @@
>> +/* PR target/63281 */
>> +/* { dg-do compile { target lp64 } } */
>> +/* { dg-options "-O2 -std=c99" } */
>> +
>> +void
>> +foo (unsigned long long *a)
>> +{
>> +  *a = 0x020805006106003;
>> +}
>> +
>> +/* { dg-final { scan-assembler-times {\mp?ld\M} 1 } } */
>> --
>> 2.17.1
>>
  
Jiufu Guo Aug. 16, 2022, 6:45 a.m. UTC | #4
Jiufu Guo <guojiufu@linux.ibm.com> writes:

> Hi,
>
> Richard Biener <richard.guenther@gmail.com> writes:
>
>> On Mon, Aug 15, 2022 at 7:26 AM Jiufu Guo via Gcc-patches
>> <gcc-patches@gcc.gnu.org> wrote:
>>>
>>> Hi,
>>>
>>> This patch tries to put the constant into constant pool if building the
>>> constant requires 3 or more instructions.
>>>
>>> But there is a concern: I'm wondering if this patch is really profitable.
>>>
>>> Because, as I tested, 1. for simple case, if instructions are not been run
>>> in parallel, loading constant from memory maybe faster; but 2. if there
>>> are some instructions could run in parallel, loading constant from memory
>>> are not win comparing with building constant.  As below examples.
>>>
>>> For f1.c and f3.c, 'loading' constant would be acceptable in runtime aspect;
>>> for f2.c and f4.c, 'loading' constant are visibly slower.
>>>
>>> For real-world cases, both kinds of code sequences exist.
>>>
>>> So, I'm not sure if we need to push this patch.
>>>
>>> Run a lot of times (1000000000) below functions to check runtime.
>>> f1.c:
>>> long foo (long *arg, long*, long *)
>>> {
>>>   *arg = 0x1234567800000000;
>>> }
>>> asm building constant:
>>>         lis 10,0x1234
>>>         ori 10,10,0x5678
>>>         sldi 10,10,32
>>> vs.  asm loading
>>>         addis 10,2,.LC0@toc@ha
>>>         ld 10,.LC0@toc@l(10)
>>> The runtime between 'building' and 'loading' are similar: some times the
>>> 'building' is faster; sometimes 'loading' is faster. And the difference is
>>> slight.
>>
>> I wonder if it is possible to decide this during scheduling - chose the
>> variant that, when the result is needed, is cheaper?  Post-RA might
>> be a bit difficult (I see the load from memory needs the TOC, but then
>> when the TOC is not available we could just always emit the build form),
>> and pre-reload precision might be not good enough to make this worth
>> the experiment?
> Thanks a lot for your comments!
>
> Yes, Post-RA may not handle all cases.
> If there is no TOC avaiable, we are not able to load the const through
> TOC.  As Segher point out: crtl->uses_const_pool maybe an approximation
> way.
> Sched2 pass could optimize some cases(e.g. for f2.c and f4.c), but for
> some cases, it may not distrubuted those 'building' instructions.
>
> So, maybe we add a peephole after sched2.  If the five-instructions
> to building constant are still successive, then using 'load' to replace
> (need to check TOC available).
> While I'm not sure if it is worthy.

Oh, as checking the object files (from GCC bootstrap and spec), it is rare
that the five-instructions are successive.  It is often 1(or 2) insns
are distributed, and other 4(or 3) instructions are successive.
So, using peephole may not very helpful.

BR,
Jeff(Jiufu)

>
>>
>> Of course the scheduler might lack on the technical side as well.
>
>
> BR,
> Jeff(Jiufu)
>
>>
>>>
>>> f2.c
>>> long foo (long *arg, long *arg2, long *arg3)
>>> {
>>>   *arg = 0x1234567800000000;
>>>   *arg2 = 0x7965234700000000;
>>>   *arg3 = 0x4689123700000000;
>>> }
>>> asm building constant:
>>>         lis 7,0x1234
>>>         lis 10,0x7965
>>>         lis 9,0x4689
>>>         ori 7,7,0x5678
>>>         ori 10,10,0x2347
>>>         ori 9,9,0x1237
>>>         sldi 7,7,32
>>>         sldi 10,10,32
>>>         sldi 9,9,32
>>> vs. loading
>>>         addis 7,2,.LC0@toc@ha
>>>         addis 10,2,.LC1@toc@ha
>>>         addis 9,2,.LC2@toc@ha
>>>         ld 7,.LC0@toc@l(7)
>>>         ld 10,.LC1@toc@l(10)
>>>         ld 9,.LC2@toc@l(9)
>>> For this case, 'loading' is always slower than 'building' (>15%).
>>>
>>> f3.c
>>> long foo (long *arg, long *, long *)
>>> {
>>>   *arg = 384307168202282325;
>>> }
>>>         lis 10,0x555
>>>         ori 10,10,0x5555
>>>         sldi 10,10,32
>>>         oris 10,10,0x5555
>>>         ori 10,10,0x5555
>>> For this case, 'building' (through 5 instructions) are slower, and 'loading'
>>> is faster ~5%;
>>>
>>> f4.c
>>> long foo (long *arg, long *arg2, long *arg3)
>>> {
>>>   *arg = 384307168202282325;
>>>   *arg2 = -6148914691236517205;
>>>   *arg3 = 768614336404564651;
>>> }
>>>         lis 7,0x555
>>>         lis 10,0xaaaa
>>>         lis 9,0xaaa
>>>         ori 7,7,0x5555
>>>         ori 10,10,0xaaaa
>>>         ori 9,9,0xaaaa
>>>         sldi 7,7,32
>>>         sldi 10,10,32
>>>         sldi 9,9,32
>>>         oris 7,7,0x5555
>>>         oris 10,10,0xaaaa
>>>         oris 9,9,0xaaaa
>>>         ori 7,7,0x5555
>>>         ori 10,10,0xaaab
>>>         ori 9,9,0xaaab
>>> For this cases, since 'building' constant are parallel, 'loading' is slower:
>>> ~8%. On p10, 'loading'(through 'pld') is also slower >4%.
>>>
>>>
>>> BR,
>>> Jeff(Jiufu)
>>>
>>> ---
>>>  gcc/config/rs6000/rs6000.cc                | 14 ++++++++++++++
>>>  gcc/testsuite/gcc.target/powerpc/pr63281.c | 11 +++++++++++
>>>  2 files changed, 25 insertions(+)
>>>  create mode 100644 gcc/testsuite/gcc.target/powerpc/pr63281.c
>>>
>>> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
>>> index 4b727d2a500..3798e11bdbc 100644
>>> --- a/gcc/config/rs6000/rs6000.cc
>>> +++ b/gcc/config/rs6000/rs6000.cc
>>> @@ -10098,6 +10098,20 @@ rs6000_emit_set_const (rtx dest, rtx source)
>>>           c = ((c & 0xffffffff) ^ 0x80000000) - 0x80000000;
>>>           emit_move_insn (lo, GEN_INT (c));
>>>         }
>>> +      else if (base_reg_operand (dest, mode)
>>> +              && num_insns_constant (source, mode) > 2)
>>> +       {
>>> +         rtx sym = force_const_mem (mode, source);
>>> +         if (TARGET_TOC && SYMBOL_REF_P (XEXP (sym, 0))
>>> +             && use_toc_relative_ref (XEXP (sym, 0), mode))
>>> +           {
>>> +             rtx toc = create_TOC_reference (XEXP (sym, 0), copy_rtx (dest));
>>> +             sym = gen_const_mem (mode, toc);
>>> +             set_mem_alias_set (sym, get_TOC_alias_set ());
>>> +           }
>>> +
>>> +         emit_insn (gen_rtx_SET (dest, sym));
>>> +       }
>>>        else
>>>         rs6000_emit_set_long_const (dest, c);
>>>        break;
>>> diff --git a/gcc/testsuite/gcc.target/powerpc/pr63281.c b/gcc/testsuite/gcc.target/powerpc/pr63281.c
>>> new file mode 100644
>>> index 00000000000..469a8f64400
>>> --- /dev/null
>>> +++ b/gcc/testsuite/gcc.target/powerpc/pr63281.c
>>> @@ -0,0 +1,11 @@
>>> +/* PR target/63281 */
>>> +/* { dg-do compile { target lp64 } } */
>>> +/* { dg-options "-O2 -std=c99" } */
>>> +
>>> +void
>>> +foo (unsigned long long *a)
>>> +{
>>> +  *a = 0x020805006106003;
>>> +}
>>> +
>>> +/* { dg-final { scan-assembler-times {\mp?ld\M} 1 } } */
>>> --
>>> 2.17.1
>>>
  
Jiufu Guo Aug. 17, 2022, 2:32 a.m. UTC | #5
Hi,

Segher Boessenkool <segher@kernel.crashing.org> writes:

> Hi!
>
> On Mon, Aug 15, 2022 at 01:25:19PM +0800, Jiufu Guo wrote:
>> This patch tries to put the constant into constant pool if building the
>> constant requires 3 or more instructions.
>> 
>> But there is a concern: I'm wondering if this patch is really profitable.
>> 
>> Because, as I tested, 1. for simple case, if instructions are not been run
>> in parallel, loading constant from memory maybe faster; but 2. if there
>> are some instructions could run in parallel, loading constant from memory
>> are not win comparing with building constant.  As below examples.
>> 
>> For f1.c and f3.c, 'loading' constant would be acceptable in runtime aspect;
>> for f2.c and f4.c, 'loading' constant are visibly slower. 
>> 
>> For real-world cases, both kinds of code sequences exist.
>> 
>> So, I'm not sure if we need to push this patch.
>> 
>> Run a lot of times (1000000000) below functions to check runtime.
>> f1.c:
>> long foo (long *arg, long*, long *)
>> {
>>   *arg = 0x1234567800000000;
>> }
>> asm building constant:
>> 	lis 10,0x1234
>> 	ori 10,10,0x5678
>> 	sldi 10,10,32
>> vs.  asm loading
>> 	addis 10,2,.LC0@toc@ha
>> 	ld 10,.LC0@toc@l(10)
>
> This is just a load insn, unless this is the only thing needing the TOC.
> You can use crtl->uses_const_pool as an approximation here, to figure
> out if we have that case?

Thanks for point out this!
crtl->uses_const_pool is set to 1 in force_const_mem. 
create_TOC_reference would be called after force_const_mem.
One concern: there maybe the case that crtl->uses_const_pool was not
clear to zero after related symbols are optimized out.

>
>> The runtime between 'building' and 'loading' are similar: some times the
>> 'building' is faster; sometimes 'loading' is faster. And the difference is
>> slight.
>
> When there is only one constant, sure.  But that isn't the expensive
> case we need to avoid :-)
Yes. If there are other instructions around, scheduler could
optimized the 'building' instructions to be in parallel with other
instructions.  If we emit 'building' instruction in split1 pass (before
sched1), these 'building constant' instructions may be more possible to
be scheduled better.  Then 'building form' maybe not bad.

>
>> 	addis 9,2,.LC2@toc@ha
>> 	ld 7,.LC0@toc@l(7)
>> 	ld 10,.LC1@toc@l(10)
>> 	ld 9,.LC2@toc@l(9)
>> For this case, 'loading' is always slower than 'building' (>15%).
>
> Only if there is nothing else to do, and only in cases where code size
> does not matter (i.e. microbenchmarks).
Yes, 'loading' may save code size slightly.
>
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/powerpc/pr63281.c
>> @@ -0,0 +1,11 @@
>> +/* PR target/63281 */
>> +/* { dg-do compile { target lp64 } } */
>> +/* { dg-options "-O2 -std=c99" } */
>
> Why std=c99 btw?  The default is c17.  Is there something we need to
> disable here?
Oh, this option is not required.  Thanks!

BR,
Jeff(Jiufu)

>
>
> Segher
  

Patch

diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
index 4b727d2a500..3798e11bdbc 100644
--- a/gcc/config/rs6000/rs6000.cc
+++ b/gcc/config/rs6000/rs6000.cc
@@ -10098,6 +10098,20 @@  rs6000_emit_set_const (rtx dest, rtx source)
 	  c = ((c & 0xffffffff) ^ 0x80000000) - 0x80000000;
 	  emit_move_insn (lo, GEN_INT (c));
 	}
+      else if (base_reg_operand (dest, mode)
+	       && num_insns_constant (source, mode) > 2)
+	{
+	  rtx sym = force_const_mem (mode, source);
+	  if (TARGET_TOC && SYMBOL_REF_P (XEXP (sym, 0))
+	      && use_toc_relative_ref (XEXP (sym, 0), mode))
+	    {
+	      rtx toc = create_TOC_reference (XEXP (sym, 0), copy_rtx (dest));
+	      sym = gen_const_mem (mode, toc);
+	      set_mem_alias_set (sym, get_TOC_alias_set ());
+	    }
+
+	  emit_insn (gen_rtx_SET (dest, sym));
+	}
       else
 	rs6000_emit_set_long_const (dest, c);
       break;
diff --git a/gcc/testsuite/gcc.target/powerpc/pr63281.c b/gcc/testsuite/gcc.target/powerpc/pr63281.c
new file mode 100644
index 00000000000..469a8f64400
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/pr63281.c
@@ -0,0 +1,11 @@ 
+/* PR target/63281 */
+/* { dg-do compile { target lp64 } } */
+/* { dg-options "-O2 -std=c99" } */
+
+void
+foo (unsigned long long *a)
+{
+  *a = 0x020805006106003;
+}
+
+/* { dg-final { scan-assembler-times {\mp?ld\M} 1 } } */