tree-optimization/113552 - fix num_call accounting in simd clone vectorization

Message ID 20240123115653.47E5C136A4@imap1.dmz-prg2.suse.org
State Accepted
Headers
Series tree-optimization/113552 - fix num_call accounting in simd clone vectorization |

Checks

Context Check Description
snail/gcc-patch-check success Github commit url

Commit Message

Richard Biener Jan. 23, 2024, 11:56 a.m. UTC
  The following avoids using exact_log2 on the number of SIMD clone calls
to be emitted when vectorizing calls since that can easily be not
a power of two in which case it will return -1.  For different simd
clones the number of calls will differ by a multiply with a power of two
only so using floor_log2 is good enough here.

Bootstrap and regtest running on x86_64-unknown-linux-gnu.

	PR tree-optimization/113552
	* tree-vect-stmts.cc (vectorizable_simd_clone_call): Use
	floor_log2 instead of exact_log2 on the number of calls.
---
 gcc/tree-vect-stmts.cc | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
  

Comments

Jakub Jelinek Jan. 23, 2024, 12:01 p.m. UTC | #1
On Tue, Jan 23, 2024 at 12:56:52PM +0100, Richard Biener wrote:
> The following avoids using exact_log2 on the number of SIMD clone calls
> to be emitted when vectorizing calls since that can easily be not
> a power of two in which case it will return -1.  For different simd
> clones the number of calls will differ by a multiply with a power of two
> only so using floor_log2 is good enough here.
> 
> Bootstrap and regtest running on x86_64-unknown-linux-gnu.
> 
> 	PR tree-optimization/113552
> 	* tree-vect-stmts.cc (vectorizable_simd_clone_call): Use
> 	floor_log2 instead of exact_log2 on the number of calls.

Is there any target which supports non-power-of-two simdlen?
If not, perhaps we should add !pow2p_hwi (num_calls) to the continue;
condition a few lines earlier?

> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index 09749ae3817..1dbe1115da4 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -4071,7 +4071,7 @@ vectorizable_simd_clone_call (vec_info *vinfo, stmt_vec_info stmt_info,
>  	    || (nargs != simd_nargs))
>  	  continue;
>  	if (num_calls != 1)
> -	  this_badness += exact_log2 (num_calls) * 4096;
> +	  this_badness += floor_log2 (num_calls) * 4096;
>  	if (n->simdclone->inbranch)
>  	  this_badness += 8192;
>  	int target_badness = targetm.simd_clone.usable (n);
> -- 
> 2.35.3

	Jakub
  
Richard Biener Jan. 23, 2024, 12:03 p.m. UTC | #2
On Tue, 23 Jan 2024, Jakub Jelinek wrote:

> On Tue, Jan 23, 2024 at 12:56:52PM +0100, Richard Biener wrote:
> > The following avoids using exact_log2 on the number of SIMD clone calls
> > to be emitted when vectorizing calls since that can easily be not
> > a power of two in which case it will return -1.  For different simd
> > clones the number of calls will differ by a multiply with a power of two
> > only so using floor_log2 is good enough here.
> > 
> > Bootstrap and regtest running on x86_64-unknown-linux-gnu.
> > 
> > 	PR tree-optimization/113552
> > 	* tree-vect-stmts.cc (vectorizable_simd_clone_call): Use
> > 	floor_log2 instead of exact_log2 on the number of calls.
> 
> Is there any target which supports non-power-of-two simdlen?
> If not, perhaps we should add !pow2p_hwi (num_calls) to the continue;
> condition a few lines earlier?

Is non-power-of-two simdlen a thing?  Note there's nothing wrong
with non-power-of-two num_calls, with VF == 4 and a group size
of 3 you get 12 lanes and either 3 (simdlen == 4) or 6 (simdlen == 2)
calls.

Iff non-power-of-two simdlen is a thing then we could bias
by + num_calls (no idea why we use *_log2 in the first place, but it
was that way since the beginning).

Richard.

> > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> > index 09749ae3817..1dbe1115da4 100644
> > --- a/gcc/tree-vect-stmts.cc
> > +++ b/gcc/tree-vect-stmts.cc
> > @@ -4071,7 +4071,7 @@ vectorizable_simd_clone_call (vec_info *vinfo, stmt_vec_info stmt_info,
> >  	    || (nargs != simd_nargs))
> >  	  continue;
> >  	if (num_calls != 1)
> > -	  this_badness += exact_log2 (num_calls) * 4096;
> > +	  this_badness += floor_log2 (num_calls) * 4096;
> >  	if (n->simdclone->inbranch)
> >  	  this_badness += 8192;
> >  	int target_badness = targetm.simd_clone.usable (n);
> > -- 
> > 2.35.3
> 
> 	Jakub
> 
>
  
Jakub Jelinek Jan. 23, 2024, 12:21 p.m. UTC | #3
On Tue, Jan 23, 2024 at 01:03:46PM +0100, Richard Biener wrote:
> On Tue, 23 Jan 2024, Jakub Jelinek wrote:
> 
> > On Tue, Jan 23, 2024 at 12:56:52PM +0100, Richard Biener wrote:
> > > The following avoids using exact_log2 on the number of SIMD clone calls
> > > to be emitted when vectorizing calls since that can easily be not
> > > a power of two in which case it will return -1.  For different simd
> > > clones the number of calls will differ by a multiply with a power of two
> > > only so using floor_log2 is good enough here.
> > > 
> > > Bootstrap and regtest running on x86_64-unknown-linux-gnu.
> > > 
> > > 	PR tree-optimization/113552
> > > 	* tree-vect-stmts.cc (vectorizable_simd_clone_call): Use
> > > 	floor_log2 instead of exact_log2 on the number of calls.
> > 
> > Is there any target which supports non-power-of-two simdlen?
> > If not, perhaps we should add !pow2p_hwi (num_calls) to the continue;
> > condition a few lines earlier?
> 
> Is non-power-of-two simdlen a thing?  Note there's nothing wrong
> with non-power-of-two num_calls, with VF == 4 and a group size
> of 3 you get 12 lanes and either 3 (simdlen == 4) or 6 (simdlen == 2)
> calls.
> 
> Iff non-power-of-two simdlen is a thing then we could bias
> by + num_calls (no idea why we use *_log2 in the first place, but it
> was that way since the beginning).

Ah, with SLP I can understand it doesn't have to be a power of two,
the original
+       if (n->simdclone->simdlen
+           < (unsigned) LOOP_VINFO_VECT_FACTOR (loop_vinfo))
+         this_badness += (exact_log2 (LOOP_VINFO_VECT_FACTOR (loop_vinfo))
+                          - exact_log2 (n->simdclone->simdlen)) * 1024;
was written for loop vectorization only and I think correctly assumed
power of 2 loop vectorization factors as well as simdlens.

I admit I don't remember why log2 rather than the count has been used,
the first version of the patch is
https://gcc.gnu.org/pipermail/gcc-patches/2013-November/374728.html
But if we keep using log2, perhaps better ceil_log2 because num_calls of
3 is certainly more expensive than 2.

	Jakub
  
Richard Biener Jan. 23, 2024, 12:40 p.m. UTC | #4
On Tue, 23 Jan 2024, Jakub Jelinek wrote:

> On Tue, Jan 23, 2024 at 01:03:46PM +0100, Richard Biener wrote:
> > On Tue, 23 Jan 2024, Jakub Jelinek wrote:
> > 
> > > On Tue, Jan 23, 2024 at 12:56:52PM +0100, Richard Biener wrote:
> > > > The following avoids using exact_log2 on the number of SIMD clone calls
> > > > to be emitted when vectorizing calls since that can easily be not
> > > > a power of two in which case it will return -1.  For different simd
> > > > clones the number of calls will differ by a multiply with a power of two
> > > > only so using floor_log2 is good enough here.
> > > > 
> > > > Bootstrap and regtest running on x86_64-unknown-linux-gnu.
> > > > 
> > > > 	PR tree-optimization/113552
> > > > 	* tree-vect-stmts.cc (vectorizable_simd_clone_call): Use
> > > > 	floor_log2 instead of exact_log2 on the number of calls.
> > > 
> > > Is there any target which supports non-power-of-two simdlen?
> > > If not, perhaps we should add !pow2p_hwi (num_calls) to the continue;
> > > condition a few lines earlier?
> > 
> > Is non-power-of-two simdlen a thing?  Note there's nothing wrong
> > with non-power-of-two num_calls, with VF == 4 and a group size
> > of 3 you get 12 lanes and either 3 (simdlen == 4) or 6 (simdlen == 2)
> > calls.
> > 
> > Iff non-power-of-two simdlen is a thing then we could bias
> > by + num_calls (no idea why we use *_log2 in the first place, but it
> > was that way since the beginning).
> 
> Ah, with SLP I can understand it doesn't have to be a power of two,
> the original
> +       if (n->simdclone->simdlen
> +           < (unsigned) LOOP_VINFO_VECT_FACTOR (loop_vinfo))
> +         this_badness += (exact_log2 (LOOP_VINFO_VECT_FACTOR (loop_vinfo))
> +                          - exact_log2 (n->simdclone->simdlen)) * 1024;
> was written for loop vectorization only and I think correctly assumed
> power of 2 loop vectorization factors as well as simdlens.

Note even without SLP we could vectorize a group size of 3 with
interleaving.  But yes, VF for interleaving is a power of two and
so is simdlen so the above should have worked.

> I admit I don't remember why log2 rather than the count has been used,
> the first version of the patch is
> https://gcc.gnu.org/pipermail/gcc-patches/2013-November/374728.html
> But if we keep using log2, perhaps better ceil_log2 because num_calls of
> 3 is certainly more expensive than 2.

I think only the relative accounting to the other this_badness
increments matter since as said, at least with power-of-two simdlen
we'd get different {ceil,floor}_log2 values for different simdlen.

It's never going to be 3 vs 2 but 3 * 2^n vs. 3 * 2^m so floor or
ceil doesn't matter.  In fact we could have just using
some inverse of exact_log2 (n->simdclone->simdlen).  That is,
it's only simdlen that's varying in this addend to this_badness.

For the exact_log2 case the behavior is unchanged and we now only
get a sensible number for the others now.  Maybe log2 was for the
fear of overflow or over-accounting compared to the other increments.
When overflow is not an issue we could also use
floor_log2 (num_calls * 4096) to be more "precise".  I don't know
why we have all these different weights and why they are the way
they are - but it's a lot of apples and oranges we put together
in a magic number to compare ;)

I prefer to do a minimal change now, but both floor and ceil work
for me, so if you prefer I can change (but as said, it would only
matter with the mixing with the other cost factors, and in unknown
ways since the desire is not spelled out).

Richard.
  
Jakub Jelinek Jan. 23, 2024, 12:43 p.m. UTC | #5
On Tue, Jan 23, 2024 at 01:40:15PM +0100, Richard Biener wrote:
> It's never going to be 3 vs 2 but 3 * 2^n vs. 3 * 2^m so floor or
> ceil doesn't matter.  In fact we could have just using
> some inverse of exact_log2 (n->simdclone->simdlen).  That is,
> it's only simdlen that's varying in this addend to this_badness.
> 
> For the exact_log2 case the behavior is unchanged and we now only
> get a sensible number for the others now.  Maybe log2 was for the
> fear of overflow or over-accounting compared to the other increments.
> When overflow is not an issue we could also use
> floor_log2 (num_calls * 4096) to be more "precise".  I don't know
> why we have all these different weights and why they are the way
> they are - but it's a lot of apples and oranges we put together
> in a magic number to compare ;)
> 
> I prefer to do a minimal change now, but both floor and ceil work
> for me, so if you prefer I can change (but as said, it would only
> matter with the mixing with the other cost factors, and in unknown
> ways since the desire is not spelled out).

Ok as is then.

	Jakub
  

Patch

diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index 09749ae3817..1dbe1115da4 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -4071,7 +4071,7 @@  vectorizable_simd_clone_call (vec_info *vinfo, stmt_vec_info stmt_info,
 	    || (nargs != simd_nargs))
 	  continue;
 	if (num_calls != 1)
-	  this_badness += exact_log2 (num_calls) * 4096;
+	  this_badness += floor_log2 (num_calls) * 4096;
 	if (n->simdclone->inbranch)
 	  this_badness += 8192;
 	int target_badness = targetm.simd_clone.usable (n);