[PATCH/RFC,08/10] aarch64: Don't use CEIL for vector_store in aarch64_stp_sequence_cost

Message ID bc85799abb2616dcac511424a1b50b57e48c2556.1694657494.git.linkw@linux.ibm.com
State Unresolved
Headers
Series vect: Move costing next to the transform for vect store |

Checks

Context Check Description
snail/gcc-patch-check warning Git am fail log

Commit Message

Kewen.Lin Sept. 14, 2023, 3:11 a.m. UTC
  This costing adjustment patch series exposes one issue in
aarch64 specific costing adjustment for STP sequence.  It
causes the below test cases to fail:

  - gcc/testsuite/gcc.target/aarch64/ldp_stp_15.c
  - gcc/testsuite/gcc.target/aarch64/ldp_stp_16.c
  - gcc/testsuite/gcc.target/aarch64/ldp_stp_17.c
  - gcc/testsuite/gcc.target/aarch64/ldp_stp_18.c

Take the below function extracted from ldp_stp_15.c as
example:

void
dup_8_int32_t (int32_t *x, int32_t val)
{
    for (int i = 0; i < 8; ++i)
          x[i] = val;
}

Without my patch series, during slp1 it gets:

  val_8(D) 2 times unaligned_store (misalign -1) costs 2 in body
  node 0x10008c85e38 1 times scalar_to_vec costs 1 in prologue

then the final vector cost is 3.

With my patch series, during slp1 it gets:

  val_8(D) 1 times unaligned_store (misalign -1) costs 1 in body
  val_8(D) 1 times unaligned_store (misalign -1) costs 1 in body
  node 0x10004cc5d88 1 times scalar_to_vec costs 1 in prologue

but the final vector cost is 17.  The unaligned_store count is
actually unchanged, but the final vector costs become different,
it's because the below aarch64 special handling makes the
different costs:

  /* Apply the heuristic described above m_stp_sequence_cost.  */
  if (m_stp_sequence_cost != ~0U)
    {
      uint64_t cost = aarch64_stp_sequence_cost (count, kind,
						 stmt_info, vectype);
      m_stp_sequence_cost = MIN (m._stp_sequence_cost + cost, ~0U);
    }

For the former, since the count is 2, function
aarch64_stp_sequence_cost returns 2 as "CEIL (count, 2) * 2".
While for the latter, it's separated into twice calls with
count 1, aarch64_stp_sequence_cost returns 2 for each time,
so it returns 4 in total.

For this case, the stmt with scalar_to_vec also contributes
4 to m_stp_sequence_cost, then the final m_stp_sequence_cost
are 6 (2+4) vs. 8 (4+4).

Considering scalar_costs->m_stp_sequence_cost is 8 and below
checking and re-assigning:

  else if (m_stp_sequence_cost >= scalar_costs->m_stp_sequence_cost)
    m_costs[vect_body] = 2 * scalar_costs->total_cost ();

For the former, the body cost of vector isn't changed; but
for the latter, the body cost of vector is double of scalar
cost which is 8 for this case, then it becomes 16 which is
bigger than what we expect.

I'm not sure why it adopts CEIL for the return value for
case unaligned_store in function aarch64_stp_sequence_cost,
but I tried to modify it with "return count;" (as it can
get back to previous cost), there is no failures exposed
in regression testing.  I expected that if the previous
unaligned_store count is even, this adjustment doesn't
change anything, if it's odd, the adjustment may reduce
it by one, but I'd guess it would be few.  Besides, as
the comments for m_stp_sequence_cost, the current
handlings seems temporary, maybe a tweak like this can be
accepted, so I posted this RFC/PATCH to request comments.
this one line change is considered.

gcc/ChangeLog:

	* config/aarch64/aarch64.cc (aarch64_stp_sequence_cost): Return
	count directly instead of the adjusted value computed with CEIL.
---
 gcc/config/aarch64/aarch64.cc | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
  

Comments

Richard Sandiford Sept. 18, 2023, 8:41 a.m. UTC | #1
Kewen Lin <linkw@linux.ibm.com> writes:
> This costing adjustment patch series exposes one issue in
> aarch64 specific costing adjustment for STP sequence.  It
> causes the below test cases to fail:
>
>   - gcc/testsuite/gcc.target/aarch64/ldp_stp_15.c
>   - gcc/testsuite/gcc.target/aarch64/ldp_stp_16.c
>   - gcc/testsuite/gcc.target/aarch64/ldp_stp_17.c
>   - gcc/testsuite/gcc.target/aarch64/ldp_stp_18.c
>
> Take the below function extracted from ldp_stp_15.c as
> example:
>
> void
> dup_8_int32_t (int32_t *x, int32_t val)
> {
>     for (int i = 0; i < 8; ++i)
>           x[i] = val;
> }
>
> Without my patch series, during slp1 it gets:
>
>   val_8(D) 2 times unaligned_store (misalign -1) costs 2 in body
>   node 0x10008c85e38 1 times scalar_to_vec costs 1 in prologue
>
> then the final vector cost is 3.
>
> With my patch series, during slp1 it gets:
>
>   val_8(D) 1 times unaligned_store (misalign -1) costs 1 in body
>   val_8(D) 1 times unaligned_store (misalign -1) costs 1 in body
>   node 0x10004cc5d88 1 times scalar_to_vec costs 1 in prologue
>
> but the final vector cost is 17.  The unaligned_store count is
> actually unchanged, but the final vector costs become different,
> it's because the below aarch64 special handling makes the
> different costs:
>
>   /* Apply the heuristic described above m_stp_sequence_cost.  */
>   if (m_stp_sequence_cost != ~0U)
>     {
>       uint64_t cost = aarch64_stp_sequence_cost (count, kind,
> 						 stmt_info, vectype);
>       m_stp_sequence_cost = MIN (m._stp_sequence_cost + cost, ~0U);
>     }
>
> For the former, since the count is 2, function
> aarch64_stp_sequence_cost returns 2 as "CEIL (count, 2) * 2".
> While for the latter, it's separated into twice calls with
> count 1, aarch64_stp_sequence_cost returns 2 for each time,
> so it returns 4 in total.
>
> For this case, the stmt with scalar_to_vec also contributes
> 4 to m_stp_sequence_cost, then the final m_stp_sequence_cost
> are 6 (2+4) vs. 8 (4+4).
>
> Considering scalar_costs->m_stp_sequence_cost is 8 and below
> checking and re-assigning:
>
>   else if (m_stp_sequence_cost >= scalar_costs->m_stp_sequence_cost)
>     m_costs[vect_body] = 2 * scalar_costs->total_cost ();
>
> For the former, the body cost of vector isn't changed; but
> for the latter, the body cost of vector is double of scalar
> cost which is 8 for this case, then it becomes 16 which is
> bigger than what we expect.
>
> I'm not sure why it adopts CEIL for the return value for
> case unaligned_store in function aarch64_stp_sequence_cost,
> but I tried to modify it with "return count;" (as it can
> get back to previous cost), there is no failures exposed
> in regression testing.  I expected that if the previous
> unaligned_store count is even, this adjustment doesn't
> change anything, if it's odd, the adjustment may reduce
> it by one, but I'd guess it would be few.  Besides, as
> the comments for m_stp_sequence_cost, the current
> handlings seems temporary, maybe a tweak like this can be
> accepted, so I posted this RFC/PATCH to request comments.
> this one line change is considered.

It's unfortunate that doing this didn't show up a regression.
I guess it's not a change we explicitly added tests to guard against.

But the point of the condition is to estimate how many single stores
(STRs) and how many paired stores (STPs) would be generated.  As far
as this heuristic goes, STP (storing two values) is as cheap as STR
(storing only one value).  So the point of the CEIL is to count 1 store
as having equal cost to 2, 3 as having equal cost to 4, etc.

For a heuristic like that, costing a vector stmt once with count 2
is different from costing 2 vector stmts with count 1.  The former
makes it obvious that the 2 vector stmts are associated with the
same scalar stmt, and are highly likely to be consecutive.  The latter
(costing 2 stmts with count 1) could also happen for unrelated stmts.

ISTM that costing once with count N provides strictly more information
to targets than costing N time with count 1.  Is there no way we can
keep the current behaviour?  E.g. rather than costing a stmt immediately
within a loop, could we just increment a counter and cost once at the end?

Thanks,
Richard

> gcc/ChangeLog:
>
> 	* config/aarch64/aarch64.cc (aarch64_stp_sequence_cost): Return
> 	count directly instead of the adjusted value computed with CEIL.
> ---
>  gcc/config/aarch64/aarch64.cc | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 37d414021ca..9fb4fbd883d 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -17051,7 +17051,7 @@ aarch64_stp_sequence_cost (unsigned int count, vect_cost_for_stmt kind,
>  	  if (!aarch64_aligned_constant_offset_p (stmt_info, size))
>  	    return count * 2;
>  	}
> -      return CEIL (count, 2) * 2;
> +      return count;
>  
>      case scalar_store:
>        if (stmt_info && STMT_VINFO_DATA_REF (stmt_info))
  
Richard Biener Sept. 18, 2023, 8:53 a.m. UTC | #2
On Mon, Sep 18, 2023 at 10:41 AM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Kewen Lin <linkw@linux.ibm.com> writes:
> > This costing adjustment patch series exposes one issue in
> > aarch64 specific costing adjustment for STP sequence.  It
> > causes the below test cases to fail:
> >
> >   - gcc/testsuite/gcc.target/aarch64/ldp_stp_15.c
> >   - gcc/testsuite/gcc.target/aarch64/ldp_stp_16.c
> >   - gcc/testsuite/gcc.target/aarch64/ldp_stp_17.c
> >   - gcc/testsuite/gcc.target/aarch64/ldp_stp_18.c
> >
> > Take the below function extracted from ldp_stp_15.c as
> > example:
> >
> > void
> > dup_8_int32_t (int32_t *x, int32_t val)
> > {
> >     for (int i = 0; i < 8; ++i)
> >           x[i] = val;
> > }
> >
> > Without my patch series, during slp1 it gets:
> >
> >   val_8(D) 2 times unaligned_store (misalign -1) costs 2 in body
> >   node 0x10008c85e38 1 times scalar_to_vec costs 1 in prologue
> >
> > then the final vector cost is 3.
> >
> > With my patch series, during slp1 it gets:
> >
> >   val_8(D) 1 times unaligned_store (misalign -1) costs 1 in body
> >   val_8(D) 1 times unaligned_store (misalign -1) costs 1 in body
> >   node 0x10004cc5d88 1 times scalar_to_vec costs 1 in prologue
> >
> > but the final vector cost is 17.  The unaligned_store count is
> > actually unchanged, but the final vector costs become different,
> > it's because the below aarch64 special handling makes the
> > different costs:
> >
> >   /* Apply the heuristic described above m_stp_sequence_cost.  */
> >   if (m_stp_sequence_cost != ~0U)
> >     {
> >       uint64_t cost = aarch64_stp_sequence_cost (count, kind,
> >                                                stmt_info, vectype);
> >       m_stp_sequence_cost = MIN (m._stp_sequence_cost + cost, ~0U);
> >     }
> >
> > For the former, since the count is 2, function
> > aarch64_stp_sequence_cost returns 2 as "CEIL (count, 2) * 2".
> > While for the latter, it's separated into twice calls with
> > count 1, aarch64_stp_sequence_cost returns 2 for each time,
> > so it returns 4 in total.
> >
> > For this case, the stmt with scalar_to_vec also contributes
> > 4 to m_stp_sequence_cost, then the final m_stp_sequence_cost
> > are 6 (2+4) vs. 8 (4+4).
> >
> > Considering scalar_costs->m_stp_sequence_cost is 8 and below
> > checking and re-assigning:
> >
> >   else if (m_stp_sequence_cost >= scalar_costs->m_stp_sequence_cost)
> >     m_costs[vect_body] = 2 * scalar_costs->total_cost ();
> >
> > For the former, the body cost of vector isn't changed; but
> > for the latter, the body cost of vector is double of scalar
> > cost which is 8 for this case, then it becomes 16 which is
> > bigger than what we expect.
> >
> > I'm not sure why it adopts CEIL for the return value for
> > case unaligned_store in function aarch64_stp_sequence_cost,
> > but I tried to modify it with "return count;" (as it can
> > get back to previous cost), there is no failures exposed
> > in regression testing.  I expected that if the previous
> > unaligned_store count is even, this adjustment doesn't
> > change anything, if it's odd, the adjustment may reduce
> > it by one, but I'd guess it would be few.  Besides, as
> > the comments for m_stp_sequence_cost, the current
> > handlings seems temporary, maybe a tweak like this can be
> > accepted, so I posted this RFC/PATCH to request comments.
> > this one line change is considered.
>
> It's unfortunate that doing this didn't show up a regression.
> I guess it's not a change we explicitly added tests to guard against.
>
> But the point of the condition is to estimate how many single stores
> (STRs) and how many paired stores (STPs) would be generated.  As far
> as this heuristic goes, STP (storing two values) is as cheap as STR
> (storing only one value).  So the point of the CEIL is to count 1 store
> as having equal cost to 2, 3 as having equal cost to 4, etc.
>
> For a heuristic like that, costing a vector stmt once with count 2
> is different from costing 2 vector stmts with count 1.  The former
> makes it obvious that the 2 vector stmts are associated with the
> same scalar stmt, and are highly likely to be consecutive.  The latter
> (costing 2 stmts with count 1) could also happen for unrelated stmts.
>
> ISTM that costing once with count N provides strictly more information
> to targets than costing N time with count 1.  Is there no way we can
> keep the current behaviour?  E.g. rather than costing a stmt immediately
> within a loop, could we just increment a counter and cost once at the end?

I suppose we can.  But isn't the logic currently (or before the series) cheated
for variable-strided stores with ncopies > 1?  That is, while it sounds like
reasonable heuristics you can't really rely on this as the vectorizer doesn't
currently provide the info whether two vector loads/stores are adjacent?

Making sure we only pass count > 1 for adjacent load/store would be possible
though.  We should document this with comments though.

Richard.

>
> Thanks,
> Richard
>
> > gcc/ChangeLog:
> >
> >       * config/aarch64/aarch64.cc (aarch64_stp_sequence_cost): Return
> >       count directly instead of the adjusted value computed with CEIL.
> > ---
> >  gcc/config/aarch64/aarch64.cc | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> > index 37d414021ca..9fb4fbd883d 100644
> > --- a/gcc/config/aarch64/aarch64.cc
> > +++ b/gcc/config/aarch64/aarch64.cc
> > @@ -17051,7 +17051,7 @@ aarch64_stp_sequence_cost (unsigned int count, vect_cost_for_stmt kind,
> >         if (!aarch64_aligned_constant_offset_p (stmt_info, size))
> >           return count * 2;
> >       }
> > -      return CEIL (count, 2) * 2;
> > +      return count;
> >
> >      case scalar_store:
> >        if (stmt_info && STMT_VINFO_DATA_REF (stmt_info))
  
Kewen.Lin Sept. 20, 2023, 2:40 a.m. UTC | #3
Hi,

on 2023/9/18 16:53, Richard Biener wrote:
> On Mon, Sep 18, 2023 at 10:41 AM Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>
>> Kewen Lin <linkw@linux.ibm.com> writes:
>>> This costing adjustment patch series exposes one issue in
>>> aarch64 specific costing adjustment for STP sequence.  It
>>> causes the below test cases to fail:
>>>
>>>   - gcc/testsuite/gcc.target/aarch64/ldp_stp_15.c
>>>   - gcc/testsuite/gcc.target/aarch64/ldp_stp_16.c
>>>   - gcc/testsuite/gcc.target/aarch64/ldp_stp_17.c
>>>   - gcc/testsuite/gcc.target/aarch64/ldp_stp_18.c
>>>
>>> Take the below function extracted from ldp_stp_15.c as
>>> example:
>>>
>>> void
>>> dup_8_int32_t (int32_t *x, int32_t val)
>>> {
>>>     for (int i = 0; i < 8; ++i)
>>>           x[i] = val;
>>> }
>>>
>>> Without my patch series, during slp1 it gets:
>>>
>>>   val_8(D) 2 times unaligned_store (misalign -1) costs 2 in body
>>>   node 0x10008c85e38 1 times scalar_to_vec costs 1 in prologue
>>>
>>> then the final vector cost is 3.
>>>
>>> With my patch series, during slp1 it gets:
>>>
>>>   val_8(D) 1 times unaligned_store (misalign -1) costs 1 in body
>>>   val_8(D) 1 times unaligned_store (misalign -1) costs 1 in body
>>>   node 0x10004cc5d88 1 times scalar_to_vec costs 1 in prologue
>>>
>>> but the final vector cost is 17.  The unaligned_store count is
>>> actually unchanged, but the final vector costs become different,
>>> it's because the below aarch64 special handling makes the
>>> different costs:
>>>
>>>   /* Apply the heuristic described above m_stp_sequence_cost.  */
>>>   if (m_stp_sequence_cost != ~0U)
>>>     {
>>>       uint64_t cost = aarch64_stp_sequence_cost (count, kind,
>>>                                                stmt_info, vectype);
>>>       m_stp_sequence_cost = MIN (m._stp_sequence_cost + cost, ~0U);
>>>     }
>>>
>>> For the former, since the count is 2, function
>>> aarch64_stp_sequence_cost returns 2 as "CEIL (count, 2) * 2".
>>> While for the latter, it's separated into twice calls with
>>> count 1, aarch64_stp_sequence_cost returns 2 for each time,
>>> so it returns 4 in total.
>>>
>>> For this case, the stmt with scalar_to_vec also contributes
>>> 4 to m_stp_sequence_cost, then the final m_stp_sequence_cost
>>> are 6 (2+4) vs. 8 (4+4).
>>>
>>> Considering scalar_costs->m_stp_sequence_cost is 8 and below
>>> checking and re-assigning:
>>>
>>>   else if (m_stp_sequence_cost >= scalar_costs->m_stp_sequence_cost)
>>>     m_costs[vect_body] = 2 * scalar_costs->total_cost ();
>>>
>>> For the former, the body cost of vector isn't changed; but
>>> for the latter, the body cost of vector is double of scalar
>>> cost which is 8 for this case, then it becomes 16 which is
>>> bigger than what we expect.
>>>
>>> I'm not sure why it adopts CEIL for the return value for
>>> case unaligned_store in function aarch64_stp_sequence_cost,
>>> but I tried to modify it with "return count;" (as it can
>>> get back to previous cost), there is no failures exposed
>>> in regression testing.  I expected that if the previous
>>> unaligned_store count is even, this adjustment doesn't
>>> change anything, if it's odd, the adjustment may reduce
>>> it by one, but I'd guess it would be few.  Besides, as
>>> the comments for m_stp_sequence_cost, the current
>>> handlings seems temporary, maybe a tweak like this can be
>>> accepted, so I posted this RFC/PATCH to request comments.
>>> this one line change is considered.
>>
>> It's unfortunate that doing this didn't show up a regression.
>> I guess it's not a change we explicitly added tests to guard against.
>>
>> But the point of the condition is to estimate how many single stores
>> (STRs) and how many paired stores (STPs) would be generated.  As far
>> as this heuristic goes, STP (storing two values) is as cheap as STR
>> (storing only one value).  So the point of the CEIL is to count 1 store
>> as having equal cost to 2, 3 as having equal cost to 4, etc.
>>

Thanks for the explanation and ...

>> For a heuristic like that, costing a vector stmt once with count 2
>> is different from costing 2 vector stmts with count 1.  The former
>> makes it obvious that the 2 vector stmts are associated with the
>> same scalar stmt, and are highly likely to be consecutive.  The latter
>> (costing 2 stmts with count 1) could also happen for unrelated stmts.
>>
>> ISTM that costing once with count N provides strictly more information
>> to targets than costing N time with count 1.  Is there no way we can
>> keep the current behaviour?  E.g. rather than costing a stmt immediately
>> within a loop, could we just increment a counter and cost once at the end?
> 
> I suppose we can.  But isn't the logic currently (or before the series) cheated
> for variable-strided stores with ncopies > 1?  That is, while it sounds like
> reasonable heuristics you can't really rely on this as the vectorizer doesn't
> currently provide the info whether two vector loads/stores are adjacent?
> 
> Making sure we only pass count > 1 for adjacent load/store would be possible
> though.  We should document this with comments though.

... the suggestion!

This sounds applied for both load and store, if the others in this series look
fine, I guess we can file a bug for the exposed test cases first then fix it
with a follow-up patch for both load and store.

BR,
Kewen

> 
> Richard.
> 
>>
>> Thanks,
>> Richard
>>
>>> gcc/ChangeLog:
>>>
>>>       * config/aarch64/aarch64.cc (aarch64_stp_sequence_cost): Return
>>>       count directly instead of the adjusted value computed with CEIL.
>>> ---
>>>  gcc/config/aarch64/aarch64.cc | 2 +-
>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
>>> index 37d414021ca..9fb4fbd883d 100644
>>> --- a/gcc/config/aarch64/aarch64.cc
>>> +++ b/gcc/config/aarch64/aarch64.cc
>>> @@ -17051,7 +17051,7 @@ aarch64_stp_sequence_cost (unsigned int count, vect_cost_for_stmt kind,
>>>         if (!aarch64_aligned_constant_offset_p (stmt_info, size))
>>>           return count * 2;
>>>       }
>>> -      return CEIL (count, 2) * 2;
>>> +      return count;
>>>
>>>      case scalar_store:
>>>        if (stmt_info && STMT_VINFO_DATA_REF (stmt_info)
  

Patch

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 37d414021ca..9fb4fbd883d 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -17051,7 +17051,7 @@  aarch64_stp_sequence_cost (unsigned int count, vect_cost_for_stmt kind,
 	  if (!aarch64_aligned_constant_offset_p (stmt_info, size))
 	    return count * 2;
 	}
-      return CEIL (count, 2) * 2;
+      return count;
 
     case scalar_store:
       if (stmt_info && STMT_VINFO_DATA_REF (stmt_info))